[NHCOLL-L:5553] Re: Crowdsourcing labels

O'Brien, Mark mfobrien at umich.edu
Thu Jul 14 09:06:16 EDT 2011


I think I can say with a degree of confidence that hiring a few people and getting them up to speed with transcribing data, that it makes no sense to me to take three or four steps to do something when one will suffice.   The time and effort and making the images, assigning a control number to each and every digitized image, and the subsequent efforts for verification and storage of all the images makes little sense to me.   At one point I had rotating teams of 2 students - one read the data and placed a catalog number label beneath the upper pin labels, and the other typed the  data that was read.  This was done in Filemaker, and within a few months of part-time effort, many thousands of specimens were cataloged, with low error rates.    Having two people read and type, and the switch places made for a better working environment and the feedback that results when one questions the data when something is in error.

Mark

------------------------------------------------------------
Mark F. O'Brien, Collection Manager
Insect Division, Museum of Zoology
The University of Michigan
1109 Geddes Avenue
Ann Arbor, MI 48109-1079
(734)-647-2199
-------------------------------------------------------------



-----Original Message-----
From: owner-nhcoll-l at lists.yale.edu [mailto:owner-nhcoll-l at lists.yale.edu] On Behalf Of Doug Yanega
Sent: Wednesday, July 13, 2011 2:27 PM
To: christopher.norris at yale.edu
Cc: NHCOLL-L at lists.yale.edu
Subject: [NHCOLL-L:5547] Re: Crowdsourcing labels

Chris Norris wrote:

>Would you consider using crowdsourcing methods to transcribe 
>handwritten specimen labels? Are you horrified by the idea? Do you have 
>any idea what crowdsourcing is?
>
>Rob Guralnick from the University of Colorado is looking for feedback 
>on this issue from the collections community. You can read more about 
>it, and comment, by following this link:
>
>http://soyouthinkyoucandigitize.wordpress.com/2011/07/11/old-weathers-c
>rowd-and-the-challenge-of-digitization/

Being in charge of a massive label transcription project, it may not surprise you that I have a question that I see as crucial, but is not discussed on the webpage linked, nor by Penny's follow-up regarding ALA.

Specifically: a crowdsourcing project requires, as an absolute first step, that each and every unit (in this case, a specimen plus its
labels) gets (1) assigned a unique identifying label if it does not already have one, and (2) has its existing labels removed (generally
- this might be different with herbarium sheets), photographed, and then placed back with the specimen.

The time, effort, and expense in getting just this first step done is not trivial. It is so non-trivial, in fact, that I have to wonder whether anyone has ever done an actual budgetary analysis that compares the cost of taking all of those digital images (especially the labor cost) with the alternative; namely, that instead of paying a technician X amount per hour to handle specimens and photograph labels, that technician is paid to handle specimens and simply transcribe the labels.

Note that the crowdsourcing effort is not cost-free; there are considerable expenses designing, creating and maintaining the specialized infrastructure that supports it (*not* counting the underlying database), right down to the need to pay someone to write instructions for the volunteers to follow, and those expenses have no parallel in a project that simply hires technicians to enter data directly.

Accordingly, the comparative costs are not something simple and straightforward to establish.

Let's say project X hires a technician to take photographs, and this technician manages to process 100 specimens per hour. The cost of this step *per specimen* is not simply 1/100th of their hourly salary, but must also include the investment in the camera and software used to take the photos and put them online. The next step, serving those images to volunteers and having them transcribe the label data, even if the volunteer labor is free, requires the personnel to design, create and maintain it, as noted above (people who may be making hourly salaries, in some cases). That also adds to the cost per specimen.

Now, how does one compare that to project Y where a technician simply sits down and types in label data, at a rate of 50 specimens per hour? Which project is more cost-effective? It *might* be project Y, since it has far fewer expenses. It might depend rather heavily on the scale; a project involving only 50,000 specimens versus one that involves 500,000 will give much worse payoffs for "up-front" 
infrastructure investments, for example.

So, who, if anyone, has ever crunched the proper numbers to determine which approach is more cost-effective?

Sincerely,
-- 

Doug Yanega        Dept. of Entomology         Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314        skype: dyanega
phone: (951) 827-4315 (standard disclaimer: opinions are mine, not UCR's)
              http://cache.ucr.edu/~heraty/yanega.html
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82



More information about the Nhcoll-l mailing list