[NHCOLL-L:5547] Re: Crowdsourcing labels

Wed Jul 13 14:27:03 EDT 2011

Chris Norris wrote:

>Would you consider using crowdsourcing methods to transcribe 
>handwritten specimen labels? Are you horrified by the idea? Do you 
>have any idea what crowdsourcing is?
>
>Rob Guralnick from the University of Colorado is looking for 
>feedback on this issue from the collections community. You can read 
>more about it, and comment, by following this link:
>
>http://soyouthinkyoucandigitize.wordpress.com/2011/07/11/old-weathers-crowd-and-the-challenge-of-digitization/

Being in charge of a massive label transcription project, it may not 
surprise you that I have a question that I see as crucial, but is not 
discussed on the webpage linked, nor by Penny's follow-up regarding 
ALA.

Specifically: a crowdsourcing project requires, as an absolute first 
step, that each and every unit (in this case, a specimen plus its 
labels) gets (1) assigned a unique identifying label if it does not 
already have one, and (2) has its existing labels removed (generally 
- this might be different with herbarium sheets), photographed, and 
then placed back with the specimen.

The time, effort, and expense in getting just this first step done is 
not trivial. It is so non-trivial, in fact, that I have to wonder 
whether anyone has ever done an actual budgetary analysis that 
compares the cost of taking all of those digital images (especially 
the labor cost) with the alternative; namely, that instead of paying 
a technician X amount per hour to handle specimens and photograph 
labels, that technician is paid to handle specimens and simply 
transcribe the labels.

Note that the crowdsourcing effort is not cost-free; there are 
considerable expenses designing, creating and maintaining the 
specialized infrastructure that supports it (*not* counting the 
underlying database), right down to the need to pay someone to write 
instructions for the volunteers to follow, and those expenses have no 
parallel in a project that simply hires technicians to enter data 
directly.

Accordingly, the comparative costs are not something simple and 
straightforward to establish.

Let's say project X hires a technician to take photographs, and this 
technician manages to process 100 specimens per hour. The cost of 
this step *per specimen* is not simply 1/100th of their hourly 
salary, but must also include the investment in the camera and 
software used to take the photos and put them online. The next step, 
serving those images to volunteers and having them transcribe the 
label data, even if the volunteer labor is free, requires the 
personnel to design, create and maintain it, as noted above (people 
who may be making hourly salaries, in some cases). That also adds to 
the cost per specimen.

Now, how does one compare that to project Y where a technician simply 
sits down and types in label data, at a rate of 50 specimens per 
hour? Which project is more cost-effective? It *might* be project Y, 
since it has far fewer expenses. It might depend rather heavily on 
the scale; a project involving only 50,000 specimens versus one that 
involves 500,000 will give much worse payoffs for "up-front" 
infrastructure investments, for example.

So, who, if anyone, has ever crunched the proper numbers to determine 
which approach is more cost-effective?

Sincerely,
-- 

Doug Yanega        Dept. of Entomology         Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314        skype: dyanega
phone: (951) 827-4315 (standard disclaimer: opinions are mine, not UCR's)
              http://cache.ucr.edu/~heraty/yanega.html
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82