[NHCOLL-L:5555] Re: Crowdsourcing labels

Thu Jul 14 14:39:32 EDT 2011

Penny Berents wrote:

>Hi Doug,
>At the AM we are attempting to do exactly the analysis that you 
>describe. We are running a pilot project based on our cicada 
>collection and with support from the ALA. We have a team of trained 
>volunteers who take images of the specimens and labels and assign a 
>unique identifier (registration number).
>We have had assistance from the ALA to employ volunteer coordinators 
>and to develop the web portal.
>The assessment at the end of the pilot will examine the real costs 
>of the project and the uptake by 'virtual volunteers' to transcribe 
>labels.

Do you have equivalent numbers (for direct data entry) to which you 
can compare this result? That is, were parts of the cicada collection 
already databased using the alternative method, and the costs 
analyzed?

That was my central point: knowing that one project costs X and 
another costs Y doesn't help a great deal if they're fundamentally 
different projects, in different countries, using different types of 
specimens. True, it's better than nothing, but it increases the 
number of variables one has to try to control for.

On one hand, I agree with Mark O'Brien, who wrote: "I think I can say 
with a degree of confidence that hiring a few people and getting them 
up to speed with transcribing data, that it makes no sense to me to 
take three or four steps to do something when one will suffice." BUT 
it is also true that I am willing to reconsider that assessment, 
should someone be able to demonstrate that the crowdsourcing approach 
gives more specimens transcribed per grant dollar spent. There are 
many folks now who are writing grants that involve a data 
transcription and/or georeferencing component, and it is potentially 
very important for us all to know - for certain, if possible - which 
approach works out better in the long run.

Andrea Thomer wrote:

>The regarding the quality of the data we are finding that the 
>results of crowdsourcing actually beat the results from "experts" 
>doing the work. People get tired, but if you manage to get tenths of 
>different people to digitize the same, the balanced result is 
>better. Also motivation is a great factor, in Oldweather for 97% of 
>data items digitized, 3 persons have written exactly the same thing, 
>so people really take care of writing what is there.

This would be interesting if true, but I do have some doubts about 
the significance of multiple people entering the same data; in my 
experience, a "majority rule" can commonly reinforce an incorrect 
result, where a single clever person recognized and corrected an 
error. The reason this happens is this: based on 20 years of personal 
experience transcribing well over 100,000 labels and georeferencing 
them, using specimens from 6 different major institutions, a minimum 
of 10-20% of all original specimen labels are either incomplete or 
erroneous and must be corrected or supplemented during data entry. In 
other words, a technician who does a stellar job transcribing labels 
verbatim into the computer is going to be entering 10-20% 
erroneous/incomplete information even if they themselves make no 
mistakes! Three technicians entering the same data does not guarantee 
that the end result is good; a fourth, more careful technician might 
spot an error and fix it, or enter additional data not present on the 
original label, but their result would be dismissed if a "majority 
rule" is implemented.

Of course, enforcing rigorous standards of Quality Control will 
inevitably increase the time and cost to generate each record, but 
people often overlook the obvious fact that QC can be minimized in 
cost if the work is done by conscientious technicians (i.e., that a 
technician who works slowly may in fact save costs overall, because 
the time spent correcting their records is much less than the time 
spent correcting those of a faster, "more productive" worker), AND 
they tend to overlook the fact that inclusion of automated steps in 
the process (such as OCR or automated georeferencing) actually ADDS 
to the overall costs because these records require far more careful 
scrutiny and clean-up (e.g., if it takes one minute to automatically 
generate a series of locality coordinates, and then another 30 
minutes to doublecheck all of them by having a human look them up to 
make sure they're right, this is potentially more costly than having 
that same human spend 30 minutes to look them up and enter them 
manually).

Sincerely,
-- 

Doug Yanega        Dept. of Entomology         Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314        skype: dyanega
phone: (951) 827-4315 (standard disclaimer: opinions are mine, not UCR's)
              http://cache.ucr.edu/~heraty/yanega.html
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.yale.edu/mailman/private/nhcoll-l/attachments/20110714/af176582/attachment.html