[NHCOLL-L:5555] Re: Crowdsourcing labels
Doug Yanega
dyanega at ucr.edu
Thu Jul 14 14:39:32 EDT 2011
Penny Berents wrote:
>Hi Doug,
>At the AM we are attempting to do exactly the analysis that you
>describe. We are running a pilot project based on our cicada
>collection and with support from the ALA. We have a team of trained
>volunteers who take images of the specimens and labels and assign a
>unique identifier (registration number).
>We have had assistance from the ALA to employ volunteer coordinators
>and to develop the web portal.
>The assessment at the end of the pilot will examine the real costs
>of the project and the uptake by 'virtual volunteers' to transcribe
>labels.
Do you have equivalent numbers (for direct data entry) to which you
can compare this result? That is, were parts of the cicada collection
already databased using the alternative method, and the costs
analyzed?
That was my central point: knowing that one project costs X and
another costs Y doesn't help a great deal if they're fundamentally
different projects, in different countries, using different types of
specimens. True, it's better than nothing, but it increases the
number of variables one has to try to control for.
On one hand, I agree with Mark O'Brien, who wrote: "I think I can say
with a degree of confidence that hiring a few people and getting them
up to speed with transcribing data, that it makes no sense to me to
take three or four steps to do something when one will suffice." BUT
it is also true that I am willing to reconsider that assessment,
should someone be able to demonstrate that the crowdsourcing approach
gives more specimens transcribed per grant dollar spent. There are
many folks now who are writing grants that involve a data
transcription and/or georeferencing component, and it is potentially
very important for us all to know - for certain, if possible - which
approach works out better in the long run.
Andrea Thomer wrote:
>The regarding the quality of the data we are finding that the
>results of crowdsourcing actually beat the results from "experts"
>doing the work. People get tired, but if you manage to get tenths of
>different people to digitize the same, the balanced result is
>better. Also motivation is a great factor, in Oldweather for 97% of
>data items digitized, 3 persons have written exactly the same thing,
>so people really take care of writing what is there.
This would be interesting if true, but I do have some doubts about
the significance of multiple people entering the same data; in my
experience, a "majority rule" can commonly reinforce an incorrect
result, where a single clever person recognized and corrected an
error. The reason this happens is this: based on 20 years of personal
experience transcribing well over 100,000 labels and georeferencing
them, using specimens from 6 different major institutions, a minimum
of 10-20% of all original specimen labels are either incomplete or
erroneous and must be corrected or supplemented during data entry. In
other words, a technician who does a stellar job transcribing labels
verbatim into the computer is going to be entering 10-20%
erroneous/incomplete information even if they themselves make no
mistakes! Three technicians entering the same data does not guarantee
that the end result is good; a fourth, more careful technician might
spot an error and fix it, or enter additional data not present on the
original label, but their result would be dismissed if a "majority
rule" is implemented.
Of course, enforcing rigorous standards of Quality Control will
inevitably increase the time and cost to generate each record, but
people often overlook the obvious fact that QC can be minimized in
cost if the work is done by conscientious technicians (i.e., that a
technician who works slowly may in fact save costs overall, because
the time spent correcting their records is much less than the time
spent correcting those of a faster, "more productive" worker), AND
they tend to overlook the fact that inclusion of automated steps in
the process (such as OCR or automated georeferencing) actually ADDS
to the overall costs because these records require far more careful
scrutiny and clean-up (e.g., if it takes one minute to automatically
generate a series of locality coordinates, and then another 30
minutes to doublecheck all of them by having a human look them up to
make sure they're right, this is potentially more costly than having
that same human spend 30 minutes to look them up and enter them
manually).
Sincerely,
--
Doug Yanega Dept. of Entomology Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314 skype: dyanega
phone: (951) 827-4315 (standard disclaimer: opinions are mine, not UCR's)
http://cache.ucr.edu/~heraty/yanega.html
"There are some enterprises in which a careful disorderliness
is the true method" - Herman Melville, Moby Dick, Chap. 82
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.yale.edu/mailman/private/nhcoll-l/attachments/20110714/af176582/attachment.html
More information about the Nhcoll-l
mailing list