[Nhcoll-l] verbatimLabel DarwinCore Field Addition

Douglas Yanega dyanega at gmail.com
Tue Apr 20 13:39:02 EDT 2021


I'm ambivalent regarding verbatim label data, because it can be 
extremely helpful in some cases, and extremely damaging in others.

Some of you may recall my having given talks, or unhappy comments at 
meetings, regarding the empirical data on error rates on original labels 
of insect specimens. It's pretty disheartening; across tens of thousands 
of specimens in roughly 10 major entomological museums assayed, 
somewhere between 15-20% of all original labels had data omissions or 
errors requiring correction prior to georeferencing. While a fair 
percentage of these are omissions that are easily fixed, or obvious 
typos, roughly half either cannot be fixed (e.g., a place name that 
occurs in more than one county, like "Sulphur Springs, Arkansas"), or 
are errors that MUST be fixed but are not immediately obvious.

Such statements have been known to provoke people to roll their eyes at 
me, thinking that I overstate the problem, but it's a genuine issue, and 
includes lines of evidence that aren't immediately obvious, such as 
comparing labels produced by different people who were collecting 
together. Just as a "tip-of-the-iceberg" example, consider these data 
labels, produced by six professional researchers from several 
high-profile entomology museums on an NSF-funded field trip to Mexico:

Chihuahua, 72 km NE Chihuahua, El Carrion, 27-VIII-91
Chihuahua, El Corrion, 72 km NE Chihuahua, 27-VIII-91
Chihuahua, El Morrion, 67 km NW Chihuahua, 27-VIII-91, 1200 m
Chihuahua, 67 km N El Morrion, 27-VIII-91
Chihuahua, 67 km N El Morrion, 27-III-91
Chihuahua, 74 km NE Chihuahua, 27-VIII-91

These labels all refer to the exact same collecting event, yet you'll 
note that no two are the same. You'll also note that *in the absence of 
the comparison*, none of them has an obvious error.

Worse still, *they are all wrong*. The actual data for this particular 
collecting event are

Chihuahua, El Morrion, 67 km NE Chihuahua, 27-VIII-91, 1200 m

As such, the six labels produced had (1) and (2) the wrong mileage *and* 
the wrong place name (3) the wrong cardinal direction (4) the wrong 
reference point (5) the wrong reference point and the wrong month, and 
(6) the wrong mileage. Note also that the georeferences generated for 
these six labels result in two points that are 67 km from the actual 
location, and one over 100 km off.

When you look specifically for examples like this, with multiple 
collectors' data used side-by-side to evaluate label accuracy, it's 
frightening how poorly people do. It also means that treating verbatim 
label data as *inherently trustworthy* is a serious mistake. As data 
suppliers and consumers, we need to be far more critical. Label data 
underlies so much of people's research, and if we supply or use bad 
data, that undermines the quality of the resulting research.

The question is whether we are better off displaying the verbatim data, 
or not, and to me that depends on whether serious quality control has or 
has not *already been exercised*.

My points are these:

(1) If the process of data capture is limited to entering verbatim label 
data and then simply parsing it out into other fields, it is much less 
likely that the data capture person is going to notice those labels that 
are in that roughly 10% where the data are wrong but it isn't obvious. 
If the process of data capture only uses verbatim data as the starting 
point, however, then the person trying to make sense of a label by 
georeferencing it themselves is relatively more likely to view it 
critically, and catch any errors.

(2) If we assume for the moment that you have done the right thing, and 
fixed an error, how are users of your data going to know which version 
of the data they should trust? If a specimen has verbatim data listing a 
country or state or county or mileage or direction that is *not the same 
as the parsed data*, is that not going to confuse them, if they notice 
the discrepancy?

(3) My overall feeling is that including verbatim data is only genuinely 
beneficial to users if quality control has NOT been applied, AND if 
external users have a reliable way to communicate with the data 
providers to *report an error and get it fixed*. In other words, having 
*bad* verbatim data made visible makes it more likely that external 
users will find errors. If quality control HAS been applied, and the 
data are clean, then the discrepancy between verbatim and parsed data 
only stands to confuse external users. Given that the specimens will 
have a GUID label, any discrepancy between what the data labels say and 
what the parsed data say won't be a problem, because the data labels are 
not what you'll refer to when tracking a specimen down.

It's a complex issue.

-- 
Doug Yanega      Dept. of Entomology       Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314     skype: dyanega
phone: (951) 827-4315 (disclaimer: opinions are mine, not UCR's)
              https://faculty.ucr.edu/~heraty/yanega.html
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.yale.edu/pipermail/nhcoll-l/attachments/20210420/9ddc44d5/attachment.html>


More information about the Nhcoll-l mailing list