<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>I'm ambivalent regarding verbatim label data, because it can be

      extremely helpful in some cases, and extremely damaging in others.</p>

    <p>Some of you may recall my having given talks, or unhappy comments

      at meetings, regarding the empirical data on error rates on

      original labels of insect specimens. It's pretty disheartening;

      across tens of thousands of specimens in roughly 10 major

      entomological museums assayed, somewhere between 15-20% of all

      original labels had data omissions or errors requiring correction

      prior to georeferencing. While a fair percentage of these are

      omissions that are easily fixed, or obvious typos, roughly half

      either cannot be fixed (e.g., a place name that occurs in more

      than one county, like "Sulphur Springs, Arkansas"), or are errors

      that MUST be fixed but are not immediately obvious.</p>

    <p>Such statements have been known to provoke people to roll their

      eyes at me, thinking that I overstate the problem, but it's a

      genuine issue, and includes lines of evidence that aren't

      immediately obvious, such as comparing labels produced by

      different people who were collecting together. Just as a

      "tip-of-the-iceberg" example, consider these data labels, produced

      by six professional researchers from several high-profile

      entomology museums on an NSF-funded field trip to Mexico:</p>

    <p>Chihuahua, 72 km NE Chihuahua, El Carrion, 27-VIII-91<br>

      Chihuahua, El Corrion, 72 km NE Chihuahua, 27-VIII-91<br>

      Chihuahua, El Morrion, 67 km NW Chihuahua, 27-VIII-91, 1200 m<br>

      Chihuahua, 67 km N El Morrion, 27-VIII-91<br>

      Chihuahua, 67 km N El Morrion, 27-III-91<br>

      Chihuahua, 74 km NE Chihuahua, 27-VIII-91<br>

    </p>

    <p>These labels all refer to the exact same collecting event, yet

      you'll note that no two are the same. You'll also note that <b>in

        the absence of the comparison</b>, none of them has an obvious

      error. <br>

    </p>

    <p>Worse still, <b>they are all wrong</b>. The actual data for this

      particular collecting event are<br>

      <br>

      Chihuahua, El Morrion, 67 km NE Chihuahua, 27-VIII-91, 1200 m</p>

    <p>As such, the six labels produced had (1) and (2) the wrong

      mileage <b>and</b> the wrong place name (3) the wrong cardinal

      direction (4) the wrong reference point (5) the wrong reference

      point and the wrong month, and (6) the wrong mileage. Note also

      that the georeferences generated for these six labels result in

      two points that are 67 km from the actual location, and one over

      100 km off.<br>

    </p>

    <p>When you look specifically for examples like this, with multiple

      collectors' data used side-by-side to evaluate label accuracy,

      it's frightening how poorly people do. It also means that treating

      verbatim label data as <b>inherently trustworthy</b> is a serious

      mistake. As data suppliers and consumers, we need to be far more

      critical. Label data underlies so much of people's research, and

      if we supply or use bad data, that undermines the quality of the

      resulting research.<br>

    </p>

    <p>The question is whether we are better off displaying the verbatim

      data, or not, and to me that depends on whether serious quality

      control has or has not <b>already been exercised</b>.<br>

    </p>

    <p>

      <style>@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;

        mso-font-charset:0;

        mso-generic-font-family:roman;

        mso-font-pitch:variable;

        mso-font-signature:-536870145 1107305727 0 0 415 0;}p.MsoNormal, li.MsoNormal, div.MsoNormal

        {mso-style-unhide:no;

        mso-style-qformat:yes;

        mso-style-parent:"";

        margin:0in;

        mso-pagination:widow-orphan;

        font-size:12.0pt;

        mso-bidi-font-size:10.0pt;

        font-family:"Times New Roman",serif;

        mso-fareast-font-family:"Times New Roman";}.MsoChpDefault

        {mso-style-type:export-only;

        mso-default-props:yes;

        font-size:10.0pt;

        mso-ansi-font-size:10.0pt;

        mso-bidi-font-size:10.0pt;}div.WordSection1

        {page:WordSection1;}</style></p>

    <p>My points are these: <br>

    </p>

    <p>(1) If the process of data capture is limited to entering

      verbatim label data and then simply parsing it out into other

      fields, it is much less likely that the data capture person is

      going to notice those labels that are in that roughly 10% where

      the data are wrong but it isn't obvious. If the process of data

      capture only uses verbatim data as the starting point, however,

      then the person trying to make sense of a label by georeferencing

      it themselves is relatively more likely to view it critically, and

      catch any errors.</p>

    <p>(2) If we assume for the moment that you have done the right

      thing, and fixed an error, how are users of your data going to

      know which version of the data they should trust? If a specimen

      has verbatim data listing a country or state or county or mileage

      or direction that is <b>not the same as the parsed data</b>, is

      that not going to confuse them, if they notice the discrepancy?</p>

    <p>(3) My overall feeling is that including verbatim data is only

      genuinely beneficial to users if quality control has NOT been

      applied, AND if external users have a reliable way to communicate

      with the data providers to <b>report an error and get it fixed</b>.

      In other words, having <b>bad</b> verbatim data made visible

      makes it more likely that external users will find errors. If

      quality control HAS been applied, and the data are clean, then the

      discrepancy between verbatim and parsed data only stands to

      confuse external users. Given that the specimens will have a GUID

      label, any discrepancy between what the data labels say and what

      the parsed data say won't be a problem, because the data labels

      are not what you'll refer to when tracking a specimen down.</p>

    <p>It's a complex issue.<br>

    </p>

    <pre class="moz-signature" cols="72">-- 

Doug Yanega      Dept. of Entomology       Entomology Research Museum

Univ. of California, Riverside, CA 92521-0314     skype: dyanega

phone: (951) 827-4315 (disclaimer: opinions are mine, not UCR's)

             <a class="moz-txt-link-freetext" href="https://faculty.ucr.edu/~heraty/yanega.html">https://faculty.ucr.edu/~heraty/yanega.html</a>

  "There are some enterprises in which a careful disorderliness

        is the true method" - Herman Melville, Moby Dick, Chap. 82</pre>

  </body>

</html>