[Nhcoll-l] global unique identifiers and naturalhistory collections

Mon Oct 15 14:36:06 EDT 2012

The thread sprouted and grew like a weed over the weekend, so I'll 
try to collate things:

Dirk Neumann wrote:

>However, it is crucial that the registration numbers are tied to
>collections and specimens in the collection, therefore I would rather
>favour to have the museum acronyms & specimen numbers included in such a
>code (what would be easily feasible if using a combined alpha numeric &
>alphabetical coding system). Problem here surely lies with the
>entomological collections, which can't be individualised in near future,
>but in the light of ongoing barcoding campaigns one should have in mind
>that many modern samples (which do have e.g. individual barcodes
>generated by BOLD) do have unique identifiers as soon as they are
>processed (this applies also for historic specimens picked form the pin)
>- even if the analyses fails.

and Mark O'Brien wrote:

>Reading that whole blog entry gave me a headache.  We have had 
>unique museum identifiers for many years, starting with printed 
>lists back in the 1960s.    If you dissociate a museum acronym  from 
>a specimen number, it will cause confusion and perhaps cause more 
>problems.  Let's say that someone uses a specimen from the UMMZ that 
>has a record number UMMZI-0023578.  In the resulting publication, it 
>becomes a part of a type series, and anyone reading that paper would 
>be able to determine (without even having to go online) that it is 
>from the Univ. Michigan Museum of Zoology Insect collection.  If it 
>was coded instead with 081-211118-87650 it means nothing without an 
>intermediary decoding via some online portal.   I know the old KISS 
>(Keep it Simple, Stupid) adage means more now than it ever did, 
>since people have a tendency to make very complex systems because 
>they can.

These two comments relate to something extremely important: GUID 
labels, even those with acronyms like UMMZ, absolutely DO NOT give 
any information about which collection a specimen is from. We have 
specimens in the UCR collection, including holotypes, bearing GUIDs 
with codens such as AMNH, EMEC, USNM, MEI, and so forth - some of 
which were traded to us, some donated to us, and some which always 
were ours, but were simply databased somewhere else (e.g., the AMNH). 
We even have three different codens for our own specimens (UCRC ENT, 
UCR ENT, UCREM), and a multi-instiutional coden (UCIS, for 
"University of California Insect Survey", specimens from which are 
spread throughout several different UC collections), resulting from 
different historical databasing efforts. Conversely, there are over 
50 insect collections around the world that now contain specimens 
with "UCRC ENT" or "UCR ENT" barcode labels on them, simply because 
those specimens were sent here, given GUIDs and databased here, and 
returned.

The bottom line is that any assigned GUIDs absolutely MUST be treated 
as information-free.  This particular point has, in fact, recently 
caused a rather serious debate within the ICZN, when an author 
recently included a coden-based GUID in a new taxon description but 
did not specify the type repository. Failure to specify a type 
depository is, after 1999, grounds for a name being unavailable, and 
some Commissioners were apparently unaware that GUIDs did not always 
correlate to where a specimen was deposited. They do not, thus there 
is controversy as to whether such a description is valid under the 
Code.

I will note that Dave Furth already posted the "official" list of 
entomological collection codens, so the entomological community is 
*trying* to use a standard, though there are some conflicts and 
omissions relative to other sources (e.g., DiscoverLife). To continue 
in that vein:

Bill Poly wrote:

>Expanding on what Mark just wrote, standardization of institutional 
>codes for museums has been going on for decades:
>
>1) http://www.biodiversitycollectionindex.org/static/index.html
>
>2) Leviton, A.E., R.H. Gibbs, Jr., E. Heal, and C.E. Dawson.  1985. 
>Standards in herpetology and ichthyology: Part I.  Standard symbolic 
>codes for institutional resource collections in herpetology and 
>ichthyology.  Copeia 1985(3): 802-832.
>
>3) Leviton, A.E. and R.H. Gibbs, Jr.  1988.  Standards in 
>herpetology and ichthyology. Standard symbolic codes for institution 
>resource collections in herpetology and ichthyology. Supplement No. 
>1: additions and corrections.  Copeia 1988(1): 280-282.
>
>4) http://www.asih.org/codons.pdf
>
>These acronyms and associated catalog numbers are used widely in the 
>literature.  What is the need for a new system that is "global?"

and John Reiss wrote:

>It would seem that a solution would be to develop a unique numeric 
>collection code that would go along with (rather than replace) the 
>traditional alphabetic one.  Thus a specimen might be something like:
>
>13429 AAU 001 for Addis Addaba University specimen 001
>
>and
>
>11946 AAU 001 for Aarhus University specimen 001

The problem here is that we DO need a *global* registry of 
institutional codens, because you will never, ever, get people to 
RELABEL specimens that they have already labeled; do you think that 
the people in Aarhus, whose specimen AAU 001 looks fine to them, are 
going to remove that label and replace it with a label that says 
"11946 AAU 001"? Are you going to pay them for the labor that this 
would require? Can you track down all the AAU labels on specimens 
they have sent to other institutions, and replace those, too? There 
will, inevitably, be cases where there are genuinely non-unique GUIDs 
and you can bet the owners of the specimens in question will not 
budge about changing their labels OR their digital records. There may 
not BE a solution for this once it has occurred, but at least with a 
single authoritative registry of codens we can *prevent* as much of 
these conflicts as possible.

Rob Guralnick wrote:

>Finally, I keep thinking about how much we scan barcodes all the time
>and don't care at all about the numbers in those barcodes that get us
>onto airplanes, or that get us groceries.   Between sticking guids in
>QR codes or cutting and pasting them and resolving their contents to a
>record, does anyone _ever_ really transcribe an identifier number for
>number?  Maybe its me, but I just can't see this issue about error
>correction being relevant.  What am I missing?

What you are missing is that lots of institutions are using, or have 
used, GUIDs that are not barcoded. The UCIS GUIDs developed and 
employed by Peter Kevan and Ev Schlinger in the 1970s were 
batch-printed by a computer program running a ribbon printer. That's 
over 300,000 specimens, in at least 5 major collections, with GUID 
labels that have no barcodes - are you going to insist that all of 
the UC collections containing these specimens remove those GUIDs and 
replace them with new ones with barcodes?

That aside, this example points to another ugly problem - when one 
specimen has either multiple GUIDs or multiple different records 
under the same GUID. That is, some of the UC collections have in fact 
ignored the UCIS GUID labels and attached new GUID labels to those 
specimens - and in at least a few cases, those original GUIDs had 
actually been databased and put online. The result is that you can 
have one specimen for which there are two different online GUIDs - 
plus the associated duplicated data. Also, many institutions have 
databasing initiatives which make use of specimens borrowed from 
*other* institutions, and sometimes they have work-study students 
doing this, who simply add a new GUID and database it even if a 
specimen already has one ("otherwise how are the data supposed to get 
into our database?") - which results in the same problem of multiple 
GUIDs for one specimen. But even cases where the students are told to 
use pre-existing GUIDs are problematic, when those students are told 
to enter the data. Why is that a problem? Because commonly the source 
institution has already entered the data for that same specimen in 
their own database! The odds that two people in different places 
transcribing the same label will produce *absolutely identical* 
database records is nearly zero, if only because it is exceedingly 
rare for two institutions to use the exact same databasing software 
(same fields, same menu options, etc.), plus the possibility that one 
or the other data entry person might make a typo or omission that 
results in non-identicality. If one then looks online and sees two 
different sets of data under the same GUID, how does one decide 
*which* set to trust?

My point is that even a GUID that is perfect in every way (even if it 
has a barcode, and an embedded DOI or whatever) can still be defeated 
by the simple fact that people who are databasing specimens rarely 
(if ever) implement the policy to NEVER enter data for a specimen 
that has come from another institution and has a GUID on it, and 
instead to (1) request a data file from the loaning institution, and 
(2) only *use* that data file, rather than *exporting* its contents 
into their own database (which typically entails converting it into a 
different format). GUIDs do not solve all of the problems, because 
some problems are related to how people do things. As Robert Heinlein 
said, "It is impossible to make anything foolproof, because fools are 
so ingenious." (ironically, this quote itself is a pseudo-duplicate, 
as at least four famous authors have said something nearly identical 
- including Edward Teller, Douglas Adams, and Gene Brown - on top of 
much older but anonymous quotes).

Sincerely,
-- 

Doug Yanega        Dept. of Entomology         Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314        skype: dyanega
phone: (951) 827-4315 (standard disclaimer: opinions are mine, not UCR's)
              http://cache.ucr.edu/~heraty/yanega.html
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82