[Nhcoll-l] global unique identifiers and naturalhistory collections

Cellinese,Nico ncellinese at flmnh.ufl.edu
Mon Oct 15 23:01:56 EDT 2012


In response to Rob, I am not convinced that unique, even resolvable,
identifiers are by themselves going to help us 'discover digital
resources on the internet'.  We are going to need services to realize
that dream. But hey, that what these damn windmills are for... Let's
go get 'em...

Yes Jim, we know that.  And the service is what BiSciCol would provide.  We are trying to lay down the groundwork for doing so.  Implementing GUIDs is the first important step.

Nico



On Tue, Oct 16, 2012 at 9:54 AM, Chuck Miller <Chuck.Miller at mobot.org<mailto:Chuck.Miller at mobot.org>> wrote:
Rob,
I suspect you're not the only one that disagrees, so no problem.  I just can't escape the reality that this thread has continued for years on different lists, repeating the same points over and over.  DOI as the biodiversity informatics solution appeared around 2006 or so I think, although LSID was on a roll about then and there was a lot of hope for that approach. True, LSID has waned and DOI has continued and as a tangible solution it is definitely there for the future.  I just wish all those other people would stop pointing out the flaws (like botanical duplicates), then we could reach consensus. What do you say, Jim?

Chuck

-----Original Message-----
From: robgur at gmail.com<mailto:robgur at gmail.com> [mailto:robgur at gmail.com] On Behalf Of Robert Guralnick
Sent: Monday, October 15, 2012 5:39 PM
To: Chuck Miller
Cc: Jim Croft; nhcoll-l at mailman.yale.edu<mailto:nhcoll-l at mailman.yale.edu>
Subject: Re: [Nhcoll-l] global unique identifiers and naturalhistory collections

 Hi Chuck --- I respectfully disagree.  If we want to follow a clear set of best practices that can assure that our digital records have persistent, globally unique and resolvable identifiers, and that the solution can work now and not in ten years or never, then we have a very limited set of options. I'd like to hear more, but the comment that "Only parts of the problem can be solved with any one of them" is not my view.  Or at minimum, there is a conflation of what the problem is.  The problem that I raised is how we discover digital records that are on the Internet quickly and effectively, and provide means to begin tracking these objects (and by the way, thanks so much for all the comments from everyone out there)!

You want to talk about interlinking data, that is a discussion about Linked Open Data and the Semantic Web, and really, that is moot for us unless we can persistently and uniquely and resolvably point to the things that are out there.

Best, Rob




On Mon, Oct 15, 2012 at 3:15 PM, Chuck Miller <Chuck.Miller at mobot.org<mailto:Chuck.Miller at mobot.org>> wrote:
It seems in these discussions that multiple use cases are addressed in the same thread.  As a result, the discussion circles endlessly because the solution to one use case may be orthogonal to another and vice versa and around we go. Based on the number of words written and years consumed on the topic, it appears that identifiers for biodiversity informatics at large is far too diverse and complex to be completely solved with a single solution, like DOI, UUID, GUID, LSID, you name it.  Only parts of the problem can be solved with any one of them.

A more segmented but integrated approach seems to be needed and at the core of it would be a "master data switch service" like that which Jim describes because it would presume the complexity of the biodiversity data universe and attempt to order it.  With a comprehensive switching service, any ID could be cross-correlated to another ID after those who have the correlated information made it available to the service.

Chuck

-----Original Message-----
From: nhcoll-l-bounces at mailman.yale.edu<mailto:nhcoll-l-bounces at mailman.yale.edu>
[mailto:nhcoll-l-bounces at mailman.yale.edu] On Behalf Of Jim Croft
Sent: Monday, October 15, 2012 3:57 PM
To: Robert Guralnick
Cc: nhcoll-l at mailman.yale.edu<mailto:nhcoll-l at mailman.yale.edu>
Subject: Re: [Nhcoll-l] global unique identifiers and naturalhistory
collections

This group's concept of GUID and mine appear to be quite different:
http://en.wikipedia.org/wiki/Globally_unique_identifier

We are using 'GUID' as local shorthand for something we are going to
make up, right? If that is the case, we should really give it another
name, because GUID is occupied and formally defined in information
space. (Loving the irony of the duplication here.)

And I agree with your comment about the human opacity of supermarket barcodes. Nobody number checks them. They are just accepted unintelligible infrastructure that make machines go 'beep' and cause stuff to happen.

The reason I and attracted to UUIDs is that you do not need a 'social infrastructure' to dole them out and control who gets what. If you need one, you just grab one, and within the bounds of human probability, it is going to be unique. There is something very elegant about this that appeals to the inner geek.

While we are considering resolvability mechanisms, and I am not yet willing to concede that this has to be in he number itself, we also need to consider mapping mechanisms.  Making sure the ID is unique is one thing. But there will always be different IDs relating to the same or similar concepts in various ways (congruent, contains, is part of, is sort of like, etc.) and these will have to be mapped to each other.
The is is especially important in botany, where the same collection can be represented as different specimens in different institutions (or even within the same institution).

jim

On Tue, Oct 16, 2012 at 6:12 AM, Robert Guralnick <Robert.Guralnick at colorado.edu<mailto:Robert.Guralnick at colorado.edu>> wrote:
 Doug --- I appreciate (and also groan the teensiest bit) when
reading this lucid description of ALL the challenges we face with
such heterogeneous practices.  I think you have made really nicely clear
just how difficult things are!   Two guid best practices in our blog
post were:  1) GUIDs must be assigned as close to the source as
possible.  For example, if data is collected in the field, the
identifier for that data needs to be assigned in the field and
attached to the field database with ownership initially stated by the
maintainers of that database.  For existing data, assignment can be
made in the source database.  2)   GUIDs propagate downstream to other
systems.  Creating new GUIDs in warehouses that duplicate existing
ones is bad practice, and thus aggregators need to honor well-curated
GUIDs from providers.   That jives with what you are saying, I think.

Please also appreciate that the kind of GUID we are suggesting is a
Digital Object Identifier or DOI.  The DOI website has a great
description of the value of using DOIs.  It says "The DOI system
provides a technical and social infrastructure for the registration
and use of persistent interoperable identifiers for use on digital
networks."  Those words resonate with me strongly.  DOIs are opaque
and not tied to terms used in our databases such as collections codes
or catalog numbers.  I continue to be convinced this makes sense.  It
means the DOI is a wrapper around a specimen, or metadata record, or
digital surrogate (e.g. an image of a specimen) that points to those
objects and allows them to be found.  This is just smart, at least in
my opinion.

Best, Rob

On Mon, Oct 15, 2012 at 12:36 PM, Doug Yanega <dyanega at ucr.edu<mailto:dyanega at ucr.edu>> wrote:
The thread sprouted and grew like a weed over the weekend, so I'll
try to collate things:

Dirk Neumann wrote:

However, it is crucial that the registration numbers are tied to
collections and specimens in the collection, therefore I would
rather favour to have the museum acronyms & specimen numbers
included in such a code (what would be easily feasible if using a
combined alpha numeric & alphabetical coding system). Problem here
surely lies with the entomological collections, which can't be
individualised in near future, but in the light of ongoing barcoding
campaigns one should have in mind that many modern samples (which do
have e.g. individual barcodes generated by BOLD) do have unique
identifiers as soon as they are processed (this applies also for
historic specimens picked form the pin)
- even if the analyses fails.

and Mark O'Brien wrote:

Reading that whole blog entry gave me a headache.  We have had
unique museum identifiers for many years, starting with printed
lists back in the 1960s.    If you dissociate a museum acronym  from
a specimen number, it will cause confusion and perhaps cause more
problems.  Let's say that someone uses a specimen from the UMMZ that
has a record number UMMZI-0023578.  In the resulting publication, it
becomes a part of a type series, and anyone reading that paper would
be able to determine (without even having to go online) that it is
from the Univ. Michigan Museum of Zoology Insect collection.  If it
was coded instead with 081-211118-87650 it means nothing without an
intermediary decoding via some online portal.   I know the old KISS
(Keep it Simple, Stupid) adage means more now than it ever did,
since people have a tendency to make very complex systems because they can.

These two comments relate to something extremely important: GUID
labels, even those with acronyms like UMMZ, absolutely DO NOT give
any information about which collection a specimen is from. We have
specimens in the UCR collection, including holotypes, bearing GUIDs
with codens such as AMNH, EMEC, USNM, MEI, and so forth - some of
which were traded to us, some donated to us, and some which always
were ours, but were simply databased somewhere else (e.g., the AMNH).
We even have three different codens for our own specimens (UCRC ENT,
UCR ENT, UCREM), and a multi-instiutional coden (UCIS, for
"University of California Insect Survey", specimens from which are
spread throughout several different UC collections), resulting from
different historical databasing efforts. Conversely, there are over
50 insect collections around the world that now contain specimens
with "UCRC ENT" or "UCR ENT" barcode labels on them, simply because
those specimens were sent here, given GUIDs and databased here, and
returned.

The bottom line is that any assigned GUIDs absolutely MUST be
treated as information-free.  This particular point has, in fact,
recently caused a rather serious debate within the ICZN, when an
author recently included a coden-based GUID in a new taxon
description but did not specify the type repository. Failure to
specify a type depository is, after 1999, grounds for a name being
unavailable, and some Commissioners were apparently unaware that
GUIDs did not always correlate to where a specimen was deposited.
They do not, thus there is controversy as to whether such a
description is valid under the Code.

I will note that Dave Furth already posted the "official" list of
entomological collection codens, so the entomological community is
*trying* to use a standard, though there are some conflicts and
omissions relative to other sources (e.g., DiscoverLife). To
continue in that vein:

Bill Poly wrote:

Expanding on what Mark just wrote, standardization of institutional
codes for museums has been going on for decades:

1) http://www.biodiversitycollectionindex.org/static/index.html

2) Leviton, A.E., R.H. Gibbs, Jr., E. Heal, and C.E. Dawson.  1985.
Standards in herpetology and ichthyology: Part I.  Standard symbolic
codes for institutional resource collections in herpetology and
ichthyology.  Copeia 1985(3): 802-832.

3) Leviton, A.E. and R.H. Gibbs, Jr.  1988.  Standards in
herpetology and ichthyology. Standard symbolic codes for institution
resource collections in herpetology and ichthyology. Supplement No.
1: additions and corrections.  Copeia 1988(1): 280-282.

4) http://www.asih.org/codons.pdf

These acronyms and associated catalog numbers are used widely in the
literature.  What is the need for a new system that is "global?"

and John Reiss wrote:

It would seem that a solution would be to develop a unique numeric
collection code that would go along with (rather than replace) the
traditional alphabetic one.  Thus a specimen might be something like:

13429 AAU 001 for Addis Addaba University specimen 001

and

11946 AAU 001 for Aarhus University specimen 001

The problem here is that we DO need a *global* registry of
institutional codens, because you will never, ever, get people to
RELABEL specimens that they have already labeled; do you think that
the people in Aarhus, whose specimen AAU 001 looks fine to them, are
going to remove that label and replace it with a label that says
"11946 AAU 001"? Are you going to pay them for the labor that this
would require? Can you track down all the AAU labels on specimens
they have sent to other institutions, and replace those, too? There
will, inevitably, be cases where there are genuinely non-unique
GUIDs and you can bet the owners of the specimens in question will
not budge about changing their labels OR their digital records.
There may not BE a solution for this once it has occurred, but at
least with a single authoritative registry of codens we can
*prevent* as much of these conflicts as possible.

Rob Guralnick wrote:

Finally, I keep thinking about how much we scan barcodes all the
time and don't care at all about the numbers in those barcodes that get us
onto airplanes, or that get us groceries.   Between sticking guids in
QR codes or cutting and pasting them and resolving their contents to
a record, does anyone _ever_ really transcribe an identifier number
for number?  Maybe its me, but I just can't see this issue about
error correction being relevant.  What am I missing?

What you are missing is that lots of institutions are using, or have
used, GUIDs that are not barcoded. The UCIS GUIDs developed and
employed by Peter Kevan and Ev Schlinger in the 1970s were
batch-printed by a computer program running a ribbon printer. That's
over 300,000 specimens, in at least 5 major collections, with GUID
labels that have no barcodes - are you going to insist that all of
the UC collections containing these specimens remove those GUIDs and
replace them with new ones with barcodes?

That aside, this example points to another ugly problem - when one
specimen has either multiple GUIDs or multiple different records
under the same GUID. That is, some of the UC collections have in
fact ignored the UCIS GUID labels and attached new GUID labels to
those specimens - and in at least a few cases, those original GUIDs
had actually been databased and put online. The result is that you
can have one specimen for which there are two different online GUIDs
- plus the associated duplicated data. Also, many institutions have
databasing initiatives which make use of specimens borrowed from
*other* institutions, and sometimes they have work-study students
doing this, who simply add a new GUID and database it even if a
specimen already has one ("otherwise how are the data supposed to
get into our database?") - which results in the same problem of
multiple GUIDs for one specimen. But even cases where the students
are told to use pre-existing GUIDs are problematic, when those
students are told to enter the data. Why is that a problem? Because
commonly the source institution has already entered the data for
that same specimen in their own database! The odds that two people
in different places transcribing the same label will produce
*absolutely identical* database records is nearly zero, if only
because it is exceedingly rare for two institutions to use the exact
same databasing software (same fields, same menu options, etc.),
plus the possibility that one or the other data entry person might
make a typo or omission that results in non-identicality. If one
then looks online and sees two different sets of data under the same
GUID, how does one decide
*which* set to trust?

My point is that even a GUID that is perfect in every way (even if
it has a barcode, and an embedded DOI or whatever) can still be
defeated by the simple fact that people who are databasing specimens
rarely (if ever) implement the policy to NEVER enter data for a
specimen that has come from another institution and has a GUID on
it, and instead to (1) request a data file from the loaning
institution, and
(2) only *use* that data file, rather than *exporting* its contents
into their own database (which typically entails converting it into
a different format). GUIDs do not solve all of the problems, because
some problems are related to how people do things. As Robert
Heinlein said, "It is impossible to make anything foolproof, because
fools are so ingenious." (ironically, this quote itself is a
pseudo-duplicate, as at least four famous authors have said
something nearly identical
- including Edward Teller, Douglas Adams, and Gene Brown - on top of
much older but anonymous quotes).

Sincerely,
--

Doug Yanega        Dept. of Entomology         Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314        skype: dyanega
phone: (951) 827-4315 (standard disclaimer: opinions are mine, not UCR's)
             http://cache.ucr.edu/~heraty/yanega.html
  "There are some enterprises in which a careful disorderliness
        is the true method" - Herman Melville, Moby Dick, Chap. 82
_______________________________________________
Nhcoll-l mailing list
Nhcoll-l at mailman.yale.edu<mailto:Nhcoll-l at mailman.yale.edu>
http://mailman.yale.edu/mailman/listinfo/nhcoll-l
_______________________________________________
Nhcoll-l mailing list
Nhcoll-l at mailman.yale.edu<mailto:Nhcoll-l at mailman.yale.edu>
http://mailman.yale.edu/mailman/listinfo/nhcoll-l



--
_________________
Jim Croft ~ jim.croft at gmail.com<mailto:jim.croft at gmail.com> ~ +61-2-62509499 ~ http://about.me/jrc 'Without the freedom to criticize, there is no true praise.
- Pierre Beaumarchais
'Whenever you find yourself on the side of the majority, it's time to pause and reflect.'
- Mark Twain
'A civilized society is one which tolerates eccentricity to the point of doubtful sanity.'
- Robert Frost
_______________________________________________
Nhcoll-l mailing list
Nhcoll-l at mailman.yale.edu<mailto:Nhcoll-l at mailman.yale.edu>
http://mailman.yale.edu/mailman/listinfo/nhcoll-l



--
_________________
Jim Croft ~ jim.croft at gmail.com<mailto:jim.croft at gmail.com> ~ +61-2-62509499 ~ http://about.me/jrc
'Without the freedom to criticize, there is no true praise.
- Pierre Beaumarchais
'Whenever you find yourself on the side of the majority, it's time to
pause and reflect.'
- Mark Twain
'A civilized society is one which tolerates eccentricity to the point
of doubtful sanity.'
- Robert Frost
_______________________________________________
Nhcoll-l mailing list
Nhcoll-l at mailman.yale.edu<mailto:Nhcoll-l at mailman.yale.edu>
http://mailman.yale.edu/mailman/listinfo/nhcoll-l

<><><><><><><><><><><><><><><><><><><><><><><><>
Nico Cellinese, Ph.D.
Assistant Curator, Botany & Informatics
Joint Assistant Professor, Department of Biology

Florida Museum of Natural History
University of Florida
354 Dickinson Hall, PO Box 117800
Gainesville, FL 32611-7800, U.S.A.
Tel. 352-273-1979
Fax 352-846-1861
http://cellinese.blogspot.com/



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.yale.edu/pipermail/nhcoll-l/attachments/20121016/41505b07/attachment-0001.html 


More information about the Nhcoll-l mailing list