<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
What: Visualize Your Text Data Using OCR Output<o:p></o:p><br>
Why: Fast access to your data, reveal the unexpected<o:p></o:p><br>
When: Wednesday 10 AM EST 22 January 2014<o:p></o:p><br>
Where: <a class="moz-txt-link-freetext"
href="http://idigbio.adobeconnect.com/augmentocr">http://idigbio.adobeconnect.com/augmentocr</a><o:p></o:p><br>
Who: All Are Welcome!<o:p></o:p><br>
<p class="MsoNormal">Note: Headsets recommended for best experience
with <a
href="https://www.idigbio.org/wiki/index.php/Web_Conferencing">AdobeConnect</a><o:p></o:p>
and please log in 15 minutes early if it is your first experience
with AdobeConnect<o:p></o:p> </p>
<p class="MsoNormal">Twitter: @iDigBio #citscribe #ocrviz<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p>See your data in a whole new way!
Museum specimen labels, note cards, field notebooks, ledgers and
other primary source materials are being imaged in many
digitization projects. Other projects plan to OCR their materials
or have questions about what they can do with the output.<br>
</p>
<p class="MsoNormal">OCR text output from these sources opens a
window to your data, <i style="mso-bidi-font-style:normal">before</i>
the data elements are entered into the database fields. It gives
you unprecedented, fast access to your data, revealing insights to
facilitate research, data validation, and public participation in
science. Come see a demonstration of how you might do this with
OCR output. As part of the recent <a
href="https://www.idigbio.org/content/citscribe-hackathon">iDigBio
CITScribe Hackathon</a>, <a
href="https://www.facebook.com/photo.php?fbid=645283398848943&l=bbbcf70f3b">the
LlLl
team</a> demonstrated one technique to do this visualization
with <a href="http://search.carrot2.org/stable/search">Carrot<sup>2</sup></a>
and <a href="https://developers.google.com/chart/">Google charts</a>
using OCR output <a
href="http://www.techopedia.com/definition/1210/index-idx-database-systems">indexed</a>
by <a href="http://lucene.apache.org/solr/">Apache Solr</a> and <span
style="font-size:11pt;font-family:Calibri,sans-serif">highlighting
OCR errors using n-gram, a probabilistic model for estimating
likelihood of a string being a good word.</span> <span
style="mso-spacerun:yes"></span>Find what you want, fast and <i
style="mso-bidi-font-style:normal">discover</i> unexpected
informative search terms. The same approach can be used to guide
what needs to be validated using crowdsourcing outputs, on a per
field basis. All are welcome.<br>
<br>
See you there! Yes, please share the link, spread the word, and
yes, it will be recorded.<br>
Andrea M, Jason B, Miao C, Sylvia O, Reed B, William U and @idbdeb
from the @idigbio #citscribe <a
href="https://www.facebook.com/photo.php?fbid=645283398848943&l=bbbcf70f3b">LlLl
Team</a>, et al from the iDigBio CITScribe Hackathon and iDigBio<br>
<br>
NB. Work inspired by a Biodiversity Information Standards (TDWG)
2013 talk <a
href="http://www.tdwg.org/fileadmin/2013conference/slides/Drinkwater_OCRforHerbaria.pptx">The
use of OCR in the digitisation of herbarium specimens</a>. Robyn
Drinkwater, Robert Cubey, and Elspeth Haston, RBGE. </p>
<o:p></o:p>keywords: OCR ML NLP SOLR GoogleCharts CARROT2
<pre class="moz-signature" cols="72">--
Upcoming iDigBio Events <a class="moz-txt-link-freetext" href="https://www.idigbio.org/outreach-events-sidebar">https://www.idigbio.org/outreach-events-sidebar</a>
--Deborah Paul
iDigBio Technology Specialist
Institute for Digital Information, 234 LSB
Florida State University
Tallahassee, Florida 32306
850-644-6366</pre>
</body>
</html>