Tuesday, July 13, 2010
Digitizing Engelmann's Legacy
Mapping Plant Specimens that Document the Great American Frontier
ESRI Education User Conference – July 13, 2010 – San Diego, CA
http://www.tropicos.org/Project/Engelmann
Tuesday, June 22, 2010
Monday, April 19, 2010
BHL poster for AETFAT2010
Due to the volcano in Iceland, I may or may not be going to Madagascar for the AETFAT conference on Thursday as planned. I'm routed through London, everything else is full, so I think I'll have to be making a go/no-go decision on Thursday morning. One of my main reasons for going to the conference was to present this poster (and another one for Tropicos) and I hope I get to display it because I think it turned out really well! Using a variety of open software and open data, I made a photomosaic of Africa and Madagascar from the title pages of books tagged with "Africa" or "Madagascar" in the Biodiversity Heritage Library. Here's how I did it:
1. Downloaded the BHL schema from http://www.biodiversitylibrary.org/data/BHLExportSchema.pdf and the following data exports:
• Title: http://www.biodiversitylibrary.org/data/title.txt (10MB+)
• Subject: http://www.biodiversitylibrary.org/data/subject.txt (3MB+)
• Item: http://www.biodiversitylibrary.org/data/item.txt (14MB+)
2. Imported those text files into tables in a simple db app (MySQL or Access). I set up a One-to-Many relationship between the Title.TitleID field and Subject.TitleID and Item.TitleID, describing how a title ("Flore de Madagascar") has shared data in subjects ("Madagascar") and items ("Volume 25"). Note the field Item.ThumbnailPageID, which indicates the pageID of the image described as either the Title Page, or if no Title Page is selected, then a representative page of interest from the book.
3. Using a simple query editor I created a SQL statement to select the ThumbnailPageID from digitized items whose titles are tagged with the subjects "%Africa%" or "%Madagascar%." Using these wild cards included subjects like "South Africa" and "Madagascar, Central."
4. Using BHL's API documentation for images, I added "http://biodiversitylibrary.org/pagethumb/" to each of the pageIDs in 3. above. This field now contains the link to the page image for the 851 title pages.
5. I used a download manager (Speed Download for Mac OSX; there are plenty for Win/Unix) to grab those 851 JPGs. Using the default size returned, each tile was small at 200 pixels wide, averaging 8k each.
6. I used the map of Africa and Madagascar from UiO as a reference image because it didn't have the sea terrain present, which muddled my first few attempts. I blew that image up using *proprietary software alert* Adobe Photoshop. You can use other imaging software to do the same, but I like Photoshop. I made a blank image roughly 3'x4' at 300 dpi and pasted in the source image, then scaled it to the size of the poster.
7. I then used MacOSaiX to build the photomosaic. This is where all the magic happens, and where I did the least. I just told the app to use the reference image from 6 & the thumbnails from 5 to build the mosaic, and off it went. After 40 minutes or so it beeped and said it was done. Voila! A photomosaic of Africa and Madagascar made from title pages of open access science books.
8. To make the poster I pasted the JPG into *proprietary software alert* Microsoft PowerPoint, because it's surprisingly easy to use for poster layout. Dropped in some text, logo, & a URL and there you have it - a cool poster using open data and (mostly) open software.
You can download the finished poster here as a 1MB JPG.
I'm purposefully documenting how I did this to encourage others to incorporate BHL data into their visualizations & presentations. BHL is an incredibly rich dataset with open access policies and open APIs, and this is but one simple example of how I was able to filter data and extract out compelling images from the millions we have scanned.
Friday, March 26, 2010
BHL metadata improvements & impact on exports / cached data
Rod Page, frequent collaborator/agitator (that’s a compliment), recently posted a tweet about stability of ItemIDs in BHL:
http://twitter.com/rdmpage/status/10884940673
I suspected this was related to metadata enhancement in one form or another, and after auditing the issue with Mike Lichtenberg, the real brain behind BHL, we determined that was the case – there was an error in the descriptive metadata for the book & through library curation the problem was resolved.
But it’s a thorny issue, and one with implications for external systems that cache BHL data. And it’s one that we need some help resolving with your feedback, discussion & comments on this post.
Let me be clear on one point: BHL IDs don’t change. We mint identifiers for all primary entities in our schema – Titles, Items, Pages, Names, Authors, Subjects, etc., all get unique identifiers, and those IDs aren’t reused.
What does happen through the course of curation is that a scanned book will have enhancements made to its descriptive metadata. This is analogous to the work that happens within library catalogues all over the world – as new facts are found about an object, or as errors are encountered, those edits are recorded in the catalogue.
The same happens within the digital BHL. We’re always striving to make our content accurate. And we’ve identified 4 scenarios that need to be resolved between BHL and those who have cached its data in their local systems:
1. Item is removed from Internet Archive & BHL due to copyright concern or major errors in digital images
Sometimes after scanning & publishing a digitized book is found to either be in conflict with copyright or significant errors are identified in its scanned images (Internet Archive, our scanning partner, does the QA post-scanning). In either case the item “goes dark” at Internet Archive and through BHL – it’s not deleted, but it’s no longer publicly exposed through searches or links. If it’s an issue with the scanned images, the book is rescanned and given a new ItemID.
2. Merging titles
BHL has scanned thousands of journals. Most libraries have gaps in their serial runs, especially for the kind of legacy, public domain materials we’re digitizing. It’s not uncommon for Library A to scan volumes 1-5, 7 & 9 of the Journal of Society Z and for Library B to fill in volumes 6 & 8.
What complicates this issue is that there is no single canonical library record for a title – no one record for the Journal of Society Z. Each library probably has different metadata about those volumes in their collection (vols 6 & 8 may have annotations, for example) and so the bibliographic metadata from each library will be similar but different. Each library submits its local record at the time of scanning and that record follows the scanned item for the purposes of provenance.
To present a complete run of the journal, we have data structures & interfaces that give BHLibrarians the ability to assign volumes 6 & 8 to the title with more volumes, or, to reassign 1-5, 7 & 9 if Library B’s metadata is determined to be more complete. When we merge titles, the deprecated title is left in the system & forwards on to the “new” more complete title.
3. Adding a new Primary or Secondary Title
And if all that wasn’t confusing enough, we haven’t even touched on the bibliographic hell of monographic series. These books can be described under one title as a journal and under another title as a monograph.
Take, for example, the following scanned book, “Some mollusks from Afghanistan”:
http://www.biodiversitylibrary.org/item/21009
It has been described as a standalone work and given its own OCLC number and maybe even an ISBN. *But*, and here’s where it gets tricky, it’s been published as a monograph within the series Fieldiana Zoology: http://www.biodiversitylibrary.org/bibliography/42256
You can thus describe and cite this digitized book as either “Some mollusks from Afghanistan” or as “Fieldiana. Zoology, new series, No. 1” But when it came into BHL, it was only described as the monographic record “Some mollusks from Afghanistan.” It took manual curation to assign it to the series record. We have structures to describe a scanned book as having a “Primary” title as well as a “Secondary” title to reflect this duality in description and citation.
4. Wholesale correction of title metadata
Sometimes (rarely) a book comes into BHL with the complete wrong metadata, but the scans are fine. In this case we correct the error with the right metadata, and use the same structure for Primary/Secondary titles as described above.
And, in reviewing Rod’s tweet & subsequent information he sent, that looks to be what happened here: The primary title for item 46212 was changed to 1594. Title 12649 remains as a secondary title.
The title IDs still point to the same titles, and the item IDs still point to the same scanned items... nothing has disappeared, and no IDs have changed. The relationship between titles and items is what has been adjusted.
Also, in this case we've actually improved the situation. The title page clearly identifies the item as part XVIII (1850) of the Proceedings of the Zoological Society of London. So, the relationship to title 12649 (Lietuvos TSR Mokslu Akademijos Darbuotoju Knygu ir Straipsniu Bibliografija... i.e. Lithuanian SSR Academy of Sciences of Employees bibliography of books and articles.) looks to be a mistake, but it’s how the book entered our system. We’ve kept that fact for the purpose of completeness, but now show the “correct” title when that item is viewed through BHL.
BHL is a dynamic system & is updated by more than 12 libraries today with more coming online this year and next. The disconnect comes when users cache BHL data and don’t have these new facts.
Saturday, March 20, 2010
Friday, March 19, 2010
Number of BHL names found on only 1 page
As of March 1, 2010, BHL had identified more than 70 million potential name strings across its 28 million digitized pages using uBio's TaxonFinder. 58 million of those name strings were confirmed as a name with a NameBankID. Of that set, 1,491,000 name strings were unique. 329,000 of those unique names were found on a single page in BHL.
Data:
Single-Page Names.zip (5.5MB) contains the results of the following query, executed on March 1, 2010:
-- Initial list of single-page names
SELECT NameConfirmed, NameBankID
INTO #tmpName
FROM dbo.PageName
WHERE NameBankID IS NOT NULL
GROUP BY NameConfirmed, NameBankID
HAVING COUNT(*) = 1
-- Add the page ID and EOL ID to the results
SELECT n.PageID, t.NameConfirmed, t.NameBankID, e.EOLID
INTO #tmpFinal
FROM #tmpName t INNER JOIN dbo.PageName n
ON t.NameConfirmed = n.NameConfirmed
AND t.NameBankID = n.NameBankID
LEFT JOIN dbo.NamebankEOL e
ON t.NameBankID = e.NameBankID
-- Produce the final result set
SELECT PageID, LEFT(NameConfirmed, 50) AS NameConfirmed, NameBankID, EOLID
FROM #tmpFinal ORDER BY NameConfirmed
-- Clean up
DROP TABLE #tmpName
DROP TABLE #tmpFinal

