Friday, March 26, 2010

BHL metadata improvements & impact on exports / cached data

Rod Page, frequent collaborator/agitator (that’s a compliment), recently posted a tweet about stability of ItemIDs in BHL:

I suspected this was related to metadata enhancement in one form or another, and after auditing the issue with Mike Lichtenberg, the real brain behind BHL, we determined that was the case – there was an error in the descriptive metadata for the book & through library curation the problem was resolved.

But it’s a thorny issue, and one with implications for external systems that cache BHL data. And it’s one that we need some help resolving with your feedback, discussion & comments on this post.

Let me be clear on one point: BHL IDs don’t change. We mint identifiers for all primary entities in our schema – Titles, Items, Pages, Names, Authors, Subjects, etc., all get unique identifiers, and those IDs aren’t reused.

What does happen through the course of curation is that a scanned book will have enhancements made to its descriptive metadata. This is analogous to the work that happens within library catalogues all over the world – as new facts are found about an object, or as errors are encountered, those edits are recorded in the catalogue.

The same happens within the digital BHL. We’re always striving to make our content accurate. And we’ve identified 4 scenarios that need to be resolved between BHL and those who have cached its data in their local systems:

1. Item is removed from Internet Archive & BHL due to copyright concern or major errors in digital images
Sometimes after scanning & publishing a digitized book is found to either be in conflict with copyright or significant errors are identified in its scanned images (Internet Archive, our scanning partner, does the QA post-scanning). In either case the item “goes dark” at Internet Archive and through BHL – it’s not deleted, but it’s no longer publicly exposed through searches or links. If it’s an issue with the scanned images, the book is rescanned and given a new ItemID.

2. Merging titles
BHL has scanned thousands of journals. Most libraries have gaps in their serial runs, especially for the kind of legacy, public domain materials we’re digitizing. It’s not uncommon for Library A to scan volumes 1-5, 7 & 9 of the Journal of Society Z and for Library B to fill in volumes 6 & 8.

What complicates this issue is that there is no single canonical library record for a title – no one record for the Journal of Society Z. Each library probably has different metadata about those volumes in their collection (vols 6 & 8 may have annotations, for example) and so the bibliographic metadata from each library will be similar but different. Each library submits its local record at the time of scanning and that record follows the scanned item for the purposes of provenance.

To present a complete run of the journal, we have data structures & interfaces that give BHLibrarians the ability to assign volumes 6 & 8 to the title with more volumes, or, to reassign 1-5, 7 & 9 if Library B’s metadata is determined to be more complete. When we merge titles, the deprecated title is left in the system & forwards on to the “new” more complete title.

3. Adding a new Primary or Secondary Title
And if all that wasn’t confusing enough, we haven’t even touched on the bibliographic hell of monographic series. These books can be described under one title as a journal and under another title as a monograph.

Take, for example, the following scanned book, “Some mollusks from Afghanistan”:

It has been described as a standalone work and given its own OCLC number and maybe even an ISBN. *But*, and here’s where it gets tricky, it’s been published as a monograph within the series Fieldiana Zoology:

You can thus describe and cite this digitized book as either “Some mollusks from Afghanistan” or as “Fieldiana. Zoology, new series, No. 1” But when it came into BHL, it was only described as the monographic record “Some mollusks from Afghanistan.” It took manual curation to assign it to the series record. We have structures to describe a scanned book as having a “Primary” title as well as a “Secondary” title to reflect this duality in description and citation.

{{{ And I'll note that this point is of particular interest to taxonomists trying to establish the priority of naming when describing new species or doing revisions. A topic for another post, and one that has direct relationship to another project I lead, "Digitizing Engelmann's Legacy". Look for that in a few weeks, with slides! }}}

4. Wholesale correction of title metadata
Sometimes (rarely) a book comes into BHL with the complete wrong metadata, but the scans are fine. In this case we correct the error with the right metadata, and use the same structure for Primary/Secondary titles as described above.

And, in reviewing Rod’s tweet & subsequent information he sent, that looks to be what happened here: The primary title for item 46212 was changed to 1594. Title 12649 remains as a secondary title.

The title IDs still point to the same titles, and the item IDs still point to the same scanned items... nothing has disappeared, and no IDs have changed. The relationship between titles and items is what has been adjusted.

Also, in this case we've actually improved the situation. The title page clearly identifies the item as part XVIII (1850) of the Proceedings of the Zoological Society of London. So, the relationship to title 12649 (Lietuvos TSR Mokslu Akademijos Darbuotoju Knygu ir Straipsniu Bibliografija... i.e. Lithuanian SSR Academy of Sciences of Employees bibliography of books and articles.) looks to be a mistake, but it’s how the book entered our system. We’ve kept that fact for the purpose of completeness, but now show the “correct” title when that item is viewed through BHL.

BHL is a dynamic system & is updated by more than 12 libraries today with more coming online this year and next. The disconnect comes when users cache BHL data and don’t have these new facts.

And because this post is WAAAAY long, I will stop now and turn it back to you - here’s where we need your feedback, discussion & recommendations. How should BHL express these changed relationships in its exports & services?

Friday, March 19, 2010

Number of BHL names found on only 1 page

As of March 1, 2010, BHL had identified more than 70 million potential name strings across its 28 million digitized pages using uBio's TaxonFinder. 58 million of those name strings were confirmed as a name with a NameBankID. Of that set, 1,491,000 name strings were unique. 329,000 of those unique names were found on a single page in BHL.

Single-Page (5.5MB) contains the results of the following query, executed on March 1, 2010:

-- Initial list of single-page names
SELECT NameConfirmed, NameBankID
INTO #tmpName
FROM dbo.PageName
GROUP BY NameConfirmed, NameBankID

-- Add the page ID and EOL ID to the results
SELECT n.PageID, t.NameConfirmed, t.NameBankID, e.EOLID
INTO #tmpFinal
FROM #tmpName t INNER JOIN dbo.PageName n
ON t.NameConfirmed = n.NameConfirmed
AND t.NameBankID = n.NameBankID
LEFT JOIN dbo.NamebankEOL e
ON t.NameBankID = e.NameBankID

-- Produce the final result set
SELECT PageID, LEFT(NameConfirmed, 50) AS NameConfirmed, NameBankID, EOLID
FROM #tmpFinal ORDER BY NameConfirmed

-- Clean up
DROP TABLE #tmpFinal

Sunday, March 14, 2010

Break Bread for Brad

BB4B button, originally uploaded by Andrew Huff.