Friday, March 26, 2010

BHL metadata improvements & impact on exports / cached data

Rod Page, frequent collaborator/agitator (that’s a compliment), recently posted a tweet about stability of ItemIDs in BHL:
http://twitter.com/rdmpage/status/10884940673

I suspected this was related to metadata enhancement in one form or another, and after auditing the issue with Mike Lichtenberg, the real brain behind BHL, we determined that was the case – there was an error in the descriptive metadata for the book & through library curation the problem was resolved.

But it’s a thorny issue, and one with implications for external systems that cache BHL data. And it’s one that we need some help resolving with your feedback, discussion & comments on this post.

Let me be clear on one point: BHL IDs don’t change. We mint identifiers for all primary entities in our schema – Titles, Items, Pages, Names, Authors, Subjects, etc., all get unique identifiers, and those IDs aren’t reused.

What does happen through the course of curation is that a scanned book will have enhancements made to its descriptive metadata. This is analogous to the work that happens within library catalogues all over the world – as new facts are found about an object, or as errors are encountered, those edits are recorded in the catalogue.

The same happens within the digital BHL. We’re always striving to make our content accurate. And we’ve identified 4 scenarios that need to be resolved between BHL and those who have cached its data in their local systems:

1. Item is removed from Internet Archive & BHL due to copyright concern or major errors in digital images
Sometimes after scanning & publishing a digitized book is found to either be in conflict with copyright or significant errors are identified in its scanned images (Internet Archive, our scanning partner, does the QA post-scanning). In either case the item “goes dark” at Internet Archive and through BHL – it’s not deleted, but it’s no longer publicly exposed through searches or links. If it’s an issue with the scanned images, the book is rescanned and given a new ItemID.

2. Merging titles
BHL has scanned thousands of journals. Most libraries have gaps in their serial runs, especially for the kind of legacy, public domain materials we’re digitizing. It’s not uncommon for Library A to scan volumes 1-5, 7 & 9 of the Journal of Society Z and for Library B to fill in volumes 6 & 8.

What complicates this issue is that there is no single canonical library record for a title – no one record for the Journal of Society Z. Each library probably has different metadata about those volumes in their collection (vols 6 & 8 may have annotations, for example) and so the bibliographic metadata from each library will be similar but different. Each library submits its local record at the time of scanning and that record follows the scanned item for the purposes of provenance.

To present a complete run of the journal, we have data structures & interfaces that give BHLibrarians the ability to assign volumes 6 & 8 to the title with more volumes, or, to reassign 1-5, 7 & 9 if Library B’s metadata is determined to be more complete. When we merge titles, the deprecated title is left in the system & forwards on to the “new” more complete title.

3. Adding a new Primary or Secondary Title
And if all that wasn’t confusing enough, we haven’t even touched on the bibliographic hell of monographic series. These books can be described under one title as a journal and under another title as a monograph.

Take, for example, the following scanned book, “Some mollusks from Afghanistan”:
http://www.biodiversitylibrary.org/item/21009

It has been described as a standalone work and given its own OCLC number and maybe even an ISBN. *But*, and here’s where it gets tricky, it’s been published as a monograph within the series Fieldiana Zoology: http://www.biodiversitylibrary.org/bibliography/42256

You can thus describe and cite this digitized book as either “Some mollusks from Afghanistan” or as “Fieldiana. Zoology, new series, No. 1” But when it came into BHL, it was only described as the monographic record “Some mollusks from Afghanistan.” It took manual curation to assign it to the series record. We have structures to describe a scanned book as having a “Primary” title as well as a “Secondary” title to reflect this duality in description and citation.

{{{ And I'll note that this point is of particular interest to taxonomists trying to establish the priority of naming when describing new species or doing revisions. A topic for another post, and one that has direct relationship to another project I lead, "Digitizing Engelmann's Legacy". Look for that in a few weeks, with slides! }}}

4. Wholesale correction of title metadata
Sometimes (rarely) a book comes into BHL with the complete wrong metadata, but the scans are fine. In this case we correct the error with the right metadata, and use the same structure for Primary/Secondary titles as described above.

And, in reviewing Rod’s tweet & subsequent information he sent, that looks to be what happened here: The primary title for item 46212 was changed to 1594. Title 12649 remains as a secondary title.

The title IDs still point to the same titles, and the item IDs still point to the same scanned items... nothing has disappeared, and no IDs have changed. The relationship between titles and items is what has been adjusted.

Also, in this case we've actually improved the situation. The title page clearly identifies the item as part XVIII (1850) of the Proceedings of the Zoological Society of London. So, the relationship to title 12649 (Lietuvos TSR Mokslu Akademijos Darbuotoju Knygu ir Straipsniu Bibliografija... i.e. Lithuanian SSR Academy of Sciences of Employees bibliography of books and articles.) looks to be a mistake, but it’s how the book entered our system. We’ve kept that fact for the purpose of completeness, but now show the “correct” title when that item is viewed through BHL.

BHL is a dynamic system & is updated by more than 12 libraries today with more coming online this year and next. The disconnect comes when users cache BHL data and don’t have these new facts.

And because this post is WAAAAY long, I will stop now and turn it back to you - here’s where we need your feedback, discussion & recommendations. How should BHL express these changed relationships in its exports & services?

5 comments:

Martin said...

Random first thought (which plays havoc with "unique". Could multiple BHL identifiers (for items that fall into the classes described above) all resolve to a new "standard" BHL identifier?

Rod Page said...

Thanks for the explanation Chris. I guess I encountered these problems because for various reasons (such as changing structure of the BHL data dump) I've found it easier to use the original download of BHL data I got last year, and then scrape the BHL web page for specific items as I need them. Sometimes I get caught out by the changes you describe, so that my local version of BHL is out of sync with BHL itself.

ItemIDs being reassigned to new TitleIDs can cause problems, as my code has a set of rules for some titles that require special treatment (e.g., in cases where BHL doesn't have enough metadata for me to accurately find an article). I guess there's no solution other than manually investigating when things go wrong.

In the example you describe, I hadn't bothered to check that the problem was at the title level. I saw "Lietuvos TSR Mokslu Akademijos Darbuotoju Knygu ir Straipsniu Bibliografija" and assumed the item was part of that title, rather than being a volume of Proc Zool Soc assigned to the wrong title.

Anonymous said...

Chris, you have opened the door for a flood of accumulated notes and comments.

BHL is an entirely new kind of library, something never before seen. This presents an array of problems not faced by traditional libraries, but also creates the opportunity to experiment, and to solve problems in new and useful ways. BHL fills a variety of needs, and creates new needs not anticipated before BHL made them possible.

Need for link stability: In the application I am collaborating with Evgeniy Meyke to develop, we link taxon names to the specific page images on which they occur. Many of these pages are from BHL. Others are from other Internet sources, author’s PDFs, and my own literature scans. Regarding BHL content, I extract relevant papers from BHL Items, and keep them locally. I also store the BHL PageID and ItemID, but these only come into play when sharing data with others. For raw BHL search results, however, only the PageID and ItemID are stored. These search results may age for some months while I evaluate and process them. Thus far I have encountered only one missing item, Proceedings of the USNM, volume 41 (http://www.biodiversitylibrary.org/item/53811). I inquired about what may have happened to it a month ago today. You said that it had been “marked offline” due to serious errors, and would be rescanned and reindexed. (As of today it is still offline.) This was the first indication we had that there might be a problem with BHL identifier durability.

I also store the sources of bibliographic citations, and reviews of papers that I may need to consult again, from, for example, Revue Critique de Paléozoologie or Archiv für Naturgeschichte. For these, I store only the BHL page link, not the page or article itself. I now understand that these references are fragile, in that if the pageID changes, the source will be lost.

For my purposes, then, the reliability of PageID and ItemID references are most important, TitleID not so much. The needs of others are, of course, different. One example: Robert Moore or the Pacific Conchological Club has been compiling a catalog of works pertaining to Mollusca that are available on the Internet for more than a year. His bibliography, in the form of five PDF documents (http://mysite.verizon.net/tjrutkas/page7.html ) includes many BHL items, though I believe all of the links are by way of IA URLs. If an item goes dark on IA, his link would be broken.

It is clear that some changes are unavoidable. The question, then, is what can BHL do to mitigate the impact? There is no getting around the copyright issue you mentioned. The item must be removed. But rather than simply deleting all trace of it, perhaps the title, its item(s), and the pages could be replaced with a message explaining why it is no longer available.

Regarding BHL items with errors, is it really necessary to completely remove and replace them? In some cases, perhaps. But I am guessing that in most cases it is a matter of tradeoffs and the relative value placed on expedience vs. stability. With the existing workflow and division of labor, it is surely easier to replace an entire volume than to fix it. Based on the small sample of BHL items I have personally examined (~2%), the few problems I have encountered consist of a few missing or poorly photographed pages, or the original book itself is defective. Defective items could be repaired, preserving the existing itemID and pageID links. The books should not be removed from availability while being repaired. If it is the book, and not the copy, that is defective, it should be left as is, a second copy added if opportunity presents. When a volume must be replaced, the original item and pages should be excluded from indexing but left accessible to external links and include a referral or link to the replacement Item. I no longer report content problems because I don’t want to be responsible for causing the items to be removed, even temporarily, from availability on BHL.

To be continued...

Anonymous said...

Cataloging (metadata) issues, apart from outright errors such as the one mentioned, can be nebulous and subjective. Cataloging is as much art as science, developing over many decades within each institution to fitted to their particular collections, preferences, and patrons. BHL faces a remarkable challenge to integrate all this diversity, inconsistency, and duplication into one homogeneous and useful collection. But at the same time, BHL has the unique quality of being a digital library, a virtual library, with all that implies. In a conventional library, a monograph in series can only be shelved as a separate monograph or as a member of the series. In BHL, it can be both. It can, in effect, be in two places at once. Journals may be cataloged under any number of titles, first, best known, or current, American or British style. For example, is the journal called Nachrichtsblatt der Deutschen Malakologischen Gesellschaft, Nachrichtsblatt der Deutschen Malakozoologischen Gesellschaft,
or Archiv fuer Molluskenkunde? On BHL, it’s the second. In another library it is the first and third. A search of BHL for any one of them could, and should, find the journal.

There are both advantages and problems in multiple copies of items existing in BHL. I think the downside of duplication, messing with statistics of the sort Rod Page likes and Pivot displays and duplication of index terms, is outweighed by the positives. Shortcomings in one copy may be corrected in another. Eccentricities in one volume may be confirmed as general, or not, by a second. In some copies, issue wrappers (helpful in dating the contents) are bound in, while not in others. Useful manuscript notes, indexes, and so on are sometimes to be found in one copy but not another.

OCR errors in two copies of the same work are different, resulting in different indexing and search results, thus improving the chances of a name being found in at least one of them. In the three instances that I counted, the number of name page instances that the second copy added were 42+2 (5%), 107+20 (19%), and 33+7 (21%), 16% overall.

A final thought, and I will stop babbling. It would be useful for there to be a virtual notes page in each Item on which patrons could record useful information about the item such as “Issue wrappers bound at end”, “Title page and table of contents inserted before index” or “Exact dates of publication listed in front matter of volume 32.”

Pat LaFollette

Jim said...

"You can thus describe and cite this digitized book as either 'Some mollusks from Afghanistan' or as 'Fieldiana. Zoology, new series, No. 1'".

Is this true? Even though the content may be identical, they are different publications, different references and will need different IDs. The apparent duplication is unavoidable because you cannot assume that, although declared to be the same thing, minor 'corrections' have not been made between one version and the next.

And as you allude, in matters of taxonomic priority, 'there can be only one'. It is surprising how frequently this parallel publication occurs in the real world. From my perspective, once is too often. :)

jim