Thursday, October 28, 2010

You Too Can Participate in International Botanical Congress 2011

I am holding a symposium at next year's International Botanical Congress in Melbourne, Australia, entitled "Informatics Tools for the Semantic Enhancement of Taxonomic Literature"

The abstract for the symposium is included here:
In recent years the landscape has dramatically changed regarding the availability of digital taxonomic literature, both contemporary publications as well as legacy texts. Projects like the Biodiversity Heritage Library and Plazi, among others, have digitized and made available a wealth of scientific texts that support the online review of protologues and species descriptions. While this advent has been exceptionally useful for scholars and has undoubtedly expedited the taxonomic process, making this literature available in digital form opens the possibility for new secondary analyses that are impossible to accomplish with traditional printed texts.

Scholars working in natural language processing, semantic markup, and other efforts within biodiversity informatics are developing new tools for the use of these digitized materials beyond the traditional human-paper interaction. These new human-machine and machine-machine interactions are facilitated by emerging software tools that enhance the traditional scientific publication, turning these texts into rich, interactive datasets that can be incorporated into other analyses.

This seminar will explore the motivation behind the digitization of historic taxonomic literature as well as the contemporary publication of new treatments and texts, and how those texts can be enhanced by these new informatics tools. Panelists will review the progress made through both legacy digitization as well as contemporary publication, and special focus will be given to scholars who are currently building the informatics tools that help provide fine-grained, semantic description of traditional taxonomic texts. Using these novel algorithms and applications, presenters will detail how taxonomic publications can be enhanced through semantic description and how these enriched texts can expedite the taxonomic process and facilitate the open sharing of organismal data to a global audience of scholars and students.
Abstract submission is open through 31 October 2010 at:
http://www.ibc2011.com/Abstracts.htm

Tuesday, July 13, 2010

Digitizing Engelmann's Legacy



Mapping Plant Specimens that Document the Great American Frontier
ESRI Education User Conference – July 13, 2010 – San Diego, CA
http://www.tropicos.org/Project/Engelmann

Monday, April 19, 2010

BHL poster for AETFAT2010

Due to the volcano in Iceland, I may or may not be going to Madagascar for the AETFAT conference on Thursday as planned. I'm routed through London, everything else is full, so I think I'll have to be making a go/no-go decision on Thursday morning. One of my main reasons for going to the conference was to present this poster (and another one for Tropicos) and I hope I get to display it because I think it turned out really well! Using a variety of open software and open data, I made a photomosaic of Africa and Madagascar from the title pages of books tagged with "Africa" or "Madagascar" in the Biodiversity Heritage Library. Here's how I did it:

1. Downloaded the BHL schema from http://www.biodiversitylibrary.org/data/BHLExportSchema.pdf and the following data exports:

Title: http://www.biodiversitylibrary.org/data/title.txt (10MB+)

Subject: http://www.biodiversitylibrary.org/data/subject.txt (3MB+)

Item: http://www.biodiversitylibrary.org/data/item.txt (14MB+)


2. Imported those text files into tables in a simple db app (MySQL or Access). I set up a One-to-Many relationship between the Title.TitleID field and Subject.TitleID and Item.TitleID, describing how a title ("Flore de Madagascar") has shared data in subjects ("Madagascar") and items ("Volume 25"). Note the field Item.ThumbnailPageID, which indicates the pageID of the image described as either the Title Page, or if no Title Page is selected, then a representative page of interest from the book.


3. Using a simple query editor I created a SQL statement to select the ThumbnailPageID from digitized items whose titles are tagged with the subjects "%Africa%" or "%Madagascar%." Using these wild cards included subjects like "South Africa" and "Madagascar, Central."


4. Using BHL's API documentation for images, I added "http://biodiversitylibrary.org/pagethumb/" to each of the pageIDs in 3. above. This field now contains the link to the page image for the 851 title pages.


5. I used a download manager (Speed Download for Mac OSX; there are plenty for Win/Unix) to grab those 851 JPGs. Using the default size returned, each tile was small at 200 pixels wide, averaging 8k each.


6. I used the map of Africa and Madagascar from UiO as a reference image because it didn't have the sea terrain present, which muddled my first few attempts. I blew that image up using *proprietary software alert* Adobe Photoshop. You can use other imaging software to do the same, but I like Photoshop. I made a blank image roughly 3'x4' at 300 dpi and pasted in the source image, then scaled it to the size of the poster.


7. I then used MacOSaiX to build the photomosaic. This is where all the magic happens, and where I did the least. I just told the app to use the reference image from 6 & the thumbnails from 5 to build the mosaic, and off it went. After 40 minutes or so it beeped and said it was done. Voila! A photomosaic of Africa and Madagascar made from title pages of open access science books.


8. To make the poster I pasted the JPG into *proprietary software alert* Microsoft PowerPoint, because it's surprisingly easy to use for poster layout. Dropped in some text, logo, & a URL and there you have it - a cool poster using open data and (mostly) open software.


You can download the finished poster here as a 1MB JPG.


I'm purposefully documenting how I did this to encourage others to incorporate BHL data into their visualizations & presentations. BHL is an incredibly rich dataset with open access policies and open APIs, and this is but one simple example of how I was able to filter data and extract out compelling images from the millions we have scanned.

Friday, March 26, 2010

BHL metadata improvements & impact on exports / cached data

Rod Page, frequent collaborator/agitator (that’s a compliment), recently posted a tweet about stability of ItemIDs in BHL:
http://twitter.com/rdmpage/status/10884940673

I suspected this was related to metadata enhancement in one form or another, and after auditing the issue with Mike Lichtenberg, the real brain behind BHL, we determined that was the case – there was an error in the descriptive metadata for the book & through library curation the problem was resolved.

But it’s a thorny issue, and one with implications for external systems that cache BHL data. And it’s one that we need some help resolving with your feedback, discussion & comments on this post.

Let me be clear on one point: BHL IDs don’t change. We mint identifiers for all primary entities in our schema – Titles, Items, Pages, Names, Authors, Subjects, etc., all get unique identifiers, and those IDs aren’t reused.

What does happen through the course of curation is that a scanned book will have enhancements made to its descriptive metadata. This is analogous to the work that happens within library catalogues all over the world – as new facts are found about an object, or as errors are encountered, those edits are recorded in the catalogue.

The same happens within the digital BHL. We’re always striving to make our content accurate. And we’ve identified 4 scenarios that need to be resolved between BHL and those who have cached its data in their local systems:

1. Item is removed from Internet Archive & BHL due to copyright concern or major errors in digital images
Sometimes after scanning & publishing a digitized book is found to either be in conflict with copyright or significant errors are identified in its scanned images (Internet Archive, our scanning partner, does the QA post-scanning). In either case the item “goes dark” at Internet Archive and through BHL – it’s not deleted, but it’s no longer publicly exposed through searches or links. If it’s an issue with the scanned images, the book is rescanned and given a new ItemID.

2. Merging titles
BHL has scanned thousands of journals. Most libraries have gaps in their serial runs, especially for the kind of legacy, public domain materials we’re digitizing. It’s not uncommon for Library A to scan volumes 1-5, 7 & 9 of the Journal of Society Z and for Library B to fill in volumes 6 & 8.

What complicates this issue is that there is no single canonical library record for a title – no one record for the Journal of Society Z. Each library probably has different metadata about those volumes in their collection (vols 6 & 8 may have annotations, for example) and so the bibliographic metadata from each library will be similar but different. Each library submits its local record at the time of scanning and that record follows the scanned item for the purposes of provenance.

To present a complete run of the journal, we have data structures & interfaces that give BHLibrarians the ability to assign volumes 6 & 8 to the title with more volumes, or, to reassign 1-5, 7 & 9 if Library B’s metadata is determined to be more complete. When we merge titles, the deprecated title is left in the system & forwards on to the “new” more complete title.

3. Adding a new Primary or Secondary Title
And if all that wasn’t confusing enough, we haven’t even touched on the bibliographic hell of monographic series. These books can be described under one title as a journal and under another title as a monograph.

Take, for example, the following scanned book, “Some mollusks from Afghanistan”:
http://www.biodiversitylibrary.org/item/21009

It has been described as a standalone work and given its own OCLC number and maybe even an ISBN. *But*, and here’s where it gets tricky, it’s been published as a monograph within the series Fieldiana Zoology: http://www.biodiversitylibrary.org/bibliography/42256

You can thus describe and cite this digitized book as either “Some mollusks from Afghanistan” or as “Fieldiana. Zoology, new series, No. 1” But when it came into BHL, it was only described as the monographic record “Some mollusks from Afghanistan.” It took manual curation to assign it to the series record. We have structures to describe a scanned book as having a “Primary” title as well as a “Secondary” title to reflect this duality in description and citation.

{{{ And I'll note that this point is of particular interest to taxonomists trying to establish the priority of naming when describing new species or doing revisions. A topic for another post, and one that has direct relationship to another project I lead, "Digitizing Engelmann's Legacy". Look for that in a few weeks, with slides! }}}

4. Wholesale correction of title metadata
Sometimes (rarely) a book comes into BHL with the complete wrong metadata, but the scans are fine. In this case we correct the error with the right metadata, and use the same structure for Primary/Secondary titles as described above.

And, in reviewing Rod’s tweet & subsequent information he sent, that looks to be what happened here: The primary title for item 46212 was changed to 1594. Title 12649 remains as a secondary title.

The title IDs still point to the same titles, and the item IDs still point to the same scanned items... nothing has disappeared, and no IDs have changed. The relationship between titles and items is what has been adjusted.

Also, in this case we've actually improved the situation. The title page clearly identifies the item as part XVIII (1850) of the Proceedings of the Zoological Society of London. So, the relationship to title 12649 (Lietuvos TSR Mokslu Akademijos Darbuotoju Knygu ir Straipsniu Bibliografija... i.e. Lithuanian SSR Academy of Sciences of Employees bibliography of books and articles.) looks to be a mistake, but it’s how the book entered our system. We’ve kept that fact for the purpose of completeness, but now show the “correct” title when that item is viewed through BHL.

BHL is a dynamic system & is updated by more than 12 libraries today with more coming online this year and next. The disconnect comes when users cache BHL data and don’t have these new facts.

And because this post is WAAAAY long, I will stop now and turn it back to you - here’s where we need your feedback, discussion & recommendations. How should BHL express these changed relationships in its exports & services?

Friday, March 19, 2010

Number of BHL names found on only 1 page

As of March 1, 2010, BHL had identified more than 70 million potential name strings across its 28 million digitized pages using uBio's TaxonFinder. 58 million of those name strings were confirmed as a name with a NameBankID. Of that set, 1,491,000 name strings were unique. 329,000 of those unique names were found on a single page in BHL.

Data:
Single-Page Names.zip (5.5MB) contains the results of the following query, executed on March 1, 2010:


-- Initial list of single-page names
SELECT NameConfirmed, NameBankID
INTO #tmpName
FROM dbo.PageName
WHERE NameBankID IS NOT NULL
GROUP BY NameConfirmed, NameBankID
HAVING COUNT(*) = 1

-- Add the page ID and EOL ID to the results
SELECT n.PageID, t.NameConfirmed, t.NameBankID, e.EOLID
INTO #tmpFinal
FROM #tmpName t INNER JOIN dbo.PageName n
ON t.NameConfirmed = n.NameConfirmed
AND t.NameBankID = n.NameBankID
LEFT JOIN dbo.NamebankEOL e
ON t.NameBankID = e.NameBankID

-- Produce the final result set
SELECT PageID, LEFT(NameConfirmed, 50) AS NameConfirmed, NameBankID, EOLID
FROM #tmpFinal ORDER BY NameConfirmed

-- Clean up
DROP TABLE #tmpName
DROP TABLE #tmpFinal

Sunday, March 14, 2010

Break Bread for Brad


BB4B button, originally uploaded by Andrew Huff.

Friday, January 22, 2010

Remembering Brad Graham: The Lesser Kudu




On January 24, 2010, the Repertory Theatre of St. Louis will open its doors to the public for "A Memorial for Brad L. Graham," 6pm at the Loretto-Hilton Center, 130 Edgar Road, Webster Groves, Missouri 63119. More details, as well as an online Remembrance Book and memorial donation form, on the Rep's (fantastic, courtesy of Brad) web site: http://repstl.org/brad

If you can't attend the service, I encourage you to check out some of Brad's writings on his blog-that's-so-oldschool-it's-called-a-weblog, "Bradlands": http://bradlands.com For me, The Lesser Kudu is perhaps his finest work; it really captures his spirit and outlook on life. I've gone back to it several times over these past three weeks and am always left upbeat, inspired. I'm reposting it below and of course with links to the original source:

The lesser kudu
My favorite animal at the Zoo is the lesser kudu. You have to admire an animal with a name like that, laboring as he must in the shadow of the greater kudu. It must be like having an older brother who excelled at sports and academics in school, to whom you have always been compared and found lacking. A few months ago, I was visiting the Zoo at lunch with a friend and discovered the area where the lesser kudu is ordinarily found was empty.

I hope he made a break for it. I hope he made his way out into the world, free of expectations, shedding labels, determined only to be the best damn kudu he could be.


--Brad Graham (originally posted July 28, 2000)




www.flickr.com








items in Celebrating TheBradMore in Celebrating TheBrad pool



Monday, December 07, 2009

8 days 12 hours 49 minutes


2009 Flight Path, originally uploaded by chrisfreeland2002.


8 days 12 hours 49 minutes...or...the approximate amount of time I spent on an airplane this year. Check out more of my stats at OpenFlights.org.

More stats:
Distance
Total flown86694 mi
Around the world3.48x
To the Moon0.363x
To Mars0.0025x

Unique
Airports19
Carriers5
Countries8

19 airports in 8 countries
NorthernmostCPH (55.62°N 12.66°E)
SouthernmostMDE (6.16°N 75.42°W)
WesternmostSFO (37.62°N 122.37°W)
EasternmostPRG (50.10°N 14.26°E)

Journey records
LongestCDGDFW, 4933 mi, 10:21
ShortestLHRAMS, 229 mi, 00:57
Average1355 mi, 03:12

Top 10 Routes
BOSSTL 13
STLORD 13
LHRORD 5
SFOSTL 3
LAXSTL 3
CDGORD 3
LHRCPH 2
CDGPRG 2
TXLLHR 2
MIASTL 2

Top 10 Airports

Friday, October 16, 2009

Save Balloon Boy


Save Balloon Boy, originally uploaded by chrisfreeland2002.


I was out to dinner last night with friends and there was a TV on mute covering all the "Balloon Boy" madness. I'm glad the kid's ok. The family seems a bit...odd...but that's beside the point, which is:

I can't believe how big the story got so quickly. But then, with our "always on" culture, why should I be surprised? It all made me consider "Ferris Bueller's Day Off" and my favorite moment from that flick - the shot of SAVE FERRIS painted on the water tower. I remember watching that for the first time in the 80's thinking "oh, ha ha, how could you get the word out that fast?" This was pre-cell phone, pre-internet, pre-social networking, and so of course it was an absurd idea; that's the joke. Now that we're always connected, is the joke on us?

Tuesday, September 29, 2009

Thursday, September 10, 2009

Anchoring and delimiting

I cast anchor with a long, long chain,
it hardly anchored anything at all

I cast anchor with a short, short chain, it anchored pretty well and I was nearly drowned

I cast an anchor with no chain at all,
it anchored just itself, and strictly nothing else

the anchor was a stone I choosed to pick around, and the chain had a look of human artifact

I questioned the stone about its feelings

the stone didn't know it could be an anchor, it didn't even know that it was a stone, and other stones around didn't care more

while nobody thought of questioning the chain
-Pierre Deleporte, Université de Rennes, Station Biologique de Paimpont, France,
"playing deliberately evasive methaphors [sic] game" concerning the differentiation between species entity, a species concept and species name. Originally posted to TAXACOM listserv, 10 Sep 2009.

Friday, June 05, 2009

#ebio09, silverbacks, & haiku

eBiosphere 2009I am in London for the e-Biosphere '09 conference. I've been at conferences that had the now-ubiquitous ongoing Twitter sidechats before, but this was the first large biodiversity informatics meeting I've attended where it was used. Read all the chatter here: #ebio09. I've also posted pics via Flickr.

Following #ebio09 there was a 2 day meeting held at the Natural History Museum, London, organized by the #ebio09 Steering Committee. The goal of the meeting was to take the 3 days of sessions from #ebio09 & turn those into a 5-10 year roadmap for the field of biodiversity informatics. Here are the stated Overview & Objectives of the meeting:

  • Use the input provided by the e-Biosphere09 Conference as background to discuss important gaps and overlaps in information system coverage and interoperability among extant systems and services, and to develop a list of priority action items to address these shortcomings;
  • Discuss the experiences, obstacles, opportunities, and potential business models for a sustainable Biodiversity Informatics landscape; and
  • Use the input provided by the e-Biosphere09 Conference as background to discuss the new functionalities and interconnections needed by current and future users, and to develop a list of priority action items.

    The outcome of the Workshop should be a proposal for developing an integrated roadmap for the development of Biodiversity Informatics over the coming 5-10 years.

  • During #ebio09 the 2day session was dubbed the 'silverback' meeting by Dean Pentcheff via Twitter, in a tongue-in-cheek biological reference to adult male gorillas, whose back coats turn to grey as they mature. I've heard the term used in this context before (first time in 2000, used it myself many times since) and it's slightly pejorative, in that these 'silverback' sessions are thought of as some special meeting of tribal elders - no young'uns allowed.

    I was invited. I was among 3 of the youngest people in the room. I'm almost 34.

    There was also some grumbling about it being a closed session, so I had promised the dozen or so #ebio09 Tweeters that I'd make reports as we'd done during the conference. Much to my dismay there was no wifi during the 2day session, so I took notes. On paper. Because the auditorium only had power outlets along the wall. Sigh.

    At the close of the first day I had about an hour before dinner started, so I came back to my hotel, typed up my notes & sent 40+ tweets in rapid succession, starting with this one. Some folks didn't like the tweetsplosion, some folks did. Wha'ver - I was pressed for time and wanted to get the info out.

    The second day (today, June 5, 2009) was quite different, with more breakouts & less plenary, so less opportunity to record tweetable bits; also wanted to avoid another tweetsplosion. I also acknowledge that there's going to be an official summary of the meeting that will be made public. I didn't want to steal the organizers' thunder, if you will, especially since they agreed to publish the outcomes & roadmap as quickly & transparently as possible. Woo hoo!

    Instead, what follows here is something I did to amuse myself while the meeting took the inevitable turn into strategery, blah blah blah, and 'my science is bigger than your science' squabbles. I wrote my impressions of Day Two as haiku. And here goes:

    Day Two: more of same.
    Silverbacks thumped, bared teeth, roared.
    Someone fell asleep.

    /

    All recognized we
    say same things time and again.
    Recapitulate!

    /

    Ontologies, tools,
    Registry of registries,
    Killer apps, outreach.

    /

    Thirty two people
    wordsmithed onscreen. "Track Changes?"
    Grammar school this ain't.

    /

    Twitter questioned as
    communication platform.
    World not flat - it's old.


    That last one stems from my frustration when the subject of communication opportunities came up. I naturally suggested Twitter and pointed out that there were hundreds of tweets posted during the conference. The crowd were unimpressed. Then at the close of the session we discussed outreach & engagement opportunities and someone mentioned social networking, and there was a collective groan throughout the audience. I bit my tongue.

    I was the only person in that group who is actively involved in social networking and who has found a way to incorporate it into my work life. Twitter & Facebook & Flickr aren't games; they're communication tools. I've been invited to participate in grants via Facebook. I've made contact with countless others in my field through Twitter. I IM my staff & collaborators while I travel (and boy, do I travel) and while I'm at home. Yes, there's chatter. But I counteract that by being online ALL THE TIME. My iPhone is my alarm clock when I travel. This morning I hit "Snooze" and then checked my email. Some people are horrified by this, but guess what - I learned this behavior from my tweenage nieces. There is no 'off' for them.

    Turns out Gil Scott-Heron was right - "the revolution will NOT be televised." It will be online, in a form we can't even imagine on devices not yet designed. As a discipline, biodiversity informatics has to stay on top of how its workers & enthusiasts (like citizen scientists) use technology. Not the tools that we build through our own efforts, but those that we carry in our pockets and integrate into our daily lives.

    I just hope that when I'm a silverback I remember this message.

    Final note: despite my critique here, the 2day session did accomplish a lot and did set out a roadmap for the next several years. To give this much space to a rant on social technology is unfair as it wasn't the central theme of the meeting, nor should it have been. Bigger problems, bigger issues. You'll see when you read the report. Just don't expect to hear about it via Twitter.

    Tuesday, May 12, 2009

    BHL Tech Overview for BHL-Europe



    Presented at BHL-Europe Kickoff Meeting.
    Museum für Naturkunde, Berlin.
    12 May 2009

    Thursday, May 07, 2009

    Rapt attention while Tagert smiles


    Mark's 50th Birthday Party, originally uploaded by TheBrad.