MyJacketFlap Blogs

  • Login or Register for free to create your own customized page of blog posts from your favorite blogs. You can also add blogs by clicking the "Add to MyJacketFlap" links next to the blog name in each post.

Blog Posts by Tag

In the past 30 days

Blog Posts by Date

Click days in this calendar to see posts by day or month
<<June 2024>>
SuMoTuWeThFrSa
      01
02030405060708
09101112131415
16171819202122
23242526272829
30      
new posts in all blogs
Viewing Blog: Inquiring Librarian, Most Recent at Top
Results 26 - 50 of 54
Visit This Blog | Login to Add to MyJacketFlap
Blog Banner
"Thoughts on librarianship, technology, and how they affect each other in the 21st Century."
Statistics for Inquiring Librarian

Number of Readers that added this blog to their MyJacketFlap: 2
26. ALA Draft Digitization Principles

The ALA Digitization Policy Task Force recently released a draft set of digitization principles for public comment. The comments on them that I've seen boil down, basically, to "duh" and "expand the scope even further." Regarding "duh," I agree that the general ideas here don't have a great deal that many would say should not be true, but that's OK. We need to state principles like this, even when they're not controversial, to help frame further discussion. The blog where these were introduced has some thoughtful comments on the details explaining the principles, which is where I think reaction is best focused.

One big-picture issue I don't think is clear, however, is the label "digitization." The principles are for the most part not about the digitization (conversion from analog to digital format) process, nor do I think they should be. They're more about the properties of "digital libraries" as a whole, which have content that was once analog, content that is born digital, and perhaps even metadata about objects that aren't digital at all. These principles seem to describe systems and organizations more than just objects.

The "expand the scope even further" commentary is also particularly apt. Coming from ALA, the focus on "libraries" could, as one comment on the blog mentions, to exclude other producers and maintainers of digital content, even others in the cultural heritage sector such as archives and museums. The direction I'd like to see these principles expand is related to (buzzword warning) interoperability. (Don't fall asleep--although that term that is often empty in its usage it really does describe some essential concepts.) My reading of the principles seems to focus inward, on developing maintaining digital collections within a single institution or close consortium. But we have an opportunity now to move away from the traditional (another buzzword warning!) "silo" approach to libraries, and create systems that operate in a much more open fashion, promoting re-use and exchange of content and metadata in new and unexpected ways. The digital libraries we maintain shouldn't just be accessible through our well-designed interfaces intended for a human to interact with - we need to supplement that access with additional methods. These methods are constantly expanding, and it will be difficult for us to keep up, but we can't ignore them.

Add a Comment
27. Weighing in on speaker compensation

The library blog drama over speaker fees is past its useful life, and I've not expressed an opinion on the issue because, well, first of all I don't usually weigh in on topics like this as they make the rounds, and second, I see the complexity of the problem and that doesn't exactly make good reading. I do think, however, that the "invited speakers get free registration for that day only" model is actually detrimental, in that it encourages invited speakers to blow in and blow out in a single day, and not talk to anybody or participate in any way other than their own session. Is this really the environment we want to be cultivating? Don't we want those individuals we
choose as invitees to engage in coffee conversations, eclectic dinner meetups, and learn from each of our communities through attending others' presentations? Certainly, not every invited speaker will be able to (or in some cases, want to) stay much longer than his or her talk, but we need models that encourage them to do so rather than making it difficult for them.

Add a Comment
28. Staying out of it, for now

Inquiring Librarian has seen only tumbleweeds coming through recently. Various professional and personal challenges recently have put blogging on the back burner for me—challenges of the sort we all face from time to time. I’m woefully behind on blog, mailing list, and professional literature reading as well. As I’m starting to wade through some of what I’ve missed, I’m seeing a boiling-up of issues related to cataloging practice and its future in number of different forums.

Many of the discussions are ebbing to the point where it doesn’t make sense for me to weigh in, but even if they weren’t, my inclination is to sit back and watch rather than participate. Perhaps I’m a bit disillusioned thinking that my words will have very little impact. I’m glad the discussion is ongoing, and I still believe that most people out there are reasonable, and can behave themselves while engaging in professional discourse. But many of the current discussions have taken on an “us vs. them” bent, and when that happens I tend to stay away, avoiding situations where emotions and stereotypes start getting in the way of dialogue.

I am sad for feeling this way, but we all must make decisions about how best to spend our precious time. I believe I have a great deal to offer these discussions, particularly in expanding their scope beyond just “catalogs” to the types of digital library systems I’m involved with building and with types of materials not well-served by the MARC/AACR2/LCSH/etc. suite that’s the main focus of these discussions. I’ll probably post some thoughts to this blog (please do comment – I’m not afraid of discussion overall!) but for now I feel the time and intellectual investment of communicating in some of these other forums is best left to others. I’m hoping my aversion to participation is temporary, and that as I get caught up both with others thoughts and my own, I’m able to jump back into the community. See you soon.





Add a Comment
29. Modular Vocabularies

I read an article recently describing efforts to create a faceted (yay!) music thesaurus to describe the performance archives at BYU, and the implications of this project for the development of an international music thesaurus. It’s an interesting piece. I’m all for more specialized vocabularies and think one for the field of music is sorely needed. Here’s the citation:

Spilker, John D. “Toward an International Music Thesaurus” Fontes Artis Musicae 52/1, January - March 2005: 29-44.

The mention in the article of the debate about whether to include a facet for “content subjects” which would include “extra-musical associations” (i.e., topics, things the music is “about”) gave me pause, however. Looking more closely, I also see that they propose a “philosophies and religions” facet, ask the question if “distant” terms that appear in source vocabularies as compound terms relating them to music (e.g., “astrology and music”) should be retained in any way, and copy terms from an existing instrument vocabulary into their own instrument facet. This duplication of effort bothers me a great deal. I see this phenomenon fairly often—projects that try to do everything end up doing nothing very well. LCSH does this, trying to shove everything under the heading “subject” and not making very good distinctions between what type of subjects those terms are. Even in faceted vocabularies, I see communities (like the one described in this article) try to include everything that might be needed to index material for that community. The emerging Ethnographic Thesaurus, facing an enormous task in developing a vocabulary for a very large and diverse field, shows signs of this as well, but I know the editors are considering these issues as they move forward with development. I think this would work much better if these communities focused instead on only those facets that they have particular expertise in, and “borrowed” the rest from other communities.
There’s often an assumption in the library world that a record needs to use a single “subject” vocabulary. But as we move forward, surely that’s a constraint (even if it’s only perceived) we can break out of. There’s no reason vocabulary for different facets (notice how I’m assuming a faceted, post-coordinate structure here) has to come from the same vocabulary. Let’s leave each specific vocabulary to its experts, and not try to have musicians developing terminology for religion and astronomers developing terminology for book bindings.
There are many details to work out, for example, the user implications of one vocabulary using singular forms by default and another using plural (there are standards for such things, but let’s be realistic about how many vocabularies that are otherwise useful we’d throw out of consideration because of the tense of its headings), but there are technological means for doing this. I’d hate to see an inordinate focus on the (potentially many) small challenges derail the larger, necessary, move in a more flexible direction.







Add a Comment
30. I *love* this

I'm talkin' 'bout the new version of OCLC's FictionFinder. Specifically, the browse feature in FictionFinder. You heard me. Browse. In a library system. Not the LCSH browse with pages upon pages upon pages upon pages of subdivisions with no discernable grouping, but a real browse.

The best is the "genre" browse (but take out those --Young adult fiction subdivisions and move them to an audience facet). It's not a short list, but it's not too long either. It would be interesting to arrange these hierarchically and see if navigating that list made any sense to users. And "settings"! How cool to be able to locate fiction that takes place in the Pyrenees. This is what library catalogs should do for our users.

I'm also intrigued by the "character" browse. This is something I've never thought of before. My general rule for browsing facets is to only include facets that have a (relatively) small number of categories, each with a (relatively) large number of members. At first, I didn't think characters met this requirement. Then I clicked on Captain Ahab, and I realized just how many works of fiction there are about him! Great works inspire derivatives, and exploring those is a fun way to guide new reading, in my opinion. It would be interesting to have access to a browse list of all characters in some situations, and only those with a large number of works (note works here, not publications) in other situations. Exploring which situations warrant which presentation would be another interesting line of inquiry.

The next improvement I want to see is allowing users to combine these facets (and others) dynamically so I can find Psychological fiction set in the Pyrenees, then narrow it to works after 1960, then remove the Pyrenees requirement, then add in Captain Ahab to the requirements that are left.... ad nauseum. Our catalogs need to support discovery of new works, not just those we already know the author and title. Systems like this are light years (sci fi fan here!) ahead of LCSH-style "browsing". I want more!

(Note to OCLC - the link to "Known problems" is broken. I'm interested to find out what challenges you've faced when building this beta system. I have a very strange idea of fun.)

Add a Comment
31. I *love* this

I'm talkin' 'bout the new version of OCLC's FictionFinder. Specifically, the browse feature in FictionFinder. You heard me. Browse. In a library system. Not the LCSH browse with pages upon pages upon pages upon pages of subdivisions with no discernable grouping, but a real browse.

The best is the "genre" browse (but take out those --Young adult fiction subdivisions and move them to an audience facet). It's not a short list, but it's not too long either. It would be interesting to arrange these hierarchically and see if navigating that list made any sense to users. And "settings"! How cool to be able to locate fiction that takes place in the Pyrenees. This is what library catalogs should do for our users.

I'm also intrigued by the "character" browse. This is something I've never thought of before. My general rule for browsing facets is to only include facets that have a (relatively) small number of categories, each with a (relatively) large number of members. At first, I didn't think characters met this requirement. Then I clicked on Captain Ahab, and I realized just how many works of fiction there are about him! Great works inspire derivatives, and exploring those is a fun way to guide new reading, in my opinion. It would be interesting to have access to a browse list of all characters in some situations, and only those with a large number of works (note works here, not publications) in other situations. Exploring which situations warrant which presentation would be another interesting line of inquiry.

The next improvement I want to see is allowing users to combine these facets (and others) dynamically so I can find Psychological fiction set in the Pyrenees, then narrow it to works after 1960, then remove the Pyrenees requirement, then add in Captain Ahab to the requirements that are left.... ad nauseum. Our catalogs need to support discovery of new works, not just those we already know the author and title. Systems like this are light years (sci fi fan here!) ahead of LCSH-style "browsing". I want more!

(Note to OCLC - the link to "Known problems" is broken. I'm interested to find out what challenges you've faced when building this beta system. I have a very strange idea of fun.)

Add a Comment
32. True confessions

I recently checked out David Allen's Getting things done from my local public library, thinking I could use a little help calming down the craziness that my life seems to have turned in to. Probably predictably, I turned it in late having only read the first 2 chapters. Oh, well.

In light of this and other related events, I've been thinking a bit about what I do get done and why. I believe I've been spoiled by having jobs for a number of years now where I find the work interesting. It's a whole lot easier to get work done when it's engaging and I care about the outcome. I find the tasks I find interesting are the ones I end up working on for the most part, leaving the ones I find un-interesting until right before a deadline.

So what does this mean for libraries? I think it means that we need to make sure to allow our staff to step up and get involved in projects as deeply as interests them. There are many of us out there who get motivated by understanding and buying into the big picture. Don't "protect" your staff from those high-level discussions - allow them to participate as much as they see fit. Sure, there are lots of folks in library-land that are just interested in the paycheck. We need to meet their needs too. But reward those who think beyond the next five minutes - they're going to be running the place soon enough.

Add a Comment
33. Children's Book Week


Reading all the touching stories of favorite childhood books across the biblioblogosphere in honor of Children's Book Week has guilted me into posting my own contribution. I still smile when I think of The Little Old Man Who Could Not Read, Irma Simonton Black (Author), Seymour Fleishman (Illustrator). It's a story of a man (who cannot read) who goes to the grocery store and selects items based on the box size and color, trying to match them to products he knows he has at home. Of course, he ends up with an amusing assortment of unintended purchases. The story is touching and the illustrations really make the point. Like many books from my childhood, I think it's out of print (and I see it was first published in 1968, before I was born), but it looks like Amazon can hook you up with a copy, as could many local libraries.

Add a Comment
34. More structured metadata

I often encounter people who see my job title (Metadata Librarian) and assume I have an agenda to do away with human cataloging entirely and rely solely on full-text searching and uncontrolled metadata generated by authors and publishers. That’s simply not true; I have no such goal. I am interested in exploring new means of description, not for their own sake, but for the retrieval possibilities they suggest for our users. So here are a few statements that begin to explain my metadata philosophy:

I want more automation. Throwing more money at a manual cataloging process is not a reasonable solution. First of all, it would take waaaaaaayyyyy more money than we can even dream of getting, and second, much metadata creation is not a good use of human effort. Let’s automate everything we can, saving our skilled people for the tasks current automation means are furthest from performing adequately. Let’s get more objective types of metadata, such as pagination, from resources themselves or from their creators (including publishers). Let’s build systems that make data entry and authority control easy. Yes, there will be some mistakes. There will be mistakes if the whole thing is done by humans too. Are catching the few mistakes that will happen from these automated processes more important than devoting our human effort to that extra few resources? More automation means more data total, and the sorts of discovery services I have in mind need lots of that data.

I want more consistency. Users can’t find what’s not there. While we can’t prescribe all records for all resources everywhere have to have a large number of features (I’m against metadata police!), the more of those features that are there mean more discovery options for those users. Imagine a system that provides access to fiction based on geographic setting. Cool, huh? I read one book recently set in Cape Breton Island and can’t wait to get my hands on more. We can’t do that very well today because that data is in very few of our records, and when it is there, isn’t always in the same place. The more consistent we are with our metadata, the better able we’ll be to build those next-generation systems.

I want more structure. I’m a big fan of faceted browsing. The ability to move seamlessly through a system, adding and removing features such as language, date, geography, topic, instrumentation (hey, I’m a musician…), and the like based on what I’m currently seeing in a result set is something I believe our users will be demanding more and more. But we can’t do this if that information isn’t explicitly coded. Instrumentation (e.g., “means of performance”) as part of a generic “subject” string isn’t going to cut it. Geographic subdivisions (even in their own subfield) that are structured to be human- rather than machine-readable also aren’t going to cut it. Nor are textual language notes, [ca. 1846?], or most GMDs. Many of these things can be parsed, and turned into more highly structured data with some degree of success. But why aren’t we doing it that way in the first place? More structure = better discovery capabilities.

What this all means is I’m glad there are lots of extremely bright people with all sorts of perspectives and skills thinking about improved discovery for library materials, but that doesn’t necessarily mean throwing out metadata-based searching. The sorts of systems I envision require more, more highly structured, more predictable, and higher-quality metadata. I want more, not less.

I’ll stand on one last (smallish) soapbox before wrapping this up. In many communities (including both search engines and libraries), discussions about retrieval possibilities often center around textual resources. However, not everything that people are interested in is textual. That’s of course not a surprise, but I’m shocked at how often discovery models are presented that rely on this assumption. I’m all for using the contents of a textual resource to enhance discovery in interesting ways, but we need systems that can provide good retrieval for other sorts of materials too. Let’s not leave our music, our art, our data sets, our maps hanging out to dry while we plow forward with text alone.









Add a Comment
35. Thinking bigger than fixing typos

The Typo of the Day blog, which presents a typographical error likely to be found in library catalogs every day, and encourages catalogers to search their own catalogs for this error, has generated much discussion and linking in the library world. I’m all for ensuring our catalog data is as accurate as possible; however, I would like to think beyond the needle-in-a-haystack approach presented here. I want our emphasis to be on systems that make it difficult to make a mistake in the first place, rather than focusing on review systems that emphasize what’s wrong over what’s right and give a select few a false sense of intellectual superiority over those who do good work and make the occasional inevitable simple mistake.

There are many ways our cataloging systems could better promote quality records and make it more difficult to commit simple errors. I’ll mention just two here: spell checking and heading control. We hear frequent complaints about the lack of spell checking in our patron search interfaces, but few talk about this feature of being useful to catalogers. And I’m not talking about a button that looks over a record before saving it—I’m talking about real interactive visual feedback that helps a cataloger fix a typo right when it happens. Think Word with its little red squiggly lines—they show up instantly so all you have to do it hit backspace a few times while you’re thinking about this particular field and not miss a beat. If it’s not really an error, the feedback is easy to ignore. Word also has a feature whereby it can automatically correct a misspelling as you type based on a preset (and customizable) list of common typos. Features like this require a bit more attention to make sure the change isn’t an undesired one, but for most people in most cases it saves a great deal more time than it takes, and the feature can be tuned to an individual’s preferences. Checking the entire record after the fact requires a higher cognitive load—turning back to a title page, remembering what you were thinking when you formulated the value for that field, checking an authority file a second time, etc., and is less helpful than real-time feedback.

Heading control is the second area in which our systems could make it easy to do the right thing. Easier movement between a bibliographic record and an authority file, widgets that fill in headings based on a single click or keystroke, and automatic checks that ensure a controlled value matches an authority reference before leaving the field can all help the cataloger avoid simple typographical errors in the first place and make the sort of treasure hunt common typo lists provide less necessary.

Consider also the enormous duplication of effort we’re expending by hundreds of individuals at hundreds of institutions all looking up the same typos in our catalogs and all editing our own copies of the same records. This local editing makes an already tough version control problem worse by increasing the differences between hundreds of copies of a record for the same thing. We have way more cataloging work to do than we can possibly afford, and duplication of effort like this is an embarrassingly poor use of our limited resources. The single most effective step we can take to improve record quality is to stop this insanity we call “cooperative cataloging” today and adopt a streamlined model whereby all benefit instantaneously and automatically from one person fixing a simple typo.





Add a Comment
36. Grant proposals

Writing competitive grant proposals for putting analog collections online is difficult, and is becoming more so as more institutions are in a position to submit high-quality proposals and digitization for its own sake is no longer a priority for funding agencies. Collections themselves are no longer enough. There are many more collections that deserve a wider audience, that will significantly contribute to the work of scholars, and that will bring new knowledge to light, than can possibly be funded by even a hundred times the amount of grant funding available. The key is to offer something new. A new search feature. Expert contextual materials. User tagging capabilities. Something to make your project stand out as special and test some new ideas.

The trick is that in order to write that convincing proposal, you have to do a significant amount of the project, even before you write the proposal and before you get any money. Most of the important decisions, such as what metadata standards you will use, must be made before you write the proposal, both to convince a funding agency you know what you are doing and to develop reasonable cost figures. To make these decisions, an in-depth understanding of the materials, your users, the sorts of discovery and delivery functionalities you will provide, and the systems you will use are all necessary. Coming to those understandings is no small task, and is one of the most important parts of project planning. Don’t think of grant money as “free”—think of it as a way to do something you were going to do anyways, just a bit faster and sooner.

Add a Comment
37. Librarians in the Media

CNET news published an article this week entitled, “Most reliable search tool could be your librarian.” While it’s nice to see librarians getting some press, I remain concerned about our image, both as presented in the media and as we present ourselves.

The article contains the usual rhetoric about caution in evaluating the “authority” of information retrieved by Web search engines, the need for advanced search strategies to achieve better search results, and the bashing of keyword searching. Here, as in so many other places, the subtext is that “our” (meaning libraries’) information is “better” – that if only you, the lowly ignorant user, would simply deem to listen to us, we can enlighten you, teach you the rituals of “quality” searching and location of deserving resources rather than that drivel out there on the Web, that could be written by (gasp!) any yahoo out there.

Of course we know it’s not that simple. But the oversimplification is what’s out there. We’re not doing ourselves any favors by portraying ourselves (or allowing ourselves to be portrayed) as holier-than-thou, constantly telling people they’re not looking for things the right way or using the right things from what they do find, even though they thought they were getting along just fine. We simply can’t draw a line in the sand and say, “the things you find through libraries are good and the things you don’t are suspect.” There are really terrible articles in academic journals, and equally terrible books, many published by reputable firms. There are, on the other hand, countless very good resources out there on the Web, discoverable through search engines. And the line between the two is becoming ever more blurry as scholarly publishing moves towards open access, libraries are putting their collections online, government resources are increasingly becoming Web-accessible, and search engines gain further access to the deep Web.

The first strategy I feel we should be taking is to move discussion away from focusing on the resource and its authority to the information need. Evaluating an individual resource is of course important, but it’s not the first step. Let’s instead talk first about all the resources and search strategies that can meet a given need, rather than always focusing on resources and search strategies that can’t meet that need. There are many, many ways a user can successfully locate the name of the actor in the movie he saw last night, identify a source to purchase a household item at a reasonable price, find a good novel to read on a given theme, or learn more about how the War of 1812 started. Let’s not assume every information need is best met by a peer-reviewed resource, and make those peer-reviewed resources and the mediation services for them we can offer more accessible when these resources and our services are appropriate to meet those information needs. Let’s be a part of the information landscape for our patrons, rather than telling them we sit above it.







Add a Comment
38. On "authority"

I recently got around to reading the response from Encyclopedia Britannica to the comparison of the “accuracy” of articles in Britannica and Wikipedia by Nature. It’s got me thinking about the nature of authority, accuracy, and truth.

Britannica’s objections to the Nature article arise from a different interpretation of the words “accuracy” and “error.” The refutations by Britannica fall into two general categories. The first is the disputation of certain factual statements, mostly when such facts were established by research. Here, these facts aren’t truly objective, rather, they’re a product of what a human is willing to believe based on the evidence. Different humans will draw different conclusions based on the same evidence. And then there’s the other human element: mistakes. We make them, both those of us who work for Britannica and those who work for Nature. The “error” rates Nature reported for both sources are astonishingly high. Certainly not all of these are true mistakes, maybe not even very many of them, but they exist, in every resource humans create, despite any level of editorial oversight.

Second, and more prevalent, are differing opinions among reasonable people, even experts in a given domain, about what is appropriate at what isn’t to include in text written for a given audience. Anything but the most detailed, comprehensive coverage of a subject requires some degree of oversimplification (and maybe even those as well). By some definition, all such oversimplifications are “wrong” – it’s a matter of perspective and interpretation whether or not they’re useful to make in any given set of circumstances. Truth is circumstantial, much as we hate to admit it.

I’d say the same principles apply to library catalog records. First, think about factual statements. At first glance, something like a publication date would seem to be an objective bit of data that’s either wrong or right. But it’s not that simple. There are multitudes of rules in library cataloging governing how to determine a publication date and how to format it. Interpretation of those rules is necessary, therefore often two different reasonable decisions based on them as to what the publication date is are possible. In cases where a true mistake has been made, our copy cataloging workflows require huge amounts of effort to distribute corrections among all libraries that have used the record with that mistake. Only sometimes is a library correcting a mistake able to reflect this correction in a shared version of a record, and no reasonable system exists to populate that correction to libraries that have already made their own copy of that record. The very idea of hundreds of copies of these records, each slightly different, floating around out there is ridiculous in today’s information environment. We’re currently stuck in this mode for historical reasons, and a major cooperative cataloging infrastructure upgrade is in order.

More subjective decisions are not frequently recognized as such when librarians talk about cataloging. We talk as if one would only follow the rules, the perfect catalog record would be produced, and that if two people were to just follow the same rules, they would produce identical records. But of course that’s not true. There will always be individual variation, no matter how well-written, well-organized, or complete the instructions. Librarians complain about “poor” records when subject headings don’t match their ideas of what a work is about. But catalogers don’t (and of course can’t) read every book, watch every video, or listen to every musical composition they describe. Why have we set up a system whereby we spend a great deal of duplicate effort overriding one subjective decision with another, based on only the most cursory understanding of the resources we’re describing, and keeping multiple but different copies of these records in hundreds of locations? How, exactly, does this promote “quality” in cataloging?

An underlying assumption here is that there is one single perfect cataloging record that is the best description of an item. But of course this isn’t true either. All metadata is an interpretation. The choices we make about vocabularies, level of description, and areas of focus all preference certain uses over others. I’m fond of citing Carl Lagoze’s statement that "it is helpful to think of metadata as multiple views that can be projected from a single information object." Few would argue with this statement taken alone, yet our descriptive practices don’t reflect it. It’s high time we stopped pretending that the rules are all we need, changed our cooperative cataloging models to do it truly cooperatively, and use content experts rather than syntax experts to describe our valuable resources.









Add a Comment
39. What about dirty OCR?

I often hear discussions as part of the digital project planning process about how best to approach full-text searching of documents. A common theme of these discussions is whether or not “dirty” (uncorrected, raw) OCR is acceptable or not. The “con” position tends to argue that OCR is only so effective (say, 95%) and that the errors made can and will adversely affect searching. The “pro” position is that some access is better than none, and OCR is a relatively cheap option for providing that “some” access.


The con position has some convincing arguments. Providing some sort of full text search sends a very strong implication that the search works – and if the error rate in the full text is more than negligible, it could be said that implied promise has been broken. Error rates themselves are misleading. A colleague of mine likes to use the following (very effective, in my opinion) example, noting that error rates refer to characters, but we search with words:


Quick brown fix jumps ever the lazy dog.

In this case, there are two errors (fix and ever), out of 40 characters (including spaces), for an accuracy rate of 95%. However, only 75% (6 of 8) words are correct in that example.


So uncorrected OCR has some problems. But the costs of human editing of OCR-ed texts are high – too high to be a valuable alternative in many situations. Double- and triple-keying (two or three humans manually typing in a text while looking at scanned images) tends to be cheaper than OCR with human editing, but these cost savings are typically achieved by outsourcing the work to third-world countries, promoting ethical concerns for many. And both of the human-intervention options themselves represent a non-zero error rate. No solution can reasonably yield completely error-free results.


I’ll argue that the appropriate choice lies, as always, in the details of the situation. How accurate can you expect the OCR to be for the materials in question? 90% vs. 95% vs. 99% makes a big difference. What sorts of funds are available for the project? Are there existing staff available for reassignment, or is there a pool of money available for paying for outsourcing? TEST all the available options with the actual materials needing conversion. Find out what accuracy rate can be achieved via OCR with all available software. Ask editing and double-keying vendors for samples of their work based on samples from the collection. Do a systematic analysis of the results. Don’t guess as to which way is better. Make a decision based on actual evidence, and make sure you get ample quantities of that evidence. Results from one page, or even ten pages, are not sufficient to make a reasoned decision. Use a larger sample, based on the size of the entire collection, to provide an appropriate testbed for making an informed choice between the available options. Too often we assume a small sample represents actual performance and accept quick support of our existing preferences as evidence of their superiority. To make good decisions about the balance of cost and accuracy, we must use all available information, including accurate performance measures from OCR and its alternatives.

Add a Comment
40. No more magic bullets

This week OCLC announced worldcat.org, a freely-available site for searching WorldCat data, which will be released in August 2006. Here’s their one-sentence explanation of its purpose:

Where Open WorldCat inserts "Find in a Library" results within regular search engine results, WorldCat.org provides a permanent destination page and search box that lets a broader range of people discover the riches of library-held materials cataloged in the WorldCat database.


I’m a huge fan of this addition to the OCLC arsenal. I’m also a fan of Open WorldCat, however. I think these two tools need to work together (and along with many others) to provide the full set of services our users need. Like others, I use various tricks to limit search engine results to Open WorldCat items when I’m looking for basic information about a book I know exists, and, like others, I’ve never seen an Open WorldCat item appear in a Google search result set that wasn’t intentionally limited to Open WorldCat results. While Open WorldCat has its benefits, it can’t be all things to all users.

And there’s the rub: in libraries (and, to be fair, in many other fields as well) we tend to think there’s a magic solution. We just need to be more like Google. Federated searching is the answer. If we had Endeca, like NCSU, everything would be fine. Shelf-ready cataloging will make all of this affordable. Put like this, it sounds absurd. Yet the magic bullet theory drives all too many library decision-making processes. Of course, only by combining these and many other technologies in innovative ways will we make the substantive changes needed in the discovery systems libraries provide to our users. Systems of different scope need different means of presenting search results. A system with a tightly-controlled scope may be able to present search results in a simple list (note: these are few and far between!). The wider the scope of the system, with regards to format, genre, and subject, the more sophisticated we need to get in presenting the search results. Grouping, drilling down, dynamic expanding and refining of results all need to be incorporated into our next-generation systems. Single books in Google results aren’t going to cut it – we need to find ways to represent groups of our materials in aggregated resources.

For many user needs, sophisticated searching options for a specific genre or format of resource are absolutely essential. For others, more generic access to a variety of resources is the appropriate approach. Flexibility is the key, and the data we’re talking about here will never live in a single location. Mashups of data from multiple sources, presented with a variety of interfaces and search systems, can provide the advanced access envisioned here. We need to stop accepting the quick fix; instead, we must broaden our expectations, and move forward evaluating every option as to its place in the grand vision.

Add a Comment
41. A quest for better metadata

I wasn’t able to attend the ACM/IEEE Joint Conference on Digital Libraries this year, but the buzz surrounding the paper by Carl Lagoze (et al) about the challenges faced by a large aggregator despite using supposed low-barrier methods such as OAI led me to look the written version of this paper up. This paper demonstrates very well that now matter how “low-barrier” the technology (OAI) or the metadata schema (DC), bad metadata makes life difficult for aggregators. Garbage in, garbage out had been a truism for some time, and the “magic” behind current technology can help, but can only go so far to mediate poor input.

There has been spirited discussion in the library world recently about next generation catalogs, but that discussion has heavily centered on systems rather than the data that drives them. I’d argue that one needs both highly functional systems and good data in order to provide the sorts of access our users demand. How we get that good data is what I’ve been interested in recently. Humans generating it the way libraries currently do is one part of a larger-scale solution, but given the current ratio of interesting resources to funding for humans to describe them, we must find other means to supplement our current approach.

So what might we do? Here are my thoughts:

  • Tap into our users. There are a whole lot of people out there that know and care a lot more about our resources than Random J. Cataloger. Let’s harness the knowledge and passion of those users, and provide systems that let them quickly and easily share what they know with us and other users.

  • Get more out of existing library data. As Lorcan Dempsey says, we should “make our data work harder.” Although MARC and other library descriptive traditions have many limitations in light of next-generation systems, they still represent a substantial corpus of data that we must use as a basis for future enhancements. Let’s use any and all techniques at our disposal to transform this data into that which drives these next-generation systems.

  • Look outside of libraries. Libraries do things differently than publishers, vendors, enthusiasts, and many other communities that create and use metadata. We should keep in mind the cliché, “Different is not necessarily better.” We need to both look at ways of mining existing metadata from other communities to meet our needs, and re-examine the way we structure our metadata with specific user functions in mind.

  • Put more IR techniques into production. Information retrieval research provides a wide variety of techniques to better process metadata from libraries and other communities. Simple field-to-field mapping is only a portion of what we can make this existing data do for us. We must work with IR experts to push our existing data farther. IR techniques can also be made to work not just on metadata but the data itself. Document summarization, automatic metadata generation, and content-based searching of text, still images, audio, and video can all provide additional data points for our systems to operate upon.

  • Develop better cooperative models. Libraries have a history of cooperative cataloging, yet this process is anything but streamlined. We simply must get away from models where every library hosts local copies of records, and each of those records receives individual attention, changing, enhancing, even removing (!) data for local purposes. Any edits or enhancements performed by one should benefit all, and the current networked environment can support this approach much better than was possible when cooperative cataloging systems were first developed.





My point is, we can’t plug our ears, sing a song, and keep doing things the way we have been doing. Let’s make use of the developments around us, contribute the expertise we have, and all benefit as a result.








Add a Comment
42. Finding new perspectives

I spent last week at a conference with an extremely diverse group of attendees. Almost all were trained musicians; among these were traditional humanist scholars, librarians of all sorts, and a smattering of technologists. I spoke at two sessions, each on a topic related to how library systems might better meet the needs of our users. I was pleasantly surprised by the environment in these sessions, and in the conference as a whole.

Due to the diversity of attendees, I had feared that my ideas might be either rejected wholesale in light of very real and valid practical concerns, or ignored due to a perception that they were irrelevant to the work of many attendees. I was wrong. I had many stimulating and mutual idea-generating discussions with other attendees, most of whom don't spend their time thinking about system design like I'm lucky enough to do. My perspective of thinking big and not being satisfied by what current systems deliver us was greeted with a great deal of enthusiasm, showing me in no uncertain terms just how connected and devoted many librarians (and those in related fields) are to the needs of our users. Perhaps those who disagreed with my approach were just being polite in not expressing major differences in perspective publicly or privately (it was an international conference and I admit to not fully understanding all the cultural factors at work); I hope not, or at least I'd like to think that such disagreements could take the form of collegial conversation that starts in a session then continues afterward to the mutual benefit of both parties. But, then again, I can be an optimist about such things.

Perhaps the most surprising thing was that my point of view wasn't the most progressive there. I had a number of conversations with attendees whose vision was broader, more visionary, more of a departure from the current environment than mine. I view myself as striking a reasonable compromise between vision and practicality in the digital library realm, but my preconception of this conference was that I would be very far outside the attendees' respective norms. I was certainly on that side, and it was good to see I had company, and even a few compatriots that were further out to stimulate discussion.

What I took away was that we in the digital library world have a tendency to navel-gaze, to think we're the only ones that can plan our next-generation systems. This week I found an excellent cross-section of groups we need to more fully engage in this discussion. Without them and others like them, we're missing vital ideas.

Add a Comment
43. An RDF Revelation

While doing some reading recently, I had an RDF revalation. I've long felt I didn't really get RDF. This time, the parts that sunk in made a bit more sense. I'm not a convert in this particular religious war, but I do feel like I now understand both sides a bit better.

I've read the W3C RDF Primer before; several times, I think. The first thing that struck me this time was a simple fact I know I'd read before but that I'd forgotten--that an object can be a either a URIref or a literal (a URI referencing a definition elsewhere, or a string containing a human-readable value). This means the strict machine-readable definitions of things RDF strives to achieve is potentially only half there--only the predicate (relationship between the subject and object) is expected to be a reference to a presumably shared source. I assume this option exists for ease of use. Certainly building up an infrastructure that allows for all values to be referenced rather than declared represents unreasonable startup time. This sort of thing is better done in an evolutionary fashion rather than forcing it to happen at the start; a reasonable decision on the part of RDF.

RDF contains some other constructs to make things easier, for example, blank nodes to group a set of nodes (or, in the words of the primer, provide "the necessary connectivity"). Blank nodes are a further feature that allow lack of formal identification of entities. The primer discusses a case using a blank node to describe a person, rather than relying on a URI such as an email address as an identifier for that person. A convenient feature, certainly, but also a step away from the formal structures envisioned in Semantic Web Nirvana.

So now I'm looking at the whole XML vs. RDF discussion much more as a continuum rather than opposing philosophical perspectives. The general tenor of RDF is that it expects everything to be declared in an extremely formal manner. But there are reasonable exceptions to that model, and RDF seems to make them. I'd argue now that both RDF and XML represent practical compromises. Both strive for interoperability in their own way. It's just a question of degree whether one expects a metadata creator to check existing vocabularies, sources, and models for individual concepts (RDF-ish) or for representing entire resources (XML-ish). I see the value of RDF for use in unpredictable environments. Yet I'm still not convinced our library applications are ready for it yet. The reality is that libraries are still for the most part sharing metadata in highly controlled environments where some human semantic understanding is present in the chain somewhere (even in big aggregations like OAIster). (Of course, if we had more machine-understandable data, that human step would be less essential...)

I'm a big champion of two-way sharing of metadata between library sources and the "outside world." I just don't think the applications that can make use of RDF metadata for this purpose are yet mature enough to make it worth the extra development time on my end. And, again, the reality is that it really would take significant extra development time for me. The metadata standards libraries use are overwhelmingly XML-based rather than RDF-based. XML tools are much more mature than RDF tools. I fully understand the power of the RDF vision. But this is one area I just can't be the one pushing the envelope to get there.

Add a Comment
44. Whither book scanning

A recent New York Times Magazine article entitled Scan this Book! by Kevin Kelly is getting lip service in the library world. The article describes the current digitization landscape, discussing the Google book project, among other initiatives, and describes both the potential benefits and current challenges to the grand vision of a digitized, hyperlinked world. I was specifically glad to see the discussion not just centering around books, but around other forms of information and expression as well. However, library folk are starting in on our usual reactions to such pieces, finding factual errors, talking about how tags and controlled subjects aren't mutually exclusive, pointing out the economics of digitization efforts, discussion of how the digitization part is only the first step and how the rest is much more difficult. All of these points are perfectly valid.

Yet even though these criticisms might be correct, I think that we do ourselves a disservice by letting knee-jerk reactions to "outsiders" talking about our books take center stage. Librarians have a great deal to offer to the digitization discussion. We've done some impressive demonstrations of the potential for information resources in the networked world. Yet we don't have a corner on this particular market. Like any group with a long history, we can be pathetically short-sighted about changes we're facing. I believe it would be a fatal mistake to believe we can face this future alone. We have solid experience and many ideas to bring to Kelly's vision for the information future. However, we simply can't do it alone, and not just for economic reasons. We simply must be listening to other perspectives, just as we expect search engines, publishers, and others we might be working with to listen to ours. Let's keep our defensiveness in check, and start a dialog with those who are interested in these efforts, instead of finding ways to criticize them.

Add a Comment
45. On the theoretical and the practical

When I do metadata training, I make a point to talk about theoretical issues first, to help set the stage for why we make the decisions we do. Only then do I give practical advice on approaches to specific situations. I'm a firm believer in that old cliché about teaching a man to fish, and think that doing any digital library project involves creative decision-making, applying general principles rather than hard-and-fast rules.

Yet the feedback I get from these sessions frequently ranks practical advice as the most useful part of the training. I struggle with how to structure these training sessions based on the difference between what I think is important and what others find useful. I learned to make good metadata decisions first by implementing rules and procedures developed by others, and only later to develop those rules and procedures myself. It should make sense that others would learn the same way.

The difference is that I learned these methods over a long period of time. The training sessions I teach don't ever claim to provide anyone with everything they would need to know to be a metadata expert. Instead, their goal is to provide participants with the tools they need to start thinking about their digital projects. I expect each of them will have many more questions and ample opportunity to apply theory presented to them as they begin planning for digital projects. This is where I see the theoretical foundation for metadata decisions coming into play. I can't possibly provide enough practical advice to meet every need in the room; I can make a reasonable attempt to address theoretical issues that would help to address these issues.

I realize the theory (why we do things) can be an overwhelming introduction to the metadata landscape. Without any practical grounding, it doesn't make any sense. Yet I know it's essential in order to plan even one digital project, much less many. I and many others out there need to continue to improve the methods by which we train others to create consistent, high-quality, shareable metadata, finding the appropriate balance between giving a theoretical foundation and providing practical advice.

Add a Comment
46. Thesauri and controlled vocabularies

I had a very interesting conversation recently with two colleagues about the differences between thesauri and controlled vocabularies. Both of these colleagues are developers who work in my department. One is finishing up a Ph.D. in Computer Science, is currently in charge of system design for a major initiative of ours, and has a knack for seeing all the aspects of a problem before finding the right solution; the other is a database guru with whom I've collaborated on some very interesting research and has just started pursuing an M.L.S to add to his already considerable expertise. I like and respect both of these individuals a great deal.

The interesting conversation began when the database-guru-and-soon-to-be-librarian (DGASTBL) (geez, that's not any better, is it?) asked if the terms "controlled vocabulary" and "thesaurus" are used interchangeably in the library world. He asked because from our previous work and a solid basis in these concepts he knew they really aren't the same thing, yet he had seen them used in print in ways that didn't match his (correct) understanding. The high-level system diagram we had at the time had a box for "vocabulary" which was intended to handle thesaurus lookups for the system. We discussed how a more precise representation of that diagram would have an outer box for "vocabulary" to handle things like name authority files and subject vocabularies with lead-in terms but no other relationships, and an inner box for "thesauri" (as a subset of controlled vocabularies) with full syndetic structures that the system could make use of. We lamented that the required outer label in this scenario of "controlled vocabulary" isn't as sexy as its subset "thesauri." The latter sounds a great deal more interesting when describing a system to those not involved in developing it.

The system designer then presented a different perspective on the issue. While the librarian types considered thesauri a subset of controlled vocabularies (perhaps party for historical reasons - we've been using loosely controlled vocabularies a lot longer than true thesauri), the system designer viewed the situation as the opposite - that controlled vocabularies were a specific type of thesauri using only one type of relationship (the synonym), or perhaps also some rudimentary broader/narrower relationships that don't qualify as true thesauri (think LCSH). I found the difference in point of view interesting - that the C.S. perspective expected a completely structured approach to the vocabulary problem, and the library perspective represented an evolving view that has never quite gotten to the point where we can make robust use of this data in our systems. It struck me that the system designer's perspective in this conversation was overly optimistic as to the state of controlled vocabularies in libraries.

Yet there's light at the end of this particular tunnel. Production systems in digital libraries are starting to emerge that make good use of controlled vocabularies in search systems, rather than relying on users to consult external vocabulary resources before searching. Libraries have not taken advantage of the revolution in search systems shifting many functions from the user to the system (think spell-checking), to our supreme discredit. Making better use of these vocabularies and thesauri is one way of shifting this burden. I hope this integration of vocabularies into search systems will push the development of these vocabularies further and make them more useful as system tools rather than just cataloger tools. By providing search systems that can integrate this structured metadata, we can improve discovery in ways not currently provided by either library catalogs or mainstream search engines.

Add a Comment
47. "Orienteering" as an information seeking strategy

I was introduced today to the notion of "orienteering" as an information seeking strategy, through a paper presented at the CHI 2004 conference by Jamie Teevan and several other colleagues. The paper discusses orienteering as a strategy by which users make "small steps...used to narrow in on the target" rather than simply typing words in a search box. For some time, I've been struggling inside my head with trying to articulate the differences between the search engine model with a wide-open box for typing in a search and the library model with vast resources but a need for users to know ahead of time which of those resources are relevant to their search. This paper very clearly spoke to me, by demonstrating that real users (to use one of my favorite phrases) are somewhere in the middle.

Users have resources they like. We prefer one map site over another, one news site over another, one author over another. And we know where each of our prefered resources can be accessed. For many types of information needs, we know the right place (for us) to start looking. Even as we make the hidden Web more accessible, the resource (like an email) we need often won't be something a generic Web search engine can get to. But for many information needs, a box and "I'm feeling lucky" is an effective solution. I think the point is that we need a wide variety of discovery models to match the wide variety of our searching needs. We can't expect all users to start with the "right" resource (what's "right"?), but we should provide seamless methods for users to move, step by step, towards what they're looking for.

Add a Comment
48. techessence.info launched

I was recently honored to be asked to participate with a stunningly informed and diverse group of library technology types in an online initiative called TechEssence. TechEssence is envisioned as a rich resource for library decision-makers to learn just enough about a wide variety of technologies to allow them to make good decisions. I'm a big fan of this approach - not everyone can know everything, and many of us need succinct information with just the right amount of evaluation from those with experience. As of yesterday, the site is now officially launched!

Here's a summary from Roy Tennant, our fearless leader:

TechEssence.info
The essence of technology for library decision-makers

A new web site and collaborative blog on technology for library
decision-makers is now available at http://techessence.info/.
TechEsssence provides library managers with summary information about
library technologies, suggested criteria for decision-making, and
links to resources for more information on essential library
technologies.

A collaborative blog provides centralized access to some of the best
writers in the field. By subscribing to the RSS feed of the
TechEssence.info blog, you will be able to keep tabs on the latest
trends of which library administrators should be aware.

To accomplish this I am joined by a truly amazing group:

* Andrew Pace
* Dorothea Salo
* Eric Lease Morgan
* Jenn Riley
* Jerry Kuntz
* Marshall Breeding
* Meredith Farkas
* Thomas Dowling

For more information on the group, see our "about us" page at http://techessence.info/about/.

Add a Comment
49. Library digitization efforts

Many libraries are seeing efforts such as the Google Books Library Project, and think they need to follow suit by digitizing books in order not to be left behind. I worry that many of these libraries are jumping in just to be on the bandwagon without fully considering wheir their efforts fit in with those of others. Digitizing books, performing dirty OCR, and making use of existing metadata is about as easy as it gets in the digital library world (not that this is exactly a walk in the park), so it's an attractive option for libraries looking to make a splash with their first efforts to deliver their local collections online.

I argue that this is not the right approach for most libraries. That impact libraries are looking for as a result of digitization of local collections is achieved through the right ratio of benefit to users versus costs to the library. While the costs to the library are lower to digitize already-described, published books sitting on the shelves, the benefits are also lower than focusing on other types of materials (more on which materials I'm thinking of later...). We already have reasonable access to the books in our collection. I'll be the first to go on and on ad infinitum about the poor intellectual access we currently provide to our library materials. But there is some intellectual access. For books a library doesn't own, interlibrary loan is a slightly cumbersome but mostly reasonable method of delivering a title to a user. There are also a (comparatively) great many digitized books out there, without good registries of what's digitized and what isn't, or good ways to share digital versions when they do exist and the institution that owns the files is willing to share. Take the Google project - they're digitizing collections from five major research libraries, yet libraries planning digitization projects don't have access to lists of materials that are being digitized as part of this project, even though we expect to have some (not complete) access to these materials through Google's services at some point in the next few years. Even though library collections have surprisingly less duplication than one might expect, a library embarking on a digitization project for published books would be duplicating effort already spent to some non-negligible extent.

Libraries in the aggregate hold almost unimaginably vast amounts of material. We're simply never going to get around to digitizing all of it, or even the proportion we would select given any reasonable set of selection guidelines. An enormously small proportion of these materials are the "easy" type - books, published, with MARC records. The huge majority are rare or unique materials: historical photographs, letters, sound recordings, original works of art, rare imprints. These sorts of materials generally have grossly inadequate or no networked method of intellectual discovery. While digitizing and delivering online these collections would take more time, effort and money than published collections, I believe strongly that the increase in benefit greatly outweighs the additional costs. In the end, the impact of focusing our efforts on classes of materials that we currently underserve will be greater than taking the easy road. Our money is better spent focusing on those materials that are held by individual libraries, held by only few or no others, and to which virtually no intellectual access exists. Isn't this preferable to spending our money digitizing published books to which current access is reasonable, if not perfect?

Add a Comment
50. On metadata "experts"

I'm often asked how one gets the skills required to do my job as a Metadata Librarian. My answer is one I can't stress strongly enough: experience. We need to know the theoretical foundation of what we do inside and out, and need to constantly think about why we're doing something - the big picture. But theory is not enough. The only way to become skilled at making good metadata decisions is practice--seeing what happens as a result of an approach and improving on that approach the next time. No matter how many times I've done a certain type of task, I see the next repetition as a way to re-use good decisions and re-think others.

I've found the metadata community in libraries to be a very open one. When I'm starting on a task that I haven't done before, I use what I can from my experience with similar tasks. But I also ask around for advice from others who do have that experience. "Metadata" is a very big and diverse area of work. Even with the best abstract thinking, applying known principles to new environments, we all often need a boost for getting started from someone who has been through a given situation.

I'm skeptical of the idea of "experts" overall. These things are all relative - only once you start learning enough to be able to effectively share what you've learned with others do you truly realize how much you still have to learn. I put much more stock in the goal of becoming good at thinking about generalized solutions, good at making decisions for classes of problems rather than simply repeating specific implementations over and over. I'm not a programmer, and neither are many in the metadata librarian community. Yet this type of thinking that makes a good programmer can, in my opinion, make the best metadata experts as well.

Add a Comment

View Next 3 Posts