Skip to end of metadata
Go to start of metadata

Executive Summary

For this review we talked with a variety of librarians, staff, university administrators, and users (faculty and students) about their experience working with the four platforms, or systems, the Libraries engage for management and delivery of our digital content:  CONTENTdm, DPubS, ETD-db, and Olive.  We report on our findings in two ways: through a summary of each platform, describing some noteworthy issues or concerns to serve as a guide for making recommendations about the next platform(s) or services we wish to put in place; and through cross-platform issues we have categorized in accordance with an analysis matrix (developed from Purdue's Comparative Analysis of IR Software). 

Platform Summaries

CONTENTdm

CONTENTdm is the most widely-used of the four platforms reviewed, holding the largest number of items and involving more staff and workflows than the others; Penn State is invested heavily in CONTENTdm, and impressions are generally positive; CONTENTdm does what we need it to do, is well supported, and has strong community support around it.  The most common issue raised during the review was, overwhelmingly, concerned with the user experience aspect of the web interface.

The top navigation bar was a source of confusion for users.  The link to “digital collections” does not connect to the Digital Collections portal, which may have been a launching point for the user, but to the CONTENTdm base URL, at which point users are left to wonder where they wound up now that they are “outside” of a collection.  This is particularly problematic for users of the Visual Resources Centre who tend to be interested primarily in the VRC collection.  There is no noticeable VRC collection home link once a user has delved into the collection.  The back button of the browser is thus a crucial aspect of any CONTENTdm session.

Another navigation issue has to do with display of search results.  CONTENTdm provides the ability to code and embed pre-parameterized search boxes, say, to limit your search to a particular collection.  If a user issues a search via one of these boxes, the user is taken to a results page within CONTENTdm.  Users frequently must issue multiple searches to find what they are looking for, and doing so from within CONTENTdm, rather than from one of the custom search boxes, loses all of the parameters that had been set.  Search results look completely different, oriented, e.g., horizontally rather than vertically, and whereas their initial search may have been contained within a collection, subsequent searches lose that scope.

The general impression left by the user experience of the CONTENTdm web interface is that the emphasis is on decontextualization rather than contextualization of found objects -- and often the context that is lost leaves users similarly lost.

DPubS

In our conversations with internal users of DPubS, the two issues that stood out the most had to do with the loading of content into DPuBs and with workflow (particularly from the university press's point of view).  In the DPubS loading process error messages resulting from XML validation failure are often confusing and not consistently helpful.  For instance, if there are problems with the metadata XML, line numbers for the problematic elements are not provided. Albert and Kurt have worked out a series of workarounds and can usually translate the error messages into more useful equivalents, now that they've been working with the system for about a year. (Dan Coughlin of DLT believes better validation of packages and metadata would be easy to automate, and Albert confirms this would save him significant time.)  Another loading issue in DPubS occurs when a load stops extracting text from items that have intellectual property protection and thus cannot be shared; in such cases, the procedure has worked well previously, but when it doesn't there isn't a telltale way of finding out what happened - or what may have changed to anything else that interacts with the system (such as upgrade of dependent software applications) - to pinpoint and thus resolve the problem.  Also, occasionally a successful load is inaccurately reported - i.e., DPubS says a load has been successful, when in fact it has not.  The integration of DPubS into workflows at the Press has also proven challenging, mainly because no DPubS developers could tell the Press how to combine the platform in their workflow.  The burden has been on users at the Press to figure out how to use it.  Functions related to workflow integration may well exist in DPubS already but not currently exploited (perhaps because those capabilities havne't been conveyed to anyone). 

Other notable issues with DPubS include weaknesses in the user interface; for example, the display and sorting of the result set could be improved, and sometimes there are problems with certain accented characters in the XML metadata.  When it comes to packaging formats, DPubS also seems to favor one operating system over another, since the DPubS loader prefers .zip files made on Windows machines, rather than on Macs.  Finally, there is a known usability issue with DPubS PDFs that are accessed via the Safari browser.  (This is typically an issue for Safari users who have noted a preference on their Macs for using Adobe Acrobat for viewing PDFs.)  The University of Chicago has a workaround to which we point users who have experienced this usability problem.  (Note:  we will be creating our own instructions for this, to reduce possible confusion for users in terms of where they are on the Web.)

ETD-db

While internal users (Pauletta Leathers in the Graduate College and Robert Hardin in Schreyer Honors College) of the ETD-db application were, by and large, pleased with the performance of the platform, there were a number of features they would like to see added, such as the following:  a notification feature for students submitting before the deadline, to let them know they won't hear from Schreyer's Honor College until after the deadline; a feature for flagging certain data points (e.g., if a student's submission doesn't meet format compliance, or have the requisite signatory name - there should be a way to flag this in the system); a way to accommodate the submission of multiple theses (as in the cases of students who are double majors); a way to modify the "submitted date" of an ETD (this is a user-generated value, but occasionally it is off and thus requires correction); and a feature for monitoring which ETDs have opted for embargoes (currently Pauletta must keep track manually - i.e., remember periodically to check if any embargoes have expired).  

Pauletta has gathered some feedback from graduate student users of the system, which provides some additional insight into improving the user experience.  These improvements would:

  • Clarify whether just advisor's or all committee members' names should be entered
  • Break process up into more of a "wizard" format
  • Approval page should get transferred electronically
  • Should allow upload of PDF/Word abstract in addition to an HTML textarea element, presumably for formatting reasons

Other internal users made observations about workflow, fee processing, and intellectual property issues.  The workflow in the ETD-db application has taken precedence over existing workflows, whereas ideally an ETD platform would accommodate users by modeling real-world workflows (rather than vice versa).  Also, the workflow of the current application is not believed to be enforced very well.  For instance, an administrative user may kick off later parts of the approval process before earlier ones, since the functions are independent of one another.  The workflow lives largely in the minds of the system's current users.  In addition, the processing of fees associated with ETD submission is separated out into other applications, as opposed to making this process a one-stop-shop for ETDs.  The system also relies on some scheduled tasks that erode the "real time" aspect of the application, which introduces potentially unnecessary delays in processes.  Where intellectual property rights are concerned, while the full text of ETDs that are embargoed are not available for a period of time to any users, the abstracts and other related metadata of those restricted ETDs do remain accessible.  These materials could contain sensitive information and should perhaps also be protected.  (Or, at least, there could be caveats issued to ETD submitters once they enter the system, in which they're advised of the open availability of their abstracts and related metadata, allowing submitters to make a decision at that point about whether to keep these components of their ETDs accessible or not - or to rewrite the abstract in such a way that doesn't expose sensitive information.)

Finally, one internal user noted that a key requirement for any new ETD/EHT platform would be not to break existing ETD/EHT URLs.  This requirement needn't actually live in the ETD software itself; it may be solved otherwise, as long as old URLs dereference properly.

Olive

Olive’s ActivePaper Archive product has for some time been, for all intents and purposes, the only option for digital newspaper delivery.  The product is well supported, mature, and very widely used, though future software development is unclear since the last version came out in 2007.  There are a few other options now for delivery of such content, but Olive seems to has the best support for article segmentation.

The user interface reflects a somewhat outdated design, and in at least one user’s view, the interface is difficult for students to figure out. One subject expert noted that there are four basic ways users typically search: by an event on a specific day, by keyword, or by byline (for a specific person), and by an interesting combination of these ways.  Each of these approaches needs to be transparent in Olive, but they are not. Some users reported feeling that the instructions provided by Olive aren’t clear enough on how to perform certain functions with the tool, such as trying to save a section, or all, of an article.  

Some functionalities that were requested: a better system for finding and tracking newspapers which have had different titles over the years; alternative interfaces for locating newspaper titles, such as a geographical one, or a chronological list of newspapers to give the user a better idea of which eras are spanned by a paper; connecting index terms from the system with the search box for “did you mean”-like functionality; exploiting the ability to create “clippings” to, say, curate custom collections of clippings around a particular topic.

Regarding support within the Libraries, internal users noted that response to issues is much more timely than in the past, but it's still not clear whom to call on for support when a problem arises.  That is, it would be nice if there were a "front-end" person, or owner, for the platform that users could know to call on when there's an issue in need of resolving fairly quickly.

Cross-Platform Issues

Access 

Various internal users we interviewed expressed the need to have more, or at least a level of regular, access to the platforms they work closely with.  For example, most of the public service librarians we spoke to also have subject expertise, which - not surprisingly - is called upon in the creation of metadata for digital collections during the pre-launch phase.  However, metadata for digital collections, post-launch, often call for corrections or some level of maintenance as well.  It would be easier and more efficient if subject librarians could continue to have access to the platforms in which they have experience creating metadata, in case changes need to be made quickly - particularly if the collection is heavily used (as in the case of our digitized maps collections).  Similarly, in the case of the ETD-db platform, administrators who manage the submission piece of the process talked of needing direct access to the user interface for making changes (such as adding new majors in a drop-down menu); at the moment their only alternative is to contact the research programmers in DLT who work with the ETD-db software.  Thus, a decision, or policy, addressing levels/classes of access for our platforms (especially CONTENTdm) would be helpful.   (This is also a question of continuing training for these librarians as the platform undergoes upgrades - i.e., they need to be included in training sessions for new versions of platforms as they are released.) 

Authentication/Authorization

Penn State is a leader in the area of identity and access management, providing robust authentication and authorization services both within the university and in the federated context.  Integration with campus authentication and authorization services is valuable for a number of reasons (that are beyond the scope of this document).  Support for such integration is generally poor in each of the four reviewed platforms.  

On the authentication side, only ETD-db integrates with Co-Sign, Penn State’s single sign-on service.  None of the other platforms uses Co-Sign: Olive’s and CONTENTdm’s authentication methods are bespoke, relying upon DLT-administered instances of Active Directory and Apache, respectively.  (There was insufficient data on DPubS’ approach to authentication.)  None of the platforms use campus authorization services: CONTENTdm and ETD-db handle authorization internally; Olive uses Active Directory.  At least one internal user believed that CONTENTdm can integrate with both campus authentication and authorization; this ability ought to be confirmed and tested.

Adoption

Of the four platforms reviewed, CONTENTdm is the one with the most staying power. Part of this stems from the strong support system it has (see below, under "Support"), and part of this has to do with the upgrade schedule; new versions of CONTENTdm are released fairly regularly. The Olive ActivePaper Archive tool, on the other hand, hasn't had an upgrade since about 2007 or so, and documentation for the platform hasn't been enhanced since 2004.  (Olive seems to have updated ActivePaper Daily much more recently.  For example, ActivePaper Daily has a feature that tracks user activity, providing data about user preferences, reading behavior, page views, etc. - a tool not unlike Google Analytics.)  Nonetheless, the lack of regular upgrades to Olive has not prevented it from continuing to be adopted widely for online delivery of digitized newspaper archives.

The future of the other platforms, DPubS and ETD-db, is less certain, since the former has support issues, and the latter, in its current shape, cannot be updated easily:  the code has been modified considerably since first being implemented more than a decade ago, making the upgrade path more complicated than it is probably worth.

Database support

Both CONTENTdm and Olive store relational data in proprietary databases, to which administrators do not have direct access, whereas DPubS and ETD-db use the free and open-source MySQL relational database management system.  The platforms a number of technologies for accessing the databases, including the Perl, PHP, and ASP programming languages.  None of these technologies, nor the database systems utilized by the platforms, are commonly used by Digital Library Technologies, who provides support for the platforms.

Development

The four platforms offer little in the way of development potential. The proprietary nature of Cdm and Olive means there is no extensible framework for customization; rather, it tends to occur on an ad hoc basis. Both Cdm and Olive offer customizable UIs, but at least one internal user we interviewed confessed a reluctance to customize Cdm's UI too much, in order not to complicate the upgrade path when new versions of the platform are released. (However, this comment was offered before the upgrade to Cdm 5.x, when the Libraries were still using Cdm 4.3. It could be that as improvements to Cdm occur incrementally and versions of Cdm 5.x and beyond are released [the Libraries are currently using Cdm 5.x], the UI customization issue could be revisited to see if a concerted effort to change the UI could be made - particularly with a usability program in place.) A similar concern about UI customization was expressed about Olive. Most internal users would like more flexibility in UI customization.

A key advantage to Cdm is its active user community, namely in the form of regional user groups that communicate via mailing lists and meet annually to hold conferences focused on the use and implementations of Cdm. In contrast, neither DPubS, nor the ETD-db software in its current state, nor Olive, enjoys a robust development network, though for slightly different reasons. In the case of DPubS and ETD-db, these platforms - as PSU knows them - have not been maintained, or updated, and can thus be considered moribund.  Olive's lack of a development community results directly from its proprietary status.

Globalization

While the extent to which our current digital content requires support for internationalization is unknown, our collections are growing and becoming more diverse and this trend seems unlikely to change in the near future. CONTENTdm shines in this area with full Unicode compliance as of version 5 (we currently run 5.2).  Internationalization support is largely unclear in the other three platforms: loading items into DPubS has at times raised character encoding issues, which may or may not be tied to Unicode, and its search system does not allow characters with diacritics; the ETD-db system, according to a couple of internal users, “probably doesn’t” support Unicode; and there is insufficient data to determine Olive, though it is likely to support Unicode given its wide and international install base.

Interoperability

Interoperability, as defined in the context of our analysis matrix, is really about the number of standards (e.g., XML, EAD, METS, OAI-PMH, etc.) that the platform supports and whether APIs are made available or not. (This information was gleaned as much from documentation about the platforms as it was from interviews with users.) Of the applications we reviewed, CONTENTdm - not surprisingly - is the most interoperable (up to a point) and ETD-db is the least (having no known OAI feature). While CONTENTdm has an export function (e.g., to XML, tab-delimited text, etc.) that is used to drive other interoperability functions, interoperability based on exports is limited, since only descriptive metadata is included; without data files, or links to them, the content ends up not being a part of the export. In the case of Olive, no APIs beyond the HTML interface (which has no OAI feed) are made available, rendering integration via the platform a challenge. However, Penn State owns the data behind Olive - essentially image and XML files - and it may be possible to enable interoperability by having an application that sits on top of these files in the file system and that works with other platforms in use. Finally, DPubS includes OAI-PMH functionality, either at the publication or collection level, but it appears not to have been enabled.

Metadata

Support for item metadata within the platforms is largely limited to descriptive metadata; administrative, structural, and preservation metadata lives outside of the platforms when it exists.  The metadata formats supported by ETD-db, DPubS, and Olive are custom, though DPubS appears to do some internal crosswalking to Dublin Core for its OAI-PMH interface. CONTENTdm supports some well-known formats, or element sets, such as Dublin Core, and there is also a template available for VRA Core.  

The custom metadata in ETD-db, DPubS, and Olive is fairly light: ETD-db supports seven fields as currently built, though this could be expanded through more customized software development; DPubS supports some very basic elements that one would expect to find in a citation; and Olive metadata is light by design, including elements such as title and time period, since the emphasis in that system is on full-text search.

The users we interviewed expressed common frustrations with search functionalities in our platforms. Almost everyone mentioned the need to enable federated searching, or searching across all collections supported by a platform. One DPuBs user would like to be able to search across the monographs offered through this platform. Some CONTENTdm users complained about navigation - notably the loss of context when one is in a collection - and asked for more functionality in the area of item-level searching; this request depends on rich item-level metadata, which is sometimes a gap when archives are digitized, since archives are structurally represented at the series and folder levels - i.e., collection-description levels. (This tension between item-level metadata and collection-level metadata also brings up a side discussion we had with internal users who are archivists: in digitizing archives, archivists sometimes feel they are re-processing a collection; if possible, it would help to process a new collection with an eye toward possible digitization in the future - that is, to somehow make allowances for future digitization of the collection that would streamline processing tasks from the start.) Finally, keyword-searching in Olive is strong, but some internal users pointed out that suggestions/instructions for searching in Olive could be improved.  In turn, a comment like this points up the need to engage public services librarians who work with platforms like Olive more regularly in UI customizations.  They know their users' needs best, and they know these tools.

Support

The customer support provided by CONTENTdm and Olive is regarded by most internal users as strong and responsive, though a notable exception would be the abysmal support OCLC provided for Solaris-based systems, which Digital Library Technologies is gradually moving away from.  In addition to official vendor-provided support, CONTENTdm has an active and engaged virtual community; there are listservs for support from fellow implementers.

There is no longer any support provided for the ETD-db product, partly because we have so heavily customized our instances of the software.  Support for DPubS comes only from our development partner, a small team at Cornell University Libraries, and the extent to which active support of the software is a priority is unclear.

Examination of each platform’s support model shed light on how service providers at Penn State provide internal support.  Generally, customer service for each of the platforms has improved significantly over the past few years, partly as a result of increased attention to process and customer service orientation.  Many internal users, however, remain unaware of key support processes, such as whom to contact when there’s a problem with digitizal collections.  Support roles, not to mention ownership roles, need to be cleared up within Penn State.

Unexploited functionality

A theme that emerged during the review was unexploited functionality.  While we seem to be using ETD-db to the fullest extent, the other platforms provide functionality that is not used, not known, or has not yet been “turned on:” CONTENTdm is believed to be able to tap into campus authentication and authorization systems; an OAI-PMH interface, for enabling periodic harvest of publication metadata, is provided by DPubS but disabled; the librarian module in Olive, which is used for management of the application, may be used to construct canned searches  -- useful for linking modern phrases e.g. “emancipation proclamation” to ones that exist in historical collections -- but has not been widely used.

Analysis Matrix

Note: Element names are linked to their definitions (which are further down the page).  Also, the fields below contain comments and observations gathered from our interviews with internal users (e.g., staff, including librarians, who work with the back-end of systems) and external users (students and faculty, as well as some public service librarians/staff, who work primarily with the user [search] interface).

Element

CONTENTdm

DPubS

ETD-db

Olive

#Access Control

When a user is added, permissions are assigned for what they can do in Cdm.  There are both user- and group-based permissions.
Collections in Cdm may be published or unpublished, so the platform may be used to work on collections that are not yet made public, preventing that bottleneck.
Cdm 5.2 now has automatic updates to the software but only those with admin privileges can monitor, since admin controls when install of updates happen. (Only updates are loaded, not entire project client.) 

Content loaded into DPubS is immediately available -- there are no items viewable by administrators that are not also viewable by the public.

Not clear for ETD-db under terms of Purdue's matrix. Dev team obviously has direct access to db.

From an internal user perspective, Roberta could use more control with the pull-down menu for majors.  When there's a new major, she'd like to be able to add it, rather than call on Kurt to do so. 

The Director and Librarian modules work only in Internet Explorer, and these modules are accessible only within DLT, from just one IP address.  Internal users raised the question of control in interviews (e.g., who has control over the UI, who is the front-end contact for problems with
Olive).

#Adoption

The VRC pushes their faculty to use Cdm, but faculty aren't accustomed to going to a database to search for images -- and, worse, their impression is that content hosted in Cdm isn't very "Googleable" so the discovery options are limited.  Faculty typically ask VRC staff for images.
Cdm in general has been in use for several years at PSU. 2000 organizations around the world use Cdm.

Not widely adopted.
Reasons why publishing PA History with DPubS was difficult:
- no expertise in XML encoding at that time in D&P (no tools for encoding, either)
- the platform wasn't stable enough
 
Since then, tools developed by Cornell have improved how the work gets done.
The future of the software is unclear as it is more or less moribund.

May be moot, since this is the only software for submission
of theses and dissertations at PSU (although it is software that originally came from Virginia Tech)

Olive has been the only newspaper digitization software used by Penn State.  It is arguably the most popular - used by 400 organizations on six continents (website: http://www.olivesoftware.com/customers/index.asp)

#Authentication

The Cdm user account/authentication/authorization is separate from PSU access ID (basically, user needs authorization to Cdm, then to the server).  Cdm user account names are made to match PSU Access IDs and are stored in an Apache htpasswd file.  It is believed that Cdm can integrate with the PSU authN/authZ system but this ought to be confirmed and tested.
Anyone with a Penn State user ID and password, including the branch campuses, can access the Art History collection in Cdm.
Just as an aside, we noticed that even though my (Patricia's) laptop was on the campus wireless, a login was still requested.  When the login link was clicked on, though, we were taken immediately to the next page without having to input a login.

 

Authentication to the administrative functions of the app are provided by integration with the Penn State standard, Co-Sign.  There is no authorization for thesis submitters; anyone with a PSU Access ID may submit a thesis.  Authorization for administrative functions of the application,
such as format review and thesis approval, is provided via a database table listing PSU Access IDs granted administrative access.

Authentication and authorization are provided via Windows accounts local to the server running the Olive instance.  (Ideally, Olive would integrate into the campus authentication and authorization system, as this has been a requirement for other systems brought up since Olive was adopted)

#Database Support

Data lives in a proprietary database.
Cdm works with a flat database structure. Possible to import from other db systems using tab-delimited format (common to import from Excel, Access). Possible to export from XML or tab-delimited text formats.  Find search engine is basically the data management system underlying Cdm 5.x.

Software appears to use MySQL.

The codebase is a modified version of Virginia Tech's ETD-db, which is a set of Perl scripts sitting atop a MySQL database, neither of which technologies were commonly used by DLT developers.

Olive runs on ASP and some proprietary database or index.

#Developer Ecosystem

Proprietary software for most part. From the website: "CONTENTdm has a well-defined query API that allows for custom client development."   In general, I-Tech does few customizations to Cdm interface; this is so that new versions don't throw the customized interfaces off too much (i.e., more customization means more work with the release of a new version).  Lack of an extensible customization framework introduces this upgrade path issue, which seems a common one in the platforms used at PSU.
One person noted that Cdm usage has spawned a robust user community consisting of people who share code (there's a place where you can drop code for people to use), though none of this discussion or code is available to the public.

Software is moribund so this is near non-existent.

The code is modified to the extent that clean upgrades from the VT development team are improbable, or perhaps judged to be more trouble than they're worth, given the primary users' satisfaction with the system they've already got.  It is unclear how actively maintained the ETD-db code is. 

Olive is propietary software, so not possible to modify the code.  UI is customizable, though. In terms of customizing the UI, not much needs to be done in Olive.  According to Karen, when a collection is built,
there is a skin/structure cloner, and this cloner is used to build a duplicate collection.  Beyond the cloner, PSU folks tweak the UI by hand-editing the ASP files that drive the application.  This implies the product
is not ideally architected to handle customization and such customizations complicate the upgrade path. 
More UI customization would be nice. 

#Globalization

Cdm 5.x fully supports Unicode.  (As of May/June 2010 Libraries are using Cdm 5.2.)  Has integrated OCR support for 184 languages.

The extent to which Unicode is supported by DPubS is unclear.  Character encoding issues encountered so far may be due to improper entity encoding in the XML.

ETD probably doesn't support Unicode.

 

#Installation

There's not much documentation for the sys admin side of things in Cdm.  Most of the Cdm documentation is not relevant for system administration.

 

Janis Mathewson (I-Tech) first adapted the ETD application, and then it was handed off around the fall of 2009 to Joni Barnoff and Kurt Baker (DLT) for maintenance, security/authentication tweaking, and so forth. Code was copied and modified to create the EHT application for Schreyers.

Vendor typically does installs/updates for this product.

#Interoperability

Cdm shares API for customizing the look of a collection.  All API documentation is on the OCLC/Cdm support site. 
Cdm also supports an export function (to XML, tab-delimited plain-text, etc.), which may be used to drive other interoperability functions, that Kevin uses to do QC on metadata.  Interoperability based on the exports will be limited as only descriptive metadata is included; there are no data files or links thereto, so the content is not included in the export.
XML, EAD, http, OAI-PMH, Z39.50, Unicode and METS/ALTO for import and export from MARC-based ILS and other database management systems.

There is some support for OAI-PMH in DPubS, at either the publication or the collection level, but it is unknown whether this functionality has been exploited.  (Seems like it was just never turned on, for whatever reason.)

 

Olive provides no known APIs other than the end-user HTML interface (no feeds or OAI), so integration via the Olive application will be difficult.  However, PSU owns the data that drives Olive -- the page images and XML files -- and so interoperability may be accomplished by other software that sits atop the same collections of files on the filesystem.

#Maturity

Jeff also looks for "bad code," i.e., tries to debug - which can be a challenge, since it's not all clean code.
Jeff has run into issues with the structure of the code and the poor commenting of the code, which makes it a difficult application to troubleshoot.
Cdm has been around since late 1990s/early 2000s. Most recent version released is 5.3 (released in March 2010).

Constructed in a "non-usable" way -- little thought about how it would be used and fit into a workflow

10 years as a platform for the Graduate College. Almost 1 year for Schreyer Honors College.

In use at Penn State since about 2000 or 2002 (at first in collaboration with OCLC).

#Metadata Standards

Default schema is Dublin Core. VRA Core template is included. Metadata schemas are customizable.
Also, metadata templates are improved in Cdm 5.x - can add file extensions for video, audio, URL, project.
Metadata fields are set up in the Server admin module - these fields can get mapped to Dublin Core elements.  MODS-ish records can be created but Cdm constrains creating valid MODS records.
It is believed that later versions of Cdm support importing, or linking to, controlled vocabularies out on the web, e.g., OCLC's terminology services.
Technical metadata for items does not live in Cdm but in spreadsheets and databases that live on, for example, Kevin's workstation.
The metadata formats used by the VRC are same as what's used in the Architecture and Landscape Architecture Library.  Database platform is Filemaker Pro - this is where metadata is input (flat database with repeating fields).
Cdm does not yet support XML (metadata) uploads so this also was not an option for us.

The DPubS software requires that the XML be well-formed, but also validates it against its own specific schema, which is very strict.

There is a separate interface for inputting metadata - the fields for which were configured mostly by the graduate school (10 years ago, when the ETD-db software was first put into use).
Metadata schema/record maps to DC, to MARC.

Metadata exposed for Olive entities (titles, issues, pages, segments) in the UI via search or display is simple, including a few elements such as title and time period.  Search has a greater emphasis upon the OCR full-text, the quality of which varies per issue.  (Despite the poor quality of some of the full-text and images, users of the system seem largely happy with it - but access to newspapers via Olive is free.)

#Migration

Jeff has been part of the team working on the migration of content in Cdm 4.3 over to Cdm 5.2.  This basically means copying data over and converting it to the version of Cdm we're implementing (there are also migration scripts, and file encoding - e.g., from ASCII to UTF-8 - is done).

 

 

 

#Object Format Support

Just about any file format - JPEG, GIF, or TIFF images, WAV or MP3 audio files, AVI or MPEG video files, PDF files, finding aids in EAD, as well as URLs. Also has JP2000 capability.
Uploads can occur singly or in batches. They can be single items (e.g., a photograph), or compound objects (e.g., a book).
Files in Cdm consist of images, text, and metadata.

 

Supports a variety of files - PDF, doc. Media files may be attached, but they are embedded in the PDF.

PDF is the image format - PDF files are what the users see when they click on the search results.  PSU contracts with Olive on occasion for digitization of the newspapers, and Olive ships back hard drives and otherwise outsources the digitization.

#Performance

Images take a long time to load in the browser.

 

 

 

#Platform Support

Linux, Windows (2003 and 2008).
The (server) code is written in PHP and runs on Solaris. 
Jeff reports that Cdm (server software) is fairly easy to work with on Linux.
Art History is a Mac environment, and - as is well-known - Cdm (project client) isn't supported on Macs.  So, if there's ever a problem with the PC, on which they do not have elevated privileges to make changes, they have to call on the Libraries - not efficient, and at times results in significant duplication of labor.

System is written in Perl.

 

DLT has it on a Windows 2003 box.

#Scalability

Cdm is meant to handle large collections - supposed to scale up without having to upgrade the software.
A single server can handle 300 collections (a collection may have up to 16 million items). 

 

 

 

#Search

Cdm uses the Find search engine (same as in WorldCat). Can customize which fields to search by. The terms list can be generated by the search engine. Indexing operations are improved in Cdm 5.x. Also: optional data filtering, which is advantageous for large full-text collections that have dirty OCR; this gives the user ability to adjust filtering control, which helps with full-text index speed and size.
Internal user perspective: searches cap out at 20K images (this cap needs to be adjusted, if possible).
Has Custom Queries and Results interface for canned searches.  
The custom query builder is essentially a wizard (also, this interface is public).  Can choose between "all," "or," and "exact" phrase search and how to display results (i.e., whether to show compound objects, or not).  Most common option is the "simple hyperlink which invokes a single predefined query."
Searches in Cdm are limited to a maximum of four fields.
Search terms don't often return expected results.
A widely reported issue with CONTENTdm search navigation is the loss of context on search result pages, especially when searching across collections; users want more item-level metadata, where available, surfaced, though there isn't always rich item-level metadata available for archival items where collection-level description is prevalent.
One frustration with Cdm that the VRC has is that when changes are made to a record, Cdm has to re-index everything.  Or if something was changed to the record but the image view wasn't changed accordingly, so VRC staff go back to correct that, the whole collection then has to be re-indexed again.  The VRC has more than 20,000 images, so the re-indexing takes a long time.  Platform isn't forgiving of oversights like this.
When Melissa has done searches in Cdm, she's found that the less specific her keywords are, the easier it is to find images.  The more specific she gets, the harder it is to find what she wants in the collection.  As an example of this, she recalled entering "Mycenaean" and getting back lots of results.  When she typed "snake goddess" in the search box, no results came back.
Other searches we did after the session with Melissa replicated this result - that is, when a search is done using the search box on the results page, and if two keywords are entered, then those keywords are interpreted as using a Boolean operator "or" .
A final issue we looked at was the Browse categories.  Melissa noted that some of them are not very helpful.  For example, it would improve browsing if a category like "Modern (19th & 20th Centuries)" had sub-categories (facets) like "painting," "sculpture," etc.  Same with a browse cateogry like "Painting" - what period of painting, or what school of painting.

All [publications] allow browse by volume and issue, access to PDF issues and limited metadata, and search on a few fields, including full-text/OCR.
Searchable PDF files are not an ideal form for digital monographs because users want more "federated search" across monographs.

ETD-db at PSU has basic text searching (title page of ETDs) and browsing.  Fields for searching are author, title, abstract, and graduate program.

Great for searching on names.
Strengths of Olive as a platform include its keyword search capability -- per Sue, "really incredible" -- its article segmentation features in the UI, and search term highlighting. 

#Search Engine Optimization

 

 

 

Olive was patched recently to facilitate Google crawling of the newspaper content.  (According to Sue, Google started crawling our Olive instance for their own newspaper archive.)

#Storage Abstraction

Digital items are stored in file directories on a server. Each individual item is accessible via a text-based index (database) that points to the item.
Each collection is on its own file system, but it's better to chunk the collection content out.
Our digital collections have 88,000 file directories, even though Cdm has maximum  of 32,000 file directories.  DLT gets around this by using a different file naming convention.  This can be problematic since it exceeds the POSIX limit.

 

 

Data stored as .pdf files. 

#Support Model

Externally: OCLC maintains a CONTENTdm User Support Center (requires login) where users can grab documentation.  OCLC also provides training sessions, and there are workshops and user groups for Cdm.
BUT support on Solaris (for Cdm 4.3) has been terrible.
We've been paying for support that we haven't been getting.
The support community (listserv) for Cdm is pretty decent.  Folks at Jeff's level post quite a bit.

Documentation is marginal. There is a very small, nearly non-existent, user community for the software, so any support requests must go directly to Cornell.  It is not clear how highly prioritized providing DPubS support is at Cornell.

The code is modified to the extent that clean upgrades from the VT development team are improbable, or perhaps judged to be more trouble than they're worth, given the primary users' satisfaction with the system they've already got. 
It is unclear how actively maintained the ETD-db code is. 

Customer support is consistently responsive.  The customer support provided
by Olive also includes the ability to remote-desktop into the Olive server and work directly on the installation.

#Sustainability

 

 

The cost (including in terms of human resources) of sustaining the system outweighs the benefits at this point.  Hence, plans for investigating a new/alternative system.

Given the widespread (worldwide) usage of Olive, the platform is here to stay - in the sense that it's going to be serving universities and other organizations for a long time. It is considered by DLT folks to be a fairly stable system. 

#System Requirements

Dedicated Web server (IIS with Windows, Apache with UNIX); Intel Pentium 4 class compatible processor or higher; minimum 1 GB RAM, recommended 2GB+ RAM, 4GB RAM required for Level 3 licenses; 300 MB of available hard-disk space for installation; and adequate disk space to hold the collection. 

 

MySQL, Perl, and a (apache with webaccess) web server.  Hardware requirements are minimal, other than necessary back-up procedures for the data.

Olive supports Windows 2000 and Windows 2003. 

#Upgrade

Releases seem timely:  Cdm 5.0 came out in January 2009. Cdm 5.1 in June 2009. Cdm 5.2 in November 2009. Cdm 5.3 in March 2010. Cdm 5.4 is out as of July 2010. Cdm 6 may be out by the end of 2010.

 

 

Olive doesn't seem to make upgrades to the software very often (last one that Steve B. could recall was in 2007).

#Versioning

 

 

Once thesis is submitted, content is frozen. No modification of the PDF files is allowed once an ETD is approved and released.

 

Acknowledgments

We'd like to thank our sponsors for their support and their patience: Mike Furlough, Lisa German, and Mairead Martin.  The project would not have been possible without the participation of a multitude of Penn State staff, faculty, and students who provided the substance for the review via a series of system demonstrations and user interview sessions: Catherine Adams, Patrick Alexander, Julia Allis, John Attig, Kurt Baker, Steve Baylis, Joni Barnoff, Debora Cheney, Kevin Clair, Linda Friend, Roberta Hardin, Andrea Harrington, Sue Kellerman, Linda Klimczyk, Pauletta Leathers, Christy Long, Carolyn Lucarelli, Doris Malkmus, Janis Mathewson, Patrick McGrady, Melissa Mednicov, Jeff Minelli, Stephanie Jakle Movahedi-Lankarani, Linda Musser, Eric Novotny, Amy Paster, Jim Quigel, Albert Rozo, Karen Schwentner, and Tom Weber.  We would also like to acknowledge the work of the Comparative Analysis of Institutional Repository Software project team, who devised a set of evaluation criteria upon which we based our criteria: Dorothea Salo (University of Wisconsin), Siddharth Kumar Singh (Purdue University), and Michael Witt (Purdue University).

Appendix A: Project Charter

Project Charter

Appendix B: Other Products

Below is a list of competing products and where they're being used. 

I. Digital Library (content delivery/management) Software

  • Greenstone (also Greenstone wiki) - open-source software from University of Waikato (New Zealand).  
    • Examples using Greenstone:  Chopin Early Editions (U. Chicago); E. Azalia Hackley Collection (Detroit Public Library); Archives of Indian Labour (V.V.Giri National Labour Institute, Uttar Pradesh); Afghanistan Centre at Kabul University (AKCU) Library Catalogue (has more than 50,000 volumes).
    • Runs on Windows (all versions), Unix/Linux, Mac OS-X
    • Supports Dublin Core
      • Can import metadata in variety of forms (using plug-ins):  XML, MARC, CDS/ISIS, ProCite, BibTex, Refer, OAI, DSpace, METS.
    • OAI-PMH-enabled
    • Can export to METS (has own METS profile) and can ingest documents in METS form.
    • Can export to DSpace and DSpace content can be imported to Greenstone.
    • Also plug-ins for ingesting documents in array of formats:  PDF, PostScript, Word, RTF, HTML, Plain text, Latex, ZIP archives, Excel, PPT, Email (various formats), source code.
      •   And plug-ins for supporting multimedia formats: Images (any format, including GIF, JIF, JPEG, TIFF), MP3 audio, Ogg Vorbis audio, and a generic plug-in that can be configured for audio formats, MPEG, MIDI, etc.
    • Supports array of languages (developers have had input from UNESCO and Human Info NGO in distribution of software). 
  • Luna Insight - proprietary digital collection software
    • Examples of what collections look like in Luna - http://www.lunacommons.org/.
    • Requirements for end-user access & digital-collection building: 
      • Platforms - Windows 2000, XP & Vista; Mac OS-X 10.3 through 10.6
      • Browsers - IE 7.0+, Firefox 2.0+, Safari 3.0+, Chrome 4.0+
    • Admin tools - Windows and Mac OS-X as specified above; also Sun Solaris, Linux
    • Database apps supported:  Oracle 9i+ (Oracle 8i+ Insight only); Microsoft SQL Server 2000/2005; MySQL 4.1+
    • Formats supported - images, audio, video, text, PDF, QTVR
    • Can process web--based derivatives including a JPEG2000 source file for quick image delivery, zooming, and repurposing
    • Metadata - bundled with three data models: Dublin Core, VRA Core 3.0, and something called Simple Labels. Custom implementations have included METS, MARC, VRA Core 4.0.
    • Also OAI-PMH-enabled. Supports XML input and output.
  • MDID - Madison Digital Image Database (project blog, wiki), developed out of James Madison University (MDID3 currently in beta)
    • Demo site
    • Institutions that use, or have used MDID, include American University, Lewis & Clark College, Middlebury College, New Mexico State University, University of Tennesse,
    • Features include:
      • Remote Collections
      • Personal images
      • IPTC Injection
      • ImageViewer classroom application
      • Integration with RLG CAMIO
      • Image Moderation
      • Support for multiple collections
      • Custom catalog data structures
      • Search and browse functions
      • Cross-collection searching
      • Personal collections in "My Images"
      • Slideshow light table
      • Tools for managing slideshows
      • User image notes and annotations
      • Web-based slideshow viewer
      • Packaged slideshows for offline presentation
      • Printable flashcards
      • Tools for managing user accounts and authentication
      • Data exchange through XML
  •  Omeka - open-source web-publishing software for delivery/display of variety of content types from libraries, museums, archive.  Quickly gaining reputation among users as "easy to use," "simple," "logical," and "unscary."
    • Examples of Omeka collections:  Lincoln at 200 (Newberry Library and Chicago History Museum); Daisie M. Helyar, 1906-1910 Scrapbook (Graduate School of Library and Information Science at Simmons College); Digital Worcester (Worcester State College); Treasures of the New York Public Library
    • Can handle large collections of metadata and files (more than 100,000 files) - website says "limitations are on your own server." 
    • Zend framework for PHP to enable customization
    • Wide variety of file formats accepted - image, audio, video, multi-page docs, PDFs, PPTs.  Individual files can have multiple files.
    • Supports Dublin Core
    • OAI-PMH-enabled
    • Data migration feature:  "Populate an Omeka site by adding items individually or batch add using data migration tools, such as the OAI-PMH harvester and CSV importer plugins" (from website).
    • Supports tagging.
    • Unicode-compliant.
    • Geolocation plug-in.
    • Reporting feature:  "Create customized reports with a simple HTML export, or PDF export that prints QR Codes" (from website).
    • Has plugin API - for development of plugins to suit project needs quickly and easily.
  • CQ DAM Repository - The Day CQ 5.3 content management system currently being implemented at the Libraries has a repository application we may wish to explore in the future as one of our options for document/data deposit.  There would definitely need to be some front-facing development done.

II.  ETD Platforms

  • ETD Workflows - open-source ETD app developed at NCSU Libraries, in collaboration with the NCSU Graduate School.
    • PHP/AJAX/MySQL application
    • Tested on Linux and Solaris 10
    • Modular architecture, plug-in based (for supporting local needs, such as authentication, metadata, and repository-to-catalog export)
    • At NCSU, integrates with their human resource management system (PeopleSoft plug-in), enabling population of metadata already captured by a HRMS.
    • DSpace installation provides public access at NCSU.
    • ETD Workflows instance at NCSU - http://www.lib.ncsu.edu/ETD-db/ETD-search/search.
  • Rutgers OpenETD - open-source, web-based ETD tool.  Can be stand-alone, or can be integrated with existing IR (via XML/METS export).  Functionality is for three different kinds of users:  Student, Graduate School Reviewer, and System Administrator.
    • OS/platform requirements - Solaris, Linux
    • Multiple school support (i.e., system is centralized for managing different types of schools and their degrees, curricula, submission terms and policies, etc.)
    • Enables basic customizations (logos, colors, footer info).
    • Unicode compliant
    • Email notification system
    • Automatically validates formatting for margins and page numbers.
    • Authentication can be local or centralized (or both).
    • Some export functionality to ProQuest/UMI.
    • Other application dependencies: MySQL 5.0 or above; PHP 5.2.12 or above; Ghostscript 8.64 or above
    • Other features (and user scenarios) can be found at the website.
  • Vireo: ETD Submission and Management - developed under Texas Digital Library (TDL) program.  Should be released as open-source soon, if hasn't already (as of now, July 2010).  Workflow tool for handling submission, management, and publication of ETDs. 
    • Works with DSpace (doesn't appear to with anything else)
    • Requires Shibboleth
    • TDL can provide hosting service
    • Overview of what Vireo provides in this presentation.

Note: At least one institution is using CONTENTdm for submission and dissemination of ETDs. This is Claremont Colleges (see http://ccdl.libraries.claremont.edu/collection.php?alias=/stc).

III.  Scholarly Publishing Platforms (equivalent of DPubS)

  • Open Conference Systems - free, open-source Web publishing application for publishing conference proceedings (and for conference management activities, such as calls for papers, submission, review, conference website development, etc.)
    • Product of the Public Knowledge Project
    • Comes with Open Harvester System - metadata indexing system.
    • Lemon8-XML is integrated in OCS, which makes possible conversions of Word or Open Office docs to XML-based publishing formats.
    • Support is community-based - which means really complicated issues may not get answered.  (See disclaimer on website.)
    • Can opt for hosted service
  • Open Monograph Press - under development.
    • Module-based platform, using various applicable PKP software already existing and re-assembling/customizing those applications to publish monographs.  See diagram for visualization of modules and workflow:

  • Plans include integration of social-networking component that would give authors with possible monograph project a space to develop that project with benefit of interacting with a like-minded community.

IV. Digitized Newspaper Delivery Platforms (equivalent of Olive)

  • CONTENTdm 5.4/6.x
    • Cdm 6.x will include an enhanced newspaper viewer in its public web interface.  Version 6.0 is slated for a late 2010 release. 
      Note: Florida Southern College is using Cdm to deliver it's digitized newspaper content. See: http://archives.flsouthern.edu/cdm4/browse.php?CISOROOT=/Southern.
    • A new Cdm product named "Catcher" will provide batch metadata editing.  It will be piloted in autumn 2010.
    • Cdm 5.4 includes numerous features that would aid with digital newspaper management
      • Improved compound object editing
      • Index partitioning
      • New "flex loader" product for loading newspapers en masse w/o having to go through the Project Client
        • Replaces newspaper loader
        • Supports article segmentation
        • Metadata mapping at both the page and collection levels
        • Compatible w/ 5.x, does not require Cdm 5.4
        • Free to all Cdm licensees
      • Article-level search coming later in 2010
  • Chronicling America
    • Powers the publicly available NDNP website
    • ChronAm exposes numerous APIs
    • Free, open-source code at sourceforge.net

Appendix C: Element Definitions

The elements used to review the platforms were adopted from Purdue's Comparative Analysis of IR Software (link no longer active; this is an article that resulted from this work).

  • Access Control

    The purpose of this measure is to evaluate the access control policy maturity of the software - its flexibility and ease. 

  • Adoption

    This parameter evaluates how widely the software has been adopted and is being used.  

  • Authentication

    The purpose of this measure is to explore the authentication and authorization mechanisms of the software. Apart from providing a high level view of how this feature works, we will specifically look for local DB, LDAP/AD, Shibboleth integration for authentication. 

  • Database Support

    This parameters reports what all databases are supported by the software repository. 

  • Developer Ecosystem

    This parameter serves to evaluate the development ecosystem for the software - how extensible the software is, and how strong the developer support is. The purpose of this measure is to evaluate plugin/scripts available for the software repository. The broader point that this parameters addresses is to inform how extensible and programmable the repository is.

  • Globalization

    This parameter reflects the level of globalization support from the software. A good way of quantifying would be to check the number of languages supported by the software. A different but equally interesting method would be to note its major installation footprint across countries.

  • Installation

    The purpose of this measure is to evaluate the ease of installation of the software. 

  • Interoperability

    The purpose of this measure is to compare the interoperability feature of the software. By interoperability we mean the number of standards a software supports to be used in conjunction with other standards. 

  • Maturity

    The purpose of this measure is to evaluate the maturity of the software: the duration for which it has been in existence and its rate of development over the years. 

  • Metadata Standards

    The purpose of this measure is to evaluate what metadata is natively supported. In this context support for a given metadata schema means that metadata can be entered into the repository, stored in the database, indexed appropriately, and made searchable through the public user interface.

  • Migration

    This measure will bring forth the migration capabilities provided by the software repository. 

  • Object Format Support

    This parameter evaluates the file formats that are supported by the software. 

  • Performance

    The purpose of this measure is to evaluate the performance of the software. 

  • Platform Support

    The purpose of this measure is to bring to note the availability of the software for different platforms. Essentially, how many operating systems platforms is the software available for (Linux, Windows, Mac, Solaris). 

  • Scalability

    The purpose of this measure is to comment on the ability of a software to be able to handle a sufficiently large number of objects.  

  • Search

    The purpose of this measure is to evaluate the search functionality that the software exposes to the user.  

  • Search Engine Optimization

    The purpose of this measure is to evaluate if the repository performs some optimizations to help in increasing the search engine visibility.  

  • Storage Abstraction

    The purpose of this measure is to find out what are the ways in which repository data can be stored.  

  • Support Model

    The purpose of this measure is to evaluate the available support for the software.

  • Sustainability

    The purpose of this measure is to report on the sustainability of the software in terms of future releases and feature additions. 

  • System Requirements

    The purpose of this measure is to report on the system requirements of the software, as well as hardware resources in order to install and run the repositories effectively.

  • Upgrade

    The purpose of this measure is to evaluate the ease and reliability of upgrading the software. 

  • Versioning

    The purpose of this measure is to evaluate if the software preserves the original content when modifications are made.




  • No labels