Building an Institutional Research Repository from the Ground Up: The ARROW Experience

Dr Andrew Treloar [HREF29], Project Manager, Strategic Information Initiatives, Information Technology Services [HREF30] & ARROW [HREF32] Technical Architect. Building 3A, Monash University [HREF31], Victoria, 3800. Email: [email protected]

Updated version of paper delivered at AusWeb04.

Architecture, Design and Creation of Software to Support an Institutional Research Repository: The ARROW Experience

Abstract

This paper describes the development of the software to support ARROW - Australian Research Repositories Online to the World (a DEST-funded project under the Research Information Infrastructure Framework for Australian Higher Education). One way of conceptualising this process is to think of it as analogous to the process of building a house. The paper therefore begins by describing the vacant lot - the context in which the project came about. It then moves on to the design brief for the architect - the list of requirements. Next comes the resulting architectural drawings - the broad model and list of functions. In order to turn a blueprint into reality, one needs building materials - in this case the pieces of software required. Finally the paper discusses the state of the building site, and when the 'house' might be open to its first visitors.

1. Vacant Lot

This is a story about building an institutional research repository. More specifically, it is about designing the architecture and choosing the building materials to make this building possible. But before the architecture can be designed, the right environment needs to exist. What was the vacant lot that made it possible for this building to even be thought about as a possibility? In this case the vacant lot had two components: a general one and a more specifically Australian one.

1.1 Overall context

There is a growing interest among academic institutions in collecting, preserving, reusing and creating value-added services from digital content produced in and for research, teaching and learning. The emphasis on research outputs and collaboration, and distance, flexible and online learning, together with developments in information technology, has led to an increased awareness that the digital content being created by members of the academic community is an institutional asset. This content is also increasingly being recognised as an institutional challenge, requiring both tactical management and a strategic response.

At the same time many academic libraries are responding to the challenges of new technologies by taking the opportunity to redefine their fundamental role in the creation, distribution and provision of access to information. Over the past decade libraries have moved almost completely towards a digital platform for management of the information (both print and electronic) that they acquire or subscribe to. They have built significant digital collections of material published by others, and they are increasingly producing new content themselves [Harboe-Ree et. al. (2004)]. Often this content originates from, or is the intellectual property of, their own institutions.

Meanwhile, all around the world, universities, their libraries, faculties, research centres and information technology and course development units, are trying to cope with the digital revolution. There is a growing recognition and articulation of the convergence that is occurring among the various digital initiatives in which universities are engaged, and the opportunities for potential synergies and more significant outcomes through collaboration and interoperability.

As one example, the COLIS (the Collaborative Online Learning and Information Services model [HREF3]) work at Macquarie University has focused on testing the feasibility of interoperable standards as a way of managing interactions between a range of electronic services. Through the success of the COLIS model, McLean and others have demonstrated that the new electronic environment can and must comprise a complex interactive matrix that is dependent on the information resources mentioned above, as well as on user directories, content and rights management software, and metadata repositories.

Sally A Rogers, from Ohio State University, has argued that the full array of a university's digital assets and information services should be broadly defined, and should include the library's catalogue, the electronic journals, reference databases and other electronic resources available through the library, as well as institutional repositories and resources created or collated elsewhere in the university, such as course material [Rogers (2003)]. She notes the overlapping of such initiatives as digital collections, course web sites, electronic course packs and learning objects, the desirability of integration to search across these repositories and the development of standards to promote interoperability. Rogers also highlights the potential of increased interoperability and connectivity to generate innovation in research, teaching and learning.

Most recently, the UK House of Commons Science and Technology Committee released on July 20, 2004 its long-awaited report entitled Scientific Publications: Free for all? [House of Commons Science and Technology Committee (2004)]. A number of the recommendations in this report refer explicitly to the central role of institutional repositories:

Recommendation 44: "We recommend that the Research Councils and other Government funders mandate their funded researchers to deposit a copy of all their articles in their institution's repository within one month of publication or a reasonable period to be agreed following publication, as a condition of their research grant." (p. 102)
Recommendation 46: "We recommend that DCMS provide adequate funds for the British Library to establish and maintain a central online repository for all UK research articles that are not housed in other institutional repositories." (p. 102)
Recommendation 48: "In order for institutional repositories to achieve maximum effectiveness, Government must adopt a joined-up approach. DTI, OST, DfES and DCMS should work together to create a strategy for the implementation of institutional repositories, with clearly define aims and a realistic timetable." (p. 103)
Recommendation 52: "The cost to the taxpayer of establishing and maintaining an infrastructure of institutional repositories across UK higher` education would be minimal, particularly in proportion to the current total UK higher education spend. When the cost is weighed against the benefits they would bring, institutional repositories plainly represent value for money." (p. 103)
Recommendation 53: "Having taken the step of funding and supporting institutional repositories, the UK Government would need to become an advocate for them at a global level. If all countries archived their research findings in this way, access to scientific publications would increase dramatically. We see this as a great opportunity for the UK to lead the way in broadening access to publicly-funded research findings and making available software tools and resources for accomplishing this work."(p. 103)
Recommendation 55: "We recommend that the Government appoints and funds a central body ... to co-ordinate the implementation of a network of institutional repositories." (p. 104)

1.2 Australian context

It was against this developing backdrop that the November 2002 report of the Higher Education Information Infrastructure Advisory Committee (HEIIAC) of the Australian Government Department of Education, Science and Training (DEST) [DEST (2002)] identified the following critical features of an enhanced research infrastructure:

information infrastructure resources should optimise the efforts of researchers in the higher education sector to create, manage, discover, access and disseminate knowledge;
access to the research information infrastructure should not be constrained by institutional affiliations, geographic locations or disciplines of individual researchers;
collaboration among libraries has improved the effectiveness of individual institutions, and further collaboration, clear strategies and a shared vision would significantly improve the coordination of the national research infrastructure;
opportunities should be sought for the academic community to regain control of scholarly publishing; and
computing and communication technologies provide new opportunities for the creation, management, storage and dissemination of information.

The HEIIAC report was primarily concerned with managing the current problems associated with scholarly communication and publishing, and it stressed the need to adopt a national collaborative approach. As already discussed, a range of players are embracing scholarly communication strategies and arguing that they should be incorporated into a more holistic approach to the management of institutional digital content and intellectual capital.

It was clear that the merging of these two approaches would yield substantial benefits to Australian university communities, consistent with the following statements of principle:

Australian universities have a commitment to support and promote their institutions' research activity through the creation and preservation of digital content, especially institutional repositories and electronic publishing.
Australian universities have a commitment to help their institutions achieve their goals more effectively by assisting with the integration of digital resources.
Australian universities have a commitment to collaborating nationally and internationally in the achievement of a more integrated approach to the management and interoperability of digital content. [Harboe-Ree and Treloar 2004]

These statements reflect the HEIIAC objectives and place them into a framework that, if implemented, would improve institutional and national efficiency and effectiveness. The challenge for HEIIAC was to turn these principles and objectives into action.

1.3 DEST RII Process

In June of 2003, the Australian Commonwealth Department of Education, Science and Training issued a call for proposals to "further the discovery, creation, management and dissemination of Australian research information in a digital environment" [DEST (2003a)]. This sought to "fund proposals which help promote Australian research output and help to build the Australian research information infrastructure, through the development of distributed digital repositories and common technical services that manage access and authorisation to these."

The guidelines for submissions identified the following requirements to be met by successful bids:

The application must provide clear evidence of the overall need for the project proposed in terms of the strategic and long-term benefits for the higher education sector in Australia as a whole and identify the specific outcomes that will be derived.
The application should indicate relevance to sector-wide needs and priorities and demonstrate that the proposal is an innovative approach.
The application must clearly demonstrate that the proposal is a cost effective response to an identified problem and will generate savings or productivity gains through its application.
The application should detail the nature and degree of cooperation between collaborating institutions.
Where relevant, the application should bear in mind future requirements and outline strategies to sustain the project beyond the period of Commonwealth funding.
Institutions should be mindful that in any infrastructure developed under this project the enabling architecture should be both effective and reasonably future proof.

In response to this call, 14 projects were submitted of which four were funded [DEST (2003b) ]. The successful projects were:

The Australian Research Repositories Online to the World (ARROW) [HREF25]
Australian Digital Theses Program Expansion and Redevelopment (ADT) [HREF13]
Towards an Australian Partnership for Sustainable Repositories (APSR) [HREF27]
Meta Access Management System (MAMS) [HREF28]

These four projects were funded for a combined total of A$12 million over a period of 3 years, with funding commencing at the start of 2004 [HREF11].

The focus of this paper will be the architectural design and creation of software to support the ARROW Project.

2. Design Brief

The original design brief was encapsulated in the Summary section of the ARROW Bid document sent to DEST (public version of bid available at HREF53]. This read:

"The ARROW project (ARROW) will identify and test a software solution or solutions to support best-practice institutional digital repositories comprising e-prints, digital theses and electronic publishing. A wide range of digital content types will be managed in these repositories. The NLA will develop a repository and associated metadata to support independent scholars (those not associated with institutions). A complementary activity of ARROW is the development and testing of national resource discovery services (developed by the NLA) using metadata harvested from the institutional repositories, and the exposing of metadata to provide services via protocols and toolkits. This will include a potential path for the redevelopment of the Australian Digital Theses (ADT) metadata repository incorporated into the NLA’s national resource discovery services.

Initially ARROW will be tested in the four partner institutions, prior to it being offered more widely across the higher-education sector. The solution will be open-standards based, or will support open standards, and will facilitate interoperability within and between participating institutions.."

This was (deliberately) a very high-level statement. What might it mean when fleshed out a bit? The best way to get an accurate sense of this is to focus on the content streams that ARROW will manage and the content types it will have to deal with.

2.1 Content Streams

The functions that ARROW will perform can best be characterised in terms of different content streams. These derive from different origination points within the Australian research community.

2.1.1 E-print repositories

An e-print repository stores and makes available (in digital form) working papers, pre-prints (not yet published in the traditional literature) and post-prints. E-print repositories have been proliferating in recent years. Most have been set up by universities, but many have also been established by scholarly and professional societies and higher education research centres. Australian universities running e-prints repositories include The Australian National University, Monash University, The University of Melbourne, The University of Queensland, and Queensland University of Technology. The increased activity around e-prints has been facilitated by the development of free, open-source software [HREF12] that manages e-print repositories.

A key feature of these repositories is that content is usually available on an open-access basis (anyone can read or view it and no fees are payable). Many e-print repositories also work on a self-submission basis, with researchers depositing material into the repository themselves using an online deposit process. The rationale behind the growing e-prints movement is to reclaim institutional scholarly output and make it widely accessible internationally, thus removing barriers to learning and research, and improving its availability and citation.

2.1.2 Digital thesis repositories

A digital thesis repository stores and makes available online, in digital form, graduate research output (M.A. by research and Ph.D. theses). Digital theses in these repositories are offered on an open access basis. In Australia the Australian Digital Theses Program [HREF13] is a national collaborative distributed database of digitised theses produced at Australian Universities. Twenty-two higher education institutions are participating members of the Program, which uses deposit-process software [HREF14] first developed at Virginia Polytechnic Institute in the United States of America.

2.1.3 Electronic publishing

A growing number of higher education institutions are trying to establish sustainable publishing alternatives to reclaim the scholarly output currently published in heavily protected commercial journals and monographs. Institutional e-presses aim to offer electronic publishing services and functionality similar to those offered by commercial presses publishing product online, but in a way that is more aligned with institutional objectives, thereby tackling problems associated with the current scholarly publishing climate. These problems include pricing and intellectual property issues, as well as long lead times for publication and publishing models that do not allow for publication of media rich titles.

The activities of an e–press can range from digitising material originally designed for print and making it available online, through to the publication (in the sense of making public) material that was born digital and that can only be fully represented digitally. E-presses are more akin to traditional publishing than e-print repositories in that e-press content tends to be offered on a subscription and/or pay-per-view basis.

As with e-print repositories, the Australian higher education sector is experiencing significant activity in this area. Both Monash University [HREF46] and The Australian National University [HREF47] have established e-presses, and Royal Melbourne Institute of Technology Publishing [HREF15] has been engaged in electronic publishing for several years now.

2.1.4 DEST Returns

Each year, Australian universities need to send to DEST information about their research output for the previous year. In most universities, this process involves manual data collection using paper forms which are then keyed into a database or spreadsheet. This is tedious and susceptible to error. In addition, the end result is a largely static document with no way to link from the publication information to the publications themselves.

ARROW wanted to see if it was possible to partially automate the gathering of publications for the annual Department of Education, Science and Training returns and storage of both the publications and required metadata in the institutional ARROW repository. This would meet the following objectives:

systematic accumulation of a critical mass of content
simplification of Department of Education, Science and Training return creation by universities
facilitation of the way in which the Department of Education, Science and Training verifies compliance

ARROW also wanted to see if it would also be possible to enable universities to enter into an ongoing dialogue with their researchers about the issues associated with academics signing over copyright in research output, and the desirability of deposit into an institutional repository.

2.1.5 Non-University Research

Of course, not all research takes place in a university. Much also occurs in research institutes of one sort or another, in R&D centres in corporations or even in informal locations (what one might characterise as the Researcher in the Backyard Shed). Researchers at institutions without institutional repositories would find it difficult to make their research visible. As ARROW was seeking to capture and make visible as much Australian research as possible, it would be useful to find a way to deal with this potential content stream.

2.2 Content types

2.2.1 Content Type Philosophy

Another part of the design brief process was deciding on what content types (as opposed to streams) would be accepted. The project decided to adopt a variant of the model developed by MIT in its DSpace [HREF16] implementation. The DSPace philosophy can be summarised as follows:

Lots of digital material is already lost
Most digital material is at risk
Preserving bits is better than nothing
It is important to capture as much information as possible
It will be necessary to evaluate cost/benefit trade-offs over time

The project also decided to be informed by the National Archives of Australia guidelines on digital formats [HREF26]. Based on this, ARROW decided to accept three types of content :

Supported

The format is recognized, and the hosting institution is confident it can make bitstreams of this format usable in the future, using whatever combination of techniques (such as migration, emulation, etc.) is appropriate given the context of need.

Known

The format is recognized, and the hosting institution will promise to preserve the bitstream as-is, and allow it to be retrieved. The hosting institution will attempt to obtain enough information to enable the format to be upgraded to the 'supported' level.

Unsupported

The format is unrecognised, but the hosting institution will undertake to preserve the bitstream as-is and allow it to be retrieved.

On the vexed subject of Lossy vs Lossless formats, the decision was made that wherever possible, ARROW would endeavour to store data objects in lossless digital formats (these are formats that do not throw away information when compressing the file). Lossy formats (which do throw away information during compression) might be stored in addition, or rendered on the fly (where possible). Storage in lossy formats would be used only as a last resort.

2.2.2 Supported Formats

For Textual content, the supported formats are:

XML
- Files with an accompanying DTD or schema preferred. If not, then well-formed XML is acceptable.
Rich Text Format (RTF)
Adobe PDF

NOTE: This content will be migrated to PDF-A once this is standardised

HTML

Validating as XHTML. Content that does not validate will need to be converted.

For Still Images, the supported formats are:

TIFF (Tagged Image File Format) [HREF34]
JPEG

Store with no-compression, migrate to JPEG-2000 over time

PNG (Portable Network Graphics) [HREF35]
EPS
SVG (Scalable Vector Graphics) [HREF36]

For Moving Images, the supported format is:

MPEG-4

For Audio, the supported formats are:

WAV
CD Audio

For Multimedia content, the supported format is:

SMIL (Synchronized Multimedia Integration Language) [HREF37]

2.2.3 Known Formats

For Textual content the following formats are known:

Word/Excel/Powerpoint

all versions, all operating systems

NOTE: The reason for including Microsoft Office file formats is simply a recognition of the market reality. If alternatives (such as StarOffice [HREF39] or OpenOffice [HREF40] become more widely deployed in the target environments for ARROW, these list may well be augmented).

For Still Images the following formats are known:

GIF
MrSID (Multi-Resolution Seamless Image Database) [HREF38]

For Moving Images the following formats are known:

Windows Media
AVI
Quicktime video encodings other than MPEG-4

For Audio the following format is known:

For Multimedia content, the following format is known:

Flash

2.2.4 Unsupported Formats

All other formats would be unsupported.

2.3 Overall Philosophy

The final part of the design brief was to make decisions about Open Source and Open Standards. The first decision was an easy one: it was a condition of the funding from DEST that any software developed using project funds had to be made available as open source. This ensured that the Australian (and, ultimately, the global) research communities got the best value from the investment. The second decision also turned out to be an easy one. The core design group agreed that the best approach was to adopt open standards wherever possible when specifying software functionality, data formats or interfaces.

3. Architectural Drawings

Once the project had a clear design brief it was possible to move on to the next step: deciding the broad architecture. This involved a series of iterative steps, as well as a lot of research into what approaches similar projects overseas had adopted. The project ended up defining three categories of required repository functionality.

3.1 Common Repository

The project decided that, if possible, all the various content types should be stored in a common repository. This would:

facilitate linkages between items
allow for more efficient management of the content and the infrastructure
enable exposure of all of an institution's public research output through a common mechanism

3.2 Content Management and Workflow

In order to get the content into the common repository, the project needed a way to efficiently manage different classes of content contributors and different content streams. The project ended up deciding to define a series of Content Management and Workflow modules, corresponding to the content streams discussed under section 2.1. Each of these modules would have its own content submission forms and workflow. Each would also have specific functionality to deal with the requirements of that particular stream type.

The ePrints module would provide software that offers no less functionality than the eprints.org software used by many universities. The major issue with this module was anticipated to be the management of content self-submission and administrative management.

The eTheses module would provide software that offers no less functionality than the current Australian Digital Theses Program software and includes OAI-PMH compliance for metadata harvesting. The main issues were anticipated to be data capture from various sources, efficient harvesting from institutional repositories, identification of software, performance and scalability requirements and interactions with other metadata services.

The ePress module would provide software that offers sufficient functionality to run an open-access ejournal electronic press, including both submission management and publishing of multiple journals.

The DEST Research Directory module would explore testing of the feasibility and effectiveness of using an ARROW repository to support the annual Department of Education, Science and Training returns. The initial instance will be a repository holding a proportion of the institution's Department of Education, Science and Training 2003 returns. The issues are anticipated to be management of content submission from academics and embedding use of the repository in the existing institution-collection process.

The NLA Repository would provide support for non-university researchers by providing a repository hosted at the National Library of Australia.

The project also recognised that the ARROW infrastructure would be potentially applicable to a wider range of problems. For this reason the possibility of adding other Content Management and Workflow modules later on was left open.

3.3 Search and Exposure

The ability to locate appropriate content for citation purposes is a critical success factor in creating reliable scholarly communication and increasing the impact of research. ARROW decided to develop a nationally available resource discovery service to provide access to Australian research output. The project will establish automated mechanisms for harvesting and re-purposing metadata from institutions and individual researchers. This will be done by applying international standards, specifications and technologies to ensure interoperability. Resource discovery will be supported by descriptive metadata. Other types of metadata may also be generated to support digital rights management, persistent identification, and archiving and preservation to ensure the longevity of scholarly content. In addition, it will be possible to search ARROW repositories through a range of discovery tools (such as education portals or search engines). This exposure will increase awareness of unique Australian content, both nationally and internationally. The project will also seek to expose published Australian research in commercial repositories, such as those created by large journal publishers.

3.4 OLAD

The end result of the architectural decisions in each of the categories of Common Repository, Content Management and Workflow and Search & Exposure was a layered architecture. The notion of a layered architecture is not particularly controversial. Such architectures have been preferred since at least the days of the International Standards Organisation Open Systems Interconnect seven-layer reference model for network services. In the Digital Library field these sorts of high-level models are so common that the project group took to referring to 'obligatory' layered architecture diagrams. Figure 1 therefore is the OLAD (Obligatory Layered Architecture Diagram) for ARROW.

ARROW OLAD

Figure 1: Obligatory Layered Architecture Diagram for ARROW.

4. Building ARROW

Now that the architecture was defined, the project had to work out how to build it. In construction terms, what building materials were available, what were the best ones to choose, and who was going to do the building?

4.1 Foundation - the repository

The project recognised very early on that the decision on the repository was foundational. The choice of repository technology would determine the functionality ARROW could provide and the ways it could provide it. Much of the latter half of 2003 was spent in careful analysis of available candidates, based on a mixture of:

reading publically available materials including:

system documentation
published articles/conference papers
online presentations
notes from conference sessions

lurking on mailing lists
downloading the software and 'kicking the tyres'
attending conference sessions (and talking to presenters afterwards)
talking to other users to get a less-partisan assessment

As a result of this work, the project rapidly settled on two likely candidates: DSpace and FEDORA.

4.1.1 DSpace

DSpace [HREF16] is a joint activity between MIT Libraries and Hewlett-Packard to jointly develop a software system to enables institutions to:

Capture and describe digital works using customized workflow processes
Provide access to an institution's digital works over the web, so users can search and retrieve items in the collection
Preserve digital works over the long term

It is being made available under the BSD open source license to other groups to run as-is, or to modify and extend as needed.

DSpace can best be thought of as a general-purpose repository application, with a series of both hard-wired and preferred behaviours. It is designed to provide stable long-term storage needed to house the digital products of MIT faculty and researchers. DSpace is intended to have different advantages for different stakeholder groups:

"For the user: DSpace enables easy remote access and the ability to read and search DSpace items from one location: the World Wide Web.
For the contributor: DSpace offers the advantages of digital distribution and long-term preservation for a variety of formats including text, audio, video, images, datasets and more. Authors can store their digital works in collections that are maintained by MIT communities.

For the institution: DSpace offers the opportunity to provide access to all the research of the institution through one interface. The repository is organized to accommodate the varying policy and workflow issues inherent in a multi-disciplinary environment. Submission workflow and access policies can be customized to adhere closely to each community's needs." [HREF17]

While DSpace grew out of the needs of MIT, a group of North American and European universities are now participating in the DSpace Federation [HREF18], which will test the existing software, and offer suggestions about how to further develop and improve it.

DSpace supports a wide range of content types [HREF19], and particular installations can easily extend the range available.

4.1.2 FEDORA

FEDORA [HREF52] is both a software platform and an architecture (it stands for the Flexible Extensible Digital Object and Repository Architecture). Note that this FEDORA is both different to and predates the use of the name by RedHat. The architecture came out of Digital Library work done in the computer science field in the late 1990s [Payette and Staples (2002)]. The history of the FEDORA repository software is described on its website as follows:

"In the summer of 1999 ... the [University of Virginia] Library's research and development group discovered a paper about Fedora written by Sandra Payette and Carl Lagoze of Cornell's Digital Library Research Group. Fedora was designed on the principle that interoperability and extensibility is best achieved by architecting a clean and modular separation of data, interfaces, and mechanisms (i.e., executable programs). With Cornell's help, the Virginia team installed the research software version of Fedora and began experimenting with some of Virginia's digital collections. Convinced that Fedora was exactly the framework they were seeking, the Virginia team reinterpreted the implementation and developed a prototype that used a relational database backend and a Java servlet that provided the repository access functionality. The prototype provided strong evidence that the Fedora architecture could indeed be the foundation for a practical, scalable digital library system. In September of 2001 The University of Virginia received a grant of $1,000,000 from the Andrew W. Mellon Foundation to enable the Library, in collaboration with Cornell University, to build a sophisticated digital object repository system based on the Flexible Extensible Digital Object and Repository Architecture (Fedora). The Mellon grant was based on the success of the Virginia prototype, and the vision of a new open-source version of Fedora that exploits the latest web technologies. Virginia and Cornell have joined forces to build this robust implementation of the Fedora architecture with a full array of management utilities necessary to support it." [HREF41].

Increasingly, the term FEDORA (which was first used over 5 years ago as an acronym for the architecture) is now being used to refer to this software implementation. In this latter sense, FEDORA is "an open source, digital object repository system using public APIs exposed as web services." [Staples, Wayland and Payette (2003)]. FEDORA can best be thought of as services-mediation infrastructure, rather than an off-the-shelf application. It can use web services to call other services as well as expose its own services using web services standards. Key to the FEDORA architecture (yes, I know this is like referring to an ATM Machine...) is its underlying object-based model. FEDORA stores digital content objects, either as datastreams contained within the repository or as links to external resources. It also stores disseminators, which are ways to render these digital content objects. The software maintains bindings between content objects and their disseminators. Each object has a default disseminator, but may be able to be disseminated in other ways. This architecture is extremely flexible, and provides significant advantages as a platform on which to build other applications.

Version 1.2 of FEDORA, released in late December 2003, provides versioning of both objects and their disseminators, as well as a Java-based Administration GUI.

4.1.3 Other Open-Source Repositories

There is also a range of other open-source repository projects underway. The Soros Institute is currently maintaining a document which summarises the functionality of many of them [HREF8]. In addition to DSpace, the current version also reviews FEDORA, CDSWare, MyCoRe, i-Tor, eprints.org and ARNO. These each come out of particular responses to the challenges of managing large amounts of digital content, and each have their own strengths and weaknesses.

4.1.4 Selection

After careful consideration, the project ended up selecting FEDORA rather than DSpace as the underlying repository. The main reasons for this decision were:

ARROW needed something to build on top of (like FEDORA) rather than an existing application (like DSpace)
The object-oriented data model for FEDORA was much more flexible than DSpace's Repository-Community-Collection-Item-Bitstream hierarchy
ARROW wanted to be able to have persistent identifiers down to the level of individual datastreams (DSpace only has such identifiers for the item)
ARROW wanted to be able to version both content and disseminators
The APIs were exposed much more openly and cleanly in FEDORA (as well-documented SOAP/REST web services) than in DSpace (poorly documented Java APIs)

However, it should be said that this is an area where a number of players, both open-source and proprietary, are moving very quickly. DSpace and FEDORA have each announced their plans for version 2.0 of their software, probably due out next year. As a result, ARROW agreed to review its software decision every 12 months.

4.2 Framing it up - the application development framework

One of the things that the repository may determine is the choice of application development framework. This is because some repositories only allow particular languages to call their Application Programming Interfaces (APIs). The project felt it was important to be able to code in a variety of languages (not be restricted to one). Given the increasing popularity and uptake of Web Services, it was also deemed critical to be able to expose repository functionality in this way. These two points are partially inter-related: having web services makes it much easier to use a range of languages because it decouples the implementation of the functionality from its invocation. ARROW decided that as well as building on web services it would also expose web services. That is, all the new ARROW functionality would also be made available as web services.

4.3 Doors and Windows - the search and exposure layer

As discussed above a key driver behind the project was making items in ARROW repositories as accessible as possible. After careful examination of the best ways to do this, the project decided to target three very different technologies.

4.3.1 OAI-PMH

The Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH) was created to facilitate discovery of distributed resources, such as those contained in a repository. The OAI-PMH achieves this by providing a simple, yet powerful framework for metadata harvesting. Harvesters can incrementally gather records contained in OAI-PMH repositories and use them to create services covering the content of several repositories. [Van de Sompel, Young and Hickey (2003)]. OAI-PMH is rapidly gathering strength as a way of providing federated resource discovery services and was seen as essential to the success of ARROW.

The National Library will use OAI-PMH where available (and other technologies where not) to harvest the metadata from ARROW and other institutional repositories. These metadata will then be used to provide national and international resource discovery for Australian research. This national resource-discovery service [HREF46] will also link with other national services delivered by the National Library for the Australian Digital Theses Program and the international Networked Digital Library of Theses and Dissertations.

4.3.2 Google and other search engines

There is little need to discuss the success of Google. For most unversity students (and probably for most staff!) Google is the resource discovery mechanism of choice. Enabling Google and other search engines to access the metadata and the full text of items in ARROW repositories was an easy choice to make. In practice, this will mean provision of a robots.txt file and publically-available content in a directory location accessible by web spidering software, as well as the ability to automatically request re-spidering for new content.

Having the content indexed by search engines is only part of the solution. Given the increasing amount of content searchable on the Web, it is important to ensure that search results are ranked sufficiently highly to be noticed by end-users. The best way to do this varies from search engine to search engine. In addition, there is an active commercial industry focussed around 'optimising' search results. In the case of Google, there is something of an 'arms race' developing between the ranking techniques Google uses and attempts to 'influence' these. ARROW has decided that it does not want to attempt to affect rankings, other than by making our content as accessible (and thus hopefully as citeable) as possible.

4.3.3 SRU/SRW

The third exposure layer was in some ways a less obvious choice. Both OAI-PMH and Google are 'proxy' search services. That is, they collect proxy records and place them in a database where they can be searched. Such proxy systems run the risk of always potentially being out of date (if only slightly). We therefore wanted to make it possible for other search services to connect directly to ARROW repositories and run interactive searches. The standard protocol for such connections in the library world is Z39.50 (More formally known as ISO 23950: "Information Retrieval (Z39.50): Application Service Definition and Protocol Specification") [HREF21]. Z39.50 has not been taken up as quickly as its proponents had hoped (for a variety of reasons too complex to cover here). As a result the Z39.50 Next Generation group (ZNG) have been working on more modern and lightweight protocols to achieve much of the original Z39.50 functionality. These newer protocols are called SRU (Search/Retrieve over URL) [HREF22] and SRW (Search/Retrieve for Web Services) [HREF23]. ARROW decided to support both SRU and SRW connections to make it possible for real-time searching through things like the portlet technology being developed by education.au (HREF24).

4.3.4 RSS

In addition to these three technologies, the project is also looking at ways of providing an alerting service. The obvious technology to support this is RSS [HREF48]. Of course, there is little point in setting up such a facility until there is (i) a sufficient base level of content and (ii) a sufficiently steady flow of new content. The project will accordingly wait until that stage before providing an RSS-based alerting mechanism.

4.4 Hire a Builder or DIY?

The other major decision to be made was how ARROW would develop the new software. The original bid to DEST had envisaged that the project would hire its own software developers to write the necessary software (a DIY strategy). The new project manager realised that a potentially far better option would be to hire a builder, preferably one with experience with our preferred building materials. After a good deal of exploration and negotiation, the project announced [HREF49] in July that it was partnering with VTLS [HREF50] who already had a product on the market called VITAL [HREF51] that was built on top of FEDORA. ARROW has licensed VITAL 1.0 (which is primarily aimed at digital image collections) and will be working with VTLS to extend the functionality of FEDORA either by contributing back to the core FEDORA code or by writing a series of ARROW-commissioned modules. This will all be open-sourced using the same license as the FEDORA code. These ARROW-commissioned modules will call FEDORA using the existing APIs and will also expose themselves as a series of Web-Services. VTLS will be able to build products on top of these new ARROW-commissioned modules if they wish and future releases of the VITAL product will almost certainly use these modules. But because the new ARROW modules will be open-sourced, in the same way anyone else will be able to build on top of them to do whatever they want. To really stretch the construction analogy, we are hiring a builder not just to build us a house, but also to provide building materials to anyone else for free.

This decision has a number of advantages:

It saves the ARROW project 3-6 months of startup time (hiring programmers, getting them up to speed on the FEDORA APIs)
It outsources the risk
It provides a support base for the software beyond the life of the project
It contributes to the functionality of the FEDORA code base
It ensures that the ARROW functionality benefits the global institutional repository community

5. Building Site

5.1 State of works

The point of all the work described so far is, of course, to actually build something. The ARROW project started to receive funds in late January 2004. Since that time it has:

appointed a Project Manager (Geoff Payne, previously with the AARLIN project [HREF43] at Latrobe University)
appointed a company to design an ARROW brand, marketing materials and a website [HREF25]
determined our repository solution
turned the original briefing document into a set of technical requirements
contracted with VTLS to perform custom software development

5.2 Plans for rest of this year

Over the rest of this year, the ARROW project will:

Install VITAL 1.0 and start loading content
Finalise the requirements for software to meet the ARROW requirements
Work with the VTLS developers to develop and test this software
Start work to acquire content within the partner institutions
Develop the search/exposure services required

6. Open House!

6.1 When is it going to be open for business?

The project is on track to have functional software available by the end of 2004. This would be the Open House date, and from that point onwards the ARROW Partners will be loading content and providing a semi-production service. Initially this service will only be available at the four project partner institutions. There is an allocation in the budget in year 3 (2006) to roll out the ARROW initiative to up to 10 other institutions across Australia. It may be possible to start this phase earlier if all goes well, but it is not possible to commit to this at such an early stage.

6.2 Plans for the future

The initial round of DEST funding runs out at the end of 2006. One of the DEST requirements was that successful projects should address the issues of sustainability. Both DEST and ARROW are keen to see the initiative continue beyond the end of 2006 and are thinking hard about how to ensure long-term viability for the project (assuming it is successful). It is far too early to say what these plans might be, but one idea that we keep playing with can be summarised as 'Embedding ARROW into the things that universities have to do anyway'.

7. Conclusions

The process of developing the architecture for ARROW has been a constant interaction between the evolving vision for what the project wanted to do and what the software might make possible. Sometimes the software possibilities constrained the vision. Sometimes they expanded it. But the end result should be a flexible architecture that will enable the project to meet the DEST requirements to make Australian research more visible. And, who knows, ARROW may end up becoming something more. In less-guarded moments the ARROW Project Team like to talk about ARROW becoming part of the fundamental infrastructure of higher-education in Australia. Perhaps it will, but there is a lot of work to be done first, and the first challenge is to succeed with the initial (and quite daunting enough) list of deliverables. The architectural and design work described in this paper is just the first step towards what will hopefully be not just a single house, but a thriving community.

8. Acknowledgement

The ARROW Project is sponsored as part of the Commonwealth Government's Backing Australia's Ability [HREF42].

References

DEST (Australian Commonwealth Department of Education, Science and Training) (2002), Research Information Infrastructure Framework for Australian Higher Education. The Final Report of the Higher Education Information Infrastructure Advisory Committee (Systemic Infrastructure Initiative). [HREF4]

DEST (2003a), Information Infrastructure - Call for Proposals 2003. [HREF5]

DEST (2003b), Information Infrastructure - Outcomes of Selections Process. [HREF6]

Harboe-Ree, C., Sabto, M. and Treloar, A. (2004), "The library as digitorium: new modes of creation, distribution and access", Proceedings of VALA 2004, Melbourne, February. [HREF1]

Harboe-Ree, C. and Treloar, A. (2004), "Connecting the Dots Downunder: Towards An Integrated Institutional Approach To Digital Content Management", High Energy Physics Libraries Webzine, issue 9, March. [HREF44]

House of Commons Science and Technology Committee (2004), Scientific Publications: free for all? (HC 399-1), UK Government Stationery Office, July 2004. [HREF45].

Lynch, Clifford A., "Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age" ARL, no. 226 (February 2003): 1-7. [HREF7]

Open Society Institute (2004), A Guide to Institutional Repository Software version 2.0. [HREF8]

Payette, Sandra & Staples, Thornton, "The Mellon Fedora Project: digital library architecture meets XML and web services", Sixth European Conference on Research and Advanced Technology for Digital Libraries. Lecture notes in computer science, vol. 2459. Springer-Verlag, Berlin Heidelberg New York (2002) 406-421. [HREF9]

Rogers, S.A., "Developing an institutional Knowledge Bank at Ohio State University: from Concept to Action Plan", in portal: Libraries and the Academy, January 2003. [HREF2]

Staples, Thornton, Wayland, Ross & Payette, Sandra, "The Fedora Project: an open-source digital object repository management system", in D-lib Magazine, April 2003. [HREF10]

Van de Sompel, H., Young, J. and Hickey, T. (2003), "Using the OAI-PMH ... Differently", D-Lib Magazine, July/August. [HREF20]

Hypertext References

HREF1: http://www.vala.org.au/vala2004/2004pdfs/21HrSaTr.pdf
HREF2: http://www.lib.ohio-state.edu/Lib_Info/rogersKBdoc.pdf
HREF3: http://www.colis.mq.edu.au/
HREF4: http://www.dest.gov.au/highered/otherpub/heiiac/exec_summary.htm
HREF5: http://www.dest.gov.au/highered/research/proposal.htm#1
HREF6: http://www.dest.gov.au/highered/research/outcomes2003.htm
HREF7: http://www.arl.org/newsltr/226/ir.html
HREF8: http://www.soros.org/openaccess/software/
HREF9: http://www.fedora.info/documents/ecdl2002final.pdf
HREF10: http://dlib.org/dlib/april03/staples/04staples.htm
HREF11: http://www.dest.gov.au/Ministers/Media/McGauran/2003/10/mcg002221003.asp
HREF12: http://www.eprints.org
HREF13: http://adt.caul.edu.au/
HREF14: http://etd.vt.edu/
HREF15: http://www.rmitpublishing.com.au/
HREF16: http://www.dspace.org
HREF17: http://libraries.mit.edu/dspace-mit/
HREF18: http://dspace.org/federation/index.html
HREF19: http://dspace.org/faqs/index.html#content
HREF20: http://www.dlib.org/dlib/july03/young/07young.html
HREF21: http://lcweb.loc.gov/z3950/agency/
HREF22: http://www.loc.gov/z3950/agency/zing/srw/sru.html
HREF23: http://www.loc.gov/z3950/agency/zing/
HREF24: http://www.educationau.edu.au/
HREF25: http://arrow.edu.au/
HREF26: http://www.naa.gov.au/recordkeeping/preservation/digital/xml_data_formats.html
HREF27: http://sts.anu.edu.au/downloads/APSR.pdf
HREF28: http://www.melcoe.mq.edu.au/projects/MAMS/index.htm
HREF29: http://andrew.treloar.net/
HREF30: http://www.its.monash.edu.au/
HREF31: http://www.monash.edu.au/
HREF32: http://arrow.edu.au/
HREF33: http://lib.monash.edu.au/
HREF34: http://home.earthlink.net/~ritter/tiff/
HREF35: http://www.libpng.org/pub/png/
HREF36: http://www.w3.org/Graphics/SVG/
HREF37: http://www.w3.org/AudioVideo/
HREF38: http://www.state.ma.us/mgis/mrsid.htm
HREF39: http://www.staroffice.com
HREF40: http://www.openoffice.org/
HREF41: http://www.fedora.info/history.html
HREF42: http://backingaus.innovation.gov.au/
HREF43: http://aarlin.edu.au/
HREF44: http://library.cern.ch/HEPLW/9/papers/1/
HREF45: http://www.publications.parliament.uk/pa/cm200304/cmselect/cmsctech/399/39909.htm
HREF46: http://search.arrow.edu.au/
HREF46: http://epress.monash.edu.au/
HREF47: http://epress.anu.edu.au/
HREF48: http://www.webreference.com/authoring/languages/xml/rss/intro/
HREF49: http://arrow.edu.au/docs/files/ARROW-VITAL.pdf
HREF50: http://www.vtls.com/
HREF51: http://www.vtls.com/Products/vital.html
HREF52: http://www.fedora.info/
HREF53: http://eprint.monash.edu.au/archive/00000046/