Building an
Institutional Research
Repository from the Ground Up: The ARROW Experience
Dr Andrew Treloar [HREF29],
Project Manager, Strategic Information Initiatives, Information
Technology Services [HREF30]
& ARROW [HREF32] Technical
Architect. Building 3A, Monash University [HREF31], Victoria, 3800.
Email: [email protected]
Updated version of paper delivered at AusWeb04.
Abstract
This paper describes the development of the software to support
ARROW - Australian Research Repositories Online to the World (a
DEST-funded project under the Research Information Infrastructure
Framework for Australian Higher Education). One way of conceptualising
this process is to think of it as analogous to the process of building
a house. The paper therefore begins by
describing the vacant lot - the context in which the project came
about. It then moves on to the design brief for the architect - the
list of requirements. Next comes the resulting architectural drawings -
the broad model and list of functions. In order to turn a blueprint
into reality, one needs building materials - in this case the pieces of
software required. Finally the paper discusses the state of the
building site, and when the 'house' might be open to its first
visitors.
1. Vacant Lot
This is a story about building an institutional research repository.
More specifically, it is about designing the architecture and choosing
the building materials to make this
building possible.
But before the architecture can be designed, the right
environment needs to exist. What was the vacant lot that made it
possible for this building to even be thought about as a possibility?
In this case the
vacant lot had two components: a general one and a more specifically
Australian one.
1.1 Overall context
There is a growing interest among academic institutions in
collecting, preserving, reusing and creating value-added services from
digital content produced in and for research, teaching and learning.
The emphasis on research outputs and collaboration, and distance,
flexible and online learning, together with developments in information
technology, has led to an increased awareness that the digital content
being created by members of the academic community is an institutional
asset. This content is also increasingly being
recognised as an institutional challenge, requiring both tactical
management and a strategic response.
At the same time many academic libraries are responding to the
challenges of new technologies by taking the opportunity to redefine
their fundamental role in the creation, distribution and provision of
access to information. Over the past decade libraries have moved almost
completely towards a digital platform for management of the information
(both print and electronic) that they acquire or subscribe to. They
have built significant digital collections of material published by
others, and they are increasingly producing new content themselves [Harboe-Ree et. al. (2004)]. Often this content
originates from, or is the
intellectual property of, their own institutions.
Meanwhile, all around the world, universities, their libraries,
faculties, research centres and information technology and course
development units, are trying to cope with the digital revolution.
There is a growing recognition and articulation of the convergence that
is occurring among the various digital initiatives in which
universities are engaged, and the opportunities for potential synergies
and more significant outcomes through collaboration and
interoperability.
As one example, the COLIS (the Collaborative Online Learning and
Information
Services model [HREF3]) work
at Macquarie
University has
focused on testing the feasibility of interoperable standards as a way
of managing interactions between a range of electronic services.
Through the success of the COLIS model, McLean and others have
demonstrated that the new electronic environment can and must comprise
a complex interactive matrix that is dependent on the information
resources mentioned above, as well as on user directories, content and
rights management software, and metadata repositories.
Sally A Rogers, from Ohio State University, has argued that the
full
array of a university's digital assets and information services should
be broadly defined, and should include the library's catalogue, the
electronic journals, reference databases and other electronic resources
available through the library, as well as institutional repositories
and resources created or collated elsewhere in the university, such as
course material [Rogers
(2003)]. She notes the overlapping of such
initiatives as digital collections, course web sites, electronic course
packs and learning objects, the desirability of integration to search
across these repositories and the development of standards to promote
interoperability. Rogers also highlights the potential of increased
interoperability and connectivity to generate innovation in research,
teaching and learning.
Most recently, the UK House of Commons Science and Technology
Committee released on July 20, 2004 its long-awaited report entitled Scientific Publications: Free for all?
[House
of Commons Science and Technology Committee (2004)].
A number of the recommendations in this report refer explicitly to the
central role of institutional repositories:
- Recommendation 44
- "We recommend that the Research Councils and other Government
funders mandate their funded researchers to deposit a copy of all their
articles in their institution's repository within one month of
publication or a reasonable period to be agreed following publication,
as a condition of their research grant." (p. 102)
- Recommendation 46
- "We recommend that DCMS provide adequate funds for the British
Library to establish and
maintain a central online repository for all UK research articles that
are
not housed in other institutional repositories." (p. 102)
- Recommendation 48
- "In order for institutional repositories to achieve maximum
effectiveness, Government
must adopt a joined-up approach. DTI, OST, DfES and DCMS should work
together to create a strategy for the implementation of
institutional repositories, with clearly define aims and a realistic
timetable." (p. 103)
- Recommendation 52
- "The cost to the taxpayer of establishing and maintaining an
infrastructure of
institutional repositories across UK higher` education would be
minimal,
particularly in proportion to the current total UK higher education
spend. When
the cost is weighed against the benefits they would bring,
institutional repositories plainly represent value for money." (p. 103)
- Recommendation 53
- "Having taken the
step of funding and supporting institutional repositories, the UK
Government would need to become an advocate for them at a global level.
If all countries archived their research findings in this way, access
to scientific
publications would increase dramatically. We see this as a great
opportunity for
the UK to lead the way in broadening access to publicly-funded research
findings and
making available software tools and resources for accomplishing this
work."(p. 103)
- Recommendation 55
- "We recommend that the Government appoints and funds a central
body ... to co-ordinate the implementation of a
network of institutional repositories." (p. 104)
1.2 Australian
context
It was against this developing backdrop that the November 2002
report of the
Higher Education Information Infrastructure Advisory Committee (HEIIAC)
of the Australian Government Department of Education, Science and
Training (DEST) [DEST (2002)] identified the
following critical
features of an enhanced
research infrastructure:
- information infrastructure resources should optimise the efforts
of researchers in the higher education sector to create, manage,
discover, access and disseminate knowledge;
- access to the research information infrastructure should not be
constrained by institutional affiliations, geographic locations or
disciplines of individual researchers;
- collaboration among libraries has improved the effectiveness of
individual institutions, and further collaboration, clear strategies
and a shared vision would significantly improve the coordination of the
national research infrastructure;
- opportunities should be sought for the academic community to
regain control of scholarly publishing; and
- computing and communication technologies provide new
opportunities for the creation, management, storage and dissemination
of information.
The HEIIAC report was primarily concerned with managing the current
problems associated with scholarly communication and publishing, and it
stressed the need to adopt a national collaborative approach. As
already discussed, a range of players are embracing scholarly
communication strategies and arguing that they should be
incorporated into a more holistic approach to the management of
institutional digital content and intellectual capital.
It was clear that the merging of these two approaches would yield
substantial
benefits to Australian university communities, consistent with the
following statements of principle:
- Australian universities have a commitment to support and promote
their institutions' research activity through the creation and
preservation of digital content, especially institutional repositories
and electronic publishing.
- Australian universities have a commitment to help their
institutions achieve their goals more effectively by assisting with the
integration of digital resources.
- Australian universities have a commitment to collaborating
nationally and internationally in the achievement of a more integrated
approach to the management and interoperability of digital content. [Harboe-Ree and Treloar 2004]
These statements reflect the HEIIAC objectives and place them
into
a framework that, if implemented, would improve institutional and
national efficiency and effectiveness. The challenge for HEIIAC was to
turn
these principles and objectives into action.
1.3 DEST RII Process
In June of 2003, the Australian Commonwealth Department of
Education, Science and Training issued a call for proposals to "further
the discovery, creation, management and dissemination of Australian
research information in a digital environment" [DEST
(2003a)].
This sought to "fund proposals which help promote Australian research
output
and help to build the Australian research information infrastructure,
through
the development of distributed digital repositories and common
technical services
that manage access and authorisation to
these."
The guidelines for submissions identified the following
requirements to be met by successful bids:
- The application must provide clear evidence of the overall need
for the project proposed in terms of the strategic and long-term
benefits for the higher education sector in Australia as a whole and
identify the specific outcomes that will be derived.
- The application should indicate relevance to sector-wide needs
and priorities and demonstrate that the proposal is an innovative
approach.
- The application must clearly demonstrate that the proposal is a
cost effective response to an identified problem and will generate
savings or productivity gains through its application.
- The application should detail the nature and degree of
cooperation between collaborating institutions.
- Where relevant, the application should bear in mind future
requirements and outline strategies to sustain the project beyond the
period of Commonwealth funding.
- Institutions should be mindful that in any infrastructure
developed under this project the enabling architecture should be both
effective and reasonably future proof.
In response to this call, 14 projects were submitted of which four
were funded [DEST (2003b) ]. The successful
projects were:
- The Australian Research Repositories Online to the World (ARROW)
[HREF25]
- Australian Digital Theses Program Expansion and Redevelopment
(ADT) [HREF13]
- Towards an Australian Partnership for Sustainable Repositories
(APSR) [HREF27]
- Meta Access Management System (MAMS) [HREF28]
These four projects were funded for a combined total of A$12
million over a period of 3 years, with funding commencing at the start
of 2004 [HREF11].
The focus of this
paper will be the architectural design and creation of software to
support the ARROW Project.
2. Design Brief
The original design brief was encapsulated in the Summary section of
the ARROW Bid document sent to DEST (public version of bid available at
HREF53].
This read:
"The ARROW project (ARROW) will identify and test a software
solution or solutions to support best-practice institutional digital
repositories comprising e-prints, digital theses and electronic
publishing. A wide range of digital content types will be managed in
these repositories. The NLA will develop a repository and associated
metadata to support independent scholars (those not associated with
institutions). A complementary activity of ARROW is the development and
testing of national resource discovery services (developed by the NLA)
using metadata harvested from the institutional repositories, and the
exposing of metadata to provide services via protocols and toolkits.
This will include a potential path for the redevelopment of the
Australian Digital Theses (ADT) metadata repository incorporated into
the NLA’s national resource discovery
services.
Initially ARROW will be tested in the four partner institutions,
prior to it being offered more widely across the higher-education
sector. The solution will be open-standards based, or will support open
standards, and will facilitate
interoperability within and between participating institutions.."
This was (deliberately) a very high-level statement. What might it
mean when fleshed
out a bit? The best way to get an accurate sense of this is to focus on
the content streams that ARROW will manage and the content
types it
will have to deal with.
2.1 Content Streams
The functions that ARROW will perform can best be characterised in
terms
of different content streams. These derive from different origination
points within the Australian research community.
2.1.1 E-print
repositories
An e-print repository stores and makes available (in digital form)
working papers, pre-prints (not yet published in the traditional
literature) and post-prints. E-print repositories have been
proliferating in recent years. Most have been set up by universities,
but many have also been established by scholarly and professional
societies and higher education research centres. Australian
universities running e-prints repositories include The Australian
National University, Monash University, The University of Melbourne,
The University of Queensland, and Queensland University of Technology.
The increased activity around e-prints has been facilitated by the
development of free, open-source software [HREF12]
that
manages e-print repositories.
A key feature of these repositories is that content is usually
available on an open-access basis (anyone can read or view it and no
fees are payable). Many e-print repositories also work on a
self-submission basis, with researchers depositing material into the
repository themselves using an online deposit process. The rationale
behind the growing e-prints movement is to reclaim institutional
scholarly output and make it widely accessible internationally, thus
removing barriers to learning and research, and improving its
availability
and citation.
2.1.2 Digital thesis
repositories
A digital thesis repository stores and makes available online, in
digital form, graduate research output (M.A. by research and Ph.D.
theses).
Digital theses in these repositories are offered on an open access
basis. In Australia the Australian Digital Theses Program
[HREF13] is a national
collaborative distributed
database of digitised theses produced at Australian Universities.
Twenty-two higher education institutions are participating members of
the Program, which uses deposit-process software [HREF14]
first developed at Virginia Polytechnic
Institute in the United States of America.
2.1.3 Electronic
publishing
A growing number of higher education institutions are trying to
establish sustainable publishing alternatives to reclaim the scholarly
output currently published in heavily protected commercial journals and
monographs. Institutional e-presses aim to offer electronic publishing
services and
functionality similar to those offered by commercial presses publishing
product online, but in a way that is more aligned with institutional
objectives, thereby tackling problems associated with the current
scholarly publishing climate. These problems include pricing and
intellectual property issues, as well as long lead times for
publication and publishing models that do not allow for publication of
media rich titles.
The activities of an e–press can range from digitising
material
originally designed for print and making it available online, through
to the publication (in the sense of making public) material that was
born digital and that can only be fully represented digitally.
E-presses are more akin to traditional publishing than e-print
repositories in that e-press content tends to be offered on a
subscription and/or pay-per-view basis.
As with e-print repositories, the Australian higher education
sector is experiencing significant activity in this area. Both Monash
University [HREF46] and The
Australian National University [HREF47] have established
e-presses, and Royal Melbourne Institute of Technology Publishing
[HREF15] has been
engaged in electronic
publishing for several years now.
2.1.4 DEST Returns
Each year, Australian universities need to send to DEST information
about their research output for the previous year. In most
universities, this
process involves manual data collection using paper forms which are
then keyed into a database or spreadsheet. This is tedious and
susceptible to error. In addition, the end result is a largely static
document with no way to link from the publication information to the
publications themselves.
ARROW wanted to see if it was possible to partially automate the
gathering of publications for the annual Department of Education,
Science and Training returns and storage of both the publications and
required metadata in the institutional ARROW repository. This would
meet the following objectives:
- systematic accumulation of a critical mass of content
- simplification of Department of Education, Science and Training
return creation by universities
- facilitation of the way in which the Department of Education,
Science and Training verifies compliance
ARROW also wanted to see if it would also be possible to
enable universities to enter into an ongoing dialogue with their
researchers about the issues associated with academics signing over
copyright in research output, and the desirability of deposit into an
institutional repository.
2.1.5 Non-University
Research
Of course, not all research takes place in a university. Much also
occurs in research institutes of one sort or another, in R&D
centres in corporations or even in informal locations (what one might
characterise as the
Researcher in the Backyard Shed). Researchers at institutions
without institutional repositories would find it difficult to make
their research visible. As ARROW was seeking to capture and make
visible as much Australian research as possible, it would be useful to
find a way to deal with this potential content stream.
2.2 Content types
2.2.1 Content Type
Philosophy
Another part of the design brief process was deciding on what content
types (as opposed to streams) would
be accepted. The project decided to adopt a variant of the model
developed by MIT in its
DSpace [
HREF16] implementation.
The DSPace
philosophy can
be
summarised as follows:
- Lots of digital material is already lost
- Most digital material is at risk
- Preserving bits is better than nothing
- It is important to capture as much information as possible
- It will be necessary to evaluate cost/benefit trade-offs
over time
The project also decided to be informed by the National Archives of
Australia
guidelines
on digital formats [
HREF26].
Based on this, ARROW
decided to accept
three
types
of
content
:
- Supported
- The format is recognized, and the hosting institution is
confident it can make bitstreams of this format usable in the future,
using whatever combination of techniques (such as migration, emulation,
etc.) is appropriate given the context of need.
- Known
- The format is recognized, and the hosting institution will
promise to preserve the bitstream as-is, and allow it to be retrieved.
The hosting institution will attempt to obtain enough information to
enable the format to be upgraded to the 'supported' level.
- Unsupported
- The format is unrecognised, but the hosting institution will
undertake to preserve the bitstream as-is and allow it to be retrieved.
On the vexed subject of Lossy vs Lossless formats, the decision was
made that wherever possible, ARROW would endeavour to store data
objects in lossless digital formats (these are formats that do not
throw away
information when compressing the file).
Lossy
formats
(which do throw away information during compression) might
be
stored
in
addition, or rendered on the fly (where possible). Storage in lossy
formats would be used only as a last resort.
2.2.2 Supported
Formats
For Textual content, the supported formats are:
- XML
- Files with an accompanying DTD or schema preferred. If not,
then well-formed XML is acceptable.
- Rich Text Format (RTF)
- Adobe PDF
- NOTE: This content will be migrated to PDF-A once this is
standardised
- HTML
- Validating as XHTML. Content that does not validate will need
to be converted.
For Still Images, the supported formats are:
- TIFF (Tagged Image File Format) [HREF34]
- JPEG
- Store with no-compression, migrate to JPEG-2000 over time
- PNG (Portable Network Graphics) [HREF35]
- EPS
- SVG (Scalable Vector Graphics) [HREF36]
For Moving Images, the supported format is:
For Audio, the supported formats are:
For Multimedia content, the supported format is:
- SMIL (Synchronized Multimedia Integration Language) [HREF37]
2.2.3 Known Formats
For Textual content the following formats are known:
- Word/Excel/Powerpoint
- all versions, all operating systems
NOTE: The reason for including Microsoft Office file formats is
simply a recognition of the market reality. If alternatives (such as
StarOffice [HREF39] or
OpenOffice [HREF40] become
more widely deployed in the target
environments for ARROW, these list may well be augmented).
For Still Images the following formats are known:
- GIF
- MrSID (Multi-Resolution Seamless Image Database) [HREF38]
For Moving Images the following formats are known:
- Windows Media
- AVI
- Quicktime video encodings other than MPEG-4
For Audio the following format is known:
For Multimedia content, the following format is known:
2.2.4 Unsupported
Formats
All other formats would be unsupported.
2.3 Overall Philosophy
The final part of the design brief was to make decisions about Open
Source and Open Standards. The first decision was an easy one: it was a
condition of the funding from DEST that any software developed using
project funds had to be made available as open source. This ensured
that the Australian (and, ultimately, the global) research communities
got the best value from the investment. The second decision also turned
out to be an easy one. The core design group agreed that the best
approach was to adopt open standards wherever possible when specifying
software functionality, data formats or interfaces.
3. Architectural
Drawings
Once the project had a clear design brief it was possible to move on to
the
next step: deciding the broad architecture. This involved a series of
iterative steps, as well as a lot of research into what approaches
similar projects overseas had adopted. The project ended up defining
three
categories of required repository functionality.
3.1 Common Repository
The project decided that, if possible, all the various content types
should be stored in a common repository. This would:
- facilitate linkages between items
- allow for more efficient management of the content and the
infrastructure
- enable exposure of all of an institution's public research output
through a common mechanism
3.2 Content
Management
and Workflow
In order to get the content into the common repository, the project
needed a way
to efficiently manage different classes of content contributors and
different content streams. The project ended up deciding to define a
series of
Content Management and Workflow modules, corresponding to the content
streams discussed under section 2.1. Each of these modules would have
its own content submission forms and workflow. Each would also have
specific functionality to deal with the requirements of that particular
stream type.
The ePrints module would provide software that offers no
less functionality than the eprints.org software used by many
universities. The major issue with this module was anticipated to be
the management of content self-submission and administrative
management.
The eTheses module would provide software that offers no less
functionality than the current Australian Digital
Theses Program software and includes OAI-PMH compliance for metadata
harvesting. The main issues were anticipated to be data capture from
various sources, efficient harvesting from
institutional repositories, identification of software, performance and
scalability requirements and interactions with other metadata services.
The ePress module would provide software that offers
sufficient functionality to run an open-access ejournal electronic
press, including both submission management and publishing of multiple
journals.
The DEST Research
Directory module would explore testing of the feasibility and
effectiveness of using an
ARROW repository to support the annual Department of Education, Science
and Training returns. The initial instance will be a repository holding
a proportion of the institution's
Department of Education, Science and Training 2003 returns. The issues
are anticipated to be management of content submission from academics
and embedding
use of the repository in the existing institution-collection process.
The NLA Repository would provide support for non-university
researchers by providing a repository hosted at the
National Library of Australia.
The project also recognised that the ARROW infrastructure would be
potentially
applicable to a wider range of problems. For this reason the
possibility of adding other Content Management and Workflow modules
later on was left open.
3.3 Search and
Exposure
The ability to locate appropriate content for citation purposes is a
critical success factor in creating reliable scholarly communication
and increasing the impact of research. ARROW decided to develop a
nationally available resource discovery service to provide access to
Australian research output. The project will establish automated
mechanisms for harvesting and re-purposing metadata from institutions
and individual researchers. This will be done by applying international
standards, specifications and technologies to ensure interoperability.
Resource discovery will be supported by descriptive metadata. Other
types of metadata may also be generated to support digital rights
management, persistent identification, and archiving and preservation
to ensure the longevity of scholarly content. In addition, it will be
possible to search ARROW repositories through a range of discovery
tools (such as education portals or search engines). This exposure will
increase awareness of unique Australian content, both nationally and
internationally. The project will also seek to expose published
Australian research in commercial repositories, such as those created
by large journal publishers.
3.4 OLAD
The end result of the architectural decisions in each of the categories
of Common
Repository, Content Management and Workflow and Search
& Exposure was a layered architecture. The notion of a layered
architecture is not particularly controversial. Such architectures have
been preferred since at least the days of the International Standards
Organisation Open Systems Interconnect seven-layer reference model for
network services. In the Digital Library field these sorts of
high-level models are so common that the project group took to
referring to 'obligatory' layered architecture diagrams. Figure 1
therefore is the OLAD
(Obligatory Layered Architecture Diagram) for ARROW.
Figure 1: Obligatory Layered Architecture Diagram for ARROW.
4. Building ARROW
Now that the architecture was defined, the project had to work out how
to build it. In construction terms, what building materials
were
available, what were the best ones to choose, and who was going to do
the building?
4.1 Foundation - the
repository
The project recognised very early on that the decision on the
repository was
foundational. The choice of repository technology would
determine
the functionality ARROW could provide and the ways it could provide it.
Much of the latter half of 2003 was spent in careful analysis of
available candidates, based on a mixture of:
- reading publically available materials including:
- system documentation
- published articles/conference papers
- online presentations
- notes from conference sessions
- lurking on mailing lists
- downloading the software and 'kicking the tyres'
- attending conference sessions (and talking to presenters
afterwards)
- talking to other users to get a less-partisan assessment
As a result of this work, the project rapidly settled on two likely
candidates:
DSpace and FEDORA.
DSpace [HREF16]
is a joint activity between MIT
Libraries and Hewlett-Packard to jointly develop a software system to
enables institutions to:
- Capture and describe digital works using customized workflow
processes
- Provide access to an institution's digital works over the web,
so users can search and retrieve items in the collection
- Preserve digital works over the long term
It is being made available under the BSD open source license to
other groups to run as-is, or to modify and extend as needed.
DSpace can best be thought of as a
general-purpose repository application, with a series of both
hard-wired and preferred behaviours. It is designed to provide stable
long-term storage needed to house the digital products of MIT faculty
and researchers. DSpace is intended to have different advantages for
different stakeholder groups:
"For the user: DSpace enables easy remote access and the
ability to read and search DSpace items from one location: the World
Wide Web.
For the contributor: DSpace offers the advantages of digital
distribution and long-term preservation for a variety of formats
including text, audio, video, images, datasets and more. Authors can
store their digital works in collections that are maintained by MIT
communities.
For the institution: DSpace offers the opportunity to provide
access to all the research of the institution through one interface.
The repository is organized to accommodate the varying policy and
workflow issues inherent in a multi-disciplinary environment.
Submission workflow and access policies can be customized to adhere
closely to each community's needs." [HREF17]
While DSpace grew out of the needs of MIT, a group of North
American and European universities are now participating in the DSpace
Federation [HREF18],
which will test the existing software,
and offer suggestions about how to further develop and improve it.
DSpace supports a wide range of content types [HREF19],
and particular installations can easily extend the range available.
FEDORA [HREF52] is both a
software platform and an architecture (it stands
for
the Flexible Extensible Digital Object and Repository Architecture).
Note that this FEDORA is both
different to and predates the use of the name by RedHat. The
architecture came out of Digital Library work done in the computer
science field in the late 1990s [Payette and Staples
(2002)]. The history of the FEDORA repository software is described
on its website as follows:
"In the summer of 1999 ... the
[University of Virginia]
Library's
research and development group discovered a paper about Fedora written
by Sandra Payette and Carl Lagoze of Cornell's Digital Library Research
Group. Fedora was designed on the principle that interoperability and
extensibility is best achieved by architecting a clean and modular
separation of data, interfaces, and mechanisms (i.e., executable
programs). With Cornell's help, the Virginia team installed the
research software version of Fedora and began experimenting with some
of Virginia's digital collections. Convinced that Fedora was exactly
the framework they were seeking, the Virginia team reinterpreted the
implementation and developed a prototype that used a relational
database backend and a Java servlet that provided the repository access
functionality. The prototype provided strong evidence that the Fedora
architecture could indeed be the foundation for a practical, scalable
digital library system. In September of 2001 The University of
Virginia received a grant of $1,000,000 from the Andrew W. Mellon
Foundation to enable the Library, in collaboration with Cornell
University, to build a sophisticated digital object repository system
based on the Flexible Extensible Digital Object and Repository
Architecture (Fedora). The Mellon grant was based on the success of the
Virginia prototype, and the vision of a new open-source version of
Fedora that exploits the latest web technologies. Virginia and Cornell
have joined forces to build this robust implementation of the Fedora
architecture with a full array of management utilities necessary to
support it." [
HREF41].
Increasingly, the term FEDORA (which was first used
over 5 years ago as an acronym for the architecture) is now being used
to refer to this software implementation. In this latter sense, FEDORA
is "an open source, digital object
repository system using public APIs exposed as web services." [
Staples,
Wayland
and Payette (2003)].
FEDORA can best be thought of as
services-mediation infrastructure, rather than an off-the-shelf
application. It can use web services to call other services as well as
expose its own services using web services standards. Key to the FEDORA
architecture (yes, I know this is like referring to an ATM Machine...)
is its underlying object-based model. FEDORA stores digital content
objects, either as datastreams contained within the repository or as
links to external resources. It also stores disseminators, which are
ways to render these digital content objects. The software maintains
bindings between content objects and their disseminators. Each object
has a default disseminator, but may be able to be disseminated in other
ways. This architecture is extremely flexible, and provides significant
advantages as a platform on which to build other applications.
Version 1.2 of FEDORA, released in late December 2003, provides
versioning of both objects and their disseminators, as well as a
Java-based Administration GUI.
There is also a range of other open-source repository projects
underway. The Soros Institute is currently maintaining a document which
summarises the functionality of many of them [HREF8].
In
addition to DSpace, the current version also reviews FEDORA, CDSWare,
MyCoRe, i-Tor, eprints.org and ARNO. These each come out of particular
responses to the challenges of managing large amounts of digital
content, and each have their own strengths and weaknesses.
4.1.4 Selection
After careful consideration, the project ended up selecting FEDORA
rather than DSpace as the underlying repository. The main reasons for
this decision were:
- ARROW needed something to build on top of (like FEDORA) rather
than an existing application (like DSpace)
- The object-oriented data model for FEDORA was much more flexible
than DSpace's Repository-Community-Collection-Item-Bitstream hierarchy
- ARROW wanted to be able to have persistent identifiers down to
the level of individual datastreams (DSpace only has such identifiers
for the item)
- ARROW wanted to be able to version both content and disseminators
- The APIs were exposed much more openly and cleanly in FEDORA (as
well-documented SOAP/REST web services) than in DSpace (poorly
documented Java APIs)
However, it should be said that this is an area where a number of
players, both open-source and proprietary, are moving very quickly.
DSpace and FEDORA have each announced their plans for version 2.0 of
their software, probably due out next year. As a result, ARROW agreed
to review its software decision every 12 months.
4.2 Framing it up -
the application development framework
One of the things that the repository may determine is the choice of
application development framework. This is because some repositories
only allow particular languages to call their Application Programming
Interfaces (APIs). The project felt it was important to be able to code
in a variety of
languages (not be restricted to one). Given the increasing popularity
and uptake of Web Services, it was also deemed critical to be able to
expose
repository functionality in this way. These two points are
partially inter-related: having web services makes it much easier to
use a range of languages because it decouples the implementation of the
functionality from its invocation. ARROW decided that as well as
building on web services it
would also expose web
services. That is, all the new ARROW functionality would also be made
available as web services.
4.3 Doors and Windows
- the search and exposure layer
As discussed above a key driver behind the project was making items in
ARROW repositories as
accessible as possible. After careful examination of the best ways to
do this, the project decided to target three very different
technologies.
4.3.1 OAI-PMH
The Open Archives Initiative's Protocol for Metadata Harvesting
(OAI-PMH) was created to facilitate discovery of distributed resources,
such as those contained in a repository. The OAI-PMH achieves this by
providing a simple, yet powerful framework for metadata harvesting.
Harvesters can incrementally gather records contained in OAI-PMH
repositories and use them to create services covering the content of
several repositories.
[
Van de Sompel, Young and Hickey (2003)]. OAI-PMH
is
rapidly gathering strength as a way of providing federated resource
discovery services and was seen as essential to the success of ARROW.
The National Library will use OAI-PMH where available (and other
technologies where not) to harvest the metadata from ARROW and other
institutional repositories. These metadata will then be used to provide
national and international resource discovery for Australian research.
This national resource-discovery service [
HREF46] will also link with
other
national services delivered by the National Library for the Australian
Digital Theses Program and the international Networked Digital Library
of Theses and Dissertations.
4.3.2 Google and
other search engines
There is little need to discuss the success of Google. For most
unversity
students (and probably for most staff!) Google is
the resource discovery mechanism of choice. Enabling Google and other
search engines to access the metadata and the full text of items in
ARROW
repositories was an easy choice to make. In practice, this will mean
provision of a robots.txt file and publically-available content in a
directory location accessible by web spidering software, as well as the
ability to automatically request re-spidering for new content.
Having the content indexed by search engines is only part of the
solution. Given the increasing amount of content searchable on the Web,
it is important to ensure that search results are ranked sufficiently
highly to be noticed by end-users. The best way to do this varies from
search engine to search engine. In addition, there is an active
commercial industry focussed around 'optimising' search results. In the
case of Google, there is something of an 'arms race' developing between
the ranking techniques Google uses and attempts to 'influence' these.
ARROW has decided that it does not want to attempt to affect rankings,
other than by making our content as accessible (and thus hopefully as
citeable) as possible.
4.3.3 SRU/SRW
The third exposure layer was in some ways a less obvious choice. Both
OAI-PMH and Google are 'proxy' search services. That is, they collect
proxy records and place them in a database where they can be searched.
Such proxy systems run the risk of always potentially being out of date
(if only slightly). We therefore wanted to make it possible for other
search services to connect directly to ARROW repositories and run
interactive searches. The standard protocol for such connections in the
library world is Z39.50 (More formally known as ISO 23950: "Information
Retrieval
(Z39.50):
Application Service Definition and Protocol
Specification") [
HREF21].
Z39.50 has not
been taken up as quickly as its proponents had hoped (for a variety of
reasons too complex to cover here). As a result the Z39.50 Next
Generation group (ZNG) have been working on more modern and lightweight
protocols to achieve much of the original Z39.50 functionality. These
newer protocols are called SRU (Search/Retrieve over URL) [
HREF22]
and SRW
(Search/Retrieve for Web Services) [
HREF23].
ARROW decided
to support both
SRU and SRW connections to make it possible for real-time searching
through things like the portlet technology being developed by
education.au (
HREF24).
4.3.4 RSS
In addition to these three technologies, the project is also looking at
ways of providing an alerting service. The obvious technology to
support this is RSS [
HREF48].
Of course, there is little point in
setting up such a facility until there is (i) a sufficient base level
of content and (ii) a sufficiently steady flow of new content. The
project will accordingly wait until that stage before providing an
RSS-based alerting mechanism.
4.4 Hire a Builder or
DIY?
The other major decision to be made was how ARROW would develop the new
software. The original bid to DEST had envisaged that the project would
hire its own software developers to write the necessary software (a DIY
strategy). The new project manager realised that a potentially far
better option would be to hire a builder, preferably one with
experience with our preferred building materials. After a good
deal of exploration and negotiation, the project announced [
HREF49] in
July that it was partnering with VTLS [
HREF50]
who already had a product on the market called VITAL [
HREF51] that was
built on top of FEDORA. ARROW has licensed VITAL 1.0 (which is
primarily aimed at digital image collections) and will be working with
VTLS to extend the functionality of FEDORA either by contributing back
to the core FEDORA code or by writing a series of ARROW-commissioned
modules. This will all be open-sourced using the same license as the
FEDORA code. These ARROW-commissioned modules will call FEDORA
using the existing APIs and will also expose themselves as a series of
Web-Services. VTLS will be able to build products on top of these new
ARROW-commissioned modules if they wish and future releases of the
VITAL product will almost certainly use these modules.
But because the new ARROW modules will be open-sourced, in the same way
anyone else will be able to build on top of them to do whatever they
want.
To
really stretch the
construction analogy, we are hiring a builder not just to build us a
house, but also to provide building materials to anyone else for free.
This decision has a number of advantages:
- It saves the ARROW project 3-6 months of startup time (hiring
programmers, getting them up to speed on the FEDORA APIs)
- It outsources the risk
- It provides a support base for the software beyond the life of
the project
- It contributes to the functionality of the FEDORA code base
- It ensures that the ARROW functionality benefits the global
institutional repository community
5. Building Site
5.1 State of works
The point
of all the work described so far is, of course, to actually build
something. The ARROW
project started to receive funds in late January 2004. Since that time
it has:
- appointed a Project Manager (Geoff Payne, previously with the
AARLIN project [HREF43] at Latrobe
University)
- appointed a company to design an ARROW brand, marketing materials
and a website [HREF25]
- determined our repository solution
- turned the original briefing document into a set of technical
requirements
- contracted with VTLS to perform custom software development
5.2 Plans for rest of
this year
Over the rest of this year, the ARROW project will:
- Install VITAL 1.0 and start loading content
- Finalise the requirements for software to meet the ARROW
requirements
- Work with the VTLS developers to develop and test this software
- Start work to acquire content within the partner institutions
- Develop the search/exposure services required
6. Open House!
6.1 When is it going
to be open for business?
The project is on track to have functional software available by the
end of 2004. This
would be the Open House date, and from that point onwards the ARROW
Partners will be
loading content and providing a semi-production service. Initially this
service will only be available at the four project partner
institutions. There is an allocation in the budget in year 3 (2006)
to roll out the ARROW initiative to up to 10 other institutions across
Australia. It may be possible to start this phase earlier if all goes
well,
but it is not possible to commit to this at such an early stage.
6.2 Plans for the
future
The initial round of DEST funding runs out at the end of 2006. One of
the DEST requirements was that successful projects should address the
issues of sustainability. Both DEST and ARROW are keen to see the
initiative continue beyond the end of 2006 and are thinking hard about
how to ensure long-term viability for the project (assuming it is
successful). It is far too early to say what these plans might be, but
one idea that we keep playing with can be summarised as 'Embedding
ARROW into the things that universities have to do anyway'.
7. Conclusions
The process of developing the architecture for ARROW has been a
constant interaction between the evolving vision for what the project
wanted to do and
what the software might make possible. Sometimes the software
possibilities constrained the vision. Sometimes they expanded it. But
the end result should be a flexible architecture that will enable the
project
to meet the DEST requirements to make Australian research more visible.
And, who knows, ARROW may end up becoming something more. In
less-guarded moments the ARROW Project Team like to talk about
ARROW becoming part of
the fundamental infrastructure of higher-education in Australia.
Perhaps it will, but there is a lot of work to be done first, and the
first challenge is to succeed with the initial (and quite daunting
enough)
list of deliverables.
The
architectural and design
work
described in this paper is just the first step towards what will
hopefully be not just a single house, but a thriving community.
8. Acknowledgement
The ARROW Project is sponsored as part of the Commonwealth Government's
Backing Australia's Ability [
HREF42].
References
DEST (Australian Commonwealth Department of
Education, Science and Training) (2002), Research Information
Infrastructure Framework for Australian Higher Education. The Final
Report of the Higher Education Information Infrastructure Advisory
Committee (Systemic Infrastructure Initiative). [HREF4]
DEST (2003a), Information Infrastructure -
Call
for Proposals 2003. [HREF5]
DEST (2003b),
Information Infrastructure -
Outcomes of Selections Process.
[
HREF6]
Harboe-Ree, C., Sabto, M. and Treloar, A.
(2004), "The
library as digitorium: new modes of creation, distribution and
access", Proceedings of VALA 2004, Melbourne, February. [HREF1]
Harboe-Ree, C. and Treloar, A. (2004),
"Connecting the Dots Downunder: Towards An Integrated Institutional
Approach To Digital Content Management", High Energy Physics Libraries
Webzine, issue 9, March. [HREF44]
House of Commons Science and Technology
Committee (2004), Scientific
Publications: free for all? (HC 399-1), UK Government Stationery
Office, July 2004. [HREF45].
Lynch, Clifford A., "Institutional
Repositories: Essential Infrastructure for Scholarship in the Digital
Age" ARL, no. 226 (February 2003): 1-7. [HREF7]
Open Society Institute (2004), A Guide to
Institutional Repository Software version 2.0. [HREF8]
Payette, Sandra & Staples, Thornton,
"The Mellon Fedora Project: digital library architecture meets XML and
web services", Sixth European Conference on Research and Advanced
Technology for Digital Libraries. Lecture notes in computer science,
vol. 2459. Springer-Verlag, Berlin Heidelberg New York (2002) 406-421. [HREF9]
Rogers, S.A., "Developing an institutional
Knowledge Bank at Ohio State University: from Concept to Action Plan",
in portal: Libraries and the Academy, January 2003. [HREF2]
Staples, Thornton, Wayland, Ross &
Payette, Sandra, "The Fedora Project: an open-source digital object
repository management system", in D-lib Magazine, April 2003. [HREF10]
Van de Sompel, H., Young, J. and Hickey, T.
(2003), "Using the OAI-PMH ... Differently", D-Lib Magazine,
July/August. [HREF20]
Hypertext References
- HREF1
- http://www.vala.org.au/vala2004/2004pdfs/21HrSaTr.pdf
- HREF2
- http://www.lib.ohio-state.edu/Lib_Info/rogersKBdoc.pdf
- HREF3
- http://www.colis.mq.edu.au/
- HREF4
- http://www.dest.gov.au/highered/otherpub/heiiac/exec_summary.htm
- HREF5
- http://www.dest.gov.au/highered/research/proposal.htm#1
- HREF6
- http://www.dest.gov.au/highered/research/outcomes2003.htm
- HREF7
- http://www.arl.org/newsltr/226/ir.html
- HREF8
- http://www.soros.org/openaccess/software/
- HREF9
- http://www.fedora.info/documents/ecdl2002final.pdf
- HREF10
- http://dlib.org/dlib/april03/staples/04staples.htm
- HREF11
- http://www.dest.gov.au/Ministers/Media/McGauran/2003/10/mcg002221003.asp
- HREF12
- http://www.eprints.org
- HREF13
- http://adt.caul.edu.au/
- HREF14
- http://etd.vt.edu/
- HREF15
- http://www.rmitpublishing.com.au/
- HREF16
- http://www.dspace.org
- HREF17
- http://libraries.mit.edu/dspace-mit/
- HREF18
- http://dspace.org/federation/index.html
- HREF19
- http://dspace.org/faqs/index.html#content
- HREF20
- http://www.dlib.org/dlib/july03/young/07young.html
- HREF21
- http://lcweb.loc.gov/z3950/agency/
- HREF22
- http://www.loc.gov/z3950/agency/zing/srw/sru.html
- HREF23
- http://www.loc.gov/z3950/agency/zing/
- HREF24
- http://www.educationau.edu.au/
- HREF25
- http://arrow.edu.au/
- HREF26
- http://www.naa.gov.au/recordkeeping/preservation/digital/xml_data_formats.html
- HREF27
- http://sts.anu.edu.au/downloads/APSR.pdf
- HREF28
- http://www.melcoe.mq.edu.au/projects/MAMS/index.htm
- HREF29
- http://andrew.treloar.net/
- HREF30
- http://www.its.monash.edu.au/
- HREF31
- http://www.monash.edu.au/
- HREF32
- http://arrow.edu.au/
- HREF33
- http://lib.monash.edu.au/
- HREF34
- http://home.earthlink.net/~ritter/tiff/
- HREF35
- http://www.libpng.org/pub/png/
- HREF36
- http://www.w3.org/Graphics/SVG/
- HREF37
- http://www.w3.org/AudioVideo/
- HREF38
- http://www.state.ma.us/mgis/mrsid.htm
- HREF39
- http://www.staroffice.com
- HREF40
- http://www.openoffice.org/
- HREF41
- http://www.fedora.info/history.html
- HREF42
- http://backingaus.innovation.gov.au/
- HREF43
- http://aarlin.edu.au/
- HREF44
- http://library.cern.ch/HEPLW/9/papers/1/
- HREF45
- http://www.publications.parliament.uk/pa/cm200304/cmselect/cmsctech/399/39909.htm
- HREF46
- http://search.arrow.edu.au/
- HREF46
- http://epress.monash.edu.au/
- HREF47
- http://epress.anu.edu.au/
- HREF48
- http://www.webreference.com/authoring/languages/xml/rss/intro/
- HREF49
- http://arrow.edu.au/docs/files/ARROW-VITAL.pdf
- HREF50
- http://www.vtls.com/
- HREF51
- http://www.vtls.com/Products/vital.html
- HREF52
- http://www.fedora.info/
- HREF53
- http://eprint.monash.edu.au/archive/00000046/
Copyright