Scholarly Publishing and the Fluid World Wide Web
Andrew Treloar, School of Computing and Mathematics, Deakin University, Rusden Campus,
662 Blackburn Road, Clayton, 3168, Australia. Phone +61 3 9244 7461.
Fax +61 3 9244 7460.
Email: [email protected].
Home Page: Andrew
Treloar.
Version information
This version was last modified on October 30, 1995. The
original version as delivered at the
Asia-Pacific
World Wide Web conference is also available online.
- Abstract:
- Scholarly publishing has traditionally taken place using the
medium of print. This paper will discuss the transition to electronic
scholarly publishing and how the fluid nature of the Web introduces
special problems. These problems include document linking, document
invariance, document durability, and changing standards.
- Keywords:
- Electronic publishing; scholarly publishing; SGML; HTML; URL;
URN; durability; fixity; standards.
Note on Citations:
As the citations need to be accessible on the Web and in print, the
following procedure has been followed. All bibliographic citations in
the text are hyper-linked to an anchor in the reference list at the end
of this paper. This allows the reader easily to check the detail of the
citation and determine if they wish to access it. Cited works for which
there exists a URL can then be accessed by clicking on the text of the
URL itself, which will be given in full after the citation information.
Selecting Go back will, in most Web browsers, return to the
original location in the main text.
Scholarly publishing has traditionally taken place using the technology
of print. Despite the seemingly inevitable rush into the digital age,
print is still the primary publishing technology for the overwhelming
majority of scholarly disciplines. It is also the technology that still
provides the official archival record for almost all publications.
However, print publications suffer from a number of disadvantages:
- Journals tend to be slow to appear. [Har91]
identifies the lag between writing and publication as their major
disadvantage.
- Print cannot be directly searched, leading to a large market for
secondary abstracting and indexing services.
- Print publications are limited to information that can be
represented statically in print.
- Mechanisms for hyper-linking (i.e. traditional citations) are
clumsy at best.
- Print is costly to produce, distribute and store ([Odl95]).
For all these reasons, as soon as the available technology made it
practicable, pioneering scholars began to use whatever means they could
to produce and distribute their writings electronically. Such
electronic publishing is sometimes referred to as epublishing, by
analogy with email. For an excellent selective bibliography on the
subject of scholarly electronic publishing over networks, consult [Bai95]. For an analysis of the views of
academics on electronic publishing, consult [Sch94].
In roughly chronological order, the technologies adopted for
epublishing on the Internet were:
- listserv,
- anonymous file transfer protocol (aftp),
- gopher,
- and the Web.
New technologies tend to be used in addition to older technologies,
rather than supplanting them. Thus, in the field of electronic
publishing it is not unusual to find journals that were initially
distributed by listserv, and which then added aftp, and later perhaps
gopher access. These older distribution technologies are now being
augmented or (increasingly) replaced by the Web. The non-hierarchical
document-based networked hypermedia architecture of the Web provides a
much richer environment for electronic publishing on the Internet than
any of the previous technologies.
Unfortunately, the Internet is an inherently impermanent medium,
characterised by anarchy and chaos. While this dynamic information
environment has many advantages, it poses some real problems as a
publishing medium. The Web inherits all of this by using the Internet
as a transport mechanism, but then adds to it some challenges of its
own. The Web is the area of the Internet that is currently developing
at the greatest rate. This speed of change brings with it a large
degree of fluidity as technologies progress, mechanisms are developed
and standards become fixed. The difficulties for scholarly publishing
on the Web are particularly severe with respect to document linking,
document fixity, document durability, and changing standards. This
paper will consider each of these in turn and try to provide some
solutions or ways forward.
To be useful to scholars, publications need to be readily accessible.
As directory hierarchies are re-organised or servers moved, old URLs no
longer work. This is of course a general problem with Internet-based
electronic information systems, but the problem of broken URLs is
particularly acute on the Web, where documents often contain multiple
links to other documents. Breaks in the electronic spider-web of links
can be extremely frustrating, and detract markedly from the feeling of
immersion in a seamless information environment. A range of solutions
to the problem of broken URLs are available or under development. They
include manual fixes, semi-automated assistance, managing the links
themselves separately from the documents, and rethinking the link
mechanism altogether.
A manual solution is to ensure that when directory hierarchies are
re-organised on servers, links are placed from old locations to new
locations. For Unix ftp servers this can be done with link files. On
the Macintosh, aliases perform a similar function. On Web servers, a
small document that
- indicates the file has moved
- states the new URL
- and provides a clickable link
is usual and sufficient.
This sort of fix works, but has a number of deficiencies:
- It requires the active involvement of server administrators and
is therefore prone to errors.
- Such link files often are given only a limited life. After they
expire, there is no way for a user to know where to manually redirect
the HREF.
- The authors of documents which point to altered locations are not
(and cannot be) notified that the target of their HREFs has moved.
Most importantly, this technique does not scale well to large complex
hyper-linked document sets.
A partial improvement would be some mechanism that helped identify
links that were broken. It is technically feasible to provide some form
of
Web
robot that would periodically walk all the links out of a site and
ensure that they still work. Unfortunately, I know of no commercial
vendors of Web publishing software who provide such a facility. The
closest thing available is the recently introduced
URL-minder
service. This allows a user to register document URLs and receive an
email message whenever the document moves or changes. This can be very
useful, especially if one is concerned about the
content of a
linked-to document, as well as its location.
Of course, one of the reasons that the Web has difficulties with broken
links is that the links are embedded in the document stream. This makes
automation of link repair difficult. A number of information systems
have made a conscious decision to separate documents and links. Two
usefully illustrative examples are Hyper-G and PASTIME.
The Hyper-G information system (
[Kap95])
uses a separate database engine to maintain meta-information
(including, but not restricted to, links) about documents as well as
their relationships to each other. Hyper-G servers automatically
maintain referential integrity for their local documents and
communicate with other servers to maintain integrity for the entire
system. In contrast to the Web, links are bidirectional, enabling one
to find the source of a link from its destination (as well as vice
versa). Hyper-G has native clients for a range of platforms, or can be
accessed via a
WWW to Hyper-G
gateway.
The
PASTIME project (
[Thi95]) has abandoned altogether the practice
of embedding fixed hyperlinks into documents. The links are generated
on the fly, based on sophisticated pattern processing, as the HTML
document is served to the remote client. To add or remove hyperlinks
only the pattern-processing needs to be altered. Fixed hyperlinks can
be entered if desired. In contrast to Hyper-G, the currency of
hyperlinks is not dependent on maintaining a separate database of link
information. The runtime efficiency of the pattern matching is claimed
to be very high.
Ultimately, the most satisfactory solution will be to rethink what is
meant by a link. The most appropriate model is to adapt the method used
for scholarly links to other documents for centuries - the scholarly
citation. At the end of this paper is the
References
section. This provides links to other documents in the form of
standardised bibliographic citations. These citations do not make
reference to the
location of the document - they only specify
its
name in some unambiguous form.
The Web equivalent, of course, is the distinction between Uniform
Resource Locators (URLs)
and Uniform Resource Names (
URNs). These are part of a proposed wider Internet infrastructure
in which URNs are used for identification and URLs for locating or
finding resources. As in the print world, what scholars want to be able
to link to is the contents of other identifiable documents -
the locations of those documents should be irrelevant. URLs,
with their dependence on a particular machine and directory path, are a
transitional kludge. URNs, with their intended ability to refer to a
known resource and have the system take care of locating it and
accessing it, are the long term solution.
Of course, building and distributing a robust URL to URN resolution
mechanism is far from trivial, although some prototype implementations
are starting to appear, such as that at Saint Joseph's College
in the U.S.A.
As an alternative transitional solution, [Fre95]
proposes using standard message broadcasting mechanisms, such as those
provided by Usenet News and Open Distribution Lists services, to
announce
Uniform Resource Identifiers (URIs), and Uniform
Resource Characteristics (URCs), as well as URNs and URLs.
In the area of the granularity of the object linked to Web publishing
provides a way to improve on traditional print citations, which point
to a specific page at best. Of course, the concept of page is not
particularly relevant to electronic documents, particularly given the
way Web browsers can dynamically reformat text as the size of the
display window is changed. It would be better if the unit linked to was
both smaller than a page, and more closely-linked to the structure of
the document. Obvious alternatives for the Web are named anchors for
sections, or numbered paragraphs. URLs can then point to the correct
section or even paragraph of a document, rather than just to its
beginning. Naturally, if individual Web documents are small, this is
less important. But many documents produced at this transitional stage
in the move to electronic-only publishing are still structured for
printing. This document, and the others produced for this
conference, are
good examples of this. The
style
guide requires that submissions be a single document for ease of
management.
All Web authors of long documents should bear the needs of
fine-grained linking in mind and provide named anchors for others to
link to. Leading by example, this document has pre-defined names for
each heading. To stop URLs growing to ridiculous lengths, these names
are 2 or 3 character identifiers that are unique (obviously) within
this document. To see what they are for linking purposes, readers can
select View Source in their browser.
In the print world, we are used to documents remaining fairly static.
Journal articles, once published, are not normally updated. They become
fixed in time and part of the historical record of scholarship.
Monographs may appear in more than one edition, but are then clearly
recognisable as new products, often with years between successive
versions. Electronic documents on the Internet can change so quickly
that they sport version numbers and the date they were last updated.
Scholarly Web publishing has a range of possibilities with respect to
fixity. At one end of the spectrum are document that follow the print
model and remain fixed once published. Somewhere in the middle are
documents where only minor change is allowed. At the other end are
documents that are continuously updated as more information becomes
available, or as the author's views change.
As the first step away from static documents, the High Energy Physics
community has moved to a model of electronic publishing which allows
for ongoing corrections and addenda. The
hep-th
e-print archive which provides this facility serves over 20,000
users from more than 60 countries, and processes over 30,000 messages
per day (
[Gin94]).
JAIR,
the Journal of Artificial Intelligence Research has also
just introduced the ability to make comments on published papers and
read other's comments.
A number of electronic scholarly journals are intending to add
pointers from earlier published articles to related later ones on the
author's behalf. The assumption is that pointers in the reverse
direction (that is, to prior published work) will already have been
included by the authors.
Tony Barry
from the ANU has suggested in
[Bar95] that
continuously updated documents might be viewed as being more like a
computer database. Such continuous updating may be desirable to cope
both with the broken URL problem discussed above and the exponentially
increasing amount of information online that can be linked to. The
first of these problems may go away in the near future - the second is
unlikely to anytime soon.
In [Bar94] he has also suggested that
scholars should get recognition for the currency of their
documents rather than their number. While attractive on the surface, I
am profoundly sceptical that universities who are currently just
starting to grapple with recognising the validity of electronic
publications are ready for this visionary proposal. Nor am I convinced
that the implications of this model for the workloads of scholars have
been fully thought through. I can imagine an academic trying (and
failing) to keep a number of her articles in different content areas up
to date. As the number of such articles increase (as one might hope
they would, particularly in a 'publish or perish' environment), the
workload would increase to crippling levels. This is particularly a
problem in fields undergoing rapid change.
Perhaps the only reasonable interim solution is to distinguish
somehow between fixed documents (print-like) and continously updated
documents (database-like), or at least to make it clear at the top of a
document into which category it falls. This approach has been used, for
example by Bailey in [Bai95]. The HTML version
of this document is continuously updated - the
ASCII version is fixed and permanently archived. Public-Access Computer Systems Review
(PACS-R), where these articles were published, has until this year only
published in fixed ASCII. As a sign of the changing times, it now
accepts articles in fixed ASCII, fixed HTML and author-updated HTML.
If documents are continuously changing and evolving over time, which
version should be cited? Which version is the 'publication of record'
(assuming this means anything any more)?
Two solutions are used to the problem of permanence on the Web at
present.
- Every time the document changes, its name changes also. If the
older version is replaced by the newer, then all URLs pointing to the
older version break. Moving to URNs will not help in this case.
- Every time the document changes, the name is kept the same and
the contents updated. Existing URLs will still work, although the
target of the URL may have changed its content significantly. In this
case, what if one scholar cites a section in a document that disappears
in the next revision? The URL-minder
service already mentioned will only be of limited help here. Presumably
under this model, a new URN will need to be assigned, as is done with
successive versions of print monographs.
An alternative solution with wider application than just the Web is
discussed in
[Gra95]. The nature of all
digital documents means that a mechanism is needed to ensure the
authenticity of a given document, or to track multiple versions of a
digital original. His proposed solution for such authentication and
version tracking is Digital Time Stamping (DTS). With DTS a one-way
algorithm is used to generate a key that can be produced only by the
original document. These keys would then be made public, thus ensuring
that anyone could verify the version of a given document.
Document durability ([Kau93]) refers to
the length of time the article is available for communicative
transactions. Paper documents printed on paper that is not acid-free
have a durability of some 100 years unless corrective action is taken.
The durability of Web documents is entirely unknown, but there are no technological
reasons for their life to be limited in any way, provided they are
archived in some systematic way. At present there are no mechanisms to
ensure that this will occur.
Graham (
[Gra94]) divides the problem of
ensuring the preservation of documents into:
- medium preservation
- technology preservation
- intellectual preservation
Medium preservation relates to the problem of preserving the medium on
which the information is recorded. This has been traditionally been
discussed in terms of environmental and handling concerns for storage
media. This will continue to be important, but the rate of
technological change is such that preserving media like 8" floppies is
of little use if nothing can read them.
[Les92]
has suggested instead that attention should instead be directed to
technological preservation. That is, the obsolescence of technologies
is much more of a problem than the decay of storage media. In his
words, for electronic information, "preservation means copying, not
physical preservation." The third of Graham's preservation
requirements, intellectual preservation, addresses the integrity and
authenticity of the originally information. In other words, what
assurance do we have that what is retrieved is what was originally
recorded.
[Gra95] proposes the DTS
technology discussed
above as a solution to this
problem.
[Rot95] also discusses a range of
mechanisms to deal with each of these three preservation problems.
In many ways, the digital nature of all electronic publishing can be
both a strength and a weakness in the area of durability. A strength,
because digital documents can easily be copied and replicated at
multiple sites around the world. A weakness, because destroying a
digital document is far easier than destroying a physical document. It
is easy to assume that the document will exist elsewhere on the Net and
that the fate of a single copy is irrelevant. Of course, there is no
mechanism to prevent everyone making this assumption and causing the
loss for ever of a piece of scholarship. In some ways, the analogy of
the single manuscript forgotten on top of a cupboard in a mediaeval
monastery may well be a forgotten directory on a rarely used hard-disk
somewhere in a university department. Unfortunately, it is all too easy
to unconsciously delete a directory - throwing away a manuscript
without realising it is somewhat harder. Given the lack of any
mechanism to ensure the archiving of print publications, it seems
unlikely (although technologically relatively easy) that anything will
be done about the situation for digital documents.
The explosive development and adoption of the Web has been paralleled
by the evolution of the HTML standard. Starting from what is now called
HTML level 1, we have moved through level 2 and got nearly as far as
level 3 (for most browsers, at least). At the time of writing, HTML
level 3 is still not finalised, although this is not stopping the
developers of Web browsers from adding proposed or likely level 3
features to their products. The
W3O
and others are no doubt already thinking about the sorts of things that
might appear in HTML level 4.
As HTML has grown and mutated it has added a whole range of things
that were not envisaged at its birth, back when the world was a simpler
place. These include at least detailed layout control, tables, and
equation support. The question is, should the process of accreting
features to HTML continue unabated into the future?
Price-Wilkins ([Pri94a], [Pri94b]) has argued that HTML as used on the
Web has a number of fundamental deficiencies as a scholarly markup
language. Some of those he lists are:
- The range of HTML tags available to users is still too limited.
As a consequence, authors are unable to differentiate important
elements with HTML.
- Where the author wishes to define a bounded segment of text, such
as a stanza or chapter, no tag is available for this purpose. Instead,
authors rely on dividing documents into files representing major
structural divisions.
- HTML tagging often confuses function and appearance.
- There are very few HTML tags that define structural relationships.
He argues that those currently coding their documents in HTML may come
to regret their short-sightedness in a few years. Instead, he argues
for coding complex documents in
SGML and
converting this into HTML on the fly for delivery on the Web as well as
through other means. As it turns out, one of the characteristics of the
evolution of HTML is precisely in the direction of greater SGML
compliance.
Phillip Greenspun, from MIT, has also written on the deficiencies of
HTML ([Gre95]). His preferred solution is to
make much wider use of the META tag included in HTML level 2.
HTML is certainly evolving towards full SGML compliance, but betrays
its origin as a formatting language rather than a structuring language
at every turn. It may not be possible to migrate entirely seamlessly
towards SGML. Indeed it may not be necessary. Many types of publishing
do not require the range of features listed by Price-Wilkins. SGML to
HTML gateways may only be required for particular kinds of large
complex documents.
One way out of this mess is to clearly separate the internal
representation of a document from its ultimate delivery format. We
already do this with computer-generated documents that are delivered in
print. In the rapidly-changing world of electronic publishing,
documents may be delivered in HTML, as Adobe
Acrobat PDF
files, as Postscript files, and in print (to name only the most obvious
options). Documents may be prepared in a wide range of word or document
processors for ultimate delivery using these formats. Provided the
amount of structuring and layout information about the document is
richer than the ultimate delivery format, than conversion is relatively
simple. A number of manufacturers are already facilitating this with
their products. Framemaker version 5.0 from Frame Technology Corporation and
Pagemaker 6.0 from Adobe, both
provide support for output to PDF and HTML, as well as Postscript and
print. Microsoft Word's
Internet Assistant provides users of the latest version of Word for
Windows with the ability to save in HTML as well as the native document
format. Given the rate of change in electronic delivery technologies,
it is probably best to keep files in a format that can be output in a
range of forms. This provides the maximum flexibility and is reasonably
future-proof.
The Web provides an attractive and accessible environment for scholarly
electronic publishing (provided the content is compatible with the
limitations of HTML). It seems likely that the use of the Web in this
context will increase. In light of the nature of the Web, care needs to
be taken to ensure that the potential dangers of the Web's fluidity are
overcome. This paper has outlined a number of the areas of difficulty
and some possible solutions.
- [Bai95]
- C. Bailey Jr., "Network-Based Electronic Publishing of Scholarly
Works: A Selective Bibliography", The Public-Access Computer
Systems Review, Vol. 6, Number 1, http://info.lib.uh.edu/pr/v6/n1/bail6n1.html.
- [Bar94]
- T. Barry, " Publishing on the Internet with World Wide Web ", in Proceedings
of CAUSE '94 in Australia, CAUDIT/CAUL, Melbourne.
- [Bar95]
- T. Barry, "Network Publishing on the Internet in Australia", in The
Virtual Information Experience - Proceedings of Information Online and
OnDisc '95, Information Science Section, Australian Library and
Information Association, pp. 239-249.
- [Fre95]
- V. Freitas, "Supporting a URI Infrastructure by Message
Broadcasting", in Proc. INET '95, http://inet.nttam.com/HMP/PAPER/116/abst.html.
- [Gin94]
- P. Ginsparg, "First Steps towards Electronic Research
Communication", Computers in Physics, August.
- [Gra94]
- Peter S. Graham, " Intellectual Preservation: Electronic
Preservation of the Third Kind", Commission on Preservation and
Access, Washington, D. C., http://aultnis.rutgers.edu/texts/cpaintpres.html.
- [Gra95]
- Peter S. Graham, "Long-Term Intellectual Preservation ", Proc.
RLG Symposium on Digital Imaging Technology for Preservation, http://aultnis.rutgers.edu/texts/dps.html.
- [Gre95]
- P. Greenspun, "We have Chosen Shame and Will Get War",
http://www-swiss.ai.mit.edu/philg/research/shame-and-war.html.
- [Har90]
- S. Harnad, "Scholarly Skywriting and the Prepublication Continuum
of Scientific Inquiry", in Psychological Science, Vol. 1,
pp. 342 - 343 (reprinted in Current Contents 45: 9-13, November 11
1991),
ftp://ftp.princeton.edu/pub/harnad/Harnad/harnad90.skywriting.
- [Har91]
- S. Harnad, "Post-Gutenberg Galaxy: The Fourth Revolution in the
Means of Production of Knowledge", in The Public-Access Computer
Systems Review, Vol. 2, No.1, pp. 39-53,
ftp://cogsci.ecs.soton.ac.uk/pub/harnad/Harnad/harnad91.postgutenberg.
- [Kap95]
- F. Kappe, "Maintaining Link Consistency in Distributed
Hyperwebs", in Proc. INET '95. Originally referenced at http://inet.nttam.com/HMP/PAPER/073/html/paper.html.
- [Kau93]
- D. S. Kaufer & K. M. Carley, Communication at a
Distance - The Influence of Print on Sociocultural Organization and
Change, Lawrence Erlbaum Associates.
- [Les92]
- M. Lesk, Preservation of New Technology: A Report of the
Technology Assessment Advisory Committee to the Commission on
Preservation and Access, Washington, DC: CPA. Available from the
Commission at $5: 1400 16th S. NW, Suite 740, Washington, DC 20036-2217.
- [Odl95]
- A. Odlyzko, "Tragic loss or good riddance? The impending demise
of traditional scholarly journals" in Electronic Publishing
Confronts Academia: The Agenda for the Year 2000, Robin P. Peek
and Gregory B. Newby, eds., MIT Press/ASIS monograph, MIT Press,
ftp://netlib.att.com/netlib/att/math/odlyzko/tragic.loss.txt.
- [Pri94a]
- J. Price-Wilkin, "Using the World-Wide Web to Deliver Complex
Electronic Documents: Implications for Libraries" in The
Public-Access Computer Systems Review, Vol. 5, No. 3, pp. 5-21,
gopher://info.lib.uh.edu:70/00/articles/e-journals/uhlibrary/pacsreview/v5/n3/pricewil.5n3.
- [Pri94b]
- J. Price-Wilkin, "A Gateway Between the World-Wide Web and PAT:
Exploiting SGML Through the Web.", in The Public-Access Computer
Systems Review, Vol. 5, No. 7 , pp. 5-27,
gopher://info.lib.uh.edu:70/00/articles/e-journals/uhlibrary/pacsreview/v5/n7/pricewil.5n7.
- [Rot95]
- J. Rothenburg, "Ensuring the Longevity of Digital Documents", in Scientific
American, January, pp. 24 - 29.
- [Sch94]
- D. Schauder, Electronic Publishing of Professional
Articles: Attitudes of Academics and Implications for the Scholarly
Communication Industry, Unpublished Ph. D. Dissertation,
University of Melbourne.
- [Thi95]
- P. Thistlewaite, "Managing Large Hypermedia Information Bases: a
case study involving the Australian Parliament", in Proc.
AusWeb'95,
http://www.scu.edu.au/ausweb95/papers/management/thistlewaite/.
�