4. Infrastructure
Components
Components of the RDM Infrastructures established by
higher education institutions are briefly considered below. The component
function, the software / platform underlying the component and component
interoperability are described, any evaluations identified, and institutions
employing the component, particularly in an RDMI context, are noted. The list
of components is not exhaustive, the most relevant and popular are reviewed,
and the components are loosely categorised by function so there may be considerable
overlap.
4.1. Integrated
systems and integrating components
A two stage RDMI consisting of an active data management system,
DataStage [4.4.1], with a data repository, DataBank [4.2.4] built on a Fedora
Commons platform. The system uses ‘Bagit’ specifications [4.8.2] to transfer
files to a SWORD2 [4.8.1] compliant archive. This system was developed by the Admiral
project [7.2.1] and is being piloted at the University of Oxford [6.14],
having been implemented during the Damaro project [7.1.3].
Dataflow is being evaluated by the Universities of Leeds and the Yorkshire
& Humberside Metropolitan Area Network. [See section 5.5 for Newcastle
University’s Iridium evaluation of DataStage and DataBank].
The pilot RDMI built at the University of
Lincoln centres on the ‘Orbital Bridge’ application (Jackson, 2012), which integrates an institutional
facing data registry built on CKAN, a public data catalogue built on EPrints
and the Research Management system ‘Nucleus’ (Stainthorp, 2012). The Orbital Bridge provides an
interface, the ‘Researcher Dashboard’ (Winn, 2013b)
allowing researchers to access and add information about projects, funding,
outputs and datasets [6.1.10].
Integrated
Rule-Oriented Data-management System (iRODS) is a software system that allows
the management of a distributed workflow through the chaining of
micro-services. See section [2.1.2.] for more information. The iREAD project
[5.1.4] evaluated iRODS for use in the CARMEN portal.
4.2. Repository
platforms
The most widely used platform for institutional
repositories (used for WRRO), EPrints is open source and free to use. Bespoke
design, hosting and maintenance services are available. EPrints is built from
Apache web server, MySQL [4.10.4] and Perl components and recommended to run on
a UNIX-like operating system.
A number of plugins have been developed by the
EPrints user community, some of which modify EPrints to handle datasets: The ReCollect plugin[i] has been
developed by UK Data Archive and the University of Essex to implement a dataset
metadata profile; Datacite DOI registration plugin[ii]; SWORD 2.0 broker plugin[iii]; Arkivum A-Stor storage
backend plugin[iv]. In
addition to these, a number of projects have developed integration of EPrints
with other components: KAPTUR [7.1.8] developed integration with
Datastage and Figshare; The Orbital Project [7.1.12] developed the Orbital
Bridge, which integrates EPrints with CKAN and other components [see above
4.1.2].
A
number of H.E. Institutions use EPrints for their institutional data repository
- Universities of Essex [6.1.4], Southampton [6.1.14]
and West of England (UWE)
[6.1.17]. The eCrystals repository [6.2.3], also at Southampton, runs on
an EPrints platform. University of Leeds have chosen EPrints for their proposed
data repostitory, as reported from the EPrints User Group workshop (Proudfoot, 2013a) at Leeds, October 2013.
CKAN
is an open source data management system
developed by the OKF[v] to
provide access to open data. Technologies used include PostgreSQL database
engine [4.10.8], SOLR search [4.10.3], Python backend and Javascript frontend.
It has a modular architecture with optional extensions - APIs surrounding a
core system. CKAN is part of the RDMI implemented
by Bristol [6.1.1]
and Lincoln [6.1.10]
and being trialled by the Kaptur project [evaluation at 5.6], and at Newcastle
[Iridium CKAN use case 5.5]. The DART project[vi] at Leeds
uses CKAN for their data portal [6.2.2].
(Flexible
Extensible Digital Object Repository Architecture) - originally developed by
Cornell University for managing digital content (DAMS). Fedora is
RDBMS-independent and has been tested with MySQL, Oracle [4.10.7], PostgreSQL,
Microsoft SQL and Derby (it is provided with Derby embedded). The Fedora
Commons distribution includes Apache Tomcat, Derby SQL and Java components.
Many workflow and service components and plug-ins have been developed to
integrate Fedora within an RDM infrastructure. UK HEIs using the Fedora Commons
platform include University of York Digital Library (YODL) and the Archaeological Data Service (ADS). Data
repositories based on Fedora include 3TU Datacentrum [6.3.3], DANS Easy [6.3.4]
and RUresearch [6.3.9].
A
data repository based on the Fedora Commons platform, designed by the Admiral
project at the University of Oxford. [See section 4.1.1 for a description of
DataFlow components and 5.5 for the Iridium evaluation of DataBank].
Hydra
is a multi-purpose repository framework based on a micro-services architecture.
The main components are a Fedora repository platform, SOLR indexing software
[4.10.3], Blacklight discovery interface [4.5.5] and the Hydra plugin, a ‘Ruby
on rails’ library, which facilitates workflow in digital object management [as
described in section 2.1.2]. Hydra is the platform for the University of Hull
Digital Repository, Hydra [6.1.9], University of Virginia Libra
[6.3.10] and the LSE digital Library[vii].
A Fedora based repository system developed through
the Arrow project. This is used at Arrow, Monash University’s research
repository [6.3.1].
Islandora
is an open source Content Management System developed by the University of
Prince Edward Island, built on a base of Fedora, Drupal [4.10.9] and Solr. This
platform is used for the University of St Andrews Digital Collections Portal[viii].
DSpace
is an open source repository system based on Apache server, PostgreSQL or
Oracle and Perl. DSpace is the platform used for University of Edinburgh
Datashare [6.1.3] and EDINA ShareGeo
repository [6.2.1], Open Research Exeter [6.1.54], the University of
Hertfordshire Research Archive (UHRA) [6.1.8], the Queen Mary University of
London, Centre for Digital Music
Research Data Repository (C4DM-RDR)
[6.1.13] and DSpace at Cambridge [6.1.2].
Datastar is an open source repository system developed by
Cornell University and Washington University. Designed to support collaboration
and data sharing among researchers during the research process, and to promote
publishing or archiving data and high-quality metadata to discipline-specific
data centers, and/or to the institution's own digital repository.
ContentDM
is a proprietary repository software from OCLC, in use at University of
Sheffield for digital collections at the Library Special Collections and
National Fairground Archive [see 3.1].
This
is proprietary repository software from Exlibris, in use at University of Leeds
for LUDOS.
This
is proprietary repository software in use for the institutional repository at Royal Holloway Research
Online[ix] and Oxford Brookes RADAR[x]. At Nottingham[xi] it is
integrated into Moodle and used to house and share digital teaching resources
(except audio and video files, which are recommended to be uploaded to
Kaltura). Equella is the platform used for the research data repository at
Griffiths University [6.3.2] and was also being considered by Nottingham ADMIRe
for the data repository / metadata store (Berry and Parsons, 2012a)[see 5.1].
An
open source repository system designed at Laval University Library based on the
DSpace model. The system was designed with internationalisation in mind, so it
has an easily modified multilingual interface. The system is not platform
dependent, and based on open source components – Java, Apache Ant, MySQL
(recommended) and Lucerne.
Repository
software developed by the ARNO Project (Academic Research in the Netherlands
Online): partners were the universities of Amsterdam, Twente and Tilburg. ARNO
is based on Apache, Oracle and Perl architecture.
4.3. ‘Archive data’ storage and digital
preservation systems and services
Arkivum
provides a digital archiving service, certified to ISO27001. Three copies of
the data are kept, two at geographically separate data centres and one at an
escrow service. Data is uploaded from the institutional network using a local
gateway appliance, the A-Stor (a file server), to the Arkivum data centre. This
may be achieved within the institutional firewall. Entered a data archiving
framework agreement with JANET[xii].
This
is a cloud based, open access repository for research outputs. Data is
persistently stored under CC license. Unlimited storage is offered for publicly
accessible data, whereas private data is provided with 1Gb free storage. The
service is supported by Digital Science[xiii] , the
providers of Symplectics Elements [4.6.1] and Projects [4.4.4]. The company now
offers ‘Figshare for institutions’, providing a cloud
based data repository service. This service is used by Imperial College London
and University of Oxford in the UK.
Dataverse
Network is an open source web application for publishing, citing, analysing and
preserving research data, which may be installed by any institution. The
architecture is based on PostgreSQL, Lucerne (SOLR) and Java. This application
supports the data repositories at Harvard University [6.3.6] and John Hopkins
University [6.3.7].
Rosetta
is a proprietary digital preservation system from Exlibris, and the successor
to DigiTool. The system is based on a distributed architecture which is
scalable and flexible and provides continual preservation actions for long-term
curation. The system is based on the OAIS[xiv] model
and conforms to the TDR[xv]
requirements. The system integrates easily with Exlibris Primo for the
discovery function.
A
proprietary cloud storage and backup service, optimised for data that are
infrequently accessed and for which short retrieval time is not critical, thus
a low cost option for long-term data storage. Geographical location of data
storage may be chosen to meet regulatory requirements.
DuraSpace
offers this commercial hosted service providing cloud infrastructure for data
preservation and access. This is used for the ICPSR repository [6.3.11].
A
proprietary data backup service that offers a range of options including backup
to cloud, disc or tape and high data availability with continuous data
protection.
4.4. ‘Active
data’ management and collaboration platforms
The
active data management component of Dataflow, appears as a mapped drive on the
researcher’s computer (an ‘Academic Dropbox’) and provides metadata annotation
and repository submission functions. Datastage is being tested by the
Universities of Essex, Hertfordshire and QML Centre for Digital Music [See
section 4.1.1 for a description of DataFlow components and 5.5 for the Iridium evaluation of DataStage].
Sakai
CLE provides a suite of resources for collaboration and project management.
Resources include the means to store, organise and share files, facilities for
blog, chat and managing forums, and a glossary providing contextual
definitions. Sakai CLE provides the VLE Part of the Hydra infrastructure at
Hull [6.1.9], VRE at Newcastle [6.1.11], Bath[xvi],
Lancaster[xvii]
and Monash [6.3.1]. The software has been evaluated by the Research360 [5.8]
and Iridium [5.5] projects.
SharePoint
is an established Web application platform introduced by Microsoft in 2001. The
platform provides a range of Web tools, including intranet portals, document
and file management, collaboration, social networks, extranets, websites,
enterprise search, and business intelligence.
Sharepoint
is part of the infrastructure at the University of Southampton [evaluation at
5.4].
Projects
is a research project management desktop application for Mac. It allows
researchers to manage research activity, track changes to files, manage backup
and restore previous versions of files, and to annotate and organise files and
folders easily. This application integrates seamlessly with Figshare, though
there is no institutional form to date.
D4Science
is a European e-infrastructure project which provides a mechanism for of data
e-infrastructure interoperability. The mechanism is based on the gCube software
framework, which allows distributed virtual organisations to collaborate and
share resources by managing the cloud / grid middleware thus configuring their
own VREs.
Dropbox
is a commonly-used collaboration and cloud storage service, free to individuals
(for volumes up to 2Gb) with added components for organisation subscription,
such as file recovery, version tracking and phone support. Data is protected by
256-bit AES and SSL encryption and Two-step verification & mobile
passcodes. Dropbox may store clients’
data on servers in another country.
Also
known as Google Documents, Google
provides collaborative and cloud storage services for educational institutions,
offering 30Gb storage per user and integration with its email service and text,
voice and video chat service. Security features include two-step authentication
and encrypted connection to servers. A vault service is offered for secure
archiving of content. Google provides these services for the Universities of
Sheffield and York.
Luminis
ia a collaboration portal platform, originally provided by SunGard, now
Ellucian. Used for the VLE at the University of Leeds.
Dynamics
is a collaboration platform for customer relationship management and enterprise
resource planning. This will be used for the VRE at University of Leeds,
replacing Luminis.
A
web-based portal for sharing data and software, developed at the University of
York and funded by HEFCE. The portal allows the sharing of data and services in
a secure online environment, the execution of analysis code and analysis of
data, and the curation of data, analysis code and experimental protocols.
Alfresco
is an open source content management system, which interfaces with Google mail
and drive (forthcoming) and the institutional filestore. This is being
developed at the University of York for use as a ‘Research Lab management
system’ and at St. Andrews for data archiving involving a Fedora Commons
repository (Allinson, 2013).
AWS
provides a wide range of cloud-based services for organisations, including:
cloud computing and applications, storage (S3 and Glacier), databases,
networking & virtual private cloud (VPC), analytics and deployment,
identity and access management.
Huddle
is a collaboration platform that is designed for content sharing, document
management, project and workflow management, secure intranet and extranet
service. This is marketed as a ‘Sharepoint alternative’.
Kaltura
provides an open source video management platform with a focus on universities
deploying videos within their organisation. This platform includes
collaborative video editing and publishing components.
THREDDS
(Thematic Real-time Environmental Distributed Data Servcies) Data server
provides catalogue, metadata and data access for scientific datasets. TDS is open source Java middleware, and is
used for part of the 3TU Datacentrum infrastructure [6.3.3].
HUBzero is an
open source content management system designed for collaborative working and
data sharing for scientific research and education. It is the platform for
Purdue University Research Repository [6.3.8].
4.5. Catalogue
software
Open
source software, developed at the University of Oxford to provide a catalogue
of research data. The metadata schema has been developed for full description
of data, people responsible, how they were generated, access arrangements,
links to publications etc. Datafinder integrates with Databank software and is
designed to be used, with minimal modification, by other HEIs as part of their
RDM infrastructure.
Redbox
has been designed as a metadata store / catalogue for research data. This
provides workflows and interfaces for metadata creation. ReDBox is a research
data registry, so the research data is assumed to be stored elsewhere, but data
and related documentation / files may be uploaded to the system. ReDBox has
been developed with, and is therefore closely integrated with Mint [4.9.7], a
name authority and vocabulary system. Development was supported by the ANDS.
This
is a metadata catalogue storing rich metadata describing data objects stored in
files, repositories or on the web. Metadata schemas are composed of concepts
that describe data. In XMC Cat, the XML metadata schemas are partitioned into
concepts, which act as the unit of metadata storage. This allows for a
dynamically adaptable query interface.
A
hybrid, multi-model data server architecture allows Virtuoso to offer
Relational, XML and RDF data management, full text indexing, linked data, web
application and document web server function and web service deployment (SOAP
or REST).
Blacklight
is an open source discovery interface for any SOLR index. Blacklight is a Ruby
on Rails gem which accommodates heterogeneous data. This is part of the
infrastructure for Hydra, the IR at the University of Hull [6.1.9].
Primo
is the discovery interface that offers a single search box for the whole range
of a library’s collections, be they locally managed or remote electronic
content. This provides the discovery interface for the libraries of the
Universities of Sheffield and York.
The
Sierra platform provides a suite of library services, including a resource
Discovery interface. With similar functionality to Ex Libris Primo, this is the
Resource Discovery interface used by the University of Leeds library.
Open
source multilingual digital library software, able to handle a wide variety of media
formats.
4.6. Current
Research Information Systems (CRIS) and DMP tools
Elements
allows the Research Office to manage their researchers’ published outputs by
importing records from external sources such as WoS, Scopus, CrossRef and
Figshare, and by allowing researchers to import details from Google Scholar,
Mendeley, Endnote, Refman and Bibtex. Research information including HR, finance
and grants administration data is managed and may be imported from legacy
databases. Faculty information and academic profiles are managed and reported.
Elements integrates with Eprints, Fedora and DSpace repositories through
community developed plugins. Elements is the RIS in use at the Universities of
Sheffield and Leeds.
Pure
provides comprehensive research information management. Pure aggregates data
from awards management, HR, finance, student administration and other
institutional sources. Publication data is retrieved from external sources such
as Scopus, WoS, PubMed, Worldcat and Mendeley to populate Pure with information
about researcher outputs. Integration with Dspace, ePrints, Fedora and Equella
supports automatic population of the institutional repository. Pure is the RIS
in use at the Universities of York, Lancaster and Edinburgh.
Developed
by Avedas, a Thomson-Reuters company, Converis is in use at Hull and integrated
into the Hydra infrastructure. Converis appears to have the same functionality
as Elements and Pure. This CRIS adheres to Research information standards –
CERIF, CASRAI, VIVO and ORCID (see below).
The
DMPonline tool has been developed by the DCC to help researchers create data
management plans. The tool contains templates that represent the requirements
of various funders and institutions. Guidance is provided during the process
and the DMP may be exported in a variety of formats. The tool is used by the University of Lancaster.
4.7. Data
capture and workflow management systems
4.7.1.
LIMS
Laboratory
information management systems manage all aspects of laboratory processes, from
data capture, sample management and instrument control to workflow, document
and personnel management. LIMSfinder http://www.limsfinder.com/ provides information
about the numerous LIMS available.
4.7.2.
Digital
Lab books
ELN
(Electronic laboratory notebooks) are computer applications designed to
document experiments as an alternative to paper laboratory notebooks. Examples
include:
Quartzy http://www.quartzy.com/;
LabAssistant http://labassistant.en.softonic.com/mac; My Lab http://www.mylab.fi/en/; Sparklix https://www.sparklix.com/;
eCAT http://www.researchspace.com/electronic-lab-notebook/index.html
and Wingu http://signup.wingu.com/index.html. LabArchives
https://mynotebook.labarchives.com/ in
partnership with BioMed Central, acts as the default storage system for
supplementary data published with articles in BMC journals.
Bioconductor
provides tools for the analysis and comprehension of high-throughput genomic
data. More data management software resources for Biomolecular research
are provided by BBMRI http://www.bbmri-wp4.eu/node/45 and Biocompare http://www.biocompare.com/Software/.
COLWIZ
provides web, desktop and mobile apps to facilitate individual and
collaborative research, saving precious time for researchers at every stage:
from an initial idea, through collaboration, to publication of results.
An
open source java based application to help create, execute and share analyses
and data to create scientific workflows.
This
is a project management system for Life Sciences. Facilitates project
management and collaboration, linking research data, protocols, results,
published papers and integrating external data.
Labtrove
is a data-centric digital infrastructure for supporting research. The software
was developed at the University of Southampton as a result of experience gained
through eScience research projects such as CombeChem[xviii],
eBank[xix],
eCrystals [6.2.3], R4L[xx],
Smart Tea[xxi]
and oreChem[xxii].
Research infrastructure components – repository, LIMS, pervasive computing and
RDF, are integrated into a blogging / social network paradigm. In Labtrove, the
data is associated with the project metadata, at the point of, or prior to,
creation. Therefore researchers can recreate and adapt experiments, using
automated procedures and instrument settings. The system provides the necessary
framework for good data management and curation.
Labview
is a graphical development environment providing a range of tools for data
acquisition, instrument control, data management and reporting.
VLab
enables the formation of a Smart Research Framework, helping the creation and
preservation of the record. VLab extends the model of a digital infrastructure
for supporting research (repositories, LIMS, pervasive computing & basic
RDF underpinning) to incorporate the online blog paradigm, where a data
centric system with control over visibility and sharing are essential.
Model Interaction Environment for Neuroscience - a
package of interface and library code intended to make a number of scientific
modeling, data markup, and data storage tasks easier. Many of the extension
functions of MIEN are devoted to neuroscience tasks, but the core MIEN package
is a general purpose scientific modeling and data visualization tool with a
flexible extension system.
This
is a social networking site and Virtual Research Environment (VRE) designed for
people to share, discover and reuse workflows and build communities.
MyExperiment was developed using the MyGrid software suite [4.10.2] by a team
from the universities of Southampton, Manchester and Oxford.
OMERO
manages images from the microscope to publication using a central repository.
Data can be viewed, organized, analyzed and shared from anywhere via the
internet, from a desktop app (Windows, Mac or Linux), from the web or from 3rd
party software.
OBiBa
provides open source software components for biobanks and biomolecular
research.
SCAPE
is an open source infrastructure platform that executes institutional digital
preservation strategies, for very large, complex and heterogenous collections
of digital objects, by extending repository functionality with semi-automated
workflows. The system integrates Fedora Commons, Taverna and Hadoop.
Taverna
is an open source and domain-independent Workflow Management System –
a suite of tools used to design and execute scientific workflows and
aid in silico experimentation. The Taverna suite was written
in Java by the MyGrid team [4.10.2] and includes the Taverna Engine (enacting
workflows), Taverna Workbench (desktop client application) and Taverna Server
(allows remote execution of workflows). Taverna has been widely deployed,
particularly in Bioinformatics and Chemistry, is hosted by the University of Manchester
and supported by JISC, EPSRC, BBSRC[xxiii],
ESRC[xxiv]
and FP7[xxv].
The Yogo
Data Management System is a set of software tools created to enhance the
process of data annotation, analysis and web publication. The system provides a
set of easy to use software tools for data sharing by the scientific community.
It enables researchers to build their own custom designed data management
systems. Another branch of the system provides tools for viewing anatomical and
physiological data.
4.8. Data
transfer protocols
Simple Web-service Offering Repository
Deposit (SWORD) is a lightweight protocol for depositing content from one
location to another. It is a profile of the Atom Publishing Protocol (APP) and
designed to ‘lower the barriers to deposit’ any content into repositories.
The BagIt file packaging format is a
hierarchical file packaging format for storage and transfer of digital content.
A ’bag’ is a structure to enclose a ’payload’ and descriptive ’tags’, and does
not require knowledge of the payload’s internal semantics. Also at: https://github.com/LibraryOfCongress/bagit-java
The Open Archives Initiative Protocol
for Metadata Harvesting is a low-barrier mechanism for repository
interoperability. OAI-PMH is a set of six verbs or services that allow data
providers to expose structured metadata and make them available for harvesting
by service providers’ requests.
4.9. Identifier
services and identity components
Datacite
is an international organisation which supports research data archiving, access
and citation by asigning persistent identifiers to datsets. An institution may
join DataCite in order to have DOIs minted for its datasets.
The
Digital Object Identifier is an character string used to uniquely identify any
object. The DOI provides an actionable (clickable), interoperable, persistent
link to metadata about the object, including the URL where the object is
located. The DOI for an object is permanent, whereas the location and other
metadata may change, therefore the DOI may be used for persistent
citation.
The
Common European Research Information Format (CERIF) is a standard for managing
and exchanging research information. It provides a data model that describes
the research domain, defining research entities; researchers, projects,
organisations, outputs and funding, and the relationships between these
entities. CERIF has been developed by EuroCRIS[xxvi].
Shibboleth
is a very widely deployed federated identity authentication system. It is an
open source, free software system that provides single sign-on capabilities for
individual access to protected online resources within and between
organisations. Shibboleth is employed for user authentication at Sheffield,
Leeds, York and many other HEIs.
ORCID
provides a persistent identifier for individual researchers, so that their
identity is unambiguous. Automatic links to research outputs, publishing
activities and grant applications are supported. ORCID is now integrated with
Symplectic Elements and Figshare, (Hamnel, 2012).
The
Consortia Advancing Standards in Research Administration Information are
developing a common data dictionary and advance best practice for research
information exchange and reuse.
VIVO
is an open source semantic web platform and ontology for representing
researchers and their associated training, background, activities,
organizations, and outputs including publications and research resources. VIVO
has been developed and implemented at Cornell University in association with
other projects including CASRAI, ORCID and EuroCRIS.
Mint
is an open source name authority and vocabulary system that provides services
to web applications. Mint was developed with ReDBox on the Fascinator platform
with support by the ANDS.
The
Building the Research Information Infrastructure project (BRII) at the
University of Oxford, aimed at developing infrastructure, built on semantic web
technologies, enabling efficient sharing of research information. The registry implemented
at Oxford, integrated into the Fedora infrastructure, forms a part of the
Oxford DAMS and as such, benefits from data preservation. This system is not
yet available for other institutions.
4.10.
Other software systems and platforms of
interest
The
Globus Toolkit is an open source set of software components enabling the
sharing of services and resources across the ‘grid’. The toolkit includes
software for security, information infrastructure, resource and data
management, monitoring and discovery. Services and resources may be shared
across institutional and geographical boundaries whilst retaining local
autonomy.
The
MyGrid team have developed a suite of tools that support the creation of
e-laboratories. These tools have been adopted by a large number of projects,
across a diverse range of domains, including Taverna [4.7.15] and MyExperiment
[4.7.11].
SOLR
is an open source search platform developed by the Apache Lucerne project.
Features include full-text search, hit highlighting, faceted search, near
real-time indexing, database integration, rich document handling and geospatial
search. SOLR is written in Java, has REST- like HTTP/XML and JSON APIs and runs
as a standalone full-text search server.
MySQL
is the world’s most popular, free open source relational database application.
MySQL is the database component of EPrints and an optional database for Fedora
commons.
SAP
provides a suite of software tools for university management processes.
Agresso
is a range of Enterprise Resource Planning (ERP) software tools.
Oracle
provides a range of database systems and Enterprise management resources.
PostgreSQL
is a free open source object-relational database system. PostgreSQL is one of
the databases that may be incorporated in CKAN, Fedora Commons and DSpace.
Drupal
is a free open source content management system, which may be used to provide a
web-based user interface for many applications (such as catalogue databases).
Moodle
is a free open source learning management system (LMS) / virtual learning
environment (VLE).
The
Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple
programming models.
5. Reviews,
Evaluations and Comparisons of Infrastructure Elements
Several
of the JISC research data management infrastructure projects published the
results of reviews, evaluations and comparisons they carried out to select the
infrastructure components and RDM tools they were to trial. A number of
projects published the results of user requirements surveys, conducted to
choose from the components and tools available. These are indicated below,
where the findings may be briefly described.
5.1.
CKAN for
Research Data Management in an Academic Setting
A
workshop was held on 18th February 2013, facilitated by the JISC MRD
programme, to investigate the use of CKAN for RDM. The workshop featured
presentations from Bristol and Lincoln and discussions fed into a user
requirements gathering exercise. CKAN capabilities in fulfilling these
requirements are expressed in the output of this work, which is available at: http://lncn.eu/mxz2 (Winn et al. 2013).
5.2.
Admire
The
project issued document outlining considerations in developing a Research Data
Management repository strategy, which included a review of repository
software: http://admire.jiscinvolve.org/wp/files/2013/05/ADMIRe-RDM-Repository-Strategy-Requirements.pdf (Berry and Parsons, 2012b).
The
EQUELLA digital repository system, in use as a DAMS at Nottingham, was piloted
for use as a data repository. The evaluation of the data repository pilot was
reported at: http://admire.jiscinvolve.org/wp/files/2013/05/ADMIRe-EQUELLA-Research-Data-Repository-Pilot.pdf (Berry and Parsons, 2012a). Key issues revealed include the need
for manual validation of the metadata entered through a wizard, the workflow
requirements of obtaining a DOI and storing non-open datasets.
5.3.
Data.bris
The CKAN data portal platform was
investigated by the data.bris project to provide a public read-only catalogue
of research data publications (data discovery) and also to manage
controlled-access active data (collaborative sharing). The team gave a
presentation on their evaluation of CKAN at the JISC ‘CKAN for research data
management in an academic setting’ workshop - reported on the blog at: http://data.bris.ac.uk/2012/12/18/ckan-and-data-bris/ (Price, 2013a), and the slides are available at: http://data.bris.ac.uk/files/2013/02/databris-ckan.pdf (Price, 2013b). CKAN has now been adopted as part of
the implemented infrastructure at Bristol [6.1.1].
5.4.
Datapool
The JISC Datapool project at the
University of Southampton was concerned with modifying Microsoft Sharepoint to
create data deposit interfaces, consisting of project and dataset forms that
can collect metadata to feed through to the EPrints repository. The metadata
profile of the EPrints institutional repository was extended using the
ReCollect plugin, adapting the repository for research data. A report was
published, describing the integration, available at:
http://EPrints.soton.ac.uk/352813/3/EPrints-sharepoint-report-final10.pdf (Hitchcock
and White 2013).
5.5.
Iridium
The
Iridium project involved the assessment and testing of a number of systems:
·
A categorisation and brief review of 67
RDM tools and infrastructure components was carried out and reported at: http://research.ncl.ac.uk/media/sites/researchwebsites/iridium/iridium_external_tools_assessment_17_5_2013_v1_PGR_LW.pdf (Iridium Support Team, 2012).
·
An evaluation of DataStage and
DataBank, reported at: http://iridiummrd.wordpress.com/2013/02/14/iridium-evaluation-of-datastage-and-databank-research-data-management-tools-from-dataflow-project/ (Wood, 2013).
·
CKAN use case report at: http://research.ncl.ac.uk/media/sites/researchwebsites/iridium/iridium_CKAN_case_study_12_6_2013_v1_BA.pdf (Allen, 2012)
·
Sakai integration
into RDMI http://iridiummrd.wordpress.com/2011/11/22/research-data-management-at-euro-sakai-2011/ (Martin, 2011).
CKAN
was adopted as the platform for the research data portal at the University of
Newcastle. Information on the CKAN data portal is available at: http://research.ncl.ac.uk/rdm/tools/ckan/
5.6.
Kaptur
The
Kaptur project carried out an evaluation of technical systems in May 2012, to
judge their suitability for the management of visual arts research data. A set of user requirements was created with
which to evaluate the technical system capabilities, based on: software type
and cost, storage requirements, interface requirements, system requirements and
institutional requirements. Seventeen software systems were chosen to evaluate
and five were short-listed by their high scores: Dataflow, DSpace, EPrints, Fedora
and Figshare. These were then measured against a more detailed set of
requirements, and EPrints was deemed the most viable option, particularly since
it was already in use at the partner institutions. However, Figshare and
Dataflow were strong contenders and fulfilled some of the requirements that
EPrints did not. Therefore two pilots were implemented, an integration of
EPrints with Figshare and an integration of EPrints with Datastage. The findings were reported in the ‘Kaptur
Technical analysis report’ at: http://www.research.ucreative.ac.uk/1239//1/Kaptur_technical_analysis.pdf (Garrett et al. 2012).
In
November 2012, it was agreed that neither of the two pilots were viable and
that an integration of EPrints and CKAN, not available for the earlier
technical analysis, would be piloted. It was determined that the EPrints-CKAN
instance, although integration was not fully possible at the time, was a
stronger, sustainable model and worth continuing to develop in the future (Gramstadt, 2013).
5.7.
Orbital
The
concept of a ‘minimum viable product for RDM’ was developed for the Orbital
project, and its feature set considered to be authentication, storage,
hosting/publishing, licensing, persistent URI and analytics. CKAN was chosen as the platform for the data repository, as
it was found to meet these requirements ‘out of the box’, and for many other reasons as reported in the
blog post at: https://orbital.blogs.lincoln.ac.uk/2012/09/06/choosing-ckan-for-research-data-management/ (Winn, 2012). An
evaluation of the use of CKAN for RDM, presented at two conferences, is
available at: http://eprints.lincoln.ac.uk/9778/ (Winn,
2013a).
5.8.
Research360
The
Research360 project at the University of Bath, carried out a survey of user
requirements for a research data repository, published at: http://opus.bath.ac.uk/34082/ (Cope, 2013). Development of the technical
infrastructure involved integration of ePrints and the HCP file storage system,
described at: http://opus.bath.ac.uk/35532/3/Research360_EPrints_HCP_Report_FINAL.docx.pdf (Research360, 2013). Modification of Sakai to enable deposit of
material into a SWORD2 compliant repository is described at: http://opus.bath.ac.uk/35540/3/Research360_Sakai_Development_Report_FINAL.pdf (Research360 Project, 2012).
5.9.
Roadmap
Research
data repository functional requirements were compiled by the RoaDMap repository
working group. The criteria were based on the Kaptur project review and
enlarged to account for the local context. The draft repository functional
requirements are available at: http://library.leeds.ac.uk/downloads/file/389/data_repository_platform_functional_requirements (RoaDMaP repository working group, 2013), and
discussed in a blogpost at: http://blog.library.leeds.ac.uk/blog/roadmap/post/163 (Proudfoot, 2013c).
Initially
it was considered expedient to build upon the existing EPrints infrastructure,
although Dataflow offered a better fit with project needs, so was considered.
Dataflow however, revealed technical issues (the link between DataStage and
DataBank), so other platforms were considered. Using three case studies, the
functional requirements were tested against the three main candidates for
repository platform: EPrints, CKAN and DataFlow. EPrints was eventually chosen
for a pilot service, given the short timescale given for EPSRC compliance (Proudfoot et al. 2013).
5.10.
SMDMRD
User
requirements were gathered by questionnaire and interview, using DAF
methodology. The main user requirements were:
- Seamless
access, or command line access with batch import/export support
- easy
to use web interface for searching published datasets
- advanced
metadata-based search function
- customisable
metadata and RDF support
- dataset
version control
- multi-level
access control
- linking
data to published papers (DOI or handle.net)
In order
to choose a platform for a prototype data management system, the project
compared installations of Fedora Commons (using an Islabdora Drupal module),
Dspace, DataVerse and DataFlow in fulfilling the following criteria:
- Meeting
user requirements out of the box
- Ease
of install, getting it running and maintenance
- Ease
of customisation
- How
many standards supported
- How
well developed, supported and widely used
The
project team favoured DataFlow because of DataStage functions, but it was still
under development. DSpace was found to be easiest to install and run, much
online help being available. Queen Mary University also has a DSpace
institutional repository already. The team found Fedora difficult to install
and run, and Dataverse limited in its functionality, particularly metadata
customisation. Eventually, DSpace was chosen for the pilot data management
system, with the intention to combine it with DataStage to integrate researcher
workflows, using the SWORD protocol to transfer datasets. The report on
platform choice is available at: http://rdm.c4dm.eecs.qmul.ac.uk/platform_choice (Fabiani,
2012).
5.11.
UWE
Managing Research Data
The project team chose EPrints for the
research data repository at UWE because it is already in use for the
Institutional repository, therefore no
further funding was necessary and they had the skills necessary to repurpose
the system. They were in communication and sharing knowledge with other
institutions that had already used EPrints for data publication. These factors
are discussed in the document available at: http://www2.uwe.ac.uk/services/library/using_the_library/Services%20for%20researchers/eprints-data-repository-uwe.pdf (Holliday, 2012).
5.12.
Loughborough
University UK HE Research Data Management Survey
A survey
of UK HEIs was conducted to determine their plans for future RDM services and
received responses from 38 institutions. Regarding technical infrastructure
components and tools, the results revealed that 6 (16%) institutions had an
operational Research Data Service, with 25 (66%) developing one. Most
institutions were storing or aimed to store both data and metadata, 2 planned
to hold only metadata and 2 planned to hold just the data. Regarding the
software system the service used or was intending to use:
·
EPrints – 11 institutions
·
DSpace – 4
·
PURE – 4
·
Symplectics – 2
·
Converis – 1
·
Figshare – 1
·
iRODS – 1
·
Other systems - 13 (included DataFlow,
Fedora/Hydra, Equella and in-house developed)
A
report on the survey results, and links to the survey results, are available
at: http://blog.martinh.net/2013/10/metadata-is-love-note-to-future-uk.html (Hamilton, 2013).
5.13.
St
Andrews
CKAN
was investigated as the platform for a pilot RDM system at the University of St
Andrews, as part of the JISC funded C4D project. A list of user requirements
was composed, with which to measure CKAN suitability, which contributed to the
work done at the ‘CKAN for Research Data Management Workshop’ [5.1], as
published in the blogpost at: https://research-computing.wp.st-andrews.ac.uk/2013/03/15/ckan-for-research-data-management/
(Plietzsch, 2013a). CKAN was chosen for the pilot, evaluation process being
described in the blogpost at: http://research-computing.wp.st-andrews.ac.uk/2013/11/27/using-ckan-for-research-data-management/
(Plietzsch, 2013b).
5.14.
iREAD
The iRODS
evaluation and demonstrator project provided an evaluation and demonstration of
the iRODS system, assessing the capabilities of a demonstrator system against
use-case requirements from the CARMEN project. The evaluation is available at: http://www.wrg.york.ac.uk/iread
5.15.
DANS Easy
The process
of deciding between Fedora, ePrints and DSpace, for the DANS Easy data
repository service, is described in the paper at: http://www.ais.up.ac.za/digi/docs/bogaards_paper.pdf (Bogaards, 2009)
5.16.
DCC
The
DCC provide a catalogue of RDM tools and services at: http://www.dcc.ac.uk/resources/external/tools-services
5.17.
ANDS
Provides
information on technical resources at: http://www.ands.org.au/resource/techdocs.html ;
and
on metadata stores solutions at: http://ands.org.au/guides/metadata-stores-resources.html .
5.18.
JISC
Digital Media
Advice on various aspects of managing
digital media collections may be found at: http://www.jiscdigitalmedia.ac.uk/managing
6. Active
Institutional Infrastructure examples
Several
institutions have now established institutional research data repositories, or
host discipline-based or multi-institutional project-based research data
repositories. The UK based institutional data repositories open to external
view, are listed below. Some discipline-based data repositories based at UK
HEIs are also listed, together with a number from institutions outside the UK.
6.1.
UK institutional data repositories
The Open data
repository is still under development. CKAN has been selected for the data
repository and functions as a catalogue of research data. This integrates with
the PURE RIS, which functions as a catalogue of research outputs (Price, 2013a).
The
institutional repository is now able to preserve and publish research data.
This
repository is based on DSpace. The technical infrastructure at Edinburgh
involves integration with PURE, active data infrastructure and the DMPonline
tool.
This data
repository is built on the EPrints platform, modified using the ReCollect
plugin to accept datasets. The service includes allocation of Datacite DOIs.
Based on the
DSpace platform, material may be deposited via Symplectic. ORE’s content
includes journal articles, conference papers, working papers, reports, book
chapters, videos, audio, images, multimedia research project outputs, raw data
and analysed data. Exeter's three former repositories (The Exeter Research
and Institutional Content Archive (ERIC), Digital Collections Online (DCO) and
the Exeter Data Archive (EDA)) were merged into ORE and all previous
content is still available via the same permanent link. The merger
took place in March 2013.
This ePrints
based Glasgow School of Art institutional repository accepts a wide range of
objects including research data. This repository was the subject of a case
study for the KAPTUR project.
Goldsmiths
research data catalogue is built on the ePrints platform and results from the
work done for the KAPTUR project.
This is a
DSpace institutional repository that is being expanded to include a data
catalogue and a research data archive.
The digital
repository at Hull is built on the Hydra micro-services architecture [see 2.1.2
and 4.2.5]. The repository is designed to hold a wide range of digital
resources including research datasets.
The
Researcher Dashboard is the interface for the Data deposit workflow,
facilitated by the ‘Orbital Bridge’ application (Stainthorp, 2013). This links the various components of the RDMI: an EPrints IR for published research papers,
network storage, Lincoln’s Awards Management System and a CKAN based data
registry (Stainthorp, 2012).
A Research Data infrastructure has been
implemented at Newcastle which includes a CKAN data portal (for archiving and
publishing data) together with a number of in-house built systems – a MyProject
(a project and awards management system), MyImpact (a researcher profile and
publication information system), a Research Data Catalogue (linking data,
projects and publications), a VRE and e-Science Central (research collaboration
tools).
Databank is a
Fedora based data repository for the University of Oxford. Data may be stored
and preserved in the long-term, retrieved and published from anywhere on the
web. This is a component of the DataFlow infrastructure at Oxford, alongside
DataStage which provides local management of active research data, including
metadata annotation and a collaborative workflow. The RDM infrastructure also
includes the Online Research Database Service (ORDS)[xxvii] and
the institutional repository, Oxford University Research Archive (ORA) [xxviii].
The Research
Data Repository at Queen Mary University
of London, Centre for Digital Music
is a Dspace based repository
[Discussion of the process of selection at 5.7]. This repository was
specifically configured for long-term preservation and sharing of multimedia
file formats.
The EPrints
institutional repository at the University of Southampton has extended
the existing the list of data types accepted to include datasets and
experiments, using the ReCollect plugin. The EPrints Soton now holds research
data underlying published research (papers) outputs. Another strand of work,
using Sharepoint to catalogue and share active data, has yet to be implemented.
The
University of Southampton has a federated approach to repository management and
so there are a number of instances of ePrints being used by departments to
curate their research outputs.
This
repository for research data is built on an ePrints platform.
UCARO is the institutional repository and accepts a
wide range of research outputs including research data. This is built on the
ePrints platform.
An instance of EPrints was modified for use as the
data repository at UWE. The
project developed its own metadata profile for research data, having decided
against subscribing to the Datacite scheme and before the Recollect plugin
became available.
6.2.
Discipline-based
research data repositories hosted by UK HEIs
Not an
institutional repository, but based at The University of Edinburgh, here,
DSpace has been customised to offer a repository that eases both the deposit
and discovery of geospatial data.
The Detection
of Archaeological Residues using Remote-sensing Techniques (DART) research
project maintains a CKAN data portal for the open data outputs from the
project.
The University of Southampton department of
Chemistry holds data from X-ray diffraction experiments in an ePrints
repository. Each ePrint instance consists of Bibliographic data, data
collection parameters and files; the files include raw data (.hkl),
visualisations (.jpg), experimental conditions (.htm), structure determination
outputs, final structural result (.cif and .cml) and a validation report.
Not an institutional, but a national social and
economic research data repository based at the University of Essex. The UKDA
provides the UK Data Service[xxix], which curates key quantitative and qualitative
data, UK Data Service ReShare[xxx], curating data from ESRC funded research
and the HDS[xxxi] (successor to the AHDS). These are housed on a
modified ePrints repository platform.
The CARMEN
Portal is a VRE to support e-Neuroscience, providing storage and processing
services over a Grid infrastructure. The CARMEN system is a three-tier web
architecture consisting of a web portal, an application layer and a storage
layer, developed by a collaboration of researchers from 11 UK universities. The
Java portal allows the user to access data and to create and run analysis tool
on remote servers. The storage layer is shared between MySQL databases and a
SRB (Storage Resource Broker) system. The application layer consists of Java
servlets, providing a middleware layer that bridges storage and portal.
6.3.
Institutional
and discipline-based research data repositories outside the UK
Arrow, the
research repository at Monash provides a place for researchers to store and
manage research data and related publications. The university provides LaRDS
(Large Research Data Store) for research datasets storage, which is used for
collaboration using the Confluence wiki and Sakai VRE, and publishing data via
the research repository Arrow. Monash also hosts a number of project based
research data repositories. Research datasets are catalogued through the
various current RDM platforms. This metadata may be harvested by the Research
Data Australia (RDA) service, which provides a national research data
catalogue. Monash does not have an institutional research metadata repository
(catalogue) since this service is provided at the national level (Jones, 2013). The software system employed is ‘VITAL’.
The Research
Data Repository is based on Equella, and participates in the RDA catalogue.
Some research data collections may be discovered using the Research Hub service
at: http://research-hub.griffith.edu.au/collections.
3TU.Datacentrum,
a collaboration of TU Delft, TU Eindhoven and University of Twente Libraries,
provide a data repository, storing datasets from technical and scientific
research in the Netherlands, and data processing services. Datacentrum is built
on Fedora Commons and THREDDS dataserver architecture.
Easy
is the online archiving system provided by the Data Archiving and Networked
Services (DANS), an institute of the Royal Netherlands Academy of Arts and
Sciences (KNAW) and the Netherlands Organisation for Scientific Research (NOW).
The repository is built on Fedora Commons architecture.
Merritt is built on a micro-services architecture
providing digital curation through a series of devolved, independent but
interoperable services. By devolving functions to a set of small self-contained
services, they are easier to deploy, maintain and develop, leading to a
flexible system able to respond to diverse needs and an ever changing technical
environment. One of the central services is the Curation Storage micro-service,
which supports a set of behaviors for manipulating and retrieving entities and
their properties. Interaction with the Storage service is provided via a Java
procedural API, a command line API, and a RESTful web API. The micro-services
available are listed at: https://confluence.ucop.edu/display/Curation/Microservices.
The Harvard Dataverse
Network is a repository for sharing, citing and preserving research data;
open to all scientific data from all disciplines worldwide. This is built on
the Dataverse repository application and is part of the Dataverse Network.
The
JHU Data Archive runs on the Dataverse repository software platform and is part
of the Dataverse network.
PURR provides an online, collaborative working space
and data-sharing facility, based on the HUBzero platform.
RUresearch makes research data available to the
scholarly community and provides a collaborative workspace for data processing
and reuse. The system also provides access to supplementary resources,
codebooks, lab books and publications to give context to the data. RUresearch
is built on Fedora Commons architecture.
The University of Virginia institutional repository,
Libra, is built on the Hydra micro-services platform and now accepts research
datasets.
The
Inter-university Consortium for Political and Social Research provides a
discipline-based data repository, located at the University of Michigan, Ann
Arbor. This repository is built on the DuraCloud platform. ICPSR provides a
range of other data curation tools and services.
A
list of research data repositories, including discipline-based, national and
institutional research data repositories can be found at Databib[xxxii]
http://databib.org/ and at re3data.org[xxxiii]
http://www.re3data.org.
[viii]
University of St Andrews Digital Collections Portal https://arts.st-andrews.ac.uk/digitalhumanities/
[xiv]
Open Archival Information System (OAIS) http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=57284
[xv]
Trusted Digital Repository (TDR) http://www.oclc.org/content/dam/research/activities/trustedrep/repositories.pdf?urlm=161690
[xxvi]
EuroCRIS – European Organisation for International Research Information http://www.eurocris.org/
No comments:
Post a Comment