A
Review of Options for the Development of Research Data Management Technical
Infrastructure at the University of Sheffield
A Report to the University of
Sheffield Research Data Management Service Delivery Group
Executive
Summary
This report reviews the options
available for the development of a technical infrastructure, the software and
hardware systems, to support Research Data Management (RDM) at the University
of Sheffield. The appropriate management of research data throughout the data
lifecycle, during and after the research project, is considered good research
practice. This involves data management
planning during the research proposal stage; looking after active data, its creation, processing, storage and
access during the project; and data
stewardship, long-term curation, publishing and reuse of archive data after
the end of the project.
Good RDM practice benefits all
stakeholders in the research process: Researchers, will secure their data
against loss or unauthorized access, and may increase research impact through
publishing data; Research institutions may consider research data as ‘special
collections’ and will need to minimise risk to data and damage to reputation;
Research Funders wish to maximise the impact of the research they fund by
enabling reuse; Publishers may wish to add value to research papers by
publishing the underlying data.
Many research funders now mandate RDM
procedures, particularly Data Management Planning (DMP), and the UK research
councils policies have contributed to the RCUK common principles on data
policy. Notice must be taken of the EPSRC Expectations of organisations
receiving EPSRC funding. These include the requirements that the organisation
will:
- Publish appropriately structured metadata describing
the research data they hold - therefore the institution must create a
public data catalogue.
- Ensure that EPSRC-funded data is securely
preserved for a minimum of ten years – therefore the institution must
create a data archive.
- Ensure that effective data curation is
provided throughout the full data lifecycle – therefore the institution
must provide the necessary human and technical infrastructure
required.
Institutions in receipt of EPSRC
funding are expected to be compliant with these expectations by 1st
May 2015. The University of Sheffield Research Data Management Policy was
developed in response to the RCUK principles and EPSRC expectations. This
states that the University will develop infrastructure and services to support
research data management in consultation with researchers.
The local infrastructure does not exist
in a vacuum and interacts with, and is dependent upon a range of other services
and processes in an information ecosystem. At one end of the continuum of
research data curation is the local storage of data and metadata (data
identification, description and documentation), usually accessible to the
project team only. At the other end are international discipline-based data
repositories or national data centres that publish research data, facilitating
its discovery and access. Research institutions lie in the middle of this
continuum and provide the means to move research data and metadata from their
local, unpublished state to an international published state. Some institutions
now publish research data, either by modifying the institutional repository
(IR) to accommodate datasets in addition to research papers or through a data
repository, being a new instance of repository system running alongside the IR.
However, discipline-based repositories are considered the most appropriate
facility for data publishing, due to their configuration for the data types and
metadata formats associated with the research community they serve.
The repository is here defined as a
software system composed of three layers – a user interface, a database holding
metadata records, and a storage layer holding the actual research data
bitstreams. In some implementations, often known as data registries, data
catalogues or metadata stores, the repository holds only the metadata records
and links to the data stored elsewhere.
This report focuses on the outcomes of projects at
UK HEIs funded by the JISC ‘Managing Research Data’ programmes 2009-11 and
2011-2013. Generally the infrastructure architectures examined have been
developed in response to the functional requirements derived from researcher
workflows. The major functional components of the RDM technical infrastructure
for the institution are:
- Metadata
capture system – in order
to identify, describe and document the research data as they are created,
captured and processed and record the context, conditions, variables and
instrument settings. This may be accomplished manually, by the researcher
filling in forms, or automatically, concurrent with data capture, by using
appropriate equipment.
- Active
research data management system –
Active research data needs to be accessed rapidly, may require large
computational resources and may require stringent security and access
arrangements. A number of collaborative systems and virtual research
environments have been developed to fulfil these requirements. These can
be considered to comprise of a filestore and a data registry (sometimes known as a metadata store or asset
registry).
- Research
Data Repository – will be
an appropriate place for preservation and publishing of archive research
data for which there is no discipline-based repository or data centre
available. The catalogue and archive functions of the repository may be
separated.
- Research
Data Catalogue – holds
the metadata records of published (but not necessarily open access)
research data. The data themselves may be held in a discipline-based data
repository outside the institution or in an institutional data archive.
- Research
Data Archive –
preserves data not, or not yet, submitted to discipline-based data
repositories. The associated metadata records will be held in the research
data catalogue.
- Current
Research Information System (CRIS) –
manages the metadata associated with researcher identity, project
information, research costing, grant applications and awards.
These components may overlap in
function, but need to be interoperable to provide seamless RDM. Alternative
approaches to an infrastructure composed of diverse components, where ensuring
interoperability may be problematic, are provided by data grids and
micro-services. The technical infrastructure must fit into the researcher
workflow, making RDM processes automatic and virtually invisible to the
researcher as far as possible. This is so as not to burden the researcher with
additional work or changes to their practice. Products, processes and practices
that have been developed by a researcher community should be adopted, adapted
and developed for the needs of other researchers, rather than new solutions
developed.
The choice of technical infrastructure
components and the approach of implementation will need to be considered with
regard to the infrastructure and expertise already present. Integrating and
modifying existing components may be as expensive, in terms of development
work, as implementing new infrastructure. Installing and configuring free
open-source software may prove expensive in terms of development, compared with
proprietary systems. At the University of Sheffield there is currently no
system that adequately supports collaborative active data management – a
virtual research environment, or ‘academic dropbox’. There is little
information available regarding the use of data and metadata capture systems,
although systems such as laboratory information management systems may be used
by some research groups at the institution. It is feasible that the Symplectic
CRIS may be configured for use as a research data registry. The ePrints institutional
repository, WRRO, may be configured for use as a research data catalogue, but
this relies on agreement between the WRUC member institutions. In order to
implement an independent institutional research data repository, expertise will
be needed to do the necessary development work.
A shared approach to RDM services is
being investigated by the White Rose and N8 consortia (of which the institution
is a member); a WR Research Data Catalogue and N8 shared data archiving service
having been proposed. The development of RDM services delivered through the
White Rose Grid and N8 HPC grid infrastructure need to be explored. Attention
should be paid to the national research data service is being piloted through
the DCC and JISC. The great benefits of the shared approach demand that support
for collaborative projects establishing shared RDM services should be a
priority.
This report briefly describes the
eighty most commonly used components of RDM technical infrastructure at UK
HEIs. The report describes evaluations, reviews and comparisons of these
components, gives examples of established RDM services and highlights the
recent projects at UK HEIs which were involved in developing these
services.
By way of conclusion, a number of
recommendations are made regarding the choice of infrastructure components to
be made and implementation strategy to be considered. These recommendations
take into account the current situation of technical infrastructure at the
institution and the constraints on time and cost. Attention is drawn to the
development of shared RDM services with collaborating institutions. Finally the
proposal is put forward that a number of technical infrastructure components of
an integrated RDM service are first piloted with EPSRC funded research projects
to ensure compliance with EPSRC expectations by 1st May 2015.
Contents
1. Introduction (1)
1.1 Research data management (1)
1.2 Research data management drivers (1)
1.3 Research data lifecycle (3)
1.4 Data documentation, metadata and data collections (4)
1.5 Data repository or data registry? (5)
1.6 Research data ecology (5)
1.7 Development of RDM services (7)
2. RDM Technical Infrastructure Architecture (9)
2.1 Technical infrastructure components (9)
2.2 Functional requirements (12)
2.3 Institutional considerations (13)
2.4 Requirements gathering methods (14)
3. The University of Sheffield RDM Technical Infrastructure Considerations (15)
3.1 Local infrastructure components (15)
3.2 Consortia options (16)
3.3 Recent reviews of RDM service developments at Sheffield (19)
4. Infrastructure Components (23)
4.1 Integrated systems and integrating components (23)
4.2 Repository platforms (23)
4.3 Archive data storage and digital preservation systems and services (26)
4.4 Active data management and collaboration platforms (27)
4.5 Catalogue software (29)
4.6 Current Research Information Systems (CRIS) and DMP tools (30)
4.7 Data capture and workflow management systems (30)
4.8 Data transfer protocols (33)
4.9 Identifier services and identity components (33)
4.10 Other software systems and platforms of interest (34)
5. Reviews, Evaluations and Comparisons of Infrastructure Components (36)
6. Active Institutional Infrastructure Examples (41)
6.1 UK institutional data repositories (41)
6.2 Discipline-based research data repositories hosted by UK HEIs (43)
6.3 Institutional and discipline-based research data repositories outside the UK (43)
7. RDMI Project Outputs (46)
7.1 Outputs from the JISC RDMI 2011-2013 projects (46)
7.2 Outputs from the JISC RDMI 2009-2011 projects (49)
7.3 Outputs from other relevant projects (51)
8. Conclusions and Recommendations (52)
9. References (55)
9.1 Works cited in the text (55)
9.2 Index of entities noted in the text (61)
No comments:
Post a Comment