2.
RDM Technical Infrastructure
Architecture
2.1.
Technical infrastructure components
Generally,
the infrastructure architecture cases examined below have been designed around
functional requirements derived from researcher workflows. Some innovative infrastructure
platforms have been architected to manage research data throughout the
lifecycle, but most infrastructure projects have had to take into account
current systems and have engineered modifications to facilitate
interoperability (Hitchcock,
2012).
2.1.1.
Major
functional components of the RDM infrastructure:
·
Metadata
Capture [4.7] - Metadata capture (data cataloguing) may be accomplished
simply by providing an interface for researchers to fill out online forms. It
is best for this process to be automated where possible to reduce the amount of
manual annotation required of researchers. As well as reducing ‘double-keying’,
which is frustrating for researchers, the number of errors introduced
(inevitable through manual input) is reduced. Automatic metadata capture,
concurrent with data capture, may be facilitated by using appropriate
instruments and equipment and save data to the laboratory, departmental or
facility file store or the institutional network. Electronic lab books,
electron microscopes and other imaging instruments, genetic sequencing and
analysis instruments may feed data to a project based Laboratory Information
Management Systems (LIMS) [4.7.1].
·
Active
Research Data Management [4.4] - Active research data needs to be accessed
rapidly, may require large computational resources and require stringent
security and access arrangements. Many institutions have developed
collaborative computing systems, such as Virtual Research Environments (VREs)
or LIMS, to accommodate these needs. Active data management is comprised of two
functional components: a filestore and a data
registry (or metadata store or asset registry). In some cases these components
will be integrated into a single system, in other cases, the metadata may be
handled by the CRIS.
·
Research Data
Repository [4.2] - Archive data
is possibly best managed in a discipline based repository or data centre,
whilst the Institutional repository is the ‘repository of last resort’, as
previously discussed. The institutional data repository (or the institutional
repository, if it has been modified to accommodate data) is an appropriate home
for datasets for which there is no discipline based repository or data centre,
or for temporary storage before being submitted to a data centre. The Research
Data Repository provides a catalogue for all published data and possibly file
storage for some published data. The catalogue and archive functions of the
repository may be separated.
·
Research
Data Catalogue [4.5] - holds the
metadata records of published research data. The data themselves may be held in
a discipline-based data repository outside the institution or in an
institutional data archive. The selection of the underlying metadata schema is
fundamental and consideration must be given to the schema used by the proposed National Data
Registry[i].
Many institutions favour the Datacite
metadata schema[ii],
subscription to which provides the means to mint DOIs and assurance of a
standard level of preservation.
·
Research
Data Archive [4.3] - preserves data not, or not yet, submitted to
discipline-based data repositories. The associated metadata records are held in
the research data catalogue.
·
Current
Research Information System (CRIS) [4.6] - manages the metadata associated
with researcher identity, project information, research costing, grant
applications and awards. CRISs hold details of researchers’ published outputs
with associated citation metrics.
All components of this technical infrastructure need to be interoperable.
This is achieved through adherence to data and metadata formats and standards
allowing data and metadata exchange between interoperable systems.
Many
implementations involve an overlap in these functional components; for example
the CRIS may provide aspects of active data management, providing a research
data registry (inward facing); Laboratory instruments may be part of a LIMS, so
that automated data capture is an integral part of the collaborative active
data management system; Data storage and data catalogue (outward facing) may be
separate systems or may be combined in a data repository. The storage / archive
function of the data repository may be achieved using an external archive
service, but access and deposit managed seamlessly through the repository
platform (see figure 3).
2.1.2.
Data
grids and Micro-services
The data grid is a form of RDMI architecture in which
middleware applications allow researchers to manage data across grid
infrastructures. Grid computing involves a distributed infrastructure served by
interoperable software services, ‘middleware’, allowing resource sharing; the
resulting ‘Grid’ may be considered a ‘virtual organisation’ (Foster
et al. 2003). Data grids permit the sharing of computational resources,
storage resources, network resources, code repositories and catalogues. Access
to Grid resources are controlled by a Resource management system, or Storage
resource broker (SRB). Several middleware toolkits are available, including
open source options Globus
[4.10.1] and MyGrid [4.10.2].
The micro-services architecture approach considers the
repository as a ‘set of services’ rather than a ‘place’ (Abrams et al. 2009).
Each function in a workflow is embodied in a self-contained micro-service,
which is joined with other micro-services in a ‘pipeline’ to produce complex
processes. The micro-services approach has been developed by the California Curation
Centre[iii]
and put into practice at the University of California Digital Library (CDL) Merritt repository [6.3.5]. At the University of Oxford, micro-services
are built on an underlying Fedora repository platform, creating the Databank
repository system [4.2.4], used as the platform for Oxford Databank [6.1.12].
The iRODS [4.1.3] software system allows the management of
a distributed workflow through the chaining of micro-services. iRODS software
is termed ‘adaptive middleware’ and allows for a more flexible customisation of
data management functions than can be achieved using a SRB system. These
functions or micro-services are coded as ‘rules’, which may be compiled
together to produce larger macro-level functionality.
Hydra [4.2.5] is multi-purpose repository
framework based on a micro-services architecture. The main components are a
Fedora repository platform [4.2.3], SOLR indexing software [4.10.3], Blacklight
discovery interface [4.5.5] and the Hydra plugin, a ‘Ruby on rails’ library,
which facilitates workflow in digital object management (Awre, 2012).
Hydra has been implemented at the University
of Hull [6.1.9] and the University of Virginia [6.3.10] with
provisions made for curating research datasets. At Hull micro-systems implement
workflows that allow deposit of materials via the CRIS, Converis [4.6.3], the
Sakai VLE [4.4.2] and Sharepoint [4.4.3].
2.2.
Functional requirements
The functional requirements of the RDM Infrastructure may be
derived from analysis of stakeholder activities, particularly researcher
workflows. Many of the JISC RDMI projects have carried out data audits and
investigated researcher workflows and use case scenarios in order to specify
infrastructure requirements. The following list is derived from the findings of
several of the JISC RDMI projects: ADMIRe (Sero
Consulting, 2012; Parsons
and Berry, 2012), CKAN for RDM (Winn et al.
2013), KAPTUR (Garrett
et al. 2012), Orbital (Stainthorp,
2012) and RoaDMaP (2013)
[see section 7.1. for more information about these JISC RDMI projects].
Researcher requirements:
a) For active data
a) For active data
·
Direct capture of data (and metadata) from
instrument.
·
As much automated metadata annotation as
possible, such as project level metadata (researcher identity and grant
information) imported from the CRIS.
·
Network that provides adequate storage (personal
and project) which is regularly backed up with speedy access to large data
volumes.
·
Secure, authenticated access mechanisms are
required, especially for sharing sensitive data; usually involves institutional
authentication mechanisms (Shibboleth [4.9.6]).
·
Ability to share data with collaborators inside
and outside the institution (‘Academic Dropbox’).
·
Mechanisms for secure data destruction.
·
Mechanisms for data transformation as required
for data curation (such as anonymisation, aggregation and format transformation).
b)
Depositing
archive data
·
User friendly data upload facility (like Dropbox
[4.4.6]).
·
Customisable workflows for creating or importing
metadata and uploading file.
·
Simple process for ingest of large data
collections (multiple files) and association of collections with single
metadata record (dataset record).
·
Controlled lists for some metadata fields.
·
Support for versioning of datasets.
·
Clear choice of license options.
·
Specify granular access rights to files at data
object and collection level.
·
Embargo options for metadata and files.
·
Mechanisms for secure data destruction.
c) For data discovery and reuse
·
Effective search and discovery mechanisms, using
subject-specific terminology. Controlled vocabularies of keywords with
auto-complete function.
·
Enable immediate access to datasets.
·
Access to datasets held outside the repository.
·
Support access to very large datasets.
·
Means of access to restricted data, where the
metadata is visible; a ‘contact owner’ button.
·
Linking dataset to context / reuse metadata or
data documentation – describing the process of data generation.
·
Related data and research publications indicated
and linked to.
·
Support for granular access to data and
associated metadata.
·
Visualisation and data analysis tools to give
summary data or overview of data. Support query and processing of data on the
repository server rather than after download.
·
Support for free tagging - adding discipline
specific tags or metadata to datasets.
·
Federated catalogues allowing searching across
multiple institutions.
·
Advice on data citation.
·
Citation data produced, demonstrating impact.
Additional RDM service
requirements:
·
Customisable metadata schema.
·
Support multiple ingest protocols.
·
Staged deposit workflow – allows administrative
area for quality check / validation.
·
Enable selective metadata harvesting.
·
Enable extraction of metadata and data in open
format.
·
Support open standards and exposure of metadata.
·
Support multiple content licensing – exposed
clearly.
·
Support technical metadata.
·
Support generation of persistent unique
identifiers.
·
Support open methods of authentication.
·
Ability to remove data to access controlled
area, a dark archive for embargoed data
·
Ability to delete data, generating a tombstone
reference.
·
Access to metadata through library catalogue –
OAI-PMH [4.8.3] endpoint required.
·
Support reporting – analysis of repository
content, download and view metrics.
·
Enable creation and retrieval of an audit trail,
reporting management actions.
2.3.
Institutional considerations
Expediency may perhaps determine the development of the
Institutional RDM technical infrastructure. In the current climate of budget
constraint and with the need to demonstrate value for money, there should be a
focus on appraising the systems currently in place, and determining whether
these may be modified to fit the proposed infrastructure. Modification of
existing components will require local expertise or employment of developers,
often the more expensive aspect of system implementation. Thus, work will be
needed in costing the options available: building upon and integrating existing
components, or otherwise implementing a new fully-integrated system, possibly a
proprietary system, replacing existing components where necessary.
The least
the institution needs to do for the development of the RDM technical
infrastructure:
- Implement institutional policy – ‘Additional infrastructure
and services for research data management, to be developed in consultation
with researchers.’ Therefore
a research data audit is recommended to determine researcher practices.
- Fulfil Funder requirements – ‘Research organisations will
ensure that appropriately structured metadata describing the research data
they hold is published...’ Therefore
a data catalogue is required by 1st May 2015.
- Promote and facilitate good RDM practice. Training and
guidance resources need to be developed.
- Select sustainable, inexpensive, open options (open for
interoperability and sustainability). Business cases will need developing
for the various options available.
- Take into account projected future requirements. This
involves a consideration of risks to services through the removal of
funding (for example the AHDS[iv] data centre no
longer received funding after 2008, so stopped functioning).
2.4.
Requirements gathering methods
In developing the institutional RDM strategy, the DCC
recommends using both requirements-gathering and gap analysis methods (Jones
et al. 2013). The DCC provide a number of tools for the purpose and have
published a case study detailing the use of these tools (Rans
and Jones, 2013).
A number of UK institutions have used the Data Audit Framework (DAF)[v]
developed by JISC and HATII for requirements gathering. The DAF provides a set
of survey methods, questionnaire and interview frameworks in order to identify,
locate and describe research data assets and determine how they are being
managed. The AIDA
Toolkit[vi]
has also been developed for institutional self-assessment of the readiness and
capabilities for management of digital assets and digital preservation.
The Collaborative
Assessment of Research Data Infrastructure and Objectives (CARDIO)[vii]
is a benchmarking tool for RDM strategy development developed from key aspects
of DAF and AIDA and other tools. The DCC recommend using CARDIO in conjunction
with the other tools, the emphasis being on strategic planning and identifying
gaps between the current situation and best practice.
[vii] Collaborative
Assessment of Research Data Infrastructure and Objectives (CARDIO) http://cardio.dcc.ac.uk/
No comments:
Post a Comment