metadatatron: RDM Technical Infrastructure Review

2. RDM Technical Infrastructure Architecture

2.1. Technical infrastructure components

Generally, the infrastructure architecture cases examined below have been designed around functional requirements derived from researcher workflows. Some innovative infrastructure platforms have been architected to manage research data throughout the lifecycle, but most infrastructure projects have had to take into account current systems and have engineered modifications to facilitate interoperability (Hitchcock, 2012).

2.1.1. Major functional components of the RDM infrastructure:

· Metadata Capture [4.7] - Metadata capture (data cataloguing) may be accomplished simply by providing an interface for researchers to fill out online forms. It is best for this process to be automated where possible to reduce the amount of manual annotation required of researchers. As well as reducing ‘double-keying’, which is frustrating for researchers, the number of errors introduced (inevitable through manual input) is reduced. Automatic metadata capture, concurrent with data capture, may be facilitated by using appropriate instruments and equipment and save data to the laboratory, departmental or facility file store or the institutional network. Electronic lab books, electron microscopes and other imaging instruments, genetic sequencing and analysis instruments may feed data to a project based Laboratory Information Management Systems (LIMS) [4.7.1].

· Active Research Data Management [4.4] - Active research data needs to be accessed rapidly, may require large computational resources and require stringent security and access arrangements. Many institutions have developed collaborative computing systems, such as Virtual Research Environments (VREs) or LIMS, to accommodate these needs. Active data management is comprised of two functional components: a filestore and a data registry (or metadata store or asset registry). In some cases these components will be integrated into a single system, in other cases, the metadata may be handled by the CRIS.

· Research Data Repository [4.2] - Archive data is possibly best managed in a discipline based repository or data centre, whilst the Institutional repository is the ‘repository of last resort’, as previously discussed. The institutional data repository (or the institutional repository, if it has been modified to accommodate data) is an appropriate home for datasets for which there is no discipline based repository or data centre, or for temporary storage before being submitted to a data centre. The Research Data Repository provides a catalogue for all published data and possibly file storage for some published data. The catalogue and archive functions of the repository may be separated.

· Research Data Catalogue [4.5] - holds the metadata records of published research data. The data themselves may be held in a discipline-based data repository outside the institution or in an institutional data archive. The selection of the underlying metadata schema is fundamental and consideration must be given to the schema used by the proposed National Data Registry [i]. Many institutions favour the Datacite metadata schema [ii], subscription to which provides the means to mint DOIs and assurance of a standard level of preservation.

· Research Data Archive [4.3] - preserves data not, or not yet, submitted to discipline-based data repositories. The associated metadata records are held in the research data catalogue.

· Current Research Information System (CRIS) [4.6] - manages the metadata associated with researcher identity, project information, research costing, grant applications and awards. CRISs hold details of researchers’ published outputs with associated citation metrics.

All components of this technical infrastructure need to be interoperable. This is achieved through adherence to data and metadata formats and standards allowing data and metadata exchange between interoperable systems.

Many implementations involve an overlap in these functional components; for example the CRIS may provide aspects of active data management, providing a research data registry (inward facing); Laboratory instruments may be part of a LIMS, so that automated data capture is an integral part of the collaborative active data management system; Data storage and data catalogue (outward facing) may be separate systems or may be combined in a data repository. The storage / archive function of the data repository may be achieved using an external archive service, but access and deposit managed seamlessly through the repository platform (see figure 3).

Figure 3. RDM technical infrastructure data flow

2.1.2. Data grids and Micro-services

The data grid is a form of RDMI architecture in which middleware applications allow researchers to manage data across grid infrastructures. Grid computing involves a distributed infrastructure served by interoperable software services, ‘middleware’, allowing resource sharing; the resulting ‘Grid’ may be considered a ‘virtual organisation’ (Foster et al. 2003). Data grids permit the sharing of computational resources, storage resources, network resources, code repositories and catalogues. Access to Grid resources are controlled by a Resource management system, or Storage resource broker (SRB). Several middleware toolkits are available, including open source options Globus [4.10.1] and MyGrid [4.10.2].

The micro-services architecture approach considers the repository as a ‘set of services’ rather than a ‘place’ (Abrams et al. 2009). Each function in a workflow is embodied in a self-contained micro-service, which is joined with other micro-services in a ‘pipeline’ to produce complex processes. The micro-services approach has been developed by the California Curation Centre [iii] and put into practice at the University of California Digital Library (CDL) Merritt repository [6.3.5]. At the University of Oxford, micro-services are built on an underlying Fedora repository platform, creating the Databank repository system [4.2.4], used as the platform for Oxford Databank [6.1.12].

The iRODS [4.1.3] software system allows the management of a distributed workflow through the chaining of micro-services. iRODS software is termed ‘adaptive middleware’ and allows for a more flexible customisation of data management functions than can be achieved using a SRB system. These functions or micro-services are coded as ‘rules’, which may be compiled together to produce larger macro-level functionality.

Hydra [4.2.5] is multi-purpose repository framework based on a micro-services architecture. The main components are a Fedora repository platform [4.2.3], SOLR indexing software [4.10.3], Blacklight discovery interface [4.5.5] and the Hydra plugin, a ‘Ruby on rails’ library, which facilitates workflow in digital object management (Awre, 2012). Hydra has been implemented at the University of Hull [6.1.9] and the University of Virginia [6.3.10] with provisions made for curating research datasets. At Hull micro-systems implement workflows that allow deposit of materials via the CRIS, Converis [4.6.3], the Sakai VLE [4.4.2] and Sharepoint [4.4.3].

2.2. Functional requirements

The functional requirements of the RDM Infrastructure may be derived from analysis of stakeholder activities, particularly researcher workflows. Many of the JISC RDMI projects have carried out data audits and investigated researcher workflows and use case scenarios in order to specify infrastructure requirements. The following list is derived from the findings of several of the JISC RDMI projects: ADMIRe (Sero Consulting, 2012; Parsons and Berry, 2012), CKAN for RDM (Winn et al. 2013), KAPTUR (Garrett et al. 2012), Orbital (Stainthorp, 2012) and RoaDMaP (2013) [see section 7.1. for more information about these JISC RDMI projects].

Researcher requirements:
a) For active data

· Direct capture of data (and metadata) from instrument.

· As much automated metadata annotation as possible, such as project level metadata (researcher identity and grant information) imported from the CRIS.

· Network that provides adequate storage (personal and project) which is regularly backed up with speedy access to large data volumes.

· Secure, authenticated access mechanisms are required, especially for sharing sensitive data; usually involves institutional authentication mechanisms (Shibboleth [4.9.6]).

· Ability to share data with collaborators inside and outside the institution (‘Academic Dropbox’).

· Mechanisms for secure data destruction.

· Mechanisms for data transformation as required for data curation (such as anonymisation, aggregation and format transformation).

b) Depositing archive data

· User friendly data upload facility (like Dropbox [4.4.6]).

· Customisable workflows for creating or importing metadata and uploading file.

· Simple process for ingest of large data collections (multiple files) and association of collections with single metadata record (dataset record).

· Controlled lists for some metadata fields.

· Support for versioning of datasets.

· Clear choice of license options.

· Specify granular access rights to files at data object and collection level.

· Embargo options for metadata and files.

· Mechanisms for secure data destruction.

c) For data discovery and reuse

· Effective search and discovery mechanisms, using subject-specific terminology. Controlled vocabularies of keywords with auto-complete function.

· Enable immediate access to datasets.

· Access to datasets held outside the repository.

· Support access to very large datasets.

· Means of access to restricted data, where the metadata is visible; a ‘contact owner’ button.

· Linking dataset to context / reuse metadata or data documentation – describing the process of data generation.

· Related data and research publications indicated and linked to.

· Support for granular access to data and associated metadata.

· Visualisation and data analysis tools to give summary data or overview of data. Support query and processing of data on the repository server rather than after download.

· Support for free tagging - adding discipline specific tags or metadata to datasets.

· Federated catalogues allowing searching across multiple institutions.

· Advice on data citation.

· Citation data produced, demonstrating impact.

Additional RDM service requirements:

· Customisable metadata schema.

· Support multiple ingest protocols.

· Staged deposit workflow – allows administrative area for quality check / validation.

· Enable selective metadata harvesting.

· Enable extraction of metadata and data in open format.

· Support open standards and exposure of metadata.

· Support multiple content licensing – exposed clearly.

· Support technical metadata.

· Support generation of persistent unique identifiers.

· Support open methods of authentication.

· Ability to remove data to access controlled area, a dark archive for embargoed data

· Ability to delete data, generating a tombstone reference.

· Access to metadata through library catalogue – OAI-PMH [4.8.3] endpoint required.

· Support reporting – analysis of repository content, download and view metrics.

· Enable creation and retrieval of an audit trail, reporting management actions.

2.3. Institutional considerations

Expediency may perhaps determine the development of the Institutional RDM technical infrastructure. In the current climate of budget constraint and with the need to demonstrate value for money, there should be a focus on appraising the systems currently in place, and determining whether these may be modified to fit the proposed infrastructure. Modification of existing components will require local expertise or employment of developers, often the more expensive aspect of system implementation. Thus, work will be needed in costing the options available: building upon and integrating existing components, or otherwise implementing a new fully-integrated system, possibly a proprietary system, replacing existing components where necessary.

The least the institution needs to do for the development of the RDM technical infrastructure:

Implement institutional policy – ‘Additional infrastructure and services for research data management, to be developed in consultation with researchers.’ Therefore a research data audit is recommended to determine researcher practices.
Fulfil Funder requirements – ‘Research organisations will ensure that appropriately structured metadata describing the research data they hold is published...’ Therefore a data catalogue is required by 1^st May 2015.
Promote and facilitate good RDM practice. Training and guidance resources need to be developed.
Select sustainable, inexpensive, open options (open for interoperability and sustainability). Business cases will need developing for the various options available.
Take into account projected future requirements. This involves a consideration of risks to services through the removal of funding (for example the AHDS [iv] data centre no longer received funding after 2008, so stopped functioning).

2.4. Requirements gathering methods

In developing the institutional RDM strategy, the DCC recommends using both requirements-gathering and gap analysis methods (Jones et al. 2013). The DCC provide a number of tools for the purpose and have published a case study detailing the use of these tools (Rans and Jones, 2013).

A number of UK institutions have used the Data Audit Framework (DAF)[v] developed by JISC and HATII for requirements gathering. The DAF provides a set of survey methods, questionnaire and interview frameworks in order to identify, locate and describe research data assets and determine how they are being managed. The AIDA Toolkit [vi] has also been developed for institutional self-assessment of the readiness and capabilities for management of digital assets and digital preservation.

The Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO)[vii] is a benchmarking tool for RDM strategy development developed from key aspects of DAF and AIDA and other tools. The DCC recommend using CARDIO in conjunction with the other tools, the emphasis being on strategic planning and identifying gaps between the current situation and best practice.

[i] DCC Research Data Registry Pilot http://www.dcc.ac.uk/projects/research-data-registry-pilot

[ii] Datacite metadata schema 3.0 http://schema.datacite.org/meta/kernel-3/ [See section 4.9.1.]

[iii] CDL Microservices https://wiki.ucop.edu/display/Curation/Microservices

[iv] Arts and Humanities Data Service (AHDS) http://www.ahds.ac.uk/

[v] Data Audit Framework (DAF) http://www.data-audit.eu/index.html

[vi] AIDA Toolkit http://aida.da.ulcc.ac.uk/wiki/index.php/Main_Page

[vii] Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO) http://cardio.dcc.ac.uk/