Thursday, 3 July 2014

RDM Tecnical Infrastructure Review - Conclusions and Recommendations

8.     Conclusions and Recommendations

The EPSRC expectations of organisations receiving EPSRC funding requires the research data created as a result of that funding, be effectively curated and securely preserved and for metadata describing this research data to be created and published by 1st May 2015. The University of Sheffield Data Management Policy was developed in response to the EPSRC expectations. This policy states that the university will provide support for research data management, including infrastructure and services to be developed in consultation with researchers.

In addition to surveying researchers to determine their RDM practices and attitudes, it is appropriate to develop these services to fit seamlessly in with the researcher workflow and not burden the researcher with significant change to their work practices. Rather than a researcher filling out forms, metadata may be exchanged between systems (CRIS, VRE and repository for example) thus reducing re-keying. Products, processes and practices that have been developed by a research community should be adopted, adapted and developed for the needs of other researchers, rather than entirely new solutions developed. As indicated by Jones et al. (2013: 14):
“…close engagement with researchers is critical when designing RDM systems to ensure their applicability and uptake.”

8.1.  Infrastructure components and implementation strategy

In choosing the components of the technical infrastructure and strategy for instituting it, the following options must be considered:
·         A ‘Big Bang’ implementation or an incremental roll-out, allowing the service to be adopted slowly.
·         A system wide generic infrastructure or a ‘Bottom-up’ project based approach, pilot infrastructure components being tested possibly throughout the whole lifecycle of a research project.
·         Utilising existing infrastructure upon which new services are developed, or implementing new infrastructure. Integrating existing components may require investment in terms of development work, but implementing new infrastructure will also be costly.
·         Choosing components where there is local experience or choosing components where there is as yet, little community of practice.
·         Choosing open source components, which require local expertise and development, or proprietary components, supported but expensive.

Perhaps the greatest factor in considering these options is that some RDM services, data catalogue and data archiving, need to be implemented by May 2015. Therefore, in providing a ‘bare compliance’ option, it may be expedient to utilise the facilities that are already in place, if they provide the necessary functions with little modification or development to achieve integration.


8.2.  Published research data catalogue or repository

To some extent, choice of the infrastructure component providing the public research data catalogue is contingent on the development of the WRRO as a catalogue of all published research outputs, research data as well as research papers. A White Rose Research Data Catalogue, perhaps implemented as a separate instance of ePrints as with WREO, would need to contain only catalogue records, not the datasets themselves, which would be held in a research data archive. The institutional ownership of research data will likely require the local control, if not location of the preservation and storage functions in the research data archive. 
EPrints is already integrated with Symplectic at Sheffield and Leeds, and with PURE at York, so dataset metadata may be automatically imported into an ePrints research data catalogue. EPrints would need to be modified to handle a research dataset metadata profile with the ReCollect plugin. This dataset metadata will be automatically recorded by the Symplectics’ system, some will be harvested from external repositories (now harvests from Figshare), and researchers manually input all other necessary metadata fields. At Open Research Exeter research data is deposited via Symplectic into the DSpace repository.

If the decision is made not to go ahead with a shared WR research data catalogue, then a local based system may be implemented that combines the catalogue and archive functions together as an institutional research data repository. A local instance of ePrints could be considered, as there is much local experience in the use of ePrints, Symplectics and the connector, and a willingness to share knowledge within the RDM practitioner community. Alternatively, other repository and cataloguing systems should be considered. The open source systems Dspace, Fedora Commons, Datafinder, Hydra and CKAN have a growing community of users, willing to share expertise. A number of proprietary systems, such as ContentDM for which there is local expertise, must also be considered.

8.3.  Research data archive
A facility for preservation of research data that has not been submitted to an external repository, needs to be provided by the institution to comply with EPSRC expectations. Such a data archive may offer a preservation service for unpublished research data also. Repository systems, that have been designed for or modified for research data, provide the archival storage function and the catalogue function. Alternatively, specialist long-term archiving and preservation systems exist. Possible candidates here include Rosetta (as the library has experience with ExLibris systems), Figshare for Institutions (as it is supported by the providers of, and integrated with Symplectic), Arkivum (as it is involved in the JANET data archiving framework agreement) and Dataverse (one of the open source data preservation platforms available).

8.4.  Active data management
Currently at Sheffield, collaborative functionality is provided by Google Drive and the HPC facilities, but there is no institutional Virtual Research Environment (VRE) as such. A number of JISC infrastructure projects investigated the use of collaboration tools such as DataStage, Sharepoint and Sakai, to provide a VRE which is integrated with the CRIS, file servers, data archive and / or a data repository. Surveys of researchers have shown a requirement for ‘Academic Dropbox’ facilities, which allow the sharing of data and for ‘Social network’ style annotation (Garrett et al. 2012).

For RDM, one function required of the VRE is that of data registry, defined here as an inward-facing data catalogue. This is built into DataStage (at Oxford) and Sharepoint (at Southampton), though a number of institutions incorporate a separate data registry component in their active data management facility, for example the use of CKAN at Lincoln and PURE at Bristol. Attention perhaps should be paid to YouShare developed at the University of York. The capabilities of the WRG infrastructure for collaborative active data management should also be investigated.

8.5.  Data and metadata capture
There is currently a lack of information about the data and metadata capture tools being used and developed at the University of Sheffield, although systems such as laboratory information management systems may be used by some research groups at the institution. Experience in the use of such tools needs investigation to inform the choice of and development of an active data management infrastructure capable of integrating these tools.


8.6.  Final remarks
The great benefits of a shared approach, in terms of saving money and time, should mean that engaging in collaborative efforts to establish shared services is a priority concern. Opportunities to collaborate in the development of a WR Research Data Catalogue and the proposed N8 shared data archiving service must be exploited. The development of RDM services delivered through the White Rose Grid and N8 HPC grid infrastructure need to be explored. Attention should be paid to the national research data service being piloted by the DCC and JISC.  
With consideration of the time constraint of compliance to the EPSRC expectations, it may be appropriate to pilot different components of the RDM technical infrastructure with a number of EPSRC research projects to begin with. If the pilot component proves sustainable, then it will contribute to the incremental roll-out of a fully integrated RDM service, whilst fulfilling the requirements of the EPSRC expectations.


No comments:

Post a Comment