Thursday, 3 July 2014

RDM Technical Infrastructure Review - Sheffield Considerations

3.     The University of Sheffield RDM Technical Infrastructure Considerations

3.1.   Local infrastructure components

Some components of an integrated infrastructure already exist at the University of Sheffield:
  •         The Institutional repository WRRO[i] is a shared service hosted at Leeds University and based on the ePrints platform [4.2.1]. There is administrative expertise at the University of Sheffield Library but software development expertise is located in Leeds.
  •          The Research and Innovations Service administer an institutional (inward facing) catalogue of research publications, MyPublications[ii], which was used to facilitate the REF2014 process. This is built on the Symplectics Elements 4.4 [4.6.1] Research Information Management System (RIMS) and hosted by the Corporate Information and Computing Services (CICS). The Symplectics RIMS is a complete and integrated system, but as yet, its full functionality is perhaps not exploited at Sheffield.
  •          MyPublications is linked to the SAP platform [4.10.5] Corporate Information System (CIS) to import researcher identity and Academic unit information.
  •           MyPublications is linked to WRRO by a connector so that metadata may be pushed into, and files uploaded to, WRRO and outputs published easily.
  •          The University Research Management System (URMS)[iii] is a web based tool used for costing and pricing research, obtaining approvals, managing awards and post award administration. It is a module of the CIS and is provided and developed by CICS.
  •          A Digital Asset Management System (DAMS) for library special collections and National Fairground Archive[iv] based on the ContentDM [4.2.10] repository platform. Metadata is harvested by the library catalogue system Primo.
  •          Library management system, Alma, and Resource discovery tool or OPAC, Primo [4.5.6].
  •          CICS Network provides network filestore for individuals and departments. Files are available on and off campus, through Shibboleth [4.9.5] authentication. Off-campus access is via the internet or via VPN. Filestores are regularly backed up.
  •          CICS provides cloud storage and collaboration resources using Google Services [4.4.7].
  •          CICS provides a Linux based high performance computing (HPC) cluster, Iceberg[v], which is the Sheffield node of the White Rose Grid (WRG)[vi]. Iceberg provides backed-up storage, space for using and storing very large amounts of data and facilities for project group level collaborative work. Several research groups at Sheffield are, or have been involved in WRG facilitated collaborations, such as the CARMEN Portal [6.2.5], which uses the iRODS grid management system and Pegasus[vii] which used the SRB management system.
  •          Sheffield is a member of the N8 HPC Grid, currently operating Polaris[viii], an SGI HPC cluster.
  •          Sheffield is a member of other Grid computing projects, GridPP[ix] and WUN[x].

3.2.   Consortia options
The University of Sheffield is currently involved in two consortia established to share services and foster collaboration. Groups from both consortia are investigating the feasibility of sharing RDM resources – infrastructure components, training materials and expertise. 
3.2.1.  White Rose Universities Consortium (WRUC)[xi]
The WRUC collaboration between the universities of Leeds, Sheffield and York, established in 1997, supports research, knowledge exchange, and teaching and learning through a number of projects. The White Rose Grid (WRG), established in 2002, offers grid infrastructure and HPC for the WRUC. The consortium manages the shared institutional repository for research publications, WRRO, established in 2004, and for theses WREO[xii]. Work done recently by the White Rose Libraries Systems Architecture Group (2013) has identified the various components of each institution’s infrastructure in order to establish areas of interoperability and shared experience.
RDM Infrastructure components in operation at Leeds:
·         The shared Institutional Repositories WRRO and WREO, based on the EPrints platform and housed on a Leeds server.
·         A Symplectics CRIS component, for managing publications.
·         EPrints-Symplectics connector.
·         Kristal[xiii], an in-house built grants management component of the CRIS.
·         SAP based Corporate Information System.
·         Shibboleth identity authentication.
·         LUDOS[xiv], a Digitool [4.2.11] Digital asset management system. This is to be replaced by an EPrints platform data repository.
·          VRE / Research portal on SunGard Luminis platform [4.4.8] – to be replaced with Microsoft Dynamics [4.4.9] product suite.
·         III Sierra [4.5.7] based library management system.
·         HPC as part of WRG and N8 Polaris HPC.
At Leeds, the RoaDMaP project [7.1.15] is continuing due to interim funding by the University, to investigate service scoping and development. Proposed developments in the RDM infrastructure at Leeds include the choice of EPrints for the data registry or ‘discovery metadata store which points to the data’ (Proudfoot, 2013a), the testing of Arkivum [4.3.1] hardware for archive data storage capabilities and its interaction with EPrints, and the integration of DMPonline [4.6.4] with Kristal (Proudfoot, 2013b).

RDM Infrastructure components in operation at York:
·         The WRRO EPrints repository.
·         A PURE [4.6.2] CRIS.
·         EPrints-PURE connector.
·         York Research Database[xv] – a registry of research publications and project information; a component of the PURE CRIS (equivalent to MyPublications at Sheffield).
·         Shibboleth identity authentication.
·         YODL[xvi], the DAMS based on Fedora Commons [4.2.3] platform. Currently this holds small volumes of Humanities research data.
·         Youshare’ [4.4.10] an in-house built VRE / Portal system for sharing data and software.
·         ‘Alfresco’ [4.4.11] an open source content management system, developed by the department of Biology as a research collaboration system.
·         ExLibris Alma Library management system, with a Primo discovery interface ‘Yorsearch[xvii].
·         Google Drive – University supported cloud storage and collaboration space.
·         York Identity Manager’ (IDM)[xviii] local authentication system (Shibboleth).
·         HPC as part of N8 Polaris HPC and WRG.
Various RDM infrastructure developments at York are being considered include integrating DMPonline with the PURE CRIS and using Rosetta [4.3.4] for digital preservation. DataStage [4.4.1], Dropbox and Amazon Glacier [4.3.5] with AWS [4.4.12] are under consideration for active research data management; CKAN [4.2.2], DataBank [4.2.4] and Figshare [4.3.2] are being considered for data storage and discovery and Arkivum [4.3.1] is being investigated for archive data storage (Allinson, 2013).

The White Rose Research Data Working Group (2013) undertook a high level assessment of the options for a shared research data repository service, resulting in three options:
·         ‘The Bakery’ – a regional service hosted and managed by one institution which the others pay to use.
·         ‘The Cake Mix’ – a recommended best practice service to be deployed and managed in the individual institutions.
·         ‘The Recipe’ – Individually developed solutions to agreed standards, which may be modified to suit local requirements.
It is considered possible for any of these options to share elements of the infrastructure, such as published API, data catalogue, storage management system or raw storage, and that these shared services could be organised on a regional or national basis.


3.2.2. N8 Research Partnership[xix]
The University of Sheffield is a member of the N8 Research Partnership, a collaboration of eight research intensive universities in northern England (with Durham, Lancaster, Leeds, Liverpool, Manchester, Newcastle and York). The N8 provides the shared HPC facility and an equipment sharing initiative n8equipment.org.uk[xx]. The N8 RDM Architecture Working Group has drafted a reference systems architecture model for RDM across the N8 institutions. This model is useful for visualising the components of the RDM infrastructure of each member institution; to date, three of the eight institutions, including Sheffield and York, have yet to submit maps. The N8 RDM Archiving and Curation Working Group is investigating the feasibility of a shared storage service and developing data appraisal and curation processes and policies.

The N8 institutions’ experience regarding RDM infrastructure implementation is limited to Leeds, Manchester and Newcastle and results from the JISC RDM projects RoaDMaP [7.1.15] at Leeds, MiSS [7.1.10] at Manchester, and Iridium [7.1.7] at Newcastle. The RDM infrastructure is most highly developed at Newcastle [6.1.11], where a suite of tools[xxi] are offered including a CKAN data portal[xxii], currently under development. Newcastle have developed their own research information systems as part of this suite, which comprises ‘MyImpact’ (a researcher profile and publication information system), ‘MyProject’ (a project and awards management system), VRE and eScience Central (Collaboration and workflow tools) and a Research Data Catalogue (linking data, projects and publications), although this is currently a proof of concept system.

Common infrastructure components at the five N8 member institutions outside WRUC:
·         EPrints based Institutional Repositories at Durham, Lancaster, Liverpool and Newcastle.
·         Fedora Commons used for Manchester IR ‘eScholar’ and Durham University Library special collections.
·         Agresso / pFACT [4.10.6] financial management tool used at Durham, Lancaster, Liverpool and Manchester.
·         Oracle Financials [4.10.7] used at Durham and Manchester.
·         Oracle based CIS and CRIS ‘ISIS’ developed at Liverpool.
·         CKAN used for Data Repository at Newcastle.
·         PURE used at Lancaster.
·         DMPonline used at Lancaster, whilst Manchester developed their own DMP tool.




3.3.   Recent reviews of RDM service development at Sheffield

3.3.1. Research & Innovation Services RDM project - Case studies

During 2011-12, the University’s Research and Innovation Committee commissioned a Research Data management Scoping Project to:
·         Establish funders’ current and potential RDM requirements.
·         Identify gaps between funders’ requirements and university practices.
·         Identify and characterise support needs from pathfinder projects.
·         Explore the university’s capabilities for meeting RDM requirements and to propose sustainable, viable extensions to support services.

The pathfinder interviews provided insights into the researchers’ perspectives, many of which had a bearing on RDM. Many find the idea of making research data publicly available, contentious and there is much confusion between open access to research articles and access to data. Many researchers need guidance in choosing between a range of storage options, and do not distinguish between active data storage, back-up, mirroring and archiving. Researchers will support initiatives that are researcher led and involve little bureaucracy. They will tend to adopt RDM practices that fit in with their work flows, the type of data they work with and that are aligned with their culture. There is a need for clarity on the issue of data ownership.

A list of actions that will potentially lead to the establishment of a University RDM support infrastructure was drawn up through a SWOT analysis of the current position. The resulting project report (Kane et al. 2012) recommends the following actions with implications for technical infrastructure:
·         Data organisation – Update University’s RDM Policy with metadata standards developed across the HE sector. Develop guidance on metadata for researchers. Establish a network of RDM expertise across the university.
·         Data management and planning – Clarification of roles and responsibilities with regard to RDM support. Development of Data management plan (DMP) templates.
·         Data storage and back-up - Clearer guidance on available storage options, the costs involved and advantages / disadvantages of these options is required. Guidance on data retention is needed. Departments should account for RDM in their business continuity plans.
·         Data sharing Create incentives to encourage data sharing (and RDM generally). Develop case studies that demonstrate the benefits to researchers of data sharing. Expand the University’s RDM policy to cover data sharing and open access to data.
·         Data Repository - Develop a repository for research data that are unsuitable for external archiving, possibly in collaboration with White Rose or N8 partners.
·         Data Catalogue - Develop a system to catalogue, document and log all datasets held by the University, datasets held elsewhere and third party datasets acquired by the University.
·         Data ownership – Clear guidance is needed regarding IP and ownership of data.

The report concludes that the R & I committee must decide which of these actions to support and prioritise and recommends that the University should investigate cost estimation and recovery models that enable research projects’ RDM costs to be covered and investigate incentivising good RDM practices.

3.3.2. White Rose Services Repositories review

A review of the consortium’s repositories and related research information infrastructure was commissioned by the White Rose Library directors in order to establish a five year roadmap for the development of the repositories (Kay and Stevens, 2012). The review highlights the emerging need for research data storage and the broader future role for repositories in the developing research data ecosystem.
Key recommendations with a bearing on the technical infrastructure include:
·         The WR consortium should be maintained and considered the default channel for delivery of shared research services.
·         Continuing ePrints development for the current service in parallel with future service design.
·         Prioritise WRRO connector developments and ePrints upgrades to enhance deposit workflows.
·         Working together to design and validate a research information infrastructure framework and to determine the role of WR services in this framework.
·         Testing the feasibility of micro-services architecture for the future research information infrastructure.
·         Assessing the potential of ePrints platform to meet priority requirements of the framework.
·         Taking an incremental approach to development of infrastructure rather than a ‘big bang’ implementation.

The report discusses the challenges involved with using a single platform, ePrints, to provide a wide range of functionality. Integration with other institutional systems, such as the CRIS (Symplectics at Sheffield and Leeds, PURE at York) is considered difficult and the slow development of the connector was considered to have a negative impact on the WRRO service. Future core ePrints developments may not be in line with WR requirements, but by investing in local development work or commissioning developments through ‘ePrints Services’ this may be achieved.

The report strongly recommends testing the feasibility and business case for a micro-services architecture [as described in section 2.1.2.], particularly the Hydra system [4.2.5]. It is suggested that using micro-services to provide the required functions of the infrastructure will avoid duplication of functions provided by several components of the alternative ePrints-based architecture, and will mitigate the problems of integration with external components. It is also noted that although Hydra is integrated with Fedora in current implementations, there is a possibility of integration with ePrints providing the same functions as Fedora. However, implementation of the micro-services approach will require substantial investment and as yet there is no wide community of such repositories.

The report recommends that adoption of the micro-services approach offers the best long-term strategy and proposes a micro-services system trial during 2013-2014. In the short to medium term, the report proposes continued investment in the current repository infrastructure, focussed on mechanisms of deposit via Symplectics and PURE. This will mean a commitment to maintaining the relevant connectors after ePrints upgrades. Service enhancements will need to be tested on the current ePrints platform.



3.3.3. White Rose Consortium shared research data management services

This feasibility study by the DCC, commissioned by the WR library directors, was carried out in during 2013. RDM Service components were grouped into human and technical infrastructure components and their feasibility was assessed using six criteria: institutional benefits, service costs, staff resources, suitability of component, drivers for development and level of risk.

The resulting report (Rans et al. 2013) makes the following recommendations regarding technical infrastructure:
·         Investigate joint purchase of active data storage hardware. Each organisation is investigating storage options including that of the WRG HPC facilities and virtualised storage from external providers. Investigation of current practice indicates a preference for direct control over data held locally – with many researchers preferring hard drives on desktop PCs to institutional Filestores.
·         Scope projected data storage requirements for the next 3-5 years by engaging with researchers. The survey found a wide variation in storage requirements and that the determination of future storage requirements will prove difficult.
·         Investigate federated back-up storage using partner sites. The current practice amongst researchers was found to vary widely, with inconsistent procedures reported at Sheffield, where no formal arrangement is in place to ensure good practice.
·         Investigate setting up collaborative spaces. The survey indicated a wide range of collaborations in operation with some holding of third party data. York host ‘YouShare’, a portal for sharing data and software, and GoogleDrive is widely used at York (as well as Sheffield and Leeds). The report suggests a good case for developing a shared service if all partners require a DropBox like facility. Such a facility, it was suggested, will potentially strengthen collaborations within the consortium.
·         Share technical experience in the development of repository architecture. The EPSRC deadline of 2015 is likely to drive institutions toward individual solutions. To deliver a shared data repository infrastructure, collaboration needs to be established before individual efforts have developed beyond the point where they can be easily abandoned or integrated. There has been some discussion about extending the WRRO to accommodate datasets, but Leeds and York are investigating alternative options.
·         Development of a formal WRC functional requirements specification. The report recognises that long-term curation will probably be achieved through a blend of local and external services and that the institutional repository will be positioned as the ‘repository of last resort’ for research data.
·         Collect disciplinary requirements for tools supporting data ingest, metadata creation and preservation. Research data shows a wider variation in ingest and deposit requirements than found with publications. The three institutions have experience in managing publications ingest to the WRRO and there is experience of the deposit of other materials at YODL in York and with the Timescapes project[xxiii] in Leeds (using DigiTool). A collaborative approach will avoid duplication of effort. The report recommends the development of ingest and preservation tools if necessary.
·         Liaise with key data centres to develop plugin deposit tools, facilitating easy upload to external repositories. The report notes that there is little information about the use of external data repositories, or how much research data is managed in this way.
·         Share deliberations and decisions made about the use of DataCite metadata schema v.3.0 and DataCite DOIs. Use of the same identifier service will facilitate the creation of a centralised data catalogue or portal.
·         Harmonise explorations of options for data catalogue development. There is an opportunity for the development of a shared service, but the delivery deadline of 2015 impacts on the time available for successful collaboration. Local options are being explored - York Research Database (using the PURE CRIS front-end) and the use of ePrints as a data catalogue at Leeds.



[i] WRRO – White Rose Research Online http://eprints.whiterose.ac.uk/
[iii] URMS - University Research Management System http://www.sheffield.ac.uk/ris/application/pricing/urms
[iv] University of Sheffield Library Digital Collections http://cdm15847.contentdm.oclc.org/cdm/
[vi] White Rose Grid (WRG) http://www.wrgrid.org.uk/
[x] Worldwide Universities Network http://www.wun.ac.uk/
[xi] White Rose Universities Consortium (WRUC) http://www.whiterose.ac.uk/
[xii] White Rose ETheses Online WREO http://etheses.whiterose.ac.uk/
[xiii] Knowledge Research Innovation System at Leeds (KRISTAL) http://www.leeds.ac.uk/forstaff/news/article/3826/get_started_with_kristal
[xiv] Leeds University Digital Objects (LUDOS) http://ludos.leeds.ac.uk/ludos/
[xvi] York Digital Library (YODL) https://dlib.york.ac.uk/
[xix] N8 Research Partnership http://www.n8research.org.uk/
[xxi] Research Data Management Tools, University of Newcastle https://research.ncl.ac.uk/rdm/tools/
[xxii] CKAN, Research data management, University of Newcastle https://research.ncl.ac.uk/rdm/tools/ckan/
[xxiii] Timescapes: An ESRC Qualitative Longitudinal Initiative http://www.timescapes.leeds.ac.uk/

No comments:

Post a Comment