3.
The University of Sheffield RDM
Technical Infrastructure Considerations
3.1.
Local infrastructure components
Some
components of an integrated infrastructure already exist at the University of
Sheffield:
- The Institutional repository WRRO[i] is a shared service hosted at Leeds University and based on the ePrints platform [4.2.1]. There is administrative expertise at the University of Sheffield Library but software development expertise is located in Leeds.
- The Research and Innovations Service administer an institutional (inward facing) catalogue of research publications, MyPublications[ii], which was used to facilitate the REF2014 process. This is built on the Symplectics Elements 4.4 [4.6.1] Research Information Management System (RIMS) and hosted by the Corporate Information and Computing Services (CICS). The Symplectics RIMS is a complete and integrated system, but as yet, its full functionality is perhaps not exploited at Sheffield.
- MyPublications is linked to the SAP platform [4.10.5] Corporate Information System (CIS) to import researcher identity and Academic unit information.
- MyPublications is linked to WRRO by a connector so that metadata may be pushed into, and files uploaded to, WRRO and outputs published easily.
- The University Research Management System (URMS)[iii] is a web based tool used for costing and pricing research, obtaining approvals, managing awards and post award administration. It is a module of the CIS and is provided and developed by CICS.
- A Digital Asset Management System (DAMS) for library special collections and National Fairground Archive[iv] based on the ContentDM [4.2.10] repository platform. Metadata is harvested by the library catalogue system Primo.
- Library management system, Alma, and Resource discovery tool or OPAC, Primo [4.5.6].
- CICS Network provides network filestore for individuals and departments. Files are available on and off campus, through Shibboleth [4.9.5] authentication. Off-campus access is via the internet or via VPN. Filestores are regularly backed up.
- CICS provides cloud storage and collaboration resources using Google Services [4.4.7].
- CICS provides a Linux based high performance computing (HPC) cluster, Iceberg[v], which is the Sheffield node of the White Rose Grid (WRG)[vi]. Iceberg provides backed-up storage, space for using and storing very large amounts of data and facilities for project group level collaborative work. Several research groups at Sheffield are, or have been involved in WRG facilitated collaborations, such as the CARMEN Portal [6.2.5], which uses the iRODS grid management system and Pegasus[vii] which used the SRB management system.
- Sheffield is a member of the N8 HPC Grid, currently operating Polaris[viii], an SGI HPC cluster.
- Sheffield is a member of other Grid computing projects, GridPP[ix] and WUN[x].
3.2.
Consortia options
The University of Sheffield is
currently involved in two consortia established to share services and foster
collaboration. Groups from both consortia are investigating the feasibility of
sharing RDM resources – infrastructure components, training materials and
expertise.
The WRUC collaboration between the
universities of Leeds, Sheffield and York, established in 1997, supports
research, knowledge exchange, and teaching and learning through a number of
projects. The White Rose Grid (WRG), established
in 2002, offers grid infrastructure and HPC for the WRUC. The consortium
manages the shared institutional repository for research publications, WRRO, established in 2004, and for
theses WREO[xii].
Work done recently by the White Rose Libraries Systems Architecture Group
(2013) has identified the various components of each institution’s
infrastructure in order to establish areas of interoperability and shared
experience.
RDM Infrastructure components in
operation at Leeds:
·
The shared Institutional Repositories WRRO and
WREO, based on the EPrints platform and housed on a Leeds server.
·
A Symplectics CRIS component, for managing
publications.
·
EPrints-Symplectics connector.
·
SAP based Corporate Information System.
·
Shibboleth identity authentication.
·
‘LUDOS’[xiv],
a Digitool [4.2.11] Digital asset management system. This is to be replaced by
an EPrints platform data repository.
·
VRE /
Research portal on SunGard Luminis platform [4.4.8] – to be replaced with Microsoft
Dynamics [4.4.9] product suite.
·
III Sierra [4.5.7] based library management
system.
·
HPC as part of WRG and N8 Polaris HPC.
At Leeds, the RoaDMaP project [7.1.15]
is continuing due to interim funding by the University, to investigate service
scoping and development. Proposed developments in the RDM infrastructure at
Leeds include the choice of EPrints for the data registry or ‘discovery
metadata store which points to the data’ (Proudfoot, 2013a),
the testing of Arkivum [4.3.1] hardware for archive data storage capabilities
and its interaction with EPrints, and the integration of DMPonline [4.6.4] with
Kristal (Proudfoot, 2013b).
RDM Infrastructure components in
operation at York:
·
The WRRO EPrints repository.
·
A PURE [4.6.2] CRIS.
·
EPrints-PURE connector.
·
‘York
Research Database’[xv]
– a registry of research publications and project information; a component of
the PURE CRIS (equivalent to MyPublications at Sheffield).
·
Shibboleth identity authentication.
·
‘YODL’[xvi],
the DAMS based on Fedora Commons [4.2.3] platform. Currently this holds small
volumes of Humanities research data.
·
‘Alfresco’ [4.4.11] an open source content
management system, developed by the department of Biology as a research
collaboration system.
·
Google Drive – University supported cloud
storage and collaboration space.
·
HPC as part of N8 Polaris HPC and WRG.
Various
RDM infrastructure developments at York are being considered include
integrating DMPonline with the PURE CRIS and using Rosetta [4.3.4] for digital
preservation. DataStage [4.4.1], Dropbox and Amazon Glacier [4.3.5] with AWS
[4.4.12] are under consideration for active research data management; CKAN
[4.2.2], DataBank [4.2.4] and Figshare [4.3.2] are being considered for data
storage and discovery and Arkivum [4.3.1] is being investigated for archive
data storage (Allinson, 2013).
The White Rose Research Data Working Group (2013) undertook
a high level assessment of the options for a shared research data repository
service, resulting in three options:
·
‘The Bakery’ – a regional service hosted and
managed by one institution which the others pay to use.
·
‘The Cake Mix’ – a recommended best practice
service to be deployed and managed in the individual institutions.
·
‘The Recipe’ – Individually developed solutions
to agreed standards, which may be modified to suit local requirements.
It is considered possible for any of
these options to share elements of the infrastructure, such as published API,
data catalogue, storage management system or raw storage, and that these shared
services could be organised on a regional or national basis.
The University of Sheffield is a
member of the N8 Research Partnership, a collaboration of eight research
intensive universities in northern England (with Durham, Lancaster, Leeds,
Liverpool, Manchester, Newcastle and York). The N8 provides the shared HPC
facility and an equipment sharing initiative n8equipment.org.uk[xx].
The N8 RDM Architecture Working Group has drafted a reference systems
architecture model for RDM across the N8 institutions. This model is useful for
visualising the components of the RDM infrastructure of each member
institution; to date, three of the eight institutions, including Sheffield and
York, have yet to submit maps. The N8 RDM Archiving and Curation Working Group
is investigating the feasibility of a shared storage service and developing
data appraisal and curation processes and policies.
The N8 institutions’ experience regarding RDM infrastructure
implementation is limited to Leeds, Manchester and Newcastle and results from
the JISC RDM projects RoaDMaP [7.1.15] at Leeds, MiSS [7.1.10] at Manchester,
and Iridium [7.1.7] at Newcastle. The RDM infrastructure is most highly
developed at Newcastle [6.1.11], where a suite of tools[xxi]
are offered including a CKAN
data portal[xxii], currently under development. Newcastle have
developed their own research information systems as part of this suite, which
comprises ‘MyImpact’ (a researcher profile and publication information system),
‘MyProject’ (a project and awards management system), VRE and eScience Central
(Collaboration and workflow tools) and a Research Data Catalogue (linking data,
projects and publications), although this is currently a proof of concept system.
Common
infrastructure components at the five N8 member institutions outside WRUC:
·
EPrints based Institutional Repositories at
Durham, Lancaster, Liverpool and Newcastle.
·
Fedora Commons used for Manchester IR ‘eScholar’
and Durham University Library special collections.
·
Agresso / pFACT [4.10.6] financial management
tool used at Durham, Lancaster, Liverpool and Manchester.
·
Oracle Financials [4.10.7] used at Durham and
Manchester.
·
Oracle based CIS and CRIS ‘ISIS’ developed at
Liverpool.
·
CKAN used for Data Repository at Newcastle.
·
PURE used at Lancaster.
·
DMPonline used at Lancaster, whilst Manchester
developed their own DMP tool.
3.3.
Recent reviews of RDM service
development at Sheffield
3.3.1.
Research
& Innovation Services RDM project - Case studies
During 2011-12,
the University’s Research and Innovation Committee commissioned a Research Data
management Scoping Project to:
·
Establish funders’ current and potential RDM
requirements.
·
Identify gaps between funders’ requirements and
university practices.
·
Identify and characterise support needs from
pathfinder projects.
·
Explore the university’s capabilities for
meeting RDM requirements and to propose sustainable, viable extensions to
support services.
The pathfinder interviews provided insights into the
researchers’ perspectives, many of which had a bearing on RDM. Many find the
idea of making research data publicly available, contentious and there is much
confusion between open access to research articles and access to data. Many
researchers need guidance in choosing between a range of storage options, and
do not distinguish between active data storage, back-up, mirroring and
archiving. Researchers will support initiatives that are researcher led and
involve little bureaucracy. They will tend to adopt RDM practices that fit in
with their work flows, the type of data they work with and that are aligned
with their culture. There is a need for clarity on the issue of data ownership.
A list of
actions that will potentially lead to the establishment of a University RDM
support infrastructure was drawn up through a SWOT analysis of the current
position. The resulting project report (Kane et al. 2012) recommends the
following actions with implications for technical infrastructure:
·
Data
organisation – Update University’s RDM Policy with metadata standards
developed across the HE sector. Develop guidance on metadata for researchers.
Establish a network of RDM expertise across the university.
·
Data
management and planning – Clarification of roles and responsibilities with
regard to RDM support. Development of Data management plan (DMP) templates.
·
Data
storage and back-up - Clearer guidance on available storage options, the
costs involved and advantages / disadvantages of these options is required.
Guidance on data retention is needed. Departments should account for RDM in
their business continuity plans.
·
Data
sharing – Create incentives to
encourage data sharing (and RDM generally). Develop case studies that
demonstrate the benefits to researchers of data sharing. Expand the
University’s RDM policy to cover data sharing and open access to data.
·
Data
Repository - Develop a repository for research data that are unsuitable for external archiving, possibly in
collaboration with White Rose or N8 partners.
·
Data
Catalogue - Develop a system to catalogue, document and log all datasets
held by the University, datasets held elsewhere and third party datasets
acquired by the University.
·
Data
ownership – Clear guidance is needed regarding IP and ownership of data.
The report concludes that the R & I committee must
decide which of these actions to support and prioritise and recommends that the
University should investigate cost estimation and recovery models that enable
research projects’ RDM costs to be covered and investigate incentivising good
RDM practices.
3.3.2.
White
Rose Services Repositories review
A review of the consortium’s repositories and related
research information infrastructure was commissioned by the White Rose Library
directors in order to establish a five year roadmap for the development of the
repositories (Kay and Stevens, 2012). The review highlights the emerging need
for research data storage and the broader future role for repositories in the
developing research data ecosystem.
Key
recommendations with a bearing on the technical infrastructure include:
·
The WR consortium should be maintained and
considered the default channel for delivery of shared research services.
·
Continuing ePrints development for the current
service in parallel with future service design.
·
Prioritise WRRO connector developments and
ePrints upgrades to enhance deposit workflows.
·
Working together to design and validate a
research information infrastructure framework and to determine the role of WR
services in this framework.
·
Testing the feasibility of micro-services
architecture for the future research information infrastructure.
·
Assessing the potential of ePrints platform to
meet priority requirements of the framework.
·
Taking an incremental approach to development of
infrastructure rather than a ‘big bang’ implementation.
The report discusses the challenges involved with using a
single platform, ePrints, to provide a wide range of functionality. Integration
with other institutional systems, such as the CRIS (Symplectics at Sheffield
and Leeds, PURE at York) is considered difficult and the slow development of
the connector was considered to have a negative impact on the WRRO service.
Future core ePrints developments may not be in line with WR requirements, but
by investing in local development work or commissioning developments through
‘ePrints Services’ this may be achieved.
The report strongly recommends testing the feasibility and
business case for a micro-services architecture [as described in section
2.1.2.], particularly the Hydra system [4.2.5]. It is suggested that using
micro-services to provide the required functions of the infrastructure will
avoid duplication of functions provided by several components of the
alternative ePrints-based architecture, and will mitigate the problems of
integration with external components. It is also noted that although Hydra is
integrated with Fedora in current implementations, there is a possibility of
integration with ePrints providing the same functions as Fedora. However,
implementation of the micro-services approach will require substantial
investment and as yet there is no wide community of such repositories.
The report recommends that adoption of the micro-services
approach offers the best long-term strategy and proposes a micro-services
system trial during 2013-2014. In the short to medium term, the report proposes
continued investment in the current repository infrastructure, focussed on
mechanisms of deposit via Symplectics and PURE. This will mean a commitment to
maintaining the relevant connectors after ePrints upgrades. Service
enhancements will need to be tested on the current ePrints platform.
3.3.3.
White
Rose Consortium shared research data management services
This feasibility study by the DCC, commissioned by the WR
library directors, was carried out in during 2013. RDM Service components were
grouped into human and technical infrastructure components and their
feasibility was assessed using six criteria: institutional benefits, service
costs, staff resources, suitability of component, drivers for development and
level of risk.
The
resulting report (Rans et al. 2013) makes the following recommendations regarding
technical infrastructure:
·
Investigate
joint purchase of active data storage hardware. Each organisation is
investigating storage options including that of the WRG HPC facilities and
virtualised storage from external providers. Investigation of current practice
indicates a preference for direct control over data held locally – with many researchers
preferring hard drives on desktop PCs to institutional Filestores.
·
Scope
projected data storage requirements for the next 3-5 years by engaging with
researchers. The survey found a wide variation in storage requirements and
that the determination of future storage requirements will prove difficult.
·
Investigate
federated back-up storage using partner sites. The current practice amongst
researchers was found to vary widely, with inconsistent procedures reported at
Sheffield, where no formal arrangement is in place to ensure good practice.
·
Investigate
setting up collaborative spaces. The survey indicated a wide range of
collaborations in operation with some holding of third party data. York host
‘YouShare’, a portal for sharing data and software, and GoogleDrive is widely
used at York (as well as Sheffield and Leeds). The report suggests a good case
for developing a shared service if all partners require a DropBox like
facility. Such a facility, it was suggested, will potentially strengthen
collaborations within the consortium.
·
Share
technical experience in the development of repository architecture. The
EPSRC deadline of 2015 is likely to drive institutions toward individual
solutions. To deliver a shared data repository infrastructure, collaboration
needs to be established before individual efforts have developed beyond the
point where they can be easily abandoned or integrated. There has been some
discussion about extending the WRRO to accommodate datasets, but Leeds and York
are investigating alternative options.
·
Development
of a formal WRC functional requirements specification. The report
recognises that long-term curation will probably be achieved through a blend of
local and external services and that the institutional repository will be
positioned as the ‘repository of last resort’ for research data.
·
Collect
disciplinary requirements for tools supporting data ingest, metadata creation
and preservation. Research data shows a wider variation in ingest and
deposit requirements than found with publications. The three institutions have
experience in managing publications ingest to the WRRO and there is experience
of the deposit of other materials at YODL in York and with the Timescapes project[xxiii]
in Leeds (using DigiTool). A collaborative approach will avoid duplication of
effort. The report recommends the development of ingest and preservation tools
if necessary.
·
Liaise
with key data centres to develop plugin deposit tools, facilitating easy
upload to external repositories. The report notes that there is little
information about the use of external data repositories, or how much research
data is managed in this way.
·
Share
deliberations and decisions made about the use of DataCite metadata schema
v.3.0 and DataCite DOIs. Use of the same identifier service will facilitate
the creation of a centralised data catalogue or portal.
·
Harmonise
explorations of options for data catalogue development. There is an
opportunity for the development of a shared service, but the delivery deadline
of 2015 impacts on the time available for successful collaboration. Local
options are being explored - York Research Database (using the PURE CRIS
front-end) and the use of ePrints as a data catalogue at Leeds.
[iii]
URMS - University Research Management System http://www.sheffield.ac.uk/ris/application/pricing/urms
[xiii]
Knowledge Research Innovation System at Leeds (KRISTAL) http://www.leeds.ac.uk/forstaff/news/article/3826/get_started_with_kristal
[xxii]
CKAN, Research data management, University of Newcastle https://research.ncl.ac.uk/rdm/tools/ckan/
No comments:
Post a Comment