Data Catalogue aspects of RDM infrastructure projects
The JISC Digital infrastructure: Research management
programme (2011-13) Managing Research Data strand is supporting 17 ResearchData Management Infrastructure projects [1]. Four of these, RoaDMaP, SwordARM,
Open Exeter and Managing Research Data (at UWE) are referred to in part one of this post [2].
Overall there seems to be a consensus in accepting
the three-tier metadata model put forward by the IDMB [3] project at Southampton (& Takeda et al 2010 [4]) :-
This model has been developed through the IRIDIUM O’loughlin 2011 [5] and DataBris Boyd 2012 [6] projects, the three tiers given slightly different attributes:-
This model has been developed through the IRIDIUM O’loughlin 2011 [5] and DataBris Boyd 2012 [6] projects, the three tiers given slightly different attributes:-
1. a minimum mandatory metadata set providing core information and could be based around a standard metadata element set - the 15 Dublin Core elements, DataCite kernel [45] or CKAN [47], but includes other fields such as location, access terms and conditions and any embargo information. The top level relates to the Discoverability of the Resource.
2. a second mandatory layer with contextual metadata covered by elements within the CERIF model [48], administrative information. base entities: project, person, organisation unit, collaborators; funding information: Funder, grant number; and result entities: publication, patent, research product ideally, much of this will be automatically harvested, or fed from administrative systems.
3. and finally a specific level of optional metadata providing the rich, specific more granular, detailed information. This layer provides the discipline related information required for reuse.
•OAI-PMH harvesting of data stores
•SWORD2 compliant
•CERIF compatible
•Metadata schema based on DataCite
•Interfaces directly with DataBank & ORDS (Oxford's Online Research Database Service, based on DataStage [16] system)
•Users can register non-electronic data.
For Datafinder, a three tier metadata approach is envisaged, comprising:
Minumum core elements
Record/digital object ID
Location of dataset
Medium
Creator (if not depositor)
Creator affiliation (if not depositor)
Title
Publisher of data
Publication year
Access terms & conditions
Data owner
Access date to data
Rights for metadata
Subject
Contextual
mandatory elements
Funding agency
Grant number
Project information
Last access request date
Source
Source URL
Data generation
process
Why the data was generated/Abstract/Brief description
Date
Reason for embargo
Optional
metadata (selection)
Co-creators/contributor
Role
Affiliation
Sub-title
Subject
Keywords
Date (other)
Language
ResourceType
AlternateIdentifier: Eg DOI
RelatedIdentifier: eg DOI of publication
Size
Format
Version
Data generation process
Abstract/Brief description
Documentation 1:descriptive or contextual information about the
dataset (e.g. machine settings and experimental conditions under which the data
were gathered)
Documentation 2
Subject specific m.d.
Subject specific m.d.
Subject specific classification
Subj specific classn scheme
Data complying with known standards eg DDI
(Rumsey 2012) [17]
The metadata will have three sources: Manual entry - generally disliked, can be inaccurate but can produce rich metadata; Imported - from data capture instruments, from institutional systems (RIMS, DMP), from a data repository; Autogenerated by the RDMI. (Rumsey 2012) [17]
The research data catalogue is Central to the infrastructure being developed by IRDIUM project [18] at Newcastle, recording what data they have and making it discoverable. This will be integrated with MyProject (the Research Management System), MyImpact (Institutional publications system – equivalent to our Symplectics system) and the EPrints based IR. The catalogue “will not be a repository but rather a straight forward web-based searchable catalogue of data and that we will only collect information on data that supports publication. We have opted for this measure as we know that data supporting publication should have already been prepared (i.e. confidentiality respected through the scrubbing of data, fields marked sensibly etc) plus we feel that data is normally available at this point for peer review and as a matter of good scientific practice, so (hopefully) we’re not asking too much more from our academics to fill in data information at the same point they fill in their new publication info in our output system.” (Wood,L. 2012) [19]. A list of twelve key field and seven further fields has been drawn up which will be publically or privately viewable through the catalogue interface. Again the three tier model has been adhered to; "This is quite appealing as we already collect much of the information in the first two levels through our current systems (MyProjects, e-prints and MyImpact) so the main additional input we’d be requiring from the academic would be at the third level.” (O’loughlin2011) [20].
"This
requires us to understand some of the systems the RDC may need
to exchange metadata with that have existing information already
entered. These could be local research group metadata catalogues,
local/national repositories and other online systems" (Wood 2012) [21].
Bristol University’s Data.Bris project [22] is developing a RDMI which will integrate a new CRIS (PURE), which also provides an institutional repository, with the existing University Research Data Storage Facility (RDSF). This extends the storage facility into becoming a Research Data Repository and allows data to be published from the storage facility. The proposed architecture [23] involves the creation of a metadata store, (a SPARQL 1.1 service), and will adhere to OAI-PMH [24], OAI-ORE [25], and SWORD [26] protocols. Data.Bris has defined a minimal set of mandatory metadata to be used when depositing or publishing data: Identifier; Creators; Title; Publisher; Publication year; and are investigating which metadata elements may be created automatically and which need adding manually. Again the three tier metadata model is thought useful, especially since metadata can be pulled in from the CRIS (Boyd2012) [27].
The Datapool project [28] follows on from the Institutional Data Management Blueprint (IDMB) project. The project will launch and populate an EPrints institutional data repository to collect and store all research data produced across disciplines within the institution, as part of the research data management infrastructure. The repository will have access to storage sufficient for local data assets and will also provide links to data held elsewhere, both externally in subject repositories and internally using other systems. The project is investigating mechanisms for transferring data and metadata into the data repository from other local data stores, and exporting data from the repository using the SWORD2 protocol. They will also use the three tier metadata model developed by IDMB [29].
"The lesson for data repositories is clear: to capture content from data creators you must provide useful services that will become an integral part of the workflow of creating the data. It will not work to isolate particular requirements, such as records creation, from other needs such as storage services. Data does not appear with the same mode and frequency as published papers, so workflow must accommodate many different patterns. Research data is often produced by machines, so deposit workflow must allow scope for non-manual intervention" (Hitchcock 2012) [30].
For the Research Data @Essex project [31], EPrints is being used for repository. This project is also adhering to the IDMB
inspired three tier metadata model; they considered EPrints metadata provided
for level 1 & 2, whilst level 3 ‘minutiae’ are derived by drawing from
DataCite, INSPIRE, DDI and DataShare schema. A multi-schema crosswalk was produced [32] and the Metadata schema worked out based on Datacite INSPIRE and DDI
2.1 [33].
University of West England's MRD uses a schema based on DataCite in a two tier model 1.
basic metadata, 2. detailed domain level metadata. The Hydra project uses the Fedora object model and MODS schema; both described in part 1 [2].
The C4D project [34] aims to integrate research data metadata with
Cerif CRIS metadata. developing mapping between multiple metadata standards
aiming at maximum interoperability.
The ADMIRe project [35] seem to be developing a system based on
DataCite minimum mandatory metadata, with additional subject specific metadata
including DDI.
KAPTUR [36] involves work integrating DataStage with EPrints providing a structured metadata
collection interface; and FigShare with EPrints with the intention to create an API to link Figshare with
an EPrints repository using the SWORD 2 protocol. The project is specifically involved with visual arts data
management so relevant metadata schema referred to [37] include the Categories for the Descriptionof Works of Art (CDWA) [38], the VRA Core Categories [39] and the Data Dictionary – TechnicalMetadata for Digital Still Images (ANSI/NISO Z39.87-2006) [40].
MiSS [41] is working towards a RDMI at University of
Manchester. They are developing a system of metadata templates specific to
different research domains, for use during data capture. The MiSS Baseline Requirements Report indicates the advantages of
implementing a RDMI in automating data capture and metadata ingest from
instruments, reduces the need for manual metadata annotation by researchers –
this benefit needs promoting to researchers. with the multitude of data sizes,
different instruments and specific proprietary data and metadata formats,
community input is needed to achieve integration of metadata schemas in the
RDMI.
Open Exeter [42] is developing a prototype DSpace research data
repository. They have surveyed post-graduates about their experiences testing
the interface and metadata webform (Evans 2012)[43] .
Orbital [44] are using
CKAN repository system for their data repository. Integrating this with their
EPrints repository, their 'Awards Management System' (RIMS) and 'ownCloud' networked storage (an ‘academic
dropbox’). Accepting minimum metadatarequirements for DataCite [45] agreement on the the mandatory and optional attributes.
PIMMS [46] (Portable Infrastructure for the Metafor
Metadata System) will refactor the
Metafor metadata management tool for use in university departments. The project
deals with metadata schema in the climatology domain.
Part 3 will describe the work of Australian projects in the research data catalogue / metadata stores area.
References
[1] JISC Digital infrastructure: Managing Research Data Programme 2011-13 - Research Data Management Infrastructure Projects http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/managingresearchdata/infrastructure.aspx[2] http://metadatatron.blogspot.co.uk/2012/10/metadata-for-wr-data-catalogue.html
[3] IDMB - Initial findings report http://eprints.soton.ac.uk/195155/1/idmbinitialfindingsreportv4.pdf
[4] Data Management for All - The Institutional Data Management Blueprint project (IDMB at the 6th IDCC) http://eprints.soton.ac.uk/169533/1/6th_international_digital_curation_conference__idmb_final_paper_revised.pdf
[5] IRIDIUM Blogpost http://iridiummrd.wordpress.com/2011/12/09/195/
[6] Data.Bris Blogpost http://data.blogs.ilrt.org/2012/02/16/cerif-tutorial-and-uk-data-surgery/
[7] EIDCSR http://eidcsr.oucs.ox.ac.uk/[8] Sudamih http://sudamih.oucs.ox.ac.uk/
[9] Admiral http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL
[10] Vidaas http://vidaas.oucs.ox.ac.uk/
[11] Dataflow http://www.dataflow.ox.ac.uk/
[12] DAMARO http://damaro.oucs.ox.ac.uk/index.xml
[13] University of Oxford Bodleian Libraries - Research data services http://www.bodleian.ox.ac.uk/bdlss/research-data-services
[14] Databank Oxford University research data repository https://databank.ora.ox.ac.uk/
[15] Wilson (2012) Infrastructure for Research Data Management at the University of Oxford. ANDS Webinar http://www.ands.org.au/events/webinars/james-wilson-jisc-webinar-slides.pdf
[16] DataStage http://www.dataflow.ox.ac.uk/index.php/datastage/ds-about
[17] Sally Rumsey (2012) Building an institutional research data management infrastructure. OR2012 http://damaro.oucs.ox.ac.uk/docs/Just%20enough%20metadata%20v3-1.pdf
[18] IRIDIUM http://research.ncl.ac.uk/iridium/
[19] Wood, L., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2012/10/02/iridium-requirements-for-a-research-data-catalogue-and-proof-of-concept-development/
[20] O'Laughlan N., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2011/12/09/195/
[21] Wood, L., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2012/10/03/iridium-rdm-systemstools-connectivity-busy-researchers-dont-like-duplication-of-metadata-entry/
[22] Data.Bris project http://data.bris.ac.uk/
[23] Steer, D. - Dat.Bris architecture http://data.blogs.ilrt.org/2012/02/03/data-bris-architecture/
[24] OAI-PMH http://www.openarchives.org/pmh/
[25] OAI-ORE http://www.openarchives.org/ore/
[26] SWORD http://swordapp.org/
[27] Boyd, D. - Data.Bris Blog http://data.blogs.ilrt.org/category/metadata/
[28] DataPool project http://datapool.soton.ac.uk/
[29] DataPool project proposal http://datapool.soton.ac.uk/files/2011/12/University-of-Southampton-Proposal-public.pdf
[31] Research Data @Essex Blog http://researchdataessex.posterous.com/metadata
[32] Research Data @Essex Metadata schema crosswalk http://researchdataessex.posterous.com/metadata#
[33] RDE Metadata Profile for EPrints https://docs.google.com/open?id=0B7VJTfTg7nrrcU1WMWVEMW9tY3M
[34] C4D http://cerif4datasets.files.wordpress.com/2012/04/cris2012_35_full_paper.pdf
[35] ADMIRe http://admire.jiscinvolve.org/wp/2012/08/16/notes-from-the-2nd-datacite-workshop/
[36] KAPTUR http://www.vads.ac.uk/kaptur/outputs/Kaptur_technical_analysis.pdf
[37] KAPTUR Blogpost https://kaptur.wordpress.com/2012/06/12/raising-your-redman/
[38] CDWA http://www.getty.edu/research/publications/electronic_publications/cdwa/index.html
[39] VRA Core http://www.vraweb.org/projects/vracore4/
[40] NISO Technical Metadata for Digital Still Images http://www.niso.org/kst/reports/standards?step=2&gid=None&project_key=b897b0cf3e2ee526252d9f830207b3cc9f3b6c2c
[41] MiSS - BaselineRequirementsReport http://www.miss.manchester.ac.uk/wp-content/uploads/2012/09/MiSS-BaselineRequirementsReport-RevisedVersion-Aug2012.pdf
[42] Open Exeter http://blogs.exeter.ac.uk/openexeterrdm/
[43] Evans, J., Open Exeter Blog http://blogs.exeter.ac.uk/openexeterrdm/blog/2012/05/31/pgr-feedback-on-data-upload/
[44] Orbital Blog http://orbital.blogs.lincoln.ac.uk/
[45] DataCite Mandatory Properties http://schema.datacite.org/meta/kernel-2.2/doc/DataCite-MetadataKernel_v2.2.pdf#page=8
[46] PIMMS http://proj.badc.rl.ac.uk/pimms
[47] CKAN http://ckan.org/
[48] CERIF 1.5 - Common European Research Information Format http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xhtml