Friday 16 November 2012

Metadata for Research Data


White Rose meeting 06/11/12 at the University of Leeds

Aims
  • To identify metadata fields common to all datasets
  • Review suitability of DataCite standard & identify additional fields needed
  • Agree required fields for a data catalogue entry (regardless of format and location), including mandatory and optional fields
  • Review Graham Blyth’s 9 layers of metadata (for RoadMap Work package 5)

Context

One outcome of the WR 'Perspectives on Research Data Management' meeting of 24th May 2012 [1] was an interest amongst some participants to discuss the requirements for a research data catalogue at the WR level. Earlier blog posts discuss the establishment of a 'WRRDC' [2] and projects involving Research Data Catalogues [3] & [4]. The RoaDMap [5] project at Leeds is developing a data catalogue component for its RDM infrastructure - Work Package 5 'repositories and metadata'. Although there are no overwhelming arguments for a data catalogue at the WR level, (in addition to the institutional and national levels), it was thought a meeting to include WR people would be useful. Metadata rather than software systems were to be discussed.

The EPSRC is requiring full compliance with their expectations [6] by May 2015, and require 'appropriately structured metadata describing the research data they hold is published (normally within 12 months of the data being generated) and made freely accessible on the internet; in each case the metadata must be sufficient to allow others to understand what research data exists, why, when and how it was generated, and how to access it. Where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object identifier'. In addition to these, EPSRC required fields include Funding agency, grant number, last access date, and privilaged access period.

Metadata schemata / models in development

IDMB 3 tier model (it was developed to enable discoverability) [7] was examined and thought adequate for data discovery :
  • Core metadata for findability
  • Discipline metadata for classification
  • Project metadata for object detail  
This model has been developed by subsequent projects - DAMARO [8], RD@Essex [9], and Iridium [10]:
  • Core metadata based on Dublin Core / DataCite [11]
  • Context/Admin metadata mapped to CERIF [12]
  • Discipline metadata based on subject schema
Compare with the proposed Roadmap model (WP5 lead Graham Blyth), includes metadata for preservation, validation and re-purposing:-
  • Core metadata - DC and DataCite
  • Discipline metadata - discipline ontologies are emerging
  • Project / Institution metadata - reflecting context of research?
  • Instrument metadata - data about instrument settings and specifications
  • Management metadata - Access control and lifecycle information
  • Preservation metadata - data formats and file structures
  • Community metadata - 'live' metadata that can be added to after dataset is put in the repository. Repository users can make observations about and reinterpret data. Cross-discipline keywords.  
Unfortunately we did not have time to discuss the details of the RoadMap metadata scheme.

Minimum mandatory metadata

DataCite proposes five mandatory properties [11] {containing child properties and attributes} as a minimum to be supplied on metadata submission:
  • Identifier {Identifier type} a registered DOI.
  • Creator {creatorName; nameIdentifier; nameIdentifierScheme}the main researchers involved, or authors of the publication in priority order.
  • Title {titleType} open format. Type includes alternative title, subtitle, translated title.
  • Publisher - in the case of datasets, the entity that makes the data available.
  • Publication year - the year that the data is made publicly available, when an embargo ends.
The Data.Bris [13] project at Bristol also requires these five mandatory fields for submission of data.
The DAMARO project at Oxford [8] is extending this set to include further core metadata:
  • Location of dataset - DOI or URL for digital data
  • Medium - Digital or non-digital or both
  • Creator affiliation
  • Access terms and conditions
  • Access date - expiry of embargo
  • Data owner - Institution / Faculty / Department
  • Metadata rights - CCO by default
  • Subject - possibly based on FAST (Faceted Application of Subject Terminology) [14]
A contextual metadata set is also mandatory:
  • Funding agency
  • Grant number
  • project information
  • Last access request date
  • Source & Source URL - if imported record
  • Data generation process
  • Why data was generated
  • Date range of data collection
  • Reason for embargo

Proposed RDMI architectures

At Oxford, DAMARO is building on past projects to implement a three stage Dataflow architecture [15] by May 2013:
  • DataStage (Data & metadata creation and local management) fed into by DMP metadata. Data and metadata transfer via SWORD to
  • DataBank (Institutional data storage on Fedora) records harvested by OAI-PMH to
  • DataFinder (Catalogue of research data) will exchange metadata with CRIS, a Symplectic RIS is in development, and ORA, Oxford's institutional repository. 
At Leeds, a similar architecture is being considered as part of RoadMap. EPrints is also being considered as an alternative to DataBank (Fedora based) - hence the interest in RD@Essex [16], where they are developing EPrints for a data repository. It is possible to link DataStage to Eprints via sword. Both a Databank and an EPrints data repository could be a subset of an established Institutional repository. DepositMO [17] is being considered for harvesting files from personal storage to the repository.

But the discussion was about metadata standards not infrastructure – was there any consensus?

Discussion
  • A WR catalogue would only need discovery layer metadata if records were pulled from institutional data catalogues; these would require a much larger set of mandatory and optional metadata elements.
  • Why a WR catalogue? Is there any benefit of creating a data catalogue at the WR level? Each institution will probably need to develop its own data catalogue anyway. A WR scale research data catalogue may make it easier to link to the resulting WRRO publications. If based on an EPrints platform and may be easy to implement as part of / a subset of WRRO.
  • Why not a national level catalogue? We should broaden our definition of a catalogue e.g. Google is a large distributed catalogue. Is Datacite a catalogue? Well yes, DataCite have established a beta metadata search facility [18] to search their records associated with the DOIs ascribed [19]. 
  • Regarding Datacite, if the majority of fields are mandatory to mint DOIs, but they require a minimum of 5 mandatory fields - presumably to encourage people, the entry barrier is low. 
  • Datacite is a good starting point as a basis for our requirements. We need to know where are the main sources of data fields – which are automatically populated? 
  • User tags are needed for developing discipline ontologies / subject taxonomies; by crowdsource tagging and keywords.
  • But who owns the data? It was decided the department / school / faculty probably owns the data - as they own the facilities that captured the data. Who is the contact? The head of department or equivalent. 
  • Who is the 'Creator'? Everyone involved in the data capture process could be named; Or all the people named as authors of published research output; Or only the Principle Investigator may be named. Other people involved (co-creators, technicians, students) may be named, and role specified in the optional fields 'Contributor' and 'Role'.
  • Metadata mapping: attribute of an element – option of qualified or not.
  • Considering elements for discoverability; what terms will be searched for? – Require more than the 5 mandatory fields of DataCite; including a mandatory subject field.
  • Rights should be mandatory.
  • What metadata needs adding manually? – some preservation metadata will – can’t have common schema across institutions for object level metadata.
  • Problem of research not funded by research councils, solely university funded. Where does metadata come from if not the Grant management system? Core project level metadata for funded research is available. Would we need a data catalogue for unfunded research? It won’t be joined up to other institutional systems. Repository can provide unique identifiers. External range of IDs would be available, but would need to be mapped to internal identifier. 

Discussion of detailed mandatory metadata fields

Taking the Damaro project metadata scheme [8] as a starting point, we looked at each of the mandatory fields to discuss
a. Would this be a mandatory field for a White Rose Institution data catalogue?
b. Would this field be required for discoverability?
c. How should we specify the exact definition for this field?


Element
Notes
Record ID
M
Unique internal repository record ID
Location of dataset
URL / DOI – DOI is best since URL is not persistent. Is DOI location of record or dataset? For non-digital dataset, contact details given.
Medium
Digital or non-digital or both (container for data, rather than format of data)
Creator
M R D
Drawn from institutional CRIS. Drawn from Names Project  [20]
- Creator ID
Unique ID for person
- PID scheme
Scheme for unique personal ID
Creator affiliation
R
Drawn from CRIS – What level, institution or department? 
- Affiliation dates
Dates of affiliation
Title of dataset
M D

Publisher of data
D
Data centre, Repository, institution where dataset is accessed from.
Publication year
M D
Year for citation purpose. Date when dataset is openly accessible – end of embargo (Datacite). Alternatively, date submitted to publisher, or date DOI minted? (See below 1.)
Access terms
Administrative metadata
Data owner
PI, HOD, Head of Faculty, school, institute named as representative of body.
Access date
Embargo expiry date. (See below 2.)
Rights for metadata
CCO, ODC. Administrative to ensure open metadata
Subject
R D

Mandatory subject description based on a controlled vocabulary.    FAST  [21]. Automatic base-level = discipline (creator affiliation?). Other subject vocabularies (LCSH).
Keywords
R O
User devised subject Keywords. 

      Key: M = mandatory for DataCite, O = optional, R = repeatable element, 
               D = terms for discovery in catalogue search.
      Text coulours: DAMARO notes, meeting suggestionsmy suggestions



  1. How do people refer to datasets catalogued using datacite scheme metadata if the dataset is embargoed, but the person is not subject to the restriction – how does a person refer to their own embargoed dataset?
  2. Publication year could be considered first date of access and may be different to access date if the  dataset is re-embargoed after being previously accessible (new embargo).
  3. Problem of management of embargoed elements within a dataset? Best to remove these elements first and publish separately after embargo? Or include embargoed material after embargo period passed and mint new DOI.
  4. Problem of management of embargoed elements within a dataset? Different publication / access dates - publication date for non-embargoed data; access date for embargoed data.
  5. Problem of management of embargoed elements within a dataset? Multiple sublevel DOIs may be minted for different parts within a dataset. 
  6. Other Mandatory fields - Related Identifier DC.12 - mandatory for re-purposed data.

Things for the group to do
Contact DAMARO  - how's progress?
Keep up to speed with RD@Essex
Reflect on this meeting and continue with the process of identifying required fields.

References

[1] White Rose Perspectives on Research Data Management http://library.leeds.ac.uk/info/377/roadmap/123/roadmap_events/2
[2] Metadatatron Blog - A White Rose Research Data Catalogue http://metadatatron.blogspot.co.uk/2012/09/white-rose-research-data-catalogue.html[3] Metadatatron Blog - Metadata for a WR Data Catalogue (part 1) http://metadatatron.blogspot.co.uk/2012/10/metadata-for-wr-data-catalogue.html[4] Metadatatron Blog - Metadata for a WR Data Catalogue (part 2)  http://metadatatron.blogspot.co.uk/2012/10/metadata-for-wr-data-catalogue-part-2.html
[5] RoaDMap - Work packages http://blog.library.leeds.ac.uk/downloads/file/260/roadmap_work_packages 
[6] EPSRC Expectations http://www.epsrc.ac.uk/about/standards/researchdata/Pages/expectations.aspx
[7] IDMB Initial findings report (p84 & 89) http://eprints.soton.ac.uk/195155/1/idmbinitialfindingsreportv4.pdf  
[8] Just enough metadata: Metadata for research datasets in institutional data repositories. Rumsey, S (2012) DAMARO http://damaro.oucs.ox.ac.uk/docs/Just%20enough%20metadata%20v3-1.pdf 
[9] Research Data @ Essex Blog http://researchdataessex.posterous.com/metadata
[10] IRIDIUM Blogpost http://iridiummrd.wordpress.com/2011/12/09/195/
[11] DataCite - Mandatory core metadata http://schema.datacite.org/meta/kernel-2.2/doc/DataCite-MetadataKernel_v2.2.pdf#page=8
[12] CERIF 1.5 Common European Research Information Format http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xhtml 
[13] Data.Bris - Minimum set of mandatory metadata http://data.blogs.ilrt.org/2012/05/18/minimal-set-of-mandatory-metadata/
[14] FAST (Faceted Application of Subject Terminology) http://www.oclc.org/research/activities/fast.html
[15] Infrastructure for Research Data Management at the University of Oxford. Wilson, J (2012) DAMARO  http://www.ands.org.au/events/webinars/james-wilson-jisc-webinar-slides.pdf
[16] Opening up research data at Essex: Experiments with EPrints. Ensom, T & Wolton, A (2012)  Research Data @Essex http://www.data-archive.ac.uk/media/368772/rde_or2012_notes.pdf
[17] DepositMO and DepositMOre: Modus Operandi for Repository Deposits http://blog.soton.ac.uk/depositmo/tag/depositmo/
[18] DataCite - Beta search facility at http://search.datacite.org/ui
[19] DataCite - Blog http://datacite.wordpress.com/2012/01/26/datacite-search/
[20] Names Project http://names.mimas.ac.uk/
[21] FAST http://www.oclc.org/research/activities/fast.html