Friday, 16 November 2012

Metadata for Research Data


White Rose meeting 06/11/12 at the University of Leeds

Aims
  • To identify metadata fields common to all datasets
  • Review suitability of DataCite standard & identify additional fields needed
  • Agree required fields for a data catalogue entry (regardless of format and location), including mandatory and optional fields
  • Review Graham Blyth’s 9 layers of metadata (for RoadMap Work package 5)

Context

One outcome of the WR 'Perspectives on Research Data Management' meeting of 24th May 2012 [1] was an interest amongst some participants to discuss the requirements for a research data catalogue at the WR level. Earlier blog posts discuss the establishment of a 'WRRDC' [2] and projects involving Research Data Catalogues [3] & [4]. The RoaDMap [5] project at Leeds is developing a data catalogue component for its RDM infrastructure - Work Package 5 'repositories and metadata'. Although there are no overwhelming arguments for a data catalogue at the WR level, (in addition to the institutional and national levels), it was thought a meeting to include WR people would be useful. Metadata rather than software systems were to be discussed.

The EPSRC is requiring full compliance with their expectations [6] by May 2015, and require 'appropriately structured metadata describing the research data they hold is published (normally within 12 months of the data being generated) and made freely accessible on the internet; in each case the metadata must be sufficient to allow others to understand what research data exists, why, when and how it was generated, and how to access it. Where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object identifier'. In addition to these, EPSRC required fields include Funding agency, grant number, last access date, and privilaged access period.

Metadata schemata / models in development

IDMB 3 tier model (it was developed to enable discoverability) [7] was examined and thought adequate for data discovery :
  • Core metadata for findability
  • Discipline metadata for classification
  • Project metadata for object detail  
This model has been developed by subsequent projects - DAMARO [8], RD@Essex [9], and Iridium [10]:
  • Core metadata based on Dublin Core / DataCite [11]
  • Context/Admin metadata mapped to CERIF [12]
  • Discipline metadata based on subject schema
Compare with the proposed Roadmap model (WP5 lead Graham Blyth), includes metadata for preservation, validation and re-purposing:-
  • Core metadata - DC and DataCite
  • Discipline metadata - discipline ontologies are emerging
  • Project / Institution metadata - reflecting context of research?
  • Instrument metadata - data about instrument settings and specifications
  • Management metadata - Access control and lifecycle information
  • Preservation metadata - data formats and file structures
  • Community metadata - 'live' metadata that can be added to after dataset is put in the repository. Repository users can make observations about and reinterpret data. Cross-discipline keywords.  
Unfortunately we did not have time to discuss the details of the RoadMap metadata scheme.

Minimum mandatory metadata

DataCite proposes five mandatory properties [11] {containing child properties and attributes} as a minimum to be supplied on metadata submission:
  • Identifier {Identifier type} a registered DOI.
  • Creator {creatorName; nameIdentifier; nameIdentifierScheme}the main researchers involved, or authors of the publication in priority order.
  • Title {titleType} open format. Type includes alternative title, subtitle, translated title.
  • Publisher - in the case of datasets, the entity that makes the data available.
  • Publication year - the year that the data is made publicly available, when an embargo ends.
The Data.Bris [13] project at Bristol also requires these five mandatory fields for submission of data.
The DAMARO project at Oxford [8] is extending this set to include further core metadata:
  • Location of dataset - DOI or URL for digital data
  • Medium - Digital or non-digital or both
  • Creator affiliation
  • Access terms and conditions
  • Access date - expiry of embargo
  • Data owner - Institution / Faculty / Department
  • Metadata rights - CCO by default
  • Subject - possibly based on FAST (Faceted Application of Subject Terminology) [14]
A contextual metadata set is also mandatory:
  • Funding agency
  • Grant number
  • project information
  • Last access request date
  • Source & Source URL - if imported record
  • Data generation process
  • Why data was generated
  • Date range of data collection
  • Reason for embargo

Proposed RDMI architectures

At Oxford, DAMARO is building on past projects to implement a three stage Dataflow architecture [15] by May 2013:
  • DataStage (Data & metadata creation and local management) fed into by DMP metadata. Data and metadata transfer via SWORD to
  • DataBank (Institutional data storage on Fedora) records harvested by OAI-PMH to
  • DataFinder (Catalogue of research data) will exchange metadata with CRIS, a Symplectic RIS is in development, and ORA, Oxford's institutional repository. 
At Leeds, a similar architecture is being considered as part of RoadMap. EPrints is also being considered as an alternative to DataBank (Fedora based) - hence the interest in RD@Essex [16], where they are developing EPrints for a data repository. It is possible to link DataStage to Eprints via sword. Both a Databank and an EPrints data repository could be a subset of an established Institutional repository. DepositMO [17] is being considered for harvesting files from personal storage to the repository.

But the discussion was about metadata standards not infrastructure – was there any consensus?

Discussion
  • A WR catalogue would only need discovery layer metadata if records were pulled from institutional data catalogues; these would require a much larger set of mandatory and optional metadata elements.
  • Why a WR catalogue? Is there any benefit of creating a data catalogue at the WR level? Each institution will probably need to develop its own data catalogue anyway. A WR scale research data catalogue may make it easier to link to the resulting WRRO publications. If based on an EPrints platform and may be easy to implement as part of / a subset of WRRO.
  • Why not a national level catalogue? We should broaden our definition of a catalogue e.g. Google is a large distributed catalogue. Is Datacite a catalogue? Well yes, DataCite have established a beta metadata search facility [18] to search their records associated with the DOIs ascribed [19]. 
  • Regarding Datacite, if the majority of fields are mandatory to mint DOIs, but they require a minimum of 5 mandatory fields - presumably to encourage people, the entry barrier is low. 
  • Datacite is a good starting point as a basis for our requirements. We need to know where are the main sources of data fields – which are automatically populated? 
  • User tags are needed for developing discipline ontologies / subject taxonomies; by crowdsource tagging and keywords.
  • But who owns the data? It was decided the department / school / faculty probably owns the data - as they own the facilities that captured the data. Who is the contact? The head of department or equivalent. 
  • Who is the 'Creator'? Everyone involved in the data capture process could be named; Or all the people named as authors of published research output; Or only the Principle Investigator may be named. Other people involved (co-creators, technicians, students) may be named, and role specified in the optional fields 'Contributor' and 'Role'.
  • Metadata mapping: attribute of an element – option of qualified or not.
  • Considering elements for discoverability; what terms will be searched for? – Require more than the 5 mandatory fields of DataCite; including a mandatory subject field.
  • Rights should be mandatory.
  • What metadata needs adding manually? – some preservation metadata will – can’t have common schema across institutions for object level metadata.
  • Problem of research not funded by research councils, solely university funded. Where does metadata come from if not the Grant management system? Core project level metadata for funded research is available. Would we need a data catalogue for unfunded research? It won’t be joined up to other institutional systems. Repository can provide unique identifiers. External range of IDs would be available, but would need to be mapped to internal identifier. 

Discussion of detailed mandatory metadata fields

Taking the Damaro project metadata scheme [8] as a starting point, we looked at each of the mandatory fields to discuss
a. Would this be a mandatory field for a White Rose Institution data catalogue?
b. Would this field be required for discoverability?
c. How should we specify the exact definition for this field?


Element
Notes
Record ID
M
Unique internal repository record ID
Location of dataset
URL / DOI – DOI is best since URL is not persistent. Is DOI location of record or dataset? For non-digital dataset, contact details given.
Medium
Digital or non-digital or both (container for data, rather than format of data)
Creator
M R D
Drawn from institutional CRIS. Drawn from Names Project  [20]
- Creator ID
Unique ID for person
- PID scheme
Scheme for unique personal ID
Creator affiliation
R
Drawn from CRIS – What level, institution or department? 
- Affiliation dates
Dates of affiliation
Title of dataset
M D

Publisher of data
D
Data centre, Repository, institution where dataset is accessed from.
Publication year
M D
Year for citation purpose. Date when dataset is openly accessible – end of embargo (Datacite). Alternatively, date submitted to publisher, or date DOI minted? (See below 1.)
Access terms
Administrative metadata
Data owner
PI, HOD, Head of Faculty, school, institute named as representative of body.
Access date
Embargo expiry date. (See below 2.)
Rights for metadata
CCO, ODC. Administrative to ensure open metadata
Subject
R D

Mandatory subject description based on a controlled vocabulary.    FAST  [21]. Automatic base-level = discipline (creator affiliation?). Other subject vocabularies (LCSH).
Keywords
R O
User devised subject Keywords. 

      Key: M = mandatory for DataCite, O = optional, R = repeatable element, 
               D = terms for discovery in catalogue search.
      Text coulours: DAMARO notes, meeting suggestionsmy suggestions



  1. How do people refer to datasets catalogued using datacite scheme metadata if the dataset is embargoed, but the person is not subject to the restriction – how does a person refer to their own embargoed dataset?
  2. Publication year could be considered first date of access and may be different to access date if the  dataset is re-embargoed after being previously accessible (new embargo).
  3. Problem of management of embargoed elements within a dataset? Best to remove these elements first and publish separately after embargo? Or include embargoed material after embargo period passed and mint new DOI.
  4. Problem of management of embargoed elements within a dataset? Different publication / access dates - publication date for non-embargoed data; access date for embargoed data.
  5. Problem of management of embargoed elements within a dataset? Multiple sublevel DOIs may be minted for different parts within a dataset. 
  6. Other Mandatory fields - Related Identifier DC.12 - mandatory for re-purposed data.

Things for the group to do
Contact DAMARO  - how's progress?
Keep up to speed with RD@Essex
Reflect on this meeting and continue with the process of identifying required fields.

References

[1] White Rose Perspectives on Research Data Management http://library.leeds.ac.uk/info/377/roadmap/123/roadmap_events/2
[2] Metadatatron Blog - A White Rose Research Data Catalogue http://metadatatron.blogspot.co.uk/2012/09/white-rose-research-data-catalogue.html[3] Metadatatron Blog - Metadata for a WR Data Catalogue (part 1) http://metadatatron.blogspot.co.uk/2012/10/metadata-for-wr-data-catalogue.html[4] Metadatatron Blog - Metadata for a WR Data Catalogue (part 2)  http://metadatatron.blogspot.co.uk/2012/10/metadata-for-wr-data-catalogue-part-2.html
[5] RoaDMap - Work packages http://blog.library.leeds.ac.uk/downloads/file/260/roadmap_work_packages 
[6] EPSRC Expectations http://www.epsrc.ac.uk/about/standards/researchdata/Pages/expectations.aspx
[7] IDMB Initial findings report (p84 & 89) http://eprints.soton.ac.uk/195155/1/idmbinitialfindingsreportv4.pdf  
[8] Just enough metadata: Metadata for research datasets in institutional data repositories. Rumsey, S (2012) DAMARO http://damaro.oucs.ox.ac.uk/docs/Just%20enough%20metadata%20v3-1.pdf 
[9] Research Data @ Essex Blog http://researchdataessex.posterous.com/metadata
[10] IRIDIUM Blogpost http://iridiummrd.wordpress.com/2011/12/09/195/
[11] DataCite - Mandatory core metadata http://schema.datacite.org/meta/kernel-2.2/doc/DataCite-MetadataKernel_v2.2.pdf#page=8
[12] CERIF 1.5 Common European Research Information Format http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xhtml 
[13] Data.Bris - Minimum set of mandatory metadata http://data.blogs.ilrt.org/2012/05/18/minimal-set-of-mandatory-metadata/
[14] FAST (Faceted Application of Subject Terminology) http://www.oclc.org/research/activities/fast.html
[15] Infrastructure for Research Data Management at the University of Oxford. Wilson, J (2012) DAMARO  http://www.ands.org.au/events/webinars/james-wilson-jisc-webinar-slides.pdf
[16] Opening up research data at Essex: Experiments with EPrints. Ensom, T & Wolton, A (2012)  Research Data @Essex http://www.data-archive.ac.uk/media/368772/rde_or2012_notes.pdf
[17] DepositMO and DepositMOre: Modus Operandi for Repository Deposits http://blog.soton.ac.uk/depositmo/tag/depositmo/
[18] DataCite - Beta search facility at http://search.datacite.org/ui
[19] DataCite - Blog http://datacite.wordpress.com/2012/01/26/datacite-search/
[20] Names Project http://names.mimas.ac.uk/
[21] FAST http://www.oclc.org/research/activities/fast.html


Friday, 26 October 2012

Metadata for a WR Data Catalogue - part 2


Data Catalogue aspects of RDM infrastructure projects 
The JISC Digital infrastructure: Research management programme (2011-13) Managing Research Data strand is supporting 17 ResearchData Management Infrastructure projects [1]. Four of these, RoaDMaP, SwordARM, Open Exeter and Managing Research Data (at UWE) are referred to in part one of this post [2].
Overall there seems to be a consensus in accepting the three-tier metadata model put forward by the IDMB [3]  project at Southampton (& Takeda et al 2010 [4]) :-



This model has been developed through the IRIDIUM O’loughlin 2011 [5] and DataBris Boyd 2012 [6] projects, the three tiers given slightly different attributes:-
1.      a minimum mandatory metadata set providing core information and could be based       around a standard metadata element set - the 15 Dublin Core elements, DataCite kernel  [45] or CKAN [47], but includes other fields such as location, access terms and             conditions and any embargo information. The top level relates to the Discoverability of the Resource.
2.      a second mandatory layer with contextual metadata covered by elements within the     CERIF model [48], administrative information. base entities: project, person, organisation unit, collaborators; funding information: Funder, grant number; and result entities:            publication, patent, research product ideally, much of this will be automatically harvested, or fed from administrative systems. 
3.      and finally a specific level of optional metadata providing the rich, specific more granular, detailed information. This layer provides the discipline related information required for reuse. 

Oxford University has hosted a number of JISC funded RDM projects since 2008, including EIDCSR [7], Sudamih [8], Admiral [9], Vidaas [10], and Dataflow [11]. These were concerned with research workflows, embedding preservation, developing core metadata, sharing research data in collaborative workspaces, cloud storage of data and research data repository; all towards development of an integrated Research Data Service. The current project DAMARO [12] is implementing this Research Data Service [13], key aspects are the development of Datafinder the Oxford University Data catalogue and Databank [14], the data repository. The Datafinder architecture (Wilson 2012) [15], will have the following characteristics:- 
OAI-PMH harvesting of data stores 
•SWORD2 compliant 
•CERIF compatible 
•Metadata schema based on DataCite 
•Interfaces directly with DataBank & ORDS (Oxford's Online Research Database Service, based on DataStage [16] system
•Users can register non-electronic data.
  

For Datafinder, a three tier metadata approach is envisaged, comprising:

Minumum core elements
Record/digital object ID
Location of dataset
Medium
Creator (if not depositor)
Creator affiliation (if not depositor)
Title
Publisher of data
Publication year
Access terms & conditions
Data owner
Access date to data
Rights for metadata
Subject

Contextual mandatory elements
Funding agency
Grant number
Project information
Last access request date
Source
Source URL
Data generation
process
Why the data was generated/Abstract/Brief description
Date
Reason for embargo

Optional metadata (selection)
Co-creators/contributor
Role
Affiliation
Sub-title
Subject
Keywords
Date (other)
Language
ResourceType
AlternateIdentifier: Eg DOI
RelatedIdentifier: eg DOI of publication
Size
Format
Version
Data generation process
Abstract/Brief description
Documentation 1:descriptive or contextual information about the dataset (e.g. machine settings and experimental conditions under which the data were gathered)
Documentation 2
Subject specific m.d.
Subject specific m.d.
Subject specific classification
Subj specific classn scheme
Data complying with known standards eg DDI
(Rumsey 2012) [17]
The metadata will have three sources: Manual entry - generally disliked, can be inaccurate but can produce rich metadata; Imported - from data capture instruments, from institutional systems (RIMS, DMP), from a data repository; Autogenerated by the RDMI.  (Rumsey 2012) [17] 



The research data catalogue  is Central to the infrastructure being developed by IRDIUM project [18] at Newcastle, recording what data they have and making it discoverable. This will be integrated with MyProject (the Research Management System), MyImpact (Institutional publications system – equivalent to our Symplectics system) and the EPrints based IR. The catalogue will not be a repository but rather a straight forward web-based searchable catalogue of data and that we will only collect information on data that supports publication. We have opted for this measure as we know that data supporting publication should have already been prepared (i.e. confidentiality respected through the scrubbing of data, fields marked sensibly etc) plus we feel that data is normally available at this point for peer review and as a matter of good scientific practice, so (hopefully) we’re not asking too much more from our academics to fill in data information at the same point they fill in their new publication info in our output system.” (Wood,L. 2012) [19]. A list of twelve key field and seven further fields has been drawn up which will be publically or privately viewable through the catalogue interface. Again the three tier model has been adhered to; "This is quite appealing as we already collect much of the information in the first two levels through our current systems (MyProjects, e-prints and MyImpact) so the main additional input we’d be requiring from the academic would be at the third level.”  (O’loughlin2011) [20].
Interconnectivity of RDC with other elements of the RDMI is important because researchers definitely do not want to enter research project metadata more than once in multiple systems, within or outside the institution.
"This requires us to understand some of the systems the RDC may need to exchange metadata with that have existing information already entered. These could be local research group metadata catalogues, local/national repositories and other online systems" (Wood 2012) [21].

Bristol University’s Data.Bris project [22] is developing a RDMI which will integrate a new CRIS (PURE), which also provides an institutional repository, with the existing University Research Data Storage Facility (RDSF). This extends the storage facility into becoming a Research Data Repository and allows data to be published from the storage facility. The proposed architecture [23] involves the creation of a metadata store, (a SPARQL 1.1 service), and will adhere to  OAI-PMH [24], OAI-ORE [25], and SWORD [26] protocols. Data.Bris has defined a minimal set of mandatory metadata to be used when depositing or publishing data: Identifier; Creators; Title; Publisher; Publication year; and are investigating which metadata elements may be created automatically and which need adding manually. Again the three tier metadata model is thought useful, especially since metadata can be pulled in from the CRIS (Boyd2012) [27].

The Datapool project [28] follows on from the Institutional Data Management Blueprint (IDMB) project. The project will launch and populate an EPrints institutional data repository to collect and store all research data produced across disciplines within the institution, as part of the research data management infrastructure. The repository will have access to storage sufficient for local data assets and will also provide links to data held elsewhere, both externally in subject repositories and internally using other systems. The project is investigating mechanisms for transferring data and metadata into the data repository from other local data stores, and exporting data from the repository using the SWORD2 protocol. They will also use the three tier metadata model developed by IDMB  [29].
"The lesson for data repositories is clear: to capture content from data creators you must provide useful services that will become an integral part of the workflow of creating the data. It will not work to isolate particular requirements, such as records creation, from other needs such as storage services. Data does not appear with the same mode and frequency as published papers, so workflow must accommodate many different patterns. Research data is often produced by machines, so deposit workflow must allow scope for non-manual intervention" (Hitchcock 2012) [30].


For the Research Data @Essex project [31], EPrints is being used for repository. This project is also adhering to the IDMB inspired three tier metadata model; they considered EPrints metadata provided for level 1 & 2, whilst level 3 ‘minutiae’ are derived by drawing from DataCite, INSPIRE, DDI and DataShare schema. A multi-schema crosswalk was produced [32] and the Metadata schema worked out based on Datacite INSPIRE and DDI 2.1 [33].



University of West England's MRD uses a schema based on DataCite in a two tier model 1. basic metadata, 2. detailed domain level metadata. The Hydra project uses the Fedora object model and MODS schema; both described in part 1 [2]. 

The C4D project [34] aims to integrate research data metadata with Cerif CRIS metadata. developing mapping between multiple metadata standards aiming at maximum interoperability.

The ADMIRe project [35] seem to be developing a system based on DataCite minimum mandatory metadata, with additional subject specific metadata including DDI.

KAPTUR [36] involves work integrating DataStage with EPrints providing a structured metadata collection interface; and FigShare with EPrints with the intention to create an API to link Figshare with an EPrints repository using the SWORD 2 protocolThe project is specifically involved with visual arts data management so relevant metadata schema referred to [37] include the Categories for the Descriptionof Works of Art (CDWA) [38], the VRA Core Categories [39] and the Data Dictionary – TechnicalMetadata for Digital Still Images (ANSI/NISO Z39.87-2006) [40]. 

MiSS [41] is working towards a RDMI at University of Manchester. They are developing a system of metadata templates specific to different research domains, for use during data capture. The MiSS Baseline Requirements Report  indicates the advantages of implementing a RDMI in automating data capture and metadata ingest from instruments, reduces the need for manual metadata annotation by researchers – this benefit needs promoting to researchers. with the multitude of data sizes, different instruments and specific proprietary data and metadata formats, community input is needed to achieve integration of metadata schemas in the RDMI.

Open Exeter [42] is developing a prototype DSpace research data repository. They have surveyed post-graduates about their experiences testing the interface and metadata webform (Evans 2012)[43] .

Orbital [44] are using CKAN repository system for their data repository. Integrating this with their EPrints repository, their 'Awards Management System' (RIMS) and 'ownCloud' networked storage (an ‘academic dropbox’). Accepting minimum metadatarequirements for DataCite [45] agreement on the the mandatory and optional attributes.

PIMMS [46] (Portable Infrastructure for the Metafor Metadata System) will refactor the Metafor metadata management tool for use in university departments. The project deals with metadata schema in the climatology domain.

Part 3 will describe the work of Australian projects in the research data catalogue / metadata stores area.

References

[1] JISC Digital infrastructure: Managing Research Data Programme 2011-13 - Research Data Management Infrastructure Projects http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/managingresearchdata/infrastructure.aspx
[2] http://metadatatron.blogspot.co.uk/2012/10/metadata-for-wr-data-catalogue.html
[4] Data Management for All - The Institutional Data Management Blueprint project (IDMB at the 6th IDCC)  http://eprints.soton.ac.uk/169533/1/6th_international_digital_curation_conference__idmb_final_paper_revised.pdf
[7] EIDCSR http://eidcsr.oucs.ox.ac.uk/
[8] Sudamih http://sudamih.oucs.ox.ac.uk/
[9] Admiral http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL
[10] Vidaas http://vidaas.oucs.ox.ac.uk/
[11] Dataflow http://www.dataflow.ox.ac.uk/
[12] DAMARO http://damaro.oucs.ox.ac.uk/index.xml
[13] University of Oxford Bodleian Libraries - Research data services http://www.bodleian.ox.ac.uk/bdlss/research-data-services 
[14] Databank Oxford University research data repository https://databank.ora.ox.ac.uk/
[15] Wilson (2012) Infrastructure for Research Data Management at the University of Oxford. ANDS Webinar http://www.ands.org.au/events/webinars/james-wilson-jisc-webinar-slides.pdf
[16] DataStage http://www.dataflow.ox.ac.uk/index.php/datastage/ds-about
[17] Sally Rumsey (2012) Building an institutional research data management infrastructure. OR2012  http://damaro.oucs.ox.ac.uk/docs/Just%20enough%20metadata%20v3-1.pdf
[18] IRIDIUM http://research.ncl.ac.uk/iridium/
[19] Wood, L., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2012/10/02/iridium-requirements-for-a-research-data-catalogue-and-proof-of-concept-development/
[20] O'Laughlan N., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2011/12/09/195/
[21] Wood, L., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2012/10/03/iridium-rdm-systemstools-connectivity-busy-researchers-dont-like-duplication-of-metadata-entry/
[22] Data.Bris project http://data.bris.ac.uk/
[23] Steer, D. - Dat.Bris architecture http://data.blogs.ilrt.org/2012/02/03/data-bris-architecture/
[24]  OAI-PMH http://www.openarchives.org/pmh/
[25]  OAI-ORE http://www.openarchives.org/ore/
[26]  SWORD http://swordapp.org/
[27] Boyd, D. - Data.Bris Blog http://data.blogs.ilrt.org/category/metadata/
[28] DataPool project http://datapool.soton.ac.uk/
[29] DataPool project proposal http://datapool.soton.ac.uk/files/2011/12/University-of-Southampton-Proposal-public.pdf
[30] Hitchcock, S., DataPool Blog http://datapool.soton.ac.uk/tag/repositories/
[31] Research Data @Essex Blog http://researchdataessex.posterous.com/metadata
[32] Research Data @Essex Metadata schema crosswalk http://researchdataessex.posterous.com/metadata#
[33] RDE Metadata Profile for EPrints https://docs.google.com/open?id=0B7VJTfTg7nrrcU1WMWVEMW9tY3M
[34] C4D http://cerif4datasets.files.wordpress.com/2012/04/cris2012_35_full_paper.pdf
[35] ADMIRe http://admire.jiscinvolve.org/wp/2012/08/16/notes-from-the-2nd-datacite-workshop/
[36] KAPTUR http://www.vads.ac.uk/kaptur/outputs/Kaptur_technical_analysis.pdf
[37] KAPTUR Blogpost https://kaptur.wordpress.com/2012/06/12/raising-your-redman/
[38] CDWA http://www.getty.edu/research/publications/electronic_publications/cdwa/index.html
[39] VRA Core http://www.vraweb.org/projects/vracore4/
[40] NISO Technical Metadata for Digital Still Images http://www.niso.org/kst/reports/standards?step=2&gid=None&project_key=b897b0cf3e2ee526252d9f830207b3cc9f3b6c2c
[41] MiSS - BaselineRequirementsReport http://www.miss.manchester.ac.uk/wp-content/uploads/2012/09/MiSS-BaselineRequirementsReport-RevisedVersion-Aug2012.pdf
[42] Open Exeter http://blogs.exeter.ac.uk/openexeterrdm/
[43] Evans, J., Open Exeter Blog http://blogs.exeter.ac.uk/openexeterrdm/blog/2012/05/31/pgr-feedback-on-data-upload/
[44] Orbital Blog http://orbital.blogs.lincoln.ac.uk/
[45] DataCite Mandatory Properties http://schema.datacite.org/meta/kernel-2.2/doc/DataCite-MetadataKernel_v2.2.pdf#page=8
[46] PIMMS http://proj.badc.rl.ac.uk/pimms
[47] CKAN http://ckan.org/
[48] CERIF 1.5 - Common European Research Information Format http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xhtml

Wednesday, 24 October 2012

Metadata for a WR Data Catalogue - part 1


This will be a brainstorming event about metadata for research data with a particular focus on core fields for a data catalogue. we'll need to identify existing metadata sources and how local systems may feed each other. However, leaving architecture aside, it may be beneficial to discuss core metadata fields for all data types and approaches to metadata for specific data sets?






The possible benefits of a shared WR Data Catalogue have been briefly described in a previous post White Rose Research Data Catalogue [1]. Each institution will eventually develop a Research Data Management Infrastucture, which may involve some form of Research Data management System (RDMS). A key component of a  RDMS  must be a Data catalogue or Metadata store, which holds details of what objects are being managed by the system. The structure of the RDMS components, the metadata store configuration and the metadata standards adhered to will of course have a great bearing on interoperability and thus the possibility of establishing a shared catalogue. Although an institutional RDMS may need to have some long term data storage capacity, for 'orphaned datasets' and other material not suitable for submitting to data centres, a possible WR Data Catalogue would not need such a storage function. 



Local RDM infrastructure and projects
The development of RDMS is already underway at Leeds. RoaDMap [2] includes WP5: Software systems and metadata. This involves assessment and deployment of RDMS options including Dataflow [3] / Databank [4]; agreement of descriptive and contextual metadata requirements for case studies (Phased development of metadata standards to be used and core metadata for all case studies)develop case study metadata templates. 
The SWORD-ARM [5] project at York, is developing a semi-automatic ingest system for the ADS; metadata required for describing datasets are outlined in 'Guidelines for Cataloguing Datasets with the ADS' [6].
WRRO and WREO are powered by ePrints3 and are shared databases - the individual institutions do not have their own Institutional Repository. WRRO is (or will be) integrated into the local Symplectics Publications databases at Leeds and Sheffield . A number of institutions use the ePrints platform for an Institutional Data Repository (see below). Although in-house developers are often required to modify ePrints for the data storage function of a data repository, less modification will be required if only the catalogue function of ePrints is required. 
Much metadata may be imported from the institutions' Research Information Management System (RIMS). If the three institutions' RIMS conform to the CERIF model  [7] they will be more easily integrated into a RDMI. York is using PURE which conforms; Leeds and Sheffield use Symplectics (which now conforms to CERIF) but only for publication management although it can be extended as a full RIMS.
Finally, a research data management project, Tools for RDM Development [8], has recently been initiated at the University of York, to improve RDM. There is the possibility of implementing Databank and integrating it with PURE and York Digital Library.



Current Institutional Data catalogues (almost) up and running

The DISC-UK DataShare project [9], a collaboration involving the universities of Edinburgh, Southampton and Oxford to investigate the accommodation of datasets in institutional IRs, developed the Metadata schema for ePrints Soton [10] a dataset metadata profile based on qualified Dublin Core. The Data Documentation Initiative or DDI metadata scheme (for microdata and aggregate data) was considered, but the DCMI based schema chosen. This project eventually resulted in the establishment of two research data repositories; Edinburgh DataShare [11] is based on DSpace software and Oxford's DataBank [12]based on the databank [4] platform. This was an output of the Dataflow project [3] being developed from Fedora-commons with Solr implemented for indexing and can be hosted within an external cloud, or can be deployed on local hardware. Databank uses DC for core metadata, users can extend metadata to provide domain-specific ontological information and more; further information about metadata is given at the Databank [13] and Datastage [14] webpages. Oxford's research publication repository, ORA, is also based on the Fedora platform; for Databank, another instance of Fedora was developed for research data (Rice 2009) [15].

Hull's digital repository, Hydra [16] is a multipurpose repository based on Fedora Commons repository software, Solr, Ruby on Rails and Blacklight. MODS is recommended by the Hydra project, for the basic descriptive metadata for content, which has been modified for use at Hull (Green & Awre 2011) [17], but other Hydra repositories use other metadata schema (Project Hydra Blog) [18].
UWE Research Data Respository [19] is an EPrints repository. EPrints was chosen because they already use the platform for the IR and they have the local skills to repurpose it for data (RDM-UWE 2012) [20]. The metadata scheme is based on DCMI, standard for their IR, extended to include mandatory and optional fields based on the DataCite Metadata schema v2.1. "Two levels of metadata are planned; the first is a basic level collected on project record entry and data deposit. An optional detailed level will conform to disciplinary and subject metadata standards" (Holliday 2012) [21].

The Open Exeter [22] project has established EDA: Exeter Data Archive [23], a DSpace based prototype data repository. 

A National Research Data Catalogue




Further afield, the Australian National Data Service (ANDS) [24] is taking a national approach to improving research data management, providing advice and tools for institutions to develop RDM policies, plans and infrastructure, similar to the activities of the DCC in the UK. ANDS have established Research Data Australia [25], a discovery service for a registry of Australian research data collections, the Australian Research Data Commons (ARDC)[26]. Records are imported from institutional metadata stores and data repositories; ARDC does not have a data storage function. ANDS requires the Registry Interchange Format for Collections and Services (RIF-CS), based on ISO 2146:2010 for exchange of records and provide comprehensive advice on Metadata Content Requirements [27].

The ANDS Data Capture program [28] promotes data creation and capture infrastructure elements that feed into data and metadata storage facilities - through the development of  'pipelines' between instruments and data and metadata storage and software that enables better description of data and metadata, and feeding of these records into the ARDC.

The ANDS 'Seeding the commons' programme funded projects involved in developing institutional research data metadata stores - these will be described in part 2. 

Finally, Posted 14 September 2012 on the ANDS news page "ANDS and the Ex Libris Group are pleased to announce their recent agreement to syndicate the metadata in Research Data Australia, and make it accessible to researchers through the Ex Libris portal, Primo Central" [29].


References

[1] Metadatatron Blogpost: White Rose Research Data Catalogue http://metadatatron.blogspot.co.uk/2012/09/white-rose-research-data-catalogue.html
[6] ADS - Guidelines for Cataloguing Datasets with the ADS  http://archaeologydataservice.ac.uk/advice/cataloguingDatasets
[8] University of York: Tools for RDM Development http://uoy-rdmproject.blogspot.co.uk/
[9] DISC-UK DataShare project http://www.disc-uk.org/index.html
[10] DataShare Metadata Schema for ePrints Soton (ePrints 3.1) (2009) http://www.disc-uk.org/docs/ePrints_Soton_Metadata.pdf
[11] Edinburgh DataShare  http://datashare.is.ed.ac.uk/
[12] Databank: Bodleian Libraries research data archival store  https://databank.ora.ox.ac.uk/
[13] Databank: Metadata https://github.com/dataflow/RDFDatabank/wiki/Metadata-(how-to-label-and-find-things-in-DataBank)
[14] Datastage Metadata https://github.com/dataflow/DataStage/wiki/Metadata-(how-to-label-and-find-things-in-DataStage)
[15] Rice, R. (2009) DataShare final report http://repository.jisc.ac.uk/336/1/DataSharefinalreport.pdf
[16] Hydra: Hull's digital repository https://hydra.hull.ac.uk/
[17] Green & Awre (2011) Hydra in Hull: Final report https://hydra.hull.ac.uk/resources/hull:5231
[18] Project Hydra Blog http://projecthydra.org/design-principles-2/metadata/
[19] UWE Research Data Respository  http://researchdata.uwe.ac.uk/
[20] EPrints as a data repository at UWE (WP 1&2 Stage 6) http://www2.uwe.ac.uk/services/library/using_the_library/Services%20for%20researchers/eprints-data-repository-uwe.pdf
[21] Holliday (2012) Metadata for UWE data repository. MRD Blog http://blogs.uwe.ac.uk/teams/mrd/archive/2012/08/14/metadata-for-uwe-data-repository.aspx
[22] Open Exeter project http://as.exeter.ac.uk/library/resources/openaccess/openexeter/
[23] EDA: Exeter Data Archive https://eda.exeter.ac.uk/repository/
[24] ANDS: Australian National Data Service http://www.ands.org.au/index.html
[25] Research Data Australia  http://researchdata.ands.org.au/
[26] ANDS - Australian Research Data Commons http://ands.org.au/about/approach.html#ardc
[27] ANDS - Metadata Content Requirements http://ands.org.au/resource/metadata-content-requirements.html
[28] ANDS - Data capture programme http://www.ands.org.au/datamanagement/capture.html
[29] ANDS - News and events: More bang for your registration buck http://ands.org.au/news/ands-and-exlibris.html

Tuesday, 2 October 2012

The nature of data

At our second RDMRose training session we considered which objects could be considered 'research data'. The consensus was that anything involved in the research cycle could be considered data, whether digital or physical, even skulls in Archaeological collections or stuffed penguins in Zoological collections.

Thinking further on this, I reckon we need to qualify which objects should be considered data and which cannot, by determining whether they carry information in some form of symbol system. The Wikipedia definition is very succinct "Data are values of qualitative or quantitative variables, belonging to a set of items". 

In considering research data management best practice, research data collected in a non-digital format should be digitised and sufficient metadata collected during the process. So, it is becoming common practice to digitise lab-books, field notes, photographs, plans, and other objects, so the data they contain may be curated more effectively.

Luckily for us, all digital objects can be considered data - because they consist of binary code. Digital objects may contain noise (meaningless data) as well as signal (meaningful data) and need processing to determine what is signal and what is noise. Information may be derived from the signal, by processing (i.e. by interpretation of the data). Even a digital object containing no meaningful data, contains information - that there is no meaningful data as determined by the interpreting process.

Whether a physical object can be considered data or not, depends upon firstly, whether the object contains data that encodes information in some symbol system and secondly, the reason for its creation or collection - the purpose it is put to.

1. Consider a stuffed penguin in a Zoological collection. I would consider that this cannot be considered research data because there is no symbol system contained within or on it. The Zoological collection catalogue record for the item can be considered research data. Research data can be derived from the penguin by measuring it using instruments - tape measure, weighing scales; or by subjecting it to other processes, such as chemical or genetic analyses. Research data may be derived from it by creating other representations of it - drawing, optical photography, X-ray photography.

2. Consider a skull  in an Archaeological collection. Again this cannot be considered research data because there is no symbol system contained within it or on it. The Archaeological collection catalogue record for the skull can be considered research data. Research data can be derived from the skull by measuring it using instruments, or by subjecting it to other processes; and by creating other representations of it.

3. Consider a skull in an Archaeological collection that has hieroglyphs carved into it. This I will suggest may be considered data - because it contains data - the hieroglyphs, and therefore information encoded in a symbol system - though the data only becomes information if the hieroglyphs can be processed through translation. Of course to curate this data effectively, the carved hieroglyphs would need to be photographed and or copied in a digital format.

4. Now, a paperback book of fiction contains data (printed text) and information, if we are able to read the text. But this cannot be considered research data unless it serves a purpose in the research process. It may be considered research data if the text is being analysed for literary or sociological research, for example. In this case, representations of it may be made by digitising (where permitted) or by quotation; and the metadata describing this data will be in the form of a reference.

5. The original hand-written manuscript created by the author - which was edited and published as the paperback book. This can be considered a set of data, but only considered research data if used by a researcher.

6. The weather cannot be considered data, of course; but measurements of wind-speed, air temperature and rainfall are.

The most important criterion to use in assessing the need for curation will be 'Can these data be recreated or recollected following the same research process?'. This is what Jim Gray refers to as Ephemeral data, that 'cannot be reproduced or reconstructed a decade from now. If no one records them today, in a decade no one will know today’s rainfall, sunspots, ozone density, or oil price' (Gray 2002 p.1). For the above examples, so long as the Archaeological and Zoological collection items are preserved correctly (museum curation), then they can be measured and photographed at any time in the future. The paperback will probably be available from a number of sources, but the original manuscript may well be unique and therefore be a priority case for curation. Weather records will be unique, being collected during a specific timespan, therefore will also be a priority case for curation.

References

Gray, J. et al (2002) Online Scientific Data Curation, Publication, and Archiving
http://arxiv.org/ftp/cs/papers/0208/0208012.pdf