Wednesday, 24 October 2012

Metadata for a WR Data Catalogue - part 1


This will be a brainstorming event about metadata for research data with a particular focus on core fields for a data catalogue. we'll need to identify existing metadata sources and how local systems may feed each other. However, leaving architecture aside, it may be beneficial to discuss core metadata fields for all data types and approaches to metadata for specific data sets?






The possible benefits of a shared WR Data Catalogue have been briefly described in a previous post White Rose Research Data Catalogue [1]. Each institution will eventually develop a Research Data Management Infrastucture, which may involve some form of Research Data management System (RDMS). A key component of a  RDMS  must be a Data catalogue or Metadata store, which holds details of what objects are being managed by the system. The structure of the RDMS components, the metadata store configuration and the metadata standards adhered to will of course have a great bearing on interoperability and thus the possibility of establishing a shared catalogue. Although an institutional RDMS may need to have some long term data storage capacity, for 'orphaned datasets' and other material not suitable for submitting to data centres, a possible WR Data Catalogue would not need such a storage function. 



Local RDM infrastructure and projects
The development of RDMS is already underway at Leeds. RoaDMap [2] includes WP5: Software systems and metadata. This involves assessment and deployment of RDMS options including Dataflow [3] / Databank [4]; agreement of descriptive and contextual metadata requirements for case studies (Phased development of metadata standards to be used and core metadata for all case studies)develop case study metadata templates. 
The SWORD-ARM [5] project at York, is developing a semi-automatic ingest system for the ADS; metadata required for describing datasets are outlined in 'Guidelines for Cataloguing Datasets with the ADS' [6].
WRRO and WREO are powered by ePrints3 and are shared databases - the individual institutions do not have their own Institutional Repository. WRRO is (or will be) integrated into the local Symplectics Publications databases at Leeds and Sheffield . A number of institutions use the ePrints platform for an Institutional Data Repository (see below). Although in-house developers are often required to modify ePrints for the data storage function of a data repository, less modification will be required if only the catalogue function of ePrints is required. 
Much metadata may be imported from the institutions' Research Information Management System (RIMS). If the three institutions' RIMS conform to the CERIF model  [7] they will be more easily integrated into a RDMI. York is using PURE which conforms; Leeds and Sheffield use Symplectics (which now conforms to CERIF) but only for publication management although it can be extended as a full RIMS.
Finally, a research data management project, Tools for RDM Development [8], has recently been initiated at the University of York, to improve RDM. There is the possibility of implementing Databank and integrating it with PURE and York Digital Library.



Current Institutional Data catalogues (almost) up and running

The DISC-UK DataShare project [9], a collaboration involving the universities of Edinburgh, Southampton and Oxford to investigate the accommodation of datasets in institutional IRs, developed the Metadata schema for ePrints Soton [10] a dataset metadata profile based on qualified Dublin Core. The Data Documentation Initiative or DDI metadata scheme (for microdata and aggregate data) was considered, but the DCMI based schema chosen. This project eventually resulted in the establishment of two research data repositories; Edinburgh DataShare [11] is based on DSpace software and Oxford's DataBank [12]based on the databank [4] platform. This was an output of the Dataflow project [3] being developed from Fedora-commons with Solr implemented for indexing and can be hosted within an external cloud, or can be deployed on local hardware. Databank uses DC for core metadata, users can extend metadata to provide domain-specific ontological information and more; further information about metadata is given at the Databank [13] and Datastage [14] webpages. Oxford's research publication repository, ORA, is also based on the Fedora platform; for Databank, another instance of Fedora was developed for research data (Rice 2009) [15].

Hull's digital repository, Hydra [16] is a multipurpose repository based on Fedora Commons repository software, Solr, Ruby on Rails and Blacklight. MODS is recommended by the Hydra project, for the basic descriptive metadata for content, which has been modified for use at Hull (Green & Awre 2011) [17], but other Hydra repositories use other metadata schema (Project Hydra Blog) [18].
UWE Research Data Respository [19] is an EPrints repository. EPrints was chosen because they already use the platform for the IR and they have the local skills to repurpose it for data (RDM-UWE 2012) [20]. The metadata scheme is based on DCMI, standard for their IR, extended to include mandatory and optional fields based on the DataCite Metadata schema v2.1. "Two levels of metadata are planned; the first is a basic level collected on project record entry and data deposit. An optional detailed level will conform to disciplinary and subject metadata standards" (Holliday 2012) [21].

The Open Exeter [22] project has established EDA: Exeter Data Archive [23], a DSpace based prototype data repository. 

A National Research Data Catalogue




Further afield, the Australian National Data Service (ANDS) [24] is taking a national approach to improving research data management, providing advice and tools for institutions to develop RDM policies, plans and infrastructure, similar to the activities of the DCC in the UK. ANDS have established Research Data Australia [25], a discovery service for a registry of Australian research data collections, the Australian Research Data Commons (ARDC)[26]. Records are imported from institutional metadata stores and data repositories; ARDC does not have a data storage function. ANDS requires the Registry Interchange Format for Collections and Services (RIF-CS), based on ISO 2146:2010 for exchange of records and provide comprehensive advice on Metadata Content Requirements [27].

The ANDS Data Capture program [28] promotes data creation and capture infrastructure elements that feed into data and metadata storage facilities - through the development of  'pipelines' between instruments and data and metadata storage and software that enables better description of data and metadata, and feeding of these records into the ARDC.

The ANDS 'Seeding the commons' programme funded projects involved in developing institutional research data metadata stores - these will be described in part 2. 

Finally, Posted 14 September 2012 on the ANDS news page "ANDS and the Ex Libris Group are pleased to announce their recent agreement to syndicate the metadata in Research Data Australia, and make it accessible to researchers through the Ex Libris portal, Primo Central" [29].


References

[1] Metadatatron Blogpost: White Rose Research Data Catalogue http://metadatatron.blogspot.co.uk/2012/09/white-rose-research-data-catalogue.html
[6] ADS - Guidelines for Cataloguing Datasets with the ADS  http://archaeologydataservice.ac.uk/advice/cataloguingDatasets
[8] University of York: Tools for RDM Development http://uoy-rdmproject.blogspot.co.uk/
[9] DISC-UK DataShare project http://www.disc-uk.org/index.html
[10] DataShare Metadata Schema for ePrints Soton (ePrints 3.1) (2009) http://www.disc-uk.org/docs/ePrints_Soton_Metadata.pdf
[11] Edinburgh DataShare  http://datashare.is.ed.ac.uk/
[12] Databank: Bodleian Libraries research data archival store  https://databank.ora.ox.ac.uk/
[13] Databank: Metadata https://github.com/dataflow/RDFDatabank/wiki/Metadata-(how-to-label-and-find-things-in-DataBank)
[14] Datastage Metadata https://github.com/dataflow/DataStage/wiki/Metadata-(how-to-label-and-find-things-in-DataStage)
[15] Rice, R. (2009) DataShare final report http://repository.jisc.ac.uk/336/1/DataSharefinalreport.pdf
[16] Hydra: Hull's digital repository https://hydra.hull.ac.uk/
[17] Green & Awre (2011) Hydra in Hull: Final report https://hydra.hull.ac.uk/resources/hull:5231
[18] Project Hydra Blog http://projecthydra.org/design-principles-2/metadata/
[19] UWE Research Data Respository  http://researchdata.uwe.ac.uk/
[20] EPrints as a data repository at UWE (WP 1&2 Stage 6) http://www2.uwe.ac.uk/services/library/using_the_library/Services%20for%20researchers/eprints-data-repository-uwe.pdf
[21] Holliday (2012) Metadata for UWE data repository. MRD Blog http://blogs.uwe.ac.uk/teams/mrd/archive/2012/08/14/metadata-for-uwe-data-repository.aspx
[22] Open Exeter project http://as.exeter.ac.uk/library/resources/openaccess/openexeter/
[23] EDA: Exeter Data Archive https://eda.exeter.ac.uk/repository/
[24] ANDS: Australian National Data Service http://www.ands.org.au/index.html
[25] Research Data Australia  http://researchdata.ands.org.au/
[26] ANDS - Australian Research Data Commons http://ands.org.au/about/approach.html#ardc
[27] ANDS - Metadata Content Requirements http://ands.org.au/resource/metadata-content-requirements.html
[28] ANDS - Data capture programme http://www.ands.org.au/datamanagement/capture.html
[29] ANDS - News and events: More bang for your registration buck http://ands.org.au/news/ands-and-exlibris.html

No comments:

Post a Comment