Monday 3 September 2012

White Rose Research Data Catalogue


Notes towards a proposal for a White Rose Research Data Catalogue

At the recent RDM Event - White Rose perspectives - 24th May 2012 [0], I was particularly interested in hearing about progress of the RoaDMap [1] and SWORD-ARM [2] projects. These projects involve, amongst other things, piloting or implementing Research Data Management Systems (RDMS) which automate metadata capture. Metadata attribution could be considered one of the major bottle-necks in the data curation process.

Towards the end of the event I asked whether there was any interest in the idea of a ‘White Rose Data Catalogue’ amongst those present – and I was glad to find there was.  Although I have no experience of repository management, this is a field that greatly interests me and is currently a big issue in academic librarianship.


Research Data Cataloguing projects
Most of the current JISC [3] funded projects are concerned with establishing an institutional RDMS rather than a catalogue of research data. See my visualisations of JISC programmes to help determine which projects are involved (Mindmap [4] and Hypertree [5]). This is the ideal situation, research data being catalogued and stored adequately as part of the research workflow, but this will only account for current / future research projects that utilise the RDMS. Hopefully, in the near future there will be projects developing tools for retrospective data cataloguing. 

At present Southampton University seem to be leading the way with their Open Data Catalogue [6]; it is worth a look at the Southampton Data Blog [7] and the ECS Web Team Blog [8]. The emphasis of this initiative is on 'Open data' rather than 'Research data' (the original motivation of the Open Data movement involved access to Government data). There seems to be differences in the software systems favoured by the Open Data movement compared with the Repository community, but Repository software (particularly ePrints) is particularly well developed, mature and ideal for the catalogue as well as storage function. Southampton maintains a list of  Open Data Catalogues [9] which includes Leeds [10] and York [11].

Why may we want to catalogue Research Data? 
We know the need for research data curation; however, creating and maintaining a catalogue of RD is only a small part of the curation process. What would the benefits of a Research Data Catalogue (RDC) at the White Rose level be?
1.  As a WRUC research showcase – it may be possible to link the most cited research articles to the underlying data, thus adding value to both. There are a number of projects involved with developing tools for such linking; LAIRD [12], Storelink [13], DryadUK [14], 3TU.Datacentrum [15] and Utopia [16], for example.
2.  For the REF - datasets published by current staff (back to 2008) can be considered a research output for the REF assessment.
3.  To assess the need for curation of our digital assets that have yet to be managed adequately. To prioritise curation of Ephemeral data, that 'cannot be reproduced or reconstructed a decade from now. If no one records them today, in a decade no one will know today’s rainfall, sunspots, ozone density, or oil price' (Gray 2002 p.1 [17]). 
4.  We have the system and expertise already in place. The catalogue (or 'metadata store'), without the data storage function, would be quite small and easily incorporated into the WRRO [18] ePrints server. 
5.  To promote best practice in data curation – inclusion in the catalogue should encourage the adoption of best practice in ongoing and future research projects. The operation of the WRRDC could incorporate a service to help researchers prepare their datasets for admission to data centres and archives. A WRRDC record should consist of adequate metadata for submission of the dataset to a data centre.



What would a WRRDC contain?
The WRRDC would essentially be a repository, but with the catalogue function disassociated from the storage function. We require just the descriptive metadata corresponding to the data - the data itself will be held and curated elsewhere. Much of the existing published data is already curated (with corresponding catalogue records) at Data centres and disciplinary repositories. Although there may well be a need for a local repository to carry out curation/storage functions for some material, these will be provided by a Research Data Management System (RDMS), implemented in the future.

The separation of the catalogue from the storage function is proposed in the establishment of the University of Western Sydney (UWS) Research Data Repository; 'Unlike a typical monolithic Institutional Repository (IR) the storage and catalogue services are disaggregated, because the data involved can be large and is much more varied in nature than the typical contents of an IR. Also, some data will reside in trusted data stores outside of the central storage supplied by IT. Not to mention that some of it is on paper and some on obsolete digital media.' (P. Sefton 2012 [19]) 

It may be necessary for each of the three institutions to maintain their own Research Data Catalogue (as part of the institution’s RDMS), but if these utilise common metadata standards, the catalogues should be easily unified. Also, the WRRDC could contain records for all research data, not necessarily Open Data – the record will include access metadata. This raises another issue - that of data ownership and IP. Can we assume that for the WRRDC, any data created at the three universities may be catalogued and included?

How to build the WRRDC. It would be wise to adhere to the Datacite Metadata Schema [20], as this is widely accepted. We will need to import or create the catalogue records in three phases:

1.   I'm assuming in the case of research data produced by future research projects, the metadata will be automatically created by the RDMS. Hopefully one of the outcomes of the RoaDMaP project will be the implementation of one such system, Dataflow [21]. Other such systems include SWORD-ARM [2], Datapool [22] (building on IDBM [23]), MiSS [24], Iridium [25] and ADMIRe [26]; some of the 17 Research Data Management Infrastructure Projects [27] funded by JISC. There are currently several interesting RDMS projects running in Australisa - for example the ANDS 'Seeding the Commons' projects at the University of Western Sydney [28] (interesting blog post here [29]), University of Melbourne [30] and the University of Wollongong [31]. ANDS have established a Research Data discovery facility 'Research Data Australia' [32]. An institutional RDMS will have a catalogue component which could feed its records into the WRRDC.

2.   Research datasets of current / completed projects that have already been submitted to discipline based repositories and data centres: these will have already been catalogued so should have adequate metadata records, which can be imported directly using OAI-PMH protocol, or which may need transforming to ingest into the WRRDC. A long list of such data repositories may be found at the DataCite website [33] and at Databib [34].

3.   In the case of research data outputs (for current & past research projects) that have not been submitted to data centres, I will make the following assumptions:
a)   Current researchers will have a record in the Institution's Research Information System (RIS), linking them to research project, department, grant and funding information and maybe Data management Plans (or at least grant application technical appendices).
b)   Researchers will have some form of system for storing data. At the very least, just a filepath and file attributes. There may be procedures for storing data on Institutional, departmental or project data servers; this core & technical metadata may be compiled when following these procedures and also may be embedded in data files.
c)   Researchers &/or Project administration will keep records of methods of data processing, computer code, instrument settings. This contextual metadata may have been systematically recorded.
For these outputs, it would be essential to automate the compiling of metadata as far as possible, since retrospective metadata creation is very time consuming and expensive. I'm assuming that much of this information exists already, but is dispersed and maybe difficult to locate. Therefore we will require some mechanism for compiling this metadata – a sort of retrospective RDMS. Before this is developed, it will be necessary for researchers to fill out forms!

Where to build the WRRDC?
Ideally ePrints may be used for the Data catalogue; perhaps on the same server as WRRO [18] (WRRDC would relate to WRRO in a similar way to WREO [35]).  Each institution may need their own instance of ePrints based RDC, before a RDMS is implemented.  The catalogue could initially be populated by:

1.   Import of records from RIS (creator, project and funding metadata fields); in a similar way that a Sympletics DB (such as Sheffield's 'MyPublications' [36]) is populated.
2.   Populating data catalogue by means of metadata harvest from discipline data repositories and data centres. Searching for current researcher names as creator (and Universities as creator affiliation).

Conclusions
We are in a position where the White Rose universities are establishing RDM policies and starting to implement RDM infrastructures (RDM Systems which will involve some aspect of RD catalogue). Leeds, Sheffield and York are establishing data libraries and digital asset management systems. We have considerable expertise in managing an ePrints repository. Considering these points, I would suggest that it will not be too difficult or expensive to establish a pilot WRRDC in the near future.

References
[0] White Rose Perspectives on Research Data Management - 24th May 2012, Ron Cooke Hub, University of York http://blog.library.leeds.ac.uk/info/377/roadmap/123/roadmap_events/2
[1] RoaDMaP  http://library.leeds.ac.uk/roadmap-project
[2] SWORD-ARM  http://archaeologydataservice.ac.uk/research/swordarm
[3] JISC Managing Research Data Programme 2011-13 http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/managingresearchdata.aspx
[4] Lewis, J. - JISC Research Data Management project 'Mindmap' visualisation  http://jalewis.staff.shef.ac.uk/jisc/JISC_mindmap.html
[5] Lewis, J. - JISC Research Data Management project 'Hyperbolic tree' visualisation  http://jalewis.staff.shef.ac.uk/jisc/JISC_tree.html
[6] University of Southampton Open Data  http://data.southampton.ac.uk/
[7] Southampton Data Blog  http://blogs.ecs.soton.ac.uk/data/
[8] Southampton ECS Webteam  - Data Catalogue Interoperability Meeting  http://blog.soton.ac.uk/webteam/2011/05/05/data-catalogue-interoperability-meeting/
[9] Gutteridge, C. - Open Data from UK Academic Institutions http://hub.data.ac.uk/
[10] University of Leeds data  http://data.leeds.ac.uk/
[11] York Digital Library Data  http://dlib.york.ac.uk/yodl/app/home/data
[12] LAIRD  http://www.ed.ac.uk/schools-departments/information-services/about/organisation/edl/data-library-projects/laird
[13] Storelink  https://sites.google.com/a/staffmail.ed.ac.uk/storelink/
[14] DryadUK  http://dev.datadryad.org/dryaduk
[15] 3TU.datacentrum  http://datacentrum.3tu.nl/en/home/
[16] Utopia  http://getutopia.com/
[17] Gray, J. et al (2002) Online Scientific Data Curation, Publication, and Archiving http://arxiv.org/ftp/cs/papers/0208/0208012.pdf
[18] White Rose Research Online  http://eprints.whiterose.ac.uk/
[19] PTSefton Blog post (2012-02-14)  http://ptsefton.com/2012/02/14/an-australian-research-data-repository.htm
[20] Datacite Metadata Schema  http://schema.datacite.org/
[22] Datapool  http://datapool.soton.ac.uk/datapool/
[23] IDBM  http://www.southamptondata.org/
[24] MiSS  http://www.miss.manchester.ac.uk/
[25] Iridium  http://research.ncl.ac.uk/iridium/
[26] ADMIRE  http://admire.jiscinvolve.org/wp/about/
[27] JISC Research Data Management Infrastructure projects http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/managingresearchdata/infrastructure.aspx
[28] UWS eResearch Blog - RDC project post  http://eresearch.uws.edu.au/blog/2012/06/04/university-of-western-sydney-enterprise-research-data-catalogue-project/
[29] UWS eResearch Blog - Data capture post  http://eresearch.uws.edu.au/blog/2012/03/16/mixing-our-research-data-metaphors-seeding-the-commons-capturing-data-taming-wild-research-data/
[30] University of Melbourne Seeding the Commons Project  http://projects.ands.org.au/id/SC02
[31] University of Wollongong Seeding the Commons Project  http://www.uow.edu.au/research/eresearch/datamanagement/projects/UOW087138.html
[32] ANDS Research Data Australia  http://researchdata.ands.org.au/
[33] DataCite - Research Data Repository list  http://datacite.org/repolist
[34] Databib  http://databib.org/
[35] White Rose eTheses Online  http://etheses.whiterose.ac.uk/
[36] University of Sheffield RIS - MyPublications  http://www.sheffield.ac.uk/ris/post-project/mypublications