Friday 26 October 2012

Metadata for a WR Data Catalogue - part 2


Data Catalogue aspects of RDM infrastructure projects 
The JISC Digital infrastructure: Research management programme (2011-13) Managing Research Data strand is supporting 17 ResearchData Management Infrastructure projects [1]. Four of these, RoaDMaP, SwordARM, Open Exeter and Managing Research Data (at UWE) are referred to in part one of this post [2].
Overall there seems to be a consensus in accepting the three-tier metadata model put forward by the IDMB [3]  project at Southampton (& Takeda et al 2010 [4]) :-



This model has been developed through the IRIDIUM O’loughlin 2011 [5] and DataBris Boyd 2012 [6] projects, the three tiers given slightly different attributes:-
1.      a minimum mandatory metadata set providing core information and could be based       around a standard metadata element set - the 15 Dublin Core elements, DataCite kernel  [45] or CKAN [47], but includes other fields such as location, access terms and             conditions and any embargo information. The top level relates to the Discoverability of the Resource.
2.      a second mandatory layer with contextual metadata covered by elements within the     CERIF model [48], administrative information. base entities: project, person, organisation unit, collaborators; funding information: Funder, grant number; and result entities:            publication, patent, research product ideally, much of this will be automatically harvested, or fed from administrative systems. 
3.      and finally a specific level of optional metadata providing the rich, specific more granular, detailed information. This layer provides the discipline related information required for reuse. 

Oxford University has hosted a number of JISC funded RDM projects since 2008, including EIDCSR [7], Sudamih [8], Admiral [9], Vidaas [10], and Dataflow [11]. These were concerned with research workflows, embedding preservation, developing core metadata, sharing research data in collaborative workspaces, cloud storage of data and research data repository; all towards development of an integrated Research Data Service. The current project DAMARO [12] is implementing this Research Data Service [13], key aspects are the development of Datafinder the Oxford University Data catalogue and Databank [14], the data repository. The Datafinder architecture (Wilson 2012) [15], will have the following characteristics:- 
OAI-PMH harvesting of data stores 
•SWORD2 compliant 
•CERIF compatible 
•Metadata schema based on DataCite 
•Interfaces directly with DataBank & ORDS (Oxford's Online Research Database Service, based on DataStage [16] system
•Users can register non-electronic data.
  

For Datafinder, a three tier metadata approach is envisaged, comprising:

Minumum core elements
Record/digital object ID
Location of dataset
Medium
Creator (if not depositor)
Creator affiliation (if not depositor)
Title
Publisher of data
Publication year
Access terms & conditions
Data owner
Access date to data
Rights for metadata
Subject

Contextual mandatory elements
Funding agency
Grant number
Project information
Last access request date
Source
Source URL
Data generation
process
Why the data was generated/Abstract/Brief description
Date
Reason for embargo

Optional metadata (selection)
Co-creators/contributor
Role
Affiliation
Sub-title
Subject
Keywords
Date (other)
Language
ResourceType
AlternateIdentifier: Eg DOI
RelatedIdentifier: eg DOI of publication
Size
Format
Version
Data generation process
Abstract/Brief description
Documentation 1:descriptive or contextual information about the dataset (e.g. machine settings and experimental conditions under which the data were gathered)
Documentation 2
Subject specific m.d.
Subject specific m.d.
Subject specific classification
Subj specific classn scheme
Data complying with known standards eg DDI
(Rumsey 2012) [17]
The metadata will have three sources: Manual entry - generally disliked, can be inaccurate but can produce rich metadata; Imported - from data capture instruments, from institutional systems (RIMS, DMP), from a data repository; Autogenerated by the RDMI.  (Rumsey 2012) [17] 



The research data catalogue  is Central to the infrastructure being developed by IRDIUM project [18] at Newcastle, recording what data they have and making it discoverable. This will be integrated with MyProject (the Research Management System), MyImpact (Institutional publications system – equivalent to our Symplectics system) and the EPrints based IR. The catalogue will not be a repository but rather a straight forward web-based searchable catalogue of data and that we will only collect information on data that supports publication. We have opted for this measure as we know that data supporting publication should have already been prepared (i.e. confidentiality respected through the scrubbing of data, fields marked sensibly etc) plus we feel that data is normally available at this point for peer review and as a matter of good scientific practice, so (hopefully) we’re not asking too much more from our academics to fill in data information at the same point they fill in their new publication info in our output system.” (Wood,L. 2012) [19]. A list of twelve key field and seven further fields has been drawn up which will be publically or privately viewable through the catalogue interface. Again the three tier model has been adhered to; "This is quite appealing as we already collect much of the information in the first two levels through our current systems (MyProjects, e-prints and MyImpact) so the main additional input we’d be requiring from the academic would be at the third level.”  (O’loughlin2011) [20].
Interconnectivity of RDC with other elements of the RDMI is important because researchers definitely do not want to enter research project metadata more than once in multiple systems, within or outside the institution.
"This requires us to understand some of the systems the RDC may need to exchange metadata with that have existing information already entered. These could be local research group metadata catalogues, local/national repositories and other online systems" (Wood 2012) [21].

Bristol University’s Data.Bris project [22] is developing a RDMI which will integrate a new CRIS (PURE), which also provides an institutional repository, with the existing University Research Data Storage Facility (RDSF). This extends the storage facility into becoming a Research Data Repository and allows data to be published from the storage facility. The proposed architecture [23] involves the creation of a metadata store, (a SPARQL 1.1 service), and will adhere to  OAI-PMH [24], OAI-ORE [25], and SWORD [26] protocols. Data.Bris has defined a minimal set of mandatory metadata to be used when depositing or publishing data: Identifier; Creators; Title; Publisher; Publication year; and are investigating which metadata elements may be created automatically and which need adding manually. Again the three tier metadata model is thought useful, especially since metadata can be pulled in from the CRIS (Boyd2012) [27].

The Datapool project [28] follows on from the Institutional Data Management Blueprint (IDMB) project. The project will launch and populate an EPrints institutional data repository to collect and store all research data produced across disciplines within the institution, as part of the research data management infrastructure. The repository will have access to storage sufficient for local data assets and will also provide links to data held elsewhere, both externally in subject repositories and internally using other systems. The project is investigating mechanisms for transferring data and metadata into the data repository from other local data stores, and exporting data from the repository using the SWORD2 protocol. They will also use the three tier metadata model developed by IDMB  [29].
"The lesson for data repositories is clear: to capture content from data creators you must provide useful services that will become an integral part of the workflow of creating the data. It will not work to isolate particular requirements, such as records creation, from other needs such as storage services. Data does not appear with the same mode and frequency as published papers, so workflow must accommodate many different patterns. Research data is often produced by machines, so deposit workflow must allow scope for non-manual intervention" (Hitchcock 2012) [30].


For the Research Data @Essex project [31], EPrints is being used for repository. This project is also adhering to the IDMB inspired three tier metadata model; they considered EPrints metadata provided for level 1 & 2, whilst level 3 ‘minutiae’ are derived by drawing from DataCite, INSPIRE, DDI and DataShare schema. A multi-schema crosswalk was produced [32] and the Metadata schema worked out based on Datacite INSPIRE and DDI 2.1 [33].



University of West England's MRD uses a schema based on DataCite in a two tier model 1. basic metadata, 2. detailed domain level metadata. The Hydra project uses the Fedora object model and MODS schema; both described in part 1 [2]. 

The C4D project [34] aims to integrate research data metadata with Cerif CRIS metadata. developing mapping between multiple metadata standards aiming at maximum interoperability.

The ADMIRe project [35] seem to be developing a system based on DataCite minimum mandatory metadata, with additional subject specific metadata including DDI.

KAPTUR [36] involves work integrating DataStage with EPrints providing a structured metadata collection interface; and FigShare with EPrints with the intention to create an API to link Figshare with an EPrints repository using the SWORD 2 protocolThe project is specifically involved with visual arts data management so relevant metadata schema referred to [37] include the Categories for the Descriptionof Works of Art (CDWA) [38], the VRA Core Categories [39] and the Data Dictionary – TechnicalMetadata for Digital Still Images (ANSI/NISO Z39.87-2006) [40]. 

MiSS [41] is working towards a RDMI at University of Manchester. They are developing a system of metadata templates specific to different research domains, for use during data capture. The MiSS Baseline Requirements Report  indicates the advantages of implementing a RDMI in automating data capture and metadata ingest from instruments, reduces the need for manual metadata annotation by researchers – this benefit needs promoting to researchers. with the multitude of data sizes, different instruments and specific proprietary data and metadata formats, community input is needed to achieve integration of metadata schemas in the RDMI.

Open Exeter [42] is developing a prototype DSpace research data repository. They have surveyed post-graduates about their experiences testing the interface and metadata webform (Evans 2012)[43] .

Orbital [44] are using CKAN repository system for their data repository. Integrating this with their EPrints repository, their 'Awards Management System' (RIMS) and 'ownCloud' networked storage (an ‘academic dropbox’). Accepting minimum metadatarequirements for DataCite [45] agreement on the the mandatory and optional attributes.

PIMMS [46] (Portable Infrastructure for the Metafor Metadata System) will refactor the Metafor metadata management tool for use in university departments. The project deals with metadata schema in the climatology domain.

Part 3 will describe the work of Australian projects in the research data catalogue / metadata stores area.

References

[1] JISC Digital infrastructure: Managing Research Data Programme 2011-13 - Research Data Management Infrastructure Projects http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/managingresearchdata/infrastructure.aspx
[2] http://metadatatron.blogspot.co.uk/2012/10/metadata-for-wr-data-catalogue.html
[4] Data Management for All - The Institutional Data Management Blueprint project (IDMB at the 6th IDCC)  http://eprints.soton.ac.uk/169533/1/6th_international_digital_curation_conference__idmb_final_paper_revised.pdf
[7] EIDCSR http://eidcsr.oucs.ox.ac.uk/
[8] Sudamih http://sudamih.oucs.ox.ac.uk/
[9] Admiral http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL
[10] Vidaas http://vidaas.oucs.ox.ac.uk/
[11] Dataflow http://www.dataflow.ox.ac.uk/
[12] DAMARO http://damaro.oucs.ox.ac.uk/index.xml
[13] University of Oxford Bodleian Libraries - Research data services http://www.bodleian.ox.ac.uk/bdlss/research-data-services 
[14] Databank Oxford University research data repository https://databank.ora.ox.ac.uk/
[15] Wilson (2012) Infrastructure for Research Data Management at the University of Oxford. ANDS Webinar http://www.ands.org.au/events/webinars/james-wilson-jisc-webinar-slides.pdf
[16] DataStage http://www.dataflow.ox.ac.uk/index.php/datastage/ds-about
[17] Sally Rumsey (2012) Building an institutional research data management infrastructure. OR2012  http://damaro.oucs.ox.ac.uk/docs/Just%20enough%20metadata%20v3-1.pdf
[18] IRIDIUM http://research.ncl.ac.uk/iridium/
[19] Wood, L., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2012/10/02/iridium-requirements-for-a-research-data-catalogue-and-proof-of-concept-development/
[20] O'Laughlan N., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2011/12/09/195/
[21] Wood, L., IRIDIUM Blogpost http://iridiummrd.wordpress.com/2012/10/03/iridium-rdm-systemstools-connectivity-busy-researchers-dont-like-duplication-of-metadata-entry/
[22] Data.Bris project http://data.bris.ac.uk/
[23] Steer, D. - Dat.Bris architecture http://data.blogs.ilrt.org/2012/02/03/data-bris-architecture/
[24]  OAI-PMH http://www.openarchives.org/pmh/
[25]  OAI-ORE http://www.openarchives.org/ore/
[26]  SWORD http://swordapp.org/
[27] Boyd, D. - Data.Bris Blog http://data.blogs.ilrt.org/category/metadata/
[28] DataPool project http://datapool.soton.ac.uk/
[29] DataPool project proposal http://datapool.soton.ac.uk/files/2011/12/University-of-Southampton-Proposal-public.pdf
[30] Hitchcock, S., DataPool Blog http://datapool.soton.ac.uk/tag/repositories/
[31] Research Data @Essex Blog http://researchdataessex.posterous.com/metadata
[32] Research Data @Essex Metadata schema crosswalk http://researchdataessex.posterous.com/metadata#
[33] RDE Metadata Profile for EPrints https://docs.google.com/open?id=0B7VJTfTg7nrrcU1WMWVEMW9tY3M
[34] C4D http://cerif4datasets.files.wordpress.com/2012/04/cris2012_35_full_paper.pdf
[35] ADMIRe http://admire.jiscinvolve.org/wp/2012/08/16/notes-from-the-2nd-datacite-workshop/
[36] KAPTUR http://www.vads.ac.uk/kaptur/outputs/Kaptur_technical_analysis.pdf
[37] KAPTUR Blogpost https://kaptur.wordpress.com/2012/06/12/raising-your-redman/
[38] CDWA http://www.getty.edu/research/publications/electronic_publications/cdwa/index.html
[39] VRA Core http://www.vraweb.org/projects/vracore4/
[40] NISO Technical Metadata for Digital Still Images http://www.niso.org/kst/reports/standards?step=2&gid=None&project_key=b897b0cf3e2ee526252d9f830207b3cc9f3b6c2c
[41] MiSS - BaselineRequirementsReport http://www.miss.manchester.ac.uk/wp-content/uploads/2012/09/MiSS-BaselineRequirementsReport-RevisedVersion-Aug2012.pdf
[42] Open Exeter http://blogs.exeter.ac.uk/openexeterrdm/
[43] Evans, J., Open Exeter Blog http://blogs.exeter.ac.uk/openexeterrdm/blog/2012/05/31/pgr-feedback-on-data-upload/
[44] Orbital Blog http://orbital.blogs.lincoln.ac.uk/
[45] DataCite Mandatory Properties http://schema.datacite.org/meta/kernel-2.2/doc/DataCite-MetadataKernel_v2.2.pdf#page=8
[46] PIMMS http://proj.badc.rl.ac.uk/pimms
[47] CKAN http://ckan.org/
[48] CERIF 1.5 - Common European Research Information Format http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF1.5_Semantics.xhtml

Wednesday 24 October 2012

Metadata for a WR Data Catalogue - part 1


This will be a brainstorming event about metadata for research data with a particular focus on core fields for a data catalogue. we'll need to identify existing metadata sources and how local systems may feed each other. However, leaving architecture aside, it may be beneficial to discuss core metadata fields for all data types and approaches to metadata for specific data sets?






The possible benefits of a shared WR Data Catalogue have been briefly described in a previous post White Rose Research Data Catalogue [1]. Each institution will eventually develop a Research Data Management Infrastucture, which may involve some form of Research Data management System (RDMS). A key component of a  RDMS  must be a Data catalogue or Metadata store, which holds details of what objects are being managed by the system. The structure of the RDMS components, the metadata store configuration and the metadata standards adhered to will of course have a great bearing on interoperability and thus the possibility of establishing a shared catalogue. Although an institutional RDMS may need to have some long term data storage capacity, for 'orphaned datasets' and other material not suitable for submitting to data centres, a possible WR Data Catalogue would not need such a storage function. 



Local RDM infrastructure and projects
The development of RDMS is already underway at Leeds. RoaDMap [2] includes WP5: Software systems and metadata. This involves assessment and deployment of RDMS options including Dataflow [3] / Databank [4]; agreement of descriptive and contextual metadata requirements for case studies (Phased development of metadata standards to be used and core metadata for all case studies)develop case study metadata templates. 
The SWORD-ARM [5] project at York, is developing a semi-automatic ingest system for the ADS; metadata required for describing datasets are outlined in 'Guidelines for Cataloguing Datasets with the ADS' [6].
WRRO and WREO are powered by ePrints3 and are shared databases - the individual institutions do not have their own Institutional Repository. WRRO is (or will be) integrated into the local Symplectics Publications databases at Leeds and Sheffield . A number of institutions use the ePrints platform for an Institutional Data Repository (see below). Although in-house developers are often required to modify ePrints for the data storage function of a data repository, less modification will be required if only the catalogue function of ePrints is required. 
Much metadata may be imported from the institutions' Research Information Management System (RIMS). If the three institutions' RIMS conform to the CERIF model  [7] they will be more easily integrated into a RDMI. York is using PURE which conforms; Leeds and Sheffield use Symplectics (which now conforms to CERIF) but only for publication management although it can be extended as a full RIMS.
Finally, a research data management project, Tools for RDM Development [8], has recently been initiated at the University of York, to improve RDM. There is the possibility of implementing Databank and integrating it with PURE and York Digital Library.



Current Institutional Data catalogues (almost) up and running

The DISC-UK DataShare project [9], a collaboration involving the universities of Edinburgh, Southampton and Oxford to investigate the accommodation of datasets in institutional IRs, developed the Metadata schema for ePrints Soton [10] a dataset metadata profile based on qualified Dublin Core. The Data Documentation Initiative or DDI metadata scheme (for microdata and aggregate data) was considered, but the DCMI based schema chosen. This project eventually resulted in the establishment of two research data repositories; Edinburgh DataShare [11] is based on DSpace software and Oxford's DataBank [12]based on the databank [4] platform. This was an output of the Dataflow project [3] being developed from Fedora-commons with Solr implemented for indexing and can be hosted within an external cloud, or can be deployed on local hardware. Databank uses DC for core metadata, users can extend metadata to provide domain-specific ontological information and more; further information about metadata is given at the Databank [13] and Datastage [14] webpages. Oxford's research publication repository, ORA, is also based on the Fedora platform; for Databank, another instance of Fedora was developed for research data (Rice 2009) [15].

Hull's digital repository, Hydra [16] is a multipurpose repository based on Fedora Commons repository software, Solr, Ruby on Rails and Blacklight. MODS is recommended by the Hydra project, for the basic descriptive metadata for content, which has been modified for use at Hull (Green & Awre 2011) [17], but other Hydra repositories use other metadata schema (Project Hydra Blog) [18].
UWE Research Data Respository [19] is an EPrints repository. EPrints was chosen because they already use the platform for the IR and they have the local skills to repurpose it for data (RDM-UWE 2012) [20]. The metadata scheme is based on DCMI, standard for their IR, extended to include mandatory and optional fields based on the DataCite Metadata schema v2.1. "Two levels of metadata are planned; the first is a basic level collected on project record entry and data deposit. An optional detailed level will conform to disciplinary and subject metadata standards" (Holliday 2012) [21].

The Open Exeter [22] project has established EDA: Exeter Data Archive [23], a DSpace based prototype data repository. 

A National Research Data Catalogue




Further afield, the Australian National Data Service (ANDS) [24] is taking a national approach to improving research data management, providing advice and tools for institutions to develop RDM policies, plans and infrastructure, similar to the activities of the DCC in the UK. ANDS have established Research Data Australia [25], a discovery service for a registry of Australian research data collections, the Australian Research Data Commons (ARDC)[26]. Records are imported from institutional metadata stores and data repositories; ARDC does not have a data storage function. ANDS requires the Registry Interchange Format for Collections and Services (RIF-CS), based on ISO 2146:2010 for exchange of records and provide comprehensive advice on Metadata Content Requirements [27].

The ANDS Data Capture program [28] promotes data creation and capture infrastructure elements that feed into data and metadata storage facilities - through the development of  'pipelines' between instruments and data and metadata storage and software that enables better description of data and metadata, and feeding of these records into the ARDC.

The ANDS 'Seeding the commons' programme funded projects involved in developing institutional research data metadata stores - these will be described in part 2. 

Finally, Posted 14 September 2012 on the ANDS news page "ANDS and the Ex Libris Group are pleased to announce their recent agreement to syndicate the metadata in Research Data Australia, and make it accessible to researchers through the Ex Libris portal, Primo Central" [29].


References

[1] Metadatatron Blogpost: White Rose Research Data Catalogue http://metadatatron.blogspot.co.uk/2012/09/white-rose-research-data-catalogue.html
[6] ADS - Guidelines for Cataloguing Datasets with the ADS  http://archaeologydataservice.ac.uk/advice/cataloguingDatasets
[8] University of York: Tools for RDM Development http://uoy-rdmproject.blogspot.co.uk/
[9] DISC-UK DataShare project http://www.disc-uk.org/index.html
[10] DataShare Metadata Schema for ePrints Soton (ePrints 3.1) (2009) http://www.disc-uk.org/docs/ePrints_Soton_Metadata.pdf
[11] Edinburgh DataShare  http://datashare.is.ed.ac.uk/
[12] Databank: Bodleian Libraries research data archival store  https://databank.ora.ox.ac.uk/
[13] Databank: Metadata https://github.com/dataflow/RDFDatabank/wiki/Metadata-(how-to-label-and-find-things-in-DataBank)
[14] Datastage Metadata https://github.com/dataflow/DataStage/wiki/Metadata-(how-to-label-and-find-things-in-DataStage)
[15] Rice, R. (2009) DataShare final report http://repository.jisc.ac.uk/336/1/DataSharefinalreport.pdf
[16] Hydra: Hull's digital repository https://hydra.hull.ac.uk/
[17] Green & Awre (2011) Hydra in Hull: Final report https://hydra.hull.ac.uk/resources/hull:5231
[18] Project Hydra Blog http://projecthydra.org/design-principles-2/metadata/
[19] UWE Research Data Respository  http://researchdata.uwe.ac.uk/
[20] EPrints as a data repository at UWE (WP 1&2 Stage 6) http://www2.uwe.ac.uk/services/library/using_the_library/Services%20for%20researchers/eprints-data-repository-uwe.pdf
[21] Holliday (2012) Metadata for UWE data repository. MRD Blog http://blogs.uwe.ac.uk/teams/mrd/archive/2012/08/14/metadata-for-uwe-data-repository.aspx
[22] Open Exeter project http://as.exeter.ac.uk/library/resources/openaccess/openexeter/
[23] EDA: Exeter Data Archive https://eda.exeter.ac.uk/repository/
[24] ANDS: Australian National Data Service http://www.ands.org.au/index.html
[25] Research Data Australia  http://researchdata.ands.org.au/
[26] ANDS - Australian Research Data Commons http://ands.org.au/about/approach.html#ardc
[27] ANDS - Metadata Content Requirements http://ands.org.au/resource/metadata-content-requirements.html
[28] ANDS - Data capture programme http://www.ands.org.au/datamanagement/capture.html
[29] ANDS - News and events: More bang for your registration buck http://ands.org.au/news/ands-and-exlibris.html

Tuesday 2 October 2012

The nature of data

At our second RDMRose training session we considered which objects could be considered 'research data'. The consensus was that anything involved in the research cycle could be considered data, whether digital or physical, even skulls in Archaeological collections or stuffed penguins in Zoological collections.

Thinking further on this, I reckon we need to qualify which objects should be considered data and which cannot, by determining whether they carry information in some form of symbol system. The Wikipedia definition is very succinct "Data are values of qualitative or quantitative variables, belonging to a set of items". 

In considering research data management best practice, research data collected in a non-digital format should be digitised and sufficient metadata collected during the process. So, it is becoming common practice to digitise lab-books, field notes, photographs, plans, and other objects, so the data they contain may be curated more effectively.

Luckily for us, all digital objects can be considered data - because they consist of binary code. Digital objects may contain noise (meaningless data) as well as signal (meaningful data) and need processing to determine what is signal and what is noise. Information may be derived from the signal, by processing (i.e. by interpretation of the data). Even a digital object containing no meaningful data, contains information - that there is no meaningful data as determined by the interpreting process.

Whether a physical object can be considered data or not, depends upon firstly, whether the object contains data that encodes information in some symbol system and secondly, the reason for its creation or collection - the purpose it is put to.

1. Consider a stuffed penguin in a Zoological collection. I would consider that this cannot be considered research data because there is no symbol system contained within or on it. The Zoological collection catalogue record for the item can be considered research data. Research data can be derived from the penguin by measuring it using instruments - tape measure, weighing scales; or by subjecting it to other processes, such as chemical or genetic analyses. Research data may be derived from it by creating other representations of it - drawing, optical photography, X-ray photography.

2. Consider a skull  in an Archaeological collection. Again this cannot be considered research data because there is no symbol system contained within it or on it. The Archaeological collection catalogue record for the skull can be considered research data. Research data can be derived from the skull by measuring it using instruments, or by subjecting it to other processes; and by creating other representations of it.

3. Consider a skull in an Archaeological collection that has hieroglyphs carved into it. This I will suggest may be considered data - because it contains data - the hieroglyphs, and therefore information encoded in a symbol system - though the data only becomes information if the hieroglyphs can be processed through translation. Of course to curate this data effectively, the carved hieroglyphs would need to be photographed and or copied in a digital format.

4. Now, a paperback book of fiction contains data (printed text) and information, if we are able to read the text. But this cannot be considered research data unless it serves a purpose in the research process. It may be considered research data if the text is being analysed for literary or sociological research, for example. In this case, representations of it may be made by digitising (where permitted) or by quotation; and the metadata describing this data will be in the form of a reference.

5. The original hand-written manuscript created by the author - which was edited and published as the paperback book. This can be considered a set of data, but only considered research data if used by a researcher.

6. The weather cannot be considered data, of course; but measurements of wind-speed, air temperature and rainfall are.

The most important criterion to use in assessing the need for curation will be 'Can these data be recreated or recollected following the same research process?'. This is what Jim Gray refers to as Ephemeral data, that 'cannot be reproduced or reconstructed a decade from now. If no one records them today, in a decade no one will know today’s rainfall, sunspots, ozone density, or oil price' (Gray 2002 p.1). For the above examples, so long as the Archaeological and Zoological collection items are preserved correctly (museum curation), then they can be measured and photographed at any time in the future. The paperback will probably be available from a number of sources, but the original manuscript may well be unique and therefore be a priority case for curation. Weather records will be unique, being collected during a specific timespan, therefore will also be a priority case for curation.

References

Gray, J. et al (2002) Online Scientific Data Curation, Publication, and Archiving
http://arxiv.org/ftp/cs/papers/0208/0208012.pdf