Tuesday, 2 October 2012

The nature of data

At our second RDMRose training session we considered which objects could be considered 'research data'. The consensus was that anything involved in the research cycle could be considered data, whether digital or physical, even skulls in Archaeological collections or stuffed penguins in Zoological collections.

Thinking further on this, I reckon we need to qualify which objects should be considered data and which cannot, by determining whether they carry information in some form of symbol system. The Wikipedia definition is very succinct "Data are values of qualitative or quantitative variables, belonging to a set of items". 

In considering research data management best practice, research data collected in a non-digital format should be digitised and sufficient metadata collected during the process. So, it is becoming common practice to digitise lab-books, field notes, photographs, plans, and other objects, so the data they contain may be curated more effectively.

Luckily for us, all digital objects can be considered data - because they consist of binary code. Digital objects may contain noise (meaningless data) as well as signal (meaningful data) and need processing to determine what is signal and what is noise. Information may be derived from the signal, by processing (i.e. by interpretation of the data). Even a digital object containing no meaningful data, contains information - that there is no meaningful data as determined by the interpreting process.

Whether a physical object can be considered data or not, depends upon firstly, whether the object contains data that encodes information in some symbol system and secondly, the reason for its creation or collection - the purpose it is put to.

1. Consider a stuffed penguin in a Zoological collection. I would consider that this cannot be considered research data because there is no symbol system contained within or on it. The Zoological collection catalogue record for the item can be considered research data. Research data can be derived from the penguin by measuring it using instruments - tape measure, weighing scales; or by subjecting it to other processes, such as chemical or genetic analyses. Research data may be derived from it by creating other representations of it - drawing, optical photography, X-ray photography.

2. Consider a skull  in an Archaeological collection. Again this cannot be considered research data because there is no symbol system contained within it or on it. The Archaeological collection catalogue record for the skull can be considered research data. Research data can be derived from the skull by measuring it using instruments, or by subjecting it to other processes; and by creating other representations of it.

3. Consider a skull in an Archaeological collection that has hieroglyphs carved into it. This I will suggest may be considered data - because it contains data - the hieroglyphs, and therefore information encoded in a symbol system - though the data only becomes information if the hieroglyphs can be processed through translation. Of course to curate this data effectively, the carved hieroglyphs would need to be photographed and or copied in a digital format.

4. Now, a paperback book of fiction contains data (printed text) and information, if we are able to read the text. But this cannot be considered research data unless it serves a purpose in the research process. It may be considered research data if the text is being analysed for literary or sociological research, for example. In this case, representations of it may be made by digitising (where permitted) or by quotation; and the metadata describing this data will be in the form of a reference.

5. The original hand-written manuscript created by the author - which was edited and published as the paperback book. This can be considered a set of data, but only considered research data if used by a researcher.

6. The weather cannot be considered data, of course; but measurements of wind-speed, air temperature and rainfall are.

The most important criterion to use in assessing the need for curation will be 'Can these data be recreated or recollected following the same research process?'. This is what Jim Gray refers to as Ephemeral data, that 'cannot be reproduced or reconstructed a decade from now. If no one records them today, in a decade no one will know today’s rainfall, sunspots, ozone density, or oil price' (Gray 2002 p.1). For the above examples, so long as the Archaeological and Zoological collection items are preserved correctly (museum curation), then they can be measured and photographed at any time in the future. The paperback will probably be available from a number of sources, but the original manuscript may well be unique and therefore be a priority case for curation. Weather records will be unique, being collected during a specific timespan, therefore will also be a priority case for curation.

References

Gray, J. et al (2002) Online Scientific Data Curation, Publication, and Archiving
http://arxiv.org/ftp/cs/papers/0208/0208012.pdf

No comments:

Post a Comment