Posterous theme by Cory Watilo

a candidate for the most boring email i have ever sent, on geodata quality

I re-post this here, because I don't think the OGC Data Quality Working Group mailing list has open archives, and because i actually managed to connect and summarise many of the things that have been bothering me about the ISO geodata standards stack over the past few years.

If you are not obsessed with geodata/quality, please look away now.

-------- Original Message --------
Subject: Re: [DQ.wg] Feedback on ISO19157 Committee Draft
Date: Wed, 01 Sep 2010 11:44:32 +0100
From: Jo Walsh
To: dq.wg@lists.opengeospatial.org

dear Eliane, i am glad both to see your engaged comments and to have the chance to unpack my own comments, which were too terse,

On 30/08/2010 15:56, Eliane Roos wrote:
> 19157 was initially proposed by Sweden to harmonise 19113, 19114 and the
> descriptive part of 19138 ...
> ISO do not realize analysis, it waits to receive comments.

Interesting.

>> In overview, my notes on the draft often read "need to see the data"
>> and also often read "implementable?" and sometimes "ontologise".
> Sorry Jo, but I do not understand what "need to see the data" means.

Okay, here i mean, i have never understood how we can rely on metadata
for data evaluation - "is it fit for my purpose", e.g. "how well do its
properties match the properties of the data i have already got, and
respond to the functions in the tools i have already got".

There are (at least) two reasons for doing and recording quality
measurements -
1) to improve efficiency of data production process
2) to improve efficiency of data re-use and analysis

It is the second case that concerns me here; data quality measures as
hints for re-use of the data. Right now i'm looking at section G.2.2 of
the draft, which talks about "Ordering in data quality evaluation".
This begins with format consistency, e.g. processability of the data by
software tools. But this will vary across versions of schemas and
versions of software libraries. The easiest way to evaluate this is to
import the data into one's environment and see what doesn't work.
I'm also looking at the Introduction, "Therefore, it is necessary to
compare the quality of the datasets to determine which best fulfils the
requirements of the user." - this "Therefore" does not follow for me.

It may be viable for the user to accept an error threshold that is so
high, that effectively any quality level is acceptable. Here i'm
thinking of collaborative geodata resources like OpenStreetmap and
geonames.org - coverage and consistency are known to be partial, but
even for researchers, it is more important to see the data than to know
the qualities of the data. But we are a long way from NMCA world here,
and I do have more constructive things to say, so should move on.

>> Code lists and controlled vocabularies; identifiers for data
>> objects; attribute, values and relationship; a Linked Data approach
> I need simple sentence with a clear requirement. :-)

And I apparently need to dance with documents for *years*, trying to
figure out why requirements are. ;)

In the UK there's serious government investment in Linked Data; there is some tension with the requirements (protocols, formats) of INSPIRE.
How can one (technically) solve the needs of the Making Public Data
Public initiative, and the UK Location Programme, at the same time?

A tack the UK Location Programme working groups have taken is to
concentrate on what looks the same. Which is
1) identifiers for datasets and dataset elements
2) the use of ontologies/schemas to create logical consistency in the
descriptions of datasets and dataset elements

(This comes from here:
http://location.defra.gov.uk/2010/05/a-guide-to-linked-data-and-the-uk-locati... )

So code lists could be presented as vocabularies, that is, each code is
a URI within a namespace, and software tools can learn about new
namespaces by getting the contents of a URI (the graph describing the
codes and the relations between them).

A discussion of MD_ScopeCode in 19115 - here is the tabular structure of the code list. Where is the "authoritative" copy? (I don't know, but I think it is a table in an Appendix to an ISO data standard).
http://home.badc.rl.ac.uk/lawrence/blog/2008/03/19/the_scope_of_iso19115

The more machine-interpretable the code list, the more efficient - less
hard-coding, more forwards compatibility. The less that the data
elements are treated differently from the conceptual schemas - e.g.
everything is a thing that has a URI - the easier the data and metadata
becomes to work with - the less difference between data and metadata.

To look again at scope as an example (section 6.2) "the scope of the
data quality unit(s) specifies the extent, spatial and/or temporal,
and/or common characteristics that identify the data on which data
quality is to be evaluated." What are the "common characteristics" if
not a schema, e.g. an ontology? The value of having a machine-interpretable description of the ontology, in automating the
evaluation of logical consistency of a dataset, is absolute.

> an attribute level, which is a codelist ... an attribute description where
> you can list a set of features, attributes etc... as GML objects or
> characterString.
> you are really welcome to add a clear comment with a precise proposition.

So my proposition here is to make these definitions using the Linked
Data principles, rather than with GML or free text.
http://www.w3.org/DesignIssues/LinkedData.html

>> 5 ... "purpose, lineage and usage are recorded as metadata in
>> accordance with ISO 19115." A machine-interpretable description of
>> dataset provenance ...

> the standalone report will be associated with the metadata

Right, my problem here is with 19115, rather than how it is being used
in 19157 (though if the DQ elements will move from 19115 in future
revisions then perhaps there *is* an opportunity to address this).

My problem is the reliance here on human description and comprehension
for data provenance and lineage, when it seems as if a
machine-processable description of provenance would have great value -
to providers of services, to data users, and to advances in quality
evaluation. Identifying processing history - links to the source
datasets - links to the processing services that have been used to
derive works - links to the quality evaluation matrices for the source
datasets. Being able to follow the lineage back a few steps and take a
different branch (thereby assisting discovery of other datasets that
share some sources and processes, thereby in some ways look the same).

So free text descriptions of lineage, processing history etc in 19115
and thereby in INSPIRE, always seemed like a cop-out to me (I heard
Harlan Onsrud make this point somewhere, too). We impose a burden of
description without a clear idea of how this description will be
usefully understood. We know we should be describing lineage of data,
but developments in research or practise have not yet given us clear
ideas about solutions (e.g. how to record dataset processing history in
such a way as one can later, working from the original sources, "play
back" the process and arrive at the same results (or not!)).

I don't understand why something is mandated that is half-hearted (the
free text or CharacterString version of lineage), rather than nothing is mandated, and we wait for proposed solutions, and standardise after
consensus has begun to emerge - which seems to me like the appropriate
action here.

A note, the sample standalone report that Annex B says can be found in
Annex F, doesn't appear in Annex F (unless i lost a few printouts).

>> Missing a discussion of generalisation and simplification of both
>> data and quality metadata - is this seen as too much to chew - if those

> This standard does not deal with metadata particularly ... but with
> quality evaluation and reporting in general. Here the aim is not to
> simplify, but to provide a quality model as broad as the multiple
> very different "real world" practices.

Something that fascinated me during the last ESDIN data quality work
package meeting i went to, was an argument about the consistency of
quality reporting after data has been generalised - are the error
measures still useful, and consistently so, after generalisation of
data. What happens when datasets have been generalised to the same scale (and thus "look the same") but the source datasets are at quite
different scales. If one judges accuracy and consistency of the
generalised data from the quality reporting for the source data, one can quickly and easily go wrong. It entertained me to see two serious
international experts having a muted, professional version of a stand-up row about the effects of different sampling techniques across
generalised data. So i shouldn't be surprised, i suppose, that this
stuff is not in the standards, but i somehow am surprised to see the
issue not recognised in 19157. Your comments have helped me understand,
if not enough people are bringing their recognition of the issue to ISO, that it doesn't belong there.

> Precise improvements proposition are really helful.

Ground-up reform of ISO to fit the 21st century technological and
industrial context? I bet many people working in ISO want this too.
I should almost certainly shut up now.


jo