Re: #correspondencetables : from raw to triplets #correspondencetables

Bo Weidema

I agree with Chris here.

First, most classifications contain themselves hierarchies, typically indicated by some code convention, such as ISIC 4 "011" being a subclass of "01" and so on. These should be related by the appropriate RDF predicate.

Secondly, in each classification, each class typically have one or more human readable label and one or more codes. These should be related by relevant "main code", "alternative code", "main label", "alternative label" RDF objects.

When matching two classifications, we have either of three relations (exact match, fully contained in, or partly contained in), that need to be expressed. Example:

Original I Original II Relations Evolving BONSAI classification
1 A Exact match Preferred name of either 1 or A, if different
2 B B fully contained in 2 (= B is sub-class of 2) B, AND implicitly "2lessB" also exists, also being sub-class of 2.
3 C C partly contained in 3 (= CpartOf3 is sub-class of 3) CpartOf3, AND implicitly 3less'CpartOf3' also exists, also being sub-class of 3.

In case both original classifications are (expected to be) exhaustive of the same domain, we can deduce the relations (exact match, fully contained in, or partly contained in) from the existence (or not) of more cases of classes 2, B, 3 and C. We can also deduce that 'Original II" will contain one or more classes corresponding to each of 2lessB, 3less'CpartOf3', and Cless'CpartOf3, so that all items within the domain will belong to one specific class in all classifications.

The resulting structure should be a triple for each of the relations between each classification.

What cannot be done automatically is: 1) The choice of the preferred name of either 1 or A, if different, and 2) Improvements in human readability of new auto-generated labels

 Best regards


Den 2019-04-05 kl. 20.09 skrev Chris Mutel:

Thanks Miguel-

It seems clear to me that the first step should be defining the verbs we will use, and the reasons we are using these particular verbs. For example, both OWL and SKOS seem to offer similar functionality, but I am sure that some people have strong opinions on which one is preferable. We also need to set up the metadata (i.e. RDF URIs) for level of confidence we have in the matchings, either official, manual and peer reviewed, computer generated, etc.

After looking through the repo and the code Miguel posted, I think we should investigate going directly from the raw data to RDF. The intermediate step doesn't really gain us anything, and it seems a bit silly not to use the power of our RDF database when constructing these correspondences. For example, if ISIC v4 disaggregated the production of some commodities from v3, then we should be storing the region-specific production of these commodities in our database, and using these numbers to do region-specific matches. We can always construct correspondence tables from the database relatively easy afterwards.

I also think we need better vocabulary then "sameAs" when storing the label, code, and other code (because why not) from certain classification systems. Maybe we can adapt existing terms for adding more specificity.

Given this need for some fundamental research, one possible priority for the group would be to get as many metadata systems in their native form into arborist (e.g. ISIC 3, ISIC 4, HS1, NACE, NAICS, CPC). The README should also be updated to reflect the data available, and current state of the repo, especially WRT to the existing correspondence tables already available in the native form (in `raw`).

Join to automatically receive all group messages.