toggle quoted messageShow quoted text
I agree with Chris here.
First, most classifications contain themselves hierarchies,
typically indicated by some code convention, such as ISIC 4 "011"
being a subclass of "01" and so on. These should be related by the
appropriate RDF predicate.
Secondly, in each classification, each class typically have one
or more human readable label and one or more codes. These should
be related by relevant "main code", "alternative code", "main
label", "alternative label" RDF objects.
When matching two classifications, we have either of three
relations (exact match, fully contained in, or partly contained
in), that need to be expressed. Example:
||Evolving BONSAI classification
name of either 1 or A, if different
fully contained in 2 (= B is sub-class of 2)
||B, AND implicitly "2lessB" also exists, also
being sub-class of 2.
partly contained in 3 (= CpartOf3 is sub-class of 3)
||CpartOf3, AND implicitly 3less'CpartOf3' also
exists, also being sub-class of 3.
In case both original classifications are (expected to be)
exhaustive of the same domain, we can deduce the relations (exact
match, fully contained in, or partly contained in) from the
existence (or not) of more cases of classes 2, B, 3 and C. We can
also deduce that 'Original II" will contain one or more classes
corresponding to each of 2lessB, 3less'CpartOf3', and
Cless'CpartOf3, so that all items within the domain will belong to
one specific class in all classifications.
The resulting structure should be a triple for each of the
relations between each classification.
What cannot be done automatically is: 1) The choice of the
preferred name of either 1 or A, if different, and 2) Improvements
in human readability of new auto-generated labels
Den 2019-04-05 kl. 20.09 skrev Chris
It seems clear to me that the first step should be defining the
verbs we will use, and the reasons we are using these
particular verbs. For example, both OWL and SKOS seem to offer
similar functionality, but I am sure that some people have strong
opinions on which one is preferable. We also need to set up the
metadata (i.e. RDF URIs) for level of confidence we have in the
matchings, either official, manual and peer reviewed, computer
After looking through the repo and the code Miguel posted, I think
we should investigate going directly from the raw data to RDF. The
intermediate step doesn't really gain us anything, and it seems a
bit silly not to use the power of our RDF database when
constructing these correspondences. For example, if ISIC v4
disaggregated the production of some commodities from v3, then we
should be storing the region-specific production of these
commodities in our database, and using these numbers to do
region-specific matches. We can always construct correspondence
tables from the database relatively easy afterwards.
I also think we need better vocabulary then "sameAs" when storing
the label, code, and other code (because why not) from certain
classification systems. Maybe we can adapt existing terms for
adding more specificity.
Given this need for some fundamental research, one possible
priority for the group would be to get as many metadata systems in
their native form into arborist (e.g. ISIC 3, ISIC 4,
HS1, NACE, NAICS, CPC). The README should also be updated to
reflect the data available, and current state of the repo,
especially WRT to the existing correspondence tables already
available in the native form (in `raw`).