Date   

Re: Hackathon report inputs

Rutger Schurgers
 

Hi Agneta,

 

How are you? I’ve just returned from holidays. The link you sent me has expired. I’m interested in the report, not sure whether it still makes sense to contribute. What do you think?

 

Regards,

 

Rutger

 

From: hackathon2019@bonsai.groups.io [mailto:hackathon2019@bonsai.groups.io] On Behalf Of Agneta
Sent: 11 July 2019 15:55
To: hackathon2019@bonsai.groups.io
Subject: [hackathon2019] Hackathon report inputs

 

Hi all

I am trying to develop a summary report of the Hackathon. I have just sent an overleaf link to all hackathon participants to kindly update it with a short summary of the tasks completed during/ after the hackathon in your respective working group with links to the respective repositories.

I would also like you to add the challenges you have faced in the respective tasks, because of which the task might be stalled. By the end, I hope this summary would be useful to reflect on where we are and communicate our work to others (in the form of a publication for the IJLCA). As discussed previously all hackathon participants will be authors for this summary report, hence please flood in with inputs and reviews.

P.S. Also please do add you institute details alongside the author names.

Hope you are all having a good summer!

Agneta

 


Hackathon report inputs

Agneta
 

Hi all

I am trying to develop a summary report of the Hackathon. I have just sent an overleaf link to all hackathon participants to kindly update it with a short summary of the tasks completed during/ after the hackathon in your respective working group with links to the respective repositories.

I would also like you to add the challenges you have faced in the respective tasks, because of which the task might be stalled. By the end, I hope this summary would be useful to reflect on where we are and communicate our work to others (in the form of a publication for the IJLCA). As discussed previously all hackathon participants will be authors for this summary report, hence please flood in with inputs and reviews.

P.S. Also please do add you institute details alongside the author names.

Hope you are all having a good summer!

Agneta

 


[Hackathon] Schedule for regular follow up meetings #poll #schedule #followup #vote

Agneta
 

Hi everyone 

It would be great if we have our follow up meetings scheduled in advance. It would be great if most of us could join in for a short 30 - 40 minute meeting to discuss the progress or challenges of their working group.

Until now we were either having our meetings on Friday evenings (or as otherwise decided on slack). It might have been difficult for everyone to participate . I have created a poll to understand which day (and time) is most suitable for everyone to participate.

Once determined, it will be easier to send meeting invitations in advance.

Thanks

Agneta

 

Results

See Who Responded


Re: #rdf #exiobase #rdfconversion #rdfconversion #rdf #exiobase

Matteo Lissandrini (AAU)
 

Hi all,

I was curious about this as well.
My understanding was that we have all the "metadata" about the type of the flow objects, the activity types, and so on, so now we just need the python to extract the actual numbers, isn't it?

Thanks,
Matteo



From: hackathon2019@bonsai.groups.io [hackathon2019@bonsai.groups.io] on behalf of Agneta via Groups.Io [agneta.20@...]
Sent: Wednesday, May 01, 2019 10:19 AM
To: hackathon2019@bonsai.groups.io
Subject: [hackathon2019] #rdf #exiobase #rdfconversion

Dear all 

Could you please let me know if we have the Exiobase data (HSUT) in RDF format yet ? We have a request for the data from the computer science dept in AAU, who might be further involved in the project.
It would be great if we could have an overview of the current state of the database. We might be able to get additional support for coding tasks from the University.

Kind regards

Agneta


Re: How does the ontology group support the correspondence table group #rdf #correspondencetables #ontology

Miguel Fernández Astudillo
 

Hello hello

During the hackathon we talked about aggregators but I have no idea of how they are implemented. For example, Exiobase documents emissions of "HFC"s while we have characterisation factors for different types of HFCs (HFC-41, HFC-152 etc.). This requires some kind of rule to dissagregate HFC emissions into the different subclasses of HFC. How can we implement this kind of aggregation/dissagregation in BONSAI?

best,

Miguel


#rdf #exiobase #rdfconversion #rdfconversion #rdf #exiobase

Agneta
 

Dear all 

Could you please let me know if we have the Exiobase data (HSUT) in RDF format yet ? We have a request for the data from the computer science dept in AAU, who might be further involved in the project.
It would be great if we could have an overview of the current state of the database. We might be able to get additional support for coding tasks from the University.

Kind regards

Agneta


Re: #evaluation #evaluation

Massimo Pizzol
 

Sorry for the delay. 11 responses, see below.

 

Worked well. We should continue doing…

 

•             I guess slack was really easy to work with all participants across different channels. I wonder if this will remain the point of contact as we continue working

 

•             (I was working remotely) - Keeping high the level of ambition. Gives me great motivation. - Involving people with different background and expertise. Extremely enriching. I learnt a lot just listening and seeing how others work. - Have working groups with respective leaders and repositories. It felt good to know that somebody had a clear overview of what was needed for a specific little project and could assign specific tasks, and that the little project was contributing to something bigger. - Working on exiobase. Since it's a large database I think is a good test case even though working with it has some challenges for the same reason. - Use slack for chatting during the hackathon. - Using github issues as a way to set the agenda and priorities. In general the whole github setting was great. +1! - Zoom meetings on fridays (or regularly). - Doing hackathons. More was produced in a week than in 2 months of talk via mail. We need to continue searching for funding for this.

 

•             Experts from different disciplines work together in charge activities based on their experience and knowledge. Organisers identified a set of deliverable that enable taking advantage of member expertise. Also, location was very comfortable (Barcelona). To have a good weather is important.

 

•             Weekly meeting; Slack; use of GitHub project board; Remote participation meeting were needed sometimes, but not all the time (it was good to cancel some);

 

•             Communicate openly, allow a space for everyone to contribute according to their capabilities.

 

•             It went fine. Sometimes, the coordination of tasks was a bit unclear, but I think it is understandable given the circumstances (some participants worked remotely). A lot of work has been achieved, so it is a sign that the event was productive. It should be reproduced periodically, I think.

 

•             Use open source platforms like Dribdat

 

•             In-person hackathons - Ambitious goals that inspire people - Organizing before the hackathon

 

•             Morning meetings summing up work that needs to be done and status updates.

 

•             slack, zoom meetings and the end and beggining of the day, python skeleton.

 

•             We should continue work in small working groups, it is easier to known what and how do to required tasks.

 

 

 

Didn’t work. We should stop doing…

 

•             Probably the fact that we have too many platforms and the first few days I took a while to navigate all the different platform.

 

•             (I was working remotely) - Bad audio on the zoom meetings during the hackathon, should really be improved. Decent mic and video highly needed. - Spreading same discussion on multiple channels. Between slack chat, mails, github issue comments, I spent a lot of time finding where something was discussed. Perhaps the criteria for when writing in one place or the other are not clear enough and should be specified. - I have been introduced to a lot of new tools during the process and now I experience a certain fatigue. I would be more parsimonious from now on. - Same as above, in the BONSAI github there has perhaps been an excessive proliferation of repos and issues, should be moderated a bit or at least set the priorities clear.

 

•             Short time between the invitation to participate and the start of the hackathon, I was full of activities in that time and did not have enough time to understand well my part. Ideally, the next call could be with at least two month in advance instead of one.

 

•             Discussions on GitHub issues were sometimes held in parallel with Slack.

 

•             Very long lunch breaks

 

•             Working remotely myself, I realized it was difficult at times. Next time, we should emphasize on the importance of having everyone physically present. Also, because of remote workers, some time was lost on online meetings, etc.

 

•             Don't stop

 

•             Separate website for hackathon progress tracking (hackathon.b.u) - In-person communication to resolve problems but not put on mailing list or Slack left outside people wondering what we were working on, what we had decided, etc. Could have a dedicated facilitator for this.

 

•             2 hours lunch breaks without discussing the lunch proposed discussion

 

•             lunch were nice but lunch time took too long.

 

•             In some parts, it was difficult to understand the full picture of how different groups interlink together.

 

 

Was missing. We should start doing...

 

 

•             Not missing but everyone must keep a check on general house keeping. With so much happening on each channel, proper documentation is a must to keep all participants on board

 

•             (I was working remotely) - I missed some more presentations. We focused on doing doing doing but not so much on updating the others. There were some short wrap-ups but perhaps given the people's different background this was not sufficient (my impression at least). I lost the overview easily of what was going on and what others had achieved. Would have been nice to see some screenshots of what was done with a short explanation. For example: instead of just saying "the code XXX for doing YYY is out" share screen, run the code and explain shortly the working principle and output, then remind how it fits in the big picture. Takes more time, but at least we are all on the same page. - Similar as above. I missed some housekeeping (for reproducibility) and pedagogic approaches to documentation. - Again on a similar line, I think we could have been more productive if we knew better what are the skills of various people. Because e.g. the WG leader knows who to ask a specific question or for a specific task. (A skill mapping survey was circulated now that I think about it, but I haven't seen the results) - Option to call others. It was just easier to talk to somebody and get a quick explanation of what to do instead of chatting. - OK this might seem stupid, but I would have preferred to have a PC in the room with the zoom meeting running continuously. Or a webcam. Just to understand where people are physically and what they are doing. From the slack chat only I got the idea you were on different room not talking to each other. - Outreach plus applications for funding based on this, sci papers, and conf pres.

 

•             More remote participation. A group of remote participants worked really well because I think they were able to understand well their tasks and identify the right way to communicate with other members. But there were also remote participants that could not identify tasks and their potential contribution diluted, which is a pity. So, better coordination with remote participants should be considered.

 

•             Remote participation : tricky to say what was missing; hologram for remote participants ;-)

 

•             Set some minor milestones during the hackaton to track actual progress. Stop a little bit earlier and *force* everyone documenting their daily work (especially for those remote)

 

•             Maybe we should set less ambitious goals next time, to favor work of better quality, as opposed to quantity.

 

•             Make it easier to contribute remotely through tidbit tasks

 

•             Could have separate smaller groups, each focused on a topic, in different locations. There weren't - The participant pool could be smaller, with a specific focus (model development, coding, documentation, ontology development) - Time for exercise/quiet alone time

 

•             celebrating milestones or other achievements (minor or amazing)

 

•             Consider existing power relationships within bonsai (e.g. supervisor/student, employer/employee). This comment is not only for the hackathon, but general to the bonsai project. Those in positions of power can (inadvertently) silence disagreement, which is not good. We should be aware of this when communicating and discussing.

 

  • I have no idea. I think that in overall things are moving forward.

 

 

From: <hackathon2019@bonsai.groups.io> on behalf of "Massimo Pizzol via Groups.Io" <massimo@...>
Reply-To: "hackathon2019@bonsai.groups.io" <hackathon2019@bonsai.groups.io>
Date: Wednesday, 3 April 2019 at 08.43
To: "hackathon2019@bonsai.groups.io" <hackathon2019@bonsai.groups.io>
Subject: Re: [hackathon2019] #evaluation

 

Just a reminder about filling the evaluation form. Only 5 responded so far.

 

I guess everybody is taking a breath after the full-immersion of the hackathon…but evaluation is important to improve the process and I believe the organisers would really appreciate a feedback.

 

I will let the form open for responses until Friday April 5th, at 12:00 and then upload the results on this discussion form.

 

BR
Massimo

 

From: <hackathon2019@bonsai.groups.io> on behalf of "Massimo Pizzol via Groups.Io" <massimo@...>
Reply-To: "hackathon2019@bonsai.groups.io" <hackathon2019@bonsai.groups.io>
Date: Friday, 29 March 2019 at 14.25
To: "hackathon2019@bonsai.groups.io" <hackathon2019@bonsai.groups.io>
Subject: [hackathon2019] #evaluation

 

Dear all

 

Here a simple evaluation form for the Hackathon.

 

It’s anonymous.


BR
Massimo


Re: #correspondencetables : from raw to triplets #correspondencetables

Miguel Fernández Astudillo
 

Interesting, I will have a deeper look when possible.

I was updating the group readme. Should I move the references to the Hackathon somewhere else? it seems that this repo will survive and it will have a function in the workflow.

Miguel

-----Original Message-----
From: hackathon2019@bonsai.groups.io <hackathon2019@bonsai.groups.io> On Behalf Of Chris Mutel
Sent: 09 April 2019 13:44
To: hackathon2019@bonsai.groups.io
Subject: Re: [hackathon2019] #correspondencetables : from raw to triplets

As we are not the only people thinking about these topics, there has already been a lot of work in this area. It is relatively easy to find some half-baked implementations in RDF, e.g. datahub.io, joinedupdata.org, and the unstats web page Miguel linked is great.
However, the best resource I have found is here:
http://semstats.org/2016/challenge/classifications, with the actual data available here:
http://semstats.org/2016/challenge/challenge-data. The repo to generate these correspondences is https://github.com/FranckCo/Stamina,
with documentation here:
https://github.com/FranckCo/Stamina/blob/master/doc/content.md.

This data was produced by a project whose website is currently down (stamina-project.org); the easiest alternative would be to work with the original creator, but it doesn't look like he is responding to issues (I am also writing the creator). There are a few other things to clean up in this data, see e.g.
https://github.com/FranckCo/Stamina/issues/11 (and others).

Not sure about the next steps, except that I don't think we can create a better wheel than professionals already have. Maybe we can polish their wheel a bit, and use it?

On Mon, 8 Apr 2019 at 17:26, <miguel.astudillo@...> wrote:

Hello hello

lets see if I am getting this right.

Chris, when you say "put metadata systems in their native form into arborist (e.g. ISIC 3, ISIC 4, HS1, NACE, NAICS, CPC)" does that mean "as downloaded?" are we talking about the "list of possible names" (e.g. stuff under "codes and descriptions) https://unstats.un.org/unsd/classifications/business-trade/correspondence.asp#correspondence-head (e.g. "ISIC_Rev_4_english_structure.txt") If so I would put only the needed ones. I dont think we need "HS1988"

would it be to create the URIs? e.g.
<http://rdf.bonsai.uno/activitytype/isic_v4section/>:Manufacturing a
bont:ActivityType

and later move to the "official" (=ready to use?) correspondance tables specifying predicates?

To make use of the existing correspondance tables I think we would need "exiobase2 to exiobase3" otherwise they are completely disconnected to the (core?) of the database.

best, Miguel

PS: I think a getting started guide urges, I am getting lost already!


--
############################
Chris Mutel
Technology Assessment Group, LEA
Paul Scherrer Institut
OHSA D22
5232 Villigen PSI
Switzerland
http://chris.mutel.org
Telefon: +41 56 310 5787
############################


Re: #correspondencetables : from raw to triplets #correspondencetables

 

As we are not the only people thinking about these topics, there has
already been a lot of work in this area. It is relatively easy to find
some half-baked implementations in RDF, e.g. datahub.io,
joinedupdata.org, and the unstats web page Miguel linked is great.
However, the best resource I have found is here:
http://semstats.org/2016/challenge/classifications, with the actual
data available here:
http://semstats.org/2016/challenge/challenge-data. The repo to
generate these correspondences is https://github.com/FranckCo/Stamina,
with documentation here:
https://github.com/FranckCo/Stamina/blob/master/doc/content.md.

This data was produced by a project whose website is currently down
(stamina-project.org); the easiest alternative would be to work with
the original creator, but it doesn't look like he is responding to
issues (I am also writing the creator). There are a few other things
to clean up in this data, see e.g.
https://github.com/FranckCo/Stamina/issues/11 (and others).

Not sure about the next steps, except that I don't think we can create
a better wheel than professionals already have. Maybe we can polish
their wheel a bit, and use it?

On Mon, 8 Apr 2019 at 17:26, <miguel.astudillo@...> wrote:

Hello hello

lets see if I am getting this right.

Chris, when you say "put metadata systems in their native form into arborist (e.g. ISIC 3, ISIC 4, HS1, NACE, NAICS, CPC)" does that mean "as downloaded?" are we talking about the "list of possible names" (e.g. stuff under "codes and descriptions) https://unstats.un.org/unsd/classifications/business-trade/correspondence.asp#correspondence-head (e.g. "ISIC_Rev_4_english_structure.txt") If so I would put only the needed ones. I dont think we need "HS1988"

would it be to create the URIs? e.g. <http://rdf.bonsai.uno/activitytype/isic_v4section/>:Manufacturing a bont:ActivityType

and later move to the "official" (=ready to use?) correspondance tables specifying predicates?

To make use of the existing correspondance tables I think we would need "exiobase2 to exiobase3" otherwise they are completely disconnected to the (core?) of the database.

best, Miguel

PS: I think a getting started guide urges, I am getting lost already!
--
############################
Chris Mutel
Technology Assessment Group, LEA
Paul Scherrer Institut
OHSA D22
5232 Villigen PSI
Switzerland
http://chris.mutel.org
Telefon: +41 56 310 5787
############################


Re: #correspondencetables : from raw to triplets #correspondencetables

Miguel Fernández Astudillo
 

Hello hello

lets see if I am getting this right.

Chris, when you say "put metadata systems in their native form into arborist (e.g. ISIC 3, ISIC 4, HS1, NACE, NAICS, CPC)" does that mean "as downloaded?" are we talking about the "list of possible names" (e.g. stuff under "codes and descriptions) https://unstats.un.org/unsd/classifications/business-trade/correspondence.asp#correspondence-head (e.g. "ISIC_Rev_4_english_structure.txt") If so I would put only the needed ones. I dont think we need "HS1988"

would it be to create the URIs? e.g. <http://rdf.bonsai.uno/activitytype/isic_v4section/>:Manufacturing a bont:ActivityType

and later move to the "official" (=ready to use?) correspondance tables specifying predicates?

To make use of the existing correspondance tables I think we would need "exiobase2 to exiobase3" otherwise they are completely disconnected to the (core?) of the database.

best, Miguel

PS: I think a getting started guide urges, I am getting lost already!


Re: #correspondencetables : from raw to triplets #correspondencetables

Matteo Lissandrini (AAU)
 

In this case your example seems fine, you can still say that fbcl is subclass of POWC probably.
you can also say sameAs between POWN and Nuclear, assuming that the only way of producing electricity is by nuclear fission (in contrast to fusion?)

rdf:type doesn't apply when matching different activity types.

________________________________________
From: hackathon2019@bonsai.groups.io [hackathon2019@bonsai.groups.io] on behalf of Chris Mutel via Groups.Io [cmutel=gmail.com@groups.io]
Sent: Monday, April 08, 2019 4:08 PM
To: hackathon2019@bonsai.groups.io
Subject: Re: [hackathon2019] #correspondencetables : from raw to triplets

Thanks Matteo-

It is a bit tricky keeping the class definitions and instances inline
with the idea of `rdf:type` referring to multiple classes - could you
provide an alternative implementation of the example?

BTW, "Nuclear" is the label ENTSO-E uses in its API, short for
"production of electricity using nuclear fission".

On Mon, 8 Apr 2019 at 15:57, Matteo Lissandrini (AAU) <matteo@...> wrote:

Hi Chris,
have you checked the very useful examples here:
https://www.w3.org/2006/07/SWD/SKOS/skos-and-owl/master.html

In general let's use subsclass of and rdf:type when we know it is a subset or an instance of, and let's use skos for "fuzzy" concepts.

ActivityType are classes, so you can say that something is a subclass of a specific activity type.

I'm not sure what should be just "Nuclear" in your model.

About automatic tools, usually they introduce uncertainty, but above all, they require an initial ground truth, otherwise we cannot understand if they are doing what we want them to do.

We do not have a first full version of the BONSAI data and system, trying to address automatic data cleaning &co. is more likely to introduce noise and slow down the project.
So I would say, let's get done with a MVP (minimum viable product) with some manual work that assures the highest quality and control (we can limit to just a portion of the tables).
Later on I will be happy to help you investigate more automatic tools, but I would say to do this when we will be able to compare to something we know to be right.


Cheers,
Matteo

---
Matteo Lissandrini

Department of Computer Science
Aalborg University

http://people.cs.aau.dk/~matteo











________________________________
From: hackathon2019@bonsai.groups.io [hackathon2019@bonsai.groups.io] on behalf of Chris Mutel via Groups.Io [cmutel=gmail.com@groups.io]
Sent: Monday, April 08, 2019 2:03 PM
To: hackathon2019@bonsai.groups.io
Subject: Re: [hackathon2019] #correspondencetables : from raw to triplets

@Matteo, Bo, Miguel; please comment and correct!

Defining correspondence tables in RDF

Based on my reading of https://www.w3.org/TR/skos-reference/, I created the following:

@prefix bont: <http://ontology.bonsai.uno/core#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWC> a bont:ActivityType ;
skos:prefLabel "Production of electricity by coal" .
skos:altLabel "A_POWC" .
skos:narrowMatch <http://rdf.bonsai.uno/activitytype/entsoe/fbcl> .

<http://rdf.bonsai.uno/activitytype/entsoe/fbcl> a bont:ActivityType ;
skos:prefLabel "Fossil Brown coal/Lignite" .
skos:broadMatch <http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWC> .

<http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWN> a bont:ActivityType ;
skos:prefLabel "Production of electricity by nuclear" .
skos:altLabel "A_POWN" .
skos:exactMatch <http://rdf.bonsai.uno/activitytype/entsoe/nuke> .

<http://rdf.bonsai.uno/activitytype/entsoe/nuke> a bont:ActivityType ;
skos:prefLabel "Nuclear" .

This has been very helpful for me, as it has helped build a mental model of how to express hierarchical relations, codes, etc. For sure, I have made mistakes though!

Outstanding questions:

1. It is unclear to me whether or not `narrowMatch` and `broadMatch` are transitive.
2. Do we need to declare `narrowMatch` and `broadMatch`?
3. Can we drop `rdfs:label` completely in favor of `skos:prefLabel`?
4. Do we agree on using `skos:altLabel` for codes?
5. Partial overlaps, as mentioned by Bo. There are possibilities to describe this in SKOS, but I don't know what approach is best.

Next steps for correspondence tables repo

I still think that the first step should be getting all the basic data (labels, codes, and URIs) into arborist, followed by the official correspondence lists using the above format. The example that Miguel posted should never be needed (A -> C, when we knew A -> B and B -> C), as we should be able to get this transitive relationship "automatically" though SPARQL queries (and we need to learn how to write these queries in any case).

We can then proceed with our own self-generated correspondences; there are a number of libraries to help with this besides fuzzywuzzy (though it does have the best name :)

https://recordlinkage.readthedocs.io/en/latest/about.html
https://github.com/dedupeio/dedupe
https://github.com/kvh/match
https://pypi.org/project/py_entitymatching/


Some research and trial phases would be necessary before picking any particular approach.




--
############################
Chris Mutel
Technology Assessment Group, LEA
Paul Scherrer Institut
OHSA D22
5232 Villigen PSI
Switzerland
http://chris.mutel.org
Telefon: +41 56 310 5787
############################


Re: #correspondencetables : from raw to triplets #correspondencetables

 

Thanks Matteo-

It is a bit tricky keeping the class definitions and instances inline
with the idea of `rdf:type` referring to multiple classes - could you
provide an alternative implementation of the example?

BTW, "Nuclear" is the label ENTSO-E uses in its API, short for
"production of electricity using nuclear fission".

On Mon, 8 Apr 2019 at 15:57, Matteo Lissandrini (AAU) <matteo@...> wrote:

Hi Chris,
have you checked the very useful examples here:
https://www.w3.org/2006/07/SWD/SKOS/skos-and-owl/master.html

In general let's use subsclass of and rdf:type when we know it is a subset or an instance of, and let's use skos for "fuzzy" concepts.

ActivityType are classes, so you can say that something is a subclass of a specific activity type.

I'm not sure what should be just "Nuclear" in your model.

About automatic tools, usually they introduce uncertainty, but above all, they require an initial ground truth, otherwise we cannot understand if they are doing what we want them to do.

We do not have a first full version of the BONSAI data and system, trying to address automatic data cleaning &co. is more likely to introduce noise and slow down the project.
So I would say, let's get done with a MVP (minimum viable product) with some manual work that assures the highest quality and control (we can limit to just a portion of the tables).
Later on I will be happy to help you investigate more automatic tools, but I would say to do this when we will be able to compare to something we know to be right.


Cheers,
Matteo

---
Matteo Lissandrini

Department of Computer Science
Aalborg University

http://people.cs.aau.dk/~matteo











________________________________
From: hackathon2019@bonsai.groups.io [hackathon2019@bonsai.groups.io] on behalf of Chris Mutel via Groups.Io [cmutel=gmail.com@groups.io]
Sent: Monday, April 08, 2019 2:03 PM
To: hackathon2019@bonsai.groups.io
Subject: Re: [hackathon2019] #correspondencetables : from raw to triplets

@Matteo, Bo, Miguel; please comment and correct!

Defining correspondence tables in RDF

Based on my reading of https://www.w3.org/TR/skos-reference/, I created the following:

@prefix bont: <http://ontology.bonsai.uno/core#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWC> a bont:ActivityType ;
skos:prefLabel "Production of electricity by coal" .
skos:altLabel "A_POWC" .
skos:narrowMatch <http://rdf.bonsai.uno/activitytype/entsoe/fbcl> .

<http://rdf.bonsai.uno/activitytype/entsoe/fbcl> a bont:ActivityType ;
skos:prefLabel "Fossil Brown coal/Lignite" .
skos:broadMatch <http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWC> .

<http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWN> a bont:ActivityType ;
skos:prefLabel "Production of electricity by nuclear" .
skos:altLabel "A_POWN" .
skos:exactMatch <http://rdf.bonsai.uno/activitytype/entsoe/nuke> .

<http://rdf.bonsai.uno/activitytype/entsoe/nuke> a bont:ActivityType ;
skos:prefLabel "Nuclear" .

This has been very helpful for me, as it has helped build a mental model of how to express hierarchical relations, codes, etc. For sure, I have made mistakes though!

Outstanding questions:

1. It is unclear to me whether or not `narrowMatch` and `broadMatch` are transitive.
2. Do we need to declare `narrowMatch` and `broadMatch`?
3. Can we drop `rdfs:label` completely in favor of `skos:prefLabel`?
4. Do we agree on using `skos:altLabel` for codes?
5. Partial overlaps, as mentioned by Bo. There are possibilities to describe this in SKOS, but I don't know what approach is best.

Next steps for correspondence tables repo

I still think that the first step should be getting all the basic data (labels, codes, and URIs) into arborist, followed by the official correspondence lists using the above format. The example that Miguel posted should never be needed (A -> C, when we knew A -> B and B -> C), as we should be able to get this transitive relationship "automatically" though SPARQL queries (and we need to learn how to write these queries in any case).

We can then proceed with our own self-generated correspondences; there are a number of libraries to help with this besides fuzzywuzzy (though it does have the best name :)

https://recordlinkage.readthedocs.io/en/latest/about.html
https://github.com/dedupeio/dedupe
https://github.com/kvh/match
https://pypi.org/project/py_entitymatching/


Some research and trial phases would be necessary before picking any particular approach.


--
############################
Chris Mutel
Technology Assessment Group, LEA
Paul Scherrer Institut
OHSA D22
5232 Villigen PSI
Switzerland
http://chris.mutel.org
Telefon: +41 56 310 5787
############################


Re: #correspondencetables : from raw to triplets #correspondencetables

Matteo Lissandrini (AAU)
 

Hi Chris,
have you checked the very useful examples here:

In general let's use subsclass of and rdf:type when we know it is a subset or an instance of, and let's use skos for "fuzzy" concepts.

ActivityType are classes, so you can say that something is a subclass of a specific activity type.

I'm not sure what should be just "Nuclear" in your model.

About automatic tools, usually they introduce uncertainty, but above all, they require an initial ground truth, otherwise we cannot understand if they are doing what we want them to do.

We do not have a first full version of the BONSAI data and system, trying to address automatic data cleaning &co. is more likely to introduce noise and slow down the project.
So I would say, let's get done with a MVP (minimum viable product) with some manual work that assures the highest quality and control (we can limit to just a portion of the tables).
Later on I will be happy to help you investigate more automatic tools, but I would say to do this when we will be able to compare to something we know to be right.


Cheers,
Matteo

---
Matteo Lissandrini

Department of Computer Science
Aalborg University

http://people.cs.aau.dk/~matteo












From: hackathon2019@bonsai.groups.io [hackathon2019@bonsai.groups.io] on behalf of Chris Mutel via Groups.Io [cmutel@...]
Sent: Monday, April 08, 2019 2:03 PM
To: hackathon2019@bonsai.groups.io
Subject: Re: [hackathon2019] #correspondencetables : from raw to triplets

@Matteo, Bo, Miguel; please comment and correct!

Defining correspondence tables in RDF

Based on my reading of https://www.w3.org/TR/skos-reference/, I created the following:

@prefix bont: <http://ontology.bonsai.uno/core#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
 
<http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWC> a bont:ActivityType ;
    skos:prefLabel "Production of electricity by coal" .
    skos:altLabel "A_POWC" .
    skos:narrowMatch <http://rdf.bonsai.uno/activitytype/entsoe/fbcl> .
 
<http://rdf.bonsai.uno/activitytype/entsoe/fbcl> a bont:ActivityType ;
    skos:prefLabel "Fossil Brown coal/Lignite" .
    skos:broadMatch <http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWC> .
 
<http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWN> a bont:ActivityType ;
    skos:prefLabel "Production of electricity by nuclear" .
    skos:altLabel "A_POWN" .
    skos:exactMatch <http://rdf.bonsai.uno/activitytype/entsoe/nuke> .
 
<http://rdf.bonsai.uno/activitytype/entsoe/nuke> a bont:ActivityType ;
    skos:prefLabel "Nuclear" .

This has been very helpful for me, as it has helped build a mental model of how to express hierarchical relations, codes, etc. For sure, I have made mistakes though!

Outstanding questions:

1. It is unclear to me whether or not `narrowMatch` and `broadMatch` are transitive.
2. Do we need to declare `narrowMatch` and `broadMatch`?
3. Can we drop `rdfs:label` completely in favor of `skos:prefLabel`?
4. Do we agree on using `skos:altLabel` for codes?
5. Partial overlaps, as mentioned by Bo. There are possibilities to describe this in SKOS, but I don't know what approach is best.

Next steps for correspondence tables repo

I still think that the first step should be getting all the basic data (labels, codes, and URIs) into arborist, followed by the official correspondence lists using the above format. The example that Miguel posted should never be needed (A -> C, when we knew A -> B and B -> C), as we should be able to get this transitive relationship "automatically" though SPARQL queries (and we need to learn how to write these queries in any case).

We can then proceed with our own self-generated correspondences; there are a number of libraries to help with this besides fuzzywuzzy (though it does have the best name :)


Some research and trial phases would be necessary before picking any particular approach.



Re: #correspondencetables : from raw to triplets #correspondencetables

 

@Matteo, Bo, Miguel; please comment and correct!

Defining correspondence tables in RDF

Based on my reading of https://www.w3.org/TR/skos-reference/, I created the following:

@prefix bont: <http://ontology.bonsai.uno/core#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
 
<http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWC> a bont:ActivityType ;
    skos:prefLabel "Production of electricity by coal" .
    skos:altLabel "A_POWC" .
    skos:narrowMatch <http://rdf.bonsai.uno/activitytype/entsoe/fbcl> .
 
<http://rdf.bonsai.uno/activitytype/entsoe/fbcl> a bont:ActivityType ;
    skos:prefLabel "Fossil Brown coal/Lignite" .
    skos:broadMatch <http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWC> .
 
<http://rdf.bonsai.uno/activitytype/exiobase3_3_17/A_POWN> a bont:ActivityType ;
    skos:prefLabel "Production of electricity by nuclear" .
    skos:altLabel "A_POWN" .
    skos:exactMatch <http://rdf.bonsai.uno/activitytype/entsoe/nuke> .
 
<http://rdf.bonsai.uno/activitytype/entsoe/nuke> a bont:ActivityType ;
    skos:prefLabel "Nuclear" .

This has been very helpful for me, as it has helped build a mental model of how to express hierarchical relations, codes, etc. For sure, I have made mistakes though!

Outstanding questions:

1. It is unclear to me whether or not `narrowMatch` and `broadMatch` are transitive.
2. Do we need to declare `narrowMatch` and `broadMatch`?
3. Can we drop `rdfs:label` completely in favor of `skos:prefLabel`?
4. Do we agree on using `skos:altLabel` for codes?
5. Partial overlaps, as mentioned by Bo. There are possibilities to describe this in SKOS, but I don't know what approach is best.

Next steps for correspondence tables repo

I still think that the first step should be getting all the basic data (labels, codes, and URIs) into arborist, followed by the official correspondence lists using the above format. The example that Miguel posted should never be needed (A -> C, when we knew A -> B and B -> C), as we should be able to get this transitive relationship "automatically" though SPARQL queries (and we need to learn how to write these queries in any case).

We can then proceed with our own self-generated correspondences; there are a number of libraries to help with this besides fuzzywuzzy (though it does have the best name :)


Some research and trial phases would be necessary before picking any particular approach.



Re: #correspondencetables : from raw to triplets #correspondencetables

Bo Weidema
 

I agree with Chris here.

First, most classifications contain themselves hierarchies, typically indicated by some code convention, such as ISIC 4 "011" being a subclass of "01" and so on. These should be related by the appropriate RDF predicate.

Secondly, in each classification, each class typically have one or more human readable label and one or more codes. These should be related by relevant "main code", "alternative code", "main label", "alternative label" RDF objects.

When matching two classifications, we have either of three relations (exact match, fully contained in, or partly contained in), that need to be expressed. Example:

Original I Original II Relations Evolving BONSAI classification
1 A Exact match Preferred name of either 1 or A, if different
2 B B fully contained in 2 (= B is sub-class of 2) B, AND implicitly "2lessB" also exists, also being sub-class of 2.
3 C C partly contained in 3 (= CpartOf3 is sub-class of 3) CpartOf3, AND implicitly 3less'CpartOf3' also exists, also being sub-class of 3.

In case both original classifications are (expected to be) exhaustive of the same domain, we can deduce the relations (exact match, fully contained in, or partly contained in) from the existence (or not) of more cases of classes 2, B, 3 and C. We can also deduce that 'Original II" will contain one or more classes corresponding to each of 2lessB, 3less'CpartOf3', and Cless'CpartOf3, so that all items within the domain will belong to one specific class in all classifications.

The resulting structure should be a triple for each of the relations between each classification.

What cannot be done automatically is: 1) The choice of the preferred name of either 1 or A, if different, and 2) Improvements in human readability of new auto-generated labels

 Best regards

Bo

Den 2019-04-05 kl. 20.09 skrev Chris Mutel:

Thanks Miguel-

It seems clear to me that the first step should be defining the verbs we will use, and the reasons we are using these particular verbs. For example, both OWL and SKOS seem to offer similar functionality, but I am sure that some people have strong opinions on which one is preferable. We also need to set up the metadata (i.e. RDF URIs) for level of confidence we have in the matchings, either official, manual and peer reviewed, computer generated, etc.

After looking through the repo and the code Miguel posted, I think we should investigate going directly from the raw data to RDF. The intermediate step doesn't really gain us anything, and it seems a bit silly not to use the power of our RDF database when constructing these correspondences. For example, if ISIC v4 disaggregated the production of some commodities from v3, then we should be storing the region-specific production of these commodities in our database, and using these numbers to do region-specific matches. We can always construct correspondence tables from the database relatively easy afterwards.

I also think we need better vocabulary then "sameAs" when storing the label, code, and other code (because why not) from certain classification systems. Maybe we can adapt existing terms for adding more specificity.

Given this need for some fundamental research, one possible priority for the group would be to get as many metadata systems in their native form into arborist (e.g. ISIC 3, ISIC 4, HS1, NACE, NAICS, CPC). The README should also be updated to reflect the data available, and current state of the repo, especially WRT to the existing correspondence tables already available in the native form (in `raw`).
--


Re: #correspondencetables : from raw to triplets #correspondencetables

 

Thanks Miguel-

It seems clear to me that the first step should be defining the verbs we will use, and the reasons we are using these particular verbs. For example, both OWL and SKOS seem to offer similar functionality, but I am sure that some people have strong opinions on which one is preferable. We also need to set up the metadata (i.e. RDF URIs) for level of confidence we have in the matchings, either official, manual and peer reviewed, computer generated, etc.

After looking through the repo and the code Miguel posted, I think we should investigate going directly from the raw data to RDF. The intermediate step doesn't really gain us anything, and it seems a bit silly not to use the power of our RDF database when constructing these correspondences. For example, if ISIC v4 disaggregated the production of some commodities from v3, then we should be storing the region-specific production of these commodities in our database, and using these numbers to do region-specific matches. We can always construct correspondence tables from the database relatively easy afterwards.

I also think we need better vocabulary then "sameAs" when storing the label, code, and other code (because why not) from certain classification systems. Maybe we can adapt existing terms for adding more specificity.

Given this need for some fundamental research, one possible priority for the group would be to get as many metadata systems in their native form into arborist (e.g. ISIC 3, ISIC 4, HS1, NACE, NAICS, CPC). The README should also be updated to reflect the data available, and current state of the repo, especially WRT to the existing correspondence tables already available in the native form (in `raw`).


Re: #correspondencetables : from raw to triplets #correspondencetables

romain
 

In the future, should manual string matching be too time-consuming, you may consider to get some help from fuzzy string comparison (https://github.com/seatgeek/fuzzywuzzy)

/Romain


#correspondencetables : from raw to triplets #correspondencetables

Miguel Fernández Astudillo
 

Dear all

 

During the Friday meeting we discussed the issue of how to arrive from “raw data” to the data stored in the database. There was some disagreement and I think I did not explain myself clearly so I will try to do it in this email.

 

As I see it raw data will come in a variety of formats, some will be csv, other xlsx or txt. Some will be existing correspondence tables that we use as intermediate steps to build a different one. Or we may create other correspondence tables “joining” two different classifications. The raw data should be stored when possible, to avoid breaking the system when data is no longer available or slightly modified.

 

I see it, it all starts with the dirty job of data cleaning. Data cleaning should be scripted so it can be reproduced easily, avoiding any manual steps. But it can hardly be generalised and it will be very specific to the tables being created. It will also need to be adapted because data providers will change the way the output their data. This process of data cleaning should arrive to a csv* that can be more easily “digested” by other functions To e.g. add a predicate or a weighting factor. This fits with recommendations of reproducibility (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285). “record intermediate results in standarddized  formats”. From the cleaned data we can create a table with subject-object-predicate and maybe some weighting (all with a descriptor of the metadata). This curated info should be (in my opinion) what is “consumed” by arborist (see issue #4).

 

Here an example of (trying) to create two different correspondence tables, just to illustrate how difference can be one to the other.

 

https://github.com/BONSAMURAIS/Correspondence-tables/blob/master/scripts/from_raw_to_clean_tables.ipynb

 

Enjoy the weekend!

 

Miguel

 

*For the three different ways of calling the same activities/flows in Exiobase. I think they should be 3 tables with “same as” (?) predicate.

 

 

 


#rdf #rdfframework #followup #ontology #rdf #rdfframework #followup #ontology

Matteo Lissandrini (AAU)
 

Hi all,

unfortunately I will not be able to join the call tomorrow.

Here is a quick update  on my side:

- The ontology is pretty stable since the hackathon, although  there are open issues for improvements
- The Triplestore with docker is working properly, with password authentication and docker, open issue is to configure it with portainer.io (I've asked Tomas to help)
- The importer module support the core upload function from .ttl files, the module needs implementation of tests, documentation, and configuration for CI/CD (any help is welcome)
- I've seen some reply about the "call for data", if I understood correctly  we are only waiting for bentso and Exiobase flows (not flow objects)  to be converted into RDF ?
- I'm willing to help with the setup of the workflow that materialize all the various RDF datasets into the triplestore

Please let me know if you have any comment/question/feedback, and if possible keep me posted with the outcome of the meeting.

Thanks a lot,
Matteo








Re: #correspondencetables - Getting to 1.0 #correspondencetables

Miguel Fernández Astudillo
 

Dear all

 

An update on this. Yesterday I created a repo following the python skeleton for the functions that will automate part of the workflow. Its called grafter. I have not had time to work on the functions.

 

https://github.com/BONSAMURAIS/grafter

 

see you tomorrow,

 

Miguel

 

 

 

From: hackathon2019@bonsai.groups.io <hackathon2019@bonsai.groups.io> On Behalf Of Chris Mutel
Sent: 01 April 2019 12:54
To: hackathon2019@bonsai.groups.io
Subject: [hackathon2019] #correspondencetables - Getting to 1.0

 

Dear all-

I am happy that there are a number of people participating here, and I think we have everything ready for assembly into a 1.0 version of this package. However, from reading these emails and looking at the repo itself, it seems like a little organization and goal-setting could help move this project forward. Here are some suggestions:

1. The goal and capabilities (user stories) for 1.0 should be clearly defined. Some possibilities:
- Python package that provides for trivial application of correspondence tables. As a BONSAI user, I want to be able to call `correspondence(data, field_identifier, table_name, aggregation_func, disaggregation_func)` and get my `data` updated automatically.
- All output correspondence data should be provided in a 3-column format, with the third column being the SKOS verb. Maybe a fourth column is needed for dis/aggregation weights.
- All output correspondence data should have metadata in DataPackage form

1.5 If a system uses multiple identifiers (e.g. exiobase), all identifiers should be in their own columns, as at some point each one will be needed.

2. This should be a python package based on the python skeleton. Being a python package would provide structure so that people would know what goes where. However, not every directory would need to be included in the python library itself. Instead, you could have this structure:

correspondence (python library code here)
    python code to do matching
    output
        csv and json files
        autogenerated index.html which lists all files and their descriptions
raw (input data in original downloaded form)

Of course, other models are possible...

3. The RDF vocabulary terms needed should be identified and documented in the README

4. RDF terms should be computed automatically from the correspondence tables, perhaps with a bit of manual intervention. The default should probably not be an exact match, but this would be configurable. In general it should be possible to map N-1 relations with one term, 1-1 with another, etc. without having to have a person go through long lists.

I would be happy to help with specific technical implementations of any of these tasks.

Who is now coordinating this working group? Could you please update issue #3 to show the current status and short-term plans?