Miguel Fernández Astudillo
During the Friday meeting we discussed the issue of how to arrive from “raw data” to the data stored in the database. There was some disagreement and I think I did not explain myself clearly so I will try to do it in this email.
As I see it raw data will come in a variety of formats, some will be csv, other xlsx or txt. Some will be existing correspondence tables that we use as intermediate steps to build a different one. Or we may create other correspondence tables “joining” two different classifications. The raw data should be stored when possible, to avoid breaking the system when data is no longer available or slightly modified.
I see it, it all starts with the dirty job of data cleaning. Data cleaning should be scripted so it can be reproduced easily, avoiding any manual steps. But it can hardly be generalised and it will be very specific to the tables being created. It will also need to be adapted because data providers will change the way the output their data. This process of data cleaning should arrive to a csv* that can be more easily “digested” by other functions To e.g. add a predicate or a weighting factor. This fits with recommendations of reproducibility (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285). “record intermediate results in standarddized formats”. From the cleaned data we can create a table with subject-object-predicate and maybe some weighting (all with a descriptor of the metadata). This curated info should be (in my opinion) what is “consumed” by arborist (see issue #4).
Here an example of (trying) to create two different correspondence tables, just to illustrate how difference can be one to the other.
Enjoy the weekend!
*For the three different ways of calling the same activities/flows in Exiobase. I think they should be 3 tables with “same as” (?) predicate.