Re: #infrastructure New working group and practice guidelines #infrastructure

Matteo Lissandrini (AAU)

Hi Chris,

my importer is actually doing the file upload, this is the command I ran yesterday night

for f in `find ../rdf -name '*.ttl'`; do bseeder -i $f; done

So you do not need to merge files in /rdf repo, actually if you do that
you end up with a big problem: you lose track of which triples go in which named graph.

In my view the RDF repo is for the instances of the taxonomies, small datasets that changes slowly (e.g., flow object/items or activity types).
While the actual data would remain out of it.

For very big files what we can do is:
1) upload them via scp/rsync to a dedicated directory on the server,
2) use the file importer utility provided by jena itself

I understand that restoring is not easy, but we need to have it for reproducibility and for reliability (if bad things happen we may need to restore the database from scratch)


From: [] on behalf of Chris Mutel via Groups.Io []
Sent: Thursday, April 04, 2019 1:43 PM
Subject: Re: [bonsai] #infrastructure New working group and practice guidelines

On Thu, 4 Apr 2019 at 13:20, Matteo Lissandrini (AAU) <> wrote:

Hi Chris,

I can imagine that some of my tests may have deleted some data from the database. Sorry for that. To be honest, I was under the impression that the data in jken
I would expect the database to be wiped out regularly until we reach a stable status.
But this should not be an issue.
I would like to help in establishing the (automatic) workflow that collects the data (in NON RDF formats) and parses and merges it with the ontology and the contents of the /rdf repo so that we can easily wipe and redeploy the Jena instance at will.
I believe this will require a coordination between the arborist repo, the rdf repo, the importer and probably some other?
No problem, this is to be expected as we are still evolving the
schema, and making sure our RDF is valid and implemented properly.
However, at some point soon we should get to a point where the Aalborg
server is considered stable, while db.b.u is still for playing.

It actually isn't that easy to restore everything, as we need a
relatively large amount of data currently (on the order of 3 gb for
EXIOBASE, and 300 mb for the electricity stuff). The metadata is easy
- arborist can rewrite the data in,
which can in turn be the foundation of the triple store. It would be
nice to have a function that would take all these small turtle files
and merge them into one file (which could then be uploaded to the
triple store).

In the medium-term, I don't think that it makes sense to store
metadata for specific databases like exiobase in arborist - this can
just as easily be part of the file including the actual data as well.
We only evolved this code pathway because we were learning as we were
going. Indeed, it is probably more clever in the long term to have generated from the database itself.

I think the small importer you wrote will work fine for smaller
datasets, but we will need to do file uploads for larger ones, as they
won't fit into memory (to be loaded by RDFLib). This should be easy to
do, though there may be some Jena configuration bugs to work out

So everything is in a bit of a flux, and it would be great if you
could take charge of this little bit of it! Please document the hell
out of stuff, so we don't have to bug you too much.

Probably in the triplestore repo?


From: [] on behalf of Chris Mutel via Groups.Io []
Sent: Thursday, April 04, 2019 10:34 AM
Subject: [bonsai] #infrastructure New working group and practice guidelines

Dear all-

As many of you have already realized, we need to organize and document our infrastructure a bit better. Specifically, I see a need for:

Standard practice guidelines on maintaining the RDF database. For example, it looks like is fixed, but I am not sure by who or when. Also, I think someone (or more than one person :) has wiped this database since the hackathon, as the electricity data is missing.
A small guide to help everyday people know which named graphs to use, and how to use them.
Backup and restore procedures for the RDF database. We need to be dumping stuff anyway to make the downloads available.
A private repository with server configs and passwords. I have applied for status, but we could also run a private instance of gitlab.
A list of all websites, virtual servers, etc.

Tomas, would you coordinate this? It doesn't mean you have to do all the work.

Chris Mutel
Technology Assessment Group, LEA
Paul Scherrer Institut
5232 Villigen PSI
Telefon: +41 56 310 5787

Join to automatically receive all group messages.