Serializing large LD datasets


 

Maybe our approach to serializing large graphs is maybe not that great. You can see the current code here - basically, we convert Python to JSON line by line, with some text mangling. It sounds (and looks) a bit crazy; the idea behind this decision was that RDFLib can't really handle large datasets, such as BONSAI.

The latest straw was realizing that we need to declare a `dataset` for the actual data (not just metadata). In turtle, this is (for example):

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ns1: <http://creativecommons.org/ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ns2: <http://purl.org/vocab/vann/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

brdfat: a dtype:Dataset ;
    ns1:license <http://creativecommons.org/licenses/by/3.0/> ;
    dc:contributor "BONSAI team" ;
    dc:creator <http://bonsai.uno/foaf/bonsai.rdf#bonsai> ;
    dc:description "ActivityType instances needed for BONSAI modelling of EXIOBASE version 3.3.17" ;
    dc:modified "2019-04-02"^^xsd:date ;
    dc:publisher "bonsai.uno" ;
    dc:title "EXIOBASE 3.3.17 activity types" ;
    ns2:preferredNamespaceUri <http://rdf.bonsai.uno/activitytype/exiobase3_3_17/#> ;
    owl:versionInfo "0.3" ;
    foaf:homepage brdfat:documentation.html .

In JSON-LD, if is... more involved:


{
  "@graph" : [ {
    "@id" : "http://rdf.bonsai.uno/activitytype/exiobase3_3_17/",
    "@type" : "dtype:Dataset",
    "license" : "http://creativecommons.org/licenses/by/3.0/",
    "contributor" : "BONSAI team",
    "creator" : "http://bonsai.uno/foaf/bonsai.rdf#bonsai",
    "description" : "ActivityType instances needed for BONSAI modelling of EXIOBASE version 3.3.17",
    "modified" : "2019-04-02",
    "publisher" : "bonsai.uno",
    "title" : "EXIOBASE 3.3.17 activity types",
    "preferredNamespaceUri" : "brdfat:#",
    "versionInfo" : "0.3",
    "homepage" : "brdfat:documentation.html"
  } ],
  "@context" : {
    "label" : {
      "@id" : "http://www.w3.org/2000/01/rdf-schema#label"
    },
    "versionInfo" : {
      "@id" : "http://www.w3.org/2002/07/owl#versionInfo"
    },
    "homepage" : {
      "@id" : "http://xmlns.com/foaf/0.1/homepage",
      "@type" : "@id"
    },
    "title" : {
      "@id" : "http://purl.org/dc/elements/1.1/title"
    },
    "publisher" : {
      "@id" : "http://purl.org/dc/elements/1.1/publisher"
    },
    "description" : {
      "@id" : "http://purl.org/dc/elements/1.1/description"
    },
    "preferredNamespaceUri" : {
      "@id" : "http://purl.org/vocab/vann/preferredNamespaceUri",
      "@type" : "@id"
    },
    "creator" : {
      "@id" : "http://purl.org/dc/elements/1.1/creator",
      "@type" : "@id"
    },
    "license" : {
      "@id" : "http://creativecommons.org/ns#license",
      "@type" : "@id"
    },
    "contributor" : {
      "@id" : "http://purl.org/dc/elements/1.1/contributor"
    },
    "modified" : {
      "@id" : "http://purl.org/dc/elements/1.1/modified",
      "@type" : "http://www.w3.org/2001/XMLSchema#date"
    },
    "dtype" : "http://purl.org/dc/dcmitype/",
    "brdfat" : "http://rdf.bonsai.uno/activitytype/exiobase3_3_17/",
  }
}

Moreover, it is difficult for me to reason about why the JSON-LD is formatted the way that it is. On the other hand, the Turtle file is much nicer to read and predict.

We had said earlier (though without a formal decision) that we want to use JSON-LD for data interchange, but it would make life a lot easier to use Turtle, if people were OK with that. Let me know what you think!
 

Join main@bonsai.groups.io to automatically receive all group messages.