Software-Entwicklung, Projektleitung, Web-Design
Kontakt:   +41 61 927 18 30
You are here:

Blog

Processing SKOS files with LINQ – Or – Do you know how fast LINQ is?

by Joerg Lang | Sep 24, 2009

Currently I’m working on a project where I have to deal with RDF files that contain data according to the SKOS standard. SKOS is the acronym for Simple Knowledge Organization System and is a formal language designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is built upon RDF and RDFS, and its main objective is to enable easy publication of controlled structured vocabularies for the Semantic Web.

A SKOS file contains ConceptScheme and Concept Nodes. ConceptSchemes are the top level nodes and can contain Concept nodes. A Concept node can be parent of other concept nodes and can also have references to other Concepts. So we have to deal with a hierarchical structure that also can have cross references.

<skos:Concept rdf:about="http:/example.com/Concept/0001"> 
<skos:prefLabel>English cuisine</skos:prefLabel>
<skos:altLabel>English dishes</skos:altLabel>
<skos:altLabel xml:lang="fr">Cuisine anglaise</skos:altLabel>
<skos:inScheme rdf:resource="http:/example.com/thesaurus"/>
<skos:broader rdf:resource="http:/example.com/Concept/0002"/>
<skos:related rdf:resource="http:/example.com/Concept/0003"/>
</
skos:Concept>

The customer has built a comprehensive knowledge system and we had to process export files from that database system for other purposes. The files we were given were rdf (xml) files and one of them was 87 MB big. As I had made good experiences I was going to use LINQ to extract the data I needed from the files. However when I looked at the size of the file, I was concerned with performance.

Just opening up this file with Notepad takes 40 seconds on my (pretty fast) laptop. I know, Notepad is not really a reference application but opening up the file in Altovas  Semantic Work application takes 180 seconds.

To familiarize myself with the SKOS scheme and the structure of the files I was given, I decided to first build a small application that would allow me to look at the hierarchy and at single items. This application uses a tree view control to navigate the hierarchy and a user control to display the details of the selected node. I built a DLL that encapsulates all data access for the SKOS file. The DLL that handles the data access has the following class diagram.

 

The main class is the SkosDocument which provides methods to get specific data from a Skos file. Most of the public methods return a list of SkosItems List<SkosItem>. In the constructor of that class the file is opened as a XDocument.

Document = XDocument.Load(fileName);

The loading of the 87 MB file takes less than 4 !! seconds on my laptop. After that you can use Linq expression to get to the data you want. To load all the 32 top level ConceptSchemes into a list of SkosItems takes less than 0,2 seconds. To do that, Linq has to somehow parse the whole 87 MB file! Below the 32 Concept Schemes there are a total of 217346 Concepts stored in the file. To get all the concepts for a specific ConceptScheme the Linq Expression looks like this:

/// <summary> /// Gets the root conepts to a specific scheme. </summary> 
/// <param name="skosItem">The skos item, that must be of SkosItemType.ConceptScheme</param>
/// <returns></returns>
public List<SkosItem> GetRootConepts(string conceptSchemePid) { if (GetSkosItem(conceptSchemePid).ItemType != SkosItemType.ConceptScheme) throw new ArgumentException("The passed skos item must be of SkosItemType.ConceptScheme"); // get all nodes from the xml that have the scheme set to the passed identifiert, but do not have a broader item. var q = Document.Descendants(m_skos + "Concept").Where( result => result.Elements(m_skos + "inScheme").Attributes(m_rdf + "resource").FirstOrDefault().Value == conceptSchemePid && ( ((string) result.Elements(m_skos + "broader").Attributes(m_rdf + "resource").FirstOrDefault() ?? "") == conceptSchemePid || string.IsNullOrEmpty(((string) result.Elements(m_skos + "broader").Attributes(m_rdf + "resource").FirstOrDefault())) )).Select(result => GetNewSkosConceptItem(result)); return q.ToList(); }

And the function for getting the detail data that is used in the expression is defined as follows: 

private SkosItem GetNewSkosConceptItem(XElement concept)
{
    return new SkosItem()
    {
        ItemType = SkosItemType.Concept,
        Pid = (string)concept.Attribute(m_rdf + "about"),
        LabelEnglish =
            (string)
            concept.Elements(m_skos + "prefLabel").Where(p => (string)p.Attribute(m_xml + "lang") == "en").
                FirstOrDefault(),
        LabelFrench =
            (string)
            concept.Elements(m_skos + "prefLabel").Where(
                p => (string)p.Attribute(m_xml + "lang") == "fr" || p.HasAttributes == false)
                .FirstOrDefault(),
        Broader = concept.Elements(m_skos + "broader").Select(p => p.Attribute(m_rdf + "resource").Value).ToList(),
        InScheme = (string)concept.Elements(m_skos + "inScheme").Attributes(m_rdf + "resource").FirstOrDefault(),
        AlternativeLabels = concept.Elements(m_skos + "altLabel").Select(p => p.Value).ToList(),
        Note =
            (string)
            concept.Elements(m_skos + "scopeNote").Where(
                p => (string)p.Attribute(m_xml + "lang") == "fr" || p.HasAttributes == false)
                .FirstOrDefault(),
        Narrower =
            concept.Elements(m_skos + "narrower").Select(p => p.Attribute(m_rdf + "resource").Value).ToList(),
        Related = concept.Elements(m_skos + "related").Select(p => p.Attribute(m_rdf + "resource").Value).ToList(),
        HasChildren =
            concept.Elements(m_skos + "narrower").Select(p => p.Attribute(m_rdf + "resource").Value).Count() > 0,
        ExactMatch = (string)concept.Elements(m_skos + "exactMatch").Attributes(m_rdf + "resource").FirstOrDefault()
    };
}

 

As you can see, this link expression used is not one of the simple ones you find in the samples about Linq. But nonetheless the performance is amazingly fast.

To load all the children of a specific ConceptScheme with more than 23000 records returned in 0.6 seconds and another ConceptScheme with more than 55000 records returned in 0.9 seconds.
Loading the returned items into the tree view control takes longer, but also here I was impressed with the Telerik tree view control. It takes 3.7 seconds to load the 55000 records into the tree, including storing the object in the tag property.

However these times can change. Every time I tested I got a little bit different results. The values I have given here are the fastest I experienced. Depending on other activity on my laptop (which has an Oracle database installed), loading of the large RDF document can also take up to 6 or even 8 seconds and loading the 55000 child records can take up to 2 seconds. But still, this is very impressive.

Conclusion

Using Linq to process XML documents is a good choice. The performance you have is outstanding and no manual parsing of the file can be that fast. So if you need to read XML documents and extract data from it, you should definitely use Linq.

Here again the summary of my (not scientific waterproof) testing:

  • Loading an 87 MB XML file into a XDocument: 4 seconds.
  • Fetching 55000 records from the file and creating typed objects: 0.9 seconds

1 Comment

  1. 1 Darnesha 25 Dez
    Walking in the persecne of giants here. Cool thinking all around!

Comment

  1. RadEditor - HTML WYSIWYG Editor. MS Word-like content editing experience thanks to a rich set of formatting tools, dropdowns, dialogs, system modules and built-in spell-check.
    RadEditor's components - toolbar, content area, modes and modules
       
    Toolbar's wrapper 
     
    Content area wrapper
    RadEditor's bottom area: Design, Html and Preview modes, Statistics module and resize handle.
    It contains RadEditor's Modes/views (HTML, Design and Preview), Statistics and Resizer
    Editor Mode buttonsStatistics moduleEditor resizer
      
    RadEditor's Modules - special tools used to provide extra information such as Tag Inspector, Real Time HTML Viewer, Tag Properties and other.