How do I maintain the provenance of statements in Linked Open Data?

Priority 3, Best Practice Established

Problem Statement:

When the AAC begins to generate a graph of all of the partner's information, many entities that reference the same entity will exist and be reconciled.

We would like to be able to connect statements made about the entities to the instituion making the statement, particularly when they disagree.

For example, given an person named Maria von Trapp, I'd like to know that Princeton thinks that Maria von Trapp's middle name is "von" and the last name is "Trapp", but that Harvard thinks that "von Trapp" is the last name.

If the information is reconciled without provenance, we will only know that "Trapp" and "von Trapp" are two possible last names.

How do we maintain this information?

Best Practice:

A best practice for this problem is still outstanding in the larger Linked Open Data community, and so this question has been determined to be out of scope for the AAC.

Discussion:

(From David Newbury)

I'd like to maintain field-level provenance for data fields, and I'd like to do this without URL parsing. This will be essential for things like constituents, where we're going to end up with conflicting data provided by multiple partners, and we're going to have to be able to distinguish between them.

How do we credit external sources for providing us with information, and how do we allow users to judge the authority of statements made?

(I know we can potentially do this through looking at the URL. Is this sufficient?)

(From Rob)

We shouldn’t care about [the source of external references]. And if we do, I’d like to know why we care about it enough to essentially requiring reification of the entire dataset, making it miserable to work with.

(From Vladimir)

First of all, we'd like to know that and are the same. Better yet, we want to map them both to VIAF or ULAN if possible. How to deal with labels is a secondary question.

But I see what you mean: for properties of art-research interest, you want to record them all, with provenance.

(From Rob)

To me, this question could be off-loaded to “How do AAC partners contribute to ULAN?” which is a Getty question we need to answer in the near future.


I think PROV-O is over-engineering for singly-mastered data (e.g. about objects)

(From Rob)

I’ve run afoul of the PROV-O constraints so many times over the past few years that I would recommend either extreme caution and checking any use with real experts, or avoiding it entirely. (e.g. https://www.w3.org/TR/prov-constraints/#generation-generation-ordering )

Reference:

Linked Open Data FAQs

Defining Types

How do I specify types for entities?
Priority 1
✔ Best Practice Established
How do I specify types for predicates?
Priority 1
✔ Best Practice Established
What existing extensions to the CIDOC-CRM should I use?
Priority 2
✔ Best Practice Established

Defining URL Structures

What URL should I use for unknown Actors?
Priority 2
✔ Best Practice Established
What is the root URL for each AAC Partner?
Priority 2
✔ Best Practice Established
What is returned when a URL is dereferenced?
Priority 3
✔ Best Practice Established
Which ID is most appropriate for URL construction?
Priority 3
✔ Best Practice Established

Labeling

What are best practices for modeling text strings?
Priority 2
✔ Best Practice Established
What is best practice for labeling external authorities?
Priority 3
✔ Best Practice Established
How do I handle strings in languages other than english?
Priority 4
✔ Best Practice Established

Modeling

How do I handle complexity in knowledge representation?
Priority 1
✔ Best Practice Established
How do I model lists of entities or multiple values?
Priority 2
✔ Best Practice Established
How should I model parts of Actor names?
Priority 2
✔ Best Practice Established

Reconciliation

How do I reconcile objects to authorities?
Priority 2
✔ Best Practice Established
Which entity should I link to in an authority file?
Priority 2
✔ Best Practice Established

Triplestores, RDF, and Inferencing

Which namespace should I use for the CIDOC CRM as LOD?
Priority 1
✔ Best Practice Established
How do I create an RDF representation of an entity?
Priority 3
✔ Best Practice Established
Where should AAC-created vocabularies be hosted?
Priority 3
✔ Best Practice Established
What serialization of RDF should I publish?
Priority 4
✔ Best Practice Established