PEARL: the ProjEction of Annotations Rule Language

Introduction

One of the main feature of CODA is PEARL (ProjEction of Annotations Rule Language): a language for the description of projections of annotations, taken from a UIMA CAS and defined into UIMA a Type System, onto RDF graph patterns.
Core elements of the language are Projection Rules, enabling users to describe matches over a set of annotations produced by UIMA Analysis Engines over streams of unstructured information, and to specify how the matched annotations will be transformed into RDF triples.
PEARL combines the mechanism of UIMA features paths (to extract revelant information from UIMA annotations) with a subset of the SPARQL syntax to describe patterns for generating RDF triples.
We describe here the structure of a typical projection document (a document containing a set of projection rules) and then we give a list of concise examples to show the expressiveness of this language.

Structure of PEARL documents

A projection document (containing PEARL rules) begins with the Prefixes Declarations followed by one or more Projection Rules.

			
prefix ...

rule ... {
	...
}

Prefix Declaration

At the beginning of the Projection Document each prefix used to shorten a URI in the projection rules is bound to a namespace. Note that these prefixes may not be the same (though they may overlap) of those which have been declared inside the target ontology and are independent from that declaration. They thus are local to the projection process, are used to expand prefixed names inside the document into valid RDF URIs and no trace of them is left in the target ontology.

			
prefix xsd: http://www.w3.org/2001/XMLSchema#>

rule ... {
	...
}

Projection Rules

The rest of projection document specifies a set of projection rules. A Projection Rule specification is divided into the following parts (some of them are optional): a rule declaration, followed by its definition, which is in turn composed of the following sections: bindings, nodes, graph and where.

Rule Declaration

Each rule starts with a declaration, introduced by the keyword "rule" (or "lazy rule", see Advanced Concepts later), and ends before the curly bracket "{", which initiates its definition.

			
...

rule it.uniroma2.art.uima.imdb.IMDBFilmCast id:cast dependsOn ... {
	...
}

The first element in the declaration, following the rule keyword, is a reference to a type (e.g. it.uniroma2.art.uima.imdb.IMDBFilmCast) from the adopted UIMA Type System: any UIMA annotation of that type (or any annotation which is a subtype of the specified type) will trigger the use of this rule.

The rule identifier (cast in the above example) follows the type declaration, and is an hook that can be used to reference the rule from other rules, according to different relationship of dependency.

The declaration may end with an optional list of dependencies on to other rules, introduced by the keyword dependsOn. Each dependency specifies the type of relationship upon the target rule (see Advanced Concepts section).

Nodes

The Nodes section (which is mandatory unless the rule dependsOn another one) is the locus for declaring and creating the nodes which will be used in the generated RDF statements.

Each node declaration is composed of:

a name (a placeholder for the newly created node, which will later be used when generating RDF triples)
a node type declaration, indicating the nature of the node (uri, plain or typed literal)
the element of the UIMA feature extracted from the triggered rule, which will be used as a basis for creating the node

By default, conversions are applied to the input from the extracted UIMA feature, and depend on the nature of the specified node type. e.g. if the node type is an URI, the feature value is first "sanitized" (i.e. characters which are incompatible with the URI specification are removed/replaced); the sanitized value is then used as a local name and concatenated to the namespace of the target ontology to generate a valid URI.

A set of conversion functions (shortly: converters) are available for applying different transformations to the input features. Converters are identified by URIs (corresponding to a contract of the conversion function) and can be invoked by specifying their URI between round brackets after the node type in the node declaration.

Converters represent an extensible part of the language. Each converter is in fact realized by a java class, implementing an interface which represents the contract for the function. Converters may be added to an existing CODA system by deploying a dedicated OSGi bundle inside its installation. The appendix of this PEARL manual has a section with the complete list of converters provided by CODA, together with their description.

The developer manual provides a dedicated section where third party developers may learn how to extend CODA with new converters.

It is possible that the value of a particular feature is a feature structure itself (i.e. thus containing other features and so on): feature path is a standard notation introduced in UIMA to specify feature structures in the common analysis structures, similar to XPath statements used to access XML elements in an XML document. This format can be adopted in node clauses as well.

			
prefix 	xsd:	<http://www.w3.org/2001/XMLSchema#>
prefix 	cdbk:	<http://art.uniroma2.it/book/coda#>

rule it.uniroma2.Book id:book1   {
	nodes = {
		book		uri(cdbk:isbn)		isbn
		title		plainLiteral		title
		author		uri			author
		authorName	literal^^xsd:string	author
	}
	...

}

In the example above, a rule allows to create an RDF description of a book, by converting information extracted by an UIMA Analysis Engine.

There are four node declarations. The placeholder author will host a URI constructed from the feature author of the current annotation with a simple sanitization and by prepending the default namespace. The placeholder book will contain a URI constructed from the value of the feature isbn by invoking a converter implementing the contract cdbk:isbn. This contract declares a function which turns a isbn into a suitable URI. The two other placeholders, title and authorName, will host respectively a plainLiteral and a xsd:string typed literal.

Graph

The graph section contains the true projection over the target ontology graph, by describing a graph pattern which is dynamically populated with unified placeholders and variables (see next paragraph on the where section). The graph pattern consists in a set of triples, where the first element is the subject, the second is the predicate and the third the object of an RDF statement. Each single element in the graph may be one of the following: a placeholder, a variable, an RDF node or an abbreviation. Inside a graph pattern, placeholders, defined in the nodes section of the current or of other referenced projection rules (when using the dependsOn construct the placeholder contains a single . and when using the a binding, two .. are placed), are identified by the prefixed symbol "$". RDF nodes can be referenced in graph patterns through the usual notation for URIs ("< >" delimited standard URIs) or by prefixed local names. The abbreviations are represented by a finite list of words that can be used place of explicit reference to RDF resources. For example we included in this list the standard abbreviation from the RDF Turtle format - also adopted in the SPARQL query language - which assumes the character "a" to be interpreted as rdf:type.

			
prefix 	my: http://art.uniroma2.it/imdb#;
prefix 	xsd: http://www.w3.org/2001/XMLSchema#;
prefix 	owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri			title
		filmTitle	literal^^xsd:string	title
		year		literal^^xsd:integer	year
	}
				
	graph = {
		$filmId		a		my:Film	.
		$filmId		my:title	$filmTitle	.
		$filmId		my:releasedIn	$year		.
	}
}

Advanced Concepts

Where

As for the graph section, the (optional) where section contains a graph pattern: the purpose of this graph pattern is to link newly extracted data with information which is already present in the target dataset (i.e. the dataset which will be updated with the triples generated by CODA).

The specified graph pattern is thus matched over the target dataset to retrieve already existing nodes by means of variable unification (variables are identifiable by a prefixed "?" symbol), so that the variables substitutions can be reused in the already described graph section.

In this sense, it is much close to the purpose of the where statement in a SPARQL CONSTRUCT query. The unification mechanism allows to assign values to variables by constraining them on the basis of information which is thought to be present in the dataset: these substitutions are then applied to the graph pattern of the graph section to project data over the target dataset.

			
prefix my: http://IMDB#;
prefix xsd: http://www.w3.org/2001/XMLSchema#;
prefix owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri			title
		filmTitle	literal^^xsd:string	title
	}
				
	graph = {
		$filmId		a			my:Film	.
		?filmId		my:title		$filmTitle	.
	}

	where = {
		?filmId		my:title		$filmTitle .
	}
}

In the WHERE section of the above example we define a variable, filmId, (by means of the ? symbol, not to be confounded with the $ of placeholders for generated nodes). The clause in the WHERE should return the subject of triples having the retrieved $filmTitle as object of the my:title predicate.

Note that, if the value is not retrieved, the WHERE will fail. However, a fallback mechanism in CODA will use the value hold by the placeholder with the same name of the unistantiated variable.

DependsOn

It is possible to state a dependency between two or more rules (and the annotions which trigger the use of these rules).

The rule referenced for the dependency must have an id. The rule stating the dependency on the first one declares it by means of the keyword dependsOn in its declaration, followed by the type of dependency. In the current implementation the following types of dependency are provided:

last: it depends on the last annotation which trigger the specified rule
next: it depends on the next annotation which trigger the specified rule
previous: it depends on all previous annotations which trigger the specified rule
following: it depends on all the following annotations which trigger the specified rule
between: it depends on all the annotations which are contained (using the begin and end feature) in the current annotation and which trigger the specified rule
lastOneOf: it depends on the last annotation which trigger one of the specified rule

In the example below we use last as the dependency type. This means that when CODA will use the second rule it will look back to the other annotation until it finds on in which the first rule was used. At this point CODA will consider this other annotation as the target of this particular instance of dependency, so the application of the second rule for the given annotation depends on the other annotation just found. Once the "link" between these two rules has been establish the rules that stated the dependency is now able to use the placeholder defined and initialized in the other rules. The syntax to use the other placeholder is quite similar to using a local placeholder, the only different is that before the placeholder name is form using the other rule's id followed by . and followd by the placeholder defined in the other rule. The second rule use the placeholder $film.filmId from the first rule in its second suggested triple.

			
prefix 	my: http://IMDB#;
prefix 	xsd: http://www.w3.org/2001/XMLSchema#;
prefix 	owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri		movieId					
	}
	graph= {
		$filmId		a		my:Film	.
	}
};
			
rule it.uniroma2.art.uima.imdb.IMDBFilmCast id:cast dependsOn last(film) {
	nodes = {
		actorId		uri		actorsList/personId
	}
	graph= {
		$actorId	a		my:Actor.
		$actorId	my:actedId	$film.filmId .	
	}
};

An example on how to state and use the dependency mechanism can be seen in this example or in the demo

Bindings

This optional section is used to define bingings with other UIMA annotation. The bindings are a particular type of dependency that are used to establish a connection with an inner annotation (an annotation that can be reach using a specific feature path) by using refering to a lazy rule.
They are used when we don't want to mix different values contained in several placeholder. An example could be a list of Annotations regarding several person. In this case it is important not to mix the first name of a person with the last name of another person, but these information should always remain linked together when CODA suggest RDF triples.

			
prefix 	my:	<http://art.uniroma2.it/ontology#>

lazy rule it.uniroma2.Person id:person   {
	nodes = {
		personId		uri			id
		firstName		plainLiteral		firstName
		secondName		plainLiteral		secondName
	}
}

lazy rule it.uniroma2.People id:people   {
	bindings = {
		singlePerson	personList	person
	}
	nodes = {
		city		uri		mainCity
	}
	graph = {
		$singlePerson..personId		my:livesIn	$city .
		$singlePerson..personId		a		my:Person .
		$singlePerson..personId		my:name		$singlePerson..firstName .
		$singlePerson..personId		my:lastName	$singlePerson..secondName .
		
	}
}

In this example let's assume that we have an annotation of type rule it.uniroma2.People which has two features: mainCity (with a single value) and personList (which is a list of annotation of type rule it.uniroma2.People ). We want to associate (in RDF) to each person his/her own name and last name. This can be accomplish using the bingding construct. Each $singlePerson (defined in the bindings section) refer to a single annotation and in the graph section we are able to access its inner values using the .. , so we can link with the desired property ( my:name or my:lastName) each personId with his/her own name.

Use of the OPTIONAL clause in the graph section

A write operation of a graph pattern GP into a graph G succeeds if all the three elements (subject, predicate, object) of the triples in GP are bound (instantiated). A triple may not be fully instantiated for various reasons: for instance, a failed match on the WHERE section might leave a variable uninstantiated, or a missing value for a UIMA feature referenced in a node declaration may leave the placeholder for the generated RDF node empty.

A whole projection rule succeeds if all the write operations in the GRAPH section succeed (i.e the whole graph is fully instantiated).

The OPTIONAL clause may be used to wrap a given subgraph of the graph specified in the GRAPH section, in order to make it non mandatory for the succesfull production of the triples to be generated.

			
prefix 	my: http://art.uniroma2.it/imdb#;
prefix 	xsd: http://www.w3.org/2001/XMLSchema#;
prefix 	owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri			title
		filmTitle	literal^^xsd:string	title
		year		literal^^xsd:integer	year
		description	literal^^xsd:string	description
	}
				
	graph = {
		$filmId		a			my:Film	.
		$filmId		my:title		$filmTitle	.
		$filmId		my:releasedIn		$year		.
		OPTIONAL { 
			$filmId		my:description	$description	.
		}
}

In the above case, if the movie description is not available in the UIMA feature structure, then the whole application of the rule is not compromised, and the sole triples out of the OPTIONAL are still written.

Lists and FeaturePaths

PEARL allows to use all the values of the features which satisfy a particular FeaturePath. In the example below:

			
prefix 	my: http://IMDB#;
prefix 	xsd: http://www.w3.org/2001/XMLSchema#;
prefix 	owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri		movieId					
		actorId		uri		actorsList/personId
	}
	graph= {
		$filmId		a		my:Film	.
		$actorId	a		my:Actor.
		$actorId	my:actedId	$filmId .	
	}
};

The second placeholder, actorId, does not contain just a single value, but a list of values. As the name of the feature may suggests, actorList contains a list of value (and these values are not primitive values, such as string or integer). The syntax to put inside a placeholder a list of value is the same one as to consider just a single one for features which has just one value. It is also possible to use a specific position inside the List/Array using the syntax actorList[i], where i stands for the position we wish to use (starting from 0 and not from 1). This particular notation (not using a specific position) is extremely useful in the cases:

we don't know when we are writing the rule how many elementa are inside the list, but we know that we need to consider all these elements in the same way
we know the cardinality of the List/Array, but we are not interested in a particular elements and we need to treat all those elements in the same manner

Appendix

Grammar of the PEARL language

The antlr file containing the grammar used by CODA can be downloaded here

A simplified version of the language grammar, expressed in Backus-Naur form can be seen in:

			
prRules := prefixDeclaration* annotations? rules+ ;
prefixDeclaration := 'prefix' prefix_name ':' namespace ';';
annotations :=	'annotations' '=' '{' (annotationDefinition)+ '}';
annotationDefinition := 'Annotation' annotationName ;
metaAnnotation* := '@' metaAnnotationName ('(' parameter (',' otherParameter)* ')')?
rules := (rule|lazyRule)+ ;
rule := 'rule' uimaTypeName 'id:' idVal ('dependsOn' depend (',' depend)*)? '{' 
	bindingsClause? nodesClause? graphClause? whereClause?'}' ;
lazyRule := 'lazy' 'rule' uimaTypeName 'id:' idVal '{' nodesClause? '}' ;
depend := dependType '(' (depRuleIds | params) (',' (depRuleIds | params ) )* ')' 
	'as' depRuleIdAs ;
bindingsClause := 'bindings' '=' '{' bindingDef+ '}' ;
bindingDef :=  bindingId featurePath bindingRuleId ;
featurePath := featurePathElement ('/' featurePathElement)* ;
nodesClause := 'nodes' '=' '{' nodeDef+ '}' ;
nodeDef := nodeId projectionOperator featurePath ;
projectionOperator :=  ('uri' converters?) |  ('literal' '^^' iri converters?) | 
	('literal' langtag  converters?) | ('literal' converters?) ;
converters := '(' iri (',' iri)* ')' ;
graphClause := 'graph' '=' '{' graphElement+ '}' ;
graphElement :=  annotation* (graphTriple | optionalGraphElement) ;
annotation := '@' target ( '(' annotationName ')' ) '.' ;
graphTriple := graphSubject graphPredicate graphObject '.' ;
optionalGraphElement := 'OPTIONAL' '{' graphElement+ '}' ;
graphSubject := var |  iri | blankNode | placeholder ;
graphPredicate := var | iri | abbr | placeholder ;
graphObject := var | iri | literal | blankNode | placeholder ;
whereClause := 'where' '='  '{' graphElement+ '}' ;

Converters overview

CODA provides a set of converters to generate URI or Literal resources.

In addition to the default converters, described in the sammary table below, there are mainly two kinds of converters, only available for the URI:

Deterministic converters: given the same input they generate always the same URI. Basically a converter of this type composes the URI concataning a deterministically generated part to a fixed prefix;
Randomic converters: a converter of this type composes the URI concataning a randomic part to a fixed prefix.

Here it is a table that shows an overview of the available CODA converters

`(default activated converter)`	`uri`	Default URI converter. Generates a URI concatening the baseUri to the given input. If the input string is already a URI, it returns the same.
		DefaultContract	DefaultConverter
`(default activated converter)`	`literal`	Default Literal converter. Simply returns the given insput as a Literal.
		DefaultContract	DefaultConverter
`coda:detGen-ConceptId`	`uri`	Deterministic converter. Generates a URI concataning a c_ prefix with a 16-digits hexadecimal deterministic char sequence.
		DeterministicIdGenForSKOSConceptContract	DeterministicIdGenForSKOSConceptConverter
`coda:detGen-ConceptId-trunc4`	`uri`	Deterministic converter. Generates a URI concataning a c_ prefix with a 4-digits hexadecimal deterministic char sequence.
		DeterministicIdGenForSKOSConceptContract	DeterministicIdGenForSKOSConceptConverterTrunc4
`coda:detGen-ConceptId-trunc8`	`uri`	Deterministic converter. Generates a URI concataning a c_ prefix with a 8-digits hexadecimal deterministic char sequence.
		DeterministicIdGenForSKOSConceptContract	DeterministicIdGenForSKOSConceptConverterTrunc8
`coda:detGen-ConceptId-trunc12`	`uri`	Deterministic converter. Generates a URI concataning a c_ prefix with a 12-digits hexadecimal deterministic char sequence.
		DeterministicIdGenForSKOSConceptContract	DeterministicIdGenForSKOSConceptConverterTrunc12
`coda:randGen-ConceptId`	`uri`	Randomic converter. Generates a URI concataning a c_ prefix with a 16-digits hexadecimal randomic char sequence.
		RandomIdGenForSKOSConceptContract	RandomIdGenForSKOSConceptConverter
`coda:randGen-ConceptId-trunc4`	`uri`	Randomic converter. Generates a URI concataning a c_ prefix with a 4-digits hexadecimal randomic char sequence.
		RandomIdGenForSKOSConceptContract	RandomIdGenForSKOSConceptConverterTrunc4
`coda:randGen-ConceptId-trunc8`	`uri`	Randomic converter. Generates a URI concataning a c_ prefix with a 8-digits hexadecimal randomic char sequence.
		RandomIdGenForSKOSConceptContract	RandomIdGenForSKOSConceptConverterTrunc8
`coda:randGen-ConceptId-trunc12`	`uri`	Randomic converter. Generates a URI concataning a c_ prefix with a 12-digits hexadecimal randomic char sequence.
		RandomIdGenForSKOSConceptContract	RandomIdGenForSKOSConceptConverterTrunc12
`coda:detGen-XLabelId`	`uri`	Deterministic converter. Generates a URI concataning a xl_ prefix with a 16-digits hexadecimal deterministic char sequence.
		DeterministicIdGenForSKOSXLLabelContract	DeterministicIdGenForSKOSXLLabelConverter
`coda:detGen-XLabelId-trunc4`	`uri`	Deterministic converter. Generates a URI concataning a xl_ prefix with a 4-digits hexadecimal deterministic char sequence.
		DeterministicIdGenForSKOSXLLabelContract	DeterministicIdGenForSKOSXLLabelConverterTrunc4
`coda:detGen-XLabelId-trunc8`	`uri`	Deterministic converter. Generates a URI concataning a xl_ prefix with a 8-digits hexadecimal deterministic char sequence.
		DeterministicIdGenForSKOSXLLabelContract	DeterministicIdGenForSKOSXLLabelConverterTrunc8
`coda:detGen-XLabelId-trunc12`	`uri`	Deterministic converter. Generates a URI concataning a xl_ prefix with a 12-digits hexadecimal deterministic char sequence.
		DeterministicIdGenForSKOSXLLabelContract	DeterministicIdGenForSKOSXLLabelConverterTrunc12
`coda:randGen-XLabelId`	`uri`	Randomic converter. Generates a URI concataning a xl_ prefix with a 16-digits hexadecimal randomic char sequence.
		RandomIdGenForSKOSXLLabelContract	RandomIdGenForSKOSXLLabelConverter
`coda:randGen-XLabelId-trunc4`	`uri`	Randomic converter. Generates a URI concataning a xl_ prefix with a 4-digits hexadecimal randomic char sequence.
		RandomIdGenForSKOSXLLabelContract	RandomIdGenForSKOSXLLabelConverterTrunc4
`coda:randGen-XLabelId-trunc8`	`uri`	Randomic converter. Generates a URI concataning a xl_ prefix with a 8-digits hexadecimal randomic char sequence.
		RandomIdGenForSKOSXLLabelContract	RandomIdGenForSKOSXLLabelConverterTrunc8
`coda:randGen-XLabelId-trunc12`	`uri`	Randomic converter. Generates a URI concataning a xl_ prefix with a 12-digits hexadecimal randomic char sequence.
		RandomIdGenForSKOSXLLabelContract	RandomIdGenForSKOSXLLabelConverterTrunc12
`coda:randGen-DefinitionId`	`uri`	Randomic converter. Generates a URI concataning a def_ prefix with a 16-digits hexadecimal randomic char sequence.
		RandomIdGenForDefinitionContract	RandomIdGenForDefinitionConverter
`coda:randGen-DefinitionId-trunc4`	`uri`	Randomic converter. Generates a URI concataning a def_ prefix with a 4-digits hexadecimal randomic char sequence.
		RandomIdGenForDefinitionContract	RandomIdGenForDefinitionConverterTrunc4
`coda:randGen-DefinitionId-trunc8`	`uri`	Randomic converter. Generates a URI concataning a def_ prefix with a 8-digits hexadecimal randomic char sequence.
		RandomIdGenForDefinitionContract	RandomIdGenForDefinitionConverterTrunc8
`coda:randGen-DefinitionId-trunc12`	`uri`	Randomic converter. Generates a URI concataning a def_ prefix with a 12-digits hexadecimal randomic char sequence.
		RandomIdGenForDefinitionContract	RandomIdGenForDefinitionConverterTrunc12