PEARL: the ProjEction of Annotations Rule Language

Introduction

One of the main features of CODA is PEARL (ProjEction of Annotations Rule Language): a transformation language from Feature Structures to RDF.

In the specific technological context of CODA, PEARL is used to project UIMA annotations (expressed as feature structures specified by a UIMA Type System and stored in a UIMA CAS) onto RDF graph patterns. The language is however generic enough to be adopted in other similar environments (such as GATE) using feature structures to represent annotations.

Core elements of the language are Projection Rules, enabling users to describe matches over sets of annotations produced by UIMA Analysis Engines over streams of unstructured information, and to specify how the matched annotations will be transformed into RDF triples.
PEARL combines the mechanism of UIMA features paths (to extract revelant information from UIMA annotations) with a subset of the SPARQL syntax to describe patterns for generating RDF triples.
We describe here the structure of a typical projection document (a document containing a set of projection rules) and then we give a list of concise examples to show the expressiveness of this language.

Structure of a PEARL document

A projection document (containing PEARL rules) begins with Prefix Declarations, then the optional Annotations definitions followed by one or more Projection Rules.

			
prefix ...

Annotation ...

rule ... {
	...
}

Prefix Declaration

At the beginning of the Projection Document each prefix used to shorten a URI in the projection rules is bound to a namespace.

			
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

Annotation ...

rule ... {
	...
}

Annotation Declaration

In this optional section, it is possible to define the annotaions that will be used inside the various projection rules. Only defined annotation can be used in the projection rules.

These annotations, by default, are only used to annotatate the desired part of the Projection rule and are not propagate to the suggested triples or to any sort of returned structure by CODA, but it is possible to have such annotations persistent (in the sense that are accassible to the users/tools during triple generation process), by annotate the desired annotations with meta-annotation @Retained(this could be useful, for example, if then a diffent tool needs to do any type of post processing on the suggested triples or on the nodes generations).

Another useful, and already present in CODA, meta-annotation is @Target and this is used to state where the annotation, having such meta-annotation, could be used. The possible values are:

node: to have the annotation assignable in the nodes section to a single placeholder
subject: to have the annotation assignable in the graph section (or any equivalent section having triple) and, in particular, the annotation is associate to the subject of the triple
predicate: to have the annotation assignable in the graph section (or any equivalent section having triple) and, in particular, the annotation is associate to the predicate of the triple
object: to have the annotation assignable in the graph section (or any equivalent section having triple) and, in particular, the annotation is associate to the object of the triple
triple: to have the annotation assignable in the graph section (or any equivalent section having triple) and, in particular, the annotation is associate to entire triple

The meta-annotation @Description is used to add a description to the annotation itself and it used only by humans to better understand what is the idea behind such annotation.

An example of a definition of an annotation is:

			
prefix	my: 	<http://art.uniroma2.it/>.			

@Retained
@Description("This annotation declares the set of possible values - IRIs - that are expected to be stored in the related node")
@Target(node)
Annotation ObjectOneOf {
    IRI [] value();
}

rule it.uniroma2.art.coda.test.ae.type.City id:city {
		
	nodes = {
		@ObjectOneOf({my:Rome, <http://art.uniroma2.it/Milan>})
		cityName		uri 		name
	}
	
	graph = {
		$cityName		a			my:City .
	}
}

In this simple example, the user is defining an annotation, called ObjectOneOf which will be present in the returned suggestions/data_structures provided by CODA (because of the meta-annotation @Retained) and that it can be used, only, on a node/placeholder, so in the nodes section (because of @Target(node)). This annotation has also a description (the @Description).

Inside the rule itself, the annotation @ObjectOneOf is used, correctly, in the nodes section, with a list of values (since in its defintion, the parameter called value consist of an array of IRI so it is possible to pass a list of IRI to the annotation itself).

It will be up to the tool calling CODA to check that these annotations are used correct semantically, so, in this case, that value of the placeholder cityName(so in the generated triples in which such placeholder is used) holds only one of the values allowed by the annotation @ObjectOneOf.

Projection Rules

The rest of the projection document specifies a set of projection rules. There are various types of rules (see Advanced Concepts later); this section of the manual only deals with standard projection rules.

A Projection Rule specification starts with a rule declaration, followed by its definition, which is composed of the following sections: nodes, graph and where. The graph section is the only mandatory section in a rule definition.

Rule Declaration

Each rule starts with a declaration, introduced by the keyword "rule", and ends before the curly bracket "{", which begins its definition.

			
...

rule it.uniroma2.art.uima.imdb.IMDBFilmCast id:cast dependsOn ... {
	...
}

The first element in the declaration, following the rule keyword, is a reference to a type (e.g. it.uniroma2.art.uima.imdb.IMDBFilmCast) from the adopted UIMA Type System: any UIMA annotation of that type (or any annotation which is a subtype of the specified type, depending on the CODA configuration) will trigger the use of this rule.

The rule identifier (cast in the above example) follows the type declaration, and is an hook that can be used to reference the rule from other rules, according to different relationships of dependency. These relationships can be declared in an optional list following the rule identifier, and are introduced by the keyword dependsOn (see Advanced Concepts section).

The rule declaration and other parts of the rule can be decorated with annotations, providind additional semantics to the annotated content.

Nodes

The Nodes section is the locus for declaring and creating the nodes which will be used in the generated RDF statements. Though not being mandatory, it provides the actual RDF nodes which will be composed (in the following graph section) into the graph populating the target dataset.

Each node declaration is composed of:

a name (a placeholder for the newly created node)
a node type declaration, indicating the nature of the node (uri or literal), followed by an optional list of converters enclosed in between round brackets (e.g. uri(coda:regexp(...), coda:formatter(...)) ). The node type, in case of literals, maybe followed either by the language tag (e.g. literal@en) or by a datatype (e.g. literal^^xsd:string)
a feature path locating an element (or of a set of elements) from the UIMA annotation matched by the triggered rule

Conversions are applied to the annotation elements identified by the feature path, in order to produce well-formed RDF terms.

A set of conversion functions (shortly: converters) are available for applying different transformations to the input features. Converters are identified by URIs (corresponding to a contract of the conversion function) and can be invoked by specifying their URI between round brackets after the node type in the node declaration. The section on converters provides more details about their usage, while the appendix of this PEARL manual lists available converters in CODA, together with their description; also, the developer manual provides a dedicated section where third party developers may learn how to extend CODA with new converters.

If no converter is specified after the uri or literal reserved words, a default converter is implicitly invoked.

			
prefix 	xsd:	<http://www.w3.org/2001/XMLSchema#>
prefix 	cdbk:	<http://art.uniroma2.it/book/coda#>

rule it.uniroma2.Book id:book1   {
	nodes = {
		book		uri(cdbk:isbn)		isbn
		title		plainLiteral		title
		author		uri			author
		authorName	literal^^xsd:string	author
	}
	...

}

In the example above, a rule allows to create an RDF description of a book, by converting information extracted by an UIMA Analysis Engine.

There are four node declarations. The placeholder author will host a URI constructed from the feature author of the current annotation with a simple sanitization and by prepending the default namespace. The placeholder book will contain a URI constructed from the value of the feature isbn by invoking a converter implementing the contract cdbk:isbn. This contract declares a function which turns a isbn into a suitable URI. The two other placeholders, title and authorName, will host respectively a plainLiteral and a xsd:string typed literal.

Graph

The graph section contains the true projection over the target dataset graph, by describing a graph pattern which is dynamically populated with unified node names and variables (see next paragraph on the where section). The graph pattern consists of a set of triples, where the first element is the subject, the second is the predicate and the third the object of an RDF statement. Each single element in the graph may be one of the following: a node placeholder, a variable, an RDF node or an abbreviation. Inside a graph pattern, placeholders, defined in the nodes section of the current or of other referenced projection rules (when using the dependsOn construct the placeholder contains a single . and when using the a binding, two .. are placed), are identified by the prefixed symbol "$". RDF nodes can be referenced in graph patterns through the usual notation for URIs (or qnames) and literals.

The abbreviations are represented by a finite list of words that can be used in place of explicit reference to RDF resources. This list includes, for instance, the standard abbreviation from the RDF Turtle format - also adopted in the SPARQL query language - which assumes the character "a" to be interpreted as rdf:type.

			
prefix 	my: <http://art.uniroma2.it/imdb#>
prefix 	xsd: <http://www.w3.org/2001/XMLSchema#>
prefix 	owl: <http://www.w3.org/2002/07/owl#>
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri			title
		filmTitle	literal^^xsd:string	title
		year		literal^^xsd:integer	year
	}
				
	graph = {
		$filmId		a		my:Film	.
		$filmId		my:title	$filmTitle	.
		$filmId		my:releasedIn	$year		.
	}
}

INSERT and DELETE

These two sections are similar to the Graph section previously decribed. In particular, the INSERT section is identical to the Graph section (in each rule there can be only a Graph section or an INSERT section), while the DELETE section contains the graph pattern which will be used to remove the generated triples in this section. They both uspport the same constructs that can be used in the Graph section (all of these three sections are used to represent a graph pattern)

Here is an example, similar to the previous one, to create an instance of my:Film, with some associated data and to remove, if present, the fact that the movie was an instance of my2:Film (to query the existing query to get exactly the data from the triple store, refer to the Where section ).

			
prefix 	my: <http://art.uniroma2.it/imdb#>
prefix 	my2: <http://art.uniroma2.it/mymovies#>
prefix 	xsd: <http://www.w3.org/2001/XMLSchema#>
prefix 	owl: <http://www.w3.org/2002/07/owl#>
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri			title
		filmTitle	literal^^xsd:string	title
		year		literal^^xsd:integer	year
	}
				
	insert = {
		$filmId		a		my:Film	.
		$filmId		my:title	$filmTitle	.
		$filmId		my:releasedIn	$year		.
	}
	
	delete = {
		$filmId		a		my2:Film	.
	}
}

Advanced Concepts

Converters

Converters are a very powerful tool of PEARL for generating URIs and literal by following different conventions, adapting annotated content or simply generating random identifiers. Converters represent an extensible part of the language. Each converter is in fact realized by a java class, implementing an interface which represents the contract for the function. Converters may be added to an existing CODA system by deploying a dedicated OSGi bundle inside its installation or by enabling automatic download of converters from the Web. In this latter case, the URI of the converter contract is used to locate the contract on the Web and to access concrete implementations hosted on OBR repositories.

Converter Contracts

Converters are indirectly referred by means of a URI which identifies the desired behavior (a contract) instead of a concrete implementation. In a certain sense a contract identifies a set of functionally equivalent converters, possibly differing for nonfunctional properties, such as resource consumption, performance with respect to the task, or even licensing terms. The exact behavior represented by a contract is expressed in terms of specifications in natural language (as usual in contracts for service or methods), provided through standard metadata properties (e.g. dcterms:description) and it is actually the URI of the contract that provides the sole semantic anchor. For instance, the contract <http://art.uniroma2.it/coda/contracts/default> is described as “the procedure invoked by default for transforming a UIMA value into a valid RDF term”.

Converter Resolution

In fact, the input/output behavior of contracts is difficult to express in a formal language. The PEARL execution logic is kept separate from the contract resolution process as contract references are resolved into suitable converters by the Component Provider. Converters are bundled complying with the OSGi specification and stored in repositories organized according to the OBR (OSGi Bundle Repository) architecture. OBR repositories maintain metadata about the hosted bundles including their name, version, provided capabilities and requirements. A CODA converter is advertised on OBR repositories as a service
capability, which holds the contract URI, and the Java interface for interrogating the converter. OBR requirements are instead populated with the non-functional properties of the converters.
The Component Provider follows a two-step procedure for resolving contracts into suitable converters. At first it uses the OBR Client to access a known set of OBR repositories (starting from the CODA Local Repository) looking for a candidate whose metadata match the contract. If no candidate is found, the Component Provider relies on its Discoverer module to explore the Web looking for additional repositories. The Discoverer exploits the fact that PEARL specifications are self-describing and, moreover, grounded in the Web of Data, since the required contracts are mentioned through dereferenceable URIs. In compliance with the Linked Data principles, the Discoverer obtains through those URIs an RDF description of the contract, including a list of authoritative repositories of known implementations. This architecture enables autonomous configuration of CODA systems, disburdening the user from manual settings prior to the execution of a PEARL projection. This is especially valuable when reusing PEARL documents written by third parties, as in open and distributed scenarios.

A list of all converters available with the standard distribution of CODA is reported in the section of the appendix: Available Converters

Converter Syntax

Some specific converters can be used simply as described in the Nodes section:

book		uri(cdbk:isbn)		isbn

In the example above, the value of the UIMA feature isbn is processed through the odbk:isbn converter and the resulting URI is stored in the book RDF node.

From version 1.2 of CODA, the converter specification has been extended with the capability to accept parameters for customizing the conversion format.

The parameters handed to the converters need to be specified between square brackets. Different type of parameters are accepted, such as String, integers etc.. and also complex constructs such as maps can be passed as a parameter.

In the following example, a new converter contract, called randIdGen, is beind invoked. randIdGen allows for the generation of random URIs (precisely, random localnames for a given baseuri), with customizable patterns which depend on the role of the generated resource (e.g. a skos concept, a skosxl label, an owl class etc...). The following invocation:

xlabelName		uri(coda:randIdGen('xLabel', {lexicalForm = "Rome"@en, lexicalizedResource = $city})) city/name .

passes by first the value xLabel, which is a known role for indicating resources of type: skosxl:Label, and then, through a map (expressed as a list of attribute/value pairs among braces) it informs the converter that the lexical form for the xLabel is "Rome"@en and the resource to be lexicalized is a node (previously declared in the Nodes section) called city.

The standard pattern of the randIdGen converter for xLabels:

xl_$ {lexicalForm.language}_$ {rand()}

ignores the URI of the lexicalized resource and uses instead the language of the lexicalForm in order to build the random URI for the generated resource. rand() is a random generator which, again by default, produces an 8-digit exadecimal number. In this case, the produced localname would be something of the form:

xl_en_233d4f6a

Another example of a complex converter is:

cityName		uri(coda:formatter("%s/!s/%s",<http://test>, "Milan")) 		name .

where in this case, the coda:formatter converter is used and this converter takes a template and a list of values (2 in this specific case) and combines them using the template.

For the complete list of existing converters and the defintion of the input parameters, please refer to Appendix

Annotations

Is is possible to add annotations to varioue elements in a PEARL rule, e.g. to an entire RULE, to one of the nodes (and specify which part of the node definition this annotation is associated to) or to a triple in one of the other sections.

This is an example of a rule with several annotations (@Trim and several uses of @Memoized):

			
prefix	: 	<http://art.uniroma2.it/default#>
prefix	rdf: 	<http://www.w3.org/1999/02/22-rdf-syntax-ns##>
prefix	rdfs: 	<http://www.w3.org/2000/01/rdf-schema##>
prefix	owl: 	<http://www.w3.org/2002/07/owl##>
prefix	skos: 	<http://www.w3.org/2004/02/skos/core##>
prefix	skosxl: 	<http://www.w3.org/2008/05/skos-xl##>
prefix	xsd: 	<http://www.w3.org/2001/XMLSchema##>
prefix	my: 	<http://art.uniroma2.it/#>
prefix	conv: 	<http://converters.it##>
prefix	coda: 	<http://art.uniroma2.it/coda/contracts/#>

rule it.uniroma2.art.coda.test.ae.type.Animal id:animal {
		
	nodes = {
		@Trim
		animalLabel		literal		name
		animalId		uri(coda:randIdGen("concept", {label = $animalLabel }))	.
		@Memoized
		animalMemoizedId	uri(coda:randIdGen("concept", {label = $animalLabel }))	.
		@Memoized("default")
		animalDefaultMapMemoizedId	uri(coda:randIdGen("concept", {label = $animalLabel }))	.
		@Memoized(other)
		animalOtherMapMemoizedId	uri(coda:randIdGen("concept", {label = $animalLabel }))	.
	}
	
	graph = {
		$animalId owl:sameAs $animalMemoizedId .
		$animalId owl:sameAs $animalDefaultMapMemoizedId .
		$animalId owl:sameAs $animalOtherMapMemoizedId .
	}
}

A list of all annotations available with the standard distribution of CODA is reported in the section of the appendix: Available Annotations

Where

As for the graph section, the (optional) where section contains a graph pattern: the purpose of this graph pattern is to link newly extracted data with information which is already present in the target dataset (i.e. the dataset which will be updated with the triples generated by CODA).

The specified graph pattern is thus matched over the target dataset to retrieve already existing nodes by means of variable unification (variables are identifiable by a prefixed "?" symbol), so that the variables substitutions can be reused in the already described graph section.

In this sense, it is much close to the purpose of the WHERE statement in a SPARQL CONSTRUCT query. The unification mechanism allows to assign values to variables by constraining them on the basis of information which is thought to be present in the dataset: these substitutions are then applied to the graph pattern of the graph section to project data over the target dataset.

			
		
prefix my: <http://IMDB#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri			title
		filmTitle	literal^^xsd:string	title
	}
				
	graph = {
		?filmId		a			my:Film	.
		?filmId		my:title		$filmTitle	.
	}

	where = {
		?filmId		my:title		$filmTitle .
	}
}

In the WHERE section of the above example we define a variable, filmId, (by means of the ? symbol, not to be confounded with the $ of placeholders for generated nodes). The clause in the WHERE should return the subject of triples having the retrieved $filmTitle as object of the my:title predicate.

Note that, if the value is not retrieved, the WHERE will fail. However, a fallback mechanism in CODA will use the value held by the placeholder with the same name of the unistantiated variable.

In the where section, the predicates used in the RDF triple, could be also Property Path (following what is possible in the SPARQL standard, minus the ones dealing with the cointing of how many times a property can be present). So, accepted RDF graph patters in the where sections are:

			
			
	?subj my:prop1/my:prop2	?obj .
	
	?subj my:prop1/my:prop2+ ?obj .
	
	?subj (my:prop3|my:prop1)/my:prop2+ ?obj .
	
	?subj ^my:prop2 / ^my:prop1 ?obj .
	
	?subj !my:prop3/my:prop2 ?obj .

Conditions for rule triggering

Normally, when deciding which rule(s) to apply, only the UIMA annotation type is considered. A more advanced mechanism for selecting which rule to apply can be implemented through the use of conditions. A dedicated conditions section allows to specify a list of values which should or should not be hold by a given feature structure.

Thus, when the UIMA type specified in the rule definition matches the current annotation, if the conditions section is present, all the conditions in it are checked first and, only if all of them are satisfied, the rule is applied.

Each condition has three elements:

the feature path which is considered for the value comparison
the type of check to perform (currently available conditions are: IN and NOT IN )
the list of values used for the comparison

An example of how to use conditions is:

			
prefix my: <http://IMDB#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
			
rule it.uniroma2.art.coda.test.ae.type.Person id:person {
	conditions = {
		name			IN		["Andrea", "Armando"] .
	}
	nodes = {
		personName		uri			name 
	}
	graph = {
		$personName		a			my:Person .
	}
}

In this case CODA will apply this rule ONLY if the value contained in the feature path name is contained in the list ["Andrea", "Armando"] , so if its value is either Andrea or Armando.

conversely:

			
prefix my: http://IMDB#;
prefix xsd: http://www.w3.org/2001/XMLSchema#;
prefix owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.coda.test.ae.type.Person id:person {
	conditions = {
		name			NOT IN		["Andrea", "Armando"] .
	}
	nodes = {
		personName		uri			name 
	}
	graph = {
		$personName		a			my:Person .
	}
}

here CODA will apply this rule ONLY if the value contained in the feature path name is NOT contained in the list ["Andrea", "Armando"] , so if its value is neither Andrea nor Armando .

DependsOn

It is possible to state a dependency between two or more rules (and the annotions which trigger the use of these rules).

The rule referenced for the dependency is identified using its id. The rule stating the dependency on the first one declares it by means of the keyword dependsOn in its declaration, followed by the type of dependency. In the current implementation the following types of dependency are provided:

last: it depends on the last annotation which trigger the specified rule
next: it depends on the next annotation which trigger the specified rule
previous: it depends on all previous annotations which trigger the specified rule
following: it depends on all the following annotations which trigger the specified rule
between: it depends on all the annotations which are contained (using the begin and end feature) in the current annotation and which trigger the specified rule
lastOneOf: it depends on the last annotation which trigger one of the specified rule (these parameters are placed near the id of the dependency rule)

Some of these dependencies, have optional parameters. Here are the depencendy with the optional paramters:

last: M, to specify the maximum distance the dependency should be looked for
next: M, to specify the maximum distance the dependency should be looked for
previous: M, to specify the maximum distance the dependency should be looked for; N, to specify the maximum number of annotation to consider; R, to use just one of the candidate annotation (in a random way)
following: M, to specify the maximum distance the dependency should be looked for; N, to specify the maximum number of annotation to consider; R, to use just one of the candidate annotation (in a random way)
between: R, to sue jut one of the candidate annotation (in a random way)
lastOneOf: M, to specify the maximum distance the dependency should be looked for;

In the example below we use last as the dependency type. This means that when CODA will use the second rule it will look back to the other annotation until it finds on in which the first rule was used. At this point CODA will consider this other annotation as the target of this particular instance of dependency, so the application of the second rule for the given annotation depends on the other annotation just found. Once the "link" between these two rules has been establish the rules that stated the dependency is now able to use the placeholder defined and initialized in the other rules. The syntax to use the other placeholder is quite similar to using a local placeholder, the only different is that before the placeholder name is form using the other rule's id followed by .. and followed by the placeholder defined in the other rule. The second rule use the placeholder $film.filmId from the first rule in its second suggested triple.

			
prefix 	my: http://IMDB#;
prefix 	xsd: http://www.w3.org/2001/XMLSchema#;
prefix 	owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri		movieId					
	}
	graph= {
		$filmId		a		my:Film	.
	}
};
			
rule it.uniroma2.art.uima.imdb.IMDBFilmCast id:cast dependsOn last(film) {
	nodes = {
		actorId		uri		actorsList/personId
	}
	graph= {
		$actorId	a		my:Actor.
		$actorId	my:actedId	$film..filmId .	
	}
};

An example on how to state and use the dependency mechanism can be seen in this example or in the demo

Bindings

This optional section is used to define bingings with other UIMA annotation. The bindings are a particular type of dependency that are used to establish a connection with an inner annotation (an annotation that can be reach using a specific feature path) by refering to a lazy rule.
They are used when we don't want to mix different values contained in several placeholder. An example could be a list of Annotations regarding several person. In this case it is important not to mix the first name of a person with the last name of another person, but these information should always remain linked together when CODA suggest RDF triples.

			
prefix 	my:	<http://art.uniroma2.it/ontology#>

lazy rule it.uniroma2.Person id:person   {
	nodes = {
		personId		uri			id
		firstName		plainLiteral		firstName
		secondName		plainLiteral		secondName
	}
}

rule it.uniroma2.People id:people   {
	bindings = {
		singlePerson	personList	person
	}
	nodes = {
		city		uri		mainCity
	}
	graph = {
		$singlePerson.personId		my:livesIn	$city .
		$singlePerson.personId		a		my:Person .
		$singlePerson.personId		my:name		$singlePerson.firstName .
		$singlePerson.personId		my:lastName	$singlePerson.secondName .
		
	}
}

In this example let's assume that we have an annotation of type rule it.uniroma2.People which has two features: mainCity (with a single value) and personList (which is a list of annotation of type rule it.uniroma2.People ). We want to associate (in RDF) to each person his/her own name and last name. This can be accomplish using the bingding construct. Each $singlePerson (defined in the bindings section) refer to a single annotation and in the graph section we are able to access its inner values using the . , so we can link with the desired property ( my:name or my:lastName) each personId with his/her own name.

Use of the OPTIONAL clause in the graph section

A write operation of a graph pattern GP into a graph G succeeds if all the three elements (subject, predicate, object) of the triples in GP are bound (instantiated). A triple may not be fully instantiated for various reasons: for instance, a failed match on the WHERE section might leave a variable uninstantiated, or a missing value for a UIMA feature referenced in a node declaration may leave the placeholder for the generated RDF node empty.

A whole projection rule succeeds if all the write operations in the GRAPH section succeed (i.e the whole graph is fully instantiated).

The OPTIONAL clause may be used to wrap a given subgraph of the graph specified in the GRAPH section, in order to make it non mandatory for the succesfull production of the triples to be generated.

			
prefix 	my: http://art.uniroma2.it/imdb#;
prefix 	xsd: http://www.w3.org/2001/XMLSchema#;
prefix 	owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri			title
		filmTitle	literal^^xsd:string	title
		year		literal^^xsd:integer	year
		description	literal^^xsd:string	description
	}
				
	graph = {
		$filmId		a			my:Film	.
		$filmId		my:title		$filmTitle	.
		$filmId		my:releasedIn		$year		.
		OPTIONAL { 
			$filmId		my:description	$description	.
		}
}

In the above case, if the movie description is not available in the UIMA feature structure, then the whole application of the rule is not compromised, and the sole triples out of the OPTIONAL are still written.

Lists and FeaturePaths

PEARL allows to use all the values of the features which satisfy a particular FeaturePath. In the example below:

			
prefix 	my: http://IMDB#;
prefix 	xsd: http://www.w3.org/2001/XMLSchema#;
prefix 	owl: http://www.w3.org/2002/07/owl#;
			
rule it.uniroma2.art.uima.imdb.IMDBFilm id:film {
	nodes = {
		filmId		uri		movieId					
		actorId		uri		actorsList/personId
	}
	graph= {
		$filmId		a		my:Film	.
		$actorId	a		my:Actor.
		$actorId	my:actedId	$filmId .	
	}
};

The second placeholder, actorId, does not contain just a single value, but a list of values. As the name of the feature may suggests, actorList contains a list of value (and these values are not primitive values, such as string or integer). The syntax to put inside a placeholder a list of value is the same one as to consider just a single one for features which has just one value. It is also possible to use a specific position inside the List/Array using the syntax actorList[i], where i stands for the position we wish to use (starting from 0 and not from 1). This particular notation (not using a specific position) is extremely useful in the cases:

we don't know when we are writing the rule how many elementa are inside the list, but we know that we need to consider all these elements in the same way
we know the cardinality of the List/Array, but we are not interested in a particular elements and we need to treat all those elements in the same manner

Regexes

It is possibile to express simple or complex regexes of UIMA annotations in PEARL.

Each Regex is composed of two parts:

The Regex rule
Other rules (called the forRegex rules) which are used to define the placeholder which will then be used in the regex itself (the first part)

Let's first see how the forRegex rules are defined. Each forRegex rule has only the nodes part and no graph section. An example is:

			
forRegex rule it.uniroma2.art.coda.test.ae.type.City id:city {
		
	nodes = {
		cityName		uri 		name
	}
} ;

This forRegex is used by a Regex rule, when its id, city, is refered to. Thanks to this rule, the value in the featture path name is placed, coverted into a URI, in the placeholder cityName. Then the placeholder will be used in the regex itself.

Another similar forRegex rule is:

			
 forRegex rule it.uniroma2.art.coda.test.ae.type.Person id:person {
		
	nodes = {
		personName		uri 		name
	}
} ;

The Regex rule has a different syntax than any other rules sees up to this point. An example of such rule is (using the two previously defined forRegex rules ):

			
regex id:firstRegex [person as firstPerson] [5 city as firstCity]->
	graph = {
		$firstPerson..personName		a				my:Person .
		$firstCity..cityName			a				my:City .
		$firstCity..cityName			my:linkedTo			$firstCity..cityName .
	} ;

A Regex rule is introduced by the word regex followed by its id. Then, there is the regex itself. Each element of the regex in placed in a couple of [ ] . An element is composed of 2 or 3 parts (3 or 4 if we count che mandatory word as):

an optional number: it reprensent the maximum distance there should be between this element and the previous one
a forRegex rule id: pointing to a specific forRegex id, so implicitly stating the type of annotation the regex is looking for
a local name: which will be used in the graph part of the regex rule, to distinguish between the forRegex rules according the where they were matched in the regex

After the regex part, there is a -> followed by the graph section. This section is similar to the graph section of the other rules, the only difference is that there are no direct placeholder (since there is no nodes section), and all the used placeholder are in the for $LOCAL_RULE_ID_NAME..PLACEHOLDER

The regex support the typical regex operators (*, +, ? and |), which should be placed outside each element

It is even possible to use regexes with conditions (in the forRegex rule, see the conditions section)

A more complex examples of the forRegex and regex rules is:

			
forRegex rule it.uniroma2.art.coda.test.ae.type.City id:city {
		
	nodes = {
		cityName		uri 		name
	}
}

forRegex rule it.uniroma2.art.coda.test.ae.type.Person id:person {
		
	nodes = {
		personName		uri 		name
	}
}

forRegex rule it.uniroma2.art.coda.test.ae.type.Plant id:plant {
		
	nodes = {
		plantName		uri 		name
	}
}

forRegex rule it.uniroma2.art.coda.test.ae.type.Animal id:animal {
		
	nodes = {
		animalName		uri 		name
	}
}

forRegex rule it.uniroma2.art.coda.test.ae.type.ComplexPerson id:complex {
		
	nodes = {
		lastName		uri 		name/lastName
	}
}



regex id:firstRegex [city as firstCity] ([person as firstPerson]+ | [plant as firstPlant]+) 
					[city as secondCity] ([animal as firstAnimal] | [plant as secondPlant])+
					[city as thirdCity]? [person as secondPerson]* [animal as lastAnimal]
					->
	graph = {
		$firstCity..cityName			a				my:City .
		$lastAnimal..animalName			a				my:Animal .
		$firstCity..cityName			my:linkedTo		$lastAnimal..animalName	.
	} ;

In this case, not all matched forRegex rules are used in the graph part of the regex rules and in the regex the same forRegex-id is used several times (each time with a different local name so they can be distinguished). In particular, this regex rule is used, once the UIMA annotations regex has been matched (using the forRegex rules), to link via the property my:linkedTo the URI from the first City annotation with URI from the last Animal annotations.

Appendix

Available Converters

CODA provides a set of converters to generate URI or Literal resources.

In addition to the default converters, described in the sammary table below, there are mainly two kinds of converters, only available for the URI:

Deterministic converters: given the same input they generate always the same URI. Basically a converter of this type composes the URI concatenating a deterministically generated part to a fixed prefix;
Randomic converters: a converter of this type composes the URI concatenating a randomic part to a fixed prefix.

Here it is a table that shows an overview of the available CODA converters

`(default activated converter)`	`uri`	Default URI converter. Generates a URI concatenating the baseUri to the given input. If the input string is already a URI, it returns the same.
		DefaultConverter	DefaultConverterImpl
`(default activated converter)`	`literal`	Default Literal converter. Simply returns the given insput as a Literal.
		DefaultConverter	DefaultConverterImpl
`coda:randIdGen`	`uri`	Randomic converter. Generates a URI concatenating a prefix with a 8-digits hexadecimal randomic char sequence. The converter takes two input parameters: xRole: tells the nature of the resource. Available args: a map of further optional arguments. They depends on the xRole parameter as follow: concept (for skos:Concepts) label: the accompanying preferred label of the skos:Concept (or literal form of the accompanying xLabel)skos:Concept schemes: the concept schemes to which the concept is being attached at the moment of its creation (serialized as a Turtle collection) conceptScheme (for skos:ConceptSchemes) label: the accompanying preferred label of the skos:Concept (or literal form of the accompanying xLabel) skosCollection (for skos:Collections) label: the accompanying preferred label of the skos:Collection (or literal form of the accompanying xLabel) xLabel (for skosxl:Labelss) lexicalForm: the lexical form of the skosxl:Label lexicalizedResource: the resource to which the skosxl:Label will be attached to lexicalizationProperty: the property used for attaching the label xNote (for reified skos:notes) value: the content of the note annotatedResource: the resource being annotated noteProperty: the property used for annotation A custom prefix that will be placed befor the randomic sequence.
		RandomIdGenerator	TemplateBasedRandomIdGenerator
`coda:date`	`literal`	Generates a literal with datatype xsd:date. The input value is parsed (compatibly with a set of recognized patterns) and is formatted according to the standard format (ISO 8601) yyyy-MM-dd. If no input value is provided, the converter generates the current date. If the input value cannot be parsed, the converter throws a ConverterConfigurationException.
		DateConverter	DateConverterImpl
`coda:time`	`literal`	Generates a literal with datatype xsd:time. The input value is parsed (compatibly with a set of recognized patterns) and is formatted according to the standard format (ISO 8601) hh:mm:ss. If no input value is provided, the converter generates the current time. If the input value cannot be parsed, the converter throws a ConverterConfigurationException. The converter takes optional parameters: an offset, which admitted values are: undefined: the output time will not contain any offset, if the input value has offset it will be ignored. Z: Zulu timezone. The "Z" timezone is simply added at the end of the output time. <hh>:<mm>: an offset, specified in hours and minutes, that is applied to the input value, or replaced if the latter already contains an offset. reuse: is applied the same offset of the input. an additionalOffset: an offset specified in hours and minutes (hh:mm). In case the offset parameter is <hh>:<mm> the additionalOffset is added, in case the offset is reuse, it adds the additionalOffset to the offset of the input (in case is missing, is considered as +00:00). In every other cases a ConverterConfigurationException is thrown. If invalid parameters are passed, the converter throws a ConverterConfigurationException.
		TimeConverter	TimeConverterImpl
`coda:datetime`	`literal`	Generates a literal with datatype xsd:dateTime. The input value is parsed (compatibly with a set of recognized patterns) and is formatted according to the standard format (ISO 8601) yyyy-MM-ddThh:mm:ss. If no input value is provided, the converter generates the current datetime. If the input value cannot be parsed, the converter throws a ConverterConfigurationException. The converter takes the same optional parameters of the coda:time converter.
		DatetimeConverter	DatetimeConverterImpl
`coda:langString`	`literal`	Generates a plain literal with the language tag provided as parameter.
		LangStringConverter	LangStringConverterImpl
`coda:formatter`	`uri/literal`	Generates a uri/literal. Produces a resource by replacing placeholder with values passed as arguments (according to their order): %s : string representation of the input. For IRIs = the string representation of the IRI, for literals it’s the lexical form %n : local name in case of IRI %d : datatype IRI in case of literals %l : lang in case of language tagged literal !s : the value from the feature path or the previous converter All these parameters can be further enriched by adding, after each of them, ^U or ^l to have the extracted valu all in UpperCase or in LowerCase. Examples (input value "Rome", $cityName1 is <http://art.uniroma2.it/Rome> and default namespace http://art.uniroma2.it/): uri(coda:formatter("%s/!s/%s",<http://test>, "Milan")) generates <http://test/Rome/Milan> uri(coda:formatter("%s/!s/%n/%n/%l/%s/!s", <http://test>, <http://test/cat>, $cityName1, "Italia"@it, "city")) generates <http://test/Rome/cat/Rome/it/city/Rome> literal@en(coda:formatter("!s capital of %n in %l in the %s", <http://countries/Italy >, "world"@en, "world"@en)) generates "Rome capital of Italy in en in the world"@en literal@en(coda:formatter("!s^l capital of %n^U in %l in the %s", <http://countries/Italy >, "world"@en, "world"@en)) generates "rome capital of ITALY in en in the world"@en
		FormatterConverter	FormatterConverterImpl
`coda:regexp`	`uri/literal`	Generates a uri/literal. Produces a resource by executing a regex to the input value, to retrieve sub part of it (via ghe regex gropus) and then combining such group(s) in the passed template. The groups are identified with the standard group id (i.e. $NUM). Examples (input value "Rome" , $cityName1 is <http://art.uniroma2.it/Rome> and default namespace http://art.uniroma2.it/): uri(coda:regexp("R(.+)", "$1")) generates <http://art.uniroma2.it/ome> uri(coda:regexp("R(.+)e", "$1/caput")) generates <http://art.uniroma2.it/om/caput> literal@en(coda:regexp("R(.+)e", "$1 is the center or Rome")) generates "om is the center or Rome"@en
		RegexpConverter	RegexpConverterImpl
`coda:turtleCollection`	`literal`	Produces the TURTLE serialization (as an xsd:string) of the collection formed by the provided items
		TurtleCollectionConverter	TurtleCollectionConverterImpl
`coda:propPathIDResolver`	`URI`	Retrieves an existing URI, from the dataset, using the input SPARQL Property Path, and, if no resource can be retrived, it returns the input value (the value of the extracted from the Feature Structure or the previous converter, in case a chain of converter is used) or the input fallback URI. If multiple URI are retrived from the dataset using the input PropertyPath and Object, the converter throw an exception. Its input parameters are: Value object: the object of the RDF triple to extract the subject, using the passed Property Path String propPath: the Property Path of the RDF triple to extract the subject, using the passed Object IRI fallbackIRI: (Optional) the URI to return in case the no URI was retrieved from the dataset an example of its use (the nodes part of a PEARL rule) is: `nodes = { nameValue literal@en name . fallbackIRI uri(coda:formatter("%s!s", "http://test.it/fallback/")) name . iriValue uri(coda:propPathIDResolver($nameValue, "(skosxl:prefLabel\|skosxl:altLabel)/skosxl:literalForm", $fallbackIRI)) name . }`
		PropPathIDResolverConverter	PropPathIDResolverConverterImpl
`coda:lexiconIDResolver`	`URI`	Retrieves an existing URI, from the dataset, using the Lexicalization Model, and, if no resource can be retrived, it returns the input value (the value of the extracted from the Feature Structure or the previous converter, in case a chain of converter is used) or the input fallback URI. If multiple URI are retrived from the dataset using the input PropertyPath and Object, the converter throw an exception. Its input parameters are: Value object: the object of the Lexicalization Model String lexModel: (Optional, and default value is CTX)the Lexicalization Model to use. It possible values are: RDFS: uses the property rdfs:label SKOS: uses the property skos:prefLabel SKOSXL: uses the Property Path skosxl:prefLabel/skosxl:literalForm ONTOLEX: currently not supported ALL: uses all other Lexicalization Models CTX: uses the Lexicalization Model of the dataset IRI fallbackIRI: (Optional) the URI to return in case the no URI was retrieved from the dataset an example of its use (the nodes part of a PEARL rule) is: `nodes = { nameValue literal@en name . fallbackIRI uri(coda:formatter("%s!s", "http://test.it/fallback/")) name . iriValue uri(coda:lexiconIDResolver($nameValue, "RDFS", $fallbackIRI)) name .`
		TurtleCollectionConverter	TurtleCollectionConverterImpl

Available Annotations

`@Memoized`	`NODE`	Use the memoized mechanism for the creation of this node (passing the same Type+Converters+Value will immediately retrieve the already previously generated Value, without the need to call the converters), it is mainly used when a Random generating converter is used, to be sure that, passing the same input will always "generate" the same value. It is possible to specify a specific pool where the Memoized value will be serched and the boolean paramters ignoreCase to ignore the Case of the various elements when the comparison is done.
`@Confidence`	`TRIPLE`	Used to associated a confidence value to the proposed RDF triples created by CODA
`@DefaultNamespace`	`NODE`	Used to override the default namespace when creating the node (so this namespace will be used instead of the namespace specified in the dataset).
`@Trim`	`NODE, RULE`	Used to instruct CODA to apply the trim function to the value, before creating the nodes (so before calling the converters). Its default value is true, which is also applied when the annotation is not specify. When applied to an entire rule, its desired behaviour is applied to ALL the nodes in such rule.
`@RemoveDuplicateSpaces`	`NODE, RULE`	Used to instruct CODA to remove multiple sequential spaces in the value, before creating the nodes (so before calling the converters). Its default value is true, which is also applied when the annotation is not specify. When applied to an entire rule, its desired behaviour is applied to ALL the nodes in such rule.
`@LowerCase`	`NODE, RULE`	Used to instruct CODA to apply the LowerCase to the value, before creating the nodes (so before calling the converters). Its default value is true, but when the annotation is not present, its effect is not applied. When applied to an entire rule, its desired behaviour is applied to ALL the nodes in such rule.
`@UpperCase`	`NODE, RULE`	Used to instruct CODA to apply the UpperCase to the value, before creating the nodes (so before calling the converters). Its default value is true, but when the annotation is not present, its effect is not applied. When applied to an entire rule, its desired behaviour is applied to ALL the nodes in such rule.
`@RemovePunctuation`	`NODE, RULE`	Used to instruct CODA to remove the punctuation to the value, before creating the nodes (so before calling the converters). Its default value is true, but when the annotation is not present, its effect is not applied. It is also possible to pass the characters which should be removed, instead of the standard punctuation characters. When applied to an entire rule, its desired behaviour is applied to ALL the nodes in such rule.

Grammar of the PEARL language

The antlr file containing the grammar used by CODA can be downloaded here

A simplified version of the language grammar, expressed in Backus-Naur form is shown here below:

			
prRules := prefixDeclaration* annotations? rules+ ;
prefixDeclaration := 'prefix' prefix_name ':' namespace ';';
annotations :=	'annotations' '=' '{' (annotationDefinition)+ '}';
annotationDefinition := 'Annotation' annotationName ;
metaAnnotation* := '@' metaAnnotationName ('(' parameter (',' otherParameter)* ')')?
rules := (rule|lazyRule)+ ;
rule := annotation* 'rule' uimaTypeName 'id:' idVal ('dependsOn' depend (',' depend)*)? '{' 
	bindingsClause? nodesClause? graphClause? whereClause?'}' ;
lazyRule := 'lazy' 'rule' uimaTypeName 'id:' idVal '{' nodesClause? '}' ;
depend := dependType '(' (depRuleIds | params) (',' (depRuleIds | params ) )* ')' 
	'as' depRuleIdAs ;
bindingsClause := 'bindings' '=' '{' bindingDef+ '}' ;
bindingDef :=  bindingId featurePath bindingRuleId ;
featurePath := featurePathElement ('/' featurePathElement)* ;
nodesClause := 'nodes' '=' '{' nodeDef+ '}' ;
nodeDef := annotation* nodeId projectionOperator featurePath ;
projectionOperator :=  ('uri' converters?) |  ('literal' '^^' iri converters?) | 
	('literal' langtag  converters?) | ('literal' converters?) ;
converters := '(' iri (',' iri)* ')' ;
graphClause := 'graph' '=' '{' graphElement+ '}' ;
graphElement :=  annotation* (graphTriple | optionalGraphElement) ;
annotation := '@' target ( '(' annotationName ')' ) '.' ;
graphTriple := graphSubject graphPredicate graphObject '.' ;
optionalGraphElement := 'OPTIONAL' '{' graphElement+ '}' ;
graphSubject := var |  iri | blankNode | placeholder ;
graphPredicate := var | iri | abbr | placeholder ;
graphObject := var | iri | literal | blankNode | placeholder ;
whereClause := 'where' '='  '{' graphElement+ '}' ;