ART

Internal Research Project

 

Language-driven Ontology learning for Question Answering

It is nowadays, widely accepted that linguistic information is relevant in ontology engineering at least for two main reasons. First, ontologies are data models whose concepts are identified by names and processing such information under a linguistic perspective is helpful in the understanding of the often implicit semantics imposed by the knowledge engineer. Harmonisation of domain ontologies throughout the use of linguistic resources has been for example proposed in (Magnini et al., 2002). Second, although ontologies capture aspects of the semantics that are somewhat independent from linguistic information, most text processing applications (e.g. knowledge-based IR or question answering) require explicit mapping between domain concepts and their textual counterparts: an example is given by the terminology (e.g. multiword expressions) that embody linguistic variants of an often complex concept in the ontology. Other examples are event matching rules in information extraction systems. Although targeted to specific event types, IE systems must be aware of the different ways the events are linguistically expressed: which verbs and which concepts are used to communicate a given event?

This form of lexical semantic information is strictly part of the ontological description especially with respect to paradigmatic properties. An event type e is a specific concept and when inheritance is required it may be connected with topological properties in a hierarchy. However, these semantic dimensions are independent from the linguistic properties (e.g. the rules needed for detecting potential realisations of e in textual material). Linguistic rules should include a combination of syntagmatic S and semantic M constraints that a text t must satisfy in order to realise e. In other words when S(t) Ù M (t) is a true formula then we can state that t realises e, i.e.

"t.S(t) Ù M (t) Þ e(t)

Rules like the above one are needed for IE over data sets of a realistic size, although they are usually not represented ontologically. Examples of properties S or M are for example distributional properties (e.g. mutual information of word collocations in corpora) able to suggest a concept (e.g. a terminology item that represents a concept in the domain) or a relation. Machine learning techniques are widely used to observe these properties and inductively develop the required concepts/relations. However, once learned they are usually mapped into the target KB throughout validation (often manual, as in (Bozsak et al., 2002)) that determine the (new) topological properties and their implicatures. As a consequence the textual semantics of a concept or a relation is not preserved in the target ontology. Syntagmatic and semantic properties are used to justify the eligibility of a given lexical ( structure) as an ontological concept but are then neglected. The textual properties that were used to justify an inductive decision (e.g. a given fragment is a terminological expression, a given syntagmatic structure is a prototypical rule for the relation or event type e), are not associated (after the decision is taken) to the resulting concept or relation type. Although such properties in principle depend on the underlying domain ontology, they are never explicitly represented. Rare exceptions are works where integration between world and textual semantics for text understanding is adopted as in (Hahn and Schnattinger, 1998; Hahn and Mark, 2002).

In ART (intelligent Agent at Roma Tor vergata), we define a framework where specific lexical semantic information can be integrated as an inherent component of an ontology. The advantage here is that the enriched information is made available even in the early phases of the ontology engineering process. Ontology learning is thus mapped into an incremental process here should be thus adopted where NL learning interleaves with ontology engineering.

In the proposed framework we at least need to make available the following semantic components:

  • a set of domain concepts and relations mainly devoted (as in the traditional view on the ontology) to define properties of individuals, relations and typical task involved in the application process
  • a language component including lexical semantic information (e.g. word sense and thematic descriptions for specific classes like verbs) structured according to linguistic methods and principles and modeled independently from the domain knowledge
  • a mapping between linguistic and domain concepts usually not captured by concept linguistic labels as naming assumptions may vary hugely across domains and applications
  • a systematic mapping between domain relations (e.g. properties of individuals or binary relationships) and their linguistic counterparts (e.g. terminological expressions or predicate argument structures of verbs denoting those properties). Notice how the distinctions between the latter linguistic rules and the former relationships reflects the traditional separation in linguistics between grammatical functions and thematic roles.

ART will be based on such a representational framework and its implementation will be based on Semantic Web standards as SOAP, WSDL and OWL. ART will be able to learn extensively from corpora its target ontology with weakly supervision from the knowledge engineer. Then, ART will be able to sustain dialogue and question answering about the target domain by exploiting in combination NLP and ontological reasoning.

Involved People

  • Roberto Basili
  • Fabio Massimo Zanzotto
  • Marco Pennacchiotti
  • Armando Stellato
  • Michele Vindigni

ART Internal page.

Specific Publications

  • Roberto Basili, Marco Pennacchiotti, Fabio Massimo Zanzotto Language Learning and Ontology Engineering: an Integrated Model for the Semantic Web. 2nd Meaning Workshop, Trento, Italy, February 2005.
     
  • Roberto Basili, Maria Teresa Pazienza, Fabio Massimo Zanzotto, ”Inducing hyper­linking rules in text collections”, in Nicolas Nicolov, Kalina Bontcheva, Galia Angelova, Ruslan Mitkov (Eds.): Recent Advances in Natural Language Processing III, Selected Papers from RANLP 2003, Borovets, Bulgaria. Current Issues in Linguistic Theory (CILT) 260 John Benjamins, Amsterdam/Philadelphia 2004.
     
  • Roberto Basili, Michele Vindigni, Fabio Massimo Zanzotto, ”Understanding the Web through its Language”, Proceedings of the 2004 IEEE/WIC/ACM International Con­ference on Web Intelligence (WI’04), Bejing Chiina, 2004.
     
  • Paolo Atzeni, Roberto Basili, Dorte H. Hansen, Paolo Missier, Patrizia Paggio, Maria Teresa Pazienza, Fabio Massimo Zanzotto Ontology-based question answering in a federation of university sites: the MOSES case study, in Proceedings of the 9th International Conference on Applications of Natural Language to Information Systems (NLDB'04) Manchester (United Kingdom), June 2004
     
  • Roberto Basili, Dorte H. Hansen, Patrizia Paggio, Maria Teresa Pazienza, Fabio Massimo Zanzotto Ontological resources and question answering Workshop on Pragmatics of Question Answering, held jointly with NAACL 2004 Boston, Massachusetts, May 2004.
     
  • R. Basili, M. Vindigni, and F. M. Zanzotto. Integrating ontological and linguistic knowledge for conceptual information extraction. In Proceedings of the IEEE/WIC WI-2003, Conference on Web Intelligence, Halifax, CA, 2003.

Start Date: September 2004
Status: in progress