Feed aggregator

“A Little Semantics Goes a Long Way”

AI3:::Adaptive Information (Mike Bergman) - Wed, 02/05/2020 - 18:22
First in an Occasional Series of KBpedia Best Practices

One of my favorite sayings regarding the semantic Web is from James Hendler, now a professor and program director at RPI, but a longstanding contributor to the semantic space, including, among other notable contributions, as a co-author on the seminal paper, “The Semantic Web,” in Scientific American in 2001. His statement was “A little semantics goes a long way,” and I whoeheartedly support that view. I previously gave a shoutout to this saying in my book [1]. In this ‘best practice’ note regarding KBpedia and creating and maintaining knowledge graphs, I want to point out two simple techniques that can immediately benefit your own knowledge representation efforts.

The two items I want to highlight are the use of ‘semsets’ (similar to the synsets used by WordNet) and emphasizing subsumption hierarchies in your knowledge graph design. The actual practice of these items involves, as much as anything, embracing a mindset that is attentive to the twin ideas of semantics and inference.

With this article, I’m also pleased to introduce an occasional series on best practices when creating, applying or maintaining knowledge graphs, using KBpedia as the reference knowledge system. I will be presenting this series throughout 2020 coincident with some exciting expansions and application of the system. These ‘best practice’ articles are not intended to be detailed pieces, my normal practice. Rather, I try to present a brief overview of the item, and then describe the process and benefits of applying it.

Semsets

The fundamental premise of semantic technologies is “things, not strings.” Labels are only the pointers to a thing, and things may be referred to in many different ways, including, of course, many different languages. Is your ‘happy’ the same as my ‘glad’? Examples abound, as language is an ambiguous affair with meaning often dependent on context.

A single term can refer to different things and a single thing can be (and is!!) referred to by many different labels. The lexical database of WordNet helped attack this problem decades ago, by creating what it called ‘synsets‘ to aggregate the multiple ways (terms) by which a given thing may be referred. The portmanteau of this name comes from the ‘synset’ being an aggregation of synonyms. In keeping with Charles Peirce‘s framing of indexes to a given thing as anything which points to or draws attention to it, we have broadened the idea to include any term or phrase that points to a given thing. This is a broadened semantic sense, so we have given this aggregation of terms the name ‘semset‘, a portmanteau using semantics. Elsewhere [2], I have very broadly defined a semset as including: synonyms, abbreviations, acronyms, aliases, argot, buzzwords, cognomens, derogatives, diminutives, epithets, hypocorisms, idioms, jargon, lingo, metonyms, misspellings, nicknames, non-standard terms (e.g., Twitter), pejoratives, pen names, pseudonyms, redirects, slang, sobriquets, stage names, or synsets. Note this listing is itself a semset for semset.

So, the best practice is this. Whenever adding a new relation or entity or concept to a knowledge graph, give it as broad of an enumeration of a semset as you can assemble with reasonable effort [3]. Redirects in Wikipedia and altLabels from Wikidata are two useful starting sources. (You may need to discover other sources for specific domains.) You can see these by the altLabels within the KBpedia knowledge base; see, as examples, abominable snowman, bird, or cake. altLabels are part of the many useful constructs in the SKOS (Simple Knowledge Organization System) RDF language, another best practice to apply to your knowledge graphs.

Then, when querying or retrieving data, one can specify standard prefLabels alone (the single, canonical identifier for the entity) for narrow retrievals, or greatly broaden the query by including the altLabels. In our own deployments, we also often include a standard text search engine such as Lucene or Elasticsearch for such retrievals, which opens up even more control and flexibility. Semsets are an easily deployed way to bridge your semantics from ‘strings’ to ‘things’.

Subsumption Hierarchies

Subsumption hierarchies simply mean that a parent concept embraces (or ‘subsumes’) child concepts [4]. The subsumption relationship can be one of intensionality, extensionality, inheritance, or mereology. In intensionality, the child has attributes embraced by the parent, such as a bear having hair like other mammals. In extensionality, class members belong to an enumerable group, as in lions and tigers and bears all being mammals. In inheritance, an actual child is subsumed under a parent. In mereology, a composite thing like a car engine has parts such as pistons, rods, or timing device. In the W3C standards of RDF or OWL, what we use in KBpedia to capture our semantic knowledge representations, the ‘class’ construct and its related properties are used to express subsumption hierarchies.

The ‘hierarchy’ idea arises from establishing a tree scaffolding of linked items. In this way, subsets of your knowledge graph resemble taxonomies (or tree-like structures) that proceed from the most general at the top (the ‘root’) to most specific at the bottom (the ‘leaf’). Different types of subsumption relationships are best represented by their own trees. Using such subsumption relations do not preclude other connections or relations in your knowledge graph.

When consistently and logically constructed, a practice that can be learned and can be tested, subsumption hierarchies enable one to infer class memberships. For instance, using the ‘mammal’ example means we can infer a bear is a mammal without so specifying, or, alternatively, we can discover that lions and tigers are also mammals if we know that a bear is a mammal. Subsumption hierarchies are an efficient way to specify group memberships, and a powerful way to overcome imprecise query specifications or to discover implicit relationships.

Using semsets and subsumption hierarchies are easy techniques for incorporating semantics into your knowledge graphs. These two simple techniques (among a few others) readily demonstrate the truth of Hendler’s “a little semantics goes a long way” in improving your knowledge representations.

NOTE: This is the first article in an occasional series about KBpedia best practices to coincide with new advances, uses, and applications of KBpedia throughout 2020.

[1] Bergman, M. K. Building Out the System. in A Knowledge Representation Practionary: Guidelines Based on Charles Sanders Peirce (ed. Bergman, M. K.) 273–294 (Springer International Publishing, 2018). doi:10.1007/978-3-319-98092-8_13 [2] See the Glossary in [1]. [3] SKOS also provides a property for capturing misspellings (hiddenLabel), which is a best practice to include, and the W3C standards allow for internationalization of all labels by use of the language tag for labels. [4] In actual language use, one can say a parent ‘subsumes’ a child. Alternatively, one can say a child ‘is subsumed by’ or ‘is subsumed under’ the parent.

Knowledge Representation is a Tricky Business

AI3:::Adaptive Information (Mike Bergman) - Mon, 12/16/2019 - 05:59
The Choice Between Class and Instance Depends on Your Point of View

Readers of this blog know that I use the open-source Protégé ontology editor to build and maintain our knowledge graphs. Besides the usefulness of the tool, there is also an informative user mail list that discusses the Protégé application and modeling choices that may arise when using it [1]. A recent thread, ‘How to Relate Different Classes,’ is but one example of an issue one might encounter on this list [2]. As one of the frequent commenters on the list, Michael DeBellis, noted about this thread [3], “I think this is a common issue with modeling, what to make a class and what to make an instance.”

Michael is indeed correct that the distinction between classes and instances is a frequent topic, one that I have touched upon in various ways through the years. The liveliness of this recent thread convinced me it would be helpful to pull together how one chooses to use a class or instance in their knowledge graphs. The topic is also critical to the questions of knowledge representation and interoperability, two key uses for knowledge graphs. So, let’s look at this question of class v instance from the aspects of the nature of knowledge, modeling, and practical considerations.

Epistemological Issues

Epistemology is simply the study of the nature of knowledge. It gets at the questions of what is knowledge? what is belief? what is justification for action? how can we acquire and validate knowledge? is knowledge infallible? are there different kinds of knowledge?

Charles Sanders Peirce and his theory of signs was intimately related to these questions, as well to how we express and convey knowledge to others. Since, as humans, we communicate through our language as symbols, what we mean and intend to convey when expressing these symbols is also of utmost importance to how we understand and refine knowledge as a community process. My recent book has a number of chapters mostly if not exclusively related to these topics [4,5]. Many of the points in this section are drawn from these chapters.

We can illustrate some of the tricky epistemology issues associated with the nature of language using the example of the ‘toucan’ bird often used in discussions of semantic technologies. When we see something, or point to something, or describe something in words, or think of something, we are, of course, using proxies in some manner for the actual thing. If the something is a ‘toucan’ bird, that bird does not reside in our head when we think of it. The ‘it’ of the toucan is a ‘re-presentation’ of the real, dynamic toucan. The representation of something is never the actual something but is itself another thing — that is, a sign — that conveys to us the idea of the real something. In our daily thinking we rarely make this distinction. (For which we should be thankful, otherwise, our flow of thoughts would be wholly jangled.) Nonetheless, the difference is real, and we should be conscious of it when we are trying to be precise in representing knowledge.

How we ‘re-present’ something is also not uniform or consistent. For the toucan bird, perhaps we make caw-caw bird noises or flap our arms to indicate we are referring to a bird. Perhaps we point at the bird. Alternatively, perhaps we show a picture of a toucan or read or say aloud the word “toucan” or see the word embedded in a sentence or paragraph, as in this one, that also provides additional context. How quickly or accurately we grasp the idea of ‘toucan’ is partly a function of how closely associated one of these accompanying signs may be to the idea of toucan bird. Probably all of us would agree that arm flapping is not nearly as useful as a movie of a toucan in flight or seeing one scolding from a tree branch to convey the ‘toucan’ concept.

The question of what we know and how we know it fascinated Peirce over the course of his intellectual life. He probed this relationship between the real or actual thing, the object, with how that thing is represented and understood. (Also understand that Peirce’s concept of the object may embrace individual or particular things to classifications or generalities.) This triadic relationship between immediate object, representation, and interpretation forms a sign and is the basis for the process of sign-making and understanding, what Peirce called semiosis [6].

Even the idea of the object, in this case, the toucan bird, is not necessarily so simple. The real thing itself, an actual toucan bird, has characters and attributes. How do we ‘know’ this real thing? Bees, like many insects, may perceive different coloration for the toucan because they can see in the ultraviolet spectrum, while we do not. On the other hand, most mammals in the rainforest would also not perceive the reds and oranges of the toucan’s feathers, which we readily see. The ‘toucan’ object is thus perceived differently by bees, humans, and other animals. Beyond physical attributes, this actual toucan may be healthy, happy, or sad, nuances beyond our perception that only some fellow toucans may perceive. Though humans, through our ingenuity, may create devices or technologies that expand our standard sensory capabilities to make up for some of these perceptual gaps, our technology will never make our knowledge fully complete. Given limits to perceptions and the information we have on hand, we can never completely capture the nature of the dynamic object, the real toucan bird.

Things get murkier still when we try to convey to others what we mean by the ‘toucan’ bird. For example, when we inspect what might be a description of a toucan on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we use to refer to ‘toucan’ may be, say, that of the keel-billed toucan (Ramphastos sulfuratus). However, if we view the images of a list of toucan species, we see just how physically divergent various toucans are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration, and range. Further, if I assert that the picture of the toucan is that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole. The point is not a lesson on toucans, but an affirmation that distinctions between what we think we may be describing occurs over multiple levels. The meaning of what we call a ‘toucan’ bird is not embodied in its label or even its name, but in the accompanying referential information that places the referent into context.

If, in our knowledge graph we intend to convey all of these broader considerations, then we are best defining ‘toucan’ as a class. On the other hand, if we are discussing the individual Pretty Bird toucan or are describing ‘toucan’ and average attributes in relation to a wider context of many types of other birds including eagles and wrens, then perhaps treating the ‘toucan’ as an instance is the better approach. Context and what we intend to convey are essential components to how we need to represent our knowledge. Whether something is an ‘instance’ or a ‘class’ is but the first of the distinctions we need to convey, and those may often vary by context.

Modeling Issues

Because these principles are universal, let’s shift our example to ‘truck’ [7]. In the English language, one of the ways we distinguish between an instance and a class is guided by the singular and plural (though English is notorious for its many different plural forms and exceptions). The attributes we assign to a term differ whether we are discussing ‘trucks’, which we think about more in terms of transport purpose, brands, model, and model year; or are discussing a ‘truck’, which has a particular driver, engine, transmission and mileage. Here is one way to look at such ‘truck’ distinctions (for this discussion, we’ll skip the ABox and TBox, another modeling topic importantly using description logics [8]):

To accommodate the twin views of class and individual, we could double the number of entities in our knowledge graphs by separately modeling single instances or plural classes, but that rapidly balloons the size of our graphs. What is more efficient is an approach that would enable us to combine both the organization of concepts and their relations and set members with the description and characterization of these concepts as things unto themselves. As our examples of ‘toucans’ and ‘trucks’ show, this dual treatment is a natural and common way to refer to things for most any domain of interest. Further, class and sub-class relationships enable us to construct tree-like hierarchies over which we can infer or inherit attributes and characteristics between parents and children.

For modeling purposes, we also want our graphs to be decidable, which importantly means we can reason over our knowledge graphs with an expectation that we can get definitive answers (even if the answer is “don’t know”) in a reasonable computation time. It is for these reasons that we have chosen the standard OWL 2 as the representation language for our knowledge graphs (in addition to other benefits [9]). A proper OWL 2 knowledge graph is decidable, and it handles both class and instance views using the metamodeling technique of “punning” [10]. Objects in OWL 2 are named with IRIs (internationalized Web links). The trick with “punning” is to evaluate the object based on how it is used contextually; the IRI is shared but its referent may be viewed as either a class or instance based on context. Any entity declared as a class and with an asserted object or data property is punned. Thus, objects used both as concepts (classes) and individuals (instances) are allowed and standard OWL 2 reasoners may be used against them.

Other Practical Issues

We’ve already discussed context, inference, and decidability, but I thought Igor Toujilov highlighted another important benefit in the mail thread of using class over instance declarations in a knowledge graph. The example he provided was based on drug development [11]:

However from my point of view (software engineering), many modern drugs are developed as a specialisation of existing drugs, i.e. by bringing new features to existing drugs. So, some new drug can be considered as a subclass of an existing drug. This is similar to object-orientated design in software: to bring new features, establish a subclass and implement it.

For example, methylphenidate can be considered as a superclass of Ritalin. If an earlier version of your ontology represents methylphenidate as an individual, then it would be difficult to represent Ritalin in later versions without breaking backward compatibility with existing interoperable applications.

This example shows that the preferable approach in ontology development is: use classes instead of individuals, if there is any chance you would need subclasses in the future.

Since knowledge is constantly dynamic and growing, it would seem prudent advice to allow for expansion of the things in your knowledge graph. Classes are the better choice in this instance (pun intended).

Like any language, there is a trade-off in OWL 2 between expressivity and reasoning efficiency [12]. Some prefer a less-constrained RDF and RDFS construct for their knowledge graphs. This approach allows virtually any statement to be asserted and is a least-common denominator for dealing with data encountered in the wild. However, one loses the punning and decidability advantages of OWL 2, and has a less-powerful framework for staging training sets and corpora for machine learning, another key motivation for our own knowledge graphs.

One could also choose a more powerful modeling language such as Datalog or Common Logic to gain the advantages of OWL 2, plus more. We have nothing critical to say about making such a choice. For our use cases, though, we do like the broader use and tools afforded by the use of OWL 2 and other W3C standards. Finding your own ‘sweet spot’ means understanding some of these knowledge representation trade-offs in context with your anticipated applications.

[1] [protege-user] mailing list at https://mailman.stanford.edu/mailman/listinfo/protege-user. [2] Protégé user email list, ‘How to Relate Different Classes’, https://mailman.stanford.edu/pipermail/protege-user/2019-November/010890.html, Nov 9, 2019. [3] https://mailman.stanford.edu/pipermail/protege-user/2019-November/010906.html, Nov 10, 2019. [4] Bergman, M. K. Information, Knowledge, Representation. in A Knowledge Representation Practionary: Guidelines Based on Charles Sanders Peirce (ed. Bergman, M. K.) 15–42 (Springer International Publishing, 2018). doi:10.1007/978-3-319-98092-8_2. [5] Bergman, M. K. A KR Terminology. in A Knowledge Representation Practionary: Guidelines Based on Charles Sanders Peirce (ed. Bergman, M. K.) 129–149 (Springer International Publishing, 2018). doi:10.1007/978-3-319-98092-8_7. [6] Peirce actually spelled it “semeiosis.” While it is true that other philosophers such as Ferdinand de Saussure also employed the shorter term “semiosis.” I also use this more common term due to greater familiarity. [7] Bergman, M. K. Metamodeling in Domain Ontologies. AI3:::Adaptive Information http://www.mkbergman.com/913/metamodeling-in-domain-ontologies/ (2010). [8] See, for example, my four-part series on description logics, beginning with Bergman, M. K. Making Linked Data Reasonable using Description Logics, Part 1, AI3:::Adaptive Information http://www.mkbergman.com/474/making-linked-data-reasonable-using-description-logics-part-1/ (2009). [9] See Bernardo Cuenca Grau, Ian Horrocks, Boris Motik, Bijan Parsia, Peter Patel-Schneider and Ulrike Sattler, 2008. “OWL2: The Next Step for OWL,” see http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2008/CHMP+08.pdf; and also see the OWL 2 Quick Reference Guide by the W3C, which provides a brief guide to the constructs of OWL 2, noting the changes from OWL 1. [10] “Punning” was introduced in OWL 2 and enables the same IRI to be used as a name for both a class and an individual. However, the direct model-theoretic semantics of OWL 2 DL accommodates this by understanding the class Truck and the individual Truck as two different views on the same IRI, i.e., they are interpreted semantically as if they were distinct. See further Pascal Hitzler et al., eds., 2009. OWL 2 Web Ontology Language Primer, a W3C Recommendation, 27 October 2009; see http://www.w3.org/TR/owl2-primer/. [11] https://mailman.stanford.edu/pipermail/protege-user/2019-November/010930.html, Nov 11, 2019. [12] OWL has historically been described as trying to find the proper tradeoff between expressive power and efficient reasoning support. See, for example, Grigoris Antoniou and Frank van Harmelen, 2003. “Web Ontology Language: OWL,” in S. Staab and R. Studer, eds., Handbook on Ontologies in Information Systems, Springer-Verlag, pp. 76-92. See http://www.few.vu.nl/~frankh/postscript/OntoHandbook03OWL.pdf.

KBpedia Continues Quality Improvements

AI3:::Adaptive Information (Mike Bergman) - Wed, 12/04/2019 - 18:54
Version 2.20 of the Knowledge Graph Now Prepped for Release on Public Repositories

Fred Giasson and I, as co-editors, are pleased to announce today the release of version 2.20 of the open-source KBpedia system. KBpedia is a knowledge graph that provides an overlay for interoperating and conducting machine learning across its constituent public knowledge bases of Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, and OpenCyc. KBpedia contains more than 53,000 reference concepts and their mappings to these knowledge bases, structured into a logically consistent knowledge graph that may be reasoned over and manipulated. KBpedia acts as a computable scaffolding over these broad knowledge bases.

We are preparing to register KBpedia on many public repository sites, and we wanted to make sure quality was a high as possible as we begin this process. Since KBpedia is a system built from many constituent knowledge bases, duplicates and inconsistencies can arise when combining them. The rationale for this release was to conduct a comprehensive manual review to identify and remove most of these issues.

We made about 10,000 changes in this newest release. The major changes we made to KBpedia resulting from this inspection include:

  • Removal of about 2,000 reference concepts (RCs) and their mappings and definitions pertaining to individual plant and animal species, which was an imbalance in relation to the other generic RCs in the system;
  • Manual inspection and fixes to the 70 or so typologies (for instance, Animals or Facilities) that are used to cluster the RCs into logical groupings;
  • Removal of references to UMBEL, one of KBpedia’s earlier constituent knowledge bases, due to retirement of the UMBEL system;
  • Fixes due to user comments and suggestions since the prior release of version 2.10 in April 2019; and
  • Adding some select new RCs in order to improve the connectivity and fill gaps with the earlier version.

Without a doubt this is now the cleanest and highest quality release for the knowledge graph. We are now in position to extend the system to new mappings, which will be the focus of future releases. (Expect the next after the first of the year.) The number and structure of KBpedia’s typologies remain unchanged from prior versions. The number of RCs now stands at 53,465, smaller than the 55,301 reference concepts in the prior version.

Besides combining the six major public knowledge bases of Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, and OpenCyc, KBpedia includes mappings to more than a score of additional leading vocabularies. The entire KBpedia structure is computable, meaning it can be reasoned over and logically sliced-and-diced to produce training sets and reference standards for machine learning and data interoperability. KBpedia provides a coherent overlay for retrieving and organizing Wikipedia or Wikidata content. KBpedia greatly reduces the time and effort traditionally required for knowledge-based artificial intelligence (KBAI) tasks. KBpedia was first released in October 2016 with some open source aspects, and was made fully open in 2018. KBpedia is sponsored by Cognonto Corporation.

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates (or relations) based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce‘s prescient theories of knowledge representation. Mappings to all external sources are provided in the linkages to the external resources file in the KBpedia downloads. (A larger inferred version is also available.) The external sources keep their own record files. KBpedia distributions provide the links. However, you can access these entities through the KBpedia explorer on the project’s Web site (see these entity examples for cameras, cakes, and canyons; clicking on any of the individual entity links will bring up the full instance record. Such reachthroughs are straightforward to construct.) See further the Github site for further downloads. All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Syndicate content