Feed aggregator

Shaping Wikipedia into a Computable Knowledge Base

AI3:::Adaptive Information (Mike Bergman) - Tue, 03/31/2015 - 21:03
Part of a Vision for Information Interoperability on the Web

Wikipedia is arguably the most important information source yet invented for natural language processing (NLP) and artificial intelligence, in addition to its role as humanity’s largest encyclopedia. Wikipedia is the principal information source for such prominent services as IBM’s Watson [1], Freebase [2], the Google Knowledge Graph [3], Apple’s Siri [4], YAGO [5], and DBpedia [6], the core reference structure for linked open data [7]. Wikipedia information has assumed a prominent role in NLP applications in word sense disambiguation, named entity recognition, co-reference resolution, and multi-lingual alignments; in information retrieval in query expansion, multi-lingual retrieval, question answering, entity ranking, text categorization, and topic indexing; and in semantic applications in topic extraction, relation extraction, entity extraction, entity typing, semantic relatedness, and ontology building [8].

The massive size of Wikipedia — with more than 26 million articles across 250 different language versions [9,10] — makes it a rich resource for reference entities and concepts. Structural features of Wikipedia that help to inform the understanding of relationships and connections include articles (and their embedded entities and concepts), article abstracts, article titles, infoboxes, redirects, internal and external links, editing histories, categories (in part), discussion pages, disambiguation pages, images and associated metadata, templates, tables, and pages of lists, not to mention Wikipedia as a whole being used as a corpus or graph [11]. It is no wonder that Wikipedia is referenced in about 665,000 academic articles and books [12]. And all of this occurs in a phenomenon that is not yet 15 years old!

Wikipedia is unparalleled as a resource for mining these resources of structure, concepts and entities. But, and here is the challenge, Wikipedia is never itself used as a computable knowledge base. It is a resource for other knowledge systems, but not a coherent knowledge base unto itself. Wikipedia feeds other useful knowledge bases, but does not play those roles alone. Why this is and how it can be remedied is the subject of this article.

Three Basic Problems

Wikipedia has been cited for three weaknesses relevant to its role as a knowledge base. The first is that its coverage is imbalanced. Various studies have evaluated the scope of Wikipedia [13, 14, 15, among many] and have found areas of popular culture such as games, movies, music and actors to be over-represented, while areas of philosophy, technology, academics and history, to be under-represented. While still perhaps true in terms of absolute numbers of articles, the actual domain coverage has been improving in recent years.

The second Wikipedia problem is incompleteness. Wikipedia tends to be spotty in terms of providing complete and equal representation in populating certain categories (or classes) with articles (instances). It also tends to be incomplete in how well embedded or structured various articles may be. An example of the representation problem is in economy or commerce and the coverage of companies or products. The notability criterion [16] is a tricky one here; some companies or products with seemingly equivalent notability get listed, while others do not. Another example is the kingdom of life where some life forms are extremely well represented, while others are not. The incompleteness of structure relates to which articles or entire categories have infoboxes or ones that are well populated, as well as how category assignments are incomplete or inconsistent. The existence of “stub” articles is one evidence for such incompleteness. As Wikipedia has gotten more structured and complicated, the number of active editors has declined. The growing use of bots, however, is often compensating for this and in some cases bringing better consistency and equivalent treatment [17,18].

But the biggest problem of Wikipedia has been its category structure. Categories were not part of the original design, but were added to Wikipedia in 2004. Various reviewers have likened Wikipedia more to a thesaurus than a classification scheme [19], others that it is different than classical knowledge organization systems in that it has no specified root or hierarchy [20]. This improved a wee bit from 2006 to 2010, when the main Wikipedia topics were organized according to top-level and main topics [21]. Still, typical commentaries point to the fact that Wikipedia’s category structure is “noisy, ill-formed, and difficult to make sense of” [22]. Its crowdsourced nature has led to various direct and indirect cycles in portions of the category structure [23]. All of these problems lead to the inability to do traditional reasoning or inference over the Wikipedia category graph [24].

Besides these lacks of computability, the Wikipedia graph is bloated with “artificial” categories (see further below) that just add noise to trying to understand or navigate the Wikipedia category structure. In short, while Wikipedia is a goldmine of resources and partial structure, its organization is incoherent at a global level, and it is unable to support reasoning and other tasks that might be expected from a truly functional knowledge base.

The real shame — but also the real opportunity — is that this lack of coherency makes it more difficult to validate and improve the information already in Wikipedia. So, there are both external reasons of linkage and internal reasons of improved authority for which it is desirable to shape Wikipedia into a true knowledge base.

Efforts to Recast Wikipedia

These faults are not unrecognized and the prospect of better leverage from Wikipedia has stimulated many efforts. Gazing inward, it is not uncommon to find efforts that attempt to clean up the existing Wikipedia structure [25], or various attempts to use the content of Wikipedia article categories [26] to re-constitute new taxonomies [27] or concept networks [28]. Clean up appears essential, and is a relative constant in other attempts to recast Wikipedia [29].

The choice of Wikipedia’s founders to make its full content available electronically for free and without restriction was a masterstroke. This has stimulated many to grab the Wikipedia content and to recast it in other ways. One of the first, and most successful, was DBpedia, with an emphasis on making (much of) Wikipedia available in RDF and linked data. DBpedia emphasized the structured content of Wikipedia’s infoboxes and eventually derived a typology of entities and their properties expressed as the DBpedia ontology [30]. It is not hyperbole to state that DBpedia nucleated the entire linked data phenomenon [7].

The key insight of YAGO [5] was the recognition that the resource richness of Wikipedia lacked a unifying structure, with WordNet chosen as the replacement organizing framework. Also, by patterned analysis of Wikipedia’s article titles structure, YAGO was able to infer and select many attribute relationships between entities. This enabled YAGO to posit what, in essence, was a much-expanded category structure for Wikipedia expressed as predicates. Many other efforts have also chosen WordNet as their organizing framework for Wikipedia [31,32].

Freebase [2], itself another attempt to use crowdsourcing with explicit attention to structured data, struggled in its early years until it embraced and incorporated Wikipedia. That marked the take-off point for Freebase, which was later acquired by Google to form the backbone of its knowledge graph. Freebase is now being shut down with its assets being transferred to Wikidata.

Wikidata [33] is itself an interesting case of how the Wikipedia model is being expanded. Wikidata, a sister project to Wikipedia under the Wikimedia banner, takes as its starting point the structured data about entities evident in Wikipedia infoboxes. Rather than extracting and cleaning that entity information as DBpedia does, the role of Wikidata is to be the multilingual source for all entities feeding the Wikimedia network, including Wikipedia. The approach leads to more uniformity and consistency, and provides a central Wikimedia access point for structured data. However, somewhat akin to Wikipedia, Wikidata also has struggled to find an appropriate typology (or ontology) for its millions of entities [34].

Other approaches to the Wikipedia classification challenge have been to map — or “express” — Wikipedia articles in relation to established external vocabularies or structures, such as the Library of Congress Classification [35], Library of Congress Subject Headings [23, 36], Universal Decimal classification (UDC) [37], Cyc [38] or UMBEL [39], among others. The idea here is that accepted organizational schemes provide more coherence than the Wikipedia category structure, with sometimes additional benefits as well.

Though not complete topical recastings, certain aspects of Wikipedia have also proven their usefulness for general knowledge acquisition. Using article (concept or entity) content can inform topical tagging using explicit semantic analysis (ESA) [40], automatic topic identification [41], information extraction [42] or a myriad of others.

Making a Natural Wikipedia Category Scheme

Whether “cleaned” or recasted, taking the existing Wikipedia structure in its existing form is problematic. Though some category cleaning sometimes takes place with some of these uses of Wikipedia, that is not uniformly nor universally so. The cleaning that does take place is often limited to administrative categories (relating to internal Wikipedia conventions or management). However, other Wikipedia conventions (such as lists) and the proliferation of user-generated “artificial” categories actually represent the bulk of the total number of categories.

Charles S. Peirce was the first, by my reading, who looked at the question of “natural classes,” which are now sometimes contraposed against what are called “artificial classes” (we tend to use the term “compound” classes instead). A “natural class” is a set with members that share the same set of attributes, though with different values (such as differences in age or hair color for humans, for example). Some of those attributes are also more essential to define the “type” of that class (such as humans being warm-blooded with live births and hair and use of symbolic languages). Artificial classes tend to only add one or a few shared attributes, and do not reflect the essence of the type [43].

“Compound” (or artificial) categories (such as Films directed by Pedro Almodóvar or Ambassadors of the United States to Mexico) are not “natural” categories, and including them in a logical evaluation only acts to confuse attributes from classification. To be sure, such existing categories should be decomposed into their attribute and concept components, but should not be included in constructing a schema of the domain.

“Artificial” categories may be identified in the Wikipedia category structure by both syntactical and heuristic signals. One syntactical rule is to look for the head of a title; one heuristic signal is to select out any category with prepositions. Across all rules, “compound” categories actually account for most of what is removed in order to produce “cleaned” categories.

We can combine these thoughts to show what a “cleaned” version of the Wikipedia category structure might look like. The 12/15/10 column in the table below reflects the approach used for determining the candidates for SuperTypes in the UMBEL ontology, last analyzed in 2010 [44]. The second column is from a current effort mapping Wikipedia to Cyc [45]:

12/15/10 3/1/15 Total Categories 100% 100% Administrative Categories 14% 15% Orphaned Categories 10% 20% Working Categories 76% 66% “Artificial” Categories 44% 34% Single Head 23% 33% Plural Head 24% “Clean” Categories 33% 46%

Two implications can be drawn from this table. First, without cleaning, there is considerable “noise” in the Wikipedia category structure, equivalent to about half to two-thirds of all categories. Without cleaning these categories, any analysis or classification that ensues is fighting unnecessary noise and has likely introduced substantial assignment errors. Second, approaches, assumptions and how filters get sequenced differ between “cleaning” attempts, which both makes comparability a challenge but also represents areas for discussion and testing to derive best practices. This lack of comparability due to differences in staging Wikipedia for analysis makes it difficult to draw comparisons between different studies and approaches. One study is not necessarily relatable to other studies.

Today, in chaotic and uncoordinated ways, we see Wikipedia feeding much analysis through partial aspects of its structure and supplying many reference concepts and entities. But each analysis is done for different purposes using different bases; they are thus incompatible. Coherency, usability and insight suffer. Any prior efforts to map to or use Wikipedia categories that do not remove these artificial categories only introduce noise and are therefore likely to be in substantial error.

Benefits of a Reference Knowledge Base

If we could overcome these shortcomings by taking the steps to make Wikipedia a true reference knowledge base, what might the benefits be? Or, said another way, why should we care?

One benefit is that reference structures of any kind provide a focus, by definition, of common or canonical referents. This commonality leads to better defined, better understood and more widely used referents. Common referents become a kind of common vocabulary for the space, upon which other vocabularies and datasets can reference. A common language, of sorts, can begin to emerge.

Reference structures also provide a grounding, a spoke-and-hub design [46], that leads to an efficient basis for external vocabularies and datasets to refer to one another. Of course, any direct mapping can provide a means to relate this information, but such pairwise mappings are not scalable nor efficient. In a spoke-and-hub design, the number of mappings required goes down significantly with the number of datasets or items requiring mapping. The spoke-and-hub design, for example, is at the heart of such disciplines as master data management.

Another benefit of common reference structures is that they provide a common target for the development of tools and best practices. These kinds of “network effects” lead to still further tooling and practices. Thus, while we see literally tens of thousands of academic papers and approaches leveraging Wikipedia in one way or another, we see little of a practice or a community that has been built around it as a knowledge base. It is as if we are still looking a bit at the shadow of Wikipedia and its possible role, a chimera for its potential as a true knowledge base.

But the ultimate benefit of Wikipedia as a reference knowledge base will reside in its computability. When we can reason over Wikipedia’s content, use it for testing and analyzing assertions or new facts, when its coherent organization can be applied to such tasks as informing how to map and interoperate data together or remaking whole legacy applications such as enterprise information integration or MDM, all of which in cross-lingual ways, we will finally see the realization of Wikipedia’s inherent potential. And, as these latent capabilities get exploited, we will see supporting knowledge sources such as Wikidata also get pulled into the ecosystem.

Seven Requirements for a Computable Knowledge Base

So, if we buy into the benefits of a computable Wikipedia — or any other useful knowledge source for that matter — what are the guideposts for doing so? How do we assess the gaps and then fill them?

The importance of working with a “clean” version of the Wikipedia structure is obvious, yet ultimately there are higher-order requirements for what it takes, in our view, to become a “true” reference knowledge base. By our definition, such KBs have these aspects:

  • Coherent — does it hold together conceptually, logically, does it make sense? Either internally via consistency tests and such, or externally via testing against known facts and knowledge, the structure of the knowledge base should be defensible and meet the “common sense” test
  • Comprehensive — does the knowledge base have the scope of domains to which it is likely to interact? For a Web reference, the KB need not be global, but be relevant to an important domain of discourse. The biomedical domain, and its constituent and biological sub-domains, is an example. Something like Wikipedia represents a more “global” domain, and is thus central to the idea
  • Referencable — is the knowledge source authoritative? does it use URIs for referencing its objects?
  • Open Standards — which also implies, does it meet open standards? Open standards, by virtue of their decision processes, represent well-reasoned bases. Open standards are also easier to interoperate with and have more tooling available
  • Computable — the combination of the above can lead to a KB structure that supports reasoning, inference, set selection, relations, attributes, datatypes, and filtering and retrieval. These aspects make the KB “computable” [47], the threshold qualifier for a “true” knowledge base
  • Testable — but now, once the KBs are computable, they are also testable. That means the entire KB structure may be tested, verified, validated, scored, and evaluated
  • Multi-lingual — if not already multi-lingual, does it have a structure (such as ID v label-based) that supports multiple languages? Is there attention paid to encoding and transfer standards so as to promote consumption and use of the KB data? Multi-linguality may sound like icing on the cake, but it represents the next phase of bringing structure to the question of how to better identify, discern, and disambiguate information.

Wikipedia, and other publicly available knowledge sources [48], already fulfill many of these requirements. With focused attention, any current reference source should be able to be lifted to meet these seven major requirements.

Outlines of a General Staging Pipeline

OK, then: what might such a KB processing (or “lifting”) approach look like?

Well, the first point is that it should be a pipeline. It is important to be able to swap in and out various options at multiple points from input to desired output. Then, because there are disparate sources and different formats to accommodate, it is also important to use canonical syntaxes and standards for expressing the products and specifications at the various steps along that pipeline.

The very notion of pipeline implies workflows, which are the actual drivers for how the pipeline should be designed. Key workflow steps include:

  • Clean the input sources
  • Express the sources in a canonical form [49]
  • Identify and extract concepts
  • Map the structure to KB concepts
  • Identify and extract entities
  • Identify and extract relations
  • Type the entities, concepts, and relations
  • Extract attributes and values for identified entities
  • Test these against the existing KB
  • Update reference structures, including placement of the new assertions, as appropriate
  • Characterize and log to files
  • Commit to the KB
  • Rinse, repeat.

Much information gets processed in these pipelines, and the underlying sources update frequently. Thus, the pipelines themselves need to be performant and based on solid code. Automation, within the demanding bounds of quality, is also an essential condition to be scalable. Improving on that is a process, not a state.

Time to Make Some Sausage

Most of these observations are really not new or innovative [39,50]. Possibly what is new is to articulate the situation for major reference sources on the Web, and to then analyze and propose how to process them in the service of information interoperability.

Because, you see, we’re still at the very, very earliest phases of how the Internet is changing the abilities to gather, understand, and represent the information in our world. We’re about ready to embark on the next stage in that journey.

[1] IBM Journal of Research and Development 56(3/4), Special Issue on “This is Watson”, 2012 [2] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1247–1250. ACM, 2008. [3] A. Singhal: Introducing the Knowledge Graph: Things, not Strings. Google Blog, May 16, 2012 [4] Gruber, T. “Siri: a virtual personal assistant.” In keynote presentation at Semantic Technologies conference (SemTech09), 2009. [5] Suchanek, Fabian M., Gjergji Kasneci, and Gerhard Weikum. “Yago: a core of semantic knowledge.” In Proceedings of the 16th international conference on World Wide Web, pp. 697-706. ACM, 2007. [6] Sören Auer, Chris Bizer, Jens Lehmann, Georgi Kobilarov, Richard Cyganiak and Zachary Ives, 2007. DBpedia: A nucleus for a web of open data, in Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea, volume 4825 of LNCS, pages 715728, November 2007. See http://iswc2007.semanticweb.org/papers/ISWC2007_IU_Auer.pdf. [7] Heath, Tom, and Christian Bizer. “Linked data: Evolving the web into a global data space.” Synthesis lectures on the semantic web: theory and technology 1, no. 1 (2011): 1-136. [8] Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp. See http://arxiv.org/ftp/arxiv/papers/0809/0809.4530.pdf. [9] Mesgari, Mostafa, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto Lanamäki. “The sum of all human knowledge: A systematic review of scholarly research on the content of Wikipedia,” Journal of the Association for Information Science and Technology 66, no. 2 (2015): 219-245. [10] However, only 1/10th of the different language Wikipedias have more than 100,000 articles; see https://en.wikipedia.org/wiki/Wikipedia:Multilingual_statistics. [11] See the discussion of ‘structural sources’ in M.K. Bergman, 2011. “In Search of ‘Gold Standards’ for the Semantic Web,” in AI3:::Adaptive Information blog, February 28, 2011. [12] This count is from Google Scholar absent references in citations and patents with the query, http://scholar.google.com/scholar?as_vis=1&q=wikipedia&hl=en&as_sdt=1,16. Also, see the SWEETpedia listing 250 articles relating to this topic on this AI3:::Adaptive Information blog; I ceased updating the list about five years ago because it was growing too large to manage. [13] Halavais, Alexander, and Derek Lackaff. “An analysis of topical coverage of Wikipedia.” Journal of Computer Mediated Communication 13, no. 2 (2008): 429-440. [14] Holloway, Todd, Miran Bozicevic, and Katy Börner. “Analyzing and visualizing the semantic coverage of Wikipedia and its authors.” Complexity 12, no. 3 (2007): 30-40. [15] Samoilenko, Anna, and Taha Yasseri. “The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics.” EPJ Data Science 3, no. 1 (2014): 1-11. [16] See http://en.wikipedia.org/wiki/Wikipedia:Notability [17] Halfaker, Aaron, and John Riedl. “Bots and cyborgs: Wikipedia’s immune system.Computer 3 (2012): 79-82. [18] See https://en.wikipedia.org/wiki/Category:Wikipedia_bots_by_purpose [19] Voss, J. Collaborative Thesaurus Tagging the Wikipedia Way, (2006) [20] Suchecki, Krzysztof, Alkim Almila Akdag Salah, Cheng Gao, and Andrea Scharnhorst. “Evolution of Wikipedia’s Category Structure.” Advances in Complex Systems 15, no. supp01 (2012). [21] For Wikipedia’s main topics, see http://en.wikipedia.org/wiki/Category:Main_topic_classificationsReference; for Wikipedia’s top-level categories, see http://en.wikipedia.org/wiki/Category:Fundamental_categories. [22] Kittur, A., Chi, E. H. and Suh, B., What’s in Wikipedia? Mapping Topics and Conflict Using Socially Annotated Category Structure, in Proceedings of the 27th Annual CHI Conference on Human Factors in Computing Systems (CHI’2009), New York, USA, 2009, pp. 1509–1512. [23] Joorabchi, Arash, and Abdulhussain E. Mahdi. “Towards linking libraries and Wikipedia: automatic subject indexing of library records with Wikipedia concepts.” Journal of Information Science 40, no. 2 (2014): 211-221. [24] Paulheim, Heiko, and Christian Bizer. “Type inference on noisy rdf data,” In The Semantic Web–ISWC 2013, pp. 510-525. Springer Berlin Heidelberg, 2013. [25] Maciej Janik and Krys Kochut, 2007. Wikipedia in Action: Ontological Knowledge in Text Categorization, University of Georgia, Computer Science Department Technical Report no. UGA-CS-TR-07-001. See http://lsdis.cs.uga.edu/~mjanik/UGA-CS-TR-07-001.pdf. Also, see Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Mohamed Tmar, and Abdelmajid Ben Hamadou. “Wikipedia Category Graph and New Intrinsic Information Content Metric for Word Semantic Relatedness Measuring.” Computing 10, no. 13 (2012): 35-37. [26] Vivi Nastase and Michael Strube, 2008. Decoding Wikipedia Categories for Knowledge Acquisition, in Proceedings of the AAAI08 Conference, Chicago, US, , pp.1219-1224. [27] Simone Paolo Ponzetto and Michael Strube, 2007a. Deriving a Large Scale Taxonomy from Wikipedia, in Association for the Advancement of Artificial Intelligence (AAAI2007). [28] Andrew Gregorowicz and Mark A. Kramer, 2006. Mining a Large-Scale Term-Concept Network from Wikipedia, Mitre Technical Report, October 2006. [29] Wu, Fei, and Daniel S. Weld. “Autonomously semantifying wikipedia.” In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 41-50. ACM, 2007. [30] Bizer, Christian, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. “DBpedia-A crystallization point for the Web of Data.” Web Semantics: science, services and agents on the world wide web 7, no. 3 (2009): 154-165. [31] Marius Pasca, 2009. Outclassing Wikipedia in Open-Domain Information Extraction: Weakly-Supervised Acquisition of Attributes over Conceptual Hierarchies, in Proceedings of the 12th Conference of the European Chapter of the ACL, pages 639–647, Athens, Greece, 30 March – 3 April 2009. See http://www.aclweb.org/anthology/E/E09/E09-1073.pdf. [32] Fei Wu and Daniel S. Weld, 2008. Automatically Refining the Wikipedia Infobox Ontology, presented at the 17th International World Wide Web Conference (WWW 2008) [33] Vrandečić, Denny, and Markus Krötzsch. “Wikidata: a free collaborative knowledgebase.” Communications of the ACM 57, no. 10 (2014): 78-85. [34] From scratch, in a bit over three years, Wikidata has grown to cover about 19 million entities according to February 2015 statistics. However, there has yet to emerge an overarching typology or ontology for these entities, with the typing system that does exist growing from the bottom up. For some background, see https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Migrating_away_from_GND_main_type [35] There is an alternate entry point to Wikipedia provided by http://en.wikipedia.org/wiki/Library_of_Congress_Classification [36] Kiyota, Yoji, Hiroshi Nakagawa, Satoshi Sakai, Tatsuya Mori, and Hidetaka Masuda. “Exploitation of the wikipedia category system for enhancing the value of LCSH.” In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, pp. 411-412. ACM, 2009. [37] Salah, Almila Akdag, Cheng Gao, Krzysztof Suchecki, and Andrea Scharnhorst. “Need to categorize: A comparative look at the categories of universal decimal classification system and Wikipedia,” Leonardo 45, no. 1 (2012): 84-85. [38] Pohl, Aleksander. “Classifying the Wikipedia articles into the OpenCyc taxonomy.” In Proceedings of the Web of Linked Entities Workshop in conjuction with the 11th International Semantic Web Conference, vol. 5, p. 16. 2012. [39] Upper Mapping and Binding Exchange Layer (UMBEL) Specification, UMBEL.org, retrieved February 16, 2015. [40] Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, in Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007. [41] Hassan, Mostafa. “Automatic Document Topic Identification Using Hierarchical Ontology Extracted from Human Background Knowledge.” PhD dissertation, University of Waterloo, 2013. [42] Fei Wu, Raphael Hoffmann and Daniel S. Weld, 2008b. Information Extraction from Wikipedia: Moving Down the Long Tail, in Proceedings of the 14th ACM SigKDD International Conference on Knowledge Discovery and Data Mining (KDD-08), Las Vegas, NV, August 24-27, 2008, pp. 635-644. See http://www.cs.washington.edu/homes/wufei/papers/kdd08.pdf. [43] Menno Hulswit, 1997. “Peirce’s teleological approach to natural classes,” in Transactions of the Charles S. Peirce Society (1997): 722-772. See http://repository.ubn.ru.nl/bitstream/handle/2066/29577/29577___.PDF?sequence=1 [44] Upper Mapping and Binding Exchange Layer (UMBEL) Specification, Annex G: UMBEL SuperTypes Documentation, UMBEL.org, retrieved February 16, 2015. [45] Aleksander Smywinski-Pohl, Krzysztof Wróbel, Michael K. Bergman and Bartosz Ziółko, 2015. “cycloped.io: An Interoperable Framework for Web Knowledge Bases,” manuscript in preparation. [46] The main advantage of a grounding reference is that it allows a spoke-and-hub design for data mapping, which is tremendously more efficient than pairwise mappings. In a spoke-and-hub design, where the reference ontology is the common node at the hub, only n – 1 routes are necessary to connect all sources, meaning that it scales linearly with the number of sources and attributes. Without a grounding reference, these same mapping capabilities would require routes in a pairwise (point-to-point) approach, that also scales poorly as a quadratic function. A system of ten datasets would require 9 composite mappings in the reference grounding case, but 45 in a pairwise approach. And, of course, datasets themselves contain tens to thousands of attributes, compounding the map scaling problem further. [47] For example, WordNet is a coherent lexical ontology, but is not computable. [48] See the knowledge bases section of M.K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” in AI3:::Adaptive Information blog, November 14, 2014. [49] Galárraga, Luis, Geremy Heitz, Kevin Murphy, and Fabian M. Suchanek. “Canonicalizing open knowledge bases.” In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1679-1688. ACM, 2014. [50] See, for example, Suchanek, Fabian M., and Gerhard Weikum. “Knowledge Bases in the Age of Big Data Analytics.” Proceedings of the VLDB Endowment 7, no. 13 (2014): 1713-1714.

Conceptual and Practical Distinctions in the Attributes Ontology

AI3:::Adaptive Information (Mike Bergman) - Tue, 03/03/2015 - 06:53
Reasons for the Relations and Attributes Split

Logicians going back at least as far as Charles S. Peirce [1] – and computer scientists as early as the entity-relationship (ER) model from the mid-1970s [2] – have made the relations-attributes distinction for predicates relating to data objects. There are both conceptual and practical bases for these distinctions. This article elaborates upon the relations-attributes distinction in the UMBEL Attributes Ontology. I also try to precisely define my terms because terminology is overlapping and confusing amongst competing data models and standards.

In one of its first communiques in 1999 regarding the Resource Description Framework, its sponsor, the W3C (World Wide Web Consortium), noted the RDF data model was a member of the entity-relationship modeling family [3]. Though for its own reasons the W3C chose to label the relationships between things as “properties” in RDF [4], for practical modeling reasons the E-R model distinction into attributes and relations provides additional explanatory power. It also provides a more useful and tractable means for modeling the connections between things, an essential requirement for efficient data interoperability. Understanding these distinctions is an important basis for understanding the structure and design of  UMBEL’s Attributes Ontology.

Recap of the UMBEL Attributes Ontology

My prior article introduced the Attributes Ontology (AO), a new module and extension about to be released for UMBEL (Upper Mapping and Binding Exchange Layer). UMBEL is both a vocabulary and an ontology reference structure for concepts. UMBEL’s role is to help match the discussion of topics and things across the Web.

The spoke-and-hub design [5] that UMBEL provides for concepts led to an obvious question of how the relationships (“properties”) of things can be similarly organized and managed. Referring again to the prior article, we put forward the following stack (as initially informed by Pietranik and Nguyen [6]) about how to look at the data interoperability space from a semantic technologies perspective:


The Ontology Stack

The prior article looked across the stack and noted the importance of relying on open standards. It noted the growing availability of public datasets and knowledge bases to inform our understanding and  structure of the various layers in this stack. In this article, however, we mostly concentrate on the Properties layer, and its split between relations and attributes.

Definitions within the Attributes Ontology

Our symbols and the referents in our language are, as Peirce pointed out, a consensual process of society to converge upon the meaning of the words in our language. This is always the tricky thing about language: We think we know what the words or terms mean because of the converging process, but in our individual interpretations we still may have slight differences about what exactly these words or terms encompass and mean. Further, when one is dealing with semantics and meaning, being explicit about the terms used is all the more essential.

Thus, here are the exact definitions as used within UMBEL and the Attributes Ontology:

  • Annotation — is a note or metadata that may be assigned to (“associated with”) a subject, property, object, record or dataset; annotations are almost always values
  • Assertion – is a statement in the form of a triple (subject – property – object/value), sometimes also referred to as a “fact”
  • Attribute – an inherent characteristic of an entity that helps further describe or define that entity. Every attribute either has a value or an assignable member of a set, with such members being objects. All entities within a given entity type share the same potential attributes
  • Class — a set of one or more members that share the same potential attributes or relations; concepts and entity types are both classes [7]
  • Concept – something that is conceived in the mind that is not a nameable, discrete, tangible thing. Concepts may have metadata assigned to them, but not attributes as defined herein [19]
  • Dataset – is a combination of one or more records, transmitted as a single unit (though it may be split into parts due to size)
  • Entity — a nameable and tangible thing, which is often independent or separate; can be part of something else when an identifiable thing by name, as “lung” is to “animal body”. As used herein, distinct from concept. When assigned a proper name, also known as a named entity [8]
  • Entity Type — a collection of entities that have the same potential attributes and possible relations and are thus members of the same class
  • Individual — in RDF and Cyc [7] an “individual” is synonymous with an instance or entity; we try not to use it because of general terminological confusion [9]
  • Identifier — a unique string assigned to every object in the system. In RDF and UMBEL and the Attributes Ontology this identifier is a URI. The identifier is a reference pointer, and should not be confused with the name or defining label for the object; it is a sign pointing to the object, but not the object itself
  • Instance – is a single entity, also called the individual in RDF
  • Metadata – is “data about data”, usually in the form of descriptive annotation. Metadata provides reifying information about an object such as provenance, author, date created or used, version information and the like
  • Object — an entity, concept or property that can be referred to via an identifier of some  sort; in the case of the Attributes Ontology, an object has a URI identifier
  • Property – a relationship between a subject and a value or object, which may either be of the form of a relation, attribute or annotation
  • Record — is an instance with one or more associated attributes and values
  • Referent – the object referred to by an identifier
  • Relation – is the relationship between two different entities or concepts (or between an entity and a concept), which is often hierarchical, mereological or some other form of mutual relationship between the two things [18]
  • Subject – is either a concept, entity or property (also collectively known as a “resource” in RDF), and is the item currently under consideration or focus; it is equivalent to the linguistic subject
  • Value — a string, literal or data value that provides the numerical quantity, or quality or utility of a subject in relation to the meaning of its associated attribute; a value has no meaning or context absent its paired attribute.

To illustrate why defining terms is so important, let’s look at three of the terms above, and how they are described in other nomenclatures or terminological systems:

UMBEL/AO Terminology Teminology Used Elsewhere
[10] entity type
  • concept
  • kind
  • set
  • collection
  • type
  • class
entity
  • object
  • instance
  • exemplar
  • element
  • member
  • record
  • individual
  • dependent variable
attribute
  • property
  • predicate
  • relationship
  • feature
  • facet
  • dimension
  • characteristic
  • field
  • header
  • independent variables
Some Comparisons of Terminology Differences

So, please try to keep in mind the definitions listed in the bullets above when looking at the distinctions in UMBEL or the Attributes Ontology.

Pragmatic Reasons for a Properties Split

Fundamentals matter. If someone were to ask you how the world was organized, what would you say? Does it even matter?

If it does matter, as it does for many with responsibility for getting divisions or people or accounting systems or companies to work with one another — that is, data interoperability — then you have to eventually grapple with the “how the world is organized” question. In my early training in plant systematics, we learned about ‘lumpers and splitters‘, taxonomists with different world views that emphasized similarities or distinctions in how they cataloged the world. Other perspectives and worldviews provide similar dichotomies or spectra in how to define and characterize the world. The challenge is how to get multiple parties to buy into an approach to data interoperability that is somehow grounded at a very fundamental level with a basis that all parties can agree as a foundation.

For a decade, I have believed that one part of that foundation should be grounded in RDF [11]. I like the simplicity of the RDF subject – property – object ‘triple’ statement. I like that RDF is an open standard. I like the ability of simple RDF statements to be combined into more complicated and sophisticated vocabularies and languages, and then ontologies. I like RDF’s applicability to any form of information, its expression in multiple serializations, and its ability to represent virtually any form of data in the wild. Though I think I understand the critics who want more expressiveness still, such as in concept models or higher-order logics, I think we are still working out the basics of human concepts and languages in a machine-understandable context. That is why I continue to try to work in the RDF and OWL sandbox, even though I suspect they will eventually be supplanted by more capable constructs.

Yet one of the things I don’t like about RDF is the semantics and terminology of the ‘property’ construct. I think it conflates relationships between things, which help us to organize and understand connections between objects in the world, with how we describe those things. While OWL provides some improvements in that we now can distinguish between data, object and annotation properties [12], those distinctions do not really get at the fundamental conflation of the ‘property’ construct. I further suspect that one of the reasons we have yet to see ABox or instance data-level ontologies to help the data interoperability question — what I first discussed in the introduction to the Attributes Ontology — is this very same conflation.

Fortunately, the extensibility of RDF and OWL via ontologies gives us a method for cleaving this conflation apart.

If we assume we can tease apart the fundamental nature and coverage of ‘properties’, what are the basic conceptual splits that represent this construct? From the aspect of data interoperability, I think we can see three: relations, attributes, and annotations.

The first conceptual split for ‘properties’ is relations. At the most fundamental level, we have things in the world and relationships or connections between those things. When we say things like dogs are a kind of mammal or Lassie is a dog, we are categorizing things by type. When we say that hair or toenails are parts of a mammal we are relating parts of an animal to a whole. When we say that mammals have hair but birds have feathers we are drawing distinctions between the two animal types. These kinds of statements tend to place the objects in our world in relation to one another. By so doing, we provide an organized view of the things in the world and give those very same things context. In all of these cases, our statements specify a relation between things that, combined with other relations, provide a schema or conceptualization of how things in the world relate to one another.

Some of these organizing principles are mental and intellectual constructs for how we group things together, such as dogs and people as mammals or mammals and birds as animals. Some of these organizing principles are ideas or concepts such as truth, beauty and conflict, a richness of terminology that gives us further explanatory power for how to place and give context to the things in our world. Relations between things are thus ultimately contextual in nature; they help to place our understanding of things in connection to other things. This is the portion of RDF ‘properties’ that we call relations, and they are explicitly excluded from our Attributes Ontology.

On the other hand, we look to separate the existence of some things different from other things by the nature of their characteristics, what we can observe and describe about that given thing. So, we describe shapes, sizes, weights, ages, colors and characteristics of things with increasingly nuanced vocabularies. We note that grasses have linear or simple leaves, oaks have serrated or wavy-shaped leaves, and carrots have branched or compound leaves. We distinguish hair color, eye color, place of birth, current location and a myriad of factors. Each one of these factors becomes an attribute for that object, with the specific values (simple v wavy v compound) distinguishing instances from one another. Attributes are the second conceptual split for ‘properties’.

These same distinctions were described by Chen in his attempt to find a common ground across network, relational and entity set models [2] in his E-R model. These are represented in pictorial form in the Wikipedia entity-relationship model article as follows:

Relation Form of
Property Entity with Attribute Relation with Attribute E-R Modeling Representation of Properties

Further, in a later elaboration of where his E-R modeling ideas arose, Chen was able to correlate these relationships to natural language [13], which I have updated to reflect the terminology herein:

Word Sense AO Component common noun proper noun transitive verb intransitive verb adjective adverb concept / entity type / entity entity relation attribute attribute attribute (property) Mapping of Word Senses to Attributes Ontology Components

Note that attributes may also apply to the relation-type of property.

The third conceptual split for ‘properties’ is annotations, or metadata or “data about data”, which can apply to anything. Annotations give us a way to describe the circumstances and provenance of the item at hand. Annotations capture the circumstances or conditions or contexts or observations for the thing at  hand. Where did we discover or find it? When did we find or elaborate upon it? By whom or when was it found or elaborated? What is our commentary about it? While these are all external elaborations of the thing at hand, and not intrinsic to the nature of the thing, they are all characterizations about a given thing. In these regards, annotations have as their focus a given object, similar to what is true for attributes. As a result, we have included annotations in the Attributes Ontology as well.

Thus, with respect to RDF ‘triples’, we can now map the three parts of the assertion statement as follows:

subject property object concept entity type entity relation attribute object value Mapping of an RDF ‘Triple’ to Attributes Ontology Components

This mapping sets the overall context for how the Attributes Ontology relates to the basic RDF building blocks.

From the Theoretical to the Pragmatic

These kinds of distinctions are not new. In philosophy, related distinctions have been drawn about intrinsic v extrinsic properties [14] or intensionality v extensionality [15]. For conceptual models with specific reference to ontologies, Wand et al [16] in 1999 were making the distinction between intrinsic properties (akin to what we term attributes herein) and mutual properties between things (what we term relations). Unfortunately, at that time, the conventions of RDF had not yet become prevalent and the idea of annotation properties had not yet emerged (from OWL). These later distinctions are important, but the Wand et al discussion still is helpful to elucidate the same pragmatic and theoretical considerations.

More recently, the DERA initiative from the University of Trento has embraced these same distinctions [17]. Unfortunately, no ontology supporting these viewpoints has yet been made public.

We are thus pretty much in virgin territory. While having a sound conceptual and theoretical basis is essential, which apparently we do, the real reasons for carving out an attribute perspective on RDF properties are pragmatic. Since attributes are the properties of an entity, we can better interoperate entity data by concentrating on those aspects that let us match data in one set of records to similar data in totally different records. By building a new vocabulary and structure upon RDF, we can provide a more sophisticated handling of ‘properties’ than RDF or OWL alone can provide in their native forms.

Specifically, an attribute focus, expressed in an Attributes Ontology, which conforms with open standards and is designed explicitly as a reference grounding, gives us these advantages:

  1. A clear focus for organizing the attributes that describe objects, the first step in providing a reference grounding
  2. With the assistance of Cyc [7], a logical and coherent basis for organizing a resulting knowledge graph of attribute types
  3. A semantic underpinning that enables us to overcome semantic heterogeneities across all aspects of data expression
  4. Immediate access to semantic approaches for inferencing and concept (semset) matching between different data vocabularies
  5. A computable structure over which data values in external datasets can be mapped
  6. Suitability for graph analytics techniques for clustering, relating or analyzing instance data
  7. Because of its conceptual affinity with E-R modeling, a bridge to many other domains and expertise, through vehicles such as UML
  8. A logical split of properties that makes RDF a more tractable data model, including for relations [18]
  9. An attribute knowledge base for informing artificial intelligence techniques, and
  10. A semantic grounding to help us overcome the data interoperability bottlenecks.

Making these distinctions operational, in part, is the purpose of the Attributes Ontology.

As a relatively cutting-edge effort, we expect some false steps and likely hiccups as we move to put in place this reference structure. However, we think this effort to be both innovative and essential to ongoing use of semantic technologies to tackle the decades-long challenges of data interoperability.

Other Related Concepts

Besides the other links mentioned in this article, here are some additional articles on Wikipedia that provide other and varied perspectives on the concepts and terminology used herein:

The Entity Segregation

The next article in this series introducing the Attributes Ontology will discuss the related basis for segregating out ‘entities’ in UMBEL. The entities and attributes work closely with one another to aid data mapping and interoperability.

[1] For more discussion of Peirce in relation to semantic Web issues, see [9]. According to the ER model entry in Wikipedia, it is in accord with philosophic and theoretical traditions from the time of the Ancient Greek philosophers: Socrates, Plato and Aristotle (428 BC) through to modern epistemology, semiotics and logic reflecting the views of Peirce, Frege, Russell and Carnap. [2] Peter Chen, 1976. “The Entity-Relationship Model – Toward a Unified View of Data“, in ACM Transactions on Database Systems 1 (1): 9–36, March 1976. [3] Ralph R. Swick and Henry S. Thompson, eds., 1999. The Cambridge Communiqué, a World Wide Web (W3C) Note, 7 October 1999. [4] Patrick Hayes, 2004. “RDF Semantics,” a W3C Recommendation, February 2004. See http://www.w3.org/TR/rdf-mt/. [5] A spoke-and-hub design (n-1) for data mapping is tremendously more efficient than the most common approach of pairwise mappings (a quadratic function). For example, ten datasets would require 9 composite mappings in a spoke-and-hub design versus 45 in a pairwise approach. And, of course, datasets themselves contain tens to thousands of attributes, compounding the map scaling problem further [6] Marcin Pietranik and Ngoc Thanh Nguyen, 2011. “Attribute Mapping as a Foundation of Ontology Alignment,” N.T. Nguyen, C.-G. Kim, and A. Janiak (Eds.): ACIIDS 2011, LNAI 6591, pp. 455–465, 2011. [7] UMBEL is based on the OpenCyc version of the Cyc knowledge base; Cyc uses the term “collections” to refer to RDF classes. [8] You sometimes see entities defined as “self-contained”; that is not strictly followed here. What is more important is being a nameable thing with attributes. [9] M. K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” in AI3:::Adaptive Information blog, January 12, 2012. See especially the discussion of the toucan. [10] Some of the terminology sources are E-R, concept maps, RDF, Cyc, descriptive logics, data modeling (of various types), OWL, etc. Most particularly, sometimes entities are referred as including concepts; the Attributes Ontology does not. [11] I have been but one of many arguing for the benefits of RDF. For my views, see M. K. Bergman, 2009. “Advantages and Myths of RDF” in AI3:::Adaptive Information blog, April 8, 2009. [12] See Pascal Hitzler, Markus Krötzsch, Bijan Parsia, Peter F. Patel-Schneider, Sebastian Rudolph, eds., 2012. W3C Recommendation, World Wide Web Consortium, 11 December 2012, for the annotation – object – datatype property distinction. [13] Peter Pin-Shan Chen, 1997. “English, Chinese and ER diagrams.” in Data & Knowledge Engineering 23, no. 1 (1997): 5-16. [14] Stanford Encyclopedia of Philosophy, 2012. “Intrinsic vs. Extrinsic Properties“, online article, first published January 5, 2002; substantive revision December 23, 2012. [15] At least for Carnap, he thought “…the full meaning of a concept is constituted by two aspects, its intension and its extension. The first part comprises the embedding of a concept in the world of concepts as a whole, i.e. the totality of all relations to other concepts. The second part establishes the referential meaning of the concept, i.e. its counterpart in the real or in a possible world“. [16] Yair Wand, Veda C. Storey, and Ron Weber, 1999. “An Ontological Analysis of the Relationship Construct in Conceptual Modeling,” in ACM Transactions on Database Systems (TODS) 24, no. 4 (1999): 494-528, December 1999. Also see, Jeffrey Parsons and Yair Wand, 2003. “Attribute-Based Semantic Reconciliation of Multiple Data Sources,” in Journal on Data Semantics LNCS 2800. [17] Fausto Giunchiglia and Biswanath Dutta, 2011. “DERA: A Faceted Knowledge Organization Framework,” Technical Report # DISI-11-457, University of Trento, March 2011; submitted to the International Conference on Theory and Practice of Digital Libraries 2011 (TPDL’2011). DERA, in fact, stands for domain, entities, relations and attributes, but is mostly derived from the work of S.R. Ranganathan, the 20th century Indian library theorist and mathematician. DERA appears to have evolved into later projects, but the provenance is unclear. [18] This provides a logical invitation to a similar ‘relations ontology’ for capturing the role and relationship aspects of RDF properties. Such relations stipulate topographical relationships, hierarchical relationships (subClassOf, fatherOf, daughterOf), mereological relationships (partOf, isComponent), role relationships (isBossOf, hasTeacher, isKeyInfluencer) or approximation relationships (isLike, isAbout, relatesTo). [19] Actually, this is a bit more complex than the definition. In OWL modeling, a “concept” may also act as an instance (“individual” in RDF terminology) through what is known as metamodeling; see M. K. Bergman, 2010. “Metamodeling in Domain Ontologies” in AI3:::Adaptive Information blog, September 20, 2010.

An UMBEL Extension for Attributes

AI3:::Adaptive Information (Mike Bergman) - Mon, 02/16/2015 - 18:43
The Attributes Ontology is Designed for Efficient Data Mapping

The semantic Web does not yet have the complete infrastructure for supporting data interoperability. Most ontology mapping or alignment efforts have focused on concepts, or the class structure of the schema. Comparatively little has been done on instance mapping or predicate (property) mapping [1]. Yet these considerations should reside at the heart of how semantic Web technologies can assist data interoperability.

We began the UMBEL (Upper Mapping and Binding Exchange Layer) vocabulary and ontology as a reference structure for concepts, a means to help match the discussion of topics and things across the Web. As such, UMBEL is part of a fairly robust library of upper ontologies that are meant to provide the grounding references for what information is about. Domains as diverse as biomedicine, banking, oil and gas, municipal governments, retail, marine organisms and the environment — among many others — have effectively leveraged upper ontologies to get diverse datasets and vocabularies to relate to one another. This is much welcomed, to be sure, and a good indicator of how semantic technologies can begin to approach getting data to interoperate.

Here is one way to look at the data interoperability space from a semantic technologies perspective (as initially informed by Pietranik and Nguyen [2]):

The Ontology Stack

The overall semantics of the structure — indeed, how the structure itself is defined — comes from which ontology languages and vocabularies are used. From an expressiveness standpoint, particularly in conceptual relations or domain schema, there are a variety of standards and specifications from which to choose [3]. We also have pretty good reference ontologies for many domains and what is called the upper levels. We are also starting, through efforts such as Wikipedia (DBpedia and Wikidata), schema.orgFreebase and OKKAM, to get referencable datasets of entities and their attributes, sometimes organized by type.

Reference groundings for properties, on the other hand, have received virtually no attention [4]. SIO, the Semanticscience Integrated Ontology, is one attempt to provide a reference structure for properties in the science domain. The approach is exemplary, but still lacks the scope required of a general grounding vocabulary. QUDT, the Quantities, Units, Dimensions and Data Types Ontologies, provides a standard vocabulary for measurement quantities, but lacks the scope to capture non-quantitative measures for describing things. Both SIO and QUDT should inform and contribute to a still-needed broader treatment of how to describe entities. That is the purpose of the Attributes Ontology in the forthcoming new release of UMBEL.

Attributes within the Semantic Technology Stack

The properties in RDF triples (s – p – o) relate two things, the subject and object, to one another. One pragmatic way to understand properties, which are the predicates or verbs of these triple statements, is that they fall into two broad categories. The first category are the properties between or among different things; they are extrinsic to the subject at hand. These relations stipulate hierarchical relationships (subClassOf, fatherOf, daughterOf), mereological relationships (partOf, isComponent), role relationships (isBossOf, hasTeacher, isKeyInfluencer) or approximation relationships (isLike, isAbout, relatesTo). Both subjects and objects are concepts or identifiable things (entities).

However, the second category of properties, attribute properties, has a different nature. Attribute properties — attributes for short — are characteristics of an entity or entity type (class). They describe the entity at hand in the nature of key-value pairs. The key is the attribute, and the value is the literal value or object reference. In broad terms, attributes are the specifics of what is contained in a data record for a given instance. Multiple instances, or records, make up what is known as a dataset.

Attribute properties are intrinsic or descriptive properties. The combination of possible attributes for a given entity constitutes the intensional definition of that object. This use of the term attribute is consistent with its research sense as a descriptive characteristic of an object or its computing sense as being a factor of a given object. In the spirit of this inclusive sense of how attributes describe a given thing, we also include annotations and metadata as part of the attributes category of properties as well. All attribute properties provide a description or characteristic for the entity at hand.

Here are some example key-value pairs about me, the entity Mike Bergman, to illustrate the diversity of how attributes may describe things:

hair : red college : Pomona College mood : happy spouse : Wendy cat : Snuffles location : 41°41′18″N 91°35′12″W dateEntered : 02/16/2015 country : USA city : Iowa City, IA occupation : CEO avocation : flyfishing, cooking lastBook :
Zen and the Art of Motorcycle Maintenance
species : homo sapiens sex : male height : 6’3″ graduateSchool : Duke University maritalStatus : married blog : http://mkbergman.com children : Erin,Zak address : 380 Knowling fullName : Michael K Bergman Example Key-Value Attributes

The infoboxes in Wikipedia are another example of such attribute types and values. Note that the values may vary widely as to units or quantities or even links to other things. Also note it really does not matter what order the value pairs are presented and some values refer to other objects (shown as links).

Virtually any data format or data serialization in existence can be expressed in such key-value pairs. Further, related types of entities have related attributes, such that attribute relationships are an alternative way to describe typologies. My attributes, as a human, are quite similar to attributes for other humans, and somewhat close to other mammals. But my attributes are very different than those for a worm or an automobile.

Even simple attributes can pose a challenge for mapping, absent a grounding framework. My name, for example, is Michael Kermit Bergman, which is often provided as Michael Bergman, Mike Bergman, M K Bergman, mkbergman or Michael K Bergman, and the fields that can capture those variants can capture one to four name parts, all called something different. References, rules, semsets (synonyms, jargon and aliases), and coherent organization are needed to ground all of these variants into a common form.

Attribute properties may be quantitative (with a quantitative, measurable value), qualitative, or descriptive or annotative. In many cases, the actual value of an attribute is a literal or numeric value, but it may also be an object, as when the value is a member of an enumerable set or its own defined entity. Describing something as having a color characteristic of red, for example, may result in a literal assignment of the string “red” or it may refer to another object definition where red is specified as to its chromatic properties. Further, if my idea of red was in context with my own personal record (as above), then the referent is more properly something like red hair. Semantics (and, thus, context) matter in data interoperability. I will describe more the rationale and importance of the relation-attribute property split in a following article [5].

The purpose of semantic technologies is to overcome some 40 categories of semantic heterogeneity, as I most recently discussed in [6]. One interesting aspect is the large number of semantic differences that may be ascribed to attributes, as this table from [6] shows (see the yellow entries):

Class Category Subcategory Examples Type [7] LANGUAGE Encoding Ingest Encoding Mismatch For example, ANSI v UTF-8 Concept Ingest Encoding Lacking Mis-recognition of tokens because not being parsed with the proper encoding Concept Query Encoding Mismatch For example, ANSI v UTF-8 in search Concept Query Encoding Lacking Mis-recognition of search tokens because not being parsed with the proper encoding Concept Languages Script Mismatch Variations in how parsers handle, say, stemming, white spaces or hyphens Concept Parsing / Morphological Analysis Errors (many) Arabic languages (right-to-left)
v Romance languages (left-to-right) Concept Syntactical Errors (many) Ambiguous sentence references, such as I’m glad I’m a man, and so is Lola (Lola by Ray Davies and the Kinks) Concept Semantics Errors (many) River bank v money bank v billiards bank shot Concept CONCEPTUAL Naming Case Sensitivity Uppercase v lower case v Camel case Concept Synonyms United States v USA v America v Uncle Sam v Great Satan Concept Acronyms United States v USA v US Concept Homonyms Such as when the same name refers to more than one concept, such as Name referring to a person v Name referring to a book Concept Misspellings As stated Concept Generalization / Specialization When single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to “phone” but the other schema has multiple elements such as “home phone,” “work phone” and “cell phone” Concept Aggregation Intra-aggregation When the same population is divided differently (such as, Census v Federal regions for states, England v Great Britain v United Kingdom, or full person names v first-middle-last) Concept Inter-aggregation May occur when sums or counts are included as set members Concept Internal Path Discrepancy Can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove) Concept Missing Item Content Discrepancy Differences in set enumerations or including items or not (say, US territories) in a listing of US states Concept Missing Content Differences in scope coverage between two or more datasets for the
same concept Concept Attribute List Discrepancy Differences in attribute completeness between two or more datasets Attribute Missing Attribute Differences in scope coverage between two or more datasets for the same attribute Attribute Item Equivalence When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example, Berlin the city v Berlin the official city-state) Concept When two individuals are asserted as being the same when they are actually distinct (for example, John Kennedy the president v John Kennedy the aircraft carrier) Attribute Type Mismatch When the same item is characterized by different types, such as a person being typed as an animal  
v human being v person Attribute Constraint Mismatch When attributes referring to the same thing have different cardinalities or disjointedness assertions Attribute DOMAIN Schematic Discrepancy Element-value to Element-label Mapping One of four errors that may occur when attribute names (say, Hair v Fur) may refer to the same attribute, or when same attribute names (say, Hair v Hair) may refer to different attribute scopes (say, Hair v Fur) or where values for these attributes may be the same but refer to different actual attributes or where values may differ but be for the same attribute and putative value.Many of the other semantic heterogeneities herein also contribute to schema discrepancies Attribute Attribute-value to Element-label Mapping Attribute Element-value to Attribute-label Mapping Attribute Attribute-value to Attribute-label Mapping Attribute Scale or Units Measurement Type Differences, say, in the metric
v English measurement systems, or currencies Attribute Units Differences, say, in meters v centimeters v millimeters Attribute Precision For example, a value of 4.1 inches in one dataset v 4.106 in another dataset Attribute Data Representation Primitive Data Type Confusion often arises in the use of literals v URIs v object types Attribute Data Format Delimiting decimals by period v commas; various date formats; using exponents or aggregate units (such as thousands or millions) Attribute DATA Naming Case Sensitivity Uppercase v lower case v Camel case Attribute Synonyms For example, centimeters v cm Attribute Acronyms For example, currency symbols v currency names Attribute Homonyms Such as when the same name refers to more than one attribute, such as Name referring to a person v Name referring to a book Attribute Misspellings As stated Attribute ID Mismatch or Missing ID URIs can be a particular problem here, due to actual mismatches but also use of name spaces or not and truncated URIs Attribute Missing Data A common problem, more acute with closed world approaches than with open world ones Attribute Element Ordering Set members can be ordered or unordered, and if ordered, the sequences of individual members or values can differ Attribute Sources of Semantic Heterogeneities

We can see that attribute heterogeneities may apply to the attribute itself (the key in a key-value pair), as to what it may contain and what it may refer to, as well as to the actual values and their units and measures. These aspects are important, in that they are the very ones we mean when we talk of data.

Rationale for an Attributes Ontology

When we combine the descriptions of things, we need ways to overcome these sources of semantic heterogeneities. As with concepts, it would be extremely helpful to have a similar attributes vocabulary, and one which is organized according to some logical attribute schema. This combination of vocabulary and schema defines what constitutes an attributes ontology. It can also be a reference grounding for how to relate data from different datasets to one another. Providing this grounding is the driving rationale for UMBEL’s new Attributes Ontology.

Benefits

In addition to this overarching rationale in data interoperability, a reference Attributes Ontology brings with it a number of benefits:

  • More efficient basis for interoperability — the main advantage of a grounding reference is that it allows a spoke-and-hub design for data mapping, which is tremendously more efficient than pairwise mappings. In a spoke-and-hub design, where the reference ontology is the common node at the hub, only n – 1 routes are necessary to connect all sources, meaning that it scales linearly with the number of sources and attributes. Without a grounding reference, these same mapping capabilities would require routes in a pairwise (point-to-point) approach, that also scales poorly as a quadratic function. A system of ten datasets would require 9 composite mappings in the reference grounding case, but 45 in a pairwise approach. And, of course, datasets themselves contain tens to thousands of attributes, compounding the map scaling problem further;
  • Higher quality mappings — a single target schema promotes schema enhancements, and toolsets can be justified to automate many processes, leading to;
  • Faster integration — these efficiencies lead to faster and more cost-effective mappings;
  • Better ability to combine data values across records — which means the approach can be seen as suitable for any content input type (structured, semi-structured or unstructured) or with any form of semantic heterogeneity;
  • Faceted browsing and querying — because the nature of the attributes and their values are mapped to a logical schema of attribute relationships, each attribute concept can be the basis of filtering and retrievals, powerfully supporting faceted browsing and querying;
  • Infer attribute properties — the logical basis of the attribute schema itself means that relationships and connections may be inferred, and semantics enable different perspectives and language to capture all aspects of the schema. This means the full capabilities of semantic search and querying can be brought to contributing data;
  • Highest common denominator — these capabilities mean that source datasets can be lifted and made consistent with a higher standard of testable values and inferences. The rich history of RDFizers points to the usefulness of RDF and related characterizations to bridge between multiple, native data formats [8]. The knowledge of the data already characterized in the system can inform the proper expression of new source data; and
  • Better data integration, interoperability — ultimately, of course, all of these factors lead to a complete approach to data interoperability, which leads to being able to finally achieve the objectives of “schema matching” or “data mapping.”
Use Cases

These benefits can be realized in any data integration or interoperability setting. However, the benefits are particularly strong for these use cases:

  • Combining records across datasets — the sine qua non of data integration;
  • Checking validity of values — having an internal knowledge base of logic, schema, attributes listing and validated values against which to test data updates or new incoming data; and
  • Establishing an EIA or MDM capability — creating the internal infrastructure for truly responsive enterprise information integration or master data management. These are the reusable information and knowledge assets that are the grease for any data integration effort. The knowledge bases become assets in and of themselves. The budget sinkholes of most enterprise integration efforts can be turned around to become competitive assets in their own right.

As we have noted many times, these uses also benefit from the incremental and open world ability to expand the scope of the data integration at any point in time [9].

Description of the Attributes Ontology

We have recognized the importance of the attributes category going back to the first introduction of SuperTypes in UMBEL v.0.80 in 2010 [10]. We noted then that many of the concepts in UMBEL were devoted to how to describe things and the units or quantities associated with their values. We could also see the potential value in having a reference for mapping data characteristics and values.

The first creation of the Attributes SuperType — also introduced in UMBEL v.0.80 in 2010 — aggregated into one place related OpenCyc concepts regarding these descriptors. Working with this category over time surfaced, again, the underlying coherence and use of OpenCyc. We found that UMBEL (via its OpenCyc extraction) already had a strong, logical undergirding to support an organized representation of attributes. Once we understood these patterns, we were able to go back to OpenCyc and better capture other aspects of its attribute structure that we had earlier overlooked. We then added a few aggregate categories to UMBEL to provide a cleaner organization. UMBEL now understands and organizes some 2000 different descriptive attributes.

Over a period of years we did research on exemplars in these areas, with the limited results as first mentioned, notably QUDT and SIO, and also DERA [4]. We also enlisted input from the semantic Web mailing list and were not able to find a suitable extant reference structure [11]. We find it perplexing more work has not been done in this area. We do abhor a vacuum!

Nonetheless, we were able to combine the 2000 attributes infrastructure of OpenCyc into the following upper level of the Attributes Ontology structure:

AttributeValues StringObject StringDatatype_Unlimited List_Information FrequentlyAskedQuestionsList MailingList AlphabeticalList Index_List_Information BullettedFormat UnitOfMeasure UnitOfDistance InternationalUnitOfMeasure UnitOfMeasure_Common NaturalLanguage Encrypted AuthenticationSource Persistence Distribution Uniform_PersistenceDistribution UnitOfMeasureConcept Ratio CollectionType Phase EmptyCollection Preference Quantity AttachmentAttribute WrittenInfo StructuredInfo VisualInfo AudioInfo LogicalFieldAttribute TruthValue AttributeTypes DescriptiveAttributes Definition_PCW VisualPattern SpatialThingTypeByShape ShapeAttributes Color Name Title EnumeratedAttributes EconomicalQuantity DispositionalQuantity MentalQuantity PhysicalQuantity Quality SocialQuantity MeasurableQuantity TotallyOrderedQuantityType QuantityType NonAspectualQuantity EnvironmentalQuantity ActionAttributeLevelQuantity EmotionalQuantityType LocationAttributes OrientationAttributes GeographicalPlace MappableAttributes ContactLocation PopulatedPlace TimeAttributes HistoricTemporalThing Time_Quantity EventAttributes TimeInterval TemporalThing IdentificationAttributes ContactLocation ReferenceWork IDString UniqueID SituationAttributes Situation Upper Structure of the Attributes Ontology

Note the structure above roughly splits into two parts. The first, AttributeValues, captures the various ways and measures that may be applied to actual values. We foresee a key mapping to QUDT in this part. The second part of the structure, AttributeTypes, organizes the nature of various attributes into similar, logical categories.

We have also added some experimental predicates to the UMBEL vocabulary for mapping domains, ranges and specific external properties to reference attributes. See the ongoing specification in the UMBEL Annex L documentation for other pertinent details.

Though the Attributes Ontology has a bit more structure, it too is a module that segregates out specific attributes into its own files. About 2000 of the UMBEL reference concepts are tagged as attributes; about two-thirds of those, or 1275, are specific attributes that are assigned to the Attributes Ontology, which is also the container for the attributes module.

To our knowledge, the Attributes Ontology (AO) will be the first publicly released attempt to provide an explicit modeling framework for data attributes and values. We expect there to be hiccups and improvements to be made as we work with the system. We expect quite a few release iterations, and experimentation and change. We will retain an experimental designation of the new UMBEL properties and the Attributes Ontology itself until we gain better working comfort with the system.

The Additional UMBEL Entities Module

This new UMBEL Attributes Ontology is being accompanied by the creation of another UMBEL component, the Entities Module. This new module, designed in a similar way to the Geo Module that was released in version 1.05, tags all entities as such and places another 12,000 instances into a separate module. A hierarchy of about 15,000 entity types (and their descriptions and relationships) remain in UMBEL core.

Like the Geo Module, itself comprised of entity instances, the Entities Module may be invoked or not for a given use of UMBEL. The ability to filter on entities and SuperTypes is also a powerful new feature. The fact that there is major disjunction among the SuperTypes also adds to the power of queries and retrievals.

Thus, with the attributes module that is now part of the Attributes Ontology, there are now three separate but invokable modules in addition to the UMBEL core. The Geo, Entities or Attributes modules may be included or not in any given UMBEL deployment.

Pending Releases

After five years of sporadically intense thinking, Structured Dynamics is extremely pleased to first formally express our ideas about how to manage and model data and its attributes using the underlying machinery of semantic technologies. We welcome use and commentary on our approach and the Attributes Ontology.

We willl be releasing UMBEL v.1.20 by the end of March with various improvements, including the Entities Module and Attributes Ontology noted above. We are also updating the UMBEL documentation and have added Annexes K and L that describe the Clojure-based UMBEL generation process and the specifics underlying the Attributes Ontology [12]. Shortly thereafter we expect to provide a new minor release that will provide mappings between the UMBEL Attributes Ontology and DBpedia and schema.org properties.

For the time being, we will be focused on refining our use of UMBEL for data interoperability, specifically for attributes. However, we note that the ontology structure used in this article also flags roles and relations as another possible gap. This gap is likely to be the next major focus in UMBEL’s research agenda.

[1] For example, the relative status of various ontology mapping efforts are covered, among others, in Fei Wu and Daniel S. Weld, 2008. “Automatically Refining the Wikipedia Infobox Ontology,” WWW 2008, April 21–25, 2008, Beijing, China; Lorena Otero-Cerdeira, Francisco J. Rodríguez-Martínez, and Alma Gómez-Rodríguez, 2015. “Ontology Matching: A Literature Review,” Expert Systems with Applications 42, no. 2 (2015): 949-971; and Marcin Pietranik and Ngoc Thanh Nguyen, 2011. “Attribute Mapping as a Foundation of Ontology Alignment,” N.T. Nguyen, C.-G. Kim, and A. Janiak (Eds.): ACIIDS 2011, LNAI 6591, pp. 455–465, 2011. Also, I also discuss the relative poor state of mapping predicates between entities in many articles. See, for example, commentary on sameAs in M.K. Bergman, 2011. “Making Connections Real,” in AI3:::Adaptive Information, January 31, 2011. See also reference [4] and the follow-on discussion in [5]. [2] The basic approach to this stack diagram was suggested by a figure in Marcin Pietranik and Ngoc Thanh Nguyen, 2011. “Attribute Mapping as a Foundation of Ontology Alignment,” N.T. Nguyen, C.-G. Kim, and A. Janiak (Eds.): ACIIDS 2011, LNAI 6591, pp. 455–465, 2011. [3] W3C standards exist for RDF, RDFS and OWL; also, Common Logic and conceptual graphs provide higher-order capabilities. We use OWL 2 in our efforts. Some rationale for this choice is provided in M.K. Bergman, 2010. “Metamodeling in Domain Ontologies,” in AI3:::Adaptive Information, September 20, 2010. [4] One relevant effort, but which has not yet posted details or an ontology, is Fausto Giunchiglia and Biswanath Dutta, 2011. “DERA: A Faceted Knowledge Organization Framework,” Technical Report # DISI-11-457, University of Trento, March 2011; submitted to the International Conference on Theory and Practice of Digital Libraries 2011 (TPDL’2011). [5] When posted, the reference to the follow-on article will be listed here. [6] First posted in M.K. Bergman, 2014. “Big Structure and Data Interoperability,” in AI3:::Adaptive Information, August 18, 2014. [7] Concept is the shorthand used for the schema or classes or TBox. Attribute is the shorthand used for instance data or entities and their ABox. I segregate class-relation properties (predicates) from instance-describing properties (attributes). [8] There are more than 100 converters of various record and data structure types to RDF. These converters — also sometimes known as translators or ‘RDFizers’ — generally take some input data records with varying formats or serializations and convert them to a form of RDF serialization (such as RDF/XML or N3), often with some ontology matching or characterizations. See this listing of known RDFizers. [9] See M. K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” in AI3:::Adaptive Information, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Anothe way to say is it that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems. [10] See this earlier (2010) version of Annex G: UMBEL SuperTypes Documentation to the UMBEL specifications. [11] See this thread on the linked open data (LOD) mailing list from July 2014. [12] See further the UMBEL Annex K: UMBEL Generator and UMBEL Annex L: Attributes Ontology and Version 1.20 to the UMBEL specifications (still being completed).

Forty Seminal Distant Supervision Articles

AI3:::Adaptive Information (Mike Bergman) - Tue, 01/20/2015 - 16:26
Some Annotated References in Relation to Knowledge-based Artificial Intelligence

Distant supervision, earlier or alternatively called self-supervision or weakly-supervised,  is a method to use knowledge bases to label entities automatically in text, which is then used to extract features and train a machine learning classifier. The knowledge bases provide coherent positive training examples and avoid the high cost and effort of manual labelling. The method is generally more effective than unsupervised learning, though with similar reduced upfront effort. Large knowledge bases such as Wikipedia or Freebase are often used as the KB basis.

The first acknowledged use of distant supervision was Craven and Kumlien in 1999 (#11 below, though they used the term weak supervision); the first use of the formal term distant supervision was in Mintz et al. in 2009 (#21 below). Since then, the field has been a very active area of research.

Here are forty of the more seminal papers in distant supervision, with annotated comments for many of them:

  1. Alan Akbik, Larysa Visengeriyeva, Priska Herger, Holmer Hemsen, and Alexander Löser, 2012. “Unsupervised Discovery of Relations and Discriminative Extraction Patterns,” in COLING, pp. 17-32. 2012. (Uses a method that discovers relations from unstructured text as well as finding a list of discriminative patterns for each discovered relation. An informed feature generation technique based on dependency trees can significantly improve clustering quality, as measured by the F-score. This paper uses Unsupervised Relation Extraction (URE), based on the latent relation hypothesis that states that pairs of words that co-occur in similar patterns tend to have similar relations. This paper discovers and ranks the patterns behind the relations.)
  2. Marcel Ackermann, 2010. “Distant Supervised Relation Extraction with Wikipedia and Freebase,” internal teaching paper from TU Darmstadt.
  3. Enriique Alfonesca, Katja Filippova, Jean-Yves Delort, and Guillermo Garrido, 2012. “Pattern Learning for Relation Extraction with a Hierarchical Topic Model,” inProceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 54-59. Association for Computational Linguistics, 2012.
  4. Alessio Palmero Aprosio, Claudio Giuliano, and Alberto Lavelli, 2013. “Extending the Coverage of DBpedia Properties using Distant Supervision over Wikipedia,.” In NLP-DBPEDIA@ ISWC. 2013. (Does not suggest amazing results.)
  5. Isabelle Augenstein, Diana Maynard, and Fabio Ciravegna, 2014. “Distantly Supervised Web Relation Extraction for Knowledge Base Population,” in Semantic Web Journal (forthcoming). (The approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co-reference resolution methods.) (Good definitions of supervised, unsupervised, semi-supervised and distant supervised.) (This paper aims to improve the state of the art in distant supervision for Web extraction by: 1) recognising named entities across domains on heterogeneous Web pages by using Web-based heuristics; 2) reporting results for extracting relations across sentence boundaries by relaxing the distant supervision assumption and using heuristic co-reference resolution methods; 3) proposing statistical measures for increasing the precision of distantly supervised systems by filtering ambiguous training data, 4) documenting an entitycentric approach for Web relation extraction using distant supervision; and 5) evaluating distant supervision as a knowledge base population approach and evaluating the impact of our different methods on information integration.)
  6. Pedro HR Assis and Marco A. Casanova, 2014. “Distant Supervision for Relation Extraction using Ontology Class Hierarchy-Based Features,” in ESWC 2014. (Describes a multi-class classifier for relation extraction, constructed using the distant supervision approach, along with the class hierarchy of an ontology that, in conjunction with basic lexical features, improves accuracy and recall.) (Investigates how background data can be even further exploited by testing if simple statistical methods based on data already present in the knowledge base can help to filter unreliable training data.) (Uses DBpedia as source, Wikipedia as target. There is also a YouTube video that may be viewed.)
  7. Isabelle Augenstein, 2014. “Joint Information Extraction from the Web using Linked Data, I. Augenstein’s Ph.D. proposal at the University of Sheffield.
  8. Isabelle Augenstein, 2014. “Seed Selection for Distantly Supervised Web-Based Relation Extraction,” in Proceedings of SWAIE (2014). (Provides some methods for better seed determinations; also uses LOD for some sources.)
  9. Justin Betteridge, Alan Ritter and Tom Mitchell, 2014. “Assuming Facts Are Expressed More Than Once,” in The Twenty-Seventh International Flairs Conference. 2014. (
  10. R. Bunescu, R. Mooney., 2007. “Learning to Extract Relations from the Web Using Minimal Supervision,” in Annual Meeting for the Association for Computational Linguistics, 2007.
  11. Mark Craven and Johan Kumlien. 1999. “Constructing Biological Knowledge Bases by Extracting Information from Text Sources,” in ISMB, vol. 1999, pp. 77-86. 1999. (Source of weak supervision term.)
  12. Daniel Gerber and Axel-Cyrille Ngonga Ngomo, 2012. “Extracting Multilingual Natural-Language Patterns for RDF Predicates,” in Knowledge Engineering and Knowledge Management, pp. 87-96. Springer Berlin Heidelberg, 2012. (The idea behind BOA is to extract natural language patterns that represent predicates found on the Data Web from unstructured data by using background knowledge from the Data Web, specifically DBpedia. See further the code or demo.)
  13. Edouard Grave, 2014. “Weakly Supervised Named Entity Classification,” in Workshop on Automated Knowledge Base Construction (AKBC), 2014. (Uses a novel PU (positive and unlabelled) method for weakly supervised named entity classification, based on discriminative clustering.) (Uses a simple string match between the seed list of named entities and unlabeled text from the specialized domain, it is easy to obtain positive examples of named entity mentions.)
  14. Edouard Grave, 2014. “A Convex Relaxation for Weakly Supervised Relation Extraction,” in Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. (Addressed the multiple label/learning problem. Seems to outperform other state-of-the-art extractors, though the author notes in conclusion that kernel methods should also be tried. See other Graves 2014 reference.)
  15. Malcolm W. Greaves, 2014. “Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logic,” PhD dissertation, Carnegie Mellon University, 2014. (Useful literature review and pipeline is one example.)
  16. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld, 2011. “Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 541-550. Association for Computational Linguistics, 2011. (A novel approach for multi-instance learning with overlapping relations that combines a sentence-level extraction model with a simple, corpus-level component for aggregating the individual facts. (Uses a self-supervised, relation-specific IE system which learns 5025 relations.) (“Knowledge-based weak supervision, using structured data to heuristically label a training corpus, works towards this goal by enabling the automated learning of a potentially unbounded number of relation extractors.” ““weak” or “distant” supervision, creates its own training data by heuristically matching the contents of a database to corresponding text”.) (Also introduces MultiR)
  17. Ander Intxaurrondo, Mihai Surdeanu, Oier Lopez de Lacalle, and Eneko Agirre, 2013. “Removing Noisy Mentions for Distant Supervision,” in Procesamiento del Lenguaje Natural 51 (2013): 41-48. (Suggests filter methods to remove some noisy potential assignments.)
  18. Mitchell Koch, John Gilmer, Stephen Soderland, and Daniel S. Weld, 2014. “Type-Aware Distantly Supervised Relation Extraction with Linked Arguments,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1891–1901, October 25-29, 2014, Doha, Qatar. (Investigates four orthogonal improvements to distance supervision: 1) integrating named entity linking (NEL) and 2) coreference resolution into argument identification for training and extraction, 3) enforcing type constraints of linked arguments, and 4) partitioning the model by relation type signature.) (Enhances the MultiR basis; see http://cs.uw.edu/homes/mkoch/re for code and data.)
  19. Yang Liu, Kang Liu, Liheng Xu, and Jun Zhao, 2014. “Exploring Fine-grained Entity Type Constraints for Distantly Supervised Relation Extraction,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2107–2116, Dublin, Ireland, August 23-29 2014. (More fine-grained entities produce better matching results.)
  20. Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek, 2013. “Distant Supervision for Relation Extraction with an Incomplete Knowledge Base,” in HLT-NAACL, pp. 777-782. 2013. (Standard distant supervision does not properly account for the negative training examples.)
  21. Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, 2009. “Distant Supervision for Relation Extraction without Labeled Data,” in Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 1003–1011, Suntec, Singapore, 2-7 August 2009. (Because their algorithm is supervised by a database, rather than by labeled text, it does not suffer from the problems of overfitting and domain-dependence that plague supervised systems. First use of the ‘distant supervision’ approach.)
  22. Ndapandula T. Nakashole, 2012. “Automatic Extraction of Facts, Relations, and Entities for Web-Scale Knowledge Base Population,” Ph.D. Dissertation for the University of Saarland, 2012. (Excellent overview and tutorial; introduces the tools Prospera, Patty and PEARL.)
  23. Truc-Vien T. Nguyen and Alessandro Moschitti, 2011. “End-to-end Relation Extraction Using Distant Supervision from External Semantic Repositories,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pp. 277-282. Association for Computational Linguistics, 2011. (Shows standard Wikipedia text can also be a source for relations.)
  24. Marius Paşca, 2007. “Weakly-Supervised Discovery of Named Entities Using Web Search Queries,” in Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 683-690. ACM, 2007.
  25. Marius Paşca, 2009. “Outclassing Wikipedia in Open-Domain Information Extraction: Weakly-Supervised Acquisition of Attributes Over Conceptual Hierarchies,” inProceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 639-647. Association for Computational Linguistics, 2009.
  26. Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, and Daniel Jurafsky, 2012. “Event Extraction Using Distant Supervision,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik. 2014. (They demonstrate that the SEARN algorithm outperforms a linear-chain CRF and strong baselines with local inference.)
  27. Sebastian Riedel, Limin Yao, and Andrew McCallum, 2010. “Modeling Relations and their Mentions without Labeled Text,” in Machine Learning and Knowledge Discovery in Databases, pp. 148-163. Springer Berlin Heidelberg, 2010. (They use a factor graph to determine if the two entities are related, then apply constraint-driven semi-supervision.)
  28. Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Etzioni, 2013. “Modeling Missing Data in Distant Supervision for Information Extraction,” TACL 1 (2013): 367-378. (Addresses the question of missing data in distant supervision.) (Appears to address many of the initial MultiR issues.)
  29. Benjamin Roth and Dietrich Klakow, 2013. “Combining Generative and Discriminative Model Scores for Distant Supervision,” in EMNLP, pp. 24-29. 2013.(By combining the output of a discriminative at-least-one learner with that of a generative hierarchical topic model to reduce the noise in distant supervision data, the ranking quality of extracted facts is significantly increased and achieves state-of-the-art extraction performance in an end-to-end setting.)
  30. Benjamin Rozenfeld and Ronen Feldman, 2008. “Self-Supervised Relation Extraction from the Web,” in Knowledge and Information Systems 17.1 (2008): 17-33.
  31. Hui Shen, Mika Chen, Razvan Bunescu and Rada Mihalcea, 2014. “Wikipedia Taxonomic Relation Extraction using Wikipedia Distant Supervision.” (Negative examples based on Wikipedia revision history; perhaps problematic. Interesting recipes for sub-graph extractions. Focused on is-a relationship. See also http://florida.cs.ohio.edu/wpgraphdb/.)
  32. Stephen Soderland, Brendan Roof, Bo Qin, Shi Xu, Mausam, and Oren Etzioni, 2010. “Adapting Open Information Extraction to Domain-Specific Relations,” in AI Magazine 31, no. 3 (2010): 93-102. (A bit more popular treatment; no new ground.)
  33. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning, 2012. “Multi-Instance Multi-Label Learning for Relation Extraction,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 455-465. Association for Computational Linguistics, 2012. (Provides means to find previously unknown relationships using a graph.)
  34. Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa, 2012. “Reducing Wrong Labels in Distant Supervision for Relation Extraction,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 721-729. Association for Computational Linguistics, 2012. (Proposes a method to reduce the incidence of false labels.)
  35. Bilyana Taneva and Gerhard Weikum, 2013. “Gem-based Entity-Knowledge Maintenance,” in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, pp. 149-158. ACM, 2013. (Methods to create the text snippets — GEMS — that are used to train the system.)
  36. Andreas Vlachos and Stephen Clark, 2014. Application-Driven Relation Extraction with Limited Distant Supervision, in COLING 2014 (2014): 1. (Uses the Dagger learning algorithm.)
  37. Wei Xu, Raphael Hoffmann, Le Zhao, and Ralph Grishman, 2013. “Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction,” in ACL (2), pp. 665-670. 2013. (Addresses the problem of false negative training examples mislabeled due to the incompleteness of knowledge bases.)
  38. Wei Xu Ralph Grishman and Le Zhao, 2011. “Passage Retrieval for Information Extraction using Distant Supervision,” in Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 1046–1054, Chiang Mai, Thailand, November 8 – 13, 2011. (Filtering of candidate passages improves quality.)
  39. Y. Yan, N. Okazaki, Y. Matsuo, Z. Yang, M. Ishizuka, 2009. “Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009.
  40. Xingxing Zhang, Jianwen Zhang, Junyu Zeng, Jun Yan, Zheng Chen, and Zhifang Sui, 2013. “Towards Accurate Distant Supervision for Relational Facts Extraction,” in ACL (2), pp. 810-815. 2013. (Three factors on how to improve the accuracy of distant supervision.)

The Era of Openness

AI3:::Adaptive Information (Mike Bergman) - Mon, 01/12/2015 - 15:35
The Internet Has Catalyzed Trends that are Creative, Destructive and Transformative

Something very broad and profound has been happening over the recent past. It is not something that can be tied to a single year or a single event. It is also something that is quite complex in that it is a matrix of forces, some causative and some derivative, all of which tend to reinforce one another to perpetuate the trend. The trend that I am referring to is openness, and it is a force that is both creative and destructive, and one that in retrospect is also inevitable given the forces and changes underlying it.

It is hard to gauge exactly when the blossoming of openness began, but by my lights the timing corresponds to the emergence of open source and the Internet. Early bulletin board systems (BBS) often were distributed with source code, and these systems foreshadowed the growth of the Internet. While the Internet itself may be dated to ARPA efforts from 1969, it is really more the development of the Web around 1991 that signaled the real growth of the medium.

Over the past quarter century, the written use of the term “open” has increased more than 40% in frequency in comparison to terms such as “near” or “close” [1], a pretty remarkable change in usage for more-or-less common terms, as this figure shows:

Though the idea of “openness” is less common than “open”, its change in written use has been even more spectacular, with its frequency more than doubling (112%) over the past 25 years. The change in growth slope appears to coincide with the mid-1980s.

Because “openness” is more of a mindset or force — a point of view, if you will — it is not itself a discrete thing, but an idea or concept. In contemplating this world of openness, we can see quite a few separate, yet sometimes related, strands that provide the weave of the “openness” definition [2]:

  • Open source — refers to a computer program in which the source code is available to the general public for use and/or modification from its original design. Open-source code is typically a collaborative effort where programmers improve upon the source code and share the changes within the community so that other members can help improve it further
  • Open standards — are standards and protocols that are fully defined and available for use without royalties or restrictions; open standards are often developed in a public, collaborative manner that enables stakeholders to suggest and modify features, with adoption generally subject to some open governance procedures
  • Open content — is a creative work, generally based on text, that others can copy or modify; open access publications are a special form of open content that provide unrestricted online access to peer-reviewed scholarly research
  • Open data — is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control; open data is a special form of open content
  • Open knowledge — is what open data becomes when it is useful, usable and used; according to the Open Knowledge Foundation, the key features of openness are availability and access wherein the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the Internet
  • Open knowledge bases — are open knowledge packaged in knowledge-base form
  • Open access to communications — means non-discriminatory means to access communications networks; because of the opennesss of access, additional features might emerge including the idea of crowdsourcing (obtaining content, services or ideas from a large group of people), including such major variants as citizen science or crowdfunding (raising funds from a large group of people)
  • Open rights — are an umbrella term to cover the ability to obtain content or data without copyright restrictions and gaining use and access to software or intellectual property via open licenses
  • Open logics — are the use of logical constructs, such as the open world assumption, which enable data and information to be added to existing systems without the need to re-architect the underlying data schema; such logics are important to knowledge management and the continuous additon of new information
  • Open architectures — are means to access existing software and platforms via such means as open APIs (application programming interfaces), open formats (published specifications for digital data) or open Web services
  • Open government — is a governing doctrine that holds citizens have the right to access the documents and proceedings of the government to allow for effective public oversight; it is generally accompanied by means for online access to government data and information
  • Open education — is an institutional practice or programmatic initiative that broadens access to the learning and training traditionally offered through formal education systems, generally to educational materials, curricula or course notes at low or no cost without copyright limitations
  • Open design — is the development of physical products, machines and systems through use of publicly shared design information, often via online collaboration
  • Open research — makes the methodology and results of research freely available via the Internet, and often invites online collaboration; if the research is scientific in nature, it is frequently referred to as open science, and
  • Open innovation — is the use and combination of open and public sources of ideas and innovations with those internal to the organization.

In looking at the factors above, we can ask two formative questions. First, is the given item above primarly a causative factor for “openness” or something that has derived from a more “open” environment? And, second, does the factor have an overall high or low impact on the question of openness. Here is my own plotting of these factors against these dimensions:

Early expressions of the “openness” idea help cause the conditions that lead to openness in other areas. As those areas also become more open, a positive reinforcement is passed back to earlier open factors, all leading to a virtuous circle of increased openness. Though perhaps not strictly “open,” other various and related factors such as the democratization of knowledge, broader access to goods and services, more competition, “long tail” access and phenomenon, and in truly open environments, more diversity and more participation, also could be plotted on this matrix.

Once viewed through the umbrella lens of “openness”, it starts to become clear that all of these various “open” aspects are totally remaking information technology and human interaction and commerce. The impacts on social norms and power and governance are just as profound. Though many innovations have uniquely shaped the course of human history — from literacy to mobility to communication to electrification or computerization — none appear to have matched the speed of penetration nor the impact of “openness”.

Separating the Chicken from the Egg

So, what is driving this phenomenon? From where did the concept of “openness” arise?

Actually, this same matrix helps us hypothesize one foundational story. Look at the question of what is causative and what might be its source. The conclusion appears to be the Internet, specifically the Web, as reinforced and enabled by open-source software.

Relatively open access to an environment of connectivity guided by standard ways to connect and contribute began to fuel still further connections and contributions. The positive values of access and connectivity via standard means, in turn, reinforced the understood value of “openness”, leading to still further connections and engagement. More openness is like the dropped sand grain that causes the entire sand dune to shift.

The Web with its open access and standards has become the magnet for open content and data, all working to promote derivative and reinforcing factors in open knowledge, education and government:

The engine of “openness” tends to reinforce the causative factors that created “openness” in the first place. More knowledge and open aspects of collaboration lead to still further content and standards that lead to further open derivatives. In this manner “openness” becomes a kind of engine that promotes further openness and innovation.

There is a kind of open logic (largely premised on the open world assumption) that lies at the heart of this engine. Since new connections and new items are constantly arising and fueling the openness engine, new understandings are constantly being bolted on to the original starting understandings. This accretive model of growth and development is similar to the depositive layers of pearls or the growth of crystals. The structures grow according to the factors governing the network effect [3], and the nature of the connected growth structures may be represented and modeled as graphs. “Openness” appears to be a natural force underlying the emerging age of graphs [4].

Openness is Both Creative and Destructive . . .

“Openness”, like the dynamism of capitalism, is both creative and destructive [5]. The effects are creative — actually transformative — because of the new means of collaboration that arise based on the new connections between new understandings or facts. “Open” graphs create entirely new understandings as well as provide a scaffolding for still further insights. The fire created from new understandings pulls in new understandings and contributions, all sucking in still more oxygen to keep the innovation cycle burning.

But the creative fire of openness is also destructive. Proprietary software, excessive software rents, silo’ed and stovepiped information stores, and much else are being consumed and destroyed in the wake of openness. Older business models — indeed, existing suppliers — are in the path of this open conflagration. Private and “closed” solultions are being swept before the openness firestorm. The massive storehouse of legacy kindling appears likely to fuel the openness flames for some time to come.

“Openness” becomes a form of adaptive life, changing the nature, value and dynamics of information and who has access to it. Though much of the old economy is — and, will be — swept away in this destructive fire, new and more fecund growth is replacing it. From the viewpoint of the practitioner on the ground, I have not seen a more fertile innovation environment in information technology in more than thirty years of experience.

. . . and Seemingly Inevitable

Once the proper conditions for “openness” were in place, it now seems inevitable that today’s open circumstances would unfold. The Internet, with its (generally) open access and standards, was a natural magnet to attract and promote open-source software and content. A hands-off, unregulated environment has allowed the Internet to innovate, grow, and adapt at an unbelievable rate. So much unconnected dry kindling exists to stoke the openness fire for some time to come.

Of course, coercive state regimes can control the Internet to varying degrees and have limited innovation in those circumstances. Also, any change to more “closed” and less “open” an Internet may also act over time to starve the openness fire. Examples of such means to slow openness include imposing Internet regulation, limiting access (technically, economically or by fiat), moving away from open standards, or limiting access to content. Any of these steps would starve the innovation fire of oxygen.

Adapting to the Era of Openness

The forces impelling openness are strong. But, these observations certainly provide no proof for cause-and-effect. The correspondence of “openness” to the Internet and open source may simply be coincidence. But my sense suggests a more causative role is likely. Further, these forces are strong, and are sweeping before them much in the way of past business practices and proprietary methods.

In all of these regards “openness” is a woven cord of forces changing the very nature and scope of information available to humanity. “Openness”, which has heretofore largely lurked in the background as some unseeing force, now emerges as a criterion by which to judge the wisdom of various choices. “Open” appears to contribute more and be better aligned with current forces. Business models based on proprietary methods or closed information generally are on the losing side of history.

For these forces to remain strong and continue to contribute material benefits, the Internet and its content in all manifestations needs to remain unregulated, open and generally free. The spirit of “open” remains just that, and dependent on open and equal access and rights to the Internet and content.

[1] The data is from Google book trends data based on this query (inspect the resulting page source to obtain the actual data); the years 2009 to 2014 were projected based on prior actuals to 1980; percentage term occurrences were converted to term frequencies by 1/n. [2] All links and definitions in this section were derived from Wikipedia. [3] See M.K. Bergman, 2014. “The Value of Connecting Things – Part I: A Foundation Based on the Network Effect,” AI3:::Adaptive Information blog, September 2, 2014. [4] See M.K. Bergman, 2012. “The Age of the Graph,” AI3:::Adaptive Information blog, August 12, 2012; and  John Edward Terrell, Termeh Shafie and Mark Golitko, 2014. “How Networks Are Revolutionizing Scientific (and Maybe Human) Thought,” Scientific American, December 12, 2014. [5] Creative destruction is a term from the economist Joseph Schumpeter that describes the process of industrial change from within whereby old processes are incessantly destroyed and replaced by new ones, leading to a constant change of economic firms that are winners and losers.
Syndicate content