One of the central engines behind artificial intelligence is machine learning. ML involves various ways that data is used to train or teach machines to classify, predict or perform complicated tasks, such as I captured in an earlier diagram. The methods used in machine learning may be statistical, based on rules, or recognizing or discovering patterns.
The name machine learning begs the question of to learn what? In the context of images, audio, video or sensory perception, machine learning is trained for the recognition of patterns, which can be layered into learning manifolds called deep learning. In my realm — that is, knowledge bases and semantics — machine learning can be applied to topic or entity clustering or classification; entity, attribute or relation identification and extraction; disambiguation; mapping and linking multiple sources; language translation; duplicates removal; reasoning; semantic relatedness; phrase identification; recommendation systems; and, question answering. Significant results can be obtained in these areas without the need for deep learning, though that can and is being usefully applied in areas like machine translation or artificial writing.
Machine learning can be either supervised or unsupervised. In supervised learning, positive and (often) negative training examples are presented to the learning algorithm in order to create a model to produce the desired results for the given context. No training examples are presented in unsupervised learning; rather, the model is derived from patterns discovered in the absence of training examples, sometimes described as finding hidden patterns in unlabeled data. Supervised methods are generally more accurate than unsupervised methods, and nearly universally so in the realm of content information and knowledge.
There is effort and expense associated with creating positive or negative training examples (sets). This effort can span from the maximum of ones completely constructed manually to ones that are semi-automatic (semi-supervised) or to ones informed by knowledge bases (weakly supervised or distant supervised , ). Creation of manual training sets may consume as much as 80% of overall efforts in some cases, and is always a time-consuming task whenever employed. The accuracy of the eventual models is only as good as the trueness of the input training sets, with traditionally the best results coming from manually determined training sets; the best of those are known as “gold standards.” The field of machine learning is thus broad and multiple methods span these spectra of effort and accuracy.
The Spectrum of Machine Learning
To date, the state-of-the-art in machine learning for natural language processing and semantics, my realm, has been in distant supervision using knowledge bases like Freebase or Wikipedia to extract training sets for supervised learning . Relatively clean positive and negative training sets may be created with much reduced effort over manually created ones. This is the current “sweet spot” in the accuracy v. effort trade-off for machine learning in my realm.
However, as employed to date, distant supervision has mostly been a case-by-case, problem-by-problem approach, and most often applied to entity or relation extraction. Yes, knowledge bases may be inspected and manipulated to create the positive and negative training examples needed, but this effort has heretofore not been systematic in approach nor purposefully applied across a range of ML applications. How to structure and use knowledge bases across a range of machine learning applications with maximum accuracy and minimum effort, what we call knowledge supervision, is the focus of this article. The methods of knowledge supervision are how we make operational the objectives of knowledge-based artificial intelligence. This article is thus one of the foundations to my recent series on KBAI .Features and Training Sets
Features and training sets, in relation to the specific machine learning approaches that are applied, are the major determinants to how successful the learning actually is. We already touched upon the trade-offs in effort and accuracy associated with training sets, and will provide further detail on this question below. But features also pose trade-offs and require similar skill in selection and use.
In machine learning, a feature is a measurable property of the system being analyzed. A feature is equivalent to what is known as an explanatory variable in statistics. A feature, stated another way, is a measurable aspect of the system that provides predictive value for the state the system.
Features with high explanatory power independent of other features are favored, because each added feature adds a computational cost of some manner. Many features are correlated with one another; in these cases it is helpful to find the strongest signals and exclude the other correlates. Too many features also make tuning and refinement more difficult, what has sometimes been called the curse of dimensionality. Overfitting is also often a problem, which limits the ability of the model to generalize to other data.
Yet too few features and there is inadequate explanatory power to achieve the analysis objectives.
Though it is hard to find discussion of best practices in feature extraction, striking this balance is an art . Multiple learners might also be needed in order to capture the smallest, independent (non-correlated) feature set with the highest explanatory power .
When knowledge bases are used in distant supervision, only a portion of their structure or content is used as features. Still other distant supervision efforts may be geared to other needs and use a different set of features. Indeed, broadly considered, knowledge bases (potentially) have a rich diversity of possible features arising from:
An understanding of the features potential for knowledge bases is the first mindset of moving toward more purposeful knowledge supervision. At Structured Dynamics we stage the structured information as RDF triples and OWL ontologies, which we can select and manipulate via APIs and SPARQL. We also stage the graph structure and text in Lucene, which gives us powerful faceted search and other advanced NLP manipulations and analyses. These same features may also be utilized to extend the features set available from the knowledge base through such actions as new entity, attribute, or relation extractions; fine-grained entity typing ; creation of word vectors or tensors; results of graph analytics; forward or backward chaining; efficient processing structures; etc.
Because all features are selectable via either structured SPARQL query or faceted search, it is also possible to more automatically extract positive and negative training sets. Attention to proper coverage and testing of disjointedness assertions is another purposeful step useful to knowledge supervision, since it aids identification of negative examples for the training.
Whatever the combination of ML method, feature set, or positive or negative training sets, the ultimate precision and accuracy of knowledge supervision requires the utmost degree of true results in both positive and negative training sets. Training to inaccurate information merely perpetuates inaccurate information. As anyone who has worked extensively with source knowledge bases such as Freebase, DBpedia or Wikipedia may attest, assignment errors and incomplete typing and characterizations are all too common. Further, none provide disjointedness assertions.
Thus, the system should be self-learning with results so characterized as to be fed automatically to further testing and refinement. As better quality and more features are added to the system, we produce what we have shown before , as the virtuous circle of KBAI:
Features and training sets may thus be based on the syntax, morphology, semantics (meaning of the data) or relationships (connections) of the source data in the knowledge base. Continuous testing and the application of the system’s own machine learners creates a virtuous feedback where the accuracy of the overall system is constantly and incrementally improved.Manifest Applications for Knowledge Supervision
The artificial intelligence applications to which knowledge supervision may be applied are manifest. Here is a brief listing of some of those areas as evidenced by distant supervision applied to machine learning in academic research, or others not yet exploited:
These areas are listed in rough order from the simpler to the more complex analyses. Most distant supervision efforts to date have concentrated on information extraction, the first items shown on the list. But all of these are amenable to knowledge supervision with ML. Since 2009, many of the insights regarding these potentials have arisen from the Knowledge Base Population initiative of the Text Analysis Conference .
Mapping and linkage are essential areas on this list since they add greatly to the available feature set and provide the bases for greater information interoperability. This is the current emphasis of Structured Dynamics.Knowledge Supervision is Purposeful and Systematic
Knowledge supervision is the purposeful structuring and use of knowledge bases to provide features and training sets for multiple kinds of machine learners, which in combination can be applied to multiple artificial intelligence outcomes. Knowledge supervision is the method by which knowledge-based artificial intelligence, or KBAI, is achieved.
None of this is free, of course. Much purposeful work is necessary to configure and stage the data structures and systems that support the broad application of knowledge supervision. And other questions and challenges related to KBAI also remain. Yet, as Pedro Domingos has stated :“And the organizations that make the most of machine learning are those that have in place an infrastructure that makes experimenting with many different learners, data sources and learning problems easy and efficient, and where there is a close collaboration between machine learning experts and application domain ones.”
Having the mindset and applying the methods of knowledge supervision produces an efficient, repeatable, improvable infrastructure for active learning about the enterprise’s information assets.
As noted, we are just at the beginnings of knowledge supervision, and best practices and guidelines are still in the formative stages. We also have open questions and challenges in how features can be effectively selected; how KB-trained classifiers can be adopted to the wild; how we can best select and combine existing machine learners to provide an ML infrastructure; where and how deep learning should most effectively be applied; and how other emerging insights in computational linguistics can be combined with knowledge supervision .
But we can already see that a purposeful mindset coupled with appropriate metadata and structured RDF data is a necessary grounding to the system. We can see broad patterns across the areas of information extraction involving concepts, entities, relations, attributes and events that can share infrastructure and methods. We realize that linkage and mapping are key enabling portions of the system. The need for continuous improvement and codification of self-learning are the means by which our systems will get more accurate.
So, with the what of knowledge-based artificial intelligence, we can now add some broad understandings of the how based on knowledge supervision. None of these ideas are unique or new unto themselves. But the central role of knowledge bases in KBAI and knowledge supervision represents an important advance of artificial intelligence to deal with real-world challenges in content and information. Distant supervision was earlier or alternatively called self-supervision, indirect supervision or weakly-supervised. It is a method to use knowledge bases to label entities automatically in text, which is then used to extract features and train a machine learning classifier. The knowledge bases provide coherent positive training examples and avoid the high cost and effort of manual labelling. The method is generally more effective than unsupervised learning, though with similar reduced upfront effort. Large knowledge bases such as Wikipedia or Freebase are often used as the KB basis. The first acknowledged use of distant supervision was Craven and Kumlien in 1999 (Mark Craven and Johan Kumlien. 1999. “Constructing Biological Knowledge Bases by Extracting Information from Text Sources,” in ISMB, vol. 1999, pp. 77-86. 1999; source of weak supervision term.)); the first use of the formal term distant supervision was in Mintz et al. in 2009 (Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, 2009. “Distant Supervision for Relation Extraction without Labeled Data,” in Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 1003–1011, Suntec, Singapore, 2-7 August 2009). Since then, the field has been a very active area of research; see next reference.  See M. K. Bergman, 2015. “Forty Seminal Distant Supervision Articles,” from AI3:::Adaptive Information blog, November 17, 2014, as supplemented by .  See M. K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” from AI3:::Adaptive Information blog, November 17, 2014.  Pedro Domingos, 2012. “A Few Useful Things to Know About Machine Learning.” Communications of the ACM 55, no. 10 (2012): 78-87.  There is a rich literature providing guidance on feature selection and feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the available features. It is also possible to apply methods, the best known and simplest being principal component analysis, among many, to reduce feature size (dimensionality) with acceptable loss in accuracy.  As a good introduction and overview, see Xiao Ling and Daniel S. Weld, 2012. “Fine-Grained Entity Recognition,” in Proceedings of the 26th AAAI Conference on Artificial Intelligence, 2012. You can also search on the topic in Google Scholar.  TAC is organized by the National Institute of Standards and Technology (NIST). Initiated in 2008, TAC grew out of NIST’s Document Understanding Conference (DUC) for text summarization, and the Question Answering Track of the Text Retrieval Conference (TREC). TAC is overseen by representatives from government, industry, and academia. The Knowledge Base Population tracks of TAC were started in 2009 and continue to today.  See, for example, Percy Liang and Christopher Potts, 2015. “Bringing Machine Learning and Compositional Semantics Together.” Annu. Rev. Linguist. 1, no. 1 (2015): 355-376.
Oh, it was probably two or three years ago that one of our clients asked us to look into single-source authoring, or more broadly what has come to be known as COPE (create once, publish everywhere), as made prominent by Daniel Jacobson of NPR, now Netflix. We also looked closely at the question of formats and workflows that might increase efficiencies or lower costs in the quest to grab and publish content.
Then, of course, about the same time, it was becoming apparent that standard desktop and laptop screens were being augmented with smartphones and tablets. Smaller screen aspects require a different interface layout and interaction; but, writing for specific devices was a losing proposition. Responsive Web design and grid layout templates that could bridge different device aspects have now come to the fore.
Though it has been true for some time that different publishing venues — from the Web to paper documents or PDFs — have posed a challenge, these other requirements point to a broader imperative. I have intuitively felt there is a consistent thread at the core of these emerging device, use and publishing demands, but the common element has heretofore eluded me.
For years — decades, actually — I have been focused on the idea of data interoperability. My first quest was to find a model that could integrate text stories and documents with structured data from conventional databases and spreadsheets. My next quest was to find a framework that could relate context and meaning across multiple perspectives and world views. Though it took awhile, and which only began to really take shape about a decade ago, I began to focus on RDF and general semantic Web principles for providing this model.
Data integration though open, semantic Web standards has been a real beacon for how I have pursued this quest. The ideal of being able to relate disparate information from multiple sources and viewpoints to each other has been a driving motivation in my professional interests. In analyzing the benefits of a more connected world of information I could see efficiencies, reduced costs, more global understandings, and insights from previously hidden connections.
Yet here is the funny thing. I began to realize that other drivers for how to improve knowledge worker efficiencies or to deploy results to different devices and venues share the same justifications as data integration. Might there not be some common bases and logic underlying the interoperability imperative? Is not data interoperability but a part of a broader mindset? Are there some universal principles to derive from an inspection of interoperability, broadly construed?
In this article I try to follow these questions to some logical ends. This investigation raises questions and tests from the global — that is, information interoperability — to the local and practical in terms of notions such as create once, use everywhere, and have it staged for relating and interoperability. I think we see that the same motivators and arguments for relating information apply to the efficient ways to organize and publish that information. I think we also see that the idea of interoperability is systemic. Fortunately, meaningful interoperability can be achieved across-the-board today with application of the right mindsets and approaches. Below, I also try to set the predicates for how these benefits might be realized by exploring some first principles of interoperability.What is Interoperability?
So, what is interoperability and why is it important?
So-called enterprise information integration and interoperability seem to sprout from the same basic reality. Information gets created and codifed across multiple organizations, formats, storage systems and locations. Each source of this information gets created with its own scope, perspective, language, characteristics and world view. Even in the same organization, information gets generated and characterized according to its local circumstances.
In the wild, and even within single organizations, information gets captured, represented, and characterized according to multiple formats and viewpoints. Without bridges between sources that make explicit the differences in format and interpretation, we end up with what — in fact — is today’s reality of information stovepipes. The reality of our digital information being in isolated silos and moats results in duplicate efforts, inefficiencies, and lost understandings. Despite all of the years and resources thrown at information generation, use and consumption, our digital assets are unexploited to a shocking extent. The overarching cause for this dereliction of fiscal stewardship is the lack of interoperability.
By the idea of interoperability we are getting at the concept of working together. Together means things are connected in some manner. Working means we can mesh the information across sources to do more things, or do them better or more cheaply. Interoperability does not necessarily imply integration, since our sources can reside in distributed locations and formats. What is important is not the physical location — or, indeed, even format and representations — but that we have bridges across sources that enable the source information to work together.
In working backwards from this observation, then, we need certain capabilities to fulfill these interoperability objectives. We need to be able to ingest multiple encodings, serializations and formats. Because we need to work with this information, and tools for doing so are diverse, we also need the ability to export information in multiple encodings, serializations and formats. Human circumstance means we need to ingest and encode this information in multiple human languages. Some of our information is more structured, and describes relationships between things or the attributes or characterizations of particular types of things. Since all of this source information has context and provenance, we need to capture these aspects as well in order to ascertain the meaning and trustworthiness of the information.
This set of requirements is a lot of work, which can most efficiently be done against one or a few canonical representations of the input information. From a data integration perspective, then, the core system to support, store and manage this information should be based on only a few central data representations and models, with many connectors for ingesting native information in the wild and tools to support the core representations:
In our approach at Structured Dynamics we have chosen the Resource Description Framework (RDF) as the structured data model at the core of the system , supported by the Lucene text engine for full-text search and efficient facet searching. Because all of the information is given unique Web identifiers (URIs), and the whole system resides on the Web accessible via the HTTP protocol, our information may reside anywhere the Internet connects.
This gives us a data model and a uniform way to represent the input data across structured, semi-structured and unstructured sources. Further, we have a structure that can capture the relations or attributes (including metadata and provenance) of the input information. However, one more step is required to achieve data interoperability: an understanding of the context and meaning of the source information.
To achieve the next layer in the data interoperability pyramid  it is thus necessary to employ semantic technologies. The structure of the RDF data model has an inherent expressiveness to capture meaning and context. To this foundation we must add a coherent view of the concepts and entity types in our domain of interest, which also enables us to capture the entities within this system and their characteristics and relationships to other entities and concepts. These properties applied to the classes and instances in our domain of interest can be expressed as a knowledge graph, which provides the logical schema and inferential framework for our domain. This stack of semantic building blocks gets formally expressed as ontologies (the technical term for a working graph) that should putatively provide a coherent representation of the domain at hand.
We can visualize this semantic stack as follows:
We have been using the spoke-and-hub diagram above for data flows for some years and have used the semantic stack representation before, too. I believe in my bones the importance of data interoperability to competitive advantage for enterprises, and therefore its business worth as a focus of my company’s technology. But, once so considered, some more fundamental questions emerge. What makes data interoperability a worthwhile objective? Can an understanding of those objectives bring us more fundamental understandings of fundamental benefits? Does a grounding in more fundamental benefits suggest any change in our development priorities?Drivers of Interoperability
I think we can boil the drivers of interoperability down to four. These are:
To be sure, data interoperability is focused on insight. But data interoperability also brings efficiency and cost reductions. As we add other aspects of interoperability — say, responsive design for mobile — we may see comparatively fewer benefits in insight, but more in efficiency, cost, and, even, capture. Anything done to increase benefits from any of these drivers contributes to the net benefits and rationale for interoperability.Principles of Interoperability
The general goodness arising from interoperability suggests it is important to understand the first principles underlying the concept. By understanding these principles, we can also tease out the fundamental areas deserving attention and improvement in our interoperability developments and efforts. These principles help us cut through the crap in order to see what is important and deserves attention.
I think the first of the first principles for interoperability is reusability. Once we have put the effort into the creation of new valuable data or content, we want to be able to use and apply that knowledge in all applicable venues. Some of this reuse might be in chunking or splitting the source information into parts that can be used and deployed for many purposes. Some of this reuse might be in repurposing the source data and content for different presentations, expressions or devices. These considerations imply the importance of storing, characterizing, structuring and retrieving information in one or a few canonical ways.
Interoperable content and forms should also aspire to an ideal of “onceness“. The ideal is that the efforts to gather, create or analyze information be done as few times as possible. This ideal clearly ties into the principle of reusabilty because that must be in place to minimize duplication and overlooking what exists. The reason to focus on onceness is that it forces an explication of the workflows and bottlenecks inherent to our current work practices. These are critical areas to attack since, unattended, such inefficiencies provide the “death by a thousand cuts” to interoperability. Onceness is at the center of such compelling ideas as COPE and the role of APIs in a flexible architecture (see below) to promote interoperability.
A respect for workflows is also a first principle, expressed in two different ways. The first way is that existing workflows can not be unduly disrupted when introducing interoperability improvements. While workflows can be improved or streamlined over time — and should — initial introduction and acceptance of new tools and practices must fit with existing ways of doing tasks in order to see adoption. Jarring changes to existing work practices are mostly resisted. The second way that workflows are a first principle is in the importance of being aware of, explicitly modeling, and then codifying how we do tasks. This becomes the “language” of our work, and helps define the tooling points or points of interaction as we merge activities from multiple disciplines in our domain. These workflow understandings also help us identify useful points for APIs in our overall interoperability architecture.
These considerations provide the rationale for assigning metadata  that characterize our information objects and structure, based on controlled vocabularies and relationships as established by domain and administrative ontologies . In the broadest interoperability perspective, these vocabularies and the tagging of information objects with them are a first principle for ensuring how we can find and transition states of information. These vocabularies need not be complex or elaborate, but they need to be constant and consistent across the entire content lifecycle. There are backbone aspects to these vocabularies that capture the overall information workflow, as well as very specific steps for individual tasks. As a complement to such administrative ontologies, domain ontologies provide the context and meaning (semantics) for what our information is about.
The common grounding of data model and semantics means we can connect our sources of information. The properties that define the relationships between things determine the structure of our knowledge graph. Seeking commonalities for how our information sources relate to one another helps provide a coherent graph for drawing inferences. How we describe our entities with attributes provides a second type of property. Attribute profiles are also a good signal for testing entity relatedness. Properties — either relations or attributes — provide another filter to draw insight from available information.
If the above sounds like a dynamic and fluid environment, you would be right. Ultimately, interoperability is a knowledge challenge in a technology environment that is rapidly changing. New facts, perspectives, devices and circumstances are constantly arising. For these very reasons an interoperability framework must embrace the open world assumption , wherein the underlying logic structure and its vocabulary and data can be grown and extended at will. We are seeing some breakaway from conventional closed-world thinking of relational databases with NoSQL and graph databases, but a coherent logic based on description logics, such as is found with open standard semantic technologies like RDF and OWL and SPARQL, is even more responsive.
Though perhaps not quite at the level of a first principle, I also think interoperability improvements should be easy to use, easy to share, and easy to learn. Tooling is clearly implied in this, but also it is important we be able to develop a language and framing for what constitutes interoperability. We need to be able to talk about and inspect the question of interoperability in order to discover insights and gain efficiencies.Aspects of Interoperability
The thing about interoperability is that it extends over all aspects of the information lifecycle, from capturing and creating information, to characterizing and vetting it, to analyzing it, or publishing or distributing it. Eventually, information and content already developed becomes input to new plans or requirements. These aspects extend across multiple individuals and departments and even organizations, with portions of the lifecycle governed (or not) by their own set of tools and practices. We can envision this overall interoperability workflow something like the following :
Overall, only pieces of this cycle are represented in most daily workflows. Actually, in daily work, parts of this workflow are much more detailed and involved than what this simplistic overview implies. Editorial review and approvals, or database administration and management, or citation gathering or reference checking, or data cleaning, or ontology creation and management, or ETL activities, or hundreds of other specific tasks, sit astride this general backbone.
Besides showing that interoperability is a systemic activity for any organization (or should be), we can also derive a couple of other insights from this figure. First, we can see that some form of canonical representation and management is central to interoperability. As noted, this need not be a central storage system, but can be distributed using Web identifiers (URIs) and protocols (HTTP). Second, we characterize and tag our information objects using ontologies, both from structural and administrative viewpoints, but also by domain and meaning. Characterizing our information by a common semantics of meaning enables us to combine and analyze our information.
A third insight is that a global schema specific to workflows and information interoperability is the key for linking and combining activities at any point within the cycle. A common vocabulary for stages and interoperability tasks, included as a best practice for our standard tagging efforts, provides the conventions for how batons can get passed between activities at any stage in this cycle. The challenge of making this insight operational is one more of practice and governance than of technology. Inspecting and characterizing our information workflows with a common vocabulary and understanding needs to be a purposeful activity in its own right, backed with appropriate management attention and incentives.
A final insight is that such a perspective on interoperability is a bit of a fractal. As we get more specific in our workflows and activities, we can apply these same insights in order to help those new, more specific workflows become interoperable. We can learn where to plug into this structure. And, we can learn how our specific activities through the application of explicit metadata and tags with canonical representations can work to interact well with other aspects of the content lifecycle.
Interoperability can be achieved today with the right mindsets and approaches. Fortunately, because of the open world first principle, this challenge can be tackled in an incremental, piecemeal manner. While the overall framework provides guidance for where comprehensive efforts across the organization may go, we can also cleave off only parts of this cycle for immediate attention, following a “pay as you benefit” approach . A global schema and a consistent approach to workflows and information characterizations can help ensure the baton is properly passed as we extend our interoperability guidance to other reaches of the enterprise.General Architecture and a Sample Path
We can provide a similar high-level view for what an enterprise information architecture supporting interoperability might look like. We can broadly layer this architecture into content acquisition, representation and repository, and content consumption:
Content of all forms — structured, semi-structured and unstructured — is brought into the system and tagged or mapped into the governing domain or administrative schema. Text content is marked up with reduced versions of HTML (such as RASH  or Markdown ) in order to retain the author’s voice and intent in areas such as emphasis, titles or section headers; the structure of the content is also characterized by patterned areas such as abstracts, body and references. All structured data is characterized according to the RDF data model, with vocabularies as provided by OWL in some cases.
We already have an exemplar repository in the Open Semantic Framework  that shows the way (along with other possible riffs on this theme) for how just a few common representations and conventions can work to distribute both schema and information (data) across a potentially distributed network. Further, by not stopping at the water’s edge of data interoperability, we can also embrace further, structural characterization of our content. Adding this wrinkle enables us to efficiently support a variety of venues for content consumption simultaneously.
This architecture is quite consistent with what is known as WOA (for Web-oriented architecture) . Like the Internet itself, WOA has the advantage of being scalable and distributed, all (mostly) based on open standards. The interfaces between architectural components are also provided though mostly RESTful application programming interfaces (APIs), which extends interoperability to outside systems and provides flexibility for swapping in new features or functionality as new components or developments arise. Under this design, all components and engines become in effect “black boxes”, with information exchange via standard vocabularies and formats using APIs as the interface for interoperability.A Global Context for Interoperability
Though data interoperability is a large and central piece, I hope I have demonstrated that interoperability is a much broader and far-reaching concept. We can see that “global interoperability” extends into all aspects of the information lifecycle. By expanding our viewpoint of what constitutes interoperability, we have discovered some more general principles and mindsets that can promise efficiencies, lower costs and greater insights across the enterprise.
An explicit attention to workflows and common vocabularies for those flows and the information objects they govern is a key to a more general understanding of interoperability and the realization of its benefits. Putting this kind of infrastructure in place is also a prerequisite to greater tooling and automation in processing information.
We can already put in place chains of tooling and workflows governed by these common vocabularies and canonical representations to achieve a degree of this interoperability. We do not need to tackle the whole enchilada at once or mount some form of “big bang” initiative. We can start piecemeal, and expand as we benefit. The biggest gaps remain codification of workflows in relation to the overall information lifecycle, and the application of taggers to provide the workflow and structure metadata at each stage in the cycle. Again, these are not matters so much of technology or tooling, but policy and information governance.
What I have outlined here provides the basic scaffolding for how such an infrastructure to promote interoperability may evolve. We know how we do our current tasks; we need to understand and codify those workflows. Then, we need to express our processing of information at any point along the content lifecycle. A number of years back I discussed climbing the data interoperability pyramid . We have made much progress over the past five years and stand ready to take our emphasis on interoperability to the next level.
To be sure there is much additional tooling still needed, mostly in the form of mappers and taggers. But the basic principles, core concepts and backbone tools for supporting greater interoperability are known and relatively easy to put in place. Embracing the mindset and inculcating this process into our general information management routines is the next challenge. Working to obtain the ideal is doable today. See M. K. Bergman, 2009. “Advantages and Myths of RDF,” from AI3:::Adaptive Information blog, April 8, 2009.  See M. K. Bergman, 2006. “Climbing the Data Federation Pyramid,” from AI3:::Adaptive Information blog, April 8, 2009.  See M. K. Bergman, 2005. “Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,” from AI3:::Adaptive Information blog, July 20, 2005.  See M. K. Bergman, 2010. “I Have Yet to Metadata I Did’t Like,” from AI3:::Adaptive Information blog, August 16, 2010.  See M. K. Bergman, 2011. “An Ontologies Architecture for Ontology-driven Apps,” from AI3:::Adaptive Information blog, December 5, 2011. Ontologies  See M. K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” from AI3:::Adaptive Information blog, December 21, 2009.  Some sources that helped form my thoughts on the information lifecycle include Backbone Media and Piktochart.  See M. K. Bergman, 2010. “‘Pay as You Benefit': A New Enterprise IT Strategy,” from AI3:::Adaptive Information blog, July 12, 2010.  See Silvio Peroni, 2015. “RASH: Research Articles in Simplified HTML,” March 15, 2015.  Many Markdown options exist for a reduced subset of HTML; one in this vein is Scholarly Markdown.  The Open Semantic Framework has its own Web site (http://opensemanticframework.org/), supported by a wiki of more than 500 supporting technical articles (http://wiki.opensemanticframework.org/index.php/Main_Page).  See M. K. Bergman, 2009. “A Generalized Web-oriented Architecture (WOA) for Structured Data,” from AI3:::Adaptive Information blog, May 3, 2009.
I was pleasantly surprised to discover that a diversity of my writings has been chosen for the syllabus for a text analytics course by Dr. Alianna Maren in Northwestern University’s master program in predictive analytics. Dr. Maren has chosen to feature some of my writings in NLP statistics, ontologies, and the open world assumption.
Dr. Maren has stated her intention is to present text analytics from a “top-down ontological perspective.” The syllabus looks very interesting.
I appreciate the recognition and wish the students and Dr. Maren a great course!
Semantics is a funny thing. All professionals come to know that communication with their peers and outside audiences requires accuracy in how to express things. Yet, even with such attentiveness, communications sometimes go awry. It turns out that background, perspective and context can all act to switch circuits at the point of communication. Despite, and probably because of, our predilection as a species to classify and describe things, all from different viewpoints, we can often exhort in earnest a thought that is communicated to others as something different from what we intended. Alas!
This reality is why, I suspect, we have embraced as a species things like dictionaries, thesauri, encyclopedias, specifications, standards, sacred tracts, and such, in order to help codify what our expressions mean in a given context. So, yes, while sometimes there is sloppiness in language and elocution, many misunderstandings between parties are also a result of difference in context and perspective.
It is important when we process information in order to identify relations or extract entities, to type them or classify them, or to fill out their attributes, that we have measures to gauge how well our algorithms and tests work, all attentive to providing adequate context and perspective. These very same measures can also tell us whether our attempts to improve them are working or not. These measures, in turn, also are the keys for establishing effective gold standards and creating positive and negative training sets for machine learning. Still, despite their importance, these measures are not always easy to explain or understand. And, truth is, sometimes these measures may also be mis-explained or mis-calculated. Aiding the understanding of important measures in improving the precision, completeness, and accuracy of communications is my purpose in this article.Some Basic Statistics as Typically Described
The most common scoring methods for gauging the “accuracy” of natural language communications involves statistical tests based on the nomenclature of negatives and positives, true or false. Sometimes it can be a bit confusing about how to interpret these terms, a confusion which can be made all the more difficult in what kind of statistical environment is at play. Let me try to first confuse, and then more simply explain these possible nuances.
Standard science is based on a branch of statistics known as statistical hypothesis testing. This is likely the statistics that you were taught in school. In hypothesis testing, we begin with a hypothesis about what might be going on with respect to a problem or issue, but for which we do not know the cause or truth. After reviewing some observations, we formulate a hypothesis that some factor A is affecting or influencing factor B. We then formulate a mirror-image null hypothesis that specifies that factor A does not affect factor B; this is what we will actually test. The null hypothesis is what we assume the world in our problem context looks like, absent our test. If the test of our formulated hypothesis does not affect that assumed distribution, then we reject our alternative (meaning our initial hypothesis fails, and we keep the null explanation).
We make assumptions from our sample about how the entire population is distributed, which enables us to choose a statistical model that captures the shape of assumed probable results for our measurement sample. These shapes or distributions may be normal (bell-shaped or Gaussian), binomial, power law, or many others. These assumptions about populations and distribution shapes then tell us what kind of statistical test(s) to perform. (Misunderstanding the true shape of the distribution of a population is one of the major sources of error in statistical analysis.) Different tests may also give us more or less statistical power to test the null hypothesis, which is that chance results will match the assumed distribution. Different tests may also give us more than one test statistic to measure variance from the null hypothesis.
We then apply our test and measure and collect our sample from the population, with random or other statistical sampling important so as not to skew results, and compare the distribution of these results to our assumed model and test statistic(s). The null hypothesis is confirmed or not by whether the shape of our sampled results matches the assumed distribution or not. The significance of the variance from the assumed shape, along with a confidence interval based on our sample size and the test at hand, provides the information necessary to either accept or reject the null hypothesis.
Rejection of the null hypothesis generally requires both significant difference from the expected shape in our sample and a high level of confidence. Absent those results, we likely need to accept the null hypothesis, thus rejecting the alternative hypothesis that some factor A is affecting or influencing factor B. Alternatively, with significant differences and a high level of confidence, we can reject the null hypothesis, thereby accepting the alternative hypothesis (our actual starting hypothesis, which prompted the null) that factor A is affecting or influencing factor B.
This is all well and good except for the fact that either the sampling method or our test may be in error. There are two types of errors that are possible: Type I errors, where a positive result corresponds to rejecting the null hypothesis; and Type II errors, where a negative result corresponds to not rejecting the null hypothesis.
We can combine all of these thoughts into what is the standard presentation for capturing these true and false, positive and negative, results :Null hypothesis (H0) is Valid/True Invalid/False Judgment of Null Hypothesis (H0) Reject False Positive
Clear as mud, huh?Let’s Apply Some Simplifications
Fortunately, there are a couple of ways to sharpen this standard story in the context of information retrieval (IR), natural language processing (NLP) and machine learning (ML) — the domains of direct interest to us at Structured Dynamics — to make understanding all of this much simpler. Statistical tests will always involve a trade off between the level of false positives (in which a non-match is declared to be a match) and the level of false negatives (in which an actual match is not detected) . Let’s see if we can simplify our recognition and understanding of these conditions.
First, let’s start with a recent explanation from the KDNuggets Web site :
“Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong:
The use of ‘case’ and ‘predictions’ help, but are still a bit confusing. Let’s hear another explanation from Benjamin Roth from his recently completed thesis :
“There are two error cases when extracting training data: false positive and false negative errors. A false positive match is produced if a sentence contains an entity pair for which a relation holds according to the knowledge base, but for which the sentence does not express the relation. The sentence is marked as a positive training example for the relation, however it does not contain a valid signal for it. False positives introduce errors in the training data from which the relational model is to be generalized. For most models false positive errors are the most critical error type, for qualitative and quantitative reasons, as will be explained in the following.
“A false negative error can occur if a sentence and argument pair is marked as a negative training example for a relation (the knowledge base does not contain the argument pair for that relation), but the sentence actually expresses the relation, and the knowledge base was incomplete. This type of error may negatively influence model learning by omitting potentially useful positive examples or by negatively weighting valid signals for a relation.”
In our context, we can see a couple of differences from traditional scientific hypothesis testing. First, the problems we are dealing with in IR, NLP and ML are all statistical classification problems, specifically in binary classification. For example, is a given text token an entity or not? What type amongst a discrete set is it? Does the token belong to a given classification or not? This makes it considerably easier to posit an alternative hypothesis and the shape of its distribution. What makes it binary is the decision as to whether a given result is correct or not. We now have a different set of distributions and tests from more common normal distributions.
Second, we can measure our correct ‘hits’ by applying our given tests to a “gold standard” of known results. This gold standard provides a representative sample of what our actual population looks like, one we have characterized in advance whether all results in the sample are true or not for the question at hand. Further, we can use this same gold standard over and over again to gauge improvements in our test procedures.
Combining these thoughts leads to a much simpler matrix, sometimes called a confusion matrix in this context, for laying out the true and false, positive and negative characterizations:Correctness Test Assertion Positive Negative True TP
As we can see, ‘positive’ and ‘negative’ are simply the assertions (predictions) arising from our test algorithm of whether or not there is a match or a ‘hit’. ‘True’ and ‘false’ merely indicate whether these assertions proved to be correct or not as determined by gold standards or training sets. A false positive is a false alarm, a “crying wolf”; a false negative is a missed result. Thus, all true results are correct; all false are incorrect.Key Information Retrieval Statistics
Armed with these four characterizations — true positive, false positive, true negative, false negative — we now have the ability to calculate some important statistical measures. Most of these IR measures also have exact analogs in standard statistics, which I also note.
The first metric captures the concept of coverage. In standard statistics, this measure is called sensitivity; in IR and NLP contexts it is called recall. Basically it measures the ‘hit’ rate for identifying true positives out of all potential positives, and is also called the true positive rate, or TPR:
Expressed as a fraction of 1.00 or a percentage, a high recall value means the test has a high “yield” for identifying positive results.
Precision is the complementary measure to recall, in that it is a measure for how efficient whether positive identifications are true or not:
Precision is something, then, of a “quality” measure, also expressed as a fraction of 1.00 or a percentage. It provides a positive predictive value, as defined as the proportion of the true positives against all the positive results (both true positives and false positives).
So, we can see that recall gives us a measure as to the breadth of the hits captured, while precision is a statement of whether our hits are correct or not. We also see, as in the Roth quote above, why false positives need to be a focus of attention in test development, because they directly lower precision and efficiency of the test.
This recognition that precision and recall are complementary and linked is reflected in one of the preferred overall measures of IR and NLP statistics, the F-score, which is the adjusted (beta) mean of precision and recall. The general formula for positive real β is:
which can be expressed in terms of TP, FN and FP as:
In many cases, the harmonic mean is used, which means a beta of 1, which is called the F1 statistic:
But F1 displays a tension. Either precision or recall may be improved to achieve an improvement in F1, but with divergent benefits or effects. What is more highly valued? Yield? Quality? These choices dictate what kinds of tests and areas of improvement need to receive focus. As a result, the weight of beta can be adjusted to favor either precision or recall. Two other commonly used F measures are the F2 measure, which weights recall higher than precision, and the F0.5 measure, which puts more emphasis on precision than recall .
Another metric can factor into this equation, though accuracy is a less referenced measure in the IR and NLP realm. Accuracy is the statistical measure of how well a binary classification test correctly identifies or excludes a condition:
An accuracy of 100% means that the measured values are exactly the same as the given values.
All of the measures above simply require the measurement of false and true, positive and negative, as do a variety of predictive values and likelihood ratios. Relevance, prevalence and specificity are some of the other notable measures that depend solely on these metrics in combination with total population.
By bringing in some other rather simple metrics, it is also possible to expand beyond this statistical base to cover such measures as information entropy, statistical inference, pointwise mutual information, variation of information, uncertainty coefficients, information gain, AUCs and ROCs. But we’ll leave discussion of some of those options until another day.Bringing It All Together
Courtesy of one of the major templates in Wikipedia in the statistics domain , for which I have taken liberties, expansions and deletions, we can envision the universe of statistical measures in IR and NLP, based solely on population and positives and negatives, true and false, as being:Condition (as determined by “Gold standard“) Total population Condition positive Condition negative Prevalence =
Please note that the order and location of TP, FP, FN and TN differs from my simple layout presented in the confusion matrix above. In the confusion matrix, we are gauging whether the assertion of the test is correct or not as established by the gold standard. In this current figure, we are instead using the positive or negative status of the gold standard as the organizing dimension. Use the shorthand identifiers of TP, etc., to make the cross reference between “correct” and “condition”.Relationships to Gold Standards and Training Sets
These basic measures and understandings have two further important roles beyond informing how to improve the accuracy and peformance of IR and NLP algorithms and tests. The first is gold standards. The second is training sets.
Gold standards that themselves contain false positives and false negatives, by definition, immediately introduce errors. These errors make it difficult to test and refine existing IR and NLP algorithms, because the baseline is skewed. And, because gold standards also often inform training sets, errors there propagate into errors in machine learning. It is also important to include true negatives in a gold standard, in the likely ratio expected by the overall population, so that this complement of the accuracy measurement is not overlooked.
Once a gold standard is created, you then run your current test regime against it when you run your same tests againt unknowns. Preferably, of course, the gold standard only includes true positives and true negatives (that is, the gold standard is the basis for judging “correctness'; see confusion matrix above). In the case of running an entity recognizer, your results against the gold standard can take one of three forms: you either have open slots (no entity asserted); slots with correct entities; or slots with incorrect entities. Thus, here is how you would create the basis for your statistical scores:
As noted before, these measures are sufficient to calculate the precision, recall, F-score and accuracy statistics. Also note that the F v T and P v N correspond to the gold standard “correctness” and what is asserted by the test(s), per the confusion matrix.
We can apply this same mindset to the second additional, important role in creating and evaluating training sets. Both positive and negative training sets are recommended for machine learning. Negative training sets are often overlooked. Again, if the learning is not based on true positives and negatives, then significant error may be introduced into the learning.
Clean, vetted gold standards and training sets are thus a critical component to improving our knowledge bases going forward . The very practice of creating gold standards and training sets needs to receive as much attention as algorithm development because, without it, we are optimizing algorithms to fuzzy objectives.
The virtuous circle that occurs between more accurate standards and training sets and improved IR and ML algorithms is a central argument for knowledge-based artificial intelligence (KBAI). Continuing to iterate better knowledge bases and validation datasets is a driving factor in improving both the yield and quality from our rapidly expanding knowledge bases. See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors.  Tilmann Bruckhaus, 2015. “How Are Precision and Recall Calculated?” from the KDNuggets Web site, retrieved May 10, 2015.  Benjamin Roth, 2014. “Effective Distant Supervision for End-To-End Knowledge Base Population Systems,” D Engineering Thesis, Saarland University; quote is on p 33.  See http://en.wikipedia.org/wiki/F1_score.  See http://en.wikipedia.org/wiki/Template:DiagnosticTesting_Diagram.  Some would also argue for adequate gold standards in the ontology realm. See Dellschaft, Klaas, and Steffen Staab. “On how to perform a gold standard based evaluation of ontology learning.” In The Semantic Web-ISWC 2006, pp. 228-241. Springer Berlin Heidelberg, 2006. For ontologies, they state it “. . . is apparent that there does not exist a canonical way of performing gold-standard based evaluations of ontology learning. Moreover, we argue in this paper that existing gold-standard based evaluations are faulty and that a well-founded evaluation model is largely missing.”
Mavlyutov et al. have posted a pre-print  of their upcoming paper to be presented at ESWC at the end of the month covering the most efficient representation of URIs in information systems. All of us who do large-scale work with the semantic Web or linked data should be interested in these findings.
To my knowledge, the paper is the first one to explicitly evaluate common data structures for encoding, storing and retrieving URIs at scale. As the unique identifiers for resources, there may be millions to billions needing to be stored and retrieved from triple stores or other database backends.
The authors compared a dozen different methods for storing URIs according to the standard needs to index, insert and retrieve URIs, including encoding and decoding, at scale. Memory and operation times were measured. The methods evaulated were specific RDF systems; various hash maps; various hash tables; binary search, B+, ART (adaptive radix), and lexicographic trees; and the HAT-trie.
Different operational needs may point to different methods. However, the authors conclude that “overall, the HAT-trie appears to be a good compromise taking into account all aspects, i.e., memory consumption, loading time, and look-ups. ART also appears as an appealing structure, since it maintains the data in sorted order, which enables additional operations like range scans and prefix lookups, and since it still remains time and memory efficient.”
This paper should be a useful reference for any group that needs to manage URIs at scale. Mavlyutov, Ruslan, Marcin Wylot, and Philippe Cudre-Mauroux. “A Comparison of Data Structures to Manage URIs on the Web of Data.”, accepted paper at the 12th ESWC Conference (2015), May 31-June 4, 2015, Portoroz, Slovenia.
The six months since the last major release of UMBEL (Upper Mapping and Binding Exchange Layer) have been spent in improving the coherence and broadening the usefulness for the ontology. Structured Dynamics is today releasing version 1.20 of the open source UMBEL.
UMBEL’s first purpose is to provide a general vocabulary of classes and predicates for describing domain ontologies, with the specific aim of promoting interoperability with external datasets and domains. UMBEL’s second purpose is to provide a coherent framework of reference subjects and topics for grounding relevant Web-accessible content. UMBEL presently has about 35,000 of these reference concepts drawn from the Cyc knowledge base, split into ‘core’ and a series of optional modules, which are organized into 32 mostly disjoint SuperTypes.
The key advances in this new 1.20 version of UMBEL include refinements to the UMBEL generator, improved tests for satisfiabliity and coherence, and additional mappings and structure to aid UMBEL’s role as a computing overlay for existing knowledge bases, such as Wikipedia. Part of the latter advance is being aided by the new addition of an Attributes Ontology to UMBEL as described in the prior articles of An UMBEL Extension for Attributes and Conceptual and Practical Distinctions in the Attributes Ontology.Summary of Changes
These are the principal changes between the last public release, version 1.10, and this version 1.20:
The Web and enterprises in general are characterized by growing, diverse and distributed information sources and data. Some of this information resides in structured databases; some resides in schema, standards, metadata, specifications and semi-structured sources; and some resides in general text or media where the content meaning is buried in unstructured form. Given these huge amounts of information, how can one bring together what subsets are relevant? And, then for candidate material that does appear relevant, how can it be usefully combined or related given its diversity? In short, how does one go about actually combining diverse information to make it interoperable and coherent?
UMBEL was conceived to provide a reference grounding to achieve these very aims. UMBEL’s vocabulary is designed to recognize that different sources of information have different contexts and different structures, and meaningful connections between sources are not always exact. UMBEL’s 35,000 reference concepts — drawn from the logically consistent Cyc knowledge base backed by 1000 person-years of development and testing — provide a set of fixed references by which we can orient, map and navigate external content. These UMBEL reference concepts form a knowledge graph (you can see a big graph visualization of this structure) of subject nodes that may be related to external classes and individuals (instances and named entities). Via this coherent structure, we gain some important benefits:
UMBEL is being developed and refined via large-scale use cases. A number of improvements have been brought to the system to make it more testable, manageable, and flexible.
The first improvement was to introduce the so-called SuperTypes to UMBEL. All UMBEL reference concepts are assigned to one or more of 32 SuperTypes, organized into nine dimensions (details may be found here). The four SuperTypes of Attributes, Abstract-level, Entities and Topics/Categories are designed to be fully non-disjoint, and do not participate in any disjoint assertions. The remaining 28 SuperTypes are designed to be as disjoint as possible:Natural World Natural Phenomena Natural Substances Earthscape Extraterrestrial Living Things Prokaryotes Protists & Fungus Plants Animals Diseases Person Types Human Activities Organizations Finance & Economy Society Activities Time-related Events Time Human Works Products Food or Drink Drugs Human Places Facilities Geopolitical Information Chemistry (n.o.c) Audio Info Visual Info Written Info Structured Info Notations & References Numbers Descriptive Attributes Classificatory Abstract-level Entities Topics/Categories
To make UMBEL more tractable, we have also modularized it into ‘core’, ‘geo’, ‘entities’, and ‘attributes’ modules (the latter two modules being added in this new release). The modules can be swapped out with other external options or left out of analysis if not needed for a given domain interest. We also have formal mappings to other important external reference sets such as Wikipedia, OpenCyc, schema.org, the DBpedia ontology,GeoNames and PROTON. UMBEL’s GitHub site provides these mappings.
Beginning with version 1.10, we also added a new UMBEL generator written in Clojure that allows the entire system to be built and tested from a series of simple input files. We are now using this system aggressively to discover gaps and mis-assignments in the UMBEL structure, as well as to achieve balance in scope and coverage. The system ties into the OWL API for certain tests and capabilities (UMBEL is OWL 2-compliant).Still a Work in Progress
Though UMBEL retains its same mission as when the system was first formulated eight years ago, we also see its role expanding. The two key areas of expansion are in UMBEL’s use to model and map instance data attributes and in acting as a computable overlay for Wikipedia (and other knowledge bases). These two areas of expansion are still a work in progress.
This UMBEL version 1.20 marks the first expression of the Attributes Ontology. While we have organized what already had existed in attribute concepts (that is, those concepts that capture the descriptive data related to how to characterize instance records), some gaps remain in both UMBEL and the source Cyc. Using the new ontology to map against the properties in the DBpedia and schema.org vocabularies is the next priority. These direct use cases are needed to ground the ontology in important, real-world information markup systems. We will also be looking at linking to an existing units and measurements ontology such as QUDT. There likely will need to be a series of releases over time to capture and test these uses.
The mapping to Wikipedia is now about 72% complete. While we are testing automated mapping mechanisms, because of its central role we also need to vet all UMBEL-Wikipedia mapping assignments. This effort is pointing out areas of UMBEL that are over-specified, under-specified, and sometimes duplicative. By placing UMBEL in an intermediate position between Cyc and Wikipedia we are finding differences and gaps on both ends, as well as gaps within UMBEL itself. Our goal is to get to a 100% coverage point with Wikipedia, and then to exercise the structure for machine learning and other tests against the KB. These efforts will enable us to enhance the semsets in UMBEL as well as to move toward multilingual versions. This effort, too, is still a work in progress.
Despite these desired enhancements, we are using all aspects of UMBEL and its mappings to both aid these expansions and to test the existing mappings and structure. These efforts are proving the virtuous circle of improvements that is at the heart of UMBEL’s purposes.Where to Get UMBEL and Learn More
The UMBEL Web site provides various online tools and Web services for exploring and using UMBEL. The UMBEL GitHub site is where you can download the UMBEL Vocabulary or the UMBEL Reference Concept ontology, both under a Creative Commons Attribution 3.0 license. Other documents and backup are also available from that location.
Technical specifications for UMBEL and its various annexes are available from the UMBEL wiki site. You can also download a PDF version of the specifications from there. You are also welcomed to participate on the UMBEL mailing list or LinkedIn group.
Wikipedia is arguably the most important information source yet invented for natural language processing (NLP) and artificial intelligence, in addition to its role as humanity’s largest encyclopedia. Wikipedia is the principal information source for such prominent services as IBM’s Watson , Freebase , the Google Knowledge Graph , Apple’s Siri , YAGO , and DBpedia , the core reference structure for linked open data . Wikipedia information has assumed a prominent role in NLP applications in word sense disambiguation, named entity recognition, co-reference resolution, and multi-lingual alignments; in information retrieval in query expansion, multi-lingual retrieval, question answering, entity ranking, text categorization, and topic indexing; and in semantic applications in topic extraction, relation extraction, entity extraction, entity typing, semantic relatedness, and ontology building .
The massive size of Wikipedia — with more than 26 million articles across 250 different language versions [9,10] — makes it a rich resource for reference entities and concepts. Structural features of Wikipedia that help to inform the understanding of relationships and connections include articles (and their embedded entities and concepts), article abstracts, article titles, infoboxes, redirects, internal and external links, editing histories, categories (in part), discussion pages, disambiguation pages, images and associated metadata, templates, tables, and pages of lists, not to mention Wikipedia as a whole being used as a corpus or graph . It is no wonder that Wikipedia is referenced in about 665,000 academic articles and books . And all of this occurs in a phenomenon that is not yet 15 years old!
Wikipedia is unparalleled as a resource for mining these resources of structure, concepts and entities. But, and here is the challenge, Wikipedia is never itself used as a computable knowledge base. It is a resource for other knowledge systems, but not a coherent knowledge base unto itself. Wikipedia feeds other useful knowledge bases, but does not play those roles alone. Why this is and how it can be remedied is the subject of this article.Three Basic Problems
Wikipedia has been cited for three weaknesses relevant to its role as a knowledge base. The first is that its coverage is imbalanced. Various studies have evaluated the scope of Wikipedia [13, 14, 15, among many] and have found areas of popular culture such as games, movies, music and actors to be over-represented, while areas of philosophy, technology, academics and history, to be under-represented. While still perhaps true in terms of absolute numbers of articles, the actual domain coverage has been improving in recent years.
The second Wikipedia problem is incompleteness. Wikipedia tends to be spotty in terms of providing complete and equal representation in populating certain categories (or classes) with articles (instances). It also tends to be incomplete in how well embedded or structured various articles may be. An example of the representation problem is in economy or commerce and the coverage of companies or products. The notability criterion  is a tricky one here; some companies or products with seemingly equivalent notability get listed, while others do not. Another example is the kingdom of life where some life forms are extremely well represented, while others are not. The incompleteness of structure relates to which articles or entire categories have infoboxes or ones that are well populated, as well as how category assignments are incomplete or inconsistent. The existence of “stub” articles is one evidence for such incompleteness. As Wikipedia has gotten more structured and complicated, the number of active editors has declined. The growing use of bots, however, is often compensating for this and in some cases bringing better consistency and equivalent treatment [17,18].
But the biggest problem of Wikipedia has been its category structure. Categories were not part of the original design, but were added to Wikipedia in 2004. Various reviewers have likened Wikipedia more to a thesaurus than a classification scheme , others that it is different than classical knowledge organization systems in that it has no specified root or hierarchy . This improved a wee bit from 2006 to 2010, when the main Wikipedia topics were organized according to top-level and main topics . Still, typical commentaries point to the fact that Wikipedia’s category structure is “noisy, ill-formed, and difficult to make sense of” . Its crowdsourced nature has led to various direct and indirect cycles in portions of the category structure . All of these problems lead to the inability to do traditional reasoning or inference over the Wikipedia category graph .
Besides these lacks of computability, the Wikipedia graph is bloated with “artificial” categories (see further below) that just add noise to trying to understand or navigate the Wikipedia category structure. In short, while Wikipedia is a goldmine of resources and partial structure, its organization is incoherent at a global level, and it is unable to support reasoning and other tasks that might be expected from a truly functional knowledge base.
The real shame — but also the real opportunity — is that this lack of coherency makes it more difficult to validate and improve the information already in Wikipedia. So, there are both external reasons of linkage and internal reasons of improved authority for which it is desirable to shape Wikipedia into a true knowledge base.Efforts to Recast Wikipedia
These faults are not unrecognized and the prospect of better leverage from Wikipedia has stimulated many efforts. Gazing inward, it is not uncommon to find efforts that attempt to clean up the existing Wikipedia structure , or various attempts to use the content of Wikipedia article categories  to re-constitute new taxonomies  or concept networks . Clean up appears essential, and is a relative constant in other attempts to recast Wikipedia .
The choice of Wikipedia’s founders to make its full content available electronically for free and without restriction was a masterstroke. This has stimulated many to grab the Wikipedia content and to recast it in other ways. One of the first, and most successful, was DBpedia, with an emphasis on making (much of) Wikipedia available in RDF and linked data. DBpedia emphasized the structured content of Wikipedia’s infoboxes and eventually derived a typology of entities and their properties expressed as the DBpedia ontology . It is not hyperbole to state that DBpedia nucleated the entire linked data phenomenon .
The key insight of YAGO  was the recognition that the resource richness of Wikipedia lacked a unifying structure, with WordNet chosen as the replacement organizing framework. Also, by patterned analysis of Wikipedia’s article titles structure, YAGO was able to infer and select many attribute relationships between entities. This enabled YAGO to posit what, in essence, was a much-expanded category structure for Wikipedia expressed as predicates. Many other efforts have also chosen WordNet as their organizing framework for Wikipedia [31,32].
Freebase , itself another attempt to use crowdsourcing with explicit attention to structured data, struggled in its early years until it embraced and incorporated Wikipedia. That marked the take-off point for Freebase, which was later acquired by Google to form the backbone of its knowledge graph. Freebase is now being shut down with its assets being transferred to Wikidata.
Wikidata  is itself an interesting case of how the Wikipedia model is being expanded. Wikidata, a sister project to Wikipedia under the Wikimedia banner, takes as its starting point the structured data about entities evident in Wikipedia infoboxes. Rather than extracting and cleaning that entity information as DBpedia does, the role of Wikidata is to be the multilingual source for all entities feeding the Wikimedia network, including Wikipedia. The approach leads to more uniformity and consistency, and provides a central Wikimedia access point for structured data. However, somewhat akin to Wikipedia, Wikidata also has struggled to find an appropriate typology (or ontology) for its millions of entities .
Other approaches to the Wikipedia classification challenge have been to map — or “express” — Wikipedia articles in relation to established external vocabularies or structures, such as the Library of Congress Classification , Library of Congress Subject Headings [23, 36], Universal Decimal classification (UDC) , Cyc  or UMBEL , among others. The idea here is that accepted organizational schemes provide more coherence than the Wikipedia category structure, with sometimes additional benefits as well.
Though not complete topical recastings, certain aspects of Wikipedia have also proven their usefulness for general knowledge acquisition. Using article (concept or entity) content can inform topical tagging using explicit semantic analysis (ESA) , automatic topic identification , information extraction  or a myriad of others.Making a Natural Wikipedia Category Scheme
Whether “cleaned” or recasted, taking the existing Wikipedia structure in its existing form is problematic. Though some category cleaning sometimes takes place with some of these uses of Wikipedia, that is not uniformly nor universally so. The cleaning that does take place is often limited to administrative categories (relating to internal Wikipedia conventions or management). However, other Wikipedia conventions (such as lists) and the proliferation of user-generated “artificial” categories actually represent the bulk of the total number of categories.
Charles S. Peirce was the first, by my reading, who looked at the question of “natural classes,” which are now sometimes contraposed against what are called “artificial classes” (we tend to use the term “compound” classes instead). A “natural class” is a set with members that share the same set of attributes, though with different values (such as differences in age or hair color for humans, for example). Some of those attributes are also more essential to define the “type” of that class (such as humans being warm-blooded with live births and hair and use of symbolic languages). Artificial classes tend to only add one or a few shared attributes, and do not reflect the essence of the type .
“Compound” (or artificial) categories (such as Films directed by Pedro Almodóvar or Ambassadors of the United States to Mexico) are not “natural” categories, and including them in a logical evaluation only acts to confuse attributes from classification. To be sure, such existing categories should be decomposed into their attribute and concept components, but should not be included in constructing a schema of the domain.
“Artificial” categories may be identified in the Wikipedia category structure by both syntactical and heuristic signals. One syntactical rule is to look for the head of a title; one heuristic signal is to select out any category with prepositions. Across all rules, “compound” categories actually account for most of what is removed in order to produce “cleaned” categories.
We can combine these thoughts to show what a “cleaned” version of the Wikipedia category structure might look like. The 12/15/10 column in the table below reflects the approach used for determining the candidates for SuperTypes in the UMBEL ontology, last analyzed in 2010 . The second column is from a current effort mapping Wikipedia to Cyc :12/15/10 3/1/15 Total Categories 100% 100% Administrative Categories 14% 15% Orphaned Categories 10% 20% Working Categories 76% 66% “Artificial” Categories 44% 34% Single Head 23% 33% Plural Head 24% “Clean” Categories 33% 46%
Two implications can be drawn from this table. First, without cleaning, there is considerable “noise” in the Wikipedia category structure, equivalent to about half to two-thirds of all categories. Without cleaning these categories, any analysis or classification that ensues is fighting unnecessary noise and has likely introduced substantial assignment errors. Second, approaches, assumptions and how filters get sequenced differ between “cleaning” attempts, which both makes comparability a challenge but also represents areas for discussion and testing to derive best practices. This lack of comparability due to differences in staging Wikipedia for analysis makes it difficult to draw comparisons between different studies and approaches. One study is not necessarily relatable to other studies.
Today, in chaotic and uncoordinated ways, we see Wikipedia feeding much analysis through partial aspects of its structure and supplying many reference concepts and entities. But each analysis is done for different purposes using different bases; they are thus incompatible. Coherency, usability and insight suffer. Any prior efforts to map to or use Wikipedia categories that do not remove these artificial categories only introduce noise and are therefore likely to be in substantial error.Benefits of a Reference Knowledge Base
If we could overcome these shortcomings by taking the steps to make Wikipedia a true reference knowledge base, what might the benefits be? Or, said another way, why should we care?
One benefit is that reference structures of any kind provide a focus, by definition, of common or canonical referents. This commonality leads to better defined, better understood and more widely used referents. Common referents become a kind of common vocabulary for the space, upon which other vocabularies and datasets can reference. A common language, of sorts, can begin to emerge.
Reference structures also provide a grounding, a spoke-and-hub design , that leads to an efficient basis for external vocabularies and datasets to refer to one another. Of course, any direct mapping can provide a means to relate this information, but such pairwise mappings are not scalable nor efficient. In a spoke-and-hub design, the number of mappings required goes down significantly with the number of datasets or items requiring mapping. The spoke-and-hub design, for example, is at the heart of such disciplines as master data management.
Another benefit of common reference structures is that they provide a common target for the development of tools and best practices. These kinds of “network effects” lead to still further tooling and practices. Thus, while we see literally tens of thousands of academic papers and approaches leveraging Wikipedia in one way or another, we see little of a practice or a community that has been built around it as a knowledge base. It is as if we are still looking a bit at the shadow of Wikipedia and its possible role, a chimera for its potential as a true knowledge base.
But the ultimate benefit of Wikipedia as a reference knowledge base will reside in its computability. When we can reason over Wikipedia’s content, use it for testing and analyzing assertions or new facts, when its coherent organization can be applied to such tasks as informing how to map and interoperate data together or remaking whole legacy applications such as enterprise information integration or MDM, all of which in cross-lingual ways, we will finally see the realization of Wikipedia’s inherent potential. And, as these latent capabilities get exploited, we will see supporting knowledge sources such as Wikidata also get pulled into the ecosystem.Seven Requirements for a Computable Knowledge Base
So, if we buy into the benefits of a computable Wikipedia — or any other useful knowledge source for that matter — what are the guideposts for doing so? How do we assess the gaps and then fill them?
The importance of working with a “clean” version of the Wikipedia structure is obvious, yet ultimately there are higher-order requirements for what it takes, in our view, to become a “true” reference knowledge base. By our definition, such KBs have these aspects:
Wikipedia, and other publicly available knowledge sources , already fulfill many of these requirements. With focused attention, any current reference source should be able to be lifted to meet these seven major requirements.Outlines of a General Staging Pipeline
OK, then: what might such a KB processing (or “lifting”) approach look like?
Well, the first point is that it should be a pipeline. It is important to be able to swap in and out various options at multiple points from input to desired output. Then, because there are disparate sources and different formats to accommodate, it is also important to use canonical syntaxes and standards for expressing the products and specifications at the various steps along that pipeline.
The very notion of pipeline implies workflows, which are the actual drivers for how the pipeline should be designed. Key workflow steps include:
Much information gets processed in these pipelines, and the underlying sources update frequently. Thus, the pipelines themselves need to be performant and based on solid code. Automation, within the demanding bounds of quality, is also an essential condition to be scalable. Improving on that is a process, not a state.Time to Make Some Sausage
Most of these observations are really not new or innovative [39,50]. Possibly what is new is to articulate the situation for major reference sources on the Web, and to then analyze and propose how to process them in the service of information interoperability.
Because, you see, we’re still at the very, very earliest phases of how the Internet is changing the abilities to gather, understand, and represent the information in our world. We’re about ready to embark on the next stage in that journey. IBM Journal of Research and Development 56(3/4), Special Issue on “This is Watson”, 2012  Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1247–1250. ACM, 2008.  A. Singhal: Introducing the Knowledge Graph: Things, not Strings. Google Blog, May 16, 2012  Gruber, T. “Siri: a virtual personal assistant.” In keynote presentation at Semantic Technologies conference (SemTech09), 2009.  Suchanek, Fabian M., Gjergji Kasneci, and Gerhard Weikum. “Yago: a core of semantic knowledge.” In Proceedings of the 16th international conference on World Wide Web, pp. 697-706. ACM, 2007.  Sören Auer, Chris Bizer, Jens Lehmann, Georgi Kobilarov, Richard Cyganiak and Zachary Ives, 2007. DBpedia: A nucleus for a web of open data, in Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea, volume 4825 of LNCS, pages 715728, November 2007. See http://iswc2007.semanticweb.org/papers/ISWC2007_IU_Auer.pdf.  Heath, Tom, and Christian Bizer. “Linked data: Evolving the web into a global data space.” Synthesis lectures on the semantic web: theory and technology 1, no. 1 (2011): 1-136.  Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp. See http://arxiv.org/ftp/arxiv/papers/0809/0809.4530.pdf.  Mesgari, Mostafa, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto Lanamäki. “The sum of all human knowledge: A systematic review of scholarly research on the content of Wikipedia,” Journal of the Association for Information Science and Technology 66, no. 2 (2015): 219-245.  However, only 1/10th of the different language Wikipedias have more than 100,000 articles; see https://en.wikipedia.org/wiki/Wikipedia:Multilingual_statistics.  See the discussion of ‘structural sources’ in M.K. Bergman, 2011. “In Search of ‘Gold Standards’ for the Semantic Web,” in AI3:::Adaptive Information blog, February 28, 2011.  This count is from Google Scholar absent references in citations and patents with the query, http://scholar.google.com/scholar?as_vis=1&q=wikipedia&hl=en&as_sdt=1,16. Also, see the SWEETpedia listing 250 articles relating to this topic on this AI3:::Adaptive Information blog; I ceased updating the list about five years ago because it was growing too large to manage.  Halavais, Alexander, and Derek Lackaff. “An analysis of topical coverage of Wikipedia.” Journal of Computer Mediated Communication 13, no. 2 (2008): 429-440.  Holloway, Todd, Miran Bozicevic, and Katy Börner. “Analyzing and visualizing the semantic coverage of Wikipedia and its authors.” Complexity 12, no. 3 (2007): 30-40.  Samoilenko, Anna, and Taha Yasseri. “The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics.” EPJ Data Science 3, no. 1 (2014): 1-11.  See http://en.wikipedia.org/wiki/Wikipedia:Notability  Halfaker, Aaron, and John Riedl. “Bots and cyborgs: Wikipedia’s immune system.“Computer 3 (2012): 79-82.  See https://en.wikipedia.org/wiki/Category:Wikipedia_bots_by_purpose  Voss, J. Collaborative Thesaurus Tagging the Wikipedia Way, (2006)  Suchecki, Krzysztof, Alkim Almila Akdag Salah, Cheng Gao, and Andrea Scharnhorst. “Evolution of Wikipedia’s Category Structure.” Advances in Complex Systems 15, no. supp01 (2012).  For Wikipedia’s main topics, see http://en.wikipedia.org/wiki/Category:Main_topic_classificationsReference; for Wikipedia’s top-level categories, see http://en.wikipedia.org/wiki/Category:Fundamental_categories.  Kittur, A., Chi, E. H. and Suh, B., What’s in Wikipedia? Mapping Topics and Conflict Using Socially Annotated Category Structure, in Proceedings of the 27th Annual CHI Conference on Human Factors in Computing Systems (CHI’2009), New York, USA, 2009, pp. 1509–1512.  Joorabchi, Arash, and Abdulhussain E. Mahdi. “Towards linking libraries and Wikipedia: automatic subject indexing of library records with Wikipedia concepts.” Journal of Information Science 40, no. 2 (2014): 211-221.  Paulheim, Heiko, and Christian Bizer. “Type inference on noisy rdf data,” In The Semantic Web–ISWC 2013, pp. 510-525. Springer Berlin Heidelberg, 2013.  Maciej Janik and Krys Kochut, 2007. Wikipedia in Action: Ontological Knowledge in Text Categorization, University of Georgia, Computer Science Department Technical Report no. UGA-CS-TR-07-001. See http://lsdis.cs.uga.edu/~mjanik/UGA-CS-TR-07-001.pdf. Also, see Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Mohamed Tmar, and Abdelmajid Ben Hamadou. “Wikipedia Category Graph and New Intrinsic Information Content Metric for Word Semantic Relatedness Measuring.” Computing 10, no. 13 (2012): 35-37.  Vivi Nastase and Michael Strube, 2008. Decoding Wikipedia Categories for Knowledge Acquisition, in Proceedings of the AAAI08 Conference, Chicago, US, , pp.1219-1224.  Simone Paolo Ponzetto and Michael Strube, 2007a. Deriving a Large Scale Taxonomy from Wikipedia, in Association for the Advancement of Artificial Intelligence (AAAI2007).  Andrew Gregorowicz and Mark A. Kramer, 2006. Mining a Large-Scale Term-Concept Network from Wikipedia, Mitre Technical Report, October 2006.  Wu, Fei, and Daniel S. Weld. “Autonomously semantifying wikipedia.” In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 41-50. ACM, 2007.  Bizer, Christian, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. “DBpedia-A crystallization point for the Web of Data.” Web Semantics: science, services and agents on the world wide web 7, no. 3 (2009): 154-165.  Marius Pasca, 2009. Outclassing Wikipedia in Open-Domain Information Extraction: Weakly-Supervised Acquisition of Attributes over Conceptual Hierarchies, in Proceedings of the 12th Conference of the European Chapter of the ACL, pages 639–647, Athens, Greece, 30 March – 3 April 2009. See http://www.aclweb.org/anthology/E/E09/E09-1073.pdf.  Fei Wu and Daniel S. Weld, 2008. Automatically Refining the Wikipedia Infobox Ontology, presented at the 17th International World Wide Web Conference (WWW 2008)  Vrandečić, Denny, and Markus Krötzsch. “Wikidata: a free collaborative knowledgebase.” Communications of the ACM 57, no. 10 (2014): 78-85.  From scratch, in a bit over three years, Wikidata has grown to cover about 19 million entities according to February 2015 statistics. However, there has yet to emerge an overarching typology or ontology for these entities, with the typing system that does exist growing from the bottom up. For some background, see https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Migrating_away_from_GND_main_type  There is an alternate entry point to Wikipedia provided by http://en.wikipedia.org/wiki/Library_of_Congress_Classification  Kiyota, Yoji, Hiroshi Nakagawa, Satoshi Sakai, Tatsuya Mori, and Hidetaka Masuda. “Exploitation of the wikipedia category system for enhancing the value of LCSH.” In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, pp. 411-412. ACM, 2009.  Salah, Almila Akdag, Cheng Gao, Krzysztof Suchecki, and Andrea Scharnhorst. “Need to categorize: A comparative look at the categories of universal decimal classification system and Wikipedia,” Leonardo 45, no. 1 (2012): 84-85.  Pohl, Aleksander. “Classifying the Wikipedia articles into the OpenCyc taxonomy.” In Proceedings of the Web of Linked Entities Workshop in conjuction with the 11th International Semantic Web Conference, vol. 5, p. 16. 2012.  Upper Mapping and Binding Exchange Layer (UMBEL) Specification, UMBEL.org, retrieved February 16, 2015.  Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, in Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007.  Hassan, Mostafa. “Automatic Document Topic Identification Using Hierarchical Ontology Extracted from Human Background Knowledge.” PhD dissertation, University of Waterloo, 2013.  Fei Wu, Raphael Hoffmann and Daniel S. Weld, 2008b. Information Extraction from Wikipedia: Moving Down the Long Tail, in Proceedings of the 14th ACM SigKDD International Conference on Knowledge Discovery and Data Mining (KDD-08), Las Vegas, NV, August 24-27, 2008, pp. 635-644. See http://www.cs.washington.edu/homes/wufei/papers/kdd08.pdf.  Menno Hulswit, 1997. “Peirce’s teleological approach to natural classes,” in Transactions of the Charles S. Peirce Society (1997): 722-772. See http://repository.ubn.ru.nl/bitstream/handle/2066/29577/29577___.PDF?sequence=1  Upper Mapping and Binding Exchange Layer (UMBEL) Specification, Annex G: UMBEL SuperTypes Documentation, UMBEL.org, retrieved February 16, 2015.  Aleksander Smywinski-Pohl, Krzysztof Wróbel, Michael K. Bergman and Bartosz Ziółko, 2015. “cycloped.io: An Interoperable Framework for Web Knowledge Bases,” manuscript in preparation.  The main advantage of a grounding reference is that it allows a spoke-and-hub design for data mapping, which is tremendously more efficient than pairwise mappings. In a spoke-and-hub design, where the reference ontology is the common node at the hub, only n – 1 routes are necessary to connect all sources, meaning that it scales linearly with the number of sources and attributes. Without a grounding reference, these same mapping capabilities would require routes in a pairwise (point-to-point) approach, that also scales poorly as a quadratic function. A system of ten datasets would require 9 composite mappings in the reference grounding case, but 45 in a pairwise approach. And, of course, datasets themselves contain tens to thousands of attributes, compounding the map scaling problem further.  For example, WordNet is a coherent lexical ontology, but is not computable.  See the knowledge bases section of M.K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” in AI3:::Adaptive Information blog, November 14, 2014.  Galárraga, Luis, Geremy Heitz, Kevin Murphy, and Fabian M. Suchanek. “Canonicalizing open knowledge bases.” In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1679-1688. ACM, 2014.  See, for example, Suchanek, Fabian M., and Gerhard Weikum. “Knowledge Bases in the Age of Big Data Analytics.” Proceedings of the VLDB Endowment 7, no. 13 (2014): 1713-1714.