A recent interview with a noted researcher, IEEE Fellow Michael I. Jordan, Pehong Chen Distinguished Professor at the University of California, Berkeley, provided a downplayed view of recent AI hype. Jordan was particularly critical of AI metaphors to real brain function and took the air out of the balloon about algorithm advances, pointing out that most current methods have roots that are decades long . In fact, the roots of knowledge-based artificial intelligence (KBAI), the subject of this article, also extend back decades.
Yet the real point, only briefly touched upon by Jordan in his lauding of Amazon’s recommendation service, is that the dynamo in recent AI progress has come from advances in the knowledge and statistical bases driving these algorithms. The improved digital knowledge bases behind KBAI have been the power behind these advances.
Knowledge bases are finally being effectively combined with AI, a dynamic synergy that is only now being recognized, let alone leveraged. As this realization increases, many forms of useful information structure in the wild will begin to be mapped to these knowledge bases, which will further extend the benefits we are are now seeing from KBAI.
Knowledge-based artificial intelligence, or KBAI, is the use of large statistical or knowledge bases to inform feature selection for machine-based learning algorithms used in AI. The use of knowledge bases to train the features of AI algorithms improves the accuracy, recall and precision of these methods. This improvement leads to perceptibly better results to information queries, including pattern recognition. Further, in a virtuous circle, KBAI techniques can also be applied to identify additional possible facts within the knowledge bases themselves, improving them further still for KBAI purposes.
It is thus, in my view, the combination of KB + AI that has led to the notable AI breakthroughs of the past, say, decade. It is in this combination that we gain the seeds for sowing AI benefits in other areas, from tagging and disambiguation to the complete integration of text with conventional data systems. And, oh, by the way, the structure of all of these systems can be made inherently multi-lingual, meaning that context and interpretation across languages can be brought to our understanding of concepts.
Structured Dynamics is working to democratize a vision of KBAI that brings its benefits to any enterprise, using the same approaches that the behemoths of the industry have used to innovate knowledge-based artificial intelligence in the first place. How and where the benefits of such KBAI may apply is the subject of this article.A Brief History of Knowledge-based Systems
Knowledge-based artificial intelligence is not a new idea. Its roots extend back perhaps to one of the first AI applications, Dendral. In 1965, nearly a half century ago, Edward Feigenbaum initiated Dendral, which became a ten-year effort to develop software to deduce the molecular structure of organic compounds using scientific instrument data. Dendral was the first expert system and used mass spectra or other experimental data together with a knowledge base of chemistry to produce a set of possible chemical structures. This set the outline for what came to be known as knowledge-based systems, which are one or more computer programs that reason and use knowledge bases to solve complex problems.
Indeed, it was in the area of expert systems that AI first came to the attention of most enterprises. According to Wikipedia,Expert systems were designed to solve complex problems by reasoning about knowledge, represented primarily as if–then rules rather than through conventional procedural code. The first expert systems were created in the 1970s and then proliferated in the 1980s. Expert systems were among the first truly successful forms of AI software.
Expert systems spawned the idea of knowledge engineers, whose role was to interview and codify the logic of the chosen experts. But, expert systems proved to be expensive to build and difficult to maintain and tune. As the influence of expert systems waned, another branch emerged, that of knowledge-based engineering and their support for CAD- and CASE-type systems. Still, overall penetration to date of most knowledge-based systems can most charitably be described as disappointing.
The specific identification of “KBAI” was (to my knowledge) first made in a Carnegie-Mellon University report to DARPA in 1975 . The source knowledge bases were broadly construed, including listings of hypotheses. The first known patent citing knowledge-based artificial intelligence is from 1992 . Within the next ten years there were dedicated graduate-level course offerings on KBAI at many universities, including at least Indiana University, SUNY Buffalo, and Georgia Tech.
In 2007, Bossé et al. devoted a chapter to KBAI in their book on information fusion, but still, at that time, the references were more generic . However, by 2013, the situation was changing fast, as this quote from Hovy et al. indicates :“Recently, however, this stalemate [the so-called ‘knowledge acquisition bottleneck] has begun to loosen up. The availability of large amounts of wide-coverage semantic knowledge, and the ability to extract it using powerful statistical methods, are enabling significant advances in applications requiring deep understanding capabilities, such as information retrieval and question-answering engines. Thus, although the well-known problems of high cost and scalability discouraged the development of knowledge-based approaches in the past, more recently the increasing availability of online collaborative knowledge resources has made it possible to tackle the knowledge acquisition bottleneck by means of massive collaboration within large online communities. This, in turn, has made these online resources one of the main driving forces behind the renaissance of knowledge-rich approaches in AI and NLP – namely, approaches that exploit large amounts of machine-readable knowledge to perform tasks requiring human intelligence.” (citations removed)
The waxing and waning of knowledge-based systems and its evolution over fifty years have led to a pretty well-defined space, even if not all component areas have achieved their commercial potential. Besides areas already mentioned, knowledge-based systems also include:
We can organize these subdomains as follows. Note particularly that the branch of KBAI (knowledge-based artificial intelligence) has two main denizens: recognized knowledge bases, such as Wikipedia, and statistical corpora. The former are familiar and evident around us; the latter are largely proprietary and not (generally) publicly accessible:
Some prominent knowledge bases and statistical corpora are identified below. Knowledge bases are coherently organized information with instance data for the concepts and relationships covered by the domain at hand, all accessible in some manner electronically. Knowledge bases can extend from the nearly global, such as Wikipedia, to very specific topic-oriented ones, such as restaurant reviews or animal guides. Some electronic knowledge bases are designed explicitly to support digital consumption, in which case they are fairly structured with defined schema and standard data formats and, increasingly, APIs. Others may be electronically accessible and highly relevant, but the data is not staged in a easily-consumable way, thereby requiring extraction and processing prior to use.
The use and role of statistical corpora is harder to discern. Statistical corpora are organized statistical relationships or rankings that facilitate the processing of (mostly) textual information. Uses can range from entity extraction to machine language translation. Extremely large sources, such as search engine indexes or massive crawls of the Web, are most often the sources for these knowledge sets. But, most are applied internally by those Web properties that control this big data.
The Web is the reason these sources — both statistical corpora and knowledge bases — have proliferated, so the major means of consuming them is via Web services with the information defined and linked to URIs.
My major thesis has been that it is the availability of electronically accessible knowledge bases, exemplified and stimulated by Wikipedia , that has been the telling factor in recent artificial intelligence advances. For example, there are at least 500 different papers that cite using Wikipedia for various natural language processing, artificial intelligence, or knowledge base purposes . These papers began to stream into conferences about 2005 to 2006, and have not abated since. In turn, the various techniques innovated for extracting more and more structure and information from Wikipedia are being applied to other semi-structured knowledge bases, resulting in a true renaissance of knowledge-based processing for AI purposes. These knowledge bases are emerging as the information substrate under many recent computational advances.Knowledge Bases in Relation to Overall Artificial Intelligence
A few months ago I pulled together a bit of an interaction diagram to show the relationships between major branches of artificial intelligence and structures arising from big data, knowledge bases, and other organizational schema for information:
What we are seeing is a system emerging whereby multiple portions of this diagram interact to produce innovations. Take, for example, Apple‘s Siri , or Google’s Google Now or the many similar systems that have emerged on smartphones. Spoken instructions are decoded to text, which is then parsed and evaluated for intent and meaning and then posed to a general knowledge base. The text results are then modulated back to speech with the answer in the smartphone’s speakers. The pattern recognition at the front and back end of this workflow has been made better though statistical datasets derived from phonemes and text. The text understanding is processed via natural language processing and semantic technologies , with the question understanding and answer formulation coming from one or more knowledge bases.
This remarkable chain of processing is now almost taken for granted, though its commercial use is less than five years old. For different purposes with different workflows we see effective question answering and diagnosis with systems like IBM’s Watson  and structured search results from Google’s Knowledge Graph . Try posing some questions to Wolfram Alpha and then stand back and be impressed with the data visualization. Behind the scenes, pattern recognition from faces to general images or thumbprints further is eroding the distinction between man and machine. Google Translate now covers language translation between 60 human languages  — and pretty damn effectively, too. All major Web players are active in these areas, from Amazon’s recommendation system  to Facebook , Microsoft , Twitter  or Baidu .
Though not universal, most all recent AI advances leveraging knowledge bases have utilized Wikipedia in one way or another. Even Freebase, the core of Google’s Knowledge Graph, did not really blossom as a separate data crowdsourcing concern until its former owner, Metaweb, decided to bring Wikipedia into its system. Many other knowledge bases, as noted below, are also derivatives or enhancements to Wikipedia in one way or another.
I believe the reasons for Wikipedia’s influence have arisen from its nearly global scope, its mix of semi-structured data and text, its nearly 200 language versions, and its completely open and accessible nature. Regardless, it is also certainly true that techniques honed with Wikipedia are now being applied to a diversity of knowledge bases. We are also seeing an appreciation start to grow in how knowledge bases can enhance the overall AI effort.Useful Statistical and Knowledge Sources
The diagram on knowledge-based systems above shows two kinds of databases contributing to KBAI: statistical corpora or databases and true knowledge bases. The statistical corpora tend to be hidden behind proprietary curtains, and also more limited in role and usefulness than general knowledge bases.Statistical Corpora
The statistical corpora or databases tend to be of a very specific nature. While lists of text corpora and many other things may contribute to this category, the ones actually in commercial use tend to be quite focused in scope, very large, and designed for bespoke functionality. A good example, and one that has been contributed for public use, is the Web 1T 5-gram data set . This data set, contributed by Google for public use in 2006, contains English word n-grams and their observed frequency counts. N-grams capture word tokens that often coincide with one another, from single words to phrases. The length of the n-grams ranges from unigrams (single words) to five-grams. The database was generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.
Another example of statistical corpora are what is used in Google’s Translate capabilities . According to Franz Josef Och, who was the lead manager at Google for its translation activities and an articulate spokesperson for statistical machine translation, a solid base for developing a usable language translation system for a new pair of languages should consist of a bilingual text corpus of more than a million words, plus two monolingual corpora each of more than a billion words. Statistical frequencies of word associations form the basis of these reference sets. Google originally seeded its first language translators with multiple language texts from the United Nations .
Such lookup or frequency tables in fact can shade into what may be termed a knowledge base as they gain more structure. NELL, for example (and see below), contains a relatively flat listing of assertions extracted from the Web for various entities; it goes beyond frequency counts or relatedness, but does not have the full structure of a general knowledge base like Wikipedia . We thus can see that statistical corpora and knowledge bases in fact reside on a continuum of structure, with no bright line to demark the two categories.
Nonetheless, most statistical corpora will never be seen publicly. Building them requires large amounts of input information. And, once built, they can offer significant commercial value to their developers to drive various machine learning systems and for general lookup.Knowledge Bases
There are literally hundreds of knowledge bases useful to artificial intelligence, most of a restricted domain nature. Listed below, partially informed by Suchanek and Weikum’s work , are some of the broadest ones available. Note that many leverage or are derivatives of or extensions to Wikipedia:
It is instructive to inspect what kinds of work or knowledge these bases are contributing to the AI enterprise. The most important contribution, in my mind, is structure. This structure can relate to the subsumption (is-a) or part of (mereology) relationships between concepts. This form of structure contributes most to understanding the taxonomy or schema of a domain; that is, its scaffolding of concepts. This structure helps orient the instance data and other external structures, generally through some form of mapping.
The next rung of contribution from these knowledge bases is in the nature of the relations between concepts and their instances. These form the predicates or nature of the relationships between things. This kind of contribution is also closely related to the attributes of the concepts and the properties of the things that populate the structure. This kind of information tends to be the kind of characteristics that one sees in a data record: a specific thing and the values for the fields by which it is described.
Another contribution from knowledge bases comes from identity and disamgibuation. Identity works in that we can point to authoritative references (with associated Web identifiers) for all of the individual things and properties in our relevant domain. We can use these identities to decide the “canonical” form, which also gives us a common reference for referring to the same things across information sources or data sets. We also gain the means for capturing the various ways that anything can be described, that is the synonyms, jargon, slang, acronyms or insults that might be associated with something. That understanding helps us identify the core item at hand. When we extend these ideas to the concepts or types that populate our relevant domain, we can also begin to establish context and other relationships to individual things. When we encounter the person “John Smith” we can use co-occurring concepts to help distinguish John Smith the plumber from John Smith the politician or John Smith the policeman. As more definition and structure is added, our ability to discriminate and disambiguate goes up.
Some of this work resides in the interface between schema (concepts) and the instances (individuals) that populate that schema, what I elsewhere have described as the work between the T-Box and A-Box of knowledge bases . In any case, with richer understandings of how we describe and discern things, we can now begin to do new work, not possible when these understandings were lacking. We can now, for example, do semantic search where we can relate multiple expressions for the same things or infer relationships or facets that either allow us to find more relevant items or better narrow our search interests.
With true knowledge bases and logical approaches for working with them and their structure, we can begin doing direct question answering. With more structure and more relationships, we can also do so in rather sophisticated ways, such as identifying items with multiple shared characteristics or within certain ranges or combinations of attributes.
Structured information and the means to query it now gives us a powerful, virtuous circle whereby our knowledge bases can drive the feature selection of AI algorithms, while those very same algorithms can help find still more features and structure in our knowledge bases. The interaction between AI and the KBs means we can add still further structure and refinement to the knowledge bases, which then makes them still better sources of features for informing the AI algorithms:
Once this threshold of feature generation is reached, we now have a virtuous dynamo for knowledge discovery and management. We can use our AI techniques to refine and improve our knowledge bases, which then makes it easier to improve our AI algorithms and incorporate still further external information. Effectively utilized KBAI thus becomes a generator of new information and structure.
This virtuous circle has not yet been widely applied beyond the early phases of, say, adding more facts to Wikipedia, as some of our examples above show. But these same basic techniques can be applied to the very infrastructural foundations of KBAI systems in such areas as data integration, mapping to new external structure and information, hypothesis testing, diagnostics and predictions, and the myriad of other uses to which AI has been hoped to contribute for decades. The virtuous circle between knowledge bases and AIs does not require us to make leaps and bounds improvements in our core AI algorithms. Rather, we need only stoke our existing AI engines with more structure and knowledge fuel in order to keep the engine chugging.
The vision of a growing nexus of KBAI should also prove that efficiencies and benefits also increase through a power function of the network effect, similar to what I earlier described in the Viking algorithm . We know how we can extract further structure and benefit from Wikipedia. We can see how such a seed catalyst can also be the means for mapping and relating more specific domain knowledge bases and structure. The beauty of this vision is that we already can see the threshold benefits from a decade of KBAI development. Each new effort — and there are many — will only act to add to these benefits, with each new increment contributing more than the increment that came before. That sounds to me like productivity, and a true basis for wealth creation. See Lee Gomes, 2014. “Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Efforts,” in IEEE Spectrum, 20 Oct 2014. See also http://www.kdnuggets.com/2014/11/berkeley-michael-jordan-big-data-transformative-not-delusion.html.  Lee D. Erman and Victor R. Lesser, 1975. A Multi-level Organization for Problem Solving Using Many Diverse, Cooperating Sources of Knowledge, DARPA Report AD-AO12-919, Carnegie-Mellon University, Pittsburgh, Pa. 15213 , March, 1975, 24 pp. See http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA012916. The authors followed this up with a 1978 book, System engineering techniques for artificial intelligence systems. Academic Press, 1978.  See “Method for converting a programmable logic controller hardware configuration and corresponding control program for use on a first programmable logic controller to use on a second programmable logic controller,” US Patent No 5142469 A, August 25, 1992.  Éloi Bossé, Jean Roy, Steve Wark, Eds., 2007. Concepts, Models, and Tools for Information Fusion, Artech House Publishers, 2007-02-28, SKU-13/ISBN: 9781596939349. Chapter 11 is devoted to “Knowledge-Based and Artificial Intelligence Systems.”  Eduard Hovy, Roberto Navigli, and Simone Paolo Ponzetto, 2013. “Collaboratively Built Semi-Structured Content and Artificial Intelligence: The Story So Far,” Artificial Intelligence 194 (2013): 2-27. See http://wwwusers.di.uniroma1.it/~navigli/pubs/AIJ_2012_Hovy_Navigli_Ponzetto.pdf.  See M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010. The listing as of its last update included 246 articles.  See the combined references between  and ; also, see Wikipedia’s own “Wikipedia in Academic Studies.”  Wade Roush, 2010. “The Story of Siri, from Birth at SRI to Acquisition by Apple–Virtual Personal Assistants Go Mobile.” Xconomy. com 14 (2010).  Special Issue on “This is Watson“, 2012. IBM Journal of Research and Development 56(3/4), May/June 2012. See also this intro to Watson video.  Amit Singhal, 2012. ” Introducing the Knowledge Graph: Things, not Strings,” Google Blog, May 16, 2012  Thomas Schulz, 2013. “Google’s Quest to End the Language Barrier,” September 13, 2013, Spiegel Online International. See http://www.spiegel.de/international/europe/google-translate-has-ambitious-goals-for-machine-translation-a-921646.html. Also see, Alon Halevy, Peter Norvig, and Fernando Pereira, 2009. “The Unreasonable Effectiveness of Data,” in IEEE Intelligent Systems, March/April 2009, pp 8-12.  Greg Linden, Brent Smith, and Jeremy York, 2003. “Amazon. com Recommendations: Item-to-item Collaborative Filtering,” Internet Computing, IEEE 7, no. 1 (2003): 76-80.  Athima Chansanchai, 2014. “Microsoft Research Shows off Advances in Artificial Intelligence with Project Adam,” on Next at Microsoft blog, July 14, 2014. Also, see .  Jimmy Lin and Alek Kolcz, 2012. “Large-Scale Machine Learning at Twitter,” SIGMOD, May 20–24, 2012, Scottsdale, Arizona, US. See http://www.dcs.bbk.ac.uk/~DELL/teaching/cc/paper/sigmod12/p793-lin.pdf.  Robert Hof, 2014. “Interview: Inside Google Brain Founder Andrew Ng’s Plans To Transform Baidu,” Forbes Online, August 28, 2014.  Alok Prasad and Lee Feigenbaum, 2014, “How Semantic Web Tech Can Make Big Data Smarter,” in CMSwire, Oct 6, 2014.  The Web 1T 5-gram data set is available from the Linguistic Data Corporation, University of Pennsylvania.  Franz Josef Och, 2005. “Statistical Machine Translation: Foundations and Recent Advances,” The Tenth Machine Translation Summit, Phuket, Thailand, September 12, 2005.  Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr, and Tom M. Mitchell, 2010. “Toward an Architecture for Never-Ending Language Learning,” in AAAI, vol. 5, p. 3. 2010.  Fabian M. Suchanek and Gerhard Weikum, 2014. “Knowledge Bases in the Age of Big Data Analytics,” Proceedings of the VLDB Endowment 7, no. 13 (2014).  Roberto Navigli and Simone Paolo Ponzetto, 2012. “BabelNet: The Automatic Construction, Evaluation and Application of a Wide-coverage Multilingual Semantic Network,” Artificial Intelligence 193 (2012): 217-250.  Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu, 2014. “Biperpedia: An Ontology for Search Applications,” . Proceedings of the VLDB Endowment 7(7), 2014.  Robert Speer and Catherine Havasi, “Representing General Relational Knowledge in ConceptNet 5,” LREC 2012, pp. 3679-3686.  Douglas B. Lenat and R. V. Guha, 1990. Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project, Addison-Wesley, 1990 ISBN 0-201-51752-3.  Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives 2007. “DBpedia: A Nucleus for a Web of Open Data,” presented at ISWC 2007, Springer Berlin Heidelberg, 2007.  Feng Niu, Ce Zhang, Christopher Ré, and Jude W. Shavlik, 2012. “DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference,” in VLDS, pp. 25-28. 2012.  Zaiqing Nie, J-R. Wen, and Wei-Ying Ma, 2012. ” Statistical Entity Extraction From the Web,” in Proceedings of the IEEE 100, no. 9 (2012): 2675-2687.  Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor, 2008. “Freebase: a Collaboratively Created Graph Database for Structuring Human Knowledge,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247-1250. ACM, 2008.  Mark Wick and Bernard Vatant, 2012. “The GeoNames Geographical Database,” available from the World Wide Web at http://geonames.org (2012).  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, 2009. ” ImageNet: A Large-scale Hierarchical Image Database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248-255.  Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates, 2005. “Unsupervised Named-entity Extraction from the Web: An Experimental Study,” Artificial Intelligence 165, no. 1 (2005): 91-134.  Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun and Wei Zhang, 2014. “Knowledge Vault: a Web-scale Approach to Probabilistic Knowledge Fusion,” KDD 2014.  Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu, 2012. “Probase: A Probabilistic Taxonomy for Text Understanding,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481-492. ACM, 2012.  M.K. Bergman, 2008. “UMBEL: A Subject Concepts Reference Layer for the Web”, Slideshare, July 2008.  Denny Vrandečić and Markus Krötzsch, 2014. “Wikidata: A Free Collaborative Knowledgebase,” Communications of the ACM 57, no. 10 (2014): 78-85.  Vivi Nastase and Michael Strube. 2013. “Transforming Wikipedia into a Large Scale Multilingual Concept Network,” Artificial Intelligence 194 (2013): 62-85.  Simone Paolo Ponzetto, and Michael Strube, 2007. “Deriving a Large Scale Taxonomy from Wikipedia,” in AAAI, vol. 7, pp. 1440-1445. 2007.  Christiane Fellbaum and George Miller, eds., 1998. WordNet: An Electronic Lexical Database, MIT Press, 1998.  Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum, 2013. “YAGO2: a Spatially and Temporally Enhanced Knowledge Base from Wikipedia,” Artificial Intelligence 194 (2013): 28-61; and Fabien M. Suchanek, Gjergji Kasneci, Gerhard Weikum, 2007. “YAGO: a Core of Semantic Knowledge,” in Proceedings of the 16th international Conference on World Wide Web, pp. 697-706. ACM, 2007.  See the T-Box and A-Box discussion with particular reference to artificial intelligence on M.K. Bergman, 2014. “Big Structure and Data Interoperability,” on AI3:::Adaptive Information blog, August 18, 2014.  M.K. Bergman, 2014. “The Value of Connecting Things – Part II: The Viking Algorithm,” on AI3:::Adaptive Information blog, September 14, 2014.  Cade Metz, 2013. “Facebook Taps ‘Deep Learninig’ Giant for New AI Lab,” on Wired.com, December 9, 2013.  Santosh Divvala, Ali Farhadi and Carlos Guestrin, 2014. “Learning Everything about Anything: Webly-Supervised Visual Concept Learning,” in CVPR 2014.
Peg, the well-being indicator system for the community of Winnipeg, recently won the international Community Indicators Consortium Impact Award, presented in Washington, D.C. on Sept. 30. Peg is a joint project of the United Way of Winnipeg (UWW) and the International Institute for Sustainable Development (IISD). Our company, Structured Dynamics, was the lead developer for the project, which is also based on SD’s Open Semantic Framework (OSF) platform.
Peg is an innovative Web portal that helps identify and, on an ongoing basis, track indicators that relate to the economic, environmental, cultural and social well-being of the people of Winnipeg. Datasets may be selected and compared with a variety of charting and mapping and visualization tools at the level of the entire city, neighborhoods or communities. All told, there are now more than 60 indicators within Peg ranging from active transportation to youth unemployment rates representing thousands of individual records and entities. We last discussed Peg upon its formal release in December 2013.
Congratulations to all team members!
I have been dazzled by what Stephen Wolfram has been doing with Wolfram Alpha since Day 1. The mathematical capabilities are exceedingly impressive, the backing knowledge base is exceedingly impressive, and the visualization has also come to be exceedingly impressive. The problem: no open source or anything.
I’ve never met Wolfram and acknowledge he has seen success many times greater than most. But, putting on my VC cap, I see an old school adherence to closed and proprietary.
I would not presume to suggest where the Wolfram folks should draw these lines on open and closed, but I definitely do counsel they work hard to open up as much as they can; it is in their own self interest. The business model of today is premised on leveraging the network effect in various ways. Total proprietary is a turn off and drives the compounding basis for the network effect into the dirt.
Maybe their recent pricing initiatives with Mathematica online are a wink toward this direction. But, better still find a core of functionality and knowledge base and release it as open source. The world will beat a path to what created all of this impressive stuff in the first place.
A few weeks back I reflected on more than a decade of involvement in the semantic Web. I appreciate the many nice comments and compliments received.
In part in reaction to that post and also because it is the retrospective season, Amit Sheth has just posted his own retrospective on the semWeb going back 15 years. Amit has been one of the leaders in the space and (according to his history) the first to obtain a semantic Web patent.
Amit’s wonderful history is more global and informed than my own, and I recommend you check it out. BTW, Amit is asking for comments from those involved in the early years for any corrections or additions.
For years now I have been a huge fan of Pandora (sorry, not apparently available to all across the world due to digital rights issues). Even though Pandora’s Music Genome was not set up as a W3C-compliant ontology, in its use and application it is effectively one. What Pandora shows is that feature selection and characterization trumps language and data structure format.
Given that, I have also been a Web scientist as to how I select and promote music to meet my musical interests.
Thus, based on my own totally unscientific study, here are a few things I’ve found worth passing on. It would be bold to call them secrets; they are really more just observations:
The net result is that I now no longer vote any of my songs on any channel as up or down. Rather, I look to the purest “seeds” that capture my mode or genre preference. If my initial selection does not provide this purity, I delete it, and try to find a better true seed.
This approach has led to some awesome channels for me, that I can then combine together, depending on mood, into mixed shuffle play with randomized channel selection. I no longer vote any song up or down; I rather look for more telling seeds.
As a couple of examples, here is a channel of less-well known 60′s goldies and a jazz guitar channel that are pure, single-seed channels, that, at least for me, provide hours of consistent music in that genre. Roll your own!
We all know the slow decline and zero take-off about the semantic Web writ large, but what we probably did not know is that Big data would gobsmack the concept. Trends do not lie.
Some ten years ago the idea of the semantic Web had about a 4x advantage in Google searches compared to that for “big data”. Today, Big data searches exceed those for the semWeb by 25-fold. The cross-over point occurred in about April 2011:
Within a year, projections are that the Big data interest level will exceed the semantic Web by 35x.
These may not be entirely apples-to-apples cases, but trends don’t often lie. Play yourself with these comparisons on Google Trends.
Big Structure is built on a foundation of reference structures, with domain structures capturing the domain at hand. These represent the target foundations for mapping schema and transforming data in the wild into an operable, canonical form. Any structure, even the most lightweight of lists and metadata, can contribute to and be mapped into this model, as this wall of structure shows:
Described below are some of these structures, in rough descending order of completeness and usefulness, for making data interoperable. Please note that any of these structures might be available as linked data.Reference Structures
In both semantics and artificial intelligence — and certainly in the realm of data interoperability — there is always the problem of symbol grounding. In the conceptual realm, symbol grounding means that when we use a term or phrase we are referring to the same thing. In the data value realm, symbol grounding means that when we refer to an object or a number, we are referring to the same measure.
UMBEL is the standard reference ontology used by Structured Dynamics. It contains 28,000 concepts (classes and relationships) derived from the Cyc knowledge base. The reference concepts of UMBEL are mapped to Wikipedia, schema.org (used in Google’s knowledge graph), DBpedia ontology classes, GeoNames and PROTON. Similar reference structures are used to ground the actual data values and attributes.
Other reference structures may be used, so long as they are rather complete in scope and coherent in their relationships. Logical consistency is a key requirement for grounding.Knowledge Bases
Knowledge bases combine schema with data in a logical manner; well-constructed ones support computations, inference and reasoning. To date, the two primary knowledge bases that we use are Wikipedia and Cyc. However, many specific domain knowledge bases also exist.
Knowledge bases are important sources for symbol grounding. It addition, because of their computability, they may be used with artificial intelligence methods to both extend the knowledge base and to refine the feature estimates used in the AI algorithms.Domain Ontologies
Domain ontologies, constructed as graphs, are the principal working structures in data interoperability. Though best practices recommend they be grounded in the reference structures, the domain structures are the ones that specifically capture the concepts and data attributes of the target information domain. More effort should be focused at this level in the wall of structure than any other.
Domain structures provide unique benefits in discovery, flexible access, and information integration due to their inherent connectedness. Further, these domain structures can be layered on top of existing information assets, which means they are an enhancement and not a displacement for prior investments. And, these domain structures may be matured incrementally, which means their development is cost-effective.Mappings
Data and schema in the wild need to be mapped and transformed into these canonical structures. What is known as data wrangling is an aspect of these mappings and transformations. Mappings thus become the glue that ties native data to interoperable forms.
Mapping is the critical bridging function in data interoperability. It requires tools and background intelligence to suggest possible correspondences; how well this is done is a key to making the semi-automatic mapping process as efficient as possible. Mapping structures are the result of the final correspondences. Mapping effort is a function of the scope of Big Structure, not the volume of Big Data.Existing Structures
A broad variety of structures occur in the wild — from database schema and taxonomies to dictionaries and lists — that need to be represented in a common form and then mapped in order to support interoperability. The common representation used by Structured Dynamics is the RDF data model.Structure Scripts
Scripting and tooling are essential to help create Big Structure efficiently.Editor’s Note: We are pleased to share with you in advance some of the text from Structured Dynamics’ new Web site.
Structured Dynamics today released version 1.10 of its open source UMBEL (Upper Mapping and Binding Exchange Layer) reference concept ontology. This new release is in preparation for a series of subsequent releases planned over the next few months as additional consistency and functionality is brought to the system.
This is the first UMBEL version to be created and loaded from scratch using a new Clojure scripting framework. Fred Giasson describes these scripts and highlights some of the new functionality in a separate blog post.
Besides this new processing framework, here are the key changes at a high level made in this new version:
This new UMBEL v. 1.10 is now updated and consistent with OpenCyc (version 4.0, dated October 12, 2012), GeoNames (version 3.1, dated October 29, 2012), schema.org (version 1.9, dated August 19, 2014) and the DBpedia ontology (version 3.9, retrieved August 18, 2014). The resulting mappings now are:
These changes have also resulted in some improvements to the umbel.org Web site and its Web services, as more fully described by Fred. The updated ontologies and mappings may be found at the UMBEL GitHub site.What’s In the Pipeline
As noted, this version update is in preparation for some pending activities, which are now moving toward completion. Subsequent releases (perhaps not in the order shown) will include:
Look here and on the UMBEL mailing list for these future announcements.
In the earlier installments of this article series we first described how to estimate the value of connections amongst Big Data datasets, premised on the network effect. We then went into detail in the second part about the Viking algorithm (VKG) derived to capture the Value of Knowledge Graphs.
In this concluding part of the series we summarize the use and implications of the Viking algorithm on your Big Data planning. We do so by offering ten guidelines for how including Big Structure may be leveraged in the context of a knowledge graph. We then conclude the series with some caveats on the interpretation of these results and a discussion of possible future directions.Ten Guidelines for Big Structure
As you think through your Big Data initiatives, we recommend you keep these ten guidelines relating to data structure in mind:
Of course, mere connections for structure’s sake is silly. It is important that the structure added and connections made are correct, consistent and coherent. Even then, not all types of connections are created equal, with typing the most important, annotations the least.
The good thing is that Big Structure can be added as a slight increase over standard data wrangling efforts, and with much greater impact than standard wrangling. Further, the structures themselves, preferably guided by domain ontologies, are a means of testing these factors for subsequent structure additions. Not only does adding structure get easier with a foundation of existing structure, but it increases the value of the information by orders of magnitude.Not the Last Word
Roughly twenty years ago Metcalfe’s law triggered a gold rush in trying to achieve network effects at Internet scale. Though the algorithm proved too optimistic at larger scales, the idea of the benefit from connections was firmly established. Ten years ago it was clear that some form of diminishing returns needed to be applied to connections at scale. Zipf’s law was not a bad guess, though we have subsequently learned that more graph-centric measures are more appropriate and accurate for estimating value. Now, the Viking algorithm has emerged as the best estimator of the value of connections within Big Data.
I suspect we will see further improvements to the Viking algorithm as: 1) we come to better understand graph structures (including the effects of clusters and cliques); and 2) we learn to distinguish the different value of different types of connections . We can already see that typing and categorization have better structural effects above annotations. We further can see that the correctness of asserted “facts” is a key to realizing the multiplier benefits of connections and structures. Thus, we should see improved means for screening and testing assertions for their accuracy at scale.
At this stage, what the Viking algorithm gives us is a defensible means for assessing the value of adding structure (through connections) to our datasets. We see these multiplier effects to be huge, and to compound to even still further benefits with scale. We also see that the most developed forms of structure — namely, ontologies — bring still further benefits in inference and testable coherence. All Big Structure efforts should be aiming to express all of the structural insights for the organization and its datasets into these ontological forms .
While our current proxy for value — namely, asserted “facts” — is useful, it would also be helpful to be able to translate these “fact” assertions into a monetary value. As we move down this path we will discover, again, that not all “facts” are created equal, and some have more monetary value than others. Transitioning our estimates of value to a monetary basis will help set parameters for the cost-benefit analysis of data collection and structurizing that is the ultimate basis for planning Big Data and Big Structure initiatives.
In the end, many things need to be analyzed to understand the impacts of each connection and structure metric on the value of the resulting graph. But, what today’s current understanding of the network effect and the Viking algorithm brings us is a better means to understand and quantify the benefits of connected information. By any measure, these value benefits are multiples of what we see for unconnected data, the multiples of which grow massively with the scale of the data and their connections.
Big Structure is fertile ground for bringing in the sheaves. Let the harvesting begin. Though not further discussed here, the ontologies also provide the means for tagging (providing structure) to unstructured documents, which also brings the multiplier benefits from structure. On the retrieval side, such structure also aids faceting and filtered “slicing and dicing” of underlying datasets, thereby improving retrieval efficacy.  As one of the first approaches to capture these nuances, see Mischa Dohler, Thomas Watteyne, Fabrice Valois and Jia-Liang Lu, 2008. Kumar’s, Zipf’s and Other Laws: How to Structure a Large-Scale Wireless Network?, published in Annales des Telecommunications – Annals of Telecommunications 63, 5-6 pp. 239-251. See http://hal.archives-ouvertes.fr/docs/00/40/58/67/PDF/Large_Scale_Networks_journal_FINAL.pdf.
Yesterday in the first part of this series we raised the important question of how to value connections made between data. At the Big Data scales represented, we prepared a Basic Facts case of up to 2 billion assertions. We are using asserted “facts” as our value proxy. We’ll talk more about value and caveats in the third part of our series, tomorrow.
We saw that early estimates of network effects, such as Metcalfe’s law, overestimate value at scale. We looked at Zipf’s law as a means to capture the diminishing value of connections given the distance between facts. In today’s article we will focus on these factors of interaction and potential value in the specific context of knowledge graphs. Knowledge graphs are Big Structure representations that capture the schematics, concepts and measures in any given knowledge domain (that is, any domain of human activity).
Since I first tried to address the value of knowledge networks some five years ago , I have been disturbed about a couple of things . First, I felt that the exponential or geometric bases for estimating the value of information connections were not correct, both because they fail at scale and they don’t discriminate that some connections work and are more important, while others are trivial or don’t work. Capturing this law of diminishing value in a context that makes sense for knowledge bases was, I felt, the key to answering the value riddle.
I believe we have now, in this series, provided a compelling basis for solving that riddle, which also points the way to further improvements. This assertion is an exciting statement, in that we now may have a quantitative basis in hand for determining where and how to spend our monies for Big Data and Big Structure initiatives. Such quantitative tools are a huge boon to bring analytic rigor to the data collection and integration challenge.Adding connections (“Big Structure“) to Big Data can increase the value of enterprise information from one to three orders of magnitude; the value also scales linearly with added structure (attributes).
This article shows that adding connections (“Big Structure“) at Big Data scales can increase the value of enterprise information from one (ten) to three (thousands) orders of magnitude. The magnitude of the value scales linearly with each added structure (attribute). These value multipliers from adding Big Structure are a tremendously cost-effective addition to standard data wrangling efforts.The Value of Knowledge Graphs (VKG) Formulation
The recognition of the need for a law of diminishing returns to reflect the distance between facts or assertions is a central argument in the Briscoe-Odlyzko-Tilly formulation (see  directly, and the prior Part I discussion). Not all information is connected, and not all connected information is of equivalent worth. The implied question in these statements, however, is how to capture those differences?
The B-O-T (or sometimes, O-T) formulation does not choose a bad starting proxy for this diminishment law. Zipf’s law reflects many observed distributions in human objects, roughly equivalant to power law, Pareto (“80:20″) distributions or n log (n) diminishing returns with long-tail characteristics. Examples include Internet distributions (such as popularity of Web sites or search terms), human language distributions, income rankings, population distributions, etc. There is no question that Zipf’s law distributions are common and frequent.
The only problem with picking the Zipf’s law basis, however, is that there was absolutely no evidence that such occurred for information networks or knowledge graphs. Zipf’s law distributions tend to be statements across single types for a single attribute distribution. Graphs, we can safely say, are anything but this distribution. Connections and multiple types are the rule, not the exception.
So, maybe the B-O-T formulation was correct, and maybe it was not. There was no empirical evidence to support this assertion for knowledge graphs. And, there did not appear to be a compelling logic argument for relating Zipf’s law to graphs other than they are artifacts of human endeavor.
My discomfort in adopting this arbitrary B-O-T basis, even though solidly embedded in human experience, caused me to seek alternative ideas and explanations, but also ones that fulfilled the key structural insights of diminishing returns and non-equivalent assertions that were the focal points of B-O-T, all within a graph context.The Starting Basis
The breakthough occurred when I discovered an obscure, un-cited paper by Yaakov Stein . Stein, a network and signals processing researcher of the first rank , wrote his paper as a means to understand and quantify his experience of joining LinkedIn and expanding his network. He began without an account and documented his experience as he joined and expanded his network of contacts on LinkedIn. He charted direct links, and then meticulously looked at and recorded secondary and tertiary links.
His formulation recognized that the value to an individual user equaled raising the access to the entire network (1) for that user plus the diminishing benefit represented by the participating graph’s other participants as measured by average degree of separation (d). d is an inherent measure of the graph type.
Though his context was a social network, the basic observation obtains: relations diminish by distance within a graph, with average link distance (directly related to degree of separation) being the key relevance metric. Connected “facts” or “friends” is essentially the same thing. It is all about what is shared amongst graph nodes.
The usefulness of this approach is that it grounds the multiplier effect in an inherent characteristic of the source graph, its average degree of separation . Like Zipf’s law, the degree of separation is a distance measure, but one grounded specifically in graphs. Here is the Stein formulation:
A graph with a degree of separation of 4, then, would exhibit a network-wide power factor of 5/4 (4/4 plus 1/4).The Viking Algorithm
As applied to knowledge graphs, however, this formulation still has two problems. The first minor one is that the degree of separation parameter should be D (the average across structure) rather than d. The second substantive one is that a correction factor needs to be included that accounts for the probability that an assertion may be false. This factor, F, is 1 – the measured error rate.
The resulting algorithm we term the Value of Knowledge Graph formulation, or the VKG (Viking) algorithm. It is expressed like this:
F is meant to be analogous to F-measure, the combined precision and recall statistic for information retrieval and NLP tasks. F in the case of the Viking algorithm is also meant to be a combined statistic that represents the “accuracy” (verifiable truthfulness) of statements asserted in the graph. F is essentially an estimated value for the residual falsity for the average statement in a graph, after removal of all assertions that do not meet existing coherency, consistency or completeness tests. F is determined by sampling statements across the graph and manually testing for truthfulness (or in a logical sense, validity given the existing statements in the graph). An F of 1 signifies complete truthfulness (accuracy); an F of 0 represents complete falsity .Viking in Relation to Other Network Estimators
Now, with this explanation of basis, we can again look at the value of the Viking (VKG) algorithm in comparison to those discussed in the first part of this series. Again on a logarithmic scale, here are those results:
Excluding the exponential and geometric multipliers (namely, the “laws” of Metcalfe and Reed) in the top two curves, this shows the Viking (VKG) algorithm to have higher value than the B-O-T (O-T) algorithm, both of which are considerably higher than the Basic Facts. However, because the Figure 1 above has a logarithmic scale, these differences are harder to discern.Viking Benefits Over the Basic Facts
Now corrected with our assumed F factor, we can begin to tease out the value benefits of connecting “facts” versus the unconnected Basic Facts. As with any logarithmic function, we see that the value benefits from connections increase in a growing manner at larger scales. For example, as Figure 2 shows below, at a level of 1000 records, the benefits from connections are 7x greater than unconnected data. By the time the scale grows to 1 million or 500 million records, the value benefits of connections grows to 44x to 215x, respectively:
Benefits from connections increase as a power function at increasing scales.Setting the VKG Factor D
But the potential value of connectedness is also a function of the general degree of information separation for the given domain. We are still in the early phases of gathering statistics for such things, but the table below summarizes what is known about the “standard” level of connectivity in various domains and applications. Note, in general, most any knowledge graph would have a D factor ranging from 2 to 8:Category Degrees of Separation (D) Notes Food webs ~ 2  Genetic differences ~ 3  LinkedIn ~ 3  Twitter 3.435 – 4.67  Facebook 3.74  Potential research collaborators ~ 4  UMBEL ~ 5.2  Social networks (general) ~ 6 Mobile ad hoc networks ~ 7  Small-world networks (max) ~ 8  Table 1. Degrees of Separation for Various Knowledge Networks
More tightly linked, cohesive domains tend to have the lower degrees of separation. It is also interesting to note that some social networks, like Twitter and Facebook, are also able to lower degrees of separation (in comparison to their nominal “social network” benchmark) by virtue of the nature of their service.
As experience is gained and with more research, I expect more estimates and more refined ones. Depending on the nature of the domain at hand, it should then be possible to pick the closest analog to use in the Viking valuation algorithm. Nonetheless, we already have a range and respective values to provide meaningful value estimates today.
Using the values in Table 1, we are thus able to plot the effects (again, log scale) of these various degrees of separation in terms of the “fact” assertions that can be made for our Big Data test dataset:
At the nominal Big Data scales of 100,000 and 1,000,000 records, the value of data connections in comparison to the unconnected Basic Facts case shows these following value improvement multipliers:Domain 100,000 Records 1,000,000 Records Food webs 203x 611x Genetic differences 38x 84x Twitter 23x 46x Facebook 17x 33x Potential research collaborators 14x 26x UMBEL 8x 12x Social networks (general) 5x 8x Mobile ad hoc networks 3x 5x Table 2. Multiplier (X) Improvements by Domain from Connections Over the Basic Facts
Of course, our “Big Data Example” from Part I was silent about the exact nature of its knowledge graph. Based on empircial experience to date, the benefits from connecting data that was previously unconnected should fall somewhere within the limits of Table 2. Even at rather low scales and more loosely-connected domains, the value improvements in making connections with data is many-fold. At larger scales for tighter networks, the multipliers can become astounding.Adding Structure to the Underlying Data
Another implication that the Viking algorithm allows us to test is the comparative benefit from adding structure to our datasets. Actually, “adding structure” is not strictly correct; it is “structurizing” the data via characterizations, attributes and categorizations. Of course, not all structure is created equal. Assigning or classifying our records into types, for example, applies to all records across the datasets and provides powerful cross-record linkages. Adding annotations or metadata to single records provides much lower benefits.
When we add structure across datasets the value improvements are a linear percent, as this figure shows:
For our Big Data example, each across-dataset structure characterization adds about 25% to 30% value per structure. Adding four structural characterizations, for example, more than doubles the “facts” assertion value (~ 140%) to the datasets.Preview of Last Part
The last part forthcoming tomorrow will summarize the implications from the Viking algorithm on the role and importance of Big Structure to your organization’s Big Data efforts. Some caveats and future directions will conclude the series. The two articles written at that time were, M.K. Bergman, 2009. Structure the World, in AI3:::Adaptive Information blog, August 3, 2009, and M.K. Bergman, 2009, The Law of Linked Data, AI3:::Adaptive Information blog, October 11, 2009.  See also the then-current state of analysis by Eric Hellman, 2009. Normal and Inverse Network Effects for Linked Data, published in his blog, October 15, 2009; see http://go-to-hellman.blogspot.com/2009/10/normal-and-inverse-network-effects-for.html.  Bob Briscoe, Andrew Odlyzko, and Benjamin Tilly, 2006. “Metcalfe’s Law is Wrong,” in IEEE Spectrum, July 2006. A copy may be viewed at http://www.cse.unr.edu/~yuksem/teaching/nae/reading/2006-briscoe-metcalfes.pdf. Odlyzko, and Tilly had published an earlier version, (sometimes the approach is shown as O-T in addition to B-O-T), and the basic form of the algorithm appears in a single Odlyzko paper.  Yaakov (J) Stein, 2009. The Value of Being Linked In, on his personal Web site, April 2009; see http://www.dspcsp.com/pubs/linkedin.pdf. Note that his empirical tests suggested a degree of separation for LinkedIn of 3.  The average degree of separation is simply the graph’s average path distance – 1. For an explanation of average path distance, see .  F is a summed average value across all assertions within a knowledge graph. In information retrieval, F-measures are now being achieved that exceed 0.90 (90%). For the cases used herein, F is estimated at 0.85. Again, this parameter is measured after all standard coherency, consistency, and completeness tests are applied to the ontology. These tests routinely remove many false assertions and establish the basic integrity of the graph. This acceptance threshold is itself constantly improving as experience is gained with basic graph integrity tests. In other words, tomorrow’s thresholds will be higher than today’s.  Richard J. Williams, Eric L. Berlow, Jennifer A. Dunne, Albert-László Barabási, and Neo D. Martinez, 2002. Two Degrees of Separation in Complex Food Webs, in Proceedings of the National Academy of Sciences, 99 (20):12913-12916, September 16, 2002, doi:10.1073/pnas.192448799; see http://www.pnas.org/content/99/20/12913.full.  See http://blog.23andme.com/ancestry/three-genetic-degrees-of-separation/.  Reza Bakhshandeh, Mehdi Samadi, Zohreh Azimifar and Jonathan Schaeffer, 2011. Degrees of Separation in Social Networks, in Proceedings, The Fourth International Symposium on Combinatorial Search (SoCS-2011), 6 pp.; see http://www.aaai.org/ocs/index.php/SOCS/SOCS11/paper/viewFile/4031/4352; and Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon, 2010. What is Twitter, a Social Network or a News Media?, in Proceedings of the 19th International Conference on World Wide Web, April 26–30, 2010, Raleigh, North Carolina, pp. 591-600, ACM; see http://snap.stanford.edu/class/cs224w-readings/kwak10twitter.pdf.  Lars Backstrom et al., 2012. Four Degrees of Separation, Archiv.org, January 6, 2012; see http://arxiv.org/pdf/1111.4570.pdf.  Paweena Chaiwanarom, Ryutaro Ichise, and Chidchano Lursinsap, 2010. Finding Potential Research Collaborators in Four Degrees of Separation, pp. 399-410, in Longbing Cao, Jiang Zhong, and Yong Feng, eds., Advanced Data Mining and Applications, Springer Berlin Heidelberg, http://dx.doi.org/10.1007/978-3-642-17313-4_39.  See http://fgiasson.com/blog/index.php/2014/08/11/graph-analysis-of-a-big-structure-umbel/#Average-Path-Length-Distribution.  Maria Papadopouli and Henning Schulzrinne, 2000. Seven Degrees of Separation in Mobile ad hoc Networks, presented at Global Telecommunications Conference, 2000 (GLOBECOM’00), IEEE. Vol. 3; see http://www.huaxiaspace.net/academic/classes/wi02/cse294/20020222globecom2000.pdf.  Paolo Pin, 2006. Eight Degrees of Separation, in Nota di Lavoro, Fondazione Eni Enrico Mattei, No. 78.2006 see http://www.econstor.eu/bitstream/10419/74249/1/NDL2006-078.pdf.
The hackneyed phrase of “connect the dots” reflects our basic intuition that there is value in making connections amongst relevant data. But, what is this value? How might we quantify it? This topic, and a method and guidelines for doing so, are the subject of this article and the second and third parts to follow.
The reason it is important to quantify the value of connected information is that such an estimate helps to define what effort or cost we can justify in order to derive those connections. In Big Data, for example, we already know that 50% to 80% of the costs in assembling relevant datasets is due to data wrangling — the effort to extract, transform and clean the input data . No where, however, do we know what it is worth to go to the next step of working to connect those data.
Quantifying this understanding will thus also help determine what the value is in developing Big Structure, the approach we have been most recently discussing for how to organize and connect Big Data. Big Structure sets the schematic and data relationships for how data from disparate sources can be connected together.
About five years ago I wrote my first articles on how we might approach the quantification of these information connections . That first, cursory look was useful for bounding the problem, but no firm conclusions as to how to specifically quantify this value were proposed. Like other graphs or networks, the usefulness of the ‘network effect‘ to bound the question was clear. It has taken further research and experiences with actual linked datasets to point to how to resolve this quantification challenge.Foundations in the Network Effect
The network effect was first realized in the early days of telephone networks, where the value of the system increased as a function of more users . We have also long recognized a similar effect in connecting information together and the breaking down of information or ‘data silos‘. As the following diagram shows, unconnected data nodes or silos look like random particles caught in the chaos of Brownian motion:
As initial connections get made, bits of structure begin to emerge. But, as connections are proliferated — analogous to the network effects of connected networks — coherence and more structure emerge.
This emergence of structure is particularly evident in physical networks, such as the growth of this hypothetical telecommunications network:
This diagram, modified from Wikipedia to be a horizontal image, shows how two telephones can make only one connection, five can make 10 connections, and twelve can make 66 connections, etc. It is this very multiplier effect that has led to most of the thinking of how to quantify the network effect.
We can see an interesting parallel between telecommunications networks and knowledge graphs. In the telecommunication network, the addition of a new user (node) by definition brings with it connections. This is what is shown in Figure 2. But in information silos, the information is already there (nodes, or the left-side of Figure 1); what is missing are the connections (the right-side of Figure 1). By explicitly adding connections we can also create network effects, as others have noted .
However, once we understand these parallels, we must also recognize the differences. To properly estimate the network effects of knowledge graphs, we must be explicit about the similarities and differences with other (physical) networks .Objectives for a Knowledge Graph Formulation
Since our objective is to quantify the “value” of a knowledge graph, we must first ask what is the basis of this value. In the best of all worlds, we would know the monetary worth of information, so we could justify what to spend in order to leverage it, which of course varies wildly across bases and sources. But we don’t. We do know, however, that a knowledge graph and the information it connects to constitutes a knowledge base. In the context of a knowledge base, the measure of value is the number of “correct” facts it contains. Therefore, we will use the number of connections in the graph (equivalent to the number of triple statements) as a proxy for value, representing the asserted “facts” of the graph.
We will also seek measures of graph distance and connectedness to capture the network-like qualities of the knowledge base. The characteristics of the graph itself should be the input base upon which to estimate value.Alternative Estimates of the Network Effect
The earliest effort to estimate the value of physical networks was Sarnoff’s law, developed by David Sarnoff, for many years the leader of the Radio Corporation of America (RCA). He posited that the value of a broadcast network was directly proportional to its number of viewers (n). However, the problem with this formulation is that a broadcast network is only one way, from broadcaster to user. What of networks where there is interaction or two-way linkages? The benefits of such networks must surely be more than linear.
Once we get into interaction effects we get into multipliers. And the proper nature of those multipliers must come from the nature and extent of those interactions, as well as perhaps the nature of the network itself. I discuss below some of the more prominent candidates that have been put forward for estimating the network effect, or the value of networks.Metcalfe’s Law
Metcalfe’s law was the first direct derivation from the telecommunications model. Robert Metcalfe formulated it about 1980 in relation to Ethernet and fax machines; the “law” was then named for Metcalfe and popularized by George Gilder in 1993. The actual algorithm proposed by Metcalfe calculates the number of unique connections in a network with n nodes to be n(n − 1)/2, which is proportional to n2. This makes Metcalfe’s law a quadratic growth equation.
The law is generally simplified  to state that the value of a telecommunications network is proportional to the square of the number of users of the system (n²):
Gilder’s popularization and the early growth of the Internet made the estimation of the benefits of network effects a very timely topic. For example, as a value measure, the network effect could be used to estimate the benefits for larger and larger numbers of users. Some have even blamed Metcalfe’s law for contributing to the creation (and then bursting) of the “dot-com bubble” of the late 1990s .
Metcalfe’s law clearly showed that interaction effects between nodes could generate multipliers that scaled rapidly with increasing numbers of nodes (users).Reed’s Law
From a different perspective and with a different take, David Reed came up with a multiplier formulation that is the largest presented — anywhere. Reed’s context is social groups, and from that perspective he can envision arbitrary sized groups forming amongst any and all participants (nodes). Because of this theoretical, global scope, justified through examples such as eBay and chat rooms, Reed specifically defined group-forming networks (GFNs), as the applicable scope . The simplified formulation for Reed is:
In scope and context, Reed does not apply to knowledge graphs, and even in the areas of social groups, most researchers find the exponential implications of Reed’s law unsupportable . The next group, for example, offers direct criticism.Briscoe – Odlyzko – Tilly Formulation
Under the provocative title, “Metcalfe’s Law is Wrong,” Briscoe, Odlyzko, and Tilly challenged both the Metcalfe and Reed approaches in 2006 . Using the proxy of Internet valuation, the authors were able to show how impractical the implications of either approach were at scale. Like the bet of rice (or wheat) doubling each of the 64 squares on a chessboard bankrupting the kingdom, the exponential implications of these two “laws” can be seen to (eventually) violate common sense.
The fundamental fallacy associated with both the Metcalfe and Reed approaches is that all potential links are of equal value . But no where in the real world do we see this to be true. There must be some law of diminishing returns that must be applied to slow the unsustainable rates of exponential or (to a lesser extent) quadratic growth.
After much hand waving, the authors chose Zipf’s law as their basis for this diminishing return. The increasing “decay rate” with distance is a common distribution pattern for real-world datasets, which Zipf’s law specifically addresses, always showing power law distributions with long tails. To approximate this distribution they offered the simple n log (n) formulation of Zipf’s law .
This is a reasonable approximation, but one that is never related directly to the nature of graphs or networks. That is the source of the next layer of refinements.VKG Formulation
I will discuss this algorithm, our recommended formulation, in Part II of this series.A Big Data Example
In order to discuss further the question of value arising from network effects, we need a case study example. We also need to define “value”, which in this case study example, as also noted above, means the number of “facts” or assertions in our database . This we call the Basic Facts (“assertions”) column.
For our basic “facts”, we consider a data series that is doubling in size for each step, eventually reaching a half billion records. Each record has four attributes or characterizations, leading to a total of more than 2 billion “facts” in the database. These are the first two columns in this table:Records Basic Facts (“assertions”) Big Data Example 1 4 4 2 8 8 4 16 16 8 32 32 16 64 64 32 128 128 64 256 256 128 512 512 256 1,024 1,024 512 2,048 2,048 1,024 4,096 4,100 2,048 8,192 8,200 4,096 16,384 16,400 8,192 32,768 32,800 16,384 65,536 65,600 32,768 131,072 131,200 65,536 262,144 262,404 131,072 524,288 524,812 262,144 1,048,576 1,049,624 524,288 2,097,152 2,099,248 1,048,576 4,194,304 4,198,496 2,097,152 8,388,608 8,396,996 4,194,304 16,777,216 16,793,992 8,388,608 33,554,432 33,587,984 16,777,216 67,108,864 67,175,972 33,554,432 134,217,728 134,351,944 67,108,864 268,435,456 268,703,888 134,217,728 536,870,912 537,407,780 268,435,456 1,073,741,824 1,074,815,564 536,870,912 2,147,483,648 2,149,631,128 Table 1. Basic Facts Connections with a Big Data Example
At this point, we have no connections between records. Each record has four attributes each, in isolation. This basis is akin to the unconnected dots on the left side of Figure 1 above.
For our Big Data Example, we will posit a record matching procedure as our first task for a new Big Data initiative. The assumption is that across all records, one-in-10000 matches another record. This results in the number of assertions (“facts”) shown in the third column in the table above. The posited Big Data initiative results in a 0.10% increase in “facts”, irrespective of record scale, once the matching threshold is reached. This result is not terribly impressive, but is perhaps not too unrelated from a first foray into a Big Data project.
Note that the following charts and analyses (including in the next part tomorrow) use as their “Basic Facts” the number of “assertions”, or the middle column in the table above. Though Big Data may represent an initial 0.10% improvement over this, that is immaterial to what our Big Structure viewpoints will provide. So, our “Basic Facts” will be unconnected records.Applying Network Effects to the Basic Facts
We can now apply our various network effect estimators to this base case. And, because of the fast-compounding nature of both the Reed and Metcalfe approaches, we need to plot this out on logarithmic scale  (click to enlarge):
On a logarithmic scale, the O-T (Briscoe-Odlyzko-Tilly) and VKG formulations appear only marginally better than the Basic Facts base case, but that is only due to the swamping effects of the unrealistic growth multipliers. We’ll get into this more tomorrow.Preview of Next Part
The next part forthcoming tomorrow will use this foundation to describe the VKG algorithm, and some of implications of its characteristics, all in the context of knowledge networks or graphs. “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets,” is a quote from Steve Lohr, 2014, “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights,” August 17, 2014, New York Times, see http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html. Also, as another example of the common 80% estimate for data preparation costs, see http://radar.oreilly.com/2013/09/data-analysis-just-one-component-of-the-data-science-workflow.html.  These two articles were, M.K. Bergman, 2009. Structure the World, in AI3:::Adaptive Information blog, August 3, 2009, and M.K. Bergman, 2009, The Law of Linked Data, AI3:::Adaptive Information blog, October 11, 2009. The same concerns I had at that time in the current state of analysis was captured by Eric Hellman, 2009. Normal and Inverse Network Effects for Linked Data, published in his blog, October 15, 2009; see http://go-to-hellman.blogspot.com/2009/10/normal-and-inverse-network-effects-for.html.  These network effect benefits were reportedly a major driver of Theodore Newton Vail‘s efforts to consolidate the thousands of initial telephone networks in the United States under the banner of the American Telephone & Telegraph (Ma Bell) company.  See, for example, James Hendler and Jennifer Golbeck, 2008. Metcalfe’s Law, Web 2.0, and the Semantic Web, in Web Semantics: Science, Services and Agents on the World Wide Web 6(1): 14-20; see http://www.cs.umd.edu/~golbeck/downloads/Web20-SW-JWS-webVersion.pdf.  Babak Hodjat and Adam Cheyer, 2003. Evolution of the Laws that Deal with the Utilization of Information Networks, in Masoud Nikravesh, Lotfi A. Zadeh and Janusz Kacprzyk, eds., Studies in Fuzziness and Soft Computing, Vol 164/2005, pp. 427-438, Springer, Berlin. See http://www.adam.cheyer.com/papers/KnowledgeNetworks_Formatted.pdf.  For a well-connected network, every node (n) connects to every other node (n-1), which gives us n*(n-1) or (n2 – n). Working this out, two nodes have two connections (2*2 – 2), three nodes have six connections (3*3 – 3) and the expression converges on the square of ‘n’ for larger values of ‘n’, e.g., (100*100 – 100) is 99% of (100*100). This convergence at larger number is the basis for the exponential simplification, 2n. Most of the other ‘laws’ stated herein are simplifications in a similar manner.  See, for example, Sara F. Peralta, 2011. Moore’s Law, Metcalfe’s Law, and the Dot Com Bubble, November 27, 2011, see https://storify.com/sarafperalta/moore-s-law-metcalfe-s-law-bubble. Also see .  David P. Reed, 1999. That Sneaky Exponential—Beyond Metcalfe’s Law to the Power of Community Building, August 27, 1999; online at http://www.reed.com/dpr/locus/gfn/reedslaw.html. For original version, see http://contextmag.com/archives/199903/digitalstrategyreedslaw.asp. Like Metcalfe, at smaller numbers the actual formula is 2n -n – 1, which rapidly converges to 2n.  However, one group has published an alternative formulation consistent with the Reed approach; see Kalevi Kilkki, and Matti Kalervo, 2004. KK-law for Group Forming Services, in XVth International Symposium on Services and Local Access, Edinburgh, March 2004. See http://kotisivukone.fi/files/50ajatelmaa.ajatukset.fi/tiedostot/Others/kilkki_kk-law.pdf.  Bob Briscoe, Andrew Odlyzko, and Benjamin Tilly, 2006. “Metcalfe’s Law is Wrong,” in IEEE Spectrum, July 2006. A copy may be viewed at http://www.cse.unr.edu/~yuksem/teaching/nae/reading/2006-briscoe-metcalfes.pdf. Odlyzko, and Tilly had published an earlier version, (sometimes the approach is shown as O-T in addition to B-O-T), and the basic form of the algorithm appears in a single Odlyzko paper.  See, for example, https://www.princeton.edu/~achaney/tmve/wiki100k/docs/Zipf_s_law.html.  In specific terms, each “fact” in our knowledge base is an assertion, represented as an RDF triple statement. Because some of these assertions may not, in fact, be true, the use of “fact” does not imply universal truthfulness. Rather, an assertion that passes current tests for logic, coherency, consistency or completeness is what is retained, even though its truthfulness is not certain. Therefore, separate adjustment factors (parameters) need to be applied to address accuracy tests.  We adjusted the scale further to reduce the exponential absurdity of the Reed approach by manually shifting the scale downward. As a result, the Reed approach exits the chart rather quickly, heading straight up.
In the first parts of this series we introduced the idea of Big Structure, and the fact that it resides at the nexus of the semantic Web, artificial intelligence, natural language processing, knowledge bases, and Big Data. In this article, we look specifically at the work that Big Structure promotes in data interoperability as a way to clarify what the roles these various aspects play.
By its nature, data integration (the first step in data interoperability) means that data is being combined across two or more datasets. Such integration surfaces all of the myriad aspects of semantic heterogeneities, exactly the kinds of issues that the semantic Web and semantic technologies were designed to address. But resolving semantic differences can not be fulfilled by semantic technologies alone. While semantics can address the basis of differences in meaning and context, resolution of those differences or deciding between differing interpretations (that is, ambiguity) also requires many of the tools of artificial intelligence or natural language processing (NLP).
By decomposing this space into its various sources of semantic heterogeneities — as well as the work required in order to provide for such functions as search, disambiguation, mapping and transformations — we can begin to understand how all of these components can work together in order to help achieve data interoperability. This understanding, in turn, is essential to understand the stack and software architecture — and its accompanying information architecture — in order to best achieve these interoperability objectives.
So, this current article lays out this conceptual framework of components and roles. Later articles in this series will address the specific questions of software and information architectural design.Data Interoperability in Relation to Semantics
Semantic technologies give us the basis for understanding differences in meaning across sources, specifically geared to address differences in real world usage and context. These semantic tools are essential for providing common bases for relating structured data across various sources and contexts. These same semantic tools are also the basis by which we can determine what unstructured content “means”, thus providing the structured data tags that also enable us to relate documents to conventional data sources (from databases, spreadsheets, tables and the like). These semantic technologies are thus the key enablers for making information — unstructured, semi-structured and structured — understandable to both humans and machines across sources. Such understandings are then a key basis for powering the artificial intelligence applications that are now emerging to make our lives more productive and less routine.
For nearly a decade I have used an initial schema by Pluempitiwiriyawej and Hammer to elucidate the sources of possible semantic differences between content. Over the years I have added language and encoding differences to this schema. Most recently, I have updated this schema to specifically call out semantic heterogeneities due to either conceptual differences between sources (largely arising from schema differences) and value and attribute differences amongst actual data. I have further added examples for what each of these categories of semantic heterogenities means .
This table of more than 40 sources of semantic heterogeneities clearly shows the possible impediments to get data to interoperate across sources:Class Category Subcategory Examples Type   LANGUAGE Encoding Ingest Encoding Mismatch For example, ANSI v UTF-8  Concept Ingest Encoding Lacking Mis-recognition of tokens because not being parsed with the proper encoding  Concept Query Encoding Mismatch For example, ANSI v UTF-8 in search  Concept Query Encoding Lacking Mis-recognition of search tokens because not being parsed with the proper encoding  Concept Languages Script Mismatch Variations in how parsers handle, say, stemming, white spaces or hyphens Concept Parsing / Morphological Analysis Errors (many) Arabic languages (right-to-left) v Romance languages (left-to-right) Concept Syntactical Errors (many) Ambiguous sentence references, such as I’m glad I’m a man, and so is Lola (Lola by Ray Davies and the Kinks) Concept Semantics Errors (many) River bank v money bank v billiards bank shot Concept CONCEPTUAL Naming Case Sensitivity Uppercase v lower case v Camel case Concept Synonyms United States v USA v America v Uncle Sam v Great Satan Concept Acronyms United States v USA v US Concept Homonyms Such as when the same name refers to more than one concept, such as Name referring to a person v Name referring to a book Concept Misspellings As stated Concept Generalization / Specialization When single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to “phone” but the other schema has multiple elements such as “home phone,” “work phone” and “cell phone” Concept Aggregation Intra-aggregation When the same population is divided differently (such as, Census v Federal regions for states, England v Great Britain v United Kingdom, or full person names v first-middle-last) Concept Inter-aggregation May occur when sums or counts are included as set members Concept Internal Path Discrepancy Can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove) Concept Missing Item Content Discrepancy Differences in set enumerations or including items or not (say, US territories) in a listing of US states Concept Missing Content Differences in scope coverage between two or more datasets for the same concept Concept Attribute List Discrepancy Differences in attribute completeness between two or more datasets Attribute Missing Attribute Differences in scope coverage between two or more datasets for the same attribute Attribute Item Equivalence When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example, Berlin the city v Berlin the official city-state) Concept When two individuals are asserted as being the same when they are actually distinct (for example, John Kennedy the president v John Kennedy the aircraft carrier) Attribute Type Mismatch When the same item is characterized by different types, such as a person being typed as an animal v human being v person Attribute Constraint Mismatch When attributes referring to the same thing have different cardinalities or disjointedness assertions Attribute DOMAIN Schematic Discrepancy Element-value to Element-label Mapping One of four errors that may occur when attribute names (say, Hair v Fur) may refer to the same attribute, or when same attribute names (say, Hair v Hair) may refer to different attribute scopes (say, Hair v Fur) or where values for these attributes may be the same but refer to different actual attributes or where values may differ but be for the same attribute and putative value.Many of the other semantic heterogeneities herein also contribute to schema discrepancies Attribute Attribute-value to Element-label Mapping Attribute Element-value to Attribute-label Mapping Attribute Attribute-value to Attribute-label Mapping Attribute Scale or Units Measurement Type Differences, say, in the metric v English measurement systems, or currencies Attribute Units Differences, say, in meters v centimeters v millimeters Attribute Precision For example, a value of 4.1 inches in one dataset v 4.106 in another dataset Attribute Data Representation Primitive Data Type Confusion often arises in the use of literals v URIs v object types Attribute Data Format Delimiting decimals by period v commas; various date formats; using exponents or aggregate units (such as thousands or millions) Attribute DATA Naming Case Sensitivity Uppercase v lower case v Camel case Attribute Synonyms For example, centimeters v cm Attribute Acronyms For example, currency symbols v currency names Attribute Homonyms Such as when the same name refers to more than one attribute, such as Name referring to a person v Name referring to a book Attribute Misspellings As stated Attribute ID Mismatch or Missing ID URIs can be a particular problem here, due to actual mismatches but also use of name spaces or not and truncated URIs Attribute Missing Data A common problem, more acute with closed world approaches than with open world ones Attribute Element Ordering Set members can be ordered or unordered, and if ordered, the sequences of individual members or values can differ Attribute Sources of Semantic Heterogeneities
Ultimately, since we express all of our content and information with human language, we need to start there to understand the first sources in semantic differences. Like the differences in human language, we also have differences in world views and experience. These differences are often conceptual in nature and get at what we might call differences in real world perspectives and experiences. From there, we encounter differences in our specific realms of expertise or concern, or the applicable domain(s) for our information and knowledge. Then, lastly, we give our observations and characterizations data and values in order to specify and quantify our observations. But the attributes of data are subject to the same semantic vagaries as concepts, in addition to their own specific challenges in units and measures and how they are expressed.
From the conceptual to actual data, then, we see differences in perspective, vocabularies, measures and conventions. Only by systematically understanding these sources of heterogeneity — and then explicitly addressing them — can we begin to try to put disparate information on a common footing. Only by reconciling these differences can we begin to get data to interoperate.
Some of these differences and heterogeneities are intrinsic to the nature of the data at hand. Even for the same putative topics, data from French researchers will be expressed in a different language and with different measurements (metric) than will data from English researchers. Some of these heterogeneities also arise from the basis and connections asserted between datasets, as misuse of the sameAs predicate shows in many linked data applications .
Fortunately, in many areas we are transitioning by social convention to overcome many of these sources of semantic heterogeneity. A mere twenty years ago, our information technology systems expressed and stored data in a multitude of formats and systems. The Internet and Web protocols have done much to overcome these sources of differences, what I’ve termed elsewhere as climbing the data federation pyramid . Semantic Web approaches where data items are assigned unique URIs are another source of making integration easier. And, whether all agree from a cultural aspect if it is good, we are also seeing English become the lingua franca of research and data.
The point of the table above is not to throw up our hands and say there is just too much complexity in data integration. Rather, by systematically decomposing the sources of semantic heterogeneity, we can anticipate and accommodate those sources not yet being addressed by cultural or technological conventions. While there is a large number of categories of semantic heterogeneity, these categories are also patterned and can be anticipated and corrected. These patterned sources inform us about what kind of work must be done to overcome semantic differences where they still reside.Work Components in Data Interoperability
The description logics that underly the semantic Web already do a fair job of architecting this concept-attribute split in semantics. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts .
The semantic Web is a standards-based effort by the W3C (World Wide Web Consortium); many of its accomplishments have arisen around ontology and TBox-related efforts. Data integration has putatively been tackled from the perspective of linked data, but that methodology so far is short on attributes and property-mapping linkages between datasets and schema. There are as yet no reference vocabularies or schema for attributes . Many of the existing linked data linkages are based on erroneous owl:sameAs assertions. It is fair to say that attribute and ABox-level semantics and interoperability have received scarce attention, even though the logic underpinnings exist for progress to be made.
This lack on the attributes or ABox-side of things is a major gap in the work requirements for data interoperability, as we see from the table below. The TBox development and understanding is quite good; and, a number of reference ontologies are available upon which to ground conceptual mappings . But the ABox third is largely missing grounding references. And, the specialty work tasks, representing about the last third, are needful of better effectiveness and tooling.
For both the TBox and the ABox we are able to describe and model concepts (classes), instances (individuals), and are pretty good at being able to model relationships (predicates) between concepts and individuals. We also are able to ground concepts and their relationships through a number of reference concept ontologies . But our understanding of attributes (the descriptive properties of instances) remains poor and ungrounded. Best practices — let alone general practices — still remain to be discovered.TBox (concepts) Specialty Work Tasks ABox (data)
Across the knowledge base (that is, the combination of the TBox and the ABox), the semantic Web has improved its search capabilities by formally integrating with conventional text search engines, such as Solr. Instance and consistency checking are pretty straightforward to do, but are often neglected steps in most non-commercial semantic installations. Critical areas such as mappings, transformations and identity evaluation remain weak work areas. This figure helps show these major areas and their work splits:
As we discussed earlier on the recent and rapid advances of artificial intelligence , the combination of knowledge bases and the semantic Web with AI machine learning (ML) and NLP techniques will show rapid improvements in data interoperability. The two stumbling blocks of not having a framework and architecture for interoperability, plus the lack of attributes groundings, have been controlling. Now that these factors are known and they are being purposefully addressed, we should see rapid improvements, similar to other areas in AI.
This re-embedding of the semantic Web in artificial intelligence, coupled with the conscious attention to provide reference groundings for data interoperability, should do much to address what are current, labor-intensive stumbling blocks in the knowledge management workflow.Putting Some Grown-up Pants on the Semantic Web
The semantic Web clearly needs to play a central role in data integration and interoperability. Fortunately, like we have seen in other areas , semantic technologies lend themselves to generic functional software that can be designed for re-use in most any knowledge domain, chiefly by changing the data and ontologies guiding them. This means that reference libraries of groundings, mappings and transformations can be built over time and reused across enterprises and projects. Use of functional programming languages will also align well with the data and schema in knowledge management functions and ontologies and DSLs. These prospects parallel the emergence of knowledge-based AI (KBAI), which marries electronic Web knowledge bases with improvements in machine-learning algorithms.
The time for these initiatives is now. The complete lack of distributed data interoperability is no longer tolerable. High costs due to unacceptable manual efforts and too many failed projects plague the data interoperability efforts of the past. Data interoperability is no longer a luxury, but a necessity for enterprises needing to compete in a data-intensive environment. At scale, point-to-point integration efforts become ineffective; a form of reusable and transferable master data management (MDM) needs to emerge for the realiites of Big Data, and one that is based on the open and standard protocols of the Web.
Much tooling and better workflows and user interfaces will need to emerge. But the critical aspects are the ones we are addressing now: information and software architectures; reference groundings and attributes; and education about these very real prospects near at hand. The challenge of data interoperability in cooperation with its artificial intelligence cousin is where the semantic Web will finally put on its Big Boy pants. See Charnyote Pluempitiwiriyawej and Joachim Hammer, 2000. A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources, Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See https://cise.ufl.edu/tr/DOC/REP-2000-396.pdf. I first cited this report and extended it to cover languages (see ) in M.K. Bergman 2006. Sources and Classification of Semantic Heterogeneities, AI3:::Adaptive Information blog, June 6, 2006. See http://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/). This most recent version added the examples and expanding the listing a bit further, to where it is no longer faithful to the original 2000 paper.  Concept is the shorthand used for the schema or classes or TBox. Attribute is the shorthand used for instance data or entities and their ABox. I segregate class-relation properties (predicates) from instance-describing properties (attributes). This distinction is not use in standard TBox-ABox splits; its rationale will be described in a further article.  See M.K. Bergman, 2006. Tutorial: Internet Languages, Character Sets and Encodings, BrightPlanet Corporation Technical Documentation, March 2006, 13 pp. See http://www.mkbergman.com/wp-content/themes/ai3v2/files/2006Posts/InternationalizationTutorial060323.pdf.  See . Also the TBox portion, or classes (concepts), is the basis of the ontologies. The ontologies establish the structure used for governing the conceptual relationships for that domain and in reference to external (Web) ontologies. The ABox portion, or instances (named entities), represents the specific, individual things that are the members of those classes. Named entities are the notable objects, persons, places, events, organizations and things of the world. Each named entity is related to one or more classes (concepts) to which it is a member. Named entities do not set the structure of the domain, but populate that structure. The ABox and TBox play different roles in the use and organization of the information and structure.  M.K. Bergman 2009. When Linked Data Rules Fail, AI3:::Adaptive Information blog, November 16, 2009. See http://www.mkbergman.com/846/when-linked-data-rules-fail/.  M.K. Bergman 2006. Climbing the Data Federation Pyramid, AI3:::Adaptive Information blog, May 25, 2006. See http://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.  M.K. Bergman 2008. Thinking ‘Inside the Box’ with Description Logics, AI3:::Adaptive Information blog, November 10, 2008. See http://www.mkbergman.com/466/thinking-inside-the-box-with-description-logics/.  See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.  Examples of upper-level ontologies include UMBEL, the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak. See further the Wikipedia entry on upper ontologies.  M.K. Bergman 2014. Spring Dawns on Artificial Intelligence, AI3:::Adaptive Information blog, June 2, 2014. See http://www.mkbergman.com/1731/spring-dawns-on-artificial-intelligence/.  M.K. Bergman 2011. Ontology-driven Apps Using Generic Applications, AI3:::Adaptive Information blog, March 7, 2011. See http://www.mkbergman.com/948/ontology-driven-apps-using-generic-applications/.
In our recent two-part series we described a decade of experience working in the semantic Web (Part I) and our view that Big Structure, which resides at the nexus of the semantic Web, knowledge bases and artificial intelligence, was a key component of making sense of Big Data going forward (Part II). We are at a time when multiple advances are conjoining to create new opportunities and excitement.
Data without context and relationships is meaningless. The idea of Big Data is powerful, but it is often presented as either a “good thing” in and of itself, or a mantra for something that is rather undefined. There is no doubt that with the Internet and the Web we are now able to generate and access data at unprecedented scale. There is also no question that tracking mechanisms and cheap storage — and simpler, large-scale databases and Web services — mean that we can also capture data and structure of natures previously unseen. Everyone knows the remarkable growth in exabytes and more.
The prospect of data everywhere — some useful with important context and some not — has clearly captured the current discussion. Heck, if we claim Big Data, we even make more in wage or consulting charge-out fees. Who can argue with that?
Well, actually, anyone interested in meaningful data or cross-dataset interoperability can argue with that. Big Data is great, except it means little if we can not combine that data across multiple sources for potentially multiple purposes. (Remember, one of the “V’s” of Big Data is variability.) Once the question of what data means gets brought to the fore, it is now time for context and relationships. Structure in an information context means that which situates or describes data in an interpretable way. Big Data needs a Big Structure complement to make sense of it all.What is a Big Structure?
Big Structure is data relationships and context that can be combined into a coherent framework to enable dataset interoperability and understanding. By necessity, Big Structure implies that the meaning of data can be understood and its values can be brought to common bases such that analysis, testing and validation can be applied across values. Big Structure is not a monolithic thing, but the combination of multiple things that give data meaning and context. As such, Big Structure is often a re-purposing of existing structural assets, plus other special sauce, organized for the aim of data interoperability.Big Structure is data relationships and context that can be combined into a coherent framework to enable dataset interoperability and understanding.
The components of Big Structure can be identified and characterized. These components can be assessed for usefulness and authoritativeness, and then incorporated into broader structures that ultimately bring the topics of what the data is about and the values of that data into alignment. Thus, Big Structure is also a mindset and approach to selecting and combining structures such that broad dataset interoperability can be achieved.
Big Structure is actually a continuum or family of concept and data relationships, any one of which is also a contributor to helping to map and interoperate data. Ultimately, the components of Big Structure get combined into reference graph structures that place the concepts and actual data values of the Big Data into context. There are certain ways to use and organize existing structures to achieve these Big Structure objectives; some of these ways are described in this article.
Once the components of Big Structure are combined into these reference graphs we then can also use network or graph analysis to understand the relationships amongst the constituent data items. This recursive nature of graph reference structures to organize the constituent data and then to use those graphs to analyze the data is one of the hallmark characteristics of Big Structure.
Big Structure thus involves the need to identify and then organize constituent forms of structure into coherent reference frameworks. Concepts in contributing datasets are then mapped to these structures, and the attributes and values of the underlying data are also transformed into canonical representations. It is these mappings and transformations that provide the interoperability of Big Structure. Big Structure therefore continues to evolve by adding more and more reference structures, all coherently organized.Contributors to Big Structure
Big Structure is a family of canonical reference structures that help guide mapping and interoperability. The table below lists some of the possible contributors to Big Structure , roughly in descending order as to the degree of structure and its contribution to interoperability. The table provides both definitions and use descriptions for each component, plus optionally some notes regarding coverage and use:Structure Type Definition Use Note Reference ontologies Major grounding structures for orienting and interoperating concepts or data The reference concepts for orienting all data and domain information  Reference attributes Major grounding structures for interoperating data and data characterizations The reference relationships amongst data descriptions and characteristics, which also provides the means for transformations between heterogeneous representations  Data model (RDF) A self-consistent means for describing the structure of data and their relationships The “canonical” data model at the heart of the system; provides a single interoperability point; RDF is the canonical model used by Structured Dynamics for its Big Structures  Domain attributes The data descriptions and characteristics for the constituent datasets in the applicable domain(s) The reference attributes specific to the domain(s) at hand (which are generally more specific than general reference attributes) Domain ontologies The formal conceptualization of a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts The reference concepts and their relationships specific to the domain(s) at hand; generally are mapped to the reference ontologies  Concept maps A diagram that depicts suggested relationships between concepts Structurally similar to a domain ontology; a few related terms shown in Note  Schema The structure of a database that defines the objects and relationships in that database Organizing framework for relational databases (and their tables)  Mappings The process of creating data element correspondences between two distinct data models or schema Mapping predicates are used to relate concepts or attributes from two different datasets or knowledge bases to one another. Mappings are often a precursor to various transformations to bring data into a common representation  Taxonomies A particular classification of related concepts, often of a hierarchical nature Hierarchical relationships are expressed in narrower or broader terms (or subClassOf); may also be see also relationships  Facets Clearly defined, mutually exclusive, and collectively exhaustive aspects, properties or characteristics of a class or specific subject Facets can provide alternative ways for classifying objects beyond a single taxonomy Categories Grouping objects based on similar properties A category may be viewed as equivalent to a concept  Tables A collection of related data held in a structured format, generally a two-dimensional layout of rows (records) and columns (fields) Simplest and most common data presentation format Synsets A group of data elements or terms that are considered semantically equivalent for the purposes of information retrieval Also known as a “semset” in the parlance of UMBEL Metadata Data providing information about one or more aspects of the source data, thus “data about data” It is the description of what data is about rather than the values and attributes of the actual data Thesauri A form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects A thesaurus is composed a list of words (or terms), a vocabulary for relating these words (or terms) to one another, often hierarchically, and a set of rules on how to use these aspects Gazetteers A listing of similar entity types with associated structural data (such as countries and population or standard codes) Often used in relation to people or place entity types, though any class of entities may have a gazetteer Controlled vocabularies The use of predefined, authorized terms as preselected by the sponsor to enforce consistency in terminology Applied to specific domains or sub-domains, with single controlled vocabularies per official language used Reference lists Authoritative listings of similar objects, each uniquely identified by name or code May be as simple as a comprehensive list of countries with associated ISO codes  Dictionaries A repository of information about data such as meaning, relationships to other data, origin, usage, or format In our context, can range from the meaning associated with standard word dictionaries to the more formal data dictionary Glossaries An alphabetical list of terms in a particular domain with the definitions for those terms Definition is the only structured information provided Nested lists Related concepts or entities organized by some form of hierarchical relationship (narrower, broader, subClassOf, etc.) Akin to a simple taxonomy Ordered lists A finite, ordered collection of values for a given type May also be additional information linked to the listing Clusters A set of objects grouped according to some basis of similarity (type, attributes, or characteristics) Basis for how the objects got clustered is not always obvious Unordered lists A container of similar items or entities, with no implied order or sequence Also known as a “bag” or “collection”  Values The actual data; a normal form or a type member Basic QUDT ontologies could contribute here
An alternate way to look at these contributor structures is to characterize them with respect to degree of structure and degree of contributing to interoperability:
In general, as might be expected, the greater the degree of structure, the greater its potential contribution to interoperability. The components in the upper right quadrant represent the most structured and interoperable ones. These also conform most to the use of W3C standards for the RDF data model and the OWL ontology languages. Expressions of structure are codified and standardized. Use of best practices also ensures completeness and suitability as reference groundings for interoperability.
The lower left portions of the quadrant represent the least structure and interoperability. However, as standard reference means for characterizing and describing data, even structures in this quadrant can contribute to meeting Big Structure requirements. Tagging of documents (unstructured data) occurs in this less-sophisticated lower left quadrant, but it gives equal footing to 80% of the content that generally resides in text form. (The interoperability system is further enhanced when the basis of the tags is derived from the “semsets” of the reference and domain ontologies, another example of a best practice.)
All of the listed components can thus contribute to Big Structure. However, the completeness of that structure and its usefulness for interoperability increases as one progresses along the blue arrow of the Big Structure continuum. Data interoperability arises from the continued efforts to drive Big Structure to the upper right of this quadrant. As noted, Big Structure is a mindset and process rather than some finite state. As more concepts and attributes get grounded in standard references, the degree of Big Structure (and, thus, data interoperability) continues to increase.The Foundation of Reference Groundings
In both semantics and artificial intelligence — and certainly in the realm of data interoperability — there is always the problem of symbol grounding. In the conceptual realm, symbol grounding means that when we use a term or phrase we are referring to the same thing; that is, the referent is the same. In the data value realm, symbol grounding means that when we refer to an object or a number — say, the number 4.1 — we are also referring to the same metric. 4.1 inches is not the same as 4.1 centimeters or 4.1 on the Richter scale, and object names for set member types also have the same challenges of ambiguous semantics as do all other things referred to by language.
The variability V in Big Data or the 40-some dimensions of potential semantic heterogeneity  are explicit recognitions of the symbol grounding challenge. Assuming we can determine context (itself an important consideration not further discussed here), fixity of reference is essential to these groundings. Context and groundings are the ways by which we remove ambiguity in what we measure and record.
Like dictionaries for human languages, or stars and constellations for navigators, or agreed standards in measurement, or the Greenwich meridian for timekeepers, fixed references are needed to orient and “ground” each new dataset over which we attempt to integrate. Without such fixities of reference, everything floats in reference to other things, the cursed “rubber ruler” phenomenon.
Thus, we can express our Big Structure components from a foundational perspective as well. In Structured Dynamics‘ view of the world, the foundation for data interoperability is grounded in reference structures or ontologies that provide the fixity of reference for concepts and data and their attributes. Upon these foundations are then constructed the domain views of concepts and attributes, which become the target for mapping other references and Big Structures:
The mappings, transformations and domain and reference ontologies are themselves written in the OWL languages of the W3C and the standards of the RDF data model. At this most expressive end of Big Structure, the representations are in the form of graphs. Network and graph analytics will expand still further business intelligence prospects. The use of these standards with common and testable logic is another means to ensure coherency and interoperability of the Big Structure that results.
Note a key aspect of the grounding foundation is missing: one or more reference ontologies for attributes. Though many examples exist on the concept side, little has been done to explicitly address the questions of data value interoperability. This major gap is a current emphasis of Structured Dynamics, with much that will be said over the coming weeks. Also expect an open source reference ontology for attributes in the near future.
The thing is that we are learning how to make the various parts of this interoperability stack work. We are leveraging existing structural assets of all kinds to establish the semantics and infrastructure for domain interoperability. We know how to match and map these existing structural assets to the reference frameworks that are the foundation to interoperability.A Vision of Interoperability
The real world is one of heterogeneous datasets, multiple schema and differing viewpoints. Even within single enterprises — and those which formerly expressed little need or interest to interoperate with the broader world — data integration and interoperability has been a real challenge. Big Data itself is not solving these problems. Quite the opposite. Big Data trends are turning data interoperability molehills into mountain-high competitive threats.
Like any well-built structure, data interoperability requires a solid foundation. That foundation must reside in exemplar reference ontologies upon which to ground the semantics and exchange standards for data. Using the canonical RDF data model makes this task practical. Existing information structures of various types across the enterprise and the Web all can and should play a role in establishing reference structures. The accretion of reference structures will lead to still further interoperability and the ability to incorporate more datasets. Currently expensive practices in, say, master data management (MDM) can begin to transition to a new paradigm. It is easy to envision working from a library of existing reference standards for use across enterprises. This kind of incremental expansion of interoperability leads to still more interoperable data in a virtuous cycle of innovation and lower budgets.
As our computing continues to get more virtual and cloud-like, physical and hardware and software architectures must give way to information architectures (in the true sense of interoperability). We have no choice but to treat the architecting of information as a first-order challenge. The totally cool thing about the data integration challenge is that the architecture can be readily varied and tested to achieve a working foundation. Much empirical information exists about how to do it and what to do next. The chief challenge has been to recognize that data interoperability — and its dependence on Big Structure — is a first-order concern (and opportunity). The intersection of Big Structure with Big Data, and with graph and AI algorithms, should create new approaches to chew across the data integration environment. I expect progress to be rapid. There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach. See M.K. Bergman, 2007. An Intrepid Guide to Ontologies, AI3:::Adaptive Information blog, May 16, 2007.  UMBEL and other upper level ontologies are examples here. In the case of UMBEL, that Big Structure is used as a scaffolding of reference concepts used to link external (unrelated) structures to help inter-operating data between two unrelated systems. Such a Big Structure can also be used for other tasks such as helping machine learning techniques to categorize and disambiguate pieces of data by leveraring such a structure of types.  Unfortunately, no reference structures for attributes yet exist. For a discussion of this status, see the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.  Data models encompass a rather broad span. The RDF discussion represents a more formal end of the data model spectrum, wherein there is complete logic, syntax and serialization discussions, more involved than most data models.  Domain ontologies represent the most closely-aligned view of the domain and its relationships of all of the component structures listed.  Concept maps are very closely related to ontologies, and may include topic maps, mind maps and other graph-like structures of concepts.  Schema may apply to many realms, but in the IT and software context schema mostly refers to database schema related to relational databases. These are often expresssed in UML diagrams or XML schema.  Mappings and transformatons are a huge area of diverse structure and different serializations and specifications. Fortunately, the task of mapping external structure to RDF removes the many-to-many issues with most transformation approaches.  Taxonomies mask an entire sub-categories of directories, folksonomies, subject trees, and more. The key aspect is that relevant concepts are expressed in a graph relationship manner to other concepts, often in a hierarchical fashion.  Categories also includes the general classification process.  I would consider a canonical references listing of country names and codes to be a part of Big Structure, since they act as a controlled vocabulary.  This is a key area for including unstructured documents, since tags are a primary means of adding metadata to a document. When the pool of tags is based on the governing reference and domain ontologies, then interoperability is further promoted.  M.K. Bergman, 2006. Sources and Classification of Semantic Heterogeneities, AI3:::Adaptive Information blog, June 6, 2006.