Previous installments in this series have listed existing ontology tools, overviewed development methodologies, and proposed a new approach to building lightweight, domain ontologies [1]. For the latter to be successful, a new generation in ontology development tools is needed. This post provides an explication of the landscape under which this new generation of tools is occurring.
Ontologies supply the structure for relating information to other information in the semantic Web or the linked data realm. Because of this structural role, ontologies are pivotal to the coherence and interoperability of interconnected data.
We are now concluding the first decade of ontology development tools, especially those geared to the semantic Web and its associated languages of RDFS and OWL. Last year we also saw the release of the major update to the OWL 2 language, with its shift to more expressiveness and a variety of profiles. The upcoming next generation of ontology tools now must also shift.
The current imperative is to shift away from ontology engineering by a priesthood to pragmatic daily use and maintenance by domain practitioners. Market growth demands simpler, task-focused tools with intuitive interfaces. For this change to occur, the general tools architecture needs to shift its center of gravity from IDEs and comprehensive toolkits to APIs and Web services. Not surprisingly, this same shift is what has been occurring across all areas of software.
Methodology Reprise: The Nature of the LandscapeIn the previous installment of this series, we presented a new methodological approach to ontology development, geared to lightweight, domain ontologies. One aspect of that design was to separate the operational workflow into two pathways:
The ontology build methodology concentrated on the upper half of this diagram (blue, with yellow lead-ins and outcomes) with the various steps overviewed in that installment [2]:
The methodology captured in this diagram embraces many different emphases from current practice: re-use of existing structure and information assets; conscious split between instance data (ABox) and the conceptual structure (TBox) [3]; incremental design; coherency and other integrity testing; and explicit feedback for scope extension and growth. The methodology also embraces some complementary utility ontologies that also reflect the design of ontology-driven apps [4].
These are notable changes in emphasis. But they are not the most important one. The most important change is the tools landscape to implement this methodology. This landscape needs to shift to pragmatic daily use and maintenance by domain practitioners. That requires simpler and more task-oriented tools. And that change in tooling needs a still more fundamental shift in tools architecture and design.
A Legacy of Excellent First Generation ToolsIn many places throughout this series I use the term “inadequate” to describe the current state of ontology development tools. This characterization is not a criticism of first-generation tools per se. Rather, it is a reflection of their inadequacy to fulfill the realities of the new tooling landscape argued in this series. The fact remains, as initial generation tools, that many of the existing tools are quite remarkable and will play central roles (mostly for the professional ontologist or developer) moving forward.
At the risk of overlooking some important players, let’s trace the (partial) legacy of some of the more pivotal tools in today’s environment.
As early as a decade ago the ontology standards languages were still in flux and the tools basis was similarly immature. Frame logic, description logics, common logic and many others were competing at that time for primacy and visibility. Most ontology tools at that time such as Protégé [5], OntoEdit [6], or OilEd [7] were based on F-logic or the predecessor to OWL, DAML+Oil. But the OWL language was under development by the W3C and in anticipation of its formal release the tools environment was also evolving to meet it. Swoop [8], for example, was one of the first dedicated OWL browsers. A Protégé plug-in for OWL was also developed by Holger Knublauch [9]. In parallel, the OWL group at the University of Manchester also introduced the OWL API [10].
With the formal release of OWL 1.0 in 2004, ontology tools continued to migrate to the language. Protégé, up through the version 3x series, became a popular open source system with many visualization and OWL-related plug-ins. Knublauch joined TopQuadrant and brought his OWL experience to TopBraid Composer, which shifted to the Eclipse IDE platform and leveraged the Jena API [9,11]. In Europe, the NeON (Networked Ontologies) project started in 2006 and by 2008 had an Eclipse-based OWL platform using the OWL API with key language processing capabilities through GATE [12].
Most recently, Protégé and NeON in open source, and TopBraid Composer on the commercial side, have likely had the largest market share of the comprehensive ontology toolkits. So far, with the release of OWL 2 in late 2009, only Protégé in version 4 and the TwoUse Toolkit have yet fully embraced all aspects of the new specification, doing so by intimately linking with the new OWL API (version 3x has full OWL 2 support) [13]. However, most leading reasoners now support OWL 2 and products such as TopBraid Composer and Ontotext’s OWLIM support OWL 2 RL as well [14].
The evolution of Protégé to version 4 (OWL 2) was led by the University of Manchester via its CO-ODE project [15], now ended, which has also been a source for most existing Protégé 4 plug-ins. (Because of the switch to OWL 2 and the OWL API most earlier plug-ins are incompatible with Protégé 4.) Manchester has also been a leading force in the development of OWL 2 and the alternative Manchester syntax.
Though only recently stable because of the formalization of OWL 2, Protégé 4 and its linkage to the new OWL API provides for a very powerful combination. With Protégé, the system has a familiar ontology editing framework and a mechanism for plug-in migration and growth. With the OWL API, there is now a common API for leading reasoners (Pellet, HermiT, FaCT++, RacerPro, etc.), a solid ontology management and annotation framework, and validators for various OWL 2 profiles (RL, EL and QL). The system is widely embraced by the biology community, probably the most active scientific field in ontologies. However, plug-in support lags the diversity of prior versions of Protégé and there does not appear to be the energy and community standing behind it as in prior years.
A Normative Tools LandscapeThese leading frameworks and toolkits have opted to be “ontology engineering” environments. Via plug-ins and complicated interfaces (tabs or Eclipse-style panes) the intent has apparently been to provide “all capabilities in one box.” The tools have been IDE-centric.
Unfortunately, one must be a combination of ontologist, developer, programmer and IDE expert in order use the tools effectively. And, as incremental capabilities get added to the systems, these also inherit the same complexity and style of the host environment. It is simply not possible to make complex environments and conventions simple.
Curiously, the existence or use of APIs have also not been adequately leveraged. The usefulness of an API means that subsets of information can be extracted and worked on in very clear and simple ways. This information can then be roundtripped without loss. An API allows a tailored subset abstraction of the underlying data model. In contrast, IDEs, such as Protégé or Eclipse, when they play a similar role, force all interfaces to share their built-in complexity.
With these thoughts in mind, then, we set out to architect a tools suite and work flow that could truly take advantage of a central API. We further wanted to isolate the pieces into distributable Web services in keeping with our standard structWSF Web services framework design.
This approach also allows us to split out simpler, focused tools that domain users and practitioners can use. And, we can do all of this while also enabling the existing professional toolsets and IDEs to also interoperate in the environment.
The resulting tools landscape is shown in the diagram below. This diagram takes the same methodology flow from Figure 1 (blue and yellow boxes) and stretches them out in a more linear fashion. Then, we embed the various tools (brown) and APIs (orange) in relation to that methodology:
This diagram is worth expanding to full size and studying in some detail. Aspects of this diagram that deserve more discussion are presented in the sections below.
OWL API as Center of GravityAs noted in the preceding methodology installment, the working ontology is the central object being managed and extended for a given deployment. Because that ontology will evolve and grow over time, it is important the complete ontology specification itself be managed by some form of version control system (green) [16]. This is the one independent tool in the landscape.
Access to and from the working ontology is mediated by the OWL API [13]. The API allows all or portions of the ontology specification to be manipulated separately, with a variety of serializations. Changes made to the ontology can also be tested for validity. Most leading reasoners can interact directly with the API. Protégé 4 also interacts directly with the API, as can various rules engines [17]. Additionally, other existing APIs, notably the Alignment API with its own mapping tools and links to other tools such as S-Match can interact with the OWL API. It is reasonable to expect more APIs to emerge over time that also interoperate [18].
The OWL API is the best current choice because of its native capabilities and because Jena does not yet support OWL 2 [11]. However, because of the basic design with structWSF (see next), it is also possible to swap out with different APIs at a later time should developments warrant.
In short, having the API play the central management role in the system means that any and all tools can be designed to interact effectively with the working ontology(ies) without any loss in information due to roundtripping.
Web Services (structWSF) as Canonical Access LayerThe same rationale that governed our development of structWSF [19] applies here: to abstract basic services and functionality through a platform-independent Web services layer. This Web services layer has canonical (standard) ways to interact with other services and is generally RESTful in design to support distributed deployments. The design conforms to proper separation of view from logic and structure. Moreover, because of the design, changes can be made on either side of the layer in terms of user interface or functionality.
Use of the structWSF layer also means that tools and functionality can be distributed anywhere on the Web. Specialized server-side functions can be supported as well as dedicated specialty hardware. Text indexing or disambiguation services can fit within this design.
The ultimate value of piggybacking on the structWSF framework is that all other extant services also become available. Thus, a wealth of converters, data managers, and semantic components (or display widgets) can be invoked depending on the needs of the specific tool.
Simpler, Task-specific ToolsThe objective, of course, of this design is to promote more and simpler tools useful to domain users. Some of these are shown under the Use & Maintain box in the diagram above; others are listed by category in the table below.
The RESTful interface and parameter calls of the structWSF layer further simplify the ontology management and annotation abstractions arising from the OWL API. The number of simple tools available to users under this design is virtually limitless. These tools are also fast to develop and test.
Combining These New Thrusts and Moving ForwardThis landscape is not yet a full reality. It is a vision of adaptive and simpler tools, working with a common API, and accessible via platform-independent Web services. It also preserves many of the existing tools and IDEs familiar to present ontology engineers.
However, pieces of this landscape do presently exist and more are on the way. The next section briefly overviews some of the major application areas where these tools might contribute.
Individual Tools within the LandscapeIf one inspects the earlier listing of 185 ontology tools it is clear that there is a diversity of tools both in terms of scope and function across the entire ontology development stack. It is also clear that nearly all of those 185 tools listed do not communicate with one another. That is a tremendous waste.
Via shared APIs and some degree of consistent design it should be possible to migrate these capabilities into a more-or-less interoperating whole. We have thus tried to categorize some important tool types and exemplar tools from that listing to show the potential that exists. (Please note that the Example Tools are links to the tools and categories from the earlier 185 tools listing.)
This correlation of types and example tools is not meant to be exhaustive nor a recommendation of specific tools. But, this tabulation is illustrative of the potential that exists to both simplify and extend tool support across the entire ontology development workflow:
Tool Type Comments Example Tools OWL API OWL API is a Java interface and implementation for the W3C Web Ontology Language (OWL), used to represent Semantic Web ontologies. The API provides links to inferencers, managers, annotators, and validators for the OWL2 profiles of RL, QL, EL OWL API Web Services Layer This layer provides a common access layer and set of protocols for almost all tools. It depends critically on linkage and communication with the OWL API structWSF Ontology Editor (IDE) There are a variety of options in this area. Generally, more complete environments (that is, IDEs) based on OWL and with links to the OWL API are preferred. Less complete editor options are listed under other categories. Note that only Protégé 4 incorporates the OWL API NeOn toolkit,The beauty of this approach is that most of the tools listed are open source and potentially amenable to the minor modifications necessary to conform with this proposed landscape.
Key Gaps in the LandscapeContrasting the normative tools landscape above with the existing listing of ontology tools points out some key gaps or areas deserving more development attention. Some of these are:
Finally, it does appear that the effort and focus behind Protégé seems to be slowing somewhat. The future has clearly shifted to OWL 2 with Protégé 4. Yet, besides the admirable CO-ODE project (now ended), tools and plug-in support seems to have slowed. Many of the admirable plug-ins for Protégé 3x do not appear to be under active development as upgrades to Protégé 4. While Protégé’s future (and similar IDEs) seems assured, its prominence possibly will (and should) be replaced by a simpler kit of tools useful to users and practitioners.
Funding and Pending Project PrioritiesFor the past few months we at Structured Dynamics have seen ontology design and management as the pending technical priorities within the semantic technology space. Now that the market no longer looks at “ontology” as a four-letter word, it is imperative to simplify the development and use of ontologies. The first generation of tools leading up to this point have been helpful to understand the semantic space; changes are now necessary to expand it.
In our first generation we have begun to understand the types and nature of needed tools. But our focus on IDEs and comprehensive toolsets belies a developer’s or technologist’s perspective. We need to now shift focus and look at tool needs from the standpoint of users and actual use of ontologies. Many players and many toolmakers and innovators will need to contribute to build this market for semantic technologies and approaches.
Fortunately, replacing an IDE focus with one based around APIs and Web services should be a fairly smooth and natural transition. If we truly desire to be market makers, we need to stand back and place ourselves into the shoes of the domain practitioners, the subject matter experts. We need to shield actual users from all of the silly technical details and complexity. And, then, let’s focus — task-by-task — on discrete items of management and use of ontologies. Growth of the semantic technology space depends on expanding our practitioner base.
For its part, Structured Dynamics is presently seeking new projects and sponsors with a commitment to these aims. Like our prior development of structWSF and semantic components, we will be looking to make simpler ontology tools a priority in the coming months. Please let me know if you want to partner with us toward this commitment.
[1] This posting is part of a current series on ontology development and tools, now permanently archived and updated on the OpenStructs TechWiki. The series began with an update of the prior Ontology Tools listing, which now contains 185 tools. It continued with a survey of ontology development methodologies. The last part presented a Lightweight, Domain Ontologies Development Methodology. This part is archived on the TechWiki as the Normative Landscape of Ontology Tools. The last installment in the series is planned to cover ontology best practices. [2] The original version, now slightly modified, was first published in M. K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies,” AI3:::Adaptive Information blog, Nov. 23, 2009. [3] The TBox portion, or classes (concepts), is the basis of the ontologies. The ontologies establish the structure used for governing the conceptual relationships for that domain and in reference to external (Web) ontologies. The ABox portion, or instances (named entities), represents the specific, individual things that are the members of those classes. Named entities are the notable objects, persons, places, events, organizations and things of the world. Each named entity is related to one or more classes (concepts) to which it is a member. Named entities do not set the structure of the domain, but populate that structure. The ABox and TBox play different roles in the use and organization of the information and structure. These distinctions have their grounding in description logics. [4] See M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies,” AI3:::Adaptive Information blog, November 23, 2009. [5] Natalya F. Noy, Michael Sintek, Stefan Decker, Monica Crubézy, Ray W. Fergerson and Mark A. Musen, 2001. “Creating Semantic Web Contents with Protégé-2000,” IEEE Intelligent Systems, vol. 16, no. 2, pp. 60-71, Mar/Apr. 2001. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.7177&rep=rep1&type=pdf. [6] York Sure, Michael Erdmann, Juergen Angele, Steffen Staab, Rudi Studer and Dirk Wenke, 2002. “OntoEdit: Collaborative Ontology Development for the Semantic Web,” in Proceedings of the International Semantic Web Conference (ISWC) (2002). See http://www.aifb.uni-karlsruhe.de/WBS/Publ/2002/2002_iswc_ontoedit.pdf. [7] Sean Bechhofer, Ian Horrocks, Carole Goble and Robert Stevens, 2001. “OilEd: a Reasonable Ontology Editor for the Semantic Web,” in Proceedings of KI2001, Joint German/Austrian conference on Artificial Intelligence. [8] Aditya Kalyanpur and James Hendler, 2004. “Swoop: Design and Architecture of a Web Ontology Browser/Editor,” Scholarly Paper for Master’s Degree in Computer Science, University of Maryland, Fall 2004. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.1779&rep=rep1&type=pdf. [9] Holger Knublauch was formerly the designer and developer of Protégé-OWL, the leading open-source ontology editor. TopBraid Composer leverages the experiences gained with Protégé and other tools into a professional ontology editor and knowledge-base framework. Composer is based on the Eclipse platform and uses Jena as its underlying API. See further http://www.topquadrant.com/composer/tbc-protege.html. [10] Sean Bechhofer, Phillip Lord and Raphael Volz, 2003. “Cooking the Semantic Web with the OWL API,” in Proceedings of the 2nd International Semantic Web Conference, ISWC, Sanibel Island, Florida, October 2003. See http://homepages.cs.ncl.ac.uk/phillip.lord/download/publications/cooking03.pdf. [11] Jena is fundamentally an RDF API. Jena’s ontology support is limited to ontology formalisms built on top of RDF. Specifically this means RDFS, the varieties of OWL, and the now-obsolete DAML+OIL. At the time of writing, no decision has yet been made about when Jena will support the new OWL 2 features. See http://jena.sourceforge.net/ontology/. [12] The NeON Toolkitis built on OntoStudio. It is based on Eclipse with support for the OWL API. A series of its key plug-ins utilize various aspects of GATE (General Architecture for Text Engineering). The four-year project began in 2006 and its first open source toolkit was released by the end of 2007. OWL features were added in 2008-09. NeON has since completed, though its toolkit and plug-ins can still be downloaded as open source. [13] OWL API is a Java interface and implementation for the W3C Web Ontology Language (OWL), used to represent Semantic Web ontologies. The API provides links to inferencers, managers, annotators, and validators for the OWL2 profiles of RL, QL, EL. Two recent papers describing the updated API are: Matthew Horridge and Sean Bechhofer, 2009. “The OWL API: A Java API for Working with OWL 2 Ontologies,” presented at OWLED 2009, 6th OWL Experienced and Directions Workshop, Chantilly, Virginia, October 2009. See http://www.webont.org/owled/2009/papers/owled2009_submission_29.pdf; and, Matthew Horridge and Sean Bechhofer, 2010. “The OWL API: A Java API for OWL Ontologies,” paper submitted to the Semantic Web Journal; see http://www.semantic-web-journal.net/sites/default/files/swj107.pdf. [14] These links show the status of TopBraid Composer and Ontotext’s OWLIM with regard to OWL 2 RL. A newer effort, based on Eclipse, with broader OWL API support is the MOST Project’s TwoUse Toolkit. In all likelihood, the number of other tools with OWL 2 support is larger than our informal survey has found. Importantly, Jena still has not upgraded to OWL 2, but its open source site suggests it may. [15] The CO-ODE project aimed to build authoring tools and infrastructure to make ontology engineering easier. It specifically supported the development and use of OWL-DL ontologies, by being heavily involved in the creation of infrastructure and plugins for the Protégé platform and OWL 2 support for the OWL API. [16] For a great discussion on the unique aspects of version control in ontologies, see T. Redmond, M. Smith, N. Drummond and T. Tudorache, 2008. “Managing Change: An Ontology Version Control System,” paper presented at OWLED 2008, Karslruhe, Germany. See http://bmir.stanford.edu/file_asset/index.php/1435/BMIR-2008-1366.pdf. [17] Birte Glimm, Matthew Horridge, Bijan Parsia and Peter F. Patel-Schneider, 2009. “A Syntax for Rules in OWL 2,” presented at the Sixth OWLED Workshop, 23-24 October 2009. See http://www.webont.org/owled/2009/papers/owled2009_submission_16.pdf. [18] The Alignment API is one of the more venerable ones in this environment. A couple of other examples include a SKOS API (Simon Jupp, Sean Bechhofer and Robert Stevens, 2009. ” A Flexible API and Editor for SKOS,” presented at the 6th Annual European Semantic Web Conference (ESWC2009); see http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-401/iswc2008pd_submission_88.pdf) and the Ontology Common API Tasks (Tomasz Adamusiak, K Joeri van der Velde, Niran Abeygunawardena, Despoina Antonakaki, Helen Parkinson and Morris A. Swertz, 2010. “OntoCAT — A Simpler Way to Access Ontology Resources,” presented at ISMB2010, 10 July 2010. See http://www.iscb.org/uploaded/css/58/17254.pdf. OntoCAT is an open source package developed to simplify the task of querying heterogeneous ontology resources. It supports NCBO BioPortal and EBI Ontology Lookup Service (OLS), as well as local OWL and OBO files). [19] The structWSF Web services framework is generally RESTful middleware that provides a bridge between existing content and structure and content management systems and available indexing engines and RDF data stores. structWSF is a platform-independent means for distributed collaboration via an innovative dataset access paradigm. It has about twenty embedded Web services. See http://openstructs.org/structwsf.Ontologies supply the structure for relating information to other information in the semantic Web or the linked data realm. Ontologies provide a similar role for the organization of data that is provided by relational data schema. Because of this structural role, ontologies are pivotal to the coherence and interoperability of interconnected data [1].
There are many ways to categorize ontologies. One dimension is between upper level and mid- and lower- (or domain-) level. Another is between reference or subject (domain) ontologies. Upper-level ontologies [2] tend to be encompassing, abstract and inclusive ways to split or organize all “things”. Reference ontologies tend to be cross-cutting such as ones that describe people and their interests (e.g., FOAF), reference subject concepts (e.g., UMBEL), bibliographies and citations (e.g., BIBO), projects (e.g., DOAP), simple knowledge structures (e.g., SKOS), social networks and activities (e.g., SIOC), and so forth.
The focus here is on domain ontologies, which are descriptions of particular subject or domain areas. Domain ontologies are the “world views” by which organizations, communities or enterprises describe the concepts in their domain, the relationships between those concepts, and the instances or individuals that are the actual things that populate that structure. Thus, domain ontologies are the basic bread-and-butter descriptive structures for real-world applications of ontologies.
According to Corcho et al. [3] “a domain ontology can be extracted from special purpose encyclopedias, dictionaries, nomenclatures, taxonomies, handbooks, scientific special languages (say, chemical formulas), specialized KBs, and from experts.” Another way of stating this is to say that a domain ontology — properly constructed — should also be a faithful representation of the language and relationships for those who interact with that domain. The form of the interaction can range from work to play to intellectual understanding or knowledge.
“… ontology engineering research should strive for a unified, lightweight and component-based methodological framework, principally targeted at domain experts ….” Simperl et al. [4]Another focus here is on lightweight ontologies. These are typically defined as more hierarchical or classificatory in nature. Like their better-known cousins of taxonomies, but with greater connectedness, lightweight ontologies are often designed to represent subsumption or other relationships between concepts. They have not too many or not too complicated predicates (relationships). As relationships are added and the complexities of the world get further captured, ontologies migrate from the lightweight to the “heavyweight” end of the spectrum.
The development of ontologies goes by the names of ontology engineering or ontology building, and can also be investigated under the rubric of ontology learning. For reasons as stated below, we prefer not to use the term ontology engineering, since it tends to convey a priesthood or specialized expertise in order to define or use them. As indicated, we see ontologies as being (largely) developed and maintained by the users or practitioners within a given domain. The tools and methodologies to be employed need to be geared to these same democratic (small “d”) objectives.
A Review of Prior MethodologiesFor the last twenty years there have been many methods put forward for how to develop ontologies. These methodological activities have diminished somewhat in recent years. Yet the research as separately discussed in Ontology Development Methodologies [1] seems to indicate this state of methodology development in the field:
While there is by no means unanimity in this community, some general consenses can be seen from these prior reviews, especially those that concentrate on practical or enterprise ontologies. In terms of design objectives, this general consensus suggests that ontologies should be [4]:
While laudable, and which represent design objectives to which we adhere, current ontology development methods do not meet these criteria. Furthermore, to be discussed in our next installment, there is also an inadequate slate of tools ready to support these objectives.
A Call for a New MethodologyIf you ask most knowledgeable enterprise IT executives what they understand ontologies to mean and how they are to be built, you would likely hear that ontologies are expensive, complicated and difficult to build. Reactions such as these (and not trying to set up strawmen) are a reflection of both the lack of methods to achieve the consensual objectives above and the lack of tools to do so.
The use of ontology design patterns is one helpful approach [5]. Such patterns help indicate best design practice for particular use cases and relationship patterns. However, while such patterns should be part of a general methodology, they do not themselves constitute a methodology.
Also, as Structured Dynamics has argued for some time, the future of the semantic enterprise resides in ontology-driven apps [6]. Yet, for that vision to be realized, clearly both methods and tools to build ontologies must improve. In part this series is a reflection of our commitment to plug these gaps.
What we see at present for ontology development is a highly technical, overly engineered environment. Methodologies are only sparsely or generally documented. They are not lightweight nor collaborative nor really incremental. While many tools exist, they do not interoperate and are pitched mostly at the professional ontologist, not the domain user. In order to achieve the vision of ontology-driven apps the methods to develop the fulcrum of that vision — namely, the ontologies themselves — need much additional attention. An adaptive methodology for ontology development is well past due.
Design Criteria for an Adaptive MethodologyWe can thus combine the results of prior surveys and recommendations with our own unique approach to adaptive ontologies in order to derive design criteria. We believe this adaptive approach should be:
We discuss each of these design criteria below.
While we agree with the advisability of collaboration as a design condition — and therefore also believe that tools to support this methodology must also accommodate group involvement — collaboration per se is not a design requirement. It is an implementation best practice.
Effective ontology development is as much as anything a matter of mindset. This mindset is grounded in leveraging what already exists, “paying as one benefits” through an incremental approach, and starting simple and adding complexity as understanding and experience are gained. Inherently this approach requires domain users to be the driving force in ongoing development with appropriate tools to support that emphasis. Ontologists and ontology engineering are important backstops, but not in the lead design or development roles. The net result of this mindset is to develop pragmatic ontologies that are understood — and used by — actual domain practitioners.
Lightweight and Domain-orientedBy definition the methodology should be lightweight and oriented to particular domains. Ontologies built for the pragmatic purposes of setting context and aiding interoperability tend to be lightweight with only a few predicates, such as isAbout, narrowerThan or broaderThan. But, if done properly, these lighter weight ontologies can be surprisingly powerful in discovering connections and relationships. Moreover, they are a logical and doable intermediate step on the path to more demanding semantic analysis.
ContextualContext simply means there is a reference structure for guiding the assignment of what content ‘is about’ [7]. An ontology with proper context has a balanced and complete scope of the domain at hand. It generally uses fairly simple predicates; Structured Dynamics tends to use the UMBEL vocabulary for its predicates and class definitions, and to link to existing UMBEL concepts to help ensure interoperability [8]. A good gauge for whether the context is adequate is whether there are sufficient concept definitions to disambiguate common concepts in the domain.
CoherentThe essence of coherence is that it is a state of consistent connections, a logical framework for integrating diverse elements in an intelligent way. So while context supplies a reference structure, coherence means that the structure makes sense. With relation to a content graph, this means that the right connections (edges or predicates) have been drawn between the object nodes (or content) in the graph [9].
Relating content coherently itself demands a coherent framework. At the upper reference layer this begins with UMBEL, which itself is an extraction from the vetted and coherent Cyc common sense knowledge base. However, as domain specifics get added, these details, too, must be testable against a unified framework. Logic and coherence testing are thus an essential part of the ontology development methodology.
IncrementalMuch value can be realized by starting small, being simple, and emphasizing the pragmatic. It is OK to make those connections that are doable and defensible today, while delaying until later the full scope of semantic complexities associated with complete data alignment.
An open world approach [10] provides the logical basis for incremental growth and adoption of ontologies. This is also in keeping with the continuous and incremental deployment model that Structured Dynamics has adopted from MIKE2.0 [11]. When this model is applied to the process of ontology development, the basic implementation increments appear as follows:
The first two phases are devoted to scoping and prototyping. Then, the remaining phases of creating a working ontology, testing it, maintaining it, and then revising and extending it are repeated over multiple increments. In this manner the deployment proceeds incrementally and only as learning occurs. Importantly, too, this approach also means that complexity, sophistication and scope only grows consistent with demonstrable benefits.
Re-use of StructureFundamental to the whole concept of coherence is the fact that domain experts and practitioners have been looking at the questions of relationships, structure, language and meaning for decades. Though perhaps today we now finally have a broad useful data and logic model in RDF, the fact remains that massive time and effort has already been expended to codify some of these understandings in various ways and at various levels of completeness and scope.
These are prior investments in structure that would be silly to ignore. Yet, today, most methodologies do ignore these resources. This ignorance of prior investments in information relationships is perplexing. Though unquestioned adoption of legacy structure is inappropriate to modern interoperable systems, that fact is no excuse for re-inventing prior effort and discoveries, many of which are the result of laborious consensus building or negotiations.
The most productive methodologies for modern ontology building are therefore those that re-use and reconcile prior investments in structural knowledge, not ignore them. These existing assets take the form of already proven external ontologies and internal and industry structures and vocabularies.
Separation of the ABox and TBoxNearly a year ago we undertook a major series on description logics [12], a key underpinning to Structured Dynamics’ conceptual and logic foundation to its ontology development. While we can not always adhere to strict and conforming description logics designs, our four-part series helped provide guidance for the separation of concerns and work that can also lead to more effective ontology designs [13].
Conscious separation of the so-called ABox (assertions or instance records) and TBox (conceptual structure) in ontology design provides some compelling benefits:
Maintaining identity relations and disambiguation as separate components also has the advantage of enabling different methodologies or algorithms to be determined or swapped out as better methods become available. A low-fidelity service, for example, could be applied for quick or free uses, with more rigorous methods reserved for paid or batch mode analysis. Similarly, maintaining full-text search as a separate component means that work can be done by optimized search engines with built-in faceting.
Simple, Interoperable Tools SupportAn essential design criteria is to have a methodology and work flow that explicitly accounts for simple and interoperable tools. By “simple” we mean targeted, task-specific tools and functionality that is also geared to domain users and practitioners.
Of all design areas, this one is perhaps the weakest in terms of current offerings. The next installment in this series [1] will address this topic directly.
The New MethodologyArmed with these criteria, we are now ready to present the new methodology. In summary terms, we can describe the steps in the methodology as:
After the scoping and analysis phase, the effort is split into two tracks:
This split conforms to the separation of ABox and TBox noted above [15]. There are conceptual and workflow parallels between entities and data v. ontologies. However, the specific methodologies differ, and we only focus on the conceptual ontology side in the discussion below, shown as the upper part (blue) of Figure 3:
Two key aspects of the initial effort are to properly scope the size and purpose of the starting prototype and to inventory the existing assets (structure and data; internal and external) available to the project.
Re-Use StructureMost current ontology methodologies do not emphasize re-use of existing structure. Yet these resources are rich in content and meaning, and often represent years to decades of effort and expenditure in creation, assembly and consensus. Just a short list of these potential sources demonstrates the treasure trove of structure and vocabularies available for re-use: Web portals; databases; legacy schema; metadata; taxonomies; controlled vocabularies; ontologies; master data catalogs; industry standards; exchange formats, etc.
Metadata and available structure may have value no matter where or how it exists, and a fundamental aspect of the build methodology is to bring such candidate structure into a common tools environment for inspection and testing. Besides assembling and reviewing existing sources, those selected for re-use must be migrated and converted to proper ontological form (OWL in the case of those developed by Structured Dynamics). Some of these techniques have been demonstrated for prior patterns and schema [17]; in other instances various converters, RDFizers or scripts may need to be employed to effect the migration.
Many tools and options exist at this stage, even though as a formal step this conversion is often neglected.
Prototype StructureThe prototype structure is the first operating instance of the ontology. The creation of this initial structure follows quite closely the approach recommended in Ontology Development 101 [18], with some modifications to reflect current terminology:
The prototype structure is important since it communicates to the project sponsors the scope and basic operation of the starting structure. This stage often represents a decision point for proceeding; it may also trigger the next budgeting phase.
Link Reference OntologiesAn essential aspect of a build methodology is to re-use “standard” ontologies as much as possible. Core ontologies are Dublin Core, DC Terms, Event, FOAF, GeoNames, SKOS, Timeline, and UMBEL. These core ontologies have been chosen because of universality, quality, community support and other factors [19]. Though less universal, there are also a number of secondary ontologies, namely BIBO, DOAP, and SIOC that may fit within the current scope.
These are then supplemented with quality domain-specific ontologies, if such exist. Only then are new name spaces assigned for any newly generated ontology(ies).
Working OntologyThe working ontology is the first production-grade (deployable) version of the ontology. It conforms to all of the ontology building best practices and needs to be complete enough such that it can be loaded and managed in a fully conforming ontology editor or IDE [20].
By also using the OWL API, this working structure can also be the source for specialty tools and user maintenance functions, short of requiring a full-blown OWL editor. Many of these aspects are some of the poorest represented in the current tools inventory; we return to this topic in the next installment.
The working ontology is the complete, canonical form of the domain ontology(ies) [21]. These are the central structures that are the focus for ongoing maintenance and extension efforts over the ensuing phases. As such, the ontologies need to be managed by a version control system with comprehensive ontology and vocabulary management support and tools.
Testing and MappingAs new ontologies are generated, they should be tested for coherence against various reasoning, inference and other natural language processing tools. Gap testing is also used to discover key holes or missing links within the resulting ontology graph structure. Coherence testing may result in discovering missing or incorrect axioms. Gap testing helps identify internal graph nodes needed to establish the integrity or connectivity of the concept graph.
Though used for different purposes, mapping and alignment tools may also work to identify logical and other inconsistencies in definitions or labels within the graph structure. Mapping and alignment is also important in its own right in order to establish the links that help promote ontology and information interoperability.
External knowledge bases can also play essential roles in testing and mapping. Two prominent knowledge base examples are Cyc and Wikipedia, but many additional exist for any specific domain.
Use and MaintenanceOf course, the whole purpose of the development methodology is to create practical, working ontologies. Such uses include search, discovery, information federation, data interoperability, analysis and reasoning, The general purposes to which ontologies may be put are described in the Executive Intro to Ontologies [22].
However, it is also in day-to-day use of the ontology that many enhancements and improvements may be discovered. Examples include improved definitions of concepts; expansions of synonyms, aliases and jargon for concepts; better, more intuitive preferred labels; better means to disambiguate between competing meanings; missing connections or excessive connections; and splitting or consolidating of the underlying structure.
Today, such maintenance enhancements are most often not pursued because existing tools do not support such actions. Reliance on IDEs and tools geared to ontology engineering are not well suited to users and practitioners being able to note or effect such changes. Yet ongoing ontology use and adaptation clearly suggest that users should be encouraged to do so. They are the ones in the front lines of identifying and potentially recording such improvements.
ExtendOntology development is a process, not a static destination or event. This observation makes intuitive sense since we understand ontologies to be a means to capture our understanding of our domains, which is itself constantly changing due to new observations and insights. This factor alone suggests that ontology development methodologies must therefore give explicit attention to extension.
But there is another reason for this attention. Incremental, adaptive ontologies are also explicitly designed to expand their scope and coverage, bite by bite as benefits prove themselves and justify that expansion. A start small and expand strategy is of course lower risk and more affordable. But, for it to be effective, it also must be designed explicitly for extension and expansion. Ontology growth thus occurs both from learning and discovery and from expanding scope.
Versioning, version control and documentation (see below) thus assume more central importance than a more static view would suggest. The use of feedbacks and the continuous improvement design based on MIKE2.0 are therefore also central tenets of our ontology development methodology.
DocumentationThis perspective of the ontology as a way to capture the structure and relationships of a domain — which is also constantly changing and growing — carries over to the need to document the institutional memory and use of it. Both better tools — such as vocabulary management and versioning — and better work processes need to be instituted to properly capture and record use and applications of ontologies.
Some of these aspects are now handled with utilities such as OWLdoc or the TechWiki that Structured Dynamics has innovated to capture ontology knowledge bases on an ongoing basis. But these are still rudimentary steps that need to be enforced with management commitment and oversight.
One need merely begin to probe the ontology development literature to observe how sparse the pickings are. Very little information on methodologies, best practices, use cases, recipes, how to manuals, conversion and use steps and other documentation really exists at present. It is unfortunately the case that documentation even lags the inadequate state of tools development in the ontology space.
Content ProcessingOnce formalized, these constructs — the structured ontologies or the named entity dictionaries as shown in Figure 3 — are then used for processing input content. That processing can range from conversion to direct information extraction. Once extracted, the structure may be injected (via RDFa or other means) back into raw Web pages. The concepts and entities that occur within these structures help inform various tagging systems [23]. The information can also be converted and exported in various forms for direct use or for incorporation in third-party systems.
Visualization systems and specialized widgets (see next) can be driven by the structure and results sets obtained from querying the ontology structure and retrieving its related instance data. While these purposes are somewhat beyond the direct needs of the ontology development methodology, the ontology structures themselves must be designed to support these functions.
Semantic Component OntologyIn our methodology we also provide for administrative ontologies whose purpose is to relate structural understandings of the underlying data and data types with applicable end-use and visualization tools (“widgets”). Thus the structural knowledge of the domain gets combined with an understanding of data types and what kinds of visualization or presentation widgets might be invoked. The phrase ontology-driven apps results from this design.
Amongst other utility ontologies, Structured Dynamics names its major tool-driver ontology the SCO (Semantic Component Ontology). The SCO works in intimate tandem with the domain ontologies, but is constructed and designed with quite different purposes. A description of the build methodology for the SCO (or its other complementary utility ontologies) is beyond the scope of this current document.
Tooling and Best PracticesAs sprinkled throughout the above commentary, this methodology is also intimately related to tools and best practices. The next chapter in this series is devoted to and will be archived on the TechWiki as the lightweight domain ontology methodology. Best practices will be handled in a similar way for the chapter after that one and in its ontology best practices document on the TechWiki.
Time for a Leap Forward in MethodologyEarlier reviews and the information in this document suggest a real need for ontology building methodologies that are integrated, easier to use, interoperate with a richer tools set and are geared to practitioners versus priests. The good news is that there are architectures and building blocks to achieve this vision. The bad news is that the first steps on this path are only now beginning.
The next two installments in this series add further detail for why it is time — and how — we can make a leap forward in methodology. Those critical remaining pieces are in tools and best practices.
[1] This posting is part of a current series on ontology development and tools. The series began with an update of my prior Ontology Tools listing, which now contains 185 tools. It continued with a survey of ontology development methodologies. The next part in this series will address a new architecture for tooling development. The last installment in the series is planned to cover ontology best practices. This same posting is permanently archived and updated on the OpenStructs TechWiki as Lightweight, Domain Ontologies Development Methodology. [2] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak. For a more detailed treatment of ontology classifications, see M. K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog, May 16, 2007. [3] O. Corcho, M. Fernandez and A. Gomez-Perez, 2003. “Methodologies, Tools and Languages for Building Ontologies: Where is the Meeting Point?,” in Data & Knowledge Engineering 46, 2003. See http://www.dia.fi.upm.es/~ocorcho/documents/DKE2003_CorchoEtAl.pdf. [4] Elena Paslaru Bontas Simperl and Christoph Tempich, 2006. “Ontology Engineering: A Reality Check,” in Proceedings of the 5th International Conference on Ontologies, Databases, and Applications of Semantics ODBASE 2006, 2006. See http://ontocom.ag-nbi.de/docs/odbase2006.pdf. [5] OntologyDesignPatterns.org is a semantic Web portal dedicated to ontology design patterns (ODPs). The portal was started under the NeOn project, which still partly supports its development. [6] See M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies,” AI3:::Adaptive Information blog, November 23, 2009. [7] See M.K. Bergman, 2008. “The Semantics of Context,” AI3:::Adaptive Information blog, May 6, 2008. [8] UMBEL (Upper Mapping and Binding Exchange Layer) is an ontology of about 20,000 subject concepts that acts as a reference structure for inter-relating disparate datasets. It is also a general vocabulary of classes and predicates designed for the creation of domain-specific ontologies. [9] See M.K. Bergman, 2008. “When is Content Coherent?,” AI3:::Adaptive Information blog, July 25, 2008. [10] See M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009. [11] MIKE2.0 (Method for Integrated Knowledge Environments) is an open source information development methodology championed by Bearing Point and Deloitte. Structured Dynamics has adopted the approach and has helped formulate MIKE2.0′s semantic enterprise offering. For a general intro to the approach, see further M.K. Bergman, 2010. “MIKE2.0: Open Source Information Development in the Enterprise,” AI3:::Adaptive Information blog, February 23, 2010. [12] This is our working definition for description logics: “Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” [13] See the four-part description logics series from M. K. Bergman, 2009. “Making Linked Data Reasonable using Description Logics, Part 1,” AI3:::Adaptive Information blog, Feb. 11, 2009; “Making Linked Data Reasonable using Description Logics, Part 2,” AI3:::Adaptive Information blog, Feb. 15, 2009; “Making Linked Data Reasonable using Description Logics, Part 3,” AI3:::Adaptive Information blog, Feb. 18, 2009; and “Making Linked Data Reasonable using Description Logics, Part 4,” AI3:::Adaptive Information blog, Feb. 23, 2009. [14] See Part 2 in [13]. [15] The TBox portion, or classes (concepts), is the basis of the ontologies. The ontologies establish the structure used for governing the conceptual relationships for that domain and in reference to external (Web) ontologies. The ABox portion, or instances (named entities), represents the specific, individual things that are the members of those classes. Named entities are the notable objects, persons, places, events, organizations and things of the world. Each named entity is related to one or more classes (concepts) to which it is a member. Named entities do not set the structure of the domain, but populate that structure. The ABox and TBox play different roles in the use and organization of the information and structure. [16] The original version, now slightly modified, was first published in M. K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies,” AI3:::Adaptive Information blog, Nov. 23, 2009. [17] As some examples, see for instance: SKOS: Mark van Assem, Veronique Malais, Alistair Miles and Guus Schreiber, 2006. “A Method to Convert Thesauri to SKOS,” in The Semantic Web: Research and Applications (2006), pp. 95-109. See http://www.cs.vu.nl/~mark/papers/Assem06b.pdf for paper, also http://thesauri.cs.vu.nl/eswc06/ and http://thesauri.cs.vu.nl/; taxonomies: Fausto Giunchiglia, Maurizio Marchese and Ilya Zaihrayeu, 2006. “Encoding Classifications into Lightweight Ontologies,” presented at Proceedings of the 3rd European Semantic Web Conference (ESWC 2006), Budva. See http://www.science.unitn.it/~marchese/pdf/encoding%20classifications%20into%20lightweight%20ontologies_JoDS8.pdf; metadata: Mikael Nilsson, 2007. See http://mikaelnilsson.blogspot.com/2007/11/semanticizing-metadata-specifications.html; relational schema: see the W3C workgroup on RDB2RDF; and, of course, there are many others. [18] Natalya F. Noy and Deborah L. McGuinness, 2001. “Ontology Development 101: A Guide to Creating Your First Ontology,” Stanford University Knowledge Systems Laboratory Technical Report KSL-01-05, March 2001. See http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html. [19] The various criteria that are considered in nominating an existing ontology to “core” status is that it should be general; highly used; universal; broad committee or community support; well done and documented; and easily understood. [20] Example and comprehensive ontology editing toolkits or IDEs (integrated development environments) include NeOn toolkit, Protégé, and TopBraid Composer. A complement to these larger toolkits is the OWL API, which when used can also provide a canonical management framework for specific ontology tools and tasks. This topic is covered more in the next installment regarding the tools landscape. [21] Good ontology design, especially for larger projects, does require a degree of modularity. An architecture of multiple ontologies often work together to isolate different work tasks so as to aid better ontology management. Ontology architecture and modularization is a separate topic in its own right. [22] Originally published as M.K. Bergman, 2010. “An Executive Intro to Ontologies,” AI3:::Adaptive Information blog, August 9, 2010. This popular document has now been permanently archived on the the OpenStructs TechWiki as Intro to Ontologies. [23] Another reason for the clear distinction between ABox and TBox is their use to aid one another in disambiguation. Structured Dynamics’ scones approach (subject concepts or named entities) is designed expressly for this purpose. It is also possible to integrate these approaches with third-party tools (e.g., Calais, Expert System (Cogito), etc.) to improve unstructured content characterization. Via this approach we now can assess concept matches in addition to entity matches. This means we can triangulate between the two assessments to aid disambiguation. Because of logical segmentation, we have increased the informational power of our concept graph.The development of ontologies goes by the names of ontology engineering or ontology building, and can also be investigated under the rubric of ontology learning. This paper summarizes key papers and links to this topic [18].
For the last twenty years there have been many methods put forward for how to develop ontologies. These methodological activities have actually diminished somewhat in recent years.
The main thrust of the papers listed herein is on domain ontologies, which model particular domains or topic areas. (As opposed to reference, upper or theoretical ontologies, which are more general or encompassing.) Also, little commentary is offered on any of the individual methodologies; please see the referenced papers for more details.
General SurveysOne of the first comprehensive surveys was done by Jones et al. in 1998 [1]. This study began to elucidate common stages and noted there are typically separate stages to produce first an informal description of the ontology and then its formal embodiment in an ontology language. The existence of these two descriptions is an important characteristic of many ontologies, with the informal description often carrying through to the formal description.
The next major survey was done by Corcho et al. in 2003 [2]. This built on the earlier Jones survey and added more recent methods. The survey also characterized the methods by tools and tool readiness.
More recently the work of Simperl and her colleagues has focused on empirical results of ontology costing and related topics. This series has been the richest source of methodology insight in recent years [3, 4, 5, 6]. More on this work is described below.
Though not a survey of methods, one of the more attainable descriptions of ontology building is Noy and McGuinness’ well-known Ontology Development 101 [7]. Also really helpful are Alan Rector’s various lecture slides on ontology building [8].
However, one general observation is that the pace of new methodology development seems to have waned in the past five years or so. This does not appear to be the result of an accepted methodology having emerged.
Some Specific MethodologiesSome of the leading methodologies, presented in rough order from the oldest to newest, are as follows:
Please note that many individual projects also describe their specific methodologies; these are purposefully not included. In addition, Ensan and Du look at some specific ontology frameworks (e.g., PROMPT, OntoLearn, etc.) from a domain-specific perspective [17].
Some FlowchartsHere is the general methodology as presented in the various Simperl et al. papers [c.f., Fig. 1 in 3]:
The Corcho et al. survey also presented a general view of the tools plus framework necessary for a complete ontology engineering environment [Fig. 4 from 2]:
There are more examples that show ontology development workflows. Here is one again from the Simperl et al. efforts [Fig. 2 in 5]:
However, what is most striking about the review of the literature is the paucity of methodology figures and the generality of those that do exist. From this basis, it is unclear what the degree of use is for real, actionable methods.
Best Practices ObservationsThe Simperl and Tempich paper [3], besides being a rich source of references, also provides some recommended best practices based on their comparative survey. These are:
General RecommendationsThis review has not set out to characterize specific methodologies, nor their strengths and weaknesses. Yet the research seems to indicate this state of methodology development in the field:
At the beginning of this year Structured Dynamics assembled a listing of ontology building tools at the request of a client. That listing was presented as The Sweet Compendium of Ontology Building Tools. Now, again because of some client and internal work, we have researched the space again and updated the listing [1].
All new tools are marked with <New> (new only means newly discovered; some had yet to be discovered in the prior listing). There are now a total of 185 tools in the listing, 31 of which are recently new, and 45 added at various times since the first release. <Newest> reflects updates — most from the developers themselves — since the original publication of this post.
Comprehensive Ontology ToolsThough all are not relevant, see my post from a couple of years back on large-scale RDF graph software.
At the SemTech conference earlier this summer there was a kind of vuvuzela-like buzzing in the background. And, like the World Cup games on television, in play at the same time as the conference, I found the droning to be just as irritating.
That droning was a combination of the sense of righteousness in the superiority of linked data matched with a reprise of the “chicken-and-egg” argument that plagued the early years of semantic Web advocacy [1]. I think both of these premises are misplaced. So, while I have been a fan and explicator of linked data for some time, I do not worship at its altar [2]. And, for those that do, this post argues for a greater sense of ecumenism.
My main points are not against linked data. I think it a very useful technique and good (if not best) practice in many circumstances. But my main points get at whether linked data is an objective in itself. By making it such, I argue our eye misses the ball. And, in so doing, we miss making the connection with meaningful, interoperable information, which should be our true objective. We need to look elsewhere than linked data for root causes.
Observation #1: What Problem Are We Solving?When I began this blog more than five years ago — and when I left my career in population genetics nearly three decades before that — I did so because of my belief in the value of information to confer adaptive advantage. My perspective then, and my perspective now, was that adaptive information through genetics and evolution was being uniquely supplanted within the human species. This change has occurred because humanity is able to record and carry forward all information gained in its experiences.
Adaptive innovations from writing to bulk printing to now electronic form uniquely position the human species to both record its past and anticipate its future. We no longer are limited to evolution and genetic information encoded in surviving offspring to determine what information is retained and moves forward. Now, all information can be retained. Further, we can combine and connect that information in ways that break to smithereens the biological limits of other species.
Yet, despite the electronic volumes and the potentials, chaos and isolated content silos have characterized humanity’s first half century of experience with digital information. I have spoken before about how we have been steadily climbing the data federation pyramid, with Internet technologies and the Web being prime factors for doing so. Now, with a compelling data model in RDF and standards for how we can relate any type of information meaningfully, we also have the means for making sense of it. And connecting it. And learning and adapting from it.
And, so, there is the answer to the rhetorical question: The problem we are solving is to meaningfully connect information. For, without those meaningful connections and recombinations, none of that information confers adaptive advantage.
Observation #2: The Problem is Not A Lack of Consumable DataOne of the “chicken-and-egg” premises in the linked data community is there needs to be more linked data exposed before some threshold to trigger the network effect occurs. This attitude, I suspect, is one of the reasons why hosannas are always forthcoming each time some outfit announces they have posted another chunk of triples to the Web.
Fred Giasson and I earlier tackled that issue with When Linked Data Rules Fail regarding some information published for data.gov and the New York Times. Our observations on the lack of standards for linked data quality proved to be quite controversial. Rehashing that piece is not my objective here.
What is my objective is to hammer home that we do not need linked data in order to have data available to consume. Far from it. Though linked data volumes have been growing, I actually suspect that its growth has been slower than data availability in toto. On the Web alone we have searchable deep Web databases, JSON, XML, microformats, RSS feeds, Google snippets, yada, yada, all in a veritable deluge of formats, contents and contexts. We are having a hard time inventing the next 1000-fold description beyond zettabyte and yottabyte to even describe this deluge [3].
There is absolutely no voice or observer anywhere that is saying, “We need linked data in order to have data to consume.” Quite the opposite. The reality is we are drowning in the stuff.
Furthermore, when one dissects what most of all of this data is about, it is about ways to describe things. Or, put another way, most all data is not schema nor descriptions of conceptual relationships, but making records available, with attributes and their values used to describe those records. Where is a business located? What political party does a politician belong to? How tall are you? What is the population of Hungary?
These are simple constructs with simple key-value pair ways to describe and convey them. This very simplicity is one reason why naïve data structs or simple data models like JSON or XML have proven so popular [4]. It is one of the reasons why the so-called NoSQL databases have also been growing in popularity. What we have are lots of atomic facts, located everywhere, and representable with very simple key-value structures.
While having such information available in linked data form makes it easier for agents to consume it, that extra publishing burden is by no means necessary. There are plenty of ways to consume that data — without loss of information — in non-linked data form. In fact, that is how the overwhelming percentage of such data is expressed today. This non-linked data is also often easy to understand.
What is important is that the data be available electronically with a description of what the records contain. But that hurdle is met in many, many different ways and from many, many sources without any reference whatsoever to linked data. I submit that any form of desirable data available on the Web can be readily consumed without recourse to linked data principles.
Observation #3: An Interoperable Data Model Does Not Require a Single Transmittal FormatThe real advantage of RDF is the simplicity of its data model, which can be extended and augmented to express vocabularies and relationships of any nature. As I have stated before, that makes RDF like a universal solvent for any extant data structure, form or schema.
What I find perplexing, however, is how this strength somehow gets translated into a parallel belief that such a flexible data model is also the best means for transmitting data. As noted, most transmitted data can be represented through simple key-value pairs. Sure, at some point one needs to model the structural assumptions of the data model from the supplying publisher, but that complexity need not burden the actual transmitted form. So long as schema can be captured and modeled at the receiving end, data record transmittal can be made quite a bit simpler.
Under this mindset RDF provides the internal (canonical) data model. Prior to that, format and other converters can be used to consume the source data in its native form. A generalized representation for how this can work is shown in this diagram using Structured Dynamics‘ structWSF Web services framework middleware as the mediating layer:
Of course, if the source data is already in linked data form with understood concepts, relationships and semantics, much of this conversion overhead can be bypassed. If available, that is a good thing.
But it is not a required or necessary thing. Insistence on publishing data in certain forms suffers from the same narrowness as cultural or religious zealotry. Why certain publishers or authors prefer different data formats has a diversity of answers. Reasons can range from what is tried and familiar to available toolsets or even what is trendy, as one might argue linked data is in some circles today.There are literally scores of off-the-shelf “RDFizers” for converting native and simple data structs into RDF form. New converters are readily written.
Adaptive systems, by definition, do not require wholesale changes to existing practices and do not require effort where none is warranted. By posing the challenge as a “chicken-and-egg” one where publishers themselves must undertake a change in their existing practices to conform, or else they fail the “linked data threshold”, advocates are ensuring failure. There is plenty of useful structured data to consume already.
Accessible structured data, properly characterized (see below), should be our root interest; not whether that data has been published as linked data per se.
Observation #4: A Technique Can Not Carry the Burden of Usefulness or InteroperabilityLinked data is nothing more than some techniques for publishing Web-accessible data using the RDF data model. Some have tried to use the concept of linked data as a replacement for the idea of the semantic Web, and some have recently tried to re-define linked data as not requiring RDF [5]. Yet the real issue with all of these attempts — correct or not, and a fact of linked data since first formulated by Tim Berners-Lee — is that a technique alone can not carry the burden of usefulness or interoperability.
Despite billions of triples now available, we in fact see little actual use or consumption of linked data, except in the life science domain. Indeed, a new workshop by the research community called COLD (Consuming Linked Data) has been set up for the upcoming ISWC conference to look into the very reasons why this lack of usage may be occurring [6].
It will be interesting to monitor what comes out of that workshop, but I have my own views as to what might be going on here. A number of factors, applicable frankly to any data, must be layered on top of linked data techniques in order for it to be useful:
These requirements apply to any data ranging from Census CSV files to Google search results. But because relationships can also be more readily asserted with linked data, these requirements are even greater for it.
It is not surprising that the life sciences have seen more uptake of linked data. That community has keen experience with curation, and the quality and linkages asserted there are much superior to other areas of linked data [7].
In other linked data areas, it is really in limited pockets such as FactForge from Ontotext or curated forms of Wikipedia by the likes of Freebase that we see the most use and uptake. There is no substitute for consistency and quality control.
It is really in this area of “publish it and they will come” that we see one of the threads of parochialism in the linked data community. You can publish it and they still will not come. And, like any data, they will not come because the quality is poor or the linkages are wrong.
As a technique for making data available, linked data is thus nothing more than a foot soldier in the campaign to make information meaningful. Elevating it above its pay grade sets the wrong target and causes us to lose focus for what is really important.
Observation #5: 50% of Linked Data is Missing (that is, the Linking part)There is another strange phenomenon in the linked data movement: the almost total disregard for the linking part. Sure data is getting published as triples with dereferencable URIs, but where are the links?
At most, what we are seeing is owl:sameAs assertions and a few others [8]. Not only does this miss the whole point of linked data, but one can question whether equivalence assertions are correct in many instances [9].
For a couple of years now I have been arguing that the central gap in linked data has been the absence of context and coherence. By context I mean the use of reference structures to help place and frame what content is about. By coherence I mean that those contextual references make internal and logical sense, that they represent a consistent world view. Both require a richer use of links to concepts and subjects describing the semantics of the content.
It is precisely through these kinds of links that data from disparate sources and with different frames of reference can be meaningfully related to other data. This is the essence of the semantic Web and the purported purpose of linked data. And it is exactly these areas in which linked data is presently found most lacking.
Of course, these questions are not the sole challenge of linked data. They are the essential challenge in any attempt to connect or interoperate structured data within information systems. So, while linked data is ostensibly designed from the get-go to fulfill these aims, any data that can find meaning outside of its native silo must also be placed into context in a coherent manner. The unique disappointment for much linked data is its failure to provide these contexts despite its design.
Observation #6: Pluralism is a Reality; Embrace ItYet, having said all of this, Structured Dynamics is still committed to linked data. We present our information as such, and provide great tools for producing and consuming it. We have made it one of the seven foundations to our technology stack and methodology.
But we live in a pluralistic data world. There are reasons and roles for the multitude of popular structured data formats that presently exist. This inherent diversity is a fact in any real-world data context. Thus, we have not met a form of structured data that we didn’t like, especially if it is accompanied with metadata that puts the data into coherent context. It is a major reason why we developed the irON (instance record and object notation) non-RDF vocabulary to provide a bridge from such forms to RDF. irON clearly shows that entities can be usefully described and consumed in either RDF or non-RDF serialized forms.
Attitudes that dismiss non-linked data forms or arrogantly insist that publishers adhere to linked data practices are anything but pluralistic. They are parochial and short-sighted and are contributing, in part, to keeping the semantic Web from going mainstream.
Adoption requires simplicity. The simplest way to encourage the greater interoperability of data is to leverage existing assets in their native form, with encouragement for minor enhancements to add descriptive metadata for what the content is about. Embracing such an ecumenical attitude makes all publishers potentially valuable contributors to a better information future. It will also nearly instantaneously widen the tools base available for the common objective of interoperability.
Parochialism and Root Cause AnalysisLinked data is a good thing, but not an ultimate thing. By making linked data an objective in itself we unduly raise publishing thresholds; we set our sights below the real problem to be solved; and we risk diluting the understanding of RDF from its natural role as a flexible and adaptive data model. Paradoxically, too much parochial insistence on linked data may undercut its adoption and the realization of the overall semantic objective.
Root cause analysis for what it takes to achieve meaningful, interoperable information suggests that describing source content in terms of what it is about is the pivotal factor. Moreover, those contexts should be shared to aid interoperability. Whichever organizations do an excellent job of providing context and coherent linkages will be the go-to ones for data consumers. As we have seen to date, merely publishing linked data triples does not meet this test.
I have heard some state that first you celebrate linked data and its growing quantity, and then hope that the quality improves. This sentiment holds if indeed the community moves on to the questions of quality and relevance. The time for that transition is now. And, oh, by the way, as long as we are broadening our horizons, let’s also celebrate properly characterized structured data no matter what its form. Pluralism is part of the tao to the meaning of information.
[1] See, for example, J.A. Hendler, 2008. “Web 3.0: Chicken Farms on the Semantic Web,” Computer, January 2008, pp. 106-108. See http://www.comp.leeds.ac.uk/webscience/talks/hendler_web_3.pdf. While I can buy Hendler’s arguments about commercial tool vendors holding off major investments until the market is sizable, I think we can also see via listings like Sweet Tools that a lack of tools is not in itself limiting. [2] An earlier treatment of this subject from a different perspective is M.K. Bergman, 2010. “The Bipolar Disorder of Linked Data,” AI3:::Adaptive Information blog, April 28, 2010. [3] So far only prefixes for units up to 10^24 (“yotta”) have names; for 10^27, a student campaign on Facebook is proposing “hellabyte” (North California slang for “a whole lot of”) to get adopted by science bodies. See http://scitech.blogs.cnn.com/2010/03/04/hella-proposal-facebook/. [4] One of more popular posts on this blog has been, M.K. Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” AI3:::Adaptive Information blog, January 22, 2009. [5] See, for example, the recent history on the linked data entry on Wikipedia or the assertions by Kingsley Idehen regarding entity attribute values (EAV) (see, for example, this blog post.) [6] See further the 1st International Workshop on Consuming Linked Data (COLD 2010), at the 9th International Semantic Web Conference (ISWC 2010), November 8, 2010, Shanghai, China. [7] For example, in the early years of GenBank, some claimed that annotations of gene sequences due to things like BLAST analyses may have had as high as 30% to 70% error rates due to propagation of initially mislabeled sequences. In part, the whole field of bioinformatics was formed to deal with issues of data quality and curation (in addition to analytics). [8] See, for example: Harry Halpin, 2009. “A Query-Driven Characterization of Linked Data,” paper presented at the Linked Data on the Web (LDOW) 2009 Workshop, April 20, 2009, Madrid, Spain, see http://events.linkeddata.org/ldow2009/papers/ldow2009_paper16.pdf; Prateek Jain, Pascal Hitzler, Peter Z. Yehy, Kunal Vermay and Amit P. Shet, 2010. “Linked Data is Merely More Data,” in Dan Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness, Linked Data Meets Artificial Intelligence, Technical Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp. 82-86., see http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf; among others. [9] Harry Halpin and Patrick J. Hayes, 2010. “When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web,” presented at LDOW 2010, April 27th, 2010, Raleigh, North Carolina. See http://events.linkeddata.org/ldow2010/papers/ldow2010_paper09.pdf.Ontologies are the structural frameworks for organizing information on the semantic Web and within semantic enterprises. They provide unique benefits in discovery, flexible access, and information integration due to their inherent connectedness; that is, their ability to represent conceptual relationships. Ontologies can be layered on top of existing information assets, which means they are an enhancement and not a displacement for prior investments. And ontologies may be developed and matured incrementally, which means their adoption may be cost-effective as benefits become evident [1].
What Is an Ontology?Ontology may be one of the more daunting terms for those exposed for the first time to semantic technologies. Not only is the word long and without common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.
The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”
Much like taxonomies or relational database schema, ontologies work to organize information. No matter what the domain or scope, an ontology is a description of a world view. That view might be limited and miniscule, or it might be global and expansive. However, unlike those alternative hierarchical views of concepts such as taxonomies, ontologies often have a linked or networked “graph” structure. Multiple things can be related to other things, all in a potentially multi-way series of relationships.
A distinguishing characteristic of ontologies compared to conventional hierarchical structures is their degreeOntologies supply the structure for relating information to other information in the semantic Web or the linked data realm. Ontologies thus provide a similar role for the organization of data that is provided by relational data schema. Because of this structural role, ontologies are pivotal to the coherence and interoperability of interconnected data.
When one uses the idea of “world view” as synonomous with an ontology, it is not meant to be cosmic, but simply a way to convey how a given domain or problem area can be described. One group might choose to describe and organize, say, automobiles, by color; another might choose body styles such as pick-ups or sedans; or still another might use brands such as Honda and Ford. None of these views is inherently “right” (indeed multiples might be combined in a given ontology), but each represents a particular way — a “world view” — of looking at the domain.
Though there is much latitude in how a given domain might be described, there are both good ontology practices and bad ones. We offer some views as to what constitutes good ontology design and practice in the concluding section.
What Are Its Benefits?A good ontology offers a composite suite of benefits not available to taxonomies, relational database schema, or other standard ways to structure information. Among these benefits are:
The relationship structure underlying an ontology provides an excellent vehicle for discovery and linkages. “Swimming through” this relationship graph is the basis of the Concept Explorer (also known as the Relation Browser) and similar widgets.
The most prevalent use of ontologies at present is in semantic search. Semantic search has benefits over conventional search in terms of being able to make inferences and matches not available to standard keyword retrieval.
The relationship structure also is a powerful and more general and more nuanced way to organize information. Concepts can relate to other concepts through a richness of vocabulary. Such predicates might capture subsumption, precedence, parts of relationships (mereology), preferences, or importances along virtually any metric. This richness of expression and relationships can also be built incrementally over time, allowing ontologies to grow and develop in sophistication and use as desired.
The pinnacle application for ontologies, therefore, is as coherent reference structures whose purpose is to help map and integrate other structures and information. Given the huge heterogeneity of information both within and without organizations, the use of ontologies as integration frameworks will likely emerge as their most valuable use.
What Makes for a Good Ontology?Good ontology practice has aspects both in terms of scope and in terms of construction.
Scope ConsiderationsHere are some scoping and design questions that we believe should be answered in the positive in order for an ontology to meet good practice standards:
If these questions can be answered affirmatively, then we would deem the ontology ready for production-grade use.
Fundamental to the whole concept of coherence is the fact that experts and practitioners within domains have been looking at the questions of relationships, structure, language and meaning for decades. Though perhaps today we now finally have a broad useful data and logic model in RDF, the fact remains that massive time and effort has already been expended to codify some of these understandings in various ways and at various levels of completeness and scope. Good practice also means, therefore, that maximum leverage is made to springboard ontologies from existing structural and vocabulary assets.
And, because good ontologies also embrace the open world approach, working toward these desired end states can also be incremental. Thus, in the face of common budget or deadline constraints, it is possible initially to scope domains as smaller or to provide less coverage in depth or to use a small set of predicates, all the while still achieving productive use of the ontology. Then, over time, the scope can be expanded incrementally.
Construction ConsiderationsTo achieve their purposes, ontologies must be both human-readable and machine-processable. Also, because they represent conceptual structures, they must be built with a certain composition.
Good ontologies therefore are constructed such that they have:
In the case of ontology-driven applications using adaptive ontologies, there are also additional instructions contained in the system (often via administrative ontologies) that tell the system which types of widgets need to be invoked for different data types and attributes. This is different than the standard conceptual schema, but is nonetheless essential to how such applications are designed.
[1] This posting was at the request of a couple of Structured Dynamics‘ customers that desired a way to describe ontologies to non-technical management. For a more in depth treatment, see M.K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog, May 16, 2007.Today, Structured Dynamics is pleased to make its Citizen Dan application available for public viewing, play and downloading for the first time.
Citizen Dan is a free, open source system available to any community and its citizens to measure and track indicators of local well being. It can be branded and themed for local needs. It is under active development by Structured Dynamics with support from a number of innovative cities.
Citizen Dan is an exemplar instance of Structured Dynamics’ open semantic framework (OSF), a generalized framework for deploying semantic platforms for any domain. By changing its guiding ontologies and source content and data, what appears for Citizen Dan can be adopted for virtually any subject area.
As configured, the Citizen Dan OSF instance is a:
Citizen Dan’s information sources may include Census data, the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics.
Text and narratives and the concepts and entities they describe are integrally linked into the system via information extraction and tagging. All ingested information, whether structured or text sources, with their semantics, can be exported in multiple formats. A standard organizing schema, also open source and extensible or modifiable by all users, is provided via the optional MUNI ontology (with vocabulary details in development here), being developed expressly for Citizen Dan and its community indicator system purposes.
All of the community information contained within a Citizen Dan instance is available as linked data.
Overview of FeaturesHere are the main components or widgets to this Citizen Dan demo:
Integration of text and stories via subjectA number of other tools are available to admins in the actual appliance, but are not exposed in the demo:
In addition, it is not possible in the demo to save persistent dashboard views or submit stories or documents for tagging, nor to register as a user or view the admin portions of the Drupal instance.
Powerful and persistent “slicing-and-dicing”The sample data and content in the demo is for the Iowa City (IA) metropolitan statistical area. This area embraces two counties (Johnson and Washington) and the census tracts and townships that comprise them, and about two dozen cities. Two of the notable cities are Iowa City itself, home of the University of Iowa, and Coralville, where Structured Dynamics, the developer of Citizen Dan and the open semantic framework (OSF), is headquartered.
The text content on this site is drawn from Wikipedia articles dealing with this area. About 30 stories are included.
The data content on the site is drawn from US Census Bureau data. Shape files for the various geographic areas were obtained from here, and the actual datasets by geographic area can be obtained from here.
The Workbench is the admin tool toCitizen Dan is an exemplar instance of Structured Dynamics’ open semantic framework (OSF), a generalized framework for deploying semantic platforms for specific domains.
OSF is a combination of a layered architecture and modular software. Most of the individual open source software products developed by Structured Dynamics and available on the OpenStructs site are components within the open semantic framework. These include:
The software that makes up the Citizen Dan appliance is one of the four legs that provide a stable, open source solution. These four legs are software, structure, methods and documentation. When all four are provided, we can term this a total open solution.
For Citizen Dan, the complements to this software are:
In its entirety, the total open solution amounts to a form of capacity building for the enterprise.
Admins have a wealth of tools for dataset andInherent in the design and architecture of Citizen Dan is the potential for each instance (single installation) to act as a node in a distributed network of nodes across the Web. Via the structWSF Web service endpoints and appropriate dataset permissions, it is possible for any city in the Citizen Dan network to share (or not) any or all of its data with other cities.
This collaboration aspect has been “baked into the cake” from Day One. The system also supports differential access, rights and roles by dataset and Web service. Thus, city staffs across multiple communities could share data differently than what is provided to the general public.
Since all data management aspects of each Citizen Dan instance is also oriented around datasets, expansion to a network mode is quite straightforward.
Citizen Dan is hosted in Drupal, with rich portal,The Citizen Dan appliance is based on the Drupal content management system, which means any community can easily theme or add to the functionality of the system with any of the available 6500 open source modules that extend the basic Drupal functionality.
All other components, including the multiple third-party ones, are also open source.
To install Citizen Dan for your own use, you need to:
(Note: there will also be some more updates in August, including the MUNI release.)
For questions and additional info, please consult the TechWiki or the OpenStructs community site.
Finally, please contact us if you’d like to learn more about the project, investigate funding or sponsorship opportunities, or contribute to development. We’d welcome your involvement!
A few weeks back I completed a three-part introductory series to what Structured Dynamics calls a ‘total open solution‘. A total open solution as we defined it is comprised of software, structure, methods and documentation. When provided in toto, these components provide all of the necessary parts for an organization to adopt new open source solutions on its own (or with the choice of its own consultants and contractors). A total open solution fulfills SD’s mantra that, “We’re successful when we’re not needed.”
Two of the four legs to this total open solution are provided by documentation and methods. These two parts can be seen as a knowledge base that instructs users on how to select, install, maintain and manage the solution at hand.
Today, SD is releasing publicly for the first time two complementary knowledge bases for these purposes: TechWiki, which is the technical and software documentation complement, in this case based around SD’s Open Semantic Framework and its associated open source software projects; and DocWiki, the process methodology and project management complement that extends this basis, in this case based around the Citizen Dan local community open data appliance.
All of the software supporting these initiatives is open source. And, all of the content in the knowledge bases is freely available under a Creative Commons 3.0 license with attribution.
Mindset and ObjectivesIn setting out the design of these knowledge bases, our mindset was to enable single-point authoring of document content, while promoting easy collaboration and rollback of versions. Thus, the design objectives became:
Assuming these objectives could be met, we then had three other objectives on our wish list:
Our initial investigations looked at conventional content and document management systems, matched with version control systems or SVNs. Somewhat surprisingly, though, we found the Mediawiki platform to fulfill all of our objectives. Mediawiki, as detailed below, has evolved to become a very mature and capable documentation platform.
While most of us know Mediawiki as a kind of organic authoring and content platform — as it is used on Wikipedia and many other leading wikis — we also found it perfect for our specific knowledge base purposes. To our knowledge, no one has yet set up and deployed Mediawiki in the specific pre-packaged knowledge base manner as described herein.
TechWiki v DocWikiTechWiki is a Mediawiki instance designed to support the collaborative creation of technical knowledge bases. The TechWiki design is specifically geared to produce high-quality, comprehensive technical documentation associated with the OpenStructs open source software. This knowledge base is meant to be the go-to source for any and all documentation for the codes, and includes information regarding:
As of today, TechWiki contains 187 articles under 56 categories, with a further 293 images. The knowledge base is growing daily.
DocWiki is a sibling Mediawiki instance that contains all TechWiki material, but has a broader purpose. Its role is to be a complete knowledge base for a given installation of an Open Semantic Framework (in the current case, Citizen Dan). As such, it needs to include much of the technical information in the TechWiki, but also extends that in the following areas:
The methodology portions of the DocWiki are drawn from the broader MIKE2.0 (Method for Integrated Knowledge Environments) approach. I have previously written about this open source methodology championed by Bearing Point and Deloitte.
As of today, DocWiki contains 357 articles and 394 structured tasks in 70 activity areas under 77 categories. Another 115 images support this content. This knowledge base, too, is growing daily.
Both of these knowledge bases are open source and may be exported and installed locally. Then, users may revise and modify and extend that pre-packaged information in any way they see fit.
Basic Wiki OverviewThe basic design of these systems is geared to collaboration and embeds what we think are really responsive work flows. These extend from supporting initial idea noodling to full-blown public documentation. The inherent design of the system also supports single-source publishing and book or PDF creation from the material that is there. Here is the basic overview of the design:
(click for full size)
Mediawiki provides the standard authoring and collaboration environment. There are a choice of editing methods. As content is created, it is organized in a standard way and stored in the knowledge base. The Mediawiki API supports the export of information in either XHTML or XML, which in turn allows the information to be used in external apps (including other Mediawiki instances) or for various single-source publication purposes. The Collection extension is one means by which PDFs or even entire books (that is, multi-page documents with potentially chapters, etc.) may be created. Use of a well-designed CSS ensures that outputs can be readily styled and themed for different purposes or audiences.
As wikis designed from the get-go to be reusable, and then downloaded and installed locally, it is important that we maintain quality and consistency across content. (After download, users are free to do with it as they wish, but it is important the initial database be clean and coherent.) The overall interaction with the content thus occurs via one of three levels: 1) simple reading, which is publicly available without limitation to any visitor, including source inspection and export; 2) editing and authoring, which is limited to approved contributors; and 3) draft authoring and noodling, which is limited to the group in #2 but for which the in-progress content is not publicly viewable. Built-in access rights in the system enable these distinctions.
Features and BenefitsBesides meeting all of the objectives noted at the opening of this post, these wikis (knowledge bases) also have these specific features:
Many of these features come from the standard extensions in the TechWiki/DocWiki packages.
The net benefits from this design are easily shared and modified knowledge bases that users and organizations may either contribute to for the broader benefit of the OpenStructs community, or download and install with simple modifications for local use and extension. There is actually no new software in this approach, just proper attention to packaging, design, standardization and workflow.
A Smooth WorkflowVia the sharing of extensions, categories and CSS, it is quite easy to have multiple instances or authoring environments in this design. For Structured Dynamics, that begins with our own internal wiki. Many notes are taken and collected there, some of a proprietary nature and the majority not intended or suitable for seeing public release.
Content that has developed to the point of release, however, can be simply tagged using conventions in the workflow. Then, with a single Export command, the relevant content is then sent to an XML file. (This document can itself be edited, such as for example changing all ‘TechWiki’ references to something like ‘My Content Site’; see further here.)
Depending on the nature of the content, this exported content may then be imported with a single Import command to either the TechWiki or DocWiki sites. (Note: Import does require admin rights.) A simple migration may also occur from the TechWiki to the DocWiki. Also, of course, initial authoring may begin at any of the sites, with collaborators an explicit feature of the TechWiki or DocWiki versions.
Any DocWiki can also be specifically configured for different domains and instance types. In terms of our current example, we are using Citizen Dan, but that could be any such Open Semantic Framework instance type:
(click for full size)
Under this design, then, the workflow suggests that technical content authoring and revision take place within the TechWiki, process and methodology revision in the DocWiki. Moreover, most DocWikis are likely to be installed locally, such that once installed, their own content would likely morph into local methods and steps.
So long as page titles are kept the same, newer information can be updated on any target wiki at any time. Prior versions are kept in the version history and can be reinstated. Alternatively, if local content is clearly diverging yet updates of initial source material is still desired, the local content need only be saved under a new title to preserve it from import overwrites.
Where Is It Going from Here?We are really excited by this design and have already seen benefits in our own internal work and documentation. We see, for example, easier management of documentation and content, permanent (canonical) URLs for specific content items, and greater consistency and common language across all projects and documentation. Also, when all documentation is consolidated into one point with a coherent organizational and category structure, documentation gaps and inconsistencies also become apparent and can readily be fixed.
Now, with the release of these systems to the OpenStructs (Open Semantic Framework) and Citizen Dan communities, we hope to see broader contributions and expansion of the content. We encourage you to check on these two sites periodically to see how the content volume continues to grow! And, we welcome all project contributors to join in and help expand these knowledge bases!
We think this general design and approach — especially in relation to a total open solution mindset — has much to recommend it for other open source projects. We think these systems, now that we have designed and worked out the workflows, are amazingly simple to set up and maintain. We welcome other projects to adopt this approach for their own. Let us know if we can be of assistance, and we welcome ideas for improvement!
Like the seminal linked data publication by PricewaterhouseCoopers of about a year ago (see “PWC Dedicates Quarterly Technology Forecast to Linked Data“, May 29, 2009), a video released by Cisco yesterday is another signal of the emergence of the semantic enterprise.
The Cisco tech brief on The Semantic Enterprise is a quite accessible — but a bit eerie — seven-minute introduction. The video was prepared by Cisco’s Internet Business Solutions Group (IBSG), with Shaun Kirby, its Director of Innovations Architectures, as the narrator:
YouTube: http://www.youtube.com/watch?v=3lUzs2I8BKIWell, as for being eerie, when the video first came up, I thought I was looking at an advanced, next generation avatar, perhaps a reincarnation of Douglas Adams’ Hyperland. Maybe this semantic stuff was closer at hand than we thought!
But, as it turned out, that first blush was only a reaction to how the video was shot. As it gets rolling, the Cisco video is extremely well done and informative. It is a great intro for sharing with management when contemplating your own moves to becoming a semantic enterprise.
I suggest you first view — and then bookmark — this one.
OK. So, you’re looking at your garage … or your bedroom closet … or your office and its files. They are a mess, and you can’t find anything and you can’t stuff anything more into the nooks, cubbies, crannies or cabinets. What do you do?
Well, when you finally get fed up and have a rainy day or some other excuse, you tackle the mess. Maybe you grab a big mug of coffee to prepare for the pending battle. Maybe you strip down to comfort clothes. Then, if you’re like me, you begin to organize stuff into piles. Labeled piles and throwaway piles and any other piles that can provide a means to start bringing order to the chaos.
In the semantic Web world, there is a phrase coined by Jim Hendler that captures this approach: A little semantics goes a long way [1]. A little semantics, just like your labeled piles, helps to bring order to information chaos.
Mind you, this is not fancy or expensive stuff. In the case of my office, it is colored sheets of paper labeled with Magic Markers as “Taxes” or “Internal” or “Blog Posts” or whatever. Then, I begin sifting and distributing. In the case of the semantic world, these are classifying things into like categories and simply relating them to other categories with simple relationships, such as “is Part Of” or “is Narrower Than”.
Of course, I could have approached my mess in a different way. I could have hired an efficiency expert to come in, interview me and all of my employees and colleagues, gotten a written analysis and report, and then committed to a multi-week project to completely store and place every single last piece of paper in my office or organize every rake and set of abandoned golf clubs in my garage. When done, I would have shelled out much money and I suspect still not have been able to find anything.
Sort of sounds like the traditional way IT does its business, doesn’t it? To clean up their information messes, enterprises need to find a better strategy.
I’m not too long from having returned from the SemTech conference, which overall was quite an excellent show. But despite its emphasis on semantic technologies and their usefulness to businesses and enterprises, I found one critical theme unspoken: the ability of semantic approaches to change how enterprise IT actually does business. New ways have got to be found to clean up the many and growing information piles emerging all around us.
The Changing Nature of ITIT is — and has been — going through a fundamental set of changes for decades. In the last decade, these changes have led to lowered relative spending, a shift in spending priorities toward services, less innovation, and less productivity. Some data and observations by researchers and analysts document these trends.
The following chart, using US Bureau of Economic Analysis data [2], shows the clear 50-year trend in declining hardware costs for enterprises, mostly resulting from the observation known as Moore’s Law. These massive hardware cost reductions (logarithmic scale) have also resulted in lower prices for IT as a whole. In 2008, for example, total relative IT prices were about two-thirds what they were a mere decade earlier:
Source: M.K. Bergman and Bureau of Economic Analysis [2] (click for full size)
In contrast, relative prices for software and services have remained remarkably flat over this entire period, including for the past decade. This is somewhat surprising given the emergence of packaged software and more recently open source. However, relative percentage expenditures for custom software and software developed in-house have also remained strong over the past decade [3].
The mid- to late-1990s represented the high-water mark on many bases for enterprise IT, expenditures and vendors. Roughly in 1997 or so, the number of public enterprise software vendors peaked as did venture funding [4] and relative expenditures for IT in relation to GDP. There was a major uptick in relation to preparing for Y2K and a major downtick due to the dot-com bubble, and then of course the past two years or so have seen a global economic downturn. But, as the figure below shows (red), the long-term trend tends to suggest a relative plateau for IT expenditures in relation to GDP somewhat around 2000:
Source: M.K. Bergman and Bureau of Economic Analysis [2] (click for full size)
Yet, like the first chart, software seems to be bucking this trend (blue lines above). Though perhaps the rate of growth in expenditures for software is slowing a bit, it is still on a growth upslope, especially in relation to overall IT expenditures. The next chart, in fact, specifically compares software expenditures to total IT expenditures. Software expenditures are some 40% higher in relation to total IT than they were a mere decade ago:
Source: M.K. Bergman and Bureau of Economic Analysis [2] (click for full size)
The mix of these software expenditures is also changing in major ways while stagnating in others.
The changing aspect is coming about from the shift of expenditures from license and maintenance fees to services. A number of software vendors began to see revenues from services overcome that from licensing in the 1990s. By the early 2000s, this was true for the enterprise software sector as a whole [4]. Today, service revenues account for 70% or so of aggregate sector revenues. Combined with the emergence of open source and other alternatives such as software as a service (SaaS), I think it fair to say that the era of proprietary software with exceedingly high margins from monopoly rents is over [5].
The stagnating aspect occurs in how the software expenditures are applied. According to Gartner, in the US, more than 70% of IT expenditures are devoted to simply running existing systems, with only about 11% of budgets devoted to innovation; other parts of the world spend nearly double on innovation and much lower for operations [6]. This relative lack of support for innovation and high percentages for running existing systems has held true for about a decade. Meanwhile, IT’s contribution to US productivity has been declining since 2001 [7].
What is the Cause for IT’s Ills?Last year, PricewaterhouseCoopers published a major report with the provocative title, “Why Isn’t IT Spending Creating More Value?” [7]. The 42-page report covered many of the aspects above. Among other factors, the PWC authors speculated that:
As consumption of IT increases and as technologies change and advance, businesses have been left to cobble together disparate software and hardware systems and tools. The end result? Unchecked IT spending, unneeded complexity, redundant systems, underutilized hardware and data centers, the need for expensive IT security, and, inevitably, diminishing returns from IT. In short, low levels of IT productivity create conditions for an IT cost crisis. [7]I suppose one could add to this litany other factors such as the growth and emergence of the Internet, sector consolidations through mergers and acquisitions, the rise of open source and alternatives such as SaaS, etc.
But which of these are causes? Which are symptoms? And which might only be consequences or coincident?
To be sure, all recognize the explosion of digital data and information, with sources and formats springing up faster than Whack-a-Mole. It is such an evident and ubiquitous phenomenon that pointing to it as a cause appears on the face of it quite obvious. Also obvious is that these new sources carry with them a diversity of systems and tools. While not categorically stated as such, it appears that PWC fingers the difficulties of “cobbling” these systems together as the root cause for low productivity and thus the IT cost crisis.
I agree totally that these are symptoms of what we see in IT’s current circumstance. I would even say these factors are a proximate cause to these ills. But I disagree they are the root cause. To discover that root, I believe, we must look deeper to mindset and assumptions.
Closed World Mindset as the Root CauseThere are some phenomena that are so obvious that they are easily missed. Not seeing your fingertip six inches between your eyes is one of these. We aren’t used to focusing on things so near at hand.
So, let’s look for a moment at the closed world assumption (CWA), a key underpinning to most standard relational data systems and enterprise schema and logics. CWA is the logic assumption that what is not currently known to be true, is false. If CWA is not directly familiar to you that is understandable; it is an implied assumption of these systems and logics. As such, it is not often inspected directly and therefore not often questioned [8].
With regard to standard IT systems, the closed world assumption has two important aspects:
On the face of them, these assumptions seem tame enough. And, indeed, there are some enterprise data systems that absolutely rely on them for efficient processing and completion times, such as most transaction systems. CWA is absolutely the appropriate design for such applications.
However, for knowledge management or representation applications — that is, applications which involve combining or using heterogeneous data or information from multiple data sources, which are exactly the same sources requiring information “cobbling” noted above by PWC — there are two very critical implications of the closed-world assumption (CWA):
The net effect, which I have argued before, most notably in a major piece about the open world assumption [11], is that typical projects with a knowledge management aspect have become costly, take very long to complete, often fail, and require much planning and coordination. These facts have been true for three decades as enterprises have attempted to extract knowledge from their electronic information using closed world approaches based on relational systems. And, as recognized by PWC, these problems are only getting worse with growth in diversity and scope of systems.
The implications of closed world v. open world approaches are absolutely at the root of the causes leading to declining productivity, low innovation, significant failures and increasing costs — all exacerbated with more data and more systems — now characterizing traditional enterprise IT. Moreover, it is not a problem for open world systems to link to and incorporate closed world approaches. With open world, there is no need for Hobson’s choices. Unfortunately, such is not true when one begins with a closed world premise.
Incremental is Good: Pay as You GoAs best as I can tell, Alon Halevy was the first to use the phrase “pay as you go” in 2006 to describe the incremental aspect of the open world approach in relation to the semantic Web [12]. The “pay as you go” phrase had been applied earlier to data management and storage and had also been used to describe phone calling plans.
Incremental concepts and “agility” have been popular topics for the past five to ten years in IT, most often related to software development. And, while “incremental” sounds good in relation to enterprise projects, especially of a knowledge management or information integration/federation nature, the actual methodologies put forward were anything but incremental in their conceptual underpinnings.
Unfortunately, the “pay as you go” phrase has (and still is) largely confined to incremental, open world approaches involving the semantic Web. How this approach might apply and benefit enterprises has yet to be articulated. Nonetheless, I like the phrase, and I think it evokes the right mindset. In fact, I think with linked data and many other aspects of the current semantic Web we are seeing such approaches come to fruition. Inch-by-inch, brick-by-brick, data on the Web is getting exposed and interlinked. “Pay as you go” is incremental, and that is good.
Purposeful is Better: Pay as You BenefitYet the idea of “pay as you benefit” is more purposeful, able to be planned and implemented, and founded on standard enterprise cost-benefit principles. I think it is a better (and more nuanced) expression of the “pay as you go” mindset in an enterprise setting. What it means is you can start small and be incomplete. You can target any domain or department or scope that is most useful and illustrative for your organization. You can deploy your first stand-ups as proofs-of-concept or sandboxes. And, you can build on each prior step with each subsequent one.
One of the reasons we (Structured Dynamics) embraced the MIKE2.0 methodology [13] was its inherent incremental character. (Government deployments often call them “spirals”.) In general, the five phases of MIKE2.0 can be represented as follows:
(click for full size)
It is specifically during the fifth phase, testing and improvement, that quantitative and qualitative benefits from the current increment are calculated and documented. This evolving methodology is where the enterprise can assess the results of its prior investment and scope and budget for the next one. These can be quick, rapid increments, or more involved ones, depending on the schedule, prior results and risk profile of the enterprise (or department) at that time.
Much is made of “incremental” or “agile” deployments within enterprises, but the nature of the traditional data system (and its closed world assumption) can act to undermine these laudable steps. The inherent nature of an open world approach, matched with methodologies and best practices, can work wonderfully with KM-related projects.
Quite Simply a Different Way to Do BusinessWe see in our current IT circumstances a number of embedded practices and assumptions. We have been assuming control and completeness — the closed world opposite to the open world approach. We have thus embraced and promoted “global” or enterprise-wide solutions: be they desktop operating systems or browsers or expensive enterprise-level proprietary software solutions. This scope leads to immense hurdle rates and risks: we better get our choices right up front, because if we don’t, the department or enterprise are at risk. We have an inward focus about our own resources, our own networks, our own systems. Meanwhile, when we look outward, we wonder how all of these new Web companies can grow and expand so rapidly in comparison to us.
Clearly, we are seeing shifts to more services than products, more open source, more outsourcing, and more software as a service. Yet, because of the legacy of decades-long commitments from prior IT investment and the failures of many hyped “solutions” such as ERP or BI or data warehousing or a dozen others, we also see a decline and a reluctance for IT to embrace new and transforming approaches. Our prior choices were practically tantamount to “betting the enterprise.” What if our new approaches fail as so many of their predecessors did? In a demanding, competitive environment can we afford to make such wrong choices again with such immense implications?
Yet, now that information technology is a given, it only seems natural that its role becomes an integral part of the enterprise, and not a special function. Like procurement, IT has matured to become a support function. Businesses should not succeed or fail based on the types of pencils and paper stock they use; so should they not depend on the software support choices that IT makes. Enterprises are now past the need to get “computerized”; they are thoroughly so. But our understanding of IT’s role and position has not evolved with its own success.
The first whiffs of these challenges to IT’s initial hegemony came from the departmental introduction of PCs and local networks in the early 1980s. It has continued with desktop software, spreadsheets and Web portals and sites. Large, mature companies awoke in horror in the last decade to discover they had hundreds — sometimes thousands — of Web sites and content dissemination points over which IT had little or no control. Such is the nature of entropy, and it is a fact for any organization of any size.
So, now, with strategies such as “pay as you benefit,” there is no longer an excuse not to innovate. There is not a justification to put off testing and discovering benefits that the open world and semantic approaches can bring to your organization. There is now a basis to make the case and set the affordable budgets within desirable timelines for becoming a semantic enterprise.
Mindsets and expectations do require some adjustment. For example, not everything will be known or modeled in early phases. But, is that also not true in any “real” real world? We’re not talking high-throughput transaction systems here, but beginning to pull together and link the information that is important to your organization strategically.
Remember the intro statement that “a little semantics goes a long way”? Well, that truth — and it is true — when combined with incremental deployment firmly tied to demonstrable results, promises quite simply a different way to do business. Never before have enterprises had working and winnable approaches such as this to test and innovate and learn and discover. Jump on in; the water is clear and warm.
And, oh, as to that mess in your closet or garage? Well, if you adhere to CWA, you will need to define a place for everything to go before you can start cleaning things up. I say: forget those false hurdles. If you’d really want to make a dent in the mess, grab a broom and start cleaning.
[1] Jim Hendler, “a little semantics goes a long way.” See http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html. [2] All starting data is for the United States only and comes from the U.S. Bureau of Economic Analysis, U.S. Department of Commerce. The data tables were downloaded from the BEA Web site at http://www.bea.gov/national/nipaweb/SelectTable.asp. GDP data is from Section 1; enterprise private investment data from Section 5. For reasons as described in the text, all relative BEA numbers were re-adjusted from a 2005 baseline to 1997 based on absolute figures. Software figures and expenditures include packaged software, custom software and software developed in-house, but excludes software bundled or included within hardware. [3] Data not shown; see the “Software Investment and Prices, by Type” data on the BEA Web page http://www.bea.gov/national/info_comm_tech.htm. [4] Michael A. Cusumano, 2008. “The Changing Software Business: Moving from Products to Services,” Massachusetts Institute of Technology, in Computer, Vol 41 (1): 20-27, January 2008. See http://www.iae.univ-lille1.fr/SitesProjets/bmcommunity/Research/cusumano.pdf. This shift has occurred despite the recognition that potential gross margins from software packages can exceed 90% due to zero costs of reproduction. As Cusumano notes in a rule, “99 percent of zero is zero: The great profit opportunity from software products becomes theoretical and not practical” if not sold. Also, another interesting observation made by Cusumano is that in the shift to services vendors with both low percentages and high percentages of services, or what he calls the “sweet spots”, show higher contributions to profitability than vendors in the middle. He posits that low percentage vendors are getting mostly profitable maintenance fees, while those above 60% in services show profitability due to learning more replicable and systematic processes and approaches for service delivery. [5] While we may occasionally see some vendors successfully buck this trend, I suspect these will only occur for established vendors with established platform advantages or for isolated applications where the innovating vendors have a significant first-mover advantage. [6] Garnter calls the innovation category “transform”; see Gartner, Incorporated, 2009. “IT Software and Services, 2007-2010,” see http://www.slideshare.net/rsink/gartner-report-it-spending-2010. Also, see Jed Rubin and Howard Rubin, 2006. “Worldwide IT Benchmark Service New Trends & Findings for 2007: Strategic Performance Management and Measurement,” from Gartner Consulting Worldwide IT Benchmark Service; see http://www.gartner.com/teleconferences/attributes/attr_161183_115.pdf. [7] PricewaterhouseCoopers, 2009. “Why Isn’t IT Spending Creating More Value?”, see http://www.pwc.com/en_US/us/increasing-it-effectiveness/assets/it_spending_creating_value.pdf. [8] Though relational database systems did not begin with an understanding of CWA, but rather Edgar Codd’s 12 rules, the understandings of these were formulated later by Raymond Reiter. Reiter first described the basis of CWA in 1978, and then provided an axiomatization of relational databases and their deductive generalizations and basis in CWA in 1984; see http://prism.cs.umd.edu/papers/Min02:reiter_memoriam/memoriam-tplp.pdf. [9] Relational database systems also assume unique names for objects, which, while not perhaps the best design for federated systems, can be overcome in other ways. [10] For semantics-related projects there is a corollary problem to the use of CWA which is the need for upfront agreement on what all predicates “mean”, which is difficult if not impossible in reality when different perspectives are the explicit purpose for the integration. [11] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems. [12] This was also the first instance (I believe) of Alon coining the “dataspace” term. First use of the “pay as you go” phrase was, Alon Halevy, Michael Franklin, and David Maier, 2006. “Principles of Dataspace Systems,” in Proceedings of ACM Symposium on Principles of Database Systems, pp: 1-9. See also the slides accompanying that talk, Alon Halevy, 2006. “Principles of Dataspace Systems (PODS),” June 26, 2006; see http://www.cs.washington.edu/homes/alon/files/pods06-keynote.ppt, 2006. More explicitly the next year see Jayant Madhavan, Shirley Cohen, Xin (Luna) Dong, Alon Y. Halevy, Shawn R. Jeffery, David Ko, and Cong Yu, 2007. “Web-scale Data Integration: You Can Afford to Pay as You Go.” in 3rd Conf. on Innovative Data Systems Research (CIDR), pp 342-350, see http://research.yahoo.com/files/paygo.pdf. The term has been picked up by many others, notably Rada Chirkova, Dongfeng Cheny, Fereidoon Sadriz and Timo J. Salo, 2007. “Pay-As-You-Go Information Integration: The Semantic Model Approach,” see ftp://ftp.csc.ncsu.edu/pub/tech/2007/TR-2007-30.pdf; and most recently papers by Gerhard Weikum on RDF-3X; see http://domino.mpi-inf.mpg.de/internet/reports.nsf/c125634c000710cec125613300585c64/70e8f906d8090e6bc125757f00448ec9!OpenDocument&ExpandSection=-1. [13] See M.K. Bergman, 2010. “MIKE2.0: Open Source Information Development in the Enterprise,” AI3 Blog posting, February 23, 2010; and M.K. Bergman, 2010. “Open SEAS: A Framework to Transition to a Semantic Enterprise,” AI3 Blog posting, March 1, 2010.Yesterday Fred Giasson announced the release of code associated with Structured Dynamics‘ open source semantics components (also called sComponents). A semantic component is an ontology-driven component, or widget, based on Flex. Such a component takes record descriptions, ontologies and target attributes/types as inputs and then outputs some (possibly interactive) visualizations of the records.
Though not all layers are by any means complete, from an architectural standpoint the release of these semantic components provides the last and missing layer to complete our open semantic framework. Completing this layer now also enables Structured Dynamics to rationalize its open source Web sites and various groups and mailing lists associated with them.
The OSF “Semantic Muffin”We first announced the open semantic framework — or OSF — a couple of weeks back. Refer to that original post for more description of the general design [1]. However, we can show this framework with the semantic components layer as illustrated by what some have called the “semantic muffin”:
(click for full size)
The OSF stack consists of these layers, moving from existing assets upward through increasing semantics and usability:
Not all of these layers are required in a given deployment and their adoption need not be sequential or absolutely depend on prior layers. Nonetheless, they do layer and interact with one another in the general manner shown.
The Semantics Components LayerCurrent semantic components, or widgets, include: filter; tabular templates (similar to infoboxes); maps; bar, pie or linear charts; relationship (concept) browser; story and text annotator and viewer; workbench for creating structured views; and dashboard for presenting pre-defined views and component arrangements. These are generic tools that respond to the structures and data fed to them, adaptable to any domain without modification.
Though Fred’s post goes into more detail — with subsequent posts to get into the technical nuances of the semantic components — the main idea of these components is shown by the diagram below.
These various semantic components get embedded in a layout canvas for the Web page. By interacting with the various components, new queries are generated (most often as SPARQL queries) to the various structWSF Web services endpoints. The result of these requests is to generate a structured results set, which includes various types and attributes.
An internal ontology that embodies the desired behavior and display options (SCO, the Semantic Component Ontology) is matched with these types and attributes to generate the formal instructions to the semantic components. These instructions are presented via the sControl component, that determines which widgets (individual components, with multiples possible depending on the inputs) need to be invoked and displayed on the layout canvas. Here is a picture of the general workflow:
(click for full size)
New interactions with the resulting displays and components cause the iteration path to be generated anew, again starting a new cycle of queries and results sets. As these pathways and associated display components get created, they can be named and made persistent for later re-use or within dashboard invocations.
Consolidating and Rationalizing Web Sites and Mailing ListsAs the release of the semantic components drew near, it was apparent that releases of previous layers had led to some fragmentation of Web sites and mailing lists. The umbrella nature of the open semantic framework enabled us to consolidate and rationalize these resources.
Our first change was to consolidate all OSF-related material under the existing OpenStructs.org Web site. It already contained the links and background material to structWSF and irON. To that, we added the conStruct and OSF material as well. This consolidation also allowed us to retire the previous conStruct Web site as well, which now re-directs to OpenStructs.
We also had fragmentation in user groups and mailing lists. Besides shared materials, these had many shared members. The Google groups for irON, structWSF and conStruct were thus archived and re-directed to the new Open Semantic Framework Google group and mailing list. Personal notices of the change and invites have been issued to all members of the earlier groups. For those interested in development work and interchange with other developers on any of these OSF layers, please now direct your membership and attention to the OSF group.
There has also been a revigoration of the developers’ community Web site at http://community.openstructs.org/. It remains the location for all central developer resources, including bug and issue tracking and links to SVNs.
Actual code SVN repositories are unchanged. These code repositories may be found at:
We hope you find these consolidations helpful. And, of course, we welcome new participants and contributors!
[1] An alternative view of this layer diagram is shown by the general Structured Dynamics product stack and architecture.As of yesterday, the readership on this AI3 blog passed 3000 daily for the first time. It has been steadily inching upward, and finally passed that minor milestone. Thank you!
I’ve been writing this blog for five years now, with some 400 total posts, or about 1.5 blog posts per week. I know my style is toward longer articles and less frequent posting, most often of a fairly detailed or technical nature. And, while I have a Twitter account, I do not bleat. My style is for more meaty discussions. Perhaps it belies my age.
The real growth in this blog, however, has come about with my conscious attempt to write for the enterprise audience. RDF, the semantic enterprise, linked data and ontologies need a bridge from the technical community to the one of practitioners. Much progress and uptake has been occurring with these business and government audiences.
At the recent SemTech meeting, I was taken aside by many individuals noting my blog posts and thanking me for the thought and effort behind them. Thank you for noticing, and reading, and you are welcome. We need more translation of semantic topics and technologies to pragmatic terms.
If you have been following the standard W3C and SemWeb mailing lists recently, you will have noticed an anxiety and a continuation of the fractious nature of this “community”. In part this comes about because there are efforts afoot to revisit the RDF specs. But, mostly, I think, it is the ongoing nature of many in this group to snatch defeat from the jaws of victory. The search by some for perfection and insistence on parochial needs and preferences can give a pettiness to this “community” that is unbecoming.
Many of us have abandoned those forums for those reasons. As for myself, I will continue to evangelize to the buying market and keep the gaze pointing outward. There is a wealth of need for tools, techniques, methods, documentation, structures, and narratives. Thanks to all of you, the readership of this blog, for continuing to affirm this value.
So, in the great scheme of things, the readership of this blog is quite small in comparison with the big boys. On the other hand, very few individuals have higher numbers, and all of this for a fairly esoteric area. I think this proves there is a market and a need out there for semantic solutions.
Thanks again! And, for those in the United States, have a most enjoyable 4th of July holiday!
Structured Dynamics has been in a fervent — and, we believe, fruitful — design phase for the past 18 months. All of the working parts related to how to embrace becoming a semantic enterprise have now been defined and designed. Actual tools and components accompany many of these parts and have been deployed.
Recently, I have been speaking and blogging much about rationale, process, mindset and approach for how to bring semantics into the organization. But, prior to now, we have not spoken much about the overall design behind our approach. Today, as we complete our design phase and introduce our first exemplar instance of it — Citizen Dan [1] — we are finally in a position to describe this overall approach.
We term our approach the open semantic framework, also OSF. The open semantic framework is a combination of a layered architecture and modular software. The open semantic framework represents the software component of the four-component total open solution, recently described in a three part series. I return to this topic in the conclusion of this post.
Revisiting Design ObjectivesOver the past nine months, I have been focusing my writing largely on the semantic enterprise, with more specificity regarding our Open SEAS (Semantic Enterprise Adoption and Solutions) initiative. In bits and pieces, these writings have tended to reflect a number of objectives:
To date, the result of these design objectives is perhaps best captured in my Seven Pillars of the Open Semantic Enterprise posting, as well as our general discussions regarding adaptive ontologies. Yet, still, these writings have been somewhat piecemeal. What this document attempts to do is to place all of these perspectives into a single, coherent whole.
The Incremental Layers of the Open Semantic FrameworkStructured Dynamics has been a strong advocate for layered architectures, with clear APIs between layers as appropriate. But these layers are not “laminates” that completely cover the layer below, nor are they all needed or necessary. Depending on the circumstance, some layers are unneeded or superfluous. Layers may be added or not incrementally.
In this manner, then, the open semantic framework is perhaps more akin to a pearl, than to a laminate or cocoon. Each subsequent layer does not “embed” the layer prior to it, and some layers actually may inter-operate with multiple layers below or above it (this is notably true for the “ontologies” layer, which has interactions up and down the stack).
Nonetheless, we can envision this pearl of the open semantic framework and its layers as follows:
(click for full size)
Others have termed this the “semantic muffin” or even “semantic muppet” or “semantic blob”. Whatever (hehe). The real idea is that layers may accrete (as in the growth of a pearl) and occur over time and be uneven. Each layer, though, does have a role to play (though it may not be needed in a given deployment), and does act to augment existing information assets in the transition to a semantic framework. Beginning at the core, each of these layers — with external references as appropriate for more details — is described below.
Existing Assets LayerThe open semantic framework is premised on leveraging existing information assets. Sure, once the framework is in place, new information can be brought into it in a more direct, semantic manner. But, the real thrust and benefit of this framework is to provide an incremental pathway for finally inter-operating and federating prior decades of data, structure and information assets.
These information assets may reside inside or outside the enterprise. They may (and DO!) exist in many formats and are described by many schema. They may come from internal transaction systems or warehouses, or may exist external on the Web or at supplier or partner sites. These information assets may span from conventional databases and relational data systems to XML interchange standards, Web pages and standard internal text or documents. In short, there is NO information asset that is not amenable to be included in this framework.
The Information Transformation (scones/irON) LayerThe information transformation layer provides either: 1) extraction of concepts and entities as structured metadata from source text or documents; or 2) conversion of existing data assets to interoperable form. As implemented by Structured Dynamics, the extractions are conducted by either scones (Subject Concept or Named EntitieS) or third-party utilities, and the conversions occur via irON (instance record Object Notation) or third-party “RDFizers“.
Depending on the source, the net result of the transformation is to produce interoperable data and information that can be ingested and used by other layers in the framework.
Though not strictly analogous, this layer bears some resemblance to the ETL (extract, transfer, load) utilities used in many enterprise information integration applications. Unlike those conventional systems, this information transformation layer also may capture and represent some of the source schema.
In all cases, however, these transformations are relatively simple and get parsed against the available structure (the ontologies, schema and entity reference lists) in the system to generate the semantic metadata (tags).
At this point, the extracted structure is generally at the level of instance records, or the ABox, with simple assertions of attribute-value pairs for specific records [2]. Little schema transformation or mapping occurs at this layer (if such is needed, that occurs at the structWSF layer; see next). Actual federation or interoperation occurs at later layers based on the TBox structures [2].
This modular portion of the framework is explicitly designed with APIs to allow third-party tools to be plugged in and substituted.
The structWSF LayerThe major workhorse of the open semantic framework is the structWSF (Web services framework) layer. structWSF is the most complicated of the OSF layers and has many supporting software packages and capabilities. The structWSF layer provides the standard, common interface (”canonical”) layer by which existing information assets get represented and presented to the outside world and to other layers in the OSF stack.
structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies; see below).
The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The current structWSF framework comes packaged with a baseline set of about twenty Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML. An internal representation, structXML [3], is used for internal communications across all structWSF Web services and with other layers.
structWSF has a central service that governs access rights and permissions. These rights occur at the level of the dataset, which gives immense flexibility to how data may be accessed, read, modified, created or deleted (or not). Datasets within a given structWSF instance may be accessed directly via API or via SPARQL queries to the instance’s endpoint. Depending on rights and query, results sets may be returned from a given structWSF instance in an infinite variety of ways.
This latter capability is the essential interface for subsequent layers in the open semantic framework stack. Depending on those subsequent components, pre-staged data and results sets may be returned for an essentially limitless variety of purposes.
Each structWSF instance also has a unique Web address that enables one or a multitude of instances to communicate and share with one another. This simple, but elegant, method enables structWSF instances to participate or not in potentially global or restricted local networks and collaboration environments. This is currently the largest untapped potential of structWSF with respect to its existing deployments.
The Semantic Components LayerThe newest layer in the stack is the semantic components layer. This layer takes results sets — most often generated by a specific query or data slice request — from one or more structWSF instances and then presents that information via a variety of data visualization or data presentation widgets (what we specifically call ‘semantic components‘ due to their design [4]). The operation and sensitivity of these display components are themselves driven by a presentation and data analysis (including statistics) ontology.
Current display widgets include: filter; tabular templates (similar to infoboxes); maps; bar, pie or linear charts; relationship (concept) browser; story and text annotator and viewer; workbench for creating structured views; and dashboard for presenting pre-defined views and component arrangements. These are generic tools that respond to the structures and data fed to them, adaptable without modification to any domain.
As presently implemented by Structured Dynamics, this layer consists either of Flex data visualization components or structured data display templates based on Smarty. The inherent design allows for updates to other bases (such as HTML5). The layer may also be swapped out or substituted with third-party capabilities.
The strength and power of this system is governed by its own ontology, the Semantic Component Ontology (SCO) (see next).
This is an extremely flexible layer in the open semantic framework stack. Expect an ongoing series of explanatory blog posts and online resources in the upcoming weeks to explain this innovative capability.
The Ontologies LayerThe ontologies layer actually refers to all structured assets driving the system. As such, this layer might be considered the “brain” (though rather simply specified!) of the open semantic framework.
At a true schema or TBox level [2], the ontologies layer represents the concept and relationships of the domain at hand. This layer also hosts the specific local entities and prominent things (people, places, events, etc.) useful for extracting local and domain-specific relevance. However, those views are also supplemented with some administrative ontologies (two examples are SCO and irON) that guide how the user interfaces or widgets in the system should behave.
The concept level represents the “world view” of the specific instantiation of the open semantic framework at hand. This conceptual (TBox) view provides the structural organization of information, inferencing capabilities, and navigation, faceting and explorer structure. The entity (ABox) view provides tagging for prominent individuals and instances important to the domain at hand, and guides the structure behind data visualizations of attribute or indicator data.
The administrative level uses simple roles and relationships for attributes and indicators to inform the framework as to how and with what widget to display information. For example, a “type” of information that is geographically related can be instructed to use the map component as an option for display. Whether some information is used for totals, comparison purposes, or other specifications useful to data visualization and graphing may also be specified.
The language and relationships (predicates or properties) of these administrative ontologies are simple and straightforward. It is, for example, relatively easy to define data display functions at the broad dataset and attributes level. Simple determinations drive how results sets and their associated results types may be displayed, no matter what datasets or slices may be generated as a result of the queries or requests fed to the system.
The structure in these layers can be replaced by other structures for other instantiations and circumstances. Indeed, all other layers in the open semantic framework can remain relatively fixed while tailoring the instance to new domains solely via this layer. The ontologies layer is what gives any given instantiation of OSF — such as Citizen Dan — its unique focus and scope.
The Content Management System (conStruct) LayerThe thinnest layer (that is, least substantial with respect to this framework) is the content management system (CMS) layer. In its current form, the open semantic framework uses the Drupal CMS via our conStruct plug-in modules. The design of the framework, however, has explicitly accommodated the possibility that other CMSs may substitute for this role.
The CMS layer is optional if structWSF endpoints are sufficient or if simple Web pages hosting semantic components are deemed as adequate. Very small organizations or deployments may reasonably choose to have no CMS layer at all.
However, for most sites or portals with more than a few active users, it is desirable to have broad flexibility in theming (”skinning”), user rights and permissions, or other functionality. These are the roles of the CMS layer. Drupal, for example, is presently supported by more than 4500 third-party modules in every conceivable function, from polling to blogs and rating systems and bulletin boards.
For such generalized portals or collaboration environments, it makes sense to adopt and install a flexible CMS system, such as Drupal. Much of the user experience and functional environment can be provided through such means.
The open semantic framework is thus designed to reside easily in a CMS while also providing the hooks to take advantage of the generalized user rights and functionality of the CMS. In this manner, the open semantic framework is able to stay focused on its structured data and interoperability purposes, while still gaining the advantages of rich-featured content management systems.
The OSF is a Web-oriented ArchitectureWith its inherent open-world orientation [5] and distributed and collaborative potential, the open semantic framework was designed from the outset to be Web-capable and Web-oriented:
(click for full size)
A Web-oriented architecture (WOA) has a number of understood requirements, to which the open semantic framework adheres. Specifically, these design considerations support the framework as being part of WOA:
Citizen Dan is our first exemplar instance of this open semantic framework. The details page for the project goes into some of Citizen Dan’s functionality and capabilities.
Citizen Dan is specifically geared to local governments and localities, with an emphasis on community indicator systems (CIS). CIS have become a popular way of measuring and tracking measures of local economic and social well-being; they are closely related to sustainability and how to measure it as used in many economic and environmental domains.
However, in the context of this post, what is really interesting about Citizen Dan is that its semantic framework is a completely open and generic one. The same set of tools and capabilities described on its details page can be applied to any domain that needs to manage and understand information in its own domain. This includes from unstructured text or documents to conventional structured databases.
What changes from domain to domain are the data structures (the ontologies, schema and entity reference lists; see above) that are fed to this open semantic framework. By swapping out new structures, what can be called Citizen Dan in one instance can morph to become Curriculum Carla in say, the education instance or Doctor Doolittle in the veterinary science instance [6].
We can illustrate these multiple instances as follows:
(click for full size)
What this figure illustrates is that even a branded expression of the framework — such as Citizen Dan — is merely an instance of that framework. And, actually, when expressed in such a packaged manner, we can more accurately call the standard and bundled suite of generic functions and accompanying structure of Citizen Dan as an instantiation of the open semantic framework:
in·stan·ti·ate \in-ˈstan(t)-shē-āt\ (transitive verb) is to:in·stan·ti·a·tion \in-‘stan(t)-shē-ā-shən\ (noun) [7]
By replacing the structure bases, and by tailoring the function suite appropriate to a given market and use, we can create many instantiations of the open semantic framework for different domains and markets. In this manner, Citizen Dan can be seen as an early exemplar of the framework, but not as a definer and limiter to it.
OSF is the Software Leg to a ‘Total Open Solution‘So far, this discussion has focused solely on considerations of software and architecture. While we see the power of the open semantic framework, highly useful in itself, this is inadequate alone to achieve acceptance and success in the enterprise (as we noted in our most recent posts). The very forces that are compelling enterprises to look at new options, are also the same ones that pose difficult hurdle rates for acceptance of open source.
To address this issue, we have developed a four-legged foundation to what we termed the total open solution. The solution involves software, structure, documentation and methods (or best practices). Each of these connect and relate to the other foundations.
The open semantic framework is clearly the software (and architecture) leg to this foundation. Again, however, what is interesting is that the mere swapping out of the structure can also make the system relatively ready for other domains.
We see these relationships in the following diagram, that also shows that the DocWiki portions of the solution embody the documentation (aside from code-level comments) and methods legs of the foundation:
(click for full size)
Differences between domains may also lead to differences as to which components are included or not in that domain’s desired instantiation.
The hugely important implied point, however, from the diagram above, is to show how nearly universal the content and methods in the DocWiki may be to other domains. Because the deltas between domains largely result from structure and what specific functional components are included or not, it becomes clear that most documentation and practices shared with the DocWiki will be applicable across domains. Sure, the use cases and some of the specific terminology may change, but we can also now see a high degree of re-usability of documentation and knowledge base across markets. This realization makes the usefulness and leverage of the DocWiki even higher.
A Common Language and Framework for Moving ForwardDeveloping “common language” by which to describe and convey things — especially new things like semantics that also have strong technical aspects — is tough, very tough. We are only now beginning on this process; we look to many in the community and elsewhere to help define informative and evocative terminology.
Per the original design objectives above, Structured Dynamics has approached the challenge of the semantic enterprise in what we think is both a pragmatic and a new way. The insistence on preserving and respecting existing information assets, matched with the opportunities and different mindsets arising from an open-world approach [5], have necessitated thinking through new designs and developing new concepts. Any time such new thinking and concepts occurs, new language and new metaphors must accompany it.
While certainly there are components and various software packages that populate and comprise an open semantic framework, the framework is also just as importantly a world view or way to think about information, information development, and its architecture. For example, a pivotal concept is that an open semantic framework is built around generic tools responsive to the information structures fed to them. This realization shifts the locus of emphasis from software development per se to creating, managing and adapting data and information structures. While this democratizes the information development process and is more inclusive of all knowledge workers, it also imposes needs for new toolsets and business processes. We are only at the nascent stages of understanding and learning about these differences.
Similarly, a development approach that is inherently incremental and leverages (rather than replaces or displaces) existing information assets means IT projects need to be considered in a new light. Small projects with more emphasis on tangible and demonstrable benefits will alter budgets, lower risks, and place a need for quicker turnaround. Like the architecture of the open semantic framework itself, projects based on OSF are also more distributed, decentralized and modular.
With such decentralization also comes the need for mechanisms and systems to overcome vendor “lock-in” and proprietary systems. A key thrust in support of what we have called the total open solution and its mixture of documentation and methods to accompany software and structure is specifically targeted at this issue. Tools and means for collaboration and concurrent contributions are another possible answer. Prior software practices in agile development and version control will see extensions to all manner of information development across the enterprise.
We are proud of our design work and proof-testing with clients over the past 18 months. We believe the open semantic framework and its implications to be a fundamental shift in how organizations need to think about their information development, existing information assets, and IT budgets and processes. We know widescale adoption is not yet at hand — enterprises are justifiably conservative when it comes to new thinking. But, given global competition and tight pocketbooks, the open semantic framework is a formulation to which enterprises and governments should pay very close attention.
[1] Citizen Dan is an open source system for aggregating different indicator data concerning local, community well-being. Information sources may include the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics. See current details with screenshots. [2] Structured Dynamics’ best practices approach makes explicit splits between the “ABox” (for instance data) and “TBox” (for ontology schema) in accordance with our working definition for description logics, a fundamental underpinning for how we use RDF: “Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” [3] A subsequent post will document this rather straightforward XML schema. [4] Contact Structured Dynamics for a early sneak peek. The Citizen Dan application will be publicly released as an online sandbox and demo by the end of summer 2010. [5] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Anothe way to say is it that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems. [6] Of course, things are always not so simple as this. The CMS layer gives the open semantic framework the ready ability to change themes and layouts (”skins), not to mention the breadth and specifics of what ancillary site functionality might be provided. Moreover, the module basis of the open semantic framework also means that entire clusters of functionality might be dropped from a given instantiation (or added to it!) without violating or negating this framework. [7] Dictionary references are from Merriam-Webster and Wikitionary.
For a dozen years, my career has been centered on Internet search, dynamic content and the deep Web. For the past few years, I have been somewhat obsessed by two topics.
The first topic, a conviction really, is that implicit structure needs to be extracted from Web content to enable it to be disambiguated, organized, shared and re-purposed. The second topic, more an open question as a former academic married to a professor, is what might replace editorial selections and peer review to establish the authoritativeness of content. These topics naturally steer one to the semantic Web.
A Millennial PerspectiveThe semantic Web, by whatever name it comes to be called, is an inevitability. History tells us that as information content grows, so do the mechanisms for organizing and managing it. Over human history, innovations such as writing systems, alphabetization, pagination, tables of contents, indexes, concordances, reference look-ups, classification systems, tables, figures, and statistics have emerged in parallel with content growth [19].
When the Lycos search engine, one of the first profitable Internet ventures, was publicly released in 1994, it indexed a mere 54,000 pages [1]. When Google wowed us with its page-ranking algorithm in 1998, it soon replaced my then favorite search engine, AltaVista. Now, tens of billions of indexed documents later, I often find Google’s results to be overwhelming dross — unfortunately true again for all of the major search engines. Faceted browsing, vertical search, and Web 2.0’s tagging and folksonomies demonstrate humanity’s natural penchant to fight this entropy, efforts that will next continue with the semantic Web and then mechanisms unforeseen to manage the chaos of burgeoning content.
An awful lot of hot air has been expelled over the false dichotomy of whether the semantic Web will fail or is on the verge of nirvana. Arguments extend from the epistemological versus ontological (classically defined) to Web 3.0 versus SemWeb or Web services (WS*) versus REST (Representational State Transfer). My RSS feed reader points to at least one such dust up every week.
Some set the difficulties of resolving semantic heterogeneities as absolutes, leading to an illogical and false rejection of semantic Web objectives. In contrast, some advocates set equally divisive arguments for semantic Web purity by insisting on formal ontologies and descriptive logics. Meanwhile, studied leaks about “stealth” semantic Web ventures mean you should grab your wallet while simultaneously shaking your head.
A Decades-Long PerspectiveMy mental image of the semantic Web is a road from here to some achievable destination — say, Detroit. Parts of the road are well paved; indeed, portions are already superhighways with controlled on-ramps and off-ramps. Other portions are two lanes, some with way too many traffic lights and some with dangerous intersections. A few small portions remain unpaved gravel and rough going.
Wreck in Nebraska during the 1919 Transcontinental Motor Convoy
A lack of perspective makes things appear either too close or too far away. The automobile isn’t yet a century old as a mass-produced item. It wasn’t until 1919 that the US Army Transcontinental Motor Convoy made the first automobile trip across the United States.
The 3,200 mile route roughly followed today’s Lincoln Highway, US 30, from Washington, D.C. to San Francisco. The convoy took 62 days and 250 recorded accidents to complete the trip (see figure), half on dirt roads at an average speed of 6 miles per hour. A tank officer on that trip later observed Germany’s autobahns during World War II. When he subsequently became President Dwight D. Eisenhower, he proposed and then signed the Interstate Highway Act.
That was 50 years ago. Today, the US is crisscrossed with 50,000 miles of interstates, which have completely remade the nation’s economy and culture [2].
Today’s PerspectiveLike the interstate system in its early years, today’s semantic Web lets you link together a complete trip, but the going isn’t as smooth or as fast as it could be. Nevertheless, making the trip is doable and keeps improving day by day, month by month.
My view of what’s required to smooth the road begins with extracting structure and meaningful information according to understandable schema from mostly uncharacterized content. Then we store the now-structured content as RDF triples that can be further managed and manipulated at scale. By necessity, the journey embraces tools and requirements that, individually, might not constitute semantic Web technology as some strictly define it. These tools and requirements are nonetheless integral to reaching the destination. We are well into that journey’s first leg, what I and others are calling the structured Web.
For the past six months or so I have been researching and assembling as many semantic Web and related tools as I can find [3]. That Sweet Tools listing now exceeds 500 tools [4] (with its presentation using the nifty lightweight Exhibit publication system from MIT’s Simile program [5]). I’ve come to understand the importance of many ancillary tool sets to the entire semantic Web highway, such as natural language processing and information extraction. I’ve also found new categories of pragmatic tools that embody semantic Web and data mediation processes but don’t label themselves as such.
In its entirety, the Sweet Tools listing provides a pretty good picture of the semantic Web’s state. It’s a surprisingly robust picture — though with some notable potholes — and includes impressive open source options in all categories. Content publishing, indexing, and retrieval at massive scales are largely solved problems. We also have the infrastructure, languages, and (yes!) standards for tying this content together meaningfully at the data and object levels.
I also think a degree of consensus has emerged on RDF as the canonical data model for semantic information. RDF triple stores are rapidly improving toward industrial strength, and RESTful designs enable massive scalability, as terabyte- and petabyte-scale full-text indexes prove.
Powerful and flexible middleware options, such as those from OpenLink [6], can transform and integrate diverse file formats with a variety of back ends. The World Wide Web Consortium’s GRDDL standard [7] and related tools, plus various “RDF-izers” from Massachusetts Institute of Technology and elsewhere [8], largely provide the conversion infrastructure for getting Web data into that canonical RDF form. Sure, some of these converters are still research-grade, but getting them to operational capabilities at scale now appears trivial.
Things start getting shakier when trying to structure information into a semantic formalism. Controlled vocabularies and ontologies range broadly and remain a contentious area. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language [9] or microformats [10]. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) [11] and the still greater formalism of OWL’s various dialects [12].
If we compare the semantic Web to the US interstate highway system, we’re still in the early stages of a journey that will remake our economy and culture. Many potholes on the road to the semantic Web exist. One ready task is to transform existing structure to RDF. Another priority is to refine tools to extract structure and meaningful information from uncharacterized content.Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications.
All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it into canonical RDF form, we can then move on to needed developments in semantic mediation, some of the roughest road on the journey.
Potholes on the Semantic HighwaySemantic mediation requires appropriate structured content. Many potholes on the road to the semantic Web exist because the content lacks structured markup; others arise because existing structure requires transformation. We need improved ways to address both problems. We also need more intuitive means for applying schema to structure. Some have referred to these issues as “who pays the tax.”
Recent experience with social software and collaboration proves that a portion of the Internet user community is willing to tag and characterize content. Furthermore, we can readily leverage that resulting structure, and free riders are welcomed. The real pothole is the lack of easy — even fun — data extractors and “structurizers.” But we’re tantalizingly close.
Tools such as Solvent and Sifter from MIT’s Simile program [13] and Marmite from Carnegie Mellon University [14] are showing the way to match DOM (document object model) inspectors with automated structure extractors. DBpedia, the alpha version of Freebase, and System One now provide large-scale, open Web data sets in RDF [15], including all of Wikipedia. Browser extensions such as Zotero [16] are showing how to integrate structure management into acceptable user interfaces, as are services such as Zoominfo [17]. Yet we still lack easy means to design the differing structures suitable for a plenitude of destinations.
Amazingly, a compelling road map for how all these pieces could truly fit together is also incomplete. How do we actually get from here to Detroit? Within specific components, architectural understandings are sometimes OK (although documentation is usually awful for open source projects, as most of the current tools are). Until our community better documents that vision, attracting new contributors will be needlessly slower, thus delaying the benefits of network effects.
So, let’s create a road map and get on with paving the gaps and filling the potholes. It’s not a matter of standards or technology — we have those in abundance. Let’s stop the silly squabbles and commit to the journey in earnest. The structured Web’s ability to reach Hyperland [18], Douglas Adam’s prescient 1990 forecast of the semantic Web, now looks to be no further away than Detroit.
This Friday brown bag leftover was first placed into the AI3 refrigerator about three years ago on May 3, 2007. The piece was my answer to a request by Jim Hendler to pen some thoughts on the semantic Web, based on I believe what he thought might be a pragmatic perspective combining Internet business with Web science. The formal piece appeared as a guest editorial in the May/June 2007 issue of IEEE Intelligent Systems. What appears above is unaltered from my original posting (aside from some minor formatting clean-up and — sorry to say — some of the projects are now defunct). [1] Chris Sherman, “Happy Birthday, Lycos!,” Search Engine Watch, August 14, 2002. See http://searchenginewatch.com/showPage.html?page=2160551. [2] David A. Pfeiffer, “Ike’s Interstates at 50: Anniversary of the Highway System Recalls Eisenhower’s Role as Catalyst,” Prologue Magazine, National Archives, Summer 2006, Vol. 38, No. 2. See: http://www.archives.gov/publications/prologue/2006/summer/interstates.html. [3] The mention of specific tool names is meant to be illustrative and not necessarily a recommendation. [4] Sweet Tools (SemWeb) listing; see http://www.mkbergman.com/new-version-sweet-tools-sem-web/ . [5] See http://simile.mit.edu/exhibit/. [6] OpenLink Software’s Virtuoso and Data Spaces products; see http://www.openlinksw.com/. [7] W3C’s Gleaning Resource Descriptions from Dialects of Languages (GRDDL, pronounced “griddle”). See http://www.w3.org/2004/01/rdxh/spec. [8] See http://simile.mit.edu/wiki/RDFizers. [9] Outline Processor Markup Language (OPML); see http://www.opml.org/. [10] Microformats; see http://microformats.org/. [11] DOAP (Description of a Project), FOAF (Friend of a Friend), SIOC (Semantically-Interlinked Online Communities) and SKOS (Simple Knowledge Organizing System). [12] W3C’s Web Ontology Language (OWL). See http://www.w3.org/TR/owl-features/. [13] Solvent (http://simile.mit.edu/wiki/Solvent) and Sifter (http://simile.mit.edu/wiki/Sifter) are from MIT’s Simile program. [14] Marmite (http://www.cs.cmu.edu/~jasonh/projects/marmite/) is from Carnegie Mellon University. [15] DBpedia (http://dbpedia.org/docs/) and Freebase (in alpha, by invitation only at http://www.freebase.com/) are two of the first large-scale open datasets on the Web; Wikipedia has also been converted to RDF by System One (http://labs.systemone.at/wikipedia3). [16] Zotero is produced by George Mason University’s Center for History and New Media; see http://www.zotero.org. [17] ZoomInfo (http://www.zoominfo.com/) provides online structured search of companies and people, plus broader services to enterprises. [18] The late Douglas Adams, of Doctor Who and A Hitchhiker’s Guide to the Galaxy fame, produced a TV program for BBC2 presaging the Internet called Hyperland. This 50-min video can be seen in five parts via YouTube at Part 1 of 5, 2 of 5, 3 of 5, 4 of 5 and 5 of 5. [19] Since I first wrote this piece, I have systematized these developments in my Timeline of Information History.In the first part to this series, we began with the argument that open source software alone was not sufficient to meet the required acceptance factors in the enterprise. As a guiding way to create the right mindset around these issues we shared the saying that we have adopted at Structured Dynamics that, “We’re successful when we are not needed.”
In the second part of this series we described the four legs of a stable, open source solution. These four legs are software, structure, methods and documentation. When all four are provided, we termed this a total open solution.
Now, in this third and concluding part to our series, we introduce the open source documentation and methodology system called ‘DocWiki’. It complements the base open source software, in the process completing the conditions for a total open solution.
Though we call this system ‘DocWiki’, it is not meant to be a brand or particular product description for what Structured Dynamics is offering. Rather, ‘DocWiki’ is merely a placeholder name for a generic, open source system and knowledge base that can be downloaded, installed, branded, modified and extended in whatever way the user sees fit. ‘DocWiki’ is a baseline documentation and methodology “starter kit” that can be dressed up in new clothes or packaged and named in whatever manner best suited to a given deployment.
In describing the major components of this ‘DocWiki’ system we will again use our Citizen Dan initiative [1] as we did in Part 2. This gives us a real use case, though the same approach is applicable to any open source information management initiative by enterprises.
We call the specific version of the ‘DocWiki’ used in the case of Citizen Dan the ‘CIS DocWiki‘ (for community indicator systems), specific to the domain and local government focus of Citizen Dan. Similarly, the structured vocabulary and ontology that guides the system is the MUNI ontology. For other information development initiatives, the specific content of these components would be swapped out for ones appropriate to that initiative.
Overview of the ‘DocWiki’ SystemA number of desires and objectives intersected to guide the design of the ‘DocWiki’ system. We wanted:
In first formulating this design, our assumption was the major building blocks would be an open source document management system linked with some form of version control. Though we think such a formulation could work OK, our exposure to the MIKE2.0 methodology actually caused us to re-look at and re-think a wiki-based approach. Ultimately the trump card that decided the design for us was familiarity and ease-of-use.
The resulting architecture of the full ‘DocWiki’ system is shown below:
(click for full size)
What is cool about this design is that a single software download install with a few extensions (Mediawiki, the Wikipedia software, plus some standard extensions and judicious use of Semantic Mediawiki) and a single loadable database are all that is required to transfer and install the ‘DocWiki’ system.
To better describe this system, we will focus on three major interconnecting pieces in this architectural diagram: the knowledge base; the vocabulary and structure (ontology); and the authoring and publishing system (wiki).
The ‘DocWiki’ Knowledge BaseThe pre-loaded content for the ‘DocWiki’ system comes from its knowledge base. This is provided as a text-exported MySQL database that can be modified en masse before loading (such as substituting ‘YourName’ for ‘DocWiki’). The exemplar upon which this knowledge base is modeled is the MIKE2.0 framework.
MIKE2.0 (Method for an Integrated Knowledge Environment ) provides a comprehensive methodology that can be applied across a number of different projects within the information management space. MIKE2.0 provides an organized way to describe the why, when, how and who of information management projects. Via standard templates and structures, MIKE2.0 provides a consistent basis to describe and manage these projects, and in a way that helps promote their interoperability and consistency across the enterprise.
MIKE2.0 has a generalized methodology and set of templates applicable to initiatives, the phases, activities and tasks to undertake them, and supporting assets. Supporting assets can range from glossaries and definition of terms and concepts to very specific technical documents or background material. The entire system is logical and applies a consistent design and organizational structure and categories.
For our purposes, we wanted a complete, turnkey content knowledge base. This meant that we needed to accommodate all forms of project management and guidance, ranging from specific “how-to” and technical discussions to the entire suite of background and supporting material. The scope of this knowledge content is defined as what a new person assigned a lead or implementation responsibility would need to read or master.
As a destination site MIKE2.0 is quite broad: it embraces the ability to model virtually any information management initiative. This makes MIKE2.0 an invaluable source of structure and methodology guidance, but also results in it being quite limited in the specific how-tos associated with any given initiative. I have earlier spoken about the structure of MIKE2.0 and in particular its applicability to the semantic enterprise.
The strength of MIKE2.0, however, is that its structure can be grabbed and quickly applied to form an organizational and structural basis for filling out the knowledge base for any specific information development initiative. And, that is exactly what we did with the ‘CIS DocWiki.’
MIKE2.0 hosts and maintains its project-related structure in Mediawiki (with some extensions). Combined with its templates, this provides a rapid-start baseline for beginning to tailor and flesh out the specific details for a given information management initiative. Thus, after copying broad aspects of the MIKE2.0 system into the incipient ‘DocWiki’, it was relatively straightforward to let the existing structure and templates of MIKE2.0 guide next steps.
As of today’s date, the ‘CIS DocWiki’ contains about 300 substantive articles, a complete activity and tasking structure, and various re-usable templates based on Semantic Mediawiki for structured and consistent access and retrieval. New tasks and structure can be readily added to the system. Existing structure or content can be deleted or marked as archive for non-display. We are still gathering all requisite content pieces, and anticipate by first public release that the baseline knowledge base will include 2x to 3x the scale of its current content.
For new ‘CIS DocWiki’ (or Citizen Dan-based) deployments, this means the knowledge base can be completely modified and extended for local circumstances. The set-up of the Mediawiki instance is separate from the loading or modification of the knowledge base, which means the look-and-feel of the entire system, not to mention user rights and permissions, can also be readily tailored for local requirements.
The core content of the ‘CIS DocWiki’ and its basis in a set structure and methodology (derived from MIKE2.0) means that the knowledge base is also adaptable for other broader information development areas, especially in the semantic enterprise or semantic government arenas. Thus, while Structured Dynamics is first releasing the ‘CIS DocWiki’ in the context of Citizen Dan and semantic government, we also are developing a parallel instance for the Open SEAS approach to the semantic enterprise.
The approach taken here is somewhat different than the standard wiki use. As experts, we are basically sole authoring (with contributions from selected collaborators and our clients) the starting basis of the knowledge base. Unlike many wikis, this enables us to be quite consistent in content, style, and organization. Such an approach allows us to present a coherent and complete starting content and methodology foundation. However, once delivered and installed for a given deployment, its users are then free to extend and change this knowledge foundation in the standard wiki manner. Whether those subsequent extensions are free-form or more tightly controlled and managed is the choice of the new deployment’s administrators.
The Supporting MUNI StructureStrictly speaking, the vocabularies and structures (including, of course, ontologies) that drive our semantic government or semantic enterprise offerings are also part of the knowledge base. And, in fact, many of these aspects, especially related to the actual operating of the instances, are included as part of the standard knowledge base.
However, the applicable domain ontology itself is separately maintained. Descriptions of how to use and modify such ontologies are part of the general ‘DocWiki’ knowledge base, but the ontology is not. This arm’s length-separation is done to acknowledge that the ontology has independent use and value apart from the knowledge base or the software (Citizen Dan, in this case) that is the focus of it.
In the Citizen Dan instance, this structure is the MUNI ontology. MUNI is a general local government domain ontology that can find use in a broad array of circumstances, using or not Citizen Dan. Thus, like other ontologies developed and maintained by Structured Dynamics, such as BIBO (the Bibliographic Ontology), the ontology itself and its documentation, discussion forums and use cases are maintained separately.
The first release of MUNI is still under development and will be released this summer.
The Wiki/Publication Portion of ‘DocWiki’The software framework that hosts and manages all of this content is the Mediawiki software, originally developed for Wikipedia. This framework is supported by a number of standard extensions packaged with the ‘DocWiki’ distribution. One of the more notable extensions is Semantic Mediawiki. Mediawiki also is the wiki framework underlying MIKE2.0, so content sharing between the systems is straightforward.
The Collaborative Wiki PortionThe first use of the ‘DocWiki’ is to add new content to the knowledge base and to modify or extend what is provided in the baseline. For straight authoring, ‘DocWiki’ offers the standard wikitext basis for content entry and editing, as well as the WikED enhanced editor and the FCKEditor WYSIWYG rich-text editor. Each of these may be turned on or off at will.
All of the baseline content is fully organized and categorized via a standard structure. Pre-existing templates aid in entering new content in specific areas consistently or in providing standard administrative ways of tagging content for completeness or need for editorial attention. Tasks and concepts, in particular, follow set ways of entry and description. These set templates, some forms-based and some derived from Semantic Mediawiki, are also tied into automatic internal scripts for listing and organizing various items. So long as new material is entered properly, it will be reflected in various stats and listings. Unlike sole reliance on Semantic Mediawiki, the ‘DocWiki’ approach is a mix of standard wiki categories and semantic types. Both are used for effective organization of the knowledge base.
Besides the knowledge base of domain content and “how-to”, the system also comes pre-packaged with many wiki “how-to” and best practices guidance for using the system effectively and consistently. Of course, a given deployment may or may not enforce all of these practices. A poorly administered instance, for example, could degenerate fairly quickly and lose the native structure and organization of the baseline system.
As with standard wikis, there is a history of prior page revisions that gives the system rollback and version control. Mediawiki has a pretty good user access and permissions framework ranging from access, reading, editing and to uploads.
Besides the standard and required extensions, ‘DocWiki’ also comes packaged with the necessary settings and configuration files to operate “out-of-the-box” in its designed baseline mode. Of course, these settings, too, can be changed and modified by site administrators, and ‘DocWiki’ also includes guidance on how to do that.
The Publication PortionA little known but highly useful part of the Mediawiki API allows direct export of XHTML content [2]. Then, with minor XSLT conversion templates, it is possible to strip out wiki-specific conventions (such as the editing of individual sections) or to create straight XML versions. When this is combined with the use of internal ‘DocWiki’ CSS style sheets that impose some clean and semantic style identifiers, a common canonical output basis for content is possible.
From that point, a given deployment may use its own CSS styles to theme output content. Output Web pages (XHTML) or XML files then can be processed using existing and accurate utilities to produce PDF or *.doc documents. Then, with systems such as OpenOffice, an even wider variety of document formats can be produced. These facilities mean that the ‘DocWiki’ can also act as a single-source publishing environment.
In its initial release, re-purposing ‘DocWiki’ content into other presentations (for example, combining sections from multiple pages into a new document as opposed to re-using existing pages as is) will require creating new wiki pages and then cutting-and-pasting the desired content. However, it should also be noted that both DocBook and DITA have been applied to Mediawiki installations [3]. It should be possible to enable a more flexible re-purposing framework for ‘DocWiki’ moving into the future. When Available
The ‘CIS DocWiki’ is meant to accompany the first release of Citizen Dan, likely by the end of summer. The MUNI ontology will also be released roughly at the same time. At release, the ‘CIS DocWiki’ is anticipated to have on the order of 500-800 baseline content and “how to” articles.
Depending on time availability and other commitments, Structured Dynamics will also be using this information to build a semantic government composite offering to MIKE2.0. We will be contributing this new offering for free, similar to what we have done earlier for a semantic enterprise offering.
Subsequent to those events, we will then be modifying the ‘CIS DocWiki’ for the semantic enterprise domain. Much of the necessary content will have already been assembled for the ‘CIS DocWiki’.
Conclusions and ApplicabilityParadoxically, while developing such knowledge bases and systems such as ‘DocWiki’ appears to be extra work, from our standpoint as developers it is useful and efficient. Structured Dynamics already researches and assembles much material and tries to “document as it goes.” Having the ‘DocWiki’ framework not only provides a consistent and coherent way to organize that information, but it also helps to point out essential gaps in our offerings.
The ‘DocWiki’ delivers the methods, documentation and portions of the structure to a total open solution. The ‘DocWiki’ is the primary means — along with software development and accompanying code-level and API documentation, of course — for us to fulfill our mantra that “We’re successful when we are not needed.” As we pointed out in Part 1 of this series, we really think such an attitude is ultimately a self-interested one. The better we can address the acceptance factors in the enterprise for our offerings, the more opportunities we will gain.
We would like to think that other enlightened open source software developers, especially those in the semantic space but certainly not limited to them, will see the wisdom of this four-legged foundation to total open solutions. Up until now, pragmatic guidance for what it takes to create a complete open source offering to businesses and enterprises has been lacking.
The tools, methods, and workflows all exist for making total open solutions real today. All of the pieces are themselves open source. There are many useful guides for best practices across the pipeline. It is just that — prior to this — no one apparently took the time to assemble and articulate them. We think this three-part series and some of the “how to” guidance in the ‘DocWiki’ system can help fix this oversight.
Ultimately, with wider adoption by developers, goaded in part by demands of the marketplace for them, we would hope that additional innovations and ideas may be forthcoming to improve the industry’s ability to offer total open source solutions. Adding just a small bit of attentive effort to how we organize and package what we know is but a small price to pay for greater acceptance and success.
[1] Citizen Dan is an open source system for aggregating different indicator data concerning local, community well-being. Information sources may include the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics. See current details with screenshots. [2] Clean XHTML can be generated directly from the Mediawiki API. This can be done directly via URL with the action=render command. See for example: http://www.mediawiki.org/wiki/API:Parsing_wikitext. [3] For example, there are a number of paths to migrate from HTML or XHTML to DocBook; see http://wiki.docbook.org/topic/Html2DocBook. But, there is a specific project that also goes directly from Mediawiki; see http://code.google.com/p/gwtwiki/wiki/Mediawiki2Docbook.In the first part to this series, we put forward the argument that incomplete provision of important support factors was limiting the adoption of open source software in the enterprise. We can liken the absence of these factors to having a chair with one or more absent or broken legs.
This second part of the series goes into the four legs of a stable, open source solution. These four legs are software, structure, methods and documentation. When all four are provided, we can term this a total open solution.
These considerations are not simply a matter of idle curiosity. New approaches and new methods are required for enterprises to modernize their IT systems while adding new capabilities and preserving sunk assets. Extending and modernizing existing IT is often not in the self-interests of the original supplying vendors. And enterprises are well aware that IT commitments can extend for decades.
While the benefits and capabilities of open source software become apparent by the day, rates of open source software adoption lag in enterprises. We have seen entire Internet-based businesses arise and get huge in just a few short years. But it is the rare existing enterprise that has committed to and embraced similar Web-oriented architectures and IT strategies [1].
The enterprise IT ecosystem is evolving to become an unhealthy one. New software vendors have generally abandoned enterprises as a market. Much more action takes place with consumer apps and Internet plays, often premised on ad-based revenues or buzz and traffic as attractors for acquisition. Existing middle-tier enterprise vendors are themselves being gobbled up and disappearing. I’m sure all observers would agree that IT software and services are increasingly dominated by a shrinking slate of vendors. I suspect most observers — myself included — would argue that enterprise-based IT innovation is also on the wane.
The argument posed in the first part of this series is that such atrophy should not be unexpected. The current state of open source software is not addressing the realities of enterprise IT needs.
And that is where the other legs of the total open solution come in. In their entirety, they amount to a form of capacity building for the enterprise [2]. It is not simply enough to put forward buzzwords matched with open source software packages. Exciting innovations in social networks, collaboration, semantic enterprise, mobile apps, REST, Web-oriented architectures, information extraction, linked data and a hundred others are being validated on the Internet. But until the full spectrum of success and adoption factors gets addressed, enterprises will not embrace these new innovations as central to their business.
As we describe these four legs to the total open solution, we will sometimes point to our Citizen Dan initiative [3]. That is not because of some universal applicability of the system to the enterprise; indeed Citizen Dan is mostly targeted to local communities and municipalities. But, Citizen Dan does represent the first instance known to us where each of these total open solution success factors is being explicitly recognized and developed. We think the approach has some transferability to the broader enterprise.
Let’s now discuss these four legs in turn.
Leg One: SoftwareOf course, the genesis of this series is grounded in open source software and what it needs to do in order to find broader enterprise acceptance. Clearly that is the first leg amongst the four to be discussed. We also have acknowledged that, generally, best-of-breed open source software is also better documented at the code level, and has documented APIs. We will return to this topic under Leg Four below.
Open source software useful to the enterprise is often a combination of individual open source packages. Some successful vendors of open source to the enterprise in fact began as packagers and documenters of multiple packages. Red Hat for Linux or Alfresco in document management or Pentaho in business intelligence come to mind, as examples.
In the case of Citizen Dan, here are the open source packages presently contained in its offering: Linux (Ubuntu), Apache, MySQL, PHP (these comprising the LAMP stack), Drupal, a variety of third-party Drupal modules, Virtuoso, Solr, ARC2, Smarty, Yahoo UI, TinyMCE, Axiis, Flex, ClearMaps, irON, conStruct, structWSF, and some others. Such combinations of packages are not unusual in open source settings, since new value-add typically comes from extensions to existing systems or unique ways to combine or package them. For example, the installation guide for structWSF alone is quite comprehensive with multiple configuration and test scripts.
Thus, besides direct software, it is also critical that configuration, settings, installation guidance and the like be addressed to enable relatively straightforward set-up. This is an area of frequent weakness. Targeting it directly is a not-so-secret factor for how some vendors have begun to achieve some success with the enterprise market.
Leg Two: StructureAll software works on data. While some data is unstructured (such as plain text) and some is semi-structured (such as HTML or Web pages that mixes markup with text), the objective of information extraction or natural language processing is to extract the “structure” from such sources. Once extracted, such structure can interoperate on a common footing with the structured data common to standard databases.
Thus, we use “structure” to denote the concepts and their relationships (the “schema” or “ontology”) and the indicators and data (attributes and values) to describe them, and the “entities” (distinct individuals or nameable instances) that populate them. In other words, “structure” refers to all of the schema (concepts + relationships) + data + attributes + indicators + records that make up the information upon which software can operate.
Structure exists in many forms and serializations. Generally, software represents its internal information in one or a few canonical storage and manipulation formats, though that same software may also be able to import (ingest) or export its information and data in many different external formats.
In our semantic enterprise work, especially with its premise in ontology-driven applications using adaptive ontologies, structure is an absolutely essential construct. But, frankly, no information technology system exists that does not also depend on structure to a more or less greater extent.
The interplay between software and structure is one source of expertise that vendors guard closely and use to competitive advantage. In years past, proprietary software could partially hide the bases for performance or algorithmic advantages. Expert knowledge and intimate familiarity with these systems was the other bases to keep these advantages closely held.
It is perhaps not too surprising given this history, then, that the software industry really has very little emphasis or discussion on the interaction between software and structure. But, if software is being brought in as open source, where is the accompanying expertise or guidance for how data structure can be used to gain full advantage? The same acquired knowledge that, say, accompanied the growth of relational databases in such areas as schema development, materialized views or (de)normalization now needs to be made explicit and exposed for all sorts of open source systems.
In the realm of the semantic enterprise we are seeing attempts at this via open source ontologies and greater emphasis on APIs and documentation of same. Citizen Dan, for example, will be first publicly released with an accompanying MUNI ontology as a reference schema and starting point. Descriptions and methods for how to obtain indicator data and relevant attribute and entity information for the domain will also accompany it.
As open source software continues to emphasize semantics and interoperability, exemplar structures and best practices will need to be an essential part of the technology transfer. Just as the “secrets” of much software began to be opened up via open source, so too must the locked-up expertise of experts and practitioners in how to effectively structure data be exposed.
Leg Three: MethodsThe need for structure explication and guidance is but one unique slice of a much broader need to expose methods and best practices surrounding a given information management initiative. The reason that any open source software might be adopted in the first place is based on the hope for some improved information management process.
Recently I have been touting MIKE2.0, the first open source, replicable and extensible framework for organizing and managing information in the enterprise. MIKE2.0 (Method for an Integrated Knowledge Environment ) provides a comprehensive methodology that can be applied across a number of different projects within the information management space. It can be applied to any type of information development.
MIKE2.0 provides an organized way to describe the why, when, how and who of information management projects. Via standard templates and structures, MIKE2.0 provides a consistent basis to describe and manage these projects, and in a way that helps promote their interoperability and consistency across the enterprise.
MIKE2.0 and its forthcoming extensions, one of which we have developed for the semantic enterprise and are now extending into the semantic government in the context of Citizen Dan, are exciting because they provide a systematic approach and guidance for how (and for what!) to document new projects and initiatives. What MIKE2.0 represents is the first time that the embedded, proprietary expertise of traditional IT consultants has been exposed for broader use and extension.
The real premise behind any approach like MIKE2.0 or variants is to codify the expertise and knowledge that was previously locked up by experts and practitioners. The framework in MIKE2.0 provides a structure by which knowledge bases of background information can be assembled to accompany an open source project. This structure extends from initial evaluation and design all the way through operation and end of life.
The ‘CIS DocWiki’ that is being developed to accompany Citizen Dan is such an example of a MIKE2.0-informed knowledge base. At present, the CIS DocWiki has more than 300 specific articles useful to community indicator systems for local governments, and a complete deployment and maintenance methodology. By public release, it will likely be 2-3 times that size. All of this will be downloadable and installable as a wiki, and as open source content, ready for branding and modification for any local circumstance. CIS DocWiki is a natural methods and documentation complement to the Citizen Dan software and its MUNI structure. Release is scheduled for summer.
As we will focus on in Part 3 of this series, we are combining a MIKE2.0 organizational approach with a documentation and single-source publication platform to fulfill the method and documentary aspects of projects. It was really through the advantages gained by the combination of these pieces that we began to see the inadequacy of many current open source projects for the enterprise.
Leg Four: DocumentationThis series began in part with a recognition that superior open source projects are often the better documented ones. But, even there, documentation is often restricted to code-level documentation or perhaps APIs.
As the material above suggests, documentation needs to extend well beyond software. We need documentation of structure, methods, best practices, use cases, background information, deployment and management, and changing needs over the lifetime of the system. And, as we have also seen in Part 1, the lifetime of that system might be measured in decades.
Documentation is no equal to paid partners and their expertise. But, documentation can be cheaper, and if that documentation is sufficient, might be a means for changing the equation in how IT projects are solicited, acquired and managed.
Today, enterprises appear to be stuck between two difficult choices: 1) the traditional vendor lock-in approach with high costs and low innovation; or 2) open source with minimal documentation and vendor knowledge and little assurance of support longevity.
These trade-offs look pretty unpalatable.
Documentation alone, even as extended into the other legs of the solution, is not prima facie going to be a deal maker. But, its absence, I submit, is a deal breaker. Just as open source itself has taken some years to build basic comfort in the enterprise, so too a concerted attack on all acceptance factors may be necessary before actual wide adoption occurs.
The ‘CIS DocWiki’ platform noted for Citizen Dan we hope will be an exemplar for this combination of documentation and methodology. It is a single-source publishing platform that allows the entire knowledge base behind a given IT initiative to be used for collaboration, operational, training or collateral purposes. And all of this is based on open source software.
Software vendors need to recognize these documentation factors and build their ventures for success. Yes, writing code and producing software is a lot more fun and rewarding than (yeech) documentation. But, unless our current generation of vendors that is committed to open source and its benefits takes its markets seriously — and thus commits to the serious efforts these markets demand — we will continue to see minimal uptake of open source in the enterprise.
An Interacting Whole Greater than the Sum of its PartsEach of these four legs of a total open solution can interact with and reinforce the other parts. Once one begins to see the problem of open source adoption in the enterprise as a holistic one, a new systems-level perspective emerges.
Enterprises know full well that software is only one means to address an information management problem, and only a first step at that. Traditional vendors to the enterprise also understand this, which is why through their embedded systems and built-up expertise they have been able to perpetuate what often amounts to a monopoly position.
Pressures are building for a earthquake in the IT landscape. Enterprises are on an anvil of global competition and limited resources. Existing IT systems are not up to the task but too expensive and embedded to abandon. Traditional vendors have near monopoly positions and little incentive to innovate. New software vendors don’t have the expertise and gravitas to handle enterprise-scale challenges. Meanwhile, the rest of the globe is leapfrogging embedded systems with agile, Web-based systems.
The true innovation that is occurring is all based around open source, nurtured by the global computing platform of the Internet, and fueled by countless individuals able to compete on downward-spiraling cost bases. But on so many levels, open source as presently constituted, either fails or poses too many risks to the commercial enterprise.
The Internet itself was the basis of a paradigm shift, but I think we are only now seeing its manifestation at the enterprise level. We are also now seeing global reordering and changes of the economic order. How will companies respond? How will their IT systems adapt? And what will new vendors need to do and recognize in order to thrive in this changing environment?
I’m not sure I have found the language or rhetoric to convey what I see coming, and coming soon. I know open source is part of it; I know enterprises need it; and I know what is presently being offered does not meet the test.
As I noted in our first part, the mantra that we use in Structured Dynamics to express this challenge is, “We’re Successful When We’re Not Needed“. I think the essence behind this statement is that premises of dependency or proprietary advantage will not survive the jet streams of change that are blowing away the old order.
Sound like too much hyperbole? Actually, my own gut feeling is that it is not nearly enough.
In any case, windy rhetoric always falls short if there is not some actionable next steps. In these first two parts of this series, I have tried to present the ingredients that need to go into the cake. In the third part I try to offer a new, and complementary, open source means for bringing stability to the foundation.
In all cases, though, I think these challenges are permanent ones and do not lend themselves to facile solutions. Four legs, or seven foundations, or twelve steps are all just too simplistic for dealing with the global and complex tsunamis blowing away the old order.
One really does not need to lick a finger to sense the direction of these winds of change. It is coming, and coming hard, and all of it is from the direction of open source. What enterprises do, and what the vendors who want to serve them do, is perhaps less clear. I think open source offers a way out of the box in which enterprise IT is currently stuck. But, at present, I also think that most open source options do not have the necessary legs to stand on.
[1] One notable exception to this are the consumer-facing aspects of some businesses, such as automobiles or personal care or fashion products. These businesses are leading the way into some of the “build your own” or “design your own” uses of modern Web technology. [2] In the 1970s the major term for this approach was “technology transfer.” [3] Citizen Dan is an open source system for aggregating different indicator data concerning local, community well-being. Information sources may include the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics. See current details with screenshots.Structured Dynamics today updated its image with a new logo and new color schemes on its Web pages and collateral. Other upgrades to various SD product logos and other adjustments are also being made.
Fred Giasson and I formed the company rapidly back in November 2008. We had other fish to fry, namely starting work with customers coming out of the gate, and (literally) grabbed a toss-off logo that had been laying in the drawer to start the company. That worked well in the early days, but we increasingly felt our image looked tired and distinctly “non-dynamic.”
So we commissioned a competition a few weeks back and left the next steps to the professionals. The winning design is shown above. We had many good options to choose from, and we will be working with some of the other finalist designers for some of our other product designs. The first in that series is the Citizen Dan logo.
So, with growth and presence it feels good to now have a professional look as well. We’re proud that we continue to be able to fully self-fund the company and look to walk arm-in-arm with this logo for quite some time to come!