Feed aggregator

A New Methodology for Building Lightweight, Domain Ontologies

AI3:::Adaptive Information (Mike Bergman) - Wed, 09/01/2010 - 05:10
Bringing Ontology Development and Maintenance to the Mainstream

Ontologies supply the structure for relating information to other information in the semantic Web or the linked data realm. Ontologies provide a similar role for the organization of data that is provided by relational data schema. Because of this structural role, ontologies are pivotal to the coherence and interoperability of interconnected data [1].

There are many ways to categorize ontologies. One dimension is between upper level and mid- and lower- (or domain-) level. Another is between reference or subject (domain) ontologies. Upper-level ontologies [2] tend to be encompassing, abstract and inclusive ways to split or organize all “things”. Reference ontologies tend to be cross-cutting such as ones that describe people and their interests (e.g., FOAF), reference subject concepts (e.g., UMBEL), bibliographies and citations (e.g., BIBO), projects (e.g., DOAP), simple knowledge structures (e.g., SKOS), social networks and activities (e.g., SIOC), and so forth.

The focus here is on domain ontologies, which are descriptions of particular subject or domain areas. Domain ontologies are the “world views” by which organizations, communities or enterprises describe the concepts in their domain, the relationships between those concepts, and the instances or individuals that are the actual things that populate that structure. Thus, domain ontologies are the basic bread-and-butter descriptive structures for real-world applications of ontologies.

According to Corcho et al. [3] “a domain ontology can be extracted from special purpose encyclopedias, dictionaries, nomenclatures, taxonomies, handbooks, scientific special languages (say, chemical formulas), specialized KBs, and from experts.” Another way of stating this is to say that a domain ontology — properly constructed — should also be a faithful representation of the language and relationships for those who interact with that domain. The form of the interaction can range from work to play to intellectual understanding or knowledge.

… ontology engineering research should strive for a unified, lightweight and component-based methodological framework, principally targeted at domain experts ….”

Simperl et al. [4]

Another focus here is on lightweight ontologies. These are typically defined as more hierarchical or classificatory in nature. Like their better-known cousins of taxonomies, but with greater connectedness, lightweight ontologies are often designed to represent subsumption or other relationships between concepts. They have not too many or not too complicated predicates (relationships). As relationships are added and the complexities of the world get further captured, ontologies migrate from the lightweight to the “heavyweight” end of the spectrum.

The development of ontologies goes by the names of ontology engineering or ontology building, and can also be investigated under the rubric of ontology learning. For reasons as stated below, we prefer not to use the term ontology engineering, since it tends to convey a priesthood or specialized expertise in order to define or use them. As indicated, we see ontologies as being (largely) developed and maintained by the users or practitioners within a given domain. The tools and methodologies to be employed need to be geared to these same democratic (small “d”) objectives.

A Review of Prior Methodologies

For the last twenty years there have been many methods put forward for how to develop ontologies. These methodological activities have diminished somewhat in recent years. Yet the research as separately discussed in Ontology Development Methodologies [1] seems to indicate this state of methodology development in the field:

  • Very few uniquely different methods exist, and those that do are relatively older in nature
  • The methods tend to either cluster into incremental, iterative ones or those more oriented to comprehensive approaches
  • There is a general logical sharing of steps across most methodologies from assessment to deployment and testing and refinement
  • Actual specifics and flowcharts are quite limited; with the exception of the UML-based systems, most appear not to meet enterprise standards
  • The supporting toolsets are not discussed much, and most of the examples if at all are based solely on a single or governing tool. Tool integration and interoperability is almost non-existent in terms of the narratives, and
  • Development methodologies do not appear to be an active area of recent research.

While there is by no means unanimity in this community, some general consenses can be seen from these prior reviews, especially those that concentrate on practical or enterprise ontologies. In terms of design objectives, this general consensus suggests that ontologies should be [4]:

  • Collaborative
  • Lightweight
  • Domain-oriented (subject matter and expertise)
  • Integrated, and
  • Incremental.

While laudable, and which represent design objectives to which we adhere, current ontology development methods do not meet these criteria. Furthermore, to be discussed in our next installment, there is also an inadequate slate of tools ready to support these objectives.

A Call for a New Methodology

If you ask most knowledgeable enterprise IT executives what they understand ontologies to mean and how they are to be built, you would likely hear that ontologies are expensive, complicated and difficult to build. Reactions such as these (and not trying to set up strawmen) are a reflection of both the lack of methods to achieve the consensual objectives above and the lack of tools to do so.

The use of ontology design patterns is one helpful approach [5]. Such patterns help indicate best design practice for particular use cases and relationship patterns. However, while such patterns should be part of a general methodology, they do not themselves constitute a methodology.

Also, as Structured Dynamics has argued for some time, the future of the semantic enterprise resides in ontology-driven apps [6]. Yet, for that vision to be realized, clearly both methods and tools to build ontologies must improve. In part this series is a reflection of our commitment to plug these gaps.

What we see at present for ontology development is a highly technical, overly engineered environment. Methodologies are only sparsely or generally documented. They are not lightweight nor collaborative nor really incremental. While many tools exist, they do not interoperate and are pitched mostly at the professional ontologist, not the domain user. In order to achieve the vision of ontology-driven apps the methods to develop the fulcrum of that vision — namely, the ontologies themselves — need much additional attention. An adaptive methodology for ontology development is well past due.

Design Criteria for an Adaptive Methodology

We can thus combine the results of prior surveys and recommendations with our own unique approach to adaptive ontologies in order to derive design criteria. We believe this adaptive approach should be:

  • Lightweight and domain-oriented
  • Contextual
  • Coherent
  • Incremental
  • Re-use structure
  • Separate the ABox and TBox (separate work), and
  • Simpler, with interoperable tools designs.

We discuss each of these design criteria below.

While we agree with the advisability of collaboration as a design condition — and therefore also believe that tools to support this methodology must also accommodate group involvement — collaboration per se is not a design requirement. It is an implementation best practice.

Effective ontology development is as much as anything a matter of mindset. This mindset is grounded in leveraging what already exists, “paying as one benefits” through an incremental approach, and starting simple and adding complexity as understanding and experience are gained. Inherently this approach requires domain users to be the driving force in ongoing development with appropriate tools to support that emphasis. Ontologists and ontology engineering are important backstops, but not in the lead design or development roles. The net result of this mindset is to develop pragmatic ontologies that are understood — and used by — actual domain practitioners.

Lightweight and Domain-oriented

By definition the methodology should be lightweight and oriented to particular domains. Ontologies built for the pragmatic purposes of setting context and aiding interoperability tend to be lightweight with only a few predicates, such as isAbout, narrowerThan or broaderThan. But, if done properly, these lighter weight ontologies can be surprisingly powerful in discovering connections and relationships. Moreover, they are a logical and doable intermediate step on the path to more demanding semantic analysis.

Contextual

Context simply means there is a reference structure for guiding the assignment of what content ‘is about’ [7]. An ontology with proper context has a balanced and complete scope of the domain at hand. It generally uses fairly simple predicates; Structured Dynamics tends to use the UMBEL vocabulary for its predicates and class definitions, and to link to existing UMBEL concepts to help ensure interoperability [8]. A good gauge for whether the context is adequate is whether there are sufficient concept definitions to disambiguate common concepts in the domain.

Coherent

The essence of coherence is that it is a state of consistent connections, a logical framework for integrating diverse elements in an intelligent way. So while context supplies a reference structure, coherence means that the structure makes sense. With relation to a content graph, this means that the right connections (edges or predicates) have been drawn between the object nodes (or content) in the graph [9].

Relating content coherently itself demands a coherent framework. At the upper reference layer this begins with UMBEL, which itself is an extraction from the vetted and coherent Cyc common sense knowledge base. However, as domain specifics get added, these details, too, must be testable against a unified framework. Logic and coherence testing are thus an essential part of the ontology development methodology.

Incremental

Much value can be realized by starting small, being simple, and emphasizing the pragmatic. It is OK to make those connections that are doable and defensible today, while delaying until later the full scope of semantic complexities associated with complete data alignment.

An open world approach [10] provides the logical basis for incremental growth and adoption of ontologies. This is also in keeping with the continuous and incremental deployment model that Structured Dynamics has adopted from MIKE2.0 [11]. When this model is applied to the process of ontology development, the basic implementation increments appear as follows:


Figure 1. A Phased, Incremental Approach to Ontology Development (click to expand)

The first two phases are devoted to scoping and prototyping. Then, the remaining phases of creating a working ontology, testing it, maintaining it, and then revising and extending it are repeated over multiple increments. In this manner the deployment proceeds incrementally and only as learning occurs. Importantly, too, this approach also means that complexity, sophistication and scope only grows consistent with demonstrable benefits.

Re-use of Structure

Fundamental to the whole concept of coherence is the fact that domain experts and practitioners have been looking at the questions of relationships, structure, language and meaning for decades. Though perhaps today we now finally have a broad useful data and logic model in RDF, the fact remains that massive time and effort has already been expended to codify some of these understandings in various ways and at various levels of completeness and scope.

These are prior investments in structure that would be silly to ignore. Yet, today, most methodologies do ignore these resources. This ignorance of prior investments in information relationships is perplexing. Though unquestioned adoption of legacy structure is inappropriate to modern interoperable systems, that fact is no excuse for re-inventing prior effort and discoveries, many of which are the result of laborious consensus building or negotiations.

The most productive methodologies for modern ontology building are therefore those that re-use and reconcile prior investments in structural knowledge, not ignore them. These existing assets take the form of already proven external ontologies and internal and industry structures and vocabularies.

Separation of the ABox and TBox

Nearly a year ago we undertook a major series on description logics [12], a key underpinning to Structured Dynamics’ conceptual and logic foundation to its ontology development. While we can not always adhere to strict and conforming description logics designs, our four-part series helped provide guidance for the separation of concerns and work that can also lead to more effective ontology designs [13].

Conscious separation of the so-called ABox (assertions or instance records) and TBox (conceptual structure) in ontology design provides some compelling benefits:

  • Easier ingest and incorporation of external instance data, including conversion from multiple formats and serializations
  • Faster and more efficient inferencing and analysis and use of the conceptual structure (TBox)
  • Easier federation and incorporation of distributed data stores (instance records), and
  • Better segregation of specialized work to the ABox, TBox and specialty work modules, as this figure shows [14]:

Figure 2. Separation of the TBox and ABox [14]

Maintaining identity relations and disambiguation as separate components also has the advantage of enabling different methodologies or algorithms to be determined or swapped out as better methods become available. A low-fidelity service, for example, could be applied for quick or free uses, with more rigorous methods reserved for paid or batch mode analysis. Similarly, maintaining full-text search as a separate component means that work can be done by optimized search engines with built-in faceting.

Simple, Interoperable Tools Support

An essential design criteria is to have a methodology and work flow that explicitly accounts for simple and interoperable tools. By “simple” we mean targeted, task-specific tools and functionality that is also geared to domain users and practitioners.

Of all design areas, this one is perhaps the weakest in terms of current offerings. The next installment in this series [1] will address this topic directly.

The New Methodology

Armed with these criteria, we are now ready to present the new methodology. In summary terms, we can describe the steps in the methodology as:

  1. Scope, analyze, then leverage existing assets
  2. Prototype structure
  3. Pivot on the working ontology
  4. Test
  5. Use and maintain
  6. Extend working ontology and repeat.
Two Parallel Tracks

After the scoping and analysis phase, the effort is split into two tracks:

  • Instances, and their descriptive characteristics, and
  • Conceptual relationships, or ontologies.

This split conforms to the separation of ABox and TBox noted above [15]. There are conceptual and workflow parallels between entities and data v. ontologies. However, the specific methodologies differ, and we only focus on the conceptual ontology side in the discussion below, shown as the upper part (blue) of Figure 3:


Figure 3. Flowchart of Ontology Development Methodology [16] (click to expand)

Two key aspects of the initial effort are to properly scope the size and purpose of the starting prototype and to inventory the existing assets (structure and data; internal and external) available to the project.

Re-Use Structure

Most current ontology methodologies do not emphasize re-use of existing structure. Yet these resources are rich in content and meaning, and often represent years to decades of effort and expenditure in creation, assembly and consensus. Just a short list of these potential sources demonstrates the treasure trove of structure and vocabularies available for re-use: Web portals; databases; legacy schema; metadata; taxonomies; controlled vocabularies; ontologies; master data catalogs; industry standards; exchange formats, etc.

Metadata and available structure may have value no matter where or how it exists, and a fundamental aspect of the build methodology is to bring such candidate structure into a common tools environment for inspection and testing. Besides assembling and reviewing existing sources, those selected for re-use must be migrated and converted to proper ontological form (OWL in the case of those developed by Structured Dynamics). Some of these techniques have been demonstrated for prior patterns and schema [17]; in other instances various converters, RDFizers or scripts may need to be employed to effect the migration.

Many tools and options exist at this stage, even though as a formal step this conversion is often neglected.

Prototype Structure

The prototype structure is the first operating instance of the ontology. The creation of this initial structure follows quite closely the approach recommended in Ontology Development 101 [18], with some modifications to reflect current terminology:

  1. Determine the domain and scope of the ontology
  2. Consider reusing existing ontologies
  3. Enumerate important terms in the ontology
  4. Define the classes and the class hierarchy
  5. Define the properties of classes
  6. Create instances

The prototype structure is important since it communicates to the project sponsors the scope and basic operation of the starting structure. This stage often represents a decision point for proceeding; it may also trigger the next budgeting phase.

Link Reference Ontologies

An essential aspect of a build methodology is to re-use “standard” ontologies as much as possible. Core ontologies are Dublin Core, DC Terms, Event, FOAF, GeoNames, SKOS, Timeline, and UMBEL. These core ontologies have been chosen because of universality, quality, community support and other factors [19]. Though less universal, there are also a number of secondary ontologies, namely BIBO, DOAP, and SIOC that may fit within the current scope.

These are then supplemented with quality domain-specific ontologies, if such exist. Only then are new name spaces assigned for any newly generated ontology(ies).

Working Ontology

The working ontology is the first production-grade (deployable) version of the ontology. It conforms to all of the ontology building best practices and needs to be complete enough such that it can be loaded and managed in a fully conforming ontology editor or IDE [20].

By also using the OWL API, this working structure can also be the source for specialty tools and user maintenance functions, short of requiring a full-blown OWL editor. Many of these aspects are some of the poorest represented in the current tools inventory; we return to this topic in the next installment.

The working ontology is the complete, canonical form of the domain ontology(ies) [21]. These are the central structures that are the focus for ongoing maintenance and extension efforts over the ensuing phases. As such, the ontologies need to be managed by a version control system with comprehensive ontology and vocabulary management support and tools.

Testing and Mapping

As new ontologies are generated, they should be tested for coherence against various reasoning, inference and other natural language processing tools. Gap testing is also used to discover key holes or missing links within the resulting ontology graph structure. Coherence testing may result in discovering missing or incorrect axioms. Gap testing helps identify internal graph nodes needed to establish the integrity or connectivity of the concept graph.

Though used for different purposes, mapping and alignment tools may also work to identify logical and other inconsistencies in definitions or labels within the graph structure. Mapping and alignment is also important in its own right in order to establish the links that help promote ontology and information interoperability.

External knowledge bases can also play essential roles in testing and mapping. Two prominent knowledge base examples are Cyc and Wikipedia, but many additional exist for any specific domain.

Use and Maintenance

Of course, the whole purpose of the development methodology is to create practical, working ontologies. Such uses include search, discovery, information federation, data interoperability, analysis and reasoning, The general purposes to which ontologies may be put are described in the Executive Intro to Ontologies [22].

However, it is also in day-to-day use of the ontology that many enhancements and improvements may be discovered. Examples include improved definitions of concepts; expansions of synonyms, aliases and jargon for concepts; better, more intuitive preferred labels; better means to disambiguate between competing meanings; missing connections or excessive connections; and splitting or consolidating of the underlying structure.

Today, such maintenance enhancements are most often not pursued because existing tools do not support such actions. Reliance on IDEs and tools geared to ontology engineering are not well suited to users and practitioners being able to note or effect such changes. Yet ongoing ontology use and adaptation clearly suggest that users should be encouraged to do so. They are the ones in the front lines of identifying and potentially recording such improvements.

Extend

Ontology development is a process, not a static destination or event. This observation makes intuitive sense since we understand ontologies to be a means to capture our understanding of our domains, which is itself constantly changing due to new observations and insights. This factor alone suggests that ontology development methodologies must therefore give explicit attention to extension.

But there is another reason for this attention. Incremental, adaptive ontologies are also explicitly designed to expand their scope and coverage, bite by bite as benefits prove themselves and justify that expansion. A start small and expand strategy is of course lower risk and more affordable. But, for it to be effective, it also must be designed explicitly for extension and expansion. Ontology growth thus occurs both from learning and discovery and from expanding scope.

Versioning, version control and documentation (see below) thus assume more central importance than a more static view would suggest. The use of feedbacks and the continuous improvement design based on MIKE2.0 are therefore also central tenets of our ontology development methodology.

Documentation

This perspective of the ontology as a way to capture the structure and relationships of a domain — which is also constantly changing and growing — carries over to the need to document the institutional memory and use of it. Both better tools — such as vocabulary management and versioning — and better work processes need to be instituted to properly capture and record use and applications of ontologies.

Some of these aspects are now handled with utilities such as OWLdoc or the TechWiki that Structured Dynamics has innovated to capture ontology knowledge bases on an ongoing basis. But these are still rudimentary steps that need to be enforced with management commitment and oversight.

One need merely begin to probe the ontology development literature to observe how sparse the pickings are. Very little information on methodologies, best practices, use cases, recipes, how to manuals, conversion and use steps and other documentation really exists at present. It is unfortunately the case that documentation even lags the inadequate state of tools development in the ontology space.

Content Processing

Once formalized, these constructs — the structured ontologies or the named entity dictionaries as shown in Figure 3 — are then used for processing input content. That processing can range from conversion to direct information extraction. Once extracted, the structure may be injected (via RDFa or other means) back into raw Web pages. The concepts and entities that occur within these structures help inform various tagging systems [23]. The information can also be converted and exported in various forms for direct use or for incorporation in third-party systems.

Visualization systems and specialized widgets (see next) can be driven by the structure and results sets obtained from querying the ontology structure and retrieving its related instance data. While these purposes are somewhat beyond the direct needs of the ontology development methodology, the ontology structures themselves must be designed to support these functions.

Semantic Component Ontology

In our methodology we also provide for administrative ontologies whose purpose is to relate structural understandings of the underlying data and data types with applicable end-use and visualization tools (”widgets”). Thus the structural knowledge of the domain gets combined with an understanding of data types and what kinds of visualization or presentation widgets might be invoked. The phrase ontology-driven apps results from this design.

Amongst other utility ontologies, Structured Dynamics names its major tool-driver ontology the SCO (Semantic Component Ontology). The SCO works in intimate tandem with the domain ontologies, but is constructed and designed with quite different purposes. A description of the build methodology for the SCO (or its other complementary utility ontologies) is beyond the scope of this current document.

Tooling and Best Practices

As sprinkled throughout the above commentary, this methodology is also intimately related to tools and best practices. The next chapter in this series is devoted to and will be archived on the TechWiki as the lightweight domain ontology methodology. Best practices will be handled in a similar way for the chapter after that one and in its ontology best practices document on the TechWiki.

Time for a Leap Forward in Methodology

Earlier reviews and the information in this document suggest a real need for ontology building methodologies that are integrated, easier to use, interoperate with a richer tools set and are geared to practitioners versus priests. The good news is that there are architectures and building blocks to achieve this vision. The bad news is that the first steps on this path are only now beginning.

The next two installments in this series add further detail for why it is time — and how — we can make a leap forward in methodology. Those critical remaining pieces are in tools and best practices.

[1] This posting is part of a current series on ontology development and tools. The series began with an update of my prior Ontology Tools listing, which now contains 185 tools. It continued with a survey of ontology development methodologies. The next part in this series will address a new architecture for tooling development. The last installment in the series is planned to cover ontology best practices. This same posting is permanently archived and updated on the OpenStructs TechWiki as Lightweight, Domain Ontologies Development Methodology. [2] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak. For a more detailed treatment of ontology classifications, see M. K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog, May 16, 2007. [3] O. Corcho, M. Fernandez and A. Gomez-Perez, 2003. “Methodologies, Tools and Languages for Building Ontologies: Where is the Meeting Point?,” in Data & Knowledge Engineering 46, 2003. See http://www.dia.fi.upm.es/~ocorcho/documents/DKE2003_CorchoEtAl.pdf. [4] Elena Paslaru Bontas Simperl and Christoph Tempich, 2006. “Ontology Engineering: A Reality Check,” in Proceedings of the 5th International Conference on Ontologies, Databases, and Applications of Semantics ODBASE 2006, 2006. See http://ontocom.ag-nbi.de/docs/odbase2006.pdf. [5] OntologyDesignPatterns.org is a semantic Web portal dedicated to ontology design patterns (ODPs). The portal was started under the NeOn project, which still partly supports its development. [6] See M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies,” AI3:::Adaptive Information blog, November 23, 2009. [7] See M.K. Bergman, 2008. “The Semantics of Context,” AI3:::Adaptive Information blog, May 6, 2008. [8] UMBEL (Upper Mapping and Binding Exchange Layer) is an ontology of about 20,000 subject concepts that acts as a reference structure for inter-relating disparate datasets. It is also a general vocabulary of classes and predicates designed for the creation of domain-specific ontologies. [9] See M.K. Bergman, 2008. “When is Content Coherent?,” AI3:::Adaptive Information blog, July 25, 2008. [10] See M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009. [11] MIKE2.0 (Method for Integrated Knowledge Environments) is an open source information development methodology championed by Bearing Point and Deloitte. Structured Dynamics has adopted the approach and has helped formulate MIKE2.0’s semantic enterprise offering. For a general intro to the approach, see further M.K. Bergman, 2010. “MIKE2.0: Open Source Information Development in the Enterprise,” AI3:::Adaptive Information blog, February 23, 2010. [12] This is our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” [13] See the four-part description logics series from M. K. Bergman, 2009. “Making Linked Data Reasonable using Description Logics, Part 1,” AI3:::Adaptive Information blog, Feb. 11, 2009; “Making Linked Data Reasonable using Description Logics, Part 2,” AI3:::Adaptive Information blog, Feb. 15, 2009; “Making Linked Data Reasonable using Description Logics, Part 3,” AI3:::Adaptive Information blog, Feb. 18, 2009; and “Making Linked Data Reasonable using Description Logics, Part 4,” AI3:::Adaptive Information blog, Feb. 23, 2009. [14] See Part 2 in [13]. [15] The TBox portion, or classes (concepts), is the basis of the ontologies. The ontologies establish the structure used for governing the conceptual relationships for that domain and in reference to external (Web) ontologies. The ABox portion, or instances (named entities), represents the specific, individual things that are the members of those classes. Named entities are the notable objects, persons, places, events, organizations and things of the world. Each named entity is related to one or more classes (concepts) to which it is a member. Named entities do not set the structure of the domain, but populate that structure. The ABox and TBox play different roles in the use and organization of the information and structure. [16] The original version, now slightly modified, was first published in M. K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies,” AI3:::Adaptive Information blog, Nov. 23, 2009. [17] As some examples, see for instance: SKOS: Mark van Assem, Veronique Malais, Alistair Miles and Guus Schreiber, 2006. “A Method to Convert Thesauri to SKOS,” in The Semantic Web: Research and Applications (2006), pp. 95-109. See http://www.cs.vu.nl/~mark/papers/Assem06b.pdf for paper, also http://thesauri.cs.vu.nl/eswc06/ and http://thesauri.cs.vu.nl/; taxonomies: Fausto Giunchiglia, Maurizio Marchese and Ilya Zaihrayeu, 2006. “Encoding Classifications into Lightweight Ontologies,” presented at Proceedings of the 3rd European Semantic Web Conference (ESWC 2006), Budva. See http://www.science.unitn.it/~marchese/pdf/encoding%20classifications%20into%20lightweight%20ontologies_JoDS8.pdf; metadata: Mikael Nilsson, 2007. See http://mikaelnilsson.blogspot.com/2007/11/semanticizing-metadata-specifications.html; relational schema: see the W3C workgroup on RDB2RDF; and, of course, there are many others. [18] Natalya F. Noy and Deborah L. McGuinness, 2001. “Ontology Development 101: A Guide to Creating Your First Ontology,” Stanford University Knowledge Systems Laboratory Technical Report KSL-01-05, March 2001. See http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html. [19] The various criteria that are considered in nominating an existing ontology to “core” status is that it should be general; highly used; universal; broad committee or community support; well done and documented; and easily understood. [20] Example and comprehensive ontology editing toolkits or IDEs (integrated development environments) include NeOn toolkit, Protégé, and TopBraid Composer. A complement to these larger toolkits is the OWL API, which when used can also provide a canonical management framework for specific ontology tools and tasks. This topic is covered more in the next installment regarding the tools landscape. [21] Good ontology design, especially for larger projects, does require a degree of modularity. An architecture of multiple ontologies often work together to isolate different work tasks so as to aid better ontology management. Ontology architecture and modularization is a separate topic in its own right. [22] Originally published as M.K. Bergman, 2010. “An Executive Intro to Ontologies,” AI3:::Adaptive Information blog, August 9, 2010. This popular document has now been permanently archived on the the OpenStructs TechWiki as Intro to Ontologies. [23] Another reason for the clear distinction between ABox and TBox is their use to aid one another in disambiguation. Structured Dynamics’ scones approach (subject concepts or named entities) is designed expressly for this purpose. It is also possible to integrate these approaches with third-party tools (e.g., Calais, Expert System (Cogito), etc.) to improve unstructured content characterization. Via this approach we now can assess concept matches in addition to entity matches. This means we can triangulate between the two assessments to aid disambiguation. Because of logical segmentation, we have increased the informational power of our concept graph.

A Brief Survey of Ontology Development Methodologies

AI3:::Adaptive Information (Mike Bergman) - Mon, 08/30/2010 - 05:53
The Recent Pace of Ontology Development Appears to Have Waned

The development of ontologies goes by the names of ontology engineering or ontology building, and can also be investigated under the rubric of ontology learning. This paper summarizes key papers and links to this topic [18].

For the last twenty years there have been many methods put forward for how to develop ontologies. These methodological activities have actually diminished somewhat in recent years.

The main thrust of the papers listed herein is on domain ontologies, which model particular domains or topic areas. (As opposed to reference, upper or theoretical ontologies, which are more general or encompassing.) Also, little commentary is offered on any of the individual methodologies; please see the referenced papers for more details.

General Surveys

One of the first comprehensive surveys was done by Jones et al. in 1998 [1]. This study began to elucidate common stages and noted there are typically separate stages to produce first an informal description of the ontology and then its formal embodiment in an ontology language. The existence of these two descriptions is an important characteristic of many ontologies, with the informal description often carrying through to the formal description.

The next major survey was done by Corcho et al. in 2003 [2]. This built on the earlier Jones survey and added more recent methods. The survey also characterized the methods by tools and tool readiness.

More recently the work of Simperl and her colleagues has focused on empirical results of ontology costing and related topics. This series has been the richest source of methodology insight in recent years [3, 4, 5, 6]. More on this work is described below.

Though not a survey of methods, one of the more attainable descriptions of ontology building is Noy and McGuinness’ well-known Ontology Development 101 [7]. Also really helpful are Alan Rector’s various lecture slides on ontology building [8].

However, one general observation is that the pace of new methodology development seems to have waned in the past five years or so. This does not appear to be the result of an accepted methodology having emerged.

Some Specific Methodologies

Some of the leading methodologies, presented in rough order from the oldest to newest, are as follows:

  • Cyc – this oldest of knowledge bases and ontologies has been mapped to many separate ontologies. See the separate document on the Cyc mapping methodology for an overview of this approach [9]
  • TOVE (Toronto Virtual Enterprise) – a first-order logic approach to representing activities, states, time, resources, and cost in an enterprise integration architecture [10]
  • IDEF5 (Integrated Definition for Ontology Description Capture Method) – is part of a broader set of methodologies developed by Knowledge Based Systems, Inc. [11]
  • ONIONS (ONtologic Integration Of Naive Sources) – a set of methods especially geared to integrating multiple information sources [12], with a particular emphasis on domain ontologies
  • COINS (COntext INterchange System) – a long-running series of efforts from MIT’s Sloan School of Management [13]
  • METHONTOLOGY – one of the better known ontology building methodologies; however, not many known uses [14]
  • OTK (On-To-Knowledge) was a methodology that came from the major EU effort at the beginning of last decade; it is a common sense approach reflected in many ways in other methodologies [15]
  • UPON (United Process for ONtologies) – is a UML-based approach that is based on use cases, and is incremental and iterative [16].

Please note that many individual projects also describe their specific methodologies; these are purposefully not included. In addition, Ensan and Du look at some specific ontology frameworks (e.g., PROMPT, OntoLearn, etc.) from a domain-specific perspective [17].

Some Flowcharts

Here is the general methodology as presented in the various Simperl et al. papers [c.f., Fig. 1 in 3]:

The Corcho et al. survey also presented a general view of the tools plus framework necessary for a complete ontology engineering environment [Fig. 4 from 2]:

There are more examples that show ontology development workflows. Here is one again from the Simperl et al. efforts [Fig. 2 in 5]:

However, what is most striking about the review of the literature is the paucity of methodology figures and the generality of those that do exist. From this basis, it is unclear what the degree of use is for real, actionable methods.

Best Practices Observations

The Simperl and Tempich paper [3], besides being a rich source of references, also provides some recommended best practices based on their comparative survey. These are:

General Recommendations
  • Enforce dissemination, e.g.. publish more best practices
  • Define selection criteria for methodologies
  • Define a unified methodology following a method engineering approach
  • Support decision for the appropriate formality level given a specific use case
Process Recommendations
  • Define selection criteria for different knowledge acquisition (KA) techniques
  • Introduce process description for the application of different KA techniques
  • Improve documentation of existing ontologies
  • Improve ontology location facilities
  • Build robust translators between formalisms
  • Build modular ontologies
  • Define metrics for ontology evaluation
  • Offer user oriented process descriptions for ontology evaluation
Organizational Recommendations
  • Provide ontology engineering activity descriptions using domain-specific terminology
  • Improve consensus making process support
Technological Recommendations
  • Provide tools to extract ontologies from structured data sources
  • Build lightweight ontology engineering environments
  • Improve the quality of tools for domain analysis, ontology evaluation, documentation
  • Include methodological support in ontology editors
  • Build tools supporting collaborative ontology engineering.
Summary of Observations

This review has not set out to characterize specific methodologies, nor their strengths and weaknesses. Yet the research seems to indicate this state of methodology development in the field:

  • Very few discrete methods exist, and those that do are relatively older in nature
  • The methods tend to either cluster into incremental, iterative ones or those more oriented to more comprehensive approaches
  • There is a general logical sharing of steps across most methodologies from assessment to deployment and testing and refinement
  • Actual specifics and flowcharts are quite limited; with the exception of the UML-based systems, most appear not to meet enterprise standards
  • The supporting toolsets are not discussed much, and most of the examples are based solely on a governing tool. Tool integration and interoperability is almost non-existent in terms of the narratives
  • This does not appear to be a very active area of current research.
[1] D.M. Jones, T.J.M. Bench-Caponand, P.R.S. Visser, 1998.“Methodologies for Ontology Development,” in Proceedings of the IT and KNOWS Conference of the 15th FIP World Computer Congress, 1998. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.2437&rep=rep1&type=pdf. [2] O. Corcho, M. Fernandez and A. Gomez-Perez, 2003. “Methodologies, Tools and Languages for Building Ontologies: Where is the Meeting Point?,” in Data & Knowledge Engineering 46, 2003. See http://www.dia.fi.upm.es/~ocorcho/documents/DKE2003_CorchoEtAl.pdf. [3] Elena Paslaru Bontas Simperl and Christoph Tempich, 2006. Ontology Engineering: A Reality Check, in Proceedings of the 5th International Conference on Ontologies, Databases, and Applications of Semantics ODBASE2006, 2006. See http://citeseerx.ist.psu.edu/icons/pdf.gif;jsessionid=DE3414C0282C76F0EA787A06039941D2. [4] Elena Paslaru Bontas Simperl, Christoph Tempich, and York Sure, 2006. “ONTOCOM: A Cost Estimation Model for Ontology Engineering,” presented at ISWC 2006; see http://ontocom.ag-nbi.de/docs/iswc2006.pdf. [5] Elena Simperl, Christoph Tempich and Denny Vrandečić, 2008. “A Methodology for Ontology Learning,” in Frontiers in Artificial Intelligence and Applications 167 from the Proceedings of the 2008 Conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pp. 225-249, 2008. See http://wtlab.um.ac.ir/parameters/wtlab/filemanager/resources/Ontology%20Learning/ONTOLOGY%20LEARNING%20AND%20POPULATION%20BRIDGING% 20THE%20GAP%20BETWEEN%20TEXT%20AND%20KNOWLEDGE.pdf#page=241. [6] Elena Simperl, Malgorzata Mochol and Tobias Burger, 2010. “Achieving Maturity: the State of Practice in Ontology Engineering in 2009,” in International Journal of Computer Science and Applications, 7(1), pp. 45 – 65, 2010. See http://www.tmrfindia.org/ijcsa/v7i13.pdf. [7] Natalya F. Noy and Deborah L. McGuinness, 2001. “Ontology Development 101: A Guide to Creating Your First Ontology,” Stanford University Knowledge Systems Laboratory Technical Report KSL-01-05, March 2001. See http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html. [8] See http://www.cs.man.ac.uk/~rector/modules/CS646/Lecture-Handouts/Lect-2-Ontology-building-2007.pdf; http://www.cs.man.ac.uk/~rector/modules/CS646/Lecture-Handouts/Lect-2-Ontology-building-2007.ppt; or http://www.cs.man.ac.uk/~rector/modules/CS646/Lecture-Handouts/Ontology-bulding-2005-Lect-5.ppt. [9] Stephen L. Reed and Douglas B. Lenat, 2002. Mapping Ontologies into Cyc, paper presented at AAAI 2002 Conference Workshop on Ontologies For The Semantic Web, Edmonton, Canada, July 2002. See http://www.cyc.com/doc/white_papers/mapping-ontologies-into-cyc_v31.pdf . Also, as presented by Doug Foxvog, Ontology Mapping with Cyc, at WMSO, June 14, 2004; see www.wsmo.org/wsml/papers/presentations/Ontology%20Mapping%20at%20Cycorp.ppt. Also, see Matthew E. Taylor, Cynthia Matuszek, Bryan Klimt, and Michael Witbrock, 2007. “Autonomous Classification of Knowledge into an Ontology,” in The 20th International FLAIRS Conference (FLAIRS), Key West, Florida, May 2007. See http://www.cyc.com/doc/white_papers/FLAIRS07-AutoClassificationIntoAnOntology.pdf. [10] M. Gruninger and M.S. Fox, 1994. “The Design and Evaluation of Ontologies for Enterprise Engineering”, Workshop on Implemented Ontologies, European Conference on Artificial Intelligence 1994, Amsterdam, NL. See http://stl.mie.utoronto.ca/publications/gruninger-onto-ecai94.pdf. [11] KBSI, 1994. “The IDEF5 Ontology Description Capture Method Overview”, Knowledge Based Systems, Inc. (KBSI) Report, Texas. The report describes the stages of: 1) organizing and scoping; 2) data collection; 3) data analysis; 4) initial ontology development; and 5) ontology refinement and validation. See http://en.wikipedia.org/wiki/IDEF5. [12] A. Gangemi, G. Steve and F. Giacomelli, 1996. “ONIONS: An Ontological Methodology for Taxonomic Knowledge Integration”, ECAI-96 Workshop on Ontological Engineering, Budapest, August 13th. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3972&rep=rep1&type=pdf. [13] The COINS approach was developed by Madnick et al. over the past two decades or so at the MIT Sloan School of Management. See further http://web.mit.edu/smadnick/www/wp/CISL-Sloan%20WP%20spreadsheet.htm for a listing of papers from this program; some are use cases, and some are architecture-related. For the most detailed treatment, see Aykut Firat, 2003. Information Integration Using Contextual Knowledge and Ontology Merging, Ph.D. Thesis for the Sloan School of Management, MIT, 151 pp. See http://www.mit.edu/~bgrosof/paps/phd-thesis-aykut-firat.pdf. [14] M. Fernandez, A. Gomez-Perez and N. Juristo, 1997. “METHONTOLOGY: From Ontological Art Towards Ontological Engineering”, AAAI-97 Spring Symposium on Ontological Engineering, Stanford University, March 24-26th, 1997. [15] York Sure, Christoph Tempich and Denny Vrandecic , 2006. “Ontology Engineering Methodologies,” in Semantic Web Technologies: Trends and Research in Ontology-based Systems, pp. 171-187, Wiley. The general phases of the approach are: 1) feasibility study; 2) kickoff; 3) refinement; 4) evaluation; and 5) application and evolution. [16] A. De Nicola, M. Missikoff, R. Navigli, 2009. “A Software Engineering Approach to Ontology Building”. Information Systems, 34(2), Elsevier, 2009, pp. 258-275. [17] Faezeh Ensan and Weichang Du, 2007. Towards Domain-Centric Ontology Development and Maintenance Frameworks; see http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.8915&rep=rep1&type=pdf. [18] This document is permanently archived on the OpenStructs TechWiki. This document is part of a current series on ontology development and tools to be completed over the coming weeks.

Listing of 185 Ontology Building Tools

AI3:::Adaptive Information (Mike Bergman) - Mon, 08/23/2010 - 05:28

Earlier Listing is Expanded by More than 30%

At the beginning of this year Structured Dynamics assembled a listing of ontology building tools at the request of a client. That listing was presented as The Sweet Compendium of Ontology Building Tools. Now, again because of some client and internal work, we have researched the space again and updated the listing [1].

All new tools are marked with <New> (new only means newly discovered; some had yet to be discovered in the prior listing). There are now a total of 185 tools in the listing, 31 of which are recently new, and 45 added at various times since the first release. <Newest> reflects updates — most from the developers themselves — since the original publication of this post.

Comprehensive Ontology Tools
  • Altova SemanticWorks is a visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design. No open source version available
  • Amine is a rather comprehensive, open source platform for the development of intelligent and multi-agent systems written in Java. As one of its components, it has an ontology GUI with text- and tree-based editing modes, with some graph visualization
  • The Apelon DTS (Distributed Terminology System) is an integrated set of open source components that provides comprehensive terminology services in distributed application environments. DTS supports national and international data standards, which are a necessary foundation for comparable and interoperable health information, as well as local vocabularies. Typical applications for DTS include clinical data entry, administrative review, problem-list and code-set management, guideline creation, decision support and information retrieval.. Though not strictly an ontology management system, Apelon DTS has plug-ins that provide visualization of concept graphs and related functionality that make it close to a complete solution
  • DOME is a programmable XML editor which is being used in a knowledge extraction role to transform Web pages into RDF, and available as Eclipse plug-ins. DOME stands for DERI Ontology Management Environment
  • FlexViz is a Flex-based, Protégé-like client-side ontology creation, management and viewing tool; very impressive. The code is distributed from Sourceforge; there is a nice online demo available; there is a nice explanatory paper on the system, and the developer, Chris Callendar, has a useful blog with Flex development tips
  • <Newest> ITM supports the management of complex knowledge structures (metadata repositories, terminologies, thesauri, taxonomies, ontologies, and knowledge bases) throughout their lifecycle, from authoring to delivery. ITM can also manage alignments between multiple knowledge structures, such as thesauri or ontologies, via the integration of INRIA’s Alignment API. Commercial; from Mondeca
  • Knoodl facilitates community-oriented development of OWL based ontologies and RDF knowledge bases. It also serves as a semantic technology platform, offering a Java service-based interface or a SPARQL-based interface so that communities can build their own semantic applications using their ontologies and knowledgebases. It is hosted in the Amazon EC2 cloud and is available for free; private versions may also be obtained. See especially the screencast for a quick introduction
  • The NeOn toolkit is a state-of-the-art, open source multi-platform ontology engineering environment, which provides comprehensive support for the ontology engineering life-cycle. The v2.3.0 toolkit is based on the Eclipse platform, a leading development environment, and provides an extensive set of plug-ins covering a variety of ontology engineering activities. You can add these plug-ins or get a current listing from the built-in updating mechanism
  • ontopia is a relative complete suite of tools for building, maintaining, and deploying Topic Maps-based applications; open source, and written in Java. Could not find online demos, but there are screenshots and there is visualization of topic relationships
  • Protégé is a free, open source visual ontology editor and knowledge-base framework. The Protégé platform supports two main ways of modeling ontologies via the Protégé-Frames and Protégé-OWL editors. Protégé ontologies can be exported into a variety of formats including RDF(S), OWL, and XML Schema. There are a large number of third-party plugins that extends the platform’s functionality
    • Protégé Plugin Library – frequently consult this page to review new additions to the Protégé editor; presently there are dozens of specific plugins, most related to the semantic Web and most open source
    • Collaborative Protégé is a plug-in extension of the existing Protégé system that supports collaborative ontology editing as well as annotation of both ontology components and ontology changes. In addition to the common ontology editing operations, it enables annotation of both ontology components and ontology changes. It supports the searching and filtering of user annotations, also known as notes, based on different criteria. There is also an online demo
    • <New>Web Protégé is an online version of Protégé attempting to capture all of the native functionality; still under development
  • <New>Sigma is open source knowledge engineering environment that includes ontology mapping, theorem proving, language generation in multiple languages, browsing, OWL read/write, and analysis. It includes the Suggested Upper Merged Ontology (SUMO), a comprehensive formal ontology. It’s under active development and use
  • TopBraid Composer is an enterprise-class modeling environment for developing Semantic Web ontologies and building semantic applications. Fully compliant with W3C standards, Composer offers comprehensive support for developing, managing and testing configurations of knowledge models and their instance knowledge bases. It is based on the Eclipse IDE. There is a free version (after registration) for small ontologies
  • <New>TwoUse Toolkit is an implementation of current OMG and W3C standards for developing ontology-based software models and model-based OWL2 ontologies, largely based around UML. There are a variety of tools, including graphics editors, with more to come
  • <New>Wandora is a topic maps engine written in Java with support for both in-memory topic maps and persisting topic maps in MySQL and SQL Server. It also contains an editor and a publishing system, and has support for automatic classification. It can read OBO, RDF(S), and many other formats, and can export topic maps to various graph formats. There is also a web-based topic maps browser, and graphical visualization.
Not Apparently in Active Use
  • Adaptiva is a user-centred ontology building environment, based on using multiple strategies to construct an ontology, minimising user input by using adaptive information extraction
  • Exteca is an ontology-based technology written in Java for high-quality knowledge management and document categorisation, including entity extraction. Though code is still available, no updates have been provided since 2006. It can be used in conjunction with search engines
  • IODT is IBM’s toolkit for ontology-driven development. The toolkit includes EMF Ontolgy Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva)
  • KAON is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. An important focus of KAON is scalable and efficient reasoning with ontologies
  • Ontolingua provides a distributed collaborative environment to browse, create, edit, modify, and use ontologies. The server supports over 150 active users, some of whom have provided us with descriptions of their projects. Provided as an online service; software availability not known.
Vocabulary Prompting Tools
  • AlchemyAPI from Orchestr8 provides an API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages
  • BooWa is a set expander for any language (formerly known as SEALS); developed by RC Wang of Carnegie Mellon
  • Google Keywords allows you to enter a few descriptive words or phrases or a site URL to generate keyword ideas
  • Google Sets for automatically creating sets of items from a few examples
  • Open Calais is free limited API web service to automatically attach semantic metadata to content, based on either entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), or events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID)
  • Query-by-document from BlogScope has a nice phrase extraction service, with a choice of ranking methods. Can also be used in a Firefox plug-in (not texted with 3.5+)
  • SemanticHacker (from Textwise) is an API that does a number of different things, including categorization, search, etc. By using ‘concept tags’, the API can be leveraged to generate metadata or tags for content
  • TagFinder is a Web service that automatically extracts tags from a piece of text. The tags are chosen based on both statistical and linguistic analysis of the original text
  • Tagthe.net has a demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations
  • TermExtractor extracts terminology consensually referred in a specific application domain. The software takes as input a corpus of domain documents, parses the documents, and extracts a list of “syntactically plausible” terms (e.g. compounds, adjective-nouns, etc.)
  • TermFinder uses Poisson statistics, the Maximum Likelihood Estimation and Inverse Document Frequency between the frequency of words in a given document and a generic corpus of 100 million words per language; available for English, French and Italian
  • TerMine is an online and batch term extractor that emphasizes part of speech (POS) and n-gram (phrase extraction). TerMine is the terminological management system with the C-Value term extraction and AcroMine acronym recognition integrated
  • Topia term extractor is a part-of-speech and frequency based term extraction tool implemented in python. Here is a term extraction demo based on this tool
  • Topicalizer is a service which automatically analyses a document specified by a URL or a plain text regarding its word, phrase and text structure. It provides a variety of useful information on a given text including the following: Word, sentence and paragraph count, collocations, syllable structure, lexical density, keywords, readability and a short abstract on what the given text is about
  • TrMExtractor does glossary extraction on pure text files for either English or Hungarian
  • Wikify! is a system to automatically “wikify” a text by adding Wikipedia-like tags throughout the document. The system extracts keywords and then disambiguates and matches them to their corresponding Wikipedia definition
  • Yahoo! Placemaker is a freely available geoparsing Web service. It helps developers make their applications location-aware by identifying places in unstructured and atomic content – feeds, web pages, news, status updates – and returning geographic metadata for geographic indexing and markup
  • Yahoo! Term Extraction Service is an API to Yahoo’s term extraction service, as well as many other APIs and services in a variety of languages and for a variety of tasks; good general resource. The service has been reported to be shut down numerous times, but apparently is kept alive due to popular demand.
Initial Ontology Development
  • COE COE (CmapTools Ontology Editor) is a specialized version of the CmapTools from IMHC. COE — and its CmapTools parent — is based on the idea of concept maps. A concept map is a graph diagram that shows the relationships among concepts. Concepts are connected with labeled arrows, with the relations manifesting in a downward-branching hierarchical structure. COE is an integrated suite of software tools for constructing, sharing and viewing OWL encoded ontologies based on these constructs
  • Conzilla2 is a second generation concept browser and knowledge management tool with many purposes. It can be used as a visual designer and manager of RDF classes and ontologies, since its native storage is in RDF. It also has an online collaboration server [apparently last updated in 2008]
  • http://diagramic.com/ has an online Flex network graph demo, which also has a neat facility for quick entry and visualization of relationships; mostly small scale; pretty cool. Does not appear to be code available anywhere
  • <New>DL-Learner is a tool for learning OWL class expressions from examples and background knowledge. It extends Inductive Logic Programming (ILP) to Description Logics and the Semantic Web. DL-Learner now has a flexible component based design, which allows to extend it easily with new learning algorithms, learning problems, reasoners, and supported background knowledge sources. A new type of supported knowledge sources are SPARQL endpoints, where DL-Learner can extract knowledge fragments, which enables learning classes even on large knowledge sources like DBpedia, and includes an OWL API reasoner interface and Web service interface.
  • DogmaModeler is a free and open source, ontology modeling tool based on ORM. The philosophy of DogmaModeler is to enable non-IT experts to model ontologies with a little or no involvement of an ontology engineer; project is quite old, but the software is still available and it may provide some insight into naive ontology development
  • Erca is a framework that eases the use of Formal and Relational Concept Analysis, a neat clustering technique. Though not strictly an ontology tool, Erca could be implemented in a work flow that allows easy import of formal contexts from CSV files, then algorithms that computes the concept lattice of the formal contexts that can be exported as dot graphs (or in JPG, PNG, EPS and SVG formats). Erca is provided as an Eclipse plug-in
  • GraphMind is a mindmap editor for Drupal. It has the basic mindmap features and some Drupal specific enhancements. There is a quick screencast about how GraphMind looks like and what is does. The Flex source is also available from Github
  • <New>H-Maps is a commercial suite of tools for building topic maps applications, consisting of a topic maps engine and server, a mapping framework for converting from legacy data, and a navigator for visualizing data. It is typically used in bioinformatics (drug discovery and research, toxicological studies, etc), engineering (support and expert systems), and for integration of hetereogeneous data. It supports the XTM 1.0 and TMAPI 1.0 specifications
  • irON using spreadsheets, via its notation and specification. Spreadsheets can be used for initial authoring, esp if the irON guidelines are followed. See further this case study of Sweet Tools in a spreadsheet using irON (commON)
  • <New>JXML2OWL API is a library for mapping XML schemas to OWL Ontologies on the JAVA platform. It creates an XSLT which transforms instances of the XML schema into instances of the OWL ontology. JXML2OWL Mapper is GUI application using the JXML2OWL API
  • MindRaider is Semantic Web outliner. It aims to connect the tradition of outline editors with emerging technologies. MindRaider mission is to organize not only the content of your hard drive but also your cognitive base and social relationships in a way that enables quick navigation, concise representation and inferencing
  • <New>Neologism is a simple web-based RDF Schema vocabulary editor and publishing system. Use it to create RDF classes and properties, which are needed to publish data on the Semantic Web. Its main goal is to dramatically reduce the time required to create, publish and modify vocabularies for the Semantic Web. It is written in PHP and built on the Drupal platform. Neologism is currently in alpha
  • <New>OCS – Ontology Creation System is software to develop ontologies in cooperative way with a graphical interface
  • RDF123 is an application and web service for converting data in simple spreadsheets to an RDF graph. Users control how the spreadsheet’s data is converted to RDF by constructing a graphical RDF123 template that specifies how each row in the spreadsheet is converted as well as metadata for the spreadsheet and its RDF translation
  • <New>ROC (Rapid Ontology Construction) is a tool that allows domain experts to quickly build a basic vocabulary for their domain, re-using existing terminology whenever possible. How this works is that the ROC tool asks the domain expert for a set of keywords that are ‘core’ terms of the domain, and then queries remote sources for concepts matching those terms. These are then presented to the user, who can select terms from the list, find relations to other terms, and expand the set of terms and relations, iteratively. The resulting vocabulary (or ‘proto-ontology’, basically a SKOS-like thesaurus) can be used as is, or can be used as input for a knowledge engineer to base a more comprehensive domain ontology on. Interface “triples-oriented,” not graphical.
  • Topincs is a Topic Map authoring software that allows groups to share their knowledge over the web. It makes use of a variety of modern technologies. The most important are Topic Maps, REST and Ajax. It consists of three components: the Wiki, the Editor, and the Server. The servier requires AMP; the Editor and Wiki are based on browser plug-ins.
Ontology Editing
  • First, see all of the Comprehensive Tools and Ontology Development listings above
  • Anzo for Excel includes an (RDFS and OWL-based) ontology editor that can be used directly within Excel. In addition to that, Anzo for Excel includes the capability to automatically generate an ontology from existing spreadsheet data, which is very useful for quick bootstrapping of an ontology
  • <New>ATop is a topic map browser and editor written in Java and supports the XTM 1.0 specification; project has not been updated since 2008
  • Hozo is an ontology visualization and development tool that brings version control constructs to group ontology development; limited to a prototype, with no online demo
  • Lexaurus Editor is for off-line creation and editing of vocabularies, taxonomies and thesauri. It supports import and export in Zthes and SKOS XML formats, and allows hierarchical / poly-hierarchical structures to be loaded for editing, or even multiple vocabularies to be loaded simultaneously, so that terms from one taxonomy can be re-used in another, using drag and drop. Not available in open source
  • Model Futures OWL Editor combines simple OWL tools, featuring UML (XMI), ErWin, thesaurus and imports. The editor is tree-based and has a “navigator” tool for traversing property and class-instance relationships. It can import XMI (the interchange format for UML) and Thesaurus Descriptor (BT-NT XML), and EXPRESS XML files. It can export to MS Word.
  • <New>OBO-Edit is an open source ontology editor written in Java. OBO-Edit is optimized for the OBO biological ontology file format. It features an easy to use editing interface, a simple but fast reasoner, and powerful search capabilities
  • <New>Onotoa is an Eclipse-based ontology editor for topic maps. It has a graphical UML-like interface, an export function for the current TMCL-draft and a XTM export
  • OntoTrack is a browsing and editing ontology authoring tool for OWL Lite. It combines a sophisticated graphical layout with mouse enabled editing features optimized for efficient navigation and manipulation of large ontologies
  • OWLViz is an attractive visual editor for OWL and is available as a Protégé plug-in
  • PoolParty is a triple store-based thesaurus management environment which uses SKOS and text extraction for tag recommendations. See further this manual, which describes more fully the system’s functionality. Also, there is a PoolParty Web service that enables a Zthes thesaurus in XML format to be uploaded and converted to SKOS (via skos:Concepts)
  • SKOSEd is a plugin for Protege 4 that allows you to create and edit thesauri (or similar artefacts) represented in the Simple Knowledge Organisation System (SKOS).
  • TemaTres is a Web application to manage controlled vocabularies, taxonomies and thesaurus. The vocabularies may be exported in Zthes, Skos, TopicMap, etc.
  • ThManager is a tool for creating and visualizing SKOS RDF vocabularies. ThManager facilitates the management of thesauri and other types of controlled vocabularies, such as taxonomies or classification schemes
  • Vitro is a general-purpose web-based ontology and instance editor with customizable public browsing. Vitro is a Java web application that runs in a Tomcat servlet container. With Vitro, you can: 1) create or load ontologies in OWL format; 2) edit instances and relationships; 3) build a public web site to display your data; and 4) search your data with Lucene. Still in somewhat early phases, with no online demos and with minimal interfaces.
  • <New>Vocab Editor is an RDF/OWL/SKOS vocabulary-diagram editor. It has both client- (Javascript) and server-side (Python) implmentations. It is open source with a demo. There is a blog (Spanish) and online sample vocabulary app editor.
Not Apparently in Active Use
  • Omnigator The Omnigator is a form-based manipulaton tool centered on Topic Maps, though it enables the loading and navigation of any conforming topic map in XTM, HyTM, LTM or RDF formats. There is a free evaluation version.
  • OntoGen is a semi-automatic and data-driven ontology editor focusing on editing of topic ontologies (a set of topics connected with different types of relations). The system combines text-mining techniques with an efficient user interface. It requires .Net.
  • OntoLight is a set of software modules for: transforming raw ontology data for several ontologies from their specific formats into a unifying light-weight ontology format, grounding the ontology and storing it into grounded ontology format, populating grounded ontologies with new instance data, and creating mappings between grounded ontologies; includes Cyc. Download no longer available. See http://analytics.ijs.si/~blazf/papers/Context_SiKDD07.pdf and http://www.neon-project.org/web-content/index.php?option=com_weblinks&task=view&catid=17&id=52 or http://www.neon-project.org/web-content/index.php?option=com_weblinks&catid=21&Itemid=73
  • OWL-S-editor is an editor for the development of services in OWL-S, with graphical, WSDL and import/export support
  • ReTAX+ is an aide to help a taxonomist create a consistent taxonomy and in particular provides suggestions as to where a new entity could be placed in the taxonomy whilst retaining the integrity of the revised taxonomy (c.f., problems in ontology modelling)
  • SWOOP is a lightweight ontology editor. (Swoop is no longer under active development at mindswap. Continuing development can be found on SWOOP’s Google Code homepage at http://code.google.com/p/swoop/)
  • WebOnto supports the browsing, creation and editing of ontologies through coarse grained and fine grained visualizations and direct manipulation.
Ontology Mapping
  • <New>The Alignment API is an API and implementation for expressing and sharing ontology alignments. The correspondences between entities (e.g., classes, objects, properties) in ontologies is called an alignment. The API provides a format for expressing alignments in a uniform way. The goal of this format is to be able to share on the web the available alignments. The format is expressed in RDF, so it is freely extensible. The Alignment API itself is a Java description of tools for accessing the common format. It defines four main interfaces (Alignment, Cell, Relation and Evaluator).
  • COMA++ is a schema and ontology matching tool with a comprehensive infrastructure. Its graphical interface supports a variety of interaction
  • ConcepTool is a system to model, analyse, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
  • <New>MapOnto is a research project aiming at discovering semantic mappings between different data models, e.g, database schemas, conceptual schemas, and ontologies. So far, it has developed tools for discovering semantic mappings between database schemas and ontologies as well as between different database schemas. The Protege plug-in is still available, but appears to be for older versions
  • MatchIT automates and facilitates schema matching and semantic mapping between different Web vocabularies. MatchIT runs as a stand-alone or plug-in Eclipse application and can be integrated with popular third party applications. MatchIT’s uses Adaptive Lexicon™ as an ontology-driven dictionary and thesaurus of English language terminology to quantify and ank the semantic similarity of concepts. It apparently is not available in open source
  • myOntology is used to produce the theoretical foundations, and deployable technology for the Wiki-based, collaborative and community-driven development and maintenance of ontologies instance data and mappings
  • OLA/OLA2 (OWL-Lite Alignment) matches ontologies written in OWL. It relies on a similarity combining all the knowledge used in entity descriptions. It also deal with one-to-many relationships and circularity in entity descriptions through a fixpoint algorithm
  • Potluck is a Web-based user interface that lets casual users—those without programming skills and data modeling expertise—mash up data themselves. Potluck is novel in its use of drag and drop for merging fields, its integration and extension of the faceted browsing paradigm for focusing on subsets of data to align, and its application of simultaneous editing for cleaning up data syntactically. Potluck also lets the user construct rich visualizations of data in-place as the user aligns and cleans up the data.
  • PRIOR+ is a generic and automatic ontology mapping tool, based on propagation theory, information retrieval technique and artificial intelligence model. The approach utilizes both linguistic and structural information of ontologies, and measures the profile similarity and structure similarity of different elements of ontologies in a vector space model (VSM).
  • <New>S-Match takes any two tree like structures (such as database schemas, classifications, lightweight ontologies) and returns a set of correspondences between those tree nodes which semantically correspond to one another.
  • Vine is a tool that allows users to perform fast mappings of terms across ontologies. It performs smart searches, can search using regular expressions, requires a minimum number of clicks to perform mappings, can be plugged into arbitrary mapping framework, is non-intrusive with mappings stored in an external file, has export to text files, and adds metadata to any mapping. See also http://sourceforge.net/projects/vine/.
Not Apparently in Active Use
  • ASMOV (Automated Semantic Mapping of Ontologies with Validation) is an automatic ontology matching tool which has been designed in order to facilitate the integration of heterogeneous systems, using their data source ontologies
  • Chimaera is a software system that supports users in creating and maintaining distributed ontologies on the web. Two major functions it supports are merging multiple ontologies together and diagnosing individual or multiple ontologies
  • CMS (CROSI Mapping System) is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
  • ConRef is a service discovery system which uses ontology mapping techniques to support different user vocabularies
  • DRAGO reasons across multiple distributed ontologies interrelated by pairwise semantic mappings, with a vision of peer-to-peer mapping of many distributed ontologies on the Web. It is implemented as an extension to an open source Pellet OWL Reasoner
  • Falcon-AO (Finding, aligning and learning ontologies) is an automatic ontology matching tool that includes the three elementary matchers of String, V-Doc and GMO. In addition, it integrates a partitioner PBM to cope with large-scale ontologies
  • FOAM is the Framework for ontology alignment and mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
  • hMAFRA (Harmonize Mapping Framework) is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
  • IF-Map is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
  • LILY is a system matching heterogeneous ontologies. LILY extracts a semantic subgraph for each entity, then it uses both linguistic and structural information in semantic subgraphs to generate initial alignments. The system is presently in a demo version only
  • MAFRA Toolkit – the Ontology MApping FRAmework Toolkit allows users to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
  • OntoEngine is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target”
  • OWLS-MX is a hybrid semantic Web service matchmaker. OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
  • RiMOM (Risk Minimization based Ontology Mapping) integrates different alignment strategies: edit-distance based strategy, vector-similarity based strategy, path-similarity based strategy, background-knowledge based strategy, and three similarity-propagation based strategies
  • semMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties
  • Snoggle is a graphical, SWRL-based ontology mapper. Snoggle attempts to solve the ontology mapping problem by providing a graphical user interface (similar to which of the Microsoft Visio) to guide the process of ontology vocabulary alignment. In Snoggle, user-defined mappings can be serialized into rules, which is expressed using SWRL
  • Terminator is a tool for creating term to ontology resource mappings (documentation in Finnish).
Ontology Visualization/Analysis

Though all are not relevant, see my post from a couple of years back on large-scale RDF graph software.

  • Social network graphing tools (many covered elsewhere)
  • Cytoscape is a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data; I have also written specifically about Cytoscape’s use in UMBEL
    • RDFScape is a project that brings Semantic Web “features” to the popular Systems Biology software Cytoscape
    • NetworkAnalyzer performs analysis of biological networks and calculates network topology parameters including the diameter of a network, the average number of neighbors, and the number of connected pairs of nodes. It also computes the distributions of more complex network parameters such as node degrees, average clustering coefficients, topological coefficients, and shortest path lengths. It displays the results in diagrams, which can be saved as images or text files; used by SD
  • Graphl is a tool for collaborative editing and visualisation of graphs, representing relationships between resources or concepts of the real world. Graphl may be thought of as a visual wiki, a place where everybody can contribute to a shared repository of knowledge
  • <New>Graphviz is open source graph visualization software. It has several main graph layout programs. It also has web and interactive graphical interfaces, and auxiliary tools, libraries, and language bindings.
  • <New>GrOWL is an ontology visualizer and editor. The layout of the GrOWL graph can be defined automatically or loaded from a separate style sheet. GrOWL implements configurable filters that can transform the display by simplifying it, hiding concepts and relationships that have no descriptions associated, or perform more complex translations. Concepts can be stored in ontologies with extensive annotations to provide documentation. GrOWL shows these annotation as tooltips, and supports complex HTML and links within them. The GrOWL browser can be used inside a web browser or as a stand-alone application. When used inside a browser, it supports Javascript interaction so that it can be used as a concept chooser with implementation-defined operations.
  • igraph is a free software package for creating and manipulating undirected and directed graphs
  • Network Workbench is a very complex, comprehensive; Swiss Army Knife
  • NetworkX – Python; very clean
  • <New>OntoGraf, a Protege 4 plug-in, gives support for interactively navigating the relationships of your OWL ontologies. Various layouts are supported for automatically organizing the structure of your ontology. Different relationships are supported: subclass, individual, domain/range object properties, and equivalence. Relationships and node types can be filtered.
  • <New>OWL2Prefuse is a Java package which creats Prefuse graphs and trees from OWL files (and Jena OntModels). It takes care of converting the OWL data structure to the Prefuse datastructure. This makes it is easy for developers, to use the Prefuse graphs and trees into their Semantic Web applications.
  • <New>RDF Gravity is a tool for visualising RDF/OWL Graphs/ ontologies. RDF Gravity is implemented by using the JUNG Graph API and Jena semantic web toolkit. Its main features are:
    • Graph Visualization
    • Global and Local Filters (enabling specific views on a graph)
    • Full text Search
    • Generating views from RDQL Queries
    • Visualising multiple RDF files
  • <Newest> SKOS Reader is a SKOS browser and an HTML renderer of SKOS thesauri and terminologies that can display a SKOS file hierarchically, alphabetically, or permuted. Commercial; from Mondeca
  • Stanford Network Analysis Package (SNAP) is a general purpose network analysis and graph mining library. It is written in C++ and easily scales to massive networks with hundreds of millions of nodes
  • Social Networks Visualizer (SocNetV) is a flexible and user-friendly tool for the analysis and visualization of Social Networks. It lets you construct networks (mathematical graphs) with a few clicks on a virtual canvas or load networks of various formats (GraphViz, GraphML, Adjacency, Pajek, UCINET, etc) and modify them to suit your needs. SocNetV also offers a built-in web crawler, allowing you to automatically create networks from all links found in a given initial URL
  • Tulip may be incredibly strong
  • Springgraph component for Flex
  • VizierFX is a Flex library for drawing network graphs. The graphs are laid out using GraphViz on the server side, then passed to VizierFX to perform the rendering. The library also provides the ability to run ActionScript code in response to events on the graph, such as mousing over a node or clicking on it.
  • <New>VUE (Visual Understanding Environment) is an open source project focused on creating flexible tools for managing and integrating digital resources in support of teaching, learning and research. VUE provides a flexible visual environment for structuring, presenting, and sharing digital information.
  • <New>yEd is a diagram editor that can be used to quickly and effectively generate high-quality drawings of diagrams. It can support OWL imports.
  • <New>ZGRViewer is a graph visualizer implemented in Java and based upon the Zoomable Visual Transformation Machine. It is specifically aimed at displaying graphs expressed using the DOT language from AT&T GraphViz and processed by programs dot, neato or others such as twopi. ZGRViewer is designed to handle large graphs, and offers a zoomable user interface (ZUI), which enables smooth zooming and easy navigation in the visualized structure.
Miscellaneous Ontology Tools
  • Apolda (Automated Processing of Ontologies with Lexical Denotations for Annotation) is a plugin (processing resource) for GATE (http://gate.ac.uk/). The Apolda processing resource (PR) annotates a document like a gazetteer, but takes the terms from an (OWL) ontology rather than from a list
  • <Newest>CA Manager supports customized workflows for semantic annotation of content. Commercial; from Mondeca
  • <New>Gloze is a XML to RDF, RDF to XML, and XSD to OWL mapping tool based on Jena; see also http://jena.hpl.hp.com/juc2006/proceedings/battle/paper.pdf . See also http://jena.sourceforge.net/contrib/contributions.html
  • <New>Hoolet is an implementation of an OWL-DL reasoner that uses a first order prover. The ontology is translated to collection of axioms (in an obvious way based on the OWL semantics) and this collection of axioms is then given to a first order prover for consistency checking.
  • LexiLink is a tool for building, curating and managing multiple lexicons and ontologies in one enterprise-wide Web-based application. The core of the technology is based on RDF and OWL
  • mopy is the Music Ontology Python library, designed to provide easy to use python bindings for ontology terms for the creation and manipulation of music ontology data. mopy can handle information from several ontologies, including the Music Ontology, full FOAF vocab, and the timeline and chord ontologies
  • OBDA (Ontology Based Data Access) is a plugin for Protégé aimed to be a full-fledged OBDA ontology and component editor. It provides data source and mapping editors, as well as querying facilities that, in sum, allow you to design and test every aspect of an OBDA system. It supports relational data sources (RDBMS) and GLAV-like mappings. In its current beta form, it requires Protege 3.3.1, a reasoner implementing the OBDA extensions to DIG 1.1 (e.g., the DIG server for QuOnto) and Jena 2.5.5
  • <New>oBrowse is a web based ontology browser developed in java. oBrowse parses OWL files of an ontology and displays ontology in a tree view. Protege-API, JSF are used in development
  • OntoComP is a Protégé 4 plugin for completing OWL ontologies. It enables the user to check whether an OWL ontology contains “all relevant information” about the application domain, and extend the ontology appropriately if this is not the case
  • Ontology Browser is a browser created as part of the CO-ODE (http://www.co-ode.org/) project; rather simple interface and use
  • Ontology Metrics is a web-based tool that displays statistics about a given ontology, including the expressivity of the language it is written in
  • <New>OntoLT aims at a more direct connection between ontology engineering and linguistic analysis. OntoLT is a Protégé plug-in, with which concepts (Protégé classes) and relations (Protégé slots) can be extracted automatically from linguistically annotated text collections. It provides mapping rules, defined by use of a precondition language that allow for a mapping between linguistic entities in text and class/slot candidates in Protégé. Only available for older Protégé versions
  • OntoSpec is a SWI-Prolog module, aiming at automatically generating XHTML specification from RDF-Schema or OWL ontologies
  • OWL API is a Java interface and implementation for the W3C Web Ontology Language (OWL), used to represent Semantic Web ontologies. The API is focused towards OWL Lite and OWL DL and offers an interface to inference engines and validation functionality
  • OWL Module Extractor is a Web service that extracts a module for a given set of terms from an ontology. It is based on an implementation of locality-based modules that is part of the OWL API.
  • OWL Syntax Converter is an online tool for converting ontologies between different formats, including several OWL syntaxes, RDF/XML, KRSS
  • OWL Verbalizer is an on-line tool that verbalizes OWL ontologies in (controlled) English
  • OwlSight is an OWL ontology browser that runs in any modern web browser; it’s developed with Google Web Toolkit and uses Gwt-Ext, as well as OWL-API. OwlSight is the client component and uses Pellet as its OWL reasoner
  • Pellint is an open source lint tool for Pellet which flags and (optionally) repairs modeling constructs that are known to cause performance problems. Pellint recognizes several patterns at both the axiom and ontology level.
  • PROMPT is a tab plug-in for Protégé is for managing multiple ontologies by comparing versions of the same ontology, moving frames between included and including project, merging two ontologies into one, or extracting a part of an ontology
  • <New>ReDeFer is a compendium of RDF-aware utilities organised in a set of packages: RDF2HTML+RDFa: render a piece of RDF/XML as HTML+RDFa; XSD2OWL: transform an XML Schema into an OWL Ontology; CS2OWL: transform a MPEG-7 Classification Scheme into an OWL Ontology; XML2RDF: transform a piece of XML into RDF; and RDF2SVG: render a piece of RDF/XML as a SVG showing the corresponding graph
  • SegmentationApp is a Java application that segments a given ontology according to the approach described in “Web Ontology Segmentation: Analysis, Classification and Use” (http://www.co-ode.org/resources/papers/seidenberg-www2006.pdf)
  • SETH is a software effort to deeply integrate Python with Web Ontology Language (OWL-DL dialect). The idea is to import ontologies directly into the programming context so that its classes are usable alongside standard Python classes
  • SKOS2GenTax is an online tool that converts hierarchical classifications available in the W3C SKOS (Simple Knowledge Organization Systems) format into RDF-S or OWL ontologies
  • SpecGen (v5) is an ontology specification generator tool. It’s written in Python using Redland RDF library and licensed under the MIT license
  • Text2Onto is a framework for ontology learning from textual resources that extends and re-engineers an earlier framework developed by the same group (TextToOnto). Text2Onto offers three main features: it represents the learned knowledge at a metalevel by instantiating the modelling primitives of a Probabilistic Ontology Model (POM), thus remaining independent from a specific target language while allowing the translation of the instantiated primitives
  • Thea is a Prolog library for generating and manipulating OWL (Web Ontology Language) content. Thea OWL parser uses SWI-Prolog’s Semantic Web library for parsing RDF/XML serialisations of OWL documents into RDF triples and then it builds a representation of the OWL ontology
  • TONES Ontology Repository is primarily designed to be a central location for ontologies that might be of use to tools developers for testing purposes; it is part of the TONES project
  • Visual Ontology Manager (VOM) is a family of tools enables UML-based visual construction of component-based ontologies for use in collaborative applications and interoperability solutions.
  • Web Ontology Manager is a lightweight, Web-based tool using J2EE for managing ontologies expressed in Web Ontology Language (OWL). It enables developers to browse or search the ontologies registered with the system by class or property names. In addition, they can submit a new ontology file
  • RDF evoc (external vocabulary importer) is an RDF external vocabulary importer module (evoc) for Drupal caches any external RDF vocabulary and provides properties to be mapped to CCK fields, node title and body. This module requires the RDF and the SPARQL modules.
Not Apparently in Active Use
  • ActiveOntology is a library, written in Ruby, for easy manipulation of RDF and RDF-Schema models, thru a dynamic DSL based on Ruby idiom
  • Almo is an ontology-based workflow engine in Java supporting the ARTEMIS project; part of the OntoWare initiative
  • ClassAKT is a text classification web service for classifying documents according to the ACM Computing Classification System
  • Elmo provides a simple API to access ontology oriented data inside a Sesame RDF repository. The domain model is simplified into independent concerns that are composed together for multi-dimensional, inter-operating, or integrated applications
  • ExtrAKT is a tool for extracting ontologies from Prolog knowledge bases.
  • F-Life is a tool for analysing and maintaining life-cycle patterns in ontology development.
  • Foxtrot is a recommender system which represents user profiles in ontological terms, allowing inference, bootstrapping and profile visualization.
  • HyperDAML creates an HTML representation of OWL content to enable hyperlinking to specific objects, properties, etc.
  • LinKFactory is an ontology management tool, it provides an effective and user-friendly way to create, maintain and extend extensive multilingual terminology systems and ontologies (English, Spanish, French, etc.). It is designed to build, manage and maintain large, complex, language independent ontologies.
  • LSW – the Lisp semantic Web toolkit enables OWL ontologies to be visualized. It was written by Alan Ruttenberg
  • OntoClassify is a system for scalable classification of text into large topic ontologies currently including DMoz and Inspec. The system is available as Web service. The software runs under Windows platform.
  • Ontodella is a Prolog HTTP server for category projection and semantic linking
  • OntoWeaver is an ontology-based approach to Web sites, which provides high level support for web site design and development
  • OWLLib is a PHP library for accessing OWL files. OWL is w3.org standard for storing semantic information
  • pOWL is a Semantic Web development platform for ontologies in PHP. pOWL consists of a number of components, including RAP
  • ROWL is the Rule Extension of OWL; it is from the Mobile Commerce Lab in the School of Computer Science at Carnegie Mellon University
  • Semantic Net Generator is a utlity for generating Topic Maps automatically from different data sources by using rules definitions specified with Jelly XML syntax. This Java library provides Jelly tags to access and modify data sources (also RDF) to create a semantic network
  • SMORE is OWL markup for HTML pages. SMORE integrates the SWOOP ontology browser, providing a clear and consistent way to find and view Classes and Properties, complete with search functionality
  • SOBOLEO is a system for Web-based collaboration to create SKOS taxonomies and ontologies and to annotate various Web resources using them
  • SOFA is a Java API for modeling ontologies and Knowledge Bases in ontology and Semantic Web applications. It provides a simple, abstract and language neutral ontology object model, inferencing mechanism and representation of the model with OWL, DAML+OIL and RDFS languages; from java.dev
  • WebScripter is a tool that enables ordinary users to easily and quickly assemble reports extracting and fusing information from multiple, heterogeneous DAMLized Web sources.
[1] This listing is maintained on a permanent basis on the OpenStructsTechWiki.

I Have Yet to Metadata I Didn’t Like

AI3:::Adaptive Information (Mike Bergman) - Mon, 08/16/2010 - 05:58

Contrasted with Some Observations on Linked Data

At the SemTech conference earlier this summer there was a kind of vuvuzela-like buzzing in the background. And, like the World Cup games on television, in play at the same time as the conference, I found the droning to be just as irritating.

That droning was a combination of the sense of righteousness in the superiority of linked data matched with a reprise of the “chicken-and-egg” argument that plagued the early years of semantic Web advocacy [1]. I think both of these premises are misplaced. So, while I have been a fan and explicator of linked data for some time, I do not worship at its altar [2]. And, for those that do, this post argues for a greater sense of ecumenism.

My main points are not against linked data. I think it a very useful technique and good (if not best) practice in many circumstances. But my main points get at whether linked data is an objective in itself. By making it such, I argue our eye misses the ball. And, in so doing, we miss making the connection with meaningful, interoperable information, which should be our true objective. We need to look elsewhere than linked data for root causes.

Observation #1: What Problem Are We Solving?

When I began this blog more than five years ago — and when I left my career in population genetics nearly three decades before that — I did so because of my belief in the value of information to confer adaptive advantage. My perspective then, and my perspective now, was that adaptive information through genetics and evolution was being uniquely supplanted within the human species. This change has occurred because humanity is able to record and carry forward all information gained in its experiences.

Adaptive innovations from writing to bulk printing to now electronic form uniquely position the human species to both record its past and anticipate its future. We no longer are limited to evolution and genetic information encoded in surviving offspring to determine what information is retained and moves forward. Now, all information can be retained. Further, we can combine and connect that information in ways that break to smithereens the biological limits of other species.

Yet, despite the electronic volumes and the potentials, chaos and isolated content silos have characterized humanity’s first half century of experience with digital information. I have spoken before about how we have been steadily climbing the data federation pyramid, with Internet technologies and the Web being prime factors for doing so. Now, with a compelling data model in RDF and standards for how we can relate any type of information meaningfully, we also have the means for making sense of it. And connecting it. And learning and adapting from it.

And, so, there is the answer to the rhetorical question: The problem we are solving is to meaningfully connect information. For, without those meaningful connections and recombinations, none of that information confers adaptive advantage.

Observation #2: The Problem is Not A Lack of Consumable Data

One of the “chicken-and-egg” premises in the linked data community is there needs to be more linked data exposed before some threshold to trigger the network effect occurs. This attitude, I suspect, is one of the reasons why hosannas are always forthcoming each time some outfit announces they have posted another chunk of triples to the Web.

Fred Giasson and I earlier tackled that issue with When Linked Data Rules Fail regarding some information published for data.gov and the New York Times. Our observations on the lack of standards for linked data quality proved to be quite controversial. Rehashing that piece is not my objective here.

What is my objective is to hammer home that we do not need linked data in order to have data available to consume. Far from it. Though linked data volumes have been growing, I actually suspect that its growth has been slower than data availability in toto. On the Web alone we have searchable deep Web databases, JSON, XML, microformats, RSS feeds, Google snippets, yada, yada, all in a veritable deluge of formats, contents and contexts. We are having a hard time inventing the next 1000-fold description beyond zettabyte and yottabyte to even describe this deluge [3].

There is absolutely no voice or observer anywhere that is saying, “We need linked data in order to have data to consume.” Quite the opposite. The reality is we are drowning in the stuff.

Furthermore, when one dissects what most of all of this data is about, it is about ways to describe things. Or, put another way, most all data is not schema nor descriptions of conceptual relationships, but making records available, with attributes and their values used to describe those records. Where is a business located? What political party does a politician belong to? How tall are you? What is the population of Hungary?

These are simple constructs with simple key-value pair ways to describe and convey them. This very simplicity is one reason why naïve data structs or simple data models like JSON or XML have proven so popular [4]. It is one of the reasons why the so-called NoSQL databases have also been growing in popularity. What we have are lots of atomic facts, located everywhere, and representable with very simple key-value structures.

While having such information available in linked data form makes it easier for agents to consume it, that extra publishing burden is by no means necessary. There are plenty of ways to consume that data — without loss of information — in non-linked data form. In fact, that is how the overwhelming percentage of such data is expressed today. This non-linked data is also often easy to understand.

What is important is that the data be available electronically with a description of what the records contain. But that hurdle is met in many, many different ways and from many, many sources without any reference whatsoever to linked data. I submit that any form of desirable data available on the Web can be readily consumed without recourse to linked data principles.

Observation #3: An Interoperable Data Model Does Not Require a Single Transmittal Format

The real advantage of RDF is the simplicity of its data model, which can be extended and augmented to express vocabularies and relationships of any nature. As I have stated before, that makes RDF like a universal solvent for any extant data structure, form or schema.

What I find perplexing, however, is how this strength somehow gets translated into a parallel belief that such a flexible data model is also the best means for transmitting data. As noted, most transmitted data can be represented through simple key-value pairs. Sure, at some point one needs to model the structural assumptions of the data model from the supplying publisher, but that complexity need not burden the actual transmitted form. So long as schema can be captured and modeled at the receiving end, data record transmittal can be made quite a bit simpler.

Under this mindset RDF provides the internal (canonical) data model. Prior to that, format and other converters can be used to consume the source data in its native form. A generalized representation for how this can work is shown in this diagram using Structured DynamicsstructWSF Web services framework middleware as the mediating layer:

Of course, if the source data is already in linked data form with understood concepts, relationships and semantics, much of this conversion overhead can be bypassed. If available, that is a good thing.

But it is not a required or necessary thing. Insistence on publishing data in certain forms suffers from the same narrowness as cultural or religious zealotry. Why certain publishers or authors prefer different data formats has a diversity of answers. Reasons can range from what is tried and familiar to available toolsets or even what is trendy, as one might argue linked data is in some circles today.There are literally scores of off-the-shelf “RDFizers” for converting native and simple data structs into RDF form. New converters are readily written.

Adaptive systems, by definition, do not require wholesale changes to existing practices and do not require effort where none is warranted. By posing the challenge as a “chicken-and-egg” one where publishers themselves must undertake a change in their existing practices to conform, or else they fail the “linked data threshold”, advocates are ensuring failure. There is plenty of useful structured data to consume already.

Accessible structured data, properly characterized (see below), should be our root interest; not whether that data has been published as linked data per se.

Observation #4: A Technique Can Not Carry the Burden of Usefulness or Interoperability

Linked data is nothing more than some techniques for publishing Web-accessible data using the RDF data model. Some have tried to use the concept of linked data as a replacement for the idea of the semantic Web, and some have recently tried to re-define linked data as not requiring RDF [5]. Yet the real issue with all of these attempts — correct or not, and a fact of linked data since first formulated by Tim Berners-Lee — is that a technique alone can not carry the burden of usefulness or interoperability.

Despite billions of triples now available, we in fact see little actual use or consumption of linked data, except in the life science domain. Indeed, a new workshop by the research community called COLD (Consuming Linked Data) has been set up for the upcoming ISWC conference to look into the very reasons why this lack of usage may be occurring [6].

It will be interesting to monitor what comes out of that workshop, but I have my own views as to what might be going on here. A number of factors, applicable frankly to any data, must be layered on top of linked data techniques in order for it to be useful:

  • Context and coherence (see below)
  • Curation and quality control (where provenance is used as the proxy), and
  • Up-to-date and timely.

These requirements apply to any data ranging from Census CSV files to Google search results. But because relationships can also be more readily asserted with linked data, these requirements are even greater for it.

It is not surprising that the life sciences have seen more uptake of linked data. That community has keen experience with curation, and the quality and linkages asserted there are much superior to other areas of linked data [7].

In other linked data areas, it is really in limited pockets such as FactForge from Ontotext or curated forms of Wikipedia by the likes of Freebase that we see the most use and uptake. There is no substitute for consistency and quality control.

It is really in this area of “publish it and they will come” that we see one of the threads of parochialism in the linked data community. You can publish it and they still will not come. And, like any data, they will not come because the quality is poor or the linkages are wrong.

As a technique for making data available, linked data is thus nothing more than a foot soldier in the campaign to make information meaningful. Elevating it above its pay grade sets the wrong target and causes us to lose focus for what is really important.

Observation #5: 50% of Linked Data is Missing (that is, the Linking part)

There is another strange phenomenon in the linked data movement: the almost total disregard for the linking part. Sure data is getting published as triples with dereferencable URIs, but where are the links?

At most, what we are seeing is owl:sameAs assertions and a few others [8]. Not only does this miss the whole point of linked data, but one can question whether equivalence assertions are correct in many instances [9].

For a couple of years now I have been arguing that the central gap in linked data has been the absence of context and coherence. By context I mean the use of reference structures to help place and frame what content is about. By coherence I mean that those contextual references make internal and logical sense, that they represent a consistent world view. Both require a richer use of links to concepts and subjects describing the semantics of the content.

It is precisely through these kinds of links that data from disparate sources and with different frames of reference can be meaningfully related to other data. This is the essence of the semantic Web and the purported purpose of linked data. And it is exactly these areas in which linked data is presently found most lacking.

Of course, these questions are not the sole challenge of linked data. They are the essential challenge in any attempt to connect or interoperate structured data within information systems. So, while linked data is ostensibly designed from the get-go to fulfill these aims, any data that can find meaning outside of its native silo must also be placed into context in a coherent manner. The unique disappointment for much linked data is its failure to provide these contexts despite its design.

Observation #6: Pluralism is a Reality; Embrace It

Yet, having said all of this, Structured Dynamics is still committed to linked data. We present our information as such, and provide great tools for producing and consuming it. We have made it one of the seven foundations to our technology stack and methodology.

But we live in a pluralistic data world. There are reasons and roles for the multitude of popular structured data formats that presently exist. This inherent diversity is a fact in any real-world data context. Thus, we have not met a form of structured data that we didn’t like, especially if it is accompanied with metadata that puts the data into coherent context. It is a major reason why we developed the irON (instance record and object notation) non-RDF vocabulary to provide a bridge from such forms to RDF. irON clearly shows that entities can be usefully described and consumed in either RDF or non-RDF serialized forms.

Attitudes that dismiss non-linked data forms or arrogantly insist that publishers adhere to linked data practices are anything but pluralistic. They are parochial and short-sighted and are contributing, in part, to keeping the semantic Web from going mainstream.

Adoption requires simplicity. The simplest way to encourage the greater interoperability of data is to leverage existing assets in their native form, with encouragement for minor enhancements to add descriptive metadata for what the content is about. Embracing such an ecumenical attitude makes all publishers potentially valuable contributors to a better information future. It will also nearly instantaneously widen the tools base available for the common objective of interoperability.

Parochialism and Root Cause Analysis

Linked data is a good thing, but not an ultimate thing. By making linked data an objective in itself we unduly raise publishing thresholds; we set our sights below the real problem to be solved; and we risk diluting the understanding of RDF from its natural role as a flexible and adaptive data model. Paradoxically, too much parochial insistence on linked data may undercut its adoption and the realization of the overall semantic objective.

Root cause analysis for what it takes to achieve meaningful, interoperable information suggests that describing source content in terms of what it is about is the pivotal factor. Moreover, those contexts should be shared to aid interoperability. Whichever organizations do an excellent job of providing context and coherent linkages will be the go-to ones for data consumers. As we have seen to date, merely publishing linked data triples does not meet this test.

I have heard some state that first you celebrate linked data and its growing quantity, and then hope that the quality improves. This sentiment holds if indeed the community moves on to the questions of quality and relevance. The time for that transition is now. And, oh, by the way, as long as we are broadening our horizons, let’s also celebrate properly characterized structured data no matter what its form. Pluralism is part of the tao to the meaning of information.

[1] See, for example, J.A. Hendler, 2008. “Web 3.0: Chicken Farms on the Semantic Web,” Computer, January 2008, pp. 106-108. See http://www.comp.leeds.ac.uk/webscience/talks/hendler_web_3.pdf. While I can buy Hendler’s arguments about commercial tool vendors holding off major investments until the market is sizable, I think we can also see via listings like Sweet Tools that a lack of tools is not in itself limiting. [2] An earlier treatment of this subject from a different perspective is M.K. Bergman, 2010. “The Bipolar Disorder of Linked Data,” AI3:::Adaptive Information blog, April 28, 2010. [3] So far only prefixes for units up to 10^24 (”yotta”) have names; for 10^27, a student campaign on Facebook is proposing “hellabyte” (North California slang for “a whole lot of”) to get adopted by science bodies. See http://scitech.blogs.cnn.com/2010/03/04/hella-proposal-facebook/. [4] One of more popular posts on this blog has been, M.K. Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” AI3:::Adaptive Information blog, January 22, 2009. [5] See, for example, the recent history on the linked data entry on Wikipedia or the assertions by Kingsley Idehen regarding entity attribute values (EAV) (see, for example, this blog post.) [6] See further the 1st International Workshop on Consuming Linked Data (COLD 2010), at the 9th International Semantic Web Conference (ISWC 2010), November 8, 2010, Shanghai, China. [7] For example, in the early years of GenBank, some claimed that annotations of gene sequences due to things like BLAST analyses may have had as high as 30% to 70% error rates due to propagation of initially mislabeled sequences. In part, the whole field of bioinformatics was formed to deal with issues of data quality and curation (in addition to analytics). [8] See, for example: Harry Halpin, 2009. “A Query-Driven Characterization of Linked Data,” paper presented at the Linked Data on the Web (LDOW) 2009 Workshop, April 20, 2009, Madrid, Spain, see http://events.linkeddata.org/ldow2009/papers/ldow2009_paper16.pdf; Prateek Jain, Pascal Hitzler, Peter Z. Yehy, Kunal Vermay and Amit P. Shet, 2010. “Linked Data is Merely More Data,” in Dan Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness, Linked Data Meets Artificial Intelligence, Technical Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp. 82-86., see http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf; among others. [9] Harry Halpin and Patrick J. Hayes, 2010. “When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web,” presented at LDOW 2010, April 27th, 2010, Raleigh, North Carolina. See http://events.linkeddata.org/ldow2010/papers/ldow2010_paper09.pdf.

An Executive Intro to Ontologies

AI3:::Adaptive Information (Mike Bergman) - Mon, 08/09/2010 - 05:53

Ontologies are the structural frameworks for organizing information on the semantic Web and within semantic enterprises. They provide unique benefits in discovery, flexible access, and information integration due to their inherent connectedness; that is, their ability to represent conceptual relationships. Ontologies can be layered on top of existing information assets, which means they are an enhancement and not a displacement for prior investments. And ontologies may be developed and matured incrementally, which means their adoption may be cost-effective as benefits become evident [1].

What Is an Ontology?

Ontology may be one of the more daunting terms for those exposed for the first time to semantic technologies. Not only is the word long and without common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.

The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”

Much like taxonomies or relational database schema, ontologies work to organize information. No matter what the domain or scope, an ontology is a description of a world view. That view might be limited and miniscule, or it might be global and expansive. However, unlike those alternative hierarchical views of concepts such as taxonomies, ontologies often have a linked or networked “graph” structure. Multiple things can be related to other things, all in a potentially multi-way series of relationships.

A distinguishing characteristic of ontologies compared to conventional hierarchical structures is their degree
of connectedness, their ability to model coherent, linked relationships

Ontologies supply the structure for relating information to other information in the semantic Web or the linked data realm. Ontologies thus provide a similar role for the organization of data that is provided by relational data schema. Because of this structural role, ontologies are pivotal to the coherence and interoperability of interconnected data.

When one uses the idea of “world view” as synonomous with an ontology, it is not meant to be cosmic, but simply a way to convey how a given domain or problem area can be described. One group might choose to describe and organize, say, automobiles, by color; another might choose body styles such as pick-ups or sedans; or still another might use brands such as Honda and Ford. None of these views is inherently “right” (indeed multiples might be combined in a given ontology), but each represents a particular way — a “world view” — of looking at the domain.

Though there is much latitude in how a given domain might be described, there are both good ontology practices and bad ones. We offer some views as to what constitutes good ontology design and practice in the concluding section.

What Are Its Benefits?

A good ontology offers a composite suite of benefits not available to taxonomies, relational database schema, or other standard ways to structure information. Among these benefits are:

  • Coherent navigation by enabling the movement from concept to concept in the ontology structure
  • Flexible entry points because any specific perspective in the ontology can be traced and related to all of its associated concepts; there is no set structure or manner for interacting with the ontology
  • Connections that highlight related information and aid and prompt discovery without requiring prior knowledge of the domain or its terminology
  • Ability to represent any form of information, including unstructured (say, documents or text), semi-structured (say, XML or Web pages) and structured (say, conventional databases) data
  • Inferencing, whereby by specifying one concept (say, mammals) one knows that we are also referring to a related concept (say, that mammals are a kind of animal)
  • Concept matching, which means that even though we may describe things somewhat differently, we can still match to the same idea (such as glad or happy both referring to the concept of a pleasant state of mind)
  • Thus, this means that we can also integrate external content by proper matching and mapping of these concepts
  • A framework for disambiguation by nature of the matching and analysis of concepts and instances in the ontology graph, and
  • Reasoning, which is the ability to use the coherence and structure itself to inform questions of relatedness or to answer questions.
How Are Ontologies Used?

The relationship structure underlying an ontology provides an excellent vehicle for discovery and linkages. “Swimming through” this relationship graph is the basis of the Concept Explorer (also known as the Relation Browser) and similar widgets.

The most prevalent use of ontologies at present is in semantic search. Semantic search has benefits over conventional search in terms of being able to make inferences and matches not available to standard keyword retrieval.

The relationship structure also is a powerful and more general and more nuanced way to organize information. Concepts can relate to other concepts through a richness of vocabulary. Such predicates might capture subsumption, precedence, parts of relationships (mereology), preferences, or importances along virtually any metric. This richness of expression and relationships can also be built incrementally over time, allowing ontologies to grow and develop in sophistication and use as desired.

The pinnacle application for ontologies, therefore, is as coherent reference structures whose purpose is to help map and integrate other structures and information. Given the huge heterogeneity of information both within and without organizations, the use of ontologies as integration frameworks will likely emerge as their most valuable use.

What Makes for a Good Ontology?

Good ontology practice has aspects both in terms of scope and in terms of construction.

Scope Considerations

Here are some scoping and design questions that we believe should be answered in the positive in order for an ontology to meet good practice standards:

  • Does the ontology provide balanced coverage of the subject domain? This question gets at the issue of properly scoping and bounding the subject coverage of the ontology. It also means that the breadth and depth of the coverage is roughly equivalent across its scope
  • Does the ontology embed its domain coverage into a proper context? A major strength of ontologies is their potential ability to interoperate with other ontologies. Re-using existing and well-accepted vocabularies and including concepts in the subject ontology that aid such connections is good practice. The ontology should also have sufficient reference structure for guiding the assignment of what content “is about”
  • Are the relationships in the ontology coherent? The essence of coherence is that it is a state of logical, consistent connections, a logical framework for integrating diverse elements in an intelligent way. So while context supplies a reference structure, coherence means that the structure makes sense. Is the hip bone connected to the thigh bone, or is the skeleton incorrect?
  • Has the ontology been well constructed according to good practice? See next.

If these questions can be answered affirmatively, then we would deem the ontology ready for production-grade use.

Fundamental to the whole concept of coherence is the fact that experts and practitioners within domains have been looking at the questions of relationships, structure, language and meaning for decades. Though perhaps today we now finally have a broad useful data and logic model in RDF, the fact remains that massive time and effort has already been expended to codify some of these understandings in various ways and at various levels of completeness and scope. Good practice also means, therefore, that maximum leverage is made to springboard ontologies from existing structural and vocabulary assets.

And, because good ontologies also embrace the open world approach, working toward these desired end states can also be incremental. Thus, in the face of common budget or deadline constraints, it is possible initially to scope domains as smaller or to provide less coverage in depth or to use a small set of predicates, all the while still achieving productive use of the ontology. Then, over time, the scope can be expanded incrementally.

Construction Considerations

To achieve their purposes, ontologies must be both human-readable and machine-processable. Also, because they represent conceptual structures, they must be built with a certain composition.

Good ontologies therefore are constructed such that they have:

  • Concept definitions – the matching and alignment of things is done on the basis of concepts (not simply labels) which means each concept must be defined
  • A preferred label that is used for human readable purposes and in user interfaces
  • A “semset” – which means a series of alternate labels and terms to describe the concept. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept
  • Clearly defined relationships (also known as properties, attributes, or predicates) for relating two things to one another
  • All of which is written in a machine-processable language such as OWL or RDF Schema (among others).

In the case of ontology-driven applications using adaptive ontologies, there are also additional instructions contained in the system (often via administrative ontologies) that tell the system which types of widgets need to be invoked for different data types and attributes. This is different than the standard conceptual schema, but is nonetheless essential to how such applications are designed.

[1] This posting was at the request of a couple of Structured Dynamics‘ customers that desired a way to describe ontologies to non-technical management. For a more in depth treatment, see M.K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog, May 16, 2007.

Citizen Dan Goes Live; Available for Download

AI3:::Adaptive Information (Mike Bergman) - Mon, 08/02/2010 - 23:20

Discover and Play with this Demo of the Open Semantic Framework

Today, Structured Dynamics is pleased to make its Citizen Dan application available for public viewing, play and downloading for the first time.

Citizen Dan is a free, open source system available to any community and its citizens to measure and track indicators of local well being. It can be branded and themed for local needs. It is under active development by Structured Dynamics with support from a number of innovative cities.

Citizen Dan is an exemplar instance of Structured Dynamics’ open semantic framework (OSF), a generalized framework for deploying semantic platforms for any domain.  By changing its guiding ontologies and source content and data, what appears for Citizen Dan can be adopted for virtually any subject area.

As configured, the Citizen Dan OSF instance is a:

  • Appliance for slicing-and-dicing and analyzing data specific to local community indicators
  • Framework for dynamically navigating, interacting with, or browsing data and concepts
  • Means to visualize local data over time or by neighborhood
  • Meeting place for the public to upload and share local data and information
  • Web data portal that can be individually tailored by any local community
  • Potential node in a global network of communities across which to compare indicators of community well-being.

Unique Concept Explorer for
dynamic discovery and navigation

Citizen Dan’s information sources may include Census data, the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics.

Text and narratives and the concepts and entities they describe are integrally linked into the system via information extraction and tagging. All ingested information, whether structured or text sources, with their semantics, can be exported in multiple formats. A standard organizing schema, also open source and extensible or modifiable by all users, is provided via the optional MUNI ontology (with vocabulary details in development here), being developed expressly for Citizen Dan and its community indicator system purposes.

All of the community information contained within a Citizen Dan instance is available as linked data.

Overview of Features

Here are the main components or widgets to this Citizen Dan demo:

Integration of text and stories via subject
concept or named entities (scones) tagging
  • Concept Explorer — this Flex widget (also called the Relation Browser) is a dynamic navigator of the concept space (ontology) that is used to organize the content on the instance. Clicking on a bubble causes it to assume the central position in the diagram, with all of its connecting concepts shown. Clicking on a branch concept then causes that new node to assume the central position, enabling one to “swim through” the overall concept graph. For this instance of Citizen Dan, the MUNI ontology is used; a diagram shows the full graph of the MUNI structure. See further the concept explorer’s technical documentation
  • Story Viewer — any type of text content (such as stories, blog posts, news articles, local government reports, city council minutes, etc.) can be submitted to the system. This content is then tagged using the scones system (subject concepts or named entities), which then provides the basis for linking the content with concepts and other data. The story viewer is a Flex widget that highlights these tags in the content and allows searches for related content based on selected tags. See further the story viewer’s technical documentation
  • Map Viewer — the map viewer is a Flex widget that presents layered views of different geographic areas. The title bar of the viewer allows different layers to be turned on and off. Clicking on various geographic areas can invoke specific data and dashboard views. See further the map viewer’s technical documentation

Mapping with data highlights for all
neighborhood and census tract data
  • Charting Widgets — the system provides a variety of charting options for numeric data, including pie, line and bar charts. These can be called directly or sprinkled amongst other widgets based on a dashboard specification (see below)
  • Filter Component — the filter, or browse, component provides the ability to slice-and-dice the information space by a choice of dataset, type of data or data attribute. These slices then become filter selections which can be persisted across various visualizations or exports. See further the browse component’s technical documentation
  • Search Component — this component provides full-text, faceted search across all content in the system; it may be used in conjunction with the filtering above to restrict the search space to the current slice. See further the search tool’s technical documentation
  • Dashboard Viewer — a dashboard is a particular layout of one or more visualization widgets and a set (or not) of content filtering conditions to be displayed on a canvas. Dashboard views are created in the workbench (see next) and given a persistent name for invoking and use at any other location in the application

A variety of charts and graphs
for all numerical data
  • Workbench — this rather complex component is generally intended to be limited to site administrators. Via the workbench, records and datasets and attributes may be selected, and then particular views or widgets obtained. When no selections are made in the left-hand panel, all are selected by default. Then, in the records viewer (middle upper), either records or attributes are selected. For each attribute (column), a new display widget appears. All display widgets interact (a selection in one reflects in the others). The nature of the data type or attribute selected determines which available widgets are available to display it; sometimes there are multiples which can be selected via the lower left dropdown list in any given display panel. These various display widgets may then be selected for a nameable layout as a persistent dashboard view (functionality not shown in this public demo)
  • Exporter — the exporter component appears in multiple locations across the appliance, either as a tab option (e.g., Filter component) or as a dropdown list to the lower right of many screens. A variety (and growing!) number of export formats are available. When it appears as a dropdown list, the export is limited to the currently active slice. When invoked via tab, more export selection options are available. See further the technical documentation for this component

Dashboard provides indicator comparisons
across time and areas
Limitations of the Online Demo

A number of other tools are available to admins in the actual appliance, but are not exposed in the demo:

  • Importer — like the exporter, there are a variety of formats supported for ingesting data or content into the system. Prominent ones include spreadsheets (CSV), XML and JSON. The irON notation is especially well suited for dataset staging for ingest. At import time, datasets can also be appended or merged. See further the technical documentation for this component
  • Dataset Submission and Management — new datasets can be defined, updated, deleted, appended and granted various access rights and permissions, including to the granularity of individual components or tools. For example, see further this technical documentation
  • Records Manager — every dataset can have its records managed via so-called CRUD rights. Depending on the dataset permissions, a given user may or may not see these tools. See further the technical documentation for each of these create read update delete tools.

In addition, it is not possible in the demo to save persistent dashboard views or submit stories or documents for tagging, nor to register as a user or view the admin portions of the Drupal instance.

Powerful and persistent “slicing-and-dicing”
across all datasets and data structure
Sample Data and Content in the Demo

The sample data and content in the demo is for the Iowa City (IA) metropolitan statistical area. This area embraces two counties (Johnson and Washington) and the census tracts and townships that comprise them, and about two dozen cities. Two of the notable cities are Iowa City itself, home of the University of Iowa, and Coralville, where Structured Dynamics, the developer of Citizen Dan and the open semantic framework (OSF), is headquartered.

The text content on this site is drawn from Wikipedia articles dealing with this area. About 30 stories are included.

The data content on the site is drawn from US Census Bureau data. Shape files for the various geographic areas were obtained from here, and the actual datasets by geographic area can be obtained from here.

The Workbench is the admin tool to
name and create new Dashboard views
An Instance of the Open Semantic Framework

Citizen Dan is an exemplar instance of Structured Dynamics’ open semantic framework (OSF), a generalized framework for deploying semantic platforms for specific domains.

OSF is a combination of a layered architecture and modular software. Most of the individual open source software products developed by Structured Dynamics and available on the OpenStructs site are components within the open semantic framework. These include:

Any data “slice” can be imported or exported
as structured data (e.g., RDF, XML, JSON, CSV)
A Part of the ‘Total Open Solution‘

The software that makes up the Citizen Dan appliance is one of the four legs that provide a stable, open source solution. These four legs are software, structure, methods and documentation. When all four are provided, we can term this a total open solution.

For Citizen Dan, the complements to this software are:

  • MUNI ontology, which provides the structure specification upon which the software runs, and
  • DocWiki (with its TechWiki subset of technical documentation) that provides the accompanying knowledge base of methods, best practices and other guidance.

In its entirety, the total open solution amounts to a form of capacity building for the enterprise.

Admins have a wealth of tools for dataset and
records CRUD and management.
The Potential for a Citizen Dan Network

Inherent in the design and architecture of Citizen Dan is the potential for each instance (single installation) to act as a node in a distributed network of nodes across the Web. Via the structWSF Web service endpoints and appropriate dataset permissions, it is possible for any city in the Citizen Dan network to share (or not) any or all of its data with other cities.

This collaboration aspect has been “baked into the cake” from Day One. The system also supports differential access, rights and roles by dataset and Web service. Thus, city staffs across multiple communities could share data differently than what is provided to the general public.

Since all data management aspects of each Citizen Dan instance is also oriented around datasets, expansion to a network mode is quite straightforward.

Citizen Dan is hosted in Drupal, with rich portal,
theming and 6500 add-ons available
How to Get the System

The Citizen Dan appliance is based on the Drupal content management system, which means any community can easily theme or add to the functionality of the system with any of the available 6500 open source modules that extend the basic Drupal functionality.

All other components, including the multiple third-party ones, are also open source.

To install Citizen Dan for your own use, you need to:

  1. Download and install all of the software components. You may also want to check out the OSF discussion forum for tips and ideas about alternative configuration options
  2. Install a baseline vocabulary. In the case of Citizen Dan, this is the MUNI ontology. MUNI is imminent for public release. Please contact the project if you need an early copy
  3. Install your own datasets. You may want to inspect the sample Citizen Dan datasets and learn more about the irON notation, especially its commON (spreadsheet) use case.

(Note: there will also be some more updates in August, including the MUNI release.)

For questions and additional info, please consult the TechWiki or the OpenStructs community site.

Finally, please contact us if you’d like to learn more about the project, investigate funding or sponsorship opportunities, or contribute to development. We’d welcome your involvement!

Using Wikis as Pre-Packaged Knowledge Bases

AI3:::Adaptive Information (Mike Bergman) - Mon, 07/26/2010 - 05:31
TechWiki DocWiki While Also Discovering Hidden Publication and Collaboration Potentials

A few weeks back I completed a three-part introductory series to what Structured Dynamics calls a ‘total open solution‘. A total open solution as we defined it is comprised of software, structure, methods and documentation. When provided in toto, these components provide all of the necessary parts for an organization to adopt new open source solutions on its own (or with the choice of its own consultants and contractors). A total open solution fulfills SD’s mantra that, “We’re successful when we’re not needed.”

Two of the four legs to this total open solution are provided by documentation and methods. These two parts can be seen as a knowledge base that instructs users on how to select, install, maintain and manage the solution at hand.

Today, SD is releasing publicly for the first time two complementary knowledge bases for these purposes: TechWiki, which is the technical and software documentation complement, in this case based around SD’s Open Semantic Framework and its associated open source software projects; and DocWiki, the process methodology and project management complement that extends this basis, in this case based around the Citizen Dan local community open data appliance.

All of the software supporting these initiatives is open source. And, all of the content in the knowledge bases is freely available under a Creative Commons 3.0 license with attribution.

Mindset and Objectives

In setting out the design of these knowledge bases, our mindset was to enable single-point authoring of document content, while promoting easy collaboration and rollback of versions. Thus, the design objectives became:

  • A full document management system
  • Multiple author support
  • Authors to document in a single, canonical form
  • Collaboration support
  • Mixing-and-matching of content from multiple pages and articles to re-purpose for different documents, and
  • Excellent version/revision control.

Assuming these objectives could be met, we then had three other objectives on our wish list:

  • Single source publishing: publish in multiple formats (HTML, PDF, doc, csv, RTF?)
  • Separate theming of output products for different users, preferably using CSS, and
  • Single-click export of the existing knowledge base, followed by easy user modification.

Our initial investigations looked at conventional content and document management systems, matched with version control systems or SVNs. Somewhat surprisingly, though, we found the Mediawiki platform to fulfill all of our objectives. Mediawiki, as detailed below, has evolved to become a very mature and capable documentation platform.

While most of us know Mediawiki as a kind of organic authoring and content platform — as it is used on Wikipedia and many other leading wikis — we also found it perfect for our specific knowledge base purposes. To our knowledge, no one has yet set up and deployed Mediawiki in the specific pre-packaged knowledge base manner as described herein.

TechWiki v DocWiki

TechWiki is a Mediawiki instance designed to support the collaborative creation of technical knowledge bases. The TechWiki design is specifically geared to produce high-quality, comprehensive technical documentation associated with the OpenStructs open source software. This knowledge base is meant to be the go-to source for any and all documentation for the codes, and includes information regarding:

  • Coding and code development
  • Systems configurations and architectures
  • Installation
  • Set-up and maintenance
  • Best practices in these areas
  • Technical background information, and
  • Links to external resources.

As of today, TechWiki contains 187 articles under 56 categories, with a further 293 images. The knowledge base is growing daily.

DocWiki is a sibling Mediawiki instance that contains all TechWiki material, but has a broader purpose. Its role is to be a complete knowledge base for a given installation of an Open Semantic Framework (in the current case, Citizen Dan). As such, it needs to include much of the technical information in the TechWiki, but also extends that in the following areas:

  • Relation and discussion of the approach viz. other information development initiatives
  • Use of a common information management framework and vocabulary (MIKE2.0)
  • A five-phased, incremental approach to deployment and use
  • Specific tasks, activities and phases under which this deployment takes place, including staff roles, governance and outcome measurement
  • Supporting background material useful for executive management and outside audiences.

The methodology portions of the DocWiki are drawn from the broader MIKE2.0 (Method for Integrated Knowledge Environments) approach. I have previously written about this open source methodology championed by Bearing Point and Deloitte.

As of today, DocWiki contains 357 articles and 394 structured tasks in 70 activity areas under 77 categories. Another 115 images support this content. This knowledge base, too, is growing daily.

Both of these knowledge bases are open source and may be exported and installed locally. Then, users may revise and modify and extend that pre-packaged information in any way they see fit.

Basic Wiki Overview

The basic design of these systems is geared to collaboration and embeds what we think are really responsive work flows. These extend from supporting initial idea noodling to full-blown public documentation. The inherent design of the system also supports single-source publishing and book or PDF creation from the material that is there. Here is the basic overview of the design:

(click for full size)

Mediawiki provides the standard authoring and collaboration environment. There are a choice of editing methods. As content is created, it is organized in a standard way and stored in the knowledge base. The Mediawiki API supports the export of information in either XHTML or XML, which in turn allows the information to be used in external apps (including other Mediawiki instances) or for various single-source publication purposes. The Collection extension is one means by which PDFs or even entire books (that is, multi-page documents with potentially chapters, etc.) may be created. Use of a well-designed CSS ensures that outputs can be readily styled and themed for different purposes or audiences.

As wikis designed from the get-go to be reusable, and then downloaded and installed locally, it is important that we maintain quality and consistency across content. (After download, users are free to do with it as they wish, but it is important the initial database be clean and coherent.) The overall interaction with the content thus occurs via one of three levels: 1) simple reading, which is publicly available without limitation to any visitor, including source inspection and export; 2) editing and authoring, which is limited to approved contributors; and 3) draft authoring and noodling, which is limited to the group in #2 but for which the in-progress content is not publicly viewable. Built-in access rights in the system enable these distinctions.

Features and Benefits

Besides meeting all of the objectives noted at the opening of this post, these wikis (knowledge bases) also have these specific features:

  • Relatively complete (and growing) knowledge base content
  • Book, PDF, or XHTML publishing
  • Single-click exports and imports
  • Easy branding and modification of the knowledge bases for local use (via the XML export files)
  • Pre-designed, standard categorization systems for easy content migration
  • Written guidance on use and best practices
  • Ability to keep content in-development “hidden” from public viewing
  • Controlled, assisted means for assigning categories to content
  • Direct incorporation of external content
  • Efficient multi-category search and filtering
  • Choice of regular wikitext, WikED or rich-text editing
  • Standard embeddable CSS objects
  • Semantic and readily themed CSS for local use and for specialty publications
  • Standard templates
  • Sharable and editable images (SVG inclusion in process)
  • Code highlighting capabilities (GeSHi, for TechWiki)
  • Pre-designed systems for roles, tasks and activities (DocWiki)
  • Semantic Mediawiki support and forms (DocWiki)
  • Guided navigation and context (DocWiki).

Many of these features come from the standard extensions in the TechWiki/DocWiki packages.

The net benefits from this design are easily shared and modified knowledge bases that users and organizations may either contribute to for the broader benefit of the OpenStructs community, or download and install with simple modifications for local use and extension. There is actually no new software in this approach, just proper attention to packaging, design, standardization and workflow.

A Smooth Workflow

Via the sharing of extensions, categories and CSS, it is quite easy to have multiple instances or authoring environments in this design. For Structured Dynamics, that begins with our own internal wiki. Many notes are taken and collected there, some of a proprietary nature and the majority not intended or suitable for seeing public release.

Content that has developed to the point of release, however, can be simply tagged using conventions in the workflow. Then, with a single Export command, the relevant content is then sent to an XML file. (This document can itself be edited, such as for example changing all ‘TechWiki’ references to something like ‘My Content Site’; see further here.)

Depending on the nature of the content, this exported content may then be imported with a single Import command to either the TechWiki or DocWiki sites. (Note: Import does require admin rights.) A simple migration may also occur from the TechWiki to the DocWiki. Also, of course, initial authoring may begin at any of the sites, with collaborators an explicit feature of the TechWiki or DocWiki versions.

Any DocWiki can also be specifically configured for different domains and instance types. In terms of our current example, we are using Citizen Dan, but that could be any such Open Semantic Framework instance type:

(click for full size)

Under this design, then, the workflow suggests that technical content authoring and revision take place within the TechWiki, process and methodology revision in the DocWiki. Moreover, most DocWikis are likely to be installed locally, such that once installed, their own content would likely morph into local methods and steps.

So long as page titles are kept the same, newer information can be updated on any target wiki at any time. Prior versions are kept in the version history and can be reinstated. Alternatively, if local content is clearly diverging yet updates of initial source material is still desired, the local content need only be saved under a new title to preserve it from import overwrites.

Where Is It Going from Here?

We are really excited by this design and have already seen benefits in our own internal work and documentation. We see, for example, easier management of documentation and content, permanent (canonical) URLs for specific content items, and greater consistency and common language across all projects and documentation. Also, when all documentation is consolidated into one point with a coherent organizational and category structure, documentation gaps and inconsistencies also become apparent and can readily be fixed.

Now, with the release of these systems to the OpenStructs (Open Semantic Framework) and Citizen Dan communities, we hope to see broader contributions and expansion of the content. We encourage you to check on these two sites periodically to see how the content volume continues to grow! And, we welcome all project contributors to join in and help expand these knowledge bases!

We think this general design and approach — especially in relation to a total open solution mindset — has much to recommend it for other open source projects. We think these systems, now that we have designed and worked out the workflows, are amazingly simple to set up and maintain. We welcome other projects to adopt this approach for their own. Let us know if we can be of assistance, and we welcome ideas for improvement!

Another Milestone in Semantic Enterprise Awareness

AI3:::Adaptive Information (Mike Bergman) - Thu, 07/15/2010 - 20:06
Cisco Video is a Good Starting Intro for Management

Like the seminal linked data publication by PricewaterhouseCoopers of about a year ago (see PWC Dedicates Quarterly Technology Forecast to Linked Data, May 29, 2009), a video released by Cisco yesterday is another signal of the emergence of the semantic enterprise.

The Cisco tech brief on The Semantic Enterprise is a quite accessible — but a bit eerie — seven-minute introduction.  The video was prepared by Cisco’s Internet Business Solutions Group (IBSG), with Shaun Kirby, its Director of Innovations Architectures, as the narrator:

YouTube: http://www.youtube.com/watch?v=3lUzs2I8BKI

Well, as for being eerie, when the video first came up, I thought I was looking at an advanced, next generation avatar, perhaps a reincarnation of Douglas Adams’ Hyperland. Maybe this semantic stuff was closer at hand than we thought!

But, as it turned out, that first blush was only a reaction to how the video was shot. As it gets rolling, the Cisco video is extremely well done and informative. It is a great intro for sharing with management when contemplating your own moves to becoming a semantic enterprise.

I suggest you first view — and then bookmark — this one.

‘Pay as You Benefit’: A New Enterprise IT Strategy

AI3:::Adaptive Information (Mike Bergman) - Tue, 07/13/2010 - 03:57
‘Pay as You Benefit’: A New Enterprise IT Strategy&rft.aulast=Bergman&rft.aufirst=Mike&rft.subject=Adaptive Innovation&rft.subject=MIKE2.0&rft.subject=Semantic Enterprise&rft.source=AI3:::Adaptive Information&rft.date=2010-07-12&rft.type=blogPost&rft.format=text&rft.identifier=http://www.mkbergman.com/896/pay-as-you-benefit-a-new-enterprise-it-strategy/&rft.language=English">

Using Incremental, Low-risk Semantic and Open World Approaches

OK. So, you’re looking at your garage … or your bedroom closet … or your office and its files. They are a mess, and you can’t find anything and you can’t stuff anything more into the nooks, cubbies, crannies or cabinets. What do you do?

Well, when you finally get fed up and have a rainy day or some other excuse, you tackle the mess. Maybe you grab a big mug of coffee to prepare for the pending battle. Maybe you strip down to comfort clothes. Then, if you’re like me, you begin to organize stuff into piles. Labeled piles and throwaway piles and any other piles that can provide a means to start bringing order to the chaos.

In the semantic Web world, there is a phrase coined by Jim Hendler that captures this approach: A little semantics goes a long way [1]. A little semantics, just like your labeled piles, helps to bring order to information chaos.

Mind you, this is not fancy or expensive stuff. In the case of my office, it is colored sheets of paper labeled with Magic Markers as “Taxes” or “Internal” or “Blog Posts” or whatever. Then, I begin sifting and distributing. In the case of the semantic world, these are classifying things into like categories and simply relating them to other categories with simple relationships, such as “is Part Of” or “is Narrower Than”.

Of course, I could have approached my mess in a different way. I could have hired an efficiency expert to come in, interview me and all of my employees and colleagues, gotten a written analysis and report, and then committed to a multi-week project to completely store and place every single last piece of paper in my office or organize every rake and set of abandoned golf clubs in my garage. When done, I would have shelled out much money and I suspect still not have been able to find anything.

Sort of sounds like the traditional way IT does its business, doesn’t it? To clean up their information messes, enterprises need to find a better strategy.

I’m not too long from having returned from the SemTech conference, which overall was quite an excellent show. But despite its emphasis on semantic technologies and their usefulness to businesses and enterprises, I found one critical theme unspoken: the ability of semantic approaches to change how enterprise IT actually does business. New ways have got to be found to clean up the many and growing information piles emerging all around us.

The Changing Nature of IT

IT is — and has been — going through a fundamental set of changes for decades. In the last decade, these changes have led to lowered relative spending, a shift in spending priorities toward services, less innovation, and less productivity. Some data and observations by researchers and analysts document these trends.

The following chart, using US Bureau of Economic Analysis data [2], shows the clear 50-year trend in declining hardware costs for enterprises, mostly resulting from the observation known as Moore’s Law. These massive hardware cost reductions (logarithmic scale) have also resulted in lower prices for IT as a whole. In 2008, for example, total relative IT prices were about two-thirds what they were a mere decade earlier:

Source: M.K. Bergman and Bureau of Economic Analysis [2] (click for full size)

In contrast, relative prices for software and services have remained remarkably flat over this entire period, including for the past decade. This is somewhat surprising given the emergence of packaged software and more recently open source. However, relative percentage expenditures for custom software and software developed in-house have also remained strong over the past decade [3].

The mid- to late-1990s represented the high-water mark on many bases for enterprise IT, expenditures and vendors. Roughly in 1997 or so, the number of public enterprise software vendors peaked as did venture funding [4] and relative expenditures for IT in relation to GDP. There was a major uptick in relation to preparing for Y2K and a major downtick due to the dot-com bubble, and then of course the past two years or so have seen a global economic downturn. But, as the figure below shows (red), the long-term trend tends to suggest a relative plateau for IT expenditures in relation to GDP somewhat around 2000:

Source: M.K. Bergman and Bureau of Economic Analysis [2] (click for full size)

Yet, like the first chart, software seems to be bucking this trend (blue lines above). Though perhaps the rate of growth in expenditures for software is slowing a bit, it is still on a growth upslope, especially in relation to overall IT expenditures. The next chart, in fact, specifically compares software expenditures to total IT expenditures. Software expenditures are some 40% higher in relation to total IT than they were a mere decade ago:

Source: M.K. Bergman and Bureau of Economic Analysis [2] (click for full size)

The mix of these software expenditures is also changing in major ways while stagnating in others.

The changing aspect is coming about from the shift of expenditures from license and maintenance fees to services. A number of software vendors began to see revenues from services overcome that from licensing in the 1990s. By the early 2000s, this was true for the enterprise software sector as a whole [4]. Today, service revenues account for 70% or so of aggregate sector revenues. Combined with the emergence of open source and other alternatives such as software as a service (SaaS), I think it fair to say that the era of proprietary software with exceedingly high margins from monopoly rents is over [5].

The stagnating aspect occurs in how the software expenditures are applied. According to Gartner, in the US, more than 70% of IT expenditures are devoted to simply running existing systems, with only about 11% of budgets devoted to innovation; other parts of the world spend nearly double on innovation and much lower for operations [6]. This relative lack of support for innovation and high percentages for running existing systems has held true for about a decade. Meanwhile, IT’s contribution to US productivity has been declining since 2001 [7].

What is the Cause for IT’s Ills?

Last year, PricewaterhouseCoopers published a major report with the provocative title, “Why Isn’t IT Spending Creating More Value?[7]. The 42-page report covered many of the aspects above. Among other factors, the PWC authors speculated that:

As consumption of IT increases and as technologies change and advance, businesses have been left to cobble together disparate software and hardware systems and tools. The end result? Unchecked IT spending, unneeded complexity, redundant systems, underutilized hardware and data centers, the need for expensive IT security, and, inevitably, diminishing returns from IT. In short, low levels of IT productivity create conditions for an IT cost crisis. [7]

I suppose one could add to this litany other factors such as the growth and emergence of the Internet, sector consolidations through mergers and acquisitions, the rise of open source and alternatives such as SaaS, etc.

But which of these are causes? Which are symptoms? And which might only be consequences or coincident?

To be sure, all recognize the explosion of digital data and information, with sources and formats springing up faster than Whack-a-Mole. It is such an evident and ubiquitous phenomenon that pointing to it as a cause appears on the face of it quite obvious. Also obvious is that these new sources carry with them a diversity of systems and tools. While not categorically stated as such, it appears that PWC fingers the difficulties of “cobbling” these systems together as the root cause for low productivity and thus the IT cost crisis.

I agree totally that these are symptoms of what we see in IT’s current circumstance. I would even say these factors are a proximate cause to these ills. But I disagree they are the root cause. To discover that root, I believe, we must look deeper to mindset and assumptions.

Closed World Mindset as the Root Cause

There are some phenomena that are so obvious that they are easily missed. Not seeing your fingertip six inches between your eyes is one of these. We aren’t used to focusing on things so near at hand.

So, let’s look for a moment at the closed world assumption (CWA), a key underpinning to most standard relational data systems and enterprise schema and logics. CWA is the logic assumption that what is not currently known to be true, is false. If CWA is not directly familiar to you that is understandable; it is an implied assumption of these systems and logics. As such, it is not often inspected directly and therefore not often questioned [8].

With regard to standard IT systems, the closed world assumption has two important aspects:

  1. The assumption is that the information domain at hand is complete [9], and
  2. The related negation as failure, which assumes every predicate to be false that cannot be proved to be true.

On the face of them, these assumptions seem tame enough. And, indeed, there are some enterprise data systems that absolutely rely on them for efficient processing and completion times, such as most transaction systems. CWA is absolutely the appropriate design for such applications.

However, for knowledge management or representation applications — that is, applications which involve combining or using heterogeneous data or information from multiple data sources, which are exactly the same sources requiring information “cobbling” noted above by PWC — there are two very critical implications of the closed-world assumption (CWA):

  1. Efforts or projects can not be undertaken incrementally; if done in pieces, each piece must be complete and consistent, which is expensive to scope and do
  2. To be consistent and explicit, the predicates (properties or relationships) must also be complex to model the “reality” of the system, which is also expensive to scope and do [10].

The net effect, which I have argued before, most notably in a major piece about the open world assumption [11], is that typical projects with a knowledge management aspect have become costly, take very long to complete, often fail, and require much planning and coordination. These facts have been true for three decades as enterprises have attempted to extract knowledge from their electronic information using closed world approaches based on relational systems. And, as recognized by PWC, these problems are only getting worse with growth in diversity and scope of systems.

The implications of closed world v. open world approaches are absolutely at the root of the causes leading to declining productivity, low innovation, significant failures and increasing costs — all exacerbated with more data and more systems — now characterizing traditional enterprise IT. Moreover, it is not a problem for open world systems to link to and incorporate closed world approaches. With open world, there is no need for Hobson’s choices. Unfortunately, such is not true when one begins with a closed world premise.

Incremental is Good: Pay as You Go

As best as I can tell, Alon Halevy was the first to use the phrase “pay as you go” in 2006 to describe the incremental aspect of the open world approach in relation to the semantic Web [12]. The “pay as you go” phrase had been applied earlier to data management and storage and had also been used to describe phone calling plans.

Incremental concepts and “agility” have been popular topics for the past five to ten years in IT, most often related to software development. And, while “incremental” sounds good in relation to enterprise projects, especially of a knowledge management or information integration/federation nature, the actual methodologies put forward were anything but incremental in their conceptual underpinnings.

Unfortunately, the “pay as you go” phrase has (and still is) largely confined to incremental, open world approaches involving the semantic Web. How this approach might apply and benefit enterprises has yet to be articulated. Nonetheless, I like the phrase, and I think it evokes the right mindset. In fact, I think with linked data and many other aspects of the current semantic Web we are seeing such approaches come to fruition. Inch-by-inch, brick-by-brick, data on the Web is getting exposed and interlinked. “Pay as you go” is incremental, and that is good.

Purposeful is Better: Pay as You Benefit

Yet the idea of “pay as you benefit” is more purposeful, able to be planned and implemented, and founded on standard enterprise cost-benefit principles. I think it is a better (and more nuanced) expression of the “pay as you go” mindset in an enterprise setting. What it means is you can start small and be incomplete. You can target any domain or department or scope that is most useful and illustrative for your organization. You can deploy your first stand-ups as proofs-of-concept or sandboxes. And, you can build on each prior step with each subsequent one.

One of the reasons we (Structured Dynamics) embraced the MIKE2.0 methodology [13] was its inherent incremental character. (Government deployments often call them “spirals”.) In general, the five phases of MIKE2.0 can be represented as follows:

(click for full size)

It is specifically during the fifth phase, testing and improvement, that quantitative and qualitative benefits from the current increment are calculated and documented. This evolving methodology is where the enterprise can assess the results of its prior investment and scope and budget for the next one. These can be quick, rapid increments, or more involved ones, depending on the schedule, prior results and risk profile of the enterprise (or department) at that time.

Much is made of “incremental” or “agile” deployments within enterprises, but the nature of the traditional data system (and its closed world assumption) can act to undermine these laudable steps. The inherent nature of an open world approach, matched with methodologies and best practices, can work wonderfully with KM-related projects.

Quite Simply a Different Way to Do Business

We see in our current IT circumstances a number of embedded practices and assumptions. We have been assuming control and completeness — the closed world opposite to the open world approach. We have thus embraced and promoted “global” or enterprise-wide solutions: be they desktop operating systems or browsers or expensive enterprise-level proprietary software solutions. This scope leads to immense hurdle rates and risks: we better get our choices right up front, because if we don’t, the department or enterprise are at risk. We have an inward focus about our own resources, our own networks, our own systems. Meanwhile, when we look outward, we wonder how all of these new Web companies can grow and expand so rapidly in comparison to us.

Clearly, we are seeing shifts to more services than products, more open source, more outsourcing, and more software as a service. Yet, because of the legacy of decades-long commitments from prior IT investment and the failures of many hyped “solutions” such as ERP or BI or data warehousing or a dozen others, we also see a decline and a reluctance for IT to embrace new and transforming approaches. Our prior choices were practically tantamount to “betting the enterprise.” What if our new approaches fail as so many of their predecessors did? In a demanding, competitive environment can we afford to make such wrong choices again with such immense implications?

Yet, now that information technology is a given, it only seems natural that its role becomes an integral part of the enterprise, and not a special function. Like procurement, IT has matured to become a support function. Businesses should not succeed or fail based on the types of pencils and paper stock they use; so should they not depend on the software support choices that IT makes. Enterprises are now past the need to get “computerized”; they are thoroughly so. But our understanding of IT’s role and position has not evolved with its own success.

The first whiffs of these challenges to IT’s initial hegemony came from the departmental introduction of PCs and local networks in the early 1980s. It has continued with desktop software, spreadsheets and Web portals and sites. Large, mature companies awoke in horror in the last decade to discover they had hundreds — sometimes thousands — of Web sites and content dissemination points over which IT had little or no control. Such is the nature of entropy, and it is a fact for any organization of any size.

So, now, with strategies such as “pay as you benefit,” there is no longer an excuse not to innovate. There is not a justification to put off testing and discovering benefits that the open world and semantic approaches can bring to your organization. There is now a basis to make the case and set the affordable budgets within desirable timelines for becoming a semantic enterprise.

Mindsets and expectations do require some adjustment. For example, not everything will be known or modeled in early phases. But, is that also not true in any “real” real world? We’re not talking high-throughput transaction systems here, but beginning to pull together and link the information that is important to your organization strategically.

Remember the intro statement that “a little semantics goes a long way”? Well, that truth — and it is true — when combined with incremental deployment firmly tied to demonstrable results, promises quite simply a different way to do business. Never before have enterprises had working and winnable approaches such as this to test and innovate and learn and discover. Jump on in; the water is clear and warm.

And, oh, as to that mess in your closet or garage? Well, if you adhere to CWA, you will need to define a place for everything to go before you can start cleaning things up. I say: forget those false hurdles. If you’d really want to make a dent in the mess, grab a broom and start cleaning.

[1] Jim Hendler, “a little semantics goes a long way.” See http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html. [2] All starting data is for the United States only and comes from the U.S. Bureau of Economic Analysis, U.S. Department of Commerce. The data tables were downloaded from the BEA Web site at http://www.bea.gov/national/nipaweb/SelectTable.asp. GDP data is from Section 1; enterprise private investment data from Section 5. For reasons as described in the text, all relative BEA numbers were re-adjusted from a 2005 baseline to 1997 based on absolute figures. Software figures and expenditures include packaged software, custom software and software developed in-house, but excludes software bundled or included within hardware. [3] Data not shown; see the “Software Investment and Prices, by Type” data on the BEA Web page http://www.bea.gov/national/info_comm_tech.htm. [4] Michael A. Cusumano, 2008. “The Changing Software Business: Moving from Products to Services,” Massachusetts Institute of Technology, in Computer, Vol 41 (1): 20-27, January 2008. See http://www.iae.univ-lille1.fr/SitesProjets/bmcommunity/Research/cusumano.pdf. This shift has occurred despite the recognition that potential gross margins from software packages can exceed 90% due to zero costs of reproduction. As Cusumano notes in a rule, “99 percent of zero is zero: The great profit opportunity from software products becomes theoretical and not practical” if not sold. Also, another interesting observation made by Cusumano is that in the shift to services vendors with both low percentages and high percentages of services, or what he calls the “sweet spots”, show higher contributions to profitability than vendors in the middle. He posits that low percentage vendors are getting mostly profitable maintenance fees, while those above 60% in services show profitability due to learning more replicable and systematic processes and approaches for service delivery. [5] While we may occasionally see some vendors successfully buck this trend, I suspect these will only occur for established vendors with established platform advantages or for isolated applications where the innovating vendors have a significant first-mover advantage. [6] Garnter calls the innovation category “transform”; see Gartner, Incorporated, 2009. “IT Software and Services, 2007-2010,” see http://www.slideshare.net/rsink/gartner-report-it-spending-2010. Also, see Jed Rubin and Howard Rubin, 2006. “Worldwide IT Benchmark Service New Trends & Findings for 2007: Strategic Performance Management and Measurement,” from Gartner Consulting Worldwide IT Benchmark Service; see http://www.gartner.com/teleconferences/attributes/attr_161183_115.pdf. [7] PricewaterhouseCoopers, 2009. “Why Isn’t IT Spending Creating More Value?”, see http://www.pwc.com/en_US/us/increasing-it-effectiveness/assets/it_spending_creating_value.pdf. [8] Though relational database systems did not begin with an understanding of CWA, but rather Edgar Codd’s 12 rules, the understandings of these were formulated later by Raymond Reiter.  Reiter first described the basis of CWA in 1978, and then provided an axiomatization of relational databases and their deductive generalizations and basis in CWA in 1984; see http://prism.cs.umd.edu/papers/Min02:reiter_memoriam/memoriam-tplp.pdf. [9] Relational database systems also assume unique names for objects, which, while not perhaps the best design for federated systems, can be overcome in other ways. [10] For semantics-related projects there is a corollary problem to the use of CWA which is the need for upfront agreement on what all predicates “mean”, which is difficult if not impossible in reality when different perspectives are the explicit purpose for the integration. [11] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems. [12] This was also the first instance (I believe) of Alon coining the “dataspace” term. First use of the “pay as you go” phrase was, Alon Halevy, Michael Franklin, and David Maier, 2006. “Principles of Dataspace Systems,” in Proceedings of ACM Symposium on Principles of Database Systems, pp: 1-9. See also the slides accompanying that talk, Alon Halevy, 2006. “Principles of Dataspace Systems (PODS),” June 26, 2006; see http://www.cs.washington.edu/homes/alon/files/pods06-keynote.ppt, 2006. More explicitly the next year see Jayant Madhavan, Shirley Cohen, Xin (Luna) Dong, Alon Y. Halevy, Shawn R. Jeffery, David Ko, and Cong Yu, 2007. “Web-scale Data Integration: You Can Afford to Pay as You Go.” in 3rd Conf. on Innovative Data Systems Research (CIDR), pp 342-350, see http://research.yahoo.com/files/paygo.pdf. The term has been picked up by many others, notably Rada Chirkova, Dongfeng Cheny, Fereidoon Sadriz and Timo J. Salo, 2007. “Pay-As-You-Go Information Integration: The Semantic Model Approach,” see ftp://ftp.csc.ncsu.edu/pub/tech/2007/TR-2007-30.pdf; and most recently papers by Gerhard Weikum on RDF-3X; see http://domino.mpi-inf.mpg.de/internet/reports.nsf/c125634c000710cec125613300585c64/70e8f906d8090e6bc125757f00448ec9!OpenDocument&ExpandSection=-1. [13] See M.K. Bergman, 2010. “MIKE2.0: Open Source Information Development in the Enterprise,” AI3 Blog posting, February 23, 2010; and M.K. Bergman, 2010. “Open SEAS: A Framework to Transition to a Semantic Enterprise,” AI3 Blog posting, March 1, 2010.

Consolidating a Coherent Message with OSF

AI3:::Adaptive Information (Mike Bergman) - Tue, 07/06/2010 - 06:52

Release of Semantic Components Adds Final Layer, Leads to Streamlined Sites

Yesterday Fred Giasson announced the release of code associated with Structured Dynamics‘ open source semantics components (also called sComponents).  A semantic component is an ontology-driven component, or widget, based on Flex. Such a component takes record descriptions, ontologies and target attributes/types as inputs and then outputs some (possibly interactive) visualizations of the records.

Though not all layers are by any means complete, from an architectural standpoint the release of these semantic components provides the last and missing layer to complete our open semantic framework. Completing this layer now also enables Structured Dynamics to rationalize its open source Web sites and various groups and mailing lists associated with them.

The OSF “Semantic Muffin”

We first announced the open semantic framework — or OSF — a couple of weeks back. Refer to that original post for more description of the general design [1]. However, we can show this framework with the semantic components layer as illustrated by what some have called the “semantic muffin”:

(click for full size)

The OSF stack consists of these layers, moving from existing assets upward through increasing semantics and usability:

  • Existing assets — any and all existing information and data assets, ranging from unstructured to structured. Preserving and leveraging those assets is a key premise
  • scones / irON — this layer is for general conversion of non-RDF data and data schema to RDF (via irON or RDFizers) or for information extraction of subject concepts or named entities (scones)
  • structWSF — is the pivotal Web services framework layer, and provides the standard, common interface by which existing information assets get represented and presented to the outside world and to other layers in the OSF stack
  • Semantic components — the highlighted layer in the “semantic muffin”; in essence, this is the visualization and data interaction layer in the OSF stack; see more below
  • Ontologies — are the layer containing the structured assets “driving” the system; this includes the concepts and relationships of the domain at hand, and administrative ontologies that guide how the user interfaces or widgets in the system should behave, and
  • conStruct — is the content management system (CMS) layer based on Drupal and the thinnest layer with respect to OSF; this optional layer provides the theming, user rights and permissions, or other functionality drawn from Drupal’s 6500 third-party modules.

Not all of these layers are required in a given deployment and their adoption need not be sequential or absolutely depend on prior layers. Nonetheless, they do layer and interact with one another in the general manner shown.

The Semantics Components Layer

Current semantic components, or widgets, include: filter; tabular templates (similar to infoboxes); maps; bar, pie or linear charts; relationship (concept) browser; story and text annotator and viewer; workbench for creating structured views; and dashboard for presenting pre-defined views and component arrangements. These are generic tools that respond to the structures and data fed to them, adaptable to any domain without modification.

Though Fred’s post goes into more detail — with subsequent posts to get into the technical nuances of the semantic components — the main idea of these components is shown by the diagram below.

These various semantic components get embedded in a layout canvas for the Web page. By interacting with the various components, new queries are generated (most often as SPARQL queries) to the various structWSF Web services endpoints. The result of these requests is to generate a structured results set, which includes various types and attributes.

An internal ontology that embodies the desired behavior and display options (SCO, the Semantic Component Ontology) is matched with these types and attributes to generate the formal instructions to the semantic components. These instructions are presented via the sControl component, that determines which widgets (individual components, with multiples possible depending on the inputs) need to be invoked and displayed on the layout canvas. Here is a picture of the general workflow:

(click for full size)

New interactions with the resulting displays and components cause the iteration path to be generated anew, again starting a new cycle of queries and results sets. As these pathways and associated display components get created, they can be named and made persistent for later re-use or within dashboard invocations.

Consolidating and Rationalizing Web Sites and Mailing Lists

As the release of the semantic components drew near, it was apparent that releases of previous layers had led to some fragmentation of Web sites and mailing lists. The umbrella nature of the open semantic framework enabled us to consolidate and rationalize these resources.

Our first change was to consolidate all OSF-related material under the existing OpenStructs.org Web site. It already contained the links and background material to structWSF and irON. To that, we added the conStruct and OSF material as well. This consolidation also allowed us to retire the previous conStruct Web site as well, which now re-directs to OpenStructs.

We also had fragmentation in user groups and mailing lists. Besides shared materials, these had many shared members. The Google groups for irON, structWSF and conStruct were thus archived and re-directed to the new Open Semantic Framework Google group and mailing list. Personal notices of the change and invites have been issued to all members of the earlier groups. For those interested in development work and interchange with other developers on any of these OSF layers, please now direct your membership and attention to the OSF group.

There has also been a revigoration of the developers’ community Web site at http://community.openstructs.org/. It remains the location for all central developer resources, including bug and issue tracking and links to SVNs.

Actual code SVN repositories are unchanged. These code repositories may be found at:

We hope you find these consolidations helpful. And, of course, we welcome new participants and contributors!

[1] An alternative view of this layer diagram is shown by the general Structured Dynamics product stack and architecture.

A Personal Thanks

AI3:::Adaptive Information (Mike Bergman) - Fri, 07/02/2010 - 15:44
As of July 1, Daily Readership Passed 3000

As of yesterday, the readership on this AI3 blog passed 3000 daily for the first time. It has been steadily inching upward, and finally passed that minor milestone. Thank you!

I’ve been writing this blog for five years now, with some 400 total posts, or about 1.5 blog posts per week. I know my style is toward longer articles and less frequent posting, most often of a fairly detailed or technical nature. And, while I have a Twitter account, I do not bleat. My style is for more meaty discussions. Perhaps it belies my age.

The real growth in this blog, however, has come about with my conscious attempt to write for the enterprise audience. RDF, the semantic enterprise, linked data and ontologies need a bridge from the technical community to the one of practitioners. Much progress and uptake has been occurring with these business and government audiences.

At the recent SemTech meeting, I was taken aside by many individuals noting my blog posts and thanking me for the thought and effort behind them. Thank you for noticing, and reading, and you are welcome. We need more translation of semantic topics and technologies to pragmatic terms.

If you have been following the standard W3C and SemWeb mailing lists recently, you will have noticed an anxiety and a continuation of the fractious nature of this “community”. In part this comes about because there are efforts afoot to revisit the RDF specs. But, mostly, I think, it is the ongoing nature of many in this group to snatch defeat from the jaws of victory. The search by some for perfection and insistence on parochial needs and preferences can give a pettiness to this “community” that is unbecoming.

Many of us have abandoned those forums for those reasons. As for myself, I will continue to evangelize to the buying market and keep the gaze pointing outward. There is a wealth of need for tools, techniques, methods, documentation, structures, and narratives. Thanks to all of you, the readership of this blog, for continuing to affirm this value.

So, in the great scheme of things, the readership of this blog is quite small in comparison with the big boys. On the other hand, very few individuals have higher numbers, and all of this for a fairly esoteric area. I think this proves there is a market and a need out there for semantic solutions.

Thanks again! And, for those in the United States, have a most enjoyable 4th of July holiday!

Domain-specific Instantiations Based on the Open Semantic Framework

AI3:::Adaptive Information (Mike Bergman) - Thu, 06/17/2010 - 16:56
Open Semantic Framework&rft.aulast=Bergman&rft.aufirst=Mike&rft.subject=Adaptive Innovation&rft.subject=Ontology Best Practices&rft.subject=Open Source&rft.subject=Semantic Enterprise&rft.subject=Software Development&rft.subject=Structured Dynamics&rft.subject=Web-oriented Architecture&rft.source=AI3:::Adaptive Information&rft.date=2010-06-17&rft.type=blogPost&rft.format=text&rft.identifier=http://www.mkbergman.com/891/domain-specific-instantiations-based-on-the-open-semantic-framework/&rft.language=English">

Structured Dynamics Completes Design Phase; Citizen Dan First Exemplar

Structured Dynamics has been in a fervent — and, we believe, fruitful — design phase for the past 18 months. All of the working parts related to how to embrace becoming a semantic enterprise have now been defined and designed. Actual tools and components accompany many of these parts and have been deployed.

Recently, I have been speaking and blogging much about rationale, process, mindset and approach for how to bring semantics into the organization. But, prior to now, we have not spoken much about the overall design behind our approach. Today, as we complete our design phase and introduce our first exemplar instance of it — Citizen Dan [1] — we are finally in a position to describe this overall approach.

We term our approach the open semantic framework, also OSF. The open semantic framework is a combination of a layered architecture and modular software. The open semantic framework represents the software component of the four-component total open solution, recently described in a three part series. I return to this topic in the conclusion of this post.

Revisiting Design Objectives

Over the past nine months, I have been focusing my writing largely on the semantic enterprise, with more specificity regarding our Open SEAS (Semantic Enterprise Adoption and Solutions) initiative. In bits and pieces, these writings have tended to reflect a number of objectives:

  • Leverage existing information assets (data + structure) as much as possible
  • Develop incrementally, and validate and justify as you go
  • Emphasize, where possible, open standards and open software
  • Employ Web-oriented architectures
  • Adopt an open-world approach that acknowledges that information is most often incomplete; the approach is a key enabler for incremental deployments
  • Use URIs as object identifiers, and use linked data where practical
  • Embrace any data format found in the wild, but use RDF as the ultimate integration data model
  • Design architectures and APIs that avoid “lock-in” and support multiple tools options across the stack
  • Provide systems and capabilities that put all information sources — text, media, semi-structured and conventional databases — on an equal footing
  • Promote designs that bring the ability to create useful results into the hands of users and decisionmakers; relegate IT to a support role.

To date, the result of these design objectives is perhaps best captured in my Seven Pillars of the Open Semantic Enterprise posting, as well as our general discussions regarding adaptive ontologies. Yet, still, these writings have been somewhat piecemeal. What this document attempts to do is to place all of these perspectives into a single, coherent whole.

The Incremental Layers of the Open Semantic Framework

Structured Dynamics has been a strong advocate for layered architectures, with clear APIs between layers as appropriate. But these layers are not “laminates” that completely cover the layer below, nor are they all needed or necessary. Depending on the circumstance, some layers are unneeded or superfluous. Layers may be added or not incrementally.

In this manner, then, the open semantic framework is perhaps more akin to a pearl, than to a laminate or cocoon. Each subsequent layer does not “embed” the layer prior to it, and some layers actually may inter-operate with multiple layers below or above it (this is notably true for the “ontologies” layer, which has interactions up and down the stack).

Nonetheless, we can envision this pearl of the open semantic framework and its layers as follows:

(click for full size)

Others have termed this the “semantic muffin” or even “semantic muppet” or “semantic blob”. Whatever (hehe). The real idea is that layers may accrete (as in the growth of a pearl) and occur over time and be uneven. Each layer, though, does have a role to play (though it may not be needed in a given deployment), and does act to augment existing information assets in the transition to a semantic framework. Beginning at the core, each of these layers — with external references as appropriate for more details — is described below.

Existing Assets Layer

The open semantic framework is premised on leveraging existing information assets. Sure, once the framework is in place, new information can be brought into it in a more direct, semantic manner. But, the real thrust and benefit of this framework is to provide an incremental pathway for finally inter-operating and federating prior decades of data, structure and information assets.

These information assets may reside inside or outside the enterprise. They may (and DO!) exist in many formats and are described by many schema. They may come from internal transaction systems or warehouses, or may exist external on the Web or at supplier or partner sites. These information assets may span from conventional databases and relational data systems to XML interchange standards, Web pages and standard internal text or documents. In short, there is NO information asset that is not amenable to be included in this framework.

The Information Transformation (scones/irON) Layer

The information transformation layer provides either: 1) extraction of concepts and entities as structured metadata from source text or documents; or 2) conversion of existing data assets to interoperable form. As implemented by Structured Dynamics, the extractions are conducted by either scones (Subject Concept or Named EntitieS) or third-party utilities, and the conversions occur via irON (instance record Object Notation) or third-party “RDFizers“.

Depending on the source, the net result of the transformation is to produce interoperable data and information that can be ingested and used by other layers in the framework.

Though not strictly analogous, this layer bears some resemblance to the ETL (extract, transfer, load) utilities used in many enterprise information integration applications. Unlike those conventional systems, this information transformation layer also may capture and represent some of the source schema.

In all cases, however, these transformations are relatively simple and get parsed against the available structure (the ontologies, schema and entity reference lists) in the system to generate the semantic metadata (tags).

At this point, the extracted structure is generally at the level of instance records, or the ABox, with simple assertions of attribute-value pairs for specific records [2]. Little schema transformation or mapping occurs at this layer (if such is needed, that occurs at the structWSF layer; see next). Actual federation or interoperation occurs at later layers based on the TBox structures [2].

This modular portion of the framework is explicitly designed with APIs to allow third-party tools to be plugged in and substituted.

The structWSF Layer

The major workhorse of the open semantic framework is the structWSF (Web services framework) layer. structWSF is the most complicated of the OSF layers and has many supporting software packages and capabilities. The structWSF layer provides the standard, common interface (”canonical”) layer by which existing information assets get represented and presented to the outside world and to other layers in the OSF stack.

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies; see below).

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The current structWSF framework comes packaged with a baseline set of about twenty Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML. An internal representation, structXML [3], is used for internal communications across all structWSF Web services and with other layers.

structWSF has a central service that governs access rights and permissions. These rights occur at the level of the dataset, which gives immense flexibility to how data may be accessed, read, modified, created or deleted (or not). Datasets within a given structWSF instance may be accessed directly via API or via SPARQL queries to the instance’s endpoint. Depending on rights and query, results sets may be returned from a given structWSF instance in an infinite variety of ways.

This latter capability is the essential interface for subsequent layers in the open semantic framework stack. Depending on those subsequent components, pre-staged data and results sets may be returned for an essentially limitless variety of purposes.

Each structWSF instance also has a unique Web address that enables one or a multitude of instances to communicate and share with one another. This simple, but elegant, method enables structWSF instances to participate or not in potentially global or restricted local networks and collaboration environments. This is currently the largest untapped potential of structWSF with respect to its existing deployments.

The Semantic Components Layer

The newest layer in the stack is the semantic components layer. This layer takes results sets — most often generated by a specific query or data slice request — from one or more structWSF instances and then presents that information via a variety of data visualization or data presentation widgets (what we specifically call ‘semantic components‘ due to their design [4]). The operation and sensitivity of these display components are themselves driven by a presentation and data analysis (including statistics) ontology.

Current display widgets include: filter; tabular templates (similar to infoboxes); maps; bar, pie or linear charts; relationship (concept) browser; story and text annotator and viewer; workbench for creating structured views; and dashboard for presenting pre-defined views and component arrangements. These are generic tools that respond to the structures and data fed to them, adaptable without modification to any domain.

As presently implemented by Structured Dynamics, this layer consists either of Flex data visualization components or structured data display templates based on Smarty. The inherent design allows for updates to other bases (such as HTML5). The layer may also be swapped out or substituted with third-party capabilities.

The strength and power of this system is governed by its own ontology, the Semantic Component Ontology (SCO) (see next).

This is an extremely flexible layer in the open semantic framework stack. Expect an ongoing series of explanatory blog posts and online resources in the upcoming weeks to explain this innovative capability.

The Ontologies Layer

The ontologies layer actually refers to all structured assets driving the system. As such, this layer might be considered the “brain” (though rather simply specified!) of the open semantic framework.

At a true schema or TBox level [2], the ontologies layer represents the concept and relationships of the domain at hand. This layer also hosts the specific local entities and prominent things (people, places, events, etc.) useful for extracting local and domain-specific relevance. However, those views are also supplemented with some administrative ontologies (two examples are SCO and irON) that guide how the user interfaces or widgets in the system should behave.

The concept level represents the “world view” of the specific instantiation of the open semantic framework at hand. This conceptual (TBox) view provides the structural organization of information, inferencing capabilities, and navigation, faceting and explorer structure. The entity (ABox) view provides tagging for prominent individuals and instances important to the domain at hand, and guides the structure behind data visualizations of attribute or indicator data.

The administrative level uses simple roles and relationships for attributes and indicators to inform the framework as to how and with what widget to display information. For example, a “type” of information that is geographically related can be instructed to use the map component as an option for display. Whether some information is used for totals, comparison purposes, or other specifications useful to data visualization and graphing may also be specified.

The language and relationships (predicates or properties) of these administrative ontologies are simple and straightforward. It is, for example, relatively easy to define data display functions at the broad dataset and attributes level. Simple determinations drive how results sets and their associated results types may be displayed, no matter what datasets or slices may be generated as a result of the queries or requests fed to the system.

The structure in these layers can be replaced by other structures for other instantiations and circumstances. Indeed, all other layers in the open semantic framework can remain relatively fixed while tailoring the instance to new domains solely via this layer. The ontologies layer is what gives any given instantiation of OSF — such as Citizen Dan — its unique focus and scope.

The Content Management System (conStruct) Layer

The thinnest layer (that is, least substantial with respect to this framework) is the content management system (CMS) layer. In its current form, the open semantic framework uses the Drupal CMS via our conStruct plug-in modules. The design of the framework, however, has explicitly accommodated the possibility that other CMSs may substitute for this role.

The CMS layer is optional if structWSF endpoints are sufficient or if simple Web pages hosting semantic components are deemed as adequate. Very small organizations or deployments may reasonably choose to have no CMS layer at all.

However, for most sites or portals with more than a few active users, it is desirable to have broad flexibility in theming (”skinning”), user rights and permissions, or other functionality. These are the roles of the CMS layer. Drupal, for example, is presently supported by more than 4500 third-party modules in every conceivable function, from polling to blogs and rating systems and bulletin boards.

For such generalized portals or collaboration environments, it makes sense to adopt and install a flexible CMS system, such as Drupal. Much of the user experience and functional environment can be provided through such means.

The open semantic framework is thus designed to reside easily in a CMS while also providing the hooks to take advantage of the generalized user rights and functionality of the CMS. In this manner, the open semantic framework is able to stay focused on its structured data and interoperability purposes, while still gaining the advantages of rich-featured content management systems.

The OSF is a Web-oriented Architecture

With its inherent open-world orientation [5] and distributed and collaborative potential, the open semantic framework was designed from the outset to be Web-capable and Web-oriented:

(click for full size)

A Web-oriented architecture (WOA) has a number of understood requirements, to which the open semantic framework adheres. Specifically, these design considerations support the framework as being part of WOA:

  • Data and objects are all identified with Web addresses (URIs)
  • Data is generally exposed (and universally available) as linked data
  • SPARQL endpoints and APIs are generally RESTful in design
  • The overall architecture is modular, with inherent decentralized and distributed aspects
  • All display and visualization aspects are cross-browser ready and capable.
OSF is the Basis for Domain-specific Instantiations

Citizen Dan is our first exemplar instance of this open semantic framework. The details page for the project goes into some of Citizen Dan’s functionality and capabilities.

Citizen Dan is specifically geared to local governments and localities, with an emphasis on community indicator systems (CIS). CIS have become a popular way of measuring and tracking measures of local economic and social well-being; they are closely related to sustainability and how to measure it as used in many economic and environmental domains.

However, in the context of this post, what is really interesting about Citizen Dan is that its semantic framework is a completely open and generic one. The same set of tools and capabilities described on its details page can be applied to any domain that needs to manage and understand information in its own domain. This includes from unstructured text or documents to conventional structured databases.

What changes from domain to domain are the data structures (the ontologies, schema and entity reference lists; see above) that are fed to this open semantic framework. By swapping out new structures, what can be called Citizen Dan in one instance can morph to become Curriculum Carla in say, the education instance or Doctor Doolittle in the veterinary science instance [6].

We can illustrate these multiple instances as follows:

(click for full size)

What this figure illustrates is that even a branded expression of the framework — such as Citizen Dan — is merely an instance of that framework. And, actually, when expressed in such a packaged manner, we can more accurately call the standard and bundled suite of generic functions and accompanying structure of Citizen Dan as an instantiation of the open semantic framework:

in·stan·ti·ate \in-ˈstan(t)-shē-āt\ (transitive verb) is to:

  1. (transitive) to represent an abstract concept by a concrete instance
  2. (transitive, object orientated computing) to create an object (an instance) of a specific class

in·stan·ti·a·tion \in-stan(t)-shē-ā-shən\ (noun) [7]

By replacing the structure bases, and by tailoring the function suite appropriate to a given market and use, we can create many instantiations of the open semantic framework for different domains and markets. In this manner, Citizen Dan can be seen as an early exemplar of the framework, but not as a definer and limiter to it.

OSF is the Software Leg to a ‘Total Open Solution‘

So far, this discussion has focused solely on considerations of software and architecture. While we see the power of the open semantic framework, highly useful in itself, this is inadequate alone to achieve acceptance and success in the enterprise (as we noted in our most recent posts). The very forces that are compelling enterprises to look at new options, are also the same ones that pose difficult hurdle rates for acceptance of open source.

To address this issue, we have developed a four-legged foundation to what we termed the total open solution. The solution involves software, structure, documentation and methods (or best practices). Each of these connect and relate to the other foundations.

The open semantic framework is clearly the software (and architecture) leg to this foundation. Again, however, what is interesting is that the mere swapping out of the structure can also make the system relatively ready for other domains.

We see these relationships in the following diagram, that also shows that the DocWiki portions of the solution embody the documentation (aside from code-level comments) and methods legs of the foundation:

(click for full size)

Differences between domains may also lead to differences as to which components are included or not in that domain’s desired instantiation.

The hugely important implied point, however, from the diagram above, is to show how nearly universal the content and methods in the DocWiki may be to other domains. Because the deltas between domains largely result from structure and what specific functional components are included or not, it becomes clear that most documentation and practices shared with the DocWiki will be applicable across domains. Sure, the use cases and some of the specific terminology may change, but we can also now see a high degree of re-usability of documentation and knowledge base across markets. This realization makes the usefulness and leverage of the DocWiki even higher.

A Common Language and Framework for Moving Forward

Developing “common language” by which to describe and convey things — especially new things like semantics that also have strong technical aspects — is tough, very tough. We are only now beginning on this process; we look to many in the community and elsewhere to help define informative and evocative terminology.

Per the original design objectives above, Structured Dynamics has approached the challenge of the semantic enterprise in what we think is both a pragmatic and a new way. The insistence on preserving and respecting existing information assets, matched with the opportunities and different mindsets arising from an open-world approach [5], have necessitated thinking through new designs and developing new concepts. Any time such new thinking and concepts occurs, new language and new metaphors must accompany it.

While certainly there are components and various software packages that populate and comprise an open semantic framework, the framework is also just as importantly a world view or way to think about information, information development, and its architecture. For example, a pivotal concept is that an open semantic framework is built around generic tools responsive to the information structures fed to them. This realization shifts the locus of emphasis from software development per se to creating, managing and adapting data and information structures. While this democratizes the information development process and is more inclusive of all knowledge workers, it also imposes needs for new toolsets and business processes. We are only at the nascent stages of understanding and learning about these differences.

Similarly, a development approach that is inherently incremental and leverages (rather than replaces or displaces) existing information assets means IT projects need to be considered in a new light. Small projects with more emphasis on tangible and demonstrable benefits will alter budgets, lower risks, and place a need for quicker turnaround. Like the architecture of the open semantic framework itself, projects based on OSF are also more distributed, decentralized and modular.

With such decentralization also comes the need for mechanisms and systems to overcome vendor “lock-in” and proprietary systems. A key thrust in support of what we have called the total open solution and its mixture of documentation and methods to accompany software and structure is specifically targeted at this issue. Tools and means for collaboration and concurrent contributions are another possible answer. Prior software practices in agile development and version control will see extensions to all manner of information development across the enterprise.

We are proud of our design work and proof-testing with clients over the past 18 months. We believe the open semantic framework and its implications to be a fundamental shift in how organizations need to think about their information development, existing information assets, and IT budgets and processes. We know widescale adoption is not yet at hand — enterprises are justifiably conservative when it comes to new thinking. But, given global competition and tight pocketbooks, the open semantic framework is a formulation to which enterprises and governments should pay very close attention.

[1] Citizen Dan is an open source system for aggregating different indicator data concerning local, community well-being. Information sources may include the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics. See current details with screenshots. [2] Structured Dynamics’ best practices approach makes explicit splits between the “ABox” (for instance data) and “TBox” (for ontology schema) in accordance with our working definition for description logics, a fundamental underpinning for how we use RDF:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” [3] A subsequent post will document this rather straightforward XML schema. [4] Contact Structured Dynamics for a early sneak peek. The Citizen Dan application will be publicly released as an online sandbox and demo by the end of summer 2010. [5] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Anothe way to say is it that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems. [6] Of course, things are always not so simple as this. The CMS layer gives the open semantic framework the ready ability to change themes and layouts (”skins), not to mention the breadth and specifics of what ancillary site functionality might be provided. Moreover, the module basis of the open semantic framework also means that entire clusters of functionality might be dropped from a given instantiation (or added to it!) without violating or negating this framework. [7] Dictionary references are from Merriam-Webster and Wikitionary.

Why Carry the Cost of Linked Data?

Tom Heath's Displacement Activities - Wed, 06/16/2010 - 12:33

In his ongoing series of niggles about Linked Data, Rob McKinnon claims that “mandating RDF [for publication of government data] may be premature and costly“. The claim is made in reference to Francis Maude’s parliamentary answer to a question from Tom Watson. Personally I see nothing in the statement from Francis Maude that implies the mandating of RDF or Linked Data, only that “Where possible we will use recognised open standards including Linked Data standards”. Note the “where possible”. However, that’s not the point of this post.

There’s nothing premature about publishing government data as Linked Data – it’s happening on a large scale in the UK, US and elsewhere. Where I do agree with Rob (perhaps for the first time ) is that it comes at a cost. However, this isn’t the interesting question, as the same applies to any investment in a nation’s infrastructure. The interesting questions are who bears that cost, and who benefits?

Let’s make a direct comparison between publishing a data set in raw CSV format (probably exported from a database or spreadsheet) and making the extra effort to publish it in RDF according to the Linked Data principles.

Assuming that your spreadsheet doesn’t contain formulas or merged cells that would make the data irregularly shaped, or that you can create a nice database view that denormalises your relational database tables into one, then the cost of publishing data in CSV basically amounts to running the appropriate export of the data and hosting the static file somewhere on the Web. Dead cheap, right?

Oh wait, you’ll need to write some documentation explaining what each of the columns in the CSV file mean, and what types of data people should expect to find in each of these. You’ll also need to create and maintain some kind of directory so people can discover your data in the crazy haystack that is the Web. Not quite so cheap after all.

So what are the comparable processes and costs in the RDF and Linked Data scenario? One option is to use a tool like D2R Server to expose data from your relational database to the Web as RDF, but let’s stick with the CSV example to demonstrate the lo-fi approach.

This is not the place to reproduce an entire guide to publishing Linked Data, but in a nutshell, you’ll need to decide on the format of the URIs you’ll assign to the things described in your data set, select one or more RDF schemata with which to describe your data (analogous to defining what the columns in your CSV file mean and how their contents relate to each other), and then write some code to convert the data in your CSV file to RDF, according to your URI format and the chosen schemata. Last of all, for it to be proper Linked Data, you’ll need to find a related Linked Data set on the Web and create some RDF that links (some of) the things in your data set to things in the other. Just as with conventional Web sites, if people find your data useful or interesting they’ll create some RDF that links the things in their data to the things in yours, gradually creating an unbounded Web of data.

Clearly these extra steps come at a cost compared to publishing raw CSV files. So why bear these costs?

There are two main reasons: discoverability and reusability.

Anyone (deliberately) publishing data on the Web presumably does so because they want other people to be able to find and reuse that data. The beauty of Linked Data is that discoverability is baked in to the combination of RDF and the Linked Data principles. Incoming links to an RDF data set put that data set “into the Web” and outgoing links increase the interconnectivity further.

Yes, you can create an HTML link to a CSV file, but you can’t link to specific things described in the data or say how they relate to each other. Linked Data enables this. Yes, you can publish some documentation alongside a CSV file explaining what each of the columns mean, but that description can’t be interlinked with the data itself, making it self-describing. Linked Data does this. Yes, you can include URIs in the data itself, but CSV provides no mechanism that for indicating that the content of a particular cell is a link to be followed. Linked Data does this. Yes, you can create directories or catalogues that describe the data sets available from a particular publisher, but this doesn’t scale to the Web. Remember what the arrival of Google did to the Yahoo! directory? What we need is a mechanism that supports arbitrary discovery of data sets by bots roaming the Web and building searchable indices of the data they find. Linked Data enables this.

Assuming that a particular data set has been discovered, what is the cost of any one party using that data in a new application? Perhaps this application only needs one data set, in which case all the developer must do is read the documentation to understand the structure of the data and get on with writing code. A much more likely scenario is that the application requires integration of two or more data sets. If each of these data sets is just a CSV file then every application developer must incur the cost of integrating them, i.e. linking together the elements common to both data sets, and must do this for every new data set they want to use in their application. In this scenario the integration cost of using these data sets is proportional to their use. There are no economies of scale. It always costs the same amount, to every consumer.

Not so with Linked Data, which enables the data publisher to identify links between their data and third party data sets, and make these links available to every consumer of that data set by publishing them as RDF along with the data itself. Yes, there is a one-off cost to the publisher in creating the links that are most likely to be useful to data consumers, but that’s a one-off. It doesn’t increase every time a developer uses the data set, and each developer doesn’t have to pay that cost for each data set they use.

If data publishers are seriously interested in promoting the use of their data then this is a cost worth bearing. Why constantly reinvent the wheel by creating new sets of links for every application that uses a certain combination of data sets? Certainly as a UK taxpayer, I would rather the UK Government made this one-off investment in publishing and linking RDF data, thereby lowering the cost for everyone that wanted to use them. This is the way to build a vibrant economy around open data.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

Brown Bag Lunch: Structure Paves the Way to the Semantic Web

AI3:::Adaptive Information (Mike Bergman) - Fri, 06/11/2010 - 05:55
How Shall We Measure Progress Over the Past Three Years?


For a dozen years, my career has been centered on Internet search, dynamic content and the deep Web. For the past few years, I have been somewhat obsessed by two topics.

The first topic, a conviction really, is that implicit structure needs to be extracted from Web content to enable it to be disambiguated, organized, shared and re-purposed. The second topic, more an open question as a former academic married to a professor, is what might replace editorial selections and peer review to establish the authoritativeness of content. These topics naturally steer one to the semantic Web.

A Millennial Perspective

The semantic Web, by whatever name it comes to be called, is an inevitability. History tells us that as information content grows, so do the mechanisms for organizing and managing it. Over human history, innovations such as writing systems, alphabetization, pagination, tables of contents, indexes, concordances, reference look-ups, classification systems, tables, figures, and statistics have emerged in parallel with content growth [19].

When the Lycos search engine, one of the first profitable Internet ventures, was publicly released in 1994, it indexed a mere 54,000 pages [1]. When Google wowed us with its page-ranking algorithm in 1998, it soon replaced my then favorite search engine, AltaVista. Now, tens of billions of indexed documents later, I often find Google’s results to be overwhelming dross — unfortunately true again for all of the major search engines. Faceted browsing, vertical search, and Web 2.0’s tagging and folksonomies demonstrate humanity’s natural penchant to fight this entropy, efforts that will next continue with the semantic Web and then mechanisms unforeseen to manage the chaos of burgeoning content.

An awful lot of hot air has been expelled over the false dichotomy of whether the semantic Web will fail or is on the verge of nirvana. Arguments extend from the epistemological versus ontological (classically defined) to Web 3.0 versus SemWeb or Web services (WS*) versus REST (Representational State Transfer). My RSS feed reader points to at least one such dust up every week.

Some set the difficulties of resolving semantic heterogeneities as absolutes, leading to an illogical and false rejection of semantic Web objectives. In contrast, some advocates set equally divisive arguments for semantic Web purity by insisting on formal ontologies and descriptive logics. Meanwhile, studied leaks about “stealth” semantic Web ventures mean you should grab your wallet while simultaneously shaking your head.

A Decades-Long Perspective

My mental image of the semantic Web is a road from here to some achievable destination — say, Detroit. Parts of the road are well paved; indeed, portions are already superhighways with controlled on-ramps and off-ramps. Other portions are two lanes, some with way too many traffic lights and some with dangerous intersections. A few small portions remain unpaved gravel and rough going.

Wreck in Nebraska during the 1919 Transcontinental Motor Convoy

A lack of perspective makes things appear either too close or too far away. The automobile isn’t yet a century old as a mass-produced item. It wasn’t until 1919 that the US Army Transcontinental Motor Convoy made the first automobile trip across the United States.

The 3,200 mile route roughly followed today’s Lincoln Highway, US 30, from Washington, D.C. to San Francisco. The convoy took 62 days and 250 recorded accidents to complete the trip (see figure), half on dirt roads at an average speed of 6 miles per hour. A tank officer on that trip later observed Germany’s autobahns during World War II. When he subsequently became President Dwight D. Eisenhower, he proposed and then signed the Interstate Highway Act.

That was 50 years ago. Today, the US is crisscrossed with 50,000 miles of interstates, which have completely remade the nation’s economy and culture [2].

Today’s Perspective

Like the interstate system in its early years, today’s semantic Web lets you link together a complete trip, but the going isn’t as smooth or as fast as it could be. Nevertheless, making the trip is doable and keeps improving day by day, month by month.

My view of what’s required to smooth the road begins with extracting structure and meaningful information according to understandable schema from mostly uncharacterized content. Then we store the now-structured content as RDF triples that can be further managed and manipulated at scale. By necessity, the journey embraces tools and requirements that, individually, might not constitute semantic Web technology as some strictly define it. These tools and requirements are nonetheless integral to reaching the destination. We are well into that journey’s first leg, what I and others are calling the structured Web.

For the past six months or so I have been researching and assembling as many semantic Web and related tools as I can find [3]. That Sweet Tools listing now exceeds 500 tools [4] (with its presentation using the nifty lightweight Exhibit publication system from MIT’s Simile program [5]). I’ve come to understand the importance of many ancillary tool sets to the entire semantic Web highway, such as natural language processing and information extraction. I’ve also found new categories of pragmatic tools that embody semantic Web and data mediation processes but don’t label themselves as such.

In its entirety, the Sweet Tools listing provides a pretty good picture of the semantic Web’s state. It’s a surprisingly robust picture — though with some notable potholes — and includes impressive open source options in all categories. Content publishing, indexing, and retrieval at massive scales are largely solved problems. We also have the infrastructure, languages, and (yes!) standards for tying this content together meaningfully at the data and object levels.

I also think a degree of consensus has emerged on RDF as the canonical data model for semantic information. RDF triple stores are rapidly improving toward industrial strength, and RESTful designs enable massive scalability, as terabyte- and petabyte-scale full-text indexes prove.

Powerful and flexible middleware options, such as those from OpenLink [6], can transform and integrate diverse file formats with a variety of back ends. The World Wide Web Consortium’s GRDDL standard [7] and related tools, plus various “RDF-izers” from Massachusetts Institute of Technology and elsewhere [8], largely provide the conversion infrastructure for getting Web data into that canonical RDF form. Sure, some of these converters are still research-grade, but getting them to operational capabilities at scale now appears trivial.

Things start getting shakier when trying to structure information into a semantic formalism. Controlled vocabularies and ontologies range broadly and remain a contentious area. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language [9] or microformats [10]. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) [11] and the still greater formalism of OWL’s various dialects [12].

If we compare the semantic Web to the US interstate highway system, we’re still in the early stages of a journey that will remake our economy and culture. Many potholes on the road to the semantic Web exist. One ready task is to transform existing structure to RDF. Another priority is to refine tools to extract structure and meaningful information from uncharacterized content.

Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications.

All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it into canonical RDF form, we can then move on to needed developments in semantic mediation, some of the roughest road on the journey.

Potholes on the Semantic Highway

Semantic mediation requires appropriate structured content. Many potholes on the road to the semantic Web exist because the content lacks structured markup; others arise because existing structure requires transformation. We need improved ways to address both problems. We also need more intuitive means for applying schema to structure. Some have referred to these issues as “who pays the tax.”

Recent experience with social software and collaboration proves that a portion of the Internet user community is willing to tag and characterize content. Furthermore, we can readily leverage that resulting structure, and free riders are welcomed. The real pothole is the lack of easy — even fun — data extractors and “structurizers.” But we’re tantalizingly close.

Tools such as Solvent and Sifter from MIT’s Simile program [13] and Marmite from Carnegie Mellon University [14] are showing the way to match DOM (document object model) inspectors with automated structure extractors. DBpedia, the alpha version of Freebase, and System One now provide large-scale, open Web data sets in RDF [15], including all of Wikipedia. Browser extensions such as Zotero [16] are showing how to integrate structure management into acceptable user interfaces, as are services such as Zoominfo [17]. Yet we still lack easy means to design the differing structures suitable for a plenitude of destinations.

Amazingly, a compelling road map for how all these pieces could truly fit together is also incomplete. How do we actually get from here to Detroit? Within specific components, architectural understandings are sometimes OK (although documentation is usually awful for open source projects, as most of the current tools are). Until our community better documents that vision, attracting new contributors will be needlessly slower, thus delaying the benefits of network effects.

So, let’s create a road map and get on with paving the gaps and filling the potholes. It’s not a matter of standards or technology — we have those in abundance. Let’s stop the silly squabbles and commit to the journey in earnest. The structured Web’s ability to reach Hyperland [18], Douglas Adam’s prescient 1990 forecast of the semantic Web, now looks to be no further away than Detroit.

This Friday brown bag leftover was first placed into the AI3 refrigerator about three years ago on May 3, 2007.  The piece was my answer to a request by Jim Hendler to pen some thoughts on the semantic Web, based on I believe what he thought might be a pragmatic perspective combining Internet business with Web science. The formal piece appeared as a guest editorial in the May/June 2007 issue of IEEE Intelligent Systems. What appears above is unaltered from my original posting (aside from some minor formatting clean-up and — sorry to say — some of the projects are now defunct). [1] Chris Sherman, “Happy Birthday, Lycos!,” Search Engine Watch, August 14, 2002. See http://searchenginewatch.com/showPage.html?page=2160551. [2] David A. Pfeiffer, “Ike’s Interstates at 50: Anniversary of the Highway System Recalls Eisenhower’s Role as Catalyst,” Prologue Magazine, National Archives, Summer 2006, Vol. 38, No. 2. See: http://www.archives.gov/publications/prologue/2006/summer/interstates.html. [3] The mention of specific tool names is meant to be illustrative and not necessarily a recommendation. [4] Sweet Tools (SemWeb) listing; see http://www.mkbergman.com/new-version-sweet-tools-sem-web/ . [5] See http://simile.mit.edu/exhibit/. [6] OpenLink Software’s Virtuoso and Data Spaces products; see http://www.openlinksw.com/. [7] W3C’s Gleaning Resource Descriptions from Dialects of Languages (GRDDL, pronounced “griddle”). See http://www.w3.org/2004/01/rdxh/spec. [8] See http://simile.mit.edu/wiki/RDFizers. [9] Outline Processor Markup Language (OPML); see http://www.opml.org/. [10] Microformats; see http://microformats.org/. [11] DOAP (Description of a Project), FOAF (Friend of a Friend), SIOC (Semantically-Interlinked Online Communities) and SKOS (Simple Knowledge Organizing System). [12] W3C’s Web Ontology Language (OWL). See http://www.w3.org/TR/owl-features/. [13] Solvent (http://simile.mit.edu/wiki/Solvent) and Sifter (http://simile.mit.edu/wiki/Sifter) are from MIT’s Simile program. [14] Marmite (http://www.cs.cmu.edu/~jasonh/projects/marmite/) is from Carnegie Mellon University. [15] DBpedia (http://dbpedia.org/docs/) and Freebase (in alpha, by invitation only at http://www.freebase.com/) are two of the first large-scale open datasets on the Web; Wikipedia has also been converted to RDF by System One (http://labs.systemone.at/wikipedia3). [16] Zotero is produced by George Mason University’s Center for History and New Media; see http://www.zotero.org. [17] ZoomInfo (http://www.zoominfo.com/) provides online structured search of companies and people, plus broader services to enterprises. [18] The late Douglas Adams, of Doctor Who and A Hitchhiker’s Guide to the Galaxy fame, produced a TV program for BBC2 presaging the Internet called Hyperland. This 50-min video can be seen in five parts via YouTube at Part 1 of 5, 2 of 5, 3 of 5, 4 of 5 and 5 of 5. [19] Since I first wrote this piece, I have systematized these developments in my Timeline of Information History.

Listening to the Enterprise: Total Open Solutions, Part 3

AI3:::Adaptive Information (Mike Bergman) - Tue, 06/01/2010 - 03:08
Total Open Solutions, Part 3&rft.aulast=Bergman&rft.aufirst=Mike&rft.subject=Adaptive Innovation&rft.subject=MIKE2.0&rft.subject=Open Source&rft.source=AI3:::Adaptive Information&rft.date=2010-05-31&rft.type=blogPost&rft.format=text&rft.identifier=http://www.mkbergman.com/884/listening-to-the-enterprise-total-open-solutions-part-3/&rft.language=English">

Introducing the Open Source ‘DocWiki’ System

In the first part to this series, we began with the argument that open source software alone was not sufficient to meet the required acceptance factors in the enterprise. As a guiding way to create the right mindset around these issues we shared the saying that we have adopted at Structured Dynamics that, “We’re successful when we are not needed.”

In the second part of this series we described the four legs of a stable, open source solution. These four legs are software, structure, methods and documentation. When all four are provided, we termed this a total open solution.

Now, in this third and concluding part to our series, we introduce the open source documentation and methodology system called ‘DocWiki’. It complements the base open source software, in the process completing the conditions for a total open solution.

Though we call this system ‘DocWiki’, it is not meant to be a brand or particular product description for what Structured Dynamics is offering. Rather, ‘DocWiki’ is merely a placeholder name for a generic, open source system and knowledge base that can be downloaded, installed, branded, modified and extended in whatever way the user sees fit. ‘DocWiki’ is a baseline documentation and methodology “starter kit” that can be dressed up in new clothes or packaged and named in whatever manner best suited to a given deployment.

In describing the major components of this ‘DocWiki’ system we will again use our Citizen Dan initiative [1] as we did in Part 2. This gives us a real use case, though the same approach is applicable to any open source information management initiative by enterprises.

We call the specific version of the ‘DocWiki’ used in the case of Citizen Dan the ‘CIS DocWiki‘ (for community indicator systems), specific to the domain and local government focus of Citizen Dan. Similarly, the structured vocabulary and ontology that guides the system is the MUNI ontology. For other information development initiatives, the specific content of these components would be swapped out for ones appropriate to that initiative.

Overview of the ‘DocWiki’ System

A number of desires and objectives intersected to guide the design of the ‘DocWiki’ system. We wanted:

  • A consolidated knowledge base with complete, turnkey implementation content
  • A collaborative document authoring system with authoring tools comfortable to most knowledge workers
  • A version control system to enable rollbacks and restoration of prior official versions
  • A system that would enable and facilitate the collection and import of relevant content; in our own case, that included widely distributed internal content in many forms and locations plus relevant external content (such as defined items in Wikipedia)
  • A document management framework that would allow existing content to be mixed, combined and re-purposed for different uses, from training to marketing collateral
  • A single source publishing system that would allow content to be published as paper documents, PDFs, Web pages and the like
  • A system that could be easily themed, skinned and branded, tailored for any given deployment or circumstance, and
  • A system built entirely from open source components and with content that had no restrictions on use or re-use.

In first formulating this design, our assumption was the major building blocks would be an open source document management system linked with some form of version control. Though we think such a formulation could work OK, our exposure to the MIKE2.0 methodology actually caused us to re-look at and re-think a wiki-based approach. Ultimately the trump card that decided the design for us was familiarity and ease-of-use.

The resulting architecture of the full ‘DocWiki’ system is shown below:

(click for full size)

What is cool about this design is that a single software download install with a few extensions (Mediawiki, the Wikipedia software, plus some standard extensions and judicious use of Semantic Mediawiki) and a single loadable database are all that is required to transfer and install the ‘DocWiki’ system.

To better describe this system, we will focus on three major interconnecting pieces in this architectural diagram: the knowledge base; the vocabulary and structure (ontology); and the authoring and publishing system (wiki).

The ‘DocWiki’ Knowledge Base

The pre-loaded content for the ‘DocWiki’ system comes from its knowledge base. This is provided as a text-exported MySQL database that can be modified en masse before loading (such as substituting ‘YourName’ for ‘DocWiki’). The exemplar upon which this knowledge base is modeled is the MIKE2.0 framework.

MIKE2.0 (Method for an Integrated Knowledge Environment ) provides a comprehensive methodology that can be applied across a number of different projects within the information management space. MIKE2.0 provides an organized way to describe the why, when, how and who of information management projects. Via standard templates and structures, MIKE2.0 provides a consistent basis to describe and manage these projects, and in a way that helps promote their interoperability and consistency across the enterprise.

MIKE2.0 has a generalized methodology and set of templates applicable to initiatives, the phases, activities and tasks to undertake them, and supporting assets. Supporting assets can range from glossaries and definition of terms and concepts to very specific technical documents or background material. The entire system is logical and applies a consistent design and organizational structure and categories.

For our purposes, we wanted a complete, turnkey content knowledge base. This meant that we needed to accommodate all forms of project management and guidance, ranging from specific “how-to” and technical discussions to the entire suite of background and supporting material. The scope of this knowledge content is defined as what a new person assigned a lead or implementation responsibility would need to read or master.

As a destination site MIKE2.0 is quite broad: it embraces the ability to model virtually any information management initiative. This makes MIKE2.0 an invaluable source of structure and methodology guidance, but also results in it being quite limited in the specific how-tos associated with any given initiative. I have earlier spoken about the structure of MIKE2.0 and in particular its applicability to the semantic enterprise.

The strength of MIKE2.0, however, is that its structure can be grabbed and quickly applied to form an organizational and structural basis for filling out the knowledge base for any specific information development initiative. And, that is exactly what we did with the ‘CIS DocWiki.’

MIKE2.0 hosts and maintains its project-related structure in Mediawiki (with some extensions). Combined with its templates, this provides a rapid-start baseline for beginning to tailor and flesh out the specific details for a given information management initiative. Thus, after copying broad aspects of the MIKE2.0 system into the incipient ‘DocWiki’, it was relatively straightforward to let the existing structure and templates of MIKE2.0 guide next steps.

As of today’s date, the ‘CIS DocWiki’ contains about 300 substantive articles, a complete activity and tasking structure, and various re-usable templates based on Semantic Mediawiki for structured and consistent access and retrieval. New tasks and structure can be readily added to the system. Existing structure or content can be deleted or marked as archive for non-display. We are still gathering all requisite content pieces, and anticipate by first public release that the baseline knowledge base will include 2x to 3x the scale of its current content.

For new ‘CIS DocWiki’ (or Citizen Dan-based) deployments, this means the knowledge base can be completely modified and extended for local circumstances. The set-up of the Mediawiki instance is separate from the loading or modification of the knowledge base, which means the look-and-feel of the entire system, not to mention user rights and permissions, can also be readily tailored for local requirements.

The core content of the ‘CIS DocWiki’ and its basis in a set structure and methodology (derived from MIKE2.0) means that the knowledge base is also adaptable for other broader information development areas, especially in the semantic enterprise or semantic government arenas. Thus, while Structured Dynamics is first releasing the ‘CIS DocWiki’ in the context of Citizen Dan and semantic government, we also are developing a parallel instance for the Open SEAS approach to the semantic enterprise.

The approach taken here is somewhat different than the standard wiki use. As experts, we are basically sole authoring (with contributions from selected collaborators and our clients) the starting basis of the knowledge base. Unlike many wikis, this enables us to be quite consistent in content, style, and organization. Such an approach allows us to present a coherent and complete starting content and methodology foundation. However, once delivered and installed for a given deployment, its users are then free to extend and change this knowledge foundation in the standard wiki manner. Whether those subsequent extensions are free-form or more tightly controlled and managed is the choice of the new deployment’s administrators.

The Supporting MUNI Structure

Strictly speaking, the vocabularies and structures (including, of course, ontologies) that drive our semantic government or semantic enterprise offerings are also part of the knowledge base.  And, in fact, many of these aspects, especially related to the actual operating of the instances, are included as part of the standard knowledge base.

However, the applicable domain ontology itself is separately maintained. Descriptions of how to use and modify such ontologies are part of the general ‘DocWiki’ knowledge base, but the ontology is not. This arm’s length-separation is done to acknowledge that the ontology has independent use and value apart from the knowledge base or the software (Citizen Dan, in this case) that is the focus of it.

In the Citizen Dan instance, this structure is the MUNI ontology. MUNI is a general local government domain ontology that can find use in a broad array of circumstances, using or not Citizen Dan. Thus, like other ontologies developed and maintained by Structured Dynamics, such as BIBO (the Bibliographic Ontology), the ontology itself and its documentation, discussion forums and use cases are maintained separately.

The first release of MUNI is still under development and will be released this summer.

The Wiki/Publication Portion of ‘DocWiki’

The software framework that hosts and manages all of this content is the Mediawiki software, originally developed for Wikipedia. This framework is supported by a number of standard extensions packaged with the ‘DocWiki’ distribution. One of the more notable extensions is Semantic Mediawiki. Mediawiki also is the wiki framework underlying MIKE2.0, so content sharing between the systems is straightforward.

The Collaborative Wiki Portion

The first use of the ‘DocWiki’ is to add new content to the knowledge base and to modify or extend what is provided in the baseline. For straight authoring, ‘DocWiki’ offers the standard wikitext basis for content entry and editing, as well as the WikED enhanced editor and the FCKEditor WYSIWYG rich-text editor. Each of these may be turned on or off at will.

All of the baseline content is fully organized and categorized via a standard structure. Pre-existing templates aid in entering new content in specific areas consistently or in providing standard administrative ways of tagging content for completeness or need for editorial attention. Tasks and concepts, in particular, follow set ways of entry and description. These set templates, some forms-based and some derived from Semantic Mediawiki, are also tied into automatic internal scripts for listing and organizing various items. So long as new material is entered properly, it will be reflected in various stats and listings. Unlike sole reliance on Semantic Mediawiki, the ‘DocWiki’ approach is a mix of standard wiki categories and semantic types. Both are used for effective organization of the knowledge base.

Besides the knowledge base of domain content and “how-to”, the system also comes pre-packaged with many wiki “how-to” and best practices guidance for using the system effectively and consistently. Of course, a given deployment may or may not enforce all of these practices. A poorly administered instance, for example, could degenerate fairly quickly and lose the native structure and organization of the baseline system.

As with standard wikis, there is a history of prior page revisions that gives the system rollback and version control. Mediawiki has a pretty good user access and permissions framework ranging from access, reading, editing and to uploads.

Besides the standard and required extensions, ‘DocWiki’ also comes packaged with the necessary settings and configuration files to operate “out-of-the-box” in its designed baseline mode. Of course, these settings, too, can be changed and modified by site administrators, and ‘DocWiki’ also includes guidance on how to do that.

The Publication Portion

A little known but highly useful part of the Mediawiki API allows direct export of XHTML content [2]. Then, with minor XSLT conversion templates, it is possible to strip out wiki-specific conventions (such as the editing of individual sections) or to create straight XML versions. When this is combined with the use of internal ‘DocWiki’ CSS style sheets that impose some clean and semantic style identifiers, a common canonical output basis for content is possible.

From that point, a given deployment may use its own CSS styles to theme output content. Output Web pages (XHTML) or XML files then can be processed using existing and accurate utilities to produce PDF or *.doc documents. Then, with systems such as OpenOffice, an even wider variety of document formats can be produced. These facilities mean that the ‘DocWiki’ can also act as a single-source publishing environment.

In its initial release, re-purposing ‘DocWiki’ content into other presentations (for example, combining sections from multiple pages into a new document as opposed to re-using existing pages as is) will require creating new wiki pages and then cutting-and-pasting the desired content. However, it should also be noted that both DocBook and DITA have been applied to Mediawiki installations [3]. It should be possible to enable a more flexible re-purposing framework for ‘DocWiki’ moving into the future. When Available

The ‘CIS DocWiki’ is meant to accompany the first release of Citizen Dan, likely by the end of summer. The MUNI ontology will also be released roughly at the same time. At release, the ‘CIS DocWiki’ is anticipated to have on the order of 500-800 baseline content and “how to” articles.

Depending on time availability and other commitments, Structured Dynamics will also be using this information to build a semantic government composite offering to MIKE2.0. We will be contributing this new offering for free, similar to what we have done earlier for a semantic enterprise offering.

Subsequent to those events, we will then be modifying the ‘CIS DocWiki’ for the semantic enterprise domain. Much of the necessary content will have already been assembled for the ‘CIS DocWiki’.

Conclusions and Applicability

Paradoxically, while developing such knowledge bases and systems such as ‘DocWiki’ appears to be extra work, from our standpoint as developers it is useful and efficient. Structured Dynamics already researches and assembles much material and tries to “document as it goes.” Having the ‘DocWiki’ framework not only provides a consistent and coherent way to organize that information, but it also helps to point out essential gaps in our offerings.

The ‘DocWiki’ delivers the methods, documentation and portions of the structure to a total open solution. The ‘DocWiki’ is the primary means — along with software development and accompanying code-level and API documentation, of course — for us to fulfill our mantra that “We’re successful when we are not needed.” As we pointed out in Part 1 of this series, we really think such an attitude is ultimately a self-interested one. The better we can address the acceptance factors in the enterprise for our offerings, the more opportunities we will gain.

We would like to think that other enlightened open source software developers, especially those in the semantic space but certainly not limited to them, will see the wisdom of this four-legged foundation to total open solutions. Up until now, pragmatic guidance for what it takes to create a complete open source offering to businesses and enterprises has been lacking.

The tools, methods, and workflows all exist for making total open solutions real today. All of the pieces are themselves open source. There are many useful guides for best practices across the pipeline. It is just that — prior to this — no one apparently took the time to assemble and articulate them. We think this three-part series and some of the “how to” guidance in the ‘DocWiki’ system can help fix this oversight.

Ultimately, with wider adoption by developers, goaded in part by demands of the marketplace for them, we would hope that additional innovations and ideas may be forthcoming to improve the industry’s ability to offer total open source solutions. Adding just a small bit of attentive effort to how we organize and package what we know is but a small price to pay for greater acceptance and success.

[1] Citizen Dan is an open source system for aggregating different indicator data concerning local, community well-being. Information sources may include the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics. See current details with screenshots. [2] Clean XHTML can be generated directly from the Mediawiki API. This can be done directly via URL with the action=render command. See for example: http://www.mediawiki.org/wiki/API:Parsing_wikitext. [3] For example, there are a number of paths to migrate from HTML or XHTML to DocBook; see http://wiki.docbook.org/topic/Html2DocBook. But, there is a specific project that also goes directly from Mediawiki; see http://code.google.com/p/gwtwiki/wiki/Mediawiki2Docbook.

Listening to the Enterprise: Total Open Solutions, Part 2

AI3:::Adaptive Information (Mike Bergman) - Tue, 05/25/2010 - 14:24
Total Open Solutions, Part 2&rft.aulast=Bergman&rft.aufirst=Mike&rft.subject=Adaptive Innovation&rft.subject=MIKE2.0&rft.subject=Open Source&rft.source=AI3:::Adaptive Information&rft.date=2010-05-25&rft.type=blogPost&rft.format=text&rft.identifier=http://www.mkbergman.com/883/listening-to-the-enterprise-total-open-solutions-part-2/&rft.language=English">

The Four Legs to a Stable Open Source Solution

In the first part to this series, we put forward the argument that incomplete provision of important support factors was limiting the adoption of open source software in the enterprise. We can liken the absence of these factors to having a chair with one or more absent or broken legs.

This second part of the series goes into the four legs of a stable, open source solution. These four legs are software, structure, methods and documentation. When all four are provided, we can term this a total open solution.

These considerations are not simply a matter of idle curiosity. New approaches and new methods are required for enterprises to modernize their IT systems while adding new capabilities and preserving sunk assets. Extending and modernizing existing IT is often not in the self-interests of the original supplying vendors. And enterprises are well aware that IT commitments can extend for decades.

While the benefits and capabilities of open source software become apparent by the day, rates of open source software adoption lag in enterprises. We have seen entire Internet-based businesses arise and get huge in just a few short years. But it is the rare existing enterprise that has committed to and embraced similar Web-oriented architectures and IT strategies [1].

The enterprise IT ecosystem is evolving to become an unhealthy one. New software vendors have generally abandoned enterprises as a market. Much more action takes place with consumer apps and Internet plays, often premised on ad-based revenues or buzz and traffic as attractors for acquisition. Existing middle-tier enterprise vendors are themselves being gobbled up and disappearing.  I’m sure all observers would agree that IT software and services are increasingly dominated by a shrinking slate of vendors. I suspect most observers — myself included — would argue that enterprise-based IT innovation is also on the wane.

The argument posed in the first part of this series is that such atrophy should not be unexpected. The current state of open source software is not addressing the realities of enterprise IT needs.

And that is where the other legs of the total open solution come in. In their entirety, they amount to a form of capacity building for the enterprise [2]. It is not simply enough to put forward buzzwords matched with open source software packages. Exciting innovations in social networks, collaboration, semantic enterprise, mobile apps, REST, Web-oriented architectures, information extraction, linked data and a hundred others are being validated on the Internet. But until the full spectrum of success and adoption factors gets addressed, enterprises will not embrace these new innovations as central to their business.

As we describe these four legs to the total open solution, we will sometimes point to our Citizen Dan initiative [3]. That is not because of some universal applicability of the system to the enterprise; indeed Citizen Dan is mostly targeted to local communities and municipalities. But, Citizen Dan does represent the first instance known to us where each of these total open solution success factors is being explicitly recognized and developed. We think the approach has some transferability to the broader enterprise.

Let’s now discuss these four legs in turn.

Leg One: Software

Of course, the genesis of this series is grounded in open source software and what it needs to do in order to find broader enterprise acceptance. Clearly that is the first leg amongst the four to be discussed. We also have acknowledged that, generally, best-of-breed open source software is also better documented at the code level, and has documented APIs. We will return to this topic under Leg Four below.

Open source software useful to the enterprise is often a combination of individual open source packages. Some successful vendors of open source to the enterprise in fact began as packagers and documenters of multiple packages. Red Hat for Linux or Alfresco in document management or Pentaho in business intelligence come to mind, as examples.

In the case of Citizen Dan, here are the open source packages presently contained in its offering: Linux (Ubuntu), Apache, MySQL, PHP (these comprising the LAMP stack), Drupal, a variety of third-party Drupal modules, Virtuoso, Solr, ARC2, Smarty, Yahoo UI, TinyMCE, Axiis, Flex, ClearMaps, irON, conStruct, structWSF, and some others. Such combinations of packages are not unusual in open source settings, since new value-add typically comes from extensions to existing systems or unique ways to combine or package them. For example, the installation guide for structWSF alone is quite comprehensive with multiple configuration and test scripts.

Thus, besides direct software, it is also critical that configuration, settings, installation guidance and the like be addressed to enable relatively straightforward set-up. This is an area of frequent weakness. Targeting it directly is a not-so-secret factor for how some vendors have begun to achieve some success with the enterprise market.

Leg Two: Structure

All software works on data. While some data is unstructured (such as plain text) and some is semi-structured (such as HTML or Web pages that mixes markup with text), the objective of information extraction or natural language processing is to extract the “structure” from such sources. Once extracted, such structure can interoperate on a common footing with the structured data common to standard databases.

Thus, we use “structure” to denote the concepts and their relationships (the “schema” or “ontology”) and the indicators and data (attributes and values) to describe them, and the “entities” (distinct individuals or nameable instances) that populate them. In other words, “structure” refers to all of the schema (concepts + relationships) + data + attributes + indicators + records that make up the information upon which software can operate.

Structure exists in many forms and serializations. Generally, software represents its internal information in one or a few canonical storage and manipulation formats, though that same software may also be able to import (ingest) or export its information and data in many different external formats.

In our semantic enterprise work, especially with its premise in ontology-driven applications using adaptive ontologies, structure is an absolutely essential construct. But, frankly, no information technology system exists that does not also depend on structure to a more or less greater extent.

The interplay between software and structure is one source of expertise that vendors guard closely and use to competitive advantage. In years past, proprietary software could partially hide the bases for performance or algorithmic advantages. Expert knowledge and intimate familiarity with these systems was the other bases to keep these advantages closely held.

It is perhaps not too surprising given this history, then, that the software industry really has very little emphasis or discussion on the interaction between software and structure. But, if software is being brought in as open source, where is the accompanying expertise or guidance for how data structure can be used to gain full advantage? The same acquired knowledge that, say, accompanied the growth of relational databases in such areas as schema development, materialized views or (de)normalization now needs to be made explicit and exposed for all sorts of open source systems.

In the realm of the semantic enterprise we are seeing attempts at this via open source ontologies and greater emphasis on APIs and documentation of same. Citizen Dan, for example, will be first publicly released with an accompanying MUNI ontology as a reference schema and starting point. Descriptions and methods for how to obtain indicator data and relevant attribute and entity information for the domain will also accompany it.

As open source software continues to emphasize semantics and interoperability, exemplar structures and best practices will need to be an essential part of the technology transfer. Just as the “secrets” of much software began to be opened up via open source, so too must the locked-up expertise of experts and practitioners in how to effectively structure data be exposed.

Leg Three: Methods

The need for structure explication and guidance is but one unique slice of a much broader need to expose methods and best practices surrounding a given information management initiative. The reason that any open source software might be adopted in the first place is based on the hope for some improved information management process.

Recently I have been touting MIKE2.0, the first open source, replicable and extensible framework for organizing and managing information in the enterprise. MIKE2.0 (Method for an Integrated Knowledge Environment ) provides a comprehensive methodology that can be applied across a number of different projects within the information management space. It can be applied to any type of information development.

MIKE2.0 provides an organized way to describe the why, when, how and who of information management projects. Via standard templates and structures, MIKE2.0 provides a consistent basis to describe and manage these projects, and in a way that helps promote their interoperability and consistency across the enterprise.

MIKE2.0 and its forthcoming extensions, one of which we have developed for the semantic enterprise and are now extending into the semantic government in the context of Citizen Dan, are exciting because they provide a systematic approach and guidance for how (and for what!) to document new projects and initiatives. What MIKE2.0 represents is the first time that the embedded, proprietary expertise of traditional IT consultants has been exposed for broader use and extension.

The real premise behind any approach like MIKE2.0 or variants is to codify the expertise and knowledge that was previously locked up by experts and practitioners. The framework in MIKE2.0 provides a structure by which knowledge bases of background information can be assembled to accompany an open source project. This structure extends from initial evaluation and design all the way through operation and end of life.

The ‘CIS DocWiki’ that is being developed to accompany Citizen Dan is such an example of a MIKE2.0-informed knowledge base. At present, the CIS DocWiki has more than 300 specific articles useful to community indicator systems for local governments, and a complete deployment and maintenance methodology. By public release, it will likely be 2-3 times that size. All of this will be downloadable and installable as a wiki, and as open source content, ready for branding and modification for any local circumstance. CIS DocWiki is a natural methods and documentation complement to the Citizen Dan software and its MUNI structure. Release is scheduled for summer.

As we will focus on in Part 3 of this series, we are combining a MIKE2.0 organizational approach with a documentation and single-source publication platform to fulfill the method and documentary aspects of projects. It was really through the advantages gained by the combination of these pieces that we began to see the inadequacy of many current open source projects for the enterprise.

Leg Four: Documentation

This series began in part with a recognition that superior open source projects are often the better documented ones. But, even there, documentation is often restricted to code-level documentation or perhaps APIs.

As the material above suggests, documentation needs to extend well beyond software. We need documentation of structure, methods, best practices, use cases, background information, deployment and management, and changing needs over the lifetime of the system. And, as we have also seen in Part 1, the lifetime of that system might be measured in decades.

Documentation is no equal to paid partners and their expertise. But, documentation can be cheaper, and if that documentation is sufficient, might be a means for changing the equation in how IT projects are solicited, acquired and managed.

Today, enterprises appear to be stuck between two difficult choices: 1) the traditional vendor lock-in approach with high costs and low innovation; or 2) open source with minimal documentation and vendor knowledge and little assurance of support longevity.

These trade-offs look pretty unpalatable.

Documentation alone, even as extended into the other legs of the solution, is not prima facie going to be a deal maker. But, its absence, I submit, is a deal breaker. Just as open source itself has taken some years to build basic comfort in the enterprise, so too a concerted attack on all acceptance factors may be necessary before actual wide adoption occurs.

The ‘CIS DocWiki’ platform noted for Citizen Dan we hope will be an exemplar for this combination of documentation and methodology. It is a single-source publishing platform that allows the entire knowledge base behind a given IT initiative to be used for collaboration, operational, training or collateral purposes. And all of this is based on open source software.

Software vendors need to recognize these documentation factors and build their ventures for success. Yes, writing code and producing software is a lot more fun and rewarding than (yeech) documentation. But, unless our current generation of vendors that is committed to open source and its benefits takes its markets seriously — and thus commits to the serious efforts these markets demand — we will continue to see minimal uptake of open source in the enterprise.

An Interacting Whole Greater than the Sum of its Parts

Each of these four legs of a total open solution can interact with and reinforce the other parts. Once one begins to see the problem of open source adoption in the enterprise as a holistic one, a new systems-level perspective emerges.

Enterprises know full well that software is only one means to address an information management problem, and only a first step at that. Traditional vendors to the enterprise also understand this, which is why through their embedded systems and built-up expertise they have been able to perpetuate what often amounts to a monopoly position.

Pressures are building for a earthquake in the IT landscape. Enterprises are on an anvil of global competition and limited resources. Existing IT systems are not up to the task but too expensive and embedded to abandon. Traditional vendors have near monopoly positions and little incentive to innovate. New software vendors don’t have the expertise and gravitas to handle enterprise-scale challenges. Meanwhile, the rest of the globe is leapfrogging embedded systems with agile, Web-based systems.

The true innovation that is occurring is all based around open source, nurtured by the global computing platform of the Internet, and fueled by countless individuals able to compete on downward-spiraling cost bases. But on so many levels, open source as presently constituted, either fails or poses too many risks to the commercial enterprise.

The Internet itself was the basis of a paradigm shift, but I think we are only now seeing its manifestation at the enterprise level. We are also now seeing global reordering and changes of the economic order. How will companies respond? How will their IT systems adapt? And what will new vendors need to do and recognize in order to thrive in this changing environment?

I’m not sure I have found the language or rhetoric to convey what I see coming, and coming soon. I know open source is part of it; I know enterprises need it; and I know what is presently being offered does not meet the test.

As I noted in our first part, the mantra that we use in Structured Dynamics to express this challenge is, “We’re Successful When We’re Not Needed“. I think the essence behind this statement is that premises of dependency or proprietary advantage will not survive the jet streams of change that are blowing away the old order.

Sound like too much hyperbole? Actually, my own gut feeling is that it is not nearly enough.

In any case, windy rhetoric always falls short if there is not some actionable next steps. In these first two parts of this series, I have tried to present the ingredients that need to go into the cake. In the third part I try to offer a new, and complementary, open source means for bringing stability to the foundation.

In all cases, though, I think these challenges are permanent ones and do not lend themselves to facile solutions. Four legs, or seven foundations, or twelve steps are all just too simplistic for dealing with the global and complex tsunamis blowing away the old order.

One really does not need to lick a finger to sense the direction of these winds of change. It is coming, and coming hard, and all of it is from the direction of open source. What enterprises do, and what the vendors who want to serve them do, is perhaps less clear. I think open source offers a way out of the box in which enterprise IT is currently stuck. But, at present, I also think that most open source options do not have the necessary legs to stand on.

[1] One notable exception to this are the consumer-facing aspects of some businesses, such as automobiles or personal care or fashion products. These businesses are leading the way into some of the “build your own” or “design your own” uses of modern Web technology. [2] In the 1970s the major term for this approach was “technology transfer.” [3] Citizen Dan is an open source system for aggregating different indicator data concerning local, community well-being. Information sources may include the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics. See current details with screenshots.

SD Gets New Logo, Look

AI3:::Adaptive Information (Mike Bergman) - Thu, 05/20/2010 - 20:03
Growth Demanded a Professional Upgrade

Structured Dynamics today updated its image with a new logo and new color schemes on its Web pages and collateral. Other upgrades to various SD product logos and other adjustments are also being made.

Fred Giasson and I formed the company rapidly back in November 2008. We had other fish to fry, namely starting work with customers coming out of the gate, and (literally) grabbed a toss-off logo that had been laying in the drawer to start the company. That worked well in the early days, but we increasingly felt our image looked tired and distinctly “non-dynamic.”

So we commissioned a competition a few weeks back and left the next steps to the professionals. The winning design is shown above. We had many good options to choose from, and we will be working with some of the other finalist designers for some of our other product designs. The first in that series is the Citizen Dan logo.

So, with growth and presence it feels good to now have a professional look as well. We’re proud that we continue to be able to fully self-fund the company and look to walk arm-in-arm with this logo for quite some time to come!

Syndicate content