Today, Structured Dynamics is pleased to release Open SEAS, its methodology for Semantic Enterprise Adoption and Solutions. At the same time, we are donating the framework to the open source MIKE2.0 Method for an Integrated Knowledge Environment project.
Open SEAS provides a framework for the enterprise to establish a coherent, consistent and interoperable layer across its information assets. It is compliant with the MIKE2.0 Semantic Enterprise Solution Offering.
Open SEAS has been developed for enterprises desiring to initiate or extend their involvement with semantic technologies. It is inherently incremental, low-cost and low-risk.
Donation and Relation to MIKE2.0Concurrent with this release, Structured Dynamics is also donating the methodology and all of its related intellectual assets to the MIKE2.0 project. Under Creative Commons license and MIKE2.0’s content governance policies, the community’s current 2000+ members are now free to expand and use the Open SEAS methodology in any manner they see fit.
Last week, I began to introduce MIKE2.0 and its methodology to the readers of this blog. MIKE2.0 provides a complete delivery environment and methodology for information management projects in the enterprise. Solutions — from the specific to the composite — are described and packaged with respect to plans, management communications, products (open source and proprietary), activities, benchmarks, and deliverables. Delivery is accomplished over multiple increments, split into five phases from definition and planning to deployment. The assets associated with this framework first are based on templates and guidelines that can be applied to any information management area. The framework allows for multiple projects to be combined and inter-related, all under a common methodology. More information and a good entry point is provided on the What is MIKE2.0? page on the project’s main Web site.
MIKE2.0 presently has some 800 resources across about 40 solution areas. With Structured Dynamics’ donation, there are now about 40 resources related to the semantic enterprise, many of them major, accompanied by many images and figures. This contribution makes the Semantic Enterprise Solution Offering instantly one of the more complete within MIKE2.0. As noted below, this contribution is also just a beginning of our commitment.
Basic Overview of Open SEASThe Open SEAS framework is Structured Dynamics’ specific implementation framework for MIKE2.0’s Semantic Enterprise Solution Offering. This section overviews some of Open SEAS‘ key facets.
A Grounding in the Open World ApproachMany enterprise information systems, particularly relational ones, embody a closed world assumption that holds that any statement that is not known to be true is false. This premise works well where there is complete coverage of specific items, such as the enumeration of all customers or all products.
Yet, in most areas of the real (”open”) world there is no guarantee or likelihood of complete coverage. Under an open world assumption the lack of a given assertion or fact does not imply whether that possible assertion is true or false: it simply is not known. An open world assumption is one of the key factors that defines the open Semantic Enterprise Offering and enables it to be deployed incrementally. It is also the basis for enabling linkage to external (often incomplete) datasets.
Fortunately, there is no requirement for enterprises to make some philosophical commitment to either closed- or open-world systems or reasoning. It is perfectly acceptable to combine traditional closed-world relational systems with open-world reasoning. It is also not necessary to make any choices or trade-offs about using public v. private data or combinations thereof. All combinations are acceptable when the basis for integration is an open-world one.
Open SEAS is grounded in this “open” style. It can be employed in virtually any enterprise circumstance and at any scope, and expanded in a similar way as budget and needs allow.
Other Basic Pillars to the FrameworkOpen SEAS is based on seven pillars, which themselves inform the basis for the MIKE2.0 Guiding Principles for the Open Semantic Enterprise. These principles cover data model, architecture, deployment practices and approach for how an enterprise can begin and then extend its use of semantics for information interoperability.
Important aspects are linked data or Web-oriented architecture, but it is really the unique combination of open-world approach and the RDF data model and its semantic power that provide the distinctive differences for Open SEAS. An exciting prospect — but still in its early stages of discovery and implementation — is the role of adaptive ontologies to power ontology-driven applications. These prospects, if fully realized, could totally remake how knowledge workers interact and specify the applications that manage their information environment.
Embracing the Layered Semantic Enterprise ArchitectureOpen SEAS also fully embraces the Layered Semantic Enterprise Architecture of MIKE2.0’s Semantic Enterprise Offering. This architecture acts as a subsequent set of functions or middleware with respect to the MIKE2.0’s standard SAFE Architecture. Most of the existing SAFE architecture resides in the Existing Assets layer. The specific aspects of Open SEAS resides in the layers above, namely Access/Conversion, Ontologies and the Applications Layers.
Using (Mostly) Open Source to Fill Gaps in the Technology StackStitching together this interoperability layer above existing information and infrastructure assets requires many diverse tools and products, and there still are gaps. The layer figure below shows the semantic enterprise architecture overlaid with some representative open source projects and tools that plug some of those gaps.
Open SEAS also maintains a comprehensive roster of open source and proprietary tools in all aspects of semantic technology, ranging from data storage and converters, to Web services and middleware, and then to ultimate user applications. A database of nearly 1,000 tools in all areas is maintained for potential applicability to the methodology.
The inherently incremental nature of the Open SEAS framework encourages experimentation, affordable deployments, and experience gathering. Because the systems and deployments put into place with this framework are based on the open world approach and use the extensible RDF data model, expansions in scope, sophistication or domain can be incorporated at any time without adverse effects on existing assets or systems or prior Open SEAS deployments.
Quick and (virtually) risk-free increments means that adopting semantic approaches in the enterprise can be accelerated (or not) based on empirical benefits and available budgets.
An Emphasis on LearningThe Open SEAS framework is built on a solid foundation, but it also one that is incomplete. Deployments of semantic technologies and approaches are still quite early in the enterprise, whether measured in numbers, scope or depth. In order for the framework — and the practice of semantic adoption in general — to continue to expand and be relevant in the enterprise, active learning and documentation is essential. One of the reasons for the affiliation of Open SEAS with MIKE2.0 is to leverage these strong roots in methodological learning.
Where Do We Go From Here?The nature of Open SEAS and its parent Semantic Enterprise Solution Offering touches most offerings within the MIKE2.0 framework. There is much to be done to integrate the semantic enterprise perspective into these other possibilities, plus much that needs to be learned and documented for the offering itself. The concept of the semantic enterprise, after all, is relatively new with few prominent case studies.
As the offering points out, there are some dozens of addition necessary resources that are available and ready to be packaged and moved into the MIKE2.0 framework. These efforts are a priority, and will continue over the coming weeks.
But, more importantly, beyond that, the experience and practitioner base needs to grow. Much is unknown regarding key aspects of the offering:
Despite these questions, emergence is the way complex systems arise out of a multiple of relatively simple interactions, exhibiting new and unforeseen properties in the process. RDF is an emergent model. It begins as simple “fact” statements of triples, that may then be combined and expanded into ever-more complex structures and stories. As an internal, canonical data model, RDF has advantages for information federation and development over any other approach. It can represent, describe, combine, extend and adapt data and their organizational schema flexibly and at will. Applications built upon RDF can explore and analyze in ways not easily available with other models.
Combined with an open-world approach, new information can be brought in and incorporated to the framework step-by-step. Perhaps the greatest promise in an ongoing transition to become a semantic enterprise is how an inherently incremental and building-block approach might alter prior practices and risks across the entire information management spectrum.
We invite you to join us and to contribute to this effort. I encourage you to join MIKE2.0 if you have not already done so, and check out announcements on this blog for ongoing developments.
Enterprises are hungry for guidance and assistance in learning how to embrace semantics and semantic technologies in their organization. Because of our services and products and my blog writings, we field many inquiries at Structured Dynamics about best practices and methods for transitioning to a semantic enterprise.
Until the middle of last year, we had been mostly focused on software development projects and our middleware efforts via things like conStruct, structWSF, irON and UMBEL. While we also were helping in early engagement and assessment efforts, it was becoming clear that more formalized (and documented!) methods and techniques were warranted. We needed concrete next steps to offer the organization once they became intrigued and then excited about what it might mean to become a semantic enterprise.
For decades, of course, various management and IT consultancies have focused on assisting enterprises adopt new work methods and information management approaches and technologies. These practices have resulted in a wealth of knowledge and methods, all attuned to enterprise needs and culture. Unfortunately, these methods have also been highly proprietary and hidden behind case studies and engagements often purposely kept from public view.
So, in parallel with formulating and documenting our own approaches — some of which are quite new and unique to the semantic space (with its open world flavor as we practice it) — we also have been active students for what others have done and written about information management assessment and change in the enterprise. Despite the hundreds of management books published each year and the deluge of articles and pundits, there are surprisingly few “meaty” sources of actual methods and templates around which to build concrete assessment and adoption methods.
The challenge here is not to present simply a few ideas or to spin some writings (or a full book!) around them. Rather, we need the templates, checklists, guidances, tools listings, frameworks, methods, test harnesses, codified approaches, scheduling and budgeting constructs, and so forth that takes initial excitement and ideas to prototyping and then deployment. These methodological assets take tens to hundreds of person-years to develop. They must also embody the philosophies and approaches consistent with our views and innovations.
Customers like to see the methods and deliverables that assessment and planning efforts can bring to them. But traditional consultancies have been naturally reluctant to share these intellectual assets with the marketplace — unless for a fee. Like many growing small companies before us, Structured Dynamics was thus embarking on systematically building its own assets up, as engagements and time allowed.
Welcome to MIKE2.0 and A Bit of HistoryI first heard of MIKE2.0 from Alan Morrison of PriceWaterhouseCoopers’ Center for Technology and Innovation and from Steve Ardire, a senior advisor to SD. My first reaction was pretty negative, both because I couldn’t believe why anyone would name a methodology after me (hehe) and I also have been pretty cool to the proliferation of version numbers for things other than software or standards.
However, through Alan and Steve’s good offices we were then introduced to two of the leaders of MIKE2.0, Sean McClowry of PWC and then Rob Hillard of Deloitte. Along with BearingPoint, the original initiator and contributor to MIKE2.0, these three organizations and their key principals provide much of the organizational horsepower and resource support to MIKE2.0.
Based on the fantastic support of the community and the resources of MIKE2.0 itself (see concluding section on Why We Like the Framework), we began digging deeper into the MIKE2.0 Web site and its methodology and resources. For the reasons summarized in this article, we were amazed with the scope and completeness of the framework, and very comfortable with its approach to developing working deployments consistent with our own philosophy of incremental expansion and learning.
Method for an Integrated Knowledge Environment (MIKE2.0) is an open source delivery framework for enterprise information management. It provides a comprehensive methodology (747 significant articles so far) that can be applied across a number of different projects within the information management space. While initially focused around structured data, the goal of MIKE2.0 is to provide a comprehensive methodology for any type of information development.
Information development is an approach organizations can apply to treat information as a strategic asset through their complete supply chain: from how it is created, accessed, presented and used in decision-making to how it is shared, kept secure, stored and destroyed. Information development is a key concept of the MIKE2.0 methodology and a central tenet of its philosophy:
The concept of Information Development is based on the premise that due to its complexity, we currently lack the methods, technologies and skills to solve our information management challenges. Many of the techniques in use today are relatively immature and fragmented and the problems keep getting more difficult to solve. This is one of the reasons we see so many problems today and why organizations that manage information well are so successful.MIKE2.0 is not a framework for general transactional or operational purposes regarding data or records in the enterprise. (Though it does support functions related to analyzing that information.) Rather, MIKE2.0 is geared to the knowledge management or information management environment, with a clear emphasis on enterprise-wide issues, information integration and collaboration.
The MIKE2.0 methodology was initially created by a team from BearingPoint, a leading management and technology consultancy. The project started as “MIKE2″, an internal approach to aid enterprises to improve their information management. The MIKE2 initiative was started in early 2005 and the methodology was brought through a number of release cycles until it reached a mature state in late 2005. “MIKE2.0″ involved taking this approach and making it open source and more collaborative. Much of the content of the MIKE2.0 methodology was made available to the open source community in late December 2006. The actual MIKE2.0 Web site and release occurred in 2007.
Anyone can join MIKE2.0, which adheres to an open source and Creative Commons model. Governance of MIKE2.0 is based on a meritocracy model, similar to the principles followed by the Apache Software Foundation.
There is much additional background on MIKE2.0. Also, for an explanation of the rationale for the framework, see the MIKE2.0 article, A New Model for the Enterprise.
A Surprisingly Robust and Complete FrameworkMIKE2.0 provides a complete delivery framework for information management projects in the enterprise. The assets associated with this framework first are based on templates and guidelines that can be applied to any information management area. This is a key source of our interest in the framework.
But, there is also real content behind these templates. There is a slate of “solution offerings” geared to most areas of enterprise information management. There are “solution capabilities” that describe the tools and templates by which these solutions need to be specified, planned and tracked. There are frameworks for relating specific vendor and open source tools to each offering. And, there are general strategic and other guidances for how to communicate the current state of the discipline as well as its possible future states.
The next diagram captures some of these major elements:
Perhaps the most important aspect of this framework, however, are the ways by which it provides solid guidance for how entirely new solution areas — the semantic enterprise, for example, in Structured Dynamics’ own case — can be expressed and “codified” in ways meaningful to enterprise customers. These frameworks provide a common competency across all areas of enterprise interest in information development and management. For a relatively new and small vendor such as us, this framework provides a credible meeting ground with the market.
A Phased and Incremental Approach to Information DevelopmentThe fundamental approach to a MIKE2.0 offering is staged and incremental. This is very much in keeping with Structured Dynamics’ own philosophy, which, more importantly, also is consonant with the phased adoption and expansion of open semantic techologies within the enterprise.
Under the MIKE2.0 framework, the first two phases relate to strategy and assessment. The next three phases (of the five standard ones) produce the first meaningful implementation of the offering. Depending, that may range from a prototype to broader deployment, based on the maturity of the offering. Thereafter, scale-out and expansion occurs via a series of potential increments:
The incremental aspects of the later three phases are not dissimilar from “spiral” deployments common to some government procurements. The truth remains, however, that actual experience is quite limited in later increments, and whether these methodologies can hold over long periods of time is unknown. Despite this caution, most failures occur in the earliest phases of a project. MIKE2.0 has strong framework support in these early phases.
A Broad Spectrum of Capabilities, Assets and SolutionsMIKE2.0 “solutions” are presented as offerings from single ones to a variety of clusters or groupings. These types reflect the real circumstances of applications and deployments at either the departmental or enterprise level. They may range from systematic to those that address specific business and technology problems. Tools and solutions may be work process, human, or technological, proprietary or open.
An overarching purpose of the MIKE2.0 methodology is to couch these variations into a consistent and holistic framework that allows individual or multiple pieces to be combined and inter-related. This consistency is a key to the core objective of information management interoperability across whatever solution profile the enterprise may choose to adopt.
This objective is best expressed via the Overall Implementation Guide. Thus, while detailed aspects of MIKE2.0’s solution offerings may encompass very specific techniques, design patterns and process steps, in combination these pieces can be combined into meaningful wholes.
This spectrum of solution possibilities is organized according to:
These groupings are shown in the diagram below, with the “core” and composite groupings shown in the middle:
These central core and composite groupings, of course, are comprised of more focused and specific solutions. While it is really not the purpose of this piece to describe any of these MIKE2.0 specifics in detail, the next diagram helps illustrate the scope and breadth of the current framework.
Here are the some 30+ individual “core” solution offerings:
These are also accompanied by 8 or so cross-cutting “composite” solutions that reach across many of the core aspects.
Whether core or component, there is a patterned set of resources, guidances and templates that accompany each solution. The MIKE2.0 Web site and resources are generally organized around these various core or composite solutions.
Why We Like the FrameworkMIKE2.0 is a project that walks its talk. Here are some of the reasons why we like the framework and how it is managed, and why we plan to be active participants as it moves forward:
We invite you to learn more about MIKE2.0 and join with us in helping it to continue to grow and mature.
And, oh, as to that aversion to the MIKE2.0 name? Well, with our recent addition of Citizen DAN, it is apparent we are adopting as many boys as we can. Welcome to the family, MIKE2.0!
It is gratifying to see the emergence of the term semantic enterprise, with much increased attention and commentary. But, similar to different styles and patterns in software programming, there is not a single (nor best, depending on circumstance) way to approach becoming a semantic enterprise.
In this piece I contrast two styles. The more traditional and familiar one is comprehensive, complete and “engineered” in its approach. The second, and emerging style, is more adaptive and incremental. While Structured Dynamics is a proponent and thought leader for the adaptive style, the use and applicability of either approach is really a function of objectives and circumstances. The choice of approach depends on use case, and should not be a dogmatic one.
Any time a contrast is posed, one should be on guard about setting up a rhetorical strawman. There may perhaps be a bit of this flavor in this article; if so, it is unintended. It is probably best to realize that there is a gradient — or spectrum — of possible approaches between these contrasting styles. The real message is to understand these differences such that you can comfortably place your own organization at the right points along this spectrum.
A Spectrum of Advantages and DifferencesThe general idea of semantics in the enterprise preceeds the use of the term, having been somewhat captured before by the ideas of enterprise application integration, enterprise information integration and other concepts even related to data federation and data warehousing stretching back to the 1980s. However, as a specific label, we can look back to the first mentions in the late 1990s and more concerted attention beginning from about 2002 or so onward [1]. As another indicator, since 2005 the Semantic Technology Conference has given specific prominence to the enterprise [2].
Throughout this period, the sense from academic papers, many vendors, and most pundits [3] has been on things like automated reasoning, machine-aided decision making, aspects of artificial intelligence, and so forth. The general tone is often framed as “revolution” or “massive changes” or something “entirely new.” If you are a consultant or software/implementation vendor — especially where VC money is backing the venture with hopes for big returns and home runs — it may make cynical sense to sell such large and costly change.
I believe there are circumstances where the Semantic Enterprise writ this large may make sense and be financially justified. But, this kind of “big change” view has also seen relatively few visible (or successful) deployments. It has colored what it means to be a semantic enterprise. And, I believe, it has weakened market credibility by perhaps overpromising and underdelivering. The conventional view of what it is be a semantic enterprise deserves to be balanced.
So, as we balance this understanding of the semantic enterprise to one that is more nuanced, we can contrast the characteristics of the two apposite styles as follows:
Characteristics of theNote we have labeled the conventional approach as the “comprehensive, engineering” style; its contrast, and the one we position more closely to, is the “adaptive, incremental” style.
[Others have posited contrasting styles, most often as "top down" v. "bottom up." However, in one interpretation of that distinction, "top down" means a layer on top of the existing Web [8]. On the other hand, “top down” is more often understood in the sense of a “comprehensive, engineered” view, consistent with my own understanding [9]. Yet no matter which characterization, neither captures what I feel to be the more important considerations of mindset, logic and premise.]
Though the table above contrasts many points, I think there are two main distinctions to the adaptive approach. First, it firmly embraces the open world assumption. OWA is key to an incremental, “learn as you go” deployment that is also well suited to incorporation of external information. The second main distinction is to leverage and build from existing assets.
A Spectrum of ApplicationsYet as noted in the opening, which of these approaches makes better sense depends on circumstance. One aspect of circumstance is available budget and deployment times for pilots or proofs-of-concept. Another aspect, of course, is the planned use or application for the deployment.
These are by no means hard distinctions, but in general we can see these contrasting approaches applying to the following uses:
Applications and Uses for theA critical distinction is the nature of the enterprise itself. “External-facing” enterprises or functions that want or need to incorporate much external information (say, marketing or competitive intelligence) are advised to look closely at the adaptive approach. Organizations that have more complete control over their circumstances should perhaps focus on the conventional approach.
Adoption Thresholds and RisksIn previous writings I have pointed to the manifest benefits that can accrue to the semantic enterprise [see, esp. 10]. But we also have witnessed nearly a decade of promotion for semantics in the enterprise, with perhaps a lack of progress in some areas or unmet promises in others. These raise questions and skepticism of the real eventual costs and benefits.
I believe some of this skepticism is inherent with anything new — the general IT fatigue from what the current “next great thing” might be. But I also believe that some of this skepticism results from an approach to semantics in the enterprise that is both lengthy to deploy and high cost.
The key advantage of the adaptive, incremental approach is that the whole IT game in the enterprise can change. An open world approach enables adoption as it proves itself and as budgets allow. Commitments made under this approach have, in essence, permanent value. Past fears and concerns about making “wrong” bets no longer apply. With learning, targets can be re-adjusted, structure re-defined and applications re-focused, all as new discoveries and broadening scope dictate.
This does not make the adaptive approach better than the conventional one. But, it does make it less risky and, well, more adaptive.
[1] For example, the earliest Google mentions on “semantic enterprise” date to about 1998 or 1999. In 2002, the University of Georgia and Amit Sheth offered the first known academic course on the Semantic Enterprise; see http://lsdis.cs.uga.edu/SemanticEnterprise/. [2] See the conference guide for the Semantic Technology Conference 2005. The sixth one, the 2010 Semantic Technology Conference, is upcoming on June 21-25 in San Francisco. [3] See, for example, Mitchell Ummell, ed., 2009. “The Rise of the Semantic Enterprise,” special dedicated edition of the Cutter IT Journal, Vol. 22(9), 40 pp., September 2009. See http://www.cutter.com/offers/semanticenterprise.html (after filling out contact form). Partially in response to this conventional view, I wrote [10]. In that article I offered as a working definition that “a semantic enterprise is one that adopts the languages and standards of the semantic Web . . . and applies them to the issues of information interoperability, preferably using the best practices of linked data.” That happens to be Structured Dynamics’ preferred definition, though as this posting indicates, there is a spectrum of definitions of the term. [4] See, M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room“, AI3:::Adaptive Information blog, December 21, 2009. [5] See for example RDF, RDFS, OWL , SKOS and SPARQL and others. [6] Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model. Two of the best practices are to name the data objects using uniform resource identifiers (URIs), and to expose the data for access via the HTTP protocol. Both of these practices enable the Web to become a distributed database, which also means that Web architectures can also be readily employed. [7] We use a basis in description logics for defining the roles and splits in schema and instances. As we define it: “Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” [8] One article that got quite a bit of play a few years back was A. Iskold, 2007. “Top Down: A New Approach to the Semantic Web,” in ReadWrite Web, Sept. 20, 2007. The problem with this terminology is that it offers a completely different sense of “top down” to traditional uses. In Iskold’s argument, his “top down” is a layering on top of the existing Web. [9] The more traditional view of “top down” with respect to the semantic Web is in relation to how the system is constructed. This is reflected well in a presentation from the NSF Workshop on DB & IS Research for Semantic Web and Enterprises, April 3, 2002, entitled “The ‘Emergent, Semantic Web: Top Down Design or Bottom Up Consensus?“. Under this view, top down is design and committee-driven; bottom up is more decentralized and based on social processes, which is more akin to Iskold’s “top down.” [10] M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise,” AI3:::Adaptive Information blog, Sept. 28, 2009.As we see more collaboration forums emerge, one question that naturally arises is the joint authoring or editing of images. This is particularly important as “official” slide decks or presentations come to the fore.
There are perhaps many different ways to skin this cat. In this article, I describe how to do so using the free, open source SVG editing program, Inkscape.
Why Inkscape?Like many of you, I have been creating and editing images for years. I am by no means a graphics artist, but images and diagrams have been essential for communicating my work.
Until a few years back, I was totally a bitmap man. I used Paint Shop Pro (bought by Corel in 2004 and getting long in the tooth) and did a lot of copying and pasting.
I switched to Inkscape about two years ago for the following reasons:
Once you have a working image in Inkscape, make sure all collaborators have a copy of the software. Then:
Of course, it is more often the case that not all collaborators may have a copy of Inkscape or that the image began in the SVG format.
The image below began as a Windows Powerpoint clip art file, which has then gone through some modifications. Note the bearded guy’s hand holding the paper is out of registry (because I screwed up in earlier editing, but I also can easily fix because it is a vector image! ). Also note we have the border from Inkscape as suggested above. This file, BTW, is people.png, and was created as a PNG after a screen capture from Inkscape:
When beginning in Powerpoint or as clip art, files in the format of Windows metafile (*.wmf) or extended WMF (*.emf) work well. (For example, you can download and play with the native Inkscape format of people.svg, or the people.wmf or people.emf versions of the image above.) If you already have images in a Powerpoint presentation, save in one of these two formats, with (*.emf) preferred. (EMF is generally better for text.)
You can open or load these files directly into Inkscape. Generally, they will come in as a group of vectors; to edit the pieces, you should “ungroup.”
After editing per the instructions in the previous section, if you need to re-insert back into Powerpoint, please use the *.emf format (and make sure you do not save text as paths).
For example, see the following PNG graphic taken from a Inkscape file (figure_text.svg):
We can save it as an EMF (figure_textpath.emf) to a Powerpoint, with the option of converting text to paths:
Or, we can save it as an EMF (figure_text.emf) to a Powerpoint, only this time not converting text to paths and then “ungrouping” once in Powerpoint:
Note the latter option, text not as path, is the far superior one. However, also note that borders are added to the figures and vertical text is rotated 90o back to horizontal. Nonetheless, the figure is fully editable, including text. Also, if the original Inkscape figures are constructed with lines of the same color as fills, the border conversion also works well.
Frankly, especially with text, because there can be orientation and other changes going from Inkscape to Powerpoint, I recommend using Inkscape and its native SVG for all early modifications and to keep a canonical copy of your images. Then, prior to completion of the deck, save as EMF for import into Powerpoint and then clean up. If changes later need to be made to the graphic, I recommend doing so in Inkscape and then re-importing.
Other AlternativesI should note there is an option, as well, in Inkscape to convert raster images to vector ones (use Path -> Trace bitmap … and invoke the multiple scans with colors). This is doable, but involves quite a bit of image copying, manipulation and color separation to achieve workable results. You may want to see further Inkscape’s documentation on tracing, or more fully this reference dealing with color.
Of course, there are likely many other ways to approach these issues of collaboration and sharing. I will leave it to others to suggest and explain those options.
I’m not in the least bit excited about the iPad, and it seems I’m not alone. The mood seems to have changed since before the launch, with countless tech journalists previously falling over themselves to declare tablets the next big thing. (Thankfully Rory Cellan-Jones from the BBC was more measured, focusing on personal projectors as a more exciting development). The mood since is considerably more downbeat, and I think more realistic.
I may be missing some crucial usage context that reveals the killer characteristics of the iPad, but I’ve tried really hard and still nothing. There are many obvious practical issues with the device:
The only scenarios I can conjure up where I could imagine using the device are:
Neither of these, or even both, are very compelling at all. TVs are getting good for viewing photos, by including e.g. an SD card slot, and rumours of the death of paper are greatly exagerated.
Perhaps the most annoying thing about the scenarios used to promote the device is the one about the San Francisco to Tokyo flight, watching video all the way without running out of battery. Any airline with planes worth boarding has personal video screens. I don’t want to bring my own. I’d rather use the space to carry a decent pair of noise-canceling headphones, which I’m sure increase my enjoyment of onboard media far more than a little bit of extra screen real estate. The development I want to see is not a new device that I have to prop on the flimsy airline table, hold tight when we hit some turbulence, and stow away when my food arrives, but the capability to connect my own device to the in-built screen via USB or Bluetooth. Even a bare USB port with power but no connectivity would be a start, allowing me to run low-powered devices (that I already own) during long flights.
OK, so the flight reference is just a touchstone for how long the device can run without mains power, but I think it demonstrates a lack of grounding of the device in realistic scenarios.
Any new device has to have two key characteristics these days for me to get excited: interoperability and convergence. The iPad seems to have very little of either. You could argue that it offers some convergence between smartphones and e-readers, but that’s about as exciting as convergence between a smartphone and a wall clock.
I’m left wondering what the iPad is competing against? I’m guessing it’s paper, whether that’s in the form of a book, brochure, newspaper, restaurant menu or whatever. Unfortunately for Apple, paper is pretty well suited to each of these, especially when you introduce bath water, the risk of theft, or just ketchup, into the equation. Perhaps this is the end of electronic picture frames as dedicated device? Probably about time. Maybe the iPad will make an excellent Spotify console for the living room. Who knows? Whatever happens I can’t see this becoming a mass-market product worthy of even a fraction of the hype.
Where I wish that Apple had expended their creative talent was in addressing the power issue. Not in making sure I could watch 10 hours of back to back video, but in enabling me to spend that energy in whatever way I choose, powering whichever device I choose. It drives me crazy that I carry several batteries around, and short of running my phone off my laptop via USB there is no interoperability between these power sources. If Apple could produce a universal power supply that was sleek, sexy, efficient and interoperable, then I would be interested. Sadly this doesn’t seem to be the way.
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.
One of the main areas of interest here at the LiDRC is linked government data. In this post, DERI researcher Evangelos Kalampokis surveys recent activity in this area from around the globe.
Well, for another client and another purpose, I was goaded into screening my Sweet Tools listing of semantic Web and -related tools and to assemble stuff from every other nook and cranny I could find. The net result is this enclosed listing of some 140 or so tools — most open source — related to semantic Web ontology building in one way or another.
Ever since I wrote my Intrepid Guide to Ontologies nearly three years ago (and one of the more popular articles of this site, though it is now perhaps a bit long in the tooth), I have been intrigued with how these semantic structures are built and maintained. That interest, in no small measure, is why I continue to maintain the Sweet Tools listing.
As far as I know, the following is the largest and most comprehensive listing of ontology building tools available. I broadly interpret the classification of ‘ontology building’; I include, for example, vocabulary extraction and prompting tools, as well as ontology visualization and mapping.
There are some 140 tools, perhaps 90 or so are still in active use. (Given the scope, not every tool could be inspected in detail. Some listed as being perhaps inactive may not be so, and others not in that category perhaps should be.) Of the entire roster of tools, somewhere on the order of 12 to 20 are quite impressive and deserving of local installation, test runs, and close inspection.
There are relatively few tools useful to non-specialists (or useful to engaging knowledgeable publics in the ontology-building exercise). There appear to be key gaps in the entire workflow from domain scoping and initial ontology definition and vocabulary candidates, to longer-term maintenance and revision. For example, spreadsheets would appear to be a possible useful first step in any workflow process (which is why irON is listed), but the spreadsheet tool per se is not listed herein (nor are text editors).
I surely have missed some tools and likely improperly assigned others. Please drop me an email or comment on this post with any revisions or suggestions.
Some Worth A Closer LookIn my own view, there are some tools that definitely deserve a closer look. My favorite candidates — for very different reasons and for very different places in the workflow — are (in no particular order): Apelon DTS, irON, FlexViz, Knoodl, Protégé, diagramic.com, BooWa, COE, ontopia, Anzo, PoolParty, Vine (and voc2rdf), Erca, Graphl, and GrOWL. Each one of these links is more fully described below. Also, all tools in the Vocabulary Prompting Tools category (which also includes extraction) are worth reviewing since all or nearly all have online demos.
Other tools may also be deserving, depending on use case. Some of the more specific analysis and conversion tools, for example, are in the Miscellaneous category.
Also, some purists may quibble with why some tools are listed here (such as inclusion of some stuff related to Topic Maps). Well, my answer to that is there are no real complete solutions, and whatever we can pragmatically do today requires glueing together many disparate parts.
Comprehensive Ontology ToolsThough all are not relevant, see my post from a couple of years back on large-scale RDF graph software.
If you are like me, you like to clear the decks before the start of major new projects. In Structured Dynamics‘ case, we actually have multiple new initiatives getting underway, so the deck clearing has been especially focused this time.
As a result, we have updated Sweet Tools, AI3’s listing of semantic Web and -related tools, with the addition of some 30 new tools, updates to others, and deletions of five expired entries. The dataset now lists 835 tools. And, as before, there is also now a new structured data view via conStruct (pick the Sweet Tools dataset).
We have also updated SWEETpedia, a listing of 246 research articles that use Wikipedia in one way or another to do semantic-Web related research. Some 20 new papers were added to this update.
Please use the comments section on this post to suggest new tools or new research articles for inclusion in future updates.
I just came across a VC blog pondering the value to a start-up of operating in “Stealth Mode” or not. I’ve amusingly come to the conclusion that all of this — particularly the “stealth” giveaway — is so much marketing hype. When a start-up claims they’re coming out of stealth mode, grab your wallet.
The most interesting and telling example I have of this is Rearden Commerce, which was announced in a breathy cover story in InfoWorld in February 2005 about the company and its founder/CEO Patrick Grady. The company has an obvious “in” with the magazine; in 2001 InfoWorld also carried a similar piece on the predecessor company to Rearden, Talaris Corporaton.
According to a recent Business Week article, Rearden Commerce and its predecessors reaching back to an earlier company called Gazoo founded in 1999 have raised $67 million in venture capital. While it is laudable the founder has reportedly put his own money into the venture, this venture through its massive funding and high-water mark of 80 employees or so hardly qualifies as “stealth.”
As early as 2001 with the same technology and business model, this same firm was pushing the “stealth” moniker. According to an October 2001 press release:
“The company, under its stealth name Gazoo, was selected by Red Herring magazine as one of its ‘Ten to Watch’ in 2001.” [emphasis added]
Even today, though no longer using the active name Talaris Corporation, it has close to 115,000 citations on Yahoo! Notable VCs such as Charter Ventures, Foundation Capital, JAFCo and Empire Capital have backed it through its multiple incubations.
Holmes Report, a marketing company, provides some insight into how the earlier Talaris was spun in 2001:
“The goal of the Talaris launch was to gain mindshare among key business and IT trade press and position Talaris as a ‘different kind of start-up’ with a multi-tiered business model, seasoned executive team and tested product offering.”
[Hmmm; grind me a pound!]
The Holmes Report documents the analyst firms and leading journals and newspapers to which it made outreach. Actually, this outreach is pretty impressive. Good companies do the same all of the time and that is to be lauded. What is to be questioned, however, is how many “stealths” a cat can have. Methinks this one is one too many.
“Stealth” thus appears to be code for an existing company of some duration that has had disappointing traction and now has new financing, a new name, new positioning, or all of the above. So, interested in a start-up that just came out of stealth mode? Let me humbly suggest standard due diligence.
This Friday brown bag leftover was first placed into the AI3 refrigerator on October 13, 2005. No changes have been made to the original posting, except the [grinding] bit.However, as of last year, Rearden had upped its VC funding to $240 million (can we spell multiple ?). Today, it is now focused on the travel industry. Fly me to the moon!
The beginning of a new year and a new decade is a perfect opportunity to take stock of how the world is changing and how we can change with it. Over the past year I have been writing on many foundational topics relevant to the use of semantic technologies in enterprises.
In this post I bring those threads together to present a unified view of these foundations — some seven pillars — to the open semantic enterprise.
By open semantic enterprise we mean an organization that uses the languages and standards of the semantic Web, including RDF, RDFS, OWL, SPARQL and others to integrate existing information assets, using the best practices of linked data and the open world assumption, and targeting knowledge management applications. It does so using some or all of the seven foundational pieces (”pillars”) noted herein.
The foundational approaches to the open semantic enterprise do not necessarily mean open data nor open source (though they are suitable for these purposes with many open source tools available [3]). The techniques can equivalently be applied to internal, closed, proprietary data and structures. The techniques can themselves be used as a basis for bringing external information into the enterprise. ‘Open’ is in reference to the critical use of the open world assumption.
These practices do not require replacing current systems and assets; they can be applied equally to public or proprietary information; and they can be tested and deployed incrementally at low risk and cost. The very foundations of the practice encourage a learn-as-you-go approach and active and agile adaptation. While embracing the open semantic enterprise can lead to quite disruptive benefits and changes, it can be accomplished as such with minimal disruption in itself. This is its most compelling aspect.
Like any change in practice or learning, embracing the open semantic enterprise is fundamentally a people process. This is the pivotal piece to the puzzle, but also the one that does not lend itself to ready formula about pillars or best practices. Leadership and vision is necessary to begin the process. People are the fuel for impelling it. So, we’ll take this fuel as a given below, and concentrate instead on the mechanics and techniques by which this vision can be achieved. In this sense, then, there are really eight pillars to the open semantic enterprise, with people residing at the apex.
This article is synthetic, with links to (largely) my preparatory blog postings and topics that preceded it. Assuming you are interested in becoming one of those leaders who wants to bring the benefits of an open semantic enterprise to your organization, I encourage you to follow the reference links for more background and detail.
A Review of the BenefitsOK, so what’s the big deal about an open semantic enterprise and why should my organization care?
We should first be clear that the natural scope of the open semantic enterprise is in knowledge management and representation [1]. Suitable applications include data federation, data warehousing, search, enterprise information integration, business intelligence, competitive intelligence, knowledge representation, and so forth [2]. In the knowledge domain, the benefits for embracing the open semantic enterprise can be summarized as greater insight with lower risk, lower cost, faster deployment, and more agile responsiveness.
The intersection of knowledge domain, semantic technologies and the approaches herein means it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with a focus on early, high-value applications and domains.
There is absolutely no need to abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives.
Embracing the pillars of the open semantic enterprise brings these knowledge management benefits:
Moreover, by building on successful Web architectures, we can also put in place loosely coupled, distributed systems that can grow and interoperate in a decentralized manner. These also happen to be perfect architectures for flexible collaboration systems and networks.
These benefits arise both from individual pillars in the open semantic enterprise foundation, as well as in the interactions between them. Let’s now re-introduce these seven pillars.
Pillar #1: The RDF Data ModelAs I stated on the occasion of the 10th birthday of the Resource Description Framework data model, I belief RDF is the single most important foundation to the open semantic enterprise [4]. RDF can be applied equally to all structured, semi-structured and unstructured content. By defining new types and predicates, it is possible to create more expressive vocabularies within RDF. This expressiveness enables RDF to define controlled vocabularies with exact semantics. These features make RDF a powerful data model and language for data federation and interoperability across disparate datasets.
Via various processors or extractors, RDF can capture and convey the metadata or information in unstructured (say, text), semi-structured (say, HTML documents) or structured sources (say, standard databases). This makes RDF almost a “universal solvent” for representing data structure.
Because of this universality, there are now more than 150 off-the-shelf ‘RDFizers’ for converting various non-RDF notations (data formats and serializations) to RDF [5]. Because of its diversity of serializations and simple data model, it is also easy to create new converters. Once in a common RDF representation, it is easy to incorporate new datasets or new attributes. It is also easy to aggregate disparate data sources as if they came from a single source. This enables meaningful compositions of data from different applications regardless of format or serialization.
What this practically means is that the integration layer can be based on RDF, but that all source data and schema can still reside in their native forms [6]. If it is easier or more convenient to author, transfer or represent data in non-RDF forms, great [7]. RDF is only necessary at the point of federation, and not all knowledge workers need be versed in the framework.
Pillar #2: Linked Data TechniquesLinked data is a set of best practices for publishing and deploying instance and class data using the RDF data model. Two of the best practices are to name the data objects using uniform resource identifiers (URIs), and to expose the data for access via the HTTP protocol. Both of these practices enable the Web to become a distributed database, which also means that Web architectures can also be readily employed (see Pillar #5 below).
Linked data is applicable to public or enterprise data, open or proprietary. It is really straightforward to employ. Structured Dynamics has published a useful FAQ on linked data.
Additional linked data best practices relate to how to characterize and classify data, especially in the use of predicates with the proper semantics for establishing the degree of relatedness for linked data items from disparate sources.
Linked data has been a frequent topic of this blog, including how adding linkages creates value for existing data, with a four-part series about a year ago on linked data best practices [8]. As advocated by Structured Dynamics, our linked data best practices are geared to data interconnections, interrelationships and context that is equally useful to both humans and machine agents.
Pillar #3: Adaptive OntologiesOntologies are the guiding structures for how information is interrelated and made coherent using RDF and its related schema and ontology vocabularies, RDFS and OWL [10]. Thousands of off-the-shelf ontologies exist — a minority of which are suitable for re-use — and new ones appropriate to any domain or scope at hand can be readily constructed.
In standard form, semantic Web ontologies may range from the small and simple to the large and complex, and may perform the roles of defining relationships among concepts, integrating instance data, orienting to other knowledge and domains, or mapping to other schema [11]. These are explicit uses in the way that we construct ontologies; we also believe it is important to keep concept definitions and relationships expressed separately from instance data and their attributes [9].
But, in addition to these standard roles, we also look to ontologies to stand on their own as guiding structures for ontology-driven applications (see next pillar). With a relatively few minor and new best practices, ontologies can take on the double role of informing user interfaces in addition to standard information integration.
In this vein we term our structures adaptive ontologies [11,12,13]. Some of the user interface considerations that can be driven by adaptive ontologies include: attribute labels and tooltips; navigation and browsing structures and trees; menu structures; auto-completion of entered data; contextual dropdown list choices; spell checkers; online help systems; etc. Put another way, what makes an ontology adaptive is to supplement the standard machine-readable purpose of ontologies to add human-readable labels, synonyms, definitions and the like.
A neat trick occurs with this slight expansion of roles. The knowledge management effort can now shift to the actual description, nature and relationships of the information environment. In other words, ontologies themselves become the focus of effort and development. The KM problem no longer needs to be abstracted to the IT department or third-party software. The actual concepts, terminology and relations that comprise coherent ontologies now become the explicit focus of KM activities.
Any existing structure (or multiples thereof) can become a starting basis for these ontologies and their vocabularies, from spreadsheets to naïve data structures and lists and taxonomies. So, while producing an operating ontology that meets the best practice thresholds noted herein has certain requirements, kicking off or contributing to this process poses few technical or technology demands.
The skills needed to create these adaptive ontologies are logic, coherent thinking and domain knowledge. That is, any subject matter expert or knowledge worker likely has the necessary skills to contribute to useful ontology development and refinement. With adaptive ontologies powering ontology-driven apps (see next), we thus see a shift in roles and responsibilities away from IT to the knowledge workers themselves. This shift acts to democratize the knowledge management function and flatten the organization.
Pillar #4: Ontology-driven ApplicationsThe complement to adaptive ontologies are ontology-driven applications. By definition, ontology-driven apps are modular, generic software applications designed to operate in accordance with the specifications contained in an adaptive ontology. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies, as supplemented by the human and user interface roles noted above [11,12,13].
Ontology-driven apps fulfill specific generic tasks. Examples of current ontology-driven apps include imports and exports in various formats, dataset creation and management, data record creation and management, reporting, browsing, searching, data visualization, user access rights and permissions, and similar. These applications provide their specific functionality in response to the specifications in the ontologies fed to them.
The applications are designed more similarly to widgets or API-based frameworks than to the dedicated software of the past, though the dedicated functionality (e.g., graphing, reporting, etc.) is obviously quite similar. The major change in these ontology-driven apps is to accommodate a relatively common abstraction layer that responds to the structure and conventions of the guiding ontologies. The major advantage is that single generic applications can supply shared functionality based on any properly constructed adaptive ontology.
This design thus limits software brittleness and maximizes software re-use. Moreover, as noted above, it shifts the locus of effort from software development and maintenance to the creation and modification of knowledge structures. The KM emphasis can shift from programming and software to logic and terminology [12].
Pillar #5: A Web-oriented ArchitectureA Web-oriented architecture (WOA) is a subset of the service-oriented architectural (SOA) style, wherein discrete functions are packaged into modular and shareable elements (”services”) that are made available in a distributed and loosely coupled manner. WOA uses the representational state transfer (REST) style. REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it [14].
REST and WOA stand in contrast to earlier Web service styles that are often known by the WS-* acronym (such as WSDL, etc.). WOA has proven itself to be highly scalable and robust for decentralized users since all messages and interactions are self-contained.
Enterprises have much to learn from the Web’s success. WOA has a simple design with REST and idempotent operations, simple messaging, distributed and modular services, and simple interfaces. It has a natural synergy with linked data via the use of URI identifiers and the HTTP transport protocol. As we see with the explosion of searchable dynamic databases exposed via the Web, so too can we envision the same architecture and design providing a distributed framework for data federation. Our daily experience with browser access of the Web shows how incredibly diverse and distributed systems can meaningfully interoperate [15].
This same architecture has worked beautifully in linking documents; it is now pointing the way to linking data; and we are seeing but the first phases of linking people and groups together via meaningful collaboration. While generally based on only the most rudimentary basis of connections, today’s social networking platforms are changing the nature of contacts and interaction.
The foundations herein provide a basis for marrying data and documents in a design geared from the ground up for collaboration. These capabilities are proven and deployable today. The only unclear aspects will be the scale and nature of the benefits [16].
Pillar #6: An Incremental, Layered ApproachTo this point, you’ll note that we have been speaking in what are essentially “layers”. We began with existing assets, both internal and external, in many diverse formats. These are then converted or transformed into RDF-capable forms. These various sources are then exposed via a WOA Web services layer for distributed and loosely-coupled access. Then, we integrate and federate this information via adaptive ontologies, which then can be searched, inspected and managed via ontology-driven apps. We have presented this layered architecture before [13], and have also expressed this design in relation to current Structured Dynamics’ products [17].
A slight update of this layered view is presented below, made even more general for the purposes of this foundational discussion:
Semantic technology does not change or alter the fact that most activities of the enterprise are transactional, communicative or documentary in nature. Structured, relational data systems for transactions or records are proven, performant and understood. On its very face, it should be clear that the meaning of these activities — their semantics, if you will — is by nature an augmentation or added layer to how to conduct the activities themselves.
This simple truth affirms that semantic technologies are not a starting basis, then, for these activities, but a way of expressing and interoperating their outcomes. Sure, some semantic understanding and common vocabularies at the front end can help bring consistency and a common language to an enterprise’s activities. This is good practice, and the more that can be done within reason while not stifling innovation, all the better. But we all know that the budget department and function has its own way of doing things separate from sales or R&D. And that is perfectly OK and natural.
Clearly, then, an obvious benefit to the semantic enterprise is to federate across existing data silos. This should be an objective of the first semantic “layer”, and to do so in a way that leverages existing information already in hand. This approach is inherently incremental; if done right, it is also low cost and low risk.
Pillar #7: The Open World MindsetAs these pillars took shape in our thinking and arguments over the past year, an illusive piece seemed always to be missing. It was like having one of those meaningful dreams, and then waking up in the morning wracking your memory trying to recall that essential, missing insight.
As I most recently wrote [1], that missing piece for this story is the open world assumption (OWA). I argue that this somewhat obscure concept holds within it the key as to why there have been decades of too-frequent failures in the enterprise in business intelligence, data warehousing, data integration and federation, and knowledge management.
Enterprises have been captive to the mindset of traditional relational data management and its (most often unstated) closed world assumption (CWA). Given the success of relational systems for transaction and operational systems — applications for which they are still clearly superior — it is understandable and not surprising that this same mindset has seemed logical for knowledge management problems as well. But knowledge and KM are by their nature incomplete, changing and uncertain. A closed-world mindset carries with it certainty and logic implications not supportable by real circumstances.
This is not an esoteric point, but a fundamental one. How one thinks about the world and evaluates it is pivotal to what can be learned and how and with what information. Transactions require completeness and performance; insight requires drawing connections in the face of incompleteness or unknowns.
The absolute applicability of the semantic Web stack to an open-world circumstance is the elephant in the room [1]. By itself, the open world mindset provides no assurance of gaining insight or wisdom. But, absent it, we place thresholds on information and understanding that may neither be affordable nor achievable with traditional, closed-world approaches.
And, by either serendipity or some cosmic beauty, the open world mindset also enables incremental development, testing and refinement. Even if my basic argument of the open world advantage for knowledge management purposes is wrong, we can test that premise at low cost and risk. So, within available budget, pick a doable proof-of-concept, and decide for yourself.
The Foundations for the Open Semantic EnterpriseThe seven pillars above are not magic bullets and each is likely not absolutely essential. But, based on today’s understandings and with still-emerging use cases being developed, we can see our open semantic enterprise as resulting from the interplay of these seven factors:
Thirty years of disappointing knowledge management projects and much wasted money and effort compel that better ways must be found. On the other hand, until recently, too much of the semantic Web discussion has been either revolutionary (“change everything!!”) or argued from pie-in-the-sky bases. Something needs to give.
Our work over the past few years — but especially as focused in the last 12 months — tells us that meaningful semantic Web initiatives can be mounted in the enterprise with potentially huge benefits, all at manageable risks and costs. These seven pillars point to way to how this might happen. What is now required is that eighth pillar — you.
[1] See, M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room“, AI3:::Adaptive Information blog, December 21, 2009. [2] In most instances, semantic technologies are poorly suited to transactional or operational applications. Also, there are instances in modeling specific closed-world domains where ontologies can be quite useful, such as in aerospace, petrochemicals, engineering, etc., where the scope of the domain can be precisely bounded and defined. Such efforts tend to be high cost with lengthy lead times. There are vendors who support efforts in these areas, though my company, Structured Dynamics, does not. Our focus and the more generally suitable case for semantic technologies we believe is in knowledge representation and management. [3] The standard Sweet Tools listing on my AI3:::Adaptive Information blog contains more than 800 semantic Web and -related tools, most of which are open source, which can be inspected via filtered and faceted search. [4] See, M.K. Bergman, 2009. “Advantages and Myths of RDF”, AI3:::Adaptive Information blog, April 8, 2009. [5] For example, see this listing of more than 150 specific format options available as open source. These converters can also work directly with major application APIs. [6] For an expansion on RDF as a canonical data model, see further M.K. Bergman, 2009. “Structure the World”, AI3:::Adaptive Information blog, August 3, 2009. [7] For example, for dataset authoring, Structured Dynamics has developed irON, an instance record and object notation that can be serialized as JSON (called irJSON), XML (called irXML) or comma-separated values (or CSV comma-delimited files, called commON). The purpose of these notations is to provide easier authoring environments and scripting support to RDF-ready datasets. The advantage is to shield users from the nuances of RDF. The design of commON is especially geared to using spreadsheets as authoring environments for instance record tables or simple outline structures. See further the irON specification. [8] For a general listing of linked data articles, please see that category on this AI3:::Adaptive Information blog. Specific articles of interest include the four-part series on “Making Linked Data Reasonable Using Description Logics” [9] (February 11, February 15, February 18 and February 23, 2009) and the “The Law of Linked Data” (October 11, 2009). [9] Our best practices approach makes explicit splits between the “ABox” (for instance data) and “TBox” (for ontology schema) in accordance with our working definition for description logics, a fundamental underpinning for how we use RDF: “Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” [10] Those unfamiliar with the term ontology might be interested in my first introduction to the subject: M.K. Bergman, 2007. “An Intrepid Guide to Ontologies“, AI3:::Adaptive Information blog, May 16, 2007. [11] See M.K. Bergman, 2009. “Ontologies as the ‘Engine’ for Data-Driven Applications“, AI3:::Adaptive Information blog, June 10, 2009. This is the most detailed explanation, but the specific term adaptive ontology was not yet used. The first dedicated focus on adaptive ontologies was in “Confronting Misconceptions with Adaptive Ontologies” (August 17, 2009). See also [12] and [13]. [12] See, M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies”, AI3:::Adaptive Information blog, November 23, 2009. [13] See, M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise”, AI3:::Adaptive Information blog, September 28, 2009. [14] See, M.K. Bergman, 2009. “A General Web-oriented Architecture (WOA) for Structured Data”, AI3:::Adaptive Information blog, May 3, 2009. Also, see the related WOA category for other articles in this area. [15] See, M.K. Bergman, 2008. “WOA: A New Enterprise Partner for Linked Data”, AI3:::Adaptive Information blog, October 12, 2008. [16] See, M.K. Bergman, 2009. “structWSF: A Framework for Collaboration Networks”, AI3:::Adaptive Information blog, July 2, 2009. [17] See http://structureddynamics.com/products.html for a general descriptive illustration of Structured Dynamics’ product stack. There is also a longer slideshow, with particular reference to slide #37.Structured Dynamics and its Citizen DAN project has been selected as one of the finalists to proceed with a formal proposal for the 2010 $5 million Knight News Challenge. The proposal extends SD’s basic structWSF and conStruct Drupal frameworks to provide a data appliance and network (DAN) to support citizen journalists with data and analysis at the local, community level.
Thanks to all of you who submitted votes in support of the earlier draft proposal. The News Challenge received 2,489 proposals for the 2010 contest, according to Jose Zamora, journalism program associate at the Knight Foundation. According to the Nieman Journalism Lab, Zamora said 65 percent of proposals came through the closed category and 35 percent were open.
The next-round full proposals are due by January 31. Eventual winners are slated to be announced around mid-June 2010.
According to iProspect, about 56 percent of users use search engines every day, based on a population of which more than 70 percent use the Internet more than 10 hours per week.[1] The average knowledge worker spends 2.3 hrs per day — or about 25% of work time — searching for critical job information.[2] IDC estimates that enterprises employing 1,000 knowledge workers may waste well over $6 million per year each in searching for information that does not exist, failing to find information that does, or recreating information that could have been found but was not.[3]
Vendors and customers often use time savings by knowledge workers as a key rationale for justifying a document or content initiative. This comes about because many studies over the years have noted that white collar employees spend a consistent 20% to 25% of their time seeking information. The premise is that more effective search will save time and drop these percentages. For example, EDS has suggested that improvements of 50 percent in the time spent searching for data can be achieved through improved consolidation and access to data.[4]
Using these premises, consultants often calculate that every 1% reduction in the total work time devoted to search works out illustratively on a fully burdened basis as a big cost savings benefit:
$50,000 (base salary) * 1.8 (burden rate) * 1.0% = $900/ employee
Beware such facile analysis!
The fact that many studies over the years have noted white collar employees spend a consistent 20% to 25% of their time devoted to search suggests it is the “satisficing” allocation of time to information search. (In other words, knowledge workers are willing to devote a quarter of their time to finding relevant information; the remainder for analysis and documentation.)
Thus, while better tools to aid better discovery may lead to finding better information and making better decisions more productively — an important justification in itself — there may not result a strict time or labor savings from more efficient search.[5] Be careful of justifying project expenditures based on “time savings” related to search. Search is likely to remain the “25% solution.” The more relevant question is whether the time that is spent on search produces better information or not.
This Friday brown bag leftover was first placed into the AI3 refrigerator on September 14, 2005. No changes have been made to the original posting. [1] iProspect Corporation, iProspect Search Engine User Attitudes, April/May 2004, 28 pp. See http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf. [2] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com. [3] C. Sherman and S. Feldman, “The High Cost of Not Finding Information,” International Data Corporation Report #29127, 11 pp., April 2003. [4] M. Doyle, S. Garmon, and T. Hoglund, “Make Your Portal Deliver: Building the Business Case and Maximizing Returns,” EDS White Paper, 10 pp., 2003. [5] M.E.D. Koenig, “Time Saved — a Misleading Justification for KM,” KMWorld Magazine, Vol 11, Issue 5, May 2002. See http://www.kmworld.com/publications/magazine/index.cfm.In speaking of the semantic Web, it is not infrequent that the open world assumption (OWA) gets mentioned. What this post argues is that this somewhat obscure concept may hold within it the key as to why there have been decades of too-frequent failures in the enterprise in business intelligence, data warehousing, data integration and federation, and knowledge management.
This is a fairly bold assertion. In order to support it, we first need to look to the logic and mindset assumptions associated with traditional relational data management and the semantic Web. We then need to look to the nature of knowledge itself and its relation to data federation. It is in this intersection that the key of decades of faulty premises may reside.
The main argument is that the closed world assumption (CWA) and its prevalent mindset in traditional database systems have hindered the ability of enterprises and the vendors that support them to adopt incremental, low-risk means to knowledge systems and management. CWA, in turn, has led to over-engineered schema, too-complicated architectures and massive specification efforts that have led to high deployment costs, blown schedules and brittleness.
The good news is that abandoning these failed practices and embracing the open world approach can be done immediately based on existing assets. Simply shifting from the closed world to open world premise can, I argue, improve the odds for enterprise IT success in these areas.
It is time to meet the elephant in the room.
Scope and Some Root Causes of Enterprise IT FailuresIt is, of course, a bit of editorial hyperbole to label most enterprise initiatives in business intelligence and knowledge management as being failures over the past few decades. And, insofar as failures have occurred, I also do not believe they are the result of vendor greed or cynicism, or IT management mistakes or incompetence. Rather, I believe the fault resides in the attempt to pound a square peg (relational model) into a round hole (knowledge representation).
The scope of these failures is not known. We have seen anecdotal claims of trillions of dollars in annual loses due to IT project failures worldwide; failure rates for major IT projects in the 65% to 80% ranges; and analysis of waste and failures in individual firms that are fairly eye-popping [1]. The real point of this post is not to try to quantify these problems. However, in my many years within IT it has been a common perception and concern that many — if not most — large-scale information technology deployments have disappointed in one way or another.
These disappointments range from cost overruns, to late delivery, to unmet objectives, or to low user acceptance. Many initiatives are simply cancelled before any such metrics can be documented. Whatever the absolute quantification, I think most experienced IT managers and executives would agree that these failures and disappointments have been all too commonplace.
“Business Intelligence projects are famous for low success rates, high costs and time overruns. The economics of BI are visibly broken, and have been for years. Yet BI remains the #1 technology priority according to Gartner.”[2]Why might this be?
I truly believe the reasons for these disappointments do not reside in bad faith or incompetence. The potential importance of IT knowledge projects to improve competitive position, lower costs, or aid innovation for new markets is understood by all. Dilbert aside, I find it simply incomprehensible that disappointments or failures are rooted in these causes.
Rather, I suspect the root cause resides in the success of the relational model in the enterprise.
As transaction systems and for modeling narrowly bound and structured domains (such as products, inventory or customer lists), the relational model and its proven and optimized RDBMs and SQL query language have been resounding successes. It is natural to take a successful approach and try to extend it to other areas.
However, beginning with data warehouses in the 1980s, business intelligence (BI) systems in the 1990s, and the general issue of most enterprise information being bound up in documents for decades, the application of the relational model to these areas has been disappointing.
The reasons for this do not reside in areas such as storage or hardware; these areas have seen remarkable improvements over the decades. Rather, the problem resides in the nature of the relational model itself, and its lack of suitability to knowledge-based problems.
Technical Aspects of OWA, Broadly DefinedI have noted the importance of the open world assumption to the semantic enterprise in many of my more recent posts [3,4]. But I, like many others, often refer to the open world assumption with facile summaries such as it means that a lack of information does not imply the missing information to be false. Yet to fully understand the implications of OWA and many of its associated assumptions, it is necessary to delve deeper.
I am using here a shorthand that poses the closed world assumption (CWA) vs. the open world assumption (OWA). Actually, the data models behind these approaches (Datalog or non-monotonic logic in the case of CWA; monotonic in the case of OWA [5]; OWA is also firmly grounded in description logics [4]) tend be coupled with a few other assumptions. I use the shorthand of relational approach vs. (open) semantic Web approach to contrast these two models.
There are instances where the relational model can embrace the open world assumption (for example, the null in SQL) and there are instances where semantic Web approaches can be closed world (as with frame logic or Prolog or other special considerations; see conclusion). But, as generally applied and as generally understood, this contrast between typical relational practice and the semantic Web (based on RDF and OWL) tends to hold.
From a theoretical standpoint, I have found the treatment of Patel-Schneider and Horrocks [6] to be most useful in comparing these approaches. However, the Description Logics Handbook and some other varied sources are also helpful [7,5]. Much of the technical aspects summarized in the table below are from these sources; I refer you to these sources for more informed technical discussions:
Relational Approach (Open) Semantic Web ApproachClosed World Assumption (CWA)
That which is not known to be true is presumed to be false; it needs to be explicitly stated as true. Negation as failure (NAF) is a related assumption, since it assumes as false every predicate that cannot be proven to be true. Under CWA, any statement not known to be true is false.
Everything is prohibited until it is permitted.
Open World Assumption (OWA)
The lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity.
Everything is permitted until it is prohibited.
Unique Name Assumption (UNA)
The unique name assumption (UNA) is premised that different names always refer to different entities in the world.
Duplicate Labels Allowed
OWL allows different synonym labels to be used for the same object; same names may refer to different objects. Identity assertions must be explicitly stated.
Complete Information
The data system at hand is assumed to be complete. (Missing information is often handled via the null statement in SQL, but that has been controversial and contentious in its own right.) This is also known as the domain-closure assumption.
Incomplete Information
A central tenet of OWA is that information is incomplete. A corollary is that the attributes of specific objects or instances may also be incomplete or partially known.
Single Schema (one world)
A single schema is necessary to define the scope and interpretation of the world (domain at hand).
Many World Interpretations
Schema and data instance assertions are kept separate. Multiple interpretations (worlds) for the same data are possible.
Integrity Constraints
Integrity constraints prevent “incorrect” values from being asserted in the relational model. It is useful for validation/parsing/data input and is related to the single model that contains only the facts asserted. Strict cardinality is used for checking validation.
Logical Axioms (restrictions)
Logical axioms provide restrictions through property domains and ranges. Everything can be true unless proven otherwise, and multiple possible models can satisfy the axioms. This provides more powerful inferencing, though can also be unintuitive at times. Cardinality and range restrictions exhibit different behavior for objects (inferred) or datatypes.
Non-monotonic Logic
The set of conclusions warranted on the basis of a given knowledge base does not increase (in fact, it likely shrinks) with the size of the knowledge base [5].
Monotonic Logic
The hypotheses of any derived fact may be freely extended with additional assumptions. Additional assertions tend to reduce the inferences or entailments that can be applied. A new piece of knowledge cannot reduce what is known [5]. New knowledge can arise through inference.
Fixed and Brittle
Changing the schema requires re-architecting the database; not inherently extensible.
Reusable and Extensible
Designed from the ground up to reuse existing ontologies (axioms) and to be extensible. Database design and management can be more agile, with schema evolving incrementally.
Flat Structure; Strong Typing
Information organized into flat tables; linkages and connections between tables based on foreign keys or joins. Strong data typing orientation.
Graph Structure; Open Typing
Inherent graph structure, supporting of linkage and connectivity analysis. Datatypes are inherently loose, though axioms can add strong types. Datatypes treated in the same way as classes, and datatype values are treated in the same way as individual identiers (i.e., a data value is treated as referring to an object).
Querying and Tooling
SQL and query optimizers well developed. Tooling well developed. Disjunction not supported; negation must be accommodated through approaches such as NAF. Sums and counts are easier due to unique name premise. Answer closure (one answer passable to a next calculation) is easier than OWA. Most tools are not suitable for any arbitrary schema.
Querying and Tooling
SPARQL and emerging rule languages used for querying; performance at scale and with broad distribution a concern. Queries require contextual information for proper set selection. Negation and disjunction are allowed and are powerful constructs. Tools generally less developed. Exciting opportunities for ontology-driven applications working against a small set of generic tools.
In well-characterized or self-contained domains (seats on a plane, books in a library, customers of a company, products sold via distribution channels), the traditional relational model works well. A closed-world assumption is performant for transaction operations with easier data validation. The number of negative facts about a given domain is typically much greater than the number of the positive ones. So, in many bounded applications, the number of negative facts is so large that their explicit representation can become practically impossible [7]. In such cases, it is simpler and shorter to state known “true” statements than to enumerate all “false” conditions.
However, the relational model is a paradigm where the information must be complete and it must be described by a single schema. Traditional databases require an agreement on a schema, which must be made before data can be stored and queried. The relational model assumes that the only objects and relationships that exist in the domain are those that are explicitly represented in the database, and that names uniquely identify objects in this domain. The result of these assumptions is that there is a single (canonical) model for relational systems where objects and relationships are in a one-to-one correspondence with the data in the database [6].
This makes CWA and its related assumptions a very poor choice when attempting to combine information from multiple sources, to deal with uncertainty or incompleteness in the world, or to try to integrate internal, proprietary information with external data.
The process of describing an open, semantic Web “world” can proceed incrementally, sequentially asserting new statements or conditions. The schema in the open semantic Web — the ontology — consists of sets of statements (called axioms) that describe characteristics that must be satisfied by the ontology designer’s idea of “reasonable” states of the world. Formally, such statements correspond to logical sentences, and an ontology corresponds to a logical theory [6].
Irregularity and incompleteness are toxic to relational model design. In the open semantic Web, data that is structured differently can still be stored together via RDF triple statements (subject – predicate – object). For example, OWA allows suppliers without cities and names to be stored along alongside suppliers with that information. Information can be combined about similar objects or individuals even though they have different or non-overlapping attributes. Duplicate checking now occurs based on the logic of the system and not unique name evaluations. Data validation in OWA systems can both become more complicated (via testing against restriction statements) or partially easier (via inference).
It is interesting to note that the theoretical underpinnings of CWA by Reiter [8] began to be understood about the same time (1978) that data federation and knowledge representation (KR) activities also began to come to the fore. CWA and later work on (for example) default reasoning [5] appeared to have informed early work in description logics and its alternative OWA approach. This heavily influenced the development of the semantic Web languages RDF and OWL. However, the early path toward KM work based on the relational model also appears to have been set in this timeframe.
We are still reaping the whirlwind from this unfortunate early choice of the relational model for KR, KM and BI purposes. Moreover, though there is quite a bit of theoretical and logical discussion of the alternative OWA and CWA data models, there are surprisingly few discussions of what the implications of these models are to the enterprise. (That is, the elephant in the room.) The next two sections tackle this gap.
The Knowledge Management Argument for OWAThe above should make clear that the relational model and CWA are appropriate for defined and bounded systems. However, many of the new knowledge economy challenges are anything but defined and bounded. These applications all reside in the broad category of knowledge management (KM), and include such applications as data federation, data warehousing, enterprise information integration, business intelligence, competitive intelligence, knowledge representation, and so forth.
Let’s looks at the characteristics of such knowledge systems and why they are more appropriately modeled through the open world assumption (OWA) rather than the relational model and CWA:
To be sure, there are many circumstances where large stores of instance data and their analysis are necessary for knowledge purposes. In these cases, hybrid CWA-OWA systems (see conclusion) may make sense.
But, as these points emphasize, the general assembly and organization of knowledge is open world in nature. Trying to fit KM and related applications into the straightjacket of the relational model is folly. The relational model and CWA for KM is the elephant in the room. Three decades of failures and disappointments affirm this fact.
The Business Argument for OWABesides the native match of knowledge systems with OWA, there are sound business arguments for embracing the (open) semantic enterprise as well. These arguments can be summarized as lower risk, lower cost, faster deployment, and more agile responsiveness. What is there not to love?
It should now be clear that it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with a focus on early, high-value applications and domains.
Open world does not necessarily mean open data and it does not mean open source. Open world is simply a way to think about the information we have and how we act on it. OWA technologies are neutral to the question of open or public sources. The techniques can equivalently be applied to internal, closed, proprietary data and structures. Moreover, the technologies can themselves be used as a basis for bringing external information into the enterprise. An open world assumption merely asserts that we never have all necessary information and lacking that information does not itself lead to any conclusions.
Further, we need not abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives. However, in leveraging those assets, it is important that the enterprise begin to embrace and understand the open world assumption.
We also see that RDF and OWL, while important behind the scenes as a canonical data model and languages for organizing this information, need not be exposed as such to most users. Most instance data can be expressed as is with the data languages of choice such as XML, JSON or whatever. We are merely using the techniques of the (open) semantic Web as the data model to organize our information assets at hand. These assets need not themselves be represented in the native RDF or OWL languages.
Thus, open world frameworks provide some incredibly important benefits for knowledge management applications in the enterprise:
One might argue, as we believe, that the biggest impediment to the semantic enterprise is the mind shift necessary to start thinking about and accepting the open world premise. Again, this perspective is not applicable to all problems and domains. But, where it is, much can be left in place and leveraged with semantic technologies, so long as the enterprise begins to look at these existing assets through a different open-world lens.
In most real world circumstances, there is much we don’t know and we interact in complex and external environments. Knowledge management inherently occupies this space. Ultimately, data interoperability implies a global context. Open world is the proper logic premise for these circumstances. Via the OWA framework, we can readily change and grow our conceptual understanding and coverage of the world, including incorporation of external ontologies and data. Since this can easily co-exist with underlying closed-world data, the semantic enterprise can readily bridge both worlds.
So, we can now define the open semantic enterprise as one that embraces OWA for its knowledge management applications and engages in rapid and low-risk testing of incremental learning. The open world assumption is the proper framework to reverse decades of failure and disappointment for knowledge projects in the enterprise.
Some Open Questions about OWAIn our own discussions about ABox – TBox splits [10], we have, in essence, supported a hybrid OWA-CWA argument for the enterprise. It is beyond the scope of this current piece to describe these approaches in detail, but some of the options include local CWA, the addition of rule languages and constraints to basic OWA, use of the new OWL 2, TopQuadrant’s SPIN notation, and others [11]. I will address some of these in a later post.
There are also questions about performance and scalability with open semantic technologies. Here, too, progress is rapid, with billion triple thresholds rapidly falling with daily reports of better performance [12]. Fortunately, the incremental approach that we advocate herein dovetails well with these rapid developments. There should be no arguing the benefits of a successful incremental project in a smaller domain, perhaps repeated across multiple domains, in comparison to large, costly initiatives that never produce (even though their underlying technologies are performant).
There are also architecture issues inherent in these OWA designs. In one of our next posts, we return to the topic of Web-oriented architecture and its role in support of these OWA knowledge management initiatives.
In the end, there is no substitute for doing and learning. KM based on OWA for the open semantic enterprise can be started today, in a focused manner with tangible benefits and outcomes, at low cost and risk. Let’s push the elephant out of the room and let the learning and doing begin.
[1] For example, see Roger Sessions, 2009. Cost of IT Failure, September 28, 2009. This analysis suggests failure rates of 65% with a total estimated worldwide cost of $6.2 trillion in 2009. Commenters have raised questions as to what constitutes failure and have questioned some of the analysis assumptions. Nonetheless, even with over-estimates, the scale of the numbers is alarming; see Jorge Dominguez, 2009. The CHAOS Report 2009 on IT Project Failure, June 16, 2009, which indicates combined failure and challenge rates for IT projects have ranged from 65% to 84% over the period 1994 to 2009; see Dan Galorath, 2008. Software Project Failure Costs Billions; Better Estimation & Planning Can Help, June 7, 2008. In this report, Galorath compares and combines many of the available IT failure studies and summarizes that 3 of 5 IT projects do not do what they were supposed to for the expected costs, with 49% showing budget overruns, 47% showing higher than expected maintenance costs, and 41% failing to deliver expected business value; the anecdotal failure rate for years for IT projects has been claimed as 80%, with business intelligence and data warehousing particularly failure-prone areas; in 2001, a study by Mark N. Frolick and Keith Lindsey, Critical Factors for Data Warehouse Failures, for the Data Warehousing Institute noted conventional wisdom says the failure rate of data warehousing projects is 70 to 80 percent, with a then-recent study in the insurance industry found a 90-percent failure rate. This report is useful for combining many historical studies. [2] According to this article, by Antone Gonsalves, Poor Use Of Data Integration Tools Can Waste $500,000 Annually: Gartner (April 27, 2009), which reports on a recent Gartner Report, large global 2000 companies, using several data integration tools with overlapping features, can reduce costs by more than $500,000 annually by eliminating redundant software and leveraging a shared services model. In a further report by Roman Stanek, Business Intelligence Projects are Famous for Low Success Rates, High Costs and Time Overruns (April 25, 2009), Gartner is talking about a dirty little secret in the world of data integration, the fact that the data integration technology in place is based on generations of data integration technology being layered in the enterprise over the years. Thus, technology that was purchased to solve data integration problems, and reduce costs, is actually making the data integration problem more complex and no longer cost efficient. [3] Here are some of my earlier postings dealing in some degree with OWA: Ontology-driven Applications Using Adaptive Ontologies, November 23, 2009; Fresh Perspectives on the Semantic Enterprise, September 28, 2009; Confronting Misconceptions with Adaptive Ontologies, August 17, 2009; Advantages and Myths of RDF, April 8, 2009; Making Linked Data Reasonable using Description Logics, Part 2, February 15, 2009, which specifically relates OWA to the ABox and TBox [4]; and, The Role of UMBEL: Stuck in the Middle with You . . ., May 11, 2008. [4] We use the reference to “ABox” and “TBox” in accordance with our working definition for description logics: “Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” [5] A model theory is a formal semantic theory which relates expressions to interpretations. A “model” refers to a given logical “interpretation” or “world”. (See, for example, the discussion of interpretation in Patrick Hayes, ed., 2004. RDF Semantics – W3C Recommendation, 10 February 2004.) The logic or inference system of classical model theory is monotonic. That is, it has the behavior that if S entails E then (S + T) entails E. In other words, adding information to some prior conditions or assertions cannot invalidate a valid entailment. The basic intuition of model-theoretic semantics is that asserting a statement makes a claim about the world: it is another way of saying that the world is, in fact, so arranged as to be an interpretation which makes the statement true. An assertion amounts to stating a constraint on the possible ways the world might be. In comparison, a non-monotonic logic system may include default reasoning, where one assumes a ‘normal’ general truth unless it is contradicted by more particular information (birds normally fly, but penguins don’t fly); negation-by-failure, commonly assumed in logic programming systems, where one concludes, from a failure to prove a proposition, that the proposition is false; and implicit closed-world assumptions, often assumed in database applications, where one concludes from a lack of information about an entity in some corpus that the information is false (e.g., that if someone is not listed in an employee database, that he or she is not an employee.) See further, Non-monotonic Logic from the Stanford Encyclopedia of Philosophy. [6] Peter F. Patel-Schneider and Ian Horrocks, 2006. Position Paper: A Comparison of Two Modelling Paradigms in the Semantic Web,” in WWW2006, May 22–-26, 2006, Edinburgh, UK. See http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2006/PaHo06a.pdf. [7] Other resources include: Franz Baader, Diego Calvanese, Deborah McGuiness, Daniele Nardi, and Peter Patel-Schneider, eds., 2003. The Description Logic Handbook: Theory, Implementation and Applications, Cambridge University Press, 2003. Online access to much of the book is available at http://www.inf.unibz.it/~franconi/dl/course/; see esp. Chapters 1, 2, 4 and 16 relate to this topic; Jos de Bruijn, Axel Polleres, Ruben Lara and Dieter Fensel, 2005. OWL DL vs. OWL Flight: Conceptual Modeling and Reasoning for the Semantic Web, in Proceedings of the Ninth World Wide Web Conference, Japan, May 2005. This paper argues against the use of description logics for the semantic Web; Andrew Newman, 2007. A Relational View of the Semantic Web, March 14, 2007; Hai Wang, 2006. Frames and OWL Side by Side, presented at the 9th International Protégé Conference, July 23-26, 2006, Stanford, CA; Nick Drummond and Rob Shearer, 2006. The Open World Assumption, Powerpoint presentation at The Chris Date Seminar: The Closed World of Databases Meets the Open World of the Semantic Web, e-Science Institute, Edinburgh, Scotland, 12 Ocotober 2006; Yulia Levin, 2008. Closed World Reasoning, presentation at Non-classical Logics and Applications Seminar – Winter 2008, Tel Aviv University; and Pat Hayes, 2001. “Why must the web be monotonic?”, email thread at http://lists.w3.org/Archives/Public/www-rdf-logic/2001Jul/0067.html. [8] Raymond Reiter, 1978. “On Closed World Data Bases”, in Logic and Data Bases, H. Gallaire and J. Minker, eds., New York: Plenum Press, 55-76; see also, Raymond Reiter, 1980. “A Logic for Default Reasoning,” Artificial Intelligence, 13:81-132. [9] See this Google search on ontology-driven applications. [10] See this Google search on ABox-TBox articles. [11] See, as examples: J. Heflin and H. Munoz-Avila, 2002. LCW-Based Agent Planning for the Semantic Web, in AAAI ‘02 Workshop on Ontologies and the Semantic Web, AAAI Press, pp. 63–70. See http://www.cse.lehigh.edu/~heflin/pubs/lcw-aaai02.pdf (one of the first local CWA suggestions in specific regard to the semantic Web); K. Golden, O. Etzioni and D. Weld, D. 1994. Omnipresence Without Omniscience: Efficient Sensor Managment for Planning, in Proceedings of AAAI-94 (one of the first to propose LCWA in general); Evren Sirin, Michael Smith and Evan Wallace, 2008. Integrity constraints: Opening, Closing Worlds — On Integrity Constraints, presented at OWL: Experiences and Directions (OWLED 2008), Fifth International Workshop, Karlsruhe, Germany, October 26-27, 2008; Timothy L. Hinrichs, Jui-Yi Kao and Michael R. Genesereth, 2009. Inconsistency-tolerant Reasoning with Classical Logic and Large Databases, in Proceedings of the Eighth Symposium on Abstraction, Reformulation, and Approximation (SARA2009), July 2009; S. Gómez, C. Chesñevar and G. Simari 2008. An Argumentative Approach to Reasoning with Inconsistent Ontologies, in Proceedings of the KR Workshop on Knowledge Representation and Ontologies (KROW 2008), Conferences in Research and Practice in Information Technology, Vol. 90, pp. 11-20. Eds. T.Meyer, M. Orgun. Australian Computer Society, Sidney, Australia, July 2008. Holger Knoblauch, The Object-Oriented Semantic Web with SPIN, Sunday, January 18, 2009, that discusses the SPIN (SPARQL Inferencing Notation) Modeling Vocabulary, which is a light-weight collection of RDF properties and classes to support the use of SPARQL to specify rules and logical constraints. [12] For example, the BigOWLIM can perform reasoning against 12 billion explicit statements and loads about 12,000 statements per second on a standard server; see http://www.ontotext.com/owlim/benchmarking/lubm.html; also, see Orri Erling’s blog regarding performance of the Virtuoso RDF triple store (http://www.openlinksw.com/weblog/oerling/). In any case, these performance benchmarks continue to rise steadily and indicate the performance of RDF as an ontology integration layer.Sweet Tools, AI3’s listing of semantic Web and -related tools, now has a total of 810 tools listed, a significant expansion from the last update. With the retirement of 19 prior tools, this new listing represents an increase of 93 tools, or 13%, from the previous version that listed 736.
The Sweet Tools dataset is also now showing the way to a couple of exciting innovations: new generic ontology-driven applications for structured data; and, tools for authoring structured data via spreadsheets.
Summary of Major ChangesSo, here is the summary of major changes in this new listing:
To see the major Sweet Tools page for this updated listing in its existing format, filter on ‘New’ under New or Existing? to see the recent additions. Alternatively, you can also see this same filtering using the conStruct structured data view by searching on the Status attribute using the value ‘New’; see example here.
See the new Sweet Tools structured data display at conStruct! Structured Data via conStructThough still formative, the most exciting change with the Sweet Tools listing is this new presentation via conStruct. It is a structured data Web services framework with a UI, all offered as a set of modules to Drupal. To kick the tires with this new system, you may want to look at:
BTW, there are some helpful documentation pages that show how all of these various tools work and more, such as, for example, Browse. (Also, BTW, as a demo user, you also are not seeing all of the write and update tools, either; again, see the documentation.)
The essential underlying basis to conStruct is the structWSF Web services framework. There are still some aspects to this system that we feel are incomplete and we are working on. Some of these things include dropdown selections (controlled vocabulary selects); easier template creation; and intuitive template re-use. Nonetheless, these additions will come quickly, and what is here is already a great demonstration of how structured data can drive generic tools and interfaces.
The case study of how this system was constructed from a spreadsheet input using the irON vocabulary is described in an earlier post.
Updated StatisticsThe updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories are browser tools (RDF, OWL), information extraction, parsers or converters, composite application frameworks and general ontology tools. Each accounts for more than 8% — or more than 50 tools — of the total. This breakdown is as follows (click to expand):
There are no real discernable trends in application tool categories over the past couple of years.As for the languages these applications are written in, that has stayed pretty steady, too. Java is still the leading language at about 46%, which has been very slightly trending downward over the past three years or so. PHP has increased a bit as well. The current splits are (click to expand):
Prior UpdatesBackground on prior listings and earlier statistics may be found on these previous posts:
With interim updates periodically over that period.
Structured Dynamics is one of the 320+ plus submitters (and counting!) to the 2010 $5 million Knight News Challenge. Our proposal is to extend our basic structWSF and conStruct Drupal frameworks to provide a data appliance and network (DAN) to support citizen journalists with data and analysis at the local, community level.
We invite you to look at our application and to provide comments or your rating of the application. The deadline for comments is tomorrow, and we will incorporate any appropriate last-minute suggestions. You can find our submission at:
Citizen DAN Proposal (or, search on ‘citizen dan’)
Please note: you must be signed in via a short submission to vote or comment on the application (or others shown in the listings).
The Citizen DAN IdeaCitizen DAN is meant to be a complete, open source framework for promoting citizen journalism. It is a:
Good decisions and good journalism require good starting information. Citizen DAN is a framework to provide access for any citizen to learn and compare local statistics and data with other similar communities. This helps to promote the grist for citizen journalism, as well as to provide a vehicle for discovery and learning across the community.
Citizen DAN will come pre-packaged with all necessary deployment components and documentation, including local data from government sources. It will include facilities for direct upload of additional local data in formats from spreadsheets to standard databases; many standard converters are included with the basic package.
Citizen DAN may be implemented either by local governments or by community advocacy groups. When deployed, using its clear documentation, sponsors may choose whether or what portions of local data are exposed to the broader Citizen DAN network. Data exposed on the network is automatically available to any other network community for comparison and analysis purposes.
(You may want to see our separate description — structWSF: A Framework for Collaboration Networks — of how this framework can lead to collaboration through widely distributed community nodes.)
The complete data appliance and network (DAN) is multi-lingual. If funded, this project will be tested and deployed in at least two prominent cities; one in Canada (French and English), and one in the United States (English and Spanish).
We think Citizen DAN is an exciting new prospect for local communities to share and use local data. Your support can help make this app available to any community for free.
And, in any case, do check out the other fine submissions to the challenge.
OK, you’ve been reading the literature and perhaps have attended a conference or two. You have heard a lot about semantic technologies, but have some real questions and concerns:
Such questions — and more — are not infrequent when organizations first contemplate making the transition to become a semantic enterprise.
OverviewThe diagram below shows a general workflow for migrating existing instance data into the semantic enterprise. The diagram is broken down into three parts. The first part is to characterize and stage existing data and information into the underlying structured data framework. This is what SD (that is, my firm, Structured Dynamics) does as data architects using our particular approach to adaptive ontologies. I’ll touch on this again in a moment.
Jumping to the right-hand side of the diagram is the access and display part. It is here that developers or users can make selections from dropdown lists and so forth to define the “slices” of diced results sets they wish to display. The results of those interactions are structured data results sets that are pre-staged to “drive” various applications and displays [1,2]. These same capabilities can also be embedded into standard Web end user applications, such as content management systems.
The third and middle part of the diagram is the critical part, the pivot point. It is the interface layer between the structured data on the left and the display and presentation of that data on the right. As provided by SD, this abstraction layer is the structWSF Web services framework that “bridges” between the black box of what happens with RDF and semantic Web structured data characterizations on the left in order to feed, or “drive”, useful services and functions on the right.
We call this general design and architecture “ontology-driven applications”. The bulk of this posting explains each of these three parts in a bit more detail, organized from left-to-right by these Parts 1 to 3.
Our approach relies on what we call “adaptive ontologies”. These ontologies set the structural basis for all subsequent data display, analysis, inferencing, entailments, and the like. We call them “adaptive” because we embrace a set of unique best practices. These practices enable the ontologies to do the double-duty of first structuring data and then driving generic applications by properly informing user interfaces, dropdown lists, menus and the like.
This structuring results in faceting key important dimensions and attributes of available content. Structured data gets organized. Unstructured data (text) gets tagged via this structure and integrated with it.
As Structured Dynamics’ general product schema makes clear (see the diagram at [3]), our approach leverages existing assets as much as possible. Often, this means leaving most existing data structures in place. These existing assets are staged and converted in two complementary manners that largely correspond to the conceptual ABox (instance) and TBox (concepts and schema) split central to description logics and pivotal to SD’s methodology [4].
Whether transitioning small chunks or big chunks, this staging of existing data in Part 1 results in an RDF-accessible characterization of the starting content. Instances and their attributes are represented via a common notation, generally based on irON (instance record and Object Notation) [5], that is an extensible notation and vocabulary for capturing the data characterizations, attributes and metadata of the candidate instance data (”records” in RDBMS parlance). These instances may either be internal or proprietary records, or instance data on the Web or in the public domain. By properly matching same or similar instances to one another, any source of instance characterization can be meaningfully combined.
This instance notation is extremely lightweight, and really is merely an RDF representation of data characterizations. In the characterizations to this point there is not yet any “world view” involved: we are simply describing instances and their attributes in a manner akin to key-value pairs. The process to this point is entirely descriptive.
However, these instance characteristics do contain within them the semantics as to how to describe these attributes (your “glad” is my “happy”), as well as potentially a schematic or conceptual view of how these instances relate to one another and to the broader world. Instance characterizations provide the building blocks, that are then related and made semantically whole via a second “terminological” level.
These terminological, or conceptual, relationships (the TBox [4]), reside at a different level from simply decribing things. Rather, these schema — what in this context are best known as ontologies — provide a precise language and means for describing conceptual relationships. If these structural relationships are done well, they are coherent: the hip bone is connected to the thigh bone and not to the ear. Coherence is a matter of a consistent world view that “hangs together” when analyzed via powerful logical techniques available via description logics and other broader mechanisms of the semantic enterprise.
Thus, as we transition from the existing, the operational workflow splits the input data stream into two pathways:
A sequential flow of these steps and splits is provided by this diagram below that shows: 1) the conceptual structure of the concept ontology; as 2) matched with the instances and their descriptive attributes that populate that schema.
A key point is that — while a proper starting ontology is essential to our process and proofs-of-concept — it can be grown and scaled incrementally. We leverage as much existing starting structure as possible and can readily bound the scope to meet budget and delivery imperatives.
The concepts and entities that occur within these structures help inform our fairly simple tagging system, scones [3]. (There are also benefits from “triangulating” between entity or instance identification and concept identification that helps inform disambiguation nearly for free; see further [6]). It is also possible to integrate these initial proof-of-concept approaches with third-party tools (e.g., Calais, Expert System (Cogito), etc.) to improve unstructured content characterization.
These approaches are pretty straightforward for any organization wanting to test the idea of becoming a semantic enterprise. Real benefits — such as concept retrievals overcoming the limitations of standard keyword search — can be demonstrated from even small starting ontologies and structures. Given the inherent connectedness of the data, it is possible to expand the scope and usefulness of the information incrementally within fixed and manageable budgets.
Part 2: structWSF: A Web-oriented Services API and FrameworkA pivotal part of SD’s infrastructure software is structWSF [7], our platform-independent Web services middleware. structWSF is an abstraction layer that provides the APIs, search endpoints, and specific Web services for accessing, querying or getting results sets from the underlying structured data and ontologies.
structWSF has a standard set of access and retrieval services including browse, full-text search, CRUD, direct record retrievals, and the like. It is embedded within an access and permissions service that acts at the level of registered datasets. Then, based on the requested protocol, structWSF returns the filtered results set. These results sets can be delivered as XML, JSON, or any of the other formats already available [7]. They can readily and dynamically populate HTML pages and forms in any deployment framework. For specific purposes, these results sets can also be returned as pre-staged, properly formatted results streams for driving specific applications.
As an API, the structWSF Web services can be interacted with and driven via standard HTTP requests. Alternatively, these requests can come from simple to complicated Web apps that create the API queries based on user interface choices such as selections from dropdown lists or clicking on various listed options. An interactive demo of this approach is shown by SD’s conStruct application [8], though even simpler Web pages or forms may drive the query interface.
Queries and requests to structWSF may also include a parameter for results sets to be returned in particular formats. SD’s irON protocol [5] supports requests or results in CSV, XML or JSON, in addition to other flavors including multiple serializations of RDF.
In this manner, only a simple converter need be added to the structWSF Web services stack in order to “drive” a new application with a particularly formatted results set stream.
structWSF thus acts as a single, uniform Web interface to all of the “black box” nuances of the structured data system organized by the adaptive ontologies. Further, virtually any data structure may be ingested and converted from external sources via an import service and made part of the underlying canonical structure, making the framework perfect for data federation [9]. Lastly, the dataset nature of the framework, and its neutrality to underlying data stores or content management systems, also makes structWSF an excellent framework for one or many nodes to share information and collaborate across the Web [10].
The following diagram shows how a diverse, Web-based network, involving a diversity of Web portals and data gateways and hubs, can work via the structWSF framework to establish a complete collaboration network. Via datasets and differential access rights and permissions, virtually any combination of potential interactions can be supported:
These potentials are really fundamentally new, and we ourselves are still trying to find the language and analogies to best explain them. structWSF was initially designed as a platform-independent layer between the structured data representation of existing assets and the ontology-driven applications that interact with them. We are now finding that deployment in a broader Web-based context provides additional exciting prospects for integrating various regional offices or to enable direct collaboration with customers, partners or suppliers.
Part 3: Ontology-driven ApplicationsThe basic design of structWSF is to provide a middleware layer that fulfills one or more of these broad user interaction modes:
SD has developed generic applications in these areas (with many more possible), the operations of which are guided by the instructions and nature of the underlying data that feeds them. We have proven it is possible to adopt data characterization practices within those ontologies so as to stage or “drive” such generic applications.
In the case of a standard structured data display (say, a simple table like a Wikipedia infobox, for example), such generic design includes templates tailored to various instance types (say, locational information presenting on a map versus people information warranting a image and vital statistics). Alternatively, in the generic design for some specialized application (say, Adobe Flash), the information output of the results set may need to contain certain formats and attributes.
SD’s “ontology-driven apps”, then, are really informed structured results sets that are outputted in a form suitable to various intended applications. This output form can include a variety of serializations, formats or metadata. This flexibility of output that is tailored to and responsive to particular generic applications is what makes our ontologies “adaptive”.
Expressed in this manner, “ontology-driven apps” seem neither remarkably profound nor clever. They are simply attentive to their intended uses.
Using this structure, then, it is possible to either “drive” queries and results sets selections via direct HTTP request or via simple dropdown selections on HTML forms (that is, from right to left as shown on the first diagram). Similarly, it is possible with a single parameter change to drive either a visualization app or a structured table template from the equivalent query request (that is, from left to right on the first diagram).
“Ontology-driven apps” through SD’s architecture design thus provide two profound benefits. First, the entire system can be driven via simple selections or interactions without the need for any programming or technical expertise. And, second, simple additions of new and minor output converters can work to power entirely new applications available to the system. If, say, Adobe graphics applications need to change tomorrow for Microsoft Silverlight, that switch is easy and can be made transparent to the designer.
The Complete Picture: Embrace the Open WorldThe ability to develop these systems incrementally and the ability to integrate with external, public data is fundamentally dependent on the open world assumption. The open world assumption is a different logic premise than what many enterprises are used to; relational database systems, for example, embrace the alternate closed world premise.
Open world does not necessarily mean open data and it does not mean open source. Open world is merely a way to think about the information we have and how we act on it. An open world assumption accepts that we never have all necessary information and lacking that information does not itself lead to any conclusions.
Some enterprise circumstances – say a complete enumeration of customers or products or even controlled engineering or design environments — may warrant a closed world approach. In those circumstances, the domain of inquiry is well bounded and we can get relatively complete information about it. Engineering an oil drilling platform or launching the Space Shuttle in fact demands that.
But, in most real world circumstances, there is much we don’t know and we interact in complex and external environments. Open world is the proper logic premise for these circumstances. These circumstances also happen to be the very basis in which most most knowledge workers and analysts reside.
Open world frameworks provide some incredibly important benefits if the circumstances of their use apply:
One might argue, as we believe, that the biggest impediment to the semantic enterprise is the mind shift necessary to start thinking about and accepting the open world premise. Again, this perspective is not applicable to all problems and domains. But, where it is, much can be left in place and leveraged with semantic technologies, so long as the enterprise begins to look at these existing assets through a different open-world lens.
SummarySo, let’s return to the rhetorical questions that began this posting.
It should now be clear that it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with focus on early, high-value applications and domains.
Further, we need not abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives. However, in leveraging those assets, it is important that the enterprise begin to embrace and understand the open world assumption.
We also see that RDF and OWL, while important behind the scenes as a canonical data model and languages for organizing this information, need not be exposed as such to most users. Most instance data can be expressed as is with the data languages of choice such as XML, JSON or whatever.
We also see these technologies are neutral to the question of open or public sources. The techniques can equivalently be applied to internal, closed, proprietary data and structures. Moreover, the technologies can themselves be used as a basis for bringing external information into the enterprise.
Without a doubt, some of the early years in describing semantic technologies were burdened with some unfortunate bad information and lack of sophistication. Today’s semantic Web is nimble, agile, and ready to be deployed immediately at low cost and risk. So, jump on in! We think you’ll find the water to be just fine.
This post is Part V of an occasional AI3 series on ontology best practices. [1] These selections and requests need not occur only via user interfaces or HTML forms, but also programmatically via API or direct Web services calls. [2] There are two main classes of visualizations possible with our systems: 1) navigations or explorers of the concept space, which is a particularly open challenge for large, graph-based knowledge bases (see, for example, our Subject Concept Explorer using the UMBEL Financial Account concept, and click on the bubbles); or 2) conventional data visualizations or graphics or mappings of instance data. Both are shown as workflow boxes on the first diagram above. [3] See http://structureddynamics.com/products.html for a general descriptive illustration of Structured Dynamics’ product stack. There is also a longer slideshow, from which this diagram is drawn as slide #37. [4] We use the reference to ABox and TBox in accordance with our working definition for description logics: “Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” [5] For the specification and a use case of irON using the CSV (commON) serialization, see http://openstructs.org/iron. [6] Via this approach we now can assess concept matches in addition to entity matches. This means we can triangulate between the two assessments to aid disambiguation. Because of these logical segmentations, we also have multiple “clusters” (that is, either the concept, type, superType or dimension) upon which to do our disambiguation evaluations, either between concepts and entities or within the various concept clusters. We can do so via either multiple semantic vectors (for statistical-based methods) or multiple features (for machine learning methods). In other words, because of logical segmentation, we have increased the informational power of our concept graph. See further http://www.mkbergman.com/759/supertypes-and-logical-segmentation-of-instances/. [7] See http://openstructs.org/structwsf/architecture; also, the available export formats are shown at http://constructscs.com/documentation/instructions/export. [8] There is an online demo of conStruct using the Sweet Tools database of semantic Web and -related tools at http://constructscs.com/conStruct/browse/; for background on this use case, see http://www.mkbergman.com/845/a-most-un-common-way-to-author-datasets/. [9] See, for example, http://www.mkbergman.com/496/structwsf-a-framework-for-data-mixing/. [10] See, for example, http://www.mkbergman.com/497/structwsf-a-framework-for-collaboration-networks/.Chris Gutteridge asked this question about semantically enabling conference Web sites, which is a subject close to my heart. It’s hard to give a meaningful response in 140 characters, so I decided to get some headline thoughts down for posterity. If you want a fuller account of some first-hand experiences, then the following papers are a good place to start:
1. Exploit Existing Workflows
Conferences are incredibly data-rich, but much of this richness is bound up in systems for e.g. paper submission, delegate registration, and scheduling, that aren’t native to the Semantic Web. Recognise this in advance and plan for how you intend to get the data from these systems out into the Web. The good news is that scripts now exists to handle dumps from submission systems such as EasyChair, but you may need to ensure that the conference instance of these systems is configured correctly for your needs. For example, getting dumps from these systems often comes at a price, and if you’re using one instance per track rather than the multi-track options, you may be in for a shock when you ask for the dumps. Speak to the Programme Chairs about this as soon as possible.
In my experience, delegate registration opens months in advance of a conference and often uses a proprietary, one-off system. As early as possible make contact with the person who will be developing and/or running this system, and agree how the registration system can be extended to collect data about the delegates and their affiliations, for example. Obviously there needs to be an opt-in process before this data is published on the public Web.
Collecting these types of data from existing workflows is so monumentally easier than asking people to submit it later through some dedicated means. With this in mind, have modest expectations (in terms of degree of participation) for any system you hope to deploy for people to use before, during and after the conference, whether this is a personalised schedule planner, paper annotation system or rating system for local restaurants. People have massive demands on their time always, and especially at a conference, so any system that isn’t already part of a workflow they are engaged with is likely to get limited uptake.
2. Publish Data Early then Incrementally Improve
Perhaps your goal in publishing RDF data about your conference is simply to do the right thing by eating your own dog food and providing an archival record of the event in machine-readable form. This is fine, but ideally you want people to use the published data before and during the event, not just afterwards. In an ideal world, people will use the data you publish as a foundation for demos of their applications and services and the conference, as means to enhance the event and also to promote their own work. To maximise the chances of this happening you need to make it clear in advance that you will be publishing this data, and give an indication of what the scope of this will be. The RDF available from previous events in the ESWC and ISWC series can give an impression of the shape of the data you will publish (assuming you follow the same modelling patterns), but get samples out early and basic structures in place so people have the chance to prepare. Better to incrementally enhance something than save it all up for a big bang just one week before the conference.
3. Attend to the details
Many of the recent ESWC and ISWC events have done a great job of publishing conference data, and have certainly streamlined the process considerably. However, along the way we’ve lost (or failed to attend to) some of the small but significant facts that relate to a conference, such as the location, venue, sponsors and keynote speakers. This stuff matters, and is the kind of data that probably doesn’t get recorded elsewhere. Obviously publishing data about the conference papers is important, but from an archival point of view this information is at least recorded by the publishers of the proceedings. The more tacit, historical knowledge about a conference series may be of great interest in the future, but is at risk of slipping away.
4. Piggy-back on Existing Infrastructure
As I discovered while coordinating the Semantic Web Technologies for ESWC2006, deploying event-specific services is simply making a rod for your own back. Who is going to ensure these stay alive after the event is over and everyone moves onto the next thing? The answer is probably no-one. The domain-registration will lapse, the server will get hacked or develop a fault, the person who once knew why that site mattered will take a job elsewhere, and the data will disappear in the process. Therefore it’s critical that every event uses infrastructure that is already embedded in everyday usage and also/therefore has a future. The best example of this is data.semanticweb.org, the de facto home for Linked Data from Web-related events. This service has support from SWSA, and enough buy-in from the community, to minimise the risk that it will ever go away. By all means host the data on the conference Web site if you must, but don’t dream of not mirroring it at data.semanticweb.org, with owl:sameAs links to equivalent URIs in that namespace for all entities in your data set.
5. Put Your Data in the Web
Remember that while putting your data on the Web for others to use is a great start, it’s going to be of greatest use to people if it’s also *in* the Web. This is a frequently overlooked distinction, but it really matters. No one in their right mind would dream of having a Web site with no incoming or outgoing links, and the same applies to data. Wherever possible the entities in your data set need to be linked to related entities in other data sets. This could be as simple as linking the conference venue to the town in which it is located, where the URI for the town comes from Geonames. Linking in this way ensures that consumers of the data can discover related information, and avoids you having to publish redundant information that already exists somewhere else on the Web. The really great news is that data.semanticweb.org already provides URIs for many people who have published in the Semantic Web field, and (aside from some complexities with special characters in names) linking to these really can be achieved in one line of code. When it’s this easy there really are no excuses.
Conclusions
Reading the above points back before I hit publish, I realise they focus on Semantic Web-enabling the conference as a whole, rather than specifically the conference Web site, which was the focus of Chris’s original question. I think we know a decent amount about publishing Linked Data on the Web, so hopefully these tips usefully address the more process-oriented than technical aspects.
Related posts:
Related posts brought to you by Yet Another Related Posts Plugin.
When I say, “shot”, what do you think of? A flu shot? A shot of whisky? A moon shot? A gun shot? What if I add the term “bank”? Do you now think of someone being shot in an armed robbery of a local bank or similar?
And, now, what if I add a reference to say, The Hustler, or Minnesota Fats, or “Fast Eddie” Felson? Do you now see the connection to a pressure-packed banked pool shot in some smoky bar room?
As humans we need context to make connections and remove ambiguity. For machines, with their limited reasoning and inference engines, context and accurate connections are even more important.
Over the past few weeks we have seen announcements of two large and high-visibility linked data projects: One, a first release of references for articles concerning about 5,000 people from the New York Times at data.nytimes.com; and Two, a massive exposure of 5 billion triples from data.gov datasets provided by the Tetherless World Constellation (TWC) at Rennselaer Polytechnic Institute (RPI).
On various grounds from licensing to data characterization and to creating linked data for its own sake, some prominent commentators have weighed in on what is good and what is not so good with these datasets. One of us, Mike, commented about a week ago that “we have now moved beyond ‘proof of concept’ to the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.”
Reactions to that posting and continued discussion on various mailing lists warrant a more precise dissection of what is wrong and still needs to be done with these datasets [1].
Berners-Lee’s Four Linked Data “Rules”It is useful, then, to return to first principles, namely the original four “rules” posed by Tim Berners-Lee in his design note on linked data [2]:
The first two rules are definitional to the idea of linked data. They cement the basis of linked data in the Web, and are not at issue with either of the two linked data projects that are the subject of this posting.
However, it is the lack of specifics and guidance in the last two rules where the breakdowns occur. Both the NYT and the RPI datasets suffer from a lack of “providing useful information” (Rule #3). And, the nature of the links in Rule #4 is a real problem for the NYT dataset.
What Constitutes “Useful Information”?The Wikipedia entry on linked data expands on “useful information” by augmenting the original rule with the parenthetical clause, ” (i.e., a structured description — metadata).” But even that expansion is insufficient.
Fundamentally, what are we talking about with linked data? Well, we are talking about instances that are characterized by one or more attributes. Those instances exist within contexts of various natures. And, those contexts may relate to other existing contexts.
We can break this problem description down into three parts:
At minimum, ANY dataset exposed as linked data needs to be described by a vocabulary. Both the NYT and RPI datasets fail on this score, as we elaborate below. Better practice is to also provide a schema of relationships in which to embed each instance record. And, best practice is to also map those structures to external schema.
Lacking this “useful information”, especially a defining vocabulary, we cannot begin to understand whether our instances deal with drinks, bank robberies or pool shots. This lack, in essence, makes the information worthless, even though available via URL.
The data.gov (RPI) CaseWith the support of NSF and various grant funding, RPI has set up the Data-Gov Wiki [3], which is in the process of converting the datasets on data.gov to RDF, placing them into a semantic wiki to enable comment and annotation, and providing that data as RSS feeds. Other demos are also being placed on the site.
As of the date of this posting, the site had a catalog of 116 datasets from the 800 or so available on data.gov, leading to these statistics:
We’ll take one of these datasets, #319, and look a bit closer at it:
Wiki Title Agency Name data.gov Link No Properties No Triples RDF File Dataset 319 Consumer Expenditure Survey Department of Labor LABOR-STAT http://www.data.gov/details/319 22 1,583,236 http://data-gov.tw.rpi.edu/raw/319/index.rdfThis report was picked solely because it had a small number of attributes (properties), and is thus easier to screen capture. The summary report on the wiki is shown by this page:
(click to expand)
So, we see that this specific dataset contains about 22 of the nearly 8,000 attributes across all datasets.
When we click on one of these attribute names, we are then taken to a specific wiki page that only reiterates its label. There is no definition or explanation.
When we inspect this page further we see that, other than the broad characterization of the dataset itself (the bulk of the page), we see at the bottom 22 undefined attributes with labels such as item code, periodicity code, seasonal, and the like. These attributes are the real structural basis for the data in this dataset.
But, what does all of this mean???
To gain a clue, now let’s go to the source data.gov site for this dataset (#319). Here is how that report looks:
(click to expand)
Contained within this report we see a listing for additional metadata. This link tells us about the various data fields contained in this dataset; we see many of these attributes are “codes” to various data categories.
Probing further into the dataset’s technical documentation, we see that there is indeed a rich structure underneath this report, again provided via various code lookups. There are codes for geography, seasonality (adjusted or not), consumer demographic profiles and a variety of consumption categories. (See, for example, the link to this glossary page.) These are the keys to understanding the actual values within this dataset.
For example, one major dimension of the data is captured by the attribute item_code. The survey breaks down consumption expenditures within the broad categories of Food, Housing, Apparel and Services, Transportation, Health Care, Entertainment, and Other. Within a category, there is also a rich structural breakdown. For example, expenditures for Bakery Products within Food is given a code of FHC2.
But, nowhere are these codes defined or unlocked in the RDF datasets. This absence is true for virtually all of the datasets exposed on this wiki.
So, for literally billions of triples, and 8,000 attributes, we have ABSOLUTELY NO INFORMATION ABOUT WHAT THE DATA CONTAINS OTHER THAN A PROPERTY LABEL. There is much, much rich value here in data.gov, but all of it remains locked up and hidden.
The sad truth about this data release is that it provides absolutely no value in its current form. We lack the keys to unlock the value.
To be sure, early essential spade work has been done here to begin putting in place the conversion infrastructure for moving text files, spreadsheets and the like to an RDF form. This is yeoman work important to ultimate access. But, until a vocabulary is published that defines the attributes and their codes so we can unlock this value, it will remain hidden. And only when its further value (by connecting attributes and relations across datasets) through a schema of some nature is also published, the real value from connecting the dots will also remain hidden.
These datasets may meet the partial conditions of providing clickable URLs, but the crucial “useful information” as to what any of this data means is absent.
Every single dataset on data.gov has supporting references to text files, PDFs, Web pages or the like that describe the nature of the data within each dataset. Until that information is exposed and made usable, we have no linked data.
Until ontologies get created from these technical documents, the value of these data instances remain locked up, and no value can be created from having these datasets expressed in RDF.
The devil lies in the details. The essential hard work has not yet begun.
The NYT CaseThough at a much smaller scale with many fewer attributes, the NYT dataset suffers from the same failing: it too lacks a vocabulary.
So, let’s take the case of one of the lead actors in The Hustler, Paul Newman, who played the role of “Fast Eddie” Felson. Here is the NYT record for the “person” Paul Newman (which they also refer to as http://data.nytimes.com/newman_paul_per). Note the header title of Newman, Paul:
(click to expand)
Click on any of the internal labels used by the NYT for its own attributes (such as nyt:first_use), and you will be given this message:
“An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.”
We again have no idea what is meant by all of this data except for the labels used for its attributes. In this case for nyt:first_use we have a value of “2001-03-18″.
Hello? What? What is a “first use” for a “Paul Newman” of “2001-03-18″???
The NYT put the cart before the horse: even if minimal, they should have released their ontology first — or at least at the same time — as they released their data instances. (See further this discussion about how an ontology creation workflow can be incremental by starting simple and then upgrading as needed.)
Links to Other ThingsSince there really are no links to other things on the Data-Gov Wiki, our focus in this section continues with the NYT dataset using our same example.
We now are in the territory of the fourth “rule” of linked data: 4. Include links to other URIs so that they can discover more things.
This will seem a bit basic at first, but before we can talk about linking to other things, we first need to understand and define the starting “thing” to which we are linking.
What is a “Newman, Paul” Thing?Of course, without its own vocabulary, we are left to deduce what this thing “Newman, Paul“ is that is shown in the previous screen shot. Our first clue comes from the statement that it is of rdf:type SKOS concept. By looking to the SKOS vocabulary, we see that concept is a class and is defined as:
A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this definition is meant to be suggestive, rather than restrictive. The notion of a SKOS concept is useful when describing the conceptual or intellectual structure of a knowledge organization system, and when referring to specific ideas or meanings established within a KOS.
We also see that this instance is given a foaf:primaryTopic of Paul Newman.
So, we can deduce so far that this instance is about the concept or idea of Paul Newman. Now, looking to the attributes of this instance — that is the defining properties provided by the NYT — we see the properties of nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage. Completing our deductions, and in the absence of its own vocabulary, we can now define this concept instance somewhat as follows:
New York Times articles in the period 2001 to 2009 having as their primary topic the actor Paul Newman
(BTW, across all records in this dataset, we could see what the earliest first use was to better deduce the time period over which these articles have been assembled, but that has not been done.)
We also would re-title this instance more akin to “2001-2009 NYT Articles with a Primary Topic of Paul Newman” or some such and use URIs more akin to this usage.
sameAs WoesThus, in order to make links or connections with other data, it is essential to understand what the nature is of the subject “thing” at hand. There is much confusion about actual “things” and the references to “things” and what is the nature of a “thing” within the literature and on mailing lists.
Our belief and usage in matters of the semantic Web is that all “things” we deal with are a reference to whatever the “true”, actual thing is. The question then becomes: What is the nature (or scope) of this referent?
There are actually quite easy ways to determine this nature. First, look to one or more instance examples of the “thing” being referred to. In our case above, we have the “Newman, Paul” instance record. Then, look to the properties (or attributes) the publisher of that record has used to describe that thing. Again, in the case above, we have nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage.
Clearly, this instance record — that is, its nature — deals with articles or groups of articles. The relation to Paul Newman occurs as a basis of the primary topic of these articles, and not a person basis for which to describe the instance. If the nature of the instance was indeed the person Paul Newman, then the attributes of the record would more properly be related to “person” properties such as age, sex, birthdate, death date, marital status, etc.
This confusion by NYT as to the nature of the “things” they are describing then leads to some very serious errors. By confusing the topic (Paul Newman) of a record with the nature of that record (articles about topics), NYT next misuses one of the most powerful semantic Web predicates available, owl:sameAs.
By asserting in the “Newman, Paul” record that the instance has a sameAs relationship with external records in Freebase and DBpedia, the NYT both entails that properties from any of the associated records are shared and infers a chain of other types to describe the record. More precisely, the NYT is asserting that the “thing” referred to by these instances are identical resources.
Thus, by the sameAs statements in the “Newman, Paul” record, the NYT is also asserting that that record is an instance of all these things [5]:
Furthermore, because of its strong, reciprocal entailments, the owl:sameAs assertion would also now entail that the person Paul Newman has the nyt:first_use and nyt:last_use attributes, clearly illogical for a “person” thing.
This connection is clearly wrong in both directions. Articles are not persons and don’t have marital status; and persons do not have first_uses. By misapplying this sameAs linkage relationship, we have screwed things up in every which way. And the error began with misunderstanding what kinds of “things” our data is about.
Some OptionsHowever, there are solutions. First, the sameAs assertions, at least involving these external resources, should be dropped.
Second, if linkages are still desired, a vocabulary such as UMBEL [4] could be used to make an assertion between such a concept, and these other related resources. So, even though these resources are not the same, they are closely related. The UMBEL ontology helps us to define this kind of relation between related, but non-identical, resources.
Instead of using the owl:sameAs property, we would suggest the usage of the umbel:linksEntity, which links a skos:Concept to related named entities resources. Additionally, Freebase, which also currently asserts a sameAs relationship to the NYT resource, could use the umbel:isAbout relationship to assert that their resource “is about” a certain concept, which is the one defined by the NYT.
Alternatively, still other external vocabularies that more precisely capture the intent of the NYT publishers could be found, or the NYT editors could define their own properties specifically addressing their unique linkage interests.
Other Minor IssuesAs a couple of additional, minor suggestions for the NYT dataset, we would suggest:
There are very valuable benefits from entailment, inference and logic to be gained from linking resources. However, if the nature of the “things” being linked — or the properties that define these linkages — are incorrect, then very wrong logical implications result. Great care and understanding should be applied to linkage assertions.
In the End, the Challenge is Not Linked Data, but Connected DataOur critical comments are not meant to be disrespectful and are not being picky. The NYT and TWC are prominent institutions for which we should expect leadership on these issues. Our criticisms (and we believe those of others) are also not an expression of a “trough of disillusionment” as some have been pointing out.
This posting has been jointly authored by Mike Bergman and Fred Giasson and simultaneously published on both of their blogs, hoping to draw more attention to the need for better practices in publishing linked data.This posting is about poor practices, pure and simple. The time to correct them is now. If asked, we would be pleased to help either institution establish exemplar practices. This is not automatic, and it is not always easy. The data.gov datasets, in particular, will require much time and effort to get right. There is much documentation that needs to be transitioned and expressed in semantic Web formats.
In a broader sense, we also seem to lack a definition of best practices related to vocabularies, schema and mappings. The Berners-Lee rules are imprecise and insufficient as is. Prior best guidance documents tend to be more how to publish and make URIs linkable, than to properly characterize, describe and connect the data.
Perhaps, in part, this is a bit of a semantics issue. The challenge is not the mechanics of linking data, but the meaning and basis for connecting that data. Connections require logic and rationality sufficient to reliably inform inference and rule-based engines. It also needs to pass the sniff test as we “follow our nose” by clicking the links exposed by the data.
It is exciting to see high-quality content such as from national governments and major publishers like the New York Times begin to be exposed as linked data. When this content finally gets embedded into usable contexts, we should see manifest uses and benefits emerge. We hope both institutions take our criticisms in that spirit.
[1] The NYT has been updated with improvements and they fixed multiple issues from the first release. The problems listed herein, however, still pertain after these improvements. [2] Tim Berners-Lee, 2006. Linked Data (Design Issues), first posted on 2006-07-27; last updated on 2009-06-18. See http://www.w3.org/DesignIssues/LinkedData.html. Berners-Lee refers to the steps above as “rules,” but he elaborates they are expectations of behavior. Most later citations refer to these as “principles.” [3] Li Ding, Dominic DiFranzo, Sarah Magidson, Deborah L. McGuinness and Jim Hendler, 2009. Data-GovWiki: Towards Linked Government Data. See http://www.cs.vu.nl/~pmika/swc/documents/Data-gov%20Wiki-data-gov-wiki-v1.pdf. [4] UMBEL (Upper Mapping and Binding Exchange Layer) is a lightweight ontology structure in development for relating Web content and data to a standard set of subject concepts. It purpose has resulted in its creation of an associated vocabulary geared to both class-instance and reciprocal relationships, as well as partial or likelihood relationships. See http://umbel.org/technical_documentation.html#vocabulary. [5] We’d like to thank Denny Vrandecic (see comments) for pointing out an imprecision in our original wording. This phrase was originally stated as, “Thus, by the sameAs statements in the ‘Newman, Paul’ record, the NYT is also asserting that that record is the same as these other things.”In a former life, I had the nickname of ‘Spreadsheet King’ (perhaps among others that I did not care to hear). I had gotten the nick because of my aggressive use of spreadsheets for financial models, competitors tracking, time series analyses, and the like. However, in all honesty, I have encountered many others in my career much more knowledgeable and capable with spreadsheets than I’ll ever be. So, maybe I was really more like a minor duke or a court jester than true nobility.
Yet, pro or amateur, there are perhaps 1 billion spreadsheet users worldwide [1], making spreadsheets undoubtedly the most prevalent data authoring environment in existence. And, despite moans and wails about how spreadsheets can lead to chaos, spaghetti code, or violations of internal standards, they are here to stay.
Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value “locked up” in existing spreadsheets.
How to “unlock” this value for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) [2]. I recently published a case study [3] that describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools [4] to aid users in understanding and using the commON notation. I summarize portions of that study herein.
This is the second article of a two-part series related to the recent Sweet Tools update. Background on Sweet Tools and irONThe dataset that is the focus of this use case, Sweet Tools, began as an informal tracking spreadsheet about four years ago. I began it as a way to learn about available tools in the semantic Web and -related spaces. I began publishing it and others found it of value so I continued to develop it.
As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact. The current version has well over 800 tools listed, characterized by many structured data attributes such as type, programming language, description and so forth. As it has grown, a formal controlled vocabulary has also evolved to bring consistency to the characterization of many of these attributes.
It was natural for me to maintain this listing as a spreadsheet, which was also reinforced when I was one of the first to adopt an Exhibit presentation of the data based on a Google spreadsheet about three years back. Here is a partial view of this spreadsheet as I maintain it locally:
When we began to develop irON in earnest as a simple (”naïve”) dataset authoring framework, it was clear that a comma-separated value, or CSV [5], option should join the other two serializations under consideration, XML and JSON. CSV, though less expressive and capable as a data format than the other serializations, still has an attribute-value pair (also known as key-value pairs and many other variants [6]) orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.
As a dataset very familiar to us as irON’s editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically what came to be known as the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.
The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.
Dataset Authoring in a SpreadsheetA large portion of the case study describes the many advantages of authoring small datasets within spreadsheets. The useful thing about the CSV format is that these full functional capabilities of the spreadsheet are available during authoring or later updates and modifications, but, when exported, the CSV provides a relatively clean format for processing and parsing.
So, some of the reasons for small dataset authoring in a spreadsheet include:
The next major section of the case study deals with the minor conventions that must be followed in order to stage spreadsheets for commON. Not much of the specific commON vocabulary or notation is discussed below; for details, see [7].
Because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.
To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.
In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically ‘escaped‘ when issued as CSV. commON allows you to specify your own delimiter for lists (the standard is the pipe ‘|’ character) and what the parser recognizes as the ‘escape’ character (’\’ is the standard). However, you probably should not change for most conditions.
The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.
In the irON specification there are a small number of defined modules or processing sections. In commON, these modules are denoted by the double-ampersand character sequence (’&&‘), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage). Similarly, attributes and types are denoted by a single ampersand prefix (&attributeName).
In commON, any or all of the modules can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.
The RecordList ModuleThe first spreadsheet figure above shows a Sweet Tools example for the &&recordList module. The module begins with that keyword, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attributeName convention in the columns for the first row after the &&recordList keyword is encountered).
The &&recordList format can also include the stacked style (see similar Dataset example below) in addition to the single row style shown above.
At any rate, once a worksheet is ready with its instance records following the straightforward irON and commON conventions, it can then be saved as a CSV file and appropriately named. Here is an example of what this “vanilla” CSV file now looks like when shown again in a spreadsheet:
Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:
Note that the CSV format separates each column by the comma separator, with escapes shown for the &description attribute when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).
The Dataset ModuleThe &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset [8]. Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the single row or stacked styles (see these instance record examples), with this one being the stacked style:
The &&linkage module is used to map the structure of the instance records to some structural schema, which can also include external ontologies. The module has a simple, but specific structure.
Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted [8]. By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:
Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.
The Schema (structure) ModuleIn its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets [2]. Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.
A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future [8].
Saving and ImportingIf the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved. These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.
My own naming convention is to use a format of appname_date_modulename.csv because it sorts well in a file manager accommodating multiple versions (dates) and keeps related files clustered. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format. Thus, in the case of the records listings for Sweet Tools, its filename could be something like: swt_20091110_records.csv.
Once saved, these files are now ready to be imported into a structWSF [9] instance, which is where the CSV parsing and conversion to interoperable RDF occurs [8]. In this case study, we used the Drupal-based conStruct SCS system [10]. conStruct exposes the structWSF Web services via a user interface and a user permission and access system. The actual case study write-up offers more details about the import process.
Using the DatasetWe are now ready to interact with the Sweet Tools structured dataset using conStruct (assuming you have a Drupal installation with the conStruct modules) [10].
Introduction to the AppThe screen capture below shows a couple of aspects of the system:
One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!
Some Sample UsesHere are a few sample things you can do with these generic tools driven by the Sweet Tools dataset:
Note, if you access this conStruct instance you will do so as a demo user. Unfortunately, as such, you may not be able to see all of the write and update tools, which in this case are reserved for curators or admins. Recall that structWSF has a comprehensive user access and permissions layer.
Exporting in Alternative FormatsOf course, one of the real advantages of the irON and structWSF designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations [8]:
As should be obvious, one of the real benefits of the irON notation — in addition to easy dataset authoring — is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.
The Formal Case StudyThe formal Sweet Tools case study based on commON, with sample download files and PDF, is available from Annex: A commON Case Study using Sweet Tools, Supplementary Documentation [3].
[1] In 2003, Microsoft estimated its worldwide users of the Excel spreadsheet, which then had about a 90% market share globally, at 400 million. Others at that time estimated unauthorized use to perhaps double that amount. There has been significant growth since then, and online spreadsheets such as Google Docs and Zoho have also grown wildly. This surely puts spreadsheet users globally into the 1 billion range. [2] See Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009. See http://openstructs.org/iron/iron-specification. Also see the irON Web site, Google discussion group, and code distribution site. [3] Michael Bergman, 2009. Annex: A commON Case Study using Sweet Tools, Supplementary Documentation, prepared by Structured Dynamics LLC, November 10, 2009. See http://openstructs.org/iron/common-swt-annex. It may also be downloaded in PDF . [4] See Michael K. Bergman’s AI3:::Adaptive Information blog, Sweet Tools (Sem Web). In addition, the commON version of Sweet Tools is available at the conStruct site. [5] The CSV mime type is defined in Common Format and MIME Type for Comma-Separated Values (CSV) Files [RFC 4180]. A useful overview of the CSV format is provided by The Comma Separated Value (CSV) File Format. Also, see that author’s related CTX reference for a discussion of how schema and structure can be added to the basic CSV framework; see http://www.creativyst.com/Doc/Std/ctx/ctx.htm, especially the section on the comma-delimited version (http://www.creativyst.com/Doc/Std/ctx/ctx.htm#CTC). [6] An attribute-value system is a basic knowledge representation framework comprising a table with columns designating “attributes” (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating “objects” (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table.Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array.. [7] See especially SUB-PART 3: commON PROFILE in, Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009. [8] As of the date of this case study, some of the processing steps in the commON pipeline are manual. For example, the parser creates an intermediate N3 file that is actually submitted to the structWSF. Within a week or two of publication, these capabilities should be available as a direct import to a structWSF instance. However, there is one exception to this: the specification for the schema structure. That module has been prototyped, but will not be released with the first commON upgrade. That enhancement is likely a few weeks off from the date of this posting. Please check the irON or structWSF discussion groups for announcements. [9] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a structWSF layer may be placed over virtually any existing datastore with Web access — including large instance record stores in existing relational databases — it is also a framework for Web-wide deployments and interoperability. [10] conStruct SCS is a structured content system built on the Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces. It is based on RDF and SD’s structWSF platform-independent Web services framework [6]. In addition to user access control and management and a general user interface, conStruct provides Drupal-level CRUD, data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF. [11] More Web services are being added to structWSF on a fairly constant basis, and the existng ones have been through a number of upgrades.