Features are the inputs to machine learners. The outputs of machine learners are predictions of outcomes, based on an inferred (or “learned”) model or function. In image recognition, as an example, the inputs are the characteristics of pixels and those adjacent to them; the output may be a prediction there is an image representation of “cat”. In NLP, as another case, the input might be the text, title and URL of emails; the output may be a prediction of “spam”. If we treat all ML learners as black boxes, features are what is fed to the box, and predicted labels or structures are what comes out.
As I recently argued, the importance of features has been overlooked in comparison to the choice of machine learners or how to lower the costs and efforts of labeling and creating training sets and standards. But the complete picture also needs to include feature extraction, feature selection, and feature engineering.
A recent review paper helps redress this imbalance. Feature Selection: A Data Perspective , surveys and provides a comprehensive and well-organized overview of recent advances in feature selection research. According to the authors, Li et al., “the objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and helping prepare, clean, and understand data.” The practical 73-page review is accompanied by an open-source feature selection library that consists of most of the popular feature selection algorithms covered in the review, and a comprehensive performance analysis of the methods and their results.
The first nine pages of the review are devoted to a broad, accessible overview. The intro provides a clear explanation of features and their role in feature selection. It also explains why the high-dimensionality of features is a challenge in its own right.
The bulk of the document is devoted to a discussion of the various methods used in feature selection, organized according to:
Each of the methods is characterized as to whether it is applicable to supervised or unsupervised learning. While I have used a different classification of the feature space, that does not affect the usefulness of Li et al.’s  approach. Also, in keeping with a review article, there are more than 11 pages of references containing nearly 150 citations.
The combined review nature of the paper also means that various methods have been reduced to a common symbol set, which is a handy way to relate available features to multiple learners. This common treatment enables the authors to create the open source repository, scikit-feast, written in Python and available from Github, that provides a library of 25 of the methods covered. A separate Web site presents some test datasets and performance results. Here is one example of many of the available results:
This paper deserves a permanent place on anyone’s resource shelf who has a serious interest in machine learning. I would not be surprised to see the authors’ organizational structure of feature selection methods become a standard. It is always a pleasure to encounter papers that are well-written, understandable and comprehensive. Great job! Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, Huan Liu, 2016. “Feature Selection: A Data Perspective,” arXiv:1601.07996, 29 Jan 2016.
I have to admit the first I heard the title of Pedro Domingos‘ recent book, The Master Algorithm, I was off-put, similar to the way I react negatively to the singularity made famous by Ray Kurzweil. I don’t tend to buy into single answers or theories of everything.
But as a recent talk by Domingos at Google shows, he has much more insight to share about the roots and “tribes” associated with machine learning. If you are new to ML and want to learn more about the big picture underlying its main approaches and tenets, the hour spent watching this video will prove valuable:
The strength of the talk is to describe what Domingos calls the five “tribes” underlying machine learning and the lead researchers, premises and approaches underlying each:
You can also see the slides here to Domingos’ talk.
As Domingos emphasizes, each of these approaches has its applications, strengths and weaknesses. He posits there are shared aspects and generalities underlying all of these methods that can help point the way to perhaps more universal approaches, the master algorithm.
I have argued elsewhere about the importance of knowledge bases to recent AI breakthroughs more than algorithms, but ultimately, of course, specific calculation methods need to underpin any learning approach. Though I’m not convinced there is a “master” algorithm, there is also great value in understanding the premises and mindsets behind these main approaches to machine learning.
In Part I of this series we pointed to the importance of large electronic knowledge bases (Big Data) in the recent advances in AI for knowledge- and text-based systems. Amongst other factors such as speedy GPUs and algorithm advances, we noted that electronic knowledge bases are perhaps the most important factor in the resurgence of artificial intelligence.
But the real question is not what the most important factor in the resurgence of AI may be — though the answer to that points to vetted, reference training sources. The real question is: How can we systematize this understanding by improving the usefulness of knowledge bases to support AI machine learning? Knowing that knowledge bases are important is not enough. If we can better understand what promotes — and what hinders — KBs for machine learning, perhaps we can design KB approaches that are even quicker and more effective for AI. In short, what should we be coding to promote knowledge-based artificial intelligence (KBAI)?Why is There Not a Systematic Approach to Using KBs in AI?
To probe this question, let’s take the case of Wikipedia, the most important of the knowledge bases for AI machine learning purposes. According to Google Scholar, there have been more than 15,000 research articles relating Wikipedia to machine learning or artificial intelligence . Growth in articles noting the use of Wikipedia for AI has been particularly strong in the past five years .
But two things are remarkable about this use of Wikipedia. First, virtually every one of these papers is a one-off. Each project stages and uses Wikipedia in its own way and with its own methodology and focus. Second, Wikipedia is able to make its contributions despite the fact there are many weaknesses and gaps within the knowledge base itself. Clearly, despite weaknesses, the availability of such large-scale knowledge in electronic form still provides significant advantages to AI and machine learning in the realm of natural language applications and text understanding. How much better might our learners be if we fed them more complete and coherent information?
Readers of this blog will be familiar with my periodic criticisms of Wikipedia as a structured knowledge resource . These criticisms are not related to the general scope and coverage of Wikipedia, which, overall, is remarkable and unprecedented in human endeavor. Rather, the criticisms relate to the use of Wikipedia as is for knowledge representation purposes. To recap, here are some of the weaknesses of Wikipedia as a knowledge resource for AI:
As the tremendous use of Wikipedia for research shows, none of these weaknesses is fatal, and none alone has prevented meaningful use of the knowledge base. Further, there is much active research in areas such as knowledge base population  that promise to aid solutions to some of these weaknesses. The recognition of the success of knowledge bases to train AI machine learners is also now increasing awareness that KB design for AI purposes is a worthwhile research topic in its own right. Much is happening leveraging AI in bot designs for both Wikipedia and Wikidata. A better understanding of how to test and ensure coherency, matched with a knowledge graph for inferencing and logic, should help promote better and faster AI learners. The same techniques for testing consistency and coherence may be applied to mapping external KBs into such a reference structure.
Thus, the real question again is: How can we systematize the usefulness of knowledge bases to support AI machine learning? Simply by asking this question we can alter our mindset to discover readily available ways to improve knowledge bases for KBAI purposes.Working Backwards from the Needs of Machine Learners
The best perspective to take on how to optimize knowledge bases for artificial intelligence derives from the needs of the machine learners. Not all individual learners have all of these needs, but from the perspective of a “platform” or “factory” for machine learners, the knowledge base and supporting structures should consider all of these factors:
Thus, an effective knowledge base to support KBAI should have a number of desirable aspects. It should have maximum structure and exposed features. It should be organized by a coherent knowledge graph, which can be effectively sliced and reasoned over. It must be testable via logic and consistency and performance tests, such that training and reference sets may be vetted and refined. And it must have repeatable and verifiable scripts for updating the underlying KB as it changes and to generate new, working feature sets. Moreover, means for generally managing and manipulating the knowledge base and knowledge graph are important. These desirable aspects constitute a triad of required functionality.Guidance for a Research Agenda
Achieving a purposeful knowledge base for KBAI uses is neither a stretch nor technically risky. We see the broad outlines of the requirements in the discussion above.
Two-thirds of the triad are relatively straightforward. First, creating a platform for managing the KBs is a fairly standard requirement for knowledge and semantics purposes; many platforms presently exist that can access and manage knowledge graphs, knowledge bases, and instance data at the necessary scales. Second, the build and testing scripts do require systematic attention, but these are also not difficult tasks and are quite common in many settings. It is true that build and testing scripts can often prove brittle, so care needs to be placed into their design to facilitate maintainability. Fortunately, these are matters mostly of proper focus and good practice, and not conceptually difficult.
The major challenges reside in the third leg of the triad, namely in the twin needs to map the knowledge base into a coherent knowledge graph and into an underlying speculative grammar that is logically computable to support the expression of both feature and training sets. A variety of upper ontologies and lexical frameworks such as WordNet have been tried as the guiding graph structures for knowledge bases . To my knowledge, none of these options has been scrutinized with the specific requirements of KBAI support in mind. With respect to the other twin need, that of a speculative grammar, our research to date  points to the importance of segregating the KB information into topics (concepts), relations and relation types, entities and entity types, and attributes and attribute types. The possible further distinctions, however, into possibly roles, annotations (metadata), events and expressions of mereology still require further research. The role, placement and use of rules also remain to be determined.You can see Part I of
Anyone beyond a certain age may recall the waning and waxing of the idea of AI, artificial intelligence. In fact, the periodic dismal prospects and poor reputation of artificial intelligence have been severe enough at times so as to warrant its own label: the “AI winters.” Clearly, today, we are in a resurgence of AI. But why is this so? Is the newly re-found popularity of AI merely a change in fashion, or is it due to more fundamental factors? And if it is more fundamental, what might those factors be that have led to this resurgence?
We only need to look at the world around us to see that the resurgence in AI is due to real developments, not a mere change in fashion. From virtual assistants that we can instruct or question by voice command to self-driving cars and face recognition, many mundane or tedious tasks of the past are being innovated away. The breakthroughs are real and seemingly at an increasing pace.
As to the reasons behind this resurgence, more than a year ago, the technology futurist Kevin Kelly got it mostly right when he posited these three breakthroughs :1. Cheap parallel computation Thinking is an inherently parallel process, billions of neurons firing simultaneously to create synchronous waves of cortical computation. To build a neural network—the primary architecture of AI software—also requires many different processes to take place simultaneously. Each node of a neural network loosely imitates a neuron in the brain—mutually interacting with its neighbors to make sense of the signals it receives. To recognize a spoken word, a program must be able to hear all the phonemes in relation to one another; to identify an image, it needs to see every pixel in the context of the pixels around it—both deeply parallel tasks. But until recently, the typical computer processor could only ping one thing at a time. . . . That began to change more than a decade ago, when a new kind of chip, called a graphics processing unit, or GPU, was devised for the intensely visual—and parallel—demands of videogames . . . . 2. Big Data Every intelligence has to be taught. A human brain, which is genetically primed to categorize things, still needs to see a dozen examples before it can distinguish between cats and dogs. That’s even more true for artificial minds. Even the best-programmed computer has to play at least a thousand games of chess before it gets good. Part of the AI breakthrough lies in the incredible avalanche of collected data about our world, which provides the schooling that AIs need. Massive databases, self-tracking, web cookies, online footprints, terabytes of storage, decades of search results, Wikipedia, and the entire digital universe became the teachers making AI smart. 3. Better algorithms Digital neural nets were invented in the 1950s, but it took decades for computer scientists to learn how to tame the astronomically huge combinatorial relationships between a million—or 100 million—neurons. The key was to organize neural nets into stacked layers. Take the relatively simple task of recognizing that a face is a face. When a group of bits in a neural net are found to trigger a pattern—the image of an eye, for instance—that result is moved up to another level in the neural net for further parsing. The next level might group two eyes together and pass that meaningful chunk onto another level of hierarchical structure that associates it with the pattern of a nose. It can take many millions of these nodes (each one producing a calculation feeding others around it), stacked up to 15 levels high, to recognize a human face. . . .
To these factors I would add a fourth: 4. Distributed architectures (beginning with MapReduce) and new performant datastores (NoSQL, graph DBs, and triplestores). These new technologies, plus some rediscovered, gave us the confidence to tackle larger and larger reference datasets, while also helping us innovate high-performance data representation structures, such as graphs, lists, key-value pairs, feature vectors and finite state transducers. In any case, Kelly also notes the interconnection amongst these factors in the cloud, itself a more general enabling factor. I suppose, too, one could add open source to the mix as another factor.
Still, even though these factors have all contributed, I have argued in my series on knowledge-based artificial intelligence (KBAI) the role of electronic data sets (Big Data) as the most important enabling factor . These reference datasets may range from images for image recognition (such as ImageNet) to statistical compilations from text (such as N-grams or co-occurrences) to more formal representations (such as ontologies or knowledge bases). Knowledge graphs and knowledge bases are the key enablers for AI in the realm of knowledge management and representation.
Some also tout algorithms as the most important source of AI innovation, but Alexander Wissner-Gross in the Edge online magazine comes down squarely on the side of data in AI as the most interesting news in recent science :. . . perhaps many major AI breakthroughs have actually been constrained by the availability of high-quality training datasets, and not by algorithmic advances. For example, in 1994 the achievement of human-level spontaneous speech recognition relied on a variant of a hidden Markov model algorithm initially published ten years earlier, but used a dataset of spoken Wall Street Journal articles and other texts made available only three years earlier. In 1997, when IBM’s Deep Blue defeated Garry Kasparov to become the world’s top chess player, its core NegaScout planning algorithm was fourteen years old, whereas its key dataset of 700,000 Grandmaster chess games (known as the “The Extended Book”) was only six years old. In 2005, Google software achieved breakthrough performance at Arabic- and Chinese-to-English translation based on a variant of a statistical machine translation algorithm published seventeen years earlier, but used a dataset with more than 1.8 trillion tokens from Google Web and News pages gathered the same year. In 2011, IBM’s Watson became the world Jeopardy! champion using a variant of the mixture-of-experts algorithm published twenty years earlier, but utilized a dataset of 8.6 million documents from Wikipedia, Wiktionary, Wikiquote, and Project Gutenberg updated one year prior. In 2014, Google’s GoogLeNet software achieved near-human performance at object classification using a variant of the convolutional neural network algorithm proposed twenty-five years earlier, but was trained on the ImageNet corpus of approximately 1.5 million labeled images and 1,000 object categories first made available only four years earlier. Finally, in 2015, Google DeepMind announced its software had achieved human parity in playing twenty-nine Atari games by learning general control from video using a variant of the Q-learning algorithm published twenty-three years earlier, but the variant was trained on the Arcade Learning Environment dataset of over fifty Atari games made available only two years earlier. Examining these advances collectively, the average elapsed time between key algorithm proposals and corresponding advances was about eighteen years, whereas the average elapsed time between key dataset availabilities and corresponding advances was less than three years, or about six times faster, suggesting that datasets might have been limiting factors in the advances.
Seeing these correlations only affirms the importance of looking at knowledge bases from the specific lens of how they may best support training AI machine learners. We see the correlation; it is now time to optimize the expression of these KB potentials. We need to organize the KBs via coherent knowledge graphs and express the KBs in types, entities, attributes and relations representing their inherent, latent knowledge structure. Properly expressed KBs can support creating positive and negative training sets, promote feature set generation and expression, and create reference standards for testing AI learners and model parameters.
Past AI winters arose from lofty claims that were not then realized. Perhaps today’s claims may meet a similar fate.
Yet somehow I don’t think so. The truth is, today, we are seeing rapid progress in AI tasks of increasing usefulness and value all around us. The benefits from what will continue to be seen as ubiquitous AI should now ensure an economic and innovation engine behind AI for many years to come. One way that the AI engine will continue to be fueled is through a systematic understanding of how knowledge bases and their features can work hand in hand with machine learning to more effectively automate and meet our needs.You can see Part II of
The two most labor-intensive steps in machine learning for natural language are: 1) feature engineering; and 2) labeling of training sets. Supervised machine learning uses these training sets where every point is an input-output pair, mapping an input, which is a feature, to an output, which is the label. The machine learning consists of inferring (“learning”) a function that maps between these inputs and outputs with acceptable predictive power. This learned function can then be applied to previously unseen inputs in order to predict the output label. The technique is particularly suited to problems of regression or of classification.
It is not surprising that the two most labor-intensive tasks in machine learning involve determining the suitable inputs (features) and correctly labeling the output training labels. Elsewhere in this series I discuss training sets and labeling in detail. For this current article, we focus on features.
“Features” are perhaps the least discussed aspect of machine learning. References are made to how to select them; how to construct, extract or learn them; or how even to overall engineer them. But little guidance is provided as to what features exactly are. There really is no listing or inventory for what “features” might even be considered in the various aspects of natural language or text understanding. In part, I think, because we do not have this concrete feel for features, we also don’t tend to understand how to maximize and systematize the role of features in support of our learning tasks. This aspect provides a compelling rationale for the advantages of properly constructed knowledge bases in support of artificial intelligence, what we have been terming as KBAI in this series.
So, before we can understand how to best leverage features in our KBAI efforts, we need to first define and name the feature space. That effort, in turn, enables us to provide a bit of an inventory for what features might contribute to natural language or knowledge base learning. We then organize that inventory a bit to point out the structural and conceptual relationships among these features, which enables us to provide a lightweight taxonomy for the space.
Since many of these features have not been named or exposed before, we conclude the article with some discussion about what next-generation learners may gain by working against this structure. Of course, since much of this thinking is incipient, there are certainly forks and deadends in what may unfold ahead, but there also will likely be unforeseen expansions and opportunities as well. A systematic view of machine learning in relation to knowledge and human language features — coupled with large-scale knowledge bases such as Wikipedia and Wikidata — can lead to faster and cheaper learners across a very broad range of NLP tasks .What is a Feature?
A “feature is an individual measurable property of a phenomenon being observed” . It is an input to a machine learner, an explanatory variable, sometimes in the form of a function. Features are sometimes equated to attributes, but this is not strictly true, since a feature may be a combination of other features, or a statistical calculation, or an abstraction of other inputs. In any case, a feature must be expressed as a numeric value (including Boolean) upon which the machine learner can calculate its predictions. Machine learner predictions of the output can only be based on these numeric features, though they can be subject to rules and weights depending on the type of learner.
The importance of features and the fact they may be extracted or constructed from other inputs is emphasized in this quote from Pedro Domingos :“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. . . . Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are. This is typically where most of the effort in a machine learning project goes. It is often also one of the most interesting parts, where intuition, creativity and ‘black art’ are as important as the technical stuff.”
Many experienced ML researchers make similar reference to the art or black art of features. In broad strokes, a feature may be a surface form, like terms or syntax or structure (such as hierarchy or connections); it may be derived (such as statistical, frequency, weighted or based on the ML model used); it may be semantic (in terms of meanings or relations); or it may be latent, as either something hidden or abstracted from feature layers below it. Unsupervised learning or deep learning features arise from the latent form.
For a given NLP problem domain, features can number into the millions or more. Concept classification, for example, could use features corresponding to all of the unique words or phrases in that domain. Relations between concepts could also be as numerous. To assign a value to such “high-dimensional” features, some form of vector relationship is calculated over, say, all of the terms in the space so that each term can be represented numerically . Because learners may learn over multiple feature types, the potential combinations to be evaluated for the ML learner can literally be astronomical. This combinatorial problem has been known for decades, and has been termed the curse of dimensionality for more than 50 years .
Of course, just because a feature exists says nothing about whether it is a piece of information that might be useful for ML predictions or not. Features may thus prove to be one of four kinds: 1) strongly relevant; 2) weakly relevant; 3) irrelevant; or 4) redundant . Strongly relevant features should always be considered; weakly relevant may sometimes be combined to improve the overall relevancy. All irrelevant or redundant features should be removed from consideration. Generally, the fewer the features the better, so long as the features used are strongly relevant and orthogonal (that is, they capture different aspects of the prediction space).A (Partial) Inventory and Taxonomy of Natural Language and KB Features
To make this discussion more tangible, we have assembled a taxonomy of feature types in the context of natural language and knowledge bases. This inventory is drawn from the limited literature on feature engineering and selection in the context of KBAI from the perspectives of ML learning in general [7, 8, 9], ML learning ontologies [10, 11, 12] and knowledge bases [13, 14, 15, 16, 17]. This listing is only partial, but does provide an inventory of more than 200 feature types applicable to natural language.
We have organized this inventory into eight (8) main areas, shown in non-italicized Bold, which tend to cluster into these four groupings:
Here is the taxonomy, again organized according to those same eight main areas:Lexical Corpus Phrases Averages Counts N-grams Weights Words Averages Counts Cut-offs (top N) Dictionaries Named entities Stemming Stoplists Terms Weights Syntactical Anaphora Cases Complements (argument) Co-references Decorations Dependency grammar Head (linguistic) Distances Gender Moods Paragraphs Parts of speech (POS) Patterns Plurality Phrases Sentences Tenses Word order Statistical Articles Vectors Information-theoretic Entropy Mutual information Meta-features Correlations Eigenvalues Kurtosis Sample measures Accuracy F-1 Precision Relevance Skewness Vectors Weights Phrases Document frequencies Frequencies (corpus) Ranks Vectors Words Document frequencies Frequencies (corpus) Ranks String similarity Vectors Cosine measures Feature vectors Structural Documents Node types Depth Leaf Document parts Abstract Authors Body Captions Dates Headers Images Infoboxes Links Lists Metadata Templates Title Topics Captions Disambiguation pages Discussion pages Authors Body Dates Links Topics Formats Graphs (and ontologies) Acyclic Concepts Centrality Relatedness Directed Metrics (counts, averages, min/max) Attributes Axioms Children Classes Depth Individuals Parents Sub-graphs Headers Content Section hierarchy Infoboxes Attributes Missing attributes Missing values Templates Values Language versions Definitions Entities Labels Links Synsets Links Category Incoming Linked data Outgoing See also Lists Ordered Unordered Media Audio Images Video Metadata Authorship Dates Descriptions Formats Provenance Pagination Patterns Dependency patterns Surface patterns Regular expressions Revisions Authorship Dates Structure Document parts Captions Headers Infoboxes Links Lists Metadata Templates Titles Versions Source forms Advertisements Blog posts Documents Research articles Technical documents Emails Microblogs (tweets) News Technical Web pages Templates Titles Trees Breadth measures Counts Depth measures Web pages Advertisements Body Footer Header Images Lists Menus Metadata Tables Semantics [most also subject to Syntactical and Statistical features above) Annotations Alternative labels Notes Preferred labels Associations Association rules Co-occurrences See also Attribute Types Attributes Cardinality Descriptive Qualifiers Quantifiers Many Values Datatypes Many Categories Eponymous pages Concepts Definitions Grouped concepts (topics) Hypernyms Hypernym-based feature vectors Hyponyms Meanings Synsets Acronyms Epithets Jargon Misspellings Nicknames Pseudonyms Redirects Synonyms Entity Types Entities Events Locations General semantic feature vectors Relation Types Binary Identity Logical conjunctions Conjunctive Disjunctive Mereology (part of) Relations Domain Range Similarity Roles Voice Active/passive Gender Mood Sentiment Style Viewpoint (World view) Natural Languages Morphology Nouns Syntax Verbs Word order Latent Autoencoders Many; dependent on method Features Many; dependent on method Hidden Many; dependent on method Kernels Many; dependent on method Model-based Decision tree Tree measures Dimensionality Feature characteristics Datatypes Max Mean Min Number Outliers Standard deviation Functions Factor graphs Functors Mappings Landmarking Learner accuracy Method measures Error rates Table 1. A (Partial) Taxonomy of Machine Learning Features
This compilation exceeds any prior listings. In fact, most of the feature types shown have never been applied to NLP machine learning tasks. We now turn the discussion to why this is.Mindset and Knowledge Bases
When one sees the breadth of impressive knowledge discovery tasks utilizing large-scale knowledge bases , exemplified by hundreds of research papers regarding NLP tasks utilizing Wikipedia , it is but a small stretch to envision a coherent knowledge base leveraging this content for the express purpose of making text-based machine learning systematic and less expensive. Expressed as an objective function, we now have clear guidance for how to re-organize and -express the source content information (Wikipedia, among others) to better support a ML learning factory. The idea of this and how it is driving Structured Dynamics‘ contracts and research is the mindset.
Rather than the singleton efforts to leverage knowledge bases for background knowledge, as has been the case to date, we can re-structure the knowledge source content underneath a coherent knowledge graph. This re-organization makes the entire knowledge structure computable and amenable to machine learning. It also enables the same learning capability to be turned upon itself (see image here), thereby working to improve the coverage and accuracy of the starting KB, all in a virtuous circle. Because of the mindset, we also can now look at the native structure of the KBs and work to expose still more features, providing still further grist to the next generation ML learners. Fully 50% of the features listed in the inventory in Table 1 above arise from these unique KB aspects, especially in the areas of Semantics and Structural, including graph relationships.
Many, if not most, of these new feature possibilities may prove redundant or only somewhat relevant. Not all features may ever prove useful, though some not generally used in many broader learners, such as case, may be effectively employed for named entity or specialty extractions, such as for copyrights or unique IDs or data types. Because many of these KB features cover quite orthogonal aspects of the source knowledge bases, the likelihood of finding new, strongly relevant features is high. Further, except for the Latent and Model-based areas, each of these feature types may be used singly or in combination to create coherent slices for both positive and negative training sets, helping to reduce the effort for labor-intensive labeling as well. By extension, these capabilities can also be applied to more effectively bootstrap the creation of gold standards, useful when parameters are being tested for the various machine learners.
Though the literature most often points to classification as the primary use of knowledge bases as background knowledge supporting machine learners, in fact many NLP tasks may leverage KBs. Here is but a brief listing of application areas for KBAI:
Surely other applications will emerge as this more systematic KBAI approach to machine learning evolves over the coming years.Feature Engineering is an Essential Component
As noted, this richness of feature types leads to the combinatorial problem of too many features. Feature engineering is important both to help find the features of strongest relevance while reducing the feature space dimensionality in order to speed the ML learning times.
Initial feature engineering tasks should be to transform input data, regularize them if need be, and to create numeric vectors for new ones. These are basically preparation tasks to convert the source or target data to forms amenable to machine learning. This staging now enables us to discover the most relevant (“strong”) features for the given ML method under investigation.
In a KB context, specific learning tasks as outlined in Table 2 are often highly patterned. The most effective features for training, say, an entity recognizer, will only involve a limited number of strongly relevant feature types. Moreover, the relevant feature types applicable to a given entity type should mostly apply to other entity types, even though the specific weights and individually important features will differ. This patterned aspect means that once a given ML learner is trained for a given entity type, its relevant feature types should be approximately applicable to other related entity types. The lengthy process of initial feature selection can be reduced as training proceeds for similar types. It appears that combinations of feature types, specific ML learners and methods to create training sets and gold standards may be discovered for entire classes of learning tasks. These combinations can be discovered, tested and repeated for new specific tasks within a given application cluster.
Probably the most time-consuming and demanding aspect of these patterned approaches resides in feature selection and feature extraction.
Feature selection is the process of finding a subset of the available feature types that provide the highest predictive value while not overfitting . Feature selection is typically split into three main approaches [6, 21, 22]:
For high-dimensional features, such as terms and term vectors, one may apply stoplists or cut-offs (only considering the top N most frequent terms, for example) to reduce dimensionality. Part of the “art” portion resides in knowing which feature candidates may warrant formal selection or not; this learning can be codified and reused for similar applications. Extractions and some unsupervised learning tests may also be applied at this point in order to discover additional “strong” features.
Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. Functions create new features in the form of Latent variables, which are not directly observable. Also, because these are statistically derived values, many input features are reduced to the synthetic measure, which naturally causes a reduction in dimensionality. Advantages from a reduction in dimensionality include:
On the other hand, the latent features are abstractions, and so not easily understood as the literal.
In deep learning, multiple layers of these latent features are generated as the system learns. But latent passes may also be combined with observable features, which is one way that evaluations of what a document means can be applied across multiple input forms of the content.
Of course, it is also possible to combine the predictions from multiple ML methods, which then also raises the questions of ensemble scoring. Surely we will also see these more systematic approaches to machine learning themselves be subject to self-learning (that is, metalearning), such that the overall learning process can proceed in a more automated way.Considerations for a Feature Science
In supervised learning, it is clear that more time and attention has been given to the labeling of the data, what the desired output of the model should be. Much less time and attention has been devoted to features, the input side of the equation. As a result, much needs to be done. The purposeful use of knowledge bases and structuring them properly is one of the ways progress will be made.
But progress also requires some answers to some basic questions. A scientific approach to the feature space would likely need to consider, among other objectives:
Some tools and utilities would also help to promote this progress. Some of these capabilities include:
The object of these efforts is to systematize how knowledge bases, combined with machine learners, can speed the deployment and lower the cost of creating tailored artificial intelligence applications of natural language for specific domains. This installment in our KBAI series has focused on the role and importance of features. There is an abundance of opportunity in this area, and an abundance of work required, but little systematization.
The good news is that platforms are possible that can build, manage, and grow the knowledge bases and knowledge graphs supporting machine learning. Machine learners can be applied in a pipeline manner to these KBs, including orchestrating the data flows in generating and testing features, running and testing learners, creating positive and negative training sets, and establishing gold standards. The heart of the platform must be an appropriately structured knowledge base organized according to a coherent knowledge graph; this is the present focus of Structured Dynamics’ efforts.
In the real world, engagements always demand unique scope and unique use cases. Platforms should be engineered that enable ready access, extensions, configurations, and learners. It is important to structure the KBs such that slices and modules can be specified, and all surface attributes may be selected and queried. Mapping to external schema is also essential. Background knowledge from a coherent knowledge base is the way to fuel this. Features apply to any form of machine learning, including for things like image, speech and pattern recognition. However, this article is limited to the context of natural language, unstructured data and knowledge bases.  See the feature entry from Wikipedia, which itself is based upon Christoper Bishop, 2006. Pattern Recognition and Machine Learning. Berlin: Springer. ISBN 0-387-31073-8.  Pedro Domingos, 2012. “A Few Useful Things to Know About Machine Learning,” Communications of the ACM 55, no. 10 (2012): 78-87.  For example, in the term or phrase space, the vectors might be constructed from counts, frequencies, cosine relationships between representative documents, distance functions between terms, etc.  Richard Ernest Bellman, 1957. Dynamic Programming, Rand Corporation, Princeton University Press, ISBN 978-0-691-07951-6, as republished as Richard Ernest Bellman, 2003. Dynamic Programming, Courier Dover Publications, ISBN 978-0-486-42809-3.  Isabell Guyon and André Elisseeff, 2006. “An Introduction to Feature Extraction,” in Guyon, Isabelle, Steve Gunn, Masoud Nikravesh, and Lofti A. Zadeh, eds. Feature Extraction: Foundations and Applications, pp. 1-25. Springer Berlin Heidelberg, 2006.  Haussler, David, 1999. Convolution Kernels on Discrete Structures, Vol. 646. Technical Report UCSC-CRL-99-10, Department of Computer Science, University of California at Santa Cruz, 38 pp., July 8, 1999.  Reif, Matthias, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel, 2014. “Automatic Classifier Selection for Non-experts,” Pattern Analysis and Applications 17, no. 1 (2014): 83-96.  Tang, Jiliang, Salem Alelyani, and Huan Liu, 2014. “Feature Selection for Classification: A Review.” Data Classification: Algorithms and Applications (2014): 37  Melanie Hilario, Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis, 2011 “Ontology-based Meta-mining of Knowledge Discovery Workflows,” in Meta-Learning in Computational Intelligence, pp. 273-315. Springer Berlin Heidelberg, 2011.  Panče Panov, Larisa Soldatova, and Sašo Džeroski, 2014. “Ontology of Core Data Mining Entities,” Data Mining and Knowledge Discovery 28, no. 5-6 (2014): 1222-1265.  See the general KBAI category entries on M.K. Bergman, AI3:::Adaptive Information blog, various dates.  Ivo Anastacio, Bruno Martins and Pavel Calado, 2011. “Supervised Learning for Linking Named Entities to Knowledge Base Entries,” in Proceedings of the Text Analysis Conference (TAC2011).  Weiwei Cheng, Gjergji Kasneci, Thore Graepel, David Stern, and Ralf Herbrich, 2011. “Automated Feature Generation from Structured Knowledge,” in Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1395-1404. ACM, 2011.  Lan Huang, David Milne, Eibe Frank, and Ian H. Witten, 2012. “Learning a Concept‐based Document Similarity Measure.” Journal of the American Society for Information Science and Technology 63, no. 8 (2012): 1593-1608.  Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp.  Hui Shen, Mika Chen, Razvan Bunescu, and Rada Mihalcea, 2012. “Wikipedia Taxonomic Relation Extraction using Wikipedia Distant Supervision,” Ann Arbor 1001: 48109.  Conventional knowledge bases have also been supplemented with massive-scale statistical bases, most often created from major search engine indexes; see the section on ‘Statistical Corpora’ in M.K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” AI3:::Adaptive Information blog, November 14, 2014.  See M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010. The listing as of its last update included 246 articles; also, see Wikipedia’s own “Wikipedia in Academic Studies.”  Overfitting is where a statistical model, such as a machine learner, describes random error or noise instead of the underlying relationship. It is particularly a problem in high-dimensional spaces, a common outcome of employing too many features.  George H. John, Ron Kohavi, and Karl Pfleger, 1994. “Irrelevant features and the subset selection problem.” In Machine Learning: Proceedings of the Eleventh International Conference, pp. 121-129. 1994.  See especially slide #11 in Zdeněk Žabokrtský, 2015. “Feature Engineering in Machine Learning,” Machine Learning Methods course, Institute of Formal and Applied Linguistics, Charles University in Prague, Czech Republic.  If constructed properly, deep learning models can be effective feature extractors over high-dimensional data; see Geoffrey E. Hinton, 2009. “Deep Belief Networks,” Scholarpedia 4 (5): 5947, which references an earlier paper, Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. “A Fast Learning Algorithm for Deep Belief Nets,” Neural Computation 18, no. 7 (2006): 1527-1554.
Here is an excerpt from yesterday’s post by MN Gordon* dealing with the familiar concept to computer scientists of ‘garbage in, garbage out’ (GIGO) as applied to monetary policy:
“The Fed believes that by fixing the price of money artificially low, they’ll increase something they call ‘aggregate demand.’ The thesis is that cheap credit will compel individuals and businesses to borrow more and consume more. Before you know it, the good times will be here again. Profits will increase. Jobs will be created. Wages will rise. A new cycle of expansion will take root. Sounds great, doesn’t it?
“In practice, however, the results are destructive. While cheap credit may have a stimulative influence on an economy with moderate debt levels, once an economy has reached total debt saturation, where the economy can no longer support its debt overhang, the cheap credit trick no longer works to stimulate the economy. Like applying additional fertilizer to an already overstimulated crop field, the marginal return of each unit of additional credit in terms of new growth diminishes to nothing. In fact, the additional credit, and its counterpart debt, actually strangles future growth.
“The experience following the Great Recession is that the abundance of cheap credit floods not into the economy, but into asset prices…grossly distorting them in the process. The simple fact is solving the problem of too much debt by pushing more debt doesn’t solve the problem at all. It makes it worse.
“. . . . The point is the radical monetary policy interventions being employed by the Fed to somehow improve the economy are being guided by garbage.”
I don’t often comment on economic or political matters, but I have deep near-term concerns about debt and monetary policy.* M.N. Gordon, 2015. “Garbage In Garbage Out Economics,” from Economic Prism blog, November 3, 2015.
Inherent in Structured Dynamics‘ discussions about knowledge-based artificial intelligence (KBAI) have been some embedded premises. Some of my prior articles — and future ones to come — elaborate more fully on one or more of these points:
Precise definitions for all of the italicized terms are provided in the related glossary.