Searching through all your content is fine – until you get a mountain of it with similar content, differentiated only by context. Then you’ll need to understand the meaning within the content. In this post I discuss how to do this using semantic techniques…
Organisations today have realised that for certain applications it is useful to have a consolidated search approach over several catalogues. This is most often the case when customers can interact with several parts of the company – sales, billing, service, delivery, fraud checks.
This approach is commonly called Enterprise Search, or Search and Discovery, which is where your content across several repositories is indexed in a separate search engine. Typically this indexing occurs some time after the content is added. In addition, it is not possible for a search engine to understand the fully capabilities of every content system. This means complex mappings are needed between content, meta data and security. In some cases, this may be retrofitted with custom code as the systems do not support a common vocabulary around these aspects of information management.
We are all used to content search, so much so that for today’s teenagers a search bar with a common (‘Google like’) grammar is expected. This simple yet powerful interface allows us to search for content (typically web pages and documents) that contain all the words or phrases that we need. Often this is broadened by the use of a thesaurus and word stemming (plays and played stems to the verb play), and combined with some form of weighting based on relative frequency within each unit of content.
Other techniques are also applied. Metadata is extracted or implied – author, date created, modified, security classification, Dublin Core descriptive data. Classification tools can be used (either at the content store or search indexing stages) to perform entity extraction (Cheese is a food stuff) and enrichment (Sheffield is a place with these geospatial co-ordinates). This provides a greater level of description of the term being searched for over and above simple word terms.
Using these techniques, additional search functionality can be provided. Search for all shops visible on a map using a bounding box, radius or polygon geospatial search. Return only documents where these words are within 6 words of each other. Perhaps weight some terms as more important than others, or optional.
These techniques are provided by many of the Enterprise class search engines out there today. Even Open Source tools like Lucene and Solr are catching up with this. They have provided access to information where before we had to rely on Information and Library Services staff to correctly classify incoming documents manually, as they did back in the paper bound days of yore.
Content search only gets you so far though.
Implications of Big Data
As we all know from using web search, a content search will only get you so far. What happens when your search terms return 10 000 000 documents and you need to triage which to look at first? Humans are very good at this, but even so we still struggle to find exactly what we need from the likes of Google.
The problem often lies in determining both the reliability of a resource, and the context around the usage of the terms we are searching for. Reliability is ascertained by ranking based on the number of links going to and from a website or page. Also by determining how often your terms appear amongst other terms in a document. The key calculation here is Term Frequency / Inverse Document Frequency.
A semantic arms race between search engine vendors and search engine optimisation techniques (not to mention purveyors of online cheap pharmaceuticals and lonely hearts adverts) has led to the gaming of term frequency. Often an email will slip through the various spam filters, or indeed comments sections on WordPress, because the terms look valid, yet the sentence structure looks ok but not one a spam algorithm is used to – so it gets through. This causes web page indexers to index the links with the terms mentioning, thus leading the result to turn up more often.
This problem is not limited to just spamming – it is also linked to volume. The larger the data set to search, or the larger the set of potentially interesting terms, the more likely it is that your results are spurious. We all know that for a result set containing 18 000 000 documents, we probably don’t want to go beyond page 8 – because those pages will likely be irrelevant. Indeed, for terms used in multiple contexts this can happen on the first page.
The more data we have, and the more sophisticated our query criteria, the more human effort will be required to triage even the most ‘relevant’ results from searching content alone.
Searching for meaning, not just content
The key to winning this war on volume, I believe, is in understanding meaning. The search for meaning – or applying Semantic methods – can drastically improve your chances of finding relevant results, first time, and reducing (or even eliminating) false positives.
Consider this example. I have written a blog post saying that Adam likes Cheese. (It’s funny ‘cos it’s true). Perhaps at some point I’ve also written a blog post about working in the Cheese Cake Factory (alongside Penny, and avoiding Sheldon when he visits, of course).
A content search would return the first post because the words ‘Adam’ and ‘Cheese’ occur. The second post, though, only contains the word ‘Cheese’. The second document was written by an ‘afowler’, which is a username, belonging to the user whose name is ‘Adam Fowler’. Even so, its not as relevant a result. The second document may still appear lower down the results if the search algorithm is doing an ‘or’ search rather than an ‘and’ search, and ranking by score.
In the context of the magical Interwebs, the problem becomes ever more acute. How many documents out there mention Adam and Cheese??? … I bet you just Googled that didn’t you!?! … Well I did – the answer is: 49 700 000. Hey, Adams like Cheese… or do they!?!
On my first page of results I find 4 cheese manufacturers, one pub review (excellent!), three English Literature links about ‘I am the cheese’, and a twitter feed. But how many Adams like Cheese!?! Hardly an authoritative way to find answers.
What is the answer to the ultimate question?
*moves hands like Wallace* Cheeeeeeeese.
The problem here is exactly the same as that within any organisation. Organisations own a vast array of data. They even model relationships either explicitly (E.g. Adam has many interests) or implicitly (Adam exists in the patient database, therefore he is a patient).
Even within the same organisation there is no authoritative definition of all information and its interconnections. These Ontologies can be very useful in defining possible relationship sets, attributes, and equivalence with other well known terms (Adam the Patient in my hospital, internally, is the same as a Person in the very public ‘Friend of a Friend’ specification).
Where organisations struggle is in both the creation of a master ontology, and in taking data and describing it with these terms to make it actionable information. There are (relatively cheap) consultancies who can define ontologies. This is the easy bit. How do you take 2 billion documents across 39 applications, 46 databases, file systems and ECM systems and get to a point where the meaning is understood, globally?
Search vendors will try and sell you a classification add on that specifically checks for these relationships at index time and stores them as yet another piece of search meta data. What happens if your ontology changes? You guessed it, you have to redefine all your indexes and perform a full re-index operation across all your content. Not nice.
A better way is to think of the problem as an evolutionary one, not revolutionary. By that I mean you should try and paint a picture over time. Extract entities, facts, and relationships as they become apparent. Store why that decision was taken, and what information was it based one. Over time you will build a web of meaning across your entire information domain. You can even define new entities and relationships on the fly, and process content to add these as small facts, rather than rip and replace your entire facts database.
This is the essence of a Semantic Web approach.
How does it work?
This, as I’ve mentioned above, is the difficult bit. How do you go from a document to a set of facts? Firstly, you can use your existing classifiers to great use. You probably have these and don’t realise it. Every time you store information as a ‘Person’, ‘Patient’ or ‘Oganisation’ you are identifying entities. In the NoSQL database world, you perform Entity Extraction – perhaps replacing the name ‘Adam Fowler’ with the entity <person-name>Adam Fowler</person-name>. Or storing this in its own ‘column’, if you must use a structured repository (Shocker!)
Once you have these entities you can create a simple algorithm to determine their relevancy to each other. Perhaps within a single document you identify all entities within the same paragraph. You know how these relate to your Ontology. E.g. a <person-name> related to the #name property of an entity with the type of FOAF Person. You also do the same thing for Organisation names. This means you can reasonably suggest relationships knowing what the maximum set of relationships are between these two entities.
Examples here include member_of and funded_by, at least as far as the FOAF specification goes. You have two options here as far as accepting the suggestions goes. You can ask for human intervention, or use some sort of natural language processing to determine an acceptable percentage chance of this being true.
Organisations are already doing this. If you look at BBC Sport, journalists enter the text within their article and are presented with a list of relevant teams, players, and terms to link that story too. This allows the story to appear automatically anywhere on the website where it is relevant, rather than relying on a human to understand the correct web site filing mechanism and required tags.
What the future holds
This is currently still a publish time activity based on known facts at the time of publication. The ultimate application, I believe, is in doing this on the fly, whilst searching and exploring content – not just facts.
Consider a classification tool that not only identified entities and relationships based on pre-programmed results, but also used existing facts and content in the system in order to do this more accurately. This would make your classifier become specialised in your own organisations content, taxonomy and terminology, resulting in a higher number of entities and relationships being recognised, and doing this more accurately over time.
This evolutionary approach matches very well with modern database and relationship modelling that is dynamic. In the NoSQL world we call this schema less – where you can store any content as-is prior to doing any data modelling. With the advent of triple stores you can now do the same with relationships – developing ontologies and classifiers over time.
The real advantage is to the end user. Rather than search for Adam and Cheese and manually trying to determine the statistics behind this, you can instead search for all People called Adam who like a Foodstuff in the category Cheese. You then can know rapidly the number of Adams that like Cheese – rather than the number of documents that mention Adam and Cheese.
If you wanted to make absolutely sure, and you were also tagging where the facts were inferred from, you could provide links to the documents – or even exact occurrence of an entity within a document – and show the language associated with these facts. Even more interestingly, as the number of documents mentioning cheese and adam increases, the less proportion of them would return in your query result set. This is because the number of exact topics being discussed, and entities related to the words ‘cheese’ and ‘adam’ would increase.
So a <company>Adams Cheeses</company> being discovered would increase the number of content matches for Adam and Cheese, but reduce the proportion of documents mentioning the person Adam and the foodstuff Cheese.
Semantic understanding extracted from content has the potential to drastically cull the number of false positives in search, even with the onslaught of ‘Big Data’ volumes across the internet scale of content.
Perhaps an even more subtle, and powerful, application of both Content, Search and Semantic techniques is in the area of information exploration. We are used to collecting relevant data, then analysing it. We know what analysis we can run because we collected the data with that in mind. With an entire internet of data though this is not possible.
People are trying to solve this with large Hadoop analysis clusters. I believe they will fail, though, when they are not operating in a known maximum finite set of information. With Hadoop you cannot explore data interactively, only apply algorithms over raw data. This makes it impossible to discover new information within existing data. An exploratory approach is required.
NoSQL tools like MarkLogic allow you to store data as-is, just like Hadoop, but additionally allow you to explore that information interactively through sophisticated query interfaces. These e-discovery applications allow normal users, rather than ‘Data Scientists’, to discover new facts about their content. You could even add functionality to allow users to highlight content and classify on the fly, identifying new people, organisations, places, topics.
An end user could explore this interface and use different visual tools to explore it. First perhaps starting in a content search to get in the right informational ball park. Then restricting results by the types of entities identified. They could see the content results next to a set of facts implied by, or extracted from, the content. They could add their own facts. They could switch to a diagram showing circular nodes as entities, and lines between them as relationships, hiding those relationships they were not interested in. Perhaps using data points that are relevant to highlight entities they may wish to explore.
Imagine a diagram of people who are members of an organisation, with the number of tweets they send represented by the size of their circle, and the number of message to and from other individuals in their group as the thickness of a line. This visual cue help an information analyst walk through the world of both content, meaning, and relationships to discover new facts. Given the right tools they may describe a subset of this network diagram as a named group in its own right, with properties of its own.
For web scale information problems, or those in complex areas such as fraud, patient records, or published research/information, understanding both content and meaning will be key to successfully managing the onslaught of data that information workers will have to deal with. Semantic technologies and techniques, with modelling of ontologies and a cross over of content and fact search and exploration interfaces will power new applications and drive discovery, turning data in to actionable information.
The technology needed exists today, but is fragmented in different problem domains and vendors. To solve the above problems, and the issue of publishing Linked (Open) Data, you will need both content stores, search engines and triple (fact) stores. You will also need a strong set of application services that can combine these functions without requiring re-wiring with every new project.
MarkLogic Server version 7, which is scheduled for release at the end of summer, will provide for the first time a Triple Store within the same product as our (Only existing) Enterprise NoSQL database sporting horizontal scaling to the Peta Byte scale and Government grade security, sophisticated search engine including geospatial functionality, and range of unified application APIs and rapid prototyping user interface wizards.
For the first time organisations will be able to rapidly configure and deploy mission critical applications that understand both the meaning and content of unstructured and structured information, and to do this with an evolutionary approach in months rather than revolutionary ‘big bang’ approach taking years of expensive design and professional services.
Perhaps with MarkLogic 7, then, we’ll be able to find out how many Adams like Cheese.