Thoughts on NoSQL & Big Data Architecture
I recently web a webpage forwarded to me by someone at work. It’s a very complex diagram of a ‘typical’ Big Data architecture. It also contains a couple of NoSQL databases. I decided to do a critique of it from a pure NoSQL standpoint. The diagram should (if we in the Computer industry are doing our jobs right) be able to be simplified if we use the correct products and approaches. I’ll detail my thoughts in this article…
Here’s the original diagram posted on http://nosql.mypopescu.com :-
Areas to look at
There are several parts to this diagram. I’ve broken them down in to jobs that i’m inferring the people who created this diagram are using the various components for. (They’re not all mentioned on the diagram)
- Storage of different data types (structured/relational, semi-structured (marketing/campaign/mobile/web logs), and unstructured (files) and log files (I’ll deal with these separately from semi-structured) )
- Loading data in to a variety of data bases
- Mining data for extra insights
- High speed analytics
- Data warehousing for reporting, and reporting over the live system
- Batch analysis (Hadoop and ecosystem)
- Log file management (web, operational)
- Transactional data store
- Web caching
Storage of different data types
This is the main reason why even today there are hundreds of choices of data base. How you manage data depends very much on it’s data type, or structure. The operations you wish to do over that data are also determined by its own nature. It’s important to get this part right.
Most data can be broken down in to three areas, as in this diagram – structured/relational, semi-structured (container format known, content not) and unstructured (could be anything). In most cases structured/relational is traditionally stored as tables and relationships in an RDBMS. Unstructured can be stored on a file system, with something managing the meta data (E.g. BCS (Basic Content Services – Sharepoint), or ECM (Enterprise Content Management), WCM (Web Content Management), or document-orientated NoSQL (Mongo, MarkLogic et al) ), the last type is semi-structured, which gets you in to the realm of XML and JSON which NoSQL databases can manage well.
Which one is correct for you? Annoyingly, it depends. I would say though that each type – RDBMS / NoSQL / ECM – can be used to hold all of the above types. The question is, what facilities do you need in your store in order to be able to query, access, modify and manage the lifecycle of this information?
Many organisations have tried to, for example, shred an XML structure in to a relational system. When this needs reading, they re-build the XML document. This works fine when you have a small amount of XML that you’re just storing and retrieving. You just store it as an XML column, and have other primitive typed columns for index information. This may seem like a reasonable approach.
Storing semi-structured and unstructured in an RDBMS
The problem comes when you want to query across the database for a document that matches three fields – all in XML, not in an index column. How do you even express that in the relational query layer? You basically have to hand crank a stored procedure for each combination of query terms. What if they’re built interactively by a user in an application? Also, with very large XML documents, or systems with high transaction speeds (Think financial services trade stores with 50 million trades per day, most in the first and last 30 minutes of trading) then the shredding / rebuilding code will start to struggle. The situation becomes even more impossible if the XML source changes schema versions.
The more volume, velocity and variety your XML has, the worse a fit relational databases are. You need MarkLogic, frankly, to deal with this. It’s the only Enterprise NoSQL database that handles XML natively.
Any system can store files + metadata. When I was in FileNet (an awesome ECM product, now owned by IBM, but don’t hold that against it!) we ran tests and found that documents under 10KB could be stored fine in a blob column in an RDBMS. It was only once you got above this size that storing on disc becomes quicker. In MarkLogic we used to recommend (memory testing here…) 2MB as the limit. Now though we have both a fast data directory option and a large binary option, so MarkLogic will handle this performance logic for you.
RDBMS vs. triple stores
Where it gets really, really interesting these days is in the Relational data world. Clearly structured data can be easily modelled in XML (or JSON for that matter). The tricky part is how to model the relationships? This has always been a struggle for a aggregate NoSQL database (i.e. document database, like MarkLogic, Mongo). Many people have used NoSQL for their data, then used a Graph store like OWLIM for their relationships (as well as other semantic ‘facts’). Examples include the BBC’s Digital Semantic Publishing (DSP) platform they use for Sport on their website. This knows, for example, that a particular player plays for a particular team – so David Beckham’s fashion launch article will show up even if you search for his team. This gives a user a holistic view of the data around, not just inside, their query.
Can a Graph/Triple store manage any relationship an RDBMS could? That is an emphatic ‘Yes’. They can also tell you the nature of that relationship – plays_for, parent_of, knows, member, friend_of. They use a subject-predicate-object triple (hence triple store) to hold this information. E.g. ‘David Beckham’ ‘makes’ ‘underpants’. They can also be used to store as their ‘objects’ references (as an IRI) or primitive data typed facts (Adam likes ‘cheese’). What is particularly interesting about triple stores, and the SPARQL query language in particular, is that you can do things at query time like: ‘Give me all players in the premier league whose father’s were also football players’.
If you tried to model that as a relational data set you’d quickly hit an issue – if you had a table named ‘person’, then the parent_of relationship points to the same table. Building this in SQL is both a pain and typically executes poorly. The more levels you can theoretically go down, the worse it gets for RDBMS’. These highly interrelated ‘tables’ or ‘objects’ where the relationship itself has a nature to it (parent_of) is not a great fit for RDBMS.
A triple store can store all the relationships an RDBMS can. An RDBMS cannot handle the dynamicism of the relationships a triple store can, and cannot handle the variety of queries it can handle either.
Conclusion on data storage
Your data store will need to understand at a granular level the data you are storing in order to be able to support all the likely queries over that data, and to do this in a performant manner. A NoSQL aggregate database, like MarkLogic, provides you with great flexibility and high performance for all these data types.
If you need to store relationships then a triple store is a great way to do it. Happily, in the upcoming version 7 of MarkLogic, we will ship a triple store inside our NoSQL database! Yes that’s right, a horizontally scaling, ACID compliant, triple store along with our ACID compliant XML/JSON/tet/binary database, with search engine built in (which I’ll come on to later).
The only reason to not use an Enterprise class NoSQL database these days (MarkLogic is the only true one) seems to be that the applications don’t run on top of it. CRM systems are typically built on RDBMS for example. This will change over time, I am sure, once these platform vendors see the advantage of using NoSQL too.
With MarkLogic being an Enterprise class product with ACID transaction support, then feel free to scrub off the ‘SQL’ boxes from the above diagram. 8o)
Loading data / ETL
The perennial problem – getting data in to a database! Or worse, a variety of data in to a variety of databases. I don’t think (unfortunately) that ETL tools are going away any time soon. With the churn of NoSQL replacing RDBMS for a lot of tasks, this problem will if anything become more acute. How do I move data accurately and speedily from an AS/400 to MarkLogic? Or from MySQL to MongoDB? Or from several places, finding the relationships between them, then storing in a single RDBMS? It’s a complex problem. I’d love ETL to go away purely down to the expense, but I think it’s hear to stay now.
As for NoSQL databases, many come with tools or services offerings to help you migrate your data. If you’re doing a direct swap then you’ll probably be fine with these options. So in my example above of a poorly performing RBDMS solution, moving to a aggregate NoSQL database should be well catered for.
There is an interesting box at the bottom of the above diagram called ‘data mining’. It includes the words ‘Ratings, scores, weights’. I take this to mean ‘look at the new data, inside its content, and figure out the relevancy of this data to my application/users’.
In MarkLogic we’d do this through several approaches. Firstly, everything is part of the Universal Index – so the document structure, elements, attributes, values, words, stems, phrases – are all indexed as soon as you add a document. Secondly, you may wish to use the Content Processing Framework or some simpler triggers to perform Entity Extraction. What we actually mean is put a tag around a persons name called <personsname>Adam Fowler</personsname>. This identifies this section of text as something important to the application.
What we can do next is Entity Enrichment. So say we saw ‘London, UK’. Rather than just put <placename>London, UK</placename> we can actually put <placename lat=’53.3′ lon=’-0.001′>London, UK</placename> – i.e. we’re adding information. Here we’re adding a point. This enables us to draw a polygon (for example) on a map and say ‘what place names are here?’. Our geospatial indexes in MarkLogic will use the lon/lat attributes to find this document, and London, as the place name within that polygon.
What you can also do in MarkLogic when you add a document is alter a property called relevancy. This will then affect the score weighting of that document in search engine results.
MarkLogic understands the structure of your content, and because of this makes it easy to perform this data mining for search / reporting summary work. I’ll talk more about complex analytics next. You can also plug in external engines from our partners like Temis and Smartlogic. These provide you with accurate ontologies and classifications for your business domain, so look at those as adjunts to MarkLogic.
Once you’ve found these solutions to work for you (should be a good 80% of you), then scratch ‘data mining’ off your diagram and replace with ‘MarkLogic enrichment services’.
High speed analytics
You see ‘Vertica’ in the above diagram, as well as ‘analytics DB’. For those reading from an Open Source NoSQL background, Vertica is a HP product which is a high speed, vertically scaling, relational database system designed for complex analytic queries. I come across it occasionally due to its use in the City of London in the financial services markets. These financial services guys don’t do ‘low speed’. If they could buy a system that showed them what would happen in the next 30 seconds then they’d spend millions, believe me.
Vertica though is fundamentally a relational system, and so falls down when the source data is not tables, rows and columns. This is also true of other open source ‘Analytics DB’ options. In memory column stores and key-value stores are very fast little systems, but at the end of the day if your source data is – as above – a variety of things which may change quickly (especially if pulling in web sourced material), then this may give you the same headaches you’ve traditionally had with RDBMS.
Instead, ask yourself ’What analytic functions do I need? Is this over live data? Do I want to perform the same functions over a variety of data types (structured/XML/etc)?’. If you answer ‘Yes’ to any of these questions, then I’d recommend you look at the functions within your NoSQL database around analytics. They may be named ‘aggregation functions’ instead, so watch out for that.
As an example, MarkLogic ships with a whole variety of analytic functions. We can do this at high speed as our range indexes are stored effectively as in memory column stores. This makes analytic functions running against them really fast. You can compose these functions together using our search technology, so it’s easy to say ‘Find all people who play premier league football, and give me the average(mean) age by league’.
Of course, you’re reading this thinking ‘a mean average? Is he kidding!?! We do REALLY heavy stuff!’. Well I’m not kidding, I just don’t have a very good statistical imagination! Our analytics functions are pluggable at runtime. You can write your own User Defined Function(UDF) in C/C++ and plug that library in to MarkLogic. It can be as complicated as you like. This will then be pushed to all nodes for you. You can then refer to them in your queries exactly as you would any built ins. (How do you think we managed to build the things in!?!)
The advantage to user MarkLogic is that the same, single database, can be used for high analytics workloads as well as operational loads. The same structure is used – you’re just composing queries over indexes – so there’s no horrid data modelling hell to manage up front. You should definitely consider MarkLogic for this workload. (Other NoSQL databases with analytical functions are available. </bbc-style-disclaimer> )
If you’re the 95% of people for whom MarkLogic analytical functions and UDFs will do the job, then scratch off both ‘vertica’ and ‘analytics db’ from the above diagram.
Reporting / data warehousing
Data warehousing is an adjunct problem to the above. In the RDBMS world this was necessary because the reporting query required a different structure to be high performant. Hence ‘OLAP’ on the above. In MarkLogic, again, we use the range indexes on the live system (so no data lag at all!) to do reporting over. We have an ODBC server so you can use your current favourite reporting tool (Tableau, Cognos, SPSS, et al) to query it using, shocker of shockers, SQL! Yes you hear it, SQL. You can even do joins.
I know, too good to be true, right? Well no actually. So there. Be told. MarkLogic allows you to define a relational ‘view’ as a set of range indexes. It’s basically a co-occurence of in memory range indexes. So it’s nice and fast, even over the live operational system.
If you want a live view of what your system is doing then you should look at MarkLogic in particular for this. This is why MarkLogic is getting traction in Financial Services as a trade store, because it can provide a live view of your risk position. This should help you spot rogue traders and high risk patterns as they happen, not 24 hours later, shutting the barn door after the horse (trader) has bolted (with billions of pounds).
So, remove the OLAP engine and warehouse DB from the above diagram.
Batch analysis – aka Hadoop
Ah Hadoop. Great tool. Helps you analyse a large wealth of data. It is not, however, a panacea for World peace. People are starting to realise that, yes, you could do everything in Hadoop – but why on Earth would you?
Hadoop has a great file system in HDFS. It is not, however, high speed for live query loads. If you take a look at HBase for example – you can use a NoSQL database over HDFS. Great. But it’ll only ever run as fast as your HDFS file system, which is not exactly lightning, lets face it. If you have batch jobs to do then fine, but I would argue Hadoop is not suited to a live, transactional, operational database system.
I’ve known clients try and effectively built MarkLogic on top of Hadoop in Java and waste a solid years worth of productivity, finding in the end that they couldn’t make it work. Don’t make the same mistake, and look through the hype. Have a really deep read of real world implementations before you go for this approach for an *operational* store.
Hadoop is well suited to a tiered storage model though. So use a high speed NoSQL database for your operational data, and archive it off to Hadoop when you need to. Afterall, it makes storage really cheap. This is a fantastic option for a whole range of problems.
I bet you’re thinking though, what if I want an operational query that includes archived data? Or what if I want a batch Hadoop job to view operational data?
There is hope for you, young Padawan. MarkLogic (and others, it must be said) come with a Hadoop connector. So you can ingest information in to Hadoop and it ends up in Marklogic. So if you have an ETL tool for Hadoop, that will help you. You can also then write Hadoop jobs that query data in MarkLogic. This gives you the best of both worlds. (It also makes these queries faster – as MarkLogic will use its indexes to speed up the queries).
What you will also get in MarkLogic 7 at the end of summer 2013 – which I think is particularly unique – is the administration functionality and automation necessary to really well manage this tiered storage and archival functionality. What you also get is the ability for MarkLogic to run on HDFS, just like HBase. This means you can create a ‘super database’ consisting of your normal, operational MarkLogic database and the HDFS stored archive database. If you want to run a fast query over live data, just talk to the MarkLogic native DB. If you want to query both sets, query the super database. Easy!
This approach is superior to a ‘just HDFS’ model. It gives you the advantages of both, with the disadvantages of neither. You can use Hadoop ecosystem tools as well, but you’ll be using them appropriately. You also don’t have to burn cycles building your own product, effectively a database, over Hadoop to do it.
Replace ‘HBase’ with ‘MarkLogic’. You know it makes sense. 8o)
Log file management
Log file management is a hot topic right now, mainly because it was left to sysadmins and perl for so long. Splunk is the poster child of this particular revolution, with support for index, search, enrichment, monitor & alerting, report & analyse functions. But do you need yet another system for this?
I’ll never forget what Justin Makeig, one of our product managers, said recently – “I just built splunk [in MarkLogic] in a weekend”. Which is perfectly true. You just need a trigger to split out the log file in to its constituent parts. You may not even need that. Just put in a simple trigger to identify date, system name, action class, and ‘details’. Then use MarkLogic’s indexes, and the universal index with word stemming for the text, and you’re away.
We support persisting a search as a document, and thus alerting. So when a new document (log file row) comes in that matches, an alert fires. This can be a system action (Hit the panic button) or a notification to a user.
With our ODBC server and support for SQL you can use your existing toolset to do the reporting very quickly. You can even use our SQL compliant MATCH statement to perform a full text query within your SQL relational query, getting the best of both worlds.
So, if you have other needs and not just log file management, consider MarkLogic. Other Open Source tools are available, but typically they’ll be missing at least one part of the puzzle (usually search or alerting or enrichment)
Transactional data store
NoSQL databases can’t do real transactions, right? Wrong. Well, sort of… There’s a lot of FUD around from Open Source NoSQL outfits around transactions. Most say ‘You don’t need ACID compliant! CAP theory rocks!’ which we all know is a crock. They also all have ACID compliance on their roadmaps, just not in their products today. Those that say they are ACID compliant mean only with the confines of a single document. Personally, I don’t count that as good transactional support.
One of the main reasons we can say ‘MarkLogic is the ONLY Enterprise NoSQL database’ is because of our transactional support. We’d had it for 10 years. Yes, 10 years. Most of these Open Source projects are less than 6 years old. They’re going to take a good 4 years before they have decent transactional support. This is why ACID compliant ‘Enterprise’ class databases like Oracle and Microsoft SQL Server are still used as transactional stores.
If the only reason you’re still using an RDBMS is Enterprise functionality, and in particular transactions, then you should strongly consider MarkLogic. We’ve been doing this for years, on mission critical (not just science project) solutions for our customers.
So another reason to remove the ‘transactional’ SQL block from your diagram.
NoSQL databases are used often as ‘dumb’ caches thanks to their speed. Typically this is for a less-responsive underlying WCM or RDBMS system. The second reason though, which is very important, is to ensure that if a DDoS attack happens that the underlying operations can continue, even if web operations are disrupted. This is certainly the BBCs approach for Sport. Also many Financial Services institutions are using Oracle Coherence and other tools for this. Not because of DDoS, but because of the sheer volumes being handled in financial markets. (Although all the ones I’ve talked to are now looking to replace Coherence due to cost from Oracle).
If all your data is in a NoSQL solution though, then have a look to see if that system has any query volume limiting capabilities. In MarkLogic you can set up a secondary query app server. You could use this for ‘public’ queries, with a strict control over query time limits and volume. This would have much the same effect as a separate caching layer for application data.
So get rid of this box and just put ‘query app server with limiting’.
Search is unfortunately an after thought often in architecture diagrams. Just look how small and jammed in the Search box is at the top of the above diagram. Unfortunately though, this is how most users and a lot of complex system queries are executed. This is especially true of an aggregate (document) NoSQL database. Most people think of ‘text queries’ when they think of ‘search’, but its much larger than that. You may need polygon point containment geospatial queries, or aggregation functions over search. You may want to save a complex search as an alert for notification purposes.
Search is fundamental to systems design in the 21st century. People are used to a simple search bar with built in grammar, faceting, paging, sorting and relevancy. They don’t want complex sets of screens where they have to key in exact information. They want quick and easy access to information. They also want often to see the landscape around an exact query. This is the ‘give me news for all players, as well as the team I’ve searched for’ problem, as mentioned above.
Search is fundamental. Unfortunately, Search engines create their own indexes, have their own crawling time delays, and so are often not suitable if you need to know *exactly* what the current state of play is. MarkLogic’s indexes are re-used for Search. You don’t have to set it up separately, it Just Works. Also, as soon as a transaction is committed you can be sure the indexes are up to date. A truly ACID compliant search solution is of paramount important when you absolutely have to know what you know.
So replace ‘Search’ with ‘MarkLogic’ too.
By the end of this article you should realise that your NoSQL tool can be used for a lot of the diagram. This simplifies your architecture a lot, and makes it cheaper to install and operate for the long term. I’d definitely recommend you ask the questions above of your NoSQL tool/vendor to see if they can be used, and save you some money.
One product is preferable to 10 afterall. If you absolutely need *all* of the functionality I’ve described, then you are in the realms of MarkLogic. This isn’t just a partisan comment on my part, but rather routed in the fact that MarkLogic as a product is 12 years old, and has been on the open market for 10 years, and so has a lot of the required functionality built in. We have references in Defence, Intel, Financial Services, Media and Publishing – amongst others – so likelihood is that we’ve come across the types of problems you’re facing before.
Feel free to check out MarkLogic on our website, or email me at adam dot fowler at marklogic dot com. I welcome comments, even somewhat hostile ones! To please comment below.