NoSQL, huh, what is it good for?…

…Actually quite a lot really. Say it again, y’all! In this post I try to dymystify NoSQL for the Relational DB crowd / average human, and give some real world examples of how NoSQL can help.

OK I promise no more blog-singing. I get asked the above question quite a lot by people unfamiliar with the technology, but who are looking at new things to try and solve some of their existing issues. As I’ve previously talked about, the reasons for looking at NoSQL databases as a potential solution and what people think may be quite different. Most people on the Interwebs according to that blog post’s research use it to make development of applications easier for their particular set of problems. In this post I talk about some of those problems, and more generally about why you’d want to perhaps consider NoSQL databases in the future.

Firstly, it’s important to realise that NoSQL databases are a wide range of different beasts. It’s not a term that conjurs up how things work. When you hear relational DB you think tables and relationships, primary and foreign keys. When you hear NoSQL there is no one single architecture to consider.

What are the types of NoSQL database available, and why do I care?

NoSQL is probably too high level a term. There are a variety of different databases out there. Just look at this tube map post for NoSQL. Below are the most common categories I’ve come across, in order of the most popular in conversation:-

Aggregate (document) databases – stores application information in logical aggregates. Think an Order from a website. In relational this would be a lot of tables, requiring a lot of joins to pull back an aggregate. Examples include MarkLogic (XML, JSON, text, binary, XQuery), MongoDB (JSON/BSON)
Graph stores – only slightly behind document databases in chatter in the UK Public Sector – use to describe relationships between data, and to make these graphs of relationships queryable at high speed. Typically ingests/exports RDF data too. OWLIM is the only one I ever hear about, but some on the interwebs talk of neo4j too.
Columnar databases – RDBMS store data as rows. These store data as columns. This makes aggregation fast. Still have object – table modelling and mapping to do though
Key-value stores – Massive hash table for fast lookups. Still hard to compose complex objects in to though.
Big Table clones – Used to store massive 2D tables that don’t have joins. Clones of Google’s Big Table. Typically used for log file analysis or the like. Examples include HBase, Cassandra, Hypertable

Which type is best?

Standard software sales answer – it depends! Naturally I think MarkLogic’s database is by far the best choice – based on knowing it best and talking to our customers – but I get paid by them so why do you care, right?. It really depends on your situation though. People do love their OWLIM, and use it for real world tasks. I tend to view columnar, key-value and big table databases as very specific to the domain problem at hand. Aggregate databases can be general use for a lot of problems, much like Relational databases  are, so I’ll concentrate on those for the rest of this post.

How does a NoSQL database differ from relational?

There are a few common architectural patterns that you tend to find in NoSQL databases:-

In memory caches/indexes, and commits – you need to avoid disc to increase speed. Many NoSQL databases commit to the memory rather than wait for the commit to flush to disc. This allows massive throughput. Having a cache for reads at this level, or perhaps just for the most often used content and indexes, is also advantageous
Horizontally scalable on commodity hardware – rather than buy a big box for your database, why not scale out cheaply and quickly? Horizontal scaling is used in many NoSQL databases. This comes with it’s own challenges – including fulfilling queries across storage nodes. Some like MarkLogic do this on the fly transparently, others like MongoDB use sharding – storing particular datasets on particular boxes.
MVCC – Multi Version Concurrency Control – the practice of writing a new document, always, and marking the old one as obsolete rather than updating the old one on disc. Makes for fast commits, but does require periodic merging to remove old versions from disc. Great for very fast writes.
Data format orientated – relational systems are very generic. You can gain efficiencies by designing a database at a slightly more specific level. E.g. storing documents, or relationships, or hashes. NoSQL databases are designed at this level, which probably explains why there’s over 200 of them!

These basically boil down to the core problems of Big Data – Variety, Velocity, Volume and Complexity – which is why you see a lot of overlap in messaging between NoSQL and Big Data solutions. If you have a Big Data problem and need real time access or very fast live querying, then use a NoSQL database. If you’re doing batch analysis on high volumes, use Hadoop. If you need both – then use something like MarkLogic that can natively handle HDFS data alongside Hadoop – Real time your Hadoop.

What can’t NoSQL do

This section is for the relational guys and gals who are by this point saying “But my RDBMS can do that!… (kind of)”. There’s lot of FUD around about what NoSQL can’t do. It’s worth bearing in mind that back-in-the-day the likes of Oracle couldn’t do a lot of these things either. They slowly matured their product over the years. This is what NoSQL databases are in the process of doing now. It took Oracle 7 versions to sort out ACID transactions by all accounts (I’m honestly too young to remember that far back!)

Here’s a few common complaints, and if/why I think they’re wrong. This comes from my previous post on Why use NoSQL, and why not?:-

ACID transactions / data loss / consistency – As I’ve already said, Oracle took several versions to get this right. MarkLogic already has. Most major NoSQL databases cry about CAP theorem and how ACID isn’t required if you’re careful, but an interesting stat is just how many of them have ACID transactions on their road map! This is a major block to adoption in the Enterprise, so they’ll have to support ACID transactions eventually.
Maturity – Most NoSQL vendors are relatively new. MarkLogic has been around for 11 years, and has many Enterprise customers. Very few others have Enterprise wide use, or have signed large deals. They tend to be services companies not Enterprise software outfits. They tend to be deployed in niche areas of large companies. (Internally we call this a ‘science project’ deployment!) Have a mission critical NoSQL installation? Do comment and tell me what it is, and which NoSQL database you use.
SQL – I actually really like SQL. It’s not SQL’s fault that someone has normalised their data to the Nth degree, forcing a nightmare world of joins on the SQL developer. SQL is complimentary to a NoSQL database (despite the name). We released an ODBC SQL connector in MarkLogic 6. There are two projects in progress to add SQL over Hadoop. (I think they won’t be as scalable as a NoSQL database over Hadoop, but still). This is mainly for BI tools that don’t want to write a plugin for every NoSQL database out there. In MarkLogic you configure SQL ‘views’ over your document range indexes. Very nifty and fast way to get data.
BI – as above. Also, in the relational world you tend to have a separate BI database in an OLAP structure. This tends to be 24 hours out of date. This is now too slow. E.g. for trade stores where banks (and regulators) want a live view of their risk exposure. NoSQL databases with an SQL connector allow you to do this with a single instance, for both in flight transactions and BI reporting.
Tools availability – This is really an expression of how new these things are. There’s no a NoSQL Toad for example. Give it time, and check out your vendor’s tool set. You’ve probably never used all Oracle’s tools, so I wouldn’t be too hung up on this.
Can’t do search – Very true on average, but neither can Oracle. MarkLogic’s raison d’etre is to merge the search engine with the database. We’re very, very, good at this. This is why the Intel agencies love us. No gluing together a separate search engine, with separate indexes, that aren’t real time, with your database.
Referential integrity – in the document space we tend to talk about URIs rather than primary/foreign keys. This is probably a fair cop, although I would argue that with transactions and sensible business level code this is not a practical issue on most systems
Denormalisation / duplicate data – true, aggregate databases sometimes have a need for a ‘meta document’ to make searching very fast if multiple aggregate documents have fields you want to query on. Personally, the advantages far outweigh the costs. If you absolutely need fast query speed over a lot of aggregates, then use a meta-document. This is no different in my mind than creating a cached view in an RDBMS
Support / expertise availability – again, a feature of NoSQL being relatively new. There are experts out there. Typically these days though they may call themselves ‘data scientists’ rather than DB admins, so watch out for that.

Most NoSQL databases will evolve in the next 3 years to be suitable for mission critical loads. Those that don’t will probably be condemned to history as people move towards those that do. Just look at the number of changes in MarkLogic 6 to see the rate of change in these products.

Real world use cases

The below are examples that couldn’t be done in relational datbases. Before people respond, please bear in mind that this is on the back of customers saying RDBMS couldn’t do it, because they’ve tried it. In their estimation it leaves something to be desired. This may me down to cost and practically rather than a question of ‘can it be done?’

“I’ve got a crap load of XML data flying at me! It’s coming in a variety of schemas and versions, and from external organisations. I can’t tell them all to use the same standard, it would take years to get agreement!”You definitely need a schema-less database, so NoSQL should be looked at. If you input is XML, and output is XML, why not use an XML database like MarkLogic? Similarly for JSON. (Bear in mind MarkLogic transparently handles JSON in it’s REST API in version 6). MongoDB is common for JSON storage/retrieval. *resisting partisan comment*. The Centres for Medicare and Medicaid use MarkLogic for this.

“I’m pulling in a tonne of different types of data, with new types coming all the time. I’ve identified 2000 fields, but for each individual entry (aka row/document) they only use on average 20 fields”
Relational systems are known to struggle when you’ve got sparsely populated tables. It’s also a hassle to do the up front field identification. handling queries if a field does not exist can be a pain. A schema-less database with fields that amalgamate element indexes (E.g. one call ‘Full name’ other called ‘Addressee’) should be considered. This leads you back to NoSQL. Think of all communication data across all formats and you get the idea.

“I have all these unstructured documents. I want to search them like I do my relational data. I don’t really want 50 different databases, one for each type of data”

How to select a NoSQL database?

If you think a NoSQL database may be what you need then you need to watch out for some common pitfalls / considerations. These may or may not be an issue for you / your vendor, but well worth asking anyway:-

  • My system is mission critical. A down day is a death day. How is my data protected from the app server going down, or a database cluster node going down? How do I recover?
  • I have some very extensive requirements. What do I do if I need a new feature, or an urgent bug fix? Can you fix it yourselves, or do you have to wait for a part time Open Source developer to fix it? What’s the support level agreement?
  • How many customers do you already have like me? At my scale?
  • NoSQL databases typically want to store as much data as possible in RAM, and need 3x data size for storage (to enable indexes, and merges to occur). Some thrash if the RAM isn’t as big as the data. (People have noted MongoDB doesn’t like it when the data is larger than the RAM – it’s memory mapped from disc)
  • Consider the types of queries you’re writing up front, not just the desired data model. Be flexible in your data model to get the most out of your data access. Allow your vendors to recommend a data model given the input data, output (result) requirements, and searches you want to run
  • Consider if you want full text or geospatial search – if so, this drastically reduces your options. At least for a single vendor solution. *coff* call MarkLogic *coff*

In summary

I hope this has been useful for all. Naturally there’s a lot of MarkLogic specific information in the above, but it’s what I know best, and a lot of my customers read this blog so they’ll want to know. Feel free to add any facts about your favourite NoSQL database to the comments section though. Please keep it to facts rather than religious wars though!

Feel free to ask me any questions about your UK & Ireland NoSQL needs via email. I promise no hard sell! Email me at adam dot fowler at marklogic dot com.

11 comments

  1. One of the best overviews I’ve read so far, Thanks for this. I still get lost in the ACID vs CAP discussion any further reading you can point me to much appreciated.

  2. With a fleet of 5000 vehicles, with every vehicle transmitting its geo-location every minute to my server, would it make sense to use NoSQL to store the location update messages or should I use Relational DB?
    The location info will contain only the vehicle ID, lat, lon and time. My objective is to handle a huge number of location updates in a short duration.

    1. Whichever approach you’ll need to ensure that this can handle rapid updates. An in memory database may work well. As your ‘schema’ is fixed with just 4 items NoSQL vs RDBMS isn’t a hug issue. You’ll need something capable of geospatial search though, so watch out for that. Many databases only do simple radius or bounding box search, whereas solutions like MarkLogic include polygon search too. Eg. ‘show me all vehicles that were in Greater London at 1700 on Tuesday’. If you’re storing historical data, something capable of handling bitemporality (time between two points) will be needed. You may find query performance and features is your determining factor.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.