Last night I
attended a Webstep [web|twitter] seminar on polyglot persistence. Multiple
presenters, all with developer/architect background, enlightened the audience
with an intro to key terms like Database Thaw, CAP theorem, the BASE acronym (as opposed to the ACID acronym), different types of NoSQL databases, and they showed some cool demos as well. It was a great night of learning!
So, how do you
proceed to actually making a decision on what technology to use? People tend to
use the tools they already know. And this approach, for the most part, results
in a solution persisting all data to a RDBMS. Not that there's anything wrong
with that. However, the ongoing change in how data is generated and used,
forces the industry to step out of the box and look to new or renewed ways of
data storage and retrieval.
Every solution and
every project is different, so the term "It depends" will apply in all aspects of this process. Even so, in my opinion, the data itself must guide
the way when deciding what technology to use. Know your data! Know your data!
Know your data! There are some other considerations as well, but the key here is the characteristics of the data. I will mention the different aspects to consider alphabetically as the priority depends on the project in question. I advise you to use the MoSCoW method to prioritize
between the different aspects, and to get the primary stakeholders approval before
you continue with the decision process.
The main data characteristics
to consider when deciding on a persistence technology:
Do you need
transaction control on the database level? Is it acceptable for every item be
transmitted one-by-one like tweets or items in a list/cart/bookshelf, or must
the items be transmitted dependent on one other like payment and shipment info
when shopping online?
How strict is the
data model? Do you need constraints or other rules on the database
level to be applied at all times? Are there room for delay or a softer approach where consistency will happen over time?
strikes, how important is it to recover all the data? Is it acceptable with some data
loss, minimal data loss or no data loss at all? Disaster recovery applies to
both external disaster like hardware failure and internal disaster like
Are the data self explanatory or will a metadata model be required in the context of the physical modell?
How important is the
data on the record level? Do you need to be sure that all data are persisted?
Is it acceptable if a record is lost, like a status update, or must the item be stored
like the creation of a new social security number.
When will the data
be used? What are the requirements to availability? Should the data be
available through defined contracts, or will ad-hoc querying be important as
Do you need
concurrency control on the database level? Will dirty reads cause problems, or
will inserts of new records when someone else operates on the entire data set
compromise quality? Will one user be affected by what a different user sees or does?
Must the data be
stored a certain amount of time or be kept in any given state, etc.? Will the
data need to be randomized or anonymized before being made available through
development and testing environments?
How will the data be
used? Are there ad hoc aggregation and analysis going on? Should read
performance meet defined contracts or ad hoc querying? Will the data be updated
in batches or row by row?
Is the data
sensitive? Do you need role based access control to manage access control? Is
it important to control access rights on the database object level? Do you
need to encrypt some or all data?
How much data do you
have now? How much data is generated in a week, month or year? What is the life
expectancy of the solution? Will the data be archived or deleted along the way?
What is the main
purpose of the data? Will the data serve a LOB application, a reporting
solution or any other type of solution? Are there OR-mapping involved? Will the
data be visualized as isolated items, in a table, in a chart, as free text or
other? What about availability through free text and "advanced"
Other aspects to consider (IMHO secondary to the data characteristics):
Is it important to
minimize the total database administration and maintenance related to the
What's the budget?
This applies to explicit financial cost related to licencing, hardware, consultant
hours etc.. But do not forget to consider the hours spent by in-house staff, training hours as well as the general
job satisfaction in the team and other aspects.
What is the
development environment like, and will the tools in question play well with what
you already have? How well does the tools in question adapt to the development process, like agile methods, test driven development etc.?
Are there available
staff with knowledge in-house or in consultancies on the tools in question? Is
there an informal knowledge base available through online resources or an
Are you friends with
the IT-department, Legal-department, CFO or others can influence the final
Will you need
support directly from the vendor? How often and at what times? Will you provide bug reports when bugs are found in the product?
When are the stakeholders expecting the first delivery or the final delivery? Is there enough time to buy, install and implement the tools in question?
All products have bugs, but is it important to
have the product kept up to date through frequent patches and updates?
I'm sure there are
other aspects as well. Unfortunately, I don't know the different available technologies and tools well enough to list their compliance with the different aspects. That might
be the topic for some posts in the future!
Thank you for
reading, and feel free to drop me a line regarding this post and the topic in general.