Friday, October 11, 2013

Polyglot persistence - let the data lead the way

Last night I attended a Webstep [web|twitter] seminar on polyglot persistence. Multiple presenters, all with developer/architect background, enlightened the audience with an intro to key terms like Database Thaw, CAP theorem, the BASE acronym (as opposed to the ACID acronym), different types of NoSQL databases, and they showed some cool demos as well. It was a great night of learning!

So, how do you proceed to actually making a decision on what technology to use? People tend to use the tools they already know. And this approach, for the most part, results in a solution persisting all data to a RDBMS. Not that there's anything wrong with that. However, the ongoing change in how data is generated and used, forces the industry to step out of the box and look to new or renewed ways of data storage and retrieval.

Every solution and every project is different, so the term "It depends" will apply in all aspects of this process. Even so, in my opinion, the data itself must guide the way when deciding what technology to use. Know your data! Know your data! Know your data! There are some other considerations as well, but the key here is the characteristics of the data. I will mention the different aspects to consider alphabetically as the priority depends on the project in question. I advise you to use the MoSCoW method to prioritize between the different aspects, and to get the primary stakeholders approval before you continue with the decision process.

The main data characteristics to consider when deciding on a persistence technology:

Do you need transaction control on the database level? Is it acceptable for every item be transmitted one-by-one like tweets or items in a list/cart/bookshelf, or must the items be transmitted dependent on one other like payment and shipment info when shopping online?

How strict is the data model? Do you need constraints or other rules on the database level to be applied at all times? Are there room for delay or a softer approach where consistency will happen over time? 

Disaster recovery
When disaster strikes, how important is it to recover all the data? Is it acceptable with some data loss, minimal data loss or no data loss at all? Disaster recovery applies to both external disaster like hardware failure and internal disaster like corrupted data. 

Are the data self explanatory or will a metadata model be required in the context of the physical modell?

How important is the data on the record level? Do you need to be sure that all data are persisted? Is it acceptable if a record is lost, like a status update, or must the item be stored like the creation of a new social security number.

High availability
When will the data be used? What are the requirements to availability? Should the data be available through defined contracts, or will ad-hoc querying be important as well?

Do you need concurrency control on the database level? Will dirty reads cause problems, or will inserts of new records when someone else operates on the entire data set compromise quality? Will one user be affected by what a different user sees or does?

Legal responsibility
Must the data be stored a certain amount of time or be kept in any given state, etc.? Will the data need to be randomized or anonymized before being made available through development and testing environments?

How will the data be used? Are there ad hoc aggregation and analysis going on? Should read performance meet defined contracts or ad hoc querying? Will the data be updated in batches or row by row?

Is the data sensitive? Do you need role based access control to manage access control? Is it important to control access rights on the database object level? Do you need to encrypt some or all data?

How much data do you have now? How much data is generated in a week, month or year? What is the life expectancy of the solution? Will the data be archived or deleted along the way?

What is the main purpose of the data? Will the data serve a LOB application, a reporting solution or any other type of solution? Are there OR-mapping involved? Will the data be visualized as isolated items, in a table, in a chart, as free text or other? What about availability through free text and "advanced" searching?

Other aspects to consider (IMHO secondary to the data characteristics):

Administration and maintenance:
Is it important to minimize the total database administration and maintenance related to the solution?

What's the budget? This applies to explicit financial cost related to licencing, hardware, consultant hours etc.. But do not forget to consider the hours spent by in-house staff, training hours as well as the general job satisfaction in the team and other aspects. 

Development process
What is the development environment like, and will the tools in question play well with what you already have? How well does the tools in question adapt to the development process, like agile methods, test driven development etc.?

Knowledge and expertise
Are there available staff with knowledge in-house or in consultancies on the tools in question? Is there an informal knowledge base available through online resources or an active community?

Are you friends with the IT-department, Legal-department, CFO or others can influence the final decision?

Will you need support directly from the vendor? How often and at what times? Will you provide bug reports when bugs are found in the product?

When are the stakeholders expecting the first delivery or the final delivery? Is there enough time to buy, install and implement the tools in question? 

Upgrades and patches
All products have bugs, but is it important to have the product kept up to date through frequent patches and updates?

I'm sure there are other aspects as well. Unfortunately, I don't know the different available technologies and tools well enough to list their compliance with the different aspects. That might be the topic for some posts in the future!

Thank you for reading, and feel free to drop me a line regarding this post and the topic in general.

Have a nice day!