I fix!

Last night I attended a Webstep [web|twitter] seminar on polyglot persistence. Multiple presenters, all with developer/architect background, enlightened the audience with an intro to key terms like Database Thaw, CAP theorem, the BASE acronym (as opposed to the ACID acronym), different types of NoSQL databases, and they showed some cool demos as well. It was a great night of learning!

So, how do you proceed to actually making a decision on what technology to use? People tend to use the tools they already know. And this approach, for the most part, results in a solution persisting all data to a RDBMS. Not that there's anything wrong with that. However, the ongoing change in how data is generated and used, forces the industry to step out of the box and look to new or renewed ways of data storage and retrieval.

Every solution and every project is different, so the term "It depends" will apply in all aspects of this process. Even so, in my opinion, the data itself must guide the way when deciding what technology to use. Know your data! Know your data! Know your data! There are some other considerations as well, but the key here is the characteristics of the data. I will mention the different aspects to consider alphabetically as the priority depends on the project in question. I advise you to use the MoSCoW method to prioritize between the different aspects, and to get the primary stakeholders approval before you continue with the decision process.

The main data characteristics to consider when deciding on a persistence technology:

Atomicity

Do you need transaction control on the database level? Is it acceptable for every item be transmitted one-by-one like tweets or items in a list/cart/bookshelf, or must the items be transmitted dependent on one other like payment and shipment info when shopping online?

Consistency

How strict is the data model? Do you need constraints or other rules on the database level to be applied at all times? Are there room for delay or a softer approach where consistency will happen over time?

Disaster recovery

When disaster strikes, how important is it to recover all the data? Is it acceptable with some data loss, minimal data loss or no data loss at all? Disaster recovery applies to both external disaster like hardware failure and internal disaster like corrupted data.

Documentation

Are the data self explanatory or will a metadata model be required in the context of the physical modell?

Durability

How important is the data on the record level? Do you need to be sure that all data are persisted? Is it acceptable if a record is lost, like a status update, or must the item be stored like the creation of a new social security number.

High availability

When will the data be used? What are the requirements to availability? Should the data be available through defined contracts, or will ad-hoc querying be important as well?

Isolation

Do you need concurrency control on the database level? Will dirty reads cause problems, or will inserts of new records when someone else operates on the entire data set compromise quality? Will one user be affected by what a different user sees or does?

Legal responsibility

Must the data be stored a certain amount of time or be kept in any given state, etc.? Will the data need to be randomized or anonymized before being made available through development and testing environments?

Performance

How will the data be used? Are there ad hoc aggregation and analysis going on? Should read performance meet defined contracts or ad hoc querying? Will the data be updated in batches or row by row?

Security

Is the data sensitive? Do you need role based access control to manage access control? Is it important to control access rights on the database object level? Do you need to encrypt some or all data?

Size

How much data do you have now? How much data is generated in a week, month or year? What is the life expectancy of the solution? Will the data be archived or deleted along the way?

Usage

What is the main purpose of the data? Will the data serve a LOB application, a reporting solution or any other type of solution? Are there OR-mapping involved? Will the data be visualized as isolated items, in a table, in a chart, as free text or other? What about availability through free text and "advanced" searching?

Other aspects to consider (IMHO secondary to the data characteristics):

Administration and maintenance:

Is it important to minimize the total database administration and maintenance related to the solution?

Cost

What's the budget? This applies to explicit financial cost related to licencing, hardware, consultant hours etc.. But do not forget to consider the hours spent by in-house staff, training hours as well as the general job satisfaction in the team and other aspects.

Development process

What is the development environment like, and will the tools in question play well with what you already have? How well does the tools in question adapt to the development process, like agile methods, test driven development etc.?

Knowledge and expertise

Are there available staff with knowledge in-house or in consultancies on the tools in question? Is there an informal knowledge base available through online resources or an active community?

Stakeholders

Are you friends with the IT-department, Legal-department, CFO or others can influence the final decision?

Support

Will you need support directly from the vendor? How often and at what times? Will you provide bug reports when bugs are found in the product?

Time

When are the stakeholders expecting the first delivery or the final delivery? Is there enough time to buy, install and implement the tools in question?

Upgrades and patches

All products have bugs, but is it important to have the product kept up to date through frequent patches and updates?

I'm sure there are other aspects as well. Unfortunately, I don't know the different available technologies and tools well enough to list their compliance with the different aspects. That might be the topic for some posts in the future!

Thank you for reading, and feel free to drop me a line regarding this post and the topic in general.

Have a nice day!

Paul Randal [blog] and Kimberly Tripp [blog] from SQLSkills are teaching a 5-days Internals and Performance class in Dallas, February 21-25. They have challenged their followers to argue their way to Dallas, so here it goes.

More by coincindence than anything else, I attended Kimberly Tripp's preconferance at TechEd Europe back in 2006, and that day turned out to be one of the most significant days of my career. The knowledge and inspiration that was sparkling from the stage just blew me away, and I've not looked back since. I've unfortunately not had the opportunity to attend neither Paul nor Kimberly's classes or sessions since the 2006 TechEd Europe, untill now that is...

There are multiple reasons for why I really, really want to attend the Internals and Performance class, and my reasons to attend are mostly founded in how I hope to make use of the knowledge when I get back. The reasons range from personal goals via community goals to explicit work goals, which all means a great deal to me.

For me personally; It's always better to learn from the best, and I can't think of anyone else I'd rather see live than Kimberly Tripp and Paul Randal (except for Pink Floyd, but that's just another cup of tea). In addition to great teachers, the class will also consist of a group of highly motivated students who I'm sure will aim to contribute to the class in a most constructive manner. The long lasting motivation and inspiration classes like this create is invaluable, and the great fun I'll have while being there, is like birthdays and christmas at once!

For the community: Norway is located in a far corner of the world, and the offline SQL Server community here is, at least across companies, nearly non-existent. The Norwegian SQL Server user group is hibernating, and I would like to wake it. One of my goals would be to strenghten my knowledge base and get the confidence to start a SQL Server meetup in Oslo. In this context, attending classes with the best of the best will hopefully, in addition to help me attain new knowledge, also give me confidence in what I already know.

At work: I see all training and knowledge elevation as tools for making better decisions in my work day helping customers focus on their enterprise rather than the supporting infrastructure. In this aspect, learning is a continous activity. But as time is the most valuable asset I have, small classes focused on highly relevant topics are a lot more effective than learning "the google way" or reading a book.

And out of curiosity: One of my preferred sources of knowledge these days is the "SQL Server 2008 Internals" book by Kalen Delaney featuring among others both Kimberly Tripp and Paul Randal. Having the opportunity to learn from the author of chapter 11 (and the technology it describes) would be just thrilling :-)

I fix!

Friday, October 11, 2013

Polyglot persistence - let the data lead the way

The main data characteristics to consider when deciding on a persistence technology:

Other aspects to consider (IMHO secondary to the data characteristics):

Tuesday, January 25, 2011

Challenge accepted

Monday, January 24, 2011

There is a first time for everything...