Book - Designing Data Intensive Applications

Link to the book.

The book starts by describing very simple key-value database systems; then gradually introduces more concepts and features like indexes, joins, SQL (Structured Query Language), MapReduce, isolation, consensus, total order broadcast and so on, eventually winding up with data streaming systems. At each stage, the author explains the problems that such systems solve and the trade-offs that they make. The book does not dive too deeply into the algorithms that such systems use, although it does describe some key ones. The examples are typically based on existing data products, which is somewhat useful for discovery and comparison if you might be designing a data system. The earlier parts of the book covering database internals is interesting but at first seems somewhat unrelated to later parts; however in the later parts the author draws parallels between the internal workings of a database and data streaming systems. In particular he predicts an 'unbundling' of database features like replication and secondary indexes to allow greater flexibility and integration of systems which are most suited for particular jobs.

The author is keen to point out that although more complex data systems can offer horizontal scalability, that these should be weighed against the additional complexity they bring to your system. I agree with this approach.

Overall I found the book useful. It covered some concepts I only had a passing familiarity with such as MapReduce and it really made me think about the problems we might face in our systems at work and their potential solutions. I would recommend it to most software engineers, in particular those who might be designing systems that work with data that won't fit on a single computer.