Designing Data-Intensive Applications

Technical Books
My notes on Designing Data-Intensive Applications by Martin Kleppmann.
Author

Tyler Hillery

Published

November 18, 2023


I am going to try something new while reading this book. After every reading session I will journal any thoughts and ideas I have, unfiltered and unedited. This book comes highly regarded and I want to be able to capture my experience during my first read through.

TODO

These notes are a work in progress and may have several typos!

Reading Session 01

  • date: Tue 10/10/2023
  • time: 7:05PM-7:15PM (10 min)
  • chapter: Preface
  • pages: 5-18 (15 pages)

The preface sets the stage for the book describing the scope, who should read the book, and the outline of the book. Based on the preface alone I have feeling I am going to really enjoy it. The preface emphasizes the focus of the book to be on the principals of designing data intensive applications. We first learn also what does a data intensive application is:

We call an application data-intensive if the data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive, where CPU cycles are the bottleneck.

I personally identified with “the people who have natural curiosity for the way things work” as my reasoning for reading this book. My reasoning goes along with this point as well:

You will, however, develop a good intuition for what your systems are doing under the hood so that you can reason about their behavior, make good design decisions, and track down any problems that may arise.

For me it’s nice to be able to have the high level mental model of how some of these systems works so I can mentally connect the dots even though I may not be working on these underlying systems directly. Although one day I would like to 😉

I am looking forward to reading the rest of the book!


Reading Session 02

  • date: Wed 10/11/2023
  • time: 4:00AM-4:45AM (45 min)
  • chapter: Ch 01 - Reliability
  • pages: 18-31 (13 pages)

Notes

I like the concept of data systems even today we are still seeing the lines get blurred further and further between messaging systems and databases. For example Redpanda recently announced that they plan on adding iceberg support for the tiered storage. This would me any query engine that can read the Iceberg table format should be able to query the data. Does this make Redpanda a database? I think so.

Many of us may be data system designers without even realizing it. Figure 1.1. does a great job showing how one application could be interacting with an in memory cache database, a primary db, a full-text index and a message queue all at once. all glued together through the application code with the implementation details hidden from clients. I think this book is more wildly applicable than people realize.

Three key terms:

Reliability: The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).

Reliability: As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth

Maintainability: Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

The things that can go wrong are called faults, and systems that anticipate faults can cope with them are called fault-tolerant or resilient.

TODO

Look into the Netflix Chaos Monkey (something about purposely introducing faults to test the system)

TODO

Look up RAID configuration, I remember learning about this in my computer org. and arch. class

Interesting to hear how handling reliability is now less focuses on making the hardware more redundant and instead having the software handle it. This was largely due to the cause of cloud platforms like AWS which have a focus more on flexibility and elastic over single-machine reliability.

When the author was giving example of software faults and how they can lie dormant for a long time until they are triggered by an unusual set of circumstances. It reminded me of a bug I encountered yesterday for our dim_calendar table which had a column called is_holiday. My assumption when using this table was than when is_holiday was true that would also mean the stock markets would be close. Well in the case of 10/09/2023, Columbus day is a federal holiday but the markets are still open. Lucky enough I had some dbt tests that notified me there was something wrong with some downstream dependencies of this table. I was able to quickly search through the DAG and noticed that the is_holiday column was based on a python UDF that used from [pandas.tseries.holiday](http://pandas.tseries.holiday) import USFederalHolidayCalendar . I was able to modify the function to only use holidays observed by the stock market and open PR to fix this.

Reading session summary

My first thought is this book is going to take forever if I continue taking these detailed notes and journaling about thoughts as I read. I expect other sessions to be less detailed but I am going to try to stick with it. It was nice to learn more about what reliability means and how to handle it. I believe this is an area of mine and that I need to work on. Most of the work I do is more reactive (testing) to faults that can cause me to go into a panic trying to debug what is wrong. It would be better if I could be more proactive. It reminds me of a pattern used in data engineering called Write-Audit-Publish (WAP) wow, this acronym is forever ruined by Cardi B. In essence WAP involves:

  1. Write the data to a staging area (a non-production env so the data is isolated)
  2. Audit the data to validate the data and solve any quality issues (NULL values, duplicates etc.)
  3. publish data to production

Table formats like Iceberg even have special functionality to help implement this pattern like the older version WAP.id or the newer method with branching WAP.branch . More about this can be found here Streamlining Data Quality in Apache Iceberg with write-audit-publish & branching

This pattern reminds me of version control. Where step one is like creating your own local branch, step two is the CI/CD process that runs testing your changes. Once passed step three happens where changes are deployed to production.


Reading Session 03

  • date: Thu 10/12/2023
  • time: 6:45PM-7:00PM (15 min)
  • chapter: Ch 01 - Scalability
  • pages: 31-39 (8 pages)

Notes

I like the author phrase how it’s not just good enough to say this systems scales, we have to ask “what are our options for coping with growth

Load can be described with load parameters - requests per second to a web server - ratio of reads to writes to a db - number if simultaneously active users in a chat room - hit rate on a cache

Interesting to hear that Twitter’s scaling challenge is due to fan-out where each user follows many people and each user is followed by many people.

Love the inside joke on just setting up my twttr.

Throughput: The number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.

Response time: What the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays.

Latency: Duration that a request is waiting to be handled

Response time can vary greatly to it’s best to measure as a distribution of values.

Another good tidbit, mean is not good measure for “typical” response time, because it doesn’t tell you how many users actually experience delay. It’s better to use percentiles and use the median aka the 50th percentile.

Reading Session Summary

I’ll be honest I didn’t want to read today but I currently have my longest reading streak in the kindle app (6 days) and I didn’t want to break it. My original plan was to read just one page but it’s surprising once you start something you don’t want to stop. Author poses some many great questions to ask when comes to scalability of systems and really starts to question key terms I have heard of better could never define or put into words if you asked me, like what is load or load parameters.


Reading Session 04

  • date: Fri 10/13/2023
  • time: 4:05AM-4:20AM (15 min)
  • chapter: Ch 01 - Scalability
  • pages: 39-46 (7 pages)

Reading Session Summary

Couple things that I came to mind as I finished up the scalability portion. I like how the author started with first describing load and metrics for measure for performance before jumping straight into “how to maintain good performance even when load parameters increase by some amount?” I believe this is a thoughtful exercise to do before trying to optimize for scalability because it forces you to think about what it is you are actually trying to optimizer for. This deep thinking will help avoid any pitfalls of optimizing for the wrong thing, the author even calls out “An architecture that scales well for particular application is built around assumptions of which operations will be common and which will be are — the load parameters

I am familiar will some of the common techniques mentioned in this chapter with ways dealing scalability such as vertical and horizontal scaling and shared-nothing architecture. The author makes the correct call that distributed data systems have become the default choice. In fact I think as of today people choose distributed systems far tool quickly as the default the choice when tackling data problems. I recently came across a post on r/dataengineering where an engineer was helping out a local floral shop and they set them up with BigQuery, Mage, dbt etc. It wasn’t clear to me the size of data they were dealing, and with but the technology choices they recommended just seemed absurd. Just thinking about my local floral and setting them up with this tech, who the heck is going to maintain it? Also the problem didn’t really call for all these tools, they simply needed a more automated way of cross referencing data from SQL server and manually downloaded sales information. This sounds like something a spreadsheet could do. The pendulum has shifted too far as making these distributed systems the default choice. In recent times with the rise of tools like DuckDB, I believe people are going to second guess themselves before instinctively setting up a spark cluster to do some simple ELT/ELT work.


Reading Session 05

  • date: Sat 10/14/2023
  • time: 4:05AM-4:20AM (15 min)
  • chapter: Ch 01 - Maintainability & Ch 02 - Data Models
  • pages: 39-46 (7 pages)

Notes

3 design principals of Maintainability

Simplicity: Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.

  • This can be tough finding the right level of abstraction, author actually goes into detail about abstraction and relates it to high-level programming languages with how they abstract away having to write machine code. There is some foreshadowing by the author implying we will learn some techniques into how to find the right layer of abstraction.

Operability: Make it easy for operations team to keep the system running smoothly

Simplicity: Make it easy for engineers to make changes to the system in the future, adapting for unanticipated use cases as requirements change.

Don’t create big balls of mud.

Reading Session Summary

Finishing up chapter 1 I learned what it means to create maintainable software. I have personally been there, working on some code and thinking to myself “who wrote this awful code?” Only to check the commit history to see it was me… Refactoring code honestly feels like a majority of the work I have done in my career, it always feels like no matter what company I worked for or what role I was in I was taking old code and making it better or migrating it to another platform. These personal experiences just reiterate the point of how crucial it is to create maintainable systems and I liked how the author broke down maintainable into 3 design principals. Overall, Chapter 1 did a great job on laying the foundation and principles that go into thinking about data-intensive applications

When I read the title of Chapter 2. Data Models and Query Languages I got excited. This is a particular interest of mine. The author makes a bold claim about data models

Data Models are perhaps the most important part of developing software because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we solving

I have long had this belief that all software can be distilled down to taking data in, manipulating it and producing data as a result. I am something of a self proclaimed data extremist. This is why I am so fascinated by databases and data systems as a whole. Now I come from a analytics background and what a lot people think I am describing here is a data pipeline but I also think this pattern is just as applicable to all domains in tech; front-end engineering, back-end engineering, data engineering, systems engineering. Everyone at the end of the no matter what they are building are taking data in, and doing something with the data and producing some sort of output which is also data. Front-end engineer gets JSON data from an API, transforms it into HTML (which is textual data) and renders it on the client. It’s the data models and the ways we represent & interact with the data that separate these engineers into these domains.

I like the way the author phrases how applications are really built by layering one data model on top of another which really get be thought of as an abstraction layer IMO. I think it’s important at whatever layer you are operating at to try learning the skills and tools commonly used one layer up and below as it will make you that much more knowledgeable about operating at the layer you normally work in. This is part of the reason that of why I am learning about lower level systems and how databases actually work. As an analytics engineering my main reasonability is to best figure out how to best model our data to enable down stream uses cases like machine learning & reporting, commonly implemented by using SQL. Understanding how databases work will give me better insight into writing more optimize and performant queries even if I never actually have to build a database.

It was interesting to hear that databases back in the day made application developers think about the internal representation of the data in the database. Could imagine actually having to worry about how the data is being physically stored on disked opposed to just writing create table and insert values? Wow what a time to be alive.

Also I never knew of the terms: business data processing, network model and hierarchical model. Just goes to show how the relational model has stood the test of time & appears to have made some of these other models obsolete.


Reading Session 06

  • date: Sun 10/15/2023
  • time: 3:20PM-3:55PM (35 min)
  • chapter: Ch 02 Data Models & Query Languages
  • pages: 61-121 (60 pages)

Notes

One comparison that I really liked was:

Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking.

TODO

Look into the Google’s Spanner database (how does it offer same locality properties in a relational model?)

imperative language tells the computer to perform certain operations in order

declarative language describe the result you want

Declarative languages are better for parallel execution and hides implementation details.

Reading Session Summary

I am most familiar with the relational model so hearing about other data models and NoSQL systems was interesting to learn about. I still believe the relational model is the best and continues to improve as support for things like JSON get better. The author also mentions this point in the book how relation model systems and non-relation models systems are converging in terms of capabilities.


Reading Session 07

  • date: Mon 10/16/2023
  • time: 12:45PM-1:45PM (60 min)
  • chapter: Ch 03 Storage and Retrieval
  • pages: 121-169 (48 pages)

Notes

One of the main reasons I am reading this book:

You’re probably not going to implement your own storage engine from scratch, but you do need to select a storage engine that is appropriate for your application, from the many that are available. In order to tune a storage engine well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood.

I never realized how powerful an append only log is.

Not many people realize this about columnar stores

Note that it wouldn’t make sense to sort each column independently, because then we would no longer know which items in the columns belong to the same row. We can only reconstruct a row because we know that the kth item in one column belongs to the same row as the kth item in another column.

Rather, the data needs to be sorted as an entire row at a time, even though it is stored by column.

Redshift allow to define the sort order by created a sort key. This can help speed up queries by pruning data that is not need to be read based on the min and max values of certain zone maps. Very similar to how an index works. You commonly want to sort on a field used in where clauses. A good example would be a date column.

Interesting to learn about the idea of having several different sort orders and letting the database figure out the best one. I wonder if that’s what these “Real-time OLAP DBs” like Pinot do.

Reading Session Summary

I need to study my data structures and algorithms because I would be lying if I completely understood everything in this chapter so far. I think at a high level I get the differences between LSM-trees and B-tree but nothing deeper than that.

Once the chapter starting talking about OLAP and Data Warehousing which is my 🍞 and 🧈.


Reading Session 08

  • date: Tue 10/17/2023
  • time: 5:50AM-6:00AM (10 min)
  • chapter: Ch 03 Storage and Retrieval & Ch 04 Encoding and Evolution
  • pages: 169-184 (15 pages)

Summary

Finished up chapter 03 and started ch 04. I am looking towards learning about REST and RPC communication protocols as I frequently come across these terms but don’t have much knowledge about them.


Reading Session 09

  • date: Wed 10/18/2023
  • time: 8:00AM-8:05AM (5 min)
  • chapter: Ch 04 Encoding and Evolution
  • pages: 184-188 (4 pages)

Notes

encoding and serialization are the same thing.

Summary

I never heard of some of these language specific formats like pickle for python for encoding in-memory byte sequences into byte sequences. Briefly touched on other formats like XML,CSV & JSON with JSON being the most popular. I wonder if later in this chapter we will get to learn about gRPC.


Reading Session 10

  • date: Thu 10/19/2023
  • time: 9:00PM-9:03PM (3 min)
  • chapter: Ch 04 Encoding and Evolution
  • pages: 190-191 (2 pages)

Summary

Read two pages to keep the streak alive. Learned a little about binary encoding. ___

Reading Session 11

  • date: Fri 10/20/2023
  • time: 10:00AM-10:05AM (5 min)
  • chapter: Ch 04 Encoding and Evolution
  • pages: 191-197 (6 pages)

Summary

Another short reading session where I learned a little about Thrift and Protocol buffers. This is an interesting topic to me because I’ve heard of a thrift server before when I worked on spark for a little bit. I wonder if these two are the same thing. I also know protocol buffers are used in gRPC. It will be nice to actually understand what these technologies actually are.


Reading Session 12

  • date: Sat 10/21/2023
  • time: 4:20PM-4:40PM (20 min)
  • chapter: Ch 04 Encoding and Evolution
  • pages: 197-209 (12 pages)

Notes

Protocol Buffers do not have a list or array datatype, but instead has a repeated marker for fields. Thrift has a dedicated list datatype but it does not allow the same evolution from single-valued to multi-valued as Protocol Buffers does but it has the advantage of supporting nested lists.

TODO

Follow up to check if Avro is used for schema definition in Apache Kafka topics.

JSON, CSV, XML are textual formats while Protocol Buffers, Thrift and Avro are binary formats.

Interesting to learn that Thrift and Protocol Buffers have dynamic code generation which is helpful for statically typed languages. The author points out the dynamically typed languages there is no benefit in code generation since there is no compile-time type checker. I wonder now that python has type hints if there could be an advantage. I also think it could be helpful in generating schemas for libraries Pydantic which help with run time data validation.

Summary

This part of the book reminds of this project called Recap which provides the ability to transpile schema definitions in various formats. Before reading this chapter I wasn’t aware of all the different schema formats so I am now seeing how someone could benefit from this tool.


Reading Session 13

  • date: Sun 10/22/2023
  • time: 1:32PM-1:50PM (18 min)
  • chapter: Ch 04 Encoding and Evolution
  • pages: 209-234 (25 pages)

Notes

REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP… An API designed according to the principles of REST is called RESTful.

SOAP is an XML-based protocol for making network API requests…aims to be independent from HTTP and avoids using most HTTP features.

The main methods discussed were:

  • Databases: writing encodes the data and reading decodes it

  • RPC and REST APIs: client encodes a request, server decodes it and encodes a response and client decodes the response.

    TODO

    When is it best to use gRPC vs REST APIs? My understanding was gRPC is used when you want to invoke a function from another service as if it were like a function in your own service. I recall though a customer of ours was requesting for data to delivered via gRPC opposed to a RESTful API. To me that wouldn’t make sense.

  • Asynchronous Messaging Passing: Nodes communicate by sending each other messages that are encoded by the send and decoded by the recipient.

Summary

Finished chapter 04 covering many different ways of how data flows through services. So far this book feels like lego blocks with each chapter putting together the foundational pieces common distributed systems are built upon. I really enjoy this approach because it’s too common nowadays to except the norm for what it is without truly understanding the many layers of abstraction we are building upon.


Reading Session 14

  • date: Mon 10/23/2023
  • time: 6:38PM-6:50PM (12 min)
  • chapter: Ch 05 Replication
  • pages: 234-247 (13 pages)

Summary

Starting to learn about distributed systems and how data can be replicated across nodes.


Reading Session 15

  • date: Tue 10/24/2023
  • time: 5:10AM-5:32AM (22 min)
  • chapter: Ch 05 Replication
  • pages: 247-267 (20 pages)

Summary

Wow I had no idea about all the complexities one has to deal when you move from a single node systems to a distributed systems.


Reading Session 16

  • date: Tue 10/24/2023
  • time: 2:40PM-3:00PM (20 min)
  • chapter: Ch 05 Replication
  • pages: 267-291 (24 pages)

Notes

3 methods for replicating changes between nodes

  • single-leader
  • multi-leader
  • leaderless

Summary

When you start using replication methods that involve writes to multiple nodes it introduces conflicts. The best way to deal with conflicts is to avoid them 🙂


Reading Session 17

  • date: Wed 10/25/2023
  • time: 7:11PM-7:14PM (3 min)
  • chapter: Ch 05 Replication
  • pages: 291-294 (3 pages)

Summary

Handling conflicts can be hard because writes are concurrent so their is no natural order them. An arbitrary order needs to be applied to determine the last write when using the last write wins approach for conflict resolution.


Reading Session 18

  • date: Thu 10/26/2023
  • time: 4:15AM-4:30PM (15 min)
  • chapter: Ch 05 Replication
  • pages: 294-314 (20 pages)

Notes

For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.

Summary

I’ll be honest there were parts of this chapter where a lot of the information went over my head 😅. It’s good for me to be generally aware of some of these concepts but it will definitely require me to build something where I implement them for the information to truly settle in.


Reading Session 19

  • date: Fri 10/27/2023
  • time: 7:00PM-7:05PM (5 min)
  • chapter: Ch 06 Partitioning
  • pages: 314-319 (5 pages)

Notes

If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed.

A partition with disproportionately high load is called a hot spot

Summary

The beginning of the chapter reminds me about some of the query optimization work I do at my current role. At Nasdaq we use Redshift as our data warehouse. Couple things you can define for your tables is define a distribution key and sort key. The distribution key tells the leader node which worker node to send the data to. Each worker nodes stores the data into blocks called zone maps. The zone maps store the min and max values of the sort key allowing the worker node to prune the zone maps without having to read the data.

The idea behind the distribution key is to co-locate joins to minimize the impact of the redistribution step. Make sure to pick a key that evenly distributes the data across nodes. This can be challenging at Nasdaq because many stocks can skew the data because they trade a lot more than others.

TODO

If I specify a DISTKEY how does will Redshift still try to ensure the data is evenly distributed across nodes? For example, lets say there is a dataset with 3 symbols and we have two worker nodes. 1 symbol makes up 50% of the data and the other two make up the other 50%. Will Redshift know to distribute the symbol with more data on one node and the other two symbols on the other node?


Reading Session 20

  • date: Sat 10/28/2023
  • time: 6:31PM-6:34PM (3 min)
  • chapter: Ch 06 Partitioning
  • pages: 319-321 (2 pages)

Summary

Learned about some hashing techniques to help evenly distribute data across nodes to reduce the risk of a hot spot. ___

Reading Session 21

  • date: Sun 10/29/2023
  • time: 7:17PM-7:22PM (5 min)
  • chapter: Ch 06 Partitioning
  • pages: 321-327 (6 pages)

Summary

Secondary Indexes can introduce additional complexity when partitioning data and usually results in having to read in all the partitions. This approach is called scatter/gather.


Reading Session 22

  • date: Mon 10/30/2023
  • time: 4:17AM-4:38AM (21 min)
  • chapter: Ch 06 Partitioning
  • pages: 327-347 (20 pages)

Notes

I have heard of ZooKeeper before (mainly from Kafka) but never knew what role it had in the Kafka ecosystem. I now know that ZooKeeper can be used as a way to keep track of cluster metadata and maintains the mapping of partitions to nodes. It’s often used by the “routing tier” which is a partition-aware load balancer. The routing tier can query ZooKeeper to determine where to route the request to.

Summary

Wrapped up chapter 06 by learning more about how partitions can be rebalanced to different nodes. As a result, we also need techniques to keep track of partitions and the nodes they are assigned to so queries can be efficiently routed to the right node.


Reading Session 23

  • date: Tue 10/31/2023
  • time: 7:15PM-7:18PM (3 min)
  • chapter: Ch 07 Transactions
  • pages: 347-349 (2 pages)

Notes

Transactions is a way for an application to group several reads and writes together into a logical unit…Transactions are not a law of nature; they were created with a purpose, namely to simplify the programming model for applications accessing a database. By using transactions, the application is free to ignore certain potential error scenarios and concurrency issues, because the database takes care of them instead (we call these safety guarantees).

Summary

Started the chapter on transactions. I learned about some of the high level advantages that transactions provide. I’ll be the first to admit that I take transactions for granted. It’ll be nice to learn more about the lower level details of transactions, the advantages they provide along with the limitation they may impose. I wonder what scenarios are best where you could get away with not as strict transactions to maybe improve the performance of the database while sacrificing some of the safety guarantees.


Reading Session 24

  • date: Wed 11/01/2023
  • time: 6:42PM-7:02PM ( min)
  • chapter: Ch 07 Transactions
  • pages: 349-367 (18 pages)

Notes

ACID The safety guarantees provided by transactions are often described by the well-known acronym ACID, which stands for Atomicity, Consistency, Isolation, and Durability.

The safety guarantees provided by transactions are often described by the well-known acronym ACID, which stands for Atomicity, Consistency, Isolation, and Durability.

Interesting to here there are various degrees of ACID, I always thought if a db claimed they had ACID transactions it was the same safety guarantees.

atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed—for example, a process crashes, a network connection is interrupted, a disk becomes full, or some integrity constraint is violated. If the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction.

My mental of of atomic: if full transactions doesn’t complete nothing completes. I also like the author calling it abortability

consistency refers to an application-specific notion of the database being in a “good state.”

Surprising to hear that this part of ACID is actually a property of the application.

Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. The classic database textbooks formalize isolation as serializability, which means that each transaction can pretend that it is the only transaction running on the entire database. The database ensures that when the transactions have committed, the result is the same as if they had run serially (one after another), even though in reality they may have run concurrently [10]… Concurrently running transactions shouldn’t interfere with each other. For example, if one transaction makes several writes, then another transaction should see either all or none of those writes, but not some subset.

Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

The most basic level of transaction isolation is read committed. It makes two guarantees:

  1. When reading from the database, you will only see data that has been committed (no dirty reads).

  2. When writing to the database, you will only overwrite data that has been committed (no dirty writes).

Summary

My take away from this part of the book is don’t take ACID at face value. ACID means different things between databases and even consistency falls part on the application not just the database. It’ll be interesting to learn more about how databases handle isolation in more detail as that appears to be the most complicated part of the db has to handle for ACID transactions.


Reading Session 25

  • date: Thu 11/02/2023
  • time: 8:00PM-8:01PM (1 min)
  • chapter: Ch 07 Transactions
  • pages: 367-368 (1 pages)

Summary

Read one page to keep by kindle reading streak alive, we are at a record high of 28 days 🔥


Reading Session 26

  • date: Fri 11/03/2023
  • time: 7:40PM-8:00PM (20 min)
  • chapter: Ch 07 Transactions
  • pages: 368-389 (21 pages)

Summary

Isolation is hard.


Reading Session 27

  • date: Sat 11/04/2023
  • time: 1:16PM-1:41PM (25 min)
  • chapter: Ch 07 Transactions
  • pages: 389-426 (37 pages)

Summary

Finished the section on transactions. Will probably need to revisit this chapter again to be able to fully digest the information. To summarize, transactions deal with a lot of complexity so your application doesn’t have to. Most of the complexity appears to come from concurrency and by that I mean what should the database due when there are multiple requests to read or write happening at the same time.


Reading Session 28

  • date: Sun 11/05/2023
  • time: 8:18PM-8:19PM (1 min)
  • chapter: Ch 08 The Trouble with Distributed Systems
  • pages: 426-427 (1 page)

Summary

Read one page again to keep that streak alive.ead one page again to keep that streak alive.


Reading Session 29

  • date: Mon 11/06/2023
  • time: 5:18PM-5:19PM (1 min)
  • chapter: Ch 08 The Trouble with Distributed Systems
  • pages: 427-428 (1 page)

Summary

Honeymoon starts tomorrow so last couple of days I haven’t had as much time to read as I would like to.


Reading Session 30

  • date: Tue 11/07/2023
  • time: 5:49AM-5:57AM (8 min)
  • chapter: Ch 08 The Trouble with Distributed Systems
  • pages: 428-441 (13 pages)

Summary

Computer networking is very important and who knew how hard it would be to detect a partial failure in a distributed system.


Reading Session 31

  • date: Wed 11/08/2023
  • time: 8:20PM-8:41PM (21 min)
  • chapter: Ch 08 The Trouble with Distributed Systems
  • pages: 441-461 (20 pages)

Summary

I am shocked that any distributed systems work with how many problems and edge cases they have to deal with such as not being able to rely on the clocks within each node.


Reading Session 32

  • date: Thu 11/09/2023
  • time: 7:47AM-8:51AM (64 min)
  • chapter: Ch 09 Consistency and Consensus
  • pages: 461-530 (69 pages)

Summary

Starting to learn more about the algorithms that help us deal with many of the challenges that distributed systems face.


Reading Session 33

  • date: Fri 11/10/2023
  • time: 10:55AM-10:58AM (3 min)
  • chapter: Ch 09 Consistency and Consensus
  • pages: 530-534 (4 pages)

Summary

Concurrency is hard and that is why many systems will impose an order on the events.


Reading Session 34

  • date: Sat 11/11/2023
  • time: 3:40PM-3:45PM (5 min)
  • chapter: Ch 09 Consistency and Consensus
  • pages: 534-543 (9 pages)

Summary

Retroactively adding these reading sessions so don’t really have any thoughts or ideas to share about these specific pages.


Reading Session 35

  • date: Sun 11/12/2023
  • time: 5:26AM-5:36AM (10 min)
  • chapter: Ch 09 Consistency and Consensus
  • pages: 543-551 (6 pages)

Summary

Retroactively adding these reading sessions so don’t really have any thoughts or ideas to share about these specific pages.


Reading Session 36

  • date: Mon 11/13/2023
  • time: 6:25AM-7:45AM (90 min)
  • chapter: Ch 10 Batch Processing
  • pages: 551-647 (96 pages)

Summary

Retroactively adding these reading sessions so don’t really have any thoughts or ideas to share about these specific pages.


Reading Session 37

  • date: Tue 11/14/2023
  • time: 6:30PM-6:35PM (5 min)
  • chapter: Ch 10 Batch Processing
  • pages: 647-652 (5 pages)

Summary

Retroactively adding these reading sessions so don’t really have any thoughts or ideas to share about these specific pages.


Reading Session 38

  • date: Wed 11/15/2023
  • time: 7:30PM-7:33PM (3 min)
  • chapter: Ch 10 Batch Processing
  • pages: 652-654 (2 pages)

Summary

Retroactively adding these reading sessions so don’t really have any thoughts or ideas to share about these specific pages.


Reading Session 39

  • date: Thu 11/16/2023
  • time: 8:33PM-8:34PM (1 min)
  • chapter: Ch 10 Batch Processing
  • pages: 652-653 (1 pages)

Summary

The kindle streak is alive! Almost forgot to read today 😅


Reading Session 40

  • date: Fri 11/17/2023
  • time: 3:30PM-3:48PM (18 min)
  • chapter: Ch 10 Batch Processing
  • pages: 653-682 (29 pages)

Summary

Finished the batch processing job chapter. The author correctly predicts the convergence of MPP databases and batch processing frameworks. Batch processing frameworks like MapReduce where a lot more flexible in the sense MapReduce jobs could call arbitrary code while MPP databases usually used SQL. Now we see frameworks who were initially built on MapReduce like Spark (I don’t think it is anymore) has its new photon engine for running SQL workloads. You also have MPP databases that although the ability to query files directly on S3 vs having to store it in their proprietary storage format.


Reading Session 41

  • date: Sat 11/18/2023
  • time: 12:00PM-2:40PM (160 min)
  • chapter: Ch 11 Stream Processing & Ch 12 The Future of Data Systems
  • pages: 682-850 (172 pages)

Summary

Who knew the append only log could be so powerful. I like the idea of changing the way we think about building apps from stateless to stateful so we can create a complete end to end dataflow without the need for polling.

There is something about event driven architecture, data streaming, immutable data, functional programming, append only logs that really identify with my way of thinking. It’s an area that I really want to explore more and specialize in.

Recap

Just a mental recap for me of all the different topics and chapters of this book

  • Part 1: Foundations of Data Systems
    • Reliable, Scalable, and Maintainable Applications
    • Data Models and Query Languages
    • Storage and Retrieval
    • Encoding and Evolution
  • Part 2: Distributed Data
    • Replication
    • Partitioning
    • Transactions
    • The Trouble with Distributed Systems
    • Consistency and Consensus
  • Part 3: Derived Data
    • Batch Processing
    • Stream Processing
    • The Future of Data Systems
    • Summary