/images/head.jpg

FoundationDB: A Distributed Unbundled Transactional Key Value Store

Key and value sizes are limited to 10 KB and 100 KB respectively for better performance. Transaction size is limited to 10 MB

Transaction processing

Optimistic Concurrency Control + MVCC

A client transaction starts by contacting one of the Proxies to obtain a read version (i.e., a timestamp). The Proxy then asks the Sequencer for a read version that is guaranteed to be no less than any previously issued transaction commit version, and this read version is sent back to the client. Then the client may issue multiple reads to StorageServers and obtain values at that specific read version. Client writes are buffered locally without contacting the cluster. At commit time, the client sends the transaction data, including the read and write sets (i.e., key ranges), to one of the Proxies and waits for a commit or abort response from the Proxy. If the transaction cannot commit, the client may choose to restart the transaction from the beginning again.

HDFS Namenode RPC Request Execution Time Breakdown

1
2
3
 -----                                                enqueueTime  queueingTime      Processing/ResponseTime/HandlerTime 
|50090| <- Listener --> pendingConnections <- Reader1 -----------> CallQueue <----- handler (processes and sends response)
 -----              \-> pendingConnections <- Reader2

A main listener thread is accepting new connections from clients and put connections into pendingConnections queue of a Reader thread. A Reader thread detects any ready connection, reads the request and puts the call into CallQueue. This put() operation is blocking and is accounted as enqueueTime. The time a call stays in CallQueue is queueingTime. When a handler is available, it will pick a call from CallQueue and process the call, from which we can derive the processing/response/handlerTime.

Java ExecutorService

Recently, I was working on HDFS-17030 and used ExecutorService for multi-threading execution. We hit a non-intuitive `bug` of how JVM/garbage collector works. The issue is as following.

We created an executorService with 1 core thread in a class. We did not set allowCoreThreadTimeOut to true (default is false). So, that core thread will be kept running, even when the main thread exits! Then, the JVM process won’t exit, because there is still one thread running! Even when we shut down the executorService in .close() method of that class, the same issue persists. The issue is .close() method of a Closable object is called by Java garbage collector only when it is reclaiming that object. This does not happen immediately when there is no reference to an object. After an object becomes orphan (no more reference to it), it will be added to a queue. The garbage collector will look into the queue and determine which objects to reclaim based on its own algorithm. It is undeterministic/out of control from programmers on when the garbage collector will reclaim an object.

Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

Read/write performance for cloud object store.

Each read operation usually incurs at least 5–10 ms of base latency, and can then read data at roughly 50–100 MB/s, so an operation needs to read at least several hundred kilobytes to achieve at least half the peak throughput for sequential reads, and multiple megabytes to approach the peak throughput.

The VM types most frequently used for analytics on AWS have at least 10 Gbps network bandwidth, so they need to run 8–10 reads in parallel to fully utilize this bandwidth.

The CacheLib Caching Engine: Design and Experiences at Scale

CacheLib is a general-purpose caching engine, designed based on experiences with a range of caching use cases at Facebook, that facilitates the easy development and maintenance of caches.

CacheLib was first deployed at Facebook in 2017 and today powers over 70 services including CDN, storage, and application-data caches.

All of these systems process millions of queries per second, cache working sets large enough to require using both flash and DRAM for caching, and must tolerate frequent restarts due to application updates, which are common in the Facebook production environment.

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

Hundreds of thousands of customers rely on DynamoDB for its fundamental properties

In 2021, during the 66-hour Amazon Prime Day shopping event, Amazon systems - including Alexa, the Amazon.com sites, and Amazon fulfillment centers, made trillions of API calls to DynamoDB, peaking at 89.2 million requests per second, while experiencing high availability with single-digit millisecond performance.

For DynamoDB customers, consistent performance at any scale is often more important than median request service times because unexpectedly high latency requests can amplify through higher layers of applications that depend on DynamoDB and lead to a bad customer experience.