/images/head.jpg

Book Review: High Output Management

I finished reading Andrew Grove’s book on ``High Output Management’’ and really liked it, especially when reading the second half. It explains a few concepts, such as how a person’s needs may change and what is controlling a person’s behavior. It also contains quite a few practical advices on how to handle issues that could happen in a workplace. I highly recommend everyone to take a look at this book.

Key Value Workloads

YCSB

Workload Description Example Operation breakdown
A Update heavy workload session store recording recent actions read: 50%, update: 50%
B read mostly workload photo tagging; add a tag is an update, but most operations are to read tags read: 95%, update: 5%
C read only user profile cache, where profiles are constructed elsewhere )e.g., Hadoop) read: 100%
D read latest workload user status updates; people want to read the latest read: 95%, insert 5%
E short ranges threaded conversations, where each scan is for the posts in a given thread (assumed to be clustered by thread id) scan: 95% (maxscanlength=100), insert: 5
F read-modify-write workload user database, where user records are read and modified by the user or to record user activity. read: 50%, readmodifywrite: 50%

RocksDB Workloads in Facebook paper link

UDB

  • Social graph data stored in MyRocks
  • Get: 75%, Put: 20%
  • 75+% of KV-pairs are Put only once.
  • Most of the start-keys of Iterators are used only once. The scan length of more than 60% of the Iterators is only 1 across all CFs.

ZippyDB

  • Distributed KV store, mapping some metadata of an object to the object address in an object storage system
  • 78% Get, 13% Put, 6% Delete, and 3% Iterator
  • For about 80% of the KV-pairs, Get requests only occur once. A very small portion of KV-pairs have very large read counts over the 24-hour period. 1% of the KV-pairs show more than 100 Get requests, and the Gets to these KV-pairs are about 50% of the total Gets that show strong localities. About 73% of the KV-pairs are Put only once, and fewer than 0.001% of the KV-pairs are Put more than 10 times. Put does not have as clear a locality as Get does.

UP2X

  • statistic counters of user activities for AI/ML prediction and interference
  • merge (read-modify-write): 92.53%, 7.46% Get
  • Merge and Get have wide distributions of access counts. Most KV pairs are Put only once.

UDB Request Distribution

../UDB-requests.png

Key Value Systems

  • LSM-tree survey
  • WiscKey: key/value separation
  • HashKV: store cold keys separately.
  • LSM-trie
  • PebblesDB
  • KVell
  • Skip-Tree: put kv items directly to non-adjacent layers (skipping some intermediate layers) to reduce write amplification from level-to-level data compaction.
  • TRIAD: separate hot keys from cold keys. hot keys are stored in memtable for as long as possible. Use multiple memtables and flush them at once to disk.
  • Pipelined Compaction Procedure: A compaction contains three phases: read phase, merge-sort phase, and write phase. Pipeline these three phases, to overlap CPU and disk IO.
  • cLSM: LSM-tree for multi-core processors. It exploits multiprocessor-friendly data structures and non-blocking synchronization. cLSM supports a rich API, including consistent snapshot scans and general non-blocking read-modify-write operations.
  • Towards Accurate and Fast Evaluation of Multi-Stage Log-Structured Designs
  • Dostoevsky: evaluated against well-tuned LSM-trees. Extends the design space of LSM-trees with a new merge policy by combining leveling and tiering. very useful for certain workloads that require efficient writes, point lookups, and long range queries with less emphasis on short range queries.
  • Mutant: decide which type of cloud disks to place SStables, based on access pattern.
  • diff-index: use LSM-tree as secondary indexes.

Idea Repository

List of ideas to explore.

  • Interaction between a filesystem and a computational SSD that does transparent compression.

    • Problem: How to manage SSD space in this case? What API should be use?
    • Can use SSDSim to explore the design space.
  • Key prefix compression in k/v store

    • Problem: in levelDB, it uses fixed-size key prefix compression. Can we do better than this?
    • In Rockset/Foundationdb, secondary indexes are stored as <key, null> key/value pairs. This creates an opportunity that significant storage saving can be achieved if we can do a better key compression.
  • Zoned namespace SSDs

Use Prefix Operators (++i) vs. Postfix Operators (i++)

The ``C++ Primer’’ book has a very interesting recommendation to use postfix operators, only when necessary (on page 148 for the 5th version).

The authors shared the following in their book.

The prefix version avoids unnecessary work. It increments the value and returns the incremented version. The postfix operator must store the original value so that it can return the unincremented value as its result."

For ints and pointers, the compiler can optimize away this extra work.

Crash Recovery

The following content is selected from the Database Management Systems book, chapter 18.

ARIES

Three main principles lie behind the ARIES recovery algorithm:

  1. Write-Ahead Logging: Any change to a database object is first recorded in the log; the record in the log must be written to stable storage before the change to the database object is written to disk.
  2. Repeating History During Redo: On restart following a crash, ARIES retraces all actions of the DBMS before the crash and brings the system back to the exact state that it was in at the time of the crash. Then, it undoes the actions of transactions still active at the time of the crash. (effectively aborting them).
  3. Logging Changes During Undo: Changes mada to the database while undoing a transaction are logged to ensure such an action is not repeated in the event of repeated (failures causing) restarts.

Every log record is given a unique id called the log sequence number (LSN). For recovery purposes, every page in the database contains the LSN of the most recent log record that describes a change to this page. This LSN is called the pageLSN. Every log record has certain fields: prevLSN, transID, and type. The set of all log records for a given transaction is maintained as a linked list going back in time, using the prevLSN field; this list must be updated whenever a log record is added. The transID field is the id of the transaction generating the log record, and the type field obviously indicates the type of the log record.