Apache HBase and Apache Phoenix, more on block encoding and compression
By Lars Hofhansl 2 1/2 years ago I blogged about HBase Compression vs Blockencoding. Things have moved on since then. The Hadoop and HBase communities added new compression schemes like zSTD, and new block encoders like ROW_INDEX_V1. zSTD promises to be fast and yield great compression ratios. The ROW_INDEX_V1 encoder allows fast seeks into an HFile block by storing the offsets of all Cells in that block so a Cell can be found by binary search, instead of the usual linear search. So let's do some tests. Like in the previous blog I'm setting up a single node HBase, on a single node HDFS, with a single node ZK. (But note that these are different machines so do not compare these numbers to a two years old blog post) CREATE TABLE (pk INTEGER PRIMARY key, v1 FLOAT, v2 FLOAT, v3 INTEGER) DISABLE_WAL=true, SALT_BUCKETS=8 Then I loaded 2^22 = 4194304 rows. Columns v1 and v2 are random values from [0,1). Let's look at the size of the data. Remember this is just a single machine test. This is a qualitative test. We verified elsewhere that that this scales with adding machines. 88.5MB ROW_INDEX_V1 + zSTD 112.1MB ROW_INDEX_v1 + GZ 184.0MB ROW_INDEX_V1 + SNAPPY 200.0MB FAST_DIFF 556.9MB NONE 572.9MB ROW_INDEX_V1 So far, nothing surprising: The ROW_INDEX adds a bit more data (~3% for these small Cells). zSTD offers the best compression, followed by GZ, and SNAPPY. Uncompressed and unencoded HBase bloats the data a lot. Full scans: Let's first do some scanning. First from the OS buffer cache (1), then from the HBase block cache (2). (1) SELECT /*+ NO_CACHE SERIAL */ COUNT(*) FROM (2) SELECT /*+ SERIAL */ COUNT(*) FROM I'm using the SERIAL hint in Phoenix in order to get consistent results independently of how Phoenix decides to parallelize the query (which is based on region sizes as well the current stats). zSTD decompression rate is pretty good, closer to Snappy than to Gzip. ROW_INDEX - as expected - does not help with full scans, but it also dos not seem hurt (variance was within the noise). FAST_DIFF has the worst scan times, whether data is in the block cache or not. Where do we seek a lot with Phoenix? ROW_INDEX_V1 helps with random seeks, but where do we do this in Phoenix. Some areas are obvious, some might surprise you: SELECT on random PKs. Well Dah... Reverse Scans Scans through tables with lot of deleted rows (before they are compacted away) Row lookup from uncovered indexes Point Gets: So let's start with random gets, just point queries like these. To avoid repeated meta-data lookup I re-configured the table with UPDATE_CACHE_FREQUENCY = NEVER. And then issued: SELECT pk FROM WHERE pk = , 1000 times. So now we see how ROW_INDEX_V1 helps, GETs are improved by 24% and remember that this is end-to-end (query planning/compilation, network overhead, and finally the HBase scan). We also see the relative decompression cost the schemes are adding (the OS buffer cache case). Yet, zSTD decompression, seems to be on par with FAST_DIFF, and almost as fast as SNAPPY, while providing vastly better compression ratios. Also remember that default blocks are stored uncompressed in the block cache (so all ROW_INDEX block cache requests are the same performance). Reverse Scans: HBase offers reverse scans, Phoenix will automatically make use of those in queries like these: SELECT * FROM ORDER BY pk DESC LIMIT 100; (Everything is in the block cache here) Reverse scanning involves a lot of seeking. For each Cell (i.e. each column in Phoenix) we need to seek the previous row, then skip forward again for the columns of that row. We see that ROW_INDEX_V1 is 2.5x faster than no encoding and over 6x faster compared to FAST_DIFF. 99.9% of cells deleted: When HBase deletes data, it is not actually removed immediately but rather marked for deletion in the future by placing tombstones. For this scenario I deleted 99.9% of the rows: DELETE FROM WHERE v1 < 0.999 Then issued: SELECT /*+ SERIAL */ COUNT(*) FROM We see that ROW_INDEX_V1 helps with seeking past the delete markers. About a 24% improvement. Now what about reverse scanning with lots of delete marker? SELECT * FROM ORDER BY pk DESC LIMIT 100 WOW... Reverse scanning with deleted Cells is somewhat of a worst case for HBase, we need to seek backwards Cell by Cell until we find the next non-deleted one. ROW_INDEX_V1 makes a huge difference here. In fact without it, just scanning 100 row reverse is almost four times slower than scanning through all 4m rows plus almost 4m delete markers. If you have lots of deletes in your data set, for example if it is churning set, you might want to switch the encoding to ROW_INDEX_V1. Local Secondary (uncovered) Indexes: Still 2^22 rows... Now we have two decisions to make: (1) how to encode/compress the main column family, and (2) how to encode/compress the local index column family. In the interest of brevity I limited this to three cases: 308.1MB FAST_DIFF, F
By Lars Hofhansl
2 1/2 years ago I blogged about HBase Compression vs Blockencoding.
Things have moved on since then. The Hadoop and HBase communities added new compression schemes like zSTD, and new block encoders like ROW_INDEX_V1.
zSTD promises to be fast and yield great compression ratios.
The ROW_INDEX_V1 encoder allows fast seeks into an HFile block by storing the offsets of all Cells in that block so a Cell can be found by binary search, instead of the usual linear search.
So let's do some tests. Like in the previous blog I'm setting up a single node HBase, on a single node HDFS, with a single node ZK.
(But note that these are different machines so do not compare these numbers to a two years old blog post)
CREATE TABLE
2 1/2 years ago I blogged about HBase Compression vs Blockencoding.
Things have moved on since then. The Hadoop and HBase communities added new compression schemes like zSTD, and new block encoders like ROW_INDEX_V1.
zSTD promises to be fast and yield great compression ratios.
The ROW_INDEX_V1 encoder allows fast seeks into an HFile block by storing the offsets of all Cells in that block so a Cell can be found by binary search, instead of the usual linear search.
So let's do some tests. Like in the previous blog I'm setting up a single node HBase, on a single node HDFS, with a single node ZK.
(But note that these are different machines so do not compare these numbers to a two years old blog post)
CREATE TABLE