Apache HBase and Apache Phoenix, more on block encoding and compression

By Lars Hofhansl 2 1/2 years ago I blogged about HBase Compression vs Blockencoding. Things have moved on since then. The Hadoop and HBase communities added new compression schemes like zSTD, and new block encoders like ROW_INDEX_V1. zSTD promises to be fast and yield great compression ratios. The ROW_INDEX_V1 encoder allows fast seeks into an HFile block by storing the offsets of all Cells in that block so a Cell can be found by binary search, instead of the usual linear search. So let's do some tests. Like in the previous blog I'm setting up a single node HBase, on a single node HDFS, with a single node ZK. (But note that these are different machines so do not compare these numbers to a two years old blog post) CREATE TABLE (pk INTEGER PRIMARY key, v1 FLOAT, v2 FLOAT, v3 INTEGER) DISABLE_WAL=true, SALT_BUCKETS=8 Then I loaded 2^22 = 4194304 rows. Columns v1 and v2 are random values from [0,1). Let's look at the size of the data. Remember this is just a single machine test. This is a qualitative test. We verified elsewhere that that this scales with adding machines. 88.5MB ROW_INDEX_V1 + zSTD 112.1MB ROW_INDEX_v1 + GZ 184.0MB ROW_INDEX_V1 + SNAPPY 200.0MB FAST_DIFF 556.9MB NONE 572.9MB ROW_INDEX_V1 So far, nothing surprising: The ROW_INDEX adds a bit more data (~3% for these small Cells). zSTD offers the best compression, followed by GZ, and SNAPPY. Uncompressed and unencoded HBase bloats the data a lot. Full scans: Let's first do some scanning. First from the OS buffer cache (1), then from the HBase block cache (2). (1) SELECT /+ NO_CACHE SERIAL / COUNT() FROM (2) SELECT /+ SERIAL / COUNT() FROM I'm using the SERIAL hint in Phoenix in order to get consistent results independently of how Phoenix decides to parallelize the query (which is based on region sizes as well the current stats). zSTD decompression rate is pretty good, closer to Snappy than to Gzip. ROW_INDEX - as expected - does not help with full scans, but it also dos not seem hurt (variance was within the noise). FAST_DIFF has the worst scan times, whether data is in the block cache or not. Where do we seek a lot with Phoenix? ROW_INDEX_V1 helps with random seeks, but where do we do this in Phoenix. Some areas are obvious, some might surprise you: SELECT on random PKs. Well Dah... Reverse Scans Scans through tables with lot of deleted rows (before they are compacted away) Row lookup from uncovered indexes Point Gets: So let's start with random gets, just point queries like these. To avoid repeated meta-data lookup I re-configured the table with UPDATE_CACHE_FREQUENCY = NEVER. And then issued: SELECT pk FROM WHERE pk = , 1000 times. So now we see how ROW_INDEX_V1 helps, GETs are improved by 24% and remember that this is end-to-end (query planning/compilation, network overhead, and finally the HBase scan). We also see the relative decompression cost the schemes are adding (the OS buffer cache case). Yet, zSTD decompression, seems to be on par with FAST_DIFF, and almost as fast as SNAPPY, while providing vastly better compression ratios. Also remember that default blocks are stored uncompressed in the block cache (so all ROW_INDEX block cache requests are the same performance). Reverse Scans: HBase offers reverse scans, Phoenix will automatically make use of those in queries like these: SELECT * FROM ORDER BY pk DESC LIMIT 100; (Everything is in the block cache here) Reverse scanning involves a lot of seeking. For each Cell (i.e. each column in Phoenix) we need to seek the previous row, then skip forward again for the columns of that row. We see that ROW_INDEX_V1 is 2.5x faster than no encoding and over 6x faster compared to FAST_DIFF. 99.9% of cells deleted: When HBase deletes data, it is not actually removed immediately but rather marked for deletion in the future by placing tombstones. For this scenario I deleted 99.9% of the rows: DELETE FROM WHERE v1 < 0.999 Then issued: SELECT /+ SERIAL / COUNT() FROM We see that ROW_INDEX_V1 helps with seeking past the delete markers. About a 24% improvement. Now what about reverse scanning with lots of delete marker? SELECT FROM ORDER BY pk DESC LIMIT 100 WOW... Reverse scanning with deleted Cells is somewhat of a worst case for HBase, we need to seek backwards Cell by Cell until we find the next non-deleted one. ROW_INDEX_V1 makes a huge difference here. In fact without it, just scanning 100 row reverse is almost four times slower than scanning through all 4m rows plus almost 4m delete markers. If you have lots of deletes in your data set, for example if it is churning set, you might want to switch the encoding to ROW_INDEX_V1. Local Secondary (uncovered) Indexes: Still 2^22 rows... Now we have two decisions to make: (1) how to encode/compress the main column family, and (2) how to encode/compress the local index column family. In the interest of brevity I limited this to three cases: 308.1MB FAST_DIFF, F

Dec 3, 2024 - 16:27

0 2

By Lars Hofhansl

2 1/2 years ago I blogged about HBase Compression vs Blockencoding.

Things have moved on since then. The Hadoop and HBase communities added new compression schemes like zSTD, and new block encoders like ROW_INDEX_V1.

zSTD promises to be fast and yield great compression ratios.
The ROW_INDEX_V1 encoder allows fast seeks into an HFile block by storing the offsets of all Cells in that block so a Cell can be found by binary search, instead of the usual linear search.

So let's do some tests. Like in the previous blog I'm setting up a single node HBase, on a single node HDFS, with a single node ZK.
(But note that these are different machines so do not compare these numbers to a two years old blog post)

CREATE TABLE (pk INTEGER PRIMARY key, v1 FLOAT, v2 FLOAT, v3 INTEGER) DISABLE_WAL=true, SALT_BUCKETS=8

Then I loaded 2^22 = 4194304 rows. Columns v1 and v2 are random values from [0,1).

Let's look at the size of the data. Remember this is just a single machine test. This is a qualitative test. We verified elsewhere that that this scales with adding machines.

88.5MB ROW_INDEX_V1 + zSTD
112.1MB ROW_INDEX_v1 + GZ
184.0MB ROW_INDEX_V1 + SNAPPY
200.0MB FAST_DIFF
556.9MB NONE
572.9MB ROW_INDEX_V1

So far, nothing surprising:

The ROW_INDEX adds a bit more data (~3% for these small Cells).
zSTD offers the best compression, followed by GZ, and SNAPPY.
Uncompressed and unencoded HBase bloats the data a lot.

Full scans:

Let's first do some scanning. First from the OS buffer cache (1), then from the HBase block cache (2).

(1) SELECT /*+ NO_CACHE SERIAL */ COUNT(*) FROM

(2) SELECT /*+ SERIAL */ COUNT(*) FROM

I'm using the SERIAL hint in Phoenix in order to get consistent results independently of how Phoenix decides to parallelize the query (which is based on region sizes as well the current stats).

zSTD decompression rate is pretty good, closer to Snappy than to Gzip.
ROW_INDEX - as expected - does not help with full scans, but it also dos not seem hurt (variance was within the noise).
FAST_DIFF has the worst scan times, whether data is in the block cache or not.

Where do we seek a lot with Phoenix?

ROW_INDEX_V1 helps with random seeks, but where do we do this in Phoenix. Some areas are obvious, some might surprise you:

SELECT on random PKs. Well Dah...
Reverse Scans
Scans through tables with lot of deleted rows (before they are compacted away)
Row lookup from uncovered indexes

Point Gets:

So let's start with random gets, just point queries like these. To avoid repeated meta-data lookup I re-configured the table with UPDATE_CACHE_FREQUENCY = NEVER. And then issued:

SELECT pk FROM

WHERE pk = , 1000 times.

So now we see how ROW_INDEX_V1 helps, GETs are improved by 24% and remember that this is end-to-end (query planning/compilation, network overhead, and finally the HBase scan).
We also see the relative decompression cost the schemes are adding (the OS buffer cache case).

Yet, zSTD decompression, seems to be on par with FAST_DIFF, and almost as fast as SNAPPY, while providing vastly better compression ratios.

Also remember that default blocks are stored uncompressed in the block cache (so all ROW_INDEX block cache requests are the same performance).

Reverse Scans:

HBase offers reverse scans, Phoenix will automatically make use of those in queries like these:

SELECT * FROM

ORDER BY pk DESC LIMIT 100;

(Everything is in the block cache here)

Reverse scanning involves a lot of seeking. For each Cell (i.e. each column in Phoenix) we need to seek the previous row, then skip forward again for the columns of that row. We see that ROW_INDEX_V1 is 2.5x faster than no encoding and over 6x faster compared to FAST_DIFF.

99.9% of cells deleted:

When HBase deletes data, it is not actually removed immediately but rather marked for deletion in the future by placing tombstones.

For this scenario I deleted 99.9% of the rows:
DELETE FROM

WHERE v1 < 0.999

Then issued:
SELECT /*+ SERIAL */ COUNT(*) FROM

We see that ROW_INDEX_V1 helps with seeking past the delete markers. About a 24% improvement.

Now what about reverse scanning with lots of delete marker?
SELECT * FROM

ORDER BY pk DESC LIMIT 100

WOW... Reverse scanning with deleted Cells is somewhat of a worst case for HBase, we need to seek backwards Cell by Cell until we find the next non-deleted one.

ROW_INDEX_V1 makes a huge difference here. In fact without it, just scanning 100 row reverse is almost four times slower than scanning through all 4m rows plus almost 4m delete markers.

If you have lots of deletes in your data set, for example if it is churning set, you might want to switch the encoding to ROW_INDEX_V1.

Local Secondary (uncovered) Indexes:

Still 2^22 rows...

Now we have two decisions to make: (1) how to encode/compress the main column family, and (2) how to encode/compress the local index column family. In the interest of brevity I limited this to three cases:

308.1MB FAST_DIFF, FAST_DIFF
195.8MB RI + zSTD, FAST_DIFF
170.9MB RI + zSTD, RI + zSTD

Let's first do a full scan:
SELECT /*+ SERIAL */ COUNT(*) FROM

Looks like this is now purely driven by the size of the index, as expected, since Phoenix now scans along the index, rather than the main data.

Let's do some filtering.

SELECT /*+ SERIAL */ COUNT(*) FROM

WHERE v2 < p

Pretty much as expected. The index encoding is what counts, and differences are marginal.

Next: The indexes I created are uncovered, they include only the indexed column(s) but no other columns. So now when we include some columns not in the index Phoenix needs to retrieve those as well.

SELECT /*+ SERIAL */ COUNT(v1) FROM

WHERE v2 < p;

Again... Wow... We see a huge difference. Phoenix' default FAST_DIFF, FAST_DIFF is almost useless. Look back at the full scan times in the beginning. We can conclude that unless the index returns less than 0.5% of the rows a full scan is faster!

Why's that!? Well, for each row we need find the remainder of the row (at least a part of it). All we have for that in HBase is a GET. Each GET needs to SEEK into the relevant blocks and a SEEK with FAST_DIFF implies SEEKing the last full key before the key we're looking for and then forward to the actual key. That's quite expensive.

When we switch the encoding to ROW_INDEX_V1 the GET can now instead be served with a binary search this the Cell offset in the blocks are known. Now index queries that return up to 10% of the Cells are still faster than a full scan.

Since ROW_INDEX_V1 add so little storage overhead I did not test with no encoding.

Takeaways:

ROW_INDEX_V1 does not add much overhead but can potentially very helpful.
If you use uncovered local indexes (which you should, since all other cases will ust blow up the data), you must configure the main table with ROW_INDEX_V1 or the related row lookup is simply too slow.
zSTD provides an excellent compromise between compression ratio and decompression speed (I didn't test compression speed). It should probably be the default.
ROW_INDEX_V1 together with zSTD compression make for an great default for a wide range of use cases.