You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[#26283] YSQL: Support yb_index_check() execution using multiple snapshots
Summary:
===Problem===
Index consistency checker (`yb_index_check()`) currently employs a BNL LEFT join (to join the index and the base relation) followed by a `count(*)` on the base relation to detect any index consistency issue. The details of these operations and checks is mentioned in the summary of D41376 / 10de037.
yb_index_check() is a slow operation. Performance experiments show that `yb_index_check()` on an index with `pg_table_size()` of 3GB took 600 seconds in a single region, multi-AZ 3-node cluster. Any operation whose execution time is longer than `timestamp_history_retention_interval_sec` (default value is 15 mins) is susceptible to `Snapshot too old` error. Assuming `timestamp_history_retention_interval_sec` is set to the default, yb_index_check() is susceptible to this error on indexes with `pg_table_size()` >= 4.5 GB.
===Solution===
To resolve this shortcoming, this revision adds support for 'multi-snapshot' execution mode for yb_index_check(). In this mode, the left subplan of the JOIN is divided into small batches. A new (latest) snapshot is picked for processing each batch. The batch size is such that its processing is guaranteed to complete within `timestamp_history_retention_interval_sec` (details below). Hence, in this execution mode, yb_index_check() is guaranteed not to run into the `Snapshot too old` error.
===Implementation details===
===== Batching =====
**Terminology**
There are two batches involved
## BNL batch: batch inside the BNL join
## Checker batch: batch of rows of the left subplan of the JOIN that will be processed using the same snapshot during yb_index_check().
**Checker batch size**
To ensure all the rows are processed at least once, the checker batch size should be a multiple of the BNL batch size. This is because the BNL output is ordered by left relation ybctid across the batches, but not within a batch. For instance, consider the following scenario:
BNL batch size = 3
| BNL batch | output | max ybctid encountered
| 1 | ybctid1, ... | ybctid1
| 1 | ybctid3, ... | ybctid3
| 1 | ybctid2, ... | ybctid3
| 2 | ybctid6, ... | ybctid6
| 2 | ybctid4, ... | ybctid6
| 2 | ybctid5, ... | ybctid6
Note that the ybctids within a batch are not ordered, but across the batch, they are ordered. Now, say if the checker batch size = 4. The checker's first batch will finish at the 4th row, with max ybctid encountered == ybctid6. Consequently, rows corresponding to ybctid4 and ybctid5 will never be processed.
We keep processing rows in multiples of yb_bnl_batch_size within a checker batch as long as the elapsed time > 70% of timestamp_history_retention_interval_sec. This threshold of 70% is based on a heuristic. The idea is to keep it closer to 100% so that as many rows as possible are processed within a single batch, to avoid the overhead of creating too many batches. At the same time, keeping it too close to 100% risks running into the Snapshot too old error in scenarios when elapsed time is marginally less than the threshold, and hence the next set of rows are processed in the same batch, but that pushes the elapsed time beyond the timestamp_history_retention_interval_sec.
**Batch processing**
While processing a checker batch, we keep a track of the maximum processed ybctid (of left relation) and pass it as a lower bound when initializing the next checker batch.
**Controlling execution mode**
yb_index_check() now takes an optional bool argument `single_snapshot_mode` to control the execution mode. As the name suggests, if it is true, the execution mode is 'single snapshot' (all the rows are processed using a single snapshot) and vice-versa. The default value of this argument is false, meaning yb_index_check() by default executes in multi-snapshot mode.
===== Operations =====
In the multi-snapshot mode, the count(*) on the base relation is replaced by another LEFT join between the base rel and the index rel. Both these operations serve the same purpose - to detect missing rows from the index. The details of the JOIN are as follows:
Left relation: base relation
Right relation: index relation
Join condition: baserel.computed_index_row_ybctid = indexrel.t_ybindexrowybctid.
Join type: Batched Nested Loop join
Check condition: indexrel.t_ybindexrowybctid is not null (a null value would indicate missing index rows)
Jira: DB-15629
Test Plan:
./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressYbIndexCheck'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressYbIndexCheckSingleSnapshot'
./yb_build.sh --gtest_filter PgYbIndexCheckTest.YbIndexCheckRepeatableRead
./yb_build.sh --gtest_filter PgYbIndexCheckTest.YbIndexCheckSnapshotTooOld
Reviewers: amartsinchyk, kramanathan
Reviewed By: amartsinchyk
Subscribers: smishra, jason, svc_phabricator, yql
Differential Revision: https://phorge.dev.yugabyte.com/D42311
0 commit comments