feat: add support for nested loop join #164

crwen · 2024-03-15T15:32:38Z

What problem does this PR solve?

Support nested loop join, but full join is not supported.

There are still problems in explain output.

Issue link: #156

What is changed and how it works?

cross join:

=> select x.*, y.* from x cross join y;
 id | a | b | id | c | d
----+---+---+----+---+---
  0 | 1 | 2 |  0 | 1 | 5
  0 | 1 | 2 |  1 | 1 | 6
  0 | 1 | 2 |  2 | 2 | 7
  1 | 1 | 3 |  0 | 1 | 5
  1 | 1 | 3 |  1 | 1 | 6
  1 | 1 | 3 |  2 | 2 | 7
(6 rows)

left join:

=> select x.*, y.* from x left join y on a > c or b > d;
 id | a | b | id | c | d
----+---+---+----+---+---
  0 | 1 | 2 |    |   |
  1 | 1 | 3 |    |   |
(2 rows)

right join

=> select x.*, y.* from x right join y on a > c or b > d;
 id | a | b | id | c | d
----+---+---+----+---+---
    |   |   |  0 | 1 | 5
    |   |   |  1 | 1 | 6
    |   |   |  2 | 2 | 7
(3 rows)

subquery:

=> select * from x where x.a >= (select c from y); 
 id | a | b
----+---+---
  0 | 1 | 2
  0 | 1 | 2
  1 | 1 | 3
  1 | 1 | 3
(4 rows)

Code changes

Has Rust code change
Has CI related scripts change

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Note for reviewer

KKould

This is one of the most outstanding PRs I have ever received in this repository. I have given some suggestions and please add some test cases under the file slt/join.slt.

src/execution/volcano/dql/join/nested_loop_join.rs

KKould · 2024-03-15T16:52:31Z

There seems to be some problem with ci, I will try to fix it

KKould · 2024-03-15T17:09:04Z

There seems to be some problem with ci, I will try to fix it

please update the repo and try ci again

KKould · 2024-03-16T06:53:00Z

I noticed that explain still seems to be HashJoin. You can add the physical option of NestLoopJoin in src/optimizer/rule/implementation/dql/join.rs

Tips: This physical option is used for Cost calculation, but currently only supports index selection.

KKould

LGTM

KKould · 2024-03-16T07:25:07Z

src/execution/volcano/mod.rs

@@ -75,7 +77,12 @@ pub fn build_read<T: Transaction>(plan: LogicalPlan, transaction: &T) -> BoxedEx
            let right_input = childrens.pop().unwrap();
            let left_input = childrens.pop().unwrap();

-            HashJoin::from((op, left_input, right_input)).execute(transaction)
+            match &op.on {
+                JoinCondition::On { on, .. } if !on.is_empty() => {


I'm very sorry. I checked it carefully again on my computer and found that there are three slt files that need to open the comment case.

tests\slt\sql_2016\F041_08.slt

tests\slt\crdb\sqlite.slt L33

tests\slt\subquery.slt L37

At the same time, I discovered a new bug: still using HashJoin when or exists in the Join condition leads to incorrect results.
case on sqlite.slt L33

Tips: I think you can add a tag to the JoinOperator to decide whether to use NestLoopJoin or HashJoin in the Bind stage.

KKould · 2024-03-16T08:05:54Z

src/binder/select.rs

@@ -791,7 +791,12 @@ impl<'a, T: Transaction> Binder<'a, T> {
                        )?;
                    }
                    BinaryOperator::Or => {
-                        todo!("`NestLoopJoin` is not supported yet")


The result is correct when Or does not exist

let _ = fnck_sql .run("CREATE TABLE tab64784(pk INTEGER primary key, col0 INTEGER, col1 FLOAT, col2 VARCHAR, col3 INTEGER, col4 FLOAT, col5 VARCHAR)") .await?; let _ = fnck_sql .run("INSERT INTO tab64784 VALUES(0,212,202.62,'nshdy',212,208.79,'wsxfc'),(1,213,203.64,'xwfuo',213,209.26,'lyswz'),(2,214,204.82,'jnued',216,210.48,'qczzf'),(3,215,205.40,'jtijf',217,211.96,'dpugl'),(4,216,206.3,'dpdzk',219,212.43,'xfirg'),(5,218,207.43,'qpwyw',220,213.50,'fmgky'),(6,219,208.3,'uooxb',221,215.30,'xpmdy'),(7,220,209.54,'ndtbb',225,218.8,'ivqyw'),(8,221,210.65,'zjpts',226,219.82,'sezsm'),(9,222,211.57,'slaxq',227,220.91,'bdqyb');") .await?; let (schema, tuples) = fnck_sql.run("explain SELECT pk, col0 FROM tab64784 WHERE col0 IN (SELECT col3 FROM tab64784 WHERE col3 IS NULL OR (col1 < 22.54) OR col4 > 85.74) OR ((col4 IS NULL))").await?; println!("{}", create_table(&schema, &tuples)); let (schema, tuples) = fnck_sql.run("explain SELECT pk, col0 FROM tab64784 WHERE col0 IN (SELECT col3 FROM tab64784 WHERE col3 IS NULL OR (col1 < 22.54) OR col4 > 85.74)").await?; println!("{}", create_table(&schema, &tuples));

+-------------------------------------------------------------------------------------------------------------------------+ | PLAN | +=========================================================================================================================+ | Projection [tab64784.pk, tab64784.col0] [Project] | | LeftSemi Join Where ((tab64784.col0 = (tab64784.col3) as (_temp_table_1_.col3)) || tab64784.col4 is null) [HashJoin] | | Scan tab64784 -> [pk, col0, col3, col4] [SeqScan] | | Projection [tab64784.col3] [Project] | | Filter ((tab64784.col3 is null || (tab64784.col1 < 22.54)) || (tab64784.col4 > 85.74)), Is Having: false [Filter] | | Scan tab64784 -> [col1, col3, col4] [SeqScan] | +-------------------------------------------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------------------------------------------+ | PLAN | +=========================================================================================================================+ | Projection [tab64784.pk, tab64784.col0] [Project] | | LeftSemi Join On tab64784.col0 = (tab64784.col3) as (_temp_table_1_.col3) [HashJoin] | | Scan tab64784 -> [pk, col0, col3] [SeqScan] | | Projection [tab64784.col3] [Project] | | Filter ((tab64784.col3 is null || (tab64784.col1 < 22.54)) || (tab64784.col4 > 85.74)), Is Having: false [Filter] | | Scan tab64784 -> [col1, col3, col4] [SeqScan] | +-------------------------------------------------------------------------------------------------------------------------+

crwen · 2024-03-16T11:00:45Z

I just wonder that do all the queries(including subqueries) use the same binder info ? I found something wrong with the filter info

    Binary {
        op: Or,
        left_expr: Binary {
            op: Or,
            left_expr: Binary {
                op: Eq,
                left_expr: Reference {
                    expr: ColumnRef(
                        ColumnCatalog {
                            summary: ColumnSummary {
                                name: "col0",
                                table_name: Some("tab64784"),
                            },
                        },
                    ),
                    pos: 1,
                },
                right_expr: Reference { -- should be in right table
                    expr: Alias {
                        expr: ColumnRef(
                            ColumnCatalog {
                                summary: ColumnSummary {
                                    name: "col3",
                                    table_name: Some("tab64784"),
                                },
                            },
                        ),
                        alias: Expr(
                            ColumnRef(
                                ColumnCatalog {
                                    summary: ColumnSummary {
                                        name: "col3",
                                        table_name: Some("_temp_table_1_"),
                                    },
                                },
                            ),
                        ),
                    },
                    pos: 2, -- here is left table, but should be right table
                },
            },
            right_expr: Binary {
                op: And,
                left_expr: IsNull {
                    negated: false,
                    expr: Reference { 
                        expr: ColumnRef(
                            ColumnCatalog {
                                summary: ColumnSummary {
                                    name: "col4",
                                    table_name: Some("tab64784"),
                                },
                            },
                        ),
                        pos: 3,
                    },
                },
                right_expr: Binary {
                    op: Lt,
                    left_expr: Reference {
                        expr: ColumnRef(
                            ColumnCatalog {
                                summary: ColumnSummary {
                                    name: "col3",
                                    table_name: Some("tab64784"),
                                },
                            },
                        ),
                        pos: 2, -- here is left table
                    },
                    right_expr: Constant(
                        Int32(8),
                    ),
                },
            },
        },
        right_expr: Binary { ... }
  }

KKould · 2024-03-16T11:51:47Z

I just wonder that do all the queries(including subqueries) use the same binder info ? I found something wrong with the filter info

you are right! this is only possible in In Subquery, I analyzed the reasons:
https://github.com/KipData/FnckSQL/blob/main/src/binder/select.rs#L725

match expr.unpack_alias() {
    ScalarExpression::Binary {
        left_expr,
        right_expr,
        op,
        ty,
    } => {
        match op {
            BinaryOperator::Eq => {
                match (left_expr.unpack_alias_ref(), right_expr.unpack_alias_ref()) {
                    // example: foo = bar
                    (ScalarExpression::ColumnRef(l), ScalarExpression::ColumnRef(r)) => {
                        // reorder left and right joins keys to pattern: (left, right)
                        if fn_contains(left_schema, l.summary())
                            && fn_contains(right_schema, r.summary())
                        {
                            accum.push((*left_expr, *right_expr));
                        } else if fn_contains(left_schema, r.summary())
                            && fn_contains(right_schema, l.summary())
                        {
                            accum.push((*right_expr, *left_expr));
                        } else if fn_or_contains(left_schema, right_schema, l.summary())
                            || fn_or_contains(left_schema, right_schema, r.summary())
                        {
                            accum_filter.push(ScalarExpression::Binary {
                                left_expr,
                                right_expr,
                                op,
                                ty,
                            });
                        }
                    }

match (left_expr.unpack_alias_ref(), right_expr.unpack_alias_ref()) {
This will return the field expr reference of ScalarExpression::Alias. When In Subquery, expr belongs to the same table (left table).
I will try to fix this now.

KKould · 2024-03-16T12:47:26Z

I re-project the subquery so that the column output by the subquery can retain the temp_table information and be correctly mapped to left_table and right_table.

You can move the code to your PR and try to see if it solves the problem (I tested it myself and it was fine)
This is the idea for me. If you have a better way, please try it.

pr: #168

KKould

This is a profoundly influential PR, which fixes many problems related to Subquery and Join.

Tips: I added the test file and fixed join explain display.

   Projection [a.id, a.v1, a.v2, b.id, b.v3, b.v4, b.v5] [Project]
     Cross Join Nothing [NestLoopJoin]
       Scan a -> [id, v1, v2] [SeqScan]
       Scan b -> [id, v3, v4, v5] [SeqScan]

crwen added 2 commits March 15, 2024 21:00

add support for nested loop join

495bf50

fix:keep the order of tuple and schema

2f2df06

KKould requested changes Mar 15, 2024

View reviewed changes

src/execution/volcano/dql/join/nested_loop_join.rs Outdated Show resolved Hide resolved

src/execution/volcano/dql/join/nested_loop_join.rs Outdated Show resolved Hide resolved

KKould mentioned this pull request Mar 15, 2024

config: ci rust toolchain -> nightly-2024-01-17 #165

Merged

9 tasks

KKould assigned KKould and crwen and unassigned KKould Mar 15, 2024

KKould added the enhancement New feature or request label Mar 15, 2024

KKould mentioned this pull request Mar 15, 2024

TODO: SQL 2016 #130

Open

48 tasks

fix:right join not matched case

8f21af7

KKould approved these changes Mar 16, 2024

View reviewed changes

KKould requested changes Mar 16, 2024

View reviewed changes

KKould reviewed Mar 16, 2024

View reviewed changes

fix: expression ownership in the temp table of subquery

35c0cab

KKould mentioned this pull request Mar 16, 2024

[NO MERGE]fix: expression ownership in the temp table of subquery #168

Closed

9 tasks

KKould mentioned this pull request Mar 19, 2024

Bug: Only one related transaction should be allowed to exist when making a DDL statement #171

Closed

KKould added 3 commits March 19, 2024 23:36

fix: bind error of the same fields in different tables in in subquery

6a0fcc2

fix: combine_operators.rs deletes the temp table alias of subquery

b35bf08

test: add F041_08.slt/sqlite.slt/subquery.slt

0a20b22

KKould approved these changes Mar 19, 2024

View reviewed changes

KKould merged commit f44ec4f into KipData:main Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add support for nested loop join #164

feat: add support for nested loop join #164

Uh oh!

crwen commented Mar 15, 2024

Uh oh!

KKould left a comment

Uh oh!

Uh oh!

Uh oh!

KKould commented Mar 15, 2024

Uh oh!

KKould commented Mar 15, 2024

Uh oh!

KKould commented Mar 16, 2024 •

edited

Loading

Uh oh!

KKould left a comment

Uh oh!

KKould Mar 16, 2024 •

edited

Loading

Uh oh!

KKould Mar 16, 2024 •

edited

Loading

Uh oh!

crwen commented Mar 16, 2024

Uh oh!

KKould commented Mar 16, 2024

Uh oh!

KKould commented Mar 16, 2024 •

edited

Loading

Uh oh!

KKould left a comment

Uh oh!

Uh oh!

feat: add support for nested loop join #164

feat: add support for nested loop join #164

Uh oh!

Conversation

crwen commented Mar 15, 2024

What problem does this PR solve?

What is changed and how it works?

Code changes

Check List

Note for reviewer

Uh oh!

KKould left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KKould commented Mar 15, 2024

Uh oh!

KKould commented Mar 15, 2024

Uh oh!

KKould commented Mar 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KKould left a comment

Choose a reason for hiding this comment

Uh oh!

KKould Mar 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KKould Mar 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crwen commented Mar 16, 2024

Uh oh!

KKould commented Mar 16, 2024

Uh oh!

KKould commented Mar 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KKould left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KKould commented Mar 16, 2024 •

edited

Loading

KKould Mar 16, 2024 •

edited

Loading

KKould Mar 16, 2024 •

edited

Loading

KKould commented Mar 16, 2024 •

edited

Loading