Skip to content

Conversation

crwen
Copy link
Member

@crwen crwen commented Mar 15, 2024

What problem does this PR solve?

Support nested loop join, but full join is not supported.

There are still problems in explain output.

Issue link: #156

What is changed and how it works?

cross join:

=> select x.*, y.* from x cross join y;
 id | a | b | id | c | d
----+---+---+----+---+---
  0 | 1 | 2 |  0 | 1 | 5
  0 | 1 | 2 |  1 | 1 | 6
  0 | 1 | 2 |  2 | 2 | 7
  1 | 1 | 3 |  0 | 1 | 5
  1 | 1 | 3 |  1 | 1 | 6
  1 | 1 | 3 |  2 | 2 | 7
(6 rows)

left join:

=> select x.*, y.* from x left join y on a > c or b > d;
 id | a | b | id | c | d
----+---+---+----+---+---
  0 | 1 | 2 |    |   |
  1 | 1 | 3 |    |   |
(2 rows)

right join

=> select x.*, y.* from x right join y on a > c or b > d;
 id | a | b | id | c | d
----+---+---+----+---+---
    |   |   |  0 | 1 | 5
    |   |   |  1 | 1 | 6
    |   |   |  2 | 2 | 7
(3 rows)

subquery:

=> select * from x where x.a >= (select c from y); 
 id | a | b
----+---+---
  0 | 1 | 2
  0 | 1 | 2
  1 | 1 | 3
  1 | 1 | 3
(4 rows)

Code changes

  • Has Rust code change
  • Has CI related scripts change

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Note for reviewer

Copy link
Member

@KKould KKould left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the most outstanding PRs I have ever received in this repository. I have given some suggestions and please add some test cases under the file slt/join.slt.

@KKould
Copy link
Member

KKould commented Mar 15, 2024

There seems to be some problem with ci, I will try to fix it

@KKould
Copy link
Member

KKould commented Mar 15, 2024

There seems to be some problem with ci, I will try to fix it

please update the repo and try ci again

@KKould KKould assigned KKould and crwen and unassigned KKould Mar 15, 2024
@KKould KKould added the enhancement New feature or request label Mar 15, 2024
@KKould KKould mentioned this pull request Mar 15, 2024
48 tasks
@KKould
Copy link
Member

KKould commented Mar 16, 2024

I noticed that explain still seems to be HashJoin. You can add the physical option of NestLoopJoin in src/optimizer/rule/implementation/dql/join.rs

Tips: This physical option is used for Cost calculation, but currently only supports index selection.

Copy link
Member

@KKould KKould left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -75,7 +77,12 @@ pub fn build_read<T: Transaction>(plan: LogicalPlan, transaction: &T) -> BoxedEx
let right_input = childrens.pop().unwrap();
let left_input = childrens.pop().unwrap();

HashJoin::from((op, left_input, right_input)).execute(transaction)
match &op.on {
JoinCondition::On { on, .. } if !on.is_empty() => {
Copy link
Member

@KKould KKould Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very sorry. I checked it carefully again on my computer and found that there are three slt files that need to open the comment case.

  • tests\slt\sql_2016\F041_08.slt
  • tests\slt\crdb\sqlite.slt L33
  • tests\slt\subquery.slt L37

At the same time, I discovered a new bug: still using HashJoin when or exists in the Join condition leads to incorrect results.
case on sqlite.slt L33

Tips: I think you can add a tag to the JoinOperator to decide whether to use NestLoopJoin or HashJoin in the Bind stage.

@@ -791,7 +791,12 @@ impl<'a, T: Transaction> Binder<'a, T> {
)?;
}
BinaryOperator::Or => {
todo!("`NestLoopJoin` is not supported yet")
Copy link
Member

@KKould KKould Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is correct when Or does not exist

let _ = fnck_sql
    .run("CREATE TABLE tab64784(pk INTEGER primary key, col0 INTEGER, col1 FLOAT, col2 VARCHAR, col3 INTEGER, col4 FLOAT, col5 VARCHAR)")
    .await?;
let _ = fnck_sql
    .run("INSERT INTO tab64784 VALUES(0,212,202.62,'nshdy',212,208.79,'wsxfc'),(1,213,203.64,'xwfuo',213,209.26,'lyswz'),(2,214,204.82,'jnued',216,210.48,'qczzf'),(3,215,205.40,'jtijf',217,211.96,'dpugl'),(4,216,206.3,'dpdzk',219,212.43,'xfirg'),(5,218,207.43,'qpwyw',220,213.50,'fmgky'),(6,219,208.3,'uooxb',221,215.30,'xpmdy'),(7,220,209.54,'ndtbb',225,218.8,'ivqyw'),(8,221,210.65,'zjpts',226,219.82,'sezsm'),(9,222,211.57,'slaxq',227,220.91,'bdqyb');")
    .await?;
let (schema, tuples) = fnck_sql.run("explain SELECT pk, col0 FROM tab64784 WHERE col0 IN (SELECT col3 FROM tab64784 WHERE col3 IS NULL OR (col1 < 22.54) OR col4 > 85.74) OR ((col4 IS NULL))").await?;
println!("{}", create_table(&schema, &tuples));
let (schema, tuples) = fnck_sql.run("explain SELECT pk, col0 FROM tab64784 WHERE col0 IN (SELECT col3 FROM tab64784 WHERE col3 IS NULL OR (col1 < 22.54) OR col4 > 85.74)").await?;
println!("{}", create_table(&schema, &tuples));
+-------------------------------------------------------------------------------------------------------------------------+
| PLAN                                                                                                                    |
+=========================================================================================================================+
| Projection [tab64784.pk, tab64784.col0] [Project]                                                                       |
|   LeftSemi Join Where ((tab64784.col0 = (tab64784.col3) as (_temp_table_1_.col3)) || tab64784.col4 is null) [HashJoin]  |
|     Scan tab64784 -> [pk, col0, col3, col4] [SeqScan]                                                                   |
|     Projection [tab64784.col3] [Project]                                                                                |
|       Filter ((tab64784.col3 is null || (tab64784.col1 < 22.54)) || (tab64784.col4 > 85.74)), Is Having: false [Filter] |
|         Scan tab64784 -> [col1, col3, col4] [SeqScan]                                                                   |
+-------------------------------------------------------------------------------------------------------------------------+
+-------------------------------------------------------------------------------------------------------------------------+
| PLAN                                                                                                                    |
+=========================================================================================================================+
| Projection [tab64784.pk, tab64784.col0] [Project]                                                                       |
|   LeftSemi Join On tab64784.col0 = (tab64784.col3) as (_temp_table_1_.col3) [HashJoin]                                  |
|     Scan tab64784 -> [pk, col0, col3] [SeqScan]                                                                         |
|     Projection [tab64784.col3] [Project]                                                                                |
|       Filter ((tab64784.col3 is null || (tab64784.col1 < 22.54)) || (tab64784.col4 > 85.74)), Is Having: false [Filter] |
|         Scan tab64784 -> [col1, col3, col4] [SeqScan]                                                                   |
+-------------------------------------------------------------------------------------------------------------------------+

@crwen
Copy link
Member Author

crwen commented Mar 16, 2024

I just wonder that do all the queries(including subqueries) use the same binder info ? I found something wrong with the filter info

    Binary {
        op: Or,
        left_expr: Binary {
            op: Or,
            left_expr: Binary {
                op: Eq,
                left_expr: Reference {
                    expr: ColumnRef(
                        ColumnCatalog {
                            summary: ColumnSummary {
                                name: "col0",
                                table_name: Some("tab64784"),
                            },
                        },
                    ),
                    pos: 1,
                },
                right_expr: Reference { -- should be in right table
                    expr: Alias {
                        expr: ColumnRef(
                            ColumnCatalog {
                                summary: ColumnSummary {
                                    name: "col3",
                                    table_name: Some("tab64784"),
                                },
                            },
                        ),
                        alias: Expr(
                            ColumnRef(
                                ColumnCatalog {
                                    summary: ColumnSummary {
                                        name: "col3",
                                        table_name: Some("_temp_table_1_"),
                                    },
                                },
                            ),
                        ),
                    },
                    pos: 2, -- here is left table, but should be right table
                },
            },
            right_expr: Binary {
                op: And,
                left_expr: IsNull {
                    negated: false,
                    expr: Reference { 
                        expr: ColumnRef(
                            ColumnCatalog {
                                summary: ColumnSummary {
                                    name: "col4",
                                    table_name: Some("tab64784"),
                                },
                            },
                        ),
                        pos: 3,
                    },
                },
                right_expr: Binary {
                    op: Lt,
                    left_expr: Reference {
                        expr: ColumnRef(
                            ColumnCatalog {
                                summary: ColumnSummary {
                                    name: "col3",
                                    table_name: Some("tab64784"),
                                },
                            },
                        ),
                        pos: 2, -- here is left table
                    },
                    right_expr: Constant(
                        Int32(8),
                    ),
                },
            },
        },
        right_expr: Binary { ... }
  }

@KKould
Copy link
Member

KKould commented Mar 16, 2024

I just wonder that do all the queries(including subqueries) use the same binder info ? I found something wrong with the filter info

you are right! this is only possible in In Subquery, I analyzed the reasons:
https://github.com/KipData/FnckSQL/blob/main/src/binder/select.rs#L725

match expr.unpack_alias() {
    ScalarExpression::Binary {
        left_expr,
        right_expr,
        op,
        ty,
    } => {
        match op {
            BinaryOperator::Eq => {
                match (left_expr.unpack_alias_ref(), right_expr.unpack_alias_ref()) {
                    // example: foo = bar
                    (ScalarExpression::ColumnRef(l), ScalarExpression::ColumnRef(r)) => {
                        // reorder left and right joins keys to pattern: (left, right)
                        if fn_contains(left_schema, l.summary())
                            && fn_contains(right_schema, r.summary())
                        {
                            accum.push((*left_expr, *right_expr));
                        } else if fn_contains(left_schema, r.summary())
                            && fn_contains(right_schema, l.summary())
                        {
                            accum.push((*right_expr, *left_expr));
                        } else if fn_or_contains(left_schema, right_schema, l.summary())
                            || fn_or_contains(left_schema, right_schema, r.summary())
                        {
                            accum_filter.push(ScalarExpression::Binary {
                                left_expr,
                                right_expr,
                                op,
                                ty,
                            });
                        }
                    }

match (left_expr.unpack_alias_ref(), right_expr.unpack_alias_ref()) {
This will return the field expr reference of ScalarExpression::Alias. When In Subquery, expr belongs to the same table (left table).
I will try to fix this now.

@KKould
Copy link
Member

KKould commented Mar 16, 2024

I re-project the subquery so that the column output by the subquery can retain the temp_table information and be correctly mapped to left_table and right_table.

You can move the code to your PR and try to see if it solves the problem (I tested it myself and it was fine)
This is the idea for me. If you have a better way, please try it.

pr: #168

Copy link
Member

@KKould KKould left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a profoundly influential PR, which fixes many problems related to Subquery and Join.

Tips: I added the test file and fixed join explain display.

   Projection [a.id, a.v1, a.v2, b.id, b.v3, b.v4, b.v5] [Project]
     Cross Join Nothing [NestLoopJoin]
       Scan a -> [id, v1, v2] [SeqScan]
       Scan b -> [id, v3, v4, v5] [SeqScan]

@KKould KKould merged commit f44ec4f into KipData:main Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants