query planner guide 다운로드 - query planner guide 소스 코드 다운로드

[자리 표시 자]

소개

쿼리 플래너는 데이터베이스 쿼리 실행 계획을 생성하는 데이터베이스 관리 시스템 (DBMS)의 구성 요소입니다. 쿼리 계획은 쿼리에서 요청한 데이터를 검색하기 위해 DBMS가 수행 할 단계를 지정합니다. 쿼리 플래너의 목표는 가능한 한 효율적인 계획을 생성하는 것입니다. 즉, 데이터를 가능한 빨리 사용자에게 반환합니다.

쿼리 플래너는 복잡한 소프트웨어이며 이해하기 어려울 수 있습니다. 비용 기반 쿼리 플래너를 구현하기위한이 안내서는 프로세스에 대한 단계별 개요, 고유 한 비용 기반 쿼리 플래너를 구현하는 방법을 제공하지만 여전히 쿼리 플래너의 기본 개념을 다루는 방법을 제공합니다.

AI에 의해 작성, 인간이 편집했습니다

대상 관객

이 안내서는 다음과 같습니다.

쿼리 엔진으로 작업했던 사람
호기심 많은 사람은 자신의 물건을 만들고 싶어합니다
누가 DB를 배우고 싶어하지만 수학을 싫어합니다

목표 :

쿼리 계획의 기본을 이해할 수 있습니다
자신의 쿼리 플래너를 쓸 수 있습니다

쿼리 엔진의 기본 아키텍처

 그래프 TD
    사용자 ((사용자))
    파서 [쿼리 파서]
    플래너 [쿼리 플래너]
    집행자 [쿼리 프로세서]
    사용자 -텍스트 쿼리 -> Parser
    Parser -AST -> Planner
    플래너 -물리적 계획 -> 집행자

쿼리 엔진의 기본 아키텍처는 이러한 구성 요소로 구성됩니다.

쿼리 파서 : 일반적으로 사람이 읽을 수있는 텍스트 형식 (예 : SQL)으로 사용자 쿼리 입력을 구문 분석하는 데 사용됩니다.
쿼리 플래너 : 쿼리를 실행하기위한 계획/전략을 생성하는 데 사용됩니다. 일반적으로 Query Planner는 단일 쿼리에서 생성 된 여러 계획 중 가장 좋은 계획을 선택합니다.
쿼리 프로세서 : 쿼리 플래너가 출력하는 쿼리 계획을 실행하는 데 사용됩니다.

쿼리 플래너의 유형

일반적으로 쿼리 플래너는 두 가지 유형으로 나뉩니다.

휴리스틱 플래너
비용 기반 플래너

휴리스틱 플래너는 사전 정의 된 규칙을 사용하여 쿼리 계획을 생성 한 쿼리 플래너입니다.

비용 기반 플래너는 쿼리 생성 비용을 기반으로 입력 쿼리 비용을 기준으로 최적 계획을 찾으려고 시도하는 쿼리 플래너입니다.

휴리스틱 플래너는 일반적으로 변환 된 계획이 더 좋다는 것을 알고 있다면 적용 변환 규칙을 통해 최상의 계획을 찾지만 비용 기반 플래너는 동등한 계획을 열거하여 최상의 계획을 찾아서 그들 사이에서 최상의 계획을 찾으려고 노력합니다.

비용 기반 쿼리 플래너

비용 기반 쿼리 플래너에서는 일반적으로 단계로 구성됩니다.

계획 열거
쿼리 최적화

계획 열거 단계에서 플래너는 가능한 동등한 계획을 열거 할 것입니다.

그 후, 쿼리 최적화 단계에서 플래너는 열거 된 계획 목록에서 최상의 계획을 검색합니다. 최상의 계획은 비용이 가장 낮은 계획이며 비용 모델 (또는 비용 함수)이 정의됩니다.

논리적 계획의 자연은 나무와 같은 구조를 가지고 있기 때문에 최적화/검색이 실제로 트리 검색 문제라고 생각할 수 있습니다. 그리고 여기에는 많은 트리 검색 알고리즘이 있습니다.

결정 론적 동적 프로그래밍과 같은 철저한 검색. 알고리즘은 검색 종료 조건까지 최상의 계획 검색을 수행합니다.
무작위 트리 검색과 같은 무작위 검색. 알고리즘은 검색 종료 조건까지 최상의 계획 검색을 수행합니다.

참고 : 이론적으로 모든 종류의 트리 검색 알고리즘을 사용할 수 있습니다. 그러나 실용적으로 검색 알고리즘이 복잡 할 때 검색 시간이 증가하기 때문에 가능하지 않습니다.

참고 : 검색 종료 조건은 일반적으로 다음과 같습니다.

검색 피로 (더 이상 방문 할 계획이없는 경우)
비용 임계 값 (비용이 지정된 비용 임계 값보다 낮은 계획을 찾을 때)
시간 (검색 단계가 너무 오래 실행될 때)

화산 쿼리 플래너

Volcano Query Planner (또는 Volcano Optimizer Generator)는 비용 기반 쿼리 플래너입니다.

Volcano Planner는 동적 프로그래밍 방식을 사용하여 열거 된 계획 목록에서 최상의 쿼리 계획을 찾습니다.

세부 사항 : https://ieeexplore.ieee.org/document/344061 (여기에서 논문을 설명하기에는 너무 게으르다)

다음은 훌륭한 설명입니다 : https://15721.courses.cs.cmu.edu/spring2017/slides/15-optimizer2.pdf#page=9

비용 기반 쿼리 플래너 초안

우리의 쿼리 플래너는 화산 쿼리 플래너의 기본 아이디어에 따라 비용 기반 쿼리 플래너입니다. 우리의 플래너는 2 가지 주요 단계로 구성됩니다.

탐색/검색 단계
구현/최적화 단계

 그래프 lr
    AST ((AST))
    logical_plan [계획]
    explord_plans [ "`
        계획 #1
        ...
        계획 #N
    ` "]
    구현 _plan [ "Plan #X (Best Plan)"]]
    AST- 논리 계획 -> logical_plan으로 변환
    logical_plan- 탐색 단계 -> explord_plans
    explord_plans- 최적화 단계 -> 구현 _plan
    Linkstyle 1,2 색상 : 주황색, 스트로크 : 오렌지, 뇌졸중 범위 : 5px

어휘

논리적 계획

논리 계획은 쿼리를 실행하는 데 필요한 변환 단계의 추상화를 보유하는 데이터 일프 구조입니다.

다음은 논리적 계획의 예입니다.

 그래프 TD
    1 [ "Project TBL1.ID, TBL1.Field1, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"];
    2 [ "조인"];
    3 [ "스캔 TBL1"];
    4 [ "조인"];
    5 [ "스캔 tbl2"];
    6 [ "스캔 tbl3"];
    1-> 2;
    2-> 3;
    2-> 4;
    4-> 5;
    4-> 6;

물리적 계획

논리 계획은 추상화 만 보유하지만 물리적 계획은 구현 세부 사항을 보유하는 데이터 일프 구조입니다. 각 논리적 계획에는 여러 가지 물리적 계획이 있습니다. 예를 들어, 논리적 조인에는 해시 조인, 합병 조인, 브로드 캐스트 조인 등과 같은 많은 물리적 계획이있을 수 있습니다.

동등한 그룹

동등한 그룹은 동등한 표현식의 그룹입니다 (각 표현마다 논리적 계획이 논리적으로 동등합니다)

예를 들어

 그래프 TD
    하위 그래프 그룹#8
        expr#8 [ "스캔 TBL2 (Field1, Field2, Id)"]
    끝
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#11
        Expr#11 [ "조인"]
    끝
    Expr#11-> 그룹#7
    Expr#11-> 그룹#10
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#7
        expr#7 [ "스캔 tbl1 (id, field1)"]
    끝
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#10
        expr#10 [ "조인"]
    끝
    Expr#10-> 그룹#8
    Expr#10-> 그룹#9
    하위 그래프 그룹#9
        expr#9 [ "스캔 tbl3 (id, field2)"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#6
        Expr#12 [ "Project TBL1.ID, TBL1.Field1, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#12-> 그룹#11
    Expr#6-> 그룹#5

여기서 Group#6 에 2 개의 동등한 표현식이있는 것을 볼 수 있습니다. 둘 다 동일한 쿼리를 나타냅니다 (하나는 테이블에서 스캔하고 프로젝트에서 스캔하고 하나는 스캔 노드를 아래로 밀어 내고 있습니다).

변환 규칙

변환 규칙은 하나의 논리적 계획에서 다른 논리적 동등한 논리적 계획으로 전환하는 규칙입니다.

예를 들어, 계획 :

 그래프 TD
    1 [ "Project TBL1.ID, TBL1.Field1, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"];
    2 [ "조인"];
    3 [ "스캔 TBL1"];
    4 [ "조인"];
    5 [ "스캔 tbl2"];
    6 [ "스캔 tbl3"];
    1-> 2;
    2-> 3;
    2-> 4;
    4-> 5;
    4-> 6;

프로젝션 푸시 다운 변환을 적용하면 다음으로 변환됩니다.

 그래프 TD
    1 [ "Project *. *"];
    2 [ "조인"];
    3 [ "scan tbl1 (id, field1)"];
    4 [ "조인"];
    5 [ "스캔 TBL2 (Field1, Field2)];
    6 [ "scan tbl3 (id, field2, field2)"];
    1-> 2;
    2-> 3;
    2-> 4;
    4-> 5;
    4-> 6;

변환 규칙은 테이블 스키마, 데이터 통계 등과 같은 논리적 특성/속성에 영향을 줄 수 있습니다.

구현 규칙

구현 규칙은 논리적 계획이 주어진 물리적 계획을 반환하는 규칙입니다.

구현 규칙은 데이터 레이아웃 (정렬 여부) 등과 같은 물리적 특성/속성에 영향을 줄 수 있습니다.

탐사 단계

탐사 단계에서 플래너는 변환 규칙을 적용하여 동등한 논리 계획을 생성합니다.

예를 들어, 계획 :

 그래프 TD
    1326583549 [ "프로젝트 tbl1.id, tbl1.field1, tbl2.id, tbl2.field1, tbl2.field2, tbl3.id, tbl3.field2, tbl3.field2"];
    -425111028 [ "조인"];
    -349388609 [ "스캔 TBL1"];
    1343755644 [ "조인"];
    -1043437086 [ "스캔 TBL2"];
    -1402686787 [ "스캔 TBL3"];
    1326583549-> -425111028;
    -425111028-> -349388609;
    -425111028-> 1343755644;
    1343755644-> -1043437086;
    1343755644-> -1402686787;

변환 규칙을 적용한 후 다음 그래프가 나타납니다.

 그래프 TD
    하위 그래프 그룹#8
        expr#8 [ "스캔 tbl2 (id, field1, field2)"]
    끝
    하위 그래프 그룹#11
        Expr#11 [ "조인"]
        expr#14 [ "조인"]
    끝
    Expr#11-> 그룹#7
    Expr#11-> 그룹#10
    Expr#14-> 그룹#8
    Expr#14-> 그룹#12
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
        expr#16 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    Expr#16-> 그룹#2
    Expr#16-> 그룹#13
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#13
        Expr#15 [ "조인"]
    끝
    Expr#15-> 그룹#1
    Expr#15-> 그룹#3
    하위 그래프 그룹#7
        expr#7 [ "스캔 tbl1 (id, field1)"]
    끝
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#10
        expr#10 [ "조인"]
    끝
    Expr#10-> 그룹#8
    Expr#10-> 그룹#9
    하위 그래프 그룹#9
        expr#9 [ "스캔 tbl3 (id, field2)"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#12
        expr#13 [ "조인"]
    끝
    Expr#13-> 그룹#7
    Expr#13-> 그룹#9
    하위 그래프 그룹#6
        Expr#12 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#12-> 그룹#11
    Expr#6-> 그룹#5

여기서는 투영 푸시 다운 규칙과 Join REARDER 규칙이 적용되는 것을 볼 수 있습니다.

최적화 단계

최적화 단계는 탐사 단계에서 확장 된 트리를 가로 지르고 쿼리에 가장 적합한 계획을 찾는 것입니다.

이 "실제로"는 트리 검색 최적화이므로 상상할 수있는 트리 검색 알고리즘을 사용할 수 있습니다 (그러나 올바른지 확인해야합니다).

다음은 최적화 단계 후 생성 된 물리적 계획의 예입니다.

 그래프 TD
    그룹#6 [ ""
    그룹 #6
선택 : Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2
연산자 : 프로젝트 요기
비용 : 비용 (CPU = 641400.00, mem = 1020400012.00, time = 10000000.00)
"]
그룹#6-> 그룹#11
그룹#11 [ ""
그룹 #11
선택 : 가입
연산자 : 해시 노피퍼
비용 : 비용 (CPU = 641400.00, mem = 1020400012.00, time = 10000000.00)
"]
그룹#11-> 그룹#7
그룹#11-> 그룹#10
그룹#7 [ ""
그룹 #7
선택 : 스캔 TBL1 (ID, Field1)
연산자 : NormalsCanoperator
비용 : 비용 (CPU = 400.00, MEM = 400000.00, 시간 = 1000.00)
"]
그룹#10 [ ""
그룹 #10
선택 : 가입
연산자 : 병합 조정기
특성 : 분류
비용 : 비용 (CPU = 640000.00, mem = 20000012.00, 시간 = 1100000.00)
"]
그룹#10-> 그룹#8
그룹#10-> 그룹#9
그룹#8 [ ""
그룹 #8
선택된 : 스캔 tbl2 (id, field1, field2)
연산자 : NormalsCanoperator
특성 : 분류
비용 : 비용 (CPU = 600000.00, mem = 12.00, time = 10000000.00)
"]
그룹#9 [ "
그룹 #9
선택 : 스캔 tbl3 (id, field2)
연산자 : NormalsCanoperator
특성 : 분류
비용 : 비용 (CPU = 40000.00, mem = 200000000.00, 시간 = 100000.00)
"]

생성 된 계획은 선택된 논리 계획, 예상 비용 및 물리적 운영자를 보여주었습니다.

최적화/검색 종료

우리의 플래너는 최상의 계획을 찾기 위해 피로 검색을 수행합니다.

코드로 다이빙

플래너의 코드가 크기 때문에 단계별 가이드를 작성하지는 않지만 대신 코드의 모든 부분을 설명하겠습니다.

쿼리 언어

여기서 우리는이 튜토리얼을 철저히 사용한 쿼리 언어를 정의 할 것입니다.

 SELECT emp . id ,
       emp . code ,
       dept . dept_name ,
       emp_info . name ,
       emp_info . origin
FROM emp
         JOIN dept ON emp . id = dept . emp_id
         JOIN emp_info ON dept . emp_id = emp_info . id

우리가 구현할 쿼리 언어는 SQL과 같은 언어입니다. 그러나 단순성을 위해 기능과 구문을 제한합니다.

언어는 형태로 나타납니다

 SELECT tbl . field , [...]
FROM tbl JOIN [...]

SELECT and JOIN 지원할뿐만 아니라 Select 명령문의 필드는 완전히 자격을 갖추어야합니다 ( table.field 형태). 다른 모든 기능은 지원되지 않습니다.

AST

먼저, 우리는 언어에 대한 AST를 정의해야합니다. AST (또는 Abstract Syntax Tree)는 텍스트의 구문 구조를 나타내는 데 사용되는 트리입니다.

우리의 언어는 매우 간단하기 때문에 여러 줄의 코드에서 AST 구조를 정의 할 수 있습니다.

 sealed trait Identifier

case class TableID ( id : String ) extends Identifier

case class FieldID ( table : TableID , id : String ) extends Identifier

sealed trait Statement

case class Table ( table : TableID ) extends Statement

case class Join ( left : Statement , right : Statement , on : Seq [( FieldID , FieldID )]) extends Statement

case class Select ( fields : Seq [ FieldID ], from : Statement ) extends Statement

예를 들어, 쿼리

 SELECT tbl1 . id ,
       tbl1 . field1 ,
       tbl2 . id ,
       tbl2 . field1 ,
       tbl2 . field2 ,
       tbl3 . id ,
       tbl3 . field2 ,
       tbl3 . field2
FROM tbl1
         JOIN tbl2 ON tbl1 . id = tbl2 . id
         JOIN tbl3 ON tbl2 . id = tbl3 . id

로 표현할 수 있습니다

 Select (
  Seq (
    FieldID ( TableID ( " tbl1 " ), " id " ),
    FieldID ( TableID ( " tbl1 " ), " field1 " ),
    FieldID ( TableID ( " tbl2 " ), " id " ),
    FieldID ( TableID ( " tbl2 " ), " field1 " ),
    FieldID ( TableID ( " tbl2 " ), " field2 " ),
    FieldID ( TableID ( " tbl3 " ), " id " ),
    FieldID ( TableID ( " tbl3 " ), " field2 " ),
    FieldID ( TableID ( " tbl3 " ), " field2 " )
  ),
  Join (
    Table ( TableID ( " tbl1 " )),
    Join (
      Table ( TableID ( " tbl2 " )),
      Table ( TableID ( " tbl3 " )),
      Seq (
        FieldID ( TableID ( " tbl2 " ), " id " ) -> FieldID ( TableID ( " tbl3 " ), " id " )
      )
    ),
    Seq (
      FieldID ( TableID ( " tbl1 " ), " id " ) -> FieldID ( TableID ( " tbl2 " ), " id " )
    )
  )
)

간단한 쿼리 파서

AST 구조를 정의한 후 텍스트 쿼리를 AST 양식으로 변환하는 데 사용되는 쿼리 파서를 작성해야합니다.

이 안내서는 구현에 Scala를 사용하기 때문에 쿼리 파서를 만들기 위해 Scala-Parser-Combinator를 선택합니다.

쿼리 파서 클래스 :

 object QueryParser extends ParserWithCtx [ QueryExecutionContext , Statement ] with RegexParsers {

  override def parse ( in : String )( implicit ctx : QueryExecutionContext ) : Either [ Throwable , Statement ] = {
    Try (parseAll(statement, in) match {
      case Success (result, _) => Right (result)
      case NoSuccess (msg, _) => Left ( new Exception (msg))
    }) match {
      case util. Failure (ex) => Left (ex)
      case util. Success (value) => value
    }
  }

  private def select : Parser [ Select ] = ??? // we will implement it in later section

  private def statement : Parser [ Statement ] = select
}

그런 다음 일부 구문 분석 규칙을 정의하십시오.

 // common
private def str : Parser [ String ] = """ [a-zA-Z0-9_]+ """ .r
private def fqdnStr : Parser [ String ] = """ [a-zA-Z0-9_]+.[a-zA-Z0-9_]+ """ .r

// identifier
private def tableId : Parser [ TableID ] = str ^^ (s => TableID (s))

private def fieldId : Parser [ FieldID ] = fqdnStr ^^ { s =>
  val identifiers = s.split( '.' )
  if (identifiers.length != 2 ) {
    throw new Exception ( " should never happen " )
  } else {
    val table = identifiers.head
    val field = identifiers( 1 )
    FieldID ( TableID (table), field)
  }
}

다음은 식별자를 구문 분석하는 데 사용되는 두 가지 규칙입니다 : TableID 및 FieldID .

테이블 ID (또는 테이블 이름)에는 일반적으로 문자, 숫자 및 밑줄 ( _ ) 만 포함되므로 간단한 Regex [a-zA-Z0-9_]+ 사용하여 테이블 이름을 식별합니다.

반면, 우리 언어의 필드 ID (Field Qualifier)는 완전히 자격을 갖춘 필드 이름입니다. 일반적으로 table.field 형태이며 필드 이름에는 일반적으로 문자, 숫자 및 밑줄 만 포함하므로 Regex [a-zA-Z0-9_]+.[a-zA-Z0-9_]+ 사용하여 필드 이름을 파서에 사용합니다.

식별자를 구문 분석하기위한 규칙을 정의한 후 이제 규칙을 정의하여 쿼리 문을 정의 할 수 있습니다.

 // statement
private def table : Parser [ Table ] = tableId ^^ (t => Table (t))
private def subQuery : Parser [ Statement ] = " ( " ~> select <~ " ) "

table 규칙은 간단한 규칙입니다. tableId 규칙에서 구문 분석 된 TableID 사용하여 Table 노드를 만듭니다.

subQuery 는 하위 쿼리를 구문 분석하는 규칙입니다. SQL에서는 다음과 같은 쿼리를 쓸 수 있습니다.

 SELECT a
FROM ( SELECT b FROM c) d

SELECT b FROM c 는 위의 명령문의 하위 쿼리입니다. 여기서 간단한 쿼리 언어에서, 우리는 진술이 한 쌍의 괄호 ( () )로 둘러싸인 경우 하위 쿼리임을 나타냅니다. 우리의 언어는 선정 된 진술 만 가지고 있으므로, 우리는 구문 분석 규칙을 다음과 같이 쓸 수 있습니다.

 def subQuery : Parser [ Statement ] = " ( " ~> select <~ " ) "

이제 우리는 select 문에 대한 구문 분석 규칙을 정의 할 것입니다.

 private def fromSource : Parser [ Statement ] = table ||| subQuery

private def select : Parser [ Select ] =
  " SELECT " ~ rep1sep(fieldId, " , " ) ~ " FROM " ~ fromSource ~ rep(
    " JOIN " ~ fromSource ~ " ON " ~ rep1(fieldId ~ " = " ~ fieldId)
  ) ^^ {
    case _ ~ fields ~ _ ~ src ~ joins =>
      val p = if (joins.nonEmpty) {
        def chain ( left : Statement , right : Seq [( Statement , Seq [( FieldID , FieldID )])]) : Join = {
          if (right.isEmpty) {
            throw new Exception ( " should never happen " )
          } else if (right.length == 1 ) {
            val next = right.head
            Join (left, next._1, next._2)
          } else {
            val next = right.head
            Join (left, chain(next._1, right.tail), next._2)
          }
        }

        val temp = joins.map { join =>
          val statement = join._1._1._2
          val joinOn = join._2.map(on => on._1._1 -> on._2)
          statement -> joinOn
        }
        chain(src, temp)
      } else {
        src
      }
      Select (fields, p)
  }

SQL에서는 하위 쿼리를 조인 소스로 사용할 수 있습니다. 예를 들어:

 SELECT * . *
FROM tbl1
    JOIN ( SELECT * . * FROM tbl2)
    JOIN tbl3

따라서 우리의 파서는 또한 성명서의 조인 부분에서 하위 쿼리를 구문 분석하는 규칙을 구현할 것입니다. 그래서 우리는 구문 분석 규칙을 갖습니다.

 " SELECT " ~ rep1sep(fieldId, " , " ) ~ " FROM " ~ fromSource ~ rep( " JOIN " ~ fromSource ~ " ON " ~ rep1(fieldId ~ " = " ~ fieldId)

전체 구현은 QueryParser.scala를 참조하십시오

쿼리 파서 테스트

queryparserspec.scala를 참조하십시오

논리적 계획

텍스트 쿼리에서 AST를 생성 한 후 논리 계획으로 직접 변환 할 수 있습니다.

먼저 논리적 계획의 인터페이스를 정의 할 수 있습니다.

 sealed trait LogicalPlan {
  def children () : Seq [ LogicalPlan ]
}

children 아동 논리 계획의 목록입니다. 예를 들어:

 그래프 TD
    1326583549 [ "프로젝트 tbl1.id, tbl1.field1, tbl2.id, tbl2.field1, tbl2.field2, tbl3.id, tbl3.field2, tbl3.field2"];
    -425111028 [ "조인"];
    -349388609 [ "스캔 TBL1"];
    1343755644 [ "조인"];
    -1043437086 [ "스캔 TBL2"];
    -1402686787 [ "스캔 TBL3"];
    1326583549-> -425111028;
    -425111028-> -349388609;
    -425111028-> 1343755644;
    1343755644-> -1043437086;
    1343755644-> -1402686787;

PROJECT 노드의 자식 노드는 첫 번째 JOIN 노드입니다. 첫 번째 JOIN 노드에는 2 개의 어린이가 있으며, 이는 두 번째 JOIN 노드 및 SCAN tbl1 노드입니다. 곧, ...

쿼리 언어는 간단하기 때문에 3 가지 유형의 논리 노드 만 있으면됩니다.

프로젝트 : 대수 관계에서 투영 연산자를 대표합니다
조인 : 논리적 조인을 나타냅니다
스캔 : 테이블 스캔을 나타냅니다

 case class Scan ( table : ql. TableID , projection : Seq [ String ]) extends LogicalPlan {
  override def children () : Seq [ LogicalPlan ] = Seq .empty
}

case class Project ( fields : Seq [ql. FieldID ], child : LogicalPlan ) extends LogicalPlan {
  override def children () : Seq [ LogicalPlan ] = Seq (child)
}

case class Join ( left : LogicalPlan , right : LogicalPlan , on : Seq [(ql. FieldID , ql. FieldID )]) extends LogicalPlan {
  override def children () : Seq [ LogicalPlan ] = Seq (left, right)
}

그런 다음 AST를 논리적 계획으로 변환하기 위해 기능을 작성할 수 있습니다.

 def toPlan ( node : ql. Statement ) : LogicalPlan = {
  node match {
    case ql. Table (table) => Scan (table, Seq .empty)
    case ql. Join (left, right, on) => Join (toPlan(left), toPlan(right), on)
    case ql. Select (fields, from) => Project (fields, toPlan(from))
  }
}

전체 구현은 logicalplan.scala를 참조하십시오

동등한 그룹

그룹

그룹의 클래스를 다음과 같이 정의 할 수 있습니다.

 case class Group (
                  id : Long ,
                  equivalents : mutable. HashSet [ GroupExpression ]
                ) {
  val explorationMark : ExplorationMark = new ExplorationMark
  var implementation : Option [ GroupImplementation ] = None
}

case class GroupExpression (
                            id : Long ,
                            plan : LogicalPlan ,
                            children : mutable. MutableList [ Group ]
                          ) {
  val explorationMark : ExplorationMark = new ExplorationMark
  val appliedTransformations : mutable. HashSet [ TransformationRule ] = mutable. HashSet ()
}

Group 논리적으로 동등한 계획 세트입니다.

각 GroupExpression 논리 계획 노드를 나타냅니다. 논리적 계획 노드는 하위 노드 목록 (이전 섹션에서)을 정의 했으므로 GroupExpression 논리적 계획 노드를 나타내고 Group 동등한 계획 세트를 나타내므로 GroupExpression 의 어린이는 Group 목록입니다.

예를 들어

 그래프 TD
    하위 그래프 그룹#8
        expr#8
    끝
    하위 그래프 그룹#2
        expr#2
    끝
    하위 그래프 그룹#11
        expr#11
    끝
    Expr#11-> 그룹#7
    Expr#11-> 그룹#10
    하위 그래프 그룹#5
        expr#5
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        expr#4
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#7
        expr#7
    끝
    하위 그래프 그룹#1
        expr#1
    끝
    하위 그래프 그룹#10
        expr#10
    끝
    Expr#10-> 그룹#8
    Expr#10-> 그룹#9
    하위 그래프 그룹#9
        expr#9
    끝
    하위 그래프 그룹#3
        expr#3
    끝
    하위 그래프 그룹#6
        Expr#12
        expr#6
    끝
    Expr#12-> 그룹#11
    Expr#6-> 그룹#5

여기에서 볼 수 있듯이 Group#6 에는 2 개의 동등한 표현이 있습니다 : Expr#12 및 Expr#6 , Expr#12 의 어린이는 Group#11 입니다.

참고 : 탐사 단계에서 다중 라운드 변환을 구현하므로 각 Group 및 GroupExpression 에 대해 탐색 상태를 ExplorationMark 표시가 있습니다.

 class ExplorationMark {
  private var bits : Long = 0

  def get : Long = bits

  def isExplored ( round : Int ) : Boolean = BitUtils .getBit(bits, round)

  def markExplored ( round : Int ) : Unit = bits = BitUtils .setBit(bits, round, on = true )

  def markUnexplored ( round : Int ) : Unit = bits = BitUtils .setBit(bits, round, on = false )
}

ExplorationMark 는 Bitset Wrapper 클래스 일뿐입니다. I-th 비트를 I-th 라운드를 탐색하면 0으로 표시하고 그렇지 않으면 0으로 표시됩니다.

ExplorationMark 정확한 변환을 시각화하는 데 사용될 수 있습니다. 자세한 내용은 시각화를 참조하십시오.

메모

메모는 동등한 그룹을 구성하는 데 도움이되는 많은 도우미입니다. 메모는 그룹 및 그룹 표현식을 캐시하기위한 여러 해시 맵으로 구성되어 있으며 새로운 그룹 또는 그룹 표현식을 등록하는 방법을 제공합니다.

 class Memo (
            groupIdGenerator : Generator [ Long ] = new LongGenerator ,
            groupExpressionIdGenerator : Generator [ Long ] = new LongGenerator
          ) {
  val groups : mutable. HashMap [ Long , Group ] = mutable. HashMap [ Long , Group ]()
  val parents : mutable. HashMap [ Long , Group ] = mutable. HashMap [ Long , Group ]() // lookup group from group expression ID
  val groupExpressions : mutable. HashMap [ LogicalPlan , GroupExpression ] = mutable. HashMap [ LogicalPlan , GroupExpression ]()

  def getOrCreateGroupExpression ( plan : LogicalPlan ) : GroupExpression = {
    val children = plan.children()
    val childGroups = children.map(child => getOrCreateGroup(child))
    groupExpressions.get(plan) match {
      case Some (found) => found
      case None =>
        val id = groupExpressionIdGenerator.generate()
        val children = mutable. MutableList () ++ childGroups
        val expression = GroupExpression (
          id = id,
          plan = plan,
          children = children
        )
        groupExpressions += plan -> expression
        expression
    }
  }

  def getOrCreateGroup ( plan : LogicalPlan ) : Group = {
    val exprGroup = getOrCreateGroupExpression(plan)
    val group = parents.get(exprGroup.id) match {
      case Some (group) =>
        group.equivalents += exprGroup
        group
      case None =>
        val id = groupIdGenerator.generate()
        val equivalents = mutable. HashSet () + exprGroup
        val group = Group (
          id = id,
          equivalents = equivalents
        )
        groups.put(id, group)
        group
    }
    parents += exprGroup.id -> group
    group
  }
}

전체 구현은 Memo.scala를 참조하십시오

초기화

플래너 내부의 첫 번째 단계는 초기화입니다

 그래프 lr
    쿼리 (쿼리)
    AST ((AST))
    root_plan ((rootplan))
    root_group ((rootgroup))
    query- "queryparser.parse (query)" -> ast
    ast -"logicalplan.toplan (ast)" -> root_plan
    root_plan- "memo.getorcreategroup (rootplan)" -> root_group

먼저, 쿼리는 AST로 구문 분석됩니다. 그런 다음 root plan 이라고하는 논리 계획으로 변환 한 다음 root group 이라는 root plan 에서 그룹을 초기화합니다.

 def initialize ( query : Statement )( implicit ctx : VolcanoPlannerContext ) : Unit = {
  ctx.query = query
  ctx.rootPlan = LogicalPlan .toPlan(ctx.query)
  ctx.rootGroup = ctx.memo.getOrCreateGroup(ctx.rootPlan)
  // assuming this is first the exploration round,
  // by marking the initialRound(0) as explored,
  // it will be easier to visualize the different between rounds (added nodes, add connections)
  ctx.memo.groups.values.foreach(_.explorationMark.markExplored(initialRound))
  ctx.memo.groupExpressions.values.foreach(_.explorationMark.markExplored(initialRound))
}

자세한 내용은 Volcanoplanner.scala를 참조하십시오

예를 들어 쿼리 :

 SELECT tbl1 . id ,
       tbl1 . field1 ,
       tbl2 . id ,
       tbl2 . field1 ,
       tbl2 . field2 ,
       tbl3 . id ,
       tbl3 . field2 ,
       tbl3 . field2
FROM tbl1
         JOIN tbl2 ON tbl1 . id = tbl2 . id
         JOIN tbl3 ON tbl2 . id = tbl3 . id

초기화 후 그룹은 다음과 같이 보일 것입니다.

 그래프 TD
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#6
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#6-> 그룹#5

여기서 당신은 모든 그룹에 정확히 하나의 동등한 표현이 있음을 알 수 있습니다.

탐사 단계

초기화 후, 이제는 탐색 단계가 있으며, 이는 가능한 모든 동등한 계획을 탐색 할 것입니다.

탐사 방법은 매우 간단합니다.

각 그룹에 대해 변환 규칙을 적용하여 모든 동등한 그룹 표현식을 찾고 새로운 동등한 계획을 찾을 수 없을 때까지 동등한 세트에 추가하십시오.
각 그룹 표현에 대해 모든 어린이 그룹을 탐색하십시오

변환 규칙

탐색 코드로 뛰어 들기 전에 먼저 변환 규칙에 대해 이야기하겠습니다.

변환 규칙은 규칙 조건과 일치하는 경우 논리 계획을 다른 동등한 논리 계획으로 변환하는 데 사용되는 규칙입니다.

변환 규칙의 인터페이스는 다음과 같습니다.

 trait TransformationRule {
  def `match` ( expression : GroupExpression )( implicit ctx : VolcanoPlannerContext ) : Boolean

  def transform ( expression : GroupExpression )( implicit ctx : VolcanoPlannerContext ) : GroupExpression
}

논리적 계획은 트리와 같은 데이터 인프라이기 때문에 변환 규칙의 match 구현은 트리에서 패턴 일치합니다.

예를 들어, 프로젝트 노드와 일치하는 데 사용되는 match 다음과 같은 일치하며 조인 및 스캔 만 포함하는 자손인지 확인하십시오.

 override def `match` ( expression : GroupExpression )( implicit ctx : VolcanoPlannerContext ) : Boolean = {
  val plan = expression.plan
  plan match {
    case Project (_, child) => check(child)
    case _ => false
  }
}

// check if the tree only contains SCAN and JOIN nodes
private def check ( node : LogicalPlan ) : Boolean = {
  node match {
    case Scan (_, _) => true
    case Join (left, right, _) => check(left) && check(right)
    case _ => false
  }
}

이 계획은 "일치"입니다.

 그래프 TD
    하위 그래프 그룹#2
        Expr#2 [ "스캔"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#1
        expr#1 [ "스캔"]
    끝
    하위 그래프 그룹#3
        Expr#3 [ "스캔"]
    끝
    하위 그래프 그룹#6
        Expr#6 [ "프로젝트"]
    끝
    Expr#6-> 그룹#5

이 계획은 그렇지 않지만 :

 그래프 TD
    하위 그래프 그룹#2
        Expr#2 [ "스캔"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
    끝
    Expr#5-> 그룹#3
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        Expr#4 [ "스캔"]
    끝
    하위 그래프 그룹#7
        Expr#7 [ "프로젝트"]
    끝
    Expr#7-> 그룹#6
    하위 그래프 그룹#1
        expr#1 [ "스캔"]
    끝
    하위 그래프 그룹#3
        Expr#3 [ "프로젝트"]
    끝
    Expr#3-> 그룹#2
    하위 그래프 그룹#6
        Expr#6 [ "조인"]
    끝
    Expr#6-> 그룹#1
    Expr#6-> 그룹#5

계획 열거

앞에서 말했듯이 탐사 방법은 다음과 같습니다.

각 그룹에 대해 변환 규칙을 적용하여 모든 동등한 그룹 표현식을 찾고 새로운 동등한 계획을 찾을 수 없을 때까지 동등한 세트에 추가하십시오.
각 그룹 표현에 대해 모든 어린이 그룹을 탐색하십시오

그리고 여기에 탐색 코드가 있습니다 (아주 간단합니다).

 private def exploreGroup (
                          group : Group ,
                          rules : Seq [ TransformationRule ],
                          round : Int
                        )( implicit ctx : VolcanoPlannerContext ) : Unit = {
  while ( ! group.explorationMark.isExplored(round)) {
    group.explorationMark.markExplored(round)
    // explore all child groups
    group.equivalents.foreach { equivalent =>
      if ( ! equivalent.explorationMark.isExplored(round)) {
        equivalent.explorationMark.markExplored(round)
        equivalent.children.foreach { child =>
          exploreGroup(child, rules, round)
          if (equivalent.explorationMark.isExplored(round) && child.explorationMark.isExplored(round)) {
            equivalent.explorationMark.markExplored(round)
          } else {
            equivalent.explorationMark.markUnexplored(round)
          }
        }
      }
      // fire transformation rules to explore all the possible transformations
      rules.foreach { rule =>
        if ( ! equivalent.appliedTransformations.contains(rule) && rule.`match`(equivalent)) {
          val transformed = rule.transform(equivalent)
          if ( ! group.equivalents.contains(transformed)) {
            group.equivalents += transformed
            transformed.explorationMark.markUnexplored(round)
            group.explorationMark.markUnexplored(round)
          }
        }
      }
      if (group.explorationMark.isExplored(round) && equivalent.explorationMark.isExplored(round)) {
        group.explorationMark.markExplored(round)
      } else {
        group.explorationMark.markUnexplored(round)
      }
    }
  }
}

자세한 내용은 Volcanoplanner.scala를 참조하십시오

일부 변환 규칙을 구현하십시오

이제 일부 변환 규칙을 구현할 차례입니다

프로젝션 푸시 다운

프로젝션 푸시 다운은 프로젝션을 스토리지 레이어로 아래로 밀기 위해 사용되는 간단한 변환 규칙입니다.

예를 들어 쿼리입니다

 SELECT field1, field2
from tbl

계획이 있습니다

 그래프 lr
    프로젝트 [Project Field1, Field2]
    스캔 [TBL 스캔]
    프로젝트 -> 스캔

이 계획을 사용하면 실행할 때 스토리지 레이어 (스캔 중)의 행이 완전히 가져오고 불필요한 필드가 삭제됩니다 (프로젝트). 불필요한 데이터는 여전히 스캔 노드에서 프로젝트 노드로 이동해야하므로 여기에는 약간의 노력이 있습니다.

스토리지 레이어에게 필요한 필드 만 가져 오는 것만으로 말함으로써 더 좋게 만들 수 있습니다. 이제 계획은 다음으로 전환 될 것입니다.

 그래프 lr
    프로젝트 [Project Field1, Field2]
    스캔 [ "스캔 TBL (Field1, Field2)]]
    프로젝트 -> 스캔

코드로 가자 :

 override def `match` ( expression : GroupExpression )( implicit ctx : VolcanoPlannerContext ) : Boolean = {
  val plan = expression.plan
  plan match {
    case Project (_, child) => check(child)
    case _ => false
  }
}

// check if the tree only contains SCAN and JOIN nodes
private def check ( node : LogicalPlan ) : Boolean = {
  node match {
    case Scan (_, _) => true
    case Join (left, right, _) => check(left) && check(right)
    case _ => false
  }
}

여기서 프로젝션 푸시 다운 규칙은 프로젝트 노드 일 때 계획과 일치하며 모든 자손은 스캔 및 결합 노드 전용입니다.

참고 : 실제로 실제 프로젝션 푸시 다운 일치는 더 복잡하지만 단순성을 위해서는 스캔 및 결합 자손이있는 프로젝트 노드입니다.

변환 코드는 다음과 같습니다.

 override def transform ( expression : GroupExpression )( implicit ctx : VolcanoPlannerContext ) : GroupExpression = {
  val plan = expression.plan. asInstanceOf [ Project ]
  val pushDownProjection = mutable. ListBuffer [ FieldID ]()
  extractProjections(plan, pushDownProjection)
  val newPlan = Project (plan.fields, pushDown(pushDownProjection.distinct, plan.child))
  ctx.memo.getOrCreateGroupExpression(newPlan)
}

private def extractProjections ( node : LogicalPlan , buffer : mutable. ListBuffer [ FieldID ]) : Unit = {
  node match {
    case Scan (_, _) => () : Unit
    case Project (fields, parent) =>
      buffer ++= fields
      extractProjections(parent, buffer)
    case Join (left, right, on) =>
      buffer ++= on.map(_._1) ++ on.map(_._2)
      extractProjections(left, buffer)
      extractProjections(right, buffer)
  }
}

private def pushDown ( pushDownProjection : Seq [ FieldID ], node : LogicalPlan ) : LogicalPlan = {
  node match {
    case Scan (table, tableProjection) =>
      val filteredPushDownProjection = pushDownProjection.filter(_.table == table).map(_.id)
      val updatedProjection =
        if (filteredPushDownProjection.contains( " * " ) || filteredPushDownProjection.contains( " *.* " )) {
          Seq .empty
        } else {
          (tableProjection ++ filteredPushDownProjection).distinct
        }
      Scan (table, updatedProjection)
    case Join (left, right, on) => Join (pushDown(pushDownProjection, left), pushDown(pushDownProjection, right), on)
    case _ => throw new Exception ( " should never happen " )
  }
}

변환 코드는 먼저 루트 프로젝트 노드에서 모든 예측을 찾은 다음 그 아래의 모든 스캔 노드로 밀어 넣습니다.

예를 들어, 우리의 규칙을 시각화합니다

 그래프 TD
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#6
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#6-> 그룹#5

프로젝션 푸시 다운 변환을 적용한 후, 투영과 함께 새로운 동등한 계획이 스캔 작업으로 추진됩니다 (새로운 계획은 주황색 국경 노드가있는 트리입니다).

 그래프 TD
    하위 그래프 그룹#8
        expr#8 [ "스캔 tbl2 (id, field1, field2)"]
    끝
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#11
        Expr#11 [ "조인"]
    끝
    Expr#11-> 그룹#7
    Expr#11-> 그룹#10
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#7
        expr#7 [ "스캔 tbl1 (id, field1)"]
    끝
    하위 그래프 그룹#10
        expr#10 [ "조인"]
    끝
    Expr#10-> 그룹#8
    Expr#10-> 그룹#9
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#9
        expr#9 [ "스캔 tbl3 (id, field2)"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#6
        Expr#12 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#12-> 그룹#11
    Expr#6-> 그룹#5
    스타일 expr#12 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#8 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#10 스트로크 폭 : 4px, 스트로크 : 오렌지
    스타일 expr#9 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#11 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#7 스트로크 볼 : 4px, 스트로크 : 오렌지
    LinkStyle 0 스트로크 폭 : 4px, 스트로크 : 오렌지
    Linkstyle 1 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 6 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 7 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 8 스트로크 폭 : 4px, 스트로크 : 오렌지

전체 구현은 projectionpushdown.scala를 참조하십시오

재주문에 가입하십시오

Join REARDER는 또한 쿼리 플래너 세계에서 가장 인정받는 변환 중 하나입니다. 우리의 플래너는 또한 재주문 변환 규칙을 구현할 것입니다.

실제 세계의 Joint Resorder는 구현하기 쉬운 부분이 아닙니다. 따라서 여기에서 간단하고 찢어지는 버전의 Join REARDER 규칙을 구현합니다.

첫째, 규칙 match :

 // check if the tree only contains SCAN and JOIN nodes, and also extract all SCAN nodes and JOIN conditions
private def checkAndExtract (
                             node : LogicalPlan ,
                             buffer : mutable. ListBuffer [ Scan ],
                             joinCondBuffer : mutable. ListBuffer [(ql. FieldID , ql. FieldID )]
                           ) : Boolean = {
  node match {
    case node @ Scan (_, _) =>
      buffer += node
      true
    case Join (left, right, on) =>
      joinCondBuffer ++= on
      checkAndExtract(left, buffer, joinCondBuffer) && checkAndExtract(right, buffer, joinCondBuffer)
    case _ => false
  }
}

private def buildInterchangeableJoinCond ( conditions : Seq [(ql. FieldID , ql. FieldID )]) : Seq [ Seq [ql. FieldID ]] = {
  val buffer = mutable. ListBuffer [mutable. Set [ql. FieldID ]]()
  conditions.foreach { cond =>
    val set = buffer.find { set =>
      set.contains(cond._1) || set.contains(cond._2)
    } match {
      case Some (set) => set
      case None =>
        val set = mutable. Set [ql. FieldID ]()
        buffer += set
        set
    }
    set += cond._1
    set += cond._2
  }
  buffer.map(_.toSeq)
}

override def `match` ( expression : GroupExpression )( implicit ctx : VolcanoPlannerContext ) : Boolean = {
  val plan = expression.plan
  plan match {
    case node @ Join (_, _, _) =>
      val buffer = mutable. ListBuffer [ Scan ]()
      val joinCondBuffer = mutable. ListBuffer [(ql. FieldID , ql. FieldID )]()
      if (checkAndExtract(node, buffer, joinCondBuffer)) {
        // only match if the join is 3 tables join
        if (buffer.size == 3 ) {
          var check = true
          val interChangeableCond = buildInterchangeableJoinCond(joinCondBuffer)
          interChangeableCond.foreach { c =>
            check &= c.size == 3
          }
          check
        } else {
          false
        }
      } else {
        false
      }
    case _ => false
  }
}

우리의 규칙은 3 방향 조인과 일치하는 경우에만 일치합니다 (관련 테이블의 수는 3이어야하고 결합 조건은 tbl1.field1 = tbl2.field2 = tbl3.field3 와 같은 3 방향이어야합니다).

예를 들어,

tbl1
    JOIN tbl2 ON tbl1 . field1 = tbl2 . field2
    JOIN tbl3 ON tbl1 . field1 = tbl3 . field3

여기의 조인 명령문은 3 방향 조인이기 때문에 "일치"됩니다 ( tbl1 , tbl2 , tbl3 간의 조인이며 조건은 tbl1.field1 = tbl2.field2 = tbl3.field3 )입니다.

다음은 변환 코드입니다.

 override def transform ( expression : GroupExpression )( implicit ctx : VolcanoPlannerContext ) : GroupExpression = {
  val plan = expression.plan. asInstanceOf [ Join ]
  val buffer = mutable. ListBuffer [ Scan ]()
  val joinCondBuffer = mutable. ListBuffer [(ql. FieldID , ql. FieldID )]()
  checkAndExtract(plan, buffer, joinCondBuffer)
  val interChangeableCond = buildInterchangeableJoinCond(joinCondBuffer)
  //
  val scans = buffer.toList
  implicit val ord : Ordering [ Scan ] = new Ordering [ Scan ] {
    override def compare ( x : Scan , y : Scan ) : Int = {
      val xStats = ctx.statsProvider.tableStats(x.table.id)
      val yStats = ctx.statsProvider.tableStats(y.table.id)
      xStats.estimatedTableSize.compareTo(yStats.estimatedTableSize)
    }
  }

  def getJoinCond ( left : Scan , right : Scan ) : Seq [(ql. FieldID , ql. FieldID )] = {
    val leftFields = interChangeableCond.flatMap { c =>
      c.filter(p => p.table == left.table)
    }
    val rightFields = interChangeableCond.flatMap { c =>
      c.filter(p => p.table == right.table)
    }
    if (leftFields.length != rightFields.length) {
      throw new Exception ( s " leftFields.length( ${leftFields.length} ) != rightFields.length( ${rightFields.length} ) " )
    } else {
      leftFields zip rightFields
    }
  }

  val sorted = scans.sorted
  val newPlan = Join (
    sorted( 0 ),
    Join (
      sorted( 1 ),
      sorted( 2 ),
      getJoinCond(sorted( 1 ), sorted( 2 ))
    ),
    getJoinCond(sorted( 0 ), sorted( 1 ))
  )
  ctx.memo.getOrCreateGroupExpression(newPlan)
}

여기에서 변환 코드는 예상 크기로 테이블을 재정렬합니다.

예를 들어, 300b, 100b, 200b의 추정 크기 인 3 개의 테이블 a, b, c가있는 경우, 조인트 문 A JOIN B JOIN C 가있는 경우 B JOIN C JOIN A 로 변환됩니다.

참고 : 이 코드에서는 테이블 통계를 사용하여 계획을 혁신 할 힌트를 제공했습니다. 실제적으로, 플래너는 모든 종류의 통계를 사용하여 테이블 크기, 행 크기, 널 카운트, 히스토그램 등과 같은 변환을 돕습니다.

예를 들어, 우리의 규칙을 시각화합니다

 그래프 TD
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#6
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#6-> 그룹#5

재주문 변환에 가입 한 후 결과

 그래프 TD
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
        Expr#8 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    Expr#8-> 그룹#2
    Expr#8-> 그룹#7
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#7
        Expr#7 [ "조인"]
    끝
    Expr#7-> 그룹#1
    Expr#7-> 그룹#3
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#6
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#6-> 그룹#5
    스타일 expr#8 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#7 스트로크 볼 : 4px, 스트로크 : 오렌지
    LinkStyle 2 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 6 스트로크 폭 : 4px, 스트로크 : 오렌지
    Linkstyle 3 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 7 스트로크 폭 : 4px, 스트로크 : 오렌지

tbl2 JOIN tbl1 JOIN tbl3 tbl1 JOIN tbl2 JOIN tbl3 변환에 의해 생성됩니다 (새로 추가 된 노드와 가장자리는 주황색 선으로 표시됨).

전체 구현은 x3tablejoinreorderbysize.scala를 참조하십시오

모든 변환을 정리합니다

이제 우리는 변형 규칙을 한 곳에 넣을 수 있습니다.

 private val transformationRules : Seq [ Seq [ TransformationRule ]] = Seq (
  Seq ( new ProjectionPushDown ),
  Seq ( new X3TableJoinReorderBySize )
)

동등한 그룹을 탐색하기 위해 실행하십시오

 for (r <- transformationRules.indices) {
  exploreGroup(ctx.rootGroup, transformationRules(r), r + 1 )
}

예를 들어, 계획

 그래프 TD
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#6
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#6-> 그룹#5

탐색 후이 그래프가 발생합니다

 그래프 TD
    하위 그래프 그룹#8
        expr#8 [ "스캔 tbl2 (id, field1, field2)"]
    끝
    하위 그래프 그룹#11
        Expr#11 [ "조인"]
        expr#14 [ "조인"]
    끝
    Expr#11-> 그룹#7
    Expr#11-> 그룹#10
    Expr#14-> 그룹#8
    Expr#14-> 그룹#12
    하위 그래프 그룹#2
        Expr#2 [ "스캔 TBL2"]
    끝
    하위 그래프 그룹#5
        Expr#5 [ "조인"]
        expr#16 [ "조인"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    Expr#16-> 그룹#2
    Expr#16-> 그룹#13
    하위 그래프 그룹#4
        Expr#4 [ "조인"]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#13
        Expr#15 [ "조인"]
    끝
    Expr#15-> 그룹#1
    Expr#15-> 그룹#3
    하위 그래프 그룹#7
        expr#7 [ "스캔 tbl1 (id, field1)"]
    끝
    하위 그래프 그룹#1
        expr#1 [ "스캔 TBL1"]
    끝
    하위 그래프 그룹#10
        expr#10 [ "조인"]
    끝
    Expr#10-> 그룹#8
    Expr#10-> 그룹#9
    하위 그래프 그룹#9
        expr#9 [ "스캔 tbl3 (id, field2)"]
    끝
    하위 그래프 그룹#3
        expr#3 [ "스캔 TBL3"]
    끝
    하위 그래프 그룹#12
        expr#13 [ "조인"]
    끝
    Expr#13-> 그룹#7
    Expr#13-> 그룹#9
    하위 그래프 그룹#6
        Expr#12 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
        Expr#6 [ "Project TBL1.ID, TBL1.Field1, TBL2.ID, TBL2.Field1, TBL2.Field2, TBL3.ID, TBL3.Field2, TBL3.Field2"]
    끝
    Expr#12-> 그룹#11
    Expr#6-> 그룹#5
    스타일 expr#12 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#8 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#10 스트로크 폭 : 4px, 스트로크 : 오렌지
    스타일 expr#13 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#14 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#11 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#9 스트로크 윈스 : 4px, 스트로크 : 오렌지
    스타일 expr#15 스트로크 폭 : 4px, 스트로크 : 오렌지
    스타일 expr#7 스트로크 볼 : 4px, 스트로크 : 오렌지
    스타일 expr#16 스트로크 윈스 : 4px, 스트로크 : 오렌지
    LinkStyle 0 스트로크 폭 : 4px, 스트로크 : 오렌지
    Linkstyle 15 스트로크 윈스 : 4px, 스트로크 : 오렌지
    Linkstyle 12 스트로크 폭 : 4px, 스트로크 : 오렌지
    Linkstyle 1 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 16 스트로크 윈스 : 4px, 스트로크 : 오렌지
    Linkstyle 13 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 2 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 6 스트로크 폭 : 4px, 스트로크 : 오렌지
    Linkstyle 3 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 10 스트로크 폭 : 4px, 스트로크 : 오렌지
    LinkStyle 7 스트로크 폭 : 4px, 스트로크 : 오렌지
    Linkstyle 14 스트로크 폭 : 4px, 스트로크 : 오렌지
    Linkstyle 11 스트로크 폭 : 4px, 스트로크 : 오렌지

자세한 내용은 Volcanoplanner.scala를 참조하십시오

최적화 단계

탐사 단계 후에, 우리는 이제 가능한 모든 계획을 포함하는 완전히 확장 된 트리를 가지고 있습니다. 이제 최적화 단계입니다.

이 단계에서는 루트 그룹에 가장 적합한 계획을 찾을 수 있습니다. 최적화 프로세스는 다음과 같이 설명됩니다.

각 그룹마다 최저 비용으로 표현하는 그룹을 선택하여 최상의 구현을 찾을 수 있습니다.
각 그룹 표현식에 대해 먼저 논리 계획에서 물리적 구현을 열거합니다. 그런 다음 각 물리적 구현에 대해 아동 그룹 비용을 사용하여 비용을 계산합니다.

여기 예입니다

 그래프 TD
    하위 그래프 그룹#2 [ "그룹#2 (비용 = 1)"]]]
        Expr#2 [ "Expr#2 (비용 = 1)"]]
    끝
    하위 그래프 그룹#5 [ "그룹#5 (비용 = 3)"]]
        expr#5 [ "expr#5 (cost = max (3,2) = 3"]
    끝
    Expr#5-> 그룹#1
    Expr#5-> 그룹#4
    하위 그래프 그룹#4 [ "그룹#4 (비용 = 2)"]]]
        expr#4 [ "expr#4 (cost = max (1,2) = 2)"]]
        expr#7 [ "expr#7 (비용 = 1+2 = 3)"]]]
    끝
    Expr#4-> 그룹#2
    Expr#4-> 그룹#3
    하위 그래프 그룹#1 [ "그룹#1 (비용 = 3)"]]
        expr#1 [ "expr#1 (비용 = 3)"]]
    끝
    하위 그래프 그룹#3 [ "그룹#3 (비용 = 2)"]]
        expr#3 [ "expr#3 (비용 = 2)"]
    끝
    하위 그래프 그룹#6 [ "그룹#6 (비용 = 4.5)"]]
        expr#6 [ "expr#6 (비용 = 3*1.5 = 4.5)"]]
    끝
    Expr#6-> 그룹#5
    하위 그래프 그룹#8 [ "그룹#8 (비용 = 1)"]]
        expr#8 [ "expr#8 (비용 = 1)"]]
    끝
    하위 그래프 그룹#9 [ "그룹#9 (비용 = 2)"]]]
        expr#9 [ "expr#9 (비용 = 2)"]
    끝
    Expr#7-> 그룹#8
    Expr#7-> 그룹#9

예를 들어, Expr#4 비용은 max 기능을 사용하여 아동 그룹 비용 ( Group#2 및 Group#3 )에 의해 계산됩니다. 또 다른 예는 Group#4 이며, 그 비용은 동등한 표현식의 비용 사이의 최소값을 계산하여 계산됩니다.

물리적 계획

최적화 단계의 목표는 탐색 된 그룹 표현식을 고려할 때 최고의 물리적 계획을 생성하는 것입니다. 물리적 계획을 다음과 같이 정의 할 수 있습니다.

 sealed trait PhysicalPlan {
  def operator () : Operator

  def children () : Seq [ PhysicalPlan ]

  def cost () : Cost

  def estimations () : Estimations

  def traits () : Set [ String ]
}

operator 실제 운영자로, 계획을 실행하는 데 사용 된 물리적 운영자입니다. 이는 나중에 계획을 다룰 것입니다. 그런 다음 children 아동 계획 노드 목록이며 비용 계산 과정에 참여하는 데 익숙합니다. 세 번째 속성은 cost , cost 은 객체 보유 비용 정보 (예 : CPU 비용, 메모리 비용, IO 비용 등)입니다. estimations 계획에 대한 추정 통계를 보유하는 속성 (행 카운트, 행 크기 등)이며 비용 계산에도 참여하고 있습니다. 마지막으로, traits 물리적 특성 세트로, 물리적 계획 생성 프로세스에 영향을 미치는 구현 규칙에 영향을 미칩니다.

다음으로 물리적 노드 클래스를 구현할 수 있습니다.

 case class Scan (
                 operator : Operator ,
                 cost : Cost ,
                 estimations : Estimations ,
                 traits : Set [ String ] = Set .empty
               ) extends PhysicalPlan {
  override def children () : Seq [ PhysicalPlan ] = Seq .empty // scan do not receive any child
}

case class Project (
                    operator : Operator ,
                    child : PhysicalPlan ,
                    cost : Cost ,
                    estimations : Estimations ,
                    traits : Set [ String ] = Set .empty
                  ) extends PhysicalPlan {
  override def children () : Seq [ PhysicalPlan ] = Seq (child)
}

case class Join (
                 operator : Operator ,
                 leftChild : PhysicalPlan ,
                 rightChild : PhysicalPlan ,
                 cost : Cost ,
                 estimations : Estimations ,
                 traits : Set [ String ] = Set .empty
               ) extends PhysicalPlan {
  override def children () : Seq [ PhysicalPlan ] = Seq (leftChild, rightChild)
}

전체 구현은 PhysicalPlan.scala를 참조하십시오

구현 규칙

최적화 단계에서 가장 먼저, 즉 구현 규칙을 구현해야합니다. 구현 규칙은 논리 계획에서 물리적 계획을 실행하지 않고 물리적 계획으로 전환하는 규칙입니다.

우리는 플래너에서 물리적 계획을 직접 실행하지 않기 때문에 대신 물리적 계획 빌더를 반환 할 것이며 각 노드의 비용 기능을 사용자 정의하는 것이 더 쉽습니다.

구현 규칙의 인터페이스는 다음과 같습니다.

 trait PhysicalPlanBuilder {
  def build ( children : Seq [ PhysicalPlan ]) : Option [ PhysicalPlan ]
}

trait ImplementationRule {
  def physicalPlanBuilders ( expression : GroupExpression )( implicit ctx : VolcanoPlannerContext ) : Seq [ PhysicalPlanBuilder ]
}

여기서 PhysicalPlanBuilder 는 아동의 신체 계획을 고려할 때 물리적 계획을 구축하는 데 사용되는 인터페이스입니다.

예를 들어 논리적 조인에는 2 개의 물리적 구현이 해시 조인 및 병합 조인입니다.

 그래프 TD
    어린이#1 [ "어린이#1"]
    어린이#2 [ "어린이#2"]
    어린이#3 [ "어린이#3"]
    어린이#4 [ "어린이#4"]
    hash_join [ "`해시 조인 
    비용 = F (비용 (자식#1), 비용 (자식#2))
    ` "]
    merge_join [ "`합병 조인
    비용 = g (비용 (자식#3), 비용 (자식#4))
    ` "]
    Hash_Join-> Child#1
    Hash_Join-> Child#2
    merge_join-> Child#3
    merge_join-> Child#4

the HASH JOIN cost is using function f() to calculate cost, and MERGE JOIN is using function g() to calculate cost, both are using its children as function parameters. So it's easier to code if we're returning just the phyiscal plan builder from the implementation rule instead of the physical plan.

Finding the best plan

As we've said before, the optimization process is described as following:

For each group, we will find the best implementation by choosing the group expressing with the lowest cost
For each group expression, first we will enumerate the physical implementations from the logical plan. Then for each physical implementation, we will calculate its cost using its child group costs.

And here is the code:

 private def implementGroup ( group : Group , combinedRule : ImplementationRule )(
  implicit ctx : VolcanoPlannerContext
) : GroupImplementation = {
  group.implementation match {
    case Some (implementation) => implementation
    case None =>
      var bestImplementation = Option .empty[ GroupImplementation ]
      group.equivalents.foreach { equivalent =>
        val physicalPlanBuilders = combinedRule.physicalPlanBuilders(equivalent)
        val childPhysicalPlans = equivalent.children.map { child =>
          val childImplementation = implementGroup(child, combinedRule)
          child.implementation = Option (childImplementation)
          childImplementation.physicalPlan
        }
        // calculate the implementation, and update the best cost for group
        physicalPlanBuilders.flatMap(_.build(childPhysicalPlans)).foreach { physicalPlan =>
          val cost = physicalPlan.cost()
          bestImplementation match {
            case Some (currentBest) =>
              if (ctx.costModel.isBetter(currentBest.cost, cost)) {
                bestImplementation = Option (
                  GroupImplementation (
                    physicalPlan = physicalPlan,
                    cost = cost,
                    selectedEquivalentExpression = equivalent
                  )
                )
              }
            case None =>
              bestImplementation = Option (
                GroupImplementation (
                  physicalPlan = physicalPlan,
                  cost = cost,
                  selectedEquivalentExpression = equivalent
                )
              )
          }
        }
      }
      bestImplementation.get
  }
}

This code is an exhaustive search code, which is using recursive function to traverse all nodes. At each node (group), the function is called once to get its best plan while also calculate the optimal cost of that group.

Finally, the best plan for our query is the best plan of the root group

 val implementationRules = new ImplementationRule {

  override def physicalPlanBuilders (
                                     expression : GroupExpression
                                   )( implicit ctx : VolcanoPlannerContext ) : Seq [ PhysicalPlanBuilder ] = {
    expression.plan match {
      case node @ Scan (_, _) => implement. Scan (node)
      case node @ Project (_, _) => implement. Project (node)
      case node @ Join (_, _, _) => implement. Join (node)
    }
  }
}

ctx.rootGroup.implementation = Option (implementGroup(ctx.rootGroup, implementationRules))

See VolcanoPlanner.scala for full implementation

Here is an example of the plan after optimization, it's shown the selected logical node, the selected physical operator, and the estimated cost

 그래프 TD
    Group#6["
    Group #6
Selected: PROJECT tbl1.id, tbl1.field1, tbl2.id, tbl2.field1, tbl2.field2, tbl3.id, tbl3.field2, tbl3.field2
Operator: ProjectOperator
Cost: Cost(cpu=641400.00, mem=1020400012.00, time=1000000.00)
"]
Group#6 --> Group#11
Group#11["
Group #11
Selected: JOIN
Operator: HashJoinOperator
Cost: Cost(cpu=641400.00, mem=1020400012.00, time=1000000.00)
"]
Group#11 --> Group#7
Group#11 --> Group#10
Group#7["
Group #7
Selected: SCAN tbl1 (id, field1)
Operator: NormalScanOperator
Cost: Cost(cpu=400.00, mem=400000.00, time=1000.00)
"]
Group#10["
Group #10
Selected: JOIN
Operator: MergeJoinOperator
Traits: SORTED
Cost: Cost(cpu=640000.00, mem=20000012.00, time=1100000.00)
"]
Group#10 --> Group#8
Group#10 --> Group#9
Group#8["
Group #8
Selected: SCAN tbl2 (id, field1, field2)
Operator: NormalScanOperator
Traits: SORTED
Cost: Cost(cpu=600000.00, mem=12.00, time=1000000.00)
"]
Group#9["
Group #9
Selected: SCAN tbl3 (id, field2)
Operator: NormalScanOperator
Traits: SORTED
Cost: Cost(cpu=40000.00, mem=20000000.00, time=100000.00)
"]

Implement the implementation rules

Next, we will implement some implementation rules.

프로젝트

The first, easiest one is the implementation rule of logical PROJECT

 object Project {

  def apply ( node : logicalplan. Project )( implicit ctx : VolcanoPlannerContext ) : Seq [ PhysicalPlanBuilder ] = {
    Seq (
      new ProjectionImpl (node.fields)
    )
  }
}

class ProjectionImpl ( projection : Seq [ql. FieldID ]) extends PhysicalPlanBuilder {

  override def build ( children : Seq [ PhysicalPlan ]) : Option [ PhysicalPlan ] = {
    val child = children.head
    val selfCost = Cost (
      estimatedCpuCost = 0 ,
      estimatedMemoryCost = 0 ,
      estimatedTimeCost = 0
    ) // assuming the cost of projection is 0
    val cost = Cost (
      estimatedCpuCost = selfCost.estimatedCpuCost + child.cost().estimatedCpuCost,
      estimatedMemoryCost = selfCost.estimatedMemoryCost + child.cost().estimatedMemoryCost,
      estimatedTimeCost = selfCost.estimatedTimeCost + child.cost().estimatedTimeCost
    )
    val estimations = Estimations (
      estimatedLoopIterations = child.estimations().estimatedLoopIterations,
      estimatedRowSize = child.estimations().estimatedRowSize // just guessing the value
    )
    Some (
      Project (
        operator = ProjectOperator (projection, child.operator()),
        child = child,
        cost = cost,
        estimations = estimations,
        traits = child.traits()
      )
    )
  }
}

The implementation rule for logical PROJECT, is returning one physical plan builder ProjectionImpl . ProjectionImpl cost calculation is simple, it just inherits the cost from the child node (because the projection is actually not doing any intensive operation). Beside that, it also updates the estimation (in this code, estimation is also inherit from the child node)

See Project.scala for full implementation

가입하다

Writing implementation rule for logical JOIN is way harder than PROJECTION.

One first reason is, a logical JOIN has many physical implementation, such as HASH JOIN, MERGE JOIN, BROADCAST JOIN, etc.

The second reason is, estimating cost for physical JOIN is also hard, because it depends on lots of factors such as row count, row size, data histogram, indexes, data layout, etc.

So, to keep everything simple in this guide, I will only implement 2 physical JOIN: HASH JOIN and MERGE JOIN. The cost estimation functions are fictional (just to show how it works, I'm not trying to correct it). And in the MERGE JOIN, all data is assuming to be sorted by join key.

Here is the code:

 object Join {

  def apply ( node : logicalplan. Join )( implicit ctx : VolcanoPlannerContext ) : Seq [ PhysicalPlanBuilder ] = {
    val leftFields = node.on.map(_._1).map(f => s " ${f.table.id} . ${f.id} " )
    val rightFields = node.on.map(_._2).map(f => s " ${f.table.id} . ${f.id} " )
    Seq (
      new HashJoinImpl (leftFields, rightFields),
      new MergeJoinImpl (leftFields, rightFields)
    )
  }
}

The HASH JOIN:

 class HashJoinImpl ( leftFields : Seq [ String ], rightFields : Seq [ String ]) extends PhysicalPlanBuilder {

  private def viewSize ( plan : PhysicalPlan ) : Long = {
    plan.estimations().estimatedLoopIterations * plan.estimations().estimatedRowSize
  }

  // noinspection ZeroIndexToHead,DuplicatedCode
  override def build ( children : Seq [ PhysicalPlan ]) : Option [ PhysicalPlan ] = {
    // reorder the child nodes, the left child is the child with smaller view size (smaller than the right child if we're store all of them in memory)
    val (leftChild, rightChild) = if (viewSize(children( 0 )) < viewSize(children( 1 ))) {
      (children( 0 ), children( 1 ))
    } else {
      (children( 1 ), children( 0 ))
    }
    val estimatedLoopIterations = Math .max(
      leftChild.estimations().estimatedLoopIterations,
      rightChild.estimations().estimatedLoopIterations
    ) // just guessing the value
    val estimatedOutRowSize = leftChild.estimations().estimatedRowSize + rightChild.estimations().estimatedRowSize
    val selfCost = Cost (
      estimatedCpuCost = leftChild.estimations().estimatedLoopIterations, // cost to hash all record from the smaller view
      estimatedMemoryCost = viewSize(leftChild), // hash the smaller view, we need to hold the hash table in memory
      estimatedTimeCost = rightChild.estimations().estimatedLoopIterations
    )
    val childCosts = Cost (
      estimatedCpuCost = leftChild.cost().estimatedCpuCost + rightChild.cost().estimatedCpuCost,
      estimatedMemoryCost = leftChild.cost().estimatedMemoryCost + rightChild.cost().estimatedMemoryCost,
      estimatedTimeCost = 0
    )
    val estimations = Estimations (
      estimatedLoopIterations = estimatedLoopIterations,
      estimatedRowSize = estimatedOutRowSize
    )
    val cost = Cost (
      estimatedCpuCost = selfCost.estimatedCpuCost + childCosts.estimatedCpuCost,
      estimatedMemoryCost = selfCost.estimatedMemoryCost + childCosts.estimatedMemoryCost,
      estimatedTimeCost = selfCost.estimatedTimeCost + childCosts.estimatedTimeCost
    )
    Some (
      Join (
        operator = HashJoinOperator (
          leftChild.operator(),
          rightChild.operator(),
          leftFields,
          rightFields
        ),
        leftChild = leftChild,
        rightChild = rightChild,
        cost = cost,
        estimations = estimations,
        traits = Set .empty // don't inherit trait from children since we're hash join
      )
    )
  }
}

We can see that the cost function of HASH JOIN is composed of its children costs and estimations

 val selfCost = Cost (
  estimatedCpuCost = leftChild.estimations().estimatedLoopIterations, // cost to hash all record from the smaller view
  estimatedMemoryCost = viewSize(leftChild), // hash the smaller view, we need to hold the hash table in memory
  estimatedTimeCost = rightChild.estimations().estimatedLoopIterations
)

val childCosts = Cost (
  estimatedCpuCost = leftChild.cost().estimatedCpuCost + rightChild.cost().estimatedCpuCost,
  estimatedMemoryCost = leftChild.cost().estimatedMemoryCost + rightChild.cost().estimatedMemoryCost,
  estimatedTimeCost = 0
)

val estimations = Estimations (
  estimatedLoopIterations = estimatedLoopIterations,
  estimatedRowSize = estimatedOutRowSize
)

val cost = Cost (
  estimatedCpuCost = selfCost.estimatedCpuCost + childCosts.estimatedCpuCost,
  estimatedMemoryCost = selfCost.estimatedMemoryCost + childCosts.estimatedMemoryCost,
  estimatedTimeCost = selfCost.estimatedTimeCost + childCosts.estimatedTimeCost
)

Next, the MERGE JOIN:

 class MergeJoinImpl ( leftFields : Seq [ String ], rightFields : Seq [ String ]) extends PhysicalPlanBuilder {

  // noinspection ZeroIndexToHead,DuplicatedCode
  override def build ( children : Seq [ PhysicalPlan ]) : Option [ PhysicalPlan ] = {
    val (leftChild, rightChild) = (children( 0 ), children( 1 ))
    if (leftChild.traits().contains( " SORTED " ) && rightChild.traits().contains( " SORTED " )) {
      val estimatedTotalRowCount =
        leftChild.estimations().estimatedLoopIterations +
          rightChild.estimations().estimatedLoopIterations
      val estimatedLoopIterations = Math .max(
        leftChild.estimations().estimatedLoopIterations,
        rightChild.estimations().estimatedLoopIterations
      ) // just guessing the value
      val estimatedOutRowSize = leftChild.estimations().estimatedRowSize + rightChild.estimations().estimatedRowSize
      val selfCost = Cost (
        estimatedCpuCost = 0 , // no additional cpu cost, just scan from child iterator
        estimatedMemoryCost = 0 , // no additional memory cost
        estimatedTimeCost = estimatedTotalRowCount
      )
      val childCosts = Cost (
        estimatedCpuCost = leftChild.cost().estimatedCpuCost + rightChild.cost().estimatedCpuCost,
        estimatedMemoryCost = leftChild.cost().estimatedMemoryCost + rightChild.cost().estimatedMemoryCost,
        estimatedTimeCost = 0
      )
      val estimations = Estimations (
        estimatedLoopIterations = estimatedLoopIterations,
        estimatedRowSize = estimatedOutRowSize
      )
      val cost = Cost (
        estimatedCpuCost = selfCost.estimatedCpuCost + childCosts.estimatedCpuCost,
        estimatedMemoryCost = selfCost.estimatedMemoryCost + childCosts.estimatedMemoryCost,
        estimatedTimeCost = selfCost.estimatedTimeCost + childCosts.estimatedTimeCost
      )
      Some (
        Join (
          operator = MergeJoinOperator (
            leftChild.operator(),
            rightChild.operator(),
            leftFields,
            rightFields
          ),
          leftChild = leftChild,
          rightChild = rightChild,
          cost = cost,
          estimations = estimations,
          traits = leftChild.traits() ++ rightChild.traits()
        )
      )
    } else {
      None
    }
  }
}

Same with HASH JOIN, MERGE JOIN also uses its children costs and estimations to calculate its cost, but with different formulla:

 val selfCost = Cost (
  estimatedCpuCost = 0 , // no additional cpu cost, just scan from child iterator
  estimatedMemoryCost = 0 , // no additional memory cost
  estimatedTimeCost = estimatedTotalRowCount
)

val childCosts = Cost (
  estimatedCpuCost = leftChild.cost().estimatedCpuCost + rightChild.cost().estimatedCpuCost,
  estimatedMemoryCost = leftChild.cost().estimatedMemoryCost + rightChild.cost().estimatedMemoryCost,
  estimatedTimeCost = 0
)

val estimations = Estimations (
  estimatedLoopIterations = estimatedLoopIterations,
  estimatedRowSize = estimatedOutRowSize
)

val cost = Cost (
  estimatedCpuCost = selfCost.estimatedCpuCost + childCosts.estimatedCpuCost,
  estimatedMemoryCost = selfCost.estimatedMemoryCost + childCosts.estimatedMemoryCost,
  estimatedTimeCost = selfCost.estimatedTimeCost + childCosts.estimatedTimeCost
)

See HashJoinImpl.scala and MergeJoinImpl.scala for full implementation

기타

You can see other rules and physical plan builders here:

구현하다
빌더

Putting all pieces together

Now, after done implementing the implementation rules, now we can find our best plan. Let's start over from the user query

 SELECT tbl1 . id ,
       tbl1 . field1 ,
       tbl2 . id ,
       tbl2 . field1 ,
       tbl2 . field2 ,
       tbl3 . id ,
       tbl3 . field2 ,
       tbl3 . field2
FROM tbl1
         JOIN tbl2 ON tbl1 . id = tbl2 . id
         JOIN tbl3 ON tbl2 . id = tbl3 . id

will be converted to the logical plan

 그래프 TD
    1326583549["PROJECT tbl1.id, tbl1.field1, tbl2.id, tbl2.field1, tbl2.field2, tbl3.id, tbl3.field2, tbl3.field2"];
    -425111028["JOIN"];
    -349388609["SCAN tbl1"];
    1343755644["JOIN"];
    -1043437086["SCAN tbl2"];
    -1402686787["SCAN tbl3"];
    1326583549 --> -425111028;
    -425111028 --> -349388609;
    -425111028 --> 1343755644;
    1343755644 --> -1043437086;
    1343755644 --> -1402686787;

After exploration phase, it will generate lots of equivalent plans

 그래프 TD
    subgraph Group#8
        Expr#8["SCAN tbl2 (id, field1, field2)"]
    끝
    subgraph Group#11
        Expr#11["JOIN"]
        Expr#14["JOIN"]
    끝
    Expr#11 --> Group#7
    Expr#11 --> Group#10
    Expr#14 --> Group#8
    Expr#14 --> Group#12
    subgraph Group#2
        Expr#2["SCAN tbl2"]
    끝
    subgraph Group#5
        Expr#5["JOIN"]
        Expr#16["JOIN"]
    끝
    Expr#5 --> Group#1
    Expr#5 --> Group#4
    Expr#16 --> Group#2
    Expr#16 --> Group#13
    subgraph Group#4
        Expr#4["JOIN"]
    끝
    Expr#4 --> Group#2
    Expr#4 --> Group#3
    subgraph Group#13
        Expr#15["JOIN"]
    끝
    Expr#15 --> Group#1
    Expr#15 --> Group#3
    subgraph Group#7
        Expr#7["SCAN tbl1 (id, field1)"]
    끝
    subgraph Group#1
        Expr#1["SCAN tbl1"]
    끝
    subgraph Group#10
        Expr#10["JOIN"]
    끝
    Expr#10 --> Group#8
    Expr#10 --> Group#9
    subgraph Group#9
        Expr#9["SCAN tbl3 (id, field2)"]
    끝
    subgraph Group#3
        Expr#3["SCAN tbl3"]
    끝
    subgraph Group#12
        Expr#13["JOIN"]
    끝
    Expr#13 --> Group#7
    Expr#13 --> Group#9
    subgraph Group#6
        Expr#12["PROJECT tbl1.id, tbl1.field1, tbl2.id, tbl2.field1, tbl2.field2, tbl3.id, tbl3.field2, tbl3.field2"]
        Expr#6["PROJECT tbl1.id, tbl1.field1, tbl2.id, tbl2.field1, tbl2.field2, tbl3.id, tbl3.field2, tbl3.field2"]
    끝
    Expr#12 --> Group#11
    Expr#6 --> Group#5
    style Expr#12 stroke-width: 4px, stroke: orange
    style Expr#8 stroke-width: 4px, stroke: orange
    style Expr#10 stroke-width: 4px, stroke: orange
    style Expr#13 stroke-width: 4px, stroke: orange
    style Expr#14 stroke-width: 4px, stroke: orange
    style Expr#11 stroke-width: 4px, stroke: orange
    style Expr#9 stroke-width: 4px, stroke: orange
    style Expr#15 stroke-width: 4px, stroke: orange
    style Expr#7 stroke-width: 4px, stroke: orange
    style Expr#16 stroke-width: 4px, stroke: orange
    linkStyle 0 stroke-width: 4px, stroke: orange
    linkStyle 15 stroke-width: 4px, stroke: orange
    linkStyle 12 stroke-width: 4px, stroke: orange
    linkStyle 1 stroke-width: 4px, stroke: orange
    linkStyle 16 stroke-width: 4px, stroke: orange
    linkStyle 13 stroke-width: 4px, stroke: orange
    linkStyle 2 stroke-width: 4px, stroke: orange
    linkStyle 6 stroke-width: 4px, stroke: orange
    linkStyle 3 stroke-width: 4px, stroke: orange
    linkStyle 10 stroke-width: 4px, stroke: orange
    linkStyle 7 stroke-width: 4px, stroke: orange
    linkStyle 14 stroke-width: 4px, stroke: orange
    linkStyle 11 stroke-width: 4px, stroke: orange

And the at optimization phase, a final best plan is chose

 그래프 TD
    Group#6["
    Group #6
Selected: PROJECT tbl1.id, tbl1.field1, tbl2.id, tbl2.field1, tbl2.field2, tbl3.id, tbl3.field2, tbl3.field2
Operator: ProjectOperator
Cost: Cost(cpu=641400.00, mem=1020400012.00, time=1000000.00)
"]
Group#6 --> Group#11
Group#11["
Group #11
Selected: JOIN
Operator: HashJoinOperator
Cost: Cost(cpu=641400.00, mem=1020400012.00, time=1000000.00)
"]
Group#11 --> Group#7
Group#11 --> Group#10
Group#7["
Group #7
Selected: SCAN tbl1 (id, field1)
Operator: NormalScanOperator
Cost: Cost(cpu=400.00, mem=400000.00, time=1000.00)
"]
Group#10["
Group #10
Selected: JOIN
Operator: MergeJoinOperator
Traits: SORTED
Cost: Cost(cpu=640000.00, mem=20000012.00, time=1100000.00)
"]
Group#10 --> Group#8
Group#10 --> Group#9
Group#8["
Group #8
Selected: SCAN tbl2 (id, field1, field2)
Operator: NormalScanOperator
Traits: SORTED
Cost: Cost(cpu=600000.00, mem=12.00, time=1000000.00)
"]
Group#9["
Group #9
Selected: SCAN tbl3 (id, field2)
Operator: NormalScanOperator
Traits: SORTED
Cost: Cost(cpu=40000.00, mem=20000000.00, time=100000.00)
"]

Bonus: query execution

Now we've done building a functional query planner which can optimize the query from user, but our query plan could not run by itself. So it's the reason why now we will implement the query processor to test out our query plan.

Basically the query process receive input from the query planner, and execute them

 graph LR
    plan(("Physical Plan"))
    storage[("Storage Layer")]
    processor["Query Processor"]
    plan -- execute --> processor
    storage -- fetch --> processor

Volcano/Iterator model

Volcano/iterator model is the query processing model that is widely used in many DBMS. It is a pipeline architecture, which means that the data is processed in stages, with each stage passing the output of the previous stage to the next stage.

Each stage in the pipeline is represented by an operator. Operators are functions that perform a specific operation on the data, such as selecting rows, filtering rows, or aggregating rows.

Usually, operator can be formed directly from the query plan. For example, the query

 SELECT field_1
FROM tbl
WHERE field = 1

will have the plan

 그래프 TD
  project["PROJECT: field_1"]
  scan["SCAN: tbl"]
  filter["FILTER: field = 1"]
  project --> scan
  filter --> project

will create a chain of operators like this:

 scan = {
    next() // fetch next row from table "tbl"
}

project = {
    next() = {
        next_row = scan.next() // fetch next row from scan operator
        projected = next_row["field_1"]
        return projected
    }
}

filter = {
    next() = {
        next_row = {}
        do {
            next_row = project.next() // fetch next row from project operator
        } while (next_row["field"] != 1)
        return next_row
    }
}

results = []
while (row = filter.next()) {
    results.append(row)
}

notes : this pseudo code did not handle for end of result stream

The operators

The basic interface of an operator is described as following:

 trait Operator {
  def next () : Option [ Seq [ Any ]]
}

See Operator.scala for full implementation of all operators

Testing a simple query

Let's define a query

 SELECT emp . id ,
       emp . code ,
       dept . dept_name ,
       emp_info . name ,
       emp_info . origin
FROM emp
         JOIN dept ON emp . id = dept . emp_id
         JOIN emp_info ON dept . emp_id = emp_info . id

with some data and stats

 val table1 : Datasource = Datasource (
  table = " emp " ,
  catalog = TableCatalog (
    Seq (
      " id " -> classOf [ String ],
      " code " -> classOf [ String ]
    ),
    metadata = Map ( " sorted " -> " true " ) // assumes rows are already sorted by id
  ),
  rows = Seq (
    Seq ( " 1 " , " Emp A " ),
    Seq ( " 2 " , " Emp B " ),
    Seq ( " 3 " , " Emp C " )
  ),
  stats = TableStats (
    estimatedRowCount = 3 ,
    avgColumnSize = Map ( " id " -> 10 , " code " -> 32 )
  )
)

val table2 : Datasource = Datasource (
  table = " dept " ,
  catalog = TableCatalog (
    Seq (
      " emp_id " -> classOf [ String ],
      " dept_name " -> classOf [ String ]
    ),
    metadata = Map ( " sorted " -> " true " ) // assumes rows are already sorted by emp_id (this is just a fake trait to demonstrate how trait works)
  ),
  rows = Seq (
    Seq ( " 1 " , " Dept 1 " ),
    Seq ( " 1 " , " Dept 2 " ),
    Seq ( " 2 " , " Dept 3 " ),
    Seq ( " 3 " , " Dept 3 " )
  ),
  stats = TableStats (
    estimatedRowCount = 4 ,
    avgColumnSize = Map ( " emp_id " -> 10 , " dept_name " -> 255 )
  )
)

val table3 : Datasource = Datasource (
  table = " emp_info " ,
  catalog = TableCatalog (
    Seq (
      " id " -> classOf [ String ],
      " name " -> classOf [ String ],
      " origin " -> classOf [ String ]
    ),
    metadata = Map ( " sorted " -> " true " ) // assumes rows are already sorted by id (this is just a fake trait to demonstrate how trait works)
  ),
  rows = Seq (
    Seq ( " 1 " , " AAAAA " , " Country A " ),
    Seq ( " 2 " , " BBBBB " , " Country A " ),
    Seq ( " 3 " , " CCCCC " , " Country B " )
  ),
  stats = TableStats (
    estimatedRowCount = 3 ,
    avgColumnSize = Map ( " id " -> 10 , " name " -> 255 , " origin " -> 255 )
  )
)

The cost model is optimized for CPU

 val costModel : CostModel = ( currentCost : Cost , newCost : Cost ) => {
  currentCost.estimatedCpuCost > newCost.estimatedCpuCost
}

Now, executing the query by running this code:

 val planner = new VolcanoPlanner
QueryParser .parse(query) match {
  case Left (err) => throw err
  case Right (parsed) =>
    val operator = planner.getPlan(parsed)
    val result = Utils .execute(operator)
    // print result
    println(result._1.mkString( " , " ))
    result._2.foreach(row => println(row.mkString( " , " )))
}

it will print:

 emp.id,emp.code,dept.dept_name,emp_info.name,emp_info.origin
1,Emp A,Dept 1,AAAAA,Country A
1,Emp A,Dept 2,AAAAA,Country A
2,Emp B,Dept 3,BBBBB,Country A
3,Emp C,Dept 3,CCCCC,Country B

Voila, We've done building a fully functional query planner and query engine :). You can start writing one for your own, good luck

See Demo.scala for full demo code

감사해요

Thanks for reading this, this guide is quite long, and not fully correct, but I've tried my best to write it as understandably as possible ?

확장하다