Seek and then move: Interpretation of the mystery of line number estimation and path generation in the generation of data warehouse plans

Seek and then move: Interpretation of the mystery of line number estimation and path generation in the generation of data warehouse plans

Anything must be planned and deployed, and prepared, so as to be conducive to the success of this matter, and we must not act recklessly. Similarly, the GaussDB (DWS) query statement will be executed according to the predetermined plan. In the case of a given hardware environment, the execution speed depends on the quality of the plan. So how is the plan of a query statement formulated? This article will Interpret the mystery of line number estimation and path generation in plan generation for everyone.

This article is shared from HUAWEI CLOUD Community "The Secret of GaussDB (DWS) Project Generation Principle (1)" , the original author: Jugg.

GaussDB (DWS) optimizer has two plan generation methods, one is dynamic programming, and the other is genetic algorithm. The former is the most used method and is also the focus of this series of articles. Generally speaking, after a SQL statement generates a specific structure of the query tree (QueryTree) through the syntax tree (ParseTree), it starts from the QueryTree and then enters the core part of the plan generation. There are some key steps:

1. Set the initial degree of parallelism (Dop)

2. Query Rewrite

3. Estimate the number of rows in the base table

4. Estimated association table (JoinRel)

5. Path generation, generate optimal Path

6. Create a Plan node for execution by the optimal Path

7. Adjust the optimal parallelism

This article mainly focuses on 3, 4, and 5. These steps have a relatively large impact on the generation of a plan. They mainly involve line number estimation, path selection method, and cost estimation (or cost estimation). Cost estimation is the basis for path selection. The child corresponds to a set of models and is a relatively independent part, which will be explained in subsequent articles. Plan Hint will intersperse the generation of interference plans in 3, 4, 5 and many other steps. For a detailed introduction, readers can refer to the blog post: GaussDB (DWS) Performance Tuning Series Implementation Part 6: Eighteen Martial Arts Plan Hint Application .

First look at a simple query statement:

select count (*) from t1 join t2 on t1.c2 = t2.c2 and t1.c1> 100 and (t1.c3 is not null or t2.c3 is not null); duplicated code

The execution plan given by the GaussDB (DWS) optimizer is as follows:

postgres=# explain verbose select count(*) from t1 join t2 on t1.c2 = t2.c2 and t1.c1> 100 and (t1.c3 is not null or t2.c3 is not null); QUERY PLAN -------------------------------------------------- -------------------------------------------------- ---------- id | operation | E-rows | E-distinct | E-memory | E-width | E-costs ----+--------------------------------------------- -----+--------+------------+----------+---------+- -------- 1 | -> Aggregate | 1 | | | 8 | 111.23 2 | -> Streaming (type: GATHER) | 4 | | | 8 | 111.23 3 | -> Aggregate | 4 | | 1MB | 8 | 101.23 4 | -> Hash Join (5,7) | 3838 | | 1MB | 0 | 98.82 5 | -> Streaming(type: REDISTRIBUTE) | 1799 | 112 | 2MB | 10 | 46.38 6 | -> Seq Scan on test.t1 | 1799 | | 1MB | 10 | 9.25 7 | -> Hash | 1001 | 25 | 16MB | 8 | 32.95 8 | -> Streaming(type: REDISTRIBUTE) | 1001 | | 2MB | 8 | 32.95 9 | -> Seq Scan on test.t2 | 1001 | | 1MB | 8 | 4.50 Predicate Information (identified by plan id) -------------------------------------------------- --------------- 4 --Hash Join (5,7) Hash Cond: (t1.c2 = t2.c2) Join Filter: ((t1.c3 IS NOT NULL) OR (t2.c3 IS NOT NULL)) 6 --Seq Scan on test.t1 Filter: (t1.c1> 100) Copy code

Usually the Plan of a query statement starts from the base table. In this example, the base table t1 has multiple filter conditions. From the plan, some conditions are pushed down to the base table, and some conditions are not pushed down, so its rows How is the number estimated? We first start with the estimation of the number of rows in the base table.

1. Estimation of the number of rows in the base table

If there is no filter condition on the base table or the filter condition cannot be pushed down to the base table, the estimated number of rows in the base table is the number of rows displayed in the statistics, and no special processing is required. This section considers the filtering conditions pushed down to the base table, which can be divided into single-column and multiple-column cases.

1. The idea of single-column filter condition estimation

The estimation of the number of rows in the base table currently mainly relies on statistical information. The statistical information is some statistical average information about the sample data of the table that is triggered by Analyze to be generated before the plan. For example, some statistical information of the t1 table is as follows:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct, avg_width, most_common_vals, most_common_freqs from pg_stats where tablename ='t1'; tablename | attname | null_frac | n_distinct | n_dndistinct | avg_width | most_common_vals | most_common_freqs -----------+---------+-----------+------------+--- -----------+-----------+------------------+------- ------------ t1 | c1 | 0 | -.5 | -.5 | 4 | | t1 | c2 | 0 | -.25 | -.431535 | 4 | | t1 | c3 | .5 | 1 | 1 | 6 | {gauss} | {.5} t1 | c4 | .5 | 1 | 1 | 8 | {gaussdb} | {.5} (4 rows) Copy code

The meaning of each field is as follows:

  • null_frac: the proportion of null values
  • n_distinct: global distinct value, value rule: when a positive number represents the distinct value, when a negative number, its absolute value represents the ratio of the distinct value to the number of rows
  • n_dndistinct: the distinct value on DN1, the value rule is similar to n_distinct
  • avg_width: the average width of the field
  • most_common_vals: list of high frequency values
  • most_common_freqs: A list of the proportions of high frequency values, corresponding to most_common_vals

The specific data distribution can be roughly judged from the above statistical information, such as the t1.c1 column, the average width is 4, the average repeatability of each data is 2, and there is no null value, and no value is significantly higher than the other The value, that is, most_common_vals (MCV for short) is empty. This can also be understood as the data is basically evenly distributed. For these evenly distributed data, a certain amount of buckets are allocated, and the data is divided by the equal height, and each bucket is recorded The boundary, commonly known as histogram (Histogram), means that there is an equal amount of data in each bucket.

With this basic information, the number of rows in the base table can be roughly estimated. For example, the filter condition "t1.c1>100" on the t1 table, combined with the uniform distribution characteristics of the t1.c1 column and the specific data distribution:

postgres=# select histogram_bounds from pg_stats where tablename ='t1' and attname ='c1'; histogram_bounds -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ------ {1,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,390,400,410,420,430,440,450,460,470,610 0,630,640,650,660,670,680,690,700,710,720,730,740,750,760,770,780,790,800,810,820,830,840,850,860,870,880,890,900,910,920,930,940,950,960,970,980,990,1000} (1 row) Copy code

It can be seen that the data in column t1.c1 is distributed between 1 and 1000, and the amount of data contained in each of the two boundaries is roughly the same (here is the statistical boundary based on sample statistics), first find 100 in this histogram Approximate location, here it is the boundary of a bucket (sometimes inside the bucket), then the proportion of data with t1.c1>100 is approximately the proportion of the number of buckets after the boundary 100, and the proportion here is also called It is the selection rate, that is, the proportion of the selected data after passing this condition, so the number of rows after filtering by "t1.c >100" can be estimated.

The above is the basic idea of estimating the number of rows in the base table. normally

There are statistics:

1. Equivalence conditions 1) Compare MCV, if the filter conditions are met, the selection rate (ie most_common_freqs) is accumulated; 2) For Histogram data, the selection rate is roughly estimated according to the number of distinct values;

2. Range conditions 1) Compare the MCV data, if the filter conditions are met, the selection rate is accumulated; 2) For the Histogram data, estimate the selection rate according to the boundary position;

3. Unequal value conditions: can be transformed into equal value conditions estimation

No statistics:

1. Equivalent conditions: For example, the filter condition is: "substr(c3, 1, 5) ='gauss'", the c3 column has statistical information, but substr(c3, 1, 5) has no statistical information. How to estimate this condition selection rate? A simple idea is that if the distinct value of substr(c3, 1, 5) is known, you can roughly assume that the repeatability of each distinct value is the same, so the selection rate can also be estimated; in GaussDB (DWS), You can turn on the expression distinct value estimation function by setting cost_model_version=1;

2. Range conditions: At this time, only knowing the distinct value of substr(c3, 1, 5) cannot predict the selection rate. For expressions that cannot be estimated, you can set the corresponding distinct value through qual_num_distinct;

3. Unequal value conditions: can be transformed into equal value conditions estimation

2. Multi-column filter condition estimation idea

For example, the t1 table has two filter conditions: t1.c1 = 100 and t1.c3 ='gauss', then how to estimate the comprehensive selection rate of the two columns? In GaussDB (DWS), there are two general methods:

Only a single column of statistics

In this case, first calculate the selection rate of each filter condition according to a single column of statistical information, and then select a method to combine these selection rates. The selection method can be specified by setting cost_param. Why do we need to choose a combination method? Because in the actual model, there is a certain correlation between columns. In some scenarios, the correlation is relatively strong, and some scenes are relatively weak. The strength of the correlation determines the final number of rows.

For the meaning and usage introduction of this parameter, please refer to: GaussDB (DWS) Performance Tuning Series Practical Part 5: Path Intervention of Eighteen Martial Arts .

There are multiple columns of combined statistics

If the combined statistical information of the filtered combination column has been collected, the optimizer will first use the combined statistical information to estimate the number of rows. The basic idea of the estimation is the same as that of a single column, that is, the combination of multiple columns is regarded as a "single column", and then more Column statistics to estimate.

For example, multi-column statistical information: ((c1, c2, c4)), ((c1, c2)), double brackets indicate a group of multi-column statistical information:

1. If the conditions are: c1 = 7 and c2 = 3 and c4 = 5, use ((c1, c2, c4))

2. If the conditions are: c1 = 7 and c2 = 3, use ((c1, c2))

3. If the conditions are: c1 = 7 and c2 = 3 and c5 = 6, then use ((c1, c2))

The general principle of multi-column condition matching multi-column statistical information is:

1. The column combination of multiple columns of statistical information needs to be included by the column combination of the filter condition;

2. Among all the multi-column statistical information that meets "Condition 1", select the multi-column statistical information that "has the largest intersection with the column combination of the filter condition".

For filter conditions that cannot match multiple columns of statistical information, use single-column statistical information for estimation.

3. Points worth noting

  • Currently, when using multi-column statistical information, range conditions are not supported; if there are multiple sets of multi-column conditions, the selection rate of each group of multi-column conditions is multiplied as the overall selection rate.

  • The single-column condition estimation and multi-column condition estimation mentioned above are applicable to only one column in each filter condition. If a filter condition is a combination of multiple columns, such as "t1.c1 <t1.c2", then generally It is impossible to estimate single-column statistical information, because single-column statistical information is independent of each other, and it is impossible to determine whether two independent statistical data come from one row. The current multi-column statistical information mechanism does not support scenarios where the filter conditions on the base table involve multiple columns.

  • Filter conditions that cannot be pushed down to the base table are not included in the consideration of the number of base table rows, such as the above: t1.c3 is not null or t2.c3 is not null, this condition is generally called JoinFilter, and JoinRel will be created Estimated at the time.

  • If no statistics are available, then the default selection rate is given.

2. JoinRel row number estimation

After estimating the number of rows in the base table, you can enter the table association stage. Then to associate two tables, you need some information, such as the number of rows in the base table, the number of rows after the association, and the choice of the way of association (also called the choice of Path, please see the next section), and then choose the least cost in these ways , Also called the best path. For the estimation of correlation conditions, there are also single conditions and multiple conditions. The optimizer needs to calculate the comprehensive selection rate of all Join conditions and JoinFilter, and then give the estimated number of rows. First look at how the selection rate of a single correlation condition is estimated.

1. A group of Join condition estimation ideas

Similar to estimating the number of rows by base table filter conditions, it also uses statistical information to estimate. For example, the association condition in the above SQL example: t1.c2 = t2.c2, first look at the statistical information of t1.c2:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct, avg_width, most_common_vals, most_common_freqs from pg_stats where tablename ='t1' and attname ='c2'; tablename | attname | null_frac | n_distinct | n_dndistinct | avg_width | most_common_vals | most_common_freqs -----------+---------+-----------+------------+--- -----------+-----------+------------------+------- ------------ t1 | c2 | 0 | -.25 | -.431535 | 4 | | (1 row) Copy code

The t1.c2 column has no MCV value. On average, each distinct value is repeated about 4 times and is evenly distributed. Because the data retained in the Histogram is only the boundary of the bucket, not the actual data (repetitive collection of statistics, these boundaries may be There are changes), then it is not practical to compare the boundary value with t2.c2, which may cause a relatively large error. At this time, we firmly believe that: "The related columns and columns have the same meaning, and the data overlap as much as possible. " That is to say, if the t1.c2 column has 500 distinct values, the t2.c2 column has 100. A distinct value, then these 100 and 500 will overlap 100, that is, the smaller the distinct value will all appear in the table with the larger distinct value. Although this assumption is a bit harsh, it is more consistent with the actual situation in many cases. Going back to this example, according to the statistics, n_distinct shows a negative value representing the proportion, and the estimated number of rows in the t1 table is 2000:

postgres=# select reltuples from pg_class where relname ='t1'; reltuples ----------- 2000 (1 row) Copy code

So, the distinct of t1.c2 is 0.25 * 2000 = 500. Similarly, according to statistics, the distinct of t2.c2 is 100:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct from pg_stats where tablename ='t2' and attname ='c2'; tablename | attname | null_frac | n_distinct | n_dndistinct -----------+---------+-----------+------------+--- ----------- t2 | c2 | 0 | 100 | -.39834 (1 row) Copy code

So, can the distinct value of t1.c2 be 500 directly? The answer is no. Because there is a filter condition "t1.c1 >100" on the base table t1, the current association occurs after the filter condition of the base table, the estimated distinct should be the number of distinct after the filter condition, not the original table How many. At this time, various hypothetical models can be used for estimation, such as a few simple models: Poisson model (assuming that t1.c1 and t1.c2 are very weakly correlated) or a fully correlated model (assuming that t1.c1 and t1.c2 are completely correlated) ), the values obtained by different models will be different. In this example, the selection rate of "t1.c1> 100" is 8.995000e-01, and the distinct values obtained by different models will be different, as follows:

a. Poisson model (weak correlation model): 500 * (1.0 -exp(-2000 * 8.995000e-01/500)) = 486

b. Fully related model: 500 * 8.995000e-01 = 450

c. Completely irrelevant model: 500 * (1.0-pow(1.0-8.995000e-01, 2000/500)) = 499.9, the model can be obtained by probabilistic methods, and interested readers can try to derive it by themselves

d. The actual filtered distinct: 500, that is, the c2 and c1 columns are irrelevant

postgres=# select count(distinct c2) from t1 where c1> 100; count ------- 500 (1 row) Copy code

Estimating the distinct value of t1.c2 after filtering, then the selection rate of "t1.c2 =t2.c2" can be estimated: 1/distinct.

The above is the case where there is no MCV in any table. If t1.c2 and t2.c2 have MCV, then compare their MCV first, because the values in MCV have a clear proportion, just accumulate the matching results directly. Then match the values in the Histogram.

2. Multi-group Join condition estimation idea

When the table association contains multiple Join conditions, it is similar to the base table filter condition estimation. There are also two ways of thinking. 1. try multiple columns of statistical information to estimate the selection rate. When multi-column statistical information cannot be used, single-column statistical information is used to calculate the selection rate of each Join condition according to the above method. Then the method of combination selection rate is also controlled by the parameter cost_param. For details, please refer to GaussDB (DWS) Performance Tuning Series Practical Part 5: Path Intervention of Eighteen Martial Arts .

In addition, the following is a selection rate estimation method for special circumstances:

  • If the Join column is an expression and there is no statistical information, the optimizer will try to estimate the distinct value, and then estimate it without MCV;

  • Left Join/RightJoin needs special consideration of the following characteristics of filling in the space on one side and full output on the other side, and the above model can be modified appropriately;

  • If the correlation condition is a comparison of range classes, such as "t1.c2 <t2.c2", the default selection rate is currently given: 1/3;

3. The estimation idea of JoinFilter

When two tables are associated, if there are some filter conditions that cannot be pushed down on the base table, they will generally become JoinFilter, that is, these conditions are filtered during the Join process, so JoinFilter will affect the number of rows in JoinRel, but not Affect the number of rows scanned up by the base table. Strictly speaking, if JoinRel is regarded as an intermediate table, then these JoinFilters are the filter conditions of this intermediate table, but JoinRel has not been generated yet, and there is no row number and statistical information, so it cannot be accurately estimated. However, a simple approximate method is to still use the base table to roughly estimate the selection rate of this JoinFilter, and then put it in the final row count estimation of JoinRel.

3. path generation

With the foreshadowing of the number of rows estimated in the previous two sections, you can enter the process of path generation. What is path generation? It is known that there are many ways to associate tables (such as NestLoop, HashJoin), and GaussDB (DWS) tables are distributed and stored in the cluster, so there may be multiple ways to associate the two tables, and our goal That is, starting from these given base tables, going through some operations (filter conditions, association methods and conditions, aggregation, etc.) as required, combining with each other, step by step, and finally get the result we want. This is like starting from the base table to find the best path so that we can get the fastest result. This is our goal. In this section, we introduce the generation of Join Path and Aggregate Path.

1. Join Path generation

The basic idea of GaussDB (DWS) optimizer selection is dynamic programming. As the name suggests, from a certain starting state, through solving the optimal solution of the intermediate state, gradually evolve forward, and finally get the global optimal plan. Then in dynamic programming, there is always a variable that drives the evolution of the process. Here, this variable is the number of tables. In this section, we take the following SQL as an example to explain:

select count (*) from t1, t2 where t1.c2 = t2.c2 and t1.c1 <800 and exists (select c1 from t3 where t1.c1 = t3.c2 and t3.c1> 100); duplicated code

In this SQL statement, there are three base tables t1, t2, t3. The distribution keys of the three tables are all column c1, and there are two association conditions:

1. t1.c2 = t2.c2, t1 and t2 are related

2. t1.c1 = t3.c2, t1 and t3 are related

In order to cooperate with the analysis, we combine the log to help everyone understand, set the following parameters, and then execute the statement:

set logging_module='on(opt_join)'; set log_min_messages=debug2; Copy code

The first step, how to obtain data t1 and t2

First of all, how to get the data of t1 and t2, such as Seq Scan, Index Scan, etc. Since we did not create Index in this example, the only choice is Seq Scan. The log snippet shows:

Let's remember these three sets of Path names: path_list, cheapest_startup_path, cheapest_total_path, the latter two correspond to the local optimal solution of dynamic programming, here is a set of sets, collectively referred to as the optimal path, which is also the search space for the next step. The path_list stores a set of valuable candidate Paths in the current Rel set (the pruned Path will not be placed here), cheapest_startup_path represents the path with the lowest startup cost in the path_list, and cheapest_total_path represents the group with the smallest total cost in the path_list Path (here, a set of optimal Paths that may have multiple dimensions are used). The t2 table is similar to the t3 table, and the optimal path is a Seq Scan. With the Scan optimal path of all base tables, the associated path can be selected below.

The second step is to solve the optimal path associated with (t1, t2)

The distribution keys of the two tables t1 and t2 are both the c1 column, but the Join column is both the c2 column, then the theoretical path is: (placed on the right means as an inner table)

1. Broadcast(t1)join t2

2. t1 joinBroadcast(t2)

3. Broadcast(t2)join t1

4. t2 joinBroadcast(t1)

5. Redistribute(t1) join Redistribute(t2)

6. Redistribute(t2) join Redistribute(t1)

Then each path can be matched with different Join methods (NestLoop, HashJoin, MergeJoin), a total of 18 related paths, the optimizer needs to select the optimal path among these paths, and the basis of the selection is the cost of the path (Cost). The optimizer will assign a cost to each operator. For example, Seq Scan, Redistribute, HashJoin all have a cost, and the cost is related to the data size, data characteristics, system resources, etc., on how to estimate the cost, the subsequent article will analyze it, this section Just focus on how to choose the path based on these costs. Since the cost is proportional to the execution time, the goal of the optimizer is to choose the least costly plan, so the path selection is the same. The path cost comparison idea is roughly like this. For a new Path generated, compare the new Path with the path in the path_list one by one. If the total_cost is very similar, compare the startup cost. If it is similar, then keep the Path to the path_list; If the total_cost of the new path is larger, but the startup_cost is much smaller, the Path is retained. The specific comparison process is omitted here, and the comparison result of Path is directly given:

It can be seen that the path with the smallest total cost is the path with redistribution on both sides and t1 as the inner table.

The third step is to solve the optimal path associated with (t1, t3)

The association condition of the t1 and t3 tables is: t1.c1 = t3.c2, because the Join column of t1 is the distribution key c1 column, so there is no need to add Redistribute to the t1 table; because the Join method of t1 and t3 is Semi Join, the external table cannot Broadcast, otherwise, duplicate results may be generated; there is another type of Unique Path option (ie t3 table deduplication), then the available candidate paths are roughly as follows:

1. t1 semijoin Redistribute(t3)

2. Redistribute(t3) semiright semi join t1

3. t1 joinUnique(Redistribute(t3))

4. Unique(Redistribute(t3))join t1

Since only one side needs to be redistributed and can be redistributed, broadcast is not selected, because the cost of Broadcast is generally higher than redistribution when the amount of data is the same, and it is pruned in advance. Taking the Join method into consideration, the optimizer gives the final choice:

The optimal plan at this time is to choose the path of the internal table Unique Path, that is, the t3 table first removes the duplicates, and then goes through the Inner Join process.

The fourth step is to solve the optimal path associated with (t1, t2, t3)

With the preparation of the previous two steps, the idea of association of the three tables is similar. The form is to decompose into two tables and associate first, and then associate with the third table. In actual operation, all the two tables are directly related. JoinRel, and then add another table one by one, try to associate, the selection method is as follows:

  • JoinRel(t1,t2) join t3:

  • (t1, t2)->(cheapest_startup_path + cheapest_total_path) joint3->(cheapest_startup_path + cheapest_total_path)

  • JoinRel(t1,t3) join t2:

  • (t1, t3)->(cheapest_startup_path + cheapest_total_path) joint2->(cheapest_startup_path + cheapest_total_path)

  • JoinRel(t2, t3)join t1: Since there is no (t2, t3) association, this situation does not exist

Each time a pair of Paths of the inner and outer tables are used for Join, it will also determine whether redistribution is needed and whether it can be deduplicated, and the association method is selected. For example, when JoinRel(t1, t2) joins t3, it will also try to deduplicate the t3 table. Path, because the essence of this Join is still Semi Join. The following figure shows part of the valuable candidate paths generated during the selection process (limited by space, only a part of it is intercepted):

Among these paths, the optimizer selected the following optimal path:

Compared with the actual execution plan, the two are the same (compared to the "E-costs" of the 4th layer HashJoin are the same):

From this process, it can be roughly felt that path_list may have some expansion. If there are too many paths in path_list, it may cause multiple cheapest_total_paths, and the next level of search space will also become very large, which will eventually lead to Increased time-consuming plan generation. Regarding the generation of Join Path, the following points are explained:

1. When selecting a Join path, the cost will be calculated in two stages, initial and final costs. The initial cost quickly estimates the cost of building the hash table, calculating the hash value, and placing the disk. When the initial cost is greater than a path in the path_list Prune the path in advance;

2. cheapest_total_path has multiple reasons: mainly considering that in multiple dimensions, paths with very similar costs may be the best choice for the next level of dynamic planning, leaving only one may not get the overall optimal plan;

3. cheapest_startup_path records the one with the lowest startup cost, which also reserves another dimension. When the query statement requires few results, there is a Path with a small startup cost, but the total cost may be relatively large, this Path may be Will be the first choice;

4. Due to pruning, in some cases, a Path may be pruned in advance, or the Path may not be selected as cheapest_total_path or cheapest_startup_path, and this Path is part of the theoretically optimal plan, which will lead to the final The plan is not optimal, and the probability of this scenario is generally not high. If you encounter this situation, you can try to use Plan Hint for tuning;

5. Path generation is closely related to the size of the cluster, system resources, statistics, and cost estimation. For example, the number of cluster DNs affects the tilt of the distribution and the amount of data of a single DN, and the system memory affects the disk cost. Statistics are OK. The first-hand data for the estimation of the number and distinct value, and the Cost estimation model is the key factor for selection and elimination in the entire plan generation. The inaccurate estimation of the number of rows for each JoinRel may affect the final plan. Therefore, for the same SQL statement, in different clusters or different statistics in the same cluster, the plan may be different. If there are some changes in the path, you can locate the problem by analyzing the Performance information and logs. For a detailed explanation of Performance, please refer to the blog post: GaussDB(DWS ) Detailed explanation of the explain performance .

6. If the Random Plan mode is set, the cheapest_startup_path and cheapest_total_path of each layer of dynamic planning are randomly selected from the path_list to ensure randomness.

2. Generation of Aggregate Path

Generally speaking, Aggregate Path is generated after the Path associated with the table is generated, and there are three main steps (Aggregate of Unique Path is completed when Join Path is generated, but there will be these three steps): First estimate Aggregate the number of rows of the result, and then select the path method, and finally create the optimal Aggregate Path. The former relies on statistical information and the Cost estimation model, while the latter depends on the former s estimation results, cluster size and system resources. The estimation of the number of aggregate rows is mainly based on the combination of the distinct values of the aggregate columns. We focus on the estimation of the number of aggregate rows and the selection of the optimal Aggregate Path.

2.1 Aggregate row count estimation

Take the following SQL as an example:

select t1.c2, t2.c2, count (* ) cnt from t1, t2 where t1.c2 = t2.c2 and t1.c1 <500 group by t1.c2, t2.c2; duplicated code

The statement first associates the two tables with a filter condition on the base table, and then obtains the GROUP BY result of the two columns. There are two aggregate columns here, t1.c2 and t2.c2. Take a look at the original information given in the system table:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct from pg_stats where (tablename ='t1' or tablename ='t2') and attname ='c2'; tablename | attname | null_frac | n_distinct | n_dndistinct -----------+---------+-----------+------------+--- ----------- t1 | c2 | 0 | -.25 | -.431535 t2 | c2 | 0 | 100 | -.39834 (2 rows) Copy code

Statistics show that the original distinct values of t1.c2 and t2.c2 are -0.25 and 100, respectively. The absolute value of -0.25 is 0.25 * 2000 = 500. Should their combined distinct value be at least 500? The answer is no. Because Aggregate aggregates the results of JoinRel(t1, t2), and the statistical information in the system table is the original information (without any filtering). At this time, both Join conditions and filter conditions need to be taken into consideration. How to consider? First look at the filter condition "t1.c1<500" may filter out a part of t1.c2, then there will be a selection rate (we call it FilterRatio at this time), and then the Join condition "t1.c2 = t2.c2" is also There will be a selection rate (we call it JoinRatio at this time), both of which are a number between [0, 1], so when estimating the distinct of t1.c2, the influence of these two Ratios must be considered . If the Poisson model is selected between different columns, and the fully correlated model is used between the same columns, the distinct of t1.c2 is approximately like this:

distinct(t1.c2) = Poisson(d0, ratio1, nRows) * ratio2

Where d0 represents the original distinct in the base table, ratio1 represents the Ratio using the Poisson model, ratio2 represents the Ratio using the fully correlated model, and nRows is the number of rows in the base table. If you need to locate and analyze the problem, these Ratios can be consulted from the log, and run the SQL statement after the following settings:

set logging_module='on(opt_card)'; set log_min_messages=debug3; Copy code

In this example, we can see two Ratios on the t1 table from the log:

Looking at t2.c2, the original distinct of this column is 100, and from the log above, we can see that the data in the t2 table is all matched (without Ratio), then the distinct of t2.c2 after Join is also 100. At this time, t1.c2 and t2.c2 cannot be directly combined, because "t1.c2 =t2.c2" implies that the values of these two columns are the same, that is, they are equivalent, so you only need to consider Min(distinct( t1.c2), distinct(t2.c2)). The following figure shows the actual and estimated number of rows given by Performance:

postgres=# explain performance select t1.c2, t2.c2, count(*) cnt from t1, t2 where t1.c2 = t2.c2 and t1.c1 <500 group by t1.c2, t2.c2; QUERY PLAN -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ------------ id | operation | A-time | A-rows | E-rows | E-distinct | Peak Memory | E-memory | A-width | E-width | E-costs ----+--------------------------------------------- --+------------------+--------+--------+---------- --+----------------+----------+---------+--------- +--------- 1 | -> Streaming (type: GATHER) | 48.500 | 99 | 100 | | 80KB | | | 16 | 89.29 2 | -> HashAggregate | [38.286, 40.353] | 99 | 100 | | [28KB, 31KB] | 16MB | [24,24] | 16 | 79.29 3 | -> Hash Join (4,6) | [37.793, 39.920] | 1980 | 2132 | | [6KB, 6KB] | 1MB | | 8 | 75.04 4 | -> Streaming(type: REDISTRIBUTE) | [0.247, 0.549] | 1001 | 1001 | 25 | [53KB, 53KB] | 2MB | | 4 | 32.95 5 | -> Seq Scan on test.t2 | [0.157, 0.293] | 1001 | 1001 | | [12KB, 12KB] | 1MB | | 4 | 4.50 6 | -> Hash | [36.764, 38.997] | 998 | 1000 | 62 | [291KB, 291KB] | 16MB | [20,20] | 4 | 29.88 7 | -> Streaming(type: REDISTRIBUTE) | [36.220, 38.431] | 998 | 999 | | [53KB, 61KB] | 2MB | | 4 | 29.88 8 | -> Seq Scan on test.t1 | [0.413, 0.433] | 998 | 999 | | | [14KB, 14KB] | 1MB | | 4 | 9.25 Copy code

2.2 Aggregrate Path generation

With the number of aggregation rows, the aggregation method can be flexibly selected according to the resource situation. There are three main methods of Aggregate:

1. Aggregate +Gather (+ Aggregate)

2. Redistribute +Aggregate (+Gather)

3. Aggregate +Redistribute + Aggregate (+Gather)

The indication in brackets may not have this step, depending on the specific situation. These aggregation methods can be understood as whether to choose both Redistribute or Broadcast when two tables are associated. After the optimizer obtains the final number of aggregated rows, it will try each aggregation method, calculate the corresponding cost, select the optimal method, and finally generate a path. When there are two layers of Aggregate, the last layer is the number of final aggregate rows, and the number of aggregate rows in the first layer is calculated based on the Poisson model. The Aggregate method is selected by the optimizer according to the cost by default, and the user can also specify it through the parameter best_agg_plan. The general scope of application of the three types of aggregation methods is as follows:

  • The first type, the number of rows after direct aggregation is not too large, usually DN aggregation, CN collection, and sometimes CN needs to be aggregated twice.
  • The second type requires redistribution and the number of rows is not significantly reduced after direct aggregation
  • The third type requires redistribution and the number of rows decreases significantly after direct aggregation. After redistribution, the number of rows can be reduced again, usually DN aggregation, redistribution, and re-aggregation, commonly known as double-layer Aggregate

4. concluding remarks

This article focuses on the core steps of plan generation, from estimating the number of rows, to the generation of the Join Path, and then to the generation of the Aggregate Path, and introduces the basic principles of the simplest process. The actual processing method is far more complicated than the description, and there are many situations to consider, such as how to combine the optimal selection rates of multiple groups, how to choose the distribution key, how to deal with the tilt, how much memory is used, and so on. Weighing the entire plan generation process, sometimes you have to give something away, so that you can gain, and sometimes a little disadvantage of the plan can be ignored or compensated by other capabilities. For example, after the SMP is turned on, the parallel effect will dilute some of the defects in the plan. . All in all, plan generation is a complex and meticulous work. To generate a global optimal plan requires continuous discovery and optimization. In follow-up blog posts, we will continue to explore the secrets of plan generation.

Click to follow to learn about Huawei Cloud's fresh technology for the first time~