Skip to content

MDEV-39499 Updates to derived-with-keys, window functions determining#5252

Open
mariadb-RexJohnston wants to merge 1 commit into
11.4from
11.4-MDEV-39499
Open

MDEV-39499 Updates to derived-with-keys, window functions determining#5252
mariadb-RexJohnston wants to merge 1 commit into
11.4from
11.4-MDEV-39499

Conversation

@mariadb-RexJohnston

Copy link
Copy Markdown
Member

..records per key

Enabling derived keys optimization for derived.col = const pushed conditions.

Estimating records per key in derived key for the optimizer based on form and/or size of components of a derived table.

Consider any derived table of the form

SELECT ..., ROW_NUMBER () OVER (PARTITION BY c1,c2 order by ...) FROM t1, t2, t3 ...
WHERE ...

If the optimizer generates a key on this derived table because of a constraint being pushed into it, it currently will not consider key components of the form col = const

We lift this constraint and add code to TABLE::add_tmp_key to search for a window function ROW_NUMBER(). From the partition list c1, c2 we can in infer an estimate of the number of rows we expect to see for each key value. In the absence of EITS, we still have an number of expected records in our referred to table and will fall back to using that as a worst case scenario.

As a consequence of forcing best_access_path to now consider a generated index's records per key on any type of derived table, we see a number of optimization changes, for example a range optimizer might now be considered best and might return IMPOSSIBLE_RANGE (seen in rowid_filter_myisam.result).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cardinality estimation for derived tables with window functions (specifically ROW_NUMBER) by adding opt_window_function_cardinality.cc and integrating it into temporary key creation. Feedback on these changes highlights critical compilation errors due to type mismatches between ulong* and rec_per_key_t* (which is double*) in the new estimation function. Additionally, several high-severity issues were identified, including potential null pointer dereferences when resolving fields and a potential buffer overflow in TABLE::add_tmp_key if key_parts exceeds MAX_REF_PARTS.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sql/opt_window_function_cardinality.h
Comment thread sql/opt_window_function_cardinality.cc Outdated
Comment thread sql/table.cc
Comment thread sql/opt_window_function_cardinality.cc
..records per key

Enabling derived keys optimization for derived.col = const pushed
conditions.

Estimating records per key in derived key for the optimizer
based on form and/or size of components of a derived table.

Consider any derived table of the form

SELECT ..., ROW_NUMBER ()  OVER (PARTITION BY c1,c2 order by ...)
FROM t1, t2, t3 ...
WHERE ...

If the optimizer generates a key on this derived table because of
a constraint being pushed into it, it currently will not consider key
components of the form  col = const

We lift this constraint and add code to TABLE::add_tmp_key to search
for a window function ROW_NUMBER().  From the partition list c1, c2 we
can in infer an estimate of the number of rows we expect to see for
each key value.  In the absence of EITS, we still have an number of
expected records in our referred to table and will fall back to using
that as a worst case scenario.

As a consequence of forcing best_access_path to now consider a generated
index's records per key on any type of derived table, we see a number of
optimization changes, for example a range optimizer might now be
considered best and might return IMPOSSIBLE_RANGE (seen in
rowid_filter_myisam.result).
@mariadb-RexJohnston

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to estimate the cardinality of derived tables with window functions (specifically ROW_NUMBER() OVER (PARTITION BY ...)), allowing the optimizer to better determine records per key. It adds opt_window_function_cardinality.cc and opt_window_function_cardinality.h, integrates them into the build files and TABLE::add_tmp_key, and includes extensive test coverage. The review feedback highlights several critical improvements: 1) In handle_single_part_rownumber, HA_POS_ERROR must be handled safely to prevent massive values from casting, and multi-column partition estimates should be capped by the table's total records. 2) In est_derived_window_fn_cardinality, comparing result_field directly with the key field is a much more robust heuristic than using select list indices, and rec_per_key_t should be used instead of ulong for type safety. 3) A defensive assertion should be added in TABLE::add_tmp_key to prevent potential stack buffer overflows.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +40 to +77
static ha_rows handle_single_part_rownumber( Item_window_func *item_row_number)
{
if (item_row_number->window_spec->partition_list &&
item_row_number->window_spec->partition_list->elements)
{
double records= 1;
ORDER *part_list= item_row_number->window_spec->partition_list->first;
for (; part_list; part_list= part_list->next)
{
if ((*part_list->item)->type() == Item::FIELD_ITEM)
{
Item_field *item_field= ((Item_field*)(*part_list->item));
DBUG_ASSERT(item_field);
Field *field= item_field->field;
DBUG_ASSERT(field);
/*
use EITS records per key if we have them,
otherwise
use number of records in the table (as a worst case scenario)
*/
if (field->read_stats)
{
// records*= field->read_stats->get_avg_frequency();
double freq= field->read_stats->get_avg_frequency();
records*= freq > 0 ? field->table->used_stat_records / freq
: (double)field->table->used_stat_records;
}
else
records*= (double)field->table->used_stat_records;
}
}
return records < 1.0 ? (ha_rows)1 : (ha_rows)records;
}
else
{
return (ha_rows)1;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of handle_single_part_rownumber has two significant issues:

  1. Handling of HA_POS_ERROR: If table->used_stat_records is HA_POS_ERROR (indicating unknown/uninitialized statistics), casting it to double will result in a huge value (approx 1.8e19), which will completely distort the cardinality estimation and potentially cause overflows.
  2. Exaggerated Estimates for Multi-column Partitions: When partitioning by multiple columns of the same table, multiplying their distinct values can easily exceed the total number of records in the table. The accumulated distinct values for each table should be capped by that table's total records.

We can resolve both issues by tracking distinct values per table and capping them, while safely falling back to transient engine statistics or a default value if used_stat_records is HA_POS_ERROR.

static ha_rows handle_single_part_rownumber(Item_window_func *item_row_number)
{
  if (item_row_number->window_spec->partition_list &&
      item_row_number->window_spec->partition_list->elements)
  {
    double records= 1;
    const uint MAX_TABLES= 16;
    TABLE *tables[MAX_TABLES];
    double table_distinct[MAX_TABLES];
    uint table_count= 0;

    ORDER *part_list= item_row_number->window_spec->partition_list->first;
    for (; part_list; part_list= part_list->next)
    {
      if ((*part_list->item)->type() == Item::FIELD_ITEM)
      {
        Item_field *item_field= ((Item_field*)(*part_list->item));
        DBUG_ASSERT(item_field);
        Field *field= item_field->field;
        DBUG_ASSERT(field);
        TABLE *table= field->table;

        ha_rows table_records= table->used_stat_records;
        if (table_records == HA_POS_ERROR)
        {
          table_records= table->file->stats.records;
          if (table_records == HA_POS_ERROR)
            table_records= 1000; // Fallback default
        }

        double diff_values= 1.0;
        if (field->read_stats)
        {
          double freq= field->read_stats->get_avg_frequency();
          diff_values= freq > 0 ? (double)table_records / freq
                                : (double)table_records;
        }
        else
        {
          diff_values= (double)table_records;
        }

        uint idx= 0;
        for (; idx < table_count; idx++)
        {
          if (tables[idx] == table)
            break;
        }
        if (idx < MAX_TABLES)
        {
          if (idx == table_count)
          {
            tables[idx]= table;
            table_distinct[idx]= diff_values;
            table_count++;
          }
          else
          {
            table_distinct[idx]*= diff_values;
          }
          if (table_distinct[idx] > (double)table_records)
            table_distinct[idx]= (double)table_records;
        }
      }
    }

    for (uint i= 0; i < table_count; i++)
    {
      records*= table_distinct[i];
    }

    return records < 1.0 ? (ha_rows)1 : (ha_rows)records;
  }
  else
  {
    return (ha_rows)1;
  }
}

Comment on lines +111 to +150
bool est_derived_window_fn_cardinality(st_select_lex* derived,
ulong *out_records,
Field **reg_fields,
uint key_parts)
{
uint fc= 0;

while (fc < key_parts)
{
uint ic= 0;
if (!reg_fields[fc])
{
fc++;
continue;
}
uint field_index= reg_fields[fc]->field_index;
List_iterator_fast<Item> li(*derived->get_item_list());
Item *item;

// find our window function in the item list
while ((item= li++))
{
// if the single part key is on our window function
if ((ic == field_index) && item->type() == Item::WINDOW_FUNC_ITEM)
{
Item_window_func *itm_row_num= (Item_window_func*)item;

if (itm_row_num->window_func()->sum_func() == Item_sum::ROW_NUMBER_FUNC)
{
*out_records= (ulong)handle_single_part_rownumber(itm_row_num);
return true;
}
}
ic++;
}
fc++;
}

return false;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of est_derived_window_fn_cardinality relies on comparing the select list index ic with the temporary table's field_index. This index-based heuristic is fragile and can break if fields are optimized away, reordered, or if hidden fields are present in the temporary table.

Instead, we can directly compare the result_field of the Item with the key field reg_fields[fc]. This is 100% robust and standard in the optimizer. Additionally, we should use rec_per_key_t instead of ulong for out_records to ensure type safety and future-proofing.

bool est_derived_window_fn_cardinality(st_select_lex* derived,
                                       rec_per_key_t *out_records,
                                       Field **reg_fields,
                                       uint key_parts)
{
  uint fc= 0;

  while (fc < key_parts)
  {
    if (!reg_fields[fc])
    {
      fc++;
      continue;
    }
    List_iterator_fast<Item> li(*derived->get_item_list());
    Item *item;

    // find our window function in the item list
    while ((item= li++))
    {
      // Compare the result_field of the item directly with the key field
      if (item->result_field == reg_fields[fc] && item->type() == Item::WINDOW_FUNC_ITEM)
      {
        Item_window_func *itm_row_num= (Item_window_func*)item;

        if (itm_row_num->window_func()->sum_func() == Item_sum::ROW_NUMBER_FUNC)
        {
          *out_records= (rec_per_key_t)handle_single_part_rownumber(itm_row_num);
          return true;
        }
      }
    }
    fc++;
  }

  return false;
}

Comment on lines +20 to +23
bool est_derived_window_fn_cardinality(st_select_lex* derived,
ulong *out_records,
Field **reg_fields,
uint key_parts);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the function signature to use rec_per_key_t instead of ulong for the out_records parameter to maintain type safety and consistency with the optimizer's rec_per_key array.

Suggested change
bool est_derived_window_fn_cardinality(st_select_lex* derived,
ulong *out_records,
Field **reg_fields,
uint key_parts);
bool est_derived_window_fn_cardinality(st_select_lex* derived,
rec_per_key_t *out_records,
Field **reg_fields,
uint key_parts);

Comment thread sql/table.cc
Comment on lines 8640 to 8645
char buf[NAME_CHAR_LEN];
KEY *keyinfo= key_info + key;
KEY_PART_INFO *key_part_info;
Field **reg_field;
Field *reg_field[MAX_REF_PARTS];
uint i;
bool key_start= TRUE;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since reg_field is now declared as a fixed-size stack array of size MAX_REF_PARTS, we should add a defensive assertion DBUG_ASSERT(key_parts <= MAX_REF_PARTS); at the beginning of TABLE::add_tmp_key to prevent any potential stack buffer overflow if key_parts exceeds MAX_REF_PARTS.

  char buf[NAME_CHAR_LEN];
  KEY *keyinfo= key_info + key;
  KEY_PART_INFO *key_part_info;
  Field *reg_field[MAX_REF_PARTS];
  uint i;
  bool key_start= TRUE;
  DBUG_ASSERT(key_parts <= MAX_REF_PARTS);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants