MySQL 记录、页、索引的数据结构简析_Mysql

引言

本文在介绍 MySQL 内存中记录、页、索引、游标的数据结构的基础上，通过简单分析插入操作过程中行格式的转换介绍了不同数据结构的关系，其中不涉及加锁相关逻辑。

原理

记录

概念

InnoDB 存储引擎基于记录（record）存储，表明记录是根据行（row）格式存储。

MySQL 中有行格式有以下三种存储方式：

Server 层的格式，这种格式与存储引擎没关系，适用于所有存储引擎，这种格式是操作数据必然要经过的一种格式，也是 Row 模式下 binlog 存储所使用的一种格式；
索引元组格式，这种格式是 InnoDB 存储引擎在存取记录时一种记录格式的中间状态，它要么是从 Server 层转换而来，要是么转换为 Server 层的格式。索引元组格式与索引一一对应，是内存中一种用来存储索引中所有列数据的数据结构，对应逻辑记录（tuple）；
物理存储记录格式，这种格式是一条记录在物理页面中的存储格式，也就是 Compact 格式。这种格式与索引元组格式一一对应，对应物理记录（record）。

因此：

存数据时，发生以下转换：Server 层的格式 -> 索引元组格式 -> 物理存储记录格式；
取数据时，发生以下转换：物理存储记录格式 -> 索引元组格式 -> Server 层的格式。

个人理解存取数据时完成二进制到“字符串”的相互转换（忽略其他数据类型），相当于客户端与服务端交互时的编码与解码。

数据结构

内存中逻辑记录与其中的字段分别对应数据结构 dtuple_t 与 dfield_t。

结构体定义如下所示。

/** Structure for an SQL data tuple of fields (logical record) */
struct dtuple_t {
 ulint  n_fields; /*!< number of fields in dtuple */
 ulint  n_fields_cmp; /*!< number of fields which should
     be used in comparison services
     of rem0cmp.*; the index search
     is performed by comparing only these
     fields, others are ignored; the
     default value in dtuple creation is
     the same value as n_fields */
 dfield_t* fields;  /*!< fields */
 UT_LIST_NODE_T(dtuple_t) tuple_list;
     /*!< data tuples can be linked into a
     list using this field */
};

其中：

ulint n_fields，表示列的数量；
dfield_t* fields，表示记录中的列，存储指向每一列真实数据的指针；
UT_LIST_NODE_T(dtuple_t) tuple_list，连接多个 dtuple_t，原因是每个索引对应一个 dtuple_t，比如 INSERT SQL 需要向主键索引和二级索引中都插入时。

数据结构 dfield_t 中保存列的信息。

/** Structure for an SQL data field */
struct dfield_t{
 void*  data; /*!< pointer to data */
 unsigned ext:1; /*!< TRUE=externally stored, FALSE=local */
 unsigned len; /*!< data length; UNIV_SQL_NULL if SQL null */
 dtype_t  type; /*!< type of data */
};

其中：

data，表示真实列数据的指针；
ext:1，如果是大记录（blob），则在外部页存储；
len，表示列数据的长度；
type，表示列数据的类型。

物理记录由 rec_t 数据结构表示，字节类型。

/* We define the physical record simply as an array of bytes */
typedef byte rec_t;

页

概念

页和记录类似同样可以分为以下两种形式：

物理页（block），表示存储在外部存储设备上，一般是持久的。block 是文件系统中一次 IO 的大小；
内存页（page），表示存储在缓冲池（buffer pool）中，当物理页与内存页不同时表明是脏页。

页分多种类型，比如 index page、undo log page 等，本文关注的是 index page。

由于 InnoDB 是索引组织树，索引即数据，因此可以将索引页理解为数据页。

InnoDB 中页是一个无序堆，因此页中的记录无序存放，记录间通过 record header 中的 next record 组成单向链表。

数据结构

buffer pool 中与页相关的有两个数据结构，包括 buf_block_t 和 buf_page_t 。

buf_block_t 是数据页控制体的一种。

/** The buffer control block structure */

struct buf_block_t{

 buf_page_t page;  /*!< page information; this must
     be the first field, so that
     buf_pool->page_hash can point
     to buf_page_t or buf_block_t */
 byte*  frame;
 BPageLock lock;  /*!< read-write lock of the buffer frame */

其中：

block->buf_page_t，数据页控制块指针，要求是 buf_block_t 的第一个成员，便于 buf_block_t 随时强制类型转换为 buf_page_t；
block->frame，数据页指针，指向页对应在缓冲池中的内存地址，保存页的真实内容；
block->lock，读写锁用于保护 page 的内容。

buf_page_t 称为数据页控制体。

/** The common buffer control block structure
for compressed and uncompressed frames */

class buf_page_t {
public:
  
  page_id_t id; // page id
  buf_page_t* hash;  /*!< node used in chaining to
     buf_pool->page_hash or
     buf_pool->zip_hash */
  lsn_t  newest_modification; // 当前Page最新修改lsn
  lsn_t  oldest_modification; // 当前Page最老修改lsn，即第一条修改lsn
  
};

其中：

oldest_modification，最初修改该页 redo log record 的 end LSN（或者是 mtr end LSN），oldest_modification 不为 0 表示是脏页；
newest_modification，最近修改该页 redo log record 的 end LSN（或者是 mtr end LSN）。

索引

概念

索引是基于以空间换时间的思想用于加速查询的数据结构。

InnoDB 中索引基于 B+ 树实现，因此每个索引都是一棵 B+ 树。索引用于组织页，页用于组织行记录。

数据结构

每个 B+ 树索引在内存中对应数据结构 dict_index_t。

/** Data structure for an index.  Most fields will be
initialized to 0, NULL or FALSE in dict_mem_index_create(). */
struct dict_index_t{
 index_id_t id; /*!< id of the index */
 mem_heap_t* heap; /*!< memory heap */
 id_name_t name; /*!< index name */
 const char* table_name;/*!< table name */
 dict_table_t* table; /*!< back pointer to table */
  unsigned space:32;
    /*!< space where the index tree is placed */
 unsigned page:32;/*!< index tree root page number */
  unsigned allow_duplicates:1;
      /*!< if true, allow duplicate values
      even if index is created with unique
      constraint */
  unsigned n_uniq:10;/*!< number of fields from the beginning
    which are enough to determine an index
    entry uniquely */
 unsigned n_def:10;/*!< number of fields defined so far */
 unsigned n_fields:10;/*!< number of fields in the index */
 unsigned n_nullable:10;/*!< number of nullable fields */
  dict_field_t* fields; /*!< array of field descriptions */
};

其中：

index->table，指向当前索引对应的表；
index->n_fields，表示当前索引中包含的列数；
index->page，表示当前索引 root 页的编号（偏移量）；
index->fields，表示当前索引列的描述信息。

通过 B+ 树索引进行查找（lookup）是最为常见的操作，具体通过游标实现。

游标

概念

游标是一个逻辑概念，无论是写入还是查询，都需要在 B+ 树上遍历至某个位置，具体到某个 block 上的某个 record。

InnoDB 中通过 cursor search 实现 B+ 树的查找，每个 open 一个 cursor 都会开启一个从 B+ 树 root 结点搜索到指定层级的 record 的搜索过程。

cursor 分为以下两种：

B-tree cursor
page cursor，表示指向记录所在位置的游标。

page cursor 在定位（lookup）到记录后，再通过查询模式（search mode）再进行向前或向后的记录扫描（scan）。

比如对于一个主键的范围查询，首先需要定位第一条记录，然后进行记录的顺序扫描。

InnoDB 中定义了四种查询模式：

PAGE_CUR_G: > 查询，查询第一个大于 dtuple 的 rec_t
PAGE_CUR_GE: >=，> 查询，查询第一个大于等于 dtuple 的 rec_t
PAGE_CUR_L: < 查询，查询最后一个小于 dtuple 的 rec_t
PAGE_CUR_LE: <= 查询，查询最后一个小于等于 dtuple 的 rec_t

插入操作的 search_mode 默认是 PAGE_CUR_LE，即插在最后一个小于等于该 dtuple 的 rec_t 后。

为了提高查询效率，InnoDB 做了以下优化：

record buffer，也就是预读，连续读取多条记录；
persistent cursor（持久化游标，简称 pcur），用于保存每次查询到的记录，待查询下一条记录时，首先恢复上一次查询的记录，然后再获取下一条记录；
Adaptive Hash Index（自适应哈希索引，简称 AHI），根据热点行记录进行索引，但是如果记录发生变化，也需要维护自适应哈希索引。

数据结构

B-tree cursor 对应 btr_cur_t 结构体。

/** The tree cursor: the definition appears here only for the compiler
to know struct size! */
struct btr_cur_t {
 btr_cur_t() { memset(this, 0, sizeof(*this)); }

 dict_index_t* index;  /*!< index where positioned */
 page_cur_t page_cur; /*!< page cursor */
 purge_node_t* purge_node; /*!< purge node, for BTR_DELETE */
 buf_block_t* left_block; /*!< this field is used to store
     a pointer to the left neighbor
     page, in the cases
     BTR_SEARCH_PREV and
     BTR_MODIFY_PREV */
 /*------------------------------*/
 que_thr_t* thr;  /*!< this field is only used
     when btr_cur_search_to_nth_level
     is called for an index entry
     insertion: the calling query
     thread is passed here to be
     used in the insert buffer */
 /*------------------------------*/

};

其中：

page_cur_t page_cur，保存 page cursor 。

page cursor 对应 page_cur_t 结构体。

/** Index page cursor */

struct page_cur_t{
 const dict_index_t* index;
 rec_t*  rec; /*!< pointer to a record on page */
 ulint*  offsets;
 buf_block_t* block; /*!< pointer to the block containing rec */
};

其中：

rec_t* rec，表示游标查询得到的记录；
buf_block_t* block，表示表示游标查询得到的记录所在块信息。

persistent cursor 对应 btr_pcur_t 数据结构。

/* The persistent B-tree cursor structure. This is used mainly for SQL
selects, updates, and deletes. */

struct btr_pcur_t{
 btr_pcur_t() { memset(this, 0, sizeof(*this)); }

 /** a B-tree cursor */
 btr_cur_t btr_cur;
  
  rec_t*  old_rec;
   
  /** Return the index of this persistent cursor */
 dict_index_t* index() const { return(btr_cur.index); }
};

其中：

rec_t* old_rec，持久化游标中保存查询得到的记录。

下面以 insert 语句的执行为例，分析不同数据结构之间的关系。

insert 语句

build_template

从 server 层调用存储引擎的接口 handler 开始，具体是 ha_innobase::write_row 函数。

// storage/innbase/handler/ha_innodb.cc

/* Stores a row in an InnoDB database, to the table specified in this handle. */
int
ha_innobase::write_row(
/*===================*/
 uchar* record) /*!< in: a row in MySQL format */
{
  /* Step-4: Prepare INSERT graph that will be executed for actual INSERT
 (This is a one time operation) */
 if (m_prebuilt->mysql_template == NULL
     || m_prebuilt->template_type != ROW_MYSQL_WHOLE_ROW) {

  /* Build the template used in converting quickly between
  the two database formats */

  build_template(true);
 }
  
  /* Step-5: Execute insert graph that will result in actual insert. */
 // 插入数据
 error = row_insert_for_mysql((byte*) record, m_prebuilt);
}

其中：

入参 uchar* record 指向 table->record[0]，其中不包含表列的类型、列的长度、列数等信息，完全是客户端插入的数据；

unsigned char 数据类型经常被用于处理原始字节数据，比如在进行二进制文件读写、网络通信、处理图像数据、编码转换或者其他需要以字节为单位操作的场景中。利用 unsigned char 可以确保不会有负值出现，这在与二进制数据和位操作打交道时非常有用。

调用 build_template 函数初始化 m_prebuilt 变量；
调用 row_insert_for_mysql 函数插入数据，入参中将 record 指针从 uchar 转换成 byte 类型。

typedef unsigned char uchar; /* Short for unsigned char */

/* Note that inside MySQL 'byte' is defined as char on Linux! */
#define byte   unsigned char

m_prebuilt 变量是 row_prebuilt_t 结构体指针类型，其中部分成员如下所示，包括表、索引、游标等各种信息。

/** Save CPU time with prebuilt/cached data structures */
row_prebuilt_t*  m_prebuilt;

/** A struct for (sometimes lazily) prebuilt structures in an Innobase table
handle used within MySQL; these are used to save CPU time. */

struct row_prebuilt_t {
 dict_table_t* table;  /*!< Innobase table handle */
 dict_index_t* index;  /*!< current index for a search, if any */
 trx_t*  trx;  /*!< current transaction handle */
  unsigned n_template:10; /*!< number of elements in the
        template */
  ins_node_t* ins_node; /*!< Innobase SQL insert node
        used to perform inserts
        to the table */
  byte*  ins_upd_rec_buff;/*!< buffer for storing data converted
        to the Innobase format from the MySQL
        format */
  mysql_row_templ_t* mysql_template;/*!< template used to transform
        rows fast between MySQL and Innobase
        formats; memory for this template
        is not allocated from 'heap' */
  btr_pcur_t* pcur;  /*!< persistent cursor used in selects
        and updates */
};

其中：

ins_node_t* ins_node，该结构体很重要，下文中将介绍；
btr_pcur_t* pcur，持久化游标用于保存当前查询的记录；
mysql_row_templ_t* mysql_template，辅助结构，主要保存列的元数据信息，用于加快行记录格式的转换。

row_prebuilt_t 结构体与 build_template 函数的作用参考 chatgpt：

row_prebuilt_t:row_prebuilt_t结构体是 InnoDB 存储引擎用来预存储有关某个表操作所需的信息的数据结构。这个结构体包括了众多和查询执行相关的字段，如指向 InnoDB 内部诸如表描述（dict_table_t）、索引描述（dict_index_t）的指针；控制当前操作的游标状态；查询的模板信息；用于读写操作的缓冲区；锁定信息等。简言之，row_prebuilt_t是一个操作上下文，它保存了InnoDB在执行诸如插入、更新、删除、查询等操作时所需要的所有上下文信息。
build_template:build_template函数的主要工作是为 SQL 语句的执行准备模板数据，这些数据用于确定当从存储引擎获取数据时应该返回哪些列，以及如何快速地从 InnoDB 的内部格式转换为 MySQL 格式。例如，如果一个 SELECT 语句只查询了表中的一些列，则build_template将生成相应的模板以确保只有这些被查询的列的数据被读取和转换。这个过程包括决定哪些列可以跳过、哪些列需要转换等。这样就可以避免不必要的数据转换和传输，提高查询效率。
在 InnoDB 存储引擎的设计中，row_prebuilt_t和build_template都是实现 SQL 层与存储引擎层之间高效数据交互的重要组成部分。一方面，row_prebuilt_t作为一个操作上下文，保存了完成某个表操作所需的所有信息；另一方面，build_template则通过预先构建模板来优化数据列的读取和转换过程。

byte -> dtuple_t* row

row_insert_for_mysql 函数中调用 row_insert_for_mysql_using_ins_graph 函数，其中将 Server 层的记录格式转换为 InnoDB 的记录格式，具体是从 byte 转换为 dtuple_t。

// storage/innobase/row/row0mysql.cc

static
dberr_t
row_insert_for_mysql_using_ins_graph(
 const byte* mysql_rec,   /* row in the MySQL format */
 row_prebuilt_t* prebuilt)  /* prebuilt struct in MySQL handle */
{
  que_thr_t* thr;
  ins_node_t* node  = prebuilt->ins_node;
  
  // 主要构造用于执行插入操作的 2 个对象：
 // 1. ins_node_t 对象，保存在 prebuilt->ins_node 中
 // 2. que_fork_t 对象，保存在 prebuilt->ins_graph 中
 row_get_prebuilt_insert_row(prebuilt);
 node = prebuilt->ins_node;

 // 把 server 层的记录格式转换为 InnoDB 的记录格式
 row_mysql_convert_row_to_innobase(node->row, prebuilt, mysql_rec,
       &blob_heap);
  
  thr->run_node = node;
 thr->prev_node = node;
  
  // 执行插入操作，插入记录到主键索引、二级索引（包含唯一索引、非唯一索引）
 /*插入记录*/
 row_ins_step(thr);
}

其中：

调用 row_get_prebuilt_insert_row 函数给 m_prebuilt->ins_node 成员赋初值，如 dtuple_t；
调用 row_mysql_convert_row_to_innobase 函数转换行记录格式；
调用 row_ins_step 函数插入记录，其中入参 que_thr_t* thr，thr->run_node = node = prebuilt->ins_node，具体 que_thr_t 结构体暂不介绍。

这里又出现了一个重要的结构体 ins_node_t，该结构体部分成员如下所示。

/* Insert node structure */

struct ins_node_t{
 dtuple_t* row; /*!< row to insert */
 dict_table_t* table; /*!< table where to insert */
 sel_node_t* select; /*!< select in searched insert */
 que_node_t* values_list;/* list of expressions to evaluate and
    insert in an INS_VALUES insert */
 ulint  state; /*!< node execution state */
 dict_index_t* index; /*!< NULL, or the next index where the index
    entry should be inserted */
 dtuple_t* entry; /*!< NULL, or entry to insert in the index;
    after a successful insert of the entry,
    this should be reset to NULL */
}

其中：

dtuple_t* row，其中保存完整的行数据；
dtuple_t* entry，给每个 index 创建一个 dtuple_t，用于保存索引中的数据，比如二级索引中只需要保存索引列和主键列，因此可以理解为针对特定索引优化后的列版本。

row_mysql_convert_row_to_innobase 函数中遍历索引的所有列（prebuilt->n_template）进行赋值。

static
void
row_mysql_convert_row_to_innobase(
/*==============================*/
 dtuple_t* row,  /*!< in/out: Innobase row where the
     field type information is already
     copied there! */
 row_prebuilt_t* prebuilt, /*!< in: prebuilt struct where template
     must be of type ROW_MYSQL_WHOLE_ROW */
 const byte* mysql_rec, /*!< in: row in the MySQL format; */
 mem_heap_t** blob_heap) /*!< in: FIX_ME, remove this after
     server fixes its issue */
{
 const mysql_row_templ_t*templ;
  
  for (i = 0; i < prebuilt->n_template; i++) {

  templ = prebuilt->mysql_template + i;
    
    // 获取 field
    dfield = dtuple_get_nth_field(row, n_col);
    
    // 格式转换，从 rec 写入 dtuple_t
  row_mysql_store_col_in_innobase_format(
      /* dfield_t* dfield, call to set field */
   dfield,
      /* byte* buf, buffer for a converted integer value */
   prebuilt->ins_upd_rec_buff + templ->mysql_col_offset,
      /* ibool row_format_col, TRUE if the mysql_data is from a MySQL row, FALSE if from a MySQL key value; */
   TRUE, 
      /* byte* mysql_data, MySQL row format data */
   mysql_rec + templ->mysql_col_offset,
      /* ulint col_len, MySQL column length */
   templ->mysql_col_len,        
      /*!< ulint comp, nonzero=compact format */
   dict_table_is_comp(prebuilt->table));
}

其中：

调用 row_mysql_store_col_in_innobase_format 函数将列值保存在 dtuple_t::dfield_t 中。

const byte* ptr = mysql_data;
 const dtype_t* dtype;
 ulint  type;

 dtype = dfield_get_type(dfield);

 type = dtype->mtype;
  
  // 计算 col_len
  ...
  
  dfield_set_data(dfield, ptr, col_len);

其中：

获取 dtype、计算 col_len
调用 dfield_set_data 函数保存数据完成行记录格式的转换

/* Sets pointer to the data and length in a field. */
UNIV_INLINE
void
dfield_set_data(
/*============*/
 dfield_t* field, /*!< in: field */
 const void* data, /*!< in: data */
 ulint  len) /*!< in: length or UNIV_SQL_NULL */
{
 field->data = (void*) data;
 field->ext = 0;
 field->len = static_cast<unsigned int>(len);
}

其中：

向 dfield_t 结构体中保存字段数据、字段长度

到这里完整的行数据索引元组格式 dtuple_t* row dtuple_t 准备好了，但由于 InnoDB 是索引组织树，因此还需要将数据保存到索引元组格式 dtuple_t* entry 中。

dtuple_t* row -> dtuple_t* entry

row_ins_step 函数中创建 ins_node_t 并调用 row_ins 函数。

// 创建 ins_node_t
 node = static_cast<ins_node_t*>(thr->run_node); 

 // 给表加上意向锁
  err = lock_table(0, node->table, LOCK_IX, thr);
  
  /* DO THE CHECKS OF THE CONSISTENCY CONSTRAINTS HERE */
 // 调用 row_ins() 插入记录到主键索引、二级索引
 err = row_ins(node, thr);

row_ins 函数中遍历表的每个 index，插入 index entry。

// 遍历所有索引，向每个索引中插入记录
 while (node->index != NULL) {
    
    /* 向索引中插入记录 */
    err = row_ins_index_entry_step(node, thr);

  // 插入记录到主键索引或二级索引成功后获取下一个索引
    // node->index、entry 指向表中的下一个索引
  node->index = dict_table_get_next_index(node->index);
  node->entry = UT_LIST_GET_NEXT(tuple_list, node->entry);

其中：

循环调用 row_ins_index_entry_step 函数插入记录，其中入参 node 就是 row_mysql_convert_row_to_innobase 函数转换后的 row_prebuilt_t->ins_node_t；
每次插入索引时更新 ins_node_t->index 与 ins_node_t->entry，因此行数据对应的所有索引复用同一个 ins_node_t 变量，其中 ins_node_t->row 保持不变。

row_ins_index_entry_step 函数中向指定索引中插入记录。

// storage/innobase/row/row0insert.cc

/* Inserts a single index entry to the table. */
static MY_ATTRIBUTE((nonnull, warn_unused_result))
dberr_t
row_ins_index_entry_step(
/*=====================*/
 ins_node_t* node, /*!< in: row insert node */
 que_thr_t* thr) /*!< in: query thread */
{
  /*给索引项赋值*/
 err = row_ins_index_entry_set_vals(node->index, node->entry, node->row);

 /*插入索引项*/
 err = row_ins_index_entry(node->index, node->entry, thr);
}

其中：

调用 row_ins_index_entry_set_vals 函数遍历索引的每个字段并赋值，其中将 innobase format field 的值赋给对应的 index entry field。

/** Sets the values of the dtuple fields in entry from the values of appropriate columns in row. */
dberr_t
row_ins_index_entry_set_vals(
 const dict_index_t* index,
 dtuple_t*  entry,
 const dtuple_t*  row)
{
 n_fields = dtuple_get_n_fields(entry);

 for (i = 0; i < n_fields + num_v; i++) {
  dfield_t* field;
  const dfield_t* row_field;
    ulint  len;
    
    // 获取 field
    field = dtuple_get_nth_field(entry, i);
    row_field = dtuple_get_nth_field(
    row, ind_field->col->ind);
    len = dfield_get_len(row_field);
    
    // 写入 field
  dfield_set_data(field, dfield_get_data(row_field), len);
}

其中：

与 row 相同，循环调用 dfield_set_data 方法给索引 entry 中的每个字段 field 赋值。

到这里每个索引的数据也准备好了，但是还不知道数据插入的位置，理论上是插在最后一个小于等于该 dtuple 的 rec_t 后。

cursor

row_ins_index_entry 函数中以插入主键索引为例。

row_ins_clust_index_entry_low 函数中以乐观插入为例。

// storage/innbase/row/row0ins.cc

dberr_t
row_ins_clust_index_entry_low(
/*==========================*/
 ulint  flags, /*!< in: undo logging and locking flags */
 ulint  mode, /*!< in: BTR_MODIFY_LEAF or BTR_MODIFY_TREE,
    depending on whether we wish optimistic or
    pessimistic descent down the index tree */
 dict_index_t* index, /*!< in: clustered index */
 ulint  n_uniq, /*!< in: 0 or index->n_uniq */
 dtuple_t* entry, /*!< in/out: index entry to insert */
 ulint  n_ext, /*!< in: number of externally stored columns */
 que_thr_t* thr, /*!< in: query thread */
 bool  dup_chk_only)
    /*!< in: if true, just do duplicate check
    and return. don't execute actual insert. */
{
  btr_pcur_t pcur;
  btr_cur_t* cursor;
  
  // offsets
  ulint           offsets_[REC_OFFS_NORMAL_SIZE];
 ulint*          offsets         = offsets_;
  
  /* Note that we use PAGE_CUR_LE as the search mode, because then
 the function will return in both low_match and up_match of the
 cursor sensible values */
 // 插入操作的 search_mode 默认是 PAGE_CUR_LE，即插在最后一个小于等于该 dtuple 的 rec_t 后
 // btr_pcur_open 函数内部调用 btr_cur_search_to_nth_level 函数
 btr_pcur_open(index, entry, PAGE_CUR_LE, mode, &pcur, &mtr);
  // 从 btr_pcur_t 中获取 btr_cur_t（tree cursor）
 cursor = btr_pcur_get_btr_cur(&pcur);
 cursor->thr = thr;
  
  rec_t* insert_rec;
  
  /* 乐观插入 */
  err = btr_cur_optimistic_insert(
    flags, cursor, &offsets, &offsets_heap,
    entry, &insert_rec, &big_rec,
    n_ext, thr, &mtr);
}

其中：

入参中包括 dict_index_t* index 与 dtuple_t* entry，分别对应 ins_node_t->index 与 ins_node_t->entry，其中 entry 中保存索引的数据；
调用 btr_pcur_open 函数将 cursor 移动到索引上待插入的位置，也就是定位页与行，注意入参 mode = PAGE_CUR_LE；
调用 btr_cur_optimistic_insert 函数乐观插入。

btr_pcur_open 函数中调用 btr_cur_search_to_nth_level 函数用于自顶向下查找整个 B-tree。

page_cursor = btr_cur_get_page_cur(cursor);
 
 // 初始的，获得索引的根节点（space_id，page_no）
 const ulint  space = dict_index_get_space(index);
 const page_size_t page_size(dict_table_page_size(index->table));

 /* Start with the root page. */
 // 从 dict_index_t 元信息中拿到 root page 在物理文件的 page no（默认是 4）
 page_id_t  page_id(space, dict_index_get_page(index));
  
  // 从 buffer_pool 中根据 page no 拿 page（buf_page_get_gen），buffer pool 模块会根据 page 是否被缓存来决定是从内存中还是磁盘中读取，并根据加锁策略对 page 加锁
 block = buf_page_get_gen(page_id, page_size, rw_latch, guess,
     buf_mode, file, line, mtr);
 tree_blocks[n_blocks] = block;
  
  /* Search for complete index fields. */
  up_bytes = low_bytes = 0;
  // 对 page 内部的 record list 进行二分搜索 (page_cur_search_with_match)
  // page_cur_search_with_match：在一个数据页内“二分查找”（使用数据页中的 directory slot），定位到 record
  page_cur_search_with_match(
    block, index, tuple, page_mode, &up_match,
    &low_match, page_cursor,
    need_path ? cursor->rtr_info : NULL);

其中：

调用 buf_page_get_gen 函数根据 page no 获取 page（block），如果 page 在 buffer poo 中就直接读取，否则从文件读取到 buffer pool 中；
调用 page_cur_search_with_match 函数在 page 内部首先通过 Page Directory 二分查找确定 slot，然后线性查找确定行 record，其中 mode = PAGE_CUR_LE 。

/* Perform binary search until the lower and upper limit directory
 slots come to the distance 1 of each other */
  
  // 在索引页内查找对于指定的 tuple，满足某种条件（依赖于传入的 mode，比如 PAGE_CUR_L）的 record
 // PAGE_CUR_G（>），PAGE_CUR_GE（>=），PAGE_CUR_L（<），PAGE_CUR_LE（<=）
 // 1. 二分查找
 // 在稀疏的 Page Directory 内使用二分查找
 low = 0;
 up = page_dir_get_n_slots(page) - 1;
  
  /* Perform linear search until the upper and lower records come to
 distance 1 of each other. */

 // 二分查找结束后，low / up是临近的两个 slot，这两个 slot 指向的 record 记为
  // low_rec 和 up_rec，满足：low_rec <= tuple <= up_rec，切记 tuple 为待插入的（逻辑）记录

  // 2. 线性查找，渐渐夹逼
  // 在两个相邻的 directory 内，进行线性查找。线性查找的实现即不断"增大 low"，"减小 up"

其中：

二分查找与线性查找过程中都需要比较大小，具体调用 cmp_dtuple_rec_with_match_low 函数比较物理记录与逻辑记录的大小。

// storage/innobase/rem/rem0cmp.cc

/** Compare a data tuple to a physical record.
@param[in] dtuple data tuple
@param[in] rec B-tree record
@param[in] offsets rec_get_offsets(rec)
@param[in] n_cmp number of fields to compare
@param[in,out] matched_fields number of completely matched fields
@return the comparison result of dtuple and rec
@retval 0 if dtuple is equal to rec
@retval negative if dtuple is less than rec
@retval positive if dtuple is greater than rec */
int
cmp_dtuple_rec_with_match_low(
 const dtuple_t* dtuple, 
 const rec_t* rec,
 const ulint* offsets,
 ulint  n_cmp,
 ulint*  matched_fields)
{
  /* Match fields in a loop */
  // 遍历记录的每个字段
 for (; cur_field < n_cmp; cur_field++) {
    
    // 获取逻辑记录
    const dfield_t* dtuple_field
   = dtuple_get_nth_field(dtuple, cur_field);
    
    // 将逻辑记录转换为物理记录
    const byte* dtuple_b_ptr
   = static_cast<const byte*>(
    dfield_get_data(dtuple_field));
    
    // 根据 offset 获取物理记录
  rec_b_ptr = rec_get_nth_field(rec, offsets, cur_field,
           &rec_f_len);

  // 比较大小
  ret = cmp_data(type->mtype, type->prtype,
          dtuple_b_ptr, dtuple_f_len,
          rec_b_ptr, rec_f_len);
  if (ret) {
   goto order_resolved;
  }
    
  }
}

其中：

入参中包括逻辑记录 dtuple、物理记录 rec、记录每个 field 偏移量的数组 offsets；
出参表示大小关系；
函数内部遍历记录的每个字段，先将逻辑记录转换为物理记录，然后与根据 offset 获取到的物理记录比较大小。

到这里就定位到了数据插入的位置，开始真正的数据插入。

dtuple_t* entry -> byte

调用 btr_cur_optimistic_insert 函数乐观插入。

page_cur_t* page_cursor;
 buf_block_t* block;
 
 // 通过 cursor 获取 block
 block = btr_cur_get_block(cursor);
 // buf_block_t::frame 中保存 page
 page = buf_block_get_frame(block);
 index = cursor->index;

 // btr_cur_t->page_cursor
 page_cursor = btr_cur_get_page_cur(cursor);

 /* Check locks and write to the undo log, if specified */
  // 真正插入 entry 前，会检查事务锁和记录 undo
  // 其中修改 inherit 值，如果下一条记录上没有锁，就不需要锁分裂
  err = btr_cur_ins_lock_and_undo(flags, cursor, entry,
          thr, mtr, &inherit);

  if (err != DB_SUCCESS) {
    goto fail_err;
  }

  // 插入数据
  *rec = page_cur_tuple_insert(
    page_cursor, entry, index, offsets, heap,
    n_ext, mtr);

其中：

buf_block_t::frame 是记录要插入的数据页，page_cursor 是记录要插入的位置；
插入数据前调用 btr_cur_ins_lock_and_undo 函数检查是否有与插入意向锁冲突的锁、是否需要发生锁分裂；
调用 page_cur_tuple_insert 函数插入数据，逻辑记录保存在 entry 中。

page_cur_tuple_insert 函数中调用 rec_convert_dtuple_to_rec_new 函数将逻辑记录转换成物理记录。

/*********************************************************//**
Builds a new-style physical record out of a data tuple and
stores it beginning from the start of the given buffer.
@return pointer to the origin of physical record */
static
rec_t*
rec_convert_dtuple_to_rec_new(
/*==========================*/
 byte*   buf, /*!< in: start address of the physical record */
 const dict_index_t* index, /*!< in: record descriptor */
 const dtuple_t*  dtuple) /*!< in: data tuple */
{
 ulint extra_size;
 ulint status;
 rec_t* rec;

 // 计算记录头大小，记录头大小用 extra_size 表示
 rec_get_converted_size_comp(
  index, status, dtuple->fields, dtuple->n_fields, &extra_size);
 // 因为 buf 是用来存储整个记录的开始位置的，这里的 buf + extra_size 表示存储的第一个列的位置，即 rec 所指的位置
 rec = buf + extra_size;

 // 真正转换格式
 rec_convert_dtuple_to_rec_comp(
  rec, index, dtuple->fields, dtuple->n_fields, NULL,
  status, false);
}

其中：

调用 rec_convert_dtuple_to_rec_comp 函数将 tuple 逻辑记录转换成物理记录。

rec_convert_dtuple_to_rec_comp 函数中将每个列的值，以及 null 和 len 的信息存储到对应的位置。

/* Store the data and the offsets */

 // 将每个列的值，以及 null 和 len 的信息存储到对应的位置
 for (i = 0; i < n_fields; i++) {
  const dict_field_t* ifield;
  dict_col_t*  col = NULL;
    
    // 获取每个 field 的值
  field = &fields[i];
    
    // 计算 null 信息，因为这个标志是通过位来存储的，所以对每一个字节都需要做位处理
    // 计算列的大小和存储其长度字节数动态匹配的位置，比如判断变长列的长度占用1个字节或2个字节
    
    // 将元组列信息，写入到 compact 记录的对应列中，len为其对应的存储长度
  memcpy(end, dfield_get_data(field), len);
  }

最终调用 page_cur_insert_rec_low 函数将物理记录写入文件。

// 真正插入记录
  rec = page_cur_insert_rec_low(cursor->rec,
              index, rec, *offsets, mtr);

到这里就完成了一条数据的插入。

总结

insert 过程中用到了多个数据结构进行数据的传递。

其中：

row_prebuilt_t结构体中保存有关某个表操作相关信息，其中包括ins_node_t；
ins_node_t结构体中保存插入操作相关的相关信息，其中包括dtuple_t；
dtuple_t结构体中保存逻辑记录，也就是索引元组，其中包括dfield_t；
dfield_t结构体中保存字段信息，dfield_t::data中保存真实列数据的指针；
btr_pcur_t结构体中包括btr_cur_t；
btr_cur_t结构体中包括page_cur_t；
page_cur_t结构体中保存查询得到的记录 rec，其中包括buf_block_t；
buf_block_t结构体中保存数据页指针 frame，其中包括buf_page_t。

insert 过程中行记录格式发生了多次转换。

其中：

首先从 byte 转换成 dtuple_t* row，其中保存完整的行数据；
然后将 dtuple_t* row 转换成 dtuple_t* entry，其中保存每个索引的列数据；
最后将 dtuple_t* entry 转换成 byte，用于写入文件。

插入流程的具体函数堆栈可以参考文章【InnoDB --insert 操作分析】。

插入流程可以简化为下图。

其中：

首先进行数据行格式转换
然后定位要插入的位置
最后进行插入操作

结论

MySQL 中有行格式有三种存储方式，包括 Server 层的格式、索引元组格式（逻辑记录，tuple）、物理存储记录格式（record）。

类似的，数据页分为两种形式，包括物理页（block）、内存页（page）。

通过 B+ 树索引进行查找（lookup）是最为常见的操作，具体通过游标实现。游标分为三种类型，包括 B-tree cursor、page cursor、persistent cursor。

存取数据时需要进行行格式的转换，原因是 IO 时使用二进制，内存操作时使用逻辑记录。

以插入数据为例，主要流程为：

首先进行数据行格式转换
然后定位要插入的位置，基于 cursor 实现
最后进行插入操作

其中数据格式的转换包括：

首先从 byte 转换成 dtuple_t* row，其中保存完整的行数据；
然后将 dtuple_t* row 转换成 dtuple_t* entry，其中保存每个索引的列数据；
最后将 dtuple_t* entry 转换成 byte，用于写入文件。

数据格式转换过程中涉及较多数据结构，其中：

dtuple_t结构体中保存逻辑记录，也就是索引元组，其中包括dfield_t；
dfield_t结构体中保存字段信息，dfield_t::data中保存真实列数据的指针；
page_cur_t结构体中保存查询得到的记录 rec，其中包括buf_block_t；
buf_block_t结构体中保存数据页指针 frame，其中包括buf_page_t。

待办

row format
cursor
block

参考教程

《MySQL 内核 InnoDB 存储引擎》
《MySQL 运维内参》
InnoDB --insert 操作分析

https://www.zzvips.com/uploads/allimg/1pnvx5pwc5y data-id="u738a58b-BL4zUZma">

InnoDB：B-tree index（1）

https://www.zzvips.com/uploads/allimg/bj15cy0aear data-id="u738a58b-5nHlKP7r">

InnoDB：B-tree index（2）

https://www.zzvips.com/uploads/allimg/j1hbhsykkcm data-id="u738a58b-SctsL6lt">

InnoDB：Buffer Pool

https://www.zzvips.com/uploads/allimg/qdtitkijpih data-id="u738a58b-RG0RHA51">

MySQL · 源码分析 · 一条insert语句的执行过程

MySQL · 内核分析 · InnoDB主键约束和唯一约束的实现分析

http://mysql.taobao.org/monthly/2021/04/05/

MySQL 记录、页、索引的数据结构简析

引言

原理

记录

概念

数据结构

页

概念

数据结构

索引

概念

数据结构

游标

概念

数据结构

insert 语句

build_template

byte -> dtuple_t* row

dtuple_t* row -> dtuple_t* entry

cursor

dtuple_t* entry -> byte

总结

结论

待办

参考教程

延伸 · 阅读