编写Postgres扩展之三:调试
在上一篇关于编写Postgres Extensions的文章中,我们从头开始创建了一个新的数据类型base36。 然而,我们留下了一个导致我们的服务器崩溃的严重bug。
现在让我们用调试器来查找这个错误并完成测试工具。
我们根据这些系列中关于编写PostgreSQL扩展的内容创建了一个专用的github repo。上一篇文章中的代码可以在分支部分ii上找到,今天的更改在分支部分iii上。
Bug
首先让我们复现这个bug。
test=# CREATE EXTENSION base36;
test=# CREATE TABLE base36_test(val base36);
test=#  EXPLAIN SELECT * FROM base36_test where '3c'::base36 > val;
server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 680,225 ms
!>
我们当然不希望这种情况发生在我们的生产数据库上,所以让我们找出问题所在。我们只编写了两个相对简单的C函数base36_out和base36_in。如果我们假设我们并不比PostgreSQL核心团队的人更聪明 - 至少对我个人来说这是一个合理的假设 - 那么bug一定在其中一个中。
#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(base36_in);
Datum
base36_in(PG_FUNCTION_ARGS)
{
    long result;
    char *str = PG_GETARG_CSTRING(0);
    result = strtol(str, NULL, 36);
    PG_RETURN_INT32((int32)result);
}
PG_FUNCTION_INFO_V1(base36_out);
Datum
base36_out(PG_FUNCTION_ARGS)
{
    int32 arg = PG_GETARG_INT32(0);
    if (arg < 0)
        ereport(ERROR,
            (
             errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
             errmsg("negative values are not allowed"),
             errdetail("value %d is negative", arg),
             errhint("make it positive")
            )
        );
    char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
    /* max 6 char + '\0' */
    char *buffer        = palloc(7 * sizeof(char));
    unsigned int offset = 7 * sizeof(char);
    buffer[--offset]    = '\0';
    do {
        buffer[--offset] = base36[arg % 36];
    } while (arg /= 36);
    PG_RETURN_CSTRING(&buffer[offset]);
}
设置调试环境
为了使用想LLDB这样的调试器,你需要开启调试符号编译Postgresql。下面的简短指南适应于安装了Postgresql和Homebrew的MacOS。通过Xcode使用LLDB。
首先,我们先关闭所有正在运行的Postgresql实例-你不想搞乱你现有的数据库或其他工作:)
$ cd /usr/local/opt/postgresql
$ launchctl unload homebrew.mxcl.postgresql.plist
# Double check it’s not running:
$ psql some_db
psql: could not connect to server: No such file or directory
  Is the server running locally and accepting
  connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
然后,我们通过运行下面的脚本下载postrgesql源码。
$ cd ~
$ curl https://ftp.postgresql.org/pub/source/v9.4.4/postgresql-9.4.4.tar.bz2 | bzip2 -d | tar x
$ cd postgresql-9.4.4
并启用调试选项进行构建。
$  ./configure --enable-cassert --enable-debug CFLAGS="-ggdb"
$ make
$ sudo make install
我们将跳过Postgres文档推荐的adduser命令。相反,我将使用自己的用户帐户运行Postgres,以便更轻松地进行调试。
$ sudo chown $(whoami) /usr/local/pgsql
然后初始化数据目录
$ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
接着启动服务器
$ /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /usr/local/pgsql/data/postmaster.log start
将新安装的pgsql/bin 路径添加到path环境变量中
$ export PATH=/usr/local/pgsql/bin:$PATH
安装扩展(由于上面的export,这次使用新安装的pgxn)
$ make && make install
现在我们可以创建test数据库了
$ /usr/local/pgsql/bin/createdb test
然后连接到这个数据库
$ /usr/local/pgsql/bin/psql test
检查是否正常
test=# CREATE EXTENSION base36 ;
CREATE EXTENSION
test=# CREATE TABLE base36_test(val base36);
CREATE TABLE
test=# INSERT INTO base36_test VALUES ('123'), ('3c'), ('5A'), ('zZz');
INSERT 0 4
test=# EXPLAIN SELECT * FROM base36_test where val='3c';
server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
调试
现在我们已经配置好了调试环境,让我们开始实际追踪问题。首先,我们看一下日志文件。也就是我们用-l指定的文件。根据上面的步骤,文件路径是/usr/local/pgsql/data/postmaster.log。
TRAP: FailedAssertion("!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", File: "mcxt.c", Line: 699)
LOG:  server process (PID 6515) was terminated by signal 6: Abort trap
DETAIL:  Failed process was running: EXPLAIN SELECT * FROM base36_test where val='3c';
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2015-10-09 15:11:18 CEST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/22D0868
LOG:  record with zero length at 0/2359140
LOG:  redo done at 0/2359110
LOG:  last completed transaction was at log time 2015-10-09 15:12:01.344859+02
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
重新连接到数据库并找出当前数据库会话的pid
test=# SELECT pg_backend_pid();
 pg_backend_pid
----------------
           6644
(1 row)
将LLDB与pid连接(在另一个终端中)
lldb -p 6644
在psql会话中运行之前失败命令
EXPLAIN SELECT * FROM base36_test where val='3c';
继续LLDB
(lldb) c
Process 6644 resuming
Process 6644 stopped
* thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff906d2286 <+10>: jae    0x7fff906d2290            ; <+20>
    0x7fff906d2288 <+12>: movq   %rax, %rdi
    0x7fff906d228b <+15>: jmp    0x7fff906cdc53            ; cerror_nocancel
    0x7fff906d2290 <+20>: retq
从LLDB中获取到一个回溯
(lldb) bt
* thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff910f39f9 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff848f19b3 libsystem_c.dylib`abort + 129
    frame #3: 0x0000000108328549 postgres`ExceptionalCondition(conditionName="!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", errorType="FailedAssertion", fileName="mcxt.c", lineNumber=699) + 137 at assert.c:54
    frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699
    frame #5: 0x00000001082ab9e3 postgres`get_const_expr(constval=0x00007ff9e1806708, context=0x00007fff57e824c8, showtype=0) + 707 at ruleutils.c:8002
    frame #6: 0x00000001082a5f79 postgres`get_rule_expr(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='\x01') + 281 at ruleutils.c:6647
    frame #7: 0x00000001082acf22 postgres`get_rule_expr_paren(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='\x01', parentNode=0x00007ff9e1806788) + 146 at ruleutils.c:6600
...more
我们得到了什么?异常在mcxt.c:699定义的pfree抛出。pfree在ruleutils.c:8002中被get_const_expr调用。如果我们去调用堆栈四次。 我们最终到了这里:
(lldb) up
frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699
   696     * allocated chunk.
   697     */
   698    Assert(pointer != NULL);
-> 699    Assert(pointer == (void *) MAXALIGN(pointer));
   700
   701    /*
   702     * OK, it's probably safe to look at the chunk header.
让我们看看mcxt.c中的源码
文件名mcxt.c:699
/*
 * pfree
 *    Release an allocated chunk.
 */
void
pfree(void *pointer)
{
  MemoryContext context;
  /*
   * Try to detect bogus pointers handed to us, poorly though we can.
   * Presumably, a pointer that isn't MAXALIGNED isn't pointing at an
   * allocated chunk.
   */
  Assert(pointer != NULL);
  Assert(pointer == (void *) MAXALIGN(pointer));
  /*
   * OK, it's probably safe to look at the chunk header.
   */
  context = ((StandardChunkHeader *)
         ((char *) pointer - STANDARDCHUNKHEADERSIZE))->context;
  AssertArg(MemoryContextIsValid(context));
  (*context->methods->free_p) (context, pointer);
  VALGRIND_MEMPOOL_FREE(context, pointer);
}
Postgres使用pfree从当前内存上下文释放内存。不知何故,我们把内存搞砸了。
我们看一下指针的内容
(lldb) print (char *)pointer
(char *) $0 = 0x00007ff9e1813674 "3c"
这确实是我们的搜索条件3c。所以我们这里做错了什么?正如第一篇文章提到的,pfree和palloc是Postgres中free和malloc的对等体,用于在当前内存上下文中安全地分配和释放内存。不知何故我们搞错了。在base36_out中我们使用
char *buffer = palloc0(7 * sizeof(char));
来分配了7个字节的内存。最终我们偏移4返回一个指针
PG_RETURN_CSTRING(&buffer[offset]);
mcxt.c:699中的断言如下:
Assert(pointer == (void *) MAXALIGN(pointer));
确保要释放的数据正确对齐。这里的条件是
(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))
当指针以8字节的倍数开始时读取?由于返回的地址与分配的地址不同,因此pfree会因指针没有对齐而报错。
让我们来修复他!
文件名:base36.c
PG_FUNCTION_INFO_V1(base36_out);
Datum
base36_out(PG_FUNCTION_ARGS)
{
    int32 arg = PG_GETARG_INT32(0);
    if (arg < 0)
        ereport(ERROR,
            (
             errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
             errmsg("negative values are not allowed"),
             errdetail("value %d is negative", arg),
             errhint("make it positive")
            )
        );
    char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
    /* max 6 char + '\0' */
    char buffer[7];
    unsigned int offset = sizeof(buffer);
    buffer[--offset] = '\0';
    do {
        buffer[--offset] = base36[arg % 36];
    } while (arg /= 36);
    PG_RETURN_CSTRING(pstrdup(&buffer[offset]));
}
现在我们从堆栈中分配缓冲区(第18行),最后我们使用pstrdup复制字符串到新分配的内存(第26行)。这种实现更接近 - 几乎相当于维基百科。
您可能已经猜到pstrdup是Postgres对应的strdup。它通过palloc安全地从当前内存上下文中获取内存,并在事务结束时自动释放。
类型转换
现在我们可以为我们的类型输入和输出数据了。从其他类型转换或者转换为其他类型都行。
文件名:base36–0.0.1.sql
-- type and operators omitted
CREATE CAST (integer as base36) WITHOUT FUNCTION AS IMPLICIT;
CREATE CAST (base36 as integer) WITHOUT FUNCTION AS IMPLICIT;
这是相对容易的。 由于整数和base36是二进制可强制的(即二进制内部表示相同),转换可以免费完成(WITHOUT FUNCTION)。我们还将此转换标记为隐式转换,从而告诉postgres它可以在适当的时候自动执行转换。例如,考虑这个查询:
test=# SELECT 10::integer + '5a'::base36;
 ?column?
----------
      200
(1 row)
没有定义整数+ base36运算符,但是通过将base36隐式转换为integer Postgres可以使用integer + integer运算符并将结果转换为integer。然而,隐式转换应该小心定义,因为某些操作的结果可能是可疑的。对于上面的操作,用户不知道结果是整数还是base36,因此可能会误解它。如果我们后来决定添加一个返回base36的运算符integer + base36,查询将完全中断。
更令人困惑的可能是这个查询结果
test=# SELECT -50::base36;
 ?column?
----------
      -50
(1 row)
虽然我们不允许负值,但我们在这里得到了一个,这怎么可能呢?内部Postgres执行以下操作:
test=# SELECT -(50::base36)::integer;
 ?column?
----------
      -50
我们可以也应该避免这种令人困惑的行为。一种选择是向base36输出添加前缀(就像十六进制或八进制数一样),或者将责任交给用户,只允许显式强制转换。
澄清事情的另一个选择是标记zhuanhuan AS ASSIGNMENT.
只有当您为base36类型指定一个整数时,才会自动执行该转换,反之亦然。这通常适用于INSERT或UPDATE语句。让我们试试这个
文件名:base36–0.0.1.sql
-- type and operators omitted
CREATE CAST (integer as base36) WITHOUT FUNCTION AS ASSIGNMENT;
CREATE CAST (base36 as integer) WITHOUT FUNCTION AS ASSIGNMENT;
填充我们的表格
test=# INSERT INTO base36_test SELECT i FROM generate_series(1,100) i;
INSERT 0 100
SELECT * FROM base36_test ORDER BY val LIMIT 12;
 val
-----
 1
 2
 3
 4
 5
 6
 7
 8
 9
 a
 b
 c
(12 rows)
更多内容
You have seen how important it is to test everything, not only to find bugs that in the worst case might crash the server, but also to specify the expected output from certain operations such as casts. In the next post we’ll elaborate on that creating a full-coverage test suite.
 
  还真有人点开啊🤣随意随意😂

 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号