Pickle反序列化浅析

Pickle Introduction

pickle所使用的协议有很多版本,为了兼容性,低版本的协议也是可以在高版本下被解析的。

v0是pickle最初的版本,操作符全是可见字符,也是最可读的,所以我们首先来讨论v0协议中的内容,具体协议的内容可以通过查picktools的源码知道。

pickle中维护两个数据结构,一个stack,一个memo

  • 所有的操作都是基于stack,pickle反序列化完成后,stack上应该有且只有一个对象,这个对象就是pickle反序列化得到的对象,也是 pickle.loads 会返回的对象
  • 而memo大概只是用来存数据的,可以理解为一个array或者list,根据索引进行操作

basic type

比如,反序列化一个int

a = 1
data = pickle.dumps(a, protocol=0)
print(data)
# b'I1\n.'

我们可以借助pickletools来更直观的分析序列化之后的数据

这里 . 操作符用来作为反序列化结束的标志

I(name='STOP',
  code='.',
  arg=None,
  stack_before=[anyobject],
  stack_after=[],
  proto=0,
  doc="""Stop the unpickling machine.

  Every pickle ends with this opcode.  The object at the top of the stack
  is popped, and that's the result of unpickling.  The stack should be
  empty then.
  """),

I 操作符用来将一个int压到栈里

当pickle试图load这段bytes的时候,先往栈里压一个1,然后遇到 . 停止操作并返回栈上仅有的1

如果返回的时候栈上的数据多于一个怎么办呢

data = b'I1\nI2\n.'
print(pickle.loads(data))
# 2

其实也是可以正常返回的,返回的是栈顶的对象

虽然pickletools会说这是有问题的

data = b'I1\nI2\n.'
pickletools.dis(data)
'''
    0: I    INT        1
    3: I    INT        2
    6: .    STOP
highest protocol among opcodes = 0
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    pickletools.dis(data)
  File "/usr/lib/python3.8/pickletools.py", line 2547, in dis
    raise ValueError("stack not empty after STOP: %r" % stack)
ValueError: stack not empty after STOP: [int_or_bool]
'''

那么如果返回的时候栈是空的呢?

对于pickletools来说当然是有问题的

data = b'.'
pickletools.dis(data)
'''
    0: .    STOP
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    pickletools.dis(data)
  File "/usr/lib/python3.8/pickletools.py", line 2535, in dis
    raise ValueError("tries to pop %d items from stack with "
ValueError: tries to pop 1 items from stack with only 0 items
'''

对于pickle的load来说当然也是有问题的

data = b'.'
print(pickle.loads(data))
'''
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    print(pickle.loads(data))
_pickle.UnpicklingError: unpickling stack underflow
'''

那么对于其他类似的对象应该怎么办呢

从pickletools的源码中我们可以查到各种类型数据的表示

I(name='INT',
  code='I',
  arg=decimalnl_short,
  stack_before=[],
  stack_after=[pyinteger_or_bool],
  proto=0,
  doc="""Push an integer or bool.

  The argument is a newline-terminated decimal literal string.

  The intent may have been that this always fit in a short Python int,
  but INT can be generated in pickles written on a 64-bit box that
  require a Python long on a 32-bit box.  The difference between this
  and LONG then is that INT skips a trailing 'L', and produces a short
  int whenever possible.

  Another difference is due to that, when bool was introduced as a
  distinct type in 2.3, builtin names True and False were also added to
  2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
  True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
  Leading zeroes are never produced for a genuine integer.  The 2.3
  (and later) unpicklers special-case these and return bool instead;
  earlier unpicklers ignore the leading "0" and return the int.
  """),

opcode I 后接一个十进制数字,以换行结尾,表示一个整数

opcode LI 差不多,但数字以L结尾

opcode N 表示 None

data = b'N.'
print(pickle.loads(data))
# None

opcode S 标识压一个string,需要用引号引起来

opcode V 表示压一个unicode,unicode与string的区别在于不需要用引号引起来

opcode F 表示压一个float


list

那么对于复杂的数据类型呢

来看list

I(name='LIST',
  code='l',
  arg=None,
  stack_before=[markobject, stackslice],
  stack_after=[pylist],
  proto=0,
  doc="""Build a list out of the topmost stack slice, after markobject.

  All the stack entries following the topmost markobject are placed into
  a single Python list, which single list object replaces all of the
  stack from the topmost markobject onward.  For example,

  Stack before: ... markobject 1 2 3 'abc'
  Stack after:  ... [1, 2, 3, 'abc']
  """),

opcode l 可以构造一个list,把栈上直到markobject的所有元素都弹出来,组成一个list再压回去,list中元素的顺序就是压栈的顺序

关于markobject,就是一个栈上的标记,用来辅助其他操作的

I(name='MARK',
  code='(',
  arg=None,
  stack_before=[],
  stack_after=[markobject],
  proto=0,
  doc="""Push markobject onto the stack.

  markobject is a unique object, used by other opcodes to identify a
  region of the stack containing a variable number of objects for them
  to work on.  See markobject.doc for more detail.
  """),

比如,构造一个 [1, 2, 3] ,先做一个标记,然后把1,2,3依次压到栈里,再 l 一下

data = b'(I1\nI2\nI3\nl.'
print(pickle.loads(data))
# [1, 2, 3]

pickletools看一下

data = b'(I1\nI2\nI3\nl.'
print(pickle.loads(data))
'''
    0: (    MARK
    1: I        INT        1
    4: I        INT        2
    7: I        INT        3
   10: l        LIST       (MARK at 0)
   11: .    STOP
highest protocol among opcodes = 0
'''

但是如果我们直接序列化一个list,得到的却不一定是这样的数据,但最后的效果都是一样的

l = [1, 2, 3]
data = pickle.dumps(l, protocol=0)

print(data)
# b'(lp0\nI1\naI2\naI3\na.'

pickletools.dis(data)
'''
    0: (    MARK
    1: l        LIST       (MARK at 0)
    2: p    PUT        0
    5: I    INT        1
    8: a    APPEND
    9: I    INT        2
   12: a    APPEND
   13: I    INT        3
   16: a    APPEND
   17: .    STOP
highest protocol among opcodes = 0
'''

list还有很多其他操作

opcode a 可以把一个元素append到一个list上

opcode 0 可以把栈顶的元素pop出来

data = b'I1\nI2\nI3\n0.'
print(pickle.loads(data))
# 2

opcode 2 可以把栈顶的元素复制一个再压到栈里

opcode g 可以从memo里读一个元素压到栈里

opcode p 可以把栈顶的元素放到memo里(不pop)

memo的操作都是基于索引的,opcode后面跟一个10进制的索引,然后换行

data = b'I1\np2\n0g2\n.'
print(pickle.loads(data))
# 2

pickletools.dis(data)
'''
    0: I    INT        1
    3: p    PUT        2
    6: 0    POP
    7: g    GET        2
   10: .    STOP
highest protocol among opcodes = 0
'''

当然pickle还支持其他数据结构,比如dict,set,其操作与list类似

要注意的是构造dict时数据应该是成对的(key和value)

data = b'(Vfoo\nI123\nVbar\nI456\nd.'
print(pickle.loads(data))
# {'foo': 123, 'bar': 456}

pickletools.dis(data)
'''
    0: (    MARK
    1: V        UNICODE    'foo'
    6: I        INT        123
   11: V        UNICODE    'bar'
   16: I        INT        456
   21: d        DICT       (MARK at 0)
   22: .    STOP
highest protocol among opcodes = 0
'''

class & object

接下来看看类和对象

我们有很多方法可以在pickle中得到一个对象

class Test:
    def __init__(self, foo, bar) -> None:
        self.foo = foo
        self.bar = bar

data = b'(i__main__\nTest\n.'
print(pickle.loads(data))
# <__main__.Test object at 0x7f975a1cacd0>

pickletools.dis(data)
'''
    0: (    MARK
    1: i        INST       '__main__ Test' (MARK at 0)
   16: .    STOP
highest protocol among opcodes = 0
'''

opcode i 接受两个参数(以换行结尾),模块名和类名,然后构造一个实例,同时将栈顶到mark的所有元素变成一个tuple然后pop出来

  • 如果这个tuple是空的,而且要实例化的类没有 __getinitargs__, 这时候将构造一个old-style的实例,这种构造方式不会调用 __init__ , 具体为构造一个dummy class的实例,然后将这个实例的 __class__ 绑定到要实例化的类上,这个时候去访问 __init__ 中定义的成员会报has no attribute的错
  • 否则,这个tuple将作为 __init__ 的参数,顺序为压栈的顺序
class Test:
    def __init__(self, foo, bar) -> None:
        self.foo = foo
        self.bar = bar

data = b'(I123\nI456\ni__main__\nTest\n.'
print(pickle.loads(data).bar)
# 456

pickletools.dis(data)
'''
    0: (    MARK
    1: I        INT        123
    6: I        INT        456
   11: i        INST       '__main__ Test' (MARK at 0)
   26: .    STOP
highest protocol among opcodes = 0
'''

opcode b 可以对一个object执行 __setstate__ 或者 __dict__.update

  • 取决于该对象是否有 __setstate__
  • 执行dict update的时候栈顶必须是一个dict作为参数
class Test:
    def __init__(self) -> None:
        self.foo = 123
        self.bar = 456

data = b'(i__main__\nTest\n(Vfoo\nI111\nVbar\nI222\ndb.'
print(pickle.loads(data).foo)
# 111

pickletools.dis(data)
'''
    0: (    MARK
    1: i        INST       '__main__ Test' (MARK at 0)
   16: (    MARK
   17: V        UNICODE    'foo'
   22: I        INT        111
   27: V        UNICODE    'bar'
   32: I        INT        222
   37: d        DICT       (MARK at 16)
   38: b    BUILD
   39: .    STOP
highest protocol among opcodes = 0
'''

opcode R 可以执行 __reduce__

  • reduce需要两个参数,一个callable,一个tuple,tuple作为callable的输入
  • callable的返回值被压回栈中

可以利用reduce调用 copy_reg._reconstructor 来构造一个对象

class Test:
    def __init__(self) -> None:
        self.foo = 123
        self.bar = 456

data = b'ccopy_reg\n_reconstructor\n(c__main__\nTest\nc__builtin__\nobject\nNtR.'
print(pickle.loads(data))
# <__main__.Test object at 0x7f2a23326cd0>

pickletools.dis(data)
'''
    0: c    GLOBAL     'copy_reg _reconstructor'
   25: (    MARK
   26: c        GLOBAL     '__main__ Test'
   41: c        GLOBAL     '__builtin__ object'
   61: N        NONE
   62: t        TUPLE      (MARK at 25)
   63: R    REDUCE
   64: .    STOP
highest protocol among opcodes = 0
'''

所以其实 R 操作符的本质就是调用一个函数并把返回值压回栈里,所以如果unpickle的数据可控的话可以利用这一点做很多事情,比如执行代码

data = b'c__builtin__\nprint\n(I123\ntR.'
print(pickle.loads(data))
'''
123
None
'''
pickletools.dis(data)
'''
    0: c    GLOBAL     '__builtin__ print'
   19: (    MARK
   20: I        INT        123
   25: t        TUPLE      (MARK at 19)
   26: R    REDUCE
   27: .    STOP
highest protocol among opcodes = 0
'''