Sunday, November 23, 2014

PyLongObject representation in memory and restoring long integer value from memory

I've tried to understand how CPython's "long" value is stored in memory. To do this I used the WinDbg.
I've used the next long value for check:
0xaaaabbbbccccddddeeeeffff00001111 in hexadecimal, or 226855257439031502727993705501399453969L in decimal format.
So on Windows x64, I've the next memory representation of PyLongObject:

Using bytes format:
00000000`0034ec60 02 00 00 00 00 00 00 00 e0 65 29 1e 00 00 00 00 
00000000`0034ec70 05 00 00 00 00 00 00 00 11 11 00 00 fc ff bb 3b 
00000000`0034ec80 de dd cd 0c f3 ee ae 2a aa 00 00 00 00 f0 ad ba 

Or using WinDbg's pointer and symbol memory display format:
00000000`0034ec60 0000000000000002 
00000000`0034ec68 000000001e2965e0 python27!PyLong_Type
00000000`0034ec70 0000000000000005 
00000000`0034ec78 3bbbfffc00001111 
00000000`0034ec80 2aaeeef30ccdddde 
00000000`0034ec88 baadf000000000aa 

On x86 version of Windows the memory representation is the next:
022bb300 00000002 
022bb304 1e1f25e0 python27!PyLong_Type
022bb308 00000009 
022bb30c 00001111 
022bb310 77777ffc
022bb314 199b5dde 
022bb318 555d6ef3 
022bb31c baad00aa 

Knowing the structure of PyLongObject:
struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};
which expands to:
struct _longobject {
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
    Py_ssize_t ob_size;
    digit ob_digit[1];
};

As on x86 version:
typedef unsigned short digit;
#define PyLong_SHIFT 15

and on x64 version:
typedef PY_UINT32_T digit;
#define PyLong_SHIFT 30

According to the structure, let's consider the contents of the memory:
0000000000000002 - ob_refcnt
000000001e2965e0 - ob_type
It is easy to check the type value in Python interpreter:
>>> hex(id(long))
'0x1e2965e0L'
0000000000000005 - ob_size. Number of digit objects (for x64 one digit is uint32).
3bbbfffc00001111 - ob_digit
2aaeeef30ccdddde - ob_digit
baadf000000000aa - ob_digit
Here the value baadf000 (or baad for alignment on the above x86 example) is a value of uninitialized allocated heap memory.

Now the main question is how to restore the original long integer value from memory. Successfully I've found the answer in CPython sources. 
Long integer representation.
The absolute value of a number is equal to
SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
Negative numbers are represented with ob_size < 0;
zero is represented by ob_size == 0.

Knowing that I've wrote a function to restore long value. Here is script with hardcoded bytes from the memory above:
import struct
import platform

is_x64 = platform.architecture()[0] == '64bit'

if is_x64:
    SHIFT = 30
    digit_size = struct.calcsize('I')
else:
    SHIFT = 15
    digit_size = struct.calcsize('H')

def restore_python_long():
    """
    Restore python long integer value from memory
    """
    # hard coded data bytes from memory
    if is_x64:
        ob_size_data_str = '05 00 00 00 00 00 00 00'
        ob_digit_data_str = '11 11 00 00 fc ff bb 3b de dd cd 0c f3 ee ae 2a aa 00 00 00'
    else:
        ob_size_data_str = '09 00 00 00'
        ob_digit_data_str = '11 11 00 00 fc 7f 77 77 de 5d 9b 19 f3 6e 5d 55 aa 00'
    
    ob_size_data = ob_size_data_str.replace(' ', '').decode('hex')
    ob_digit_data = ob_digit_data_str.replace(' ', '').decode('hex')

    # get ob_size
    if is_x64:
        ob_size = struct.unpack('q', ob_size_data)[0]
    else:
        ob_size = struct.unpack('i', ob_size_data)[0]

    if ob_size == 0:
        return 0L

    # get digits
    digits = []
    for i in xrange(0, abs(ob_size)):
        digit_value = ob_digit_data[i * digit_size: i * digit_size + digit_size]
        if is_x64:
            digits.append(struct.unpack('I', digit_value)[0])
        else:
            digits.append(struct.unpack('H', digit_value)[0])

    # restore long
    value = 0L
    for i in xrange(0, abs(ob_size)):
        value += digits[i] * 2 ** (SHIFT * i)

    if ob_size < 0:
        value = -value
    return value

value = restore_python_long()
print hex(value)
print value

And the result is the same as input:
0xaaaabbbbccccddddeeeeffff00001111L
226855257439031502727993705501399453969

No comments:

Post a Comment