[personal profile] omnifarious

I have been working on a serialization framework I'm happy with for Python. I want to be able to describe CAKE protocol messages clearly and succinctly. This will make it easier to tweak the messages without having to rip apart difficult to understand code. It will also make it easier to understand if I drop the project again and then come back to it years later, or if (by some miracle) someone else decides to help me with it.

Here is what I've come up with as the interface, along with one implementation fo that interface for a simple type:

class Serializer(object):
    """This is class is an abstract base class.  Derived classes, when
    instantiated, create objects that can serialize other objects of a
    particular type to a sequence of bytes, or alternately deserialize
    a sequence of bytes into an object of a particular type."""

    __slots__ = ('__weakref__',)

    def __init__(self):
        super(Serializer, self).__init__()

    def serialize(self, val):
        """x.serialize(value) -> b'serialized value'

        This is implemented in terms of serialize_iter by default.

        It is suggested that derived classes only implement serialize
        or serialize_iter and implement one in terms of the other."""
        if self.__class__ is Serializer:
            raise NotImplentedError("This is an abstract class.")
        return b''.join(x for x in self.serialize_iter(val))

    def serialize_iter(self, val):
        """x.serialize_iter(value) -> an iterator over the bytes
        sequences making p the seralized version of value."""
        if self.__class__ is Serializer:
            raise NotImplentedError("This is an abstract class.")
        return iter((self.serialize(val),))

    def deserialize(self, data, memo=None):
        """x.deserialize(data, [memo]) ->
        (value of the appropriate type, memoryview(remaining_data))

        data must be of type 'bytes', or 'memoryview'.  The memo must
        be a value extracted from a previous NotEnoughDataError.

        It is undefined what happens if you use memo and do not pass
        the same data (plus some possible extra data on the end) into
        deserialize that you originally passed in when you got the
        NotEnoughDataError you extracted the memo from.

        May raise a ParseError if there is a problem with the data.
        If the failure was because the parser ran out of data before
        parsing was finished, this is required to be a
        NotEnoughDataError."""
        return self._deserialize(data if not isinstance(data, bytes) \
                                     else memoryview(data),
                                 memo)

    def _deserialize(self, memview, memo=None):
        """x._deserialize(memoryview) ->
        (value of the appropriate type, memoryview(remaining_data))

        Exactly like deserialize, except a memoryview object is
        required.  deserialize is implemented in terms of
        _deserialize.  Derived classes are expected to override
        _deserialize."""
        raise NotImplentedError("This is an abstract class.")


class SmallInt(Serializer): """This class is for integers that are 8, 16, 32, or 64 bits long. They may be signed or unsigned. No other sizes are supported. >>> s = SmallInt(2, True) Traceback (most recent call last): ... ValueError: size is 2, must be 8, 16, 32 or 64 >>> s = SmallInt(8, True) >>> b = list(s.serialize_iter(5)) >>> b == [b'\\x05'] True >>> o = s.deserialize(b''.join(b)) >>> o = (o[0], o[1].tobytes()) >>> o == (5, b'') True >>> o = s.deserialize(b''.join(b) + b'z') >>> o = (o[0], o[1].tobytes()) >>> o == (5, b'z') True >>> s = SmallInt(8, True) >>> b = s.serialize(-5) >>> b == b'\\xfb' True >>> s = SmallInt(8, True) >>> s = s.serialize(128) Traceback (most recent call last): ... ValueError: 128 is out of range for an signed 8 bit integer >>> s = SmallInt(64, False) >>> b = s.serialize(2**64-1) >>> b == b'\\xff\\xff\\xff\\xff\\xff\\xff\\xff\\xff' True >>> s = SmallInt(64, True) >>> b = s.serialize(-2**63) >>> b == b'\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\x00' True """ _formats = dict(( ((8, True), '>b'), ((8, False), '>B'), ((16, True), '>h'), ((16, False), '>H'), ((32, True), '>i'), ((32, False), '>I'), ((64, True), '>q'), ((64, False), '>Q') )) __slots__ = ('_size', '_signed', '_low', '_high', '_format') def __init__(self, size, signed): if size not in (8, 16, 32, 64): raise ValueError("size is %d, must be 8, 16, 32 or 64" % (size,)) self._size = size self._signed = bool(signed) self._format = self._formats[(size, signed)] def serialize(self, value): if not isinstance(value, (int, long)): raise TypeError("%r must be an int or long" % (value,)) value = int(value) try: ret = _struct.pack(self._format, value) except _struct.error: raise ValueError("%d is out of range for an %ssigned %d bit " "integer" % (value, ("un" if not self._signed else ""), self._size)) return ret def _deserialize(self, memview, memo=None): numbytes = self._size // 8 if len(memview) < numbytes: raise _NotEnoughDataError((self._size // 8) - len(memview)) else: data = memview[0:numbytes].tobytes() remaining = memview[numbytes:] try: result = _struct.unpack(self._format, data)[0] return result, remaining except _struct.error as err: raise ParseErrror(err)

There is also a CompoundNumbered type for representing tuples. This allows you to represent structured messages with multiple fields. Here is example of how you might represent CAKE new session messages:

cake_newsess_v2 = _serial.CompoundNumbered(
    _serial.Count(), # Version
    _serial.Count(), # Type
    _serial.KeyName(), # Destination key
    _serial.KeyName(), # Source key
    _serial.SmallInt(64, False), # Session serial #
    _serial.CountDelimitedByteString(), # Encryption header
    _serial.CountDelimitedByteString(), # Signature.
    _serial.FixedLengthByteString(32) # Header HMAC
)

There is a problem though. The signature and header HMAC are supposed to be encrypted, but the deserializer can't know the key to use until it's decrypted the encryption header. This means that later parts of the deserialization process need to know about things from previous parts.

I have a way for the deserialization process to save state. This is used so that if deserialization throws a NotEnoughDataError because not enough data is available, the exception may have a memo field. This memo field can then be passed in again to resume close to where deserialization stopped. (Though now I'm sort of wondering if I shouldn't do something generator based instead...)

But this mechanism does not allow state to be passed forward from a previous deserializer to a new one. And this applies the other way around too. When serializing there is stuff that's not really a part of the data being serialized (like the current HMAC or encryption state) that needs to be known by serializer in order to serialize properly.

I'm thinking of adding an optional context parameter to the serialization and deserialization functions that's just an empty dictionary into which this sort of state can be stuffed. But this seems really messy. Can anybody think of any better ways to do this that are fairly general?

Simplification by reusing encoding protocols

Date: 2011-02-04 07:01 pm (UTC)
From: [identity profile] https://www.google.com/accounts/o8/id?id=AItOawnVzkBT9W2c7MFFW7BFyL36YPol8F2Dy-0
The CAKE protocol looks interesting but it mixes two concepts: structured data encoding, and security encapsulation of message passing.

IMHO, if you were enforcing and reusing off the shelf encoding like thrift or protobuf, it would make the design and code much simpler to parse.

The following will probably look naive as I haven't looked at CAKE implementation code, maybe I haven't scoped the whole problem.

Once you've done that, it would be merely:
raw_data = pipe.read()
# unpack_header is a thrift/protobuf raw reader.
hdr = unpack_header(raw_data)

# hdr.firsthmac and hdr.messages are still encrypted at this point
# hdr.messages is a list of message as defined at http://www.cakem.net/v2/sessions.html#repeated

# Grab valid key, verify signature. Throws on invalid signature, etc.
process(hdr)

# hdr.firsthmac is now decrypted and verified

if hdr.message_type == 1:  # session continuation
  for message in hdr.messages:
    # unpack_data is whatever decoding the user wants to use.
    # in practice, it's probably better to not even have this function here and
    # just yield the raw buffer.
    data = unpack_data(decrypt_and_verify(hdr.key, message.padding))
    yield data

Both thrift and protobuf handle a lots of the problems for you like: futureproofing the protocol message format, efficient encoding & decoding of native types line int&string, multi-language support, golden message definition in a single file, etc.

Profile

Lover of ideas

November 2012

S M T W T F S
    123
45678910
11121314151617
18 192021222324
252627282930 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Oct. 24th, 2014 01:28 pm
Powered by Dreamwidth Studios