conversion of string to Bytearray in python

215 views Asked by At

I have a string variable in this format: var = "x14\x12\x13\x11\x69"

My challenge is to convert this string into bytearray to get exactly: bytearray(b'\x14\x12\x13\x11\x69')

expected_bytearry = bytearray(var, 'utf-8')

if i print(expected_bytearry), I get this result

bytearray(b'\\x14\\x12\\x13\\x11\\x69'). Which contains double backslash.

Another approach which I tried was expected_bytearryy = bytearray.fromhex(var.encode().hex())

with that, **I get bytearray(b'\x14\x12\x13\x11i')** **instead of bytearray(b'\\x14\\x12\\x13\\x11\\x69')**

I also tried:

b = bytearray(ord(c) for c in var)
a = bytearray(var, 'utf-8')

Both of them produced the same result of double backslash.

A help will be appreciated.

Thanks

1

There are 1 answers

3
chrslg On

That's unicode_escape encoding. So you need to decode with this encoding.

But unfortunately, you are in the worst situation: decoding is done on byte strings, and you have a string. And produce a string, and you want a bytearray.

Hence that rather convoluted solution

var.encode('ascii').decode('unicode_escape').encode('latin1')

What makes people mind goes in loop in this kind of things is the fact that python also use unicode-escape when interacting with human coders. That is when reading a string or bytestring (in the code). And when printing a bytestring to the console.

What is important to understand is that in the bytestring b'\x14\x13\ there are no backslashes, no x. Only 2 bytes. 20 and 19 (14 and 13 in hex). Likewise in string '\x14\x13'. Just a string made of whatever the 2 codes 20 and 19 means in current encoding (since current encoding is most likely unicode, those are just the "ascii" part of unicode, on a single bytes. So non-printable chars of ascii code 20 and 19)

But in your string you do have literal backslashes and x. Like in string r'\x14\x13', that is printed '\\x14\\x13' by python (that uses escape encoding). But you have only 2 backslashed, not 4 (it is only when printing with escape encoding that they appeared doubled).

So what I do here, is start from a string var=r'\x14\x13' (shown here in its raw string form), that is a string made of 8 literal characters backslash, x, 1, 4, backslash, x, 1, 3.

I encode those characters in ascii (well pragmatically it doesn't matter: those are all chars that have a single byte encoding in any encoding. No machine less than 40 years old or so, would have default encoding where this doesn't encode the same as in ascii. Those are just digits, letters and backslashes. But strictly speaking, that is ascii). So I get a bytestring of 8 bytes (because each of those 8 chars have a 1 byte encoding in ascii). That happen to be 92, 120, 49, 52, 92, 120, 49, 51. (printed in python as b'\\x14\\x13'. But again, ignore that, that is just confusing, because python happens to be already, in its REPL, doing part of the job we are doing. Keep in mind that strings are sequence of characters, and bytestring sequence of numbers between 0 and 255. And contrarily to language like C, that is not at all the same thing. A character is not a number)

Now I can use unicode-escape encoding to decode those 8 bytes into chars. And in unicode escape, bytes 92 (ascii for backslash) 120 (ascii for x) followed by 2, 4 or 6 hexadecimal digits in ascii, means a character whose unicode is made of those 2 4 or 6 digits.

So, now we have a byte string made of only 2 characters. The one whose unicode is 20 and the one whose unicode is 19.

And if we want those unicode, well, we need to encode them back. There is a potentially problematic case here: for codes over 128. Those are encoded in two bytes in utf-8. But since the 256 first code of unicode are the same as in latin1, encoding them back in latin1 would create a byte whole value is the unicode. For example if your string were r'\xc3\xa9', then r'\xc3\xa9'.encode('ascii').decode('unicode-escape') would get two char 'é', whose unicode are 0xc3 and 0xa9 (not to be confused with single char 'é' whose unicode is 0xe9, which is, in utf-8 represented by 2 bytes 0xc3 0xa9). You don't want to encode them in utf-8, since those would be two bytes each (0xc3 0x83, 0xc2, 0xa9). And you can't encode them in ascii, since they use 8 bits. But encoding them as latin1 will give 2 bytes whose value are the unicode 0xc3 and 0xa9.

Hence the r'\x14\x15'.encode('ascii').decode('unicode-escape').encode('latin1')