How to convert a binary in surrogate pairs to unicode?

121 views Asked by At

Anyone can help figure out a surrogate pairs problem?

The source binary is #{EDA0BDEDB883}(encoded by Hessian/Java), how to decode it to or "^(1F603)"? I have checked UTF-16 wiki, but it only telled me a half story.

My problem is how to convert #{EDA0BDEDB883} to \ud83d and \ude03?

My aim is to rewrite the python3 program to Rebol or Red-lang,just parsing binary data, without any libs.

This is how python3 do it:

def _decode_surrogate_pair(c1, c2):
    """
    Python 3 no longer decodes surrogate pairs for us; we have to do it
    ourselves.
    """
    # print(c1.encode('utf-8')) # \ud83d
    # print(c2.encode('utf-8')) # \ude03
    if not('\uD800' <= c1 <= '\uDBFF') or not ('\uDC00' <= c2 <= '\uDFFF'):
        raise Exception("Invalid UTF-16 surrogate pair")
    code = 0x10000
    code += (ord(c1) & 0x03FF) << 10
    code += (ord(c2) & 0x03FF)
    return chr(code)

def _decode_byte_array(bytes):
    s = ''
    while(len(bytes)):
        b, bytes = bytes[0], bytes[1:]
   
        c = b.decode('utf-8', 'surrogatepass')
        if '\uD800' <= c <= '\uDBFF':
            b, bytes = bytes[0], bytes[1:]
            c2 = b.decode('utf-8', 'surrogatepass')
            c = _decode_surrogate_pair(c, c2)
        s += c
    return s

bytes = [b'\xed\xa0\xbd', b'\xed\xb8\x83']

print(_decode_byte_array(bytes))

public static void main(String[] args) throws Exception {
    // ""
    // "\uD83D\uDE03"
    final byte[] bytes1 = "".getBytes(StandardCharsets.UTF_16);
    // [-2, -1, -40, 61, -34, 3]
    // #{FFFFFFFE}  #{FFFFFFFF} #{FFFFFFD8} #{0000003D}  #{FFFFFFDE}  #{00000003}
    System.out.println(Arrays.toString(bytes1));
}
0

There are 0 answers