Perl issue when encoding mysql data from UTF-8 to UCS-2 for SMPP

267 views Asked by At

I am trying to fetch UTF-8 accentuated characters "é" "ê" from mysql and convert them to UCS-2 when sending over SMPP. The data is stored as utf8_general_ci and I perform the following when opening the DB connection:

$dbh->{'mysql_enable_utf8'}=1;
$dbh->do("set NAMES 'utf8'");

If I test the sending part by hard coding the string value with "é" "ê" using data_encoding=8, it goes through perfectly. However if I comment out the first line and just use what comes from the DB, it fails. Also, if I try to send the characters using the DB and setting data_encoding=3, it also works fine, but then the "ê" would not appear, which is also expected. Here is what I use:

$fred = 'éêcole'; <-- If I comment out this line, the SMPP call fails
$fred = decode('utf-8', $fred);
$fred = encode('UCS-2', $fred);

$resp_pdu = $short_smpp->submit_sm(
        source_addr_ton => 0x00,
        source_addr_npi => 0x01,
        source_addr => $didnb,
        dest_addr_ton => 0x01,
        dest_addr_npi => 0x01,
        destination_addr => $number,
        data_coding => 0x08,
        short_message => $fred
) or do {
        Log("ERROR: submit_sm indicated error: " . $resp_pdu->explain_status());
        $success = 0;
};

The different values for the data_coding fields are the following: Meaning of "data_coding" field in SMPP

00000000 (0) - usually GSM7
00000011 (3) for standard ISO-8859-1
00001000 (8) for the universal character set -- de facto UTF-16

The SMPP provider's documentation also mentions that special characters should be handled via UCS-2: https://community.sinch.com/t5/SMS-365-enterprise-service/Handling-Special-Characters/ta-p/1137

How should I prepare the data that is coming out of the DB to make this SMPP call work?

I am using Perl v5.10.1

Thanks !

2

There are 2 answers

3
ikegami On BEST ANSWER

$dbh->{'mysql_enable_utf8'} = 1; is used to decode the values returned from the database, causing queries to return decoded text (strings of Unicode Code Points). It makes no sense to decode such a string. Go straight to the encode.

my $s_ucp = "\xE9\xEA\x63\x6F\x6C\x65";  # éêcole
# -or-
use utf8; # Script is encoded using UTF-8.
my $s_ucp = "éêcole";

printf "%vX\n", $s_ucp;                  # E9.EA.63.6F.6C.65

my $s_ucs2be = encode('UCS-2', $s_ucp);

printf "%vX\n", $s_ucs2be;               # 0.E9.0.EA.0.63.0.6F.0.6C.0.65
0
Rick James On

SET NAMES says the encoding you have/want in the client. That is, regardless of the encoding in the table, MySQL will convert it to whatever SET NAMES says during a SELECT.

So, feed what comes from the SELECT directly to SMPP. (It won't be readable by most other clients.)

SET NAMES ucs2

(The collation is irrelevant to the encoding.)

You could ask the SELECT to convert with something like

CONVERT(col_name, CHAR UNICODE)

https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html