I'm getting errors when certain characters are being added to a table... even when the column is has utf8mb4 character set. For example:
SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xE0' for column 'surname'
The data in question is: SEGAL‡ (note the double dagger)
Is this character beyond even 4 byte UTF8 or is the collation causing the issue? Or is it something else?
Screenshot showing character set and collation of the column:
It's a Laravel 8 app and the MySQL connection is configured to the following:
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
Looking at the CSV file in PHPStorm, non-ASCII characters are displayed as �. I've tried explicitly setting the file encoding to UTF-8 in PHPStorm (with and without BOM).
If I open the CSV in Excel then the non-ASCII characters display correctly. Confused.
Update
Examining the CSV in a HEX editor shows that a character like ä is stored as a single byte (8A). When this CSV is opened in Excel it correctly shows ä, but in everything else it shows �.
I don't know what character encoding Excel is using, as this character should be typically stored as E4 when using a single byte, or C3 A4 in UTF-8.

Double dagger is hex E2 80 A1 (only 3 bytes) when encoded in
CHARACTER SETutf8 or utf8mb4. It is also available in latin1 as hex 87.C3A4 [ä]LATIN SMALL LETTER A WITH DIAERESIS (in utf8/utf8mb4) or E4 in latin1.
Please provide
SHOW VARIABLES LIKE 'char%';See "black diamond" in Trouble with UTF-8 characters; what I see is not what I stored for the likely cause.
The
CHARACTER SETis the "encoding" and is relevant; theCOLLATIONis how characters compare (eg case-folding) and is not relevant.Nothing in the Question hints of a BOM (http://mysql.rjweb.org/doc.php/charcoll#bom_byte_order_mark) EF BB BF at the beginning of some UTF-8-encoded files.
Do not use ucs-2 (UTF-16) unless you happen to have a file encoded that way. Still, I would declare the tables as utf8, not ucs-2.
More
I plugged
8AE0E8into a handy script (that doesCONVERT(CONVERT(UNHEX('8ae0e8') USING %s) USING utf8mb4)and got:It looks like
macromanis the charset. I think that is specific to Apple.