Edit: I found a solution, but not sure that it is the definitive answer. Information at the end of the post.
On AWS RDS, I have created a read replica from a master database running 5.7.44 (extended support opted-in).
Thereafter, on the replica, the database, table, column, view, triggers and procedures have been changed (painstakingly) to use character set utf8mb4 and collation utf8mb4_unicode_ci. It was a process of rinse-and-repeat until the upgrade checker passed. At the last stage, the replica was happily replicating from its source, which still had a mix of character sets and collations (mostly utf8 with utf8_unicode_ci). Source and replica both on 5.7.44.
In a test and UAT environment, the next steps were followed without hinderance:
- Upgrade the replica's Engine Version to 8.0.28,
- At a sufficient time, promote the replica and create a new read replica from the 8.0.28 instance.
That is a happy scenario that worked well.
However, following the same procedure in the production environment, all steps went smooth up until the completion of updating character set and collations on all aspects of the replica database (database, table, columns, views, triggers, routines which include procedures and functions).
As soon as the replica instance's Engine Version got upgraded to 8.0.28, replication started breaking with:
Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'ANONYMOUS' at master log ...
Further inspection yielded:
Worker 1 failed executing transaction 'ANONYMOUS' at master log mysql-bin-changelog.009309, end_log_pos 8019869; Column 4 of table '{TABLE NAME}' cannot be converted from type 'char(3(bytes))' to type 'char(4(bytes) utf8mb4)', Error_code: MY-013146
I have tried to use the RDS-provided stored procedure mysql.rds_skip_repl_error and it appears that the error alternates between two tables (I have not exhaustively skipped all the errors).
The closest to more information on the problem is a lengthy thread of a MySQL Bug #83461 replication breakage noted from 5.7.10 to 8.0.0. The bug has been around for a while and I did not have the issue on low-traffic test and UAT environments, but it definitely proves to be happening on production.
It's just strange that it was happy in a 5.7.x-to-5.7.x configuration along with mixed source and target character set and collation.
I'm a bit stumped right now, and am trying to find ways around this. The problem is that the preparation of a replica to the point just before the major version upgrade takes a bucketload of time, and simply starting over is not a great experience.
Would my only solution be to halt all traffic at an appropriate time and then proceed with major engine version upgrades?
Possible Solution:
A 2015 post from MySQL Bugs suggested a solution that appears to be working in this scenario:
--slave-type-conversions=ALL_NON_LOSSY, which translates to the RDS Parameter Group parameter slave_type_conversion with value ALL_NON_LOSSY.
Replication resumed after this.
As per that bug report, the off-by-one error is also still present as experienced in 8.0.28.