ABCD Global

March 25, 2023 at 3:20 pm #5193

Keymaster

Helpful

Not Helpful

Thanks for the clarification and indeed it explains how certain characters in a one-byte encoding (like the old DOS/CP437 or ISO-5589-1) can be split into more than one (usually 2) bytes in a multi-byte encoding such as UTF8. Then going back to single-byte seemingly is no longer possible since the first byte of the couple will be seen as ‘the next character’ and represented without taking into account the next one.This despite built-in logics into the multi-byte construction of characters in Unicode.

A conversion with the above listed couples of values was easy to create/apply but did not yield the correct diacritics, unfortunately.

However I kept being puzzled by the fact that in the given database (BSHMD) some diacritics at the beginning of the MST-file (viewed as a text-file) were shown correctly with CP437, leading to the hope that they could still be recovered. But further analysis led to the conclusion that these were logically deleted MFNs, which still remained in that old MST-file from the era BEFORE the diacritics conversion (or corruption…) occurred. That explains why any action with mx, or other CISIS tools like i2id and crunchmf, resulted in the correctly seen diacritics no more to be there : these MFNs were not used anymore and only the logically existing ones shown.

So this is a lesson to be learned : no matter how powerful CISIS is with its gizmo-conversion, and the availability of many encoding-conversion gizmos, this should not give us a false feeling of being safe for any possible conditions when it comes to encoding… If indeed some conversions are irreversible we have to be warned.