Diacritics in ABCD

Home Forums Error reporting Diacritics in ABCD

This forum is for reporting errors, bugs and malfunctions in the system.
  • Creator
    Topic
  • #5173
    Egbert de Smet
    Keymaster
    Helpful
    Up
    0
    Down
    Not Helpful
    ::

    Many issues related to diacritics (e.g. ë, è, é…) have been reported in the past. ABCD as an ISIS-family member deals with ‘international’ (also non-anglosaxon) content relatively more often and conversion from one encoding set (or codepage) to another is necessary quite regularly, e.g. moving from CDS/ISIS for DOS (CP437) to WinISIS (CP850 or ANSI) and to more current encoding like ISO-8859-1 or UTF8 as used in browsers.

     

    CISIS provides ‘gizmo’ tables (actually : ISIS databases with fields v1 and v2, resp. defining ‘from’ and ‘to’ values) to convert from one to another. So in principle all tools are available to deal with diacritics.

     

    Still we ran into a weird issue :

    • – a MST file opened with just a text-viewer and using its ‘CP437’ codepage shows the diacritics correctly
    • – however in ABCD the diacritics are wrong
    • – applying the gizmo ‘g437ansi’ (from CP437 to ANSI) while in principle being the solution doesn’t solve the problem : the diacritics remain wrong.

     

    – further testing/experimenting shows that ANY action with mx will change the diacritics in the MST, disallowing the correct diacritics even in CP437 ! E.g just creating a copy of the database with the mx-utility (both in Windows and Linux) means that the MST of the resulting copy no longer can be viewed correctly, even with CP437. Printing the records to an ISO2709-file : same. Dumping to a text-file : same.

     

    Even an old pre-ABCD version of mx was used to test this but the result remains : any processing with mx will lose the diacritics (no codepage shows them correctly anymore).

    Does anybody see what is happening here ? How to preserve the correct diacritics – which are there as can be seen when viewing the MST with CP437 but show up wrongly in any other process or environment (ABCD, ISO2709, …) ?

     

    Illustration : viewing same sample text with diacritics with CP437 before (original) and after (copy of original, created with mx) :

     

Viewing 2 replies - 1 through 2 (of 2 total)
  • Author
    Replies
  • #5192
    Helpful
    Up
    0
    Down
    Not Helpful
    ::

    Egbert,

    I received the files of this database in trial mode and was able to analyse them more calmly.

    I converted the database to text to try with Sublime Text and VSCode to locate the correct encoding, but what I could see is that it is practically irreversible.

    To fix it the best chance is to convert the database to text and use “Replace” to batch correct the errors:

     

    ./i2id database > new_database.txt
    
    

    This link explains a possible cause: https://php-de.github.io/jumpto/utf-8/

    This site (https://dencode.com/) tries to find the correct encoding for text, I tried to play some database snippets and no satisfactory result.

     

    My hypothesis is that this database was converted to UTF8 and used as ISO8859-1, because according to what I read in a forum the error happens when the file is with one encoding but is displayed as another, causing a mix up.

     

    If there is a purer version of this database could help, but right now I only see the replacement as a resolution.

    There are even tables for this, but it is a lot of work:

    É -> É
    “ -> "
    †-> "
    Ç -> Ç
    Ã -> Ã
    é, 'é
    Ã -> À
    ú -> ú
    • -> -
    Ø -> Ø
    õ -> õ
    Ã -> í
    â -> â
    ã -> ã
    ê -> ê
    á -> á
    é -> é
    ó -> ó
    â€" -> -
    ç -> ç
    ª -> ª
    º -> º
    Ã -> à
    #5193
    Egbert de Smet
    Keymaster
    Helpful
    Up
    0
    Down
    Not Helpful
    ::

    Thanks for the clarification and indeed it explains how certain characters in a one-byte encoding (like the old DOS/CP437 or ISO-5589-1) can be split into more than one (usually 2) bytes in a multi-byte encoding such as UTF8. Then going back to single-byte seemingly is no longer possible since the first byte of the couple will be seen as ‘the next character’ and represented without taking into account the next one.This despite built-in logics into the multi-byte construction of characters in Unicode.

    A conversion with the above listed couples of values was easy to create/apply but did not yield the correct diacritics, unfortunately.

    However I kept being puzzled by the fact that in the given database (BSHMD) some diacritics at the beginning of the MST-file (viewed as a text-file) were shown correctly with CP437, leading to the hope that they could still be recovered. But further analysis led to the conclusion that these were logically deleted MFNs, which still remained in that old MST-file from the era BEFORE the diacritics conversion (or corruption…) occurred. That explains why any action with mx, or other CISIS tools like i2id and crunchmf, resulted in the correctly seen diacritics no more to be there : these MFNs were not used anymore and only the logically existing ones shown.

    So this is a lesson to be learned : no matter how powerful CISIS is with its gizmo-conversion, and the availability of many encoding-conversion gizmos, this should not give us a false feeling of being safe for any possible conditions when it comes to encoding… If indeed some conversions are irreversible we have to be warned.

Viewing 2 replies - 1 through 2 (of 2 total)
  • You must be logged in to reply to this topic.