Forum Replies Created
-
AuthorReplies
-
Egbert de Smet
Keymaster::Thanks for the clarification and indeed it explains how certain characters in a one-byte encoding (like the old DOS/CP437 or ISO-5589-1) can be split into more than one (usually 2) bytes in a multi-byte encoding such as UTF8. Then going back to single-byte seemingly is no longer possible since the first byte of the couple will be seen as ‘the next character’ and represented without taking into account the next one.This despite built-in logics into the multi-byte construction of characters in Unicode.
A conversion with the above listed couples of values was easy to create/apply but did not yield the correct diacritics, unfortunately.
However I kept being puzzled by the fact that in the given database (BSHMD) some diacritics at the beginning of the MST-file (viewed as a text-file) were shown correctly with CP437, leading to the hope that they could still be recovered. But further analysis led to the conclusion that these were logically deleted MFNs, which still remained in that old MST-file from the era BEFORE the diacritics conversion (or corruption…) occurred. That explains why any action with mx, or other CISIS tools like i2id and crunchmf, resulted in the correctly seen diacritics no more to be there : these MFNs were not used anymore and only the logically existing ones shown.
So this is a lesson to be learned : no matter how powerful CISIS is with its gizmo-conversion, and the availability of many encoding-conversion gizmos, this should not give us a false feeling of being safe for any possible conditions when it comes to encoding… If indeed some conversions are irreversible we have to be warned.
Egbert de Smet
Keymaster::example use of the ‘nested’ replace() function I used in an application to avoid double quotes ” and a slash / to appear in the index for titles (v245^a in MARC) :
replace(replace(v245^a,’/’,”),'”‘,”)
If you put this in the FST, you won’t change anything in the data themselves, only in the search keys. Same if you put this in the autoridades.pft : it will only change the way how the names are formatted in the picklists.
Egbert de Smet
Keymaster::Did you try with a replace() function in the autoridades.pft ? In principle you can use that to omit (meaning : to replace a character by nothing) the weird characters like the single apostrophe or double quotes. I remember having done that for other purposes (e.g. a cleaner Inverted File) and it worked fine.
Egbert de Smet
Keymaster::I also fail to re-construct the problem on my installation (Linux Mint) : I can continue browse index-terms until after the 3rd click without problems.
If the problem is with the data we will need (a subset of) your data for further testing. E.g. an ISO2709 export of a larger set of records (to obtain sufficient index-terms to be browsed and make getting the weird characters more likely).
To me it seems likely indeed that strings containing single apostrophes ‘ or similar create problems as the scripts themselves also quote the strings. A possible solution would be to use the replace()-function of the ISIS Formatting Language to remove them before ‘producing’ them in the interface (or in the authority list).
-
AuthorReplies