Character Sets

What is a Character Set?

All characters, numbers and special characters we can see on screen or printed have to be encoded to numerical values for electronic transmission in a so-called character set.

Initially only small character sets were developed that comprised a rather limited number of characters, e.g. ANSI, ASCII. DOS-, Windows-, Linux- and Macintosh-Computer didn't use identical character sets. But they differ mainly at the language-specific characters, like Umlauts.

What is required by the GEDCOM Standard?

Genealogical data may contain characters of any language. GEDCOM 5.5 allowed only ANSEL (elsewhere very uncommon), ASCII and UNICODE; GEDCOM 5.5.1 additionally UTF-8.
The IBMPC character set is explicitely not allowed as it cannot be interpreted properly without knowing which code page the sender was using.

However most genealogical programs support also character sets for Windows and Mac.

In order that the program knows how the data bytes are to be interpreted, GEDCOM requires the tag 1 CHAR <CHARACTER_SET> in the file header.

What should we use?

Meanwhile Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. So that's how we normally should save our files.

When saving a file for another genealogical program we should make sure that we choose a character set that is understood by the receiving program.
Therefore Genj can still save as: ANSEL, ANSI, ASCII, LATIN1, UTF-8, UNICODE. That is selected in the 'Save as …' dialog under 'Encoding'.

When GenJ opens a file

When opening a GEDCOM file GenJ even tries to interpret files that are not GEDCOM conform. When encountering IBMPC it assumes ISO-8859-1 (Latin1).

When a file shows uncorrect characters you may try to change the value of the CHAR tag in the file header with an normal editor and read it again into GenJ.

When GenJ saves a file

GenJ saves a UNICODE file in fact as UTF-16 and marks that, if requested, in the BOM (Byte Order Mark). The BOM indicates the byte sequence of a UTF-8, UTF-16 or UTF-32 coded file. Any app that supports unicode encodings should be able to read it.

That is controlled in the 'save as …' dialog (see Saving, Closing and Backup of the File). But it should be checked only if the receiver understands and needs it.

en/manual/character_sets.txt · Last modified: 2011/01/05 06:23 by kpschubert
Get GenealogyJ at Fast, secure and Free Open Source software downloads Recent changes RSS feed Creative Commons License Driven by DokuWiki