NetHack in other languages

From NetHackWiki
(Redirected from Localization)
Jump to navigation Jump to search

NetHack's text output is in English. Although the program's structure does not easily lend itself to localization because English morphology and syntax are hard-wired into the source code on all levels, several localization projects currently exist.

German

Tony Crawford and Karl Breuer have completed a German-localized version called NetzHack (note the 'z'), which runs on GNU/Linux, *BSD, and OS X (console and X11), and on Win32 (console and Windows graphics). Early versions up to NetzHack v1, released in 2017, were based on NetHack 3.4.3; NetzHack v2.0, released in December 2021, is based on NetHack 3.6. Source and binaries available here.

A different German translation attempt by Patric Mueller called NetHack-De was released as a playable, although incomplete, alpha release on 11 October 2007. The latest release (2012) includes source code, a Debian package and a graphical Windows binary.

Japanese

JNetHack by Issei Numata has been in existence for several years. The older version (based on NetHack 3.2.2) by Issei Numata is here. In addition, beta versions of JSLASH'EM is also available now.

Sourceforge.jp also carries a JSlash'em, JSporkHack and JUnNetHack as well as a Japanese NetHack Resources Project.

NetHack brass can be compiled as an English or Japanese version.

Spanish

Ray Chason has published Internationalized NetHack as a work in progress. It presently supports English and Spanish, and will eventually supersede Spanish NetHack.

Korean

Several starts have been made to produce a Korean version of NetHack:

  • nethack-ko. The last update was on May 29th, 2007.
  • Another Korean translation is in progress, based on jnethack, at KRNethack.

Korean articles can also be accessed on NetHackWiki via the /ko subpages.

Chinese

On January 28th 2009 a Chinese translation called nethack-cn was begun on Google Code, but the last update was on June 25th 2009.

Simplified Chinese-language articles can also be accessed on NetHackWiki via the /zh-CN subpages.

Incomplete or stalled translations

A SourceForge project for a French translation called nethack-fr was registered on August 6th 2009. The last update was on August 15 2014.

There is a French translation of the guidebook and some spoilers.

A GitHub project for an Italian translation called nethack-it had its first commit on December 4th 2009 and its last commit on December 7th 2009.

Internationalization

Ray Chason has launched the NetHack-i18n project, also called Internationalized NetHack, which is aimed at adapting NetHack for easier translation to other languages. The last activity was November 2016.

Current NetHack localization strategies

The existing NetHack localization projects differ in their approaches to the task.

The problem

Because NetHack has output text in the form of string literals scattered throughout the code, the customary approach is for the translator to go through the source code and substitute translations for the string literals. What complicates this process is the fact that many messages are composed of elements that can vary with the runtime context. For example, an output statement like "the dagger hits your little dog" would be generated by a line of code more or less like this:

pline("%s hits %s.", objectname, monstername);

where the variables "objectname" and "monstername" may be singular or plural, masculine or feminine, and may be introduced by "a" or "the", or sometimes "your". The words to be inserted must be formed appropriately before the output function call.

At various points in the program, NetHack's output messages vary with second and third person verb forms, singular and plural verb forms, and noun inflections by case, gender, and number.

In English, this is easy: word forms do not change with grammatical gender or case, and most nouns change from singular to plural simply by the addition of a trailing 's'. There is only one form of the definite article ("the"), and there are two forms of the indefinite article ("a" and "an") which are grammatically equivalent. In other languages, morphology can be much more complex: Spanish, for example, has four forms of the definite article, depending on whether a noun is singular or plural, masculine or feminine; German has six, depending on number, gender, and case.

Furthermore, some languages have mandatory contractions (Spanish contracts the preposition and article "a"+"el" into "al"; French contracts the preposition and article "de"+"le" into "du", etc.).

Some examples of word and sentence morphology in Spanish:

  • "¡Idefix golpea al orco!" (subject and object are both nouns)
  • "¡Idefix lo golpea!" (object is a pronoun, and goes before the verb)
  • "¡Golpeas al orco!" (subject is a pronoun ("tú") and is omitted; verb changes to second person singular)
  • "¡Lo golpeas!" (both modifications apply)

(As it happens, monsters in NetHack always act or are acted upon singly, not collectively, which simplifies matters sometimes. On the other hand, stackable objects can be singular or plural, and a pair of gloves or shoes is, in the game's logic, a single object, but may call for a plural verb form. Furthermore, many objects are named differently at different times – by name, by description, or by class – and so an object name or a pronoun that replaces "it" can vary even for the same object.)

Word order can also change depending on certain conditions, such as whether the subject is a common noun, a proper noun or a pronoun. The message generation routine must also provide sentence capitalization (in languages that require it) after such rules have been applied.

Original NetHack contains a few functions to modify linguistic elements for output, such as vtense and makeplural in objnam.c, and s_suffix in hacklib.c. But since English is not a highly inflected language, even these do not actually operate on grammatical categories, but tend to manipulate words by superficial characteristics: an for example chooses between the indefinite article forms "a" and "an" merely on the basis of the following word's first letter, and has no concept even of subject or object case. NetHack's function the prefixes a definite article to any noun, but it becomes useless in German, for example, because the form of the definite article depends on the noun's gender and number, and on the grammatical case in which it is used.

These technical and grammatical problems are all in addition to the fundamental problems inherent in any translation. NetHack in particular is famous for the humor it incorporates, much of which depends on English wordplay (jokes about pit vipers in pits, for example), idiomatic expressions ("everything but the kitchen sink"), and American cultural references ("core dumped", Keystone Kops, ...). The stock in trade of a translator is to achieve an equivalent tone and mood in the target language. For NetHack, that means translating wordplay where possible, compensating for untranslatable puns with new target-language jokes as the opportunity arises, and generally choosing similarly humorous (or menacing or archaicizing) wording in the target language in keeping with the spirit of the original game. Conceivably, references to the target culture could be added in analogy to the original game's references – a German version of the Castle level might contain some allusion to the Kafka story, for example.

Localization approaches

NetHack-i18n

Internationalized NetHack aims to systematize the process of string replacement using Gettext together with a scriptable printf-like system to handle the grammar bits.

Gettext's grammar support is minimal. It supports plurals. NetHack-i18n needs such things as support for changing word order and noun cases, and encodes them in two ways:

  • by extending the printf-like syntax to include formatters such as %3${g/handsome/beautiful}, where the number after the % is a parameter number (this is a POSIX extension to printf) and the part between the braces is interpreted by a Ruby script; and
  • by defining "joining rules" at the start and end of each substitution, to handle mandatory contractions and such rules as "a/an".

For example, the output statement in mthrowu.c#line227,

pline("%s is blinded by %s.", Monnam(mtmp), the(xname(otmp)));

becomes in NetHack-i18n:

pline(NHFormat(T_("%1${Nt$} is blinded by %2${nt}.")) << mtmp << otmp);

This is C++ rather than C, and the NHFormat class overloads the << operator and the cast to std::string to make this work; it's rather similar to Boost Format. "%1${Nt$}" means substitute the first parameter, and use a locale-specific formatting with "Nt$" to indicate the specific formatting.

The code for the English locale interprets "Nt$" with a monster parameter as follows:

  • Initial "n" means the name of the monster;
  • The "n" is made capital, to indicate the output should be capitalized;
  • "t" means prefix "the" if appropriate; and
  • "$" means show the saddle if the monster does not have a name. ("s" would mean "always show the saddle.")

With an object parameter, initial "n" means show the name, and "t" again means use "the" if appropriate.

T_() consults the message catalog, which uses the gettext syntax, but does not support plurals. The message catalog for the Spanish locale has this entry:

msgid "%1${Nt$} is blinded by %2${nt}."
msgstr "%1${:es_intrans,Nl$,es} cegad%1${oa} por %2${nl}."

Note that the first parameter is substituted twice. This is permitted, and indeed very frequent. The substitutions are as follows:

  • %1${:es_intrans,Nl$,es}: Both the English and the Spanish locales adopt the convention that a format string beginning with a colon names a method in the Ruby code. Thus ":es_intrans,Nl$,es" invokes a method called es_intrans. (The name is a misnomer: you use :es_trans if the direct object is a monster, and :es_intrans otherwise.) The commas (any non-alphanumeric character may be used) delimit parameters to es_intrans. "Nl$" is the formatter for the monster, with "l" indicating the definite article, and "es" is the verb. If the monster cannot be seen, the format routine returns "él" or "Él", and es_intrans omits it and capitalizes the verb if appropriate. (This pattern is overkill for the particular case, as the message does not appear if the monster isn't visible, but it frequently appears elsewhere.)
  • %1${oa}: "oa" means substitute "o" if the parameter is a masculine noun, or "a" if feminine. There are several other such substitutions, and they may be used with strings or objects – or the hero ("¡Destruid a %0${el} ladr%0${ón}, mi%1${p} mascota%1${p}!").
  • %2${nl}: Show the name of the object with the definite article.

Spanish NetHack

Spanish NetHack handles grammar rules by coding special routines to handle them, much as the unpatched NetHack does. For example, the output statement in mthrowu.c#line227,

pline("%s is blinded by %s.", Monnam(mtmp), the(xname(otmp)));

becomes in Spanish NetHack:

pline("%s es cegad%c por %s.", Monnam(mtmp),
    mon_gender(mtmp)? 'a' : 'o', the(xname(otmp)));

Monnam, the, and xname retain their names from the original code, though "the" in fact uses the appropriate Spanish article. mon_gender returns nonzero if the monster's name is a feminine noun.

NetzHack

NetzHack began with the idea that the developers just wanted to translate, not to rewrite the program. Or, in other words: NetHack is a prime example of how you don't code for localization, and trying to fix that was pretty near hopeless. So the localization strategy was as follows:

  • Translate string literals in the source code
  • Create a new data type, usage_t, to contain the usage information of a context in which a noun, adjective or pronoun appears: number (singular or plural), case (nominative, genitive, dative or accusative), gender (masculine, feminine or neuter), and determiner (the, a/an, this, your, or none).
  • Write a new module, german.c, with the functions necessary to inflect German nouns and adjectives for a specified usage, and add a dictionary, nouns_de.h, which associates each German noun in the game with a reference to its declension paradigm.
  • Replace functions that produce an object or monster name, such as doname in objnam.c or mon_nam in do_name.c, with expanded versions that take a usage_t argument and inflect the output noun phrase for the usage indicated.
  • Write human-readable macros in a new header, german.h, to call those functions with specific values of the usage parameters, then use these macros as drop-in replacements for the original functions to provide German grammar throughout the code. For example, the output statement in line 227 of mthrowu.c,
pline("%s is blinded by %s.", Monnam(mtmp), the(xname(otmp)));

becomes in NetzHack:

pline("%s wird von %s geblendet.", Monnam_nomsing(mtmp), the_xname_dat(otmp));

Monnam_nomsing and the_xname_dat are macros that call German grammar-sensitive versions of mon_nam in do_name.c and xname in objnam.c, passing them the appropriate usage parameters for this message. The macro definitions (in german.h) look like this:

#define Monnam_nomsing(m) Monnamg((m), (usage_t){SINGULAR, GENDER_UNKNOWN, CASE_NOMINATIVE, ARTICLE_DEFINITE})
...
#define the_xname_dat(o) xnameg((o), (usage_t){(o)->quan>1L?PLURAL:SINGULAR, GENDER_UNKNOWN, CASE_DATIVE, ARTICLE_DEFINITE})

The replacement functions, with names ending in 'g' for German, take the same arguments as the original naming functions (in this case, a pointer to a monster or object structure), plus a usage argument that specifies number, gender, case and determiner. In our example, the noun phrase that designates the monster must be in the nominative case, singular, and capitalized; the noun phrase for the thrown object must be in the dative case and have a definite article. The grammatical gender depends on the exact word that ends up being used to designate the monster or object, so it is indicated as "unknown" in these function calls. (Actually, NetzHack also hijacks the names of the original functions in extern.h to make them point to the nominative-singular macros, so that the original Monnam(mtmp) call above doesn't really need to be edited at all.) Since the determiner is a necessary part of the usage parameter – that is, it influences the form of any adjective preceding the noun – the nested call the(xname(...)), a frequent occurrence in NetHack, is always replaced (as in the example) with a single function call via one of the macros the_xname_{nom, gen, dat, acc} (for nominative, genitive, dative or accusative case).

The frequent dictionary look-ups to determine the necessary declension pattern for each monster or object noun used might be a drawback if computing power had not grown tremendously since NetHack was young. NetzHack caches recent look-ups, though, which is especially helpful since nouns are often repeated in output in a given game context. There are 1796 nouns in the dictionary.

NetzHack, like the original, is written entirely in C.

The minimal-effort strategy does not bring the game any closer to UTF-8 compatibility; however, since the changes from the original program structure are limited, there might be hope of patching in a future UTF-8 port of NetHack without too much adaptation. In fact, since recent versions of Windows have at last gone the way of other operating systems and use UTF-8 internally, the adaptation will probably consist of undoing the changes required to support German letters in ISO 8859-x encoding.

Monster and object names

The English names of monsters and objects are string literals in monst.c and objects.c. The NetHack build process compiles and invokes the utility makedefs to convert these names into preprocessor symbols, contained in the files include/pm.h and include/onames.h. The program then identifies objects and monsters by the numeric constants associated with those preprocesor symbols. The problem for translation is therefore that changing the names in monst.c and objects.c would change the preprocessor symbols, and almost every other part of NetHack would then have to be edited accordingly.

Spanish NetHack and NetHack-de solve this problem by replacing each string in monst.c and objects.c with a preprocessor symbol, and providing new headers to substitute either the original English or translated names for these symbols. In this way, distinct versions of objects.o and monst.o are built with the names in English and in the target language.

NetzHack, on the other hand, adds an element to the object and monster data types, struct obj and struct mon, so that each kind of monster and object has both its translated German name and, invisibly to the user, its original English name too. Thus pm.h and onames.h are generated using the original names as before.

NetHack-i18n, because it has Gettext available, leaves the monster and object tables in English and converts them at run time.

Another approach might be to bite the bullet and replace the preprocessor symbols in pm.h and onames.h with their translated versions. No known translation takes this approach.

Input parsing

The largest problem here is support for wishes. Every translation must rewrite the readobjnam function to parse an object name according to the rules of the target language.

NetHack-i18n first removes the dungeon feature wishes, replacing them with a new extended command, called "dfeature" in the English locale; and then splits the rest into a parser, which is placed in the Ruby script, and a rule-enforcer, which remains in the core code.

Character sets

ASCII is inadequate for most languages other than English. All translations use a larger character set for messages. Case mappings and fuzzy matches for wishes and other inputs must take the character set into account; if the Spanish-language user wishes for "cota de escamas de dragon gris", he should get a gray dragon scale mail, even though the correct spelling is "dragón".

JNetHack uses EUC-JP, with tests in the code to detect if the source has been converted to Shift-JIS; EUC-JP is adapted for Unix-like environments, and Shift-JIS for Microsoft Windows.

Spanish NetHack encodes all messages in ISO-8859-1, while leaving the map symbols in code page 437. Reduced IBMgraphics modes are available for users who do not have code page 437 configured. Slight hackery is needed to support the different character sets, because map symbols can appear outside the map in three places:

As NetHack-i18n is meant to be language-neutral, it uses Unicode throughout. Any user input is encoded in Unicode, and user interfaces are expected to support it. The TTY interface is abandoned in favor of a modified Curses interface, and the Curses library must support wide characters.

NetHack-De encodes all messages in ISO 8859-1. As a result, IBMgraphics doesn't work (because it uses a different character set), although DECgraphics does. User wishes are normalized before being parsed so that the user can enter wishes in any charset: to wish for "Rüstung" ("armor"), for example, the user may type "ruestung" in ASCII (the German letter ü originated as a combination of 'u' and 'e', hence "ue" is a conventional alternative where ü is not available), or "Rüstung" in ISO-8859-1, or "Rästung" in UTF-8. (This feature is part of a preliminary UTF-8 support: a UTF-8 capable terminal would show "Rüstung", but be unable to display umlauts in the rest of Nethack-De's ISO 8859-1-encoded messages.)

NetzHack is also in ISO-8859-x. The MS Windows console version actually uses two charsets (or "code pages" in Microspeak): the dungeon map is drawn in the system's default code page, while the Windows 1252 code page, containing the German characters ÄÖÜäöüß, is used for text messages.

In recent updates of Windows 10, Microsoft has finally begun to offer some support for UTF-8 output, which has been standard for some time now in the Unix world. Future versions of NetHack in other languages can therefore be expected to use UTF-8 encoding.

UTF-8, the ISO-8859-x encodings, and Shift-JIS are all supersets of the 7-bit ASCII used by original NetHack, but they are all mutually exclusive. Only UTF-8 is designed to encode the writing systems of all languages in one scheme. Hence a fundamental port to UTF-8 encoding might facilitate other translation projects.