Difference between revisions of "NetHack in other languages"

From NetHackWiki
Jump to navigation Jump to search
(Localization approaches: Etc.)
m (Monster and object names: Promoted up one outline level)
Line 104: Line 104:
 
The minimal-effort strategy does not bring the game any closer to UTF-8 compatibility; however, since the changes from the original program structure are limited, there might be hope of patching in a future UTF-8 port of NetHack without too much adaptation.
 
The minimal-effort strategy does not bring the game any closer to UTF-8 compatibility; however, since the changes from the original program structure are limited, there might be hope of patching in a future UTF-8 port of NetHack without too much adaptation.
  
====Monster and object names====
+
===Monster and object names===
  
 
The English names of monsters and objects are string literals in monst.c and objects.c. The NetHack build process compiles and invokes the utility makedefs convert these names into preprocessor symbols, contained in the files include/pm.h and include/onames.h. The program then identifies objects and monsters by the numeric constants associated with those preprocesor symbols. The problem for translation is therefore that changing the names in monst.c and objects.c would change the preprocessor symbols, and almost every other part of NetHack would then have to be edited accordingly.
 
The English names of monsters and objects are string literals in monst.c and objects.c. The NetHack build process compiles and invokes the utility makedefs convert these names into preprocessor symbols, contained in the files include/pm.h and include/onames.h. The program then identifies objects and monsters by the numeric constants associated with those preprocesor symbols. The problem for translation is therefore that changing the names in monst.c and objects.c would change the preprocessor symbols, and almost every other part of NetHack would then have to be edited accordingly.

Revision as of 23:01, 1 March 2013

NetHack's text output is in English. Although the program's structure does not easily lend itself to localization (since morphological features of English are hard-wired into the source code on all levels), several localization projects currently exist.

German

Tony Crawford and Karl Breuer have developed a German localized version called NetzHack (note the 'z'), which runs on Linux, *BSD, and OS X (console and X11), and on Win32 (console and Windows graphics). Source and binaries available here.

A different German translation attempt by Patric Mueller called NetHack-De was released as a playable, although incomplete, alpha release on 11 October 2007. The latest release includes source code, a Debian package and a graphical Windows binary.

Japanese

The Japanese version JNetHack by Issei Numata has been in existence for several years. For those who don't read Japanese, there's some outdated information in English at jnethack.org.

Sourceforge.jp also carries a JSlash'em, JSporkHack and JUnNetHack as well as a Japanese NetHack Resources Project.

NetHack brass can be compiled as an English or Japanese version.

Spanish

Ray Chason has published Internationalized NetHack as a work in progress. It presently supports English and Spanish, and will eventually supersede Spanish NetHack.

Incomplete or stalled translations

On January 28th 2009 a Chinese translation called nethack-cn was begun on Google Code but the last update was on June 25th 2009.

A SourceForge project for a French translation called nethack-fr was registered on August 6th 2009. The last update was on October 29th 2009. There is a French translation of the guidebook and some spoilers.

The first commit of GitHub project for a Italian translation called nethack-it was on December 4th 2009. The last commit so far was on January 27th 2010.

A SourceForge project for a Portuguese translation of NetHack and Slash'Em was registered on May 3rd 2004 but this was also the only activity on that project.

Internationalization

Ray Chason has launched the NetHack-i18n project, also called Internationalized NetHack, which is aimed at adapting NetHack for easier translation to other languages.

Current NetHack localization strategies

The problem

Because NetHack has output text in the form of string literals scattered throughout the code, the customary approach is for the translator to go through the source code and substitute translations for the string literals. What complicates this process is the fact that many messages are composed of elements that can vary with the runtime context. For example, in an output statement like this:

pline("%s hits %s.", objectname, monstername);

the variables "objectname" and "monstername" may be singular or plural, masculine or feminine, and may be introduced by "a" or "the". The words to be inserted must be formed appropriately before the output function call.

At various points in the program, NetHack's output messages vary with second and third person verb forms, singular and plural verb forms, and noun inflections by case, gender, and number.

In English, this is easy: word forms do not change with grammatical gender or case, and most nouns simply change from singular to plural by the addition of a trailing 's'. There are only one form of the definite article ("the"), and two forms of the indefinite article ("a" and "an") which are grammatically equivalent. In other languages, morphology can be much more complex: Spanish has four forms of the definite article, depending on whether a noun is singular or plural, masculine or feminine; German has six.

(As it happens, monsters in NetHack always act or are acted upon singly, not collectively, which simplifies matters sometimes. On the other hand, objects are named differently at different times -- by name, by description, or by class -- and so an object name or a pronoun that replaces "it" can vary even for the same object.)

Word order can also change depending on certain conditions, such as whether the subject is a common noun, a proper noun or a pronoun.

Furthermore, some languages have mandatory contractions (Spanish contracts the preposition and article "a"+"el" into "al"; French contracts the preposition and article "de"+"le" into "du", etc.).

Some examples of word and sentence morphology in Spanish:

  • "¡Idefix golpea al orco!" (subject and object are both nouns)
  • "¡Idefix lo golpea!" (object is a pronoun, and goes before the verb)
  • "¡Golpeas al orco!" (subject is a pronoun ("tú") and is omitted; verb changes to second person singular)
  • "¡Lo golpeas!" (both modifications apply)

The message generation must also correctly capitalize after such rules are applied.

Original NetHack contains a few functions to modify linguistic elements for output, such as makeplural(), s_suffix(), and vtense(). But since English is not a highly inflected language, even these do not actually operate on grammatical categories, but tend to manipulate words by superficial characteristics: an() for example chooses between the indefinite article forms "a" and "an" merely on the basis of the following word's first letter.

Localization approaches

NetHack-i18n

Internationalized NetHack aims to systematize the process of string replacement using Gettext together with a scriptable printf-like system to handle the grammar bits.

Gettext's grammar support is minimal. It supports plurals.

Spanish NetHack

Spanish NetHack handles these rules by coding special routines to handle them, much as the unpatched NetHack does. NetHack-i18n encodes such rules in two ways:

  • by extending the printf-like syntax to include formatters such as %3${g/handsome/beautiful}, where the number after the % is a parameter number (this is a POSIX extension to printf) and the part between the braces is interpreted by a Ruby script; and
  • by defining "joining rules" at the start and end of each substitution, to handle mandatory contractions and such rules as "a/an".

NetzHack

NetzHack's development began with the idea that we just wanted to translate, not to rewrite the program, or in other words: NetHack is a prime example of how you don't code for localization, and trying to fix that was pretty near hopeless. So the localization strategy was as follows:

  • Translate string literals in the source code
  • Create a new data type, usage_t, to contain the usage information of each context in which a noun, adjective or pronoun might appear: number, case, gender, and determiner.
  • Write a new module, german.c, with the functions necessary to inflect German nouns and adjectives for a specified usage, and add a dictionary, nouns_de.h, which associates each German noun with a reference to its declension paradigm.
  • Replace functions that produce an object or monster name, such as do_name() or monnam(), with expanded versions that take a usage_t argument.
  • Write human-readable macros in a new header, german.h, to call those functions with specific values of the usage parameters, then apply the macros as drop-in replacements for the original functions to provide German grammar throughout the code. For example, the output statement in mthrowu.c#line227,
pline("%s is blinded by %s.", Monnam(mtmp), the(xname(otmp)));

becomes in NetzHack:

pline("%s wird von %s geblendet.", Monnam_nomsing(mtmp), the_xname_dat(otmp));

Monnam_nomsing and the_xname_dat are macros that call German grammar-sensitive versions of mon_nam() and xname(), passing them the appropriate usage parameters for this message. The noun phrase that designates the monster must be in the nominative case, singular, and capitalized; the noun phrase for the thrown object must be in the dative case and have a definite article. (Actually, NetzHack also hijacks the names of the original functions in extern.h to make them point to the nominative-singular macros, so that the Monnam(mtmp) call above doesn't need to be edited at all.)

The frequent dictionary look-ups to determine the necessary declension pattern for each given noun might be a drawback if computing power had not grown tremendously since NetHack was young. Caching of look-ups helps, though, especially since nouns are often repeated in output in a given game context.

NetzHack, like the original, is written entirely in C.

The minimal-effort strategy does not bring the game any closer to UTF-8 compatibility; however, since the changes from the original program structure are limited, there might be hope of patching in a future UTF-8 port of NetHack without too much adaptation.

Monster and object names

The English names of monsters and objects are string literals in monst.c and objects.c. The NetHack build process compiles and invokes the utility makedefs convert these names into preprocessor symbols, contained in the files include/pm.h and include/onames.h. The program then identifies objects and monsters by the numeric constants associated with those preprocesor symbols. The problem for translation is therefore that changing the names in monst.c and objects.c would change the preprocessor symbols, and almost every other part of NetHack would then have to be edited accordingly.

Spanish NetHack and NetHack-de solve this problem by replacing each string in monst.c and objects.c with a preprocessor symbol, and providing new headers to substitute either the original English or translated names for these symbols. In this way, distinct versions of objects.o and monst.o are built with the names in English and in the target language.

NetzHack, on the other hand, adds an element to the object and monster data types, struct obj and struct mon, so that each kind of monster and object has both its translated German name and, invisibly to the user, its original English name too. Thus pm.h and onames.h are generated using the original names as before.

NetHack-i18n, because it has Gettext available, leaves the monster and object tables in English and converts them at run time. Another approach might be to bite the bullet and replace the preprocessor symbols in pm.h and onames.h with their translated versions. No known translation takes this approach.

Input parsing

The largest problem here is support for wishes. Every translation must rewrite the readobjnam function to parse an object name according to the rules of the target language.

NetHack-i18n first removes the dungeon feature wishes, replacing them with a new extended command, called "dfeature" in the English locale; and then splits the rest into a parser, which is placed in the Ruby script, and a rule-enforcer, which remains in the core code.

Character sets

ASCII is inadequate for most languages other than English. All translations use a larger character set for messages. Case mappings and fuzzy matches for wishes and other inputs must take the character set into account; if the user wishes for "cota de escamas de dragon gris", he should get a gray dragon scale mail, even though the correct spelling is "dragón".

JNetHack uses EUC-JP, with tests in the code to detect if the source has been converted to Shift-JIS; EUC-JP is adapted for Unix-like environments, and Shift-JIS for Microsoft Windows.

Spanish NetHack encodes all messages in ISO-8859-1, while leaving the map symbols in code page 437. Reduced IBMgraphics modes are available for users who do not have code page 437 configured. Slight hackery is needed to support the different character sets, because map symbols can appear outside the map in three places:

As NetHack-i18n is meant to be language-neutral, it uses Unicode throughout. Any user input is encoded in Unicode, and user interfaces are expected to support it. The TTY interface is abandoned in favor of a modified Curses interface, and the Curses library must support wide characters.

NetHack-De encodes all messages in ISO 8859-1. As a result, IBMgraphics doesn't work (as IBMgraphics is using a different character set), although DECgraphics does. User wishes are normalized before being parsed so that the user can enter wishes (example: Armor) as ASCII: "ruestung" (German umlauts entered using the German transcription rules), or in ISO-8859-1: "Rüstung" or UTF-8: "Rästung" (part of the preliminary UTF-8 support, a UTF-8 capable terminal would show "Rüstung" but the rest of Nethack-De's messages would have broken umlauts).

NetzHack is also in ISO-8859-x. The MS Windows console version actually uses two charsets (or "code pages" in Microspeak): the dungeon map is drawn in the system's default code page, while the Windows 1252 code page, containing the German characters ÄÖÜäöüß, is used for text messages.