fbpx
Wikipedia

Extended Unix Code

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.

The most commonly used EUC codes are variable-length encodings with a character belonging to an ISO/IEC 646 compliant coded character set (such as ASCII) taking one byte, and a character belonging to a 94x94 coded character set (such as GB 2312) represented in two bytes. The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes, including an initial shift code, whereas a single character in EUC-TW can take up to four bytes.

Modern applications are more likely to use UTF-8, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. EUC is however still very popular, especially EUC-KR for South Korea.

Encoding structure

 
Relationship between packed EUC and other 8-bit ISO 2022 profiles

The structure of EUC is based on the ISO/IEC 2022 standard, which specifies a system of graphical character sets which can be represented with a sequence of the 94 7-bit bytes 0x21–7E, or alternatively 0xA1–FE if an eighth bit is available. This allows for sets of 94 graphical characters, or 8836 (942) characters, or 830584 (943) characters. Although initially 0x20 and 0x7F were always the space and delete character and 0xA0 and 0xFF were unused, later editions of ISO/IEC 2022 allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing the inclusion of 96-character sets. The ranges 0x00–1F and 0x80–9F are used for C0 and C1 control codes.

EUC is a family of 8-bit profiles of ISO/IEC 2022, as opposed to 7-bit profiles such as ISO-2022-JP. As such, only ISO 2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. The G0 set is set to an ISO/IEC 646 compliant coded character set such as US-ASCII, ISO 646:KR (KS X 1003) or ISO 646:JP (the lower half of JIS X 0201) and invoked over GL (i.e. 0x21–0x7E, with the most significant bit cleared).[1] If US-ASCII is used, this makes the code an extended ASCII encoding; the most common deviation from US-ASCII is that 0x5C (backslash in US-ASCII) is often used to represent a Yen sign in EUC-JP (see below) and a won sign in EUC-KR.

The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in the kuten code); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO 646 code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codes SS2 (0x8E) and SS3 (0x8F) respectively, and invoked over GR. Besides the initial shift code, any byte outside of the range 0xA0–0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code.[1]

The EUC code itself does not make use of the announcement and designation sequences from ISO 2022.[1] However, the code specification is equivalent to the following sequence of four ISO 2022 announcement sequences, with meanings breaking down as follows.[1]

Individual sequence Hexadecimal Feature of EUC denoted
ESC SP C 1B 20 43 ISO-8 (8-bit, G0 in GL, G1 in GR)
ESC SP Z 1B 20 5A G2 accessed using SS2
ESC SP [ 1B 20 5B G3 accessed using SS3
ESC SP \ 1B 20 5C Single-shifts invoke over GR

Fixed-length format

 
Layout of the fixed-length format for Japanese

The ISO-2022-based variable-length encoding described above is sometimes referred to as the EUC packed format, which is the encoding format usually labelled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called the EUC complete two-byte format. This represents:[2]

  • Code set 0 as two bytes in the range 0x21–0x7E (except that the first may be 0x00).
  • Code set 1 as two bytes in the range 0xA0–0xFF (except that the first may be 0x80).
  • Code set 2 as a byte in the range 0x21–0x7E (or 0x00) followed by a byte in the range 0xA0–0xFF.
  • Code set 3 as a byte in the range 0xA0–0xFF (or 0x80) followed by a byte in the range 0x21–0x7E.

Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format.[2] These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange.

EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese".[3] Only the packed format is included in the WHATWG Encoding Standard used by HTML5.[4]

EUC-CN

EUC-CN
 
MIME / IANAGB2312
Alias(es)csGB2312, CN-GB[5]
Language(s)Simplified Chinese, English, Russian
StandardGB 2312 (1980)
ClassificationExtended ASCII, variable-length encoding, CJK encoding, EUC
ExtendsUS-ASCII
Extensions748, GBK, GB 18030, x-mac-chinesesimp
Transforms / EncodesGB 2312
Succeeded byGBK, GB 18030

EUC-CN[6] is the usual encoded form of the GB 2312 standard for simplified Chinese characters. Unlike the case of Japanese JIS X 0208 and ISO-2022-JP, GB 2312 is not normally used in a 7-bit ISO 2022 code version,[a] although a variant form called HZ (which delimits GB 2312 text with ASCII sequences) was sometimes used on USENET.

An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes, both from the range 0xA1–0xFE.

Related Mainland Chinese encoding systems

748 code

An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB 2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.

IBM code pages 1380, 1381, 1382 and 1383

IBM code page 1381 (CCSID 1381) comprises the single-byte code page 1115 (CPGID 1115 as CCSID 1115) and the double-byte code page 1380 (CPGID 1380 as CCSID 1380),[7] which encodes GB 2312 the same way as EUC-CN, but deviates from the EUC structure by extending the lead byte range back to 0x8C, adding 31 IBM-selected characters in 0x8CE0 through 0x8CFE and adding 1880 user-defined characters with lead bytes 0x8D through 0xA0.[8]

IBM code page 1383 (CCSID 1383) comprises the single-byte code page 367 and the double-byte code page 1382 (CPGID 1382 as CCSID 1382),[9] which differs by conforming to the EUC structure, adding the 31 IBM-selected characters in 0xFEE0 through 0xFEFE instead, and including only 1360 user-defined characters, interspersed in the positions not used by GB 2312.[10] The alternative CCSID 5479[11] is used for the pure EUC-CN code page: it uses CCSID 9574 as its double-byte set, which uses CPGID 1382 but excludes the IBM-selected and user-defined characters.[12]

GBK and GB 18030

GBK is an extension to GB 2312. It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1.1, including traditional Chinese characters and characters used only in Japanese. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required.

Variants of GBK are implemented by Windows code page 936 (the Microsoft Windows code page for simplified Chinese), and by IBM's code page 1386.

The Unicode-based GB 18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode. However, Unicode encoded as GB 18030 is a variable-length encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8.

Mac OS Chinese Simplified

Other EUC-CN variants deviating from the EUC mechanism include the Mac OS Chinese Simplified script (known as Code page 10008 or x-mac-chinesesimp).[13] It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE and 0xFF for the U with umlaut (ü), two special font metric characters, the non-breaking space, the copyright sign (©), the trademark sign (™) and the ellipsis (…) respectively.[6] This differs in what is regarded as a single-byte character versus the first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes).

This use of 0xA0, 0xFD, 0xFE and 0xFF matches Apple's Shift_JIS variant.

Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8.[6] These are considered "standard extensions to GB 2312", neither of which is proprietary to Apple: the row 8 extension was taken from GB 6345.1,[6] both extensions are included by GB/T 12345 (the Traditional Chinese variant of GB 2312),[14] and both extensions are included by GB 18030 (the successor to GB 2312).[15]

EUC-JP

EUC-JP
 
MIME / IANAEUC-JP
Alias(es)Unixized JIS (UJIS), csEUCPkdFmtJapanese
Language(s)Japanese, English, Russian
ClassificationExtended ISO 646, variable-length encoding, CJK encoding, EUC
ExtendsUS-ASCII or ISO 646:JP
Transforms / EncodesJIS X 0208, JIS X 0212, JIS X 0201
Succeeded byEUC-JISx0213
EUC-JIS-2004
 
Alias(es)EUC-JISx0213
Language(s)Japanese, Ainu, English, Russian
StandardJIS X 0213
ClassificationExtended ASCII, variable-length encoding, CJK encoding, EUC
ExtendsUS-ASCII
Transforms / EncodesJIS X 0213, JIS X 0201 (Kana)
Preceded byEUC-JP

EUC-JP is a variable-length encoding used to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201. Other names for this encoding include Unixized JIS (or UJIS) and AT&T JIS.[2] 0.1% of all web pages use EUC-JP since August 2018,[16] while 2.5% of websites in Japanese use this encoding[17] (less used than Shift JIS, or UTF-8). It is called Code page 954 by IBM.[18][19] Microsoft has two code page numbers for this encoding (51932 and 20932).

This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS).

A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes JIS X 0201 and JIS X 0213[20] (similarly to Shift_JISx0213, its Shift_JIS-based counterpart).

Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions (Windows code page 932 on Microsoft Windows, and MacJapanese on classic Mac OS), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX). Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.

Characters are encoded as follows:

  • As an EUC/ISO 2022 compliant encoding, the C0 control characters, space and DEL are represented as in ASCII.
  • A graphical character from ASCII (code set 0) is represented as its usual one-byte representation, in the range 0x21 – 0x7E. While some variants of EUC-JP encode the lower half of JIS X 0201 here, most encode ASCII,[21] including the W3C/WHATWG Encoding standard used by HTML5,[22] and so does EUC-JIS-2004.[20] While this means that 0x5C is typically mapped to Unicode as U+005C REVERSE SOLIDUS (the ASCII backslash), U+005C may be displayed as a Yen sign by certain Japanese-locale fonts, e.g. on Microsoft Windows, for compatibility with the lower half of JIS X 0201.[23][24]
  • A character from JIS X 0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE. This differs from the ISO-2022-JP representation by having the high bit set. This code set may also contain vendor extensions in some EUC-JP variants. In EUC-JIS-2004, the first plane of JIS X 0213 is encoded here, which is effectively a superset of standard JIS X 0208.[20]
  • A character from the upper half of JIS X 0201 (half-width kana, code set 2) is represented by two bytes, the first being 0x8E, the second being the usual JIS X 0201 representation in the range 0xA1 – 0xDF. This set may contain IBM vendor extensions in some variants.
  • A character from JIS X 0212 (code set 3) is represented in EUC-JP by three bytes, the first being 0x8F, the following two being in the range 0xA1–0xFE, i.e. with the high bit set. In addition to standard JIS X 0212, code set 3 of some EUC-JP variants may also contain extensions in rows 83 and 84 to represent characters from IBM's Shift JIS extensions which lack standard JIS X 0212 mappings, which may be coded in either of two layouts, one defined by IBM themselves and one defined by the OSF.[25][26] In EUC-JIS-2004, the second plane of JIS X 0213 is encoded here,[20] which does not collide with the allocated rows in standard JIS X 0212.[27] Some implementations of EUC-JIS-2004, such as the one used by Python, allow both JIS X 0212 and JIS X 0213 plane 2 characters in this set.[27]

Related Japanese encoding methods

Vendor extensions to EUC-JP (from, for example, the Open Software Foundation, IBM or NEC) were often allocated within the individual code sets,[25][26] as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR).

However, some vendor-specific encodings are partially compatible with EUC-JP, due to encoding JIS X 0208 over GR, but do not follow the packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji.

DEC Kanji

Digital Equipment Corporation defines two variants of EUC-JP only partly conforming to the EUC packed format, but also bearing some resemblance to the complete two-byte format. The overall format of the "DEC Kanji" encoding mostly corresponds to fixed-length (complete two-byte) EUC; however, code set 0 is not required to be left-padded with null bytes (similarly to the packed format).[28] JIS X 0208 is, as usual, used for code set 1; code set 2 (half-width katakana) is absent; code set 3 is encoded like the two-byte fixed width format (i.e. without a shift byte and with only the first high bit set), but used for two-byte user defined characters rather than being specified for JIS X 0212.[28] In the basic "DEC Kanji" encoding, only the first 31 rows of code set 3 are used for user-defined characters: rows 32 through 94 are reserved, similarly to the unused rows in code set 1.[29]

The "Super DEC Kanji" encoding accepts codes both from the "DEC Kanji" encoding and from packed-format EUC, for a total of five code-sets.[28] It also allows the entire user defined code set, and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets (rows 85–94 and 78–94 respectively), to be used for user-defined characters.[29]

HP-16

Hewlett-Packard defines an encoding referred to as "HP-16". This accompanies their "HP-15" encoding, which is a variant of Shift JIS. HP-16 encodes JIS X 0208 using the same bytes as in EUC-JP, but does not use the single shift codes (thus omitting code sets 2 and 3), and adds three user-defined regions which do not follow the packed-format EUC structure:[28]

  • Lead bytes 0xA1–C2, trail bytes 0x21–7E
  • Lead bytes 0xC3–E3, trail bytes 0x21–3F
  • Lead bytes 0xC3–E1, trail bytes 0x40–64

IKIS

The IKIS (Interactive Kanji Information System) encoding used by Data General resembles EUC-JP without single shifts, i.e. with only code sets 0 and 1. Half-width katakana are instead included in row 8 of JIS X 0208 (colliding with the box-drawing characters added to the standard in 1983). JIS X 0208 rows 9 through 12 are used for user-defined characters.[28][29]

Adaptations of EUC-JP for EBCDIC

KEIS (Kanji-processing Extended Information System) is an EBCDIC encoding used by Hitachi,[29] with double-byte characters (a DBCS-Host encoding) included using shifting sequences, making it a stateful encoding. Specifically, the sequence 0x0A 0x41 switches to single-byte mode and the sequence 0x0A 0x42 switches to double-byte mode.[b] However, JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC-JP. This results in duplicate encodings for the ideographic space—0x4040 per the DBCS-Host code structure, and 0xA1A1 as in EUC-JP. This differs from IBM's DBCS-Host encoding for Japanese, the layout of which builds on versions which predate JIS X 0208 altogether. The lead byte range is extended back to 0x59, out of which the lead bytes 0x81–A0 are designated for user-defined characters,[28] and the remainder are used for corporate-defined characters, including both kanji and non-kanji.[29]

JEF (Japanese-processing Extended Feature)[29] is an EBCDIC encoding used on Fujitsu FACOM mainframes, contrasting with FMR (a variant of Shift JIS) used on Fujitsu PCs. Like KEIS, JEF is a stateful encoding, switching to a double-byte DBCS-Host mode using shifting sequences (where 0x29 switches to single-byte mode and 0x28 switches to double-byte mode).[30] Also similarly to KEIS, JIS X 0208 codes are represented the same as in EUC-JP.[28] The lead byte range is extended back to 0x41, with 0x80–A0 designated for user definition; lead bytes 0x41–7F are assigned row numbers 101 through 163 for kuten purposes, although row 162 (lead byte 0x7E) is unused.[28][29] Rows 101 through 148 are used for extended kanji, while rows 149 through 163 are used for extended non-kanji.[29]

EUC-KR

EUC-KR
 
EUC-KR code structure
MIME / IANAEUC-KR
Alias(es)Wansung, IBM-970
Language(s)Korean, English, Russian
StandardKS X 2901 (KS C 5861)
ClassificationExtended ISO 646, variable-length encoding, CJK encoding, EUC
ExtendsUS-ASCII or ISO 646:KR
ExtensionsMac OS Korean, IBM-949, Unified Hangul Code (Windows-949)
Transforms / EncodesKS X 1001
Succeeded byUnified Hangul Code (web standards)

EUC-KR is a variable-length encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601)[31][32] and either ISO 646:KR (KS X 1003, formerly KS C 5636) or US-ASCII, depending on variant. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR.

A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or US-ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E).

It is usually referred to as Wansung (Korean: 완성, romanizedWanseong, lit.'precomposed[33]') in the Republic of Korea. IBM refers to the double-byte component as Code page 971,[34] and to EUC-KR with ASCII as Code page 970.[35][36][37] It is implemented as Code page 20949 ("Korean Wansung")[38][39] and Code page 51949 ("EUC Korean") by Microsoft.[38]

As of December 2022, less than 0.07% of all web pages globally use EUC-KR,[16] but 4.5% of South Korean web pages use EUC-KR.[40] Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (macOS, other Unix-like OSes, and Windows), but its use has been very slowly shifting to UTF-8 as it gains popularity, especially on Linux and macOS.

As with most other encodings, UTF-8 is now preferred for new use, solving problems with consistency between platforms and vendors.

Related Korean encoding systems

Unified Hangul Code

A common extension of EUC-KR is the Unified Hangul Code (통합형 한글 코드, Tonghabhyeong Hangeul Kodeu,[41] or 통합 완성형, Tonghab Wansunghyung), which is the default Korean codepage on Microsoft Windows. It is given the code page number 949 by Microsoft, and 1261[42] or 1363[43] by IBM. IBM's code page 949 is a different, unrelated, EUC-KR extension.

Unified Hangul Code extends EUC-KR by using codes which do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available in Johab and Unicode. The W3C/WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR.[44]

Mac OS Korean (HangulTalk)

Other encodings incorporating EUC-KR as a subset include the Mac OS Korean script (known as Code page 10003 or x-mac-korean),[13] which was used by HangulTalk (MacOS-KH), the Korean localisation of the classic Mac OS. It was developed by Elex Computer (일렉스), who were at the time the authorised distributor of Apple Macintosh computers in South Korea.[45][29]

HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD, both in unused space within the EUC-KR GR plane (trail bytes 0xA1–0xFE), and using non-EUC codes outside of it (trail bytes 0x41–0xA0). Some of these characters are font-style-independent stylised dingbats.[29] Many of these characters do not have exact Unicode mappings, and Apple software maps these cases variously to combining sequences, to approximate mappings with an appended private-use character as a modifier for round-trip purposes, or to private-use characters.[46]

Apple also uses certain single-byte codes outside of the EUC-KR plane for additional characters: 0x80 for a required space, 0x81 for a won sign (₩), 0x82 for an en dash (–), 0x83 for a copyright sign (©), 0x84 for a wide underscore (_) and 0xFF for an ellipsis (…).[46] Although none of these additional single-byte codes are within the lead byte range of plain EUC-KR (unlike Apple's extensions to EUC-CN, see above), some are within the lead byte range of Unified Hangul Code (specifically, 0x81, 0x82, 0x83 and 0x84).

EUC-KP

Similarly to KS X 1001, the North Korean KPS 9566 standard is typically used in EUC form; in these contexts, it is sometimes referred to as EUC-KP.[47] More recent editions of the standard extend the EUC representation with characters using non-EUC two-byte codes, in a similar manner to Unified Hangul Code.[48]

EUC-TH

Although certain single-byte encodings such as the ISO/IEC 8859 series technically conform to the EUC structure, they are rarely labelled as EUC. However, eucTH is used on Solaris as a label for TIS-620.[49]

EUC-TW

EUC-TW is a variable-length encoding that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese characters as used in Taiwan. Variants of Big5 are much more common than EUC-TW, although Big5 only encodes the first two planes of CNS 11643 hanzi, while UTF-8 is becoming more common.

  • As an EUC/ISO 2022 encoding, the C0 control characters, ASCII space and DEL are encoded as in ASCII.
  • A graphical character from US-ASCII (G0, code set 0) is encoded in GL as its usual single byte representation (0x21–0x7E).
  • A character from CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1–0xFE).
  • A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes:
    • The first byte is always 0x8E (Single Shift 2).
    • The second byte (0xA1–0xB0) indicates the plane, the number of which is obtained by subtracting 0xA0 from that byte.
    • The third and fourth bytes are in GR (0xA1–0xFE).

Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.

See also

Notes

  1. ^ 7-bit ISO 2022 code versions supporting GB 2312 include ISO-2022-CN (with shift codes) and ISO-2022-JP-2 (without shift codes), both of which also support other non-ASCII sets.
  2. ^ These sequences match the hexadecimal forms shown by DEC[30] and the decimal forms (10 65 and 10 66) listed by Lunde.[28] Lunde lists the hexadecimal forms for both as 0xA0 0x42, seemingly in error.

References

  1. ^ a b c d IBM. "Character Data Representation Architecture (CDRA)". IBM. pp. 157–162.
  2. ^ a b c Lunde, Ken (2008). CJKV Information Processing: Chinese, Japanese, Korean, and Vietnamese Computing. O'Reilly. pp. 242–244. ISBN 9780596800925.
  3. ^ "Character Sets". IANA.
  4. ^ "4.2. Names and labels". Encoding Standard. WHATWG.
  5. ^ Zhu, Haifeng; Hu, Daoyuan; Wang, Zhiguan; Kao, Tien-cheu; Chang, Wen-chung; Crispin, Mark (1996). "RFC 1922: Chinese Character Encoding for Internet Messages (§ 2.1: CN-GB)". Requests for Comments. IETF. doi:10.17487/rfc1922.
  6. ^ a b c d "Map (external version) from Mac OS Chinese Simplified encoding to Unicode 3.0 and later". Apple, Inc.
  7. ^ . IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-26.
  8. ^ "IBM Simplified Chinese Graphic Character Set" (PDF). IBM. 1993. C-H 3-3220-130 1993-11.
  9. ^ . IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-28.
  10. ^ "IBM Simplified Chinese Graphic Character Set for Extended UNIX Code (EUC)" (PDF). IBM. 1994. C-H 3-3220-132 1994-06.
  11. ^ . IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-27.
  12. ^ . IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-27.
  13. ^ a b "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft.
  14. ^ Lunde, Ken (1998). Appendix F: GB/T 12345 (PDF). CJKV Information Processing. O'Reilly Media. ISBN 9781565922242.
  15. ^ Standardization Administration of China (SAC) (2005-11-18). GB 18030-2005: Information Technology—Chinese coded character set.
  16. ^ a b "Historical trends in the usage of character encodings for websites". W3Techs.
  17. ^ "Distribution of Character Encodings among websites that use Japanese". w3techs.com. Retrieved 2022-12-02.
  18. ^ . Archived from the original on 2016-03-27.
  19. ^ International Components for Unicode (ICU), ibm-954_P101-2007.ucm, 2002-12-03
  20. ^ a b c d "JIS X 0213 Code Mapping Tables". x0213.org.
  21. ^ "Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)". XML Japanese Profile. W3C.
  22. ^ "EUC-JP decoder". Encoding Standard. WHATWG. "If byte is an ASCII byte, return a code point whose value is byte."
  23. ^ . Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.
  24. ^ Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?".
  25. ^ a b . Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.
  26. ^ a b Lunde, Ken (13 January 2009). "Appendix J: Japanese Character Sets" (PDF). CJKV Information Processing (2nd ed.). ISBN 978-0-596-51447-1.
  27. ^ a b Chang, Hyeshik (8 December 2021). "Readme for CJKCodecs". cPython. Python Software Foundation.
  28. ^ a b c d e f g h i Lunde, Ken (13 January 2009). "Appendix F: Vendor Encoding Methods" (PDF). CJKV Information Processing (2nd ed.). ISBN 978-0-596-51447-1.
  29. ^ a b c d e f g h i j Lunde, Ken (2009). "Appendix E: Vendor Character Set Standards" (PDF). CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing (2nd ed.). Sebastopol, CA: O'Reilly. ISBN 978-0-596-51447-1.
  30. ^ a b "2: Codesets and Codeset Conversion". DIGITAL UNIX Technical Reference for Using Japanese Features. Digital Equipment Corporation, Compaq.
  31. ^ "KS X 1001:1992" (PDF).
  32. ^ "KS C 5601:1987" (PDF). 1988-10-01.
  33. ^ Lunde, Ken (2009). "Chapter 3: Character Set Standards". CJKV Information Processing. p. 146. ISBN 978-0596514471.
  34. ^ . Archived from the original on 2014-11-30. Retrieved 2021-09-03.
  35. ^ . IBM Globalization. IBM. Archived from the original on 2014-12-01.
  36. ^ "ibm-970_P110_P110-2006_U2 (alias euc-kr)". Converter Explorer - ICU Demonstration. International Components for Unicode.
  37. ^ International Components for Unicode (ICU), ibm-970_P110_P110-2006_U2.ucm, 2002-12-03
  38. ^ a b "Code Page Identifiers". Windows Dev Center. Microsoft.
  39. ^ Julliard, Alexandre. "dump_krwansung_codepage: build Korean Wansung table from the KSX1001 file". make_unicode: Generate code page .c files from ftp.unicode.org descriptions. Wine Project.
  40. ^ "Distribution of Character Encodings among websites that use .kr". w3techs.com. Retrieved 2022-12-02.
  41. ^ (in Korean). W3C. Archived from the original on 2013-05-24. Retrieved 2019-01-07.
  42. ^ In ucnv_lmb.cpp, a file originating from IBM and included in the International Components for Unicode source tree, the lead byte 0x11 is commented as referring to "Korean: ibm-1261" after the definition of ULMBCS_GRP_KO, and is mapped to the "windows-949" ICU codec in the OptGroupByteToCPName array later in the file.
  43. ^ , IBM Globalization, IBM, archived from the original on 2014-11-29
  44. ^ "5. Indexes (§ index EUC-KR)", Encoding Standard, WHATWG
  45. ^ Gil, Hojin. "HangulTalk: De facto standard Hangul environment for Mac". Guide to using Hangul on Macintosh.
  46. ^ a b Apple (2005-04-05). "Map (external version) from Mac OS Korean encoding to Unicode 3.2 and later". Unicode Consortium.
  47. ^ Kim, Kyongsok (2002-11-30). "3-way cross-reference tables - KS X 1001, KPS 9566, and UCS" (PDF). ISO/IEC JTC 1/SC 2/WG 2 N2564. [Note: updated links for tables accompanying document: [1] [2]]
  48. ^ Chung, Jaemin (2018-01-05). "Information on the most recent version of KPS 9566 (KPS 9566-2011?)" (PDF). UTC L2/18-011.
  49. ^ IBM (2001-05-07). "solaris-eucTH-2.7". icu-data. Unicode Consortium/International Components for Unicode.

External links

  • EUC-JP codeset table (minus the ASCII and halfwidth parts)
  • Code Page Identifiers
  •  – mentions the 748 code
  • (in Chinese)
  • Manual page of EUC-JISX0213 in the Perl Encode module
  • International Register of Coded Character Sets to be Used With Escape Sequence – section 2.4 (p.14f.) with the coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)
  • Chinese, Japanese, and Korean character set standards and encoding systems

extended, unix, code, multibyte, character, encoding, system, used, primarily, japanese, korean, simplified, chinese, most, commonly, used, codes, variable, length, encodings, with, character, belonging, compliant, coded, character, such, ascii, taking, byte, . Extended Unix Code EUC is a multibyte character encoding system used primarily for Japanese Korean and simplified Chinese The most commonly used EUC codes are variable length encodings with a character belonging to an ISO IEC 646 compliant coded character set such as ASCII taking one byte and a character belonging to a 94x94 coded character set such as GB 2312 represented in two bytes The EUC CN form of GB 2312 and EUC KR are examples of such two byte EUC codes EUC JP includes characters represented by up to three bytes including an initial shift code whereas a single character in EUC TW can take up to four bytes Modern applications are more likely to use UTF 8 which supports all of the glyphs of the EUC codes and more and is generally more portable with fewer vendor deviations and errors EUC is however still very popular especially EUC KR for South Korea Contents 1 Encoding structure 1 1 Fixed length format 2 EUC CN 2 1 Related Mainland Chinese encoding systems 2 1 1 748 code 2 1 2 IBM code pages 1380 1381 1382 and 1383 2 1 3 GBK and GB 18030 2 1 4 Mac OS Chinese Simplified 3 EUC JP 3 1 Related Japanese encoding methods 3 1 1 DEC Kanji 3 1 2 HP 16 3 1 3 IKIS 3 1 4 Adaptations of EUC JP for EBCDIC 4 EUC KR 4 1 Related Korean encoding systems 4 1 1 Unified Hangul Code 4 1 2 Mac OS Korean HangulTalk 5 EUC KP 6 EUC TH 7 EUC TW 8 See also 9 Notes 10 References 11 External linksEncoding structure Edit Relationship between packed EUC and other 8 bit ISO 2022 profiles The structure of EUC is based on the ISO IEC 2022 standard which specifies a system of graphical character sets which can be represented with a sequence of the 94 7 bit bytes 0x21 7E or alternatively 0xA1 FE if an eighth bit is available This allows for sets of 94 graphical characters or 8836 942 characters or 830584 943 characters Although initially 0x20 and 0x7F were always the space and delete character and 0xA0 and 0xFF were unused later editions of ISO IEC 2022 allowed the use of the bytes 0xA0 and 0xFF or 0x20 and 0x7F within sets under certain circumstances allowing the inclusion of 96 character sets The ranges 0x00 1F and 0x80 9F are used for C0 and C1 control codes EUC is a family of 8 bit profiles of ISO IEC 2022 as opposed to 7 bit profiles such as ISO 2022 JP As such only ISO 2022 compliant character sets can have EUC forms Up to four coded character sets referred to as G0 G1 G2 and G3 or as code sets 0 1 2 and 3 can be represented with the EUC scheme The G0 set is set to an ISO IEC 646 compliant coded character set such as US ASCII ISO 646 KR KS X 1003 or ISO 646 JP the lower half of JIS X 0201 and invoked over GL i e 0x21 0x7E with the most significant bit cleared 1 If US ASCII is used this makes the code an extended ASCII encoding the most common deviation from US ASCII is that 0x5C backslash in US ASCII is often used to represent a Yen sign in EUC JP see below and a won sign in EUC KR The other code sets are invoked over GR i e with the most significant bit set Hence to get the EUC form of a character the most significant bit of each coding byte is set equivalent to adding 128 to each 7 bit coding byte or adding 160 to each number in the kuten code this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO 646 code or the extended code Characters in code sets 2 and 3 are prefixed with the control codes SS2 0x8E and SS3 0x8F respectively and invoked over GR Besides the initial shift code any byte outside of the range 0xA0 0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code 1 The EUC code itself does not make use of the announcement and designation sequences from ISO 2022 1 However the code specification is equivalent to the following sequence of four ISO 2022 announcement sequences with meanings breaking down as follows 1 Individual sequence Hexadecimal Feature of EUC denotedESC SP C 1B 20 43 ISO 8 8 bit G0 in GL G1 in GR ESC SP Z 1B 20 5A G2 accessed using SS2ESC SP 1B 20 5B G3 accessed using SS3ESC SP 1B 20 5C Single shifts invoke over GRFixed length format Edit Layout of the fixed length format for Japanese The ISO 2022 based variable length encoding described above is sometimes referred to as the EUC packed format which is the encoding format usually labelled as EUC However internal processing of EUC data may make use of a fixed length transformation format called the EUC complete two byte format This represents 2 Code set 0 as two bytes in the range 0x21 0x7E except that the first may be 0x00 Code set 1 as two bytes in the range 0xA0 0xFF except that the first may be 0x80 Code set 2 as a byte in the range 0x21 0x7E or 0x00 followed by a byte in the range 0xA0 0xFF Code set 3 as a byte in the range 0xA0 0xFF or 0x80 followed by a byte in the range 0x21 0x7E Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte There is also a four byte fixed length format 2 These fixed length encoding formats are suited to internal processing and are not usually encountered in interchange EUC JP is registered with the IANA in both formats the packed format as EUC JP or csEUCPkdFmtJapanese and the fixed width format as csEUCFixWidJapanese 3 Only the packed format is included in the WHATWG Encoding Standard used by HTML5 4 EUC CN EditEUC CN MIME IANAGB2312Alias es csGB2312 CN GB 5 Language s Simplified Chinese English RussianStandardGB 2312 1980 ClassificationExtended ASCII variable length encoding CJK encoding EUCExtendsUS ASCIIExtensions748 GBK GB 18030 x mac chinesesimpTransforms EncodesGB 2312Succeeded byGBK GB 18030vteEUC CN 6 is the usual encoded form of the GB 2312 standard for simplified Chinese characters Unlike the case of Japanese JIS X 0208 and ISO 2022 JP GB 2312 is not normally used in a 7 bit ISO 2022 code version a although a variant form called HZ which delimits GB 2312 text with ASCII sequences was sometimes used on USENET An ASCII character is represented in its usual encoding A character from GB 2312 is represented by two bytes both from the range 0xA1 0xFE Related Mainland Chinese encoding systems Edit 748 code Edit An encoding related to EUC CN is the 748 code used in the WITS typesetting system developed by Beijing s Founder Technology now obsoleted by its newer FITS typesetting system The 748 code contains all of GB 2312 but is not ISO 2022 compliant and therefore not a true EUC code It uses an 8 bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared and is therefore more similar in structure to Big5 and other non ISO 2022 compliant DBCS encoding systems The non GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting IBM code pages 1380 1381 1382 and 1383 Edit IBM code page 1381 CCSID 1381 comprises the single byte code page 1115 CPGID 1115 as CCSID 1115 and the double byte code page 1380 CPGID 1380 as CCSID 1380 7 which encodes GB 2312 the same way as EUC CN but deviates from the EUC structure by extending the lead byte range back to 0x8C adding 31 IBM selected characters in 0x8CE0 through 0x8CFE and adding 1880 user defined characters with lead bytes 0x8D through 0xA0 8 IBM code page 1383 CCSID 1383 comprises the single byte code page 367 and the double byte code page 1382 CPGID 1382 as CCSID 1382 9 which differs by conforming to the EUC structure adding the 31 IBM selected characters in 0xFEE0 through 0xFEFE instead and including only 1360 user defined characters interspersed in the positions not used by GB 2312 10 The alternative CCSID 5479 11 is used for the pure EUC CN code page it uses CCSID 9574 as its double byte set which uses CPGID 1382 but excludes the IBM selected and user defined characters 12 GBK and GB 18030 Edit Main articles GBK character encoding and GB 18030 GBK is an extension to GB 2312 It defines an extended form of the EUC CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1 1 including traditional Chinese characters and characters used only in Japanese It is not however a true EUC code because ASCII bytes may appear as trail bytes and C1 bytes not limited to the single shifts may appear as lead or trail bytes due to a larger encoding space being required Variants of GBK are implemented by Windows code page 936 the Microsoft Windows code page for simplified Chinese and by IBM s code page 1386 The Unicode based GB 18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode However Unicode encoded as GB 18030 is a variable length encoding which may use up to four bytes per character due to an even larger encoding space being required Being an extension of GBK it is a superset of EUC CN but is not itself a true EUC code Being a Unicode encoding its repertoire is identical to that of other Unicode transformation formats such as UTF 8 Mac OS Chinese Simplified Edit Other EUC CN variants deviating from the EUC mechanism include the Mac OS Chinese Simplified script known as Code page 10008 or x mac chinesesimp 13 It uses the bytes 0x80 0x81 0x82 0xA0 0xFD 0xFE and 0xFF for the U with umlaut u two special font metric characters the non breaking space the copyright sign c the trademark sign and the ellipsis respectively 6 This differs in what is regarded as a single byte character versus the first byte of a two byte character from both EUC where of those 0xFD and 0xFE are defined as lead bytes and GBK where of those 0x81 0x82 0xFD and 0xFE are defined as lead bytes This use of 0xA0 0xFD 0xFE and 0xFF matches Apple s Shift JIS variant Besides these changes to the lead byte range the other distinctive feature of the double byte portion of Mac OS Chinese Simplified is the inclusion of two extensions to the basic GB 2312 80 set in rows 6 and 8 6 These are considered standard extensions to GB 2312 neither of which is proprietary to Apple the row 8 extension was taken from GB 6345 1 6 both extensions are included by GB T 12345 the Traditional Chinese variant of GB 2312 14 and both extensions are included by GB 18030 the successor to GB 2312 15 EUC JP EditEUC JP MIME IANAEUC JPAlias es Unixized JIS UJIS csEUCPkdFmtJapaneseLanguage s Japanese English RussianClassificationExtended ISO 646 variable length encoding CJK encoding EUCExtendsUS ASCII or ISO 646 JPTransforms EncodesJIS X 0208 JIS X 0212 JIS X 0201Succeeded byEUC JISx0213vteEUC JIS 2004 Alias es EUC JISx0213Language s Japanese Ainu English RussianStandardJIS X 0213ClassificationExtended ASCII variable length encoding CJK encoding EUCExtendsUS ASCIITransforms EncodesJIS X 0213 JIS X 0201 Kana Preceded byEUC JPvteEUC JP is a variable length encoding used to represent the elements of three Japanese character set standards namely JIS X 0208 JIS X 0212 and JIS X 0201 Other names for this encoding include Unixized JIS or UJIS and AT amp T JIS 2 0 1 of all web pages use EUC JP since August 2018 16 while 2 5 of websites in Japanese use this encoding 17 less used than Shift JIS or UTF 8 It is called Code page 954 by IBM 18 19 Microsoft has two code page numbers for this encoding 51932 and 20932 This encoding scheme allows the easy mixing of 7 bit ASCII and 8 bit Japanese without the need for the escape characters employed by ISO 2022 JP which is based on the same character set standards and without ASCII bytes appearing as trail bytes unlike Shift JIS A related and partially compatible encoding called EUC JISx0213 or EUC JIS 2004 encodes JIS X 0201 and JIS X 0213 20 similarly to Shift JISx0213 its Shift JIS based counterpart Compared to EUC CN or EUC KR EUC JP did not become as widely adopted on PC and Macintosh systems in Japan which used Shift JIS or its extensions Windows code page 932 on Microsoft Windows and MacJapanese on classic Mac OS although it became heavily used by Unix or Unix like operating systems except for HP UX Therefore whether Japanese web sites use EUC JP or Shift JIS often depends on what OS the author uses Characters are encoded as follows As an EUC ISO 2022 compliant encoding the C0 control characters space and DEL are represented as in ASCII A graphical character from ASCII code set 0 is represented as its usual one byte representation in the range 0x21 0x7E While some variants of EUC JP encode the lower half of JIS X 0201 here most encode ASCII 21 including the W3C WHATWG Encoding standard used by HTML5 22 and so does EUC JIS 2004 20 While this means that 0x5C is typically mapped to Unicode as U 005C REVERSE SOLIDUS the ASCII backslash U 005C may be displayed as a Yen sign by certain Japanese locale fonts e g on Microsoft Windows for compatibility with the lower half of JIS X 0201 23 24 A character from JIS X 0208 code set 1 is represented by two bytes both in the range 0xA1 0xFE This differs from the ISO 2022 JP representation by having the high bit set This code set may also contain vendor extensions in some EUC JP variants In EUC JIS 2004 the first plane of JIS X 0213 is encoded here which is effectively a superset of standard JIS X 0208 20 A character from the upper half of JIS X 0201 half width kana code set 2 is represented by two bytes the first being 0x8E the second being the usual JIS X 0201 representation in the range 0xA1 0xDF This set may contain IBM vendor extensions in some variants A character from JIS X 0212 code set 3 is represented in EUC JP by three bytes the first being 0x8F the following two being in the range 0xA1 0xFE i e with the high bit set In addition to standard JIS X 0212 code set 3 of some EUC JP variants may also contain extensions in rows 83 and 84 to represent characters from IBM s Shift JIS extensions which lack standard JIS X 0212 mappings which may be coded in either of two layouts one defined by IBM themselves and one defined by the OSF 25 26 In EUC JIS 2004 the second plane of JIS X 0213 is encoded here 20 which does not collide with the allocated rows in standard JIS X 0212 27 Some implementations of EUC JIS 2004 such as the one used by Python allow both JIS X 0212 and JIS X 0213 plane 2 characters in this set 27 Related Japanese encoding methods Edit Vendor extensions to EUC JP from for example the Open Software Foundation IBM or NEC were often allocated within the individual code sets 25 26 as opposed to using invalid EUC sequences as in popular extensions of EUC CN and EUC KR However some vendor specific encodings are partially compatible with EUC JP due to encoding JIS X 0208 over GR but do not follow the packed EUC structure Often these do not include use of the single shifts from EUC JP and are thus not straight extensions of EUC JP with the exception of Super DEC Kanji DEC Kanji Edit Digital Equipment Corporation defines two variants of EUC JP only partly conforming to the EUC packed format but also bearing some resemblance to the complete two byte format The overall format of the DEC Kanji encoding mostly corresponds to fixed length complete two byte EUC however code set 0 is not required to be left padded with null bytes similarly to the packed format 28 JIS X 0208 is as usual used for code set 1 code set 2 half width katakana is absent code set 3 is encoded like the two byte fixed width format i e without a shift byte and with only the first high bit set but used for two byte user defined characters rather than being specified for JIS X 0212 28 In the basic DEC Kanji encoding only the first 31 rows of code set 3 are used for user defined characters rows 32 through 94 are reserved similarly to the unused rows in code set 1 29 The Super DEC Kanji encoding accepts codes both from the DEC Kanji encoding and from packed format EUC for a total of five code sets 28 It also allows the entire user defined code set and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets rows 85 94 and 78 94 respectively to be used for user defined characters 29 HP 16 Edit Hewlett Packard defines an encoding referred to as HP 16 This accompanies their HP 15 encoding which is a variant of Shift JIS HP 16 encodes JIS X 0208 using the same bytes as in EUC JP but does not use the single shift codes thus omitting code sets 2 and 3 and adds three user defined regions which do not follow the packed format EUC structure 28 Lead bytes 0xA1 C2 trail bytes 0x21 7E Lead bytes 0xC3 E3 trail bytes 0x21 3F Lead bytes 0xC3 E1 trail bytes 0x40 64IKIS Edit The IKIS Interactive Kanji Information System encoding used by Data General resembles EUC JP without single shifts i e with only code sets 0 and 1 Half width katakana are instead included in row 8 of JIS X 0208 colliding with the box drawing characters added to the standard in 1983 JIS X 0208 rows 9 through 12 are used for user defined characters 28 29 Adaptations of EUC JP for EBCDIC Edit Main article Japanese language in EBCDIC KEIS Kanji processing Extended Information System is an EBCDIC encoding used by Hitachi 29 with double byte characters a DBCS Host encoding included using shifting sequences making it a stateful encoding Specifically the sequence 0x0A 0x41 switches to single byte mode and the sequence 0x0A 0x42 switches to double byte mode b However JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC JP This results in duplicate encodings for the ideographic space 0x4040 per the DBCS Host code structure and 0xA1A1 as in EUC JP This differs from IBM s DBCS Host encoding for Japanese the layout of which builds on versions which predate JIS X 0208 altogether The lead byte range is extended back to 0x59 out of which the lead bytes 0x81 A0 are designated for user defined characters 28 and the remainder are used for corporate defined characters including both kanji and non kanji 29 JEF Japanese processing Extended Feature 29 is an EBCDIC encoding used on Fujitsu FACOM mainframes contrasting with FMR a variant of Shift JIS used on Fujitsu PCs Like KEIS JEF is a stateful encoding switching to a double byte DBCS Host mode using shifting sequences where 0x29 switches to single byte mode and 0x28 switches to double byte mode 30 Also similarly to KEIS JIS X 0208 codes are represented the same as in EUC JP 28 The lead byte range is extended back to 0x41 with 0x80 A0 designated for user definition lead bytes 0x41 7F are assigned row numbers 101 through 163 for kuten purposes although row 162 lead byte 0x7E is unused 28 29 Rows 101 through 148 are used for extended kanji while rows 149 through 163 are used for extended non kanji 29 EUC KR Edit EUC KR redirects here For the variant so named in HTML standards see Unified Hangul Code EUC KR EUC KR code structureMIME IANAEUC KRAlias es Wansung IBM 970Language s Korean English RussianStandardKS X 2901 KS C 5861 ClassificationExtended ISO 646 variable length encoding CJK encoding EUCExtendsUS ASCII or ISO 646 KRExtensionsMac OS Korean IBM 949 Unified Hangul Code Windows 949 Transforms EncodesKS X 1001Succeeded byUnified Hangul Code web standards vteEUC KR is a variable length encoding to represent Korean text using two coded character sets KS X 1001 formerly KS C 5601 31 32 and either ISO 646 KR KS X 1003 formerly KS C 5636 or US ASCII depending on variant KS X 2901 formerly KS C 5861 stipulates the encoding and RFC 1557 dubbed it as EUC KR A character drawn from KS X 1001 G1 code set 1 is encoded as two bytes in GR 0xA1 0xFE and a character from KS X 1003 or US ASCII G0 code set 0 takes one byte in GL 0x21 0x7E It is usually referred to as Wansung Korean 완성 romanized Wanseong lit precomposed 33 in the Republic of Korea IBM refers to the double byte component as Code page 971 34 and to EUC KR with ASCII as Code page 970 35 36 37 It is implemented as Code page 20949 Korean Wansung 38 39 and Code page 51949 EUC Korean by Microsoft 38 As of December 2022 update less than 0 07 of all web pages globally use EUC KR 16 but 4 5 of South Korean web pages use EUC KR 40 Including extensions it is the most widely used legacy character encoding in Korea on all three major platforms macOS other Unix like OSes and Windows but its use has been very slowly shifting to UTF 8 as it gains popularity especially on Linux and macOS As with most other encodings UTF 8 is now preferred for new use solving problems with consistency between platforms and vendors Related Korean encoding systems Edit Unified Hangul Code Edit Main article Unified Hangul Code A common extension of EUC KR is the Unified Hangul Code 통합형 한글 코드 Tonghabhyeong Hangeul Kodeu 41 or 통합 완성형 Tonghab Wansunghyung which is the default Korean codepage on Microsoft Windows It is given the code page number 949 by Microsoft and 1261 42 or 1363 43 by IBM IBM s code page 949 is a different unrelated EUC KR extension Unified Hangul Code extends EUC KR by using codes which do not conform to the EUC structure to incorporate additional syllable blocks completing the coverage of the composed syllable blocks available in Johab and Unicode The W3C WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC KR 44 Mac OS Korean HangulTalk Edit Other encodings incorporating EUC KR as a subset include the Mac OS Korean script known as Code page 10003 or x mac korean 13 which was used by HangulTalk MacOS KH the Korean localisation of the classic Mac OS It was developed by Elex Computer 일렉스 who were at the time the authorised distributor of Apple Macintosh computers in South Korea 45 29 HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD both in unused space within the EUC KR GR plane trail bytes 0xA1 0xFE and using non EUC codes outside of it trail bytes 0x41 0xA0 Some of these characters are font style independent stylised dingbats 29 Many of these characters do not have exact Unicode mappings and Apple software maps these cases variously to combining sequences to approximate mappings with an appended private use character as a modifier for round trip purposes or to private use characters 46 Apple also uses certain single byte codes outside of the EUC KR plane for additional characters 0x80 for a required space 0x81 for a won sign 0x82 for an en dash 0x83 for a copyright sign c 0x84 for a wide underscore and 0xFF for an ellipsis 46 Although none of these additional single byte codes are within the lead byte range of plain EUC KR unlike Apple s extensions to EUC CN see above some are within the lead byte range of Unified Hangul Code specifically 0x81 0x82 0x83 and 0x84 EUC KP EditMain article KPS 9566 Similarly to KS X 1001 the North Korean KPS 9566 standard is typically used in EUC form in these contexts it is sometimes referred to as EUC KP 47 More recent editions of the standard extend the EUC representation with characters using non EUC two byte codes in a similar manner to Unified Hangul Code 48 EUC TH EditAlthough certain single byte encodings such as the ISO IEC 8859 series technically conform to the EUC structure they are rarely labelled as EUC However eucTH is used on Solaris as a label for TIS 620 49 EUC TW EditEUC TW is a variable length encoding that supports US ASCII and 16 planes of CNS 11643 each of which is 94x94 It is a rarely used encoding for traditional Chinese characters as used in Taiwan Variants of Big5 are much more common than EUC TW although Big5 only encodes the first two planes of CNS 11643 hanzi while UTF 8 is becoming more common As an EUC ISO 2022 encoding the C0 control characters ASCII space and DEL are encoded as in ASCII A graphical character from US ASCII G0 code set 0 is encoded in GL as its usual single byte representation 0x21 0x7E A character from CNS 11643 plane 1 code set 1 is encoded as two bytes in GR 0xA1 0xFE A character in plane 1 through 16 of CNS 11643 code set 2 is encoded as four bytes The first byte is always 0x8E Single Shift 2 The second byte 0xA1 0xB0 indicates the plane the number of which is obtained by subtracting 0xA0 from that byte The third and fourth bytes are in GR 0xA1 0xFE Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2 See also EditCJK Japanese language and computers Korean language and computers Chinese character encodingNotes Edit 7 bit ISO 2022 code versions supporting GB 2312 include ISO 2022 CN with shift codes and ISO 2022 JP 2 without shift codes both of which also support other non ASCII sets These sequences match the hexadecimal forms shown by DEC 30 and the decimal forms 10 65 and 10 66 listed by Lunde 28 Lunde lists the hexadecimal forms for both as 0xA0 0x42 seemingly in error References Edit a b c d IBM Character Data Representation Architecture CDRA IBM pp 157 162 a b c Lunde Ken 2008 CJKV Information Processing Chinese Japanese Korean and Vietnamese Computing O Reilly pp 242 244 ISBN 9780596800925 Character Sets IANA 4 2 Names and labels Encoding Standard WHATWG Zhu Haifeng Hu Daoyuan Wang Zhiguan Kao Tien cheu Chang Wen chung Crispin Mark 1996 RFC 1922 Chinese Character Encoding for Internet Messages 2 1 CN GB Requests for Comments IETF doi 10 17487 rfc1922 a b c d Map external version from Mac OS Chinese Simplified encoding to Unicode 3 0 and later Apple Inc S Ch PC Data mixed IBM GB including 1880 UDC 31 IBM selected characters and 5 SAA SB characters IBM Globalization Coded character set identifiers IBM Archived from the original on 2016 03 26 IBM Simplified Chinese Graphic Character Set PDF IBM 1993 C H 3 3220 130 1993 11 CCSID 1383 S Ch EUC G0 set ASCII G1 set GB 2312 80 set 1382 IBM Globalization Coded character set identifiers IBM Archived from the original on 2016 03 28 IBM Simplified Chinese Graphic Character Set for Extended UNIX Code EUC PDF IBM 1994 C H 3 3220 132 1994 06 CCSID 5479 S Ch EUC G0 set ASCII G1 set GB 2312 80 set 5478 IBM Globalization Coded character set identifiers IBM Archived from the original on 2016 03 27 CCSID 9574 S Ch DBCS PC GB 2312 80 set excluding 31 IBM selected and 1360 UDC Also used in T Ch 2022 CN TCP IBM Globalization Coded character set identifiers IBM Archived from the original on 2016 03 27 a b Encoding WindowsCodePage Property NET Framework current version MSDN Microsoft Lunde Ken 1998 Appendix F GB T 12345 PDF CJKV Information Processing O Reilly Media ISBN 9781565922242 Standardization Administration of China SAC 2005 11 18 GB 18030 2005 Information Technology Chinese coded character set a b Historical trends in the usage of character encodings for websites W3Techs Distribution of Character Encodings among websites that use Japanese w3techs com Retrieved 2022 12 02 CCSID 954 information document Archived from the original on 2016 03 27 International Components for Unicode ICU ibm 954 P101 2007 ucm 2002 12 03 a b c d JIS X 0213 Code Mapping Tables x0213 org Ambiguities in conversion from Japanese EUC to Unicode Non Normative XML Japanese Profile W3C EUC JP decoder Encoding Standard WHATWG If byte is an ASCII byte return a code point whose value is byte 3 1 1 Details of Problems Problems and Solutions for Unicode and User Vendor Defined Characters The Open Group Japan Archived from the original on 1999 02 03 Retrieved 2019 08 14 Kaplan Michael S 2005 09 17 When is a backslash not a backslash a b 4 2 Review Process of Rules for Code Set Conversion Between eucJP open and UCS Problems and Solutions for Unicode and User Vendor Defined Characters The Open Group Japan Archived from the original on 1999 02 03 Retrieved 2019 08 14 a b Lunde Ken 13 January 2009 Appendix J Japanese Character Sets PDF CJKV Information Processing 2nd ed ISBN 978 0 596 51447 1 a b Chang Hyeshik 8 December 2021 Readme for CJKCodecs cPython Python Software Foundation a b c d e f g h i Lunde Ken 13 January 2009 Appendix F Vendor Encoding Methods PDF CJKV Information Processing 2nd ed ISBN 978 0 596 51447 1 a b c d e f g h i j Lunde Ken 2009 Appendix E Vendor Character Set Standards PDF CJKV Information Processing Chinese Japanese Korean amp Vietnamese Computing 2nd ed Sebastopol CA O Reilly ISBN 978 0 596 51447 1 a b 2 Codesets and Codeset Conversion DIGITAL UNIX Technical Reference for Using Japanese Features Digital Equipment Corporation Compaq KS X 1001 1992 PDF KS C 5601 1987 PDF 1988 10 01 Lunde Ken 2009 Chapter 3 Character Set Standards CJKV Information Processing p 146 ISBN 978 0596514471 IBM Globalization Coded character set identifiers CCSID 971 Archived from the original on 2014 11 30 Retrieved 2021 09 03 CCSID 970 IBM Globalization IBM Archived from the original on 2014 12 01 ibm 970 P110 P110 2006 U2 alias euc kr Converter Explorer ICU Demonstration International Components for Unicode International Components for Unicode ICU ibm 970 P110 P110 2006 U2 ucm 2002 12 03 a b Code Page Identifiers Windows Dev Center Microsoft Julliard Alexandre dump krwansung codepage build Korean Wansung table from the KSX1001 file make unicode Generate code page c files from ftp unicode org descriptions Wine Project Distribution of Character Encodings among websites that use kr w3techs com Retrieved 2022 12 02 한글 코드에 대하여 in Korean W3C Archived from the original on 2013 05 24 Retrieved 2019 01 07 In ucnv lmb cpp a file originating from IBM and included in the International Components for Unicode source tree the lead byte 0x11 is commented as referring to Korean ibm 1261 after the definition of ULMBCS GRP KO and is mapped to the windows 949 ICU codec in the OptGroupByteToCPName array later in the file Coded character set identifiers CCSID 1363 IBM Globalization IBM archived from the original on 2014 11 29 5 Indexes index EUC KR Encoding Standard WHATWG Gil Hojin HangulTalk De facto standard Hangul environment for Mac Guide to using Hangul on Macintosh a b Apple 2005 04 05 Map external version from Mac OS Korean encoding to Unicode 3 2 and later Unicode Consortium Kim Kyongsok 2002 11 30 3 way cross reference tables KS X 1001 KPS 9566 and UCS PDF ISO IEC JTC 1 SC 2 WG 2 N2564 Note updated links for tables accompanying document 1 2 Chung Jaemin 2018 01 05 Information on the most recent version of KPS 9566 KPS 9566 2011 PDF UTC L2 18 011 IBM 2001 05 07 solaris eucTH 2 7 icu data Unicode Consortium International Components for Unicode External links EditEUC JP codeset table minus the ASCII and halfwidth parts Code Page Identifiers GB18030 2000 The New Chinese National Standard The New Generation of Pre Press Software in China mentions the 748 code Description of the EUC TW code in Chinese Manual page of EUC JISX0213 in the Perl Encode module International Register of Coded Character Sets to be Used With Escape Sequence section 2 4 p 14f with the coded character sets of China Japan South Korea North Korea and Taiwan ISO IEC Chinese Japanese and Korean character set standards and encoding systems Retrieved from https en wikipedia org w index php title Extended Unix Code amp oldid 1132636794 EUC JP, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.