fbpx
Wikipedia

ISO/IEC 2022

ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35,[1][2] the ANSI standard ANSI X3.41[3] and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.[4]

ISO 2022
Language(s)Various.
Standard
ClassificationStateful system of encodings (with stateless pre-configured subsets)
Transforms / EncodesUS-ASCII and, depending on implementation:
Succeeded byISO/IEC 10646 (Unicode)
Other related encoding(s)Stateful subsets:
Pre-configured versions:

ISO 2022 specifies a general structure which character encodings can conform to, dedicating particular ranges of bytes (0x00–1F and 0x7F–9F) to be used for non-printing control codes[5] for formatting and in-band instructions (such as line breaks or formatting instructions for text terminals), rather than graphical characters. It also specifies a syntax for escape sequences, multiple-byte sequences beginning with the ESC control code, which can likewise be used for in-band instructions.[6] Specific sets of control codes and escape sequences designed to be used with ISO 2022 include ISO/IEC 6429, portions of which are implemented by ANSI.SYS and terminal emulators.

ISO 2022 itself also defines particular control codes and escape sequences which can be used for switching between different coded character sets (for example, between ASCII and the Japanese JIS X 0208) so as to use multiple in a single document,[7] effectively combining them into a single stateful encoding (a feature less important since the advent of Unicode). It is designed to be usable in both 8-bit environments and 7-bit environments (those where only seven bits are usable in a byte, such as e-mail without 8BITMIME).[8]

Encodings and conformance edit

The ASCII character set supports the ISO Basic Latin alphabet (equivalent to the English alphabet), and does not provide good support for languages which use additional letters, or which use a different writing system altogether. Other writing systems with relatively few characters, such as Greek, Cyrillic, Arabic or Hebrew, as well as forms of the Latin script using diacritics or letters absent from the ISO Basic Latin alphabet, have historically been represented on personal computers with different 8-bit, single byte, extended ASCII encodings, which follow ASCII when the most significant bit is 0 (i.e. bytes 0x00–7F, when represented in hexadecimal), and include additional characters for a most significant bit of 1 (i.e. bytes 0x80–FF). Some of these, such as the ISO 8859 series, conform to ISO 2022,[9][10] while others such as DOS code page 437 do not, usually due to not reserving the bytes 0x80–9F for control codes.

Certain East Asian languages, specifically Chinese, Japanese, and Korean (collectively "CJK"), are written using far more characters than the maximum of 256 which can be represented in a single byte, and were first represented on computers with language-specific double-byte encodings or variable-width encodings; some of these (such as the Simplified Chinese encoding GB 2312) conform to ISO 2022, while others (such as the Traditional Chinese encoding Big5) do not. Control codes in ISO 2022 are always represented with a single byte, regardless of the number of bytes used for graphical characters. CJK encodings used in 7-bit environments which use ISO 2022 mechanisms to switch between character sets are often given names starting with "ISO-2022-", most notably ISO-2022-JP, although some other CJK encodings such as EUC-JP also make use of ISO 2022 mechanisms.[11][12]

Since the first 256 code points of Unicode were taken from ISO 8859-1, Unicode inherits the concept of C0 and C1 control codes from ISO 2022, although it adds other non-printing characters besides the ISO 2022 control codes. However, Unicode transformation formats such as UTF-8 generally deviate from the ISO 2022 structure in various ways, including:

  • Using 8-bit bytes, but not representing the C1 codes in their single-byte forms specified in ISO 2022 (most UTFs, one exception being the obsolete UTF-1)
  • Representing all characters, including control codes, with multiple bytes (e.g. UTF-16, UTF-32)
  • Mixing bytes with the most significant bit set and unset within the coded representation for a single code point (e.g. UTF-1, GB 18030)

ISO 2022 escape sequences do, however, exist for switching to and from UTF-8 as a "coding system different from that of ISO 2022",[13] which are supported by certain terminal emulators such as xterm.[14]

Overview edit

Elements edit

ISO/IEC 2022 specifies the following:

  • An infrastructure of multiple character sets with particular structures which may be included in a single character encoding system, including multiple graphical character sets and multiple sets of both primary (C0) and secondary (C1) control codes,[15]
  • A format for encoding these sets, assuming that 8 bits are available per byte,[16]
  • A format for encoding these sets in the same encoding system when only 7 bits are available per byte,[17] and a method for transforming any conformant character data to pass through such a 7-bit environment,[8]
  • The general structure of ANSI escape codes,[6] and
  • Specific escape code formats for identifying individual character sets,[7] for announcing the use of particular encoding features or subsets,[18] and for interacting with or switching to other encoding systems.[18]

Code versions edit

A specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation. Although many of the mechanisms defined by the ISO/IEC 2022 standard are infrequently used, several established encodings are based on a subset of the ISO/IEC 2022 system.[19] In particular, 7-bit encoding systems using ISO/IEC 2022 mechanisms include ISO-2022-JP (or JIS encoding), which has primarily been used in Japanese-language e-mail. 8-bit encoding systems conforming to ISO/IEC 2022 include ISO/IEC 4873 (ECMA-43), which is in turn conformed to by ISO/IEC 8859,[9][10] and Extended Unix Code, which is used for East Asian languages.[11] More specialised applications of ISO 2022 include the MARC-8 encoding system used in MARC 21 library records.[3]

Designation escape sequences edit

The escape sequences for switching to particular character sets or encodings are registered with the ISO-IR registry (except for those set apart for private use, the meanings of which are defined by vendors, or by protocol specifications such as ARIB STD-B24) and follow the patterns defined within the standard. Character encodings making use of these escape sequences require data to be processed sequentially in a forward direction, since the correct interpretation of the data depends on previously encountered escape sequences.

Specific profiles such as ISO-2022-JP may impose extra conditions, such as that the current character set is reset to US-ASCII before the end of a line. Furthermore, the escape sequences declaring the national character sets may be absent if a specific ISO-2022-based encoding permits or requires this, and dictates that particular national character sets are to be used. For example, ISO-8859-1 states that no defining escape sequence is needed.

Multi-byte characters edit

To represent large character sets, ISO/IEC 2022 builds on ISO/IEC 646's property that a seven-bit character representation will normally be able to represent 94 graphic (printable) characters (in addition to space and 33 control characters); if only the C0 control codes (narrowly defined) are excluded, this can be expanded to 96 characters. Using two bytes, it is thus possible to represent up to 8,836 (94×94) characters; and, using three bytes, up to 830,584 (94×94×94) characters. Though the standard defines it, no registered character set uses three bytes (although EUC-TW's unregistered G2 does, as does the similarly unregistered CCCII).

For the two-byte character sets, the code point of each character is normally specified in so-called row-cell or kuten[a] form, which comprises two numbers between 1 and 94 inclusive, specifying a row[b] and cell[c] of that character within the zone. For a three-byte set, an additional plane[d] number is included at the beginning.[20] The escape sequences do not only declare which character set is being used, but also whether the set is single-byte or multi-byte (although not how many bytes it uses if it is multi-byte), and also whether each byte has 94 or 96 permitted values.

Code structure edit

Notation and nomenclature edit

ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated"[21] into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked"[22] to interpret bytes in the stream.

Encoding byte values ("bit combinations") are often given in column-line notation, where two decimal numbers in the range 00–15 (each corresponding to a single hexadecimal digit) are separated by a slash.[23] Hence, for instance, codes 2/0 (0x20) through 2/15 (0x2F) inclusive may be referred to as "column 02". This is the notation used in the ISO/IEC 2022 / ECMA-35 standard itself.[24] They may be described elsewhere using hexadecimal, as is often used in this article, or using the corresponding ASCII characters,[25] although the escape sequences are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.

Byte values from the 7-bit ASCII graphic range (hexadecimal 0x20–0x7F), being on the left side of a character code table, are referred to as "GL" codes (with "GL" standing for "graphics left") while bytes from the "high ASCII" range (0xA0–0xFF), if available (i.e. in an 8-bit environment), are referred to as the "GR" codes ("graphics right").[5] The terms "CL" (0x00–0x1F) and "CR" (0x80–0x9F) are defined for the control ranges, but the CL range always invokes the primary (C0) controls, whereas the CR range always either invokes the secondary (C1) controls or is unused.[5]

Fixed coded characters edit

The delete character DEL (0x7F), the escape character ESC (0x1B) and the space character SP (0x20) are designated "fixed" coded characters[26] and are always available when G0 is invoked over GL, irrespective of what character sets are designated. They may not be included in graphical character sets, although other sizes or types of whitespace character may be.[27]

General syntax of escape sequences edit

Sequences using the ESC (escape) character take the form ESC [I...] F, where the ESC character is followed by zero or more intermediate bytes[28] (I) from the range 0x20–0x2F, and one final byte[29] (F) from the range 0x30–0x7E.[30]

The first I byte, or absence thereof, determines the type of escape sequence; it might, for instance, designate a working set, or denote a single control function. In all types of escape sequences, F bytes in the range 0x30–0x3F are reserved for unregistered private uses defined by prior agreement between parties.[31]

Control functions from some sets may make use of further bytes following the escape sequence proper. For example, the ISO 6429 control function "Control Sequence Introducer", which can be represented using an escape sequence, is followed by zero or more bytes in the range 0x30–0x3F, then zero or more bytes in the range 0x20–0x2F, then by a single byte in the range 0x40–0x7E, the entire sequence being called a "control sequence".[32]

Graphical character sets edit

Each of the four working sets G0 through G3 may be a 94-character set or a 94n-character multi-byte set. Additionally, G1 through G3 may be a 96- or 96n-character set.

In a 96- or 96n-character set, the bytes 0x20 through 0x7F when GL-invoked, or 0xA0 through 0xFF when GR-invoked, are allocated to and may be used by the set. In a 94- or 94n-character set, the bytes 0x20 and 0x7F are not used.[33] When a 96- or 96n-character set is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available until a 94- or 94n-character set (such as the G0 set) is invoked in GL.[5] 96-character sets cannot be designated to G0.

Registration of a set as a 96-character set does not necessarily mean that the 0x20/A0 and 0x7F/FF bytes are actually assigned by the set; some examples of graphical character sets which are registered as 96-sets but do not use those bytes include the G1 set of I.S. 434,[34] the box drawing set from ISO/IEC 10367,[35] and ISO-IR-164 (a subset of the G1 set of ISO-8859-8 with only the letters, used by CCITT).[36]

Combining characters edit

Characters are expected to be spacing characters, not combining characters, unless specified otherwise by the graphical set in question.[37] ISO 2022 / ECMA-35 also recognizes the use of the backspace and carriage return control characters as means of combining otherwise spacing characters, as well as the CSI sequence "Graphic Character Combination" (GCC)[37] (CSI 0x20 (SP) 0x5F (_)).[38]

Use of the backspace and carriage return in this manner is permitted by ISO/IEC 646 but prohibited by ISO/IEC 4873 / ECMA-43[39] and by ISO/IEC 8859,[40][41] on the basis that it leaves the graphical character repertoire undefined. ISO/IEC 4873 / ECMA-43 does, however, permit the use of the GCC function provided that the sequence of characters is kept the same and merely displayed in one space, rather than being over-stamped to form a character with a different meaning.[42]

Control character sets edit

Control character sets are classified as "primary" or "secondary" control code sets,[43] respectively also called "C0" and "C1" control code sets.[44]

A C0 control set must contain the ESC (escape) control character at 0x1B[45] (a C0 set containing only ESC is registered as ISO-IR-104),[46] whereas a C1 control set may not contain the escape control whatsoever.[33] Hence, they are entirely separate registrations, with a C0 set being only a C0 set and a C1 set being only a C1 set.[44]

If codes from the C0 set of ISO 6429 / ECMA-48, i.e. the ASCII control codes, appear in the C0 set, they are required to appear at their ISO 6429 / ECMA-48 locations.[45] Inclusion of transmission control characters in the C0 set, besides the ten included by ISO 6429 / ECMA-48 (namely SOH, STX, ETX, EOT, ENQ, ACK, DLE, NAK, SYN and ETB),[47] or inclusion of any of those ten in the C1 set, is also prohibited by the ISO/IEC 2022 / ECMA-35 standard.[45][33]

A C0 control set is invoked over the CL range 0x00 through 0x1F,[48] whereas a C1 control function may be invoked over the CR range 0x80 through 0x9F (in an 8-bit environment) or by using escape sequences (in a 7-bit or 8-bit environment),[43] but not both. Which style of C1 invocation is used must be specified in the definition of the code version.[49] For example, ISO/IEC 4873 specifies CR bytes for the C1 controls which it uses (SS2 and SS3).[50] If necessary, which invocation is used may be communicated using announcer sequences.

In the latter case, single control functions from the C1 control code set are invoked using "type Fe" escape sequences,[33] meaning those where the ESC control character is followed by a byte from columns 04 or 05 (that is to say, ESC 0x40 (@) through ESC 0x5F (_)).[51]

Other control functions edit

Additional control functions are assigned to "type Fs" escape sequences (in the range ESC 0x60 (`) through ESC 0x7E (~)); these have permanently assigned meanings rather than depending on the C0 or C1 designations.[51][52] Registration of control functions to type "Fs" sequences must be approved by ISO/IEC JTC 1/SC 2.[52] Other single control functions may be registered to type "3Ft" escape sequences (in the range ESC 0x23 (#) [I...] 0x40 (@) through ESC 0x23 (#) [I...] 0x7E (~)),[53] although no "3Ft" sequences are currently assigned (as of 2019).[54] Some of these are specified in ECMA-35 (ISO 2022 / ANSI X3.41), others in ECMA-48 (ISO 6429 / ANSI X3.64).[55] ECMA-48 refers to these as "independent control functions".[56]

Code Hex Abbr. Name Effect[54]
ESC ` 1B 60 DMI Disable manual input Disables some or all of the manual input facilities of the device.
ESC a 1B 61 INT Interrupt Interrupts the current process.
ESC b 1B 62 EMI Enable manual input Enables the manual input facilities of the device.
ESC c 1B 63 RIS Reset to initial state Resets the device to its state after being powered on.[57]
ESC d 1B 64 CMD Coding method delimiter Used when interacting with an outer coding / representation system, see below.
ESC n 1B 6E LS2 Locking shift two Shift function, see below.
ESC o 1B 6F LS3 Locking shift three Shift function, see below.
ESC | 1B 7C LS3R Locking shift three right Shift function, see below.
ESC } 1B 7D LS2R Locking shift two right Shift function, see below.
ESC ~ 1B 7E LS1R Locking shift one right Shift function, see below.

Escape sequences of type "Fp" (ESC 0x30 (0) through ESC 0x3F (?)) or of type "3Fp" (ESC 0x23 (#) [I...] 0x30 (0) through ESC 0x23 (#) [I...] 0x3F (?)) are reserved for single private use control codes, by prior agreement between parties.[58] Several such sequences of both types are used by DEC terminals such as the VT100, and are thus supported by terminal emulators.[14]

Shift functions edit

By default, GL codes specify G0 characters and GR codes (where available) specify G1 characters; this may be otherwise specified by prior agreement. The set invoked over each area may also be modified with control codes referred to as shifts, as shown in the table below.[59]

An 8-bit code may have GR codes specifying G1 characters, i.e. with its corresponding 7-bit code using Shift In and Shift Out to switch between the sets (e.g. JIS X 0201),[60] although some instead have GR codes specifying G2 characters, with the corresponding 7-bit code using a single-shift code to access the second set (e.g. T.51).[61]

The codes shown in the table below are the most common encodings of these control codes, conforming to ISO/IEC 6429. The LS2, LS3, LS1R, LS2R and LS3R shifts are registered as single control functions and are always encoded as the escape sequences listed below,[54] whereas the others are part of a C0 or C1 control code set (as shown below, SI (LS0) and SO (LS1) are C0 controls and SS2 and SS3 are C1 controls), meaning that their coding and availability may vary depending on which control sets are designated: they must be present in the designated control sets if their functionality is used.[48][49] The C1 controls themselves, as mentioned above, may be represented using escape sequences or 8-bit bytes, but not both.

Alternative encodings of the single-shifts as C0 control codes are available in certain control code sets. For example, SS2 and SS3 are usually available at 0x19 and 0x1D respectively in T.51[61] and T.61.[62] This coding is currently recommended by ISO/IEC 2022 / ECMA-35 for applications requiring 7-bit single-byte representations of SS2 and SS3,[63] and may also be used for SS2 only,[64] although older code sets with SS2 at 0x1C also exist,[65][66][67] and were mentioned as such in an earlier edition of the standard.[68] The 0x8E and 0x8F coding of the single shifts as shown below is mandatory for ISO/IEC 4873 levels 2 and 3.[69]

Code Hex Abbr. Name Effect
SI 0F SI
LS0
Shift In
Locking shift zero
GL encodes G0 from now on[70][71]
SO 0E SO
LS1
Shift Out
Locking shift one
GL encodes G1 from now on[70][71]
ESC n 1B 6E LS2 Locking shift two GL encodes G2 from now on[70][71]
ESC o 1B 6F LS3 Locking shift three GL encodes G3 from now on[70][71]
CR area: SS2
Escape code: ESC N
CR area: 8E
Escape code: 1B 4E
SS2 Single shift two GL or GR (see below) encodes G2 for the immediately following character only[72]
CR area: SS3
Escape code: ESC O
CR area: 8F
Escape code: 1B 4F
SS3 Single shift three GL or GR (see below) encodes G3 for the immediately following character only[72]
ESC ~ 1B 7E LS1R Locking shift one right GR encodes G1 from now on[73]
ESC } 1B 7D LS2R Locking shift two right GR encodes G2 from now on[73]
ESC | 1B 7C LS3R Locking shift three right GR encodes G3 from now on[73]

Although officially considered shift codes and named accordingly, single-shift codes are not always viewed as shifts,[12] and they may simply be viewed as prefix bytes (i.e. the first bytes in a multi-byte sequence),[11] since they do not require the encoder to keep the currently active set as state, unlike locking shift codes. In 8-bit environments, either GL or GR, but not both, may be used as the single-shift area. This must be specified in the definition of the code version.[72] For instance, ISO/IEC 4873 specifies GL, whereas packed EUC specifies GR. In 7-bit environments, only GL is used as the single-shift area.[74][75] If necessary, which single-shift area is used may be communicated using announcer sequences.

The names "locking shift zero" (LS0) and "locking shift one" (LS1) refer to the same pair of C0 control characters (0x0F and 0x0E) as the names "shift in" (SI) and "shift out" (SO). However, the standard refers to them as LS0 and LS1 when they are used in 8-bit environments and as SI and SO when they are used in 7-bit environments.[59]

The ISO/IEC 2022 / ECMA-35 standard permits, but discourages, invoking G1, G2 or G3 in both GL and GR simultaneously.[76]

Registration of graphical and control code sets edit

The ISO International register of coded character sets to be used with escape sequences (ISO-IR) lists graphical character sets, control code sets, single control codes and so forth which have been registered for use with ISO/IEC 2022. The procedure for registering codes and sets with the ISO-IR registry is specified by ISO/IEC 2375. Each registration receives a unique escape sequence, and a unique registry entry number to identify it.[77][78] For example, the CCITT character set for Simplified Chinese is known as ISO-IR-165.

Registration of coded character sets with the ISO-IR registry identifies the documents specifying the character set or control function associated with an ISO/IEC 2022 non‑private-use escape sequence. This may be a standard document; however, registration does not create a new ISO standard, does not commit the ISO or IEC to adopt it as an international standard, and does not commit the ISO or IEC to add any of its characters to the Universal Coded Character Set.[79]

ISO-IR registered escape sequences are also used encapsulated in a Formal Public Identifier to identify character sets used for numeric character references in SGML (ISO 8879). For example, the string ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0 can be used to identify the International Reference Version of ISO 646-1983,[80] and the HTML 4.01 specification uses ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6 to identify Unicode.[81] The textual representation of the escape sequence, included in the third element of the FPI, will be recognised by SGML implementations for supported character sets.[80]

Character set designations edit

Escape sequences to designate character sets take the form ESC I [I...] F. As mentioned above, the intermediate (I) bytes are from the range 0x20–0x2F, and the final (F) byte is from the range 0x30–0x7E. The first I byte (or, for a multi-byte set, the first two) identifies the type of character set and the working set it is to be designated to, whereas the F byte (and any additional I bytes) identify the character set itself, as assigned in the ISO-IR register (or, for the private-use escape sequences, by prior agreement).

Additional I bytes may be added before the F byte to extend the F byte range. This is currently only used with 94-character sets, where codes of the form ESC ( ! F have been assigned.[82] At the other extreme, no multibyte 96-sets have been registered, so the sequences below are strictly theoretical.

As with other escape sequence types, the range 0x30–0x3F is reserved for private-use F bytes,[31] in this case for private-use character set definitions (which might include unregistered sets defined by protocols such as ARIB STD-B24[83] or MARC-8,[3] or vendor-specific sets such as DEC Special Graphics).[84] However, in a graphical set designation sequence, if the second I byte (for a single-byte set) or the third I byte (for a double-byte set) is 0x20 (space), the set denoted is a "dynamically redefinable character set" (DRCS) defined by prior agreement,[85] which is also considered private use.[31] A graphical set being considered a DRCS implies that it represents a font of exact glyphs, rather than a set of abstract characters.[86] The manner in which DRCS sets and associated fonts are transmitted, allocated and managed is not stipulated by ISO/IEC 2022 / ECMA-35 itself, although it recommends allocating them sequentially starting with F byte 0x40 (@);[87] however, a manner for transmitting DRCS fonts is defined within some telecommunication protocols such as World System Teletext.[88]

There are also three special cases for multi-byte codes. The code sequences ESC $ @, ESC $ A, and ESC $ B were all registered when the contemporary version of the standard allowed multi-byte sets only in G0, so must be accepted in place of the sequences ESC $ ( @ through ESC $ ( B to designate to the G0 character set.[89]

There are additional (rarely used) features for switching control character sets, but this is a single-level lookup, in that (as noted above) the C0 set is always invoked over CL, and the C1 set is always invoked over CR or by using escape codes. As noted above, it is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible. The control set designation sequences (as opposed to the graphical set ones) may also be used from within ISO/IEC 10646 (UCS/Unicode), in contexts where processing ANSI escape codes is appropriate, provided that each byte in the sequence is padded to the code unit size of the encoding.[90]

A table of escape sequence I bytes and the designation or other function which they perform is below.[91]

Code Hex Abbr. Name Effect Example
ESC SP F 1B 20 F ACS Announce code structure Specifies code features used, e.g. working sets (see below).[92] ESC SP L
(ISO 4873 level 1)
ESC ! F 1B 21 F CZD C0-designate F selects a C0 control character set to be used.[93] ESC ! @
(ASCII C0 codes)
ESC " F 1B 22 F C1D C1-designate F selects a C1 control character set to be used.[94] ESC " C
(ISO 6429 C1 codes)
ESC # F 1B 23 F - (Single control function) (Reserved for sequences for control functions, see above.) ESC # 6
(private use: DEC Double Width Line)[95]
  • ESC $ F[e]
  • ESC $ ( F
  • 1B 24 F[e]
  • 1B 24 28 F
GZDM4 G0-designate multibyte 94-set F selects a 94n-character set to be used for G0.[89] ESC $ ( C
(KS X 1001 in G0)
ESC $ ) F 1B 24 29 F G1DM4 G1-designate multibyte 94-set F selects a 94n-character set to be used for G1.[89] ESC $ ) A
(GB 2312 in G1)
ESC $ * F 1B 24 2A F G2DM4 G2-designate multibyte 94-set F selects a 94n-character set to be used for G2.[89] ESC $ * B
(JIS X 0208 in G2)
ESC $ + F 1B 24 2B F G3DM4 G3-designate multibyte 94-set F selects a 94n-character set to be used for G3.[89] ESC $ + D
(JIS X 0212 in G3)
ESC $ , F 1B 24 2C F - (not used) (not used)[f] -
ESC $ - F 1B 24 2D F G1DM6 G1-designate multibyte 96-set F selects a 96n-character set to be used for G1.[89] ESC $ - 1
(private use)
ESC $ . F 1B 24 2E F G2DM6 G2-designate multibyte 96-set F selects a 96n-character set to be used for G2.[89] ESC $ . 2
(private use)
ESC $ / F 1B 24 2F F G3DM6 G3-designate multibyte 96-set F selects a 96n-character set to be used for G3.[89] ESC $ / 3
(private use)
ESC % F 1B 25 F DOCS Designate other coding system Switches coding system, see below. ESC % G
(UTF-8)
ESC & F 1B 26 F IRR Identify revised registration Prefixes designation escape to denote revision.[g] ESC & @ ESC $ B
(JIS X 0208:1990 in G0)
ESC ' F 1B 27 F - (not used) (not used) -
ESC ( F 1B 28 F GZD4 G0-designate 94-set F selects a 94-character set to be used for G0.[89] ESC ( B
(ASCII in G0)
ESC ) F 1B 29 F G1D4 G1-designate 94-set F selects a 94-character set to be used for G1.[89] ESC ) I
(JIS X 0201 Kana in G1)
ESC * F 1B 2A F G2D4 G2-designate 94-set F selects a 94-character set to be used for G2.[89] ESC * v
(ITU T.61 RHS in G2)
ESC + F 1B 2B F G3D4 G3-designate 94-set F selects a 94-character set to be used for G3.[89] ESC + D
(NATS-SEFI-ADD in G3)
ESC , F 1B 2C F - (not used) (not used)[h] -
ESC - F 1B 2D F G1D6 G1-designate 96-set F selects a 96-character set to be used for G1.[89] ESC - A
(ISO 8859-1 RHS in G1)
ESC . F 1B 2E F G2D6 G2-designate 96-set F selects a 96-character set to be used for G2.[89] ESC . B
(ISO 8859-2 RHS in G2)
ESC / F 1B 2F F G3D6 G3-designate 96-set F selects a 96-character set to be used for G3.[89] ESC / b
(ISO 8859-15 RHS in G3)

Note that the registry of F bytes is independent for the different types. The 94-character graphic set designated by ESC ( A through ESC + A is not related in any way to the 96-character set designated by ESC - A through ESC / A. And neither of those is related to the 94n-character set designated by ESC $ ( A through ESC $ + A, and so on; the final bytes must be interpreted in context. (Indeed, without any intermediate bytes, ESC A is a way of specifying the C1 control code 0x81.)

Also note that C0 and C1 control character sets are independent; the C0 control character set designated by ESC ! A (which happens to be the NATS control set for newspaper text transmission) is not the same as the C1 control character set designated by ESC " A (the CCITT attribute control set for Videotex).

Interaction with other coding systems edit

The standard also defines a way to specify coding systems that do not follow its own structure.

A sequence is also defined for returning to ISO/IEC 2022; the registrations which support this sequence as encoded in ISO/IEC 2022 comprise (as of 2019) various Videotex formats, UTF-8, and UTF-1.[99] A second I byte of 0x2F (/) is included in the designation sequences of codes which do not use that byte sequence to return to ISO 2022; they may have their own means to return to ISO 2022 (such as a different or padded sequence) or none at all.[100] All existing registrations of the latter type (as of 2019) are either transparent raw data, Unicode/UCS formats, or subsets thereof.[101]

Code Hex Abbr. Name Effect
ESC % @ 1B 25 40 DOCS Designate other coding system ("standard return") Return to ISO/IEC 2022 from another encoding.[100]
ESC % F 1B 25 F Designate other coding system ("with standard return")[99] F selects an 8-bit code; use ESC % @ to return.[100]
ESC % / F 1B 25 2F F Designate other coding system ("without standard return")[101] F selects an 8-bit code; there is no standard way to return.[100]
ESC d 1B 64 CMD Coding method delimiter Denotes the end of an ISO/IEC 2022 coded sequence.[102]

Of particular interest are the sequences which switch to ISO/IEC 10646 (Unicode) formats which do not follow the ISO/IEC 2022 structure. These include UTF-8 (which does not reserve the range 0x80–0x9F for control characters), its predecessor UTF-1 (which mixes GR and GL bytes in multi-byte codes), and UTF-16 and UTF-32 (which use wider coding units).[99][101]

Several codes were also registered for subsets (levels 1 and 2) of UTF-8, UTF-16 and UTF-32, as well as for three levels of UCS-2.[101] However, the only codes currently specified by ISO/IEC 10646 are the level-3 codes for UTF-8, UTF-16 and UTF-32 and the unspecified-level code for UTF-8, with the rest being listed as deprecated.[103] ISO/IEC 10646 stipulates that the big-endian formats of UTF-16 and UTF-32 are designated by their escape sequences.[104]

Unicode Format Code(s) Hex[103] Deprecated codes Deprecated hex[99][101][103]
UTF-1 (UTF-1 not in current ISO/IEC 10646.) ESC % B 1B 25 42
UTF-8 ESC % G,
ESC % / I
1B 25 47,[13]
1B 25 2F 49[105]
ESC % / G,
ESC % / H
1B 25 2F 47,
1B 25 2F 48
UTF-16 ESC % / L 1B 25 2F 4C[106] ESC % / @,
ESC % / C,
ESC % / E,
ESC % / J,
ESC % / K
1B 25 2F 40,
1B 25 2F 43,
1B 25 2F 45,
1B 25 2F 4A,
1B 25 2F 4B
UTF-32 ESC % / F 1B 25 2F 46 ESC % / A,
ESC % / D
1B 25 2F 41,
1B 25 2F 44

Of the sequences switching to UTF-8, ESC % G is the one supported by, for example, xterm.[14]

Although use of a variant of the standard return sequence from UTF-16 and UTF-32 is permitted, the bytes of the escape sequence must be padded to the size of the code unit of the encoding (i.e. 001B 0025 0040 for UTF-16), i.e. the coding of the standard return sequence does not conform exactly to ISO/IEC 2022. For this reason, the designations for UTF-16 and UTF-32 use a without-standard-return syntax.[107]

For specifying encodings by labels, the X Consortium's Compound Text format defines five private-use DOCS sequences.[108]

Code structure announcements edit

The sequence "announce code structure" (ESC SP (0x20) F) is used to announce a specific code structure, or a specific group of ISO 2022 facilities which are used in a particular code version. Although announcements can be combined, certain contradictory combinations (specifically, using locking shift announcements 16–23 with announcements 1, 3 and 4) are prohibited by the standard, as is using additional announcements on top of ISO/IEC 4873 level announcements 12–14[92] (which fully specify the permissible structural features). Announcement sequences are as follows:

Number Code Hex Code version feature announced[92]
1 ESC SP A 1B 20 41 G0 in GL, GR absent or unused, no locking shifts.
2 ESC SP B 1B 20 42 G0 and G1 invoked to GL by locking shifts, GR absent or unused.
3 ESC SP C 1B 20 43 G0 in GL, G1 in GR, no locking shifts, requires an 8-bit environment.
4 ESC SP D 1B 20 44 G0 in GL, G1 in GR if 8-bit, no locking shifts unless in a 7-bit environment.
5 ESC SP E 1B 20 45 Shift functions preserved during 7-bit/8-bit conversion.
6 ESC SP F 1B 20 46 C1 controls using escape sequences.
7 ESC SP G 1B 20 47 C1 controls in CR region in 8-bit environments, as escape sequences otherwise.
8 ESC SP H 1B 20 48 94-character graphical sets only.
9 ESC SP I 1B 20 49 94-character and/or 96-character graphical sets.
10 ESC SP J 1B 20 4A Uses a 7-bit code, even if an eighth bit is available for use.
11 ESC SP K 1B 20 4B Requires an 8-bit code.
12 ESC SP L 1B 20 4C Complies to ISO/IEC 4873 (ECMA-43) level 1.
13 ESC SP M 1B 20 4D Complies to ISO/IEC 4873 (ECMA-43) level 2.
14 ESC SP N 1B 20 4E Complies to ISO/IEC 4873 (ECMA-43) level 3.
16 ESC SP P 1B 20 50 SI / LS0 used.
18 ESC SP R 1B 20 52 SO / LS1 used.
19 ESC SP S 1B 20 53 LS1R used in 8-bit environments, SO used in 7-bit environments.
20 ESC SP T 1B 20 54 LS2 used.
21 ESC SP U 1B 20 55 LS2R used in 8-bit environments, LS2 used in 7-bit environments.
22 ESC SP V 1B 20 56 LS3 used.
23 ESC SP W 1B 20 57 LS3R used in 8-bit environments, LS3 used in 7-bit environments.
26 ESC SP Z 1B 20 5A SS2 used.
27 ESC SP [ 1B 20 5B SS3 used.
28 ESC SP \ 1B 20 5C Single-shifts invoke over GR.

ISO/IEC 2022 code versions edit

 
Various ISO 2022 and other CJK encodings supported by Mozilla Firefox as of 2004. (This support has been reduced in later versions to avoid certain cross site scripting attacks.)

Six 7-bit ISO 2022 code versions (ISO-2022-CN, ISO-2022-CN-EXT, ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2 and ISO-2022-KR) are defined by IETF RFCs, of which ISO-2022-JP and ISO-2022-KR have been extensively used in the past.[109] A number of other variants are defined by vendors, including IBM.[110] Although UTF-8 is the preferred encoding in HTML5, legacy content in ISO-2022-JP remains sufficiently widespread that the WHATWG encoding standard retains support for it,[111] in contrast to mapping ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT[112] entirely to the replacement character,[113] due to concerns about code injection attacks such as cross-site scripting.[111][113]

8-bit code versions include Extended Unix Code.[11][12] The ISO/IEC 8859 encodings also follow ISO 2022, in a subset stipulated in ISO/IEC 4873.[9][10]

Japanese e-mail versions edit

ISO-2022-JP edit

ISO-2022-JP is a widely used encoding for Japanese, in particular in e-mail. It was introduced for use on the JUNET network and later codified in IETF RFC 1468, dated 1993.[114] It has an advantage over other encodings for Japanese in that it does not require 8-bit clean transmission. Microsoft calls it Code page 50220.[115] It starts in ASCII and includes the following escape sequences:

  • ESC ( B to switch to ASCII (1 byte per character)
  • ESC ( J to switch to JIS X 0201-1976 (ISO/IEC 646:JP) Roman set (1 byte per character)
  • ESC $ @ to switch to JIS X 0208-1978 (2 bytes per character)
  • ESC $ B to switch to JIS X 0208-1983 (2 bytes per character)

Use of the two characters added in JIS X 0208-1990 is permitted, but without including the IRR sequence, i.e. using the same escape sequence as JIS X 0208-1983.[114] Also, due to being registered before designating multi-byte sets except to G0 was possible, the escapes for JIS X 0208 do not include the second I-byte (.[89]

The RFC notes that some existing systems did not distinguish ESC ( B from ESC ( J, or did not distinguish ESC $ @ from ESC $ B, but stipulates that the escape sequences should not be changed by systems simply relaying messages such as e-mails.[114] The WHATWG Encoding Standard referenced by HTML5 handles ESC ( B and ESC ( J distinctly, but treats ESC $ @ the same as ESC $ B when decoding, and uses only ESC $ B for JIS X 0208 when encoding.[116] The RFC also notes that some past systems had made erroneous use of the sequence ESC ( H to switch away from JIS X 0208, which is actually registered for ISO-IR-11 (a Swedish variant of ISO 646 and World System Teletext).[114][i]

Versions with halfwidth katakana edit

Use of ESC ( I to switch to the JIS X 0201-1976 Kana set (1 byte per character) is not part of the ISO-2022-JP profile,[114] but is also sometimes used. Python allows it in a variant which it labels ISO-2022-JP-EXT (which also incorporates JIS X 0212 as described below, completing coverage of EUC-JP);[117][118] this is close in both name and structure to an encoding denoted ISO-2022-JPext by DEC, which furthermore adds a two-byte user-defined region accessed with ESC $ ( 0 to complete the coverage of Super DEC Kanji.[119] The WHATWG/HTML5 variant permits decoding JIS X 0201 katakana in ISO-2022-JP input, but converts the characters to their JIS X 0208 equivalents upon encoding.[116] Microsoft's code page for ISO-2022-JP with JIS X 0201 kana additionally permitted is Code page 50221.[115]

Other, older variants known as JIS7 and JIS8 build directly on the 7-bit and 8-bit encodings defined by JIS X 0201 and allow use of JIS X 0201 kana from G1 without escape sequences, using Shift Out and Shift In or setting the eighth bit (GR-invoked), respectively.[120] They are not widely used;[120] JIS X 0208 support in extended 8-bit JIS X 0201 is more commonly achieved via Shift JIS. Microsoft's code page for JIS X 0201-based ISO 2022 with single-byte katakana via Shift Out and Shift In is Code page 50222.[115]

ISO-2022-JP-2 edit

ISO-2022-JP-2 is a multilingual extension of ISO-2022-JP, defined in RFC 1554 (dated 1993), which permits the following escape sequences in addition to the ISO-2022-JP ones. The ISO/IEC 8859 parts are 96-character sets which cannot be designated to G0, and are accessed from G2 using the 7-bit escape sequence form of the single-shift code SS2:[121]

  • ESC $ A to switch to GB 2312-1980 (2 bytes per character)
  • ESC $ ( C to switch to KS X 1001-1992 (2 bytes per character)
  • ESC $ ( D to switch to JIS X 0212-1990 (2 bytes per character)
  • ESC . A to switch to ISO/IEC 8859-1 high part, Extended Latin 1 set (1 byte per character) [designated to G2]
  • ESC . F to switch to ISO/IEC 8859-7 high part, Basic Greek set (1 byte per character) [designated to G2]

ISO-2022-JP with the ISO-2022-JP-2 representation of JIS X 0212, but not the other extensions, was subsequently dubbed ISO-2022-JP-1 by RFC 2237, dated 1997.[122]

IBM Japanese TCP edit

IBM implements nine 7-bit ISO 2022 based encodings for Japanese, each using a different set of escape sequences: IBM-956, IBM-957, IBM-958, IBM-959, IBM-5052, IBM-5053, IBM-5054, IBM-5055 and ISO-2022-JP, which are collectively termed "TCP/IP Japanese coded character sets".[123] CCSID 9148 is the standard (RFC 1468) ISO-2022-JP.[124]

IBM variants of ISO-2022-JP
Code page / CCSID ACRI definition number Escape sequences for ACRI[110]
956[125] TCP-01
  • ESC ( J (JIS X 0201 Roman)
  • ESC $ ( B (JIS X 0208, 1983+, long escape sequence)
  • ESC $ I (JIS X 0201 Katakana)
  • ESC $ ( D
957[126] TCP-02
  • ESC ( J (JIS X 0201 Roman)
  • ESC $ ( @ (JIS X 0208, 1978, long escape sequence)
  • ESC $ I (JIS X 0201 Katakana)
  • ESC $ ( D (JIS X 0212)
958[127] TCP-03
  • ESC ( A (ASCII)
  • ESC $ ( B (JIS X 0208, 1983+, long escape sequence)
  • ESC $ I (JIS X 0201 Katakana)
  • ESC $ ( D (JIS X 0212)
959[128] TCP-04
  • ESC ( A (ASCII)
  • ESC $ ( @ (JIS X 0208, 1978, long escape sequence)
  • ESC $ I (JIS X 0201 Katakana)
  • ESC $ ( D (JIS X 0212)
5052[129] TCP-05
  • ESC ( J (JIS X 0201 Roman)
  • ESC $ B (JIS X 0208, 1983+)
  • ESC $ I (JIS X 0201 Katakana)
  • ESC $ ( D (JIS X 0212)
5053[130] TCP-06
  • ESC ( J (JIS X 0201 Roman)
  • ESC $ @ (JIS X 0208, 1978)
  • ESC $ I (JIS X 0201 Katakana)
  • ESC $ ( D (JIS X 0212)
5054[131] TCP-07
  • ESC ( A (ASCII)
  • ESC $ B (JIS X 0208, 1983+)
  • ESC $ I (JIS X 0201 Katakana)
  • ESC $ ( D (JIS X 0212)
5055[132] TCP-08
  • ESC ( A (ASCII)
  • ESC $ @ (JIS X 0208, 1978)
  • ESC $ I (JIS X 0201 Katakana)
  • ESC $ ( D (JIS X 0212)
9148[124] TCP-16
  • ESC ( A (ASCII)
  • ESC ( J (JIS X 0201 Roman)
  • ESC $ @ (JIS X 0208, 1978)
  • ESC $ B (JIS X 0208, 1983+)

JIS X 0213 edit

The JIS X 0213 standard, first published in 2000, defines an updated version of ISO-2022-JP, without the ISO-2022-JP-2 extensions, named ISO-2022-JP-3. The additions made by JIS X 0213 compared to the base JIS X 0208 standard resulted in a new registration being made for the extended JIS plane 1, while the new plane 2 received its own registration. The further additions to plane 1 in the 2004 edition of the standard resulted in an additional registration being added to a further revision of the profile, dubbed ISO-2022-JP-2004. In addition to the basic ISO-2022-JP designation codes, the following designations are recognized:

  • ESC ( I to switch to JIS X 0201-1976 Kana set (1 byte per character)
  • ESC $ ( O to switch to JIS X 0213-2000 Plane 1 (2 bytes per character)
  • ESC $ ( P to switch to JIS X 0213-2000 Plane 2 (2 bytes per character)
  • ESC $ ( Q to switch to JIS X 0213-2004 Plane 1 (2 bytes per character, ISO-2022-JP-2004 only)

Other 7-bit versions edit

ISO-2022-KR is defined in RFC 1557, dated 1993.[133] It encodes ASCII and the Korean double-byte KS X 1001-1992,[134][135] previously named KS C 5601-1987. Unlike ISO-2022-JP-2, it makes use of the Shift Out and Shift In characters to switch between them, after including ESC $ ) C once at the start of a line to designate KS X 1001 to G1.[133]

ISO-2022-CN and ISO-2022-CN-EXT are defined in RFC 1922, dated 1996. They are 7-bit encodings making use both of the Shift Out and Shift In functions (to shift between G0 and G1), and of the 7-bit escape code forms of the single-shift functions SS2 and SS3 (to access G2 and G3).[136] They support the character sets GB 2312 (for simplified Chinese) and CNS 11643 (for traditional Chinese).

The basic ISO-2022-CN profile uses ASCII as its G0 (shift in) set, and also includes GB 2312 and the first two planes of CNS 11643 (due to these two planes being sufficient to represent all traditional Chinese characters from common Big5, to which the RFC provides a correspondence in an appendix):[136]

  • ESC $ ) A to switch to GB 2312-1980 (2 bytes per character) [designated to G1]
  • ESC $ ) G to switch to CNS 11643-1992 Plane 1 (2 bytes per character) [designated to G1]
  • ESC $ * H to switch to CNS 11643-1992 Plane 2 (2 bytes per character) [designated to G2]

The ISO-2022-CN-EXT profile permits the following additional sets and planes.[136]

  • ESC $ ) E to switch to ISO-IR-165 (2 bytes per character) [designated to G1]
  • ESC $ + I to switch to CNS 11643-1992 Plane 3 (2 bytes per character) [designated to G3]
  • ESC $ + J to switch to CNS 11643-1992 Plane 4 (2 bytes per character) [designated to G3]
  • ESC $ + K to switch to CNS 11643-1992 Plane 5 (2 bytes per character) [designated to G3]
  • ESC $ + L to switch to CNS 11643-1992 Plane 6 (2 bytes per character) [designated to G3]
  • ESC $ + M to switch to CNS 11643-1992 Plane 7 (2 bytes per character) [designated to G3]

The ISO-2022-CN-EXT profile further lists additional Guobiao standard graphical sets as being permitted, but conditional on their being assigned registered ISO 2022 escape sequences:[136]

  • GB 12345 in G1
  • GB 7589 or GB 13131 in G2
  • GB 7590 or GB 13132 in G3

The character after the ESC (for single-byte character sets) or ESC $ (for multi-byte character sets) specifies the type of character set and working set that is designated to. In the above examples, the character ( (0x28) designates a 94-character set to the G0 character set, whereas ), * or + (0x29–0x2B) designates to the G1–G3 character sets.

ISO-2022-KR and ISO-2022-CN are used less frequently than ISO-2022-JP, and are sometimes deliberately not supported due to security concerns. Notably, the WHATWG Encoding Standard used by HTML5 maps ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT (as well as HZ-GB-2312) to the "replacement" decoder,[112] which maps all input to the replacement character (�), in order to prevent certain cross-site scripting and related attacks, which utilize a difference in encoding support between the client and server.[113] Although the same security concern (allowing sequences of ASCII bytes to be interpreted differently) also applies to ISO-2022-JP and UTF-16, they could not be given this treatment due to being much more frequently used in deployed content.[111]

In April 2024, a security flaw[137] was found in the implementation of ISO-2022-CN-EXT in glibc, which lead to recommendations to disable the encoding entirely on Linux systems.[138]

ISO/IEC 4873 edit

 
Relationship between ECMA-43 (ISO/IEC 4873) editions and levels, and EUC.

A subset of ISO 2022 applied to 8-bit single-byte encodings is defined by ISO/IEC 4873, also published by Ecma International as ECMA-43. ISO/IEC 8859 defines 8-bit codes for ISO/IEC 4873 (or ECMA-43) level 1.[9][10]

ISO/IEC 4873 / ECMA-43 defines three levels of encoding:[139]

  • Level 1, which includes a C0 set, the ASCII G0 set, an optional C1 set and an optional single-byte (94-character or 96-character) G1 set. G0 is invoked over GL, and G1 is invoked over GR. Use of shift functions is not permitted.
  • Level 2, which includes a (94-character or 96-character) single-byte G2 and/or G3 set in addition to a mandatory G1 set. Only the single-shift functions SS2 and SS3 are permitted (i.e. locking shifts are forbidden), and they invoke over the GL region (including 0x20 and 0x7F in the case of a 96-set). SS2 and SS3 must be available in C1 at 0x8E and 0x8F respectively. This minimal required C1 set for ISO 4873 is registered as ISO-IR-105.[69]
  • Level 3, which permits the GR locking-shift functions LS1R, LS2R and LS3R in addition to the single shifts, but otherwise has the same restrictions as level 2.

Earlier editions of the standard permitted non-ASCII assignments in the G0 set, provided that the ISO/IEC 646 invariant positions were preserved, that the other positions were assigned to spacing (not combining) characters, that 0x23 was assigned to either £ or #, and that 0x24 was assigned to either $ or ¤.[140] For instance, the 8-bit encoding of JIS X 0201 is compliant with earlier editions. This was subsequently changed to fully specify the ISO/IEC 646:1991 IRV / ISO-IR No. 6 set (ASCII).[141][142][143]

The use of the ISO/IEC 646 IRV (synchronised with ASCII since 1991) at ISO/IEC 4873 Level 1 with no C1 or G1 set, i.e. using the IRV in an 8-bit environment in which shift codes are not used and the high bit is always zero, is known as ISO 4873 DV, in which DV stands for "Default Version".[144]

In cases where duplicate characters are available in different sets, the current edition of ISO/IEC 4873 / ECMA-43 only permits using these characters in the lowest numbered working set which they appear in.[145] For instance, if a character appears in both the G1 set and the G3 set, it must be used from the G1 set. However, use from other sets is noted as having been permitted in earlier editions.[143]

ISO/IEC 8859 defines complete encodings at level 1 of ISO/IEC 4873, and does not allow for use of multiple ISO/IEC 8859 parts together. It stipulates that ISO/IEC 10367 should be used instead for levels 2 and 3 of ISO/IEC 4873.[9][10] ISO/IEC 10367:1991 includes G0 and G1 sets matching those used by the first 9 parts of ISO/IEC 8859 (i.e. those which existed as of 1991, when it was published), and some supplementary sets.[146]

Character set designation escape sequences are used for identifying or switching between versions during information interchange only if required by a further protocol, in which case the standard requires an ISO/IEC 2022 announcer sequence specifying the ISO/IEC 4873 level, followed by a complete set of escapes specifying the character set designations for C0, C1, G0, G1, G2 and G3 respectively (but omitting G2 and G3 designations for level 1), with an F-byte of 0x7E denoting an empty set. Each ISO/IEC 4873 level has its own single ISO/IEC 2022 announcer sequence, which are as follows:[147]

Code Hex Announcement
ESC SP L 1B 20 4C ISO 4873 Level 1
ESC SP M 1B 20 4D ISO 4873 Level 2
ESC SP N 1B 20 4E ISO 4873 Level 3

Extended Unix Code edit

Extended Unix Code (EUC) is an 8-bit variable-width character encoding system used primarily for Japanese, Korean, and simplified Chinese. It is based on ISO 2022, and only character sets which conform to the ISO 2022 structure can have EUC forms. Up to four coded character sets can be represented (in G0, G1, G2 and G3). The G0 set is invoked over GL, the G1 set is invoked over GR, and the G2 and G3 sets are (if present) invoked using the single shifts SS2 and SS3, which are used as CR bytes (i.e. 0x8E and 0x8F respectively) and invoke over GR (not GL).[11] Locking shift codes are not used.[12]

The code assigned to the G0 set is ASCII, or the country's national ISO 646 character set such as KS-Roman (KS X 1003) or JIS-Roman (the lower half of JIS X 0201).[11] Hence, 0x5C (backslash in US-ASCII) is used to represent a Yen sign in some versions of EUC-JP and a Won sign in some versions of EUC-KR.

G1 is used for a 94x94 coded character set represented in two bytes. The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes (i.e. SS3 plus two bytes) whereas a single character in EUC-TW can take up to four bytes (i.e. SS2 plus three bytes).

The EUC code itself does not make use of the announcer or designation sequences from ISO 2022; however, it corresponds to the following sequence of four announcer sequences, with meanings breaking down as follows.[148]

Individual sequence Hexadecimal Feature of EUC denoted
ESC SP C 1B 20 43 ISO-8 (8-bit, G0 in GL, G1 in GR)
ESC SP Z 1B 20 5A G2 accessed using SS2
ESC SP [ 1B 20 5B G3 accessed using SS3
ESC SP \ 1B 20 5C Single-shifts invoke over GR

Compound Text (X11) edit

The X Consortium defined an ISO 2022 profile named Compound Text as an interchange format in 1989.[149] This uses only four control codes: HT (0x09), NL (newline, coded as LF, 0x0A), ESC (0x1B) and CSI (in its 8-bit representation 0x9B),[150] with the SDS (CSI … ]) CSI sequence being used for bidirectional text control.[151] It is an 8-bit code using G0 and G1 for GL and GR, and follows ISO-8859-1 in its initial state.[152] The following F-bytes are used:

ISO 2022 designation sequences used in X11 Compound Text[153]
Escape sequence type Final byte Graphical set
GZD4, G1D4 (for 94-character sets) B (0x42) ASCII
I (0x49) JIS X 0201 katakana
J (0x4A) JIS X 0201 Roman
G1D6 (for 96-character sets) A (0x41) ISO-8859-1 high part
B (0x42) ISO-8859-2 high part
C (0x43) ISO-8859-3 high part
D (0x44) ISO-8859-4 high part
F (0x46) ISO-8859-7 high part
G (0x47) ISO-8859-6 high part
H (0x48) ISO-8859-8 high part
L (0x4C) ISO-8859-5 high part
M (0x4D) ISO-8859-9 high part
GZDM4, G1DM4 (for 2-byte sets) A (0x41) GB 2312
B (0x42) JIS X 0208
C (0x43) KS C 5601

For specifying encodings by labels, X11 Compound Text defines five private-use DOCS sequences: ESC % / 0 (1B 25 2F 30) for variable-length encodings, and ESC % / 1 through ESC % / 4 for fixed-length encodings using one through four bytes respectively. Rather than using another escape sequence to return to ISO 2022, the two bytes following the initial escape sequence specify the remaining length in bytes, coded in base-128 using bytes 0x80–FF. The encoding label is included in ISO 8859-1 before the encoded text, and terminated with STX (0x02).[108]

Comparison with other encodings edit

Advantages edit

  • As ISO/IEC 2022's entire range of graphical character encodings can be invoked over GL, the available glyphs are not significantly limited by an inability to represent GR and C1, such as in a system limited to 7-bit encodings. It accordingly enables the representation of large set of characters in such a system. Generally, this 7-bit compatibility is not really an advantage, except for backwards compatibility with older systems. The vast majority of modern computers use 8 bits for each byte.
  • As compared to Unicode, ISO/IEC 2022 sidesteps Han unification by using sequence codes to switch between discrete encodings for different East Asian languages. This avoids the issues[citation needed] associated with unification, such as difficulty supporting multiple CJK languages with their associated character variants in a single document and font.

Disadvantages edit

  • Since ISO/IEC 2022 is a stateful encoding, a program cannot jump in the middle of a block of text to search, insert or delete characters. This makes manipulation of the text very cumbersome and slow when compared to non-stateful encodings. Any jump in the middle of the text may require a backup to the previous escape sequence before the bytes following the escape sequence can be interpreted.
  • Due to the stateful nature of ISO/IEC 2022, an identical and equivalent character may be encoded in different character sets, which may be designated to any of G0 through G3, which may be invoked using single shifts or by using locking shifts to GL or GR. Consequently, characters can be represented in multiple ways, meaning that two visually identical and equivalent strings can not be reliably compared for equality.
  • Some systems, like DICOM and several e-mail clients, use a variant of ISO-2022 (e.g. "ISO 2022 IR 100"[154]) in addition to supporting several other encodings.[155] This type of variation makes it difficult to portably transfer text between computer systems.
  • UTF-1, the multi-byte Unicode transformation format compatible with ISO/IEC 2022's representation of 8-bit control characters, has various disadvantages in comparison with UTF-8, and switching from or to other charsets, as supported by ISO/IEC 2022, is typically unnecessary in Unicode documents.
  • Because of its escape sequences, it is possible to construct attack byte sequences in which a malicious string (such as cross-site scripting) is masked until it is decoded to Unicode, which may allow it to bypass sanitisation.[156] Use of this encoding is thus treated as suspicious by malware protection suites,[157][better source needed] and 7-bit ISO 2022 data (except for ISO-2022-JP) is mapped in its entirety to the replacement character in HTML5 to prevent attacks.[112][113] Restricted ISO 2022 8-bit code versions which do not use designation escapes or locking shift codes, such as Extended Unix Code, do not share this problem.
  • Concatenation can pose issues. Profiles such as ISO-2022-JP specify that the stream starts in the ASCII state and must end in the ASCII state.[114] This is necessary to ensure that characters in concatenated ISO-2022-JP and/or ASCII streams will be interpreted in the correct set. This has the consequence that if a stream that ends in a multi-byte character is concatenated with one that starts with a multi-byte character, a pair of escape codes are generated switching to ASCII and immediately away from it. However, as stipulated in Unicode Technical Report #36 ("Unicode Security Considerations"), pairs of ISO 2022 escape sequences with no characters between them should generate a replacement character ("�") to prevent them from being used to mask malicious sequences such as cross-site scripting.[158] Implementing this measure, e.g. in Mozilla Thunderbird, has led to interoperability issues, with unexpected "�" characters being generated where two ISO-2022-JP streams have been concatenated.[156]

See also edit

Footnotes edit

  1. ^ Japanese: 区点, romanizedkuten; Chinese: 区位; pinyin: qūwèi; Korean행렬; Hanja行列; RRhaeng-nyeol
  2. ^ Japanese: , romanizedku, lit.'zone'; Chinese: ; pinyin: ; Korean; Hanja; RRhaeng
  3. ^ Japanese: , romanizedten, lit.'point'; Chinese: ; pinyin: wèi; lit. 'position'; Korean; Hanja; RRyeol
  4. ^ Japanese: , romanizedmen, lit.'face'
  5. ^ a b Specified for F bytes 0x40 (@), 0x41 (A) and 0x42 (B) only, for historical reasons.[89] Some implementations, such as the SoftBank 2G emoji encoding, use additional escapes of this form for non-ISO-2022-compliant purposes.[96]
  6. ^ Listed by MARC-8.[3] See footnote for ESC , F below for background.
  7. ^ F, adjusted to the range 1-63, indicates which (upwardly compatible) revision of the immediately-following registration is needed, so that old systems know that they are old.[97]
  8. ^ In earlier editions, 96-character sets did not exist, and the escape codes now used for 96-character sets were reserved as space for additional 94-character sets. Accordingly, the ESC 0x1B 0x2C sequence was defined in early editions of the standard as designating further 94-character sets to G0.[98] Since 96-character sets cannot be designated to G0, this first I byte is not used by the current edition of the standard. However, it is still listed by MARC-8.[3]
  9. ^ See also, for instance, Printronix (2012), OKI® Programmer's Reference Manual (PDF), p. 26 for a more recent system which uses ESC ( H to switch to ASCII from a DBCS.

References edit

  1. ^ ECMA-35 (1994), Brief History
  2. ^ ECMA-35 (1994), p. 51, annex D
  3. ^ a b c d e "Technique 2: Using standard alternate graphic character sets". MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. Library of Congress. 2007-12-05. from the original on 2020-07-22. Retrieved 2020-07-19.
  4. ^ "ECMA-35: Character code structure and extension techniques (web page)". Ecma International. from the original on 2022-04-25. Retrieved 2022-04-27.
  5. ^ a b c d ECMA-35 (1994), pp. 15–16, chapter 8.1
  6. ^ a b ECMA-35 (1994), chapter 13
  7. ^ a b ECMA-35 (1994), chapters 12, 14
  8. ^ a b ECMA-35 (1994), chapter 11
  9. ^ a b c d e ISO/IEC FDIS 8859-10 (1998), p. 1, chapter 1 ("Scope")
  10. ^ a b c d e ECMA-144 (2000), p. 1, chapter 1 ("Scope")
  11. ^ a b c d e f Lunde (2008), pp. 242–245, Chapter 4 ("Encoding Methods"), section "EUC encoding"
  12. ^ a b c d Lunde (2008), pp. 253–255, Chapter 4 ("Encoding Methods"), section "EUC versus ISO-2022 encodings".
  13. ^ a b ISO-IR-196 (1996)
  14. ^ a b c Moy, Edward; Gildea, Stephen; Dickey, Thomas. "Controls beginning with ESC". XTerm Control Sequences. from the original on 2019-10-10. Retrieved 2019-10-04.
  15. ^ ECMA-35 (1994), chapters 6, 7
  16. ^ ECMA-35 (1994), chapter 8
  17. ^ ECMA-35 (1994), chapter 9
  18. ^ a b ECMA-35 (1994), chapter 15
  19. ^ Lunde (2008), pp. 228–234, Chapter 4 ("Encoding Methods"), section "ISO-2022 encoding"
  20. ^ Lunde (2008), pp. 19–20, Chapter 1 ("CJKV Information Processing Overview"), section "What are Row-Cell and Plane-Row-Cell?"
  21. ^ ECMA-35 (1994), p. 4, definition 4.11
  22. ^ ECMA-35 (1994), p. 5, definition 4.18
  23. ^ See, for instance, ISO-IR-14 (1975), defining the G0 designation of the JIS X 0201 Roman set as ESC 2/8 4/10.
  24. ^ ECMA-35 (1994), p. 5, chapter 5.1
  25. ^ See, for instance, RFC 1468 (1993), defining the G0 designation of the JIS X 0201 Roman set as ESC ( J.
  26. ^ ECMA-35 (1994), p. 7, chapter 6.2
  27. ^ ECMA-35 (1994), p. 10, chapter 6.3.2
  28. ^ ECMA-35 (1994), p. 4, definition 4.17
  29. ^ ECMA-35 (1994), p. 4, definition 4.14
  30. ^ ECMA-35 (1994), p. 28, chapter 13.1
  31. ^ a b c ECMA-35 (1994), p. 33, chapter 13.3.3
  32. ^ ECMA-48 (1991), pp. 24–26, chapter 5.4
  33. ^ a b c d ECMA-35 (1994), p. 11, chapter 6.4.3
  34. ^ ISO-IR-208 (1999)
  35. ^ ISO-IR-155 (1990)
  36. ^ ISO-IR-164 (1992)
  37. ^ a b ECMA-35 (1994), p. 10, chapter 6.3.3
  38. ^ Google Inc. (2014). "ansi.go, line 134". ANSI escape sequence library for Go. from the original on 2022-04-30. Retrieved 2019-09-14.
  39. ^ ECMA-43 (1991), p. 5, chapter 7 ("Specification of the characters of the 8-bit code")
  40. ^ ISO/IEC FDIS 8859-10 (1998), p. 3, chapter 6 ("Specification of the coded character set")
  41. ^ ECMA-144 (2000), p. 3, chapter 6 ("Specification of the coded character set")
  42. ^ ECMA-43 (1991), p. 19, annex C ("Composite graphic characters")
  43. ^ a b ECMA-35 (1994), p. 10, chapter 6.4.1
  44. ^ a b ECMA-35 (1994), p. 11, chapter 6.4.4
  45. ^ a b c ECMA-35 (1994), p. 11, chapter 6.4.2
  46. ^ ISO-IR-104 (1985)
  47. ^ ISO-IR-1 (1975)
  48. ^ a b ECMA-35 (1994), p. 19, chapter 8.5.1
  49. ^ a b ECMA-35 (1994), p. 19, chapter 8.5.2
  50. ^ ECMA-43 (1991), p. 8, chapter 7.6 ("C1 set")
  51. ^ a b ECMA-35 (1994), p. 29, chapter 13.2.1
  52. ^ a b ECMA-35 (1994), p. 12, chapter 6.5.1
  53. ^ ECMA-35 (1994), p. 12, chapter 6.5.2
  54. ^ a b c ISO-IR, p. 19, chapter 2.7 ("Single control functions")
  55. ^ ECMA-35 (1994), p. 12, chapter 6.5.4
  56. ^ ECMA-48 (1991), chapter 5.5
  57. ^ ISO/TC 97/SC 2 (1976-12-30). Reset to Initial State (RIS) (PDF). ITSCJ/IPSJ. ISO-IR-35.{{citation}}: CS1 maint: numeric names: authors list (link)
  58. ^ ECMA-35 (1994), p. 12, chapter 6.5.3
  59. ^ a b ECMA-35 (1994), p. 14, chapter 7.3, table 2
  60. ^ ISO-IR-14 (1975)
  61. ^ a b ITU-T (1995-08-11). Recommendation T.51 (1992) Amendment 1. from the original on 2020-08-02. Retrieved 2019-12-25.
  62. ^ ISO-IR-106 (1985)
  63. ^ ECMA-35 (1994), p. 15, chapter 7.3, note 23
  64. ^ ISO-IR-140 (1987)
  65. ^ ISO-IR-7 (1975)
  66. ^ ISO-IR-26 (1976)
  67. ^ ISO-IR-36 (1977)
  68. ^ ECMA-35 (1980), p. 8, chapter 5.1.7
  69. ^ a b ISO-IR-105 (1985)
  70. ^ a b c d ECMA-35 (1994), p. 17, chapter 8.3.1
  71. ^ a b c d ECMA-35 (1994), p. 23, chapter 9.3.1
  72. ^ a b c ECMA-35 (1994), p. 19, chapter 8.4
  73. ^ a b c ECMA-35 (1994), p. 17, chapter 8.3.2
  74. ^ ECMA-35 (1994), pp. 23–24, chapter 9.4
  75. ^ ECMA-35 (1994), p. 27, chapter 11.1
  76. ^ ECMA-35 (1994), p. 17, chapter 8.3.3
  77. ^ ECMA-35 (1994), p. 47, annex B
  78. ^ ISO-IR, p. 2, chapter 1 ("Introduction")
  79. ^ ISO/IEC 2375 (2003)
  80. ^ a b "Handling of the SGML declaration in SP". SP: an SGML System Conforming to International Standard ISO 8879.
  81. ^ "20: SGML Declaration of HTML 4". HTML 4.01 Specification. W3C.
  82. ^ ISO-IR, p. 10, chapter 2.2 ("94-Character graphic character set with second Intermediate byte")
  83. ^ ARIB STD-B24 (2008), p. 39, part 2, Table 7-3
  84. ^ Mascheck, Sven; Le Breton, Stefan; Hamilton, Richard L. "About the 'alternate linedrawing character set'". ~sven_mascheck/. from the original on 2019-12-29. Retrieved 2020-01-08.
  85. ^ ECMA-35 (1994), p. 36, chapter 14.4
  86. ^ ECMA-35 (1994), p. 36, chapter 14.4.2, note 48
  87. ^ ECMA-35 (1994), p. 36, chapter 14.4.2, note 47
  88. ^ ETS 300 706 (1997), p. 103, chapter 14 ("Dynamically Re-definable Characters")
  89. ^ a b c d e f g h i j k l m n o p q ECMA-35 (1994), pp. 35–36, chapter 14.3.2
  90. ^ ISO/IEC 10646 (2017), pp. 19–20, chapter 12.4 ("Identification of control function set")
  91. ^ ECMA-35 (1994), p. 32, table 5
  92. ^ a b c ECMA-35 (1994), pp. 37–41, chapter 15.2
  93. ^ ECMA-35 (1994), p. 34, chapter 14.2.2
  94. ^ ECMA-35 (1994), p. 34, chapter 14.2.3
  95. ^ Digital. "DECDWL—Double-Width, Single-Height Line". VT510 Video Terminal Programmer Information. from the original on 2020-08-02. Retrieved 2020-01-17.
  96. ^ Kawasaki, Yusuke (2010). "Encode::JP::Emoji::Encoding". Encode-JP-Emoji. Line 268. from the original on 2022-04-30. Retrieved 2020-05-28.
  97. ^ ECMA-35 (1994), pp. 36–37, chapter 14.5
  98. ^ ECMA-35 (1980), pp. 14–15, chapter 5.3.7
  99. ^ a b c d ISO-IR, p. 20, chapter 2.8.1 ("Coding systems with Standard return")
  100. ^ a b c d ECMA-35 (1994), pp. 41–42, chapter 15.4
  101. ^ a b c d e ISO-IR, p. 21, chapter 2.8.2 ("Coding systems without Standard return")
  102. ^ ECMA-35 (1994), p. 41, chapter 15.3
  103. ^ a b c ISO/IEC 10646 (2017), p. 19, chapter 12.2 ("Identification of a UCS encoding scheme")
  104. ^ ISO/IEC 10646 (2017), pp. 18–19, chapter 12.1 ("Purpose and context of identification")
  105. ^ ISO-IR-192 (1996)
  106. ^ ISO-IR-195 (1996)
  107. ^ ISO/IEC 10646 (2017), p. 20, chapter 12.5 ("Identification of the coding system of ISO/IEC 2022")
  108. ^ a b Scheifler (1989), § Non-Standard Character Set Encodings
  109. ^ Lunde (2008), pp. 229–230, Chapter 4 ("Encoding Methods"), section "ISO-2022 encoding" "Those encodings that have been extensively used in the past, or continue to be used today for some purposes, have been highlighted."
  110. ^ a b . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2015-01-07.
  111. ^ a b c WHATWG Encoding Standard, section 2 ("Security background")
  112. ^ a b c WHATWG Encoding Standard, chapter 4.2 ("Names and labels"), anchor "replacement"
  113. ^ a b c d WHATWG Encoding Standard, section 14.1 ("replacement")
  114. ^ a b c d e f RFC 1468 (1993)
  115. ^ a b c "Code Page Identifiers". Windows Dev Center. Microsoft. from the original on 2019-06-16. Retrieved 2019-09-16.
  116. ^ a b WHATWG Encoding Standard, section 12.2 ("ISO-2022-JP")
  117. ^ Chang, Hye-Shik. "Modules/cjkcodecs/_codecs_iso2022.c, line 1122". cPython source tree. Python Software Foundation. from the original on 2022-04-30. Retrieved 2019-09-15.
  118. ^ "codecs — Codec registry and base classes § Standard Encodings". Python 3.7.4 documentation. Python Software Foundation. from the original on 2019-07-28. Retrieved 2019-09-16.
  119. ^ "2: Codesets and Codeset Conversion". DIGITAL UNIX Technical Reference for Using Japanese Features. Digital Equipment Corporation, Compaq.[dead link]
  120. ^ a b Lunde (2008), pp. 236–238, Chapter 4 ("Encoding Methods"), section "The predecessor of ISO-2022-JP encoding—JIS encoding"
  121. ^ RFC 1554 (1993)
  122. ^ RFC 2237 (1997)
  123. ^ "PQ02042: New Function to Provide C/370 iconv() Support for Japanese ISO-2022-JP". IBM. 2021-01-19. from the original on 2022-01-04. Retrieved 2022-01-04.
  124. ^ a b . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
  125. ^ . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-12-02.
  126. ^ . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-30.
  127. ^ . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-12-01.
  128. ^ . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-12-02.
  129. ^ . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
  130. ^ . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
  131. ^ . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
  132. ^ . IBM Globalization - Coded Character Set Identifiers. IBM. Archived from the original on 2014-11-29.
  133. ^ a b RFC 1557 (1993)
  134. ^ "KS X 1001:1992" (PDF). (PDF) from the original on 2007-09-26. Retrieved 2007-07-12.
  135. ^ ISO-IR-149 (1988)
  136. ^ a b c d RFC 1922 (1996)
  137. ^ "CVE-2024-2961".
  138. ^ "GLIBC Vulnerability on Servers Serving PHP".
  139. ^ ECMA-43 (1991), pp. 9–10, chapter 8 ("Levels")
  140. ^ ECMA-43 (1985), pp. 7–11, chapter 7.3 ("The G0 set")
  141. ^ ECMA-43 (1991), pp. 6–8, chapter 7.4 ("G0 set")
  142. ^ ECMA-43 (1991), p. 11, chapter 10.3 ("Identification of a version")
  143. ^ a b ECMA-43 (1991), p. 23, annex E ("Main differences between the second edition (1985) and the present (third) edition of this ECMA Standard")
  144. ^ IPTC (1995). The IPTC Recommended Message Format (PDF) (5th ed.). IPTC TEC 7901. (PDF) from the original on 2022-01-25. Retrieved 2020-01-14.
  145. ^ ECMA-43 (1991), pp. 10, chapter 9.2 ("Unique coding of characters")
  146. ^ van Wingen, Johan W (1999). "8. Code Extension, ISO 2022 and 2375, ISO 4873 and 10367". Character sets. Letters, tokens and codes. Terena. from the original on 2020-08-01. Retrieved 2019-10-02.
  147. ^ ECMA-43 (1991), pp. 10–11, chapter 10 ("Identification of version and level")
  148. ^ IBM. "Character Data Representation Architecture (CDRA)". IBM. pp. 157–162. from the original on 2019-06-23. Retrieved 2020-06-18.
  149. ^ Scheifler (1989)
  150. ^ Scheifler (1989), § Control Characters
  151. ^ Scheifler (1989), § Directionality
  152. ^ Scheifler (1989), § Standard Character Set Encodings
  153. ^ Scheifler (1989), § Approved Standard Encodings
  154. ^ "DICOM PS3.2 2016d - Conformance; D.6.2 Character Sets; D.6 Support of Character Sets". from the original on 2020-02-16. Retrieved 2020-05-21.
  155. ^ . Archived from the original on 2013-04-30. Retrieved 2009-07-25.
  156. ^ a b Sivonen, Henri (2018-12-17). "(UNSUBMITTED DRAFT) No U+FFFD Generation for Zero-Length ASCII-State Content between ISO-2022-JP Escape Sequences" (PDF). (PDF) from the original on 2019-02-21. Retrieved 2019-02-21.
  157. ^ "935453 - Gather telemetry about HZ and other encodings we might try to remove". from the original on 2017-05-19. Retrieved 2018-06-18.
  158. ^ Davis, Mark; Suignard, Michel (2014-09-19). "3.6.2 Some Output For All Input". Unicode Technical Report #36: Unicode Security Considerations (revision 15). Unicode Consortium. from the original on 2019-02-22. Retrieved 2019-02-21.


Standards and registry indices cited edit

  • ARIB (2008). ARIB STD-B24: Data Coding and Transmission Specification for Digital Broadcasting (PDF) (ARIB Standard). 5.2-E1. Vol. 1. (PDF) from the original on 2017-07-10. Retrieved 2017-07-10.
  • ECMA (1980). ECMA-35: Extension of the 7-bit Coded Character Set (PDF) (ECMA Standard) (2nd ed.).
  • ECMA (1994). ECMA-35: Character Code Structure and Extension Techniques (PDF) (ECMA Standard) (6th ed.).
  • ECMA (1985). ECMA-43: 8-Bit Coded Character Set Structure and Rules (PDF) (ECMA Standard) (2nd ed.).
  • ECMA (1991). ECMA-43: 8-Bit Coded Character Set Structure and Rules (PDF) (ECMA Standard) (3rd ed.).
  • ECMA (1991). ECMA-48: Control Functions for Coded Character Sets (PDF) (ECMA Standard) (5th ed.).
  • ECMA (2000). ECMA-144: 8-Bit Single-Byte Coded Graphic Character sets: Latin Alphabet No. 6 (PDF) (ECMA Standard) (3rd ed.).
  • European Broadcasting Union (1997). ETS 300 706: Enhanced Teletext specification (PDF) (European Telecommunications Standards). ETSI.
  • ISO/IEC JTC 1/SC 2 (2003). ISO/IEC 2375:2003: Information technology — Procedure for registration of escape sequences and coded character sets. ISO.{{cite book}}: CS1 maint: numeric names: authors list (link)
  • ISO/IEC JTC 1/SC 2 (1998-02-12). ISO/IEC FDIS 8859-10: Information Technology — 8-bit single-byte coded graphic character sets — Part 10: Latin alphabet No. 6 (PDF) (Final Draft International Standard).{{cite book}}: CS1 maint: numeric names: authors list (link)
  • ISO/IEC JTC 1/SC 2 (2017). ISO/IEC 10646: Information technology — Universal Coded Character Set (UCS) (ISO Standard) (5th ed.). ISO.{{cite book}}: CS1 maint: numeric names: authors list (link)
  • ISO-IR: ISO/IEC International Register of Coded Character Sets To Be Used With Escape Sequences (PDF) (Registry Index). ITSCJ/IPSJ.
  • Scheifler, Robert W. (1989). Compound Text Encoding (X Consortium Standard). X Consortium.
  • van Kesteren, Anne. WHATWG Encoding Standard (WHATWG Living Standard). WHATWG.

Registered code sets cited edit

  • ISO/TC 97/SC 2 (1975-12-01). ISO-IR-1: The set of control characters of the ISO 646 (PDF). ITSCJ/IPSJ.{{citation}}: CS1 maint: numeric names: authors list (link)
  • Sveriges Standardiseringskommission (1975-12-01). ISO-IR-7: NATS Control set for newspaper text transmission (PDF). ITSCJ/IPSJ.
  • Japanese Industrial Standards Committee (1975-12-01). ISO-IR-14: The Japanese Roman graphic set of characters (PDF). ITSCJ/IPSJ.
  • IPTC (1976-03-25). ISO-IR-26: Control set for newspaper text transmission (PDF). ITSCJ/IPSJ.
  • ISO/TC 97/SC 2 (1977-10-15). ISO-IR-36: The set of control characters of ISO 646, with IS4 replaced by Single Shift for G2 (SS2) (PDF). ITSCJ/IPSJ.{{citation}}: CS1 maint: numeric names: authors list (link)
  • ISO/TC97/SC2/WG-7; ECMA (1985-08-01). ISO-IR-104: Minimum C0 set for ISO 4873 (PDF). ITSCJ/IPSJ.{{citation}}: CS1 maint: numeric names: authors list (link)
  • ISO/TC97/SC2/WG-7; ECMA (1985-08-01). ISO-IR-105: Minimum C1 Set for ISO 4873 (PDF). ITSCJ/IPSJ.{{citation}}: CS1 maint: numeric names: authors list (link)
  • ITU (1985-08-01). ISO-IR-106: Teletex Primary Set of Control Functions (PDF). ITSCJ/IPSJ.
  • Úřad pro normalizaci a měřeni (1987-07-31). ISO-IR-140: The C0 Set of Control Characters of ISO 646, with EM replaced by SS2 (PDF). ITSCJ/IPSJ.
  • Korea Bureau of Standards (1988-10-01). ISO-IR-149: Korean Graphic Character Set for Information Interchange (KS C 5601:1987) (PDF). ITSCJ/IPSJ.
  • ISO/IEC/JTC1/SC2/WG3 (1990-04-16). ISO-IR-155: Basic Box-Drawings Set (PDF). ITSCJ/IPSJ.{{citation}}: CS1 maint: numeric names: authors list (link)
  • CCITT (1992-07-13). ISO-IR-164: Hebrew Supplementary Set of Graphic Characters (PDF). ITSCJ/IPSJ.
  • ECMA (1996-04-22). ISO-IR-192: UCS Transformation Format (UTF-8), implementation level 3, without standard return (PDF). ITSCJ/IPSJ.
  • ECMA (1996-04-22). ISO-IR-195: UCS Transformation Format (UTF-16), implementation level 3, without standard return (PDF). ITSCJ/IPSJ.
  • ECMA (1996-04-22). ISO-IR-196: UCS Transformation Format (UTF-8), with standard return (PDF). ITSCJ/IPSJ.
  • National Standards Authority of Ireland (1999-12-07). ISO-IR-208: Ogham coded character set for information interchange (PDF). ITSCJ/IPSJ.

Internet Requests For Comment cited edit

  • Murai, J.; Crispin, M.; van der Poel, E. (1993). "RFC 1468: Japanese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1468.
  • Ohta, M.; Handa, K. (1993). "RFC 1554: ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP". Requests for Comments. IETF. doi:10.17487/rfc1554.
  • Choi, U.; Chon, K.; Park, H. (1993). "RFC 1557: Korean Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1557.
  • Zhu, HF.; Hu, DY.; Wang, ZG.; Kao, TC.; Chang, WCH.; Crispin, M. (1996). "RFC 1922: Chinese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1922.
  • Tamaru, K. (1997). "RFC 2237: Japanese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc2237.

Other published works cited edit

Further reading edit

External links edit

  • ISO/IEC 2022:1994
  • ISO/IEC 2022:1994/Cor 1:1999
  • ECMA-35, equivalent to ISO/IEC 2022 and freely downloadable.
  • International Register of Coded Character Sets to be Used with Escape Sequences, a full list of assigned character sets and their escape sequences
  • Ken Lunde's CJK.INF: a document on encoding Chinese, Japanese, and Korean (CJK) languages, including a discussion of the various variants of ISO/IEC 2022.

2022, confused, with, 20022, information, technology, character, code, structure, extension, techniques, standard, field, character, encoding, equivalent, ecma, standard, ecma, ansi, standard, ansi, japanese, industrial, standard, 0202, originating, 1971, most. Not to be confused with ISO 20022 ISO IEC 2022 Information technology Character code structure and extension techniques is an ISO IEC standard in the field of character encoding It is equivalent to the ECMA standard ECMA 35 1 2 the ANSI standard ANSI X3 41 3 and the Japanese Industrial Standard JIS X 0202 Originating in 1971 it was most recently revised in 1994 4 ISO 2022Language s Various StandardISO IEC 2022ECMA 35ANSI X3 41JIS X 0202GB T 2311ClassificationStateful system of encodings with stateless pre configured subsets Transforms EncodesUS ASCII and depending on implementation GB 2312JIS X 0201JIS X 0208JIS X 0212JIS X 0213KS X 1001CNS 11643ISO IEC 646ISO IEC 8859 10367various othersSucceeded byISO IEC 10646 Unicode Other related encoding s Stateful subsets ISO 2022 JPISO 2022 CNISO 2022 KRCompound Text Pre configured versions ISO IEC 4873EUCvte ISO 2022 specifies a general structure which character encodings can conform to dedicating particular ranges of bytes 0x00 1F and 0x7F 9F to be used for non printing control codes 5 for formatting and in band instructions such as line breaks or formatting instructions for text terminals rather than graphical characters It also specifies a syntax for escape sequences multiple byte sequences beginning with the ESC control code which can likewise be used for in band instructions 6 Specific sets of control codes and escape sequences designed to be used with ISO 2022 include ISO IEC 6429 portions of which are implemented by ANSI SYS and terminal emulators ISO 2022 itself also defines particular control codes and escape sequences which can be used for switching between different coded character sets for example between ASCII and the Japanese JIS X 0208 so as to use multiple in a single document 7 effectively combining them into a single stateful encoding a feature less important since the advent of Unicode It is designed to be usable in both 8 bit environments and 7 bit environments those where only seven bits are usable in a byte such as e mail without 8BITMIME 8 Contents 1 Encodings and conformance 2 Overview 2 1 Elements 2 2 Code versions 2 3 Designation escape sequences 2 4 Multi byte characters 3 Code structure 3 1 Notation and nomenclature 3 2 Fixed coded characters 3 3 General syntax of escape sequences 3 4 Graphical character sets 3 5 Combining characters 3 6 Control character sets 3 7 Other control functions 3 8 Shift functions 3 9 Registration of graphical and control code sets 3 10 Character set designations 3 11 Interaction with other coding systems 3 12 Code structure announcements 4 ISO IEC 2022 code versions 4 1 Japanese e mail versions 4 1 1 ISO 2022 JP 4 1 2 Versions with halfwidth katakana 4 1 3 ISO 2022 JP 2 4 1 4 IBM Japanese TCP 4 1 5 JIS X 0213 4 2 Other 7 bit versions 4 3 ISO IEC 4873 4 4 Extended Unix Code 4 5 Compound Text X11 5 Comparison with other encodings 5 1 Advantages 5 2 Disadvantages 6 See also 7 Footnotes 8 References 8 1 Standards and registry indices cited 8 2 Registered code sets cited 8 3 Internet Requests For Comment cited 8 4 Other published works cited 9 Further reading 10 External linksEncodings and conformance editThe ASCII character set supports the ISO Basic Latin alphabet equivalent to the English alphabet and does not provide good support for languages which use additional letters or which use a different writing system altogether Other writing systems with relatively few characters such as Greek Cyrillic Arabic or Hebrew as well as forms of the Latin script using diacritics or letters absent from the ISO Basic Latin alphabet have historically been represented on personal computers with different 8 bit single byte extended ASCII encodings which follow ASCII when the most significant bit is 0 i e bytes 0x00 7F when represented in hexadecimal and include additional characters for a most significant bit of 1 i e bytes 0x80 FF Some of these such as the ISO 8859 series conform to ISO 2022 9 10 while others such as DOS code page 437 do not usually due to not reserving the bytes 0x80 9F for control codes Certain East Asian languages specifically Chinese Japanese and Korean collectively CJK are written using far more characters than the maximum of 256 which can be represented in a single byte and were first represented on computers with language specific double byte encodings or variable width encodings some of these such as the Simplified Chinese encoding GB 2312 conform to ISO 2022 while others such as the Traditional Chinese encoding Big5 do not Control codes in ISO 2022 are always represented with a single byte regardless of the number of bytes used for graphical characters CJK encodings used in 7 bit environments which use ISO 2022 mechanisms to switch between character sets are often given names starting with ISO 2022 most notably ISO 2022 JP although some other CJK encodings such as EUC JP also make use of ISO 2022 mechanisms 11 12 Since the first 256 code points of Unicode were taken from ISO 8859 1 Unicode inherits the concept of C0 and C1 control codes from ISO 2022 although it adds other non printing characters besides the ISO 2022 control codes However Unicode transformation formats such as UTF 8 generally deviate from the ISO 2022 structure in various ways including Using 8 bit bytes but not representing the C1 codes in their single byte forms specified in ISO 2022 most UTFs one exception being the obsolete UTF 1 Representing all characters including control codes with multiple bytes e g UTF 16 UTF 32 Mixing bytes with the most significant bit set and unset within the coded representation for a single code point e g UTF 1 GB 18030 ISO 2022 escape sequences do however exist for switching to and from UTF 8 as a coding system different from that of ISO 2022 13 which are supported by certain terminal emulators such as xterm 14 Overview editElements edit ISO IEC 2022 specifies the following An infrastructure of multiple character sets with particular structures which may be included in a single character encoding system including multiple graphical character sets and multiple sets of both primary C0 and secondary C1 control codes 15 A format for encoding these sets assuming that 8 bits are available per byte 16 A format for encoding these sets in the same encoding system when only 7 bits are available per byte 17 and a method for transforming any conformant character data to pass through such a 7 bit environment 8 The general structure of ANSI escape codes 6 and Specific escape code formats for identifying individual character sets 7 for announcing the use of particular encoding features or subsets 18 and for interacting with or switching to other encoding systems 18 Code versions edit Further information ISO IEC 2022 code versions A specific implementation does not have to implement all of the standard the conformance level and the supported character sets are defined by the implementation Although many of the mechanisms defined by the ISO IEC 2022 standard are infrequently used several established encodings are based on a subset of the ISO IEC 2022 system 19 In particular 7 bit encoding systems using ISO IEC 2022 mechanisms include ISO 2022 JP or JIS encoding which has primarily been used in Japanese language e mail 8 bit encoding systems conforming to ISO IEC 2022 include ISO IEC 4873 ECMA 43 which is in turn conformed to by ISO IEC 8859 9 10 and Extended Unix Code which is used for East Asian languages 11 More specialised applications of ISO 2022 include the MARC 8 encoding system used in MARC 21 library records 3 Designation escape sequences edit Further information Registration of graphical and control code sets and Character set designations The escape sequences for switching to particular character sets or encodings are registered with the ISO IR registry except for those set apart for private use the meanings of which are defined by vendors or by protocol specifications such as ARIB STD B24 and follow the patterns defined within the standard Character encodings making use of these escape sequences require data to be processed sequentially in a forward direction since the correct interpretation of the data depends on previously encountered escape sequences Specific profiles such as ISO 2022 JP may impose extra conditions such as that the current character set is reset to US ASCII before the end of a line Furthermore the escape sequences declaring the national character sets may be absent if a specific ISO 2022 based encoding permits or requires this and dictates that particular national character sets are to be used For example ISO 8859 1 states that no defining escape sequence is needed Multi byte characters edit Further information JIS X 0208 Code points and code numbers To represent large character sets ISO IEC 2022 builds on ISO IEC 646 s property that a seven bit character representation will normally be able to represent 94 graphic printable characters in addition to space and 33 control characters if only the C0 control codes narrowly defined are excluded this can be expanded to 96 characters Using two bytes it is thus possible to represent up to 8 836 94 94 characters and using three bytes up to 830 584 94 94 94 characters Though the standard defines it no registered character set uses three bytes although EUC TW s unregistered G2 does as does the similarly unregistered CCCII For the two byte character sets the code point of each character is normally specified in so called row cell or kuten a form which comprises two numbers between 1 and 94 inclusive specifying a row b and cell c of that character within the zone For a three byte set an additional plane d number is included at the beginning 20 The escape sequences do not only declare which character set is being used but also whether the set is single byte or multi byte although not how many bytes it uses if it is multi byte and also whether each byte has 94 or 96 permitted values Code structure editNotation and nomenclature edit ISO IEC 2022 coding specifies a two layer mapping between character codes and displayed characters Escape sequences allow any of a large registry of graphic character sets to be designated 21 into one of four working sets named G0 through G3 and shorter control sequences specify the working set that is invoked 22 to interpret bytes in the stream Encoding byte values bit combinations are often given in column line notation where two decimal numbers in the range 00 15 each corresponding to a single hexadecimal digit are separated by a slash 23 Hence for instance codes 2 0 0x20 through 2 15 0x2F inclusive may be referred to as column 02 This is the notation used in the ISO IEC 2022 ECMA 35 standard itself 24 They may be described elsewhere using hexadecimal as is often used in this article or using the corresponding ASCII characters 25 although the escape sequences are actually defined in terms of byte values and the graphic assigned to that byte value may be altered without affecting the control sequence Byte values from the 7 bit ASCII graphic range hexadecimal 0x20 0x7F being on the left side of a character code table are referred to as GL codes with GL standing for graphics left while bytes from the high ASCII range 0xA0 0xFF if available i e in an 8 bit environment are referred to as the GR codes graphics right 5 The terms CL 0x00 0x1F and CR 0x80 0x9F are defined for the control ranges but the CL range always invokes the primary C0 controls whereas the CR range always either invokes the secondary C1 controls or is unused 5 Fixed coded characters edit The delete character DEL 0x7F the escape character ESC 0x1B and the space character SP 0x20 are designated fixed coded characters 26 and are always available when G0 is invoked over GL irrespective of what character sets are designated They may not be included in graphical character sets although other sizes or types of whitespace character may be 27 General syntax of escape sequences edit Sequences using the ESC escape character take the form ESC var style padding right 1px I var var style padding right 1px F var where the ESC character is followed by zero or more intermediate bytes 28 I from the range 0x20 0x2F and one final byte 29 F from the range 0x30 0x7E 30 The first I byte or absence thereof determines the type of escape sequence it might for instance designate a working set or denote a single control function In all types of escape sequences F bytes in the range 0x30 0x3F are reserved for unregistered private uses defined by prior agreement between parties 31 Control functions from some sets may make use of further bytes following the escape sequence proper For example the ISO 6429 control function Control Sequence Introducer which can be represented using an escape sequence is followed by zero or more bytes in the range 0x30 0x3F then zero or more bytes in the range 0x20 0x2F then by a single byte in the range 0x40 0x7E the entire sequence being called a control sequence 32 Graphical character sets edit Each of the four working sets G0 through G3 may be a 94 character set or a 94n character multi byte set Additionally G1 through G3 may be a 96 or 96n character set In a 96 or 96n character set the bytes 0x20 through 0x7F when GL invoked or 0xA0 through 0xFF when GR invoked are allocated to and may be used by the set In a 94 or 94n character set the bytes 0x20 and 0x7F are not used 33 When a 96 or 96n character set is invoked in the GL region the space and delete characters codes 0x20 and 0x7F are not available until a 94 or 94n character set such as the G0 set is invoked in GL 5 96 character sets cannot be designated to G0 Registration of a set as a 96 character set does not necessarily mean that the 0x20 A0 and 0x7F FF bytes are actually assigned by the set some examples of graphical character sets which are registered as 96 sets but do not use those bytes include the G1 set of I S 434 34 the box drawing set from ISO IEC 10367 35 and ISO IR 164 a subset of the G1 set of ISO 8859 8 with only the letters used by CCITT 36 Combining characters edit Characters are expected to be spacing characters not combining characters unless specified otherwise by the graphical set in question 37 ISO 2022 ECMA 35 also recognizes the use of the backspace and carriage return control characters as means of combining otherwise spacing characters as well as the CSI sequence Graphic Character Combination GCC 37 CSI 0x20 SP 0x5F 38 Use of the backspace and carriage return in this manner is permitted by ISO IEC 646 but prohibited by ISO IEC 4873 ECMA 43 39 and by ISO IEC 8859 40 41 on the basis that it leaves the graphical character repertoire undefined ISO IEC 4873 ECMA 43 does however permit the use of the GCC function provided that the sequence of characters is kept the same and merely displayed in one space rather than being over stamped to form a character with a different meaning 42 Control character sets edit Control character sets are classified as primary or secondary control code sets 43 respectively also called C0 and C1 control code sets 44 A C0 control set must contain the ESC escape control character at 0x1B 45 a C0 set containing only ESC is registered as ISO IR 104 46 whereas a C1 control set may not contain the escape control whatsoever 33 Hence they are entirely separate registrations with a C0 set being only a C0 set and a C1 set being only a C1 set 44 If codes from the C0 set of ISO 6429 ECMA 48 i e the ASCII control codes appear in the C0 set they are required to appear at their ISO 6429 ECMA 48 locations 45 Inclusion of transmission control characters in the C0 set besides the ten included by ISO 6429 ECMA 48 namely SOH STX ETX EOT ENQ ACK DLE NAK SYN and ETB 47 or inclusion of any of those ten in the C1 set is also prohibited by the ISO IEC 2022 ECMA 35 standard 45 33 A C0 control set is invoked over the CL range 0x00 through 0x1F 48 whereas a C1 control function may be invoked over the CR range 0x80 through 0x9F in an 8 bit environment or by using escape sequences in a 7 bit or 8 bit environment 43 but not both Which style of C1 invocation is used must be specified in the definition of the code version 49 For example ISO IEC 4873 specifies CR bytes for the C1 controls which it uses SS2 and SS3 50 If necessary which invocation is used may be communicated using announcer sequences In the latter case single control functions from the C1 control code set are invoked using type Fe escape sequences 33 meaning those where the ESC control character is followed by a byte from columns 04 or 05 that is to say ESC 0x40 through ESC 0x5F 51 Other control functions edit Additional control functions are assigned to type Fs escape sequences in the range ESC 0x60 through ESC 0x7E these have permanently assigned meanings rather than depending on the C0 or C1 designations 51 52 Registration of control functions to type Fs sequences must be approved by ISO IEC JTC 1 SC 2 52 Other single control functions may be registered to type 3Ft escape sequences in the range ESC 0x23 var style padding right 1px I var 0x40 through ESC 0x23 var style padding right 1px I var 0x7E 53 although no 3Ft sequences are currently assigned as of 2019 54 Some of these are specified in ECMA 35 ISO 2022 ANSI X3 41 others in ECMA 48 ISO 6429 ANSI X3 64 55 ECMA 48 refers to these as independent control functions 56 Code Hex Abbr Name Effect 54 ESC 1B 60 DMI Disable manual input Disables some or all of the manual input facilities of the device ESC a 1B 61 INT Interrupt Interrupts the current process ESC b 1B 62 EMI Enable manual input Enables the manual input facilities of the device ESC c 1B 63 RIS Reset to initial state Resets the device to its state after being powered on 57 ESC d 1B 64 CMD Coding method delimiter Used when interacting with an outer coding representation system see below ESC n 1B 6E LS2 Locking shift two Shift function see below ESC o 1B 6F LS3 Locking shift three Shift function see below ESC 1B 7C LS3R Locking shift three right Shift function see below ESC 1B 7D LS2R Locking shift two right Shift function see below ESC 1B 7E LS1R Locking shift one right Shift function see below Escape sequences of type Fp ESC 0x30 0 through ESC 0x3F or of type 3Fp ESC 0x23 var style padding right 1px I var 0x30 0 through ESC 0x23 var style padding right 1px I var 0x3F are reserved for single private use control codes by prior agreement between parties 58 Several such sequences of both types are used by DEC terminals such as the VT100 and are thus supported by terminal emulators 14 Shift functions edit By default GL codes specify G0 characters and GR codes where available specify G1 characters this may be otherwise specified by prior agreement The set invoked over each area may also be modified with control codes referred to as shifts as shown in the table below 59 An 8 bit code may have GR codes specifying G1 characters i e with its corresponding 7 bit code using Shift In and Shift Out to switch between the sets e g JIS X 0201 60 although some instead have GR codes specifying G2 characters with the corresponding 7 bit code using a single shift code to access the second set e g T 51 61 The codes shown in the table below are the most common encodings of these control codes conforming to ISO IEC 6429 The LS2 LS3 LS1R LS2R and LS3R shifts are registered as single control functions and are always encoded as the escape sequences listed below 54 whereas the others are part of a C0 or C1 control code set as shown below SI LS0 and SO LS1 are C0 controls and SS2 and SS3 are C1 controls meaning that their coding and availability may vary depending on which control sets are designated they must be present in the designated control sets if their functionality is used 48 49 The C1 controls themselves as mentioned above may be represented using escape sequences or 8 bit bytes but not both Alternative encodings of the single shifts as C0 control codes are available in certain control code sets For example SS2 and SS3 are usually available at 0x19 and 0x1D respectively in T 51 61 and T 61 62 This coding is currently recommended by ISO IEC 2022 ECMA 35 for applications requiring 7 bit single byte representations of SS2 and SS3 63 and may also be used for SS2 only 64 although older code sets with SS2 at 0x1C also exist 65 66 67 and were mentioned as such in an earlier edition of the standard 68 The 0x8E and 0x8F coding of the single shifts as shown below is mandatory for ISO IEC 4873 levels 2 and 3 69 Code Hex Abbr Name Effect SI 0F SILS0 Shift InLocking shift zero GL encodes G0 from now on 70 71 SO 0E SOLS1 Shift OutLocking shift one GL encodes G1 from now on 70 71 ESC n 1B 6E LS2 Locking shift two GL encodes G2 from now on 70 71 ESC o 1B 6F LS3 Locking shift three GL encodes G3 from now on 70 71 CR area SS2Escape code ESC N CR area 8EEscape code 1B 4E SS2 Single shift two GL or GR see below encodes G2 for the immediately following character only 72 CR area SS3Escape code ESC O CR area 8FEscape code 1B 4F SS3 Single shift three GL or GR see below encodes G3 for the immediately following character only 72 ESC 1B 7E LS1R Locking shift one right GR encodes G1 from now on 73 ESC 1B 7D LS2R Locking shift two right GR encodes G2 from now on 73 ESC 1B 7C LS3R Locking shift three right GR encodes G3 from now on 73 Although officially considered shift codes and named accordingly single shift codes are not always viewed as shifts 12 and they may simply be viewed as prefix bytes i e the first bytes in a multi byte sequence 11 since they do not require the encoder to keep the currently active set as state unlike locking shift codes In 8 bit environments either GL or GR but not both may be used as the single shift area This must be specified in the definition of the code version 72 For instance ISO IEC 4873 specifies GL whereas packed EUC specifies GR In 7 bit environments only GL is used as the single shift area 74 75 If necessary which single shift area is used may be communicated using announcer sequences The names locking shift zero LS0 and locking shift one LS1 refer to the same pair of C0 control characters 0x0F and 0x0E as the names shift in SI and shift out SO However the standard refers to them as LS0 and LS1 when they are used in 8 bit environments and as SI and SO when they are used in 7 bit environments 59 The ISO IEC 2022 ECMA 35 standard permits but discourages invoking G1 G2 or G3 in both GL and GR simultaneously 76 Registration of graphical and control code sets edit The ISO International register of coded character sets to be used with escape sequences ISO IR lists graphical character sets control code sets single control codes and so forth which have been registered for use with ISO IEC 2022 The procedure for registering codes and sets with the ISO IR registry is specified by ISO IEC 2375 Each registration receives a unique escape sequence and a unique registry entry number to identify it 77 78 For example the CCITT character set for Simplified Chinese is known as ISO IR 165 Registration of coded character sets with the ISO IR registry identifies the documents specifying the character set or control function associated with an ISO IEC 2022 non private use escape sequence This may be a standard document however registration does not create a new ISO standard does not commit the ISO or IEC to adopt it as an international standard and does not commit the ISO or IEC to add any of its characters to the Universal Coded Character Set 79 ISO IR registered escape sequences are also used encapsulated in a Formal Public Identifier to identify character sets used for numeric character references in SGML ISO 8879 For example the string ISO 646 1983 CHARSET International Reference Version IRV ESC 2 5 4 0 can be used to identify the International Reference Version of ISO 646 1983 80 and the HTML 4 01 specification uses ISO Registration Number 177 CHARSET ISO IEC 10646 1 1993 UCS 4 with implementation level 3 ESC 2 5 2 15 4 6 to identify Unicode 81 The textual representation of the escape sequence included in the third element of the FPI will be recognised by SGML implementations for supported character sets 80 Character set designations edit Escape sequences to designate character sets take the form ESC var style padding right 1px I var var style padding right 1px I var var style padding right 1px F var As mentioned above the intermediate I bytes are from the range 0x20 0x2F and the final F byte is from the range 0x30 0x7E The first I byte or for a multi byte set the first two identifies the type of character set and the working set it is to be designated to whereas the F byte and any additional I bytes identify the character set itself as assigned in the ISO IR register or for the private use escape sequences by prior agreement Additional I bytes may be added before the F byte to extend the F byte range This is currently only used with 94 character sets where codes of the form ESC var style padding right 1px F var have been assigned 82 At the other extreme no multibyte 96 sets have been registered so the sequences below are strictly theoretical As with other escape sequence types the range 0x30 0x3F is reserved for private use F bytes 31 in this case for private use character set definitions which might include unregistered sets defined by protocols such as ARIB STD B24 83 or MARC 8 3 or vendor specific sets such as DEC Special Graphics 84 However in a graphical set designation sequence if the second I byte for a single byte set or the third I byte for a double byte set is 0x20 space the set denoted is a dynamically redefinable character set DRCS defined by prior agreement 85 which is also considered private use 31 A graphical set being considered a DRCS implies that it represents a font of exact glyphs rather than a set of abstract characters 86 The manner in which DRCS sets and associated fonts are transmitted allocated and managed is not stipulated by ISO IEC 2022 ECMA 35 itself although it recommends allocating them sequentially starting with F byte 0x40 87 however a manner for transmitting DRCS fonts is defined within some telecommunication protocols such as World System Teletext 88 There are also three special cases for multi byte codes The code sequences ESC ESC A and ESC B were all registered when the contemporary version of the standard allowed multi byte sets only in G0 so must be accepted in place of the sequences ESC through ESC B to designate to the G0 character set 89 There are additional rarely used features for switching control character sets but this is a single level lookup in that as noted above the C0 set is always invoked over CL and the C1 set is always invoked over CR or by using escape codes As noted above it is required that any C0 character set include the ESC character at position 0x1B so that further changes are possible The control set designation sequences as opposed to the graphical set ones may also be used from within ISO IEC 10646 UCS Unicode in contexts where processing ANSI escape codes is appropriate provided that each byte in the sequence is padded to the code unit size of the encoding 90 A table of escape sequence I bytes and the designation or other function which they perform is below 91 Code Hex Abbr Name Effect Example ESC SP var style padding right 1px F var 1B 20 var style padding right 1px F var ACS Announce code structure Specifies code features used e g working sets see below 92 ESC SP L ISO 4873 level 1 ESC var style padding right 1px F var 1B 21 var style padding right 1px F var CZD C0 designate F selects a C0 control character set to be used 93 ESC ASCII C0 codes ESC var style padding right 1px F var 1B 22 var style padding right 1px F var C1D C1 designate F selects a C1 control character set to be used 94 ESC C ISO 6429 C1 codes ESC var style padding right 1px F var 1B 23 var style padding right 1px F var Single control function Reserved for sequences for control functions see above ESC 6 private use DEC Double Width Line 95 ESC var style padding right 1px F var e ESC var style padding right 1px F var 1B 24 var style padding right 1px F var e 1B 24 28 var style padding right 1px F var GZDM4 G0 designate multibyte 94 set F selects a 94n character set to be used for G0 89 ESC C KS X 1001 in G0 ESC var style padding right 1px F var 1B 24 29 var style padding right 1px F var G1DM4 G1 designate multibyte 94 set F selects a 94n character set to be used for G1 89 ESC A GB 2312 in G1 ESC var style padding right 1px F var 1B 24 2A var style padding right 1px F var G2DM4 G2 designate multibyte 94 set F selects a 94n character set to be used for G2 89 ESC B JIS X 0208 in G2 ESC var style padding right 1px F var 1B 24 2B var style padding right 1px F var G3DM4 G3 designate multibyte 94 set F selects a 94n character set to be used for G3 89 ESC D JIS X 0212 in G3 ESC var style padding right 1px F var 1B 24 2C var style padding right 1px F var not used not used f ESC var style padding right 1px F var 1B 24 2D var style padding right 1px F var G1DM6 G1 designate multibyte 96 set F selects a 96n character set to be used for G1 89 ESC 1 private use ESC var style padding right 1px F var 1B 24 2E var style padding right 1px F var G2DM6 G2 designate multibyte 96 set F selects a 96n character set to be used for G2 89 ESC 2 private use ESC var style padding right 1px F var 1B 24 2F var style padding right 1px F var G3DM6 G3 designate multibyte 96 set F selects a 96n character set to be used for G3 89 ESC 3 private use ESC var style padding right 1px F var 1B 25 var style padding right 1px F var DOCS Designate other coding system Switches coding system see below ESC G UTF 8 ESC amp var style padding right 1px F var 1B 26 var style padding right 1px F var IRR Identify revised registration Prefixes designation escape to denote revision g ESC amp ESC B JIS X 0208 1990 in G0 ESC var style padding right 1px F var 1B 27 var style padding right 1px F var not used not used ESC var style padding right 1px F var 1B 28 var style padding right 1px F var GZD4 G0 designate 94 set F selects a 94 character set to be used for G0 89 ESC B ASCII in G0 ESC var style padding right 1px F var 1B 29 var style padding right 1px F var G1D4 G1 designate 94 set F selects a 94 character set to be used for G1 89 ESC I JIS X 0201 Kana in G1 ESC var style padding right 1px F var 1B 2A var style padding right 1px F var G2D4 G2 designate 94 set F selects a 94 character set to be used for G2 89 ESC v ITU T 61 RHS in G2 ESC var style padding right 1px F var 1B 2B var style padding right 1px F var G3D4 G3 designate 94 set F selects a 94 character set to be used for G3 89 ESC D NATS SEFI ADD in G3 ESC var style padding right 1px F var 1B 2C var style padding right 1px F var not used not used h ESC var style padding right 1px F var 1B 2D var style padding right 1px F var G1D6 G1 designate 96 set F selects a 96 character set to be used for G1 89 ESC A ISO 8859 1 RHS in G1 ESC var style padding right 1px F var 1B 2E var style padding right 1px F var G2D6 G2 designate 96 set F selects a 96 character set to be used for G2 89 ESC B ISO 8859 2 RHS in G2 ESC var style padding right 1px F var 1B 2F var style padding right 1px F var G3D6 G3 designate 96 set F selects a 96 character set to be used for G3 89 ESC b ISO 8859 15 RHS in G3 Note that the registry of F bytes is independent for the different types The 94 character graphic set designated by ESC A through ESC A is not related in any way to the 96 character set designated by ESC A through ESC A And neither of those is related to the 94n character set designated by ESC A through ESC A and so on the final bytes must be interpreted in context Indeed without any intermediate bytes ESC A is a way of specifying the C1 control code 0x81 Also note that C0 and C1 control character sets are independent the C0 control character set designated by ESC A which happens to be the NATS control set for newspaper text transmission is not the same as the C1 control character set designated by ESC A the CCITT attribute control set for Videotex Interaction with other coding systems edit The standard also defines a way to specify coding systems that do not follow its own structure A sequence is also defined for returning to ISO IEC 2022 the registrations which support this sequence as encoded in ISO IEC 2022 comprise as of 2019 various Videotex formats UTF 8 and UTF 1 99 A second I byte of 0x2F is included in the designation sequences of codes which do not use that byte sequence to return to ISO 2022 they may have their own means to return to ISO 2022 such as a different or padded sequence or none at all 100 All existing registrations of the latter type as of 2019 are either transparent raw data Unicode UCS formats or subsets thereof 101 Code Hex Abbr Name Effect ESC 1B 25 40 DOCS Designate other coding system standard return Return to ISO IEC 2022 from another encoding 100 ESC var style padding right 1px F var 1B 25 var style padding right 1px F var Designate other coding system with standard return 99 F selects an 8 bit code use ESC to return 100 ESC var style padding right 1px F var 1B 25 2F var style padding right 1px F var Designate other coding system without standard return 101 F selects an 8 bit code there is no standard way to return 100 ESC d 1B 64 CMD Coding method delimiter Denotes the end of an ISO IEC 2022 coded sequence 102 Of particular interest are the sequences which switch to ISO IEC 10646 Unicode formats which do not follow the ISO IEC 2022 structure These include UTF 8 which does not reserve the range 0x80 0x9F for control characters its predecessor UTF 1 which mixes GR and GL bytes in multi byte codes and UTF 16 and UTF 32 which use wider coding units 99 101 Several codes were also registered for subsets levels 1 and 2 of UTF 8 UTF 16 and UTF 32 as well as for three levels of UCS 2 101 However the only codes currently specified by ISO IEC 10646 are the level 3 codes for UTF 8 UTF 16 and UTF 32 and the unspecified level code for UTF 8 with the rest being listed as deprecated 103 ISO IEC 10646 stipulates that the big endian formats of UTF 16 and UTF 32 are designated by their escape sequences 104 Unicode Format Code s Hex 103 Deprecated codes Deprecated hex 99 101 103 UTF 1 UTF 1 not in current ISO IEC 10646 ESC B 1B 25 42 UTF 8 ESC G ESC I 1B 25 47 13 1B 25 2F 49 105 ESC G ESC H 1B 25 2F 47 1B 25 2F 48 UTF 16 ESC L 1B 25 2F 4C 106 ESC ESC C ESC E ESC J ESC K 1B 25 2F 40 1B 25 2F 43 1B 25 2F 45 1B 25 2F 4A 1B 25 2F 4B UTF 32 ESC F 1B 25 2F 46 ESC A ESC D 1B 25 2F 41 1B 25 2F 44 Of the sequences switching to UTF 8 ESC G is the one supported by for example xterm 14 Although use of a variant of the standard return sequence from UTF 16 and UTF 32 is permitted the bytes of the escape sequence must be padded to the size of the code unit of the encoding i e 001B 0025 0040 for UTF 16 i e the coding of the standard return sequence does not conform exactly to ISO IEC 2022 For this reason the designations for UTF 16 and UTF 32 use a without standard return syntax 107 For specifying encodings by labels the X Consortium s Compound Text format defines five private use DOCS sequences 108 Code structure announcements edit The sequence announce code structure ESC SP 0x20 var style padding right 1px F var is used to announce a specific code structure or a specific group of ISO 2022 facilities which are used in a particular code version Although announcements can be combined certain contradictory combinations specifically using locking shift announcements 16 23 with announcements 1 3 and 4 are prohibited by the standard as is using additional announcements on top of ISO IEC 4873 level announcements 12 14 92 which fully specify the permissible structural features Announcement sequences are as follows Number Code Hex Code version feature announced 92 1 ESC SP A 1B 20 41 G0 in GL GR absent or unused no locking shifts 2 ESC SP B 1B 20 42 G0 and G1 invoked to GL by locking shifts GR absent or unused 3 ESC SP C 1B 20 43 G0 in GL G1 in GR no locking shifts requires an 8 bit environment 4 ESC SP D 1B 20 44 G0 in GL G1 in GR if 8 bit no locking shifts unless in a 7 bit environment 5 ESC SP E 1B 20 45 Shift functions preserved during 7 bit 8 bit conversion 6 ESC SP F 1B 20 46 C1 controls using escape sequences 7 ESC SP G 1B 20 47 C1 controls in CR region in 8 bit environments as escape sequences otherwise 8 ESC SP H 1B 20 48 94 character graphical sets only 9 ESC SP I 1B 20 49 94 character and or 96 character graphical sets 10 ESC SP J 1B 20 4A Uses a 7 bit code even if an eighth bit is available for use 11 ESC SP K 1B 20 4B Requires an 8 bit code 12 ESC SP L 1B 20 4C Complies to ISO IEC 4873 ECMA 43 level 1 13 ESC SP M 1B 20 4D Complies to ISO IEC 4873 ECMA 43 level 2 14 ESC SP N 1B 20 4E Complies to ISO IEC 4873 ECMA 43 level 3 16 ESC SP P 1B 20 50 SI LS0 used 18 ESC SP R 1B 20 52 SO LS1 used 19 ESC SP S 1B 20 53 LS1R used in 8 bit environments SO used in 7 bit environments 20 ESC SP T 1B 20 54 LS2 used 21 ESC SP U 1B 20 55 LS2R used in 8 bit environments LS2 used in 7 bit environments 22 ESC SP V 1B 20 56 LS3 used 23 ESC SP W 1B 20 57 LS3R used in 8 bit environments LS3 used in 7 bit environments 26 ESC SP Z 1B 20 5A SS2 used 27 ESC SP 1B 20 5B SS3 used 28 ESC SP 1B 20 5C Single shifts invoke over GR ISO IEC 2022 code versions edit nbsp Various ISO 2022 and other CJK encodings supported by Mozilla Firefox as of 2004 This support has been reduced in later versions to avoid certain cross site scripting attacks Six 7 bit ISO 2022 code versions ISO 2022 CN ISO 2022 CN EXT ISO 2022 JP ISO 2022 JP 1 ISO 2022 JP 2 and ISO 2022 KR are defined by IETF RFCs of which ISO 2022 JP and ISO 2022 KR have been extensively used in the past 109 A number of other variants are defined by vendors including IBM 110 Although UTF 8 is the preferred encoding in HTML5 legacy content in ISO 2022 JP remains sufficiently widespread that the WHATWG encoding standard retains support for it 111 in contrast to mapping ISO 2022 KR ISO 2022 CN and ISO 2022 CN EXT 112 entirely to the replacement character 113 due to concerns about code injection attacks such as cross site scripting 111 113 8 bit code versions include Extended Unix Code 11 12 The ISO IEC 8859 encodings also follow ISO 2022 in a subset stipulated in ISO IEC 4873 9 10 Japanese e mail versions edit ISO 2022 JP edit ISO 2022 JP is a widely used encoding for Japanese in particular in e mail It was introduced for use on the JUNET network and later codified in IETF RFC 1468 dated 1993 114 It has an advantage over other encodings for Japanese in that it does not require 8 bit clean transmission Microsoft calls it Code page 50220 115 It starts in ASCII and includes the following escape sequences ESC B to switch to ASCII 1 byte per character ESC J to switch to JIS X 0201 1976 ISO IEC 646 JP Roman set 1 byte per character ESC to switch to JIS X 0208 1978 2 bytes per character ESC B to switch to JIS X 0208 1983 2 bytes per character Use of the two characters added in JIS X 0208 1990 is permitted but without including the IRR sequence i e using the same escape sequence as JIS X 0208 1983 114 Also due to being registered before designating multi byte sets except to G0 was possible the escapes for JIS X 0208 do not include the second I byte 89 The RFC notes that some existing systems did not distinguish ESC B from ESC J or did not distinguish ESC from ESC B but stipulates that the escape sequences should not be changed by systems simply relaying messages such as e mails 114 The WHATWG Encoding Standard referenced by HTML5 handles ESC B and ESC J distinctly but treats ESC the same as ESC B when decoding and uses only ESC B for JIS X 0208 when encoding 116 The RFC also notes that some past systems had made erroneous use of the sequence ESC H to switch away from JIS X 0208 which is actually registered for ISO IR 11 a Swedish variant of ISO 646 and World System Teletext 114 i Versions with halfwidth katakana edit Use of ESC I to switch to the JIS X 0201 1976 Kana set 1 byte per character is not part of the ISO 2022 JP profile 114 but is also sometimes used Python allows it in a variant which it labels ISO 2022 JP EXT which also incorporates JIS X 0212 as described below completing coverage of EUC JP 117 118 this is close in both name and structure to an encoding denoted ISO 2022 JPext by DEC which furthermore adds a two byte user defined region accessed with ESC 0 to complete the coverage of Super DEC Kanji 119 The WHATWG HTML5 variant permits decoding JIS X 0201 katakana in ISO 2022 JP input but converts the characters to their JIS X 0208 equivalents upon encoding 116 Microsoft s code page for ISO 2022 JP with JIS X 0201 kana additionally permitted is Code page 50221 115 Other older variants known as JIS7 and JIS8 build directly on the 7 bit and 8 bit encodings defined by JIS X 0201 and allow use of JIS X 0201 kana from G1 without escape sequences using Shift Out and Shift In or setting the eighth bit GR invoked respectively 120 They are not widely used 120 JIS X 0208 support in extended 8 bit JIS X 0201 is more commonly achieved via Shift JIS Microsoft s code page for JIS X 0201 based ISO 2022 with single byte katakana via Shift Out and Shift In is Code page 50222 115 ISO 2022 JP 2 edit ISO 2022 JP 2 is a multilingual extension of ISO 2022 JP defined in RFC 1554 dated 1993 which permits the following escape sequences in addition to the ISO 2022 JP ones The ISO IEC 8859 parts are 96 character sets which cannot be designated to G0 and are accessed from G2 using the 7 bit escape sequence form of the single shift code SS2 121 ESC A to switch to GB 2312 1980 2 bytes per character ESC C to switch to KS X 1001 1992 2 bytes per character ESC D to switch to JIS X 0212 1990 2 bytes per character ESC A to switch to ISO IEC 8859 1 high part Extended Latin 1 set 1 byte per character designated to G2 ESC F to switch to ISO IEC 8859 7 high part Basic Greek set 1 byte per character designated to G2 ISO 2022 JP with the ISO 2022 JP 2 representation of JIS X 0212 but not the other extensions was subsequently dubbed ISO 2022 JP 1 by RFC 2237 dated 1997 122 IBM Japanese TCP edit IBM implements nine 7 bit ISO 2022 based encodings for Japanese each using a different set of escape sequences IBM 956 IBM 957 IBM 958 IBM 959 IBM 5052 IBM 5053 IBM 5054 IBM 5055 and ISO 2022 JP which are collectively termed TCP IP Japanese coded character sets 123 CCSID 9148 is the standard RFC 1468 ISO 2022 JP 124 IBM variants of ISO 2022 JP Code page CCSID ACRI definition number Escape sequences for ACRI 110 956 125 TCP 01 ESC J JIS X 0201 Roman ESC B JIS X 0208 1983 long escape sequence ESC I JIS X 0201 Katakana ESC D 957 126 TCP 02 ESC J JIS X 0201 Roman ESC JIS X 0208 1978 long escape sequence ESC I JIS X 0201 Katakana ESC D JIS X 0212 958 127 TCP 03 ESC A ASCII ESC B JIS X 0208 1983 long escape sequence ESC I JIS X 0201 Katakana ESC D JIS X 0212 959 128 TCP 04 ESC A ASCII ESC JIS X 0208 1978 long escape sequence ESC I JIS X 0201 Katakana ESC D JIS X 0212 5052 129 TCP 05 ESC J JIS X 0201 Roman ESC B JIS X 0208 1983 ESC I JIS X 0201 Katakana ESC D JIS X 0212 5053 130 TCP 06 ESC J JIS X 0201 Roman ESC JIS X 0208 1978 ESC I JIS X 0201 Katakana ESC D JIS X 0212 5054 131 TCP 07 ESC A ASCII ESC B JIS X 0208 1983 ESC I JIS X 0201 Katakana ESC D JIS X 0212 5055 132 TCP 08 ESC A ASCII ESC JIS X 0208 1978 ESC I JIS X 0201 Katakana ESC D JIS X 0212 9148 124 TCP 16 ESC A ASCII ESC J JIS X 0201 Roman ESC JIS X 0208 1978 ESC B JIS X 0208 1983 JIS X 0213 edit The JIS X 0213 standard first published in 2000 defines an updated version of ISO 2022 JP without the ISO 2022 JP 2 extensions named ISO 2022 JP 3 The additions made by JIS X 0213 compared to the base JIS X 0208 standard resulted in a new registration being made for the extended JIS plane 1 while the new plane 2 received its own registration The further additions to plane 1 in the 2004 edition of the standard resulted in an additional registration being added to a further revision of the profile dubbed ISO 2022 JP 2004 In addition to the basic ISO 2022 JP designation codes the following designations are recognized ESC I to switch to JIS X 0201 1976 Kana set 1 byte per character ESC O to switch to JIS X 0213 2000 Plane 1 2 bytes per character ESC P to switch to JIS X 0213 2000 Plane 2 2 bytes per character ESC Q to switch to JIS X 0213 2004 Plane 1 2 bytes per character ISO 2022 JP 2004 only Other 7 bit versions edit ISO 2022 KR is defined in RFC 1557 dated 1993 133 It encodes ASCII and the Korean double byte KS X 1001 1992 134 135 previously named KS C 5601 1987 Unlike ISO 2022 JP 2 it makes use of the Shift Out and Shift In characters to switch between them after including ESC C once at the start of a line to designate KS X 1001 to G1 133 ISO 2022 CN and ISO 2022 CN EXT are defined in RFC 1922 dated 1996 They are 7 bit encodings making use both of the Shift Out and Shift In functions to shift between G0 and G1 and of the 7 bit escape code forms of the single shift functions SS2 and SS3 to access G2 and G3 136 They support the character sets GB 2312 for simplified Chinese and CNS 11643 for traditional Chinese The basic ISO 2022 CN profile uses ASCII as its G0 shift in set and also includes GB 2312 and the first two planes of CNS 11643 due to these two planes being sufficient to represent all traditional Chinese characters from common Big5 to which the RFC provides a correspondence in an appendix 136 ESC A to switch to GB 2312 1980 2 bytes per character designated to G1 ESC G to switch to CNS 11643 1992 Plane 1 2 bytes per character designated to G1 ESC H to switch to CNS 11643 1992 Plane 2 2 bytes per character designated to G2 The ISO 2022 CN EXT profile permits the following additional sets and planes 136 ESC E to switch to ISO IR 165 2 bytes per character designated to G1 ESC I to switch to CNS 11643 1992 Plane 3 2 bytes per character designated to G3 ESC J to switch to CNS 11643 1992 Plane 4 2 bytes per character designated to G3 ESC K to switch to CNS 11643 1992 Plane 5 2 bytes per character designated to G3 ESC L to switch to CNS 11643 1992 Plane 6 2 bytes per character designated to G3 ESC M to switch to CNS 11643 1992 Plane 7 2 bytes per character designated to G3 The ISO 2022 CN EXT profile further lists additional Guobiao standard graphical sets as being permitted but conditional on their being assigned registered ISO 2022 escape sequences 136 GB 12345 in G1 GB 7589 or GB 13131 in G2 GB 7590 or GB 13132 in G3 The character after the ESC for single byte character sets or ESC for multi byte character sets specifies the type of character set and working set that is designated to In the above examples the character 0x28 designates a 94 character set to the G0 character set whereas or 0x29 0x2B designates to the G1 G3 character sets ISO 2022 KR and ISO 2022 CN are used less frequently than ISO 2022 JP and are sometimes deliberately not supported due to security concerns Notably the WHATWG Encoding Standard used by HTML5 maps ISO 2022 KR ISO 2022 CN and ISO 2022 CN EXT as well as HZ GB 2312 to the replacement decoder 112 which maps all input to the replacement character in order to prevent certain cross site scripting and related attacks which utilize a difference in encoding support between the client and server 113 Although the same security concern allowing sequences of ASCII bytes to be interpreted differently also applies to ISO 2022 JP and UTF 16 they could not be given this treatment due to being much more frequently used in deployed content 111 In April 2024 a security flaw 137 was found in the implementation of ISO 2022 CN EXT in glibc which lead to recommendations to disable the encoding entirely on Linux systems 138 ISO IEC 4873 edit nbsp Relationship between ECMA 43 ISO IEC 4873 editions and levels and EUC A subset of ISO 2022 applied to 8 bit single byte encodings is defined by ISO IEC 4873 also published by Ecma International as ECMA 43 ISO IEC 8859 defines 8 bit codes for ISO IEC 4873 or ECMA 43 level 1 9 10 ISO IEC 4873 ECMA 43 defines three levels of encoding 139 Level 1 which includes a C0 set the ASCII G0 set an optional C1 set and an optional single byte 94 character or 96 character G1 set G0 is invoked over GL and G1 is invoked over GR Use of shift functions is not permitted Level 2 which includes a 94 character or 96 character single byte G2 and or G3 set in addition to a mandatory G1 set Only the single shift functions SS2 and SS3 are permitted i e locking shifts are forbidden and they invoke over the GL region including 0x20 and 0x7F in the case of a 96 set SS2 and SS3 must be available in C1 at 0x8E and 0x8F respectively This minimal required C1 set for ISO 4873 is registered as ISO IR 105 69 Level 3 which permits the GR locking shift functions LS1R LS2R and LS3R in addition to the single shifts but otherwise has the same restrictions as level 2 Earlier editions of the standard permitted non ASCII assignments in the G0 set provided that the ISO IEC 646 invariant positions were preserved that the other positions were assigned to spacing not combining characters that 0x23 was assigned to either or and that 0x24 was assigned to either or 140 For instance the 8 bit encoding of JIS X 0201 is compliant with earlier editions This was subsequently changed to fully specify the ISO IEC 646 1991 IRV ISO IR No 6 set ASCII 141 142 143 The use of the ISO IEC 646 IRV synchronised with ASCII since 1991 at ISO IEC 4873 Level 1 with no C1 or G1 set i e using the IRV in an 8 bit environment in which shift codes are not used and the high bit is always zero is known as ISO 4873 DV in which DV stands for Default Version 144 In cases where duplicate characters are available in different sets the current edition of ISO IEC 4873 ECMA 43 only permits using these characters in the lowest numbered working set which they appear in 145 For instance if a character appears in both the G1 set and the G3 set it must be used from the G1 set However use from other sets is noted as having been permitted in earlier editions 143 ISO IEC 8859 defines complete encodings at level 1 of ISO IEC 4873 and does not allow for use of multiple ISO IEC 8859 parts together It stipulates that ISO IEC 10367 should be used instead for levels 2 and 3 of ISO IEC 4873 9 10 ISO IEC 10367 1991 includes G0 and G1 sets matching those used by the first 9 parts of ISO IEC 8859 i e those which existed as of 1991 when it was published and some supplementary sets 146 Character set designation escape sequences are used for identifying or switching between versions during information interchange only if required by a further protocol in which case the standard requires an ISO IEC 2022 announcer sequence specifying the ISO IEC 4873 level followed by a complete set of escapes specifying the character set designations for C0 C1 G0 G1 G2 and G3 respectively but omitting G2 and G3 designations for level 1 with an F byte of 0x7E denoting an empty set Each ISO IEC 4873 level has its own single ISO IEC 2022 announcer sequence which are as follows 147 Code Hex Announcement ESC SP L 1B 20 4C ISO 4873 Level 1 ESC SP M 1B 20 4D ISO 4873 Level 2 ESC SP N 1B 20 4E ISO 4873 Level 3 Extended Unix Code edit Main article Extended Unix Code Extended Unix Code EUC is an 8 bit variable width character encoding system used primarily for Japanese Korean and simplified Chinese It is based on ISO 2022 and only character sets which conform to the ISO 2022 structure can have EUC forms Up to four coded character sets can be represented in G0 G1 G2 and G3 The G0 set is invoked over GL the G1 set is invoked over GR and the G2 and G3 sets are if present invoked using the single shifts SS2 and SS3 which are used as CR bytes i e 0x8E and 0x8F respectively and invoke over GR not GL 11 Locking shift codes are not used 12 The code assigned to the G0 set is ASCII or the country s national ISO 646 character set such as KS Roman KS X 1003 or JIS Roman the lower half of JIS X 0201 11 Hence 0x5C backslash in US ASCII is used to represent a Yen sign in some versions of EUC JP and a Won sign in some versions of EUC KR G1 is used for a 94x94 coded character set represented in two bytes The EUC CN form of GB 2312 and EUC KR are examples of such two byte EUC codes EUC JP includes characters represented by up to three bytes i e SS3 plus two bytes whereas a single character in EUC TW can take up to four bytes i e SS2 plus three bytes The EUC code itself does not make use of the announcer or designation sequences from ISO 2022 however it corresponds to the following sequence of four announcer sequences with meanings breaking down as follows 148 Individual sequence Hexadecimal Feature of EUC denoted ESC SP C 1B 20 43 ISO 8 8 bit G0 in GL G1 in GR ESC SP Z 1B 20 5A G2 accessed using SS2 ESC SP 1B 20 5B G3 accessed using SS3 ESC SP 1B 20 5C Single shifts invoke over GR Compound Text X11 edit The X Consortium defined an ISO 2022 profile named Compound Text as an interchange format in 1989 149 This uses only four control codes HT 0x09 NL newline coded as LF 0x0A ESC 0x1B and CSI in its 8 bit representation 0x9B 150 with the SDS CSI CSI sequence being used for bidirectional text control 151 It is an 8 bit code using G0 and G1 for GL and GR and follows ISO 8859 1 in its initial state 152 The following F bytes are used ISO 2022 designation sequences used in X11 Compound Text 153 Escape sequence type Final byte Graphical set GZD4 G1D4 for 94 character sets B 0x42 ASCII I 0x49 JIS X 0201 katakana J 0x4A JIS X 0201 Roman G1D6 for 96 character sets A 0x41 ISO 8859 1 high part B 0x42 ISO 8859 2 high part C 0x43 ISO 8859 3 high part D 0x44 ISO 8859 4 high part F 0x46 ISO 8859 7 high part G 0x47 ISO 8859 6 high part H 0x48 ISO 8859 8 high part L 0x4C ISO 8859 5 high part M 0x4D ISO 8859 9 high part GZDM4 G1DM4 for 2 byte sets A 0x41 GB 2312 B 0x42 JIS X 0208 C 0x43 KS C 5601 For specifying encodings by labels X11 Compound Text defines five private use DOCS sequences ESC 0 1B 25 2F 30 for variable length encodings and ESC 1 through ESC 4 for fixed length encodings using one through four bytes respectively Rather than using another escape sequence to return to ISO 2022 the two bytes following the initial escape sequence specify the remaining length in bytes coded in base 128 using bytes 0x80 FF The encoding label is included in ISO 8859 1 before the encoded text and terminated with STX 0x02 108 Comparison with other encodings editAdvantages edit As ISO IEC 2022 s entire range of graphical character encodings can be invoked over GL the available glyphs are not significantly limited by an inability to represent GR and C1 such as in a system limited to 7 bit encodings It accordingly enables the representation of large set of characters in such a system Generally this 7 bit compatibility is not really an advantage except for backwards compatibility with older systems The vast majority of modern computers use 8 bits for each byte As compared to Unicode ISO IEC 2022 sidesteps Han unification by using sequence codes to switch between discrete encodings for different East Asian languages This avoids the issues citation needed associated with unification such as difficulty supporting multiple CJK languages with their associated character variants in a single document and font Disadvantages edit Since ISO IEC 2022 is a stateful encoding a program cannot jump in the middle of a block of text to search insert or delete characters This makes manipulation of the text very cumbersome and slow when compared to non stateful encodings Any jump in the middle of the text may require a backup to the previous escape sequence before the bytes following the escape sequence can be interpreted Due to the stateful nature of ISO IEC 2022 an identical and equivalent character may be encoded in different character sets which may be designated to any of G0 through G3 which may be invoked using single shifts or by using locking shifts to GL or GR Consequently characters can be represented in multiple ways meaning that two visually identical and equivalent strings can not be reliably compared for equality Some systems like DICOM and several e mail clients use a variant of ISO 2022 e g ISO 2022 IR 100 154 in addition to supporting several other encodings 155 This type of variation makes it difficult to portably transfer text between computer systems UTF 1 the multi byte Unicode transformation format compatible with ISO IEC 2022 s representation of 8 bit control characters has various disadvantages in comparison with UTF 8 and switching from or to other charsets as supported by ISO IEC 2022 is typically unnecessary in Unicode documents Because of its escape sequences it is possible to construct attack byte sequences in which a malicious string such as cross site scripting is masked until it is decoded to Unicode which may allow it to bypass sanitisation 156 Use of this encoding is thus treated as suspicious by malware protection suites 157 better source needed and 7 bit ISO 2022 data except for ISO 2022 JP is mapped in its entirety to the replacement character in HTML5 to prevent attacks 112 113 Restricted ISO 2022 8 bit code versions which do not use designation escapes or locking shift codes such as Extended Unix Code do not share this problem Concatenation can pose issues Profiles such as ISO 2022 JP specify that the stream starts in the ASCII state and must end in the ASCII state 114 This is necessary to ensure that characters in concatenated ISO 2022 JP and or ASCII streams will be interpreted in the correct set This has the consequence that if a stream that ends in a multi byte character is concatenated with one that starts with a multi byte character a pair of escape codes are generated switching to ASCII and immediately away from it However as stipulated in Unicode Technical Report 36 Unicode Security Considerations pairs of ISO 2022 escape sequences with no characters between them should generate a replacement character to prevent them from being used to mask malicious sequences such as cross site scripting 158 Implementing this measure e g in Mozilla Thunderbird has led to interoperability issues with unexpected characters being generated where two ISO 2022 JP streams have been concatenated 156 See also editISO 2709 ISO IEC 646 ISO IR 102 C0 and C1 control codes CJK characters MARC standards Mojibake luit ISO IEC JTC 1 SC 2Footnotes edit Japanese 区点 romanized kuten Chinese 区位 pinyin quwei Korean 행렬 Hanja 行列 RR haeng nyeol Japanese 区 romanized ku lit zone Chinese 区 pinyin qu Korean 행 Hanja 行 RR haeng Japanese 点 romanized ten lit point Chinese 位 pinyin wei lit position Korean 열 Hanja 列 RR yeol Japanese 面 romanized men lit face a b Specified for F bytes 0x40 0x41 A and 0x42 B only for historical reasons 89 Some implementations such as the SoftBank 2G emoji encoding use additional escapes of this form for non ISO 2022 compliant purposes 96 Listed by MARC 8 3 See footnote for ESC var style padding right 1px F var below for background F adjusted to the range 1 63 indicates which upwardly compatible revision of the immediately following registration is needed so that old systems know that they are old 97 In earlier editions 96 character sets did not exist and the escape codes now used for 96 character sets were reserved as space for additional 94 character sets Accordingly the ESC 0x1B 0x2C sequence was defined in early editions of the standard as designating further 94 character sets to G0 98 Since 96 character sets cannot be designated to G0 this first I byte is not used by the current edition of the standard However it is still listed by MARC 8 3 See also for instance Printronix 2012 OKI Programmer s Reference Manual PDF p 26 for a more recent system which uses ESC H to switch to ASCII from a DBCS References edit ECMA 35 1994 Brief History ECMA 35 1994 p 51 annex D a b c d e Technique 2 Using standard alternate graphic character sets MARC 21 Specifications for Record Structure Character Sets and Exchange Media Library of Congress 2007 12 05 Archived from the original on 2020 07 22 Retrieved 2020 07 19 ECMA 35 Character code structure and extension techniques web page Ecma International Archived from the original on 2022 04 25 Retrieved 2022 04 27 a b c d ECMA 35 1994 pp 15 16 chapter 8 1 a b ECMA 35 1994 chapter 13 a b ECMA 35 1994 chapters 12 14 a b ECMA 35 1994 chapter 11 a b c d e ISO IEC FDIS 8859 10 1998 p 1 chapter 1 Scope a b c d e ECMA 144 2000 p 1 chapter 1 Scope a b c d e f Lunde 2008 pp 242 245 Chapter 4 Encoding Methods section EUC encoding a b c d Lunde 2008 pp 253 255 Chapter 4 Encoding Methods section EUC versus ISO 2022 encodings a b ISO IR 196 1996 a b c Moy Edward Gildea Stephen Dickey Thomas Controls beginning with ESC XTerm Control Sequences Archived from the original on 2019 10 10 Retrieved 2019 10 04 ECMA 35 1994 chapters 6 7 ECMA 35 1994 chapter 8 ECMA 35 1994 chapter 9 a b ECMA 35 1994 chapter 15 Lunde 2008 pp 228 234 Chapter 4 Encoding Methods section ISO 2022 encoding Lunde 2008 pp 19 20 Chapter 1 CJKV Information Processing Overview section What are Row Cell and Plane Row Cell ECMA 35 1994 p 4 definition 4 11 ECMA 35 1994 p 5 definition 4 18 See for instance ISO IR 14 1975 defining the G0 designation of the JIS X 0201 Roman set as ESC 2 8 4 10 ECMA 35 1994 p 5 chapter 5 1 See for instance RFC 1468 1993 defining the G0 designation of the JIS X 0201 Roman set as ESC J ECMA 35 1994 p 7 chapter 6 2 ECMA 35 1994 p 10 chapter 6 3 2 ECMA 35 1994 p 4 definition 4 17 ECMA 35 1994 p 4 definition 4 14 ECMA 35 1994 p 28 chapter 13 1 a b c ECMA 35 1994 p 33 chapter 13 3 3 ECMA 48 1991 pp 24 26 chapter 5 4 a b c d ECMA 35 1994 p 11 chapter 6 4 3 ISO IR 208 1999 ISO IR 155 1990 ISO IR 164 1992 a b ECMA 35 1994 p 10 chapter 6 3 3 Google Inc 2014 ansi go line 134 ANSI escape sequence library for Go Archived from the original on 2022 04 30 Retrieved 2019 09 14 ECMA 43 1991 p 5 chapter 7 Specification of the characters of the 8 bit code ISO IEC FDIS 8859 10 1998 p 3 chapter 6 Specification of the coded character set ECMA 144 2000 p 3 chapter 6 Specification of the coded character set ECMA 43 1991 p 19 annex C Composite graphic characters a b ECMA 35 1994 p 10 chapter 6 4 1 a b ECMA 35 1994 p 11 chapter 6 4 4 a b c ECMA 35 1994 p 11 chapter 6 4 2 ISO IR 104 1985 ISO IR 1 1975 a b ECMA 35 1994 p 19 chapter 8 5 1 a b ECMA 35 1994 p 19 chapter 8 5 2 ECMA 43 1991 p 8 chapter 7 6 C1 set a b ECMA 35 1994 p 29 chapter 13 2 1 a b ECMA 35 1994 p 12 chapter 6 5 1 ECMA 35 1994 p 12 chapter 6 5 2 a b c ISO IR p 19 chapter 2 7 Single control functions ECMA 35 1994 p 12 chapter 6 5 4 ECMA 48 1991 chapter 5 5 ISO TC 97 SC 2 1976 12 30 Reset to Initial State RIS PDF ITSCJ IPSJ ISO IR 35 a href Template Citation html title Template Citation citation a CS1 maint numeric names authors list link ECMA 35 1994 p 12 chapter 6 5 3 a b ECMA 35 1994 p 14 chapter 7 3 table 2 ISO IR 14 1975 a b ITU T 1995 08 11 Recommendation T 51 1992 Amendment 1 Archived from the original on 2020 08 02 Retrieved 2019 12 25 ISO IR 106 1985 ECMA 35 1994 p 15 chapter 7 3 note 23 ISO IR 140 1987 ISO IR 7 1975 ISO IR 26 1976 ISO IR 36 1977 ECMA 35 1980 p 8 chapter 5 1 7 a b ISO IR 105 1985 a b c d ECMA 35 1994 p 17 chapter 8 3 1 a b c d ECMA 35 1994 p 23 chapter 9 3 1 a b c ECMA 35 1994 p 19 chapter 8 4 a b c ECMA 35 1994 p 17 chapter 8 3 2 ECMA 35 1994 pp 23 24 chapter 9 4 ECMA 35 1994 p 27 chapter 11 1 ECMA 35 1994 p 17 chapter 8 3 3 ECMA 35 1994 p 47 annex B ISO IR p 2 chapter 1 Introduction ISO IEC 2375 2003 a b Handling of the SGML declaration in SP SP an SGML System Conforming to International Standard ISO 8879 20 SGML Declaration of HTML 4 HTML 4 01 Specification W3C ISO IR p 10 chapter 2 2 94 Character graphic character set with second Intermediate byte ARIB STD B24 2008 p 39 part 2 Table 7 3 Mascheck Sven Le Breton Stefan Hamilton Richard L About the alternate linedrawing character set sven mascheck Archived from the original on 2019 12 29 Retrieved 2020 01 08 ECMA 35 1994 p 36 chapter 14 4 ECMA 35 1994 p 36 chapter 14 4 2 note 48 ECMA 35 1994 p 36 chapter 14 4 2 note 47 ETS 300 706 1997 p 103 chapter 14 Dynamically Re definable Characters a b c d e f g h i j k l m n o p q ECMA 35 1994 pp 35 36 chapter 14 3 2 ISO IEC 10646 2017 pp 19 20 chapter 12 4 Identification of control function set ECMA 35 1994 p 32 table 5 a b c ECMA 35 1994 pp 37 41 chapter 15 2 ECMA 35 1994 p 34 chapter 14 2 2 ECMA 35 1994 p 34 chapter 14 2 3 Digital DECDWL Double Width Single Height Line VT510 Video Terminal Programmer Information Archived from the original on 2020 08 02 Retrieved 2020 01 17 Kawasaki Yusuke 2010 Encode JP Emoji Encoding Encode JP Emoji Line 268 Archived from the original on 2022 04 30 Retrieved 2020 05 28 ECMA 35 1994 pp 36 37 chapter 14 5 ECMA 35 1980 pp 14 15 chapter 5 3 7 a b c d ISO IR p 20 chapter 2 8 1 Coding systems with Standard return a b c d ECMA 35 1994 pp 41 42 chapter 15 4 a b c d e ISO IR p 21 chapter 2 8 2 Coding systems without Standard return ECMA 35 1994 p 41 chapter 15 3 a b c ISO IEC 10646 2017 p 19 chapter 12 2 Identification of a UCS encoding scheme ISO IEC 10646 2017 pp 18 19 chapter 12 1 Purpose and context of identification ISO IR 192 1996 ISO IR 195 1996 ISO IEC 10646 2017 p 20 chapter 12 5 Identification of the coding system of ISO IEC 2022 a b Scheifler 1989 Non Standard Character Set Encodings Lunde 2008 pp 229 230 Chapter 4 Encoding Methods section ISO 2022 encoding Those encodings that have been extensively used in the past or continue to be used today for some purposes have been highlighted a b Additional Coding related Required Information IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2015 01 07 a b c WHATWG Encoding Standard section 2 Security background a b c WHATWG Encoding Standard chapter 4 2 Names and labels anchor replacement a b c d WHATWG Encoding Standard section 14 1 replacement a b c d e f RFC 1468 1993 a b c Code Page Identifiers Windows Dev Center Microsoft Archived from the original on 2019 06 16 Retrieved 2019 09 16 a b WHATWG Encoding Standard section 12 2 ISO 2022 JP Chang Hye Shik Modules cjkcodecs codecs iso2022 c line 1122 cPython source tree Python Software Foundation Archived from the original on 2022 04 30 Retrieved 2019 09 15 codecs Codec registry and base classes Standard Encodings Python 3 7 4 documentation Python Software Foundation Archived from the original on 2019 07 28 Retrieved 2019 09 16 2 Codesets and Codeset Conversion DIGITAL UNIX Technical Reference for Using Japanese Features Digital Equipment Corporation Compaq dead link a b Lunde 2008 pp 236 238 Chapter 4 Encoding Methods section The predecessor of ISO 2022 JP encoding JIS encoding RFC 1554 1993 RFC 2237 1997 PQ02042 New Function to Provide C 370 iconv Support for Japanese ISO 2022 JP IBM 2021 01 19 Archived from the original on 2022 01 04 Retrieved 2022 01 04 a b CCSID 9148 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 11 29 CCSID 956 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 12 02 CCSID 957 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 11 30 CCSID 958 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 12 01 CCSID 959 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 12 02 CCSID 5052 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 11 29 CCSID 5053 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 11 29 CCSID 5054 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 11 29 CCSID 5055 IBM Globalization Coded Character Set Identifiers IBM Archived from the original on 2014 11 29 a b RFC 1557 1993 KS X 1001 1992 PDF Archived PDF from the original on 2007 09 26 Retrieved 2007 07 12 ISO IR 149 1988 a b c d RFC 1922 1996 CVE 2024 2961 GLIBC Vulnerability on Servers Serving PHP ECMA 43 1991 pp 9 10 chapter 8 Levels ECMA 43 1985 pp 7 11 chapter 7 3 The G0 set ECMA 43 1991 pp 6 8 chapter 7 4 G0 set ECMA 43 1991 p 11 chapter 10 3 Identification of a version a b ECMA 43 1991 p 23 annex E Main differences between the second edition 1985 and the present third edition of this ECMA Standard IPTC 1995 The IPTC Recommended Message Format PDF 5th ed IPTC TEC 7901 Archived PDF from the original on 2022 01 25 Retrieved 2020 01 14 ECMA 43 1991 pp 10 chapter 9 2 Unique coding of characters van Wingen Johan W 1999 8 Code Extension ISO 2022 and 2375 ISO 4873 and 10367 Character sets Letters tokens and codes Terena Archived from the original on 2020 08 01 Retrieved 2019 10 02 ECMA 43 1991 pp 10 11 chapter 10 Identification of version and level IBM Character Data Representation Architecture CDRA IBM pp 157 162 Archived from the original on 2019 06 23 Retrieved 2020 06 18 Scheifler 1989 Scheifler 1989 Control Characters Scheifler 1989 Directionality Scheifler 1989 Standard Character Set Encodings Scheifler 1989 Approved Standard Encodings DICOM PS3 2 2016d Conformance D 6 2 Character Sets D 6 Support of Character Sets Archived from the original on 2020 02 16 Retrieved 2020 05 21 DICOM ISO 2022 variation Archived from the original on 2013 04 30 Retrieved 2009 07 25 a b Sivonen Henri 2018 12 17 UNSUBMITTED DRAFT No U FFFD Generation for Zero Length ASCII State Content between ISO 2022 JP Escape Sequences PDF Archived PDF from the original on 2019 02 21 Retrieved 2019 02 21 935453 Gather telemetry about HZ and other encodings we might try to remove Archived from the original on 2017 05 19 Retrieved 2018 06 18 Davis Mark Suignard Michel 2014 09 19 3 6 2 Some Output For All Input Unicode Technical Report 36 Unicode Security Considerations revision 15 Unicode Consortium Archived from the original on 2019 02 22 Retrieved 2019 02 21 Standards and registry indices cited edit ARIB 2008 ARIB STD B24 Data Coding and Transmission Specification for Digital Broadcasting PDF ARIB Standard 5 2 E1 Vol 1 Archived PDF from the original on 2017 07 10 Retrieved 2017 07 10 ECMA 1980 ECMA 35 Extension of the 7 bit Coded Character Set PDF ECMA Standard 2nd ed ECMA 1994 ECMA 35 Character Code Structure and Extension Techniques PDF ECMA Standard 6th ed ECMA 1985 ECMA 43 8 Bit Coded Character Set Structure and Rules PDF ECMA Standard 2nd ed ECMA 1991 ECMA 43 8 Bit Coded Character Set Structure and Rules PDF ECMA Standard 3rd ed ECMA 1991 ECMA 48 Control Functions for Coded Character Sets PDF ECMA Standard 5th ed ECMA 2000 ECMA 144 8 Bit Single Byte Coded Graphic Character sets Latin Alphabet No 6 PDF ECMA Standard 3rd ed European Broadcasting Union 1997 ETS 300 706 Enhanced Teletext specification PDF European Telecommunications Standards ETSI ISO IEC JTC 1 SC 2 2003 ISO IEC 2375 2003 Information technology Procedure for registration of escape sequences and coded character sets ISO a href Template Cite book html title Template Cite book cite book a CS1 maint numeric names authors list link ISO IEC JTC 1 SC 2 1998 02 12 ISO IEC FDIS 8859 10 Information Technology 8 bit single byte coded graphic character sets Part 10 Latin alphabet No 6 PDF Final Draft International Standard a href Template Cite book html title Template Cite book cite book a CS1 maint numeric names authors list link ISO IEC JTC 1 SC 2 2017 ISO IEC 10646 Information technology Universal Coded Character Set UCS ISO Standard 5th ed ISO a href Template Cite book html title Template Cite book cite book a CS1 maint numeric names authors list link ISO IR ISO IEC International Register of Coded Character Sets To Be Used With Escape Sequences PDF Registry Index ITSCJ IPSJ Scheifler Robert W 1989 Compound Text Encoding X Consortium Standard X Consortium van Kesteren Anne WHATWG Encoding Standard WHATWG Living Standard WHATWG Registered code sets cited edit ISO TC 97 SC 2 1975 12 01 ISO IR 1 The set of control characters of the ISO 646 PDF ITSCJ IPSJ a href Template Citation html title Template Citation citation a CS1 maint numeric names authors list link Sveriges Standardiseringskommission 1975 12 01 ISO IR 7 NATS Control set for newspaper text transmission PDF ITSCJ IPSJ Japanese Industrial Standards Committee 1975 12 01 ISO IR 14 The Japanese Roman graphic set of characters PDF ITSCJ IPSJ IPTC 1976 03 25 ISO IR 26 Control set for newspaper text transmission PDF ITSCJ IPSJ ISO TC 97 SC 2 1977 10 15 ISO IR 36 The set of control characters of ISO 646 with IS4 replaced by Single Shift for G2 SS2 PDF ITSCJ IPSJ a href Template Citation html title Template Citation citation a CS1 maint numeric names authors list link ISO TC97 SC2 WG 7 ECMA 1985 08 01 ISO IR 104 Minimum C0 set for ISO 4873 PDF ITSCJ IPSJ a href Template Citation html title Template Citation citation a CS1 maint numeric names authors list link ISO TC97 SC2 WG 7 ECMA 1985 08 01 ISO IR 105 Minimum C1 Set for ISO 4873 PDF ITSCJ IPSJ a href Template Citation html title Template Citation citation a CS1 maint numeric names authors list link ITU 1985 08 01 ISO IR 106 Teletex Primary Set of Control Functions PDF ITSCJ IPSJ Urad pro normalizaci a mereni 1987 07 31 ISO IR 140 The C0 Set of Control Characters of ISO 646 with EM replaced by SS2 PDF ITSCJ IPSJ Korea Bureau of Standards 1988 10 01 ISO IR 149 Korean Graphic Character Set for Information Interchange KS C 5601 1987 PDF ITSCJ IPSJ ISO IEC JTC1 SC2 WG3 1990 04 16 ISO IR 155 Basic Box Drawings Set PDF ITSCJ IPSJ a href Template Citation html title Template Citation citation a CS1 maint numeric names authors list link CCITT 1992 07 13 ISO IR 164 Hebrew Supplementary Set of Graphic Characters PDF ITSCJ IPSJ ECMA 1996 04 22 ISO IR 192 UCS Transformation Format UTF 8 implementation level 3 without standard return PDF ITSCJ IPSJ ECMA 1996 04 22 ISO IR 195 UCS Transformation Format UTF 16 implementation level 3 without standard return PDF ITSCJ IPSJ ECMA 1996 04 22 ISO IR 196 UCS Transformation Format UTF 8 with standard return PDF ITSCJ IPSJ National Standards Authority of Ireland 1999 12 07 ISO IR 208 Ogham coded character set for information interchange PDF ITSCJ IPSJ Internet Requests For Comment cited edit Murai J Crispin M van der Poel E 1993 RFC 1468 Japanese Character Encoding for Internet Messages Requests for Comments IETF doi 10 17487 rfc1468 Ohta M Handa K 1993 RFC 1554 ISO 2022 JP 2 Multilingual Extension of ISO 2022 JP Requests for Comments IETF doi 10 17487 rfc1554 Choi U Chon K Park H 1993 RFC 1557 Korean Character Encoding for Internet Messages Requests for Comments IETF doi 10 17487 rfc1557 Zhu HF Hu DY Wang ZG Kao TC Chang WCH Crispin M 1996 RFC 1922 Chinese Character Encoding for Internet Messages Requests for Comments IETF doi 10 17487 rfc1922 Tamaru K 1997 RFC 2237 Japanese Character Encoding for Internet Messages Requests for Comments IETF doi 10 17487 rfc2237 Other published works cited edit Lunde Ken 2008 CJKV Information Processing 2nd ed O Reilly Media ISBN 9780596514471 Further reading editLunde Ken 1998 CJKV Information Processing Cambridge Massachusetts O Reilly amp Associates ISBN 1 56592 224 7 External links editISO IEC 2022 1994 ISO IEC 2022 1994 Cor 1 1999 ECMA 35 equivalent to ISO IEC 2022 and freely downloadable International Register of Coded Character Sets to be Used with Escape Sequences a full list of assigned character sets and their escape sequences History of Character Codes in North America Europe and East Asia from 1999 rev 2004 Ken Lunde s CJK INF a document on encoding Chinese Japanese and Korean CJK languages including a discussion of the various variants of ISO IEC 2022 Retrieved from https en wikipedia org w index php title ISO IEC 2022 amp oldid 1221162761, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.