fbpx
Wikipedia

GBK (character encoding)

GBK is an extension of the GB 2312 character set for Simplified Chinese characters, used in the People's Republic of China. It includes all unified CJK characters found in GB 13000.1-93, i.e. ISO/IEC 10646:1993, or Unicode 1.1. Since its initial release in 1993, GBK has been extended by Microsoft in Code page 936/1386, which was then extended into GBK 1.0. GBK is also the IANA-registered internet name for the Microsoft mapping,[1] which differs from other implementations primarily by the single-byte euro sign at 0x80.

Guójiā Biāozhǔn Kuòzhǎn (GBK)
Layout of GBK (see below for a larger copy of this diagram)
MIME / IANAGBK
Alias(es)CP936, MS936, windows-936, csGBK
Language(s)Web browsers, decode as GB 18030, supporting all languages, while the encoding (and other software decoders) is primarily used for Simplified Chinese, but also supports Traditional Chinese, Japanese, English, Russian and (partially) Greek.
StandardGBK 1.0
ClassificationExtended ASCII,[a] variable-width encoding, CJK encoding
ExtendsEUC-CN
Preceded byGB 2312
Succeeded byGB 18030
  1. ^ Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.

GB abbreviates Guójiā Biāozhǔn, which means national standard in Chinese, while K stands for Extension (扩展 kuòzhǎn). GBK not only extended the old standard GB 2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of GB 2312 in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the 镕 (róng) character in former Chinese Premier Zhu Rongji's name, are now representable.[2]

As of October 2022, GBK is the third-most popular encoding served from China and territories (after UTF-8 and the subset GB 2312), with 1.9% of web servers serving a page that declares GBK.[3] However, all major web browsers decode GB2312-marked documents as if they were marked GBK, except for Safari and Edge on the label GB_2312.[4] Together, GBK and GB 2312 encodings have a combined 5.5% presence in China and territories.[3] Globally, GBK accounts for less than 0.07% of all web pages and GBK+GB2312 for 0.2%.[5]

History edit

In 1993, the Unicode 1.1 standard was released, including 20,902 characters used in mainland China, Taiwan, Japan and Korea. Following this, China released GB 13000.1-93, the Guobiao standard equivalent of Unicode 1.1.

The GBK character set was defined in 1993 as an extension of GB 2312-80, while also including the characters of GB 13000.1-93 through the unused codepoints available in GB 2312. Hence GBK is backward compatible with GB 2312. GBK was defined in a normative annex to GB 13000.1-93.[6]

Microsoft implemented GBK in Windows 95 and Windows NT 3.51 as Code Page 936. While GBK was never an official standard, widespread usage of Windows 95 led to GBK becoming the de facto standard. While GBK included all the Chinese characters defined in Unicode 1.1 and GB 13000.1-93, these standards used different code tables. The primary reason for its existence was simply to bridge the gap between GB 2312-80 and GB 13000.1-93.

In 1995, China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Extension Specification (Chinese: 汉字内码扩展规范 (GBK); pinyin: Hànzì Nèimǎ Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a slight extension of Codepage 936. The newly added 95 characters were not found in GB 13000.1-1993, and were provisionally assigned Unicode PUA code points.[7]: 534 

Microsoft later added the euro sign to Code page 936 and assigned the code 0x80 to it. This is not a valid code point in GBK 1.0.

In 2000, the GB 18030-2000 standard was released, superseding yet maintaining compatibility with GBK 1.0. It increased the number of definitions of Chinese characters and extended the number of possible characters through the implementation of four-byte character spaces. The subset of GB 18030 consisting of one-byte and two-byte characters is sometimes also referred to as GBK. Mapping to Unicode has been slightly changed, though, as some characters are now defined in Unicode. In the most up-to-date form of the standard, GB 18030-2005, only 24[8] characters are still mapped to Unicode PUA (see GB 18030#PUA.)

In 2002, GBK was registered as an IANA charset; the registration uses code page 936 mapping as well as CP936/MS936 aliases, but refers to GBK 1.0 specification.[1] W3C's technical recommendation published in 2015[9] defines a GBK encoder as a GB 18030 encoder with a single-byte euro sign and without four-byte sequences (while W3C's GBK decoder specification has no such limitation, decodes as GB 18030, i.e. with same range of letters as all of Unicode).

Encoding edit

A character is encoded as 1 or 2 bytes. A byte in the range 007F is a single byte that means the same thing as it does in ASCII. Strictly speaking, there are 95 characters and 33 control codes in this range.

A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range 81FE (that is, never 80 or FF), and the second byte is 40A0 except 7F for some areas and A1FE for others.

More specifically, the following ranges of bytes are defined:

GBK Encoding Ranges
range byte 1 byte 2 code points characters
GB 18030 GBK 1.0 Codepage 936 GB 2312
Level GBK/1 A1A9 A1FE 846 718[7]: 8–10  717 715 682
Level GBK/2 B0F7 A1FE 6,768 6,763 6,763 6,763
Level GBK/3 81A0 40FE except 7F 6,080 6,080 6,080
Level GBK/4 AAFE 40A0 except 7F 8,160 8,160 8,080
Level GBK/5 A8A9 40A0 except 7F 192 166 153
user-defined 1[7] AAAF A1FE 564
user-defined 2 F8FE A1FE 658
user-defined 3 A1A7 40A0 except 7F 672
total: 23,940 21,887 21,886 21,791 7,445

Layout diagram edit

In graphical form, the following figure shows the space of all 64K possible 2-byte codes. Green and yellow areas are assigned GBK codepoints, red are for user-defined characters. The uncolored areas are invalid byte combinations.

 

Relationship to other encodings edit

The areas indicated in the previous section as GBK/1 and GBK/2, taken by themselves, is simply GB 2312-80 in its usual encoding, GBK/1 being the non-hanzi region and GBK/2 the hanzi region. GB 2312, or more properly the EUC-CN encoding thereof, takes a pair of bytes from the range A1FE, like any 94² ISO-2022 character set loaded into GR. This corresponds to the lower-right quarter of the illustration above. However, GB 2312 does not assign any code points to the rows located at AAB0 and F8FE, even though it had staked out the territory. GBK added extensions to these rows. You can see that the two gaps were filled in with user-defined areas.

More significantly, GBK extended the range of the bytes. Having two-byte characters in the ISO-2022 GR range gives a limit of 94²=8,836 possibilities. Abandoning the ISO-2022 model of strict regions for graphics and control characters, but retaining the feature of low bytes being 1-byte characters and pairs of high bytes denoting a character, you could potentially have 128²=16,384 positions. GBK takes part of that, extending the range from A1FE (94 choices for each byte) to 81FE (126 choices) for the first byte and 40FE (191 choices) for the second byte, for a total of 24,066 positions.

Microsoft's Code Page 936 is generally thought of as being GBK.[1] However, the 95 PUA characters added in GBK 1.0 are not included in Code Page 936. Code Page 936 also has a single-byte euro sign at 0x80 which GBK 1.0 doesn't have.[10]

GBK's successor, GB 18030-2000, uses the remaining range available to the second byte (3039) to further expand the number of possibilities while retaining GBK as a subset.

References edit

  1. ^ a b c "Character Sets". Retrieved 3 October 2016.
  2. ^ . Microsoft. Archived from the original on 2002-10-01. Conversion map between Codepage 936 and Unicode. Need manually selecting GB 18030 or GBK in browser to view it correctly.
  3. ^ a b "Distribution of Character Encodings among websites that use China and territories". w3techs.com. Retrieved 2022-10-25.
  4. ^ "Encoding: Summarized test results". www.w3.org. Retrieved 2019-11-15.
  5. ^ "Historical trends in the usage statistics of character encodings for websites, October 2022". w3techs.com. Retrieved 2022-10-25.
  6. ^ "18.2: Ideographic Description Characters" (PDF). The Unicode Standard. Version 15.0.0. 2022. p. 763. The Ideographic Description characters are found in GBK—an extension to GB 2312-80 that added all 20,902 Unicode Version 1.1 ideographs not already in GB 2312-80. GBK is defined as a normative annex of GB 13000.1-93.
  7. ^ a b c Standardization Administration of China (SAC) (2005-11-18). GB 18030-2005: Information Technology—Chinese coded character set.
  8. ^ GB 18030-2005 Standard p.9, 79
  9. ^ "Encoding Standard # gbk-encoder". W3C. Retrieved 2016-10-02.
  10. ^ Scherer, Markus (4 January 2002). "Re: Fun with GBK & GB2312". Unicode Mail List Archive. Retrieved 4 March 2020.

Notes edit

External links edit

  • ICU's Authoritative GBK mapping - part of GB18030 data 2016-10-31 at the Wayback Machine
  • Microsoft Reference page for GBK
  • Mapping of GBK to Unicode N.B.: this is Microsoft code page 936, which contains entries for 21791 double-byte code points, 96 single-byte graphic characters, and 33 control characters. This is not exactly the same as GBK which has 21886 characters.
  • GBK Code Table N.B. This gbk-encoded page shows the available coding space totally populated except for 2 places, for a total of 32256 glyphs (32352 with the implied single-byte ASCII codes not illustrated), which is more than 23940 or 21886. Actual rendering of this table depends on your browser's GBK decoder.

character, encoding, this, article, needs, additional, citations, verification, please, help, improve, this, article, adding, citations, reliable, sources, unsourced, material, challenged, removed, find, sources, character, encoding, news, newspapers, books, s. This article needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed Find sources GBK character encoding news newspapers books scholar JSTOR October 2016 Learn how and when to remove this template message GBK is an extension of the GB 2312 character set for Simplified Chinese characters used in the People s Republic of China It includes all unified CJK characters found in GB 13000 1 93 i e ISO IEC 10646 1993 or Unicode 1 1 Since its initial release in 1993 GBK has been extended by Microsoft in Code page 936 1386 which was then extended into GBK 1 0 GBK is also the IANA registered internet name for the Microsoft mapping 1 which differs from other implementations primarily by the single byte euro sign at 0x80 Guojia Biaozhǔn Kuozhǎn GBK Layout of GBK see below for a larger copy of this diagram MIME IANAGBKAlias es CP936 MS936 windows 936 csGBKLanguage s Web browsers decode as GB 18030 supporting all languages while the encoding and other software decoders is primarily used for Simplified Chinese but also supports Traditional Chinese Japanese English Russian and partially Greek StandardGBK 1 0ClassificationExtended ASCII a variable width encoding CJK encodingExtendsEUC CNPreceded byGB 2312Succeeded byGB 18030 Not in the strictest sense of the term as ASCII bytes can appear as trail bytes vteGB abbreviates Guojia Biaozhǔn which means national standard in Chinese while K stands for Extension 扩展 kuozhǎn GBK not only extended the old standard GB 2312 with Traditional Chinese characters but also with Chinese characters that were simplified after the establishment of GB 2312 in 1981 With the arrival of GBK certain names with characters formerly unrepresentable like the 镕 rong character in former Chinese Premier Zhu Rongji s name are now representable 2 As of October 2022 update GBK is the third most popular encoding served from China and territories after UTF 8 and the subset GB 2312 with 1 9 of web servers serving a page that declares GBK 3 However all major web browsers decode GB2312 marked documents as if they were marked GBK except for Safari and Edge on the label GB 2312 4 Together GBK and GB 2312 encodings have a combined 5 5 presence in China and territories 3 Globally GBK accounts for less than 0 07 of all web pages and GBK GB2312 for 0 2 5 Contents 1 History 2 Encoding 2 1 Layout diagram 3 Relationship to other encodings 4 References 5 Notes 6 External linksHistory editIn 1993 the Unicode 1 1 standard was released including 20 902 characters used in mainland China Taiwan Japan and Korea Following this China released GB 13000 1 93 the Guobiao standard equivalent of Unicode 1 1 The GBK character set was defined in 1993 as an extension of GB 2312 80 while also including the characters of GB 13000 1 93 through the unused codepoints available in GB 2312 Hence GBK is backward compatible with GB 2312 GBK was defined in a normative annex to GB 13000 1 93 6 Microsoft implemented GBK in Windows 95 and Windows NT 3 51 as Code Page 936 While GBK was never an official standard widespread usage of Windows 95 led to GBK becoming the de facto standard While GBK included all the Chinese characters defined in Unicode 1 1 and GB 13000 1 93 these standards used different code tables The primary reason for its existence was simply to bridge the gap between GB 2312 80 and GB 13000 1 93 In 1995 China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Extension Specification Chinese 汉字内码扩展规范 GBK pinyin Hanzi Neimǎ Kuozhǎn Guifan GBK Version 1 0 known as GBK 1 0 which is a slight extension of Codepage 936 The newly added 95 characters were not found in GB 13000 1 1993 and were provisionally assigned Unicode PUA code points 7 534 Microsoft later added the euro sign to Code page 936 and assigned the code 0x80 to it This is not a valid code point in GBK 1 0 In 2000 the GB 18030 2000 standard was released superseding yet maintaining compatibility with GBK 1 0 It increased the number of definitions of Chinese characters and extended the number of possible characters through the implementation of four byte character spaces The subset of GB 18030 consisting of one byte and two byte characters is sometimes also referred to as GBK Mapping to Unicode has been slightly changed though as some characters are now defined in Unicode In the most up to date form of the standard GB 18030 2005 only 24 8 characters are still mapped to Unicode PUA see GB 18030 PUA In 2002 GBK was registered as an IANA charset the registration uses code page 936 mapping as well as CP936 MS936 aliases but refers to GBK 1 0 specification 1 W3C s technical recommendation published in 2015 9 defines a GBK encoder as a GB 18030 encoder with a single byte euro sign and without four byte sequences while W3C s GBK decoder specification has no such limitation decodes as GB 18030 i e with same range of letters as all of Unicode Encoding editA character is encoded as 1 or 2 bytes A byte in the range 00 7F is a single byte that means the same thing as it does in ASCII Strictly speaking there are 95 characters and 33 control codes in this range A byte with the high bit set indicates that it is the first of 2 bytes Loosely speaking the first byte is in the range 81 FE that is never 80 or FF and the second byte is 40 A0 except 7F for some areas and A1 FE for others More specifically the following ranges of bytes are defined GBK Encoding Ranges range byte 1 byte 2 code points charactersGB 18030 GBK 1 0 Codepage 936 GB 2312Level GBK 1 A1 A9 A1 FE 846 718 7 8 10 717 715 682Level GBK 2 B0 F7 A1 FE 6 768 6 763 6 763 6 763Level GBK 3 81 A0 40 FE except 7F 6 080 6 080 6 080Level GBK 4 AA FE 40 A0 except 7F 8 160 8 160 8 080Level GBK 5 A8 A9 40 A0 except 7F 192 166 153user defined 1 7 AA AF A1 FE 564user defined 2 F8 FE A1 FE 658user defined 3 A1 A7 40 A0 except 7F 672total 23 940 21 887 21 886 21 791 7 445Layout diagram edit In graphical form the following figure shows the space of all 64K possible 2 byte codes Green and yellow areas are assigned GBK codepoints red are for user defined characters The uncolored areas are invalid byte combinations nbsp Relationship to other encodings editThe areas indicated in the previous section as GBK 1 and GBK 2 taken by themselves is simply GB 2312 80 in its usual encoding GBK 1 being the non hanzi region and GBK 2 the hanzi region GB 2312 or more properly the EUC CN encoding thereof takes a pair of bytes from the range A1 FE like any 94 ISO 2022 character set loaded into GR This corresponds to the lower right quarter of the illustration above However GB 2312 does not assign any code points to the rows located at AA B0 and F8 FE even though it had staked out the territory GBK added extensions to these rows You can see that the two gaps were filled in with user defined areas More significantly GBK extended the range of the bytes Having two byte characters in the ISO 2022 GR range gives a limit of 94 8 836 possibilities Abandoning the ISO 2022 model of strict regions for graphics and control characters but retaining the feature of low bytes being 1 byte characters and pairs of high bytes denoting a character you could potentially have 128 16 384 positions GBK takes part of that extending the range from A1 FE 94 choices for each byte to 81 FE 126 choices for the first byte and 40 FE 191 choices for the second byte for a total of 24 066 positions Microsoft s Code Page 936 is generally thought of as being GBK 1 However the 95 PUA characters added in GBK 1 0 are not included in Code Page 936 Code Page 936 also has a single byte euro sign at 0x80 which GBK 1 0 doesn t have 10 GBK s successor GB 18030 2000 uses the remaining range available to the second byte 30 39 to further expand the number of possibilities while retaining GBK as a subset References edit a b c Character Sets Retrieved 3 October 2016 Code Page 936 PRC GBK XGB Microsoft Archived from the original on 2002 10 01 Conversion map between Codepage 936 and Unicode Need manually selecting GB 18030 or GBK in browser to view it correctly a b Distribution of Character Encodings among websites that use China and territories w3techs com Retrieved 2022 10 25 Encoding Summarized test results www w3 org Retrieved 2019 11 15 Historical trends in the usage statistics of character encodings for websites October 2022 w3techs com Retrieved 2022 10 25 18 2 Ideographic Description Characters PDF The Unicode Standard Version 15 0 0 2022 p 763 The Ideographic Description characters are found in GBK an extension to GB 2312 80 that added all 20 902 Unicode Version 1 1 ideographs not already in GB 2312 80 GBK is defined as a normative annex of GB 13000 1 93 a b c Standardization Administration of China SAC 2005 11 18 GB 18030 2005 Information Technology Chinese coded character set GB 18030 2005 Standard p 9 79 Encoding Standard gbk encoder W3C Retrieved 2016 10 02 Scherer Markus 4 January 2002 Re Fun with GBK amp GB2312 Unicode Mail List Archive Retrieved 4 March 2020 Notes editExternal links editICU s Authoritative GBK mapping part of GB18030 data Archived 2016 10 31 at the Wayback Machine Microsoft Reference page for GBK Mapping of GBK to Unicode N B this is Microsoft code page 936 which contains entries for 21791 double byte code points 96 single byte graphic characters and 33 control characters This is not exactly the same as GBK which has 21886 characters GBK Code Table N B This gbk encoded page shows the available coding space totally populated except for 2 places for a total of 32256 glyphs 32352 with the implied single byte ASCII codes not illustrated which is more than 23940 or 21886 Actual rendering of this table depends on your browser s GBK decoder Retrieved from https en wikipedia org w index php title GBK character encoding amp oldid 1188778483, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.