fbpx
Wikipedia

Universal Coded Character Set

The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

Universal Coded Character Set
Alias(es)UCS, Unicode
Language(s)International
StandardISO/IEC 10646
Encoding formatsUTF-8, UTF-16, GB 18030
Less common: UTF-32, BOCU, SCSU, UTF-7
Preceded byISO/IEC 8859, ISO/IEC 2022, various others

The UCS has over 1.1 million possible code points available for use/allocation, but only the first 65,536, which is the Basic Multilingual Plane (BMP), had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) ruled in 2006 that all software sold in its jurisdiction would have to support GB 18030. This required software intended for sale in the PRC to move beyond the BMP.[clarification needed]

The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimise conflicts with other encoding forms.

The original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".[clarification needed]

Another encoding, UTF-32 (previously named UCS-4), uses four bytes (total 32 bits) to encode a single character of the codespace. UTF-32 thereby permits a binary representation of every code point in the APIs, and software applications.

History edit

The International Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Hugh McGregor Ross was one of its principal architects.

This work happened independently of the development of the Unicode standard, which had been in development since 1987 by Xerox and Apple.

The original ISO 10646 draft differed markedly from the current standard. It defined:

  • 128 groups of
  • 256 planes of
  • 256 rows of
  • 256 cells,

for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of C0 and C1 control codes (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal notation) in any one of the four bytes specifying a group, plane, row and cell. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.

One could code the characters of this primordial ISO/IEC 10646 standard in one of three ways:

  1. UCS-4, four bytes for every character, enabling the simple encoding of all characters;
  2. UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO/IEC 2022 escape sequences;
  3. UTF-1, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control codes).

In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO/IEC 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it.[citation needed] ISO officials realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control code values), thus opening code points for allocation; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.

Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 code points from 17 planes by means of the UTF-16 surrogate mechanism. For that reason, ISO/IEC 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 679 million. The UCS-4 encoding of ISO/IEC 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32, although it has almost no use outside programs' internal data.

Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed-width encoding that was also backward-compatible with 7-bit ASCII, which came to be called UTF-8,[1] and is currently the most popular UCS encoding.

Differences from Unicode edit

ISO/IEC 10646 and Unicode have an identical repertoire and numbers—the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. Unicode has rules and specifications outside the scope of ISO/IEC 10646. ISO/IEC 10646 is a simple character map, an extension of previous standards like ISO/IEC 8859. In contrast, Unicode adds rules for collation, normalisation of forms, and the bidirectional algorithm for right-to-left scripts such as Arabic and Hebrew. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO/IEC 10646; Unicode must be implemented.

To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character's default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number '8', or the vulgar fraction '¼', that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.

Some applications support ISO/IEC 10646 characters but do not fully support Unicode. One such application, Xterm, can properly display all ISO/IEC 10646 characters that have a one-to-one character-to-glyph mapping[clarification needed] and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari (one character to many glyphs) or Arabic (both features). Most GUI applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly.

Citing the Universal Coded Character Set edit

ISO/IEC 10646, a general, informal citation for the ISO/IEC 10646 family of standards, is acceptable in most prose. And even though it is a separate standard, the term Unicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite the year of the edition in the form ISO/IEC 10646:{year}, for example: ISO/IEC 10646:2014.

Relationship with Unicode edit

Since 1991, the Unicode Consortium and the ISO/IEC have developed The Unicode Standard ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Unicode Version 2.0 exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard.

  • ISO/IEC 10646-1:1993 = Unicode 1.1
  • ISO/IEC 10646-1:1993 plus Amendments 5 to 7 = Unicode 2.0
  • ISO/IEC 10646-1:1993 plus Amendments 5 to 7 = Unicode 2.1 excluding Euro sign and Object Replacement Character, which are included in Amendment 18
  • ISO/IEC 10646-1:2000 = Unicode 3.0
  • ISO/IEC 10646-1:2000 and ISO/IEC 10646-2:2001 = Unicode 3.1
  • ISO/IEC 10646-1:2000 plus Amendment 1 and ISO/IEC 10646-2:2001 = Unicode 3.2
  • ISO/IEC 10646:2003 = Unicode 4.0
  • ISO/IEC 10646:2003 plus Amendment 1 = Unicode 4.1
  • ISO/IEC 10646:2003 plus Amendments 1 to 2 = Unicode 5.0 excluding Devanagari letters GGA, JJA, DDDA and BBA, which are included in Amendment 3
  • ISO/IEC 10646:2003 plus Amendments 1 to 4 = Unicode 5.1
  • ISO/IEC 10646:2003 plus Amendments 1 to 6 = Unicode 5.2
  • ISO/IEC 10646:2003 plus Amendments 1 to 8 = ISO/IEC 10646:2011 = Unicode 6.0 excluding Indian rupee sign
  • ISO/IEC 10646:2012 = Unicode 6.1
  • ISO/IEC 10646:2012 = Unicode 6.2 excluding Turkish lira sign, which is included in Amendment 1
  • ISO/IEC 10646:2012 = Unicode 6.3 excluding Turkish lira sign, which is included in Amendment 1, and five bidirectional control characters (Arabic Letter Mark, Left-To-Right Isolate, Right-To-Left Isolate, First Strong Isolate, Pop Directional Isolate), which are included in Amendment 2
  • ISO/IEC 10646:2012 plus Amendments 1 and 2 = Unicode 7.0 excluding the Ruble sign
  • ISO/IEC 10646:2014 plus Amendment 1 = Unicode 8.0 excluding the Lari sign, nine CJK unified ideographs, and 41 emoji characters
  • ISO/IEC 10646:2014 plus Amendments 1 and 2 = Unicode 9.0 excluding Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols
  • ISO/IEC 10646:2017 = Unicode 10.0 excluding 285 Hentaigana characters, 3 Zanabazar Square characters, and 56 emoji symbols
  • ISO/IEC 10646:2017 plus Amendment 1 = Unicode 11.0 excluding 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters
  • ISO/IEC 10646:2017 plus Amendments 1 and 2 = Unicode 12.0 excluding 62 additional characters
  • ISO/IEC 10646:2020 = Unicode 13.0
  • ISO/IEC 10646:2021 = Unicode 14.0

See also edit

Related standards:

References edit

  1. ^ Pike, Rob (2003-04-03). "UTF-8 history". from the original on 2016-05-23.

External links edit

  • Publicly available standards (ISO) – includes a copy of ISO/IEC 10646:2020/Amd. 1:2023(E)
  • ISO/IEC JTC1/SC2/WG2, the working group in charge of ISO 10646
  • UTF-8 and Unicode FAQ
  • SIL's freeware fonts, editors and documentation
  • Simple but pleasant UTF-8 example testing your web browser and font capabilities.
  • Character set issues for ADA 9x from October 1989, goes into some detail about the original, pre-merger DIS ISO-10646

universal, coded, character, this, article, needs, additional, citations, verification, please, help, improve, this, article, adding, citations, reliable, sources, unsourced, material, challenged, removed, find, sources, news, newspapers, books, scholar, jstor. This article needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed Find sources Universal Coded Character Set news newspapers books scholar JSTOR April 2020 Learn how and when to remove this message The Universal Coded Character Set UCS Unicode is a standard set of characters defined by the international standard ISO IEC 10646 Information technology Universal Coded Character Set UCS plus amendments to that standard which is the basis of many character encodings improving as characters from previously unrepresented typing systems are added Universal Coded Character SetAlias es UCS UnicodeLanguage s InternationalStandardISO IEC 10646Encoding formatsUTF 8 UTF 16 GB 18030Less common UTF 32 BOCU SCSU UTF 7Preceded byISO IEC 8859 ISO IEC 2022 various othersvte The UCS has over 1 1 million possible code points available for use allocation but only the first 65 536 which is the Basic Multilingual Plane BMP had entered into common use before 2000 This situation began changing when the People s Republic of China PRC ruled in 2006 that all software sold in its jurisdiction would have to support GB 18030 This required software intended for sale in the PRC to move beyond the BMP clarification needed The system deliberately leaves many code points not assigned to characters even in the BMP It does this to allow for future expansion or to minimise conflicts with other encoding forms The original edition of the UCS defined UTF 16 an extension of UCS 2 to represent code points outside the BMP A range of code points in the S Special Zone of the BMP remains unassigned to characters UCS 2 disallows use of code values for these code points but UTF 16 allows their use in pairs Unicode also adopted UTF 16 but in Unicode terminology the high half zone elements become high surrogates and the low half zone elements become low surrogates clarification needed Another encoding UTF 32 previously named UCS 4 uses four bytes total 32 bits to encode a single character of the codespace UTF 32 thereby permits a binary representation of every code point in the APIs and software applications Contents 1 History 2 Differences from Unicode 3 Citing the Universal Coded Character Set 4 Relationship with Unicode 5 See also 6 References 7 External linksHistory editThe International Organization for Standardization ISO set out to compose the universal character set in 1989 and published the draft of ISO 10646 in 1990 Hugh McGregor Ross was one of its principal architects This work happened independently of the development of the Unicode standard which had been in development since 1987 by Xerox and Apple The original ISO 10646 draft differed markedly from the current standard It defined 128 groups of 256 planes of 256 rows of 256 cells for an apparent total of 2 147 483 648 characters but actually the standard could code only 679 477 248 characters as the policy forbade byte values of C0 and C1 control codes 0x00 to 0x1F and 0x80 to 0x9F in hexadecimal notation in any one of the four bytes specifying a group plane row and cell The Latin capital letter A for example had a location in group 0x20 plane 0x20 row 0x20 cell 0x41 One could code the characters of this primordial ISO IEC 10646 standard in one of three ways UCS 4 four bytes for every character enabling the simple encoding of all characters UCS 2 two bytes for every character enabling the encoding of the first plane 0x20 the Basic Multilingual Plane containing the first 36 864 codepoints straightforwardly and other planes and groups by switching to them with ISO IEC 2022 escape sequences UTF 1 which encodes all the characters in sequences of bytes of varying length 1 to 5 bytes each of which contain no control codes In 1990 therefore two initiatives for a universal character set existed Unicode with 16 bits for every character 65 536 possible characters and ISO IEC 10646 The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it citation needed ISO officials realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode Two changes took place the lifting of the limitation upon characters prohibition of control code values thus opening code points for allocation and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode Meanwhile in the passage of time the situation changed in the Unicode standard itself 65 536 characters came to appear insufficient and the standard from version 2 0 and onwards supports encoding of 1 112 064 code points from 17 planes by means of the UTF 16 surrogate mechanism For that reason ISO IEC 10646 was limited to contain as many characters as could be encoded by UTF 16 and no more that is a little over a million characters instead of over 679 million The UCS 4 encoding of ISO IEC 10646 was incorporated into the Unicode standard with the limitation to the UTF 16 range and under the name UTF 32 although it has almost no use outside programs internal data Rob Pike and Ken Thompson the designers of the Plan 9 operating system devised a new fast and well designed mixed width encoding that was also backward compatible with 7 bit ASCII which came to be called UTF 8 1 and is currently the most popular UCS encoding Differences from Unicode editISO IEC 10646 and Unicode have an identical repertoire and numbers the same characters with the same numbers exist on both standards although Unicode releases new versions and adds new characters more often Unicode has rules and specifications outside the scope of ISO IEC 10646 ISO IEC 10646 is a simple character map an extension of previous standards like ISO IEC 8859 In contrast Unicode adds rules for collation normalisation of forms and the bidirectional algorithm for right to left scripts such as Arabic and Hebrew For interoperability between platforms especially if bidirectional scripts are used it is not enough to support ISO IEC 10646 Unicode must be implemented To support these rules and algorithms Unicode adds many properties to each character in the set such as properties determining a character s default bidirectional class and properties to determine how the character combines with other characters If the character represents a numeric value such as the European number 8 or the vulgar fraction that numeric value is also added as a property of the character Unicode intends these properties to support interoperable text handling with a mixture of languages Some applications support ISO IEC 10646 characters but do not fully support Unicode One such application Xterm can properly display all ISO IEC 10646 characters that have a one to one character to glyph mapping clarification needed and a single directionality It can handle some combining marks by simple overstriking methods but cannot display Hebrew bidirectional Devanagari one character to many glyphs or Arabic both features Most GUI applications use standard OS text drawing routines which handle such scripts although the applications themselves still do not always handle them correctly Citing the Universal Coded Character Set editISO IEC 10646 a general informal citation for the ISO IEC 10646 family of standards is acceptable in most prose And even though it is a separate standard the term Unicode is used just as often informally when discussing the UCS However any normative references to the UCS as a publication should cite the year of the edition in the form ISO IEC 10646 year for example ISO IEC 10646 2014 Relationship with Unicode editSince 1991 the Unicode Consortium and the ISO IEC have developed The Unicode Standard Unicode and ISO IEC 10646 in tandem The repertoire character names and code points of Unicode Version 2 0 exactly match those of ISO IEC 10646 1 1993 with its first seven published amendments After Unicode 3 0 was published in February 2000 corresponding new and updated characters entered the UCS via ISO IEC 10646 1 2000 In 2003 parts 1 and 2 of ISO IEC 10646 were combined into a single part which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard ISO IEC 10646 1 1993 Unicode 1 1 ISO IEC 10646 1 1993 plus Amendments 5 to 7 Unicode 2 0 ISO IEC 10646 1 1993 plus Amendments 5 to 7 Unicode 2 1 excluding Euro sign and Object Replacement Character which are included in Amendment 18 ISO IEC 10646 1 2000 Unicode 3 0 ISO IEC 10646 1 2000 and ISO IEC 10646 2 2001 Unicode 3 1 ISO IEC 10646 1 2000 plus Amendment 1 and ISO IEC 10646 2 2001 Unicode 3 2 ISO IEC 10646 2003 Unicode 4 0 ISO IEC 10646 2003 plus Amendment 1 Unicode 4 1 ISO IEC 10646 2003 plus Amendments 1 to 2 Unicode 5 0 excluding Devanagari letters GGA JJA DDDA and BBA which are included in Amendment 3 ISO IEC 10646 2003 plus Amendments 1 to 4 Unicode 5 1 ISO IEC 10646 2003 plus Amendments 1 to 6 Unicode 5 2 ISO IEC 10646 2003 plus Amendments 1 to 8 ISO IEC 10646 2011 Unicode 6 0 excluding Indian rupee sign ISO IEC 10646 2012 Unicode 6 1 ISO IEC 10646 2012 Unicode 6 2 excluding Turkish lira sign which is included in Amendment 1 ISO IEC 10646 2012 Unicode 6 3 excluding Turkish lira sign which is included in Amendment 1 and five bidirectional control characters Arabic Letter Mark Left To Right Isolate Right To Left Isolate First Strong Isolate Pop Directional Isolate which are included in Amendment 2 ISO IEC 10646 2012 plus Amendments 1 and 2 Unicode 7 0 excluding the Ruble sign ISO IEC 10646 2014 plus Amendment 1 Unicode 8 0 excluding the Lari sign nine CJK unified ideographs and 41 emoji characters ISO IEC 10646 2014 plus Amendments 1 and 2 Unicode 9 0 excluding Adlam Newa Japanese TV symbols and 74 emoji and symbols ISO IEC 10646 2017 Unicode 10 0 excluding 285 Hentaigana characters 3 Zanabazar Square characters and 56 emoji symbols ISO IEC 10646 2017 plus Amendment 1 Unicode 11 0 excluding 46 Mtavruli Georgian capital letters 5 CJK unified ideographs and 66 emoji characters ISO IEC 10646 2017 plus Amendments 1 and 2 Unicode 12 0 excluding 62 additional characters ISO IEC 10646 2020 Unicode 13 0 ISO IEC 10646 2021 Unicode 14 0See also editRelated standards ISO IEC 646 positions 0 to 127 are the same as in ISO IEC 10646 and Unicode and the numbers 646 and 10646 are similar ISO IEC 2022 Information technology Character code structure and extension techniques ISO IEC 6429 C0 and C1 control codes ISO IEC 8859 positions 0 through 255 of UCS and Unicode are the same as in ISO IEC 8859 1 alias ISO Latin 1 ISO IEC 14651 Information technology International string ordering and comparison ISO 15924 Codes for the representation of names of scripts each character is associated with one of those scripts Comparison of Unicode encodings List of XML and HTML character entity references List of Unicode fonts Universal Character Set characters ISO IEC JTC 1 SC 2References edit Pike Rob 2003 04 03 UTF 8 history Archived from the original on 2016 05 23 External links editPublicly available standards ISO includes a copy of ISO IEC 10646 2020 Amd 1 2023 E ISO IEC JTC1 SC2 WG2 the working group in charge of ISO 10646 UTF 8 and Unicode FAQ SIL s freeware fonts editors and documentation Simple but pleasant UTF 8 example testing your web browser and font capabilities Character set issues for ADA 9x from October 1989 goes into some detail about the original pre merger DIS ISO 10646 Retrieved from https en wikipedia org w index php title Universal Coded Character Set amp oldid 1215741020, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.