fbpx
Wikipedia

Code point

In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols, control characters, or formatting.[1] The set of all possible code points within a given encoding/character set make up that encoding's codespace.[2][3]

For example, the character encoding scheme ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

Definition

The notion of a code point is used for abstraction, to distinguish both:

  • the number from an encoding as a sequence of bits, and
  • the abstract character from a particular graphical representation (glyph).

This is because one may wish to make these distinctions to:

  • encode a particular code space in different ways, or
  • display a character via different glyphs.

For Unicode, the particular sequence of bits is called a code unit – for the UCS-4 encoding, any code point is encoded as 4-byte (octet) binary numbers, while in the UTF-8 encoding, different code points are encoded as sequences from one to four bytes long, forming a self-synchronizing code. See comparison of Unicode encodings for details. Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data. However, code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.

The distinction between a code point and the corresponding abstract character is not pronounced in Unicode but is evident for many other encoding schemes, where numerous code pages may exist for a single code space.

History

The concept of a code point is part of Unicode's solution to a difficult conundrum faced by character encoding developers in the 1980s.[4] If they added more bits per character to accommodate larger character sets, that design decision would also constitute an unacceptable waste of then-scarce computing resources for Latin script users (who constituted the vast majority of computer users at the time), since those extra bits would always be zeroed out for such users.[5] The code point avoids this problem by breaking the old idea of a direct one-to-one correspondence between characters and particular sequences of bits.

See also

References

  1. ^ (PDF). Unicode Consortium. 30 June 2018. p. 23. Archived from the original (PDF) on 19 September 2018. Retrieved 25 December 2018. Format: Invisible but affects neighboring characters; includes line/paragraph separators
  2. ^ "Glossary". unicode.org. Retrieved 20 March 2023.
  3. ^ (PDF). Unicode Consortium. 30 June 2018. p. 22. Archived from the original (PDF) on 19 September 2018. Retrieved 25 December 2018. On a computer, abstract characters are encoded internally as numbers. To create a complete character encoding, it is necessary to define the list of all characters to be encoded and to establish systematic rules for how the numbers represent the characters. The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encodedcharacter.
  4. ^ Constable, Peter (13 June 2001). . NRSI: Computers & Writing Systems. Archived from the original (html) on 16 September 2010. Retrieved 25 December 2018. By the early 1980s, the software industry was starting to recognise the need for a solution to the problems involved with using multiple character encoding standards. Some particularly innovative work was begun at Xerox. The Xerox Star workstation used a multi-byte encoding that allowed it to support a single character set with potentially millions of characters.
  5. ^ Mark Davis, Ken Whistler (23 March 2001). . Unicode Consortium. Archived from the original (html) on 25 August 2001. Retrieved 25 December 2018. 6.2 Large Weight Values{{cite web}}: CS1 maint: uses authors parameter (link)

External links

  • Codepoints.net, a site dedicated to all things characters, letters and Unicode

code, point, confused, with, point, code, character, encoding, terminology, code, point, codepoint, code, position, numerical, value, that, maps, specific, character, usually, represent, single, grapheme, usually, letter, digit, punctuation, mark, whitespace, . Not to be confused with point code In character encoding terminology a code point codepoint or code position is a numerical value that maps to a specific character Code points usually represent a single grapheme usually a letter digit punctuation mark or whitespace but sometimes represent symbols control characters or formatting 1 The set of all possible code points within a given encoding character set make up that encoding s codespace 2 3 For example the character encoding scheme ASCII comprises 128 code points in the range 0hex to 7Fhex Extended ASCII comprises 256 code points in the range 0hex to FFhex and Unicode comprises 1 114 112 code points in the range 0hex to 10FFFFhex The Unicode code space is divided into seventeen planes the basic multilingual plane and 16 supplementary planes each with 65 536 216 code points Thus the total size of the Unicode code space is 17 65 536 1 114 112 Contents 1 Definition 2 History 3 See also 4 References 5 External linksDefinition EditThe notion of a code point is used for abstraction to distinguish both the number from an encoding as a sequence of bits and the abstract character from a particular graphical representation glyph This is because one may wish to make these distinctions to encode a particular code space in different ways or display a character via different glyphs For Unicode the particular sequence of bits is called a code unit for the UCS 4 encoding any code point is encoded as 4 byte octet binary numbers while in the UTF 8 encoding different code points are encoded as sequences from one to four bytes long forming a self synchronizing code See comparison of Unicode encodings for details Code points are normally assigned to abstract characters An abstract character is not a graphical glyph but a unit of textual data However code points may also be left reserved for future assignment most of the Unicode code space is unassigned or given other designated functions The distinction between a code point and the corresponding abstract character is not pronounced in Unicode but is evident for many other encoding schemes where numerous code pages may exist for a single code space History EditThe concept of a code point is part of Unicode s solution to a difficult conundrum faced by character encoding developers in the 1980s 4 If they added more bits per character to accommodate larger character sets that design decision would also constitute an unacceptable waste of then scarce computing resources for Latin script users who constituted the vast majority of computer users at the time since those extra bits would always be zeroed out for such users 5 The code point avoids this problem by breaking the old idea of a direct one to one correspondence between characters and particular sequences of bits See also EditCombining character Replacement character Text based computing Unicode collation algorithmReferences Edit The Unicode Standard Version 11 0 Core Specification PDF Unicode Consortium 30 June 2018 p 23 Archived from the original PDF on 19 September 2018 Retrieved 25 December 2018 Format Invisible but affects neighboring characters includes line paragraph separators Glossary unicode org Retrieved 20 March 2023 The Unicode Standard Version 11 0 Core Specification PDF Unicode Consortium 30 June 2018 p 22 Archived from the original PDF on 19 September 2018 Retrieved 25 December 2018 On a computer abstract characters are encoded internally as numbers To create a complete character encoding it is necessary to define the list of all characters to be encoded and to establish systematic rules for how the numbers represent the characters The range of integers used to code the abstract characters is called the codespace A particular integer in this set is called a code point When an abstract character is mapped or assigned to a particular code point in the codespace it is then referred to as an encodedcharacter Constable Peter 13 June 2001 Understanding Unicode I NRSI Computers amp Writing Systems Archived from the original html on 16 September 2010 Retrieved 25 December 2018 By the early 1980s the software industry was starting to recognise the need for a solution to the problems involved with using multiple character encoding standards Some particularly innovative work was begun at Xerox The Xerox Star workstation used a multi byte encoding that allowed it to support a single character set with potentially millions of characters Mark Davis Ken Whistler 23 March 2001 Unicode Technical Standard 10 UNICODE COLLATION ALGORITHM Unicode Consortium Archived from the original html on 25 August 2001 Retrieved 25 December 2018 6 2 Large Weight Values a href Template Cite web html title Template Cite web cite web a CS1 maint uses authors parameter link External links EditCodepoints net a site dedicated to all things characters letters and Unicode Retrieved from https en wikipedia org w index php title Code point amp oldid 1145618565, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.