fbpx
Wikipedia

Numeric character reference

A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document (for example, because they are international characters that do not fit in the 8-bit character set being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

Examples

In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma

Numerical character reference of U+03A3 Σ GREEK CAPITAL LETTER SIGMA
(Note that 3A316 = 93110)
Unicode character Numerical base Numerical reference in markup Effect
U+03A3 Decimal Σ Σ
U+03A3 Decimal Σ Σ
U+03A3 Hexadecimal Σ Σ
U+03A3 Hexadecimal Σ Σ
U+03A3 Hexadecimal Σ Σ

In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE

Numerical character reference of U+00C6 Æ LATIN CAPITAL LETTER AE
Unicode character Numerical base Numerical reference in markup Effect
U+00C6 Decimal Æ Æ
U+00C6 Hexadecimal Æ Æ

In SGML, HTML, and XML, the following are all valid numeric character references for the Latin small letter sharp s ß

Numerical character reference of U+00DF ß LATIN SMALL LETTER SHARP S
Unicode character Numerical base Numerical reference in markup Effect
U+00DF Decimal ß ß
U+00DF Hexadecimal ß ß

List of numeric character references for the printable ASCII characters:

Unicode character Character
Reference
(decimal)
Character
Reference
(hexadecimal)
Effect
U+0020     (space)
U+0021 ! ! !
U+0022 " " "
U+0023 # # #
U+0024 $ $ $
U+0025 % % %
U+0026 & & &
U+0027 ' ' '
U+0028 ( ( (
U+0029 ) ) )
U+002A * * *
U+002B + + +
U+002C , , ,
U+002D - - -
U+002E . . .
U+002F / / /
U+0030 0 0 0
U+0031 1 1 1
U+0032 2 2 2
U+0033 3 3 3
U+0034 4 4 4
U+0035 5 5 5
U+0036 6 6 6
U+0037 7 7 7
U+0038 8 8 8
U+0039 9 9 9
U+003A : : :
U+003B &#59; &#x3B; ;
U+003C &#60; &#x3C; <
U+003D &#61; &#x3D; =
U+003E &#62; &#x3E; >
U+003F &#63; &#x3F; ?
U+0040 &#64; &#x40; @
U+0041 &#65; &#x41; A
U+0042 &#66; &#x42; B
U+0043 &#67; &#x43; C
U+0044 &#68; &#x44; D
U+0045 &#69; &#x45; E
U+0046 &#70; &#x46; F
U+0047 &#71; &#x47; G
U+0048 &#72; &#x48; H
U+0049 &#73; &#x49; I
U+004A &#74; &#x4A; J
U+004B &#75; &#x4B; K
U+004C &#76; &#x4C; L
U+004D &#77; &#x4D; M
U+004E &#78; &#x4E; N
U+004F &#79; &#x4F; O
U+0050 &#80; &#x50; P
U+0051 &#81; &#x51; Q
U+0052 &#82; &#x52; R
U+0053 &#83; &#x53; S
U+0054 &#84; &#x54; T
U+0055 &#85; &#x55; U
U+0056 &#86; &#x56; V
U+0057 &#87; &#x57; W
U+0058 &#88; &#x58; X
U+0059 &#89; &#x59; Y
U+005A &#90; &#x5A; Z
U+005B &#91; &#x5B; [
U+005C &#92; &#x5C; \
U+005D &#93; &#x5D; ]
U+005E &#94; &#x5E; ^
U+005F &#95; &#x5F; _
U+0060 &#96; &#x60; '
U+0061 &#97; &#x61; a
U+0062 &#98; &#x62; b
U+0063 &#99; &#x63; c
U+0064 &#100; &#x64; d
U+0065 &#101; &#x65; e
U+0066 &#102; &#x66; f
U+0067 &#103; &#x67; g
U+0068 &#104; &#x68; h
U+0069 &#105; &#x69; i
U+006A &#106; &#x6A; j
U+006B &#107; &#x6B; k
U+006C &#108; &#x6C; l
U+006D &#109; &#x6D; m
U+006E &#110; &#x6E; n
U+006F &#111; &#x6F; o
U+0070 &#112; &#x70; p
U+0071 &#113; &#x71; q
U+0072 &#114; &#x72; r
U+0073 &#115; &#x73; s
U+0074 &#116; &#x74; t
U+0075 &#117; &#x75; u
U+0076 &#118; &#x76; v
U+0077 &#119; &#x77; w
U+0078 &#120; &#x78; x
U+0079 &#121; &#x79; y
U+007A &#122; &#x7A; z
U+007B &#123; &#x7B; {
U+007C &#124; &#x7C; -
U+007D &#125; &#x7D; }
U+007E &#126; &#x7E; ~

Discussion

Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding.

Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence.

Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte each.

Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism.

The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references.

Character references that are based on the referenced character's UCS or Unicode code point are called numeric character references. In HTML 4 and in all versions of XHTML and XML, the code point can be expressed either as a decimal (base 10) number or as a hexadecimal (base 16) number. The syntax is as follows:

Character U+0026 (ampersand), followed by character U+0023 (number sign), followed by one of the following choices:

  • one or more decimal digits zero (U+0030) through nine (U+0039); or
  • character U+0078 ("x") followed by one or more hexadecimal digits, which are zero (U+0030) through nine (U+0039), Latin capital letter A (U+0041) through F (U+0046), and Latin small letter a (U+0061) through f (U+0066);

all followed by character U+003B (semicolon). Older versions of HTML disallowed the hexadecimal syntax.

The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable.

There is another kind of character reference called a character entity reference, which allows a character to be referred to by a name instead of a number. (Naming a character creates a character entity.) HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs.

Restrictions

The Universal Character Set defined by ISO 10646 is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character referenced in such a document, must be in the UCS.

While the syntax of SGML does not prohibit references to invalid or unassigned code points, such as &#xFFFF;, SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters.

Restrictions may also apply for other reasons. For example, in HTML 4, &#12;, which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference.[1][citation needed] As another example, &#128;, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers – some of which interpret it as a reference to the character represented by code value 128 in the Windows-1252 encoding for compatibility reasons. This character, "€", has to be represented as &#8364; in a standard-compliant HTML code. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like &#65536; (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended.

Markup languages also place restrictions on where character references can occur.

Compatibility issues

In the initial versions of SGML and HTML, numeric character references were interpreted in relationship to the document character encoding, rather than Unicode. For Latin-script documents, numeric character references to characters between x80 and x9F in those documents will not be correct against Unicode, and must be recoded. HTML standards prior to HTML 4 supported only Western Latin script documents: the treatment of character references above #7F may vary between applications and national conventions.

For example, as mentioned above, the correct numeric character reference for the Euro sign "€" U+20AC when using Unicode is decimal &#8364; and hexadecimal &#x20AC;. However, if using tools supporting obsolete implementations of HTML, the reference &#128; (Euro sign in the CP-1252 code page) or &#164; (Euro sign in ISO/IEC 8859-15) may work.

As another example, if some text was created originally using the MacRoman character set, the left double quotation mark will be represented with code point xD2. This will not display properly in a system expecting a document encoded as UTF-8, ISO 8859-1, or CP-1252, where this code point is occupied by the letter Ò. The correct numeric character reference for in HTML 4 and newer is &#x201C;, because U+201C is its UCS code. In some systems, the named character reference &ldquo; may also be available.

See also

References

  1. ^ "HTML 5.2: 8. The HTML syntax". www.w3.org.


numeric, character, reference, this, article, relies, largely, entirely, single, source, relevant, discussion, found, talk, page, please, help, improve, this, article, introducing, citations, additional, sources, find, sources, news, newspapers, books, scholar. This article relies largely or entirely on a single source Relevant discussion may be found on the talk page Please help improve this article by introducing citations to additional sources Find sources Numeric character reference news newspapers books scholar JSTOR February 2021 A numeric character reference NCR is a common markup construct used in SGML and SGML derived markup languages such as HTML and XML It consists of a short sequence of characters that in turn represents a single character Since WebSgml XML and HTML 4 the code points of the Universal Character Set UCS of Unicode are used NCRs are typically used in order to represent characters that are not directly encodable in a particular document for example because they are international characters that do not fit in the 8 bit character set being used or because they have special syntactic meaning in the language When the document is interpreted by a markup aware reader each NCR is treated as if it were the character it represents Contents 1 Examples 2 Discussion 3 Restrictions 4 Compatibility issues 5 See also 6 ReferencesExamples EditIn SGML HTML and XML the following are all valid numeric character references for the Greek capital letter Sigma Numerical character reference of U 03A3 S GREEK CAPITAL LETTER SIGMA Note that 3A316 93110 Unicode character Numerical base Numerical reference in markup EffectU 03A3 Decimal amp 931 SU 03A3 Decimal amp 0931 SU 03A3 Hexadecimal amp x3A3 SU 03A3 Hexadecimal amp x03A3 SU 03A3 Hexadecimal amp x3a3 SIn SGML HTML and XML the following are all valid numeric character references for the Latin capital letter AE Numerical character reference of U 00C6 AE LATIN CAPITAL LETTER AE Unicode character Numerical base Numerical reference in markup EffectU 00C6 Decimal amp 198 AEU 00C6 Hexadecimal amp xC6 AEIn SGML HTML and XML the following are all valid numeric character references for the Latin small letter sharp s ss Numerical character reference of U 00DF ss LATIN SMALL LETTER SHARP S Unicode character Numerical base Numerical reference in markup EffectU 00DF Decimal amp 223 ssU 00DF Hexadecimal amp xDF ssList of numeric character references for the printable ASCII characters Unicode character CharacterReference decimal CharacterReference hexadecimal EffectU 0020 amp 32 amp x20 space U 0021 amp 33 amp x21 U 0022 amp 34 amp x22 U 0023 amp 35 amp x23 U 0024 amp 36 amp x24 U 0025 amp 37 amp x25 U 0026 amp 38 amp x26 amp U 0027 amp 39 amp x27 U 0028 amp 40 amp x28 U 0029 amp 41 amp x29 U 002A amp 42 amp x2A U 002B amp 43 amp x2B U 002C amp 44 amp x2C U 002D amp 45 amp x2D U 002E amp 46 amp x2E U 002F amp 47 amp x2F U 0030 amp 48 amp x30 0U 0031 amp 49 amp x31 1U 0032 amp 50 amp x32 2U 0033 amp 51 amp x33 3U 0034 amp 52 amp x34 4U 0035 amp 53 amp x35 5U 0036 amp 54 amp x36 6U 0037 amp 55 amp x37 7U 0038 amp 56 amp x38 8U 0039 amp 57 amp x39 9U 003A amp 58 amp x3A U 003B amp 59 amp x3B U 003C amp 60 amp x3C lt U 003D amp 61 amp x3D U 003E amp 62 amp x3E gt U 003F amp 63 amp x3F U 0040 amp 64 amp x40 U 0041 amp 65 amp x41 AU 0042 amp 66 amp x42 BU 0043 amp 67 amp x43 CU 0044 amp 68 amp x44 DU 0045 amp 69 amp x45 EU 0046 amp 70 amp x46 FU 0047 amp 71 amp x47 GU 0048 amp 72 amp x48 HU 0049 amp 73 amp x49 IU 004A amp 74 amp x4A JU 004B amp 75 amp x4B KU 004C amp 76 amp x4C LU 004D amp 77 amp x4D MU 004E amp 78 amp x4E NU 004F amp 79 amp x4F OU 0050 amp 80 amp x50 PU 0051 amp 81 amp x51 QU 0052 amp 82 amp x52 RU 0053 amp 83 amp x53 SU 0054 amp 84 amp x54 TU 0055 amp 85 amp x55 UU 0056 amp 86 amp x56 VU 0057 amp 87 amp x57 WU 0058 amp 88 amp x58 XU 0059 amp 89 amp x59 YU 005A amp 90 amp x5A ZU 005B amp 91 amp x5B U 005C amp 92 amp x5C U 005D amp 93 amp x5D U 005E amp 94 amp x5E U 005F amp 95 amp x5F U 0060 amp 96 amp x60 U 0061 amp 97 amp x61 aU 0062 amp 98 amp x62 bU 0063 amp 99 amp x63 cU 0064 amp 100 amp x64 dU 0065 amp 101 amp x65 eU 0066 amp 102 amp x66 fU 0067 amp 103 amp x67 gU 0068 amp 104 amp x68 hU 0069 amp 105 amp x69 iU 006A amp 106 amp x6A jU 006B amp 107 amp x6B kU 006C amp 108 amp x6C lU 006D amp 109 amp x6D mU 006E amp 110 amp x6E nU 006F amp 111 amp x6F oU 0070 amp 112 amp x70 pU 0071 amp 113 amp x71 qU 0072 amp 114 amp x72 rU 0073 amp 115 amp x73 sU 0074 amp 116 amp x74 tU 0075 amp 117 amp x75 uU 0076 amp 118 amp x76 vU 0077 amp 119 amp x77 wU 0078 amp 120 amp x78 xU 0079 amp 121 amp x79 yU 007A amp 122 amp x7A zU 007B amp 123 amp x7B U 007C amp 124 amp x7C U 007D amp 125 amp x7D U 007E amp 126 amp x7E Discussion EditMarkup languages are typically defined in terms of UCS or Unicode characters That is a document consists at its most fundamental level of abstraction of a sequence of characters which are abstract units that exist independently of any encoding Ideally when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits the encoding that is used will be one that supports representing each and every character in the document if not in the whole of Unicode directly as a particular bit sequence Sometimes though for reasons of convenience or due to technical limitations documents are encoded with an encoding that cannot represent some characters directly For example the widely used encodings based on ISO 8859 can only represent at most 256 unique characters as one 8 bit byte each Documents are rarely in practice ever allowed to use more than one encoding internally so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones This is generally done through some kind of escaping mechanism The SGML based markup languages allow document authors to use special sequences of characters from the ASCII range the first 128 code points of Unicode to represent or reference any Unicode character regardless of whether the character being represented is directly available in the document s encoding These special sequences are character references Character references that are based on the referenced character s UCS or Unicode code point are called numeric character references In HTML 4 and in all versions of XHTML and XML the code point can be expressed either as a decimal base 10 number or as a hexadecimal base 16 number The syntax is as follows Character U 0026 ampersand followed by character U 0023 number sign followed by one of the following choices one or more decimal digits zero U 0030 through nine U 0039 or character U 0078 x followed by one or more hexadecimal digits which are zero U 0030 through nine U 0039 Latin capital letter A U 0041 through F U 0046 and Latin small letter a U 0061 through f U 0066 all followed by character U 003B semicolon Older versions of HTML disallowed the hexadecimal syntax The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today so there is no risk of the reference itself being unencodable There is another kind of character reference called a character entity reference which allows a character to be referred to by a name instead of a number Naming a character creates a character entity HTML defines some character entities but not many all other characters can only be included by direct encoding or using NCRs Restrictions EditThe Universal Character Set defined by ISO 10646 is the document character set of SGML HTML 4 so by default any character in such a document and any character referenced in such a document must be in the UCS While the syntax of SGML does not prohibit references to invalid or unassigned code points such as amp xFFFF SGML derived markup languages such as HTML and XML can and often do restrict numeric character references to only those code points that are assigned to characters Restrictions may also apply for other reasons For example in HTML 4 amp 12 which is a reference to a non printing form feed control character is allowed because a form feed character is allowed But in XML the form feed character cannot be used not even by reference 1 citation needed As another example amp 128 which is a reference to another control character is not allowed to be used or referenced in either HTML or XML but when used in HTML it is usually not flagged as an error by web browsers some of which interpret it as a reference to the character represented by code value 128 in the Windows 1252 encoding for compatibility reasons This character has to be represented as amp 8364 in a standard compliant HTML code As a further example prior to the publication of XML 1 0 Second Edition on October 6 2000 XML 1 0 was based on an older version of ISO 10646 and prohibited using characters above U FFFD except in character data thus making a reference like amp 65536 U 10000 illegal In XML 1 1 and newer editions of XML 1 0 such a reference is allowed because the available character repertoire was explicitly extended Markup languages also place restrictions on where character references can occur Compatibility issues EditIn the initial versions of SGML and HTML numeric character references were interpreted in relationship to the document character encoding rather than Unicode For Latin script documents numeric character references to characters between x80 and x9F in those documents will not be correct against Unicode and must be recoded HTML standards prior to HTML 4 supported only Western Latin script documents the treatment of character references above 7F may vary between applications and national conventions For example as mentioned above the correct numeric character reference for the Euro sign U 20AC when using Unicode is decimal amp 8364 and hexadecimal amp x20AC However if using tools supporting obsolete implementations of HTML the reference amp 128 Euro sign in the CP 1252 code page or amp 164 Euro sign in ISO IEC 8859 15 may work As another example if some text was created originally using the MacRoman character set the left double quotation mark will be represented with code point xD2 This will not display properly in a system expecting a document encoded as UTF 8 ISO 8859 1 or CP 1252 where this code point is occupied by the letter O The correct numeric character reference for in HTML 4 and newer is amp x201C because U 201C is its UCS code In some systems the named character reference amp ldquo may also be available See also EditList of XML and HTML character entity referencesReferences Edit HTML 5 2 8 The HTML syntax www w3 org Retrieved from https en wikipedia org w index php title Numeric character reference amp oldid 1127051256, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.