fbpx
Wikipedia

Chinese character description languages

The Chinese character description languages are several proposed languages to most accurately and completely describe Chinese (or CJK) characters and information such as their list of components, list of strokes (basic and complex), their order, and the location of each of them on a background empty square. They are designed to overcome the inherent lack of information within a bitmap description. This enriched information can be used to identify variants of characters that are unified into one code point by Unicode and ISO/IEC 10646, as well as to provide an alternative form of representation for rare characters that do not yet have a standardized encoding in Unicode or ISO/IEC 10646. Many aim to work for Kaishu style and Song style, as well as to provide the character's internal structure which can be used for easier look-up of a character by indexing the character's internal make-up and cross-referencing among similar characters.

CDL edit

 
CDL of cascading components approach.

Character Description Language is a font technology, based on XML, co-created by Tom Bishop and Richard Cook for Wenlin Institute, Inc, designed for describing any CJK character, but suitable for describing any glyph.

This XML-based declarative language defines the stroke order of each component (a subunit of the glyph similar to a radical, but not necessarily bearing the semantic significance of a true radical), as well as assembly of previously defined components to build up ever more complex characters. Many of these components are characters in their own right, in addition to serving as building-block components.

The background looks like a square of 128 pixels on each side. In this background:

  1. Each of about 50 strokes can be drawn in SVG.
  2. A basic component is composed by calling several strokes. In this component, each stroke is described by its bottom-left and top-right corner. Transformations are possible (reduction, enlargement, etc.). There are more than 1,000 basic components.
  3. A character is composed by calling several components. In this character, each component is described by its bottom-left and top-right corner. In order for a component to fit into its proper portion of the Chinese character's rectangular block, a component may be transformed (e.g., horizontal or vertical reduction or enlargement) upon its use as a building-block embedded within a containing more-complex character.

Accordingly, a set of less than 50 strokes[1] allow one to construct a set of about 1,000 components[2] which may in turn be embedded within tens of thousands of characters' descriptions.[2] A change in the shape of one of the 50 basic strokes is implicitly applied within each character that embeds that stroke. Likewise, a change to a component is implicitly applied within each and all characters whose assemblage uses that component.[2]

T. Bishop and R. Cook explain this as follows:

The stroke count of one character is generally related to the stroke counts of other characters. Most characters are built from components, and as long as the stroke counts of those components are defined, there is rarely any difficulty in adding them together to obtain the combined stroke count. Therefore, if a standard defines the strokes of a few thousand characters, it implicitly defines the strokes of many thousands of additional characters.[3]

As of 2020, nearly 100,000 Chinese characters have been described via CDL.[4]

HanGlyph edit

A character description language intended for supplying missing rare characters in documents (addressing the Chinese equivalent of the gaiji problem).[5] Documents can contain markup for missing characters, which will automatically trigger the generation of small fonts to provide the characters. The language itself is a simple postfix notation describing strokes and ways to combine them. The prototype software uses Metapost to render the characters and embed them in LaTeX documents. The language was presented by Wai Wong in 1997,[6] and papers about its implementation in Metapost and LaTeX appeared at TeX user group conferences in 2003.[7][8]

Ideographic Description Sequences edit

Chapter 12 of the Unicode specification[9] defines a syntax for "Ideographic Description Sequences" (IDSs) intended for use in describing characters not included in the standard in terms of combinations of components that do have code points. Sixteen special characters in the range U+2FF0 to U+2FFF act as prefix operators to combine other characters or sequences to form larger characters.

Ideographic Description Characters in Unicode
Character Unicode Character Number Full Unicode Name
U+2FF0 Ideographic description character left to right
U+2FF1 Ideographic description character above to below
U+2FF2 Ideographic description character left to middle and right
U+2FF3 Ideographic description character above to middle and below
U+2FF4 Ideographic description character full surround
U+2FF5 Ideographic description character surround from above
U+2FF6 Ideographic description character surround from below
U+2FF7 Ideographic description character surround from left
U+2FFC Ideographic description character surround from right
U+2FF8 Ideographic description character surround from upper left
U+2FF9 Ideographic description character surround from upper right
U+2FFA Ideographic description character surround from lower left
U+2FFD Ideographic description character surround from lower right
U+2FFB Ideographic description character overlaid
U+2FFE Ideographic description character horizontal reflection
⿿ U+2FFF Ideographic description character rotation

Two additional ideographic description characters are scattered in other Unicode blocks. Note that U+303E IDEOGRAPHIC VARIATION INDICATOR is not officially an ideographic description character, but is sometimes used in ideographic description sequences.

Other related Ideographic Description Characters in Unicode
Character Unicode Character Number Block Full Unicode Name
U+303E CJK Symbols and Punctuation Ideographic variation indicator
U+31EF CJK Strokes Ideographic description character subtraction

These sequences are useful in describing to the reader a character that is not directly printable, either because it is absent in a given font, or is absent from the Unicode standard altogether. For example, the Sawndip character " " (encoded in CJK Unified Ideographs Extension F as U+2DA21 𭨡) can be described as "⿰書史". Another use is for dictionary lookup purposes, as a sort of rough input method for queries.

These sequences can be rendered either by keeping the individual characters separately or by parsing the Ideographic Description Sequence and drawing the ideograph so described.[10] They do not, by themselves, provide unambiguous rendering for all characters. For instance, the sequence ⿱十一 represents both 土 ("soil", the middle bar being narrower) and 士 ("bachelor", the middle bar being wider).

Unicode's specification for these sequences is based on the characters and syntax of the earlier GBK standard. Additional symbols are later encoded to fill in the missing combinations.

The IDSgrep free software package by Matthew Skala[11][12] extends Unicode's IDS syntax to include additional features for dictionary lookup; it is capable of converting KanjiVG's database to its own extended IDS format, or of searching EIDS files generated by the related Tsukurimashou font family.

KanjiVG edit

KanjiVG (Kanji Vector Graphics) is a free, Creative Commons-licensed Japanese character description language (intended to eventually expand to Chinese as well) based on the SVG vector graphics format.

SCML edit

In 2007, Structural Character Modeling Language was proposed as a different kind of XML-based Chinese-character description language whose positioning is not based on a numerical grid, as CDL and HanGlyph are. The known database of characters whose strokes and components are encoded in SCML is for demonstration-of-principle only; no known effort exists to attempt to encode, say, all of Unicode's CJK characters in SCML.

See also edit

Notes edit

  1. ^ Bishop & Cook 2013-12-31:p2
  2. ^ a b c Bishop & Cook 2013-12-31:p9
  3. ^ Bishop & Cook 2003b, pp. 8–9, point n⁰12
  4. ^ Wenlin Institute webpage for CDL
  5. ^ "HanGlyph". Archived from the original on 24 January 2013. Retrieved 17 February 2012.
  6. ^ Wong, Wai (April 1997). (PDF). Proceedings of the Seventeenth International Conference on Computer Processing of Oriental Languages, Hong Kong. Archived from the original (PDF) on 2021-08-23.
  7. ^ Yiu, Candy L. K.; Wai Wong (July 2003). "Chinese Character Synthesis using METAPOST" (PDF). Proceedings of the 24th Annual Meeting and Conference of the TeX User Group, Hawaii, U.S.A. (PDF) from the original on 2011-07-26.
  8. ^ Wong, Wai; Candy L. K. Yiu; Kelvin, C. F. Ng (June 2003). "Typesetting Rare Chinese Characters in LaTeX" (PDF). Proceedings of the 14th European TeX Conference, Brest, France. (PDF) from the original on 2011-11-06.
  9. ^ https://www.unicode.org/versions/Unicode6.0.0/ch12.pdf[bare URL PDF]
  10. ^ "The Unicode® Standard – Version 12.0 – Core Specification" (PDF). Unicode Consortium. March 2019. p. 26.
  11. ^ "IDSgrep".
  12. ^ Skala, Matthew (2015). (PDF). International Journal of Asian Language Processing. 23 (2): 127–159. arXiv:1404.5585. Archived from the original (PDF) on 2016-03-04. Retrieved 2016-01-13.

External links edit

CDL language from Wenlin Institute edit

  • Wenlin Institute (2015), Wenlin User's Guide : Character Description Language
  • Bishop, Tom; Cook, Richard, CDL specification
  • Bishop, Tom; Cook, Richard (2003a), Character Description Language (CDL): The Set of Basic CJK Unified Stroke Types (PDF)
  • Bishop, Tom; Cook, Richard (2003b), A Specification for CDL Character Description Language (PDF)
    • 2003/12/31 correction: Bishop, Tom; Cook, Richard (2003c), (PDF), archived from the original (PDF) on 2016-04-05, retrieved 2018-01-17
  • Cook, Richard (2003), Chinese Character Description Languages (PDF)
  • Bishop, Tom (2007), A character description language for CJK (PDF), Multilingual, #91, Volume 18 Issue 7, pp. 62–8
  • Digital Humanities Start-up Grant from the U.S. National Endowment for the Humanities

SCML edit

  • Peebles, Daniel G. (May 29, 2007), (PDF), Devin, Balkcom (advisor), Dartmouth College, p. 30, archived from the original (PDF) on March 10, 2016, retrieved August 30, 2009{{citation}}: CS1 maint: location missing publisher (link)

HanGlyph edit

  • HanGlyph – a Chinese Character Description Language - Presentation, archived from the original on 2013-01-25, retrieved 2007-12-11
  • (PDF), 13 September 2003, p. 31, archived from the original (PDF) on 4 March 2016, retrieved 11 December 2007

chinese, character, description, languages, several, proposed, languages, most, accurately, completely, describe, chinese, characters, information, such, their, list, components, list, strokes, basic, complex, their, order, location, each, them, background, em. The Chinese character description languages are several proposed languages to most accurately and completely describe Chinese or CJK characters and information such as their list of components list of strokes basic and complex their order and the location of each of them on a background empty square They are designed to overcome the inherent lack of information within a bitmap description This enriched information can be used to identify variants of characters that are unified into one code point by Unicode and ISO IEC 10646 as well as to provide an alternative form of representation for rare characters that do not yet have a standardized encoding in Unicode or ISO IEC 10646 Many aim to work for Kaishu style and Song style as well as to provide the character s internal structure which can be used for easier look up of a character by indexing the character s internal make up and cross referencing among similar characters Contents 1 CDL 2 HanGlyph 3 Ideographic Description Sequences 4 KanjiVG 5 SCML 6 See also 7 Notes 8 External links 8 1 CDL language from Wenlin Institute 8 2 SCML 8 3 HanGlyphCDL edit nbsp CDL of cascading components approach Character Description Language is a font technology based on XML co created by Tom Bishop and Richard Cook for Wenlin Institute Inc designed for describing any CJK character but suitable for describing any glyph This XML based declarative language defines the stroke order of each component a subunit of the glyph similar to a radical but not necessarily bearing the semantic significance of a true radical as well as assembly of previously defined components to build up ever more complex characters Many of these components are characters in their own right in addition to serving as building block components The background looks like a square of 128 pixels on each side In this background Each of about 50 strokes can be drawn in SVG A basic component is composed by calling several strokes In this component each stroke is described by its bottom left and top right corner Transformations are possible reduction enlargement etc There are more than 1 000 basic components A character is composed by calling several components In this character each component is described by its bottom left and top right corner In order for a component to fit into its proper portion of the Chinese character s rectangular block a component may be transformed e g horizontal or vertical reduction or enlargement upon its use as a building block embedded within a containing more complex character Accordingly a set of less than 50 strokes 1 allow one to construct a set of about 1 000 components 2 which may in turn be embedded within tens of thousands of characters descriptions 2 A change in the shape of one of the 50 basic strokes is implicitly applied within each character that embeds that stroke Likewise a change to a component is implicitly applied within each and all characters whose assemblage uses that component 2 T Bishop and R Cook explain this as follows The stroke count of one character is generally related to the stroke counts of other characters Most characters are built from components and as long as the stroke counts of those components are defined there is rarely any difficulty in adding them together to obtain the combined stroke count Therefore if a standard defines the strokes of a few thousand characters it implicitly defines the strokes of many thousands of additional characters 3 As of 2020 update nearly 100 000 Chinese characters have been described via CDL 4 HanGlyph editA character description language intended for supplying missing rare characters in documents addressing the Chinese equivalent of the gaiji problem 5 Documents can contain markup for missing characters which will automatically trigger the generation of small fonts to provide the characters The language itself is a simple postfix notation describing strokes and ways to combine them The prototype software uses Metapost to render the characters and embed them in LaTeX documents The language was presented by Wai Wong in 1997 6 and papers about its implementation in Metapost and LaTeX appeared at TeX user group conferences in 2003 7 8 Ideographic Description Sequences editMain article Ideographic Description Characters Unicode block Chapter 12 of the Unicode specification 9 defines a syntax for Ideographic Description Sequences IDSs intended for use in describing characters not included in the standard in terms of combinations of components that do have code points Sixteen special characters in the range U 2FF0 to U 2FFF act as prefix operators to combine other characters or sequences to form larger characters Ideographic Description Characters in Unicode Character Unicode Character Number Full Unicode Name U 2FF0 Ideographic description character left to right U 2FF1 Ideographic description character above to below U 2FF2 Ideographic description character left to middle and right U 2FF3 Ideographic description character above to middle and below U 2FF4 Ideographic description character full surround U 2FF5 Ideographic description character surround from above U 2FF6 Ideographic description character surround from below U 2FF7 Ideographic description character surround from left U 2FFC Ideographic description character surround from right U 2FF8 Ideographic description character surround from upper left U 2FF9 Ideographic description character surround from upper right U 2FFA Ideographic description character surround from lower left U 2FFD Ideographic description character surround from lower right U 2FFB Ideographic description character overlaid U 2FFE Ideographic description character horizontal reflection U 2FFF Ideographic description character rotationTwo additional ideographic description characters are scattered in other Unicode blocks Note that U 303E IDEOGRAPHIC VARIATION INDICATOR is not officially an ideographic description character but is sometimes used in ideographic description sequences Other related Ideographic Description Characters in Unicode Character Unicode Character Number Block Full Unicode Name U 303E CJK Symbols and Punctuation Ideographic variation indicator U 31EF CJK Strokes Ideographic description character subtractionThese sequences are useful in describing to the reader a character that is not directly printable either because it is absent in a given font or is absent from the Unicode standard altogether For example the Sawndip character nbsp encoded in CJK Unified Ideographs Extension F as U 2DA21 𭨡 can be described as 書史 Another use is for dictionary lookup purposes as a sort of rough input method for queries These sequences can be rendered either by keeping the individual characters separately or by parsing the Ideographic Description Sequence and drawing the ideograph so described 10 They do not by themselves provide unambiguous rendering for all characters For instance the sequence 十一 represents both 土 soil the middle bar being narrower and 士 bachelor the middle bar being wider Unicode s specification for these sequences is based on the characters and syntax of the earlier GBK standard Additional symbols are later encoded to fill in the missing combinations The IDSgrep free software package by Matthew Skala 11 12 extends Unicode s IDS syntax to include additional features for dictionary lookup it is capable of converting KanjiVG s database to its own extended IDS format or of searching EIDS files generated by the related Tsukurimashou font family KanjiVG editThis section needs expansion You can help by adding to it December 2009 KanjiVG Kanji Vector Graphics is a free Creative Commons licensed Japanese character description language intended to eventually expand to Chinese as well based on the SVG vector graphics format SCML editThis section needs additional citations for verification Please help improve this article by adding citations to reliable sources in this section Unsourced material may be challenged and removed November 2022 Learn how and when to remove this template message In 2007 Structural Character Modeling Language was proposed as a different kind of XML based Chinese character description language whose positioning is not based on a numerical grid as CDL and HanGlyph are The known database of characters whose strokes and components are encoded in SCML is for demonstration of principle only no known effort exists to attempt to encode say all of Unicode s CJK characters in SCML See also editList of Shuowen Jiezi radicals a system of 540 components used by Xu Shen d 147 AD in his Shuowen Jiezi List of Kangxi radicals a system of 214 components used by the Kangxi dictionary 1716 made under the leadership of the Kangxi Emperor List of Unicode radicals a modern and computer based ongoing attempt to create a complete and accurate set of CJK component list led by Unicode Cangjie input method Radical StrokeNotes edit Bishop amp Cook 2013 12 31 p2 a b c Bishop amp Cook 2013 12 31 p9 Bishop amp Cook 2003b pp 8 9 point n 12 Wenlin Institute webpage for CDL HanGlyph Archived from the original on 24 January 2013 Retrieved 17 February 2012 Wong Wai April 1997 HanGlyph a Chinese Character Description Language PDF Proceedings of the Seventeenth International Conference on Computer Processing of Oriental Languages Hong Kong Archived from the original PDF on 2021 08 23 Yiu Candy L K Wai Wong July 2003 Chinese Character Synthesis using METAPOST PDF Proceedings of the 24th Annual Meeting and Conference of the TeX User Group Hawaii U S A Archived PDF from the original on 2011 07 26 Wong Wai Candy L K Yiu Kelvin C F Ng June 2003 Typesetting Rare Chinese Characters in LaTeX PDF Proceedings of the 14th European TeX Conference Brest France Archived PDF from the original on 2011 11 06 https www unicode org versions Unicode6 0 0 ch12 pdf bare URL PDF The Unicode Standard Version 12 0 Core Specification PDF Unicode Consortium March 2019 p 26 IDSgrep Skala Matthew 2015 A Structural Query System for Han Characters PDF International Journal of Asian Language Processing 23 2 127 159 arXiv 1404 5585 Archived from the original PDF on 2016 03 04 Retrieved 2016 01 13 External links editCDL language from Wenlin Institute edit Wenlin Institute 2015 Wenlin User s Guide Character Description Language Bishop Tom Cook Richard CDL specification Bishop Tom Cook Richard 2003a Character Description Language CDL The Set of Basic CJK Unified Stroke Types PDF Bishop Tom Cook Richard 2003b A Specification for CDL Character Description Language PDF 2003 12 31 correction Bishop Tom Cook Richard 2003c Specification for CDL PDF archived from the original PDF on 2016 04 05 retrieved 2018 01 17 Cook Richard 2003 Chinese Character Description Languages PDF Bishop Tom 2007 A character description language for CJK PDF Multilingual 91 Volume 18 Issue 7 pp 62 8 Digital Humanities Start up Grant from the U S National Endowment for the HumanitiesSCML edit Peebles Daniel G May 29 2007 SCML A Structural Representation for Chinese Characters Technical Report TR2007 592 PDF Devin Balkcom advisor Dartmouth College p 30 archived from the original PDF on March 10 2016 retrieved August 30 2009 a href Template Citation html title Template Citation citation a CS1 maint location missing publisher link HanGlyph edit HanGlyph a Chinese Character Description Language Presentation archived from the original on 2013 01 25 retrieved 2007 12 11 HanGlyph a Chinese Character Description Language Reference Manual PDF 13 September 2003 p 31 archived from the original PDF on 4 March 2016 retrieved 11 December 2007 Retrieved from https en wikipedia org w index php title Chinese character description languages amp oldid 1209322416, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.