fbpx
Wikipedia

Round-trip format conversion

The term round-trip is used in document conversion particularly involving markup languages such as XML and SGML. A successful round-trip consists of converting a document in format A (docA) to one in format B (docB) and then back again to format A (docA′). If docA and docA′ are identical then there has been no information loss and the round-trip has been successful. More generally it means converting from any data representation and back again, including from one data structure to another.

Information loss edit

When a document in one format is converted to another there is likely to be information loss. For example, suppose an HTML document is saved as plain text (*.txt). Then all the markup (structure, formatting, superscripts, …) will be lost. Compound documents will frequently lose information on images and other embedded objects. If the text file is converted back to the original format, information will necessarily be missing.

A similar effect happens with image formats. Some formats such as JPEG achieve compression through small amount of information loss. If a lossless file, such as a BMP or PNG file, is converted to JPEG and back again then the result will be different from the original (although it may be visually very similar).

Just because the initial and final documents are not bitwise identical does not mean there is information loss. Some formats have undefined fields, or fields where the contents have no impact on the result.

Markup languages edit

Markup languages such as XML can, in principle, hold any information and so the process docA → docX → docA' could be designed to avoid information loss. It is now common to convert legacy formats to XML formats because they have greater interoperability and a wider set of available tools. Thus it is possible to convert Word documents to an XML format and reimport them.

The XML document should contain identical information to the legacy format. An important condition is that the roundtrip (legacy → XML → legacy') should result in effectively identical documents. Because some document structures allow some flexibility in content order, whitespace, case-sensitivity, etc. it is useful to have a means of canonicalizing the legacy format. The full roundtrip may then be:

legacy → canonicalLegacy → XML → legacy′ → canonicalLegacy′

If canonicalLegacy = canonicalLegacy′ then the roundtrip has been successful.

Character encodings edit

Unicode has a principle to have round-trip compatibility with older standardized legacy encodings, so conversion of documents to Unicode do not lose information; they can be converted back. To achieve this, Unicode compatibility characters have been introduced.

Limitation edit

An application can claim to round-trip and be dishonest. For example, it may save the original data from docA as a field in docX, so the reverse transformation to docA′ simply extracts that field. While this may be needed for some cases, the idea of a round-trip conversion is to go through another format representation or data structure and back again. Such a strategy means that small changes in a document means that it can not be converted back to the original format.

Usage edit

The term appears to be common, but not reported in dictionaries. A typical usage occurs on a 1999 xml-dev thread but the term is likely to have been used before this.[1]

See also edit

References edit

  1. ^ Kesselman, Joseph “keshlam” (March 25, 1999). "Round-trip issues". XML-dev. IBM Research. Gathering and replying to several comments [including CDATA]

round, trip, format, conversion, this, article, needs, additional, citations, verification, please, help, improve, this, article, adding, citations, reliable, sources, unsourced, material, challenged, removed, find, sources, news, newspapers, books, scholar, j. This article needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed Find sources Round trip format conversion news newspapers books scholar JSTOR February 2024 Learn how and when to remove this message The term round trip is used in document conversion particularly involving markup languages such as XML and SGML A successful round trip consists of converting a document in format A docA to one in format B docB and then back again to format A docA If docA and docA are identical then there has been no information loss and the round trip has been successful More generally it means converting from any data representation and back again including from one data structure to another Contents 1 Information loss 2 Markup languages 3 Character encodings 4 Limitation 5 Usage 6 See also 7 ReferencesInformation loss editWhen a document in one format is converted to another there is likely to be information loss For example suppose an HTML document is saved as plain text txt Then all the markup structure formatting superscripts will be lost Compound documents will frequently lose information on images and other embedded objects If the text file is converted back to the original format information will necessarily be missing A similar effect happens with image formats Some formats such as JPEG achieve compression through small amount of information loss If a lossless file such as a BMP or PNG file is converted to JPEG and back again then the result will be different from the original although it may be visually very similar Just because the initial and final documents are not bitwise identical does not mean there is information loss Some formats have undefined fields or fields where the contents have no impact on the result Markup languages editMarkup languages such as XML can in principle hold any information and so the process docA docX docA could be designed to avoid information loss It is now common to convert legacy formats to XML formats because they have greater interoperability and a wider set of available tools Thus it is possible to convert Word documents to an XML format and reimport them The XML document should contain identical information to the legacy format An important condition is that the roundtrip legacy XML legacy should result in effectively identical documents Because some document structures allow some flexibility in content order whitespace case sensitivity etc it is useful to have a means of canonicalizing the legacy format The full roundtrip may then be legacy canonicalLegacy XML legacy canonicalLegacy If canonicalLegacy canonicalLegacy then the roundtrip has been successful Character encodings editUnicode has a principle to have round trip compatibility with older standardized legacy encodings so conversion of documents to Unicode do not lose information they can be converted back To achieve this Unicode compatibility characters have been introduced Limitation editAn application can claim to round trip and be dishonest For example it may save the original data from docA as a field in docX so the reverse transformation to docA simply extracts that field While this may be needed for some cases the idea of a round trip conversion is to go through another format representation or data structure and back again Such a strategy means that small changes in a document means that it can not be converted back to the original format Usage editThe term appears to be common but not reported in dictionaries A typical usage occurs on a 1999 xml dev thread but the term is likely to have been used before this 1 See also editLossy data conversion MojibakeReferences edit Kesselman Joseph keshlam March 25 1999 Round trip issues XML dev IBM Research Gathering and replying to several comments including CDATA Retrieved from https en wikipedia org w index php title Round trip format conversion amp oldid 1210700117, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.