SEARCH
NEW RPMS
DIRECTORIES
ABOUT
FAQ
VARIOUS
BLOG
DONATE


YUM REPOSITORY

 
 

MAN page from CentOS 7 catdoc-0943-0.94.3-1.4.x86_64.rpm

catdoc

Section: User Commands (1)
Updated: Version 0.94.2
Index 

NAME

catdoc - reads MS-Word file and puts its content as plain text on standard output 

SYNOPSIS

catdoc [-vlu8btawxV] [-m number][-scharset][-d charset][ -foutput-format]file

 

DESCRIPTION

catdocbehaves much likecat(1)but it reads MS-Word file and produces human-readable text on standard output.Optionally it can use latex(1)escape sequences for characters which have special meaning for LaTeX.It also makes some effort to recognize MS-Word tables, although it nevertries to write correct headers for LaTeX tabular environment. Additionaloutput formats, such is HTML can be easily defined.

catdocdoesn't attempt to extract formatting information other than tables fromMS-Word document, so different output modes means mainly that differentcharacters should be escaped and different ways used to represent characters,missing from output charset. See CHARACTER SUBSTITUTION below

catdocuses internal unicode(7)representation of text, so it is able to convert texts when charset insource document doesn't match charset on target system.See CHARACTER SETS below.

If no file names supplied, catdocprocesses its standard input unless it is terminal. It is unlikely that somebody could type Word document from keyboard, so if catdoc invoked without arguments and stdin is not redirected, it prints briefusage message and exits. Processing of standard input (even among other files) can be forced usingdash '-' as file name.

By default, catdocwraps lines which are more than 72 chars long and separates paragraphs byblank lines. This behavior can be turned of by -wswitch. In widemode catdoc prints each paragraph as one long line, suitable for import intoword processors which perform word wrapping.
 

 

OPTIONS

-a
- shortcut for -f ascii. Produces ASCII text as output.Separates table columns with TAB
-b
- process broken MS-Word file. Normally,catdoc checks if first 8 bytesof file is Microsoft OLE signature. If so, it processes file, otherwiseit just copies it to stdin. It is intended to use catdoc as filter for viewing all files with .docextension.
-dcharset
- specifies destination charset name. Charset file has format described inCHARACTER SETS below and should have .txtextension and reside in catdoc library directory ( /usr/lib64/catdoc). By default, currentlocale charset is used if langinfo support compiled in.
-fformat
- specifies output format as described in CHARACTER SUBSTITUTION below.catdoccomes with two output formats - ascii and tex. You can add your own if youwish.
-l
Causes catdocto list names of available charsets to the stdout and exit successfully.
-mnumber
Specifies right margin for text (default 72). -m 0is equivalent to-w
-scharset
Specifies source charset. (one used in Word document), if Word documentdoesn't contain UTF-16 text. When reading rtf documents, it istypically not necessary, because rtf documents contain ansicpgspecification. But it can be set wrong by Word (I've seen RTF documentson Russian, where cp1252 was specified). In this case this option wouldtake precedence over charset, specified in the document. Butsource_charset statement in the configuration file have less prioritythan charset in the document.
-t
- shortcut for -f tex
 converts all printable chars, which have special meaning for LaTeX(1)into appropriate control sequences. Separates table columns by &.
-u
- declares that Word document contain UNICODE (UTF-16) representationof text (as some Word-97 documents). If catdoc fails to correct Word documentwith default charset, try this option.
-8
- declares is Word document is 8 bit. Just in case that catdoc
 recognizes file format incorrectly.
-w
disables word wrapping. By default catdocoutput is split into lines not longer than 72 (or number, specified by-m option) characters and paragraphsare separated by blank line. With this option each paragraph is onelong line.
-x
causes catdoc to output unknown UNICODE character as \xNNNN, insteadof question marks.
-v
causes catdoc to print some useless information about word documentstructure to stdout before actual start of text.
-V
outputs catdoc version

 

CHARACTER SETS

When processing MS-Word file catdocuses information about two character sets, typically different
 -  input and output. They are stored in plain text files in catdocdata directory. Character set files should contain two whitespace-separatedhexadecimal numbers - 8-bit code in character set and 16-bit Unicode code.Anything from hash mark to end of line is ignored, as well as blank lines.

catdoc distribution includes some of these character sets. Additional character setdefinitions, directly usable by catdoc can be obtained from ftp.unicode.org. Charset files have.txtsuffix, which shouldn't be specified in command-line or configurationfiles.

Note thatcatdoc is distributed with Cyrillic charsets as default. If you are notRussian, you probably don't want it, an should reconfigure catdoc at compile time or in runtime configuration file.

When dealing with documents with charsets other than default, rememberthat Microsoft never uses ISO charsets. While letters in, say cp1252 areat the same position as in ISO-8859-1, some punctuation signs would belost, if you specify ISO-8859-1 as input charset. If you use cp1252,catdoc would deal with those signs as described in CHARACTERSUBSTITUTION below.

 

CHARACTER SUBSTITUTION

catdocconverts MS-Word file into following internal Unicode representation:
1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
2. Table cells within row are separated by ASCII Field Separator symbol
(0x001C)
3. Table rows are separated by ASCII Record Separator (0x001E)
4. All printable characters, including whitespace are represented with their
respective UNICODE codes.

This UNICODE representation is subsequently converted into 8-bit text intarget character set using following four-step algorithm:

1. List of special characters is searched for given Unicode character.
If found, then appropriate multi-character sequence is output instead ofcharacter.
2. If there is an equivalent in target character set, it is output.
3. Otherwise, replacement list is searched and, if there is multi-character
substitution for this UNICODE char, it is output.
4. If all above fails, "Unknown char" symbol (question mark) is output.

Lists of special characters and list of substitution are characterset-independent, because special chars should be escaped regardless of theirexistence in target character set (usually, they are parts of US-ASCII, andtherefore exist in any character set) and replacement list is searched onlyfor those characters, which are not found in target character set.

These lists are stored incatdoc data directory in files with prefix of format name. These files havefollowing format:

Each line can be either comment (starting with hash mark) or containhexadecimal UNICODE value, separated by whitespace from string, whichwould be substituted instead of it. If string contain no whitespace it can be used as is, otherwise it should be enclosed in single or doublequotes. Usual backslash sequences like '\n','\t'can be used in these string.

 

RUNTIME CONFIGURATION

Upon startup catdoc reads its system-wide configuration file/etc/catdocrc and thenuser-specific configuration file${HOME}/.catdocrc.

These files can contain following directives:

source_charset = charset-name
Sets default source charset, which would be used if no -soption specified. Consult configuration of nearby windowsworkstation to find one you need.
target_charset = charset-name

 Sets default output charset. You probably know, which one you use.
charset_path = directory-list
colon-separated list of directories, which are searched for charset files.This allows you to install additional charsets in your home directory.If first directory component of path is ~ it is replaced by contents ofHOME environment variable.On MS-DOS platform, if directory name starts with %s, it is replacedwith directory of executable file. Empty element in list (i.e. twoconsequitve colons) is considered current directory.
map_path = directory-list
colon-separated list of directories, which are searched for special charactermap and replacement map.Same substitution rules as incharset_pathare applied.
format = format name
Output format which would be used by default.catdoccomes with two formats - ascii and texbut nothing prevents you from writing your own format (set two map files -special character map and replacement map).
unknown_char = character specification
sets character to output instead of unknown Unicode character (default '?')Character specification can have one of two form - character enclosed insingle quotes or hexadecimal code.
use_locale =(yes|no)
Enables or disables automatic selection of output charset (default yes),
 based onsystem locale settings (if enabled at compile time). If automaticdetection is enabled, than output charset settings in the configurationfiles (but not in the command line) are ignored, and current systemlocale charset is used instead. There are no automatic choice of inputcharset, based of locale language, because most modern Word files (sinceWord 97) are Unicode anyway

 

BUGS

Doesn't handlefast-saves properly. Prints footnotes as separate paragraphs at the end offile, instead of producing correct LaTeX commands. Cannot distinguishbetween empty table cell and end of table row.

 

SEE ALSO

xls2csv(1),cat(1),strings(1),utf(4),unicode(7)

 

AUTHOR

V.B.Wagner <vitusAATT45.free.net>


 

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
CHARACTER SETS
CHARACTER SUBSTITUTION
RUNTIME CONFIGURATION
BUGS
SEE ALSO
AUTHOR

This document was created byman2html,using the manual pages.