Section: User Commands (1)
Updated: 16 SEP 2003Index
demoroniser - correct moronic and gratuitously incompatible HTML generated by Microsoft applications
Many slick, high profile corporate Web sites I visit seemedto exhibit terrible grammar completely inconsistent with theobvious investment in graphics and design. Apostrophes andquote marks were frequently omitted, and every couple ofparagraphs words were run together which should havebeen separated by a punctuation mark of some kind.
This remained a mystery to me until I wanted to convert apresentation I'd developed in 1996 using Microsoft PowerPointinto a set of Web pages. A friend was kind enough to run thepresentation through PowerPoint's ``Save as HTML'' feature(I have abandoned all use of Microsoft products, so I did not havea current version of PowerPoint which includes this feature).When I got the PowerPoint-generated HTML back and viewed itin my browser, I discovered that it containedprecisely the same grammatical errorsI'd noted on so many Web sites, and which certainlywere not present in my original presentation.
A little detective work revealed that, as is usually the case when youencounter something shoddy in the vicinity of a computer, Microsoftincompetence and gratuitous incompatibility were to blame. Western language HTMLdocuments are written in the ISO 8859-1 Latin-1 character set, with aspecified set of escapes for special characters. Blithely ignoringthis prescription, as usual, Microsoft use their own "extension" toLatin-1, in which a variety of characters which do not appear inLatin-1 are inserted in the range 0x82 through 0x95--this having themerit of being incompatible with both Latin-1 and Unicode, whichreserve this region for additional control characters.
These characters include open and close single and double quotes,em and en dashes, an ellipsis and a variety of other thingsyou've been dying for, such as a capital Y umlaut and aflorin symbol. Well, okay, you say, if Microsoft want to havetheir own little incompatible character set, why not? Becauseit doesn't stop there--in their inimitable fashion (who wouldwant to?)--they aggressively pollute the Web pages of unknowingand innocent victims worldwide with these characters, with theresult that the owners of these pages look like semi-literatemorons when their pages are viewed on non-Microsoft platforms(or on Microsoft platforms, for that matter, if the user hasselected as the browser's font one of the many TrueType fontswhich do not include the incompatible Microsoft characters).
You see, ``state of the art'' Microsoft Office applicationssport a nifty feature called ``smart quotes.'' (Rule of thumb--everytime Microsoft use the word ``smart,'' be on the lookout for something dumb).This feature is on by default in both Word and PowerPoint, and canbe disabled only by finding the little box buried among thedozens of bewildering option panels these products contain.If enabled, and you type the string,"Halt," he cried, "this is the police!"
``smart quotes'' transforms the ASCII quote characters automaticallyinto the incompatible Microsoft opening and closing quotes.ASCII single and double quotes are similarly transformed (eventhough ASCII already contains apostrophe and single open quotecharacters), and double hyphens are replaced by the incompatibleem dash symbol. What other horrors occur, I know not.If the user notices this happening at all, their reactionmight be ``Thank you Billy-boy--that looks ever so much nicer,''not knowing they've been set up to look like a moron tofolks all over the world.
You see, when you export a document as text for hand-editinginto HTML, or avail yourself of the ``Save as HTML'' featuresin newer versions of Office applications, theseincompatible, Microsoft-specific charactersremain in place.When viewed by a user on a non-Microsoft platform, theywill not be displayed properly--most browsers seem tojust drop them, as opposed to including a symbolindicating an undisplayable character. Hence, theapparently ungrammatical text, which the author of thepage, editing on a Microsoft platform, will neverbe aware of.
Having no desire to hand-edit the HTML for a long presentationto correct a raft of Microsoft-induced incompatibilities, Iwrote a Perl program, thedemoroniser,to transform Microsoft's ``junk HTML'' into at least a startingpoint for something I'd consider presentable on my site.In addition to replacing the incompatible characters withHTML-compliant equivalents wherever possible (a few rarely-encounteredcharacters which can't be translated result in warning messagesif encountered), the following sloppy or downright wrong HTMLis corrected.
- The missing semicolon at the end of numeric characterescapes (=) is supplied.
- Numeric renderings of special characters (< > &)are replaced with readable equivalents.
- Unquoted <table> tags containing non-alphanumericcharacters are quoted.
- PowerPoint's mis-nesting of <font> and <strong> tagsis corrected.
- PowerPoint's boneheaded use of <ul> and </ul> tags toaccomplish paragraph breaks is corrected and theproper <p> tags inserted.
- Missing <tr> tags in text-only slides are inserted.
- Nugatory </p> tags are removed.
- Unmatched <li> tags in headings are removed.
- Idiot ``paragraph-long lines'' are broken intosomething suitable for editing with a normal text editor.
- Quiet: don't print warnings for untranslated characters.
- Print how-to-call information and a summary of options.
- Wrap output lines at columncols.By default, lines are wrapped at column 72. Acolsspecification of 0 disables line wrapping.demoroniserattempts to wrap lines so as to preserve their meaning.Lines are broken at white spacewhenever possible. If this cannot be done,a line longer than thecolsspecification will remain in the output HTML.
is a Perl script.In order to use it, you must have Perl installed on your system.demoroniser
was developed using Perl 4.0, patch level 36.
is specified, output is written to standard output.If noinfile
is specified, input is read from standard input.
John WalkerWWW: http://www.fourmilab.ch/
This program is in the public domain.
- SEE ALSO
This document was created byman2html,using the manual pages.