faq2html

(Convert some particular text formats into XHTML)

SYNOPSIS

faq2html [-hluv] [-s style] [-t title] [infile [outfile]]

Yes, this is another of those odd breed of partially functional beasts, a text to XHTML converter. It is my belief that writing a general text to XHTML converter is completely impossible, on the grounds that people do too many varied things with their text to intuit document structure from it. This is therefore a converter that will translate documents written the way I write.

It may or may not work for you. The chances that it will work for you are directly proportional to how much your writing looks like mine.

Usage is simple; just give it an input file and an output file. If the output file isn't given, it will write to stdout. If the input file also isn't given, it will read from stdin.

faq2html understands digest separators (lines of exactly thirty hyphens, from the minimal digest standard) and will treat a Subject header immediately after them as a section header. Beyond that, headings must either be outdented, underlined on the following line, or in all caps to be recognized as section headers. (Outdenting means that the regular text is indented by a few spaces, but headers start in column 0, or at least in a column farther to the left than the regular text.)

Section headers that begin with numbers (with any number of periods) will be given <a id> tags containing that number prepended with S. As a special case of the parsing, any section with a header containing "contents" will have lines beginning with numbers turned into links to the appropriate <a id> tags in the same document. You can use this to turn the table of contents of your minimal digest format FAQ into a real table of contents with links in the HTML version.

Text with embedded whitespace more than a single space or a couple of spaces at a sentence boundary or after a colon (and any text with literal tabs) will be wrapped in <pre> tags. So will any indented text that doesn't look like English paragraphs. URLs surrounded by <...> or <URL:...> will be turned into links. Other URLs will not be turned into links, nor is any effort made to turn random body text into links because it happens to look like a link. I dislike link syndrome.

Bulletted lists and numbered lists will be turned into the appropriate HTML structures. Some attempt is also made to recognize description lists, but faq2html was written by someone who writes a lot of technical documentation and therefore tends to prefer <pre>; description lists are therefore only going to work if the description titles aren't indented relative to the surrounding text.

Regular indented paragraphs or paragraphs quoted with a consistent non-alphanumeric quote character are recognized and turned into HTML block quotes.

It's worthwhile paying attention to the headers at the top of your document so that faq2html can get a few things right. If you use RCS or CVS, put the RCS Id keyword as the first line of your document; it will be stripped out of the resulting output and faq2html will use it to determine the document revision. This should be followed by regular message headers and news.answers subheaders if the document is an actual FAQ, and faq2html will use the From and Subject headers to figure out a title and headings to use.

As a special case, an HTML-title header in the subheaders will override any other title that faq2html thinks it should use for the document.

faq2html expects your document to have a centered title, and will add one from the Subject header if it doesn't find one. It will also add centered subheaders giving the author (from the From header) and the last modified time and revision (from the RCS Id string) if there are no subheadings already. If there's a subheading that contains RCS identifiers, it will be replaced by a nicely formatted heading generated from the RCS Id information in the HTML output.

Text marked as *bold* using the standard asterix notation will be surrounded by <strong> tags, if the asterixes appear to be marking bold text rather than serving as wildcards or some other function.

faq2html produces output (at least in the absence of any lurking bugs) which complies with the XHTML 1.0 Strict standard (unless -n is given, in which case it complies with XHTML 1.0 Transitional). The input and output character set is assumed to be UTF-8.

OPTIONS

-h, --help: Print out this documentation (which is done simply by feeding the script to perldoc -t.
-l, --last-modified: Add a last modified subheading to the converted document based on the last modification timestamp of the source file. This is only done if no RCS/CVS Id string is found in the file. If there is one, it is used in preference. This option is ignored if the input is not a file.
-s style, --style=style: Insert a reference to style as a style sheet into the generated web page. Unless this argument is given, no style sheet will be referred to in the generated web page.
-t title, --title=title: Use title as the page title rather than whatever may be determined from looking at the input file.
-u, --use-value: If this option is given, faq2html includes value attributes in all <li> tags so that the item numbers will match the numbers specified in the source. This is only necessary if the item numbers must continue to increase through disconnected numbered lists, or if the lists don't count as normal. Without this option, no value tags are given and the numbering is left up to the browser, allowing the output to validate as XHTML 1.0 Strict instead of XHTML 1.0 Transitional.
-v, --version: Print out the version of faq2html and exit.

NOTES

I wrote this program because every other text to HTML converter that I've seen made specific assumptions about the document format and wanted you to write like it wanted you to write rather than like the way you wanted to write. This program instead wants you to write like I write. Which from my perspective is an improvement.

I don't claim that this is the be-all and end-all of text to XHTML converters, as I don't believe such a beast exists. I do believe it's pretty close to being the be-all and end-all of text to XHTML converters for text that I personally have written, since I've written into it a lot of knowledge of the sorts of text formatting conventions that I use. If you happen to use the same ones, you may be delighted with this program. If you don't, you'll probably be very frustrated with it.

In any case, I took to this project the perspective that whenever there was something this program couldn't handle, I wanted to make it smarter rather than change the input. I've mostly been successful at that, so far.

CAVEATS

This program attempts to do the impossible, namely intuit structure from an unstructured markup format. To do that, it relies on a whole bunch of fussy heuristics, poorly-understood assumptions, and sheer blind luck. To fully document the boundary cases of this program would take more time and patience than I care to invest; see the source code if you're curious. This is not a predictable or easily documentable program. Instead, it attempts to do what I mean without bugging me about it.

There is therefore, at least currently, no way to control or adjust parameters in this program without editing it. I may someday add that, but I'm leery of it, since the code complexity would start increasing exponentially if I tried to let people tweak everything. I've completely given up on more than one text to HTML converter because it had more options than ls and expected you to try to figure out which ones should be used for a document yourself. That's not the way I want this program to work.

English month names are used for the last modification dates. To change the names used, see the top of the script. Similarly, faq2html always says the language of the document is English.

AUTHOR

Russ Allbery <eagle@eyrie.org>

COPYRIGHT AND LICENSE

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

Last spun 2022-12-12 from POD modified 2021-03-28