*nix Documentation Project
·  Home
 +   man pages
·  Linux HOWTOs
·  FreeBSD Tips
·  *niX Forums

  man pages->Linux man pages -> html2text (1)              
Title
Content
Arch
Section
 

html2text(1)

Contents


NAME    [Toc]    [Back]

       html2text - an advanced HTML-to-text converter

SYNOPSIS    [Toc]    [Back]

       html2text -help
       html2text -version
       html2text  [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] [
       -rcfile path ] [ -style ( compact | pretty ) ] [ -width width  ]  [  -o
       output-file ] [ -nobs ] [ input-uri ...	]

DESCRIPTION    [Toc]    [Back]

       html2text  reads  HTML  3.2 documents from the input-uris, formats each
       into a stream of plain text characters  (ISO  8859-1)  and  writes  the
       result  to standard output (or into output-file, if the -o command line
       option is used).

       Documents that are specified by an URI that begins  with  "http:"  (RFC
       1738)  are  retrieved  with the Hypertext Transfer Protocol (RFC 1945).
       URIs that begin with "file:" and URIs that do not contain a colon specify
 local files. All other URIs are invalid.

       If  no  input-uris  are	specified on the command line, html2text reads
       from standard input. A dash as the input-uri is	an  alternate  way  to
       specify standard input.

       html2text understands all HTML 3.2 constructs, but can render only part
       of them due to the limitations of the text output format. However,  the
       program attempts to provide good substitutes for the elements it cannot
       render.	It also accepts syntactically incorrect input and attempts  to
       interpret it "reasonably".

       The  way  in that html2text formats the HTML documents is controlled by
       formatting properties read from an RC file.  html2text attempts to read
       $HOME/.html2textrc  (or	the file specified by the -rcfile command line
       option); if that file  cannot  be  read,  html2text  attempts  to  read
       /etc/html2textrc.   If  no  RC file can be read (or if the RC file does
       not override all formatting properties), then "reasonable" defaults are
       assumed.  The  RC file format is described in the html2textrc(5) manual
       page.

OPTIONS    [Toc]    [Back]

       -help  Print command line summary and exit.

       -version
	      Print program version and exit.

       -unparse
	      This option is for diagnostic purposes:  Instead	of  formatting
	      the  parsed  document, generate HTML code, that is guaranteed to
	      be syntactically correct. If html2text has  problems  parsing  a
	      syntactically  incorrect HTML document, this option may help you
	      to understand what html2text thinks that the original HTML  code
	      means.

       -check This  option  is	for  diagnostic purposes: The HTML document is
	      only parsed and not processed otherwise. In this mode of	operation,
  html2text	will  report  on parse errors and scan errors,
	      which it does not in other modes of operation. Notice that parse
	      and  scan errors are not fatal for html2text, but may cause misinterpretation
 of the HTML code and/or portions of the  document
	      being swallowed.

       -debug-scanner
	      While scanning the HTML document, html2text reports on each lexical
 token scanned. This option is for diagnostic purposes.

       -debug-parser
	      While scanning the  HTML	document,  html2text  reports  on  the
	      tokens  being  shifted, rules being applied, etc. This option is
	      for diagnostic purposes.

       -rcfile path
	      Attempt to read the file specified in path as RC file.

       -style ( compact | pretty )
	      Style pretty changes some of the default values of  the  formatting
 parameters documented in html2textrc(5).  To find out which
	      and how the formatting parameter defaults are changed, check the
	      file "pretty.style". If this option is omitted, style compact is
	      assumed as default.

       -width width
	      By default, html2text formats the HTML documents	for  a	screen
	      width  of  79 characters. If redirecting the output into a file,
	      or if your terminal has a width other than 80 characters, or  if
	      you  just  want  to  get	an idea how html2text deals with large
	      tables and different terminal widths, you may want to specify  a
	      different width.

       -o output-file
	      Write  the  output  to output-file instead of standard output. A
	      dash as the output-file is an alternate way to specify the standard
 output.

       -nobs  By  default, html2text renders underlined letters with sequences
	      like "underscore-backspace-character" and boldface letters  like
	      "character-backspace-character",	which works fine when the output
 is piped into more(1), less(1), or similar. For other applications,
	or  when redirecting the output into a file, it may be
	      desirable not to render character attributes with such backspace
	      sequences, which can be specified with this command line option.

FILES    [Toc]    [Back]

       /etc/html2textrc
	      System wide parser configuration file.

       $HOME/.html2textrc
	      Personal parser configuration file, overrides  the  system  wide
	      values.

CONFORMING TO    [Toc]    [Back]

       HTML  3.2 (HTML 3.2 Reference Specification - http://www.w3.org/TR/REC-
       html32),
       RFC 1945 (Hypertext Transfer Protocol - HTTP).

NOTES    [Toc]    [Back]

       html2text undergoes considerable effort to parse  syntactically	incorrect
  input,  but is not always as successful as other HTML processors.
       If you have the possibility to correct the HTML source  code,  you  may
       want  to  use  the  -unparse or -check options to find out what exactly
       html2text's problem is.

RESTRICTIONS    [Toc]    [Back]

       html2text provides only a basic implementation of the Hypertext	Transfer
  Protocol (HTTP). It requires the complete and exactly matching URI
       to be given as argument and will not  follow  redirections  (HTTP  301/
       307).

AUTHOR    [Toc]    [Back]

       html2text   was	 written   up	to   version   1.2.2  by  Arno	Unkrig
       <[email protected]> for GMRS Software GmbH, UnterschleiBheim.

       Current maintainer and primary download location is:
       Martin Bayer <[email protected]>
       http://userpage.fu-berlin.de/~mbayer/tools/html2text.html

SEE ALSO    [Toc]    [Back]

      
      
       html2textrc(5), less(1), more(1)



				  2001-10-05			  html2text(1)
[ Back ]
 Similar pages
Name OS Title
pdftotext Linux Portable Document Format (PDF) to text converter (version 1.00)
grohtml NetBSD html driver for groff
grohtml OpenBSD html driver for groff
cosmocreate IRIX WYSIWYG HTML authoring tool
Pod::Html IRIX module to convert pod files to HTML
apt Linux Advanced Package Tool
acpi FreeBSD Advanced Configuration and Power Management support
apmd FreeBSD Advanced Power Management monitor daemon
apm OpenBSD advanced power management device interface
apm OpenBSD advanced power management device interface
Copyright © 2004-2005 DeniX Solutions SRL
newsletter delivery service