Linux Today: Linux News On Internet Time.
Search Linux Today
Linux News Sections:  Developer -  High Performance -  Infrastructure -  IT Management -  Security -  Storage -
Linux Today Navigation
LT Home
Contribute
Contribute
Link to Us
Linux Jobs


Top White Papers

More on LinuxToday


Linux .doc to Text Conversions Inadequate

Sep 23, 2010, 12:03 (0 Talkback[s])
(Other stories by Brendan Scott)

[ Thanks to Brendan Scott for this link. ]

"Having a look at converting doc and html to text on Linux. While these are good for indexing purposes, they aren't that great for actually presenting the output. Column Width

"The engines which perform pretty well, like wvText and lynx (via html) have the annoying 'feature' of formatting to a column width. This means that it is very hard to predict what are actually paragraph breaks and what are just the pager splitting lines. Using wvHtml can then be transformed by html2text -width 0 to avoid this width problem (although results are a little haphazard at times). Lynx maxes out at a width of about 990 characters, but there are plenty of paragraphs around which exceed that length. I take it back about html2text -width 0, which seems to fail more often than it succeeds. However, it does seem to honour large width arguments (unlike lynx which has a magic number limiting the width) – width 20000 seems to work (although if there is a line, it inserts 20000 '=' characters)"

Complete Story

Related Stories: