Linux .doc to Text Conversions Inadequate
Sep 23, 2010, 12:03 (0 Talkback[s])
(Other stories by Brendan Scott)
[ Thanks to Brendan Scott for this
link. ]
"Having a look at converting doc and html to text on
Linux. While these are good for indexing purposes, they aren't that
great for actually presenting the output. Column Width
"The engines which perform pretty well, like wvText and lynx
(via html) have the annoying 'feature' of formatting to a column
width. This means that it is very hard to predict what are actually
paragraph breaks and what are just the pager splitting lines. Using
wvHtml can then be transformed by html2text -width 0 to avoid this
width problem (although results are a little haphazard at times).
Lynx maxes out at a width of about 990 characters, but there are
plenty of paragraphs around which exceed that length. I take it
back about html2text -width 0, which seems to fail more often than
it succeeds. However, it does seem to honour large width arguments
(unlike lynx which has a magic number limiting the width) –
width 20000 seems to work (although if there is a line, it inserts
20000 '=' characters)"
Complete Story
Related Stories: