Linux Today: Linux News On Internet Time.

More on LinuxToday

Linux .doc to Text Conversions Inadequate

Sep 23, 2010, 12:03 (0 Talkback[s])
(Other stories by Brendan Scott)


Desktop-as-a-Service Designed for Any Cloud ? Nutanix Frame

[ Thanks to Brendan Scott for this link. ]

"Having a look at converting doc and html to text on Linux. While these are good for indexing purposes, they aren't that great for actually presenting the output. Column Width

"The engines which perform pretty well, like wvText and lynx (via html) have the annoying 'feature' of formatting to a column width. This means that it is very hard to predict what are actually paragraph breaks and what are just the pager splitting lines. Using wvHtml can then be transformed by html2text -width 0 to avoid this width problem (although results are a little haphazard at times). Lynx maxes out at a width of about 990 characters, but there are plenty of paragraphs around which exceed that length. I take it back about html2text -width 0, which seems to fail more often than it succeeds. However, it does seem to honour large width arguments (unlike lynx which has a magic number limiting the width) – width 20000 seems to work (although if there is a line, it inserts 20000 '=' characters)"

Complete Story

Related Stories: