21 Jun 2012 yeupou   » (Master)

Converting PDFs to multiple HTML pages with pdftk and pdftohtml

As already stated on this blog, Bada OS is total crap. Scripting is a mess, T9 is missing of original versions, updating is not an available option depending on your phone (even if the phone is less than a year old). It keeps being absolutely worthless when it comes to reading PDF. No matter how, even if you feed it a specifically cropped PDF with no margins, you’ll always end up with something not really readable, too big, too small, whatever. A pain in the ass.

I soon realized it’s best, with such an appalling combination of software and hardware, to convert ebooks/PDFs to HTML. And as the provided HTML reader can’t remember what page you last read (not surprising) and, ahem, is unable to load a 3 MB page (low memory it says: even if a 30 MB PDF can be loaded by the PDF reader with no issue on the exact same phone, go figure!), it needs splitted HTML.

PDF is usually an output format, not a source format. While there’s plenty to convert to PDF, fact is there is no complete suite to convert from. pdftk is powerful but not easy to handle IMHO and pdftohtml latest released is almost 10 years old. So I ended writing a small wrapper (pdf2htmls.pl) for both theses tools to convert one PDF to multiples HTML files with basic indexes. It takes –input=file.pdf and (optional) –output=directory arguments. Asides from Perl, it requires debian packages pdftk and poppler-utils.

The indexes are über-crude. They could be improved with chapters/titles, I’ll maybe add that later.

Syndicated 2012-06-21 12:57:03 from # cd /scratch

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!