Showing posts with label pdf to html. Show all posts
Showing posts with label pdf to html. Show all posts

Sunday 15 January 2012

Converting PDF Files To Text Or HTML From Linux Terminal

Earlier, we saw how we can merge or combine PDF files from terminal. Now, I am sharing two command line tools to convert PDF files to text or html files.

Poppler Utils is a great package of PDF rendering and conversion tools and should be installed before we convert PDF files to text or html files. You can install the poppler-utils issuing the following command in debian based distro. You can install them in your favorite distros using their corresponding package installers.

sudo apt-get install poppler-utils

Now that poppler-utils is installed, we will be able to convert PDF files to text and HTML using pdftotext and pdftohtml command-line tools.

PDF to Text

To convert a PDF files to text, we should use pdftotext command. Following is the simplest form of the command for converting a PDF file to text file.

pdftotext file.pdf file.txt

This command also allows you to preserve the original layout in the pdf file using the -layout switch as below:

pdftotext -layout file.pdf file.txt

Similarly, if you wish to convert pages of specific range, you can use -f and -l switches to specify the first and last page to convert to text file. An example below would clarify things where I've choosen to convert pages from 4 to 8 into text.

pdftotext -f 4 -l 8 file.pdf file.txt

Check the man page of pdftotext and also see the help for the tool to explore other options as well.


To convert a PDF file to HTML file, you can use the pdftohtml tool available in the poppler package. Before that, I will show how to use pdftotext command to convert the PDF file to HTML file.

pdftotext -f 4 -l 8 -htmlmeta file.pdf file.html

Now, using the pdftohtml tool is not that different than pdftotext. A simplest form would be as below:

pdftohtml file.pdf file.html

You can use the same arguments as in the pdftotext for this tool as well for specifying the range. However, -htmlmeta and -layout are only available in pdftotext. I would let you explore more on the pdftohtml tool.

I hope this information is useful for you. :)