Click here to get a free demo version.
This is a Windows product. If you're working on a platform other than Windows, or if you're interested in a source-code license, please contact us.

PDFdeconstruct™ decomposes PDF files into XML files. The XML output includes:
- text – Unicode text with font, color and position data for each word (or each character)
- images – in PNG, TIFF or JPEG format
- vector graphics – complete path information for fills and strokes
- form fields – with field names and values
PDFdeconstruct can be used for:
- document-format conversion: convert PDF to other formats
- document analysis: examine the content of a PDF page
- complex content extraction: e.g., input to further processing based on text with position information
The PDFdeconstruct output format is described in the
manual.
PDFdeconstruct is a cross-platform command-line tool, suitable for use on servers or for batch-mode processing. Basic usage looks like this:
pdfdeconstruct [options]PDF-fileoutput-dir
For example:
pdfdeconstruct test.pdf testout
will create a directory called "
testout
", containing a "
doc.xml
" file along with any extracted fonts and images. Various
options are available for controlling and enriching the XML output.
Supported platforms:
- Windows
- Mac OS X
- Linux
- 32-bit and 64-bit versions available for all platforms
- other platforms: portable C++ source code is available. Contact us for details.
For conversion to plain text (instead of XML), consider
XpdfText.
To arrange for an evaluation copy of PDFdeconstruct, please contact us.