Sunday, January 29, 2012

Join / Merge / Combine / Concatenate PDF pages into a single PDF

There are many free (and open source) methods available.

pdfjoin / pdfjam - command line utility. Very easy to use, but has some limitations:
  • Output pages are all the same size
  • Hyperlinks are stripped
Use ghostscript to "print" all of the files into a new PDF file. With this method, you can use other ghostscript capabilities to further modify the output (such s resampling graphics, etc):
  •  gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=output.pdf input1.pdf input2.pdf
pdftk (PDF Toolkit) - yet more command line tools for PDF manipulation. There are a bunch of different styles of command line (examples can be seen on web site). The simplest is:
  • pdftk input1.pdf input2.pdf input3.pdf cat output output.pdf
  • powerful but complicated GUI to PDFtk is available 
pyPDF - a tool for the programming / python oriented. It can split, merge, watermark, rotate, and extract information with simple python scripting. There is a GTK-based GUI available called PDF-Shuffler.

PDF Split and Merge is a cross-platform java based GUI (and command line) tool.

There are more! With so many choices, how does one choose? Whim, personal preferences, scale requirements, tools and platform on hand (ease of installation/integration), or just random.

Tuesday, January 17, 2012

Resampling jpegs inside a PDF (to shrink/reduce file size)

My wife is applying for various teaching jobs. Most of the positions require uploading documents to various school board applicant tracking systems. I carefully prepared the documents for her, also wanting to keep them for historical archival purposes. The uploads produce a bizarre error. Contacting the site administrator, we find out that there is a 1 megabyte file size limit on uploads and this is what produces the cryptic error. Sigh. (To add insult to injury, after resampling the scanned PDFs, it then turns out there's also crazy filename limitations... so all the nice descriptively  named files had to be renamed to eliminate various characters, and maintain a certain length). These are the gatekeepers of our children's education.

But I digress.

This info is a lot of places on the net already, but I want to add it here for my own reference as I've forgotten the simple command line a few times already and had to re-search it. 

In short, use ghostscript:
gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
The PDFSETTINGS presets equate to the following resolutions:

  • /screen ... screen-view-only quality, 72 dpi images 
  • /ebook ... low quality, 150 dpi images
  • /printer ... high quality, 300 dpi images 
  • /prepress ... high quality, color preserving, 300 dpi images
  • /default ... similar to /screen

Other suggested command line arguments made various places:

  • -dCompatibilityLevel=1.4 
  • -dColorImageResolution=38 -dColorImageDownsampleType=/Average -dGrayImageDownsampleType=/Average -dGrayImageResolution=38 -dMonoImageResolution=38 -dMonoImageDownsampleType=/Average -dOptimize=true -dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dUseCIEColor -dColorConversionStrategy=/sRGB
  • -dMaxSubsetPct=100
This page is quite good for further tips (such as recompressing with lossless compression).