Wuff

Sunday, April 11, 2010

software: slicing up PDFs

I wanted to combine all the statements that I downloaded for my audit into a single PDF and then exclude all the cover pages plus the pages of boilerplate disclaimer, "how to reconcile your account", etc. PDF is a standard page presentation format, so you would expect there to be software to do this besides paying Adobe $119 for Adobe Acrobat.

There is, but it's the usual onion: a load of crap surrounding a simple idea.
  • Googling for "split PDF" finds the usual mess of sites and shareware and paid utilities
    • So I restrict to "linux split PDF", which points me to pdftk that Sid Shepard wrote in support of his book. But installing that requires 50MB of supporting GCJ packages. It's really cool this runs as a standalone program, but I already have a Java interpreter installed so this approach is 20× bigger and more complicated,
      • So I google some more and find joinPDF, supposedly a simple script and a Java library written by Gerard Briscoe, but the directory to download for this is defunct.
        • There are tons of other search results for this, on Mac shareware sites (someone bundled a graphical user interface for the Mac for users who don't know how to enter command lines), but their links are broken as well. (As an aside, why can't Google be smart? If I Google "download joinPDF" and a page with that text has a broken link, then don't waste my time with that search result!! I need a decision engine, not a search engine.)
          • I finally find a web site that has the simple original joinPDF for download. Follow the README.txt's instructions to manually copy the Java library and two scripts to the right location, and I'm set!
            • It turns out the actual core of this onion is a Java library, iText written by Bruno Lowagie, that can slice and dice PDFs: both joinPDF and the bloated pdftk simply include this library and provide a wrapper around it

Now enter the command line
joinPDF combined_statements.pdf checking*.pdf
, and I get combined_statements.pdf! But the files use stupid date naming so they're in the wrong order. Rename them with ISO8601 date format 2007-01, 2007-02, etc. file names, repeat.

Now I have to excise the pages I don't want. joinPDF provides another command, splitPDF, to split a PDF into individual pages, but this does not remove particular ranges of pages. (I should have used splitPDF to split each statement into _page1, _page2, etc. files, then glued a subset of these together, but that seemed to mess up the thumbnail display). I could probably get the source code and write my own simple wrapper around the iText library for an excise command, how hard can Java programming be? But that seems silly. Surely a Portable Document Format should make it easy to cut out pages I don't want.

I bring up combined_statements.pdf in the awesome vim, text editor. It understands PDF files and colorcodes certain words of them: obj, /Type, Kids, stream, etc. Looks promising, but there's no obvious Start of page 39... End of page 39 to chop out. I just need a little guidance as to what these mean. Back to Google for "PDF file format". But all of the articles show graphical tools or describe the format from the bottom up instead of telling me at a high level what to look for. So I add one of the words in the file, endobj to my Google search, and find Introduction to PDF! That's what I need!

For reference, in a particular PDF produced by printing a Quicken document in Wine...

You need to delete the page object and optionally things it references. The PDF is full of flattened objects. Each object starts with NN 0 obj where NN is a number for the object and 0 is its version (0 for most generated objects), and ends with endobj . Delete from one to the other and you've removed an object.

One object in the file is:
2 0 obj
<< /Type /Pages /Kids [ 3 0 R
4 0 R
5 0 R
...
46 0 R
] /Count 44 >>
endobj
This lists all 44 pages in the file, using their object numbers. I think they're in the order you see them, so delete the Nth line inside the brackets and the PDF will no longer have an Nth page. Done! (My PDF viewer Okular doesn't seem to mind that the /Count 44 is no longer accurate.)

You can go on to actually get rid of the page object you removed from the page list:
46 0 obj
<< /Type /Page /Parent 2 0 R
...
/Contents 137 0 R
is the page itself. But that page object is only 12 lines long, where's the actual massive text block with the contents of the page? Well, any time you see NN 0, it's probably a reference to another object; Sure enough, /Contents 137 0 is another object with a huge stream of stuff:
137 0 obj
<< /Length 138 0 R >>
stream
q 0.240000 0 0 0.240000 0 0 cm /R0 gs 0 w 1 J ... ...
So you can delete this as well. There are more objects you don't need, but they're small enough to leave around.

Update: The joinPDF author's web site actually does exist and you can click through (Software > joinPDF) to his software, but incredibly, Google search results show all those broken links in preference to this! Maybe because he's using frames, but c'mon Google, be smart!

Labels: , ,

Tuesday, June 2, 2009

software: the world is flat but for my house

Google was at Maker Faire promoting SketchUp, a 3D program.

One of the things it can do is texture the surfaces of a model. Wait, Google Maps has a top-down picture of your house from satellite imagery. So draw boundary lines on the edges of your roof, then extrude vertically, then pull up the roof line, and you have a crude wooden-block house shape with your roof. Next, Google Street View may have a drive-by panorama of your house, assuming an angry luddite mob didn't block Google's camera car. So grab the street view and paste it on the front of the model. Five minutes later (assuming you've spent months or years mastering the unintuitive mysteries of a 3-D modeling program) you have a passable representation of your house. You can upload this to Google's 3-D warehouse of SketchUp designs, and you can place it in Google Earth, a more sophisticated version of Google Maps that presents landmarks and other geographic data anywhere and everywhere on earth. When people waltz around your neighborhood in Google Earth, they'll see your dollhouse.[*]
SketchUp house in Google Earth
In the screenshot, the panel below is Google Earth's in-program browser with the house model that Google's 3D ninja whipped up. (Click the screenshot to see more of the Google Earth program).

Yes my neighbors' houses are all low-rise ranch houses sunk into the earth, and there really is a 7-meter shiny ball parked on the street!

Google is crowd-sourcing the creation of a 3-D model of the world. As builders and planners and amateurs create more 3D models, the virtual world gets fleshed out until a fly-through in Google Earth is a pretty good approximation of being there. You can see downtown and the Bay Bridge are getting filled in.
view of downtown SF
It's more evidence for my thesis that computer previsualizations of movies will be good enough to replace the filmed movie.

All of these tools and programs are free, I don't know where Google makes money. Google is looking to get 3D into the browser, so soon you'll get all this in Google Maps; maybe Google will sell billboards in virtual earth. Or maybe they'll charge to have you socialize in it with other avatars.

[*] If you want to see my house, you've got to ask for the additional 3-D warehouse, it doesn't appear automatically. I guess that provides some protection for Google against complaints from house-proud owners that a griefer uploaded a model that makes their property look ugly, or shows a guy mooning out of a window.

An interesting question is why doesn't Google automate this. They have the overhead picture, they have the front picture, so run some AI to glue the two together so my neighbors' houses poke out of the ground to form a 3D canyon.
Road Rash screenshot
I asked Google's modeling ninja and he said the AI isn't smart enough to do it. 10 years ago MetaCreations released Canoma which supposedly let you semi-automatically pin photographs onto 3D shapes and it would guess the outlines of the building. Despite all the wonders our network of computers is producing, hard AI remains hard.

Labels: , , ,