Author Topic: PDF Indexing with PPD (PagePiece Info) (Read 3483 times)

Lars Bencze · « **on:** October 08, 2015, 07:27:37 AM »

Hi,

We have now implemented an Application that actually uses the "newest" way to index PDF files, namely using PagePiece info.
Environment is Windows 2008, DB2 9.7 and CMOD Server 9.0.0.6. TSM is on a separate W2K8 server.

We have successfully delivered several batches/files of 20000 documents to OnDemand, and so far no problems.
(Well, to be fair, it was nothing BUT trouble on CMOD 9.0.0.3, we had to install the 9.0.0.6 fix pack to remedy that.)
As this is our first document type that uses that, I would say we are still beginners, but it works like a charm!

We are currently looking into some performance issues.
1. Indexing takes surprisingly long time - I suspect this is because of that the PDF is still slightly compressed, although the creator has indeed tried to switch compression off. Also, the PDF files are "linearized" after they are created. (The performance is not bad at all, but it takes nearly 4 minutes to index where I was expecting around 0,5-1 minute.)
2. One of the main benefits with using PDF Indexing is the resource collection. So far however, we have been unable to make OnDemand recognize that this is the same resources in each batch - it says "New" on every load.
At first, we had subsets of (customized) character sets/fonts made for every batch, but when we turned that off and sent the full set(s) with every batch, it still did not catch it - every file is deemed to have its own unique set of PDF resources.

Has anyone already solved this type of tuning problem?
I would be happy to hear your info on how to tune this solution to perfection. In return, I will be happy to share our experiences of this!

I will try to attach an image to this post. The lines highlighted in blue have full font sets and "no PDF compression". The lines below that use compression and have subsetted font sets.

Justin Derrick · « **Reply #1 on:** October 08, 2015, 02:49:58 PM »

I think I understand the problem you're trying to fix. Historically, it's been solved by grouping more PDFs together in a load. In the PDF presentation that Bud Paton did for us, by binding the PDFs together into a single PDF file, he was able to show almost a 400x reduction in the storage required for a large batch of PDFs.

Check out the video presentation, and if you want the specific section where he talks about storage savings, skip to 36 minutes into the session:

http://www.odusergroup.org/forums/index.php?topic=1724.0

Hope this helps a little!

-JD.

Lars Bencze · « **Reply #2 on:** October 09, 2015, 06:01:10 AM »

Hi Justin and thanks for your reply!
I took a look at the video, and yes, that is what we are already doing!

The question in my initial post above is how to avoid getting "New" resource packages every time you send a new instance of a particular PDF file to OnDemand.
As you can see in the image I attached, it says "New" in the "Resource reuse" column for every new file we add. One would think that when sending the same type of documents, with the same font sets, same background, same images etc (i.e. the same resources), CMOD would be able to recognize the resources as "Existing".
Here is an extract from the Load logs for three consecutive loads:
ARS1142I Resource D:\OnDemandDirectories\arstmp\infoflyttut.0.villkutsk160101.paketforalltid.20151002.194133.pdf.res will be added as resource >12146-11-0<. Compression Type(Disable) Original Size(2237100) Compressed Size(2237100)
ARS1142I Resource D:\OnDemandDirectories\arstmp\infoflyttut.0.villkutsk160101.paketforalltid.20151002.200204.pdf.res will be added as resource >12147-11-0<. Compression Type(Disable) Original Size(2237100) Compressed Size(2237100)
ARS1142I Resource D:\OnDemandDirectories\arstmp\infoflyttut.0.villkutsk160101.paketforalltid.20151002.202231.pdf.res will be added as resource >12148-11-0<. Compression Type(Disable) Original Size(2237100) Compressed Size(2237100)

Even though the resource files have the exact same size for all loads, when I run a File Comparison program, they prove to be quite different indeed.
How come?
So the 2nd question in my original post above is quite simply: can PDF resources be re-used, using this indexing method? If so, what do we need to change?
(As for question 1 above, we will continue experimenting with different types of "no compression" and linearization, but if anyone else has any benchmark ~~times~~ performance measurements/timings/examples, it would be nice to compare our respective results.)

OnDemand User Group

News:

Author Topic: PDF Indexing with PPD (PagePiece Info) (Read 3483 times)

Lars Bencze

PDF Indexing with PPD (PagePiece Info)

Justin Derrick

Re: PDF Indexing with PPD (PagePiece Info)

Lars Bencze

Re: PDF Indexing with PPD (PagePiece Info)