Author Topic: Helpl Loading PDF Files (Read 2612 times)

tjspencer2 · « **on:** February 16, 2017, 05:07:09 PM »

I'm attempting to load 500,000 individual PDF files that a vendor has sent me and I need to do so by Monday, 2/20 COB. Each of these files is a fairly small and single PDF statement. This of course is not the best way to load files to CMOD - larger files with many statements would be optimal, but it is what it is. Is there a way to streamline the load process to create the highest load speed possible. I'm running on AIX, CMOD 9.5 MP and PDF Indexer 9.5.

Lars Bencze · « **Reply #1 on:** February 17, 2017, 07:11:03 AM »

Bud Paton of IBM has created an excellent Powerpoint presentation which tells you the fastest way to load documents.
I suppose you have the metadata in a separate file?
If so, I would create a small script which builds the .ind file out of that metadata. If I don't misrecall, it is faster if you point to each individual PDF file instead of concatenating them into one big .out file and using offset + length. (Just use GROUP_OFFSET:0 and GROUP_LENGTH:0 to load the entire file)

If you have PDF Indexer fields defined in the form itself, well... that's an entirely different story.
If the PDF Indexing is not fast enough, it MAY be possible to use that little tool, whats-it-called-again - arspdump? To convert all text data in the PDF to a TXT representation, which you with a lot of luck could parse with a script, which may or may not build a generic indexing file (.ind) faster than running PDF Indexer.

I suggest you make small batches of maybe 100 docs and check out which method is the fastest.
(I always ask the guys that create the PDF files to create an .ind file as well, while they're at it... Or I tell them to include the data as PPD Page-Piece Info)

tjspencer2 · « **Reply #2 on:** February 21, 2017, 08:25:47 PM »

Thanks for the reply and guidance.

Yes we received the files as individual PDF statement files - 1 statement for each pdf file.

We're able to load about 3,000/hour on an AIX 7.1 box running CMOD 9.5 and DB210.5.

It's not great, but at this point, I'm going to let it run it's course.

Maciej Mieczakowski · « **Reply #3 on:** March 08, 2017, 01:08:22 AM »

My approach for very fast loading is to create one indexfile that contains set of indexes to thousands of documents in it, each for separate file. You can concatenate all single index files into one, with small changes (no need of codepage definition for instance), use GROUP_OFFSET:0 and GROUP_LENGTH:0, as you don;t want to devide input date into smaller pieces, as Lars mentioned and then simply run that indexfile with empty trigger ard file. Indexfile have relative information there the input data is, and all indexes with it.

I used to load even 50000 of documents within few minutes with this approach. I have not tested how much data could be loaded within an hour, but for sure many times more instead of using ordinary arsload process.

zancanaro · « **Reply #4 on:** March 14, 2017, 09:42:43 AM »

As Maciej wrote, we each use this sort of work to load between 35.000 & 65.000 in only 1 file (i.e. 1 for .idx and 1 for .ARD)
The pair of result files are created by java script ; As Maciej mentionned before, you may just relate them one by one with GROUP_LENGTH + GROUP_OFFSET.
For 65.000 files agregated (in french "agrégés") it's about 4-6 minutes by loadid.

OnDemand User Group

News:

Author Topic: Helpl Loading PDF Files (Read 2612 times)

tjspencer2

Helpl Loading PDF Files

Lars Bencze

Re: Helpl Loading PDF Files

tjspencer2

Re: Helpl Loading PDF Files

Maciej Mieczakowski

Re: Helpl Loading PDF Files

zancanaro

Re: Helpl Loading PDF Files