OnDemand User Group

Support Forums => MP Server => Topic started by: Steve Bechtolt on August 14, 2018, 11:25:21 AM

Title: PDF Indexer with large PDF files
Post by: Steve Bechtolt on August 14, 2018, 11:25:21 AM
I was just wondering what the largest PDF file, in terms of number of pages, anyone has processed.
and what is the timing for indexing these large PDF files.
Title: Re: PDF Indexer with large PDF files
Post by: Lars Bencze on August 23, 2018, 08:54:44 AM
I see that you have not yet received an answer to this, so I'll give you a partial one.

When I look at one of the OD systems I manage, we use "PagePiece Info" PDF indexing a lot for the bigger loads. We also try to divide big loads into batches of 20,000 or 50,000 documents each.
I'm sorry but we do not store Page Count for our batches, but a reasonable estimate would be somewhere around 1,5 to 2,5 on average pages per document.

The time needed to index these files are dependent on a lot of things, here are some:
what type of PDF indexing you use :) , how many fields you have defined, how complex the structure of the PDF file is (including compression, which should be avoided), (number of ) CPU(s and their) speed, amount of RAM available, how fast your disks are (to read the large files)... etc etc.

With the fairly small OD servers we have, and around a dozen fields defined, the indexer seems to handle about 100 documents per second. So a 20,000 doc batch takes about 200 seconds to index etc.
I am pretty sure that you can achieve much faster indexing rates with better hardware.

I hope this gives you a hint, and that someone else can give you a better answer.