OnDemand User Group
Support Forums => z/OS Server => Topic started by: Corinne on October 04, 2010, 01:56:19 PM
-
Does anyone have experience creating PDF output from Exstream to load into OnDemand?
We are using Extream to create the PDF output and the index file but cannot figure out how to produce the PDF in the correct format ("stacked") to work with OnDemand. Can anyone help?
Thank you.
Corinne
-
I'm not familiar with 'Extream', but I think you're talking about loading PDF files using the Generic Indexer.
You don't have to necessarily 'stack' these PDFs into a single file -- the Generic Indexer will let you specify an individual file name for each PDF, and concatenate them into 'objects' at load time.
Can you give us more information about what is wrong with the way Extream is producing files now?
-
Yes, we are using the Generic Indexer.
Extream can produce the index file in the format that the Generic Indexer expects, but the PDF output is simply one PDF. Therefore, when we load the files and try to view them in OnDemand we get an error (pulling up one individual indexed document) telling us that the PDF is not properly formatted - which makes sense.
How would I go about specifying an individual file name for each PDF, and concatenate them into 'objects' at load time? We are looking at potentially 6000 indexed documents...
Thanks!
-
Got it. It concatenates all the docs together into one PDF. I'm fairly certain that if you can't break this file up into individual PDFs, that you'll have to use the PDF Indexer. It's been years since I've played with it, so I'll let someone else fill in the blanks about it.
-
Hello Corinne,
I am not sure to understand exactly what Exstream is doing, but maybe you can clarify my understanding, and maybe I can help you afterwards!
My understanding so far is the following:
Understanding 1)
You have, let's say, 10 PDFs, and Exstream will just concatenate each PDF file together. Really a file concatenation.
And he creates a Generic Indexer to archive this big multi stacked PDF into OnDemand.
So it means, If you cut the PDF with the values of GROUP_OFFSET and GROUP_LENGTH, you will have for everypiece of file, a valid PDF.
OR
Understanding 2)
Exstream does a valid PDF which inside this PDF contains the concatenation of all the PDF you want to archive, and then with then Extream creates a Generic Index, with some OFFSET and LENGTH...
And then if you use the GROUP_OFFSET and GROUP_LENGTH to cut the PDF in smaller file, then you don't have valid PDF, because you are cut in the internal of the big PDF's container.
From your explanation, I suppose that "Understanding 1)" is the correct answer... but are you sure that the GROUP_OFFSET and GROUP_LENGTH are correctly calculated? Because one small error, and then your pointer in OnDemand will retrieve the wrong part of the file and the output will be a corrupted PDF file.
If indeed it is "Understanding 2)", then you can forget Generic Indexer, because that doesn't work at all. And as suggested Justin, you must use PDF Indexer for it, but the PDF Indexer has many limitations...
To know if it's 1) or 2) can you cut the "stacked" PDF you receive from Exstream into piece with the value of the index file (look at the GROUP_OFFSET, GROUP_LENGTH)?
Cheers,
Alessandro
-
Hi Alessandro...
I think the situation that Corrine is trying to describe is:
File.pdf ( Pg1 Pg2 Pg3 ... )
... where the individual pages are documents that should be able to be retrieved separately.
And *not* a series of complete PDF files, simply concatenated into a single file:
AllPDFs.out ( [Pg1.pdf] [Pg2.pdf] [Pg3.pdf] [...] )
...which is why the Generic Index won't work.
Corrine -- can you clarify this for us?
-JD.
-
Hi Alessandro...
I think the situation that Corrine is trying to describe is:
File.pdf ( Pg1 Pg2 Pg3 ... )
... where the individual pages are documents that should be able to be retrieved separately.
And *not* a series of complete PDF files, simply concatenated into a single file:
AllPDFs.out ( [Pg1.pdf] [Pg2.pdf] [Pg3.pdf] [...] )
...which is why the Generic Index won't work.
Corrine -- can you clarify this for us?
Hi Justin,
Well apparently I need to get some training of clear message with you :-D Because that is exactly what I wanted to say, but... with 100X more words than you!!! ::)
I need holidays!!!
Cheers,
Alessandro
-
And what is the final result ? On how to do this ? I will need to do the same thing with 40,000 - 70,000 documents daily.
If using the PDF Indexer with an external index file, then the name of each PDF file must be listed with each document's indexes. My IBM OnDemand Sales Support described four scenarios based on our taking AFP to PDF.
To use the Generic Indexer with PDF files requires knowing the coordinate locations of the text to be extracted, or have Tag Logical Elements in the AFP document which identify the field values to be used.
-
Not sure if this thread is still open or not. We load AFP to OD so I will give you the procedure for that and it may work. I know this is an OD blog, so please forgive me as I speak "exstream".
First - you will have to create x-number of variables in exstream application equal to the index values of your ondemand folder(ex. acctnum, stmtdate, etc). Next, add x search keys to your exstream environment section, pointing them to your x variables. Now add the x new search keys to your application and be sure to set the placement as "before each customer". This should add your index information to your PDF file. You will then have to change your Ondemand application "Indexer Information" tab to look for your variable names at the locations they appear in the output file. Hope that helps.
-
My Exstream support is too busy with other issues, so meanwhile I did get a solution working with IBM.
The Exstream team gave me a test PDF many months ago, that I was unable to use for some time, mainly since I thought I needed something else to go with it. :(
To not lose too much time, I also posed the problem to IBM. IBM finally told me I could nevertheless make the indexer type = PDF, and identify upper-left and lower-right coordinates for each field.
It took some work, but I was able to get the values, which turned out to be at the hundredths of an inch decimal value, and load and index the PDF sample file. I can't give more details, since I did this almost a month ago. I just needed rectangular coordinates that were VERY tight to the print fields on a normal size page. I'm sure I worked at the trigggers first, or one of them, and when ARSLOAD indicated I had a good value, I worked each successive field very closely, too, until they were all done.
----
COORDINATES=IN
INDEXSTARTBY=1
TRIGGER1=ul(7.63,0.24),lr(8.03,0.45),*,'Page 1'
TRIGGER2=ul(6.40,8.95),lr(6.86,9.13),0,'AMOUNT'
FIELD1=ul(4.51,0.67),lr(6.28,0.88),0,(TRIGGER=1,BASE=0)
FIELD2=ul(1.02,9.36),lr(3.50,9.57),0,(TRIGGER=1,BASE=0)
FIELD3=ul(1.02,9.51),lr(3.50,9.72),0,(TRIGGER=1,BASE=0)
FIELD4=ul(0.66,1.35),lr(1.34,1.55),0,(TRIGGER=1,BASE=0)
FIELD5=ul(7.10,8.93),lr(8.15,9.14),0,(TRIGGER=2,BASE=0)
INDEX1='ACCOUNT NUMBER',FIELD1,(TYPE=GROUP)/* ACCOUNT NUMBER */
INDEX2='NAME',FIELD2,(TYPE=GROUP)/* NAME */
INDEX3='ADDRESS',FIELD3,(TYPE=GROUP)/* ADDRESS */
INDEX4='CAN NO',FIELD4,(TYPE=GROUP)/* CAN NO */
INDEX5='AMOUNT DUE',FIELD5,(TYPE=GROUP)/* AMOUNT DUE */
-
To define these co-ordinates (UL and LR), did you use the graphical indexer?
I think that's the easiest way to define the co-ordinates.
Just draw rectangular boxes around the fields and CMOD generates the co-ordinates for you.
-
I tried setting the Indexer to "PDF" and opening a sample PDF file using Parameter Source "Sample Data", and got an error: "Adobe Acrobat (AcroExch.App rc=2147221005) could not be loaded."
-
:(
I get the same error by trying to bring up the Report Wizard.
So I solved the loading of the PDF inspite of the missing pieces.
-
::) I found my coordinates identifier. I used ARSPDUMP, which identified the coordinates of every text string in the PDF file. Hundreds of them. I then needed to test and identify the right ones to use for triggers and fields for the Indexer Parameters. I also tried adjusting the values slightly, which never worked. :P
My sample ARSPDUMP code is below:
-----------------------------------
//PDFDUMP EXEC PGM=ARSPDUMP,REGION=0M,
// PARM='/-f //DD:INDD -o //DD:OUT'
//STEPLIB DD DISP=SHR,DSN=SYS2.CMOD.SARSLOAD
//ADOBERES DD DISP=SHR,DSN=SYS2.CMOD.USERPARM(ADOBERES)
//ADOBEFNT DD DISP=SHR,DSN=FFF.OD840.ADOBEFNT.WORK
//TEMPATTR DD DISP=SHR,DSN=SYS2.CMOD.ADOBEPDF.TEMPATTR
//INDD DD DISP=SHR,DSN=SMPE.OD840.BILL.FINAL.PDF2
//OUT DD SYSOUT=*
//SYSTMP01 DD UNIT=SYSDA,DSN=&&SYSTM1,DISP=(NEW,PASS),
// SPACE=(CYL,(6,6))
//SYSTERM DD SYSOUT=*
//SYSPRINT DD SYSOUT=*
-------------------------
sample output:
============
Place
ul.h = 5.37 ul.v = 8.58 lr.h = 5.66 lr.v = 8.79
your
ul.h = 5.67 ul.v = 8.58 lr.h = 5.91 lr.v = 8.79
payment
ul.h = 5.92 ul.v = 8.58 lr.h = 6.37 lr.v = 8.79
stub
ul.h = 6.38 ul.v = 8.58 lr.h = 6.61 lr.v = 8.79
in
ul.h = 6.62 ul.v = 8.58 lr.h = 6.73 lr.v = 8.79
the
ul.h = 6.74 ul.v = 8.58 lr.h = 6.92 lr.v = 8.79
provided
ul.h = 6.93 ul.v = 8.58 lr.h = 7.38 lr.v = 8.79
envelope
ul.h = 5.37 ul.v = 8.73 lr.h = 5.83 lr.v = 8.94
-
"Adobe Acrobat (AcroExch.App rc=-2147221005) could not be loaded."
"Unable to initialize document."
http://www-01.ibm.com/support/docview.wss?uid=swg21211278 (http://www-01.ibm.com/support/docview.wss?uid=swg21211278)
When I use the administration client to define my index parameters through the graphical interface, I receive message:
ACROEXCH.APP -2147221005
http://www-01.ibm.com/support/docview.wss?uid=swg21141770 (http://www-01.ibm.com/support/docview.wss?uid=swg21141770)
-
Now this is a silly question tha I am asking.
Is Adobe Acrobat installed or not?
-
No, not installed.
But based on my brief experience using output from ARSPDUMP, and the description of using the Wizard with Acrobat, there could be a clash. The documentation says make the box around the trigger or field as large as possible. ARSPDUMP gives coordinates at decimal hundredths of an inch as tight as possible around the text. When I tried using wider coordinates ala what the documentation suggested, the text was not found. So using the wizard with Acrobat, one might have to try over and over again to get valid coordinates to index the text. But using coordinates from ARSPDUMP, you have the coordinates nailed the first time. Plus if you need to change triggers or fields, no need to go through the guesswork again, since ARSPDUMP provides output on every word of every page in a test file.
Based on this experience, Acrobat might be nice to have to stick with the wizard, but unnecessary.
-
"Adobe Acrobat (AcroExch.App rc=-2147221005) could not be loaded."
"Unable to initialize document."
http://www-01.ibm.com/support/docview.wss?uid=swg21211278 (http://www-01.ibm.com/support/docview.wss?uid=swg21211278)
When I use the administration client to define my index parameters through the graphical interface, I receive message:
ACROEXCH.APP -2147221005
http://www-01.ibm.com/support/docview.wss?uid=swg21141770 (http://www-01.ibm.com/support/docview.wss?uid=swg21141770)
Hello Ed,
you need to have at least Acrobat Standard version installed.
You cannot use it with Acrobat Reader.
And then if you have it installed, then you must be sure that it is installed before CMOD Client.
If it installed after, you'll need to copy the arspdf32.api from <Home Directory for CMOD Client>\PDF
and copy it in the plug-ins directory of your installation of Acrobat Standard Version.
Cheers,
Alessandro
-
So, I want to be sure I understand the indexing approach taken here by LWagner.
- You arsdump'd a PDF file - created by Exstream as described earlier
- You copied the desired coordinates into a "keyboard input" indexing parameters file just like you would for a linedata file, or an AFP file.
- You specified PDF as the input data type/load method
- You loaded with arsload
- You did not use the graphical indexer on the file
- This was not a "stacked" PDF file (multiple separate PDFs in a single file) but a single PDF file with multiple customer statements in it.
Did I get that right? Please correct where I'm wrong.
Thanks.
-
This WAS a stacked PDF. A test file with five embedded PDFs in one PDF file.
I also learned that if ARSPDUMP does not break on something, then you can't force a break. I had a string of three codes together, with no break. I could not force the string to be taken in pieces by specifying intermediate stop and start points along the string. I tried catching a different "break", along hundredths of an inch moving 1/100s, and no break.
So ARSPDUMP will tell you your absolute break strings.
-
Hi
If you by stacked pdf's mean creating a file with pdf's that is concatinated into a file, we have such a solution.
Dialogue creates also a index-file with pointer to the start of each new pdf-file in the containated file(The size of each document in number of bytes).
This is solved on the Dialogue-side with use of DDA conectors and java-code - developed with help from the company Emitto, that was the Dialogue-representants in the Nordic contries. The people involved now work in Avantias.
HON
-
PDF load performance in CMOD 8.4.0 is limited. A 500 document loads, a 2000 document PDF fails. Each averaged about 2.6 pages per document, about 30 Mb each on input, and about 320 Mb on output.
CMOD 8.5 does better PDF performance, and is able to use a common PDF resource for the elemental PDFs, so the file size does not grow.
But as we understand it, PDF load performance is still about 10 times faster on a Windows object server than in z/OS. We are going with a small server farm to upload our PDFs.