Author Topic: PDF indexer and the dash character  (Read 4007 times)

bwissink

  • Guest
PDF indexer and the dash character
« on: July 12, 2013, 02:24:36 PM »
We are running OnDemand 8.4 and using the PDF indexer.  We just ran into a situation where the last character of a field is a dash.  Is there something special about the dash character being the last character in a field.  Because it looks like the indexer is appending the next text it finds to the end of that field, thus making the field to long for the loader and it fails to load.  To get an idea of what is happening we ran the PDF document though ARSPDUMP and below is an example of the output:

Sponsor:                                                                         
ul.h = 1.16 ul.v = 1.97 lr.h = 1.64 lr.v = 2.12 
                                                 
COLLEGE                                         
ul.h = 1.72 ul.v = 1.97 lr.h = 2.17 lr.v = 2.12 
                                                 
CHARLESTON                                       
ul.h = 2.34 ul.v = 1.97 lr.h = 3.00 lr.v = 2.12 
                                                 
Sponsor                                         
ul.h = 0.86 ul.v = 2.09 lr.h = 1.31 lr.v = 2.25 
                                                 
Ref#:                                           
ul.h = 1.32 ul.v = 2.09 lr.h = 1.64 lr.v = 2.25 
                                                 
520877ISU                                       
ul.h = 1.72 ul.v = 2.10 lr.h = 2.25 lr.v = 2.25 
                                                 
This becomes generic index records
GROUP_FIELD_NAME:SPNSR                                 
GROUP_FIELD_VALUE:COLLEGE CHARLESTON
GROUP_FIELD_NAME:SPNSR_REF                             
GROUP_FIELD_VALUE:520877

                               
Below is what happens when there is a dash at the end.                                           
                                                 
Sponsor:                                         
ul.h = 1.16 ul.v = 1.97 lr.h = 1.64 lr.v = 2.12 
                                                 
BROOKHAVEN                                       
ul.h = 1.72 ul.v = 1.97 lr.h = 2.40 lr.v = 2.12 
                                                 
ASSOCIATES-Sponsor                               
ul.h = 2.86 ul.v = 1.97 lr.h = 3.51 lr.v = 2.12 
ul.h = 0.86 ul.v = 2.09 lr.h = 1.31 lr.v = 2.25 
                                                 
Ref#:                                           
ul.h = 1.32 ul.v = 2.09 lr.h = 1.64 lr.v = 2.25 
                                                 
183423                                           
ul.h = 1.72 ul.v = 2.10 lr.h = 2.08 lr.v = 2.25 

This becomes generic index records
GROUP_FIELD_NAME:SPNSR                                 
GROUP_FIELD_VALUE:BROOKHAVEN SCIENCE ASSOCIATES-Sponsor
GROUP_FIELD_NAME:SPNSR_REF                             
GROUP_FIELD_VALUE:149282                               

Has anyone seen this before?   how do we fix it?

Alessandro Perucchi

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1002
    • View Profile
Re: PDF indexer and the dash character
« Reply #1 on: July 16, 2013, 12:34:48 AM »
We are running OnDemand 8.4 and using the PDF indexer.  We just ran into a situation where the last character of a field is a dash.  Is there something special about the dash character being the last character in a field.  Because it looks like the indexer is appending the next text it finds to the end of that field, thus making the field to long for the loader and it fails to load.  To get an idea of what is happening we ran the PDF document though ARSPDUMP and below is an example of the output:

Sponsor:                                                                         
ul.h = 1.16 ul.v = 1.97 lr.h = 1.64 lr.v = 2.12 
                                                 
COLLEGE                                         
ul.h = 1.72 ul.v = 1.97 lr.h = 2.17 lr.v = 2.12 
                                                 
CHARLESTON                                       
ul.h = 2.34 ul.v = 1.97 lr.h = 3.00 lr.v = 2.12 
                                                 
Sponsor                                         
ul.h = 0.86 ul.v = 2.09 lr.h = 1.31 lr.v = 2.25 
                                                 
Ref#:                                           
ul.h = 1.32 ul.v = 2.09 lr.h = 1.64 lr.v = 2.25 
                                                 
520877ISU                                       
ul.h = 1.72 ul.v = 2.10 lr.h = 2.25 lr.v = 2.25 
                                                 
This becomes generic index records
GROUP_FIELD_NAME:SPNSR                                 
GROUP_FIELD_VALUE:COLLEGE CHARLESTON
GROUP_FIELD_NAME:SPNSR_REF                             
GROUP_FIELD_VALUE:520877

                               
Below is what happens when there is a dash at the end.                                           
                                                 
Sponsor:                                         
ul.h = 1.16 ul.v = 1.97 lr.h = 1.64 lr.v = 2.12 
                                                 
BROOKHAVEN                                       
ul.h = 1.72 ul.v = 1.97 lr.h = 2.40 lr.v = 2.12 
                                                 
ASSOCIATES-Sponsor                               
ul.h = 2.86 ul.v = 1.97 lr.h = 3.51 lr.v = 2.12 
ul.h = 0.86 ul.v = 2.09 lr.h = 1.31 lr.v = 2.25 
                                                 
Ref#:                                           
ul.h = 1.32 ul.v = 2.09 lr.h = 1.64 lr.v = 2.25 
                                                 
183423                                           
ul.h = 1.72 ul.v = 2.10 lr.h = 2.08 lr.v = 2.25 

This becomes generic index records
GROUP_FIELD_NAME:SPNSR                                 
GROUP_FIELD_VALUE:BROOKHAVEN SCIENCE ASSOCIATES-Sponsor
GROUP_FIELD_NAME:SPNSR_REF                             
GROUP_FIELD_VALUE:149282                               

Has anyone seen this before?   how do we fix it?

Hello bwissink,

I don't know what to say, except what version do you have? you say 8.4, are you on the latest fix pack of this version (8.4.0.3)? Did you try to upgrade to 8.4.1 (8.4.1.9)?

Anyways, as you probably know CMOD 8.4 is not anymore supported by IBM, since 30.09.2012 (Unix, Windows) and 30.04.2013 (z/OS).

So my best guess would be to think for an upgrade from your version to V8.5.0.7 or V9.0.0.2.

And test if your problem is still there (I would suggest first in a dev or test environment).

V8.5 and V9.0 have a complete rewrite of PDF indexer, so maybe the little glitches that might have happened before, are solved now. And it should be faster.

Of course, if somebody has an idea to help you, then great, otherwise you might consider my suggestion.

Sincerely yours,
Alessandro
Alessandro Perucchi

#Install #Migrations #Conversion #Educate #Repair #Upgrade #Migrate #Enhance #Optimize #AIX #Linux #Multiplatforms #DB2 #Windows #Oracle #TSM #Tivoli #Performance #Audits #Customizing #Availability #HA #DR #JavaApi #ContentNavigator #ICN #WEBi #ODWEK #Services #PDF #AFP #XML

Ed_Arnold

  • Hero Member
  • *****
  • Posts: 1208
    • View Profile
Re: PDF indexer and the dash character
« Reply #2 on: July 16, 2013, 10:50:03 AM »
Brad - you might want to take a look at PK65917 .

ERROR DESCRIPTION:                                             
Using the PDF indexer, the next word is being extracted and used
in the indexed field if the indexed field contains blank spaces
after a hyphen. For example, if the indexed field is, "A-       
", the PDF indexer is extracting instead, "A-<next word>".     
                                                               
The PDF library OnDemand uses during PDF indexing ignores the   
hyphen when returning a word. This causes the next word to be   
concatenated to the previous word.
                             

Ed
#zOS #ODF

Ed_Arnold

  • Hero Member
  • *****
  • Posts: 1208
    • View Profile
Re: PDF indexer and the dash character
« Reply #3 on: July 16, 2013, 11:00:05 AM »
Brad - I have good news and bad news.

The good news is that the problem is fixed.

The bad news is not at V8.4.x.

The rest of the story:

This is a bug in the Adobe libraries.  It is caused by the hyphen at the end of the line.  This problem is fixed in V8.5 which uses a newer version of the Adobe API.  The newer version of the libraries are not available for 8.4.1 or before.


Ed
#zOS #ODF

LWagner

  • Guest
Re: PDF indexer and the dash character
« Reply #4 on: July 17, 2013, 10:03:49 AM »
You might want to consider generating white text on white background (hidden data) in the report for specific index values.  We have been doing that with bills, and now report PDFs. We call this scanline data.

LWagner

  • Guest
Re: PDF indexer and the dash character
« Reply #5 on: October 24, 2013, 10:45:49 AM »
The PDF Indexer does not break on characters of choice yet. Even in CMOD 8.5.0.8, it breaks words on a space and a colon.  But the colon leaves little space to actually detect the break.  For reliable text breaks, I recommend you have at least one space between words to discern. And do not have text lines to close together, nor too small of a font.  If you can't read the text of the printed document, the OCR features of the PDF Indexer may not be able to read it reliably either.

An APAR was just created today, APAR PI04732, to deal with the problem of the PDF Indexer not always being able to break data on coordinate values supplied by arspdump.