Author Topic: PDF indexer and the dash character (Read 4007 times)

bwissink · « **on:** July 12, 2013, 02:24:36 PM »

We are running OnDemand 8.4 and using the PDF indexer. We just ran into a situation where the last character of a field is a dash. Is there something special about the dash character being the last character in a field. Because it looks like the indexer is appending the next text it finds to the end of that field, thus making the field to long for the loader and it fails to load. To get an idea of what is happening we ran the PDF document though ARSPDUMP and below is an example of the output:

Sponsor:
ul.h = 1.16 ul.v = 1.97 lr.h = 1.64 lr.v = 2.12

COLLEGE
ul.h = 1.72 ul.v = 1.97 lr.h = 2.17 lr.v = 2.12

CHARLESTON
ul.h = 2.34 ul.v = 1.97 lr.h = 3.00 lr.v = 2.12

Sponsor
ul.h = 0.86 ul.v = 2.09 lr.h = 1.31 lr.v = 2.25

Ref#:
ul.h = 1.32 ul.v = 2.09 lr.h = 1.64 lr.v = 2.25

520877ISU
ul.h = 1.72 ul.v = 2.10 lr.h = 2.25 lr.v = 2.25

This becomes generic index records
GROUP_FIELD_NAME:SPNSR
GROUP_FIELD_VALUE:COLLEGE CHARLESTON
GROUP_FIELD_NAME:SPNSR_REF
GROUP_FIELD_VALUE:520877

Below is what happens when there is a dash at the end.

Sponsor:
ul.h = 1.16 ul.v = 1.97 lr.h = 1.64 lr.v = 2.12

BROOKHAVEN
ul.h = 1.72 ul.v = 1.97 lr.h = 2.40 lr.v = 2.12

ASSOCIATES-Sponsor
ul.h = 2.86 ul.v = 1.97 lr.h = 3.51 lr.v = 2.12
ul.h = 0.86 ul.v = 2.09 lr.h = 1.31 lr.v = 2.25

Ref#:
ul.h = 1.32 ul.v = 2.09 lr.h = 1.64 lr.v = 2.25

183423
ul.h = 1.72 ul.v = 2.10 lr.h = 2.08 lr.v = 2.25

This becomes generic index records
GROUP_FIELD_NAME:SPNSR
GROUP_FIELD_VALUE:BROOKHAVEN SCIENCE ASSOCIATES-Sponsor
GROUP_FIELD_NAME:SPNSR_REF
GROUP_FIELD_VALUE:149282

Has anyone seen this before? how do we fix it?

Alessandro Perucchi · « **Reply #1 on:** July 16, 2013, 12:34:48 AM »

Quote from: bwissink on July 12, 2013, 02:24:36 PM

We are running OnDemand 8.4 and using the PDF indexer. We just ran into a situation where the last character of a field is a dash. Is there something special about the dash character being the last character in a field. Because it looks like the indexer is appending the next text it finds to the end of that field, thus making the field to long for the loader and it fails to load. To get an idea of what is happening we ran the PDF document though ARSPDUMP and below is an example of the output:

Sponsor:
ul.h = 1.16 ul.v = 1.97 lr.h = 1.64 lr.v = 2.12

COLLEGE
ul.h = 1.72 ul.v = 1.97 lr.h = 2.17 lr.v = 2.12

CHARLESTON
ul.h = 2.34 ul.v = 1.97 lr.h = 3.00 lr.v = 2.12

Sponsor
ul.h = 0.86 ul.v = 2.09 lr.h = 1.31 lr.v = 2.25

Ref#:
ul.h = 1.32 ul.v = 2.09 lr.h = 1.64 lr.v = 2.25

520877ISU
ul.h = 1.72 ul.v = 2.10 lr.h = 2.25 lr.v = 2.25

This becomes generic index records
GROUP_FIELD_NAME:SPNSR
GROUP_FIELD_VALUE:COLLEGE CHARLESTON
GROUP_FIELD_NAME:SPNSR_REF
GROUP_FIELD_VALUE:520877

Below is what happens when there is a dash at the end.

Sponsor:
ul.h = 1.16 ul.v = 1.97 lr.h = 1.64 lr.v = 2.12

BROOKHAVEN
ul.h = 1.72 ul.v = 1.97 lr.h = 2.40 lr.v = 2.12

ASSOCIATES-Sponsor
ul.h = 2.86 ul.v = 1.97 lr.h = 3.51 lr.v = 2.12
ul.h = 0.86 ul.v = 2.09 lr.h = 1.31 lr.v = 2.25

Ref#:
ul.h = 1.32 ul.v = 2.09 lr.h = 1.64 lr.v = 2.25

183423
ul.h = 1.72 ul.v = 2.10 lr.h = 2.08 lr.v = 2.25

This becomes generic index records
GROUP_FIELD_NAME:SPNSR
GROUP_FIELD_VALUE:BROOKHAVEN SCIENCE ASSOCIATES-Sponsor
GROUP_FIELD_NAME:SPNSR_REF
GROUP_FIELD_VALUE:149282

Has anyone seen this before? how do we fix it?

Hello bwissink,

I don't know what to say, except what version do you have? you say 8.4, are you on the latest fix pack of this version (8.4.0.3)? Did you try to upgrade to 8.4.1 (8.4.1.9)?

Anyways, as you probably know CMOD 8.4 is not anymore supported by IBM, since 30.09.2012 (Unix, Windows) and 30.04.2013 (z/OS).

So my best guess would be to think for an upgrade from your version to V8.5.0.7 or V9.0.0.2.

And test if your problem is still there (I would suggest first in a dev or test environment).

V8.5 and V9.0 have a complete rewrite of PDF indexer, so maybe the little glitches that might have happened before, are solved now. And it should be faster.

Of course, if somebody has an idea to help you, then great, otherwise you might consider my suggestion.

Sincerely yours,
Alessandro

Ed_Arnold · « **Reply #2 on:** July 16, 2013, 10:50:03 AM »

Brad - you might want to take a look at PK65917 .

ERROR DESCRIPTION:
Using the PDF indexer, the next word is being extracted and used
in the indexed field if the indexed field contains blank spaces
after a hyphen. For example, if the indexed field is, "A-
", the PDF indexer is extracting instead, "A-<next word>".

The PDF library OnDemand uses during PDF indexing ignores the
hyphen when returning a word. This causes the next word to be
concatenated to the previous word.

Ed

Ed_Arnold · « **Reply #3 on:** July 16, 2013, 11:00:05 AM »

Brad - I have good news and bad news.

The good news is that the problem is fixed.

The bad news is not at V8.4.x.

The rest of the story:

This is a bug in the Adobe libraries. It is caused by the hyphen at the end of the line. This problem is fixed in V8.5 which uses a newer version of the Adobe API. The newer version of the libraries are not available for 8.4.1 or before.

Ed

LWagner · « **Reply #4 on:** July 17, 2013, 10:03:49 AM »

You might want to consider generating white text on white background (hidden data) in the report for specific index values. We have been doing that with bills, and now report PDFs. We call this scanline data.

LWagner · « **Reply #5 on:** October 24, 2013, 10:45:49 AM »

The PDF Indexer does not break on characters of choice yet. Even in CMOD 8.5.0.8, it breaks words on a space and a colon. But the colon leaves little space to actually detect the break. For reliable text breaks, I recommend you have at least one space between words to discern. And do not have text lines to close together, nor too small of a font. If you can't read the text of the printed document, the OCR features of the PDF Indexer may not be able to read it reliably either.

An APAR was just created today, APAR PI04732, to deal with the problem of the PDF Indexer not always being able to break data on coordinate values supplied by arspdump.

OnDemand User Group

News:

Author Topic: PDF indexer and the dash character (Read 4007 times)

bwissink

PDF indexer and the dash character

Alessandro Perucchi

Re: PDF indexer and the dash character

Ed_Arnold

Re: PDF indexer and the dash character

Ed_Arnold

Re: PDF indexer and the dash character

LWagner

Re: PDF indexer and the dash character

LWagner

Re: PDF indexer and the dash character