Author Topic: Indexing files containing DBCS (Double-Byte Character Set) (Read 2868 times)

frasert · « **on:** January 20, 2011, 02:05:17 PM »

Has anyone successfully processed files (PDF or otherwise) containing DBCS?

We have a PDF that contains chinese characters, but the PDF graphical indexer fails with a server error.
arspdump is also unable to process the file:

$ arspdump -f file.pdf -p 1 | head -40
file.pdf
Number of Pages = 4

WordFinder version: 3

------------- Page 1 -------------

?
ul.h = 0.49 ul.v = 0.18 lr.h = 0.61 lr.v = 0.37

?
ul.h = 0.59 ul.v = 0.18 lr.h = 0.65 lr.v = 0.37

....

Justin Derrick · « **Reply #1 on:** January 21, 2011, 01:55:26 AM »

Even if you get past that issue, I don't think you'll be able to store those double-byte index values inside your database without switching it to UTF-16. That's something I run into constantly during migrations. Non-ASCII characters get converted to double-byte strings, meaning they won't fit inside the columns defined in databases with 8-bit codepages.

The icing on the cake is that I have no idea how this would affect searching. (Can a Windows client or web browser properly convey double-byte characters and make it all the way through CMOD down to the database for a query?)

Hopefully some of our users from Europe will be able to help with this.

-JD.

OnDemand User Group

News:

Author Topic: Indexing files containing DBCS (Double-Byte Character Set) (Read 2868 times)

frasert

Indexing files containing DBCS (Double-Byte Character Set)

Justin Derrick

Re: Indexing files containing DBCS (Double-Byte Character Set)