Author Topic: Indexing files containing DBCS (Double-Byte Character Set)  (Read 2868 times)

frasert

  • Guest
Indexing files containing DBCS (Double-Byte Character Set)
« on: January 20, 2011, 02:05:17 PM »
Has anyone successfully processed files (PDF or otherwise) containing DBCS?

We have a PDF that contains chinese characters, but the PDF graphical indexer fails with a server error.
arspdump is also unable to process the file:

$ arspdump -f file.pdf -p 1 | head -40
file.pdf
Number of Pages = 4

WordFinder version: 3

------------- Page 1 -------------

?
ul.h = 0.49     ul.v = 0.18     lr.h = 0.61     lr.v = 0.37

?
ul.h = 0.59     ul.v = 0.18     lr.h = 0.65     lr.v = 0.37

....

Justin Derrick

  • IBM Content Manager OnDemand Consultant
  • Administrator
  • Hero Member
  • *****
  • Posts: 2231
  • CMOD Guru for hire...
    • View Profile
    • Tenacious Consulting
Re: Indexing files containing DBCS (Double-Byte Character Set)
« Reply #1 on: January 21, 2011, 01:55:26 AM »
Even if you get past that issue, I don't think you'll be able to store those double-byte index values inside your database without switching it to UTF-16.  That's something I run into constantly during migrations.  Non-ASCII characters get converted to double-byte strings, meaning they won't fit inside the columns defined in databases with 8-bit codepages.

The icing on the cake is that I have no idea how this would affect searching.  (Can a Windows client or web browser properly convey double-byte characters and make it all the way through CMOD down to the database for a query?)

Hopefully some of our users from Europe will be able to help with this.

-JD.

IBM CMOD Professional Services: http://TenaciousConsulting.com
Call:  +1-866-533-7742  or  eMail:  jd@justinderrick.com
IBM CMOD Wiki:  https://CMOD.wiki/
FREE IBM CMOD Education & Webinars:  https://CMOD.Training/

Interests: #AIX #Linux #Multiplatforms #DB2 #TSM #SP #Performance #Security #Audits #Customizing #Availability #HA #DR