Author Topic: Help needed: reverse-engineer on-disk cache format in MP V9.5  (Read 338 times)

Mattbianco

  • Jr. Member
  • **
  • Posts: 17
    • View Profile
Background: cache-only setup, database backup failure, cache filesystem backups (with TSM) complete.
Documents stored "as is" with the generic indexer only, so no separation of resources and document data.

Need to recover some documents from AG where arsmaint -c and arsmaint -d have been run where the segment date incorrectly was set to 1970-01-01 on some documents...

I've restored the affected DOC files from backup (into another folder), both the 1136FAA1 (OD77-compressed metadata) and 1136FAAA (OD77-compressed documents).
I've run arsadmin decompress on the restored files, and have noticed that the ...FAA1 files contain the document metadata, and the ...FAAA etc files contain the documents themselves.

This far, I've noticed that the first line (sometimes lines) begin with "<" and end with ">" and in between contain tab-separated AG field names.
Then follows lines with the metadata, one line for each document. First comes the values of the AG field names, in the same order as in the <>-enclosed header, and then some CMOD-specific fields that could look like this:
1136FAAA   0   21945   0   106262   U   O   0   1   0

The first is obviously the name of the file containing the documents, and the second and third one (0 + 21945) is the byte offset and length of the document, after decompression, in the document data file.

But, what are the other fields? 0, 106262, U, O, 0, 1, 0 ?

In this example file, the first 29 documents make perfect sense. Here is the CMOD-data for documents 27 - 32 in the decompressed 1136FAA1 file:

1136FAAA  572939     21888            0      106262   U   O   0   1   0
1136FAAA  594827     21887            0      106262   U   O   0   1   0
1136FAAA  616714     21971            0      106262   U   O   0   1   0
1136FAAA       0     21894       106262      104751   U   O   0   1   0
1136FAAA   21894     22109       106262      104751   U   O   0   1   0
1136FAAA   44003     22005       106262      104751   U   O   0   1   0


The 1136FAAA file is exactly 616714 + 21971 byte after decompression of the entire file, so, at the same time the offset counter drops back to zero, and the second pair of "counters" increase, I don't understand where to find these remaining documents.

Does anyone here know what the 0 / 106262 / 104751 in the columns after the first offset+length pairs mean?
Do you think there could be a way to salvage the remaining documents from the cache backups, without using the database?

Thanks!
Matt

Mattbianco

  • Jr. Member
  • **
  • Posts: 17
    • View Profile
Re: Help needed: reverse-engineer on-disk cache format in MP V9.5
« Reply #1 on: March 11, 2024, 03:47:54 AM »
Okay...

I think I figured it out now. The second pair of offset (106262) and length (104751) is the location in the uncompressed data file of the compressed "block" where the first pair of offset (0) and length (21894) of the actual document:

1136FAAA       0     21894       106262      104751   U   O   0   1   0

I fooled myself when running arsadmin decompress without offset and length and got the first section decompressed without issues.
Didn't realize that these pdf documents could be compressed so efficiently that 211013 bytes would be decompressed into two files of 638685 + 637376 byte.

Once again, CMOD and it's storage structures impress me! There are indeed some truth to the "things were better back in the day" phrases.
A more recently built solution would probably have been very hard to recover from in a situation where the database with the metadata was lost. Placing the small metadata on disk in the cache file systems is a real life saver!
« Last Edit: March 11, 2024, 05:08:04 AM by Mattbianco »