OnDemand User Group

Support Forums => MP Server => Topic started by: Lars Bencze on January 02, 2018, 02:07:37 AM

Title: Alternative delete methods
Post by: Lars Bencze on January 02, 2018, 02:07:37 AM
With the new GDPR regulation coming into effect by May 2018, many companies need better ways to delete their documents, especially from OnDemand.
Most regulatory agencies don't accept the "lazy delete" done with arsdoc delete and by the ODWEK API, where only the database record pointing to the document is removed, while the document data itself is left intact on disk (and other storage).
(Most of you guys here on the forum would succeed in restoring such a "deleted" document, if you had access to the database and the files on disk.)

Are there any other options?
If we do an export + reload (after removing the document that is to be deleted) and then unload the original load, a new (minor) problem appears: the segmentation order is disrupted. Example: Say that you for a given Application Group have 100 or more segment tables, and they have been created sequentially by/with daily printouts.
Then you unload and reload a batch, which for this example is 5 years old. When you reload it, it ends up in the CURRENT segment table, and the START_DT (Start date) column for the current table will be set to a much earlier date. If you repeat this, the segmentation will eventually become really messed up and searches will be slower, since a lot of (unnecessary) tables will be searched.

Are there any better methods out there to delete data from the FAA* files?
Has anyone been bold/crazy enough to investigate a solution which overwrites part of the data file on disk? (NOT recommended!)
Can you "re-open" a segment table for writing, temporarily? (As far as I know, you can only close a table and that automatically creates a new one. Of course, you could close the current table, reload the old data into a new table, and then close that table too. But that would create a whole lot of new tables over time.)
Can you forcefully move a batch of documents from one segment table to another?
Any other solution?

Please share your thoughts and solutions here. Also if you happen to know that IBM has a solution for this up the sleeve, I'd like to know.
Title: Re: Alternative delete methods
Post by: Justin Derrick on January 02, 2018, 01:24:30 PM
Enhanced Retention Management.  It reloads-under-the-covers, and keeps table segmentation intact.  This works for *most* situations, except where the back-end storage is WORM, since the media needs to be destroyed in order to be considered 'deleted'.

Happy New Year!

-JD.
Title: Re: Alternative delete methods
Post by: Nolan on January 02, 2018, 04:39:11 PM
I don't believe - Enhanced Retention Management is a solution to the real world problem.  Enhanced Retention Management, is good for situations where you need a legal hold on a few documents where 90% or more of the documents in the load will follow the regular delete cycle.   I think IBM needs to try again and create a supported solution to delete a document out of the repository without unloading/reloading and tie into a Records Management system.

Note that  Enhanced Retention Management is only the OnDemand side, you still need to write/build/customize integration to manage the hold on the documents or worse, leave it up to the users!

Title: Re: Alternative delete methods
Post by: Justin Derrick on January 02, 2018, 06:16:38 PM
The snag here is how CMOD compresses and bundles objects together, and I'm fairly certain that having the ability to 'wipe' individual documents would break the compression / bundling.

Maybe add it as an enhancement thread if we can't find a good solution here.

I guess the big question is, what's the range of dates in the documents you're loading all at once?  If a batch of documents is within, say, a 7 or 30 day window, I don't see this as a big issue, as documents will expire and be deleted in short order after their official expiration date.  If we're talking about one-off deletions, like deleting an individual document, then maybe there needs to be a way to specify that an object is replaced and re-written without that index record -- but the problem is that your 'archive' system is suddenly editable -- and that would likely threaten the credibility of a document produced from that system.

However, if the document is a 'bad' one from a batch of loaded files, then that's a data quality issue that should be addressed at the source.

It's a very big question, and something I've thought about for years but never really talked about before now...  :)

-JD.
Title: Re: Alternative delete methods
Post by: Nolan on January 03, 2018, 08:22:34 AM
Agreed, the compression and bundling does create a serious challenge to this problem/solution.  I think a hybrid solution which "might" appease the auditors would be to make it impossible to recreate a deleted record in the table after the lazy delete has been issued.  Using a key GUID to validate all the indexed rows, then when a "lazy" delete is done rebuild all the remaining keys so going back is impossible. 



Title: Re: Alternative delete methods
Post by: Justin Derrick on January 03, 2018, 08:51:29 AM
Actually, this provides for an interesting solution.  CMOD v10.1 supports encryption.  It would be nice to provide an encryption key for each individual document.  Yes, it would slaughter performance, but it would make each individual document irrecoverable when the row is deleted.  It would also be possible to detect any tampering with an individual load by hashing all of the keys together and storing that hash in the arsload table...

In this scenario, an individual document row could be deleted in the database, eliminating the key, making the document irrecoverable.  A change to an existing document on disk would be impossible without the key from the database.  Any change to the file on disk would break the encryption.  Any row deleted would cause a 'verification' of the load to fail, since the missing key would break the hash in the arsload table.

Anyone care to expand on that idea?

-JD.
Title: Re: Alternative delete methods
Post by: Nolan on January 03, 2018, 09:51:25 AM
That is along the lines of what I was thinking.  Now to get IBM to build it  :D

Title: Re: Alternative delete methods
Post by: Lars Bencze on January 08, 2018, 04:06:33 AM
Hi, very interesting thoughts.
According to another source I have (I have not verified this yet due to a lack of time), the ERM does NOT keep the segmentation intact.
Do you have a source where I can verify that this is indeed the case?
From my tests with ERM, it does not delete or reload Jack. Unless you run arsmaint -D 100, but that is to my understanding not part of ERM but of base CMOD.
(Running "arsmaint -D 100 ... -G AppGroup" is another thing I have also not verified yet. during my last attempt, it seemed to try to reload every single LoadID in the Application Group - NOT an option as you understand... :) )
Title: Re: Alternative delete methods
Post by: Justin Derrick on January 08, 2018, 07:21:25 AM
The info came from IBM in a presentation.  I'll try and get a written source for you.  :)

Edit:
Here's the IBM CMOD Enhanced Retention Management documentation:  https://www.ibm.com/support/knowledgecenter/SSEPCD_10.1.0/com.ibm.ondemand.erm.doc/doder200.htm -- and no, it doesn't contain any evidence of my assertion about how it works.  I've sent a note to the developer for confirmation.

-JD.
Title: Re: Alternative delete methods
Post by: Nolan on January 08, 2018, 10:12:37 AM
Lars, perhaps you need to adjust your settings for it to unload jack :)

From the document shared by Ed/Justin.

To help you control how often Content Manager OnDemand reloads a load, include the -D flag when you run the arsmaint
and arsadmin unload commands as part of your expiration process. The -D flag indicates that Content Manager OnDemand should reload a load when
the number of documents with a hold in an application group changes by a specified percentage from the previous time the application group was
loaded.
When Content Manager OnDemand needs to reload an application group, it does the following tasks:

a. Extracts all the documents that have holds applied and their related index data.
b. Loads all the held documents and their related index data into a new load.
c. Deletes the original load (all files from cache and the index data from the OnDemand databases).
Title: Re: Alternative delete methods
Post by: Justin Derrick on January 08, 2018, 02:22:52 PM
Okay, confirmed by the developer -- it doesn't work the way I thought, loads go into the open database segment.  I forsee an Enhancement Request in my future.  :)

-JD.
Title: Re: Alternative delete methods
Post by: Lars Bencze on January 09, 2018, 07:31:55 AM
Thank you both Justin and Nolan for your help.
Yes, I read the same documentation and noted that it could be reloading it into the old table, but it was unspecified.

PS: I will go searching for the "jack" setting when I have some time over.... ;) Or maybe we will write an addon for that too.
Title: Re: Alternative delete methods
Post by: Stephen McNulty on January 23, 2018, 07:32:33 AM
Actually, this provides for an interesting solution.  CMOD v10.1 supports encryption.  It would be nice to provide an encryption key for each individual document.  Yes, it would slaughter performance, but it would make each individual document irrecoverable when the row is deleted.  It would also be possible to detect any tampering with an individual load by hashing all of the keys together and storing that hash in the arsload table...

In this scenario, an individual document row could be deleted in the database, eliminating the key, making the document irrecoverable.  A change to an existing document on disk would be impossible without the key from the database.  Any change to the file on disk would break the encryption.  Any row deleted would cause a 'verification' of the load to fail, since the missing key would break the hash in the arsload table.

Anyone care to expand on that idea?

-JD.

perhaps on this line of thinking, during the arsdoc delete, we know the object location, offset and length of the compressed document within the storage object overwrite the bytes.
Title: Re: Alternative delete methods
Post by: Justin Derrick on January 23, 2018, 08:05:42 AM
> perhaps on this line of thinking, during the arsdoc delete, we know the object location, offset and length of the compressed document within the storage object overwrite the bytes.

I think this breaks the compression in the file.  I'm under the impression that there are compressed blocks aggregated into stored objects, and inside those compressed blocks are multiple documents, so blanking out one would break the rest of the objects in that compressed block.

-JD.