OnDemand User Group

Support Forums => MP Server => Topic started by: tjspencer2 on November 08, 2020, 08:42:36 AM

Title: Unified Content and Access integration Layer (UCAIL)
Post by: tjspencer2 on November 08, 2020, 08:42:36 AM
So there are vendors in the marketplace that have productized software to create integrated content access and integration layers that abstract your content repositories (FileNet, CMOD, SharePoint, et. al.) metadata properties (including pointers to documents/reports) to an "access and integration layer" typically storing this metadata in NoSQL database.  This can empower web service access to enterprise search and "get" calls at this "access and integration layer" in lieu of writing API interfaces to all the individual back office content stores.  Assume for the sake of conversation that all noted so far is possible, when it comes to CMOD, is there a way to get to a single "Unique ID" for a specific CMOD document or report?  Given that CMOD is all about where something is within an input file, it's much different than a GUID (globally unique ID) that is stored in a content repository like FileNet or SharePoint.  Just wanted to see if anyone had attempted to do this and what they had come up with as their solution. 
Title: Re: Unified Content and Access integration Layer (UCAIL)
Post by: Justin Derrick on November 08, 2020, 02:31:46 PM
What's the end goal / big picture?  Search federation or enterprise search?  This sounds a lot like CMIS, which seemed to fizzle out after a few years of hype.

As for accessing a document, there really isn't a way.  You could store the object name, plus the byte offset and length, but it would still be compressed in a proprietary fashion, so you'd still have to call CMOD anyway.

You absolutely *can* use globally unique document IDs or document hashes as metadata for searching for a *specific* document, but that actually just adds to the overhead, it doesn't reduce it.

Title: Re: Unified Content and Access integration Layer (UCAIL)
Post by: tjspencer2 on November 13, 2020, 06:13:28 AM
It's conceptually very much like CMIS but reportedly benchmarks faster.  We've done some custom solutions like this where we extract load metadata (query strings and load ids) so that web service calls can do a "get list" against SQL in lieu of ODWK - VERY fast.  Then our web service ONLY calls ODWEK when the user wants to "get" the document/statement - executes the stored query string via ODWEK.

The big picture is "enterprise search", and my question is, "Does CMOD provide a way to identify a document ID, a GUID, if you will, so that this can somehow be extracted and stored along with other metadata?  Or do we have to go with an approach like I outline above where we extract load ID and store a query string to be executed for "gets". 

And my CMOD context is PDF statements - while many PDF files only had one statement in them, most were large PDF files - hundreds of megs - that contained tens of thousands of contiguous PDF statements all in the same file.

Title: Re: Unified Content and Access integration Layer (UCAIL)
Post by: Justin Derrick on November 13, 2020, 06:37:11 AM
There *used* to be a DocID that you could store, but it was a (severe) security problem.  DocID's are now encrypted, and are only valid for a particular user on a particular server, and I believe for a limited amount of time.

You could enable GUIDs or Document Hashes in CMOD, but then you'd still have to do a search beforehand.

I'm actually curious to know where the performance problem is.  I have customers doing high-hundreds-of-millions of queries a day, along with low-hundreds-of-millions retrievals per day, with *zero* performance concerns except the occasional 'malicious query' (like, "all customers with 'S' in their name").  There's a LOT of tuning that can be done in CMOD.

Title: Re: Unified Content and Access integration Layer (UCAIL)
Post by: tjspencer2 on November 20, 2020, 05:01:16 AM
So there are a few of things going on there, Justin.

First - Check Images - Our primary use case is PDF check statements with lots of images - as I appreciate it, CMOD stores check images separately from text and effectively reassembles them on the ODWK "get".  Because we have some statements with thousands of pages of images, these can completely compromise our performance when these statements are called to reassemble.

Second - Bad Design - We have a variety of statements stored across a variety of AGs and Folders, AND we have the same fields named differently across AGs/Folders (synonyms) that when trying to pull everything for a customer across the whole API is problematic.  In short, this is somewhat our own fault.  We had several cooks in the kitchen at one time before my team took over.   

Third - [ODWEK] API Performance/Expertise - Our developers don't want to have to learn every application's API and instead argue for an abstraction layer.  Also, searching ODWEK across a slew of CMOD Folders/AGs to retrieve ALL of a customers statement types is pretty painful to them, and they argue it doesn't perform near as well as an abstraction model does.

Hopefully this helps a bit.