A GPT-document search is unrealistic for mass search – do this instead

Author Contribution
Mar 27, 2024
7 min read

By Jason Ly.

Time spent searching for documents leeches on a workday. Having witnessed generative AI boost efficiency across many areas, innovation directors are asking if they can apply the same technology to streamline enterprise search and quickly fish out hidden documents.

Although generative AI confers many benefits, deploying LLMs to search an entire large database (called a “mass search”) is not their intended purpose.

In this article, we explain why the state of the majority of firms’ file storage drives and the formatting of their files makes them unprepared for a GPT-powered search. We consider the limitations of the technique of ‘embedding.

We give our recommendation for using LLMs on smaller scales, such as for individual clients, projects or departments, with a better chance of success for new databases than historical backlogs.

Searching for files is a time-consuming task

Content findability issues can bottleneck a workflow, and the majority of employees consider searching for files to be a “top-three problem”. [1]

Time is expended scouring cluttered email inboxes and shared drives for misplaced invoices, cryptically named meeting notes, and scattered presentations.

A fundamental problem lies in how a search system cannot always match a query with the content in the file when the vocabulary used differs. In other words, systems are unforgiving towards users who cannot recall the exact words used in the file.

Proposed LLM-based solutions promise to get around this problem through natural language processing (NLP).

Corporations want to streamline document search with generative AI

Generative AI has been demonstrated to boost productivity and efficiency; Harvard and BCG found that consultants using GPT-4 finished 12% more tasks and completed them 25% more quickly. [2] Firms are consequently keen to extend its benefits to document search.

Firms now wonder “can I hook up my GenAI solution to all my files, and have a unified knowledge management system that is ChatGPT-searchable?”.

In theory, a user would input a prompt into an LLM-powered document search system such as “find me that spreadsheet from late last year with the archived accounts of client X that includes revenue for the Y department”. It would then analyse the names, content, and categorisations of every document, and swiftly retrieve the correct file.

If only it were so easy!

Mass enterprise search is not a straightforward task, even for the smartest LLM

Deploying LLMs to search an entire large database of millions of a vast variety of documents is not their intended purpose. While it is technically possible, the current practical limitations and user experience fall far short of expectations.

The aforementioned dream GPT tool faces 4 main hurdles:

Potentially massive costs
Diminishing returns with increasing data volume
Disorderly drives with abandoned, badly versioned documents
Challenges with security compliance

Most of these four challenges extend beyond technological solutions, necessitating agile business decision-making and potentially major shifts in company culture around how documents are created and managed in the first place.

Most drives are unprepared for an LLM search

When users cannot find a file, often an LLM is none the wiser. The typical corporation’s drive has not been kept clean, organised, indexed, and formatted throughout its history in a way that LLM-powered search systems can easily navigate.

A drive typically has:

Duplicate versions
Outdated, incomplete, and redundant data
Untagged documents
Inconsistent formatting and naming of documents
Formatting that is unreadable by LLMs

Duplicate versions raise the issues:

How to consistently determine the right version: the latest, or the version last edited by the most senior person? Or the version last edited by a different, specific team member?
And how to handle branches?
We commonly use other people’s documents as templates
How do we know a document was even completed and not written off?

Managing a well-curated set of documents has been an objective for every business in the last 5 years. And yet, we’ve not met a single employee from a company of over 100 employees who doesn’t consider their knowledge management a mess. Utilising LLMs for document search has high hidden costs

The excessive amount of irrelevant data strains our storage systems, requiring frequent relevance checks for each document. Despite the low cost of scanning individual documents with an LLM — merely a hundredth of a penny — the total expense quickly accumulates with each search, becoming prohibitively costly in large document collections.

For example, when employing a method that rigorously scans tens of thousands of documents, the costs can skyrocket. Consider a scenario with 1000 users conducting 100 searches daily. Given the need for multiple attempts to locate precise information, this equates to a staggering £1000 in search expenses per day.

Initially, law firms, intrigued by the potential, approached us to advise them on searching millions of documents. However, at such a scale, relying solely on a Language Model (LLM) is impractical.

Instead, LLMs should be leveraged as a "final mile" solution, akin to how one might use an expensive Uber ride for the final leg of a journey after travelling most of the distance via cheaper, albeit less comfortable, trains or buses.

Embedding techniques confer limited benefits

As such cheaper, more traditional search options like keywords or embeddings are necessary to perform the first layer of filtering.

Embeddings are a core technology for bridging the symbolic nature of language with the numeric nature required for computational processing. It is like a keyword search, but more words and phrases are considered similar.

The aim is to match up search clauses with language used in the file. However, this is not always easy; nor is it easy for machines to comprehend distant synonyms or descriptive phrases from users.

However, embeddings are only effective with documents that contain large amounts of plain text. This means it doesn’t work consulting decks produced in PowerPoint, containing diagrams, images and other visual data.

Even with text, these searches are only about as good as a google search. (Google has been using embeddings for years). The LLM level of comprehension is only applied to the top 10 results retrieved by the embeddings search. This is great when you’re trawling the internet for SEO-optimised content that is deliberately made to be easily found.

However, the documents in your company cloud likely are not SEO-optimised. While you may get some good results, don’t expect anywhere near ChatGPT-quality output.

Make your documents easy to index and easy for machines to understand

Documents should be formatted in a search-friendly way to facilitate systems finding them. LLMs can help with this at the point of content creation. This can be done by:

Formatting every document consistently
Writing a summary of each document and what it is useful for
The whole company agreeing to only put clean, searchable, high-quality documents into the searchable system
Ensuring proper version management techniques (as opposed to simply including “V2” in the name)

To drive an initiative of this scale, the entire company has to adopt this methodology in a behavioural shift that demands strong top-down leadership.

A notable example of a culture change on this scale is Jeff Bezos' influence on reshaping meeting culture at Amazon. Bezos swapped PowerPoints for reading a structured multi-page memo followed by a discussion.

As a side note, these prepared documents of Amazon may be well-suited for advanced search capabilities!

If you’re in a young company / department and can drive this sort of initiative, then LLM-powered enterprise search may work well for you. And it may well be worth starting to implement these changes and best practices in your team or department, in order to benefit from the capability at a later date. The best time to start was 5 years ago, and the next best time to start is today.

Realistically, however, most larger companies are unlikely to be able to execute this type of work culture shift.

A GPT document search must follow information security rules

Another potential blocker for having LLMs in front of your document system are challenges complying with infosec best practices to maintain existing file permissions.

That is, if one user couldn’t access a document before, they shouldn’t be able to access an LLM’s summary of the content via your enterprise search tool.

A classic horror story would be an employee asking a chatbot “what is our HR strategy for the upcoming restructuring?” and the chatbot actually answering.

DMSs such as Sharepoint and iManage handle file permissions differently, creating challenges, and different companies will have them set up; further complexity arises when using group level permissions to share documents among different teams.

This challenge, compounded with encryption in transit, data sovereignty, and maintaining geographical restrictions makes infosec for such a system a monolithic endeavour.

LLMs can have success in smaller scale searches of new, indexed, compatibly-formatted databases

So, all doom and gloom so far? Where can we see the light?

Each firm starts tens to hundreds of new projects daily that require minimal information from past historical documents. Employees likely take useful templates that they know are good, change them and put them into a project folder.

This project folder, over the course of its lifetime, has likely come to consist of 100-1000 curated, genuinely useful documents that your team regularly uses: precedent, templates, advice notes etc. We recommend developing solutions that serve project-specific requests which can be addressed on the basis of the documents in the respective project folder. For example, “Create a new SaaS contract for this client based on their usual terms”.

The system can be built to ensure that these documents are properly stored and formatted to be conducive to chatGPT-like interfaces. When a team member adds a document into this system, they know to add only useful documents of pre-defined quality and meta tags. Having this data in one place, and appropriately structured, significantly reduces search times.

Specialised solutions tailored to the legal industry are emerging to address unique needs. These solutions consider specific document types like emails, contracts, or engagement letters, integrating them into sophisticated prompt-architected pipelines. This approach is particularly effective for automating routine documents and processes in law firms.

Key takeaways

LLMs are unsuitable for streamlining mass enterprise document search over entire DMSs.

Nevertheless, LLMs can make a significant difference to your workflow when applied on a smaller scale, namely a specified dataset. This can be done by either of two avenues: Firstly, by establishing organised project folders for storing important information, ensuring documents are well-structured, labelled, and tagged for tools like Microsoft Copilot to efficiently search. While effective for general use cases these tools may be less dependable for automating repetitive processes done on a regular basis.

The second avenue involves developing prompt architected solutions that perform a specific task, extract particular information from a particular type of document from distinct document types like contracts, emails, letters, and forms. These solutions are geared towards generating specific outcomes, such as supporting particular categories of legislative matters.

These solutions can be developed with AI partners like Springbok AI.

Notes

[1] Elastic (2021), report: Welcome to a new state of find: Unified search for finding workplace content

[2] BCG report “How People Can Create—and Destroy—Value with Generative AI”, 2023.

About the Author

Jason Ly is the Cofounder and Chief of Product and Solutions at Springbok AI, leading the technical arm of the business. Jason led the technical implementation of Dentons’ internal instance: fleetAI. Jason holds an MPhil in Mathematics from the University of Warwick. #JasonLy #AI #documents #data #efficiency #search

A GPT-document search is unrealistic for mass search – do this instead

Related Posts

Comments