How to query the ChatGPT API about large documents

Question

I want to pass a large document (or many documents) to the ChatGPT API and then ask it some questions. However, due to the token limits, you can’t do that, because the documents have millions of words, while the limit is usually very low (a few thousand words). So, how can you query ChatGPT about personal documents, books, PDFs, website docs, etc.?

Answer

The most common strategy to solve the problem with token limits is to use these steps:

Split the document into chunks of text (e.g. 4k tokens each); you can simply split each section of the document or use more advanced techniques, like a sliding window
Use LLM embeddings to classify the features/topics contained in a given chunk; basically you should pass the chunk of text to an AI that returns a vector of numbers (array of numbers), where each number represents a feature of that text (similar topics will generate similar embeddings)
Store the embeddings in a vector database, like Pinecone
When a user asks a question, convert the question into an embedding (in the same way that you did for the chunks of text)
Using the vector database search for the embeddings that are similar to the embedding of the question (vectors that are nearby in the multi dimensional space)
Summarize the chunks of text that you have just found using a LLM
Pass the relevant text to the ChatGPT API together with the question (e.g. “Answer this question {question} based on the following text: {text}”).