Question
I need to summarize a long text (e.g. a book or a document) using the OpenAI API. However it has a token limit (e.g. 4k words) that is much shorter compared to an entire book. Is there any solution to summarize a long text that exceeds the token limit?
Answer
You can split the text into chunks and summarize each chunk separately. Then you can concatenate the partial summaries to get a summary of the book. If this summary is still too long you can repeat the same process to summarize the text that you got. For better results you should also try to break the text into chunks naturally: for example you should not break words or paragraphs.
Here’s an algorithm that you can use:
-
Split the text into sentences (you can split on newlines or use a more complex regex)
-
For each sentence, if adding it to the current chunk of text would exceed the maximum length (e.g. 4k tokens) then start a new chunk, otherwise add the sentence to the current chunk
-
For each chunk of text, send it to the LLM for summarization, using a prompt like this: “Write a detailed summary of the following: {text}”
-
Now that you have a summary you can repeat the above steps recursively until you get a summary of the desired length (i.e. you can use the same algorithm to summarize the summary, then summarize the summary of the summary, and so on…)