Learn how to avoid exceeding token limits when chaining LLM calls in n8n with strategies like text chunking, summarization, efficient prompts, and data filtering for optimized workflows.
Book a call with an Expert
Starting a new venture? Need to upgrade your web app? RapidDev builds application with your growth in mind.
To avoid exceeding token limits when chaining LLM calls in n8n, you need to implement strategies like breaking text into smaller chunks, using summarization techniques, filtering data before processing, and leveraging efficient prompt engineering. These approaches help manage token consumption while maintaining the effectiveness of your workflow.
A Comprehensive Guide to Managing Token Limits When Chaining LLM Calls in n8n
When working with Large Language Models (LLMs) in n8n workflows, you may encounter token limit constraints, especially when chaining multiple LLM calls together. This comprehensive guide provides detailed strategies to effectively manage these limitations while maintaining the functionality of your workflows.
Step 1: Understanding Token Limits in LLMs
Before diving into solutions, it's essential to understand what tokens are and why they matter:
Step 2: Setting Up the n8n Environment for LLM Integration
Before implementing token management strategies, ensure your n8n environment is properly configured:
For OpenAI integration, add your API key in the n8n credentials section:
// This is done in the n8n UI, not in code
// Navigate to Settings > Credentials > Add new credential
// Select "OpenAI API" and input your API key
Step 3: Implementing Text Chunking for Large Documents
When dealing with large text inputs, breaking them into manageable chunks is essential:
// Using Function node to split text into chunks
const inputText = items[0].json.text;
const maxChunkSize = 4000; // Tokens, not characters (approximate)
const overlap = 200; // Overlap between chunks for context
// Simple character-based chunking (approximate)
function splitIntoChunks(text, maxSize, overlap) {
const chunks = [];
let startPos = 0;
while (startPos < text.length) {
const endPos = Math.min(startPos + maxSize, text.length);
chunks.push(text.substring(startPos, endPos));
startPos = endPos - overlap; // Create overlap
if (startPos < 0) break;
}
return chunks;
}
const textChunks = splitIntoChunks(inputText, maxChunkSize, overlap);
// Return chunks for processing in subsequent nodes
return textChunks.map(chunk => ({
json: {
chunkText: chunk
}
}));
For more advanced chunking based on semantic meaning:
// Using Function node to split text by paragraphs
function splitByParagraphs(text, maxTokens) {
const paragraphs = text.split('\n\n');
const chunks = [];
let currentChunk = '';
for (const paragraph of paragraphs) {
// Rough approximation of tokens (4 chars ≈ 1 token)
const paragraphTokens = paragraph.length / 4;
const currentChunkTokens = currentChunk.length / 4;
if (currentChunkTokens + paragraphTokens > maxTokens && currentChunk !== '') {
chunks.push(currentChunk);
currentChunk = paragraph;
} else {
currentChunk = currentChunk ? `${currentChunk}\n\n${paragraph}` : paragraph;
}
}
if (currentChunk) {
chunks.push(currentChunk);
}
return chunks;
}
const inputText = items[0].json.text;
const chunks = splitByParagraphs(inputText, 3800); // Conservative token limit
return chunks.map(chunk => ({
json: { chunkText: chunk }
}));
Step 4: Implementing Progressive Summarization
Instead of sending the entire output of one LLM call to the next, implement progressive summarization:
// Step 1: Process chunks individually with LLM
// This happens in OpenAI node configured to process each chunk
// The prompt for this node might be:
// "Analyze the following text and extract key information: {{$json.chunkText}}"
// Step 2: Use Function node to combine summaries
const summaries = items.map(item => item.json.openAiResponse);
const combinedSummary = summaries.join('\n\n');
// Step 3: Feed the combined summary to another LLM call for final processing
return [{ json: { combinedSummary } }];
For a more structured approach with intermediate summarization:
// Function node to implement a map-reduce pattern
// This assumes previous nodes have created chunks and processed them
// Step 1: Map - Process each chunk (done in previous OpenAI node)
// items now contain processed chunks
// Step 2: Reduce - Combine processed chunks in batches
const processedChunks = items.map(item => item.json.chunkAnalysis);
const batchSize = 3; // Number of summaries to combine at once
const batches = [];
for (let i = 0; i < processedChunks.length; i += batchSize) {
batches.push(processedChunks.slice(i, i + batchSize).join('\n\n'));
}
// Return batches for intermediate summarization
return batches.map(batch => ({
json: { batchText: batch }
}));
// Step 3: Process each batch with a summarization prompt (in next OpenAI node)
// Step 4: Final combination and processing (in subsequent nodes)
Step 5: Implementing Efficient Prompt Engineering
Optimize your prompts to reduce token usage:
// Instead of verbose prompts like:
const inefficientPrompt = \`
Please analyze the following text in great detail. Consider all possible interpretations,
explore multiple perspectives, and provide an extensive analysis covering all aspects of
the content. Be thorough and leave no stone unturned: ${text}
\`;
// Use concise, focused prompts:
const efficientPrompt = `Analyze concisely:\n${text}\nExtract: key points, entities, sentiment.`;
// In n8n, this would be set in the OpenAI node's "Prompt" field
For templated efficient prompts:
// Function node to generate efficient prompts
function createPrompt(text, task) {
const promptTemplates = {
summarize: `Summarize briefly:\n${text}`,
analyze: `Analyze:\n${text}\nExtract: main points, sentiment.`,
translate: `Translate to {{language}}:\n${text}`,
extract: `Extract {{entities}} from:\n${text}`
};
return promptTemplates[task] || `Process:\n${text}`;
}
const inputText = items[0].json.text;
const task = items[0].json.task || 'summarize';
return [{
json: {
prompt: createPrompt(inputText, task)
}
}];
Step 6: Implementing Data Filtering and Pre-processing
Filter irrelevant data before sending it to the LLM:
// Function node to filter and preprocess data
function preprocessText(text) {
// Remove boilerplate content
let processed = text.replace(/Disclaimer:.+?(?=\n\n)/gs, '');
// Remove redundant whitespace
processed = processed.replace(/\s+/g, ' ');
// Remove irrelevant sections
processed = processed.replace(/References:[\s\S]+$/, '');
return processed.trim();
}
const inputText = items[0].json.text;
const processedText = preprocessText(inputText);
return [{
json: {
originalLength: inputText.length,
processedLength: processedText.length,
processedText
}
}];
For more advanced filtering with keywords:
// Function node for keyword-based relevance filtering
function filterByRelevance(text, keywords) {
const paragraphs = text.split('\n\n');
const relevantParagraphs = paragraphs.filter(para => {
// Check if paragraph contains any of the keywords
return keywords.some(keyword =>
para.toLowerCase().includes(keyword.toLowerCase())
);
});
return relevantParagraphs.join('\n\n');
}
const inputText = items[0].json.text;
const keywords = items[0].json.keywords || ['important', 'critical', 'key', 'main'];
const filteredText = filterByRelevance(inputText, keywords);
return [{
json: {
filteredText,
reductionPercentage: Math.round((1 - filteredText.length / inputText.length) \* 100)
}
}];
Step 7: Implementing Stateful Processing with n8n
Use n8n's capabilities to maintain state across multiple LLM calls:
// Function node to implement stateful processing
// This example shows how to process a document in chunks while maintaining context
// Initialize or retrieve state
const workflowStateName = 'documentProcessingState';
let state = $workflow.variables[workflowStateName] || {
processedChunks: 0,
currentSummary: '',
remainingText: items[0].json.text
};
// Process next chunk
const chunkSize = 3000; // Characters, not tokens (approximate)
const currentChunk = state.remainingText.substring(0, chunkSize);
const remainingText = state.remainingText.substring(chunkSize);
// Update state for next iteration
state = {
processedChunks: state.processedChunks + 1,
currentSummary: state.currentSummary, // Will be updated after LLM processing
remainingText
};
// Save state for next iteration
$workflow.variables[workflowStateName] = state;
// Return current chunk for processing
return [{
json: {
chunk: currentChunk,
chunkNumber: state.processedChunks,
hasMoreChunks: remainingText.length > 0,
previousSummary: state.currentSummary
}
}];
// Note: After LLM processing, update the summary in another Function node
Step 8: Implementing Context Windows for Long-Running Conversations
Manage conversation history intelligently to stay within token limits:
// Function node to maintain a sliding context window
function manageConversationHistory(newMessage, history = [], maxTokens = 3000) {
// Add new message to history
const updatedHistory = [...history, newMessage];
// Calculate approximate token count (4 chars ≈ 1 token)
let tokenCount = updatedHistory.reduce((count, msg) =>
count + Math.ceil(JSON.stringify(msg).length / 4), 0);
// Remove oldest messages until under token limit
while (tokenCount > maxTokens && updatedHistory.length > 1) {
updatedHistory.shift(); // Remove oldest message
tokenCount = updatedHistory.reduce((count, msg) =>
count + Math.ceil(JSON.stringify(msg).length / 4), 0);
}
return updatedHistory;
}
// Get current conversation state
const currentMessage = {
role: "user",
content: items[0].json.userMessage
};
const conversationHistory = items[0].json.conversationHistory || [];
const updatedHistory = manageConversationHistory(currentMessage, conversationHistory);
return [{
json: {
conversationHistory: updatedHistory,
messagesForLLM: updatedHistory,
approximateTokens: Math.ceil(JSON.stringify(updatedHistory).length / 4)
}
}];
For even more efficient context management:
// Advanced context management with summarization
function manageConversationContext(newMessage, history = [], maxContextTokens = 3000) {
// Add new message
let context = [...history, newMessage];
// Calculate tokens (approximation)
const getTokenCount = text => Math.ceil(JSON.stringify(text).length / 4);
let totalTokens = getTokenCount(context);
// If within limit, return as is
if (totalTokens <= maxContextTokens) {
return {
context,
needsSummarization: false
};
}
// If exceeding limit, create a summary of older messages
const recentMessages = context.slice(-3); // Keep most recent messages intact
const olderMessages = context.slice(0, -3);
return {
context: [
{
role: "system",
content: `Previous conversation summary: The conversation discussed ${olderMessages.map(m => m.content).join(', ')}`
},
...recentMessages
],
needsSummarization: true,
originalHistory: context
};
}
const userMessage = {
role: "user",
content: items[0].json.message
};
const history = items[0].json.history || [];
const contextResult = manageConversationContext(userMessage, history);
return [{
json: {
...contextResult,
forLLM: contextResult.context
}
}];
Step 9: Using Streaming for Progressive Processing
Implement streaming to process data as it becomes available:
// This would be implemented across multiple nodes
// Function node to prepare for streaming
function prepareStreamingProcess(text, chunkSize = 1000) {
// Split text into manageable chunks
const chunks = [];
for (let i = 0; i < text.length; i += chunkSize) {
chunks.push(text.substring(i, i + chunkSize));
}
return {
totalChunks: chunks.length,
chunks: chunks,
processedChunks: 0,
results: []
};
}
const inputText = items[0].json.text;
const streamingData = prepareStreamingProcess(inputText);
// Store in workflow variable for state management
$workflow.variables.streamingState = streamingData;
// Return first chunk for processing
return [{
json: {
currentChunk: streamingData.chunks[0],
chunkNumber: 1,
totalChunks: streamingData.totalChunks
}
}];
// Note: Subsequent nodes would process each chunk and update the state
Step 10: Using Model Switching for Efficiency
Switch between different models based on the complexity and token requirements:
// Function node to determine optimal model
function selectOptimalModel(text, task) {
// Estimate token count (4 chars ≈ 1 token)
const estimatedTokens = Math.ceil(text.length / 4);
// Define task complexity levels
const taskComplexity = {
summarize: 'low',
translate: 'low',
analyze: 'medium',
create: 'high',
research: 'high'
};
const complexity = taskComplexity[task] || 'medium';
// Select model based on estimated tokens and complexity
if (estimatedTokens < 2000 && complexity === 'low') {
return 'gpt-3.5-turbo'; // Faster, cheaper, smaller context
} else if (estimatedTokens < 6000 && complexity !== 'high') {
return 'gpt-3.5-turbo-16k'; // Medium context
} else {
return 'gpt-4'; // Largest context, most capable
}
}
const inputText = items[0].json.text;
const task = items[0].json.task || 'summarize';
const selectedModel = selectOptimalModel(inputText, task);
return [{
json: {
text: inputText,
task,
estimatedTokens: Math.ceil(inputText.length / 4),
selectedModel,
modelParameters: {
temperature: task === 'create' ? 0.7 : 0.2,
maxTokens: task === 'summarize' ? 300 : 1000
}
}
}];
Step 11: Implementing Caching for Repeated Queries
Cache LLM responses to avoid redundant API calls:
// Function node to implement caching
function generateCacheKey(prompt, model) {
// Create a deterministic hash of the prompt and model
const str = `${prompt}|${model}`;
let hash = 0;
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash = hash & hash; // Convert to 32-bit integer
}
return hash.toString();
}
// Get cache from workflow variables or initialize
const responseCache = $workflow.variables.llmResponseCache || {};
const prompt = items[0].json.prompt;
const model = items[0].json.model || 'gpt-3.5-turbo';
const cacheKey = generateCacheKey(prompt, model);
// Check if we have a cached response
if (responseCache[cacheKey]) {
return [{
json: {
response: responseCache[cacheKey].response,
fromCache: true,
cacheKey,
cachedAt: responseCache[cacheKey].timestamp
}
}];
}
// No cache hit, prepare for LLM call
return [{
json: {
prompt,
model,
cacheKey,
fromCache: false
}
}];
// Note: After LLM response, update cache in another Function node:
// responseCache[items[0].json.cacheKey] = {
// response: items[0].json.llmResponse,
// timestamp: Date.now()
// };
// $workflow.variables.llmResponseCache = responseCache;
Step 12: Creating a Complete n8n Workflow
Let's build a complete n8n workflow that implements these techniques:
// This represents the sequence of nodes in an n8n workflow
// Note: This is pseudocode showing the flow, not actual code to paste
// 1. Start Node: HTTP Request or Manual Trigger
// Receives the initial document or text
// 2. Function Node: "Prepare Processing"
function prepareProcessing(items) {
const inputText = items[0].json.text;
const task = items[0].json.task || 'analyze';
// Estimate tokens
const estimatedTokens = Math.ceil(inputText.length / 4);
// Determine processing approach
const needsChunking = estimatedTokens > 3000;
if (needsChunking) {
// Split into chunks with 200 token overlap
const chunkSize = 3000;
const overlap = 200;
const chunks = [];
for (let i = 0; i < inputText.length; i += (chunkSize - overlap) \* 4) {
chunks.push(inputText.substring(i, i + chunkSize \* 4));
}
return chunks.map((chunk, index) => ({
json: {
text: chunk,
chunkIndex: index,
totalChunks: chunks.length,
task
}
}));
}
// No chunking needed
return [{
json: {
text: inputText,
task,
processingType: 'direct'
}
}];
}
// 3. Switch Node: Based on processingType
// If "direct" -> Single LLM Call Node
// If chunked -> Loop Through Chunks
// 4A. For Direct Processing:
// OpenAI Node: Process the entire text
// Prompt: Generate a {{$json.task}} of the following text: {{$json.text}}
// 4B. For Chunked Processing:
// Loop Start Node: Loop through chunks
// OpenAI Node: Process each chunk
// Prompt: {{$json.task}} the following text (part {{$json.chunkIndex + 1}} of {{$json.totalChunks}}): {{$json.text}}
// Loop End Node: Collect all processed chunks
// 5. Function Node: "Combine Results" (for chunked processing)
function combineResults(items) {
// Sort chunks by index
items.sort((a, b) => a.json.chunkIndex - b.json.chunkIndex);
// Combine all responses
const combinedText = items.map(item => item.json.openAiResponse).join('\n\n');
return [{
json: {
combinedText,
processingType: 'combined\_chunks'
}
}];
}
// 6. OpenAI Node: "Final Processing" (for chunked processing)
// Prompt: Synthesize these {{$json.totalChunks}} summaries into a coherent {{$json.task}}: {{$json.combinedText}}
// 7. Function Node: "Prepare Output"
function prepareOutput(items) {
const result = items[0].json;
return [{
json: {
originalTextLength: items[0].json.text ? items[0].json.text.length : 'N/A',
processingType: result.processingType,
result: result.openAiResponse || result.combinedText,
taskType: result.task
}
}];
}
// 8. End Node: Return final result
Step 13: Monitoring and Debugging Token Usage
Implement monitoring to track token usage:
// Function node to estimate and log token usage
function trackTokenUsage(items) {
// Simple token estimation (4 chars ≈ 1 token)
function estimateTokens(text) {
return Math.ceil((text || '').length / 4);
}
const promptText = items[0].json.prompt || '';
const responseText = items[0].json.response || '';
const promptTokens = estimateTokens(promptText);
const responseTokens = estimateTokens(responseText);
const totalTokens = promptTokens + responseTokens;
// Get existing usage log or initialize
const usageLog = $workflow.variables.tokenUsageLog || [];
// Add current usage
usageLog.push({
timestamp: new Date().toISOString(),
promptTokens,
responseTokens,
totalTokens,
model: items[0].json.model || 'unknown'
});
// Save updated log
$workflow.variables.tokenUsageLog = usageLog;
// Calculate cumulative usage
const totalUsage = usageLog.reduce((sum, entry) => sum + entry.totalTokens, 0);
return [{
json: {
...items[0].json,
tokenUsage: {
prompt: promptTokens,
response: responseTokens,
total: totalTokens
},
cumulativeTokenUsage: totalUsage,
tokenUsageHistory: usageLog
}
}];
}
Step 14: Handling Errors and Fallback Strategies
Implement error handling for token limit issues:
// Function node for error handling and fallback
function handleLLMErrors(items) {
// Check if there was an error
const hasError = items[0].json.error;
if (!hasError) {
// No error, pass through the data
return items;
}
const error = items[0].json.error;
const isTokenLimitError = error.includes('maximum context length') ||
error.includes('token limit') ||
error.includes('context window');
if (isTokenLimitError) {
// Get the original text
const originalText = items[0].json.text;
// Implement fallback strategy - aggressive summarization first
if (originalText && originalText.length > 1000) {
// Extractive summarization as fallback
const sentences = originalText.match(/[^.!?]+[.!?]+/g) || [];
const reducedText = sentences
.filter((\_, i) => i % 3 === 0) // Take every third sentence
.join(' ');
return [{
json: {
...items[0].json,
text: reducedText,
originalText,
fallbackStrategy: 'extractive\_summarization',
error: null
}
}];
}
// If text is already small, try a smaller model
return [{
json: {
...items[0].json,
model: 'gpt-3.5-turbo', // Fallback to smaller model
fallbackStrategy: 'model\_downgrade',
error: null
}
}];
}
// For other types of errors, just pass through
return items;
}
Step 15: Optimizing for Cost Efficiency
Balance token usage with cost considerations:
// Function node to optimize for cost efficiency
function optimizeForCost(items) {
const text = items[0].json.text;
const task = items[0].json.task;
// Model pricing (approximate cost per 1K tokens)
const modelCosts = {
'gpt-3.5-turbo': 0.002,
'gpt-3.5-turbo-16k': 0.004,
'gpt-4': 0.06,
'gpt-4-32k': 0.12
};
// Estimate tokens
const estimatedTokens = Math.ceil(text.length / 4);
// Simple task complexity estimation
const taskComplexity = {
'summarize': 1,
'extract': 1,
'translate': 1,
'analyze': 2,
'create': 3,
'research': 3
};
const complexity = taskComplexity[task] || 2;
// Decision matrix
let selectedModel = 'gpt-3.5-turbo'; // Default
if (complexity === 1) {
// Simple tasks
if (estimatedTokens > 8000) {
selectedModel = 'gpt-3.5-turbo-16k';
}
} else if (complexity === 2) {
// Medium complexity
if (estimatedTokens > 8000) {
selectedModel = 'gpt-3.5-turbo-16k';
} else if (estimatedTokens < 2000 && task === 'analyze') {
selectedModel = 'gpt-4'; // Use better model for analysis if affordable
}
} else {
// High complexity
if (estimatedTokens > 8000) {
selectedModel = 'gpt-4-32k';
} else {
selectedModel = 'gpt-4';
}
}
// Estimate cost
const estimatedCost = (estimatedTokens / 1000) \* modelCosts[selectedModel];
return [{
json: {
...items[0].json,
selectedModel,
estimatedTokens,
estimatedCost: `$${estimatedCost.toFixed(4)}`,
taskComplexity: complexity
}
}];
}
Conclusion
Managing token limits when chaining LLM calls in n8n requires a combination of techniques including text chunking, progressive summarization, efficient prompt engineering, data filtering, and context management. By implementing these strategies, you can create robust workflows that handle large volumes of text while staying within token limits.
Remember that different approaches may be more suitable for different use cases. For document processing, chunking and summarization work well. For conversational applications, context window management is crucial. Always monitor token usage to optimize both performance and costs.
By applying these techniques and best practices, you can build sophisticated n8n workflows that make the most of LLM capabilities while effectively managing their constraints.
When it comes to serving you, we sweat the little things. That’s why our work makes a big impact.