How to avoid exceeding token limits when chaining LLM calls in n8n?

Book a call with an Expert

Starting a new venture? Need to upgrade your web app? RapidDev builds application with your growth in mind.

How to avoid exceeding token limits when chaining LLM calls in n8n?

To avoid exceeding token limits when chaining LLM calls in n8n, you need to implement strategies like breaking text into smaller chunks, using summarization techniques, filtering data before processing, and leveraging efficient prompt engineering. These approaches help manage token consumption while maintaining the effectiveness of your workflow.

A Comprehensive Guide to Managing Token Limits When Chaining LLM Calls in n8n

When working with Large Language Models (LLMs) in n8n workflows, you may encounter token limit constraints, especially when chaining multiple LLM calls together. This comprehensive guide provides detailed strategies to effectively manage these limitations while maintaining the functionality of your workflows.

Step 1: Understanding Token Limits in LLMs

Before diving into solutions, it's essential to understand what tokens are and why they matter:

Tokens are the basic units that LLMs process, roughly corresponding to 4 characters in English text.
Different models have different token limits (e.g., GPT-3.5 Turbo has 4,096 tokens, GPT-4 has 8,192 or 32,768 tokens depending on the version).
Token limits include both the input (prompt) and the output (response).
When chaining LLM calls, the total tokens across all calls can quickly add up.

Step 2: Setting Up the n8n Environment for LLM Integration

Before implementing token management strategies, ensure your n8n environment is properly configured:

Install the latest version of n8n (npm install n8n -g).
Make sure you have the necessary credentials set up for your LLM provider (OpenAI, Anthropic, etc.).
Install any relevant n8n nodes: OpenAI, LangChain, Text Manipulation, etc.

For OpenAI integration, add your API key in the n8n credentials section:


// This is done in the n8n UI, not in code
// Navigate to Settings > Credentials > Add new credential
// Select "OpenAI API" and input your API key

Step 3: Implementing Text Chunking for Large Documents

When dealing with large text inputs, breaking them into manageable chunks is essential:


// Using Function node to split text into chunks
const inputText = items[0].json.text;
const maxChunkSize = 4000; // Tokens, not characters (approximate)
const overlap = 200; // Overlap between chunks for context

// Simple character-based chunking (approximate)
function splitIntoChunks(text, maxSize, overlap) {
  const chunks = [];
  let startPos = 0;
  
  while (startPos < text.length) {
    const endPos = Math.min(startPos + maxSize, text.length);
    chunks.push(text.substring(startPos, endPos));
    startPos = endPos - overlap; // Create overlap
    
    if (startPos < 0) break;
  }
  
  return chunks;
}

const textChunks = splitIntoChunks(inputText, maxChunkSize, overlap);

// Return chunks for processing in subsequent nodes
return textChunks.map(chunk => ({
  json: {
    chunkText: chunk
  }
}));

For more advanced chunking based on semantic meaning:


// Using Function node to split text by paragraphs
function splitByParagraphs(text, maxTokens) {
  const paragraphs = text.split('\n\n');
  const chunks = [];
  let currentChunk = '';
  
  for (const paragraph of paragraphs) {
    // Rough approximation of tokens (4 chars ≈ 1 token)
    const paragraphTokens = paragraph.length / 4;
    const currentChunkTokens = currentChunk.length / 4;
    
    if (currentChunkTokens + paragraphTokens > maxTokens && currentChunk !== '') {
      chunks.push(currentChunk);
      currentChunk = paragraph;
    } else {
      currentChunk = currentChunk ? `${currentChunk}\n\n${paragraph}` : paragraph;
    }
  }
  
  if (currentChunk) {
    chunks.push(currentChunk);
  }
  
  return chunks;
}

const inputText = items[0].json.text;
const chunks = splitByParagraphs(inputText, 3800); // Conservative token limit

return chunks.map(chunk => ({
  json: { chunkText: chunk }
}));

Step 4: Implementing Progressive Summarization

Instead of sending the entire output of one LLM call to the next, implement progressive summarization:


// Step 1: Process chunks individually with LLM
// This happens in OpenAI node configured to process each chunk
// The prompt for this node might be:
// "Analyze the following text and extract key information: {{$json.chunkText}}"

// Step 2: Use Function node to combine summaries
const summaries = items.map(item => item.json.openAiResponse);
const combinedSummary = summaries.join('\n\n');

// Step 3: Feed the combined summary to another LLM call for final processing
return [{ json: { combinedSummary } }];

For a more structured approach with intermediate summarization:


// Function node to implement a map-reduce pattern
// This assumes previous nodes have created chunks and processed them

// Step 1: Map - Process each chunk (done in previous OpenAI node)
// items now contain processed chunks

// Step 2: Reduce - Combine processed chunks in batches
const processedChunks = items.map(item => item.json.chunkAnalysis);
const batchSize = 3; // Number of summaries to combine at once
const batches = [];

for (let i = 0; i < processedChunks.length; i += batchSize) {
  batches.push(processedChunks.slice(i, i + batchSize).join('\n\n'));
}

// Return batches for intermediate summarization
return batches.map(batch => ({
  json: { batchText: batch }
}));

// Step 3: Process each batch with a summarization prompt (in next OpenAI node)
// Step 4: Final combination and processing (in subsequent nodes)

Step 5: Implementing Efficient Prompt Engineering

Optimize your prompts to reduce token usage:


// Instead of verbose prompts like:
const inefficientPrompt = \`
Please analyze the following text in great detail. Consider all possible interpretations, 
explore multiple perspectives, and provide an extensive analysis covering all aspects of 
the content. Be thorough and leave no stone unturned: ${text}
\`;

// Use concise, focused prompts:
const efficientPrompt = `Analyze concisely:\n${text}\nExtract: key points, entities, sentiment.`;

// In n8n, this would be set in the OpenAI node's "Prompt" field

For templated efficient prompts:


// Function node to generate efficient prompts
function createPrompt(text, task) {
  const promptTemplates = {
    summarize: `Summarize briefly:\n${text}`,
    analyze: `Analyze:\n${text}\nExtract: main points, sentiment.`,
    translate: `Translate to {{language}}:\n${text}`,
    extract: `Extract {{entities}} from:\n${text}`
  };
  
  return promptTemplates[task] || `Process:\n${text}`;
}

const inputText = items[0].json.text;
const task = items[0].json.task || 'summarize';

return [{
  json: {
    prompt: createPrompt(inputText, task)
  }
}];

Step 6: Implementing Data Filtering and Pre-processing

Filter irrelevant data before sending it to the LLM:


// Function node to filter and preprocess data
function preprocessText(text) {
  // Remove boilerplate content
  let processed = text.replace(/Disclaimer:.+?(?=\n\n)/gs, '');
  
  // Remove redundant whitespace
  processed = processed.replace(/\s+/g, ' ');
  
  // Remove irrelevant sections
  processed = processed.replace(/References:[\s\S]+$/, '');
  
  return processed.trim();
}

const inputText = items[0].json.text;
const processedText = preprocessText(inputText);

return [{
  json: { 
    originalLength: inputText.length,
    processedLength: processedText.length,
    processedText
  }
}];

For more advanced filtering with keywords:


// Function node for keyword-based relevance filtering
function filterByRelevance(text, keywords) {
  const paragraphs = text.split('\n\n');
  const relevantParagraphs = paragraphs.filter(para => {
    // Check if paragraph contains any of the keywords
    return keywords.some(keyword => 
      para.toLowerCase().includes(keyword.toLowerCase())
    );
  });
  
  return relevantParagraphs.join('\n\n');
}

const inputText = items[0].json.text;
const keywords = items[0].json.keywords || ['important', 'critical', 'key', 'main'];
const filteredText = filterByRelevance(inputText, keywords);

return [{
  json: {
    filteredText,
    reductionPercentage: Math.round((1 - filteredText.length / inputText.length) \* 100)
  }
}];

Step 7: Implementing Stateful Processing with n8n

Use n8n's capabilities to maintain state across multiple LLM calls:


// Function node to implement stateful processing
// This example shows how to process a document in chunks while maintaining context

// Initialize or retrieve state
const workflowStateName = 'documentProcessingState';
let state = $workflow.variables[workflowStateName] || {
  processedChunks: 0,
  currentSummary: '',
  remainingText: items[0].json.text
};

// Process next chunk
const chunkSize = 3000; // Characters, not tokens (approximate)
const currentChunk = state.remainingText.substring(0, chunkSize);
const remainingText = state.remainingText.substring(chunkSize);

// Update state for next iteration
state = {
  processedChunks: state.processedChunks + 1,
  currentSummary: state.currentSummary, // Will be updated after LLM processing
  remainingText
};

// Save state for next iteration
$workflow.variables[workflowStateName] = state;

// Return current chunk for processing
return [{
  json: {
    chunk: currentChunk,
    chunkNumber: state.processedChunks,
    hasMoreChunks: remainingText.length > 0,
    previousSummary: state.currentSummary
  }
}];

// Note: After LLM processing, update the summary in another Function node

Step 8: Implementing Context Windows for Long-Running Conversations

Manage conversation history intelligently to stay within token limits:


// Function node to maintain a sliding context window
function manageConversationHistory(newMessage, history = [], maxTokens = 3000) {
  // Add new message to history
  const updatedHistory = [...history, newMessage];
  
  // Calculate approximate token count (4 chars ≈ 1 token)
  let tokenCount = updatedHistory.reduce((count, msg) => 
    count + Math.ceil(JSON.stringify(msg).length / 4), 0);
  
  // Remove oldest messages until under token limit
  while (tokenCount > maxTokens && updatedHistory.length > 1) {
    updatedHistory.shift(); // Remove oldest message
    tokenCount = updatedHistory.reduce((count, msg) => 
      count + Math.ceil(JSON.stringify(msg).length / 4), 0);
  }
  
  return updatedHistory;
}

// Get current conversation state
const currentMessage = {
  role: "user",
  content: items[0].json.userMessage
};

const conversationHistory = items[0].json.conversationHistory || [];
const updatedHistory = manageConversationHistory(currentMessage, conversationHistory);

return [{
  json: {
    conversationHistory: updatedHistory,
    messagesForLLM: updatedHistory,
    approximateTokens: Math.ceil(JSON.stringify(updatedHistory).length / 4)
  }
}];

For even more efficient context management:


// Advanced context management with summarization
function manageConversationContext(newMessage, history = [], maxContextTokens = 3000) {
  // Add new message
  let context = [...history, newMessage];
  
  // Calculate tokens (approximation)
  const getTokenCount = text => Math.ceil(JSON.stringify(text).length / 4);
  let totalTokens = getTokenCount(context);
  
  // If within limit, return as is
  if (totalTokens <= maxContextTokens) {
    return {
      context,
      needsSummarization: false
    };
  }
  
  // If exceeding limit, create a summary of older messages
  const recentMessages = context.slice(-3); // Keep most recent messages intact
  const olderMessages = context.slice(0, -3);
  
  return {
    context: [
      {
        role: "system",
        content: `Previous conversation summary: The conversation discussed ${olderMessages.map(m => m.content).join(', ')}`
      },
      ...recentMessages
    ],
    needsSummarization: true,
    originalHistory: context
  };
}

const userMessage = {
  role: "user",
  content: items[0].json.message
};

const history = items[0].json.history || [];
const contextResult = manageConversationContext(userMessage, history);

return [{
  json: {
    ...contextResult,
    forLLM: contextResult.context
  }
}];

Step 9: Using Streaming for Progressive Processing

Implement streaming to process data as it becomes available:


// This would be implemented across multiple nodes
// Function node to prepare for streaming
function prepareStreamingProcess(text, chunkSize = 1000) {
  // Split text into manageable chunks
  const chunks = [];
  for (let i = 0; i < text.length; i += chunkSize) {
    chunks.push(text.substring(i, i + chunkSize));
  }
  
  return {
    totalChunks: chunks.length,
    chunks: chunks,
    processedChunks: 0,
    results: []
  };
}

const inputText = items[0].json.text;
const streamingData = prepareStreamingProcess(inputText);

// Store in workflow variable for state management
$workflow.variables.streamingState = streamingData;

// Return first chunk for processing
return [{
  json: {
    currentChunk: streamingData.chunks[0],
    chunkNumber: 1,
    totalChunks: streamingData.totalChunks
  }
}];

// Note: Subsequent nodes would process each chunk and update the state

Step 10: Using Model Switching for Efficiency

Switch between different models based on the complexity and token requirements:


// Function node to determine optimal model
function selectOptimalModel(text, task) {
  // Estimate token count (4 chars ≈ 1 token)
  const estimatedTokens = Math.ceil(text.length / 4);
  
  // Define task complexity levels
  const taskComplexity = {
    summarize: 'low',
    translate: 'low',
    analyze: 'medium',
    create: 'high',
    research: 'high'
  };
  
  const complexity = taskComplexity[task] || 'medium';
  
  // Select model based on estimated tokens and complexity
  if (estimatedTokens < 2000 && complexity === 'low') {
    return 'gpt-3.5-turbo'; // Faster, cheaper, smaller context
  } else if (estimatedTokens < 6000 && complexity !== 'high') {
    return 'gpt-3.5-turbo-16k'; // Medium context
  } else {
    return 'gpt-4'; // Largest context, most capable
  }
}

const inputText = items[0].json.text;
const task = items[0].json.task || 'summarize';
const selectedModel = selectOptimalModel(inputText, task);

return [{
  json: {
    text: inputText,
    task,
    estimatedTokens: Math.ceil(inputText.length / 4),
    selectedModel,
    modelParameters: {
      temperature: task === 'create' ? 0.7 : 0.2,
      maxTokens: task === 'summarize' ? 300 : 1000
    }
  }
}];

Step 11: Implementing Caching for Repeated Queries

Cache LLM responses to avoid redundant API calls:


// Function node to implement caching
function generateCacheKey(prompt, model) {
  // Create a deterministic hash of the prompt and model
  const str = `${prompt}|${model}`;
  let hash = 0;
  for (let i = 0; i < str.length; i++) {
    const char = str.charCodeAt(i);
    hash = ((hash << 5) - hash) + char;
    hash = hash & hash; // Convert to 32-bit integer
  }
  return hash.toString();
}

// Get cache from workflow variables or initialize
const responseCache = $workflow.variables.llmResponseCache || {};
const prompt = items[0].json.prompt;
const model = items[0].json.model || 'gpt-3.5-turbo';
const cacheKey = generateCacheKey(prompt, model);

// Check if we have a cached response
if (responseCache[cacheKey]) {
  return [{
    json: {
      response: responseCache[cacheKey].response,
      fromCache: true,
      cacheKey,
      cachedAt: responseCache[cacheKey].timestamp
    }
  }];
}

// No cache hit, prepare for LLM call
return [{
  json: {
    prompt,
    model,
    cacheKey,
    fromCache: false
  }
}];

// Note: After LLM response, update cache in another Function node:
// responseCache[items[0].json.cacheKey] = {
//   response: items[0].json.llmResponse,
//   timestamp: Date.now()
// };
// $workflow.variables.llmResponseCache = responseCache;

Step 12: Creating a Complete n8n Workflow

Let's build a complete n8n workflow that implements these techniques:


// This represents the sequence of nodes in an n8n workflow
// Note: This is pseudocode showing the flow, not actual code to paste

// 1. Start Node: HTTP Request or Manual Trigger
// Receives the initial document or text

// 2. Function Node: "Prepare Processing"
function prepareProcessing(items) {
  const inputText = items[0].json.text;
  const task = items[0].json.task || 'analyze';
  
  // Estimate tokens
  const estimatedTokens = Math.ceil(inputText.length / 4);
  
  // Determine processing approach
  const needsChunking = estimatedTokens > 3000;
  
  if (needsChunking) {
    // Split into chunks with 200 token overlap
    const chunkSize = 3000;
    const overlap = 200;
    const chunks = [];
    
    for (let i = 0; i < inputText.length; i += (chunkSize - overlap) \* 4) {
      chunks.push(inputText.substring(i, i + chunkSize \* 4));
    }
    
    return chunks.map((chunk, index) => ({
      json: {
        text: chunk,
        chunkIndex: index,
        totalChunks: chunks.length,
        task
      }
    }));
  }
  
  // No chunking needed
  return [{
    json: {
      text: inputText,
      task,
      processingType: 'direct'
    }
  }];
}

// 3. Switch Node: Based on processingType
// If "direct" -> Single LLM Call Node
// If chunked -> Loop Through Chunks

// 4A. For Direct Processing:
// OpenAI Node: Process the entire text
// Prompt: Generate a {{$json.task}} of the following text: {{$json.text}}

// 4B. For Chunked Processing:
// Loop Start Node: Loop through chunks
// OpenAI Node: Process each chunk
// Prompt: {{$json.task}} the following text (part {{$json.chunkIndex + 1}} of {{$json.totalChunks}}): {{$json.text}}
// Loop End Node: Collect all processed chunks

// 5. Function Node: "Combine Results" (for chunked processing)
function combineResults(items) {
  // Sort chunks by index
  items.sort((a, b) => a.json.chunkIndex - b.json.chunkIndex);
  
  // Combine all responses
  const combinedText = items.map(item => item.json.openAiResponse).join('\n\n');
  
  return [{
    json: {
      combinedText,
      processingType: 'combined\_chunks'
    }
  }];
}

// 6. OpenAI Node: "Final Processing" (for chunked processing)
// Prompt: Synthesize these {{$json.totalChunks}} summaries into a coherent {{$json.task}}: {{$json.combinedText}}

// 7. Function Node: "Prepare Output"
function prepareOutput(items) {
  const result = items[0].json;
  
  return [{
    json: {
      originalTextLength: items[0].json.text ? items[0].json.text.length : 'N/A',
      processingType: result.processingType,
      result: result.openAiResponse || result.combinedText,
      taskType: result.task
    }
  }];
}

// 8. End Node: Return final result

Step 13: Monitoring and Debugging Token Usage

Implement monitoring to track token usage:


// Function node to estimate and log token usage
function trackTokenUsage(items) {
  // Simple token estimation (4 chars ≈ 1 token)
  function estimateTokens(text) {
    return Math.ceil((text || '').length / 4);
  }
  
  const promptText = items[0].json.prompt || '';
  const responseText = items[0].json.response || '';
  
  const promptTokens = estimateTokens(promptText);
  const responseTokens = estimateTokens(responseText);
  const totalTokens = promptTokens + responseTokens;
  
  // Get existing usage log or initialize
  const usageLog = $workflow.variables.tokenUsageLog || [];
  
  // Add current usage
  usageLog.push({
    timestamp: new Date().toISOString(),
    promptTokens,
    responseTokens,
    totalTokens,
    model: items[0].json.model || 'unknown'
  });
  
  // Save updated log
  $workflow.variables.tokenUsageLog = usageLog;
  
  // Calculate cumulative usage
  const totalUsage = usageLog.reduce((sum, entry) => sum + entry.totalTokens, 0);
  
  return [{
    json: {
      ...items[0].json,
      tokenUsage: {
        prompt: promptTokens,
        response: responseTokens,
        total: totalTokens
      },
      cumulativeTokenUsage: totalUsage,
      tokenUsageHistory: usageLog
    }
  }];
}

Step 14: Handling Errors and Fallback Strategies

Implement error handling for token limit issues:


// Function node for error handling and fallback
function handleLLMErrors(items) {
  // Check if there was an error
  const hasError = items[0].json.error;
  
  if (!hasError) {
    // No error, pass through the data
    return items;
  }
  
  const error = items[0].json.error;
  const isTokenLimitError = error.includes('maximum context length') || 
                            error.includes('token limit') ||
                            error.includes('context window');
  
  if (isTokenLimitError) {
    // Get the original text
    const originalText = items[0].json.text;
    
    // Implement fallback strategy - aggressive summarization first
    if (originalText && originalText.length > 1000) {
      // Extractive summarization as fallback
      const sentences = originalText.match(/[^.!?]+[.!?]+/g) || [];
      const reducedText = sentences
        .filter((\_, i) => i % 3 === 0) // Take every third sentence
        .join(' ');
      
      return [{
        json: {
          ...items[0].json,
          text: reducedText,
          originalText,
          fallbackStrategy: 'extractive\_summarization',
          error: null
        }
      }];
    }
    
    // If text is already small, try a smaller model
    return [{
      json: {
        ...items[0].json,
        model: 'gpt-3.5-turbo', // Fallback to smaller model
        fallbackStrategy: 'model\_downgrade',
        error: null
      }
    }];
  }
  
  // For other types of errors, just pass through
  return items;
}

Step 15: Optimizing for Cost Efficiency

Balance token usage with cost considerations:


// Function node to optimize for cost efficiency
function optimizeForCost(items) {
  const text = items[0].json.text;
  const task = items[0].json.task;
  
  // Model pricing (approximate cost per 1K tokens)
  const modelCosts = {
    'gpt-3.5-turbo': 0.002,
    'gpt-3.5-turbo-16k': 0.004,
    'gpt-4': 0.06,
    'gpt-4-32k': 0.12
  };
  
  // Estimate tokens
  const estimatedTokens = Math.ceil(text.length / 4);
  
  // Simple task complexity estimation
  const taskComplexity = {
    'summarize': 1,
    'extract': 1,
    'translate': 1,
    'analyze': 2,
    'create': 3,
    'research': 3
  };
  
  const complexity = taskComplexity[task] || 2;
  
  // Decision matrix
  let selectedModel = 'gpt-3.5-turbo'; // Default
  
  if (complexity === 1) {
    // Simple tasks
    if (estimatedTokens > 8000) {
      selectedModel = 'gpt-3.5-turbo-16k';
    }
  } else if (complexity === 2) {
    // Medium complexity
    if (estimatedTokens > 8000) {
      selectedModel = 'gpt-3.5-turbo-16k';
    } else if (estimatedTokens < 2000 && task === 'analyze') {
      selectedModel = 'gpt-4'; // Use better model for analysis if affordable
    }
  } else {
    // High complexity
    if (estimatedTokens > 8000) {
      selectedModel = 'gpt-4-32k';
    } else {
      selectedModel = 'gpt-4';
    }
  }
  
  // Estimate cost
  const estimatedCost = (estimatedTokens / 1000) \* modelCosts[selectedModel];
  
  return [{
    json: {
      ...items[0].json,
      selectedModel,
      estimatedTokens,
      estimatedCost: `$${estimatedCost.toFixed(4)}`,
      taskComplexity: complexity
    }
  }];
}

Conclusion

Managing token limits when chaining LLM calls in n8n requires a combination of techniques including text chunking, progressive summarization, efficient prompt engineering, data filtering, and context management. By implementing these strategies, you can create robust workflows that handle large volumes of text while staying within token limits.

Remember that different approaches may be more suitable for different use cases. For document processing, chunking and summarization work well. For conversational applications, context window management is crucial. Always monitor token usage to optimize both performance and costs.

By applying these techniques and best practices, you can build sophisticated n8n workflows that make the most of LLM capabilities while effectively managing their constraints.

How to avoid exceeding token limits when chaining LLM calls in n8n?

How to avoid exceeding token limits when chaining LLM calls in n8n?

Want to explore opportunities to work with us?

Client trust and success are our top priorities