Skip to main content
RapidDev - Software Development Agency
flutterflow-tutorials

How to Integrate Machine Learning for Voice Recognition in FlutterFlow

Advanced ML voice recognition in FlutterFlow goes beyond basic speech-to-text. Use a Custom Action to record audio, then route it through a Cloud Function to Azure Speaker Recognition for speaker identification or Hume AI for emotion detection. For on-device custom voice commands, deploy a TensorFlow Lite model via the tflite_flutter package in a Custom Widget for real-time inference without internet.

What you'll learn

  • How to record audio and send it to Azure Speaker Recognition for voice biometric authentication
  • How to detect speaker emotion using the Hume AI audio API from a Cloud Function
  • How to deploy a TensorFlow Lite model on-device for custom wake words and voice commands
  • How to integrate ElevenLabs for text-to-speech with a cloned or custom AI voice
Book a free consultation
4.9Clutch rating
600+Happy partners
17+Countries served
190+Team members
Advanced9 min read2-4 hoursFlutterFlow Pro+ (Custom Code required)March 2026RapidDev Engineering Team
TL;DR

Advanced ML voice recognition in FlutterFlow goes beyond basic speech-to-text. Use a Custom Action to record audio, then route it through a Cloud Function to Azure Speaker Recognition for speaker identification or Hume AI for emotion detection. For on-device custom voice commands, deploy a TensorFlow Lite model via the tflite_flutter package in a Custom Widget for real-time inference without internet.

ML voice recognition: speaker ID, emotion, and custom commands

Basic speech-to-text converts audio to text. Advanced ML voice recognition identifies WHO is speaking (speaker verification), HOW they feel (emotion detection), and recognizes CUSTOM commands (not just transcription). This tutorial covers four patterns: enrolling and verifying users by voice using Azure's Speaker Recognition API, detecting emotional state from audio using Hume AI, building an on-device voice command classifier with TensorFlow Lite via tflite_flutter, and generating custom-voice audio via ElevenLabs. All cloud API calls are routed through Cloud Functions to keep API keys off the client device.

Prerequisites

  • FlutterFlow Pro plan or higher (Custom Code panel required)
  • Firebase project connected to your FlutterFlow app
  • An Azure account with Cognitive Services Speaker Recognition resource created (free tier available)
  • Basic familiarity with Custom Actions and Cloud Functions in FlutterFlow

Step-by-step guide

1

Add the audio recording package and record a voice sample

Go to Custom Code → Pubspec Dependencies and add record: ^5.0.0. This package handles microphone permission, audio capture, and file output. Create a Custom Action named recordVoiceSample. Import the package and implement: request microphone permission via Permission.microphone.request(), create a recorder instance, start recording to a temp file with AudioEncoder.wav, record for 5 seconds, stop recording, then return the file path as a string. In your FlutterFlow page, add a Record Voice button that calls this action and stores the returned file path in a Page State variable named voiceFilePath.

Expected result: Tapping the Record button captures 5 seconds of audio and stores the file path in Page State.

2

Set up Azure Speaker Recognition for voice enrollment and verification

In the Azure Portal, create a Speech service resource and copy the Subscription Key and Region. In FlutterFlow Secrets (Settings → Secrets), add AZURE_SPEECH_KEY. Create a Cloud Function named enrollSpeaker that accepts userId and audioBase64. Inside the function: POST to https://{region}.api.cognitive.microsoft.com/speaker/verification/v2.0/text-independent/profiles with header Ocp-Apim-Subscription-Key. This creates a speaker profile. Then POST the base64 audio to /profiles/{profileId}/enrollments. Save the profileId to Firestore users/{userId}.speakerProfileId. Create a second Cloud Function named verifySpeaker that takes userId and audioBase64, fetches the profileId from Firestore, then calls /profiles/{profileId}/verify with the audio. Returns a JSON with result (Accept/Reject) and score (0.0-1.0). In FlutterFlow API Manager, create API calls to both Cloud Functions.

functions/index.js
1// Cloud Function: verifySpeaker
2const functions = require('firebase-functions');
3const admin = require('firebase-admin');
4const axios = require('axios');
5
6exports.verifySpeaker = functions.https.onRequest(async (req, res) => {
7 const { userId, audioBase64 } = req.body;
8 const azureKey = process.env.AZURE_SPEECH_KEY;
9 const region = process.env.AZURE_REGION;
10
11 // Get stored profile ID
12 const userDoc = await admin.firestore().doc(`users/${userId}`).get();
13 const profileId = userDoc.data().speakerProfileId;
14
15 // Convert base64 to binary buffer
16 const audioBuffer = Buffer.from(audioBase64, 'base64');
17
18 // Verify against enrolled voice
19 const response = await axios.post(
20 `https://${region}.api.cognitive.microsoft.com/speaker/verification/v2.0/text-independent/profiles/${profileId}/verify`,
21 audioBuffer,
22 {
23 headers: {
24 'Ocp-Apim-Subscription-Key': azureKey,
25 'Content-Type': 'audio/wav'
26 }
27 }
28 );
29
30 res.json({
31 result: response.data.result,
32 score: response.data.score
33 });
34});

Expected result: Calling verifySpeaker with a recorded audio sample returns Accept or Reject with a confidence score.

3

Detect speaker emotion using Hume AI audio API

Create a Hume AI account at hume.ai and get your API key. Add HUME_API_KEY to FlutterFlow Secrets. Create a Cloud Function named detectEmotion that accepts audioBase64. The function POSTs to https://api.hume.ai/v0/batch/jobs with header X-Hume-Api-Key and a multipart form containing the audio file. Hume processes asynchronously — poll GET /v0/batch/jobs/{jobId}/predictions until status is COMPLETED. The response includes emotion scores for prosody (speech rhythm and tone): an array of {name, score} objects sorted by score descending. Common emotions returned: Joy, Sadness, Anger, Fear, Surprise, Disgust, Neutral. In FlutterFlow, display these as a Container with a Column of emotion bars: Text (emotion name) + LinearProgressIndicator (score 0.0-1.0).

Expected result: After recording, the app displays a ranked list of detected emotions with confidence percentages.

4

Deploy a TFLite custom voice command model on-device

Add tflite_flutter: ^0.10.4 to Pubspec Dependencies. Obtain or train a TFLite model for your custom commands (.tflite file) and place it in your project's assets folder by adding it in FlutterFlow → Custom Code → Assets. Create a Custom Widget named VoiceCommandListener. In the widget, load the model: Interpreter.fromAsset('assets/voice_commands.tflite'). Record short audio clips (1 second), convert to mel spectrogram float array (40 features × 98 time steps), run inference: interpreter.run(inputBuffer, outputBuffer). outputBuffer contains probabilities for each command class. Pick the argmax class if confidence > 0.85. Map class indices to command strings: {0: 'start', 1: 'stop', 2: 'help', 3: 'background'}. Trigger FlutterFlow actions via callbacks on recognized commands. For production apps, RapidDev has implemented this exact on-device voice command pipeline across multiple FlutterFlow projects requiring specialized audio preprocessing.

Expected result: The Custom Widget recognizes custom voice commands on-device with no internet connection required.

5

Add ElevenLabs text-to-speech with a custom AI voice

Sign up at ElevenLabs.io and get your API key. Add ELEVENLABS_API_KEY to FlutterFlow Secrets. Create a Cloud Function named synthesizeSpeech that accepts text and voiceId. POST to https://api.elevenlabs.io/v1/text-to-speech/{voiceId} with header xi-api-key and body {text, model_id: 'eleven_multilingual_v2', voice_settings: {stability: 0.5, similarity_boost: 0.75}}. The response is binary MP3 audio. Upload the audio to Firebase Storage and return the download URL. In FlutterFlow, add an API Call to your synthesizeSpeech Cloud Function. Use a Custom Action to play the returned audio URL with the audioplayers package (add audioplayers: ^6.0.0 to Pubspec Dependencies).

Expected result: Your app generates and plays natural-sounding speech in a cloned or preset AI voice.

Complete working example

voice_recognition_setup.dart
1// Custom Action: recordAndVerifySpeaker
2// Pubspec: record: ^5.0.0
3import 'dart:convert';
4import 'dart:io';
5import 'package:record/record.dart';
6import 'package:path_provider/path_provider.dart';
7
8Future<Map<String, dynamic>> recordAndVerifySpeaker(
9 String userId) async {
10 final audioRecorder = AudioRecorder();
11
12 // Check microphone permission
13 final hasPermission = await audioRecorder.hasPermission();
14 if (!hasPermission) {
15 return {'error': 'Microphone permission denied'};
16 }
17
18 // Get temp directory for file
19 final dir = await getTemporaryDirectory();
20 final path = '${dir.path}/voice_sample.wav';
21
22 // Start recording
23 await audioRecorder.start(
24 const RecordConfig(encoder: AudioEncoder.wav),
25 path: path,
26 );
27
28 // Record for 5 seconds
29 await Future.delayed(const Duration(seconds: 5));
30 await audioRecorder.stop();
31
32 // Read file and encode to base64
33 final audioFile = File(path);
34 final audioBytes = await audioFile.readAsBytes();
35 final audioBase64 = base64Encode(audioBytes);
36
37 // Return base64 for Cloud Function call
38 return {
39 'audioBase64': audioBase64,
40 'userId': userId,
41 };
42}

Common mistakes

Why it's a problem: Training a custom TFLite voice model on clean studio audio then deploying to real-world environments

How to avoid: Augment training data with background noise samples (cafes, streets, offices), normalize audio volume, and vary recording quality. Use TensorFlow's tf.data augmentation pipeline to artificially add noise, reverb, and pitch variation to a smaller clean dataset.

Why it's a problem: Calling cloud ML voice APIs synchronously on the main thread of a Flutter widget

How to avoid: Always use async/await in Custom Actions and show a CircularProgressIndicator or animated waveform widget while processing. Use Isolates for heavy on-device model inference to keep the UI thread free.

Why it's a problem: Storing raw audio recordings in Firestore documents

How to avoid: Always upload audio files to Firebase Storage. Store only the download URL string in Firestore. Delete temporary audio files from the device after processing.

Best practices

  • Always request microphone permission with a clear explanation dialog before recording — 'We need microphone access to verify your identity by voice'
  • Add a visual recording indicator (animated microphone icon, waveform) so users know when the app is capturing audio
  • Store speaker profile IDs in Firestore, not on the device — enables cross-device voice authentication from day one
  • Cache emotion detection results in Firestore to avoid re-processing identical recordings and reduce API costs
  • Set a minimum confidence threshold for voice verification (score > 0.7) before accepting authentication — balance security with user experience
  • Always delete temporary WAV files from device storage after uploading to Cloud Functions to avoid accumulating audio files on the user's phone
  • Test custom TFLite models on physical devices of varying hardware generations — older devices have slower inference and may not meet real-time requirements

Still stuck?

Copy one of these prompts to get a personalized, step-by-step explanation.

ChatGPT Prompt

I am building a FlutterFlow app that needs voice biometric authentication using Azure Speaker Recognition. Write me a Firebase Cloud Function in Node.js that: (1) accepts a POST request with userId and audioBase64 fields, (2) fetches the user's stored Azure speaker profileId from Firestore users/{userId}, (3) calls the Azure Speaker Recognition verification endpoint with the audio, (4) returns the verification result and score. Include proper error handling for missing profiles and Azure API errors.

FlutterFlow Prompt

Add a voice enrollment screen to my FlutterFlow app. The screen should have a Record Sample button that records 5 seconds of audio using the record package, a progress indicator during recording, and an Enroll Voice button that sends the recording to my enrollSpeaker Cloud Function. Show a success message with a green checkmark when enrollment completes. Store the result in App State isVoiceEnrolled.

Frequently asked questions

What is the difference between voice-to-text and voice recognition ML?

Voice-to-text (speech-to-text) converts spoken words to a text transcript — it identifies WHAT was said. ML voice recognition goes further: speaker identification determines WHO is speaking, emotion detection determines HOW the person feels, and custom command classification recognizes specific phrases independent of transcription. This tutorial covers the advanced ML layer, not basic transcription.

How accurate is Azure Speaker Recognition for voice authentication?

Azure's text-independent speaker verification achieves Equal Error Rate (EER) of approximately 2-5% on high-quality audio. In real-world mobile conditions with background noise, EER rises to 5-10%. For security-critical apps, set the acceptance threshold higher (score > 0.85) and require a fallback authentication method (PIN or biometric). The free Azure tier allows 10,000 speaker profile operations per month.

Can I run voice recognition entirely on-device without sending audio to a server?

Yes, for custom command recognition. Using tflite_flutter, you can run a TensorFlow Lite model locally with no network call. For speaker identification and emotion detection, on-device options exist (google_mlkit_face_mesh for visual emotion, though not audio-based) but cloud APIs are significantly more accurate. On-device TFLite models are best for fixed vocabulary command sets of 10-50 words.

How do I handle background noise in voice recording for ML models?

Use the record package's noise suppression settings where available, and apply audio pre-processing before sending to ML APIs: normalize volume (RMS normalization), apply a high-pass filter to remove low-frequency noise, and trim silence from the start and end. For Azure Speaker Recognition, the API handles some noise internally. For custom TFLite models, train with augmented noisy data as described in the tutorial.

Is voice biometric authentication GDPR and privacy compliant?

Voice biometric data is classified as sensitive biometric data under GDPR, CCPA, and Illinois BIPA. You must: (1) obtain explicit consent before enrollment, (2) allow users to delete their voice profile at any time, (3) disclose voice data use in your privacy policy, (4) store voice profiles on servers within the user's jurisdiction when possible. Azure Speaker Recognition stores profiles in the Azure region you select — choose EU regions for EU users.

What if I need help implementing custom voice ML features in my app?

Advanced voice ML pipelines involving custom TFLite model deployment, audio preprocessing, and speaker verification require specialized expertise. RapidDev has implemented voice recognition systems in FlutterFlow apps and can handle the full pipeline from model integration to Cloud Function setup and privacy compliance.

How much does Azure Speaker Recognition cost for a production app?

Azure Speaker Recognition pricing is $1 per 1,000 API transactions on the standard tier. A free tier allows 10,000 transactions per month. For a typical app where each user enrolls once (3-5 transactions) and verifies on each login, 10,000 free transactions supports roughly 2,000 users per month. For apps with more users, budget approximately $10-50 per month for typical authentication workloads.

RapidDev

Talk to an Expert

Our team has built 600+ apps. Get personalized help with your project.

Book a free consultation

Need help with your project?

Our experts have built 600+ apps and can accelerate your development. Book a free consultation — no strings attached.

Book a free consultation

We put the rapid in RapidDev

Need a dedicated strategic tech and growth partner? Discover what RapidDev can do for your business! Book a call with our team to schedule a free, no-obligation consultation. We'll discuss your project and provide a custom quote at no cost.