Table of Content


Created on: 2025-11-04; Revised on: 2025-11-04

Whisper - Automatic Speech Recognition

Introduction

Whisper is software created by OpenAI that automatically recognizes speech in audio and video files. It generates several output files:

Although a few words may require corrections, the quality is generally excellent depending on the model used. The medium model is particularly good for French.

Whisper can be used in two ways:

Official links:

Using without installation with Docker

Using Docker is the easiest way to use Whisper without having to install it locally. A pre-configured Docker image is available on GitHub.

Model Selection

The medium model is suggested for French. Smaller models (Small, Base, Tiny) give less satisfactory results in French, although they may be acceptable for English.

The large model is more powerful but ideally requires a GPU to be used effectively.

Note: GPU configuration with Docker can be complex, especially in WSL (Windows Subsystem for Linux). Nvidia is simpler to configure than AMD.

docker run -ti --rm \
    --gpus all \
    -v $(pwd):/workspace \
    -v ~/.cache/whisper:/root/.cache/whisper \
    foilen/fdi-openai-whisper:latest \
    whisper --model medium audio.mp3

Without GPU (CPU only)

The regular processor works very well with the medium model, even if it’s slower than with a GPU.

docker run -ti --rm \
    -v $(pwd):/workspace \
    -v ~/.cache/whisper:/root/.cache/whisper \
    foilen/fdi-openai-whisper:latest \
    whisper --model medium audio.mp3

Parameter explanation:

Installing for a user in a Python venv

If you prefer a local installation without Docker, you can install Whisper in a Python virtual environment (venv). This method avoids permission issues and dependency conflicts.

Why a venv? On Ubuntu, installation with global pip doesn’t work when Python is installed via APT. The system prevents installation of global dependencies. A Python virtual environment (venv) solves this problem by creating an isolated installation with its own dependencies and versions.

Installation

Create and activate the virtual environment, then install Whisper:

VENV_DIR="$HOME/.whisper-venv"
python3 -m venv "$VENV_DIR"
source "$VENV_DIR/bin/activate"

pip install --upgrade pip

pip install -U openai-whisper

Important note: The installation downloads all Nvidia dependencies (even if you don’t have an Nvidia GPU), which can take up a lot of disk space.

Important: When you do source, you add the environment to your current bash. You don’t create a new bash. So if you do exit, you completely close your terminal, not just the venv.

Usage

Each time you want to use Whisper, first activate the virtual environment:

source ~/.whisper-venv/bin/activate

whisper --model medium audio.mp3

Advantage: If you have already downloaded the model with Docker (in ~/.cache/whisper), Whisper will automatically use this cache and won’t re-download the model.

Generated output files

When Whisper processes an audio or video file, it generates multiple files in different formats, each with a specific use:

JSON File

Contains all detailed transcription information:

Usage: Ideal for creating software that analyzes text, associates images with specific moments or sends content to AI for analysis.

SRT File

Standard subtitle format compatible with VLC and other video players.

Usage: Allows displaying subtitles directly in video players. Can be uploaded to YouTube to facilitate automatic translation into other languages.

TXT File

Raw text with transcribed content, broken down by complete sentences (not by time).

Usage: Simple content reading, copy-paste to other applications, text analysis.

TSV File

“Tab-Separated Values” format - similar to CSV but with tabs instead of commas.

Usage: Easier to process than JSON for some systems and avoids comma issues in the text (if it were a CSV).

VTT File

WebVTT format for web subtitles.

Usage: Subtitles for web and HTML5 videos.

Execution Process

When Whisper runs:

  1. Language detection: Analyzes the first 30 seconds to automatically detect the language
  2. Progressive processing: Displays progress in real-time
  3. File generation: All files are created at the same time at the end of processing
  4. Performance: Uses CPU and memory (GPU if configured with --gpus all)

Note: Without a GPU, you will see a message indicating that FP16 (Floating Point 16) is not available, which is normal and does not prevent proper operation.