Edit on GitHub

StreamTranscriber API

Transcribe an audio stream (e.g microhpone input) to text using the whisper.cpp speech-to-text implementation. This is experimental and not working in Firefox because sample rate conversion with AudioContext is not supported. Also, wasm is way to slow for real-time transcription.

StreamTranscriber(options)

Creates a new StreamTranscriber instance.

Syntax

const streamTranscriber = new StreamTranscriber(options);

Parameters

ParamTypeDefaultDescription
optionsFileTranscriberOptions
createModule(moduleArg = {}) => Promise<any>Exported createModule() function from @transcribe/shout
modelstring | FileWhisper.cpp model file in ggml format. Will call fetch() if string, otherwise will use the provided file.
workerPathstringDefaults to the directory where shout.wasm.js is located.Path to shout.wasm.worker.mjs file.
audioWorkletsPathstring${currentUrl}/audio-workletsPath to vad.js & buffer.js files. Defaults to the audio-worklets/ directory where StreamTranscriber.js is located.
onReady() => void() => {}Called after init.
onStreamStatus(status: StreamStatus) => void() => {}Called when stream status changes. StreamStatus: "loading" | "waiting" | "processing" | "stopped"
onSegment(segment: TranscribeResultSegment) => void() => {}Called when a new transcribed segment is ready.

Returns

A new StreamTranscriber instance.

init()

Loads model, audio worklets and creates a new wasm instance. Must be called before start().

Syntax

await streamTranscriber.init();

Returns

A promise resolving to void Promise<void>.

start(options?)

Starts a new stream transcriber (technically a loop in wasm space waiting for audio input). Must be called before transcribe().

Syntax

await streamTranscriber.start(options);

Parameters

ParamTypeDefaultDescription
optionsFileTranscriberOptions
langstring"auto"Language code of the audio language (eg. "en")
threadsnumberthis.maxThreadsNumber of threads to use. Defaults to max available.
translatebooleanfalseTranslate result to english.
suppress_non_speechbooleanfalseIf true, transcriber will try to suppress non-speech segments.
max_tokensnumber16Maximum number of tokens in a single segment, see whisper.cpp.
audio_ctxboolean512Audio context buffer size in samples, see whisper.cpp.

Returns

A promise resolving to void Promise<void>.

stop()

Stops wasm loop waiting for audio input.

Syntax

await streamTranscriber.stop();

Returns

A promise resolving to void Promise<void>.

transcribe(stream, options?)

Transcribes the audio signal from stream. Wasm calls the onSegment(result) callback once the transcription is ready.

The function starts buffering the audio when speech is detected. The buffer is then sent to wasm when silence is detected or maxRecordMs is exceeded.

Syntax

await streamTranscriber.transcribe(stream, options);
ParamTypeDefaultDescription
optionsFileTranscriberOptions
preRecordsMsnumber200Time of audio in ms to include before, because voice activity detection needs some time.
maxRecordMsnumber5000If buffer reaches this length it will get flushed to wasm, even during speech.
minSilenceMsnumber500Minimum time in ms of silence before transcribe is called.
onVoiceActivity(active: boolean) => voidCalled when there's a change in voice activity.

Returns

A promise resolving to void Promise<void>.

destroy()

Destroys the wasm instance and frees wasm memory.

Syntax

transcriber.destroy();