Advanced Speech Recognition

Speech Recognition API

Highly accurate speech-to-text for Japanese with industry-leading performance. Convert spoken language into text with precision and speed optimized for Japanese audio.

API Docs View Pricing

Speech Recognition Demo

Record audio and get instant transcription

💡 Tip: Record between 1-10 seconds. Automatic transcription will start after you stop.

Industry-Leading Accuracy

Benchmarked performance on real-world Japanese audio

98.5%

Overall Accuracy

Clean audio conditions

95.2%

Noisy Environments

Background noise handling

<0.5s

Response Time

Per minute of audio

97.8%

Mixed Language

Japanese-English code-switching

Accuracy Comparison by Domain

Customer Service Calls96.5%

Business Meetings97.2%

Medical Consultations95.8%

Legal Proceedings98.1%

Technical Discussions96.9%

Built for Japanese Audio

Features designed specifically for Japanese speech recognition

Multi-Dialect Support

Recognizes standard Japanese, Kansai, Tohoku, and other regional dialects accurately

Real-time Streaming

Process audio streams in real-time for live transcription and instant results

Speaker Diarization

Automatically identify and separate multiple speakers in conversations

Lightning Fast

Process hours of audio in minutes with our optimized inference pipeline

Enterprise Security

SOC 2 compliant with end-to-end encryption and secure audio processing

Custom Vocabulary

Add industry-specific terms, brand names, and custom phrases for better accuracy

Trusted Use Cases

See how businesses leverage our ASR API

Call Center Transcription

Automatically transcribe customer service calls for quality assurance, compliance, and insights.

Quality monitoring
Compliance recording
Agent training
Customer sentiment analysis

Meeting Notes

Transform meetings, interviews, and discussions into searchable, actionable text documents.

Business meetings
Interview transcripts
Conference recordings
Team standups

Subtitles & Captions

Generate accurate subtitles for videos, live streams, and broadcasts in real-time or batch mode.

Video subtitles
Live event captions
Broadcast transcription
Accessibility compliance

API Key

Configure Your API Key

Enter your API key below to automatically update all code examples on this page

Hotwords & Custom Vocabulary

Improve transcription accuracy for specialized terms by including hotwords in your text prompt. Hotwords help the model correctly recognize:

Language

Hotwords (comma-separated)

Request Body Preview

{
  "audio": "<base64-encoded audio>"
}

Get an API Key →

Quick Start Guide

Get up and running with the Speech Recognition API in three simple steps

1. Get Your API Key

Sign up for a Shisa AI account and obtain your API key from the developer dashboard. Include it in the Authorization header with the 'shsk:' prefix:

Authorization: Bearer shsk:YOUR_API_KEY

2. Prepare Your Audio

The API accepts base64-encoded audio in various formats. Supported audio formats include:

OGG (Opus, Vorbis)
WAV (PCM, 16-bit)
MP3, WebM, M4A, FLAC

3. Make Your First Request

Send a POST request to the API endpoint with your audio data and configuration. Here's a basic example using cURL:

curl -s -XPOST 'https://api.shisa.ai/asr/srt/audio_llm' \
  -H 'Authorization: Bearer shsk:YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "audio": "'$(base64 -w0 audio.ogg)'"
  }'

Minimal request

Only the audio field is required. Language is auto-detected and tuning parameters use sensible defaults.

Expected Response

The API returns a JSON response with the transcribed text, detected language, and confidence score.

{
  "text": "こんにちは、シサAIです。",
  "language": "ja",
  "confidence": 0.98
}

API Endpoint

The Speech Recognition API uses a chat-style interface for maximum flexibility and context awareness

Speech Recognition Endpoint

POSThttps://api.shisa.ai/asr/srt/audio_llm

This multimodal endpoint accepts both text instructions and audio content, allowing you to provide context and custom vocabulary (hotwords) for improved accuracy.

Request Parameters

Configure your transcription requests with these parameters

Request Body Parameters

Parameter	Type	Required	Description
audio	string	Required	Base64-encoded audio data (WAV, OGG, MP3, or FLAC)
language	string	Optional	Language code (e.g. `"ja"`, `"en"`). Omit for automatic language detection (LID).
hotwords	string[]	Optional	Array of words/phrases to boost recognition accuracy for domain-specific terms
temperature	float	Optional	Sampling temperature (0.0-2.0). Lower values make output more deterministic. Default: 0.0 Default: `0.0`
top_p	float	Optional	Nucleus sampling parameter (0.0-1.0). Controls diversity of output. Default: 0.85 Default: `0.85`
frequency_penalty	float	Optional	Penalizes frequent tokens (-2.0 to 2.0). Reduces repetition. Default: 0.5 Default: `0.5`
repetition_penalty	float	Optional	Penalizes token repetition (1.0-2.0). Values > 1.0 discourage repetition. Default: 1.05 Default: `1.05`
vad	integer	Optional	Voice activity detection mode Default: `1`

Audio Input Format

Audio must be provided as a base64-encoded data URL in the following format:

"audio": "SGVsbG8gV29ybGQ..."

Pass raw base64-encoded audio data in the audio field. The server auto-detects the format from the binary header.

Supported Audio Formats:

Format	MIME Type	Detection
WAV	audio/wav	RIFF header
OGG	audio/ogg	OggS header
MP3	audio/mpeg	ID3 tag or MPEG sync bytes
FLAC	audio/flac	fLaC header

Encoding Audio to Base64

Use the following command to convert your audio file to base64:

# Encode any supported format to base64
base64 -w0 audio.ogg    # Linux
base64 -i audio.ogg     # macOS

# Use in a curl request
curl -s -XPOST 'https://api.shisa.ai/asr/srt/audio_llm' \\
  -H 'Authorization: Bearer shsk:YOUR_API_KEY' \\
  -H 'Content-Type: application/json' \\
  -d '{ "audio": "'$(base64 -w0 audio.ogg)'" }'

Supported Languages (LID)

The API supports automatic language identification (LID) for the following languages. The detected language is returned in the language field of the response.

Primary Languages

jaJapanese

enEnglish

zhChinese

Response Format

Understanding the API response structure

Successful Response

{
  "text": "こんにちは、シサAIです。",
  "language": "ja",
  "confidence": 0.98
}

Response Fields:

text: The transcribed text from the audio
language: The detected or specified language code
confidence: Transcription confidence score (0 to 1)

Error Handling

Common errors and how to resolve them

Error Response Format

{
  "code": 400,
  "error": "No audio data provided"
}

401 Authentication Error

Returned when the API key is missing, invalid, or expired. Check that your Authorization header includes a valid token.

{
  "context": ["authMiddleware"],
  "code": 104,
  "name": "ErrAuthenticationFailed",
  "error": "Authentication error: Invalid token"
}

Error Codes

Code	Cause	Error Message
400	Missing audio field	No audio data provided
400	Audio decodes to empty	No audio data provided
400	Not base64 encoded	Invalid base64 audio data
400	Base64 decode fails	Invalid base64 audio data
400	Unsupported audio format	Unsupported audio format
500	Services not ready	Transcription service not available
500	Backend failure	Transcription failed: ...

Code Examples

Integration examples in popular programming languages

cURL - Quick Start

Basic example using cURL to transcribe an audio file

curl -s -XPOST 'https://api.shisa.ai/asr/srt/audio_llm' \
  -H 'Authorization: Bearer shsk:YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "audio": "'$(base64 -w0 audio.ogg)'"
  }'

Python - Full Example

Complete Python function with base64 encoding and hotwords support

import base64
import requests

# Read and encode audio file
with open("audio.ogg", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

url = "https://api.shisa.ai/asr/srt/audio_llm"
headers = {
    "Authorization": "Bearer shsk:YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "audio": audio_data
}

response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
print(response.json())

JavaScript - Browser Integration

Client-side JavaScript example with FileReader API

async function transcribeAudio(audioFile) {
  // Read file and convert to base64
  const fileBuffer = await audioFile.arrayBuffer();
  const base64Audio = btoa(
    new Uint8Array(fileBuffer).reduce(
      (data, byte) => data + String.fromCharCode(byte),
      ''
    )
  );

  const response = await fetch('https://api.shisa.ai/asr/srt/audio_llm', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer shsk:YOUR_API_KEY',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      audio: base64Audio
    })
  });

  if (!response.ok) {
    throw new Error(`API request failed: ${response.status}`);
  }

  return await response.json();
}

// Example usage with file input
document.querySelector('#audioInput').addEventListener('change', async (e) => {
  const file = e.target.files[0];
  if (file) {
    const result = await transcribeAudio(file);
    console.log('Transcription:', result);
  }
});

Turn Speech into Text with Precision

Start with 180 minutes (3 hours) of free transcription per month. Scale as you grow.

Get Started View Pricing Plans