Advanced Speech Recognition

Speech Recognition API

Highly accurate speech-to-text for Japanese with industry-leading performance. Convert spoken language into text with precision and speed optimized for Japanese audio.

Speech Recognition Demo
Record audio and get instant transcription
💡 Tip: Record between 1-10 seconds. Automatic transcription will start after you stop.

Industry-Leading Accuracy

Benchmarked performance on real-world Japanese audio

98.5%
Overall Accuracy
Clean audio conditions
95.2%
Noisy Environments
Background noise handling
<0.5s
Response Time
Per minute of audio
97.8%
Mixed Language
Japanese-English code-switching
Accuracy Comparison by Domain
Customer Service Calls96.5%
Business Meetings97.2%
Medical Consultations95.8%
Legal Proceedings98.1%
Technical Discussions96.9%

Built for Japanese Audio

Features designed specifically for Japanese speech recognition

Multi-Dialect Support
Recognizes standard Japanese, Kansai, Tohoku, and other regional dialects accurately
Real-time Streaming
Process audio streams in real-time for live transcription and instant results
Speaker Diarization
Automatically identify and separate multiple speakers in conversations
Lightning Fast
Process hours of audio in minutes with our optimized inference pipeline
Enterprise Security
SOC 2 compliant with end-to-end encryption and secure audio processing
Custom Vocabulary
Add industry-specific terms, brand names, and custom phrases for better accuracy

Trusted Use Cases

See how businesses leverage our ASR API

Call Center Transcription
Automatically transcribe customer service calls for quality assurance, compliance, and insights.
  • Quality monitoring
  • Compliance recording
  • Agent training
  • Customer sentiment analysis
Meeting Notes
Transform meetings, interviews, and discussions into searchable, actionable text documents.
  • Business meetings
  • Interview transcripts
  • Conference recordings
  • Team standups
Subtitles & Captions
Generate accurate subtitles for videos, live streams, and broadcasts in real-time or batch mode.
  • Video subtitles
  • Live event captions
  • Broadcast transcription
  • Accessibility compliance

API Key

Configure Your API Key
Enter your API key below to automatically update all code examples on this page
Hotwords & Custom Vocabulary
Improve transcription accuracy for specialized terms by including hotwords in your text prompt. Hotwords help the model correctly recognize:
{
  "audio": "<base64-encoded audio>"
}

Quick Start Guide

Get up and running with the Speech Recognition API in three simple steps

1. Get Your API Key

Sign up for a Shisa AI account and obtain your API key from the developer dashboard. Include it in the Authorization header with the 'shsk:' prefix:

Authorization: Bearer shsk:YOUR_API_KEY
2. Prepare Your Audio

The API accepts base64-encoded audio in various formats. Supported audio formats include:

  • OGG (Opus, Vorbis)
  • WAV (PCM, 16-bit)
  • MP3, WebM, M4A, FLAC
3. Make Your First Request

Send a POST request to the API endpoint with your audio data and configuration. Here's a basic example using cURL:

curl -s -XPOST 'https://api.shisa.ai/asr/srt/audio_llm' \
  -H 'Authorization: Bearer shsk:YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "audio": "'$(base64 -w0 audio.ogg)'"
  }'

Minimal request

Only the audio field is required. Language is auto-detected and tuning parameters use sensible defaults.

Expected Response

The API returns a JSON response with the transcribed text, detected language, and confidence score.

{
  "text": "こんにちは、シサAIです。",
  "language": "ja",
  "confidence": 0.98
}

API Endpoint

The Speech Recognition API uses a chat-style interface for maximum flexibility and context awareness

Speech Recognition Endpoint
POSThttps://api.shisa.ai/asr/srt/audio_llm

This multimodal endpoint accepts both text instructions and audio content, allowing you to provide context and custom vocabulary (hotwords) for improved accuracy.

Request Parameters

Configure your transcription requests with these parameters

Request Body Parameters
ParameterTypeRequiredDescription
audiostringRequiredBase64-encoded audio data (WAV, OGG, MP3, or FLAC)
languagestringOptionalLanguage code (e.g. "ja", "en"). Omit for automatic language detection (LID).
hotwordsstring[]OptionalArray of words/phrases to boost recognition accuracy for domain-specific terms
temperaturefloatOptionalSampling temperature (0.0-2.0). Lower values make output more deterministic. Default: 0.0
Default: 0.0
top_pfloatOptionalNucleus sampling parameter (0.0-1.0). Controls diversity of output. Default: 0.85
Default: 0.85
frequency_penaltyfloatOptionalPenalizes frequent tokens (-2.0 to 2.0). Reduces repetition. Default: 0.5
Default: 0.5
repetition_penaltyfloatOptionalPenalizes token repetition (1.0-2.0). Values > 1.0 discourage repetition. Default: 1.05
Default: 1.05
vadintegerOptionalVoice activity detection mode
Default: 1
Audio Input Format

Audio must be provided as a base64-encoded data URL in the following format:

"audio": "SGVsbG8gV29ybGQ..."

Pass raw base64-encoded audio data in the audio field. The server auto-detects the format from the binary header.

Supported Audio Formats:

FormatMIME TypeDetection
WAVaudio/wavRIFF header
OGGaudio/oggOggS header
MP3audio/mpegID3 tag or MPEG sync bytes
FLACaudio/flacfLaC header

Encoding Audio to Base64

Use the following command to convert your audio file to base64:

# Encode any supported format to base64
base64 -w0 audio.ogg    # Linux
base64 -i audio.ogg     # macOS

# Use in a curl request
curl -s -XPOST 'https://api.shisa.ai/asr/srt/audio_llm' \\
  -H 'Authorization: Bearer shsk:YOUR_API_KEY' \\
  -H 'Content-Type: application/json' \\
  -d '{ "audio": "'$(base64 -w0 audio.ogg)'" }'
Supported Languages (LID)

The API supports automatic language identification (LID) for the following languages. The detected language is returned in the language field of the response.

Primary Languages

jaJapanese
enEnglish
zhChinese

Response Format

Understanding the API response structure

Successful Response
{
  "text": "こんにちは、シサAIです。",
  "language": "ja",
  "confidence": 0.98
}

Response Fields:

  • text: The transcribed text from the audio
  • language: The detected or specified language code
  • confidence: Transcription confidence score (0 to 1)

Error Handling

Common errors and how to resolve them

Error Response Format
{
  "code": 400,
  "error": "No audio data provided"
}
401 Authentication Error

Returned when the API key is missing, invalid, or expired. Check that your Authorization header includes a valid token.

{
  "context": ["authMiddleware"],
  "code": 104,
  "name": "ErrAuthenticationFailed",
  "error": "Authentication error: Invalid token"
}
Error Codes
CodeCauseError Message
400Missing audio fieldNo audio data provided
400Audio decodes to emptyNo audio data provided
400Not base64 encodedInvalid base64 audio data
400Base64 decode failsInvalid base64 audio data
400Unsupported audio formatUnsupported audio format
500Services not readyTranscription service not available
500Backend failureTranscription failed: ...

Code Examples

Integration examples in popular programming languages

cURL - Quick Start
Basic example using cURL to transcribe an audio file
curl -s -XPOST 'https://api.shisa.ai/asr/srt/audio_llm' \
  -H 'Authorization: Bearer shsk:YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "audio": "'$(base64 -w0 audio.ogg)'"
  }'
Python - Full Example
Complete Python function with base64 encoding and hotwords support
import base64
import requests

# Read and encode audio file
with open("audio.ogg", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

url = "https://api.shisa.ai/asr/srt/audio_llm"
headers = {
    "Authorization": "Bearer shsk:YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "audio": audio_data
}

response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
print(response.json())
JavaScript - Browser Integration
Client-side JavaScript example with FileReader API
async function transcribeAudio(audioFile) {
  // Read file and convert to base64
  const fileBuffer = await audioFile.arrayBuffer();
  const base64Audio = btoa(
    new Uint8Array(fileBuffer).reduce(
      (data, byte) => data + String.fromCharCode(byte),
      ''
    )
  );

  const response = await fetch('https://api.shisa.ai/asr/srt/audio_llm', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer shsk:YOUR_API_KEY',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      audio: base64Audio
    })
  });

  if (!response.ok) {
    throw new Error(`API request failed: ${response.status}`);
  }

  return await response.json();
}

// Example usage with file input
document.querySelector('#audioInput').addEventListener('change', async (e) => {
  const file = e.target.files[0];
  if (file) {
    const result = await transcribeAudio(file);
    console.log('Transcription:', result);
  }
});

Turn Speech into Text with Precision

Start with 180 minutes (3 hours) of free transcription per month. Scale as you grow.