How I ended up maintaining building a python package with 1M+ downloads
KochiFOSS Meetup
April 26, 2025
$whoami
ML Engineer @ Sarvam
Volunteer @ Swathanthra Malayalam Computing
Loves Walking and like to participate in marathon
Bird Watching is my hobby (PS: Sarvam Models are named with bird names because of me)
I talked here itself in 2023 in KeyValue office
Who I Am: A Snapshot
FOSS & Python
Active contributor with a passion for open source software.
Machine Learning
Engineer focusing on deep learning and fast.ai frameworks.
Community & Travel
Engaged in Malayalam computing and explored 10 Indian cities.
Speaker & Volunteer
Presented at PyCon India and contributes to AI4Bharat initiatives.
(Prompt: create images to represent things like FOSS, ML, Malayalam computing, Walking, Ooty, Pune, 10cities travelling in India, fast.ai, Deep Learning, Sarvam, Startups, Python, Kaggle, bird watching, Language hero comes with pride)
Why I created a python package?
Identify a Problem
  • In my previous company I was benchmarking various ASR providers.
  • Malayalam ASR Benchmarking
I knew how to build a python package
Learned nbdev made by Jeremy Howard, Hamel Hussain, Wasim Lograt etc. during fast.ai course, 2022
Frustration lead me to publish as a python package while doing Malayalam ASR Benchmarking project and giving talks on - OpenAI Whisper and it's amazing power to do finetuning.
Make something you want
Loading...
What is nbdev?
  • Create delightful software with Jupyter notebooks.
  • Using Jupyter notebooks, build a python package with proper documentation. Easily publish it in pypi, github, anaconda ecosystem.
How to evaluate ASR providers?
ASR is evaluated by comparing ground truth and ASR output. Two common metrics used are:
  • Word Error Rate (WER)
  • Character Error rate (CER)
Example of ASR evaluation
Ground Truth = I am at Kochi FOSS today. I am presenting a talk today, at Key Value Systems along with Andrew from Hoppscotch and Renjith from Wikidata.
ASR output = I am at Kochi Foods today. I am presenting a talk today at Key Value Systems along with Andrew from Hope's Coach and Ranjit from Wikidata.
WER without normalization = 0.2
CER without normalization = 0.08759
WER with normalization = 0.2
CER with normalization = 0.067164
What is Whisper and it's normalizer?
  • Whisper was open-sourced on September 21, 2022 by releasing the inference code and pre-trained model weights.
  • The Whisper normalizer is a text normalization tool and algorithm used in OpenAI's Whisper automatic speech recognition (ASR) system. Its main purpose is to standardize transcribed text so that formatting differences-such as punctuation, capitalization, or whitespace-do not unfairly penalize evaluation metrics like Word Error Rate (WER) and Character Error Rate (CER)​.​​ The normalization process makes it easier to compare transcriptions by ensuring that only genuine transcription errors are counted, not superficial formatting differences.
  • Explain EnglishNormalizer
  • Explain BasicTextNormalizer
Hello world to Whisper Normalizer
Loading...
Why the package got popular?
  • The whole field of voice agents and Speech in general exploeded in 2023, 2024 onwards
  • Seeing increasing better Speech to Text models, Text to Speech models and Speech to Speech models etc.
  • SEO in google because of which my python package comes when googling whisper normalizer or using perplexity.

Perplexity AI

What is whisper normalizer and how to use it

The Whisper normalizer is a text normalization tool and algorithm used in OpenAI's Whisper automatic speech recognition (ASR) system. Its main purpose is to...

Current Monthly stats
Best Monthly Downloads
Loading...
Loading...
Getting even more downloads than nbdev
Loading...
Loading...
Identifying a big issue with Malayalam ASR benchmarking
1
Kavya noticed a big bug
whisper_normalizer is removing vowels as part of Basic Text Normalizer
2
Kavya and I tweet
Inform the community via blogpost which Kavya wrote and tweets, that normalizer used in Meta's ASR paper, Assembly.ai, OpenAI etc are wrong
4
Both of us are trying to fix the issues
  • Fixed the problem with normalizers written by Anoop Kunchukuttan and AI4Bharat
  • Now there are normalizers like MalayalamNormalizer, HindiNormalizer in 9 Indian languages
Loading...
Loading...
Loading...
Loading...
Loading...
Paper accepted at EMNLP 2024
Crafting a User-Friendly Python package
Simplify Usage
  • Proper github README
  • Simple name
Comprehensive Documentation
  • Properly document usage
  • Accept contributions
Examples & Tutorials
  • Made youtube videos
Maintenance required for python package
1
1
Fix Bugs Quickly
Prioritize issues reported by users for reliability.
2
2
Add Features
Implement enhancements based on community feedback. Like MalayalamNormalizer, updates in English Normalizer
3
3
Update Dependencies
Keep libraries current to ensure compatibility and security.
4
4
Monitor Usage
Track downloads and feedback to inform future development.

linkedin

Last year March, I created a python package called whisper_normalizer… | Kurian Benoy

Last year March, I created a python package called whisper_normalizer package with nbdev. I realized it was not possible to use the normalization algorithms in OpenAI Whisper paper either without downloading OpenAI whisper repo or using transformers package. So I created this package to use in my malayalam ASR benchmarking project. It's incredible to see its adoption grow, with over 1000 monthly downloads via PyPI. Thrilled to witness it being used in open-source projects like Amazon OD3 an

linkedin

Release v0.1.0 · kurianbenoy/whisper_normalizer | Kurian Benoy

🚀  Weekend Release Alert! Just shipped whisper_normalizer v0.1.0 💻 ✨ What's new: 1. Support for converting arabic numbers to Indic script in word format (36 -> छत्तीस, 90->തൊണ്ണൂറ്) 2. Removed network call from EnglishTextNormalizer module Check it out:

🚨Exclusive story I told I will share today
kurianbenoy
kurianbenoy2
kurianbenoy
Thank you
Make something you want
Ensure Quality with good README, documentation
Engage the Community, identify pain points
Iterate Continuously
Improve based on feedback and usage.
Made with