RHYME CTRL is a fully local system that takes a raw MP3 of a rap verse and transforms it into a synchronized, rhyme-colored lyric video by combining Whisper-based word alignment, a phoneme-tail rhyme analysis engine, locality-aware clustering, and a custom cinematic HTML renderer. The pipeline extracts phonemes, identifies stressed-vowel rhyme tails, builds time-accurate word timelines, groups rhyme families based on acoustic similarity and temporal proximity, and then renders the entire verse as a frame-perfect visual sequence. A headless browser drives the renderer at 30fps while ffmpeg assembles the final MP4, producing a video that precisely reflects the verse’s structure, flow, and internal rhyme patterns.
I was inspired by this YouTube channel which used to do rhyme breakdowns but hasn't posted in 3 years:
https://www.youtube.com/watch?v=ouW9xezYVCY&list=RDouW9xezYV...
It's easy to understand that words like "foolishly" and "buffoonishly" rhyme together with just a dictionary but in the context of rap it's a lot harder to understand that "joker" and "going" rhyme together back to back with the sound emphasis on the "jo" and "go" part as seen in the following lyrics:
A lot of jokers out running in place, chasing the style Be a lot going on beneath the empty smile
In the rhyme scheme jo and go rhyme just as style and smile rhyme.
This becomes extra tricky since rappers use mispronunciation and emphasis to make things rhyme that objectively don't rhyme such as when Eminem in an interview rhymed:
I put my orange 4 inch Door hinge in storage And ate porridge With George
There's a lot that can be done to improve this and the sample video I generated messes up towards the end because of compounding line spacing issues but I definitely learned a lot.