Chaining FFmpeg with a Browser Agent

https://100x.bot/a/chaining-ffmpeg-with-browser-agent

108•shardullavekar•3mo ago

Comments

sylware•3mo ago

HTML <video> or <audio> element with "Streaming" URLs passed to the media player (or internally in the web browser for the big ones).

utopiah•3mo ago

Have to admit, ffmpeg syntax is not trivial... but also the project is 24 years old and is basically the defacto industry standard. If you believe you will still be editing videos in 20 years with the CLI (or any other tool or any programming language) wrapping it then it's probably worth few hours learning how it actually works.

shardullavekar•3mo ago

true, companies like Descript, Veed, or Kapwing exist because no coders find this syntax intimidating. Plus, a CLI tool stands out of a workflow. We wanted to change that.

petetnt•3mo ago

Don't "no coders" find the concepts described in this article imdimitating?

The article states that whatever the article is trying to describe "Takes about ~20-30 mins. The cognitive load is high....". while their literal actual step of "Googling "ffmpeg combine static image and audio."" gives you the literal command you need to run from a known source (superuser.com sourced from ffmpeg wiki).

Anyone even slightly familiar with ffmpeg should be able to produce the same result in minutes. For someone who doesn't understand what ffmpeg is the article means absolutely nothing. How does a "no coder" understand what a "agent in a sandboxed container" is?

shardullavekar•3mo ago

we took a basic example and described it. (will try adding a complex one)

we have our designer/intern in our minds who creates shorts, adds subtiles, crops them,and merges the audio generated. He is aware of ffmpeg and prefers using a SaaS UI on top of it.

However, we see him hanging out on chatgpt, or gemini all the time. He is literally the no coder we have in mind.

We just combined his type what you want + ffmpeg workflows.

EraYaN•3mo ago

Wouldn't that intern just use an NLE (be it Premiere, Davinci Resole etc) anyway? If you need to style subtitles and edit shorts and video content, you'll need a proper editor anyway.

shardullavekar•3mo ago

1. download a larger video from s3. 2. Use NLE and cut it into shorts. (crop, resize, subtitles etc.) 3. Upload shorts on YouTube, Instagram, Tiktok.

He does use davinci resolve but only for 2.

NLEs make ffmpeg a standalone yet easy to use tool.

Not denying that major heavy lifting is done by the NLE. We go a step ahead and make it embeddable in a larger workflow.

artpar•3mo ago

I think that goes with almost every tool you want to use with llm. User should already know the tool ideally so mistakes by llm can be prevented before they happen.

Here making ffmpeg as "just another capability" allows it to be stitched together in workflows

jack_pp•3mo ago

I agree, I suggest using this instead : https://github.com/kkroening/ffmpeg-python . While not perfect once you figure it out it is far easier to use and you can wrap more complicated workflows and reuse them later.

poly2it•3mo ago

Kkroening's wrapper has been inactive for some time. I suggest using https://github.com/jonghwanhyeon/python-ffmpeg instead. It has proper async support and a better API.

jack_pp•3mo ago

Thing is, if you want to use LLMs for mockups you got to use the old one.

jack_pp•3mo ago

Scratch that I thought it was a different version. The one you linked has no support for filtergraphs so isn't even comparable to the old one.

esperent•3mo ago

The syntax isn't too bad. The problem is that I have to use it a couple of times a year, on average. So every time I've forgotten and have to relearn. This doesn't happen with GUIs nearly as much, and it's why I prefer them over CLI tools for anything that I don't do at least once every week or two.

skydhash•3mo ago

That’s why you write scripts, or put a couple snippets in your notes.

esperent•3mo ago

I do have snippets in my notes. The problem is that nearly every time I use it, I need to do something different than the previous time.

bigiain•3mo ago

I started making sure my notes included the search queries keywords/keyphrases/terminology I uses and any webpages or documentation that I successfully used to come up with the solution (and where relevant, and search queries or misunderstandings on my part that lead to dead ends).

This hasn't solved the problem of sometimes needing to do new things, but it at least gives me a map to remind me of the parts of the rabbithole I've explored before.

Sean-Der•3mo ago

My question/curiosity is why do so many people use ffmpeg (frustrated by the syntax) when GStreamer is available?

`gst-launch-1.0 filesrc ! qt4demux ! matroskamux ! filesink...` people would be less frustrated maybe?

People would also learn a little more and be less frustrated when conversation about container/codec/colorspace etc... come up. Each have a dedicated element and you can better understand its I/O

artpar•3mo ago

I did not know gstreamer wasm also exists, I'll check it out

goeiedaggoeie•3mo ago

Still has a way to go, but very exciting.

throwaway2046•3mo ago

I haven't tried GStreamer but I found FFmpeg to be extremely easy to compile as both a command line tool and library, plus it can do so much out of the box even without external libraries being present. It's already used in pretty much everything and does the job so it never occurred to me (or others) to look for alternatives.

thisislife2•3mo ago

Handbrake ( https://handbrake.fr/ ) and AviDemux ( https://www.avidemux.org/ ) is what the average user needs. Subler ( https://subler.org/ ) is also a good macOS app for muxing and tagging.

javier2•3mo ago

ffmpeg is pretty complicated, but at least it actually works.

somat•3mo ago

The thing that helped me get over that ffmpeg bump, where you go from copying stack overflow answers to actually sort of understanding what you are doing is the fairly recent include external file syntax. On the surface it is such a minor thing, but mentally it let me turn what was a confusing mess into a programing language. There are a couple ways to evoke it but the one I used was to load the whole file as an arg. Note the slash, it is important "-/filter_complex filter_file"

https://ffmpeg.org/ffmpeg-filters.html#toc-Filtergraph-synta...

"A special syntax implemented in the ffmpeg CLI tool allows loading option values from files. This is done be prepending a slash ’/’ to the option name, then the supplied value is interpreted as a path from which the actual value is loaded."

For how critical that was to getting over my ffmpeg hump, I wish it was not buried halfway through the documentation, but also, I don't know where else it would go.

And just because I am very proud of my accomplishment here is the ffmpeg side of my project, motion detection using mainly ffmpeg, there is some python glue logic to watch stdout for the events but all the tricky bits are internal to ffmpeg.

The filter(comments are added for audience understanding):

    [0:v]
    split  #split the camera feed into two parts, passthrough and motion
        [vis],
    scale=   #scale the motion feed way down, less cpu and it works better
        w=iw/4:
        h=-1,
    format= #needed because blend did not work as expected with yuv
        gbrp,
    tmix= #temporial blur to reduce artifacts
        frames=2,
    [1:v]  #the mask frame
    blend= #mask the motion feed
        all_mode=darken,
    tblend= #motion detect actual, the difference from the last frame
        all_mode=difference,
    boxblur= #blur the hell out of it to increase the number of motion pixels
        lr=20,
    maskfun= #mask it to black and white
        low=3:
        high=3,
    negate, #make the motion pixels black
    blackframe= #puts events on stdout when too many black pixels are found
        amount=1
        [motion]; #motion output
    [vis] 
    tpad= #delay pass through so you get the start of the event when notified
        start=30
        [original]; #passthrough output

and the ffmpeg evocation:

    ff_args = [
      'ffmpeg',
      '-nostats',
      '-an',
      '-i',
      camera_loc, #a security camera
      '-i',
      'zone_all.png', # mask as to which parts are relavent for motion detection
      '-/filter_complex',
      'motion_display.filter', #the filter doing all the work
      '-map',  #sort out the outputs from the filter
      '[original]',
      '-f',
      'mpegts', #I feel a little weied using mpegts but it was the best "streaming" of all the formats I tried
      'udp://127.0.0.1:8888',  #collect the full video from here
      '-map',
      '[motion]',
      '-f',
      'mpegts',
      'udp:127.0.0.1:8889', #collect the motion output from here, mainly for debugging
      ]

jack_pp•3mo ago

As someone who has used ffmpeg for 10+ years maintaining a relatively complex backend service that's basically a JSON to ffmpeg translator I did not fully understand this article.

Like the Before vs after section doesn't even seem to create the same thing, the before has no speedup, the after does.

In the end it seems they basically created a few services ("recipes") that they can reuse to do simple stuff like speed-up 2x or combine audio / video or whatever

shardullavekar•3mo ago

thanks for calling it out, I will correct the before vs after section. But you can describe any ffmpeg capability in plain English and the underlying ffmpeg tool call takes care of it.

jack_pp•3mo ago

I have written a lot of ffmpeg-python and plain ffmpeg commands using LLMs and while I am amazed at how good Gemini or chatGPT can handle ffmpeg prompts it is still not 100% so this seems to me like a big gamble on your part. However it might work for most users that only ask for simple things.

shardullavekar•3mo ago

so creators on 100x will create well defined workflows that others can reuse. If a workflow is not found, llm creates one on the go and saves it.

jack_pp•3mo ago

That sounds good, save the LLM generated workflows and have them edited by more seasoned users.

Or you could go one step further and create a special workflow which would allow you to define some inputs and iterate with an LLM until the user gets what he wants but for this you would need to generate outputs and have the user validate what the LLM has created before finally saving the recipe.

shardullavekar•3mo ago

That's exactly how it is implemented!

IsTom•3mo ago

> Half of scripting FFmpeg is just fighting with shell quote escaping for filter_complex.

-filter_complex_script is a thing

4gotunameagain•3mo ago

This is yc propping up a startup they have backed, there isn't much substance here.

coachgodzup•3mo ago

I considered FFmpeg a great project, but I usually avoid to use it directly because of his quite complex syntax. I'm reconsidering it because coupled with an llm is very straightforward and more immediate than an usual graphical editor

orbital-decay•3mo ago

At some point command line becomes unwieldy. FFmpeg would definitely benefit from a non-arcane DSL like AviSynth or a node-based UI.

skeeter2020•3mo ago

This doesn't make any sense; the Before and After examples accomplish different things. I also don't get who the target audience is; people intimidated by a CLI tool but at home with technical agents?

shardullavekar•3mo ago

people intimidated by a CLI tool but find tools like chatgpt easy to use and those who have video editing as a part of larger workflow.

sanjit•3mo ago

An aside but related?

FFmpeg has complex syntax because it’s dealing with the _complexity of video_. I agree with everyone about knowing (and helping create or contribute to) our tools.

Today I largely forget about the _legacy_ of video, the technical challenges, and how critical it was to get it right.

There are an incredible number of output formats and considerations for _current_ screens (desktop, tablet, mobile, tv, etc…). Then we have a whole other world on the creation side for capture, edit, live broadcast…

On legacy formats it used to be so complex with standards, requirements, and evolving formats. Today, we don’t even think about why we have 29.97fps around? Interlacing?

We have a mix of so many incredible (and sometimes frustrating) codecs, needs and final outputs, so it’s really amazing the power we have with a tool like FFmpeg… It’s daunting but really well thought out.

So just a big thanks to the FFmpeg team for all their incredible work over the years…

shardullavekar•3mo ago

no 2nd thoughts about it, we are only making ffmpeg more accessible and embeddable.

echelon•3mo ago

> FFmpeg has complex syntax because it’s dealing with the _complexity of video_.

It's dealing with 3D data (more if you count audio or other tracks) and multi-dimensional transforms from a command line.

charcircuit•3mo ago

>FFmpeg has complex syntax because it’s dealing with the _complexity of video_

It's complexity paired with bad design, making the situation worse than it could be.

SpaceManNabs•3mo ago

I refuse to admit that ffmpeg is bad design until I see a better one. so if you have a better one I am all ears because it would surely be very illuminating.

hexo•3mo ago

what legacy formats? what are you even talking about?

kwanbix•3mo ago

I use ChatGPT for this kind of complexity.

It works 99% of the time for my use case.

shardullavekar•3mo ago

jack_pp made a point in the comments, worth noting.

Dachande663•3mo ago

ffmpeg is the only community where I've asked for help and been told "if you have to ask, you're too stupid to use this project". Needless to say, it was a welcoming community I continued engaging with.

pinter69•3mo ago

People in the community can be hardcore there sometimes, r/ffmpeg especially. But, there are communities online and information resources that help.

This is a nice resource: https://amiaopensource.github.io/ffmprovisr/

And also I've written this cheatsheet, which is designed to be used alongside an LLM: https://github.com/rendi-api/ffmpeg-cheatsheet

Let me know if you're interested in more resources

rigrassm•3mo ago

Love the cheat sheet, forked it after reading a couple sections that were instantly useful lol

Can't promise it'll be soon but I may be able to expand on a couple of your repo's "possible future topics list" items.

I've been working on a personal project involving doing object detection on multiple camera feed inputs that all have different resolutions, frame rates, and encodings and sending a single consolidated and annotated feed to a remote streaming service.

That sent me down a really interesting rabbit hole and I've got tons of notes and links along with some Gemini chats that I'm gonna go through and see if there's anything there that might be worth including.

oldgregg•3mo ago

AI is game changer for the wildly detailed ffmpeg command line-- just tell gpt what you want to do and it will spit out the ffmpeg command 10/10.

officeplant•3mo ago

FFmpeg continues to be the great filter of those that don't RTFM.

tartoran•3mo ago

Not really, LLMs get it quite right.

javier2•3mo ago

ffmpeg is awful, except for all the other tools that are awfuller and does not even work

usrxcghghj•3mo ago

Read the entire landing page. Still do not understand 100x bot is ?

artpar•3mo ago

100x.bot primarily a browser automation engine (think imacros ) but with llm and all the tools for interacting with the Dom and a better interface . there is a workflow builder so you do not need to rely on llm for executing deterministic workflows.

arjie•3mo ago

I just tell Claude Code what I want to do and that it has imagemagick and ffmpeg available and it does all the work for me. Because it's got an agentic flow, it loops around, checks the output and fixes things up.

I can ask it to orient people the right way, crop to the important parts, etc. and it will figure out what "the right way", "the important parts", etc. are. Sometimes I have to give it some light hints like "extract n frames from before y to figure out things", but most of the time it just does it.

Claude Code acts like a very general purpose agent for me. About the one thing that I have to manually do that I'm annoyed by is editing 360 videos into a flow. I'd like to be able to tell Claude Code to "follow my daughter as I dunk her in the pool" and stuff like that but I have to do that myself in the GoPro editor.

harrall•3mo ago

I don’t entirely understand who this is for.

- For one-offs, you would just use a GUI.

- For regular edits where you want creative control, you would use a NLE GUI.

- For regular edits where you want consistency, you would have a limited GUI without access to ffmpeg options.

CLI/prompt-based editing for a visual medium is how a programmer might approach editing but no creative…

qmr•3mo ago

I cut a whole documentary in ffmpeg

https://youtu.be/9kaIXkImCAM

russfink•3mo ago

I use LLMs to help navigate FFmpeg. It hasn’t failed me yet.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

IBM Beam Spring: The Ultimate Retro Keyboard

Vocal Guide – belt sing without killing yourself

First Proof

FDA intends to take action against non-FDA-approved GLP-1 drugs

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

The F Word

LLMs as the new high level language

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Eigen: Building a Workspace

Selection rather than prediction

The silent death of good code

I write games in C (yes, C) (2016)

Reinforcement Learning from Human Feedback

Where did all the starships go?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Hackers (1995) Animated Experience

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

IBM Beam Spring: The Ultimate Retro Keyboard

Vocal Guide – belt sing without killing yourself

First Proof

FDA intends to take action against non-FDA-approved GLP-1 drugs

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

The F Word

LLMs as the new high level language

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Eigen: Building a Workspace

Selection rather than prediction

The silent death of good code

I write games in C (yes, C) (2016)

Reinforcement Learning from Human Feedback

Where did all the starships go?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Hackers (1995) Animated Experience

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Chaining FFmpeg with a Browser Agent

Comments