Our sandbox with public data is free for you to try, or just reach out and ask any question!
I'm a data engineer and make decisions around what software we use for pipelines. A lot of examples for these types of tools showcase the simple case, which is a handy intro, but I'd love to see a real world example of Bauplan scaling to interconnected pipelines!
We have people building stuff featured here (https://www.bauplanlabs.com/build-with-bauplan) as well as online (e.g. https://blog.det.life/bauplan-the-serverless-data-lakehouse-...), plus of of course our examples repo in Github that you can check as part of the tutorial.
Our largest client is a 5BN / USD year company running thousands of jobs on bauplan. If you have something in mind, you can try out the public sandbox for free and come on our Slack, and I'm happy to build something with you.
I have spent a lot of time in Jupyter notebooks for experimentation and research in a past life, and marimo's reactivity, built-in affordances for working with data (table viewer, database connections, and other interactive elements), lazy execution, and persistent caching make me far more productive when working with data, regardless of whether I am making an app-like thing.
But as the original developer of marimo I am obviously biased :) Thanks for using marimo!
[1] https://security.bauplanlabs.com/#resources-b2152df0-4179-48...
We've done quite a lot of open source in our life, at Bauplan (you can check our github), and before (you can check me ;-)), so the comment seems unfair!
We understand the importance of being clear on how the platform works, and for that we have a long series of blog posts and, if you're so inclined, quite a few peer-reviewed papers in top conferences, ranging from low-level memory optimizations (https://arxiv.org/abs/2504.06151), columnar caching (https://arxiv.org/abs/2411.08203), novel FaaS runtimes (https://arxiv.org/pdf/2410.17465), pipeline reproducibility (https://arxiv.org/pdf/2404.13682) and more.
We are also always happy to chat about our tech choices if you're interested.
Who comes up with these weird names for patterns. What the heck is "lake" supposed to invoke.
I've never understood why this is so hard. Every time data science gives me a notebook it feels like I have been handed a function that says `doFeature()` and should just have to put it behind an endpoint called /do_feature, but it always takes forever and I'm never even able to articulate why. It feels like I am clueless at reading code but just this one particular kind of code.
I think its a much better result to have data science prototype translated to a performant production version rather than have a databricks type approach or what bauplan is proposing.
What that looks like is highly dependent upon the environment at hand, and letting AI take that over may be one of those “now you have 2 problems” things.
The blog together with our marimo friends is to showcase that you can have notebook development if you like it AND cloud scaling (which u need) without code changes, thanks to the fact that both marimo and Bauplan are basically Python (maybe a small thing, but there is nothing else in the market remotely close).
On the AI part, we agree: the fact that bauplan is just Python, including data management and infra-as-code, makes it trivial for AI to build pipelines in Bauplan, which is not something that can be said about other data platforms - if you follow our blog, we are releasing in a few weeks or so a full "agentic" implementation with Bauplan API of production ETL workloads, which you may find interesting.
"but it always takes forever and I'm never even able to articulate why." -> there are way more factors at play than DoFeatures unfortunately, see for example Table 1 (https://arxiv.org/pdf/2404.13682). Even knowing which data people have developed on is hard, which is why bauplan has git-for-data semantics built in: everyone works on production data, but safely and reliably, to avoid data skews.
Each computer is different, which is why bauplan adopt FaaS with isolated and fully containerized functions: you are always in the cloud, so no skew in the infra etc.
The problem of "going to production" is still the biggest issue in the industry, and solving it is not a one-fix kind of thing, but unfortunately the combination of good ergonomics, new abstractions and reliable infra.
This isn’t necessarily what you want in a daily production environment, let alone a real-time environment.
If you're worried about data movement or secure deployment, none of that is an issue because of Iceberg + BYOC option.
Databricks and Snowflake, just to mention two players in a similar space, are not OS: did you feel that would prevent you from adopting them as well?
Yes absolutely. Snowflake is a modern Oracle. It may survive but will be more of a barnacle/legacy system for big corporations. Neither are the right solution for the next generation of companies that are starting up today
There is no "one size fits all" when it comes to building companies, and the right answers depend on many factors: it would be interesting to know your choices for example!
I do agree with you that is important to give back to the ecosystem, but per size / dollar, bauplaners have done and continue doing as much as anyone. All in all, we have shared our ideas in the community in 50+ research papers in top venues (with thousands of citations), and we have quite a few popular open source contributions, with millions of downloads and >10k GitHub stars in total (our FaaS scheduler simulator was just open sourced with our VLDB25 WS paper).
You can be a good citizen of the Python / AI / database ecosystem without doing open source as a business strategy: the reality is more nuanced I believe!
If you want to dive deeper in one line reproducibility, you can chek our SIGMOD24 paper: https://arxiv.org/pdf/2404.13682. Let us know what you think!
sounds flexible but what does that actually mean in practice? are there guardrails to keep things interoperable
> "engine-agnostic execution"
how that holds up when switching between, say, pandas and spark. are dependencies and semantics actually preserved or is it up to us to manually patch the gaps every time the backend shifts?
As for all the other Python packages, including proprietary ones, the FaaS model is such that you can declare any package you want in a function as node in the pipeline DAG, and any other in another: every function is fully isolated, and you can even selectively use pandas 1 in one, pandas 2 in another, or update the Python interpreter only in node X.
If you're interested in containerization and FaaS abstractions, this is good deep dive: https://arxiv.org/pdf/2410.17465
If you're more the practical type, just try out a few runs in the public sandbox which is free even if we are not GA.
flakiness•7mo ago
simonw•7mo ago
Marimo is pretty new (first release January 2025) but has a high rate of improvement. It's particularly good for WebAssembly stuff - that's been one of their key features almost from the start.
My notes on it so far are here: https://simonwillison.net/tags/marimo/
lvl155•7mo ago
ayhanfuat•7mo ago
akshayka•7mo ago
For those new to marimo, we have affordances for working with expensive (ML/AI/pyspark) notebooks too, including lazy execution that gives you guarantees on state without running automatically.
One small note: marimo was actually first launched publicly (on HN) in January 2024 [1]. Our first open-source release was in 2023 (a quiet soft launch). And we've been in development since 2022, in close consultation with Stanford scientists. We're used pretty broadly today :)
[1] https://news.ycombinator.com/item?id=38971966
Peritract•7mo ago
This is one of the key features of Jupyter to me; it encourages quick experimentation.
sodality2•7mo ago
abdullahkhalids•7mo ago
akshayka•7mo ago
theLiminator•7mo ago
cantdutchthis•7mo ago
https://youtu.be/4fXLB5_F2rg?si=jeUj77Cte3TkQ1j-
disclaimer: I work for marimo and I made that video, but the gamepad support is awesome and really shows the flexibility
kernelsanderz•7mo ago