——
I don’t think so - probably more in the realms of spark and, based on the roadmap, airflow.
For me it would be about doing big data analytics / dashboarding / ML or DS data prep.
My understanding is that Snowflake plays a lot in the data warehouse/lakehouse space, so is more central to data ops / cataloguing / SSOT type work.
But hey that’s all first impressions from the press release.
How does billing with "Deploy on AWS" work? Do I need to bring my own AWS account and Polars is payed for the image through AWS or am I billed by Polars and they pass a share to AWS. In other words do I have a contract primarily with AWS or Polars?
pc.ComputeContext{
cpus=4,
memory=16
}
We are working on a minimal cluster and auto-scaling based on the query.Ritchie, curious you mentioned in other responses that the SQL context stuff is out of scope for now. But I thought the SQL things were basically syntactic sugar to the dataframes in other words they both “compile” down to the same thing. If true then being able to run arbitrary SQL queries should be doable out of the box?
However, this should happend during IR-resolving. E.g. the SQL should translate directly to Polars IR, and not LazyFrames. That way we can inspect/resolve all schema's server-side.
It requires a rewrite of our SQL translation in OSS. This should not be too hard, but it is quite some work. Work we eventually get to.
Polars Cloud maps the Polars API/DSL to distributed compute. This is more akin to Spark's high level DataFrame API.
With regard to implementation, we create stages that run parts of Polars IR (internal representation) on our OSS streaming engine. Those stages run on 1 or many workers create data that will be shuffled in between stages. The scheduler is responsible for creating the distributed query plan and work distribution.
Still don't get why one of the biggest player in the space, Databricks is overinvesting in Spark. For startups, Polars or DuckDB are completely sufficient. Other companies like Palantir already support bring your own compute.
[1]: https://www.usenix.org/system/files/conference/hotos15/hotos...
Sometimes. But sometimes Python is just much easier. For example transposing rows and columns.
- Polars (Pola.rs) - the DataFrames library that now has a cloud version
- Polar (Polar.sh) - Payments and MoR service built on top of Stripe
It's a common name
Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.
Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.
If you use Athena you still have to worry about shuffling and joining, it is just hidden.. It is Trino / Presto under the hood and if you click explain you can see the execution plan, which is essentially the same as looking into the SparkUI.
Who cares about JVM versions nowadays? No one is hosting Spark themselves.
Literally every tool now supports DataFrame AND SQL APIs and to me there is no reason to pick up SQL if you are familiar with a little bit of Python
Yes, I did write tests and no, I did not write 1000-line SQL (or any SQL for that matter). But I could see analysts struggle and I could see other people in other orgs just firing off simple SQL queries that did the same as non-portable Python mess that we had to keep alive. (Not to mention the far superior performance of database queries.)
But I knew how this all came to be - a manager wanted to pad their resume with some big data acronyms and as a result, we spent way too much time and money migrating to an architecture, that made everyone worse off.
Cluster configuration is optional if you want this control. Anyhow, this doesn't have much to do with the query API, be it SQL or DataFrame.
That said the last Python code I wrote as a data engineer was to run tests on an SQL database, because the equivalent in SQL would have been tens of thousands of lines of wallpaper code.
But coming into such a discussion dunking on a tool cuz it’s not for the masses makes no sense.
So I'm very much advocating for people to "[u]se whatever tools work best".
(That is - now I'm doing this. In the past I taught a course on pandas data analytics and spoke at a few PyData conferences and meetups, partly about dataframes and how useful they are. So I'm very much guilty of what all of the above.)
I think this part(query optimizations) in general not solved/solvable, and it is sometimes/often(depending on domain) necessary to digg into details to make data transformation working.
Why is the dataframe approach getting hate when you’re talking about runtime details?
That folks understand the almost conversational aspect of SQL vs. that of the dataframe api but the other points make no difference.
If you’re a competent dev/data person and are productive with the dataframe then yay. Also setup and creating test data and such it’s all objects and functions after all — if anything it’s better than the horribad experience of ORMs.
So I'm definitely a fan, IF you need the DataFrame API. My point was that most people don't need it and it's oftentimes standing in the way. That's all.
Yes, I know spark and scala exist. I use it. But the underlying Java engines and the tacky Python gateway impact performance and capacity usage. Having your primary processing engine in the same process compiled natively always helps.
I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.
I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.
[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/
That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.
Disclaimer: I work for Polars on said query execution.
I have used Polars to process 600M of xml files (with a bit of a hack) and the polars part of the code is readable with minimal comments.
Polars has a better api than pandas, at least the intent is easier to understand. (lazyness, yay)
This article explains it pretty well: https://dynomight.net/numpy/
The original comment I responded to was confusing Pandas with Polars, and now your blog post refers to Numpy, but Polars takes a completely different approach to dataframes/data processing than either of these tools.
Take two examples of dataframe apis, dplyr and ibis. Both can run on a range of SQL backends because dataframe apis are very similar to SQL DML apis.
Moreover, the SQL translation for tools for pivot_longer in R are a good illustration of complex dynamics dataframe apis can support, that you'd use something like dbt to implement in your SQL models. duckdb allows dynamic column selection in unpivot. But in some SQL dialects this is impossible. dataframe apis -> SQL tools (or dbt) enable them in these dialects.
However, before that, you need a lot of code to clean the data and raw data does not fit well into a structured RDBMS. Here you choose to either map your raw data into row view or a table view. You're now left with the choice of either inventing your own domain object (row view) or use a dataframe (table view).
insert obama awards obama meme
I did it with pandas without much experience with it and a lot of AI help (essentially to fill in the blanks the data scientists had left, because they only had to do the calculation once).
I then created a polars version which uses lazyframes. It ended up being about 20x faster than the first version. I did try to do some optimizations by hand to make the execution planner work even better which I believe paid off.
If you have to do a large non interactive analytical calculation (i.e. not in a notebook) polars seems to be way ahead imo!
I do wish that it was just as easy to use as a rust library though.. the focus however seems to be on being competitive in python land mainly.
It is where most of our userbase is and it is very hard for us to have a stable Rust API as we have a lot of internal moving parts which Rust users typically want access to (as they like to be closer to the metal), but has no stability guarantees from us.
In python, we are able to abstract and provide a stable API.
I say this as a user of neither - just that I don’t see any inherent validity to that statement.
If you are saying Rust consumers want something lower level than you’re willing to make stable, just give them a higher level one and tell them to be happy with it because it matches your design philosophy.
``` df.with_column( map_multiple( |columns| { let col1 = columns[0].i32()?; let col2 = columns[1].str()?; let col3 = columns[3].f64()?; col1.into_iter() .zip(col2) .zip(col3) .map(|((x1, x2), x3)| { let (x1, x2, x3) = (x1?, x2?, x3?); Some(func(x1, x2, x3)) }) .collect::<StringChunked>() .into_column() }, [col("a"), col("b"), col("c")], GetOutput::from_type(DataType::String), ) .alias("new_col"), ); ```
Not much polars can do about that in Rust, that's just what the language requires. But in Python it would look something like
``` df.with_columns( pl.struct("a", "b", "c") .map_elements( lambda row: func(row["a"], row["b"], row["c"]), return_dtype=pl.String ) .alias("new_col") ) ```
Obviously the performance is nowhere close to comparable because you're calling a python function for each row, but this should give a sense of how much cleaner Python tends to be.
I'm ignorant about the exact situation in Polars, but it seems like this is the same problem that web frameworks have to handle to enable registering arbitrary functions, and they generally do it with a FromRequest trait and macros that implement it for functions of up to N arguments. I'm curious if there are were attempts that failed for something like FromDataframe to enable at least |c: Col<i32>("a"), c2: Col<f64>("b")| {...}
https://github.com/tokio-rs/axum/blob/86868de80e0b3716d9ef39...
https://github.com/tokio-rs/axum/blob/86868de80e0b3716d9ef39...
1. There are no variadic functions so you need to take a tuple: `|(Col<i32>("a"), Col<f64>("b"))|`
2. Turbofish! `|(Col::<i32>("a"), Col::<f64>("b"))|`. This is already getting quite verbose.
3. This needs to be general over all expressions (such as `col("a").str.to_lowercase()`, `col("b") * 2`, etc), so while you could pass a type such as Col if it were IntoExpr, its conversion into an expression would immediately drop the generic type information because Expr doesn't store that (at least not in a generic parameter; the type of the underlying series is always discovered at runtime). So you can't really skip those `.i32()?` calls.
Polars definitely made the right choice here — if Expr had a generic parameter, then you couldn't store Expr of different output types in arrays because they wouldn't all have the same type. You'd have to use tuples, which would lead to abysmal ergonomics compared to a Vec (can't append or remove without a macro; need a macro to implement functions for tuples up to length N for some gargantuan N). In addition to the ergonomics, Rust’s monomorphization would make compile times absolutely explode if every combination of input Exprs’ dtypes required compiling a separate version of each function, such as `with_columns()`, which currently is only compiled separately for different container types.
The reason web frameworks can do this is because of `$( $ty: FromRequestParts<S> + Send, )*`. All of the tuple elements share the generic parameter `S`, which would not be the case in Polars — or, if it were, would make `map` too limited to be useful.
EDIT: nevermind see same question in this thread. The answer is no!
What is wrong with you DB people :))).
I basically ditched SQL for most of my analytical work because it's way easier to understand for my juniors (we're not technically a tech team) so it's a total win in my eyes.
willvarfar•2d ago
It feels like we are on the path to reinventing BigQuery.
ritchie46•2d ago
Polars cloud will for the moment only support our DataFrame API. SQL might come later on the roadmap, but since this market is very saturated, we don't feel there is much need there.