My impression was that it's pretty easy to do straightforward things like the examples described in the article. But when you have to do complicated or unusual things with your data I found it very frustrating to work with. Access to the underlying data was often opague and it was difficult to me at times to figure out what was happening under the hood.
Does anyone here know any research areas still using R?
That's where I realised that the "modern" approach was taken in the article - which obviously I had not looked at.
1. Wrap the complicated bits in functions, then force it into the tidyverse model by abusing summarize and mutate.
2. Use data.table. It's very adaptable and handles arbitrary multiline expressions (returning a data.table if the last expression returns a list, otherwise returning the object as-is).
3. Use base R. It's not as bad as people make it out to be. You'll need to learn it to anyway, if you want to do anything beyond the basics.
So many different types of object, so many different syntaxes. The tidyverse makes sense, and sure, is elegant, but if your colleagues are using base R. Don't even get me started on docs and Stackoverflow for R. I much, and always will prefer Python.
The one area I still go back to R is on proper survey work. I've looked for years and haven't found anything equivalent to the survey package for Python. I do like that R tends to start from the assumption that data is weighted.
Fortunately I don't do surveys much anymore.
The big issue isn't necessarily around jackknife etc. (as you say, pretty trivial and I think perhaps in statsmodels), but around regression weighting and ensuring compatibility with colleague's work.
The R survey package, Stata and SPSS all support things like survey design in their regressions, python does not out of the box. Even simple things like weighted frequencies end up with some pretty awkward code in Python.
I can imagine a pythonic survey package that extends pandas and statsmodels, but as you say, survey people use R and there's just not a scene for it.
Past life for me anyway, surveys are not my bag.
Then you probably need to get new colleagues.
also I recommend trying Ibis. created by the creator of pandas originally and solves so many of the issues
It seems like Ibis uses DuckDB on its backend (by default) and has Polars support as well. Given this, maybe see if Ibis works better for you than polars. If you very specifically need polars, using that will for sure be better. DuckDB is faster than polars and it has great polars support, so depending on how Ibis is implemented it might be "better" than polars as data frame lib.
Whether or not DuckDB is faster than Polars depends on the query and data size. I've spent a large portion of the last 2 years building a new execution engine for and optimizing Polars, and it shows: https://pola.rs/posts/benchmarks/.
https://pypi.org/project/narwhals/#description
I tried really hard to use Ibis but I ran into issues where it was way easier to do some stuff in pandas/polars and had to keep coming out of Ibis to make it work so I gave up on it for the time being.
What Python desperately needs is a coordinated effort for a core data science /scientific computing stack with a unified framework.
In my opinion, if it weren't for Python's extensive use in Industry and package ecosystem, Julia would be the language of choice for nearly all data science and scientific computing uses.
That's my impression as well. Going back to the topic of the original post, pandas only partially implements the idioms of the tidyverse so you have to mix in a lot of different forms of syntax (with lambdas to boot) go get things done. Julia is much nicer, but I find myself using PythonCall more often than I'd like.
Scipy was originally supposed to provide the scientific computing stack, but then many offshoots in the direction of pandas / ibis / JAX, etc. happened. I guess that's what you get with a community-based language. MATLAB has its warts but MathWorks does manage to present a coherent stack on that end.
In fairness, if you're not touching Pandas, it's pretty good I'd say. Everything is based around numpy and scipy. Sklearn API is a bit idiosyncratic but works really nicely in practice and is extensible. JAX has an API which is 1:1 equivalent to numpy, probably with some catches but still. All the trouble starts with pandas.
Pandas is pretty terrible IMO for all the reasons listed by OP and TFA - and more.
A few years ago I made a package called "redframes" that tried to "solve" all of my frustrations with pandas, make data wrangling feel more like R, while retaining all the best bits of Python...
Alas, it never really took off. For those curious: https://github.com/maxhumber/redframes
There is so much hype and luck to widespread adoption, you never know with these things.
(I've never used R myself, but certainly have some very strong opinions about Pandas after having written 3 books about it.)
import pandas as pd
purchases = pd.read_csv("https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-from-r/purchases.csv")
(purchases
.assign(country_median=lambda df:
df.groupby("country")["amount"].transform("median")
)
.query("amount <= country_median * 10")
.groupby("country")
.assign(total=lambda df: (df["amount"] - df["discount"]).sum())
)
but it seems 'DataFrameGroupBy' object has no attribute 'assign' so its not that simple,
though with a slight re-ordering of the chain operations, it works cf.
https://news.ycombinator.com/item?id=44236487It's a domain-specific language that makes pipelining a first-class citizen and compiles into various flavors of SQL... but it's also a fully-fleshed out VS Code environment that dynamically checks typing based on live DB schemas, and lets you represent your entire semantic layer in incredibly terse code with type hints and error bars. It's being actively developed and was founded by the former founder of Looker.
While it's still experimental, it's very usable, particularly if you export the compiled SQL into other BI tools, and visualization tools are being developed incredibly rapidly.
import pandas as pd
purchases = pd.read_csv("purchases.csv")
(
purchases.loc[
lambda x: x["amount"] < 10 * x.groupby("country")["amount"].transform("median")
]
.eval("total=amount-discount")
.groupby("country")["total"]
.sum()
) (
purchases.loc[
lambda x: x["amount"] < 10 * x.groupby("country")["amount"].transform("median")
]
.assign(total=lambda df: df["amount"] - df["discount"])
.groupby("country")["total"]
.sum()
.reset_index() # to produce a DataFrame result
)purchases[amount <= median(amount)*10][, .(total = sum(amount - discount)), by = .(country)][order(country)]
```
- no quotes needed - no loc needed - only 1 groupby needed
Feb 22, 2024, 71 points, 25 comments: https://news.ycombinator.com/item?id=39468737
Feb 20, 2024, 20 points, 6 comments: https://news.ycombinator.com/item?id=39438491
I will say that Python has the `datatable` package by H2o that is close to R's `data.table` in elegance. I wish it had the widespread adoption that Pandas enjoys.
great_wubwub•8mo ago
j_bum•8mo ago
But, recently I’ve been working with much larger scale data than R can handle (thanks to R’s base int32 limitation) and have been needing to use Python instead.
Polars feels much more intuitive and similar to `dplyr` to me for table processing than Pandas does.
I often ask my LLM of choice to “translate this dplyr call to Polars” as I’ve been learning the Polars syntax.
aydyn•8mo ago
j_bum•8mo ago
This is one of those decisions that I just do not understand. In your mind, why do you imagine a set of improvements won’t be made?
Otherwise, for now, working with Python and R using the reticulate package in Quarto is perfect for my needs.
If the Positron IDE could get in-line plot visualization in Quarto documents like the RStudio IDE has, I’d be the happiest camper.
aydyn•8mo ago
The problem is not technical. Let's just leave it at that.
j_bum•8mo ago