That the author avoided saying Python was a bad language outright speaks a great deal of its suitability. Well, that, and the majority data science in practice.
- [1] https://scicloj.github.io
It also helps that in R any function can completely change how its arguments are evaluated, allowing the tidyverse packages to do things like evaluate arguments in the context of a data frame or add a pipe operator as a new language feature. This is a very dangerous feature to put in the hands of statisticians, but it allows more syntactic innovation than is possible in Python.
Julia and Nim [1] are dynamic and static approaches (respectively) to 1 language systems. They both have both user-defined operators and macros. Personally, I find the surface syntax of Julia rather distasteful and I also don't live in PLang REPLs / emacs all day long. Of course, neither Julia nor Nim are impractical enough to make calling C/Fortran all that hard, but the communities do tend to implement in the new language without much prompting.
Often they'd be doing very simplistic querying and then manipulating via DATAstep prior to running whatever modeling and/or reporting PROCs later, rather than pushing it upstream into a far faster native database SQL pull via pass-through.
Back in 2008/2009, I saved 30h+ runtime on a regular report by refactoring everything in SQL via pass-through as opposed to the data scientists' original code that simply pulled the data down from the external source and manipulated it in DATAstep. Moving from 30h to 3m (Oracle backend) freed up an entire FTE to do more than babysit a long-running job 3x a week to multiple times per day.
The focus of SAS and R were primarily limited to data science-related fields; however, Python is a far more generic programming language, thus the number of folks exposed to it is wider and thus the hiring pool of those who come in exposed to Python is FAR LARGER than SAS/R ever were, even when SAS was actively taught/utilized in undergraduate/graduate programs.
As a hiring leader in the Data Science and Engineering space, I have extensive experience with all of these + SQL, among others. Hiring has become much easier to go cross-field/post-secondary experience and find capable folks who can hit the ground running.
In the first place data science is more a label someone put on bag full of cats, rather than a vast field covered by similarly sized boxes.
If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.
R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.
The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet (i.e. the way of expressing this). It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.
FWIW, I agree that Python is not great, but I think it's also true R is not great. I don't agree with the specific comparisons in the piece.
most procs use tables as both input and output, and you better hope the tables have the correct columns.
you want a loop? you either get an implicit loop over rows in a table, write something using syscalls on each row in a table, or you're writing macros (all text).
Simplifying a lot, R is heavily inspired by Scheme, with some lazy evaluation added on top. Julia is another take at the design space first explored by Dylan.
You're forgetting R's data.table, https://cran.r-project.org/web/packages/data.table/vignettes...,
which is amazing. Tibbles only wins because they fought the docs/onboarding battle better, and dplyr ended up getting industry buy-in.
I suspect that in the fullness of time, mainstream languages will eventually fully incorporate tabular programming in much the same way they have slowly absorbed a variety of idioms traditionally seen as part of functional programming, like map/filter/reduce on collections.
[0] https://en.wikipedia.org/wiki/Q_(programming_language_from_K...
[1] https://ryelang.org/blog/posts/comparing_tables_to_python/
Personally I've found polars has solved most of the "ugly" problems that I had with pandas. It's way faster, has an ergonomic API, seamless pandas interop and amazing support for custom extensions. We have to keep in mind Pandas is almost 20 years old now.
I will agree that Shiny is an amazing package, but I would argue it's less important now that LLMs will write most of your code.
Recently I am seeing that Python is heavily pushed for all data science related things. Sometimes objectively Python may not be the best option especially for stats. It is hard to change something after it becomes the "norm" regardless of its usability.
The author's priorities are sensible, and indeed with that set of priorities, it makes sense to end up near R. However, they're not universal among data scientists. I've been a data scientist for eight years, and have found that this kind of plotting and dataframe wrangling is only part of the work. I find there is usually also some file juggling, parsing, and what the author calls "logistics". And R is terrible at logistics. It's also bad at writing maintainable software.
If you care more about logistics and maintenance, your conclusion is pushed towards Python - which still does okay in the dataframes department. If you're ALSO frequently concerned about speed, you're pushed towards Julia.
None of these are wrong priorities. I wish Julia was better at being R, but it isn't, and it's very hard to be both R and useful for general programming.
Edit: Oh, and I should mention: I also teach and supervise students, and I KEEP seeing students use pandas to solve non-table problems, like trying to represent a graph as a dataframe. Apparently some people are heavily drawn to use dataframes for everything - if you're one of those people, reevaluate your tools, but also, R is probably for you.
BTW AI is not helping and in fact is leading to a generation of scientists who know how to write prompts, but do not understand the code those prompts generate or have the ability to peer review it.
1. Is easy to read
2. Was easy to extend in languages that people who work with scientific data happen to like.
When I did my masters we hacked around in the numpy source and contributed here and there while doing astrophysics.
Stuff existed in Java and R, but we had learned C in the first semester and python was easier to read and contrary to MATLAB numpy did not need a license.
When data science came into the picture, the field was full of physicists that had done similar things. They brought their tools as did others.
My take (and my own experience) is that python won because the rest of the team knows it. I prefer R but our web developers don't know it, and it's way better for me to write code that the rest of our team can review, extend, and maintain.
[Data Preparation] --> [Data Analysis] --> [Result Preparation]
Neither Python or R does a good job at all of these.
The original article seems to focus on challenges in using Python for data preparation/processing, mostly pointing out challenges with Pandas and "raw" Python code for data processing.
This could be solved by switching to something like duckdb and SQL to process data.
As far as data analysis, both Python and R have their own niches, depending on field. Similarly, there are other specialized languages (e.g., SAS, Matlab) that are still used for domain-specific applications.
I personally find result preparation somewhat difficult in both Python and R. Stargazer is ok for exporting regression tables but it's not really that great. Graphing is probably better in R within the ggplot universe (I'm aware of the python port).
I can't help to conclude that Python is as good as R because I still have the choice of using Pandas when I need it. What did I get wrong?
also, we didn't define "good".
Python, the language itself, might not be a great language for data science. BUT the author can use Pandas or Polars or another data-science-related library/framework in Python to get the job done that s/he was trying to write in R. I could read both her R and Pandas code snippets and understand them equally.
This article reads just like, "Hey, I'm cooking everything by making all ingredients from scratch and see how difficult it is!".
Now, is Python a SUCCESSFUL language? Very.
> Either way, I’ll not discuss it further here. I’ll also not consider proprietary languages such as Matlab or Mathematica, or fairly obscure languages lacking a wide ecosystem of useful packages, such as Octave.
I feel, to most programming folks R is in the same category. R is to them what Octave is to the author. R is nice nice, but do they really want to learn a "niche" language, even if it has better some features than Python? Is holding a whole new paradigm, syntax, library ecosystem in your head worth it?
lenerdenator•1h ago
Python doesn't need to be the best at any one thing; it just has to be serviceable for a lot of things. You can take someone who has expertise in a completely different domain in software (web dev, devops, sysadmin, etc.) and introduce them to the data science domain without making them learn an entirely new language and toolchain.
dmurray•1h ago
It's used in data science because it's used in data science.
mohaine•1h ago
Use whatever you want on your one off personal projects but use something more non-data science friendly if you ever want your model to run directly in a production workflow.
Productionizing R models is quite painful. The normal way is to just rewrite it not in R.
lenerdenator•1h ago
vkazanov•56m ago
And it got this unprecedented level of support because right from the start it made its focus clear syntax and (perceived) simplicity.
There is also a sort of cumulative effect from being nice for algorithmic work.
Guido's long-term strategy won over numerous other strong candidates for this role.