They even screw that up inside the tiny function that populates it. If anything IMO, they over-value names immensely (which makes sense, given how they work, and how broadly consistent programmers are with naming).
In my experience they under overvalue var names, but they value comments even more. So I tend to calibrate these things with more detailed comments.
Another thing that matters massively in Python is highly accurate, clear, and sensible type annotations. In contrast, incorrect type annotations can throw-off the LLM.
Until AI is compiling straight to machine language, code needs to be readable.
Comments lie. Names lie. Code is the only source of truth.
If you believe your reductive argument, your function and variable names would all be minimally descriptive, right?
“sleep 1” is the complete expression. Because sleep takes a parameter measured in seconds, it’s already understood.
You do not need “delay_in_seconds = 1” and then a separate “sleep delay_in_seconds”. That accomplishes nothing, you might as well add a comment like “//seconds” if you want some kind of clarity.
Many bugs come from writing something that does not match intent. For example, someone writes most of their code in another language where `sleep` takes milliseconds, they meant to check the docs when they wrote it in this language, but the alarm for the annual fire drill went off just as they were about to check. So it went in as `sleep 1000` in a branch of the code that only runs occasionally. Years later, did they really mean 16 minutes and 40 seconds, or did they mean 1 second?
Leaving clues about intent helps detect such issues in review and helps debug the problems that slip through review. Comments are better than nothing, but they are easier to ignore than variable names.
If the code is working, the intent also doesn’t matter, what was written is what was intended.
Do the requirements call for an alarm of 16 minutes 40 seconds? Then leave the code be. If not, just change it.
https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...
Types help but they don't help "at a glance". In editors that have type info you have to hover over variables or look elsewhere in the code (even if it's up several lines) to figure out what you're actually looking at. In "app" hungarian this problem goes away.
"Safe strings and unsafe strings have the same type - string - so we need to give them different naming conventions." I thought "Surely the solution is to give them different types instead. We have a tool to solve this, the type system."
"Operator overloading is bad because you need to read the entire code to find the declaration of the variable and the definition of the operator." I thought "No, just hit F12 to jump to definition. (Also, doesn't this apply to methods as well, not just operators?) We have a tool to solve this, the IDE."
If it really does turn out that the article's way is making a comeback 20 years later... How depressing would that be? All those advances in compilers and language design and editors thrown out, because LLMs can't use them?
It's the same with the articles about how to work with these tools. A long list of coding best practices followed by a totally clueless "wow once I do all the hard work LLMs generate great code every time!"
1. Having clear requirements with low ambiguity. 2. Giving a few input output pairs on how something should work (few shot prompting). 3. Avoiding providing useless information. Be consicise. 4. Avoid having contradictory information or distractors. 5. Break complex problems into more manageable pieces. 6. Provide goals and style guides.
A.K.A its just good engineering.
The paper is totally mum on how "descriptive" names (e.g. process_user_input) differ from "snake_case" names (e.g. process_user_input).
The actual question here is not about the model but merely about the tokenizer: is it the case that e.g. process_user_input encodes into 5 tokens, ProcessUserInput into 3, and calcpay into 1? If you don't break down the problem into simple objective questions like this, you'll never produce anything worth reading.
Which is the exact kind of information that you want to know.
It is very non-obvious which one will use more tokens; the Gemma tokenizer has the highest variance with process|_|user|_|input = 5 tokens and Process|UserInput as 2 tokens.
In practice, I'd expect the performance difference to be relatively minimal, as input tokens tends to quickly get aggregated into more general concepts. But that's the kind of question that's worth getting metrics on: my intuition suggests one answer, but do the numbers actually hold up when you actually measure it?
Adversarially named variables. As in, variables that are named something that is deliberately wrong and misleading.
import json as csv
close = open
with close("dogs.yaml") as socket:
time = csv.loads(socket.read())
for sqlite3 in time:
# I dunno, more horrifying stuff
yakubov_org•4d ago
I ran an experiment to find out, testing 8 different AI models on 500 Python code samples across 7 naming styles. The results suggest that descriptive variable names do help AI code completion.
Full paper: https://www.researchsquare.com/article/rs-7180885/v1
amelius•13h ago
Perhaps it will make them more intelligent ...
jerf•12h ago
AIs are finite. If they're burning brainpower on determining what "x" means, that's brainpower they're not burning on your actual task. It is no different than for humans. Complete with all the considerations about them being wrong, etc.
amelius•12h ago
Also, I think this is anthropomorphizing the llms a bit too much. They are not humans, and I'd like to see an experiment on how well they perform when trained with randomized var names.
jerf•12h ago
amelius•11h ago
recursive•9h ago
When you take out the information from the variable names, you're making the training data farther from real-world data. Practicing walking on your hands, while harder than walking on your feet, won't make you better at hiking. In fact, if you spend your limited training resources on it, the opportunity cost might make you worse.
JambalayaJimbo•11h ago
socalgal2•10h ago
> I just think your input data is more likely to resemble training data with meaningful variable names.
Based on giving job interviews, cryptic names are common.
_0ffh•11h ago
datameta•11h ago
JambalayaJimbo•11h ago
gnulinux•9h ago
ACCount36•8h ago
appreciatorBus•8h ago
I am far from an AI booster or power user but in my experience, I get much better results with descriptive identifier names.
ACCount36•8h ago
They are also known to operate on high level abstracts and concepts - unlike systems operating strictly on formal logic, and very much like humans.
fenomas•11h ago
When I tried it once the model did a surprisingly good job, though it was quite a while ago and with a small model by today's standards.
knome•11h ago
better to not, I think.
dingnuts•10h ago
empath75•10h ago
But every time you make an AI think you are introducing an opportunity for it to make a mistake.