Anyone have similar stories? Curious about cases where knowing your domain beat throwing compute at the problem.
Anyone have similar stories? Curious about cases where knowing your domain beat throwing compute at the problem.
I needed to test pumping water through a special tube, but didn’t have access to a pump. I spent days searching how to rig a pump to this thing.
Then I remembered I could just hang a bucket of water up high to generate enough head pressure. Free instant solution!
You cant be "agile" with them, you need to design your data storage upfront. Like a system design interview :).
Just use postgres (or friends) until you are webscale. Unless you really have a problem amenible to key/value storage.
Interesting to hear now that the opinion is the opposite.
On another note I recently wrote this large single page app that is just a collection of functions organized by page sections as a collection of functions according to a nearly flat typescript interface. It’s stupid simple to follow in the code and loads as fast as an eighth of a second. Of course that didn’t stop HN users from crying like children for avoiding use of their favorite framework.
This gave a great impression of an intelligent adversary with very minimal code and low CPU overhead.
Unless enemies have entirely non-functional pathing. Then it's just funny.
My “dumb” solution is a little Ansible job that just runs a git pull on the server. It gets the new code and I’m done. The job also has an option to set everything up, so if the server is wiped out for some reason I can be back up and running in a couple minutes by running the job with a different flag.
I suspect that locating the referenced comment would require a semantic search system that incorporates "fancy models with complex decision boundaries". A human applying simple heuristics could use that system to find the comment.
In the "Dictionary of Heuristic" chapter, Polya's "How to Solve it" says this: *The feeling that harmonious simple order cannot be deceitful guides the discover in both in mathematical and in other sciences, and is expressed by the Latin saying simplex sigillum veri (simplicity is the seal of truth).*
There are "smarter" solutions like radix tries, hash tables, or even skip lists, but for any design choice, you also have to examine the tradeoffs. A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.
I guess the moral of the story is to just examine all your options during the design stage. Machine learning solutions are just that, another tool in the toolbox. If another simpler and often cheaper solution gets the job done without all of that fuss, you should consider using it, especially if it ends up being more reliable.
To figure that out, I remember searching for articles on how to implement inverted indices. Once I had a list of candidate strategies and data structures, I used Wikipedia supplemented by some textbooks like Skiena's [2] and occasionally some (somewhat outdated) information from NIST [3]. I found Wikipedia quite detailed for all of the data structures for this problem, so it was pretty easy to compare the tradeoffs between different design choices here. I originally wanted to implement the inverted index as a hash table but decided to use a trie because it makes wildcard search easier to implement.
After I developed most of the backend, I looked for books on "information retrieval" in general. I found a history book (Bourne and Hahn 2003) on the development of these kind of search systems [4]. I read some portions of this book, and that helped confirm many of the design choices that I made. I actually was just doing what people traditionally did when they first built these systems in the 1960s and 1970s, albeit with more modern tools and much more information on hand.
The harder part of this project for me was writing the interpreter. I actually found YouTube videos on how to write recursive descent parsers to be the most helpful there, particular this one [5]. Textbooks were too theoretical and not concrete enough, though Crafting Interpreters was sometimes helpful [6].
[1] https://en.wikipedia.org/wiki/Inverted_index
[2] https://doi.org/10.1007/978-3-030-54256-6
[3] https://xlinux.nist.gov/dads/
[4] https://doi.org/10.7551/mitpress/3543.001.0001
> There are "smarter" solutions like... hash tables.... A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.
Strangely, my own software-related answer is the opposite for the same reason.
I was implementing something for which I wanted to approximate a https://en.wikipedia.org/wiki/Shortest_common_supersequence , and my research at the time led me to a trie-based approach. But I was working in Python, and didn't want to actually define a node class and all the logic to build the trie, so I bodged it together with a dict (i.e., a hash table).
And it goes the ChatGPT comes back with and runs the appropriate command.
It underperformed banning the word "password" from a Google Form.
So that's what they went with.
This generalises to a few situations where going faster just doesn't matter. For example for many cli tools it matters if they finish in 1s or 10s. But once you get to 10ms vs 100ms, you can ask "is anyone ever likely to run this in a loop on a massive amount of data?" And if the answer is yes, "should they write their own optimised version then?"
The (deliberately) very limited analytics software I wrote for my personal website[0] could have used database but I didn't want to add a dependency to what was a very simple project so I hacked up an in-memory datastructure that periodically dumps itself to disk as a json file. This gives persistence across reboots and at a pinch I can just edit the file with a text editor.
Game design is filled with "stupid" ideas that work well. I wrote a text-based game[1] that includes Trek-style starship combat. I played around with a bunch of different ideas for enemy AI before just reverting to a simple action drawn off the top of a small deck. It's a very easy system to balance and expand, and just as fun for the player.
What I had overlooked was that journeys on that particular website were fairly constrained by design, i.e., if you landed on the home page, did a bunch of stuff, put product X in the cart - there was pretty much one sequence of pages (or in the worst case, a small handful) that you'd traverse for the journey. Which means the bag-of-words (BoW) representation was more or less as expressive as the sequence model; certain pages showing up in the BoW vector corresponded to a single sequence (mostly). But the DT could learn faster with less data.
- buying a bigger server is almost always better than distributed system
- Few lines of bash can often wipe out hundreds of lines of python.
My group (and some others) had to design a device to transport an egg from one side of a very simple "obstacle course" to the other, with the aid of beacons (to indicate the egg location and target, each along opposite ends) and light sensors. There was basically a single obstacle, a barrier running most of the way across the middle. The field was fairly small, I think 4 metres across by 3 metres wide.
The other teams followed tutorials, created beacons that emitted high-frequency light pulses and circuitry to filter out 60Hz ambient light and detect the pulse; various robots (I think at least one repurposed a remote-control car) and feedback control to steer them toward the beacons, etc. There were a few different microcontrollers on offer to us for this task, and groups generally had three people: someone responsible for the mechanical parts, someone doing circuitry, and someone doing assembly programming.
My group was just the two of us.
I designed extenders for the central barrier, a carriage to straddle the barrier, and a see-saw the length of the field. The machine would find the egg, scoop it into one end, tilt the see-saw (the other person's innovation: by releasing a stop allowing the counterweighted far side to fall), find the target and release the scoop on the other end. Our light sensors were pointed directly at the ceiling (the source of the "noise"), and put through a simple RC circuit to see that light as more or less constant. Our "beacons" were pieces of construction paper used to block the light physically. All controlled by a 3-bit finite state machine implemented directly in TTL/CMOS (I forget which).
And it worked in testing (praise for my partner; I would never have gotten the mechanics robust enough), but on presentation day the real barrier (made sloppily out of wood) was noticeably wider than specified and the carriage didn't fit on it.
As I recall, in later years the obstacle course was made considerably more complex, ruling out solutions like mine entirely. (There were other projects to choose from, for my year and later years, that as far as I know didn't require modification.)
Given there were about a billion IG profiles total at the time, I just replaced the entire setup with a single Go script that iterated from 1 to billion and tried to scrape every id in between. That gave us 10k requests per second on a single machine, which was more than enough.
----
For storage, people often overcomplicate things. Maybe you do need RAID 5 in a NAS, etc. Maybe what you need is a simple server with a single disk and an offsite backup that rsyncs every night. That RAID 5 doesn't stop 'rm -rf' from destroying everything.
For databases, people often shove a database into an app or product much too early. The rule of thumb that I use is that you should switch to a database (from flat files) when you would have to implement foreign keys, or when data won't fit in memory anymore and memory-mapped files aren't sufficient. Using a database before that just complicates your data model, introducing ORM too early seriously complicates your code.
For algorithms, there are an awful lot of O(nLogn) solutions deployed for problems with small n. An O(n) solution is often faster to write, and still solves the problem. O(n) is often actually faster when things fit in L1 or L2 cache.
For software architecture, we often forget that the client has CPU and storage (and network) that we can use. Even if you don't trust the client, you can sign a cache entry to be saved on the client, and let the client forward it later. Greatly reduces the need for consistency on the backend. If you don't trust the client to compute, you can have the server compute a spot check at lower resolution, a subset, etc.
helix90•1w ago
Minor49er•51m ago