> The output of a blue team is only as strong as its weakest link: a security system that consists of a strong component and a weak component (e.g., a house with a securely locked door, but an open window) will be insecure
Hum, no? With an open window you can go through the whole house. With a XSS vulnerability you cannot do the same amount of damage as with a SQL injection. This is why security issues have levels of severity.
This particular comment feels more like an over-concentration on trivialities rather than refutation or critique of opinion.
The "house analogy" can also support cases where the potential damage is not the same, e.g. if the open window has bars a robber might grab some stuff within reach but not be able to enter.
In this case you are criticizing an analogy meant to convey understanding of "weakest link" for not also imparting an understanding of "levels of severity".
In that respect, the attack and defense sides are not hugely different. The main difference is that many attackers are shielded from the consequences of their mistakes, whereas corporate defenders mostly aren't. But you also have the advantage of playing on your home turf, while the attackers are comparatively in the dark. If you squander that... yeah, things get rough.
You can add layers of high quality simple systems to increase your overall security exponentially, think using a VPN behind TOR etc.
It could also mean that attacks against it are high value (because of high distribution).
Point is, license isn’t a great security parameter in and of itself IMO.
For example, our development teams are using modern, stable libraries in current versions, have systems like Sonar and Snyk around, blocking pipelines for many of them, images are scanned before deployment.
I can assume this layer to be well-secured to the best of their ability. It is most likely difficult to find an exploit here.
But once I step a layer downwards, I have to ask myself: Alright, what happens IF a container gets popped and an attacker can run code in there? Some data will be exfiltrated and accessible, sure, but this application server should not be able to access more than the data it needs to access to function. The data of a different application should stay inaccessible.
As a physical example - a guest in a hotel room should only have access to their own fuse box at most, not the fuse box of their neighbours. A normal person (aka not a youtuber with big eye brows) wouldn't mess with it anyway, but even if they start messing around, they should not be able to mess with their neighbour.
And this continues: What if the database is not configured correctly to isolate access? We have, for example, isolated certain critical application databases into separate database clusters - lateral movement within a database cluster requires some configuration errors, but lateral movement onto a different database cluster requires a lot more effort. And we could even further. Currently we have one production cluster, but we could isolate that into multiple production clusters which share zero trust between them. An even bigger hurdle putting up boundaries an attacker has to overcome.
In cyber security, there is no reason the opponent cannot attack as well. So, my red team is attacking is not a reason that I do not need defense, because my opponent can also attack.
Well, to be fair, you added some words that are not there in the post
> The output of a blue team is only as strong as its weakest link: a security system that consists of a strong component and a weak component [...] will be insecure (and in fact worse, because the strong component may convey a false sense of security).
You added "defense efforts". But that doesn't invalidate the claim in the article, in fact it builds upon it.
What Terence is saying is true, factually correct. It's a golden rule in security. That is why your "efforts" should focus on overlaying different methods, strategies and measures. You build layers upon layers, so that if one weak link gets broken there are other things in place to detect, limit and fix the damage. But it's still true that often the weakest link will be an "in".
Take the recent example of cognizant desk people resetting passwords for their clients without any check whatsoever. The clients had "proper security", with VPNs and 2FA, and so on. But the recovery mechanism was outsourced to a helpdesk that turned out to be the weakest link. The attackers (allegedly) simply called, asked for credentials, and got them. That was the weakest link, and that got broken. According to their complaint, the attackers then gained access to internal systems, and managed to gather enough data to call the helpdesk again and reset the 2FA for an "IT security" account (different than the first one). And that worked as well. They say they detected the attackers in 3 hours and terminated their access, but that's "detection, mitigation" not "prevention". The attackers were already in, rummaging through their systems.
The fact that they had VPNs and 2FA gave them "a false sense of security", while their weakest link was "account recovery". (Terence is right). The fact that they had more internal layers, that detected the 2nd account access and removed it after ~3 hours is what you are saying (and you're right) that defense in depth also works.
So both are right.
In recent years the infosec world has moved from selling "prevention" to promoting "mitigation". Because it became apparent that there are some things you simply can't prevent. You then focus on mitigating the risk, limiting the surfaces, lowering trust wherever you can, treating everything as ephemeral, and so on.
I also see this in a lot of undergrads i work with. The top 10% is even better with LLMs, they know much more and they are more productive. But the rest have just resulted to turning in clear slop with no care. I still have not read a good solution on how to incentivize/restrict the use of LLms in both academia or at work correctly. Which i suspect is just the old reality of quality work is not desirable by the vast majority, and LLMs are just magnifying this
This is interesting, I'm noticing something similar (even taking LLMs out of the equation). I don't teach, but I've been coaching students for math competitions, and I feel like there's a pattern where the top few% is significantly stronger than, say, 10 years ago, but the median is weaker. Not sure why, or whether this is even real to begin with.
Oh look, it's on the Wikipedia page: https://en.wikipedia.org/wiki/RSA_cryptosystem
Yay blue/red teams in math!
Any dimension of LLM training and inference can be thought of as a tradeoff that makes it better for some tasks, and worse for others. Maybe in some scenarios a heavily quantized model that returns a result in 10ms is more useful than one that returns a result in 200ms.
Its the same reason I find asking opinions from many people useful - I take every answer and try to fit it into my world model and see what sticks. The point that many miss is that each individual's verifier model is actually accurate enough so that external generator models may afford to have high error rates.
I have not yet completely explored how the internal "fitting" mechanism works but to give an example: I read many anecdotes from Reddit, fully knowing that many are astroturfed, some flat out wrong. But I still have tricks to identify what can be accurate, which I probably do subconsciously.
In reality: answers don't exist in a randomly uniform space. "Truth" always has some structure and it is this structure (that we all individually understand a small part of) that helps us tune our verifier model.
It is useful to think of how LLM's would work with varying levels of accuracy. For example, generating gibberish to GPT O3 to ground truth. Gibberish is so inaccurate that even extremely high levels of accuracy of our internal verifier model may not allow it to be useful. But O3 is high enough that combined with my internal verifier model it is generally useful.
Our internal verifier model is fuzzy but in this example I think it is pretty much always accurate.
Also, I cam ask it to do security reviews on the system it's made and it works with it's same characteristic fervor.
I love Tao's observation, but I disagree, at least for the domains I'm allowing LLMs to creat for, that they should not play both teams.
I have not seen any independent claim that generative "AI" makes programs safer or that generating supervising features as you suggest works.
For auditing "AI" I have seen one claim (not independent or using a public methodology) that auditing "AI" rakes in bug bounties.
In addition, red and purple teams end goal is to help the blue team at the end of the day to remedy the issues discovered.
For an old example that predates LLMs, see the four color theorem.
The asymmetry is:
An attacker only has to be right ONCE, and he wins
Conversely, the defender only has to be wrong once, and he is wrong.
So the conclusion is:
Defenders/creators are using LLMs to pump out crappy code, and not testing enough, or relying on the LLM to test itself.
Some attackers might be too dismissive of LLMs, and could accelerate their work by using them to try more things
The comment was related to these stories:
How I Use AI (11 months ago) - https://news.ycombinator.com/item?id=41150317
Carlini has the fairly rare job of being an attacker: Why I Attack - https://nicholas.carlini.com/writing/2024/why-i-attack.html
https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...
> Many of the proposed use cases for AI tools try to place such tools in the "blue team" category, such as creating code...
> However, in view of the unreliability and opacity of such tools, it may be better to put them to work on the "red team", critiquing the output of blue team human experts but not directly replacing that output...
The red team is only essential if you're a coward who isn't willing to take a few risks for increased profit. Why bother testing and securing when you can boost your quarterly bonus by just... not doing that?
I suspect that Terence Tao's experience leans heavily towards high-profile risk-averse institutions. People don't call one of the greatest living mathematicians to check your work when they're just trying to duct taping a new interface on top of a line-of-business app that hasn't seen much real investment since the late 90s. Conversely, the people who are writing cutting-edge algorithms for new network protocols and filesystems are hopefully not trying to churn out code as fast and cheap as possible by copy-pasting snippets to and from random chatbots.
There are a lot of people who are already cutting corners on programmer salaries, accruing invisible tech debt minute by minute. They're not trying to add AI tools to create a missing red team, they're trying to reduce headcount on the only team they have, which is the blue team (which is actually just one overworked IT guy in over his head).
In this case, all the companies who are doing what you describe are themselves the red team. They are the unreliable, additive, distributed players in an ecosystem where the companies themselves are disposable. The blue team is the blue team by virtue of incentives: they are the organization where proper functioning of their role requires that all the parts are reliable and work well together, and if the individual people fulfilling those roles do not have those qualities, they will fail and be replaced by people who do.
You say "just" as though this is a failure of the system, but this is the system working as designed. Economies of scale are half the reason to bother with large-scale enterprise, so they inevitably consolidate to the point of monopoly, so disrupting that monopoly by force to keep the market aligned is an ongoing and never-ending process that you should expect to need to do on a regular basis.
Is Pirate Software catching strays from Terrence Tao now?
I'm optimistic about AI-powered infra & monitoring tools. When I have a long dump of system logs that I don't understand, LLMs help immensely. But then it's my job to finalize the analysis and make sure whatever debugging comes next is a good use of time. So not quite red team/blue team in that case either.
Business also has a “blue team” (those industries that the rest of the economy is built upon - electricity, oil, telecommunications, software, banking; possibly not coincidentally, “blue chips”) and a “red team” (industries that are additive to consumer welfare, but not crucial if any one of them goes down. Restaurants, specialty retail, luxuries, tourism, etc.)
It is almost always better, economically, to be on the blue team.” That’s because the blue team needs to ensure they do everything right (low supply) but has a lot of red-team customers they support (high demand). The red team, however, is additive: each additional red team firm improves the quality of the overall ecosystem, but they aren’t strictly necessary* for the success of the ecosystem as a whole. You can kinda see this even in the examples of Tao’s post: software engineers get paid more than QA, proof-creation is widely seen as harder and more economically valuable than proof-checking, etc.
If you’re Sam Altman and have to raise capital to train these LLMs, you have to hype them as blue team, because investors won’t fund them as red team. That filters down into the whole media narrative around the technology. So even though the technology itself may be most useful on the red team, the companies building it will never push that use, because if they admit that, they’re admitting that investors will never make back their money. (Which is obvious to a lot of people without a dog in the fight, but these people stay on the sidelines and don’t make multi-billion dollar investments into AI.)
The same dynamic seems to have happened to Google Glasses, VR, and wearables. These are useful red-team technologies in niche markets, but they aren’t huge new platforms and they will never make trillions like the web or mobile dev did. As a result, they’ve been left to languish because capital owners can’t justify spending huge sums on them.
And there are a host of teams working on the "red team" side of LLMs right now, using them for autonomous testing. Basically, instead of trying to figure out all the things that can go wrong and writing tests, you let the AI explore the space of all possible failures, and then write those tests.
_alternator_•5h ago
Having LLMs fix bugs or add features is more fraught, since they are prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).
skdidjdndh•5h ago
Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.
Having worked on legacy codebases, some of the hardest problems are determining “why is this broken test here that appears to test a behavior we don’t support”. Do we have a bug? Or do we have a bad test? On the other end, when there are tests for scenarios we don’t actually care about it’s impossible to determine if that test is meaningful or was added because “it’s testing the code as written”.
yojo•5h ago
Anticipating future responses: yes, a robust test harness allows you to make changes fearlessly. But most big test suites I’ve seen are less “harness” and more “straight-jacket”
andrepd•4h ago
yojo•3h ago
1) your test is specific to the implementation at the time of writing, not the business logic you mean to enforce.
2) your test has non-deterministic behavior (more common in end-to-end tests) that cause it to fail some small percentage of the time on repeated runs.
At the extreme, these types of tests degenerate your suite into a "change detector," where any modification to the code-base is guaranteed to make one or more tests fail.
They slow you down because every code change also requires an equal or larger investment debugging the test suite, even if nothing actually "broke" from a functional perspective.
Using LLMs to litter your code-base with low-quality tests will not end well.
winstonewert•3h ago
threatofrain•3h ago
Asking specs to truly match the business before we begin using them as tests would handcuff test people in the same way we're saying that tests have the potential to handcuff app and business logic people — as opposed to empowering them. So I wouldn't blame people for writing specs that only match the code implementation at that time. It's hard to engage in prophecy.
marcosdumay•3h ago
WFT are you doing writing specs based on implementation? If you already have the implementation, what are you using the specs for? Or, if you want to apply this direct to tests, if you are already assuming the program is correct, what are you trying to test?
Are you talking about rewriting applications?
nyrikki•2h ago
TDD at its core is defining expected inputs and mapping those to expected outputs at the unit of work level, e.g. function, class etc.
While UAT and domain informed what those inputs=outputs are, avoiding trying to write a broader spec that that is what many people struggle with when learning TDD.
Avoiding writing behavior or acceptance tests, and focusing on the unit of implementation tests is the whole point.
But it is challenging for many to get that to click. It should help you find ambiguous requirements, not develop a spec.
MoreQARespect•1h ago
Im weirded out by your comment. Writing tests that couple to low level implementation details was something I thought most people did accidentally before giving up on TDD, not intentionally.
jrockway•2h ago
If you can't tell if a test is there to preserve existing happenstance behavior, or if it's there to preserve an important behavior, you're slowed way down. Every red test when you add a new feature is a blocker. If the tests are red because you broke something important, great. You saved weeks! If the tests are red because the test was testing something that doesn't matter, not so great. Your afternoon was wasted on a distraction. You can't know in advance whether something is a distraction, so this type of test is a real productivity landmine.
Here's a concrete, if contrived, example. You have a test that starts your app up in a local webserver, and requests /foo, expecting to get the contents of /foo/index.html. One day, you upgrade your web framework, and it has decided to return a 302 Moved redirect to /foo/index.html, so that URLs are always canonical now. Your test fails with "incorrect status code; got 302, want 200". So now what? Do you not apply the version upgrade? Do you rewrite the test to check for a 302 instead of a 200? Do you adjust the test HTTP client to follow redirects silently? The problem here is that you checked for something you didn't care about, the HTTP status, instead of only checking for what you cared about, that "GET /foo" gets you some text you're looking for. In a world where you let the HTTP client follow redirects, like human-piloted HTTP clients, and only checked for what you cared about, you wouldn't have had to debug this to apply the web framework security update. But since you tightened down the screws constraining your application as tightly as possible, you're here debugging this instead of doing something fun.
(The fun doubles when you have to run every test for every commit before merging, and this one failure happened 45 minutes in. Goodbye, the rest of your day!)
ch33zer•3h ago
jrockway•2h ago
Change detector tests add to the noise here. No, this wasn't a feature customers care about, some AI added a test to make sure foo.go line 42 contained less than 80 characters.
groestl•1h ago
In some cases (e.g. in our case) long standing bugs become part of the API that customers rely on.
strbean•1h ago
Obligatory: https://xkcd.com/1172/
giaour•15m ago
PeeMcGee•1h ago
ozgrakkurt•5h ago
manmal•4h ago
raddan•4h ago
ozgrakkurt•4h ago
For example you can make it generate queries and data for a database and generate a list of operations and timings for the operations.
Then you can mix assertions into the test so you make sure everything is going as expected.
This is very useful because there can be many combinations of inputs and timings etc. and it tests basically everything for you without you needing to write a million unit tests
cookiengineer•4h ago
Basically the whole world of bugs introduced by someone being a too smart C/C++ coder. You can battletest parsers quite nicely with fuzzers, because parsers often have multiple states that assume naive input data structures.
bicx•4h ago
_alternator_•4h ago
wagwang•4h ago
manmal•4h ago
Tests poke and prod with a stick at the SUT, and the SUT's behaviour is observed. The truth lives in the code, the documentation, and, unfortunately, in the heads of the dev team. I think this distinction is quite important, because this question:
> Do we have a bug? Or do we have a bad test?
cannot be answered by looking at the test + the implementation. The spec or people have to be consulted when in doubt.
andruby•4h ago
Is it "System Under Test"? (That's Claude.ai's guess)
dfabulich•4h ago
card_zero•4h ago
9rx•4h ago
The tests are your spec. They exist precisely to document what the program is supposed to do for other humans, with the secondary benefit of also telling a machine what the program is supposed to do, allowing implementations to automatically validate themselves against the spec. If you find yourself writing specs and tests as independent things, that's how you end up with bad, brittle tests that make development a nightmare — or you simply like pointless busywork, I suppose.
But, yes, you may still have to consult a human if there is reason to believe the spec isn't accurate.
munificent•3h ago
For all real-world software, a test suite tests a number of points in the space of possible inputs and we hope that those points generalize to pinning down the overall behavior of the implementation.
But there's no guarantee of that generalization. An implementation that fails a test is guaranteed to not implement the spec, but an implementation that passes all of the tests is not guaranteed to implement it.
9rx•3h ago
They are for the human, which is the intended recipient.
Given infinite time the machine would also be able to validate against the complete specification, but, of course, we normally cut things short because we want to release the software in a reasonable amount of time. But, as before, that this ability exists at all is merely a secondary benefit.
godelski•2h ago
Tests are an *approximation* of your spec.
Tests are a description, and like all descriptions are noisy. The thing is it is very very difficult to know if your tests have complete coverage. It's very hard to know if your description is correct.
How often do you figure out something you didn't realize previously? How often do you not realize something and it's instead pointed out by your peers? How often do you realize something after your peers say something that sparks an idea?
Do you think that those events are over? No more things to be found? I know I'm not that smart because if I was I would have gotten it all right from the get go.
There are, of course, formal proofs but even they aren't invulnerable to these issues. And these aren't commonly used in practice and at that point we're back to programming/math, so I'm not sure we should go down that route.
9rx•1h ago
As is a spec. "Description" is literally found in the dictionary definition. Which stands to reason as tests are merely a way to write a spec. They are the same thing.
> The thing is it is very very difficult to know if your tests have complete coverage.
There is no way to avoid that, though. Like you point out, not even formal proofs, the closest speccing methodology we know of to try and avoid this, is immune.
> Tests are an approximation of your spec.
Specs are an approximation of what you actually want, sure, but that does not change that tests are the spec. There are other ways to write a spec, of course, but if you went down that road you wouldn't also have tests. That would be not only pointless, but a nightmare due to not having a single source of truth which causes all kinds of social (and sometimes technical) problems.
godelski•1h ago
The point of saying this is to ensure you don't fall prey to fooling yourself. You're the easiest person for you to fool, after all. You should always carry some doubt. Not so much it is debilitating, but enough to keep you from being too arrogant. You need to constantly check that your documentation is aligned to your specs and that your specs are aligned to your goals. If you cannot see how these are different things then it's impossible to check your alignment and you've fooled yourself.
9rx•1h ago
Documentation, tests, and specs are all ultimately different words for the same thing.
You do have to check that your implementation and documentation/spec/tests are aligned, which can be a lot of work if you do so by hand, but that's why we invented automatic methods. Formal verification is theoretically best (that we know of) at this, but a huge pain in the ass for humans to write, so that is why virtually everyone has adopted tests instead. It is a reasonable tradeoff between comfort in writing documentation while still providing sufficient automatic guarantees that the documentation is true.
> If you cannot see how these are different things
If you see them as different things, you are either pointlessly repeating yourself over and over or inventing information that is, at best, worthless (but often actively harmful).
Kinrany•3h ago
It's easy to see them as four cache layers, but empirically it's almost never the case that the correct thing to do when they disagree is to blindly purge and recreate levels that are farther from the "truth" (even ignoring the cost of doing that).
Instead, it's always an ad-hoc reasoning exercise in looking at all four of them, deciding what the correct answer is, and updating some or all of them.
jgalt212•4h ago
I hear you on this, but you can still use so long as these tests are not comingled with the tests generated by subject-matter experts. I'd treat them almost a fuzzers.
SamuelAdams•1h ago
djeastm•30m ago
The dreaded "Added tests" commit...
Pxtl•47m ago
Because somebody complained when that behavior we don't support was broken, so the bug-that-wasn't-really-a-bug was fixed and a test was created to prevent regression.
Imho, the mistake was in documentation: the Test should have comments explaining why this test was created.
Just as true for tests as for the actual business logic code:
The code can only describe the what and the how. It's up to comments to describe the why.
mvieira38•4h ago
_alternator_•4h ago
johnisgood•2h ago
fpoling•2h ago
torginus•2h ago
That's why they came up with the Arrange-Act-Assert pattern.
My favorite kind of unit test nowadays is when you store known input-output pairs and validate the code on them. It's easy to test corner cases and see that the output works as desired.
01HNNWZ0MV43FF•10m ago