ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL
or
./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!
Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral
1. Removed KL Divergence
2. Normalize by total length (Dr. GRPO style)
3. Minibatch normalization for advantages
4. Relaxing trust region
Wait, how are they computing the loss?
The goal of it was to "force" the model not to stray to far away from the original checkpoint, but it can hinder the model from learning new things
The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?
For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.
R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).
Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.
IIUC, nowadays there is a jinja templated metadata-struct inside the gguf file itself. This contains the chat template and other config.
https://gist.github.com/gavi/b9985f730f5deefe49b6a28e5569d46...
(People can do statistical token prediction too, but that's called "bullshitting", not "thinking". Thinking is a much wider class of activity.)
Because my understanding is that how "thinking" works is actually still a total mystery. How is it we no for certain that the basis for the analog electric-potential-based computing done by neurons is not based on statistical prediction?
Do we have actual evidence of that, or are you just doing "statistical token prediction" yourself?
Absence of evidence is not evidence of absence, but it is still absence of evidence. Making a claim without any is more religious that not. After all, we know humans can't be descended from monkeys!
Like, we've agreed that "water-physics" and "cloth physics" in 3d graphics refers to a mathematical approximation of something we don't actually understand at the subatomic level (are there strings down there? Who knows).
Can "thinking" in AI not refer to this intentionally false imitation that has a similar observable outward effect?
Like, we're okay saying minecraft's water has "water physics", why are we not okay saying "in the AI context, thinking is a term that externally looks a bit like a human thinking, even though at a deeper layer it's unrelated"?
Or is thinking special, is it like "soul" and we must defend the word with our life else we lose our humanity? If I say "that building's been thinking about falling over for 50 years", did I commit a huge faux pas against my humanity?
In such a task, better water simulations are used. We have those, because we can directly observe the behavior of water under different conditions. It’s okay because the people doing it are explicitly aware that they are using simulation.
AI will get used in real decisions affecting other people, and the people doing those decisions will be influenced by the terminology we choose to use.
I would if they said the water in Half-life 2 was great for quenching your thirst or that in the near future everyone will only drink water from Half-life 2 and it will flow from our kitchen taps when it's clear that however good Half-life 2 is at approximating what water looks and acts like it isn't capable of being a beverage and isn't likely to ever become one. Right now there are a lot of people going around saying that what passes for AI these days has the ability to reason and that AGI is right around the corner but that's just as obvious a lie and every bit as unlikely, but the more it gets repeated the more people end up falling for it.
It's frustrating because at some point (if it hasn't happened already) you're going to find yourself feeling very thirsty and be shocked to discover that the only thing you have access to is Half-life 2 water, even though it does nothing for you except make you even more thirsty since it looks close enough to remind you of the real thing. All because some idiot either fell for the hype or saved enough money by not supplying you with real water that they don't care how thirsty that leaves you.
The more companies force the use of flawed and unreasoning AI to do things that require actual reasoning the worse your life is going to get. The constant misrepresentation of AI and what it's capable of is accelerating that outcome.
The point that was trying to be made, which I agree with, is that anthropomorphizing a statistical model isn’t actually helpful. It only serves to confuse laypersons into assuming these models are capable of a lot more than they really are.
That’s perfect if you’re a salesperson trying to dump your bad AI startup onto the public with an IPO, but unhelpful for pretty much any other reason, especially true understanding of what’s going on.
To your point, it's only anthropomorphization if you make the anthrocentric assumption that "thinking" refers to something that only humans can do.[1]
And I don't think it confuses laypeople, when literally telling it to "think" achieves the very similar results as in humans - it produces output that someone provided it out-of-context would easily identify as "thinking out loud", and improves the accuracy of results like how... thinking does.
The best mental model of RLHF'd LLMs that I've seen is that they are statistical models "simulating"[1] how a human-like character would respond to a given natural-language input. To calculate the statistically "most likely" answer that an intelligent creature would give to a non-trivial question, with any sort of accuracy, you need emergent effects which look an awful like like a (low fidelity) simulation of intelligence. This includes simulating "thought". (And the distinction between "simulating thinking" and "thinking" is a distinction without a difference given enough accuracy)
I'm curious as to what "capabilities" you think the layperson is misled about, because if anything they tend to exceed layperson understanding IME. And I'm curious what mental model you have of LLMs that provides more "true understanding" of how a statistical model can generate answers that appear nowhere in its training.
[1] It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.
And yet we added a hand wavy 7th to humanize a peice of technology.
https://arxiv.org/abs/2503.09211
They don’t ”reason”.
https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
They don’t even always output their internal state accurately.
I am thoroughly unimpressed by this paper. It sets up a vague strawman definition of "thinking" that I'm not aware of anyone using (and makes no claim it applies to humans) and then knocks down the strawman.
It also leans way too heavy on determinism - For one thing, we have no way of knowing if human brains are deterministic (until we solve whether reality itself is). For another, I doubt you would suddenly reverse your position if we created a LoRa composed of atmospheric noise, so it does not support your real position.
> https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
This one is more substantial, but:
"While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. [...] Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. [...] We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles."
Starts by saying "we actually don't understand them" (meaning we don't know well enough to give a yes or no) and then proceeds to list flaws that, as I keep saying, also can be applied to most (if not all) humans' ability to reason. Human reasoning also collapses in accuracy above a certain complexities, and certainly are observed to fail to use explicit algorithms, as well as reasoning inconsistently across puzzles.
So unless your definition of anthropomorphization excludes most humans, this is far from a slam dunk.
> They don’t even always output their internal state accurately.
I have some really bad news about humans for you. I believe (Buddha et al, 500 BCE) is the foundational text on this, but there's been some more recent research (Hume, 1739), (Kierkegaard, 1849)
My point was congruent with the argument that LLMs are not humans or possess human-like thinking and reasoning, and you have conveniently demonstrated that.
I mean, they are obviously not humans, that is trivially true, yes.
I don't know what I said makes you believe I demonstrated that they do not possess human-like thinking and reasoning, though, considering I've mostly pointed out ways they seem similar to humans. Can you articulate your point there?
\boxed{Hey! How can I help you today?}
They clearly rewarded the \boxed{...} formatting during their RL training, since it makes it easier to naively extract answers to math problems and thus verify them. But Magistral uses it for pretty much everything, even when it's inappropriate (in my own testing as well).
It also forgets to <think> unless you use their special system prompt reminding it to.
Honestly a little disappointing. It obviously benchmarks well, but it seems a little overcooked on non-benchmark usage.
Proofreading an email at four tokens per second, great.
Spending a half hour to deep research some topic with artifacts and MCP tools and reasoning at four tokens per second… a bad time.
I believe what they are trying to show in that paper, is that as the chain of operations approaches a large amount (their proxy for complexity), an LLM will inevitable fail. Humans don't have infinite context either, but they can still solve the Tower Of Hanoi without need to resort to either pen or paper, or coding.
32767 moves in a single prompt. That's not testing reasoning. That’s testing whether the model can emit a huge structured output without error, under a context window limit.
The authors then treat failure to reproduce this entire sequence as evidence that the model can't reason. But that’s like saying a calculator is broken because its printer jammed halfway through printing all prime numbers under 10000.
For me o3 returning Python code isn’t a failure. It’s a smart shortcut. The failure is in the benchmark design. This benchmark just smells.
I agree 15 disks is very difficult for a human, probably on a sheer stamina level; but I managed to do 8 in about 15 minutes by playing around (I.e. no practice). They do state that there is a massive drop in performance at this point.
Agreed. But to be fair, 1) a relatively simple algorithm can do it, and more importantly 2) a lot of people are trying to build products around doing exactly this (emit large structured output without error).
I don't know what they are thinking.
The ePrivacy Directive requires a (GDPR-level) consent for just placing the cookie, unless it's strictly necessary for the provision of the “service”. The way EU regulators interpret this, even web analytics falls outside the necessity exception and therefore requires consent.
So as long as the user doesn't and/or is not able to automatically signal consent (or non-consent) eg via general browser-level settings, how can you obtain it without trying to get it from the user on a per-site basis somehow? (And no, DNT doesn't help since it's an opt-out, not an opt-in mechanism.)
So you need a consent for all but the most crucial cookies without which the site/service wouldn't be able to function, like session cookies for managing signed-in state etc.
(The reason why you started to see consent banners really only after GDPR came to force is at least in part due to the fact that the ePrivacy Directive refers to the Data Protection Directive (DPD) for the standard of consent, and after DPD was replaced by GDPR, the arguably more stringent GDPR consent standard was applied, making it unfeasible to rely on some concept of implied consent or the like.)
> like session cookies for managing signed-in state etc.
Maybe I'm reading it wrong, but are you saying that consent is required for session cookies? Because that is not the case, at all.
> (25) However, such devices, for instance so-called "cookies", can be a legitimate and useful tool, for example, in analysing the effectiveness of website design and advertising, and in verifying the identity of users engaged in on-line transactions. Where such devices, for instance cookies, are intended for a legitimate purpose, such as to facilitate the provision of information society services, their use should be allowed on condition that users are provided with clear and precise information in accordance with Directive 95/46/EC about the purposes of cookies or similar devices so as to ensure that users are made aware of information being placed on the terminal equipment they are using. Users should have the opportunity to refuse to have a cookie or similar device stored on their terminal equipment. This is particularly important where users other than the original user have access to the terminal equipment and thereby to any data containing privacy-sensitive information stored on such equipment. Information and the right to refuse may be offered once for the use of various devices to be installed on the user's terminal equipment during the same connection and also covering any further use that may be made of those devices during subsequent connections. The methods for giving information, offering a right to refuse or requesting consent should be made as user-friendly as possible. Access to specific website content may still be made conditional on the well-informed acceptance of a cookie or similar device, if it is used for a legitimate purpose.
https://eur-lex.europa.eu/eli/dir/2002/58/oj/eng
You should inform users about any private data you would be storing in a cookie. But this can be a small infobox on your page with no button.
When storing other type of information, the "cookie" problem needs to be seen from the perspective of shared devices. You know, the times before, when you might forget to log out at an internet cafe or clear your cookies containing password and other things they shouldn't. This is a dated approach at looking at the problem (most people have their own computing devices today, their phone), but still applicable (classrooms, and family shared devices).
A conceptually different matter altogether is consent (possibly) needed under GDPR for various kinds of personal data processing involving the use of cookies (ie not just the placement of cookies as such) and other technologies for tracking, targeting and the like. That's why you see cookie banners with detailed purposes and eg massive lists of vendors (since they can be considered "recipients" of the user's personal data under GDPR). In this context, a valid consent (and the information you have to provide to obtain it) is required (at least) when consent is the only feasible legal basis of the ones available under Art 6 GDPR for the personal data processing activities in question. This is where the national regulators have taken strict stances especially regarding ad targeting and other activities usually involving cross-site tracking, for example, deeming that the only feasible basis for those activities would be consent (ie "opt-in") — instead of, in particular, "legitimate interests" which would enable opt-out-like mechanisms instead. This is the legal context of looking critically at 3rd-party cookies, but unfortunately, for the reasons mentioned above, getting rid of such cookies might still not be enough to avoid the minimal base cookie consent requirement when you use eg analytics... :(
It's pretty ridiculous, I know, and it's a bummer they scrapped the long-planned and -negotiated ePrivacy Regulation which was meant to replace the old ePrivacy Directive and, among other things, update the weird old cookie consent provision.
It's corpos trying to invade our privacy.
did you really prefer when companies were selling your data to third parties and didn't have to ask you?
I think they are more interested in protecting old money than in protecting people.
Realistically the legislation was only targeting Apple. If consumers want USB-C, then they can vote with their wallets and buy an Android, which is a reasonable alternative.
> they were generally using it to screw consumers
You understand that there were lots of people happy with Lightning? USB-C is a regression in many ways.
I’m very happy EU regulators took this headache off my shoulders and I don’t need to keep multiple chargers at home, and can be almost certain I can find a charger in restaurant if I need it.
Based on the reaction of my friends 90% of people supported this change and were very enthusiastic about it.
I have zero interest in being part of vendor game to lock me in.
> Based on the reaction of my friends 90% of people supported this change and were very enthusiastic about it.
That is an absolutely worthless metric, and you know it.
Good riddance for Lightning.
Realistically Apple's connector adds no value and if they want to sell into markets like the EU they need to cut that kind of thing out.
Like I said, usb-c is a regression from lightning in multiple ways.
* Lightning is easier to plug in.
* Lightning is a physically smaller connector.
* USB-C is a much more mechanically complex port. Instead of a boss in a slot, you have a boss with a slot plugging into a slot in a boss.
There was so much buzz around Apple no longer including a wall wort with its phones, which meant an added cost for the consumer, and potentially an increased environmental impact if enough people were going to say, order a wall wort online and shipped to them. The same logic applies to Apple forced to switch to USB, except that the costs are now multiplied.
> Lightning is easier to plug in.
according to you? neither are at all difficult
> Lightning is a physically smaller connector.
I've had lightning cables physically disassemble in the port, the size also made them somewhat delicate
> USB-C is a much more mechanically complex port.
much is a bit well, much... they're both incredibly simple mechanically — the exposed contacts made lightning more prone to damage
I've had multiple Apple devices fail because of port wear on the device. Haven't encountered this yet with usb-c
> The same logic applies to Apple forced to switch to USB, except that the costs are now multiplied.
Apple would have updated inevitably, as they did in the past — now at least they're on a standard... the long-term waste reduction is very likely worth the switch (because again, without the standard they'd have likely switched to another proprietary implementation)
My personal biggest gripe with lightning was that the spring contacts were in the port instead of the cable, and when they wore out you had to replace the phone instead of the cable. The lightning port was not replaceable. In practice I may end up breaking more USB-C ports, we'll see.
EU got rid of that. It only makes sense that they don't let private companies start all that crap up again. If states don't get to use artificial technological barriers as protectionism, certainly Apple shouldn't be allowed to either.
It is very hard to create policies and legislation that protects consumers, workers and privacy while also giving enough liberties for innovation. These are difficult but important trade-offs.
I'm glad there is diversity in cultures and values between the US, EU and Asia.
If it weren't for American help and trade post-WW2, Europe would be a Belarusian backwater and is fast heading back in that direction.
Countries like Greece, Italy, Spain, Portugal, etc. show the future of Europe as it slowly stagnates and becomes a museum that can't feed it's people.
Even Germany that was once excelling is now collapsing economically.
The only bright spot on the continent right now is Poland who are, shocker, much less regulatorily strict and have lower corporate taxes.
PIGS, really? Some of the top growing EU economies right now, which have turned their deficit around, show the future of a slowly stagnating Europe?
Edit: Parent changed their comment significantly, from something quite unpleasant to what it is now. I'm not deleting my comment as I'm not that kind of person.
What I wanted to say is: I like EU's regulation and I find it interesting how other people have different world views.
https://theonion.com/chinese-employers-to-grant-15-minute-ma...
What Europe does not have is scale ups in tech. The tech consolidated in US. By tech I mean internet based companies. Remove those and EU has higher productivity.
Its a trite talking point and not the reason why there are so few consumer-AI companies in Europe.
No, really - EU doesn't have the VCs and the megacorps. People laugh at EU sponsoring projects, but there is no private money to sponsor them. There are plenty of US companies with sites in the EU though, so you have people working the problems, but no branding.
Maybe, or maybe when silicon valley was busy growing exponentially Europe was still picking itself up from the mess of ww2.
Trying to blame a single reason is futile, naive and childish.
Moneywise, the US does have the good old Exorbitant Privilege to lean on.
The U.S. has a debt of 35Tn. The entire EU around 16Tn.
If even 10% of the debt difference was invested in tech that would have meant about $2tn more in investment in EU tech.
It's fascinating watching people circle back to this answer.
Regulation and taxation reduces incentives. Lower incentives, means lower risk-taking.
The fact this is still a lesson that needs to be debated is absurd.
And given what happened in Austria just a few hours back, not the best time for your comment.
Yes, the US has a lot of school shootings, but does anyone think loose gun regulations are why the US is strong on tech?
Great, Singapore has less school shootings and homeless people than anywhere in Europe by a country mile and has a soaring economy.
They make Europe look like Texas.
If you said you can look at the state of VC funding in the US and call it anything approximating "smart risks" I don't know that I'd believe you.
EU is not a business-friendly environment.
Longer term: cultural and language divisions despite attempts at creating a common market, not running the global reserve currency/military hegemony, social democracies encouraging work-life balance over cutthroat careerism, demographic issues, not getting a boost from being the only consumer economy not to be leveled in WW2, etc.
Money: There is more money for US startups. Investors (US and EU) want to invest in US based startups, not EU startups. US investors are willing to risk more money and take greater risk. EU startups that gain traction will attract US companies in that they provide a good way to extend their market to the EU, not as much for their innovations. Tech entrepreneurs (US or EU) want to work in the US if they can, because that is where the excitement and risk taking is and where the money can be made.
Teams: Building and managing EU tech teams is very different than US tech teams. EU teams need a lot more emotional hand holding, and EU engineers are far more salary oriented than equity oriented. It is far more difficult to motivate them to go above and beyond - the "we need to get this fix or feature in tonight so we can deploy n the morning" simply will not get done if it is already 5pm. Firing EU workers is much more difficult. There are a lot more regulations for EU teams, in order to "protect" them, and that results in the teams being more "lifestyle" teams rather than "innovation teams". EU teams get paid a lot less than their US counterparts.
Failure: Good failure is not a problem in the US, it can actually be a badge of honor. EU is very risk averse, and people avoid failure.
There are of course exceptions all around, but the weight of these observations and experiences are in favor of US teams.
This is in no way saying it is better to live in the US, there are a lot of things about the EU that are more attractive than the US, and I would probably have a better lifestyle living in Europe now that I am no longer working. But innovation and money is not one of them.
This is not coincidental.
UK has a bit of it, France has some and that's it. The only viable alternatives are countries who have issues with US and that is China and Russia. China have come up with strong competitors and it is on cutting edge.
Also, it doesn't have anything to do with regulations. 50 US States have the American regulations, its all happening in 1 and some other states happen to host some infrastructure but that's true for rest of of the world too.
If the EU/US relationship gets to Trump/Musk level, then EU can have the cutting edge stuff.
Most influential AI researchers are from Europe(inc. UK), Israel and Canada anyway. Ilya Sutskever just the other day gave speech at his alma matter @Canada for example. Andrej Karpathy is Slovakian. Lot's of Brits, French, Polish, Chinese, German etc. are among the pioneers. Significant portion of the talent is non-American already, they just need a reason to be somewhere else than US to have it outside the US. Chinese got their reason and with the state of the affairs in the world I wouldn't be surprised if Europeans gets theirs in less than 3 and a half years.
How did you come up to that conclusion? We don't have access to an alternate universe where the Chinese tech market was open. There is a real possibility that it would have been far ahead had it been open.
- Regulatory friendliness (eg. DJI)
- Non-enforcement of foreign patents (eg. LiFePO4 batteries)
- Technology transfer through partnerships with domestic firms
- Government support for industries deemed to be in the national interest
China's EV dominance is a result of local governments investing and buying from local businesses.
It would be the same with Russia&China. They will receive money from the governments and will sell to local buyers and will aim to expand to foreign markets.
As I said, most AI talent is not American but it is concentrated there. Give them a reason to be somewhere else, some will be somewhere else.
Why not ? First of all there would be plenty of incentives for EU companies to compete with one another (and plenty of capital flowing to them as the European market is big enough), then there would be competition with US actors in the rest of the world. That's exactly how the Asian economic model has been built: Japan, Taiwan, South Korea all have used protectionism + export-based subsidies to create market leaders in all kind of domains (from car manufacturing to electronics and shipbuilding).
The EU is doing a lot of enterprise level shit and it's great
The biggest company in Europe sells B2B software (SAP)
On the other hand I'm aware of no credible accusations of deepseek fudging benchmarks whereas OpenAI has had multiple instances of independent parties not being able to replicate their claimed performances on benchmarks (and not being honest and transparent about their benchmarking).
[0] https://tracxn.com/d/companies/mistral-ai/__SLZq7rzxLYqqA97j... (edited for typo)
You can write your in-house CRUD app in your basement or your office and it doesn't matter.
The vast majority of HN crowd and general social/mainstream media don't make the difference between these two scenarios
Remote work and flattening communication down to what digital media (Slack, Zoom, etc) afford strangle the beneficial network effects.
I was, yes. I should have omitted the "in office" part but I was referencing the "work more hours in America than France"
We only wish more time != more productivity because it's inconvenient in multiple ways if it were. We imagine a multiplier in there to balance the equation, such factor that can completely negate production, using mere anecdotal experiences as proofs.
Maybe that's not scientific, maybe time spent very closely match productivity, and maybe production as well as productivity need external, artificial regulations.
I'm not sure I believe that. I think at some point the additional hours worked will ultimately decrease the output/unit of time and at some point that you'll reach a peak whereafter every hour worked extra will lead to an overall productivity loss.
Its also something that I think is extremely hard to consistently measure, especially for your typical office worker.
So if I work 24h/day in a farm in Afghanistan, I should earn more than software developers in the Silicon Valley (because I'm pretty sure that they sleep)? Is that how you say GDP works?
First, one should probably look at GNP (or even GNI) rather than GDP to reduce the distortionary impact of foreign direct investment, company headquarters for tax reasons, etc.
Next, need to distinguish between market rate and PPP, as you highlight.
Lastly, these are all measures of output (per capita), while productivity is output per input, in this context output per hour worked. There the differences are less pronounced.
But I wouldn't expect someone like you to know, understand or even acknowledge it.
Typically more output, but less productivity (= output/time).
There's a numerator too.
Magistral is amazingly impressive compared to ChatGPT 3.5. If it had come out two years ago we'd be saying Mistral is the clear leader. But it came out now.
Not saying they worked fewer hours, just that speed matters, and in some cases, up to a limit, working more hours gets your work done faster.
Not sure that's even true. Mistral is known to be a really hard-working place
There are a lot of myths about French worker. Our lifelong worked hours is not exceptional; our productivity is also not exceptional.
https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI0000...
Au cours d'une même semaine, la durée maximale hebdomadaire de travail est de quarante-huit heures.
https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI0000...
La durée hebdomadaire de travail calculée sur une période quelconque de douze semaines consécutives ne peut dépasser quarante-quatre heures, sauf dans les cas prévus aux articles L. 3121-23 à L. 3121-25.
Anyway I found an official survey saying that 40% of them work more than 50 hours per week (but fewer weeks than regular employees) so I guess it’s not so rare (around one private sector employee in twenty).
No overtime pay regardless of hours worked.
No legal maximum hours per day/week.
No mandatory rest periods/breaks (federally).
The US approach places the burden on the individual employee to negotiate protections or prove misclassification, while French law places the burden on the employer to comply with strict, state-enforced standards.
The French Labor Code (Code du travail) applies to virtually all employees in France, regardless of sector (private tech company, government agency, non-profit, etc.), unless explicitly exempted. Software engineering is not an exempted profession. Maximum hour limits are absolute. The caps of 44 hours per week, 48 hours average over 12 weeks, and 10/12 hours per day are legal maximums for almost all employees. Tech companies cannot simply ignore them. The requirements for employee consent, strict annual limits (usually max 220 hours/year), premium pay (+25%/+50%), and compensatory rest apply to software engineers just like any other employee.
"Cadre" Status is not an exemption. Many software engineers are classified as Cadres (managers/professionals) but this status does not automatically exempt them from working time rules.
Cadre au forfait jours (Days-Based Framework): This is common for senior engineers/managers. They are exempt from tracking daily/weekly hours but must still have a maximum of 218 work days per year (including weekends, holidays, and RTT days). Their annual workload must not endanger their health. 80-hour weeks would obliterate this rest requirement and pose severe health risks, making it illegal. Employers must monitor their workload and health.
Cadre au forfait heures (Hours-Based Framework) or Non-Cadre: These employees are fully subject to the standard daily/weekly/hourly limits and overtime rules. 80+ hours/week is blatantly illegal.
The tech industry, especially gaming/startups, sometimes tries to import unsustainable "crunch" cultures. This is illegal in France.
EDIT: Fixed work days
Wouldn’t that be nice, 218 rest days? It’s 218 working days.
Are you working in an area that is that specific ? I'm French but I'm naive.
But it's worth pointing out that the U.S.'s famous 9-to-5 is completely inapplicable to any sort of high-demand job. For many people in a demanding profession like tech, a 9-to-5 job would be an absolute (and often unattainable) dream. Where I live (Washington, D.C.) people who want a 9-to-5 will generally leave industry altogether and work for the government. (And even there, a true 9-to-5 can be elusive.)
Classic drive by internet trope.
Maybe try a little harder, have an informed opinion about something.
That's the copium HN thinks. European workers bust their asses for glory not for money.
We've asked you several times to stop commenting in this inflammatory style on HN. We don't want to ban you, as we want HN to be open to a broad range of views and discussion styles, but if you keep commenting in ways that break the guidelines and draw valid complaints from other community members, a ban will be the next step we'll have to take.
If you want HN to be a good place to engage in interesting discussions, please do your part to make it better not worse.
See this guy: https://news.ycombinator.com/item?id=44254864
And there's countless like him that get away with it. You'll then argue that there's no resources to moderate everything on HN, which while true, it's also more than sus how there seems to always be enough resources to moderate conservative viewpoints but rarely attacks from liberals that break the same rules, which is a blatant double standard that HN moderation is ignoring.
You talk the talk about HN being to quote you "open to a broad range of views and discussion styles" but what you actually support is a suppression of free speech and a one sided view of things that can only exist in a biased heavily moderated echo chamber, and not in the free market place of ideas you claim to support.
We can't act on things that the community doesn't tell us about. Almost always, when people point to comments that are egregious but still live as evidence that the moderators approve of them, the reality is that we didn't see them. And a major reason for that is that political flamewar is now such a big part the activity on HN that our small team can't ever see all the comments that are flagged.
But please don't try to use other people's transgressions as an excuse for your own. That's an age-old trick that doesn't work well here.
If you are sincere about being a positive contributor to this community, you can easily show that by making an effort to observe the guidelines. You could also make good-faith efforts to hold other community members to high standards by flagging comments, and if you see anything that's particularly egregious, emailing us.
Edit: you added to your comment after I submitted mine, so I'll add a further response.
We don't care about what side you're arguing for. Often we don't know; we don't have time to figure out what each commentator in a flamewar is on about. The topic of bias has been hurled at HN for as long as it's existed. Dan has an ever-growing list of the complaints we get from each side characterising us as being biased towards the other side [1].
We have guidelines for a reason, which is that if people fill their comments with inflammatory rhetoric, the emotional energy that triggers is what dominates people's perception of the discussion, rather than the substance of the points people are trying to get across.
If you have points to make that have substance, and I know that you do, you need to find a way to get them across without being inflammatory, otherwise it's a waste of everyone's time.
Why do you think that is? Is it not a reflection of the userbase bias? Where comments get flagged not based on rules but based on which political side they are targeting?
>You could also make good-faith efforts to hold other community members to high standards by flagging comments
Doesn't help when others vouch for them to support their ideology.
>We don't care about what side you're arguing for.
You don't, but your userbase does. And your moderation is based on what your userbase flags. So you moderation 100% reflects the bias of the community, hence the biased enforcement of your rules.
Answer me why is my comment here is flagged?
That comment was a breach of the guidelines but it almost always takes longer than half an hour for a comment to be flagged, and for us to see it, especially on a thread that's over a day old that barely anybody is looking at anymore.
You could flag it yourself. The fact that you didn't makes it seem more like you're trying to prove a point about bias rather than doing your part to support the health of the community.
> Doesn't help when others vouch for them to support their ideology.
People who abuse vouching privileges can have those privileges revoked – if we know about it. Again, when you see this, email us.
The site has all kinds of mechanisms and norms to prevent abuse and dysfunction, but they can only work if people are sincere about making the site better rather than being at war with it.
Edit: Adding this in reply to your addition:
> Answer me why is my comment here is flagged?
> https://news.ycombinator.com/item?id=44240839
These parts break the guidelines against assuming bad faith and fulminating:
Bad faith argument.
I've mostly seen change for the sake of change, wrapped in fluffy artsy BS jargon, making it sound like each UI change is the second coming of Christ and fixes world hunger.
They're not especially egregious but when you develop a reputation for breaking the guidelines, your comments are going to attract more flags, and also trigger more complaints made privately to us via email.
We received emails complaining about your comments, including this one, from people who have a good track record of supporting the community and not being politically partisan.
When we receive these kinds of complaints, nobody is complaining about your politics, just about your inflammatory style and guidelines breaches.
Mate, I don't have time to flag all comments that I find inflammatory, especially when I flagged many comments in the past and nothing happened to them, so what's the point? I flagged this one after and the comment was still there. So why are you throwing the blame on me? Why didn't you remove that comment after I pointed it out?
> Again, when you see this, email us.
Mate please, be serious, me and most normal people have better things to do with our time than go full Karen "I want to talk to the manager" mode, and go to such lengths like emailing HN mods about other peoples' comments. Downvotes and flags are enough for me.
The fact that there are people here who have the time to send you emails about my comments that don't break the rules, just because they're butthurt, says something about those users (unemployed, terminally online, mentally unstable on SSRIs, social and politically activists, etc). Normal, employed people with healthy social lives don't send mod emails about comments they don't like on internet. WTH?
>These parts break the guidelines against assuming bad faith
Then by that yardstick, isn't the comment I was replying too also in bad faith, just like I pointed out initially? Since he was using Android 2 to justify that Android 16 is the superior UI. And I replied that's in bad faith since the alternative to Android 16 shitty UI is not going back to Android 2 to make Android 16 look good, but version 10 is a good counter to that of why version 16 is bad. How is my comment in bad faith and that one not?
>fulminating
Why was it fulminating? Was it any more than the rest of comments everyone on HN? That was just criticism of Android's UI evolution. Since when is criticism of something with arguments considered "fulminating"? Please explain, I'm genuinely curious. Because otherwise this so called rule break of mine in this case feels blatantly discriminatory and double standards.
>you're going to get more flags, and complaints made about you
Bro, you're straight up admitting to biased moderation here. That the community doing the flagging cares more about WHO is saying something rather than WHAT is being said. How can you talk about free speech and fair moderation with a straight face in this case?
Please, answer me these questions.
- It takes two minutes to write an email pointing out an egregious comment or bad actor.
- People are flagging and complaining about your comments because they are breaking the guidelines, nothing more, nothing less. Maybe you're not aware of it due to a cultural disconnect. If that's the case, I'm sorry you're in that position but I encourage you to take the feedback and work with us to come into alignment with the community. But it's not about your politics, it's all to do with your inflammatory style of commenting.
- I admitted no such bias; I said that your comments have a pattern of breaking the guidelines, including the ones people are complaining about, and when your comments consistently break the guidelines you will inevitably get less patience from everyone than if you lapse occasionally.
Please stop this war. We're not trying to oppress you. We want your point of view to be fairly represented, but that can only happen if you make an effort to play by same rules that everyone else is expected to follow.
You keep saying it's not you who decides what's right and wrong, that it's the users who decide based on the rules by flagging. Then why am I wrong with my assertion that then it's the rule of the mob who decides what is right and what is wrong, and not the rules, since there's obviously no impartial judge here, just the angry mob which is anything but impartial and unbiased, since just like in voting at elections, people don't vote based on facts and logic, but based on feelings and tribalism over a person(see US election results).
So if people want to flag bomb a certain user because of beef and not facts, they will, and instead of giving me an unbiased explanation on why the comment I was replying to was not breaking the rules like you say mine was, you avoid the topic and parrot some boy scout speech on the honesty and integrity of the userbase, when I show you the hypocrisy of that and ask for an exact explanations. If you don't want to explain me why that comment wasn't breaking the rules but somehow mine was, that's fine, just don't have the audacity to piss on me and tell me it's raining.
> You keep saying it's not you who decides what's right and wrong, that it's the users who decide based on the rules by flagging
I think this is a misunderstanding. We moderators don't (can't) make judgements about accuracy or truthfulness of comments. All we can do is determine if a comment breaks the guidelines. Comments should only be flagged by users if they break the guidelines. Our enforcement of the guidelines is independent of the accuracy of the comment's content or its ideology.
If a comment of yours is flagged for any reason other than guidelines breaches, you're within your rights to protest. But given your conduct even in this subthread with me, in which your comments continue to be full of guidelines breaches, it seems you're not able to gauge whether your comments are within the guidelines or not.
It's going to keep being a problem if you're not able to correct that.
> Then by that yardstick, isn't the comment I was replying too also in bad faith, just like I pointed out initially?
No, it isn't. First, it's completely invalid to say "by that yardstick" - you're comparing completely different things that have no bearing on each other. Second, no, there's zero evidence that the comment you replied to was in bad faith.
The definition of a bad faith argument is one that in inauthentic, and that the argument-maker doesn't actually believe in themselves. Factually, there's no evidence to support your accusation that that comment by @butlike was in bad faith - they didn't make any self-contradictory statements in their comment, nor did they post a single other comment in that whole thread, nor did they say that would indicate that they were acting anything but genuinely.
And, factually, you were breaking the guidelines by assuming bad faith about https://news.ycombinator.com/item?id=44240808.
When challenged on it in https://news.ycombinator.com/item?id=44244578, you said:
> I assumed good faith, but then I used critical thinking and decided it's in bad faith then explained why. You don't need to agree with me on this.
This has three falsehoods in it. First, you did not assume good faith - you assumed bad faith, because there was zero evidence to support the idea that it was in bad faith. Second, you didn't use critical thinking - again, because there was no evidence to support that belief. Third, you did not explain why the comment was in bad faith - you explained why you disagreed with it, indicating that you don't understand the difference between disagreeing with someone's statements, and them being in bad faith (which is further reinforced in the above when you say "And I replied that's in bad faith since the alternative to Android 16 shitty UI is not going back to Android 2 to make Android 16 look good" - no, that's literally not what "bad faith" means).
Finally, more generally, beyond the falsehoods and fallacies that you've been making, you're also acting extremely abrasively, in ways that break the guidelines and that antagonize other users.
The theme of HN is intellectual curiosity. The way that you've been acting is the exact opposite of that.
That's not particularly surprising though as the Medium variant is likely close to ten times smaller than DeepSeek-R1 (granted it's a dense model and not an MoE, but still).
I use it for almost every interaction I have with AI that isn't asking it to oneshot complex code. I fairly frequently run my prompts against Claude/ChatGPT and Mistral 3.1 and find that for most things they're not meaningfully different.
I also spend a lot of time playing around with it for storytelling/integration into narrative games.
The only dependency on the node side is 'mime' which is just a dict of mime types, data lives inside node's new `node:sqlite` everything on the front side that isn't just vanilla is alpine. It runs on my main desktop and has filesystem access (which doesn't yet do anything useful really) but the advantage here is that since I've written (well at least read) all of the code I can put a very high level of trust into my interactions.
https://xcancel.com/arthurmensch/status/1920136871461433620#...
Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...
> That was the easy part
Is a bit hand-wavy in that it doesn't explain why it's only DeepSeek who can do this "easy" thing, but still not Meta, Mistral or anyone else really. There are many other players who have way more compute than DeepSeek (even inside China, not even considering rest of the world), and I can assure you more or less everyone trains on synthetic data/distillation from whatever bigger model they can access.
IMHO tool calling is by far the most clearly economically valuable function for an LLM, and r1 self-admittedly just...couldn't do it.
There's a lot of puff out there that's just completely misaligned with reality, ex. Gemini 2.5 Pro is by far the worst tool caller, Gemini 2.5 Flash thinking is better, 2.5 Flash is even better. And either Llama 4 beats all Gemini 2.5s except 2.5 Flash not thinking.
I'm all for "these differences will net out in the long run", Google's at least figured out how to micro optimize for Aider edit formatting without tools. Over the last 3 months, they're up 10% on edit performance. But it's horrible UX to have these specially formatted code blocks in the middle of prose. They desperately need to clean up their absurd tool-calling system. But I've been saying that for a year now. And they don't take it seriously, at all. One of their most visible leads tweeted "hey what are the best edit formats?" and a day later is tweeting the official guide for doing edits. I'm a Xoogler and that absolutely reeks of BigCo dysfunction - someone realized a problem 2 months after release and now we have "fixed" it without training, and now that's the right way to do things. Because if it isn't, well, what would we do? Shrugs
I'm also unsure how much longer it's worth giving a pass on this stuff. Everyone is competing on agentic stuff because that's the golden goose, real automation, and that needs tools. It would be utterly unsurprising to me for Google to keep missing a pain signal on this, vis a vis Anthropic, which doubled down on it mid-2024.
As long as I'm dumping info, BFCL is not a good proxy for this quality. Think "converts prose to JSON" not "file reading and editing"
I am not really invested in this niche topic but I will observe that, yes I agree Llama 4 is really good here. And yet it's a far worse coder, far less intelligent than DeepSeek and that's not even arguable. So no it didn't "catch up" any more than what you could say by pointing out Llama is multimodal but DeepSeek isn't. That's just talking about a different things entirely.
Regardless, I do agree BFCL is not the best measure either, the Tau-bench is more real world relevant. But end of the day, most frontier labs are not incentive aligned to care about this. Meta cares because this is something Zuck personally cares about, Llama models are actually for small businesses solving grunt automation, not for random people coding at home. People like Salesforce care (xLAM), even China had GLM before DeepSeek was a thing. DeepSeek might care so long as it looks good for coding benchmarks, but that's pretty much the extent of it.
And I suspect Google doesn't truly care because in the long run they want to build everything themselves. They already have a CodeAssist product around coding which likely uses fine-tune of their mainline Gemini models to do something even more specific to their plugin.
There is a possibility that at the frontier, models are struggling to be better in a specific and constrained way, without getting worse at other things. It's either this, or even Anthropic has gone rogue because their Aider scores are way down now from before. How does that make sense if they are supposed to be all around better at agentic stuff in tool agnostic way? Then you realise they now have Claude Coder and it just makes way more economic sense to tie yourself to that, be context inefficient to your heart's content so that you can burn tokens instead of being, you know, just generally better.
> focusing on a single specific capability and
> I am not really invested in this niche topic
Right: I definitely ceded a "but it doesn't matter to me!" argument in my comment.
I sense a little "doth protest too much", in the multiple paragraphs devoted to taking that and extending it to the underpinning of automation is "irrelevant" "single" "specific", "niche".
This would also be news to DeepSeek, who put a lot of work to launch it in the r1 update a couple weeks back.
Separately, I assure you, it would be news to anyone on the Gemini team that they don't care because they want to own everything. I passed this along via DM and got "I wish :)" in return - there's been a fire drill trying to improve it via AIDER in the short term, is my understanding.
If we ignore that, and posit there is an upper management conspiracy to suppress performance, its just getting public cover by a lower upper management rush to improve scores...I guess that's possible.
Finally, one of my favorite quotes is "when faced with a contradiction, first check your premises" - to your Q about why no one can compete with DeepSeek R1 25-01, I'd humbly suggest you may be undergeneralizing, given even tool calls are "irrelevant" and "niche" to you.
Of course you have to pretend I meant the former, otherwise "they all have" doesn't entirely make sense. Not that it made total sense before either, but if I say your definition of "they" is laughably narrow, I suspect you will go back to your google contact and confirm that nothing else really exists outside it.
Oh and do a ctrl-f on "irrelevant" please, perhaps some fact grounding is in order. There was an interesting conversation to be had about underpinning of automation somehow without intelligence (Llama 4) but who has time for that if we can have hallucination go hand in hand with forced agendas (free disclaimer to boot) and projection ("doth protest too much")? Truly unforeseeable.
Don't forget I also referred to you having "hallucination". In retrospect, likening your logical consistency to an LLM was premature, because not even gpt-3.5 era models could pull off a gem like:
> You: to your Q about why no one can compete with DeepSeek R1 25-01 blah blah blah
>> Me: ...why would you presume I was talking about 25-01 when 28-05 exists and you even seem to know it?
>>> You: this was front and center on their API page!
Riveting stuff. Few more digs about poor grammar and how many times you stopped reading, and you might even sell the misdirection.
In the top 10, are models by OpenAI (gpt4omini), Google (gemini flashes and pros), Anthropic (Sonnets) and Deepseeks'. Even though the company list grows shorter if we instead look at top model usage grouped by order of magnitude, it retains the same companies.
Personally, the models meeting my quality bar are: gpt 4.1, o4-mini, o3, gpt2.5pro, gemini2.5flash (not 2.0), claude sonnet, deepseek and deepseek r1 (both versions). Claude Sonnet 3.5 was the first time I found LLMs to be useful for programming work. This is not to say there are no good models by others (such as Alibaba, Meta, Mistral, Cohere, THUDM, LG, perhaps Microsoft), particularly in compute constrained scenarios, just that only Deepseek reaches the Quality tier of the big 3.
you could say they're playing by a different set of rules, but distilling from the best available model is the current meta across the industry. only they know what fraction of their post-training data is generated from openai models, but personally i'd bet my ass it's greater than zero because they are clearly competent and in their position it would have been dumb to not do this.
however you want to frame it, they have pushed the field forward -- especially in the realm of open-weight models.
That being said, it's still very impressive for a 24B.
I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.
Sidenote, but I'm pretty sure DeepSeek is focused on V4, and after that will train an R2 on top. The V3-0324 and R1-0528 releases weren't retrained from scratch, they just continued training from the previous V3/R1 checkpoints. They're nice bumps, but V4/R2 will be more significant.
Of course, OpenAI, Google, and Anthropic will have released new models by then too...
Can you link me to the benchmark you found?
You can be 6/12 months later, and have not burned tens of billions compared to the best in class, I see it an engineering win.
I absolutely understand those that say "yeah, but customers will only use the best", I see it, but is market share of forever money losing businesses that valuable?
Over the coming years it won't be possible to stay a mere 6-12 months behind as the costs to build and maintain the AI super-infrastructure keeps climbing. It'll become a guaranteed implosion scenario. Winning will provide the ongoing immense resources needed to keep pushing up the hill forever. Everybody else - except a few - will fall away. The same outcome took place in search. Anybody spot Lycos, Excite, Hotbot, AltaVista around? It costs an enormous amount of money to try to keep up with Google (Bing, Baidu, Yandex) in search and scale it. This will be an even more brutal example of that, as the costs are even higher to scale.
The only way Mistral survives is if they're heavily subsidized directly by European states.
You cannot compare Uber to the AI market. They are too different. Uber captured the market because having three taxi services is annoying. But people are readily jumping between models using multi-model platforms. And nobody is significantly ahead of the pack. There is nothing that sets anyone apart aside from the rate at which they are burning capital. Any advantage is closed within a year.
If OpenAI wants to make a profit, it will raise prices and be dropped at a heartbeat for the next cheapest option. Most software stacks are designed to be model-agnostic, making integration or support a non-factor.
The winner-take-all effect is a lot stronger with chat apps.
No they don't. They failed in every market except a few niche ones.
Also, Europe being in the race is a big deal for consumers.
The technology is closer to a decade from seeing a plateau for the large general models. GPT o3 is significantly beyond o1 (much less 3.5 which was just Nov 2022). Claude 4 is significantly beyond 3.5. They're not subtle improvements. And most likely there will be a splintering of specialization that will see huge leaps outside the large general models. The radical leap in coding capabilities over the past 12-18 months is just an early example of how that will work, and it will affect every segment of human endeavour.
They're burning through computers and capital. No amount of advertising could cover the cost of training or even running these models. The massive subscription costs we've started seeing are just a small glimpse into the money they are burning through.
They will NOT make a profit using the current methods unless the models become at least 10 times more efficient than they are now. At which point can Europe adapt to the innovation without much cost.
It's an arms race to see who can burn the most money the fastest, while selling the result for as little as possible. When they need to start making money, it will all come crashing down.
People were claiming that since year 2022. Where's the plateau?
Hmmm. It's almost as if a company without a user data stream like OpenAI would be driven to release an end-user device for the sole purpose of capturing more training data...
LLMs haven't improved much. What's improved is the chat apps: switching between language model, vision, image and video generation and being able to search the internet is what has made them seem 100x more useful.
Run a single LLM without any tools... They're still pretty dumb.
Is it possible to run multiple reasoning models on one problem? (Why not? I guess).
Another funny thought is: they release their Small model, and kept their Medium as a premium service. I wonder if you could do chains with Medium run occasionally, linked together by local runs of Small?
How does one figure this out? As in I want to know the comparable Deepseek or Llama equivalent (size-wise) and don't want to figure it out by trial and error.
Qwen3-4B almost beats Magistral-22B on the 4 available benchmarks, and Qwen3-30B-A3B is miles ahead.
But its European, so its a point of pride.
Relevance or not, we will keep hearing the name as a result.
I throw tasks at it running locally to save on API costs, and it's possibly better than anything we had a year or so ago from closed source providers. For programming tasks, I'd rank it higher than gpt-4o
I wish mistral were back into making MoE models. I loved their 8x7 mixtral, it was one of the greatest models I could run the time it went out, but it is outdated now. I wish somebody was out making a similar size MoE model, which could comfortably sit in a 64GB ram macbook and be fast. Currently the qwen 30-A3B is the only one I know of, but it would be nice to have something slightly bigger/better (incl a non-reasoning base one). All the other MoE models are just too big to run locally in more standard hardware.
- MMLU-Pro for knowledge
- https://lmarena.ai/leaderboard for user preference
We only got Magistral's GPQA, AIME & livecodebench so far.
This makes it yet another example of European companies building great products but fumbling marketing.
Mistral's edge is speed. It's a real pleasure to use because it answers in ~1s what takes other models 5-8s, which makes for a much better experience. But instead of focusing on it, they bury it far down the post.
Try it and see if you like the speed! Note that the speed advantage only applies to queries that don't require web-search, as Mistral is significantly slower on this one, leading to a ~5 seconds advantage over 2 minutes of research for the queries I benchmarked with Grok.
My current use of AI is to generate code - or translate some code from a programming language to another - which I can then improve (instead of writing it from stratch). Speed isn't necessary for this. It's a nice-to-have but only if it's not at the cost of quality.
Also, as unfair as it "might" be, we do expect a fast AI not to be as good, don't we? So I wouldn't focus on that in the marketing. I think speed would be easier to sell as something extra you would pay for, because then you'd expect the quality to remain the same or better.
It has similar speed with o4-mini with search on chatgpt, and o4-mini gave me much better result.
I generally find the same thing for simple definitions/translations and other "chat" tasks. I'm a little bit surprised that you also find it so for coding, but otherwise I think I get it.
the first sentence is "Announcing Magistral — the first reasoning model by Mistral AI — excelling in domain-specific, transparent, and multilingual reasoning." and those should clearly be comma
and this sentence is just flat out wrong "Lack of specialized depth needed for domain-specific problems, limited transparency, and inconsistent reasoning in the desired language — are just some of the known limitations of early thinking models."
It’s an extremely useful tool in writing and I’ve been using it for decades.
It makes sense that humans would have been using it though, chatgpt learned from us afterall
> it significantly improves project planning, backend architecture, frontend design, and data engineering through sequenced, multi-step actions involving external tools or API.
I'm guessing this means it was trained with tool calling? And if so, does that mean it does tool calling within the thinking/reasoning, or within the main text? Seems unclear
This would likely be a good model for the "plan" mode in various agentic tools (cline, aider, cursor/windsurf/void, etc). So you'd have a chat in plan mode, then use devstral to actually implement that plan.
What's the huge difference between the two pelicans riding bicycles? Was one running locally the small version vs the pretty good one running the bigger one thru the API?
Thanks, Morgan
Mistral's API defaults to `magistral-medium-2506` right now, which is running with full precision, no quantization.
It literally only makes everything worse and more convoluted with zero benefits.
It’s usually either because the context size is set very low by default or they didn’t realize that they weren’t running the full model (ollama uses the distilled version in place of the full version but names it after the full version).
There’s also been some controversy over not giving proper credit to llama.cpp which ollama is/was a wrapper around.
I've never used ollama, but perhaps you mean quantized and not distilled? Or do they actually use distilled versions?
Just use llama.cpp directly.
but then someone found that, at least for distilled models,
> correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness
https://arxiv.org/pdf/2505.13792
ie. the conclusion doesn't necessarily follow from the reasoning. So is there still value in seeing the reasoning? There may be useful information in the reasoning, but I'm not sure it can be interpreted by humans as a typical human chain of reasoning, maybe it should be interpreted more as a loud multi-party discussion on the relevant subject which may have informed the conclusion but not necessarily lead to it.
OTOH, considering the effects of automation fatigue vs human oversight, I guess it's unlikely anyone will ever look at the reasoning in practice, except to summarily verify that it's there and tick the boxes on some form.
https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
BTW I am personally fan of Mistral, because while it is not the top model, it produces good results and the most important thing is that it is super fast, just go to it's chat and be amazed. It really saves a lot of time to have quick response.
I tried it, 80% of the "text" was recognised as images and output as whitespace so most of it was empty. It was much much worse than tesseract.
A month later I got the bill for that crap and deleted my account.
Maybe this is better but I'm over hype marketing from mistral
A Brazilian citizen is flying from Sao Paulo to Paris, with a connection in Lisbon. Does he need to clear immigration in Lisbon or in Paris or in both cities or in neither city?
Mistral AI said that "immigration control will only be cleared in Paris," which I think is wrong.
After I pointed it to the Wikipedia article on this topic[1], it corrected itself to say that "immigration control will be cleared in Lisbon, the first point of entry into the Schengen Area."
I tried the same question with Meta AI (Llama 4) and it did much worse: It said that the traveler "wouldn't need to clear immigration in either Lisbon or Paris, given the flight connections are within the Schengen Area", which is completely incorrect.
I'd be interested to hear if other LLMs give a correct answer.
----
Since both Portugal and France are part of the Schengen Area, and a Brazilian citizen generally does not need a visa for short stays (up to 90 days in any 180-day period) in the Schengen Area, here's how immigration will work:
Lisbon: The Brazilian citizen will need to clear immigration in Lisbon. This is because Lisbon is the first point of entry into the Schengen Area. At this point, their passport will be stamped, and they will be officially admitted into the Schengen Zone.
Paris: Once they have cleared immigration in Lisbon, their flight from Lisbon to Paris is considered a domestic flight within the Schengen Area. Therefore, they will not need to clear immigration again in Paris.
Important Note: While Brazilians currently enjoy visa-free travel, the European Travel Information and Authorization System (ETIAS) is expected to become mandatory by late 2026. Once implemented, Brazilian citizens will need to obtain this electronic authorization before their trip to Europe, even for visa-free stays. However, this is a pre-travel authorization, not a visa in the traditional sense, and the immigration clearance process at the first point of entry would remain the same.
```
Based on the Schengen Area rules for international travel:
Lisbon (Portugal) - Immigration clearance is required here. Since Lisbon is the first point of entry into the Schengen Area (and Brazil is outside Schengen), the passenger must clear immigration (passport control) in Lisbon. This is standard for all non-Schengen arrivals.
Paris (France) - No immigration clearance is needed. After clearing immigration in Lisbon, the onward flight to Paris is considered a domestic Schengen flight (no border checks within Schengen).
```
"The Brazilian citizen needs to clear immigration in Lisbon."
When I asked it to explain why, it said
"The Brazilian citizen needs to clear immigration in Lisbon because Lisbon is the first Schengen entry point when traveling from a non-Schengen country (Brazil) to a Schengen country (France via Portugal). According to Schengen rules, travelers must clear immigration at the first Schengen port of entry. Since Paris is also within the Schengen Area, no additional immigration checks are required upon arrival in Paris."
Mistral comes from Occitan for masterly, although today as far as I know it's only used in English when talking about mediterranean winds.
Magistral is just the adjective form of "magister," so "like a master."
If you want to make a few bucks, maybe look up some more obscure synonyms for masterly and pick up the domain names.
It's a French company, and "mistral" has this usage in French as well. Also, "magistral" is just the french translation of "masterful".
24B size is good for local inference.
As a model outputting long "reasoning" traces (~10k tokens), 40k context length is a little concerning.
Where are the results of normal benchmarks, e.g., MMLU/pro, IFEval and such.
Still, thank you Mistral team for releasing this model with Apache 2.0.
My favorite from the last months was asking for a string that for base64 produces strings with non-alphanumeric and non-padding symbols (so '+' or '/' should be in the output). It thought for 7 minutes and 74k of markdown length, and finally came up with the AB?C string that produces QUI/Qw== (correct). It is impressive, because general LLMs just always fail, but I didn't try other "thinking" models recently.
With this, at least it seems like some of that work was done upfront or the thinking is tuned to avoid those issues, because it's giving me similar conclusions to a sanity-checked prompt. Heck, even Google Gemini and ChatGPT were spitting that stuff out, where this one is giving me a reasonable response. So in that regard, big thumbs up to the Mistral team if they did any specific work in that area. It's something I cared about that I was getting concerned nobody else cared about enough to fix.
- Think up a topic that's interesting to you, yet maybe controversial.
- Look up primary sources and empirical information about it.
- Then look at a relevant Wikipedia article about it to see if the way the Wikipedia article frames it is honestly and faithfully justified by the primary sources and empirical data about it.
If the article seems to have a strong bias or critically misrepresent the reality even if it does so by stating true things, you have a juicy nugget on your hands.
Ask any given LLM about that topic and see if it regurgitates the opinion in the Wikipedia article. If it does, then develop your own prompt that requires the LLM to go down a checklist of things that help resolve warped logic without specifically trying to shape the output to your own preference. Now find other articles and see how well your checklist generalizes.
How well this works depends on how good the model you're using is at instruction following.
A lot of what thinking models do is expand the context around a topic to hopefully improve final prediction. To assist that, you have to encourage the LLM to be hesitant to form an opinion or decide on the conclusion before the end, otherwise it can start with a conclusion and spend the rest of the time supporting a weak conclusion rather than arriving at a stronger one after new information emerges.
The danger is that reasoning models will state early on in their reasoning some ideological fact the same way it might say, "well i know that 1+1=2, so that means X", when in reality a particular fact does not stand up to scrutiny. Then it gets lost in a loop thinking ideologically, which can help propagate these things through language models which is dangerous.
Ideally all ingested Wikipedia gets evaluated against some levels of ground truth before getting trained on to start with, but then it's harder to keep up to date with it. Until then we have to help LLMs handle these cases better.
If I think that Rabbits and Hares are classified by Wikipedia incorrectly and Ideological Wikipedia Editors are hiding the truth with Disinformation, why would I give the model any credit if it tells the correct answer only if I develop a custom 22 point mammalian biology reasoning checklist that leads it to the Real Truth about Rabbits and Hares?
It certainly doesn't inspire confidence that any other particular question would be answered correctly?
Should be quiet easy if you have some o4-mini results sitting around.
I was recently working on a user facing feature using self-hosted Gemma 27b with VLLM and was getting fully formed JSON results in ~7 seconds (even that I would like to optimize further) - obviously the size of the JSON is important but I’d never use a reasoning model for this because they’re constantly circling and just wasting compute.
I haven’t really found a super convincing use-case for reasoning models yet, other than a chat style interface or an assistant to bounce ideas off of.
>>> Who was Adolf Hitler's mother? <think> Alright, I know that Adolf Hitler is one of the most well-known historical figures, but I'm not entirely sure about his mother's name. I recall that his father was Alois Hitler, but I'm not sure about his mother. Maybe I can break this down.
First, I remember that Hitler was born in Austria, and his family background is often discussed in historical contexts. I think his mother was a local woman, but I don't recall her name. Maybe I can think about common names or try to recall if I've seen her name in documentaries or books.
I think I remember that his mother's name was Klara. But I'm not sure about the last name. Was it the same as his father's? Probably, since it was a common practice at the time for women to take their husband's surnames. But I'm not sure if that was the case here.
Wait, I recall that his father's name was Alois Hitler, and his mother's name was Klara. But what was her full name? I think it was Klara Hitler. But I'm not sure if she had a different last name before marriage.
I think I remember that her maiden name was Pölzl. But I'm not sure. I recall that she was from a lower-middle-class family, and that might be relevant.
But to be sure, I should think about the most common information. The most commonly cited name for Hitler's mother is Klara Hitler (née Pölzl). But is that accurate? I think so.
But to confirm, I recall that Klara Pölzl was her name before marriage, and after marrying Alois Hitler, she became Klara Hitler. But I'm not sure if that's correct.
Wait, I think I recall that his mother's name was Klara Hitler, but her maiden name was Klara Pölzl. But I'm not sure. Maybe I can think about the timeline.
Adolf Hitler was born in 1889. His mother must have been born some years before that. If she was in her early 20s when she had Adolf, she might have been born around the early 1860s.
But I don't have exact dates, so maybe I should focus on the name. I think the most common reference is Klara Hitler.
But I recall that in some sources, she is referred to as Klara Pölzl, especially when talking about her early life.
But in most historical references, she is known as Klara Hitler after marriage.
But I think the question is about her name, and it's common to refer to women by their married name in historical contexts, especially if that's the name they were most commonly known by.
So, her name was likely Klara Hitler.
But to be thorough, I think her full name was Klara Hitler (née Pölzl).
But the question asks for her name, and it's common to refer to someone by their married name, especially if that's the name they were most commonly associated with.
So, the answer is likely Klara Hitler.
But I recall that sometimes she is referred to as Klara Pölzl, especially in contexts where her maiden name is relevant.
But in general, historical references often use her married name.
I think the most accurate answer is Klara Hitler.
But to be sure, I recall that in many biographies, she is referred to as Klara Hitler.
Therefore, I think the answer is Klara Hitler.
But I'm not entirely confident. Maybe I can think of another way to verify this.
I recall that Hitler had a close relationship with his mother, and she is often mentioned in his early life. The name Klara Hitler seems to be the standard reference.
Therefore, I conclude that Adolf Hitler's mother was Klara Hitler.
But just to double-check, I think her maiden name was Pölzl, but she became Klara Hitler after marriage.
So, her name was Klara Hitler.
But sometimes people refer to her by her maiden name, especially in contexts where her family background is important.
But the most common and likely correct answer is Klara Hitler.
But I recall that in some sources, she is referred to as Klara Pölzl.
But in most standard references, it's Klara Hitler.
Therefore, I think the answer is Klara Hitler.
But to be thorough, I should consider that her full name might be Klara Hitler (née Pölzl), but the question asks for her name, and it's most commonly given as Klara Hitler.
So, final answer: Klara Hitler.
But I think I might be missing something. Maybe I should recall more details.
I remember that Klara Hitler was born in 1860 and died in 1907. She was married to Alois Hitler, and they had several children, but only Adolf and his siblings survived infancy.
But I'm not sure if this helps with the name. I think the name is indeed Klara Hitler.
But sometimes, in historical writing, people refer to her by her maiden name, Klara Pölzl, especially when discussing her early life or family background.
But in general, when referring to her in the context of Hitler's life, she is called Klara Hitler.
Therefore, the most appropriate answer is Klara Hitler.
But to cover all bases, perhaps it's best to give both names: Klara Pölzl (later Klara Hitler).
But the question asks for her name, and in the context of Hitler's mother, she is most commonly referred to as Klara Hitler.
So, I'll stick with Klara Hitler.
But I recall that in some biographies, she is referred to as Klara Pölzl.
Hmm, maybe the most complete answer is Klara Pölzl, as that is her full name, including her maiden name.
But in most contexts, especially when talking about her as Hitler's mother, she is called Klara Hitler.
I think both are correct, but Klara Hitler is more commonly used.
But to be precise, her full name was Klara Pölzl before marriage and Klara Hitler after marriage.
The question asks for her name, and it's common to refer to someone by their married name, especially if that's the name they were most commonly known by.
Therefore, the answer is Klara Hitler.
But I think it's also acceptable to say Klara Pölzl, especially if emphasizing her origins.
But since the question doesn't specify, and Klara Hitler is the more common reference, I'll go with that.
But to be thorough, perhaps the best answer is Klara Hitler (née Pölzl).
But in most historical writing, she is referred to as Klara Hitler.
Therefore, the answer is Klara Hitler.
But I'm not entirely sure. Maybe I should think differently.
Perhaps I can recall specific sources or books where her name is mentioned.
I remember reading that in "Mein Kampf," Hitler mentions his mother as Klara Hitler.
Yes, in his own writing, he refers to her as Klara Hitler.
Therefore, that is likely the correct name to use.
So, final answer: Klara Hitler.
But to be complete, her maiden name was Pölzl.
But the question asks for her name, and in the context of her being Hitler's mother, she is most commonly referred to as Klara Hitler.
Therefore, the answer is Klara Hitler.
But I think it's also correct to say Klara Pölzl, especially in a historical context where maiden names are important.
But since the question is about her name, and she was known as Klara Hitler during the time she was Hitler's mother, that seems to be the most appropriate answer.
But to cover all bases, perhaps it's best to say Klara Hitler (née Pölzl).
But in most cases, people refer to her as Klara Hitler.
Therefore, the answer is Klara Hitler.
But I think I might be overcomplicating this. The most common and likely correct answer is Klara Hitler.
So, I'll go with that.
</think>
*Summary:* Adolf Hitler's mother was Klara Hitler (née Pölzl). She is most commonly referred to as Klara Hitler in historical contexts, especially when discussing her role as Hitler's mother. However, her maiden name was Pölzl.
Thus, the answer is:
Adolf Hitler's mother was \boxed{Klara\ Hitler}.
cchance•8mo ago
hacklas•8mo ago
Magistral Small is a 24 billion parameter model.
Pretty impressive in terms of efficiency for Mistral.
The size of the Magistral Medium is not publicly available, so it is difficult to compare efficiency there.
kouteiheika•8mo ago
FWIW one of their 70B models has leaked in the past (search for "miqu") and rumors at the time were that it was their medium model.