frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Talkie: a 13B vintage language model from 1930

https://talkie-lm.com/introducing-talkie
268•jekude•8h ago•76 comments

Pgrx: Build Postgres Extensions with Rust

https://github.com/pgcentralfoundation/pgrx
41•luu•2d ago•1 comments

Microsoft and OpenAI end their exclusive and revenue-sharing deal

https://www.bloomberg.com/news/articles/2026-04-27/microsoft-to-stop-sharing-revenue-with-main-ai...
830•helsinkiandrew•16h ago•710 comments

Mo RAM, Mo Problems (2025)

https://fabiensanglard.net/curse/
89•blfr•2d ago•7 comments

LingBot-Map: Streaming 3D reconstruction with geometric context transformer

https://technology.robbyant.com/lingbot-map
17•nateb2022•2h ago•1 comments

Ted Nyman – High Performance Git

https://gitperf.com/
82•gnabgib•5h ago•12 comments

Vibe Coding Will Break Your Company

https://www.forbes.com/sites/jasonwingard/2026/04/23/vibe-coding-will-break-your-company/
26•sminchev•40m ago•9 comments

Three men are facing charges in Toronto SMS Blaster arrests

https://www.tps.ca/media-centre/stories/unprecedented-sms-blaster-arrests/
138•gnabgib•9h ago•64 comments

Integrated by Design

https://vivianvoss.net/blog/integrated-by-design-launch
91•vermaden•7h ago•36 comments

4TB of voice samples just stolen from 40k AI contractors at Mercor

https://app.oravys.com/blog/mercor-breach-2026
498•Oravys•20h ago•176 comments

How I leared what a decoupling capacitor is for, the hard way

https://nbelakovski.substack.com/p/how-i-learned-what-a-decoupling-capacitor
63•actinium226•2d ago•20 comments

Men who stare at walls

https://www.alexselimov.com/posts/men_who_stare_at_walls/
519•aselimov3•19h ago•228 comments

The quiet resurgence of RF engineering

https://atempleton.bearblog.dev/quiet-resurgence-of-rf-engineering/
178•merlinq•2d ago•92 comments

Is my blue your blue?

https://ismy.blue/
441•theogravity•9h ago•304 comments

Easyduino: Open Source PCB Devboards for KiCad

https://github.com/Hanqaqa/Easyduino
195•Hanqaqa•12h ago•31 comments

Meetings are forcing functions

https://www.mooreds.com/wordpress/archives/3734
98•zdw•2d ago•46 comments

Show HN: AgentSwift – Open-source iOS builder agent

https://github.com/hpennington/agentswift
30•hpen•5h ago•7 comments

Networking changes coming in macOS 27

https://eclecticlight.co/2026/04/23/networking-changes-coming-in-macos-27/
222•pvtmert•14h ago•192 comments

Lessons from building multiplayer browsers

https://www.alejandro.pe/writing/sail-muddy-lessons
40•alejandrohacks•1d ago•14 comments

The woes of sanitizing SVGs

https://muffin.ink/blog/scratch-svg-sanitization/
204•varun_ch•14h ago•84 comments

Radar Laboratory – Interactive Radar Phenomenology

https://radarlaboratory.com/
48•jonbaer•2d ago•2 comments

Fully Featured Audio DSP Firmware for the Raspberry Pi Pico

https://github.com/WeebLabs/DSPi
277•BoingBoomTschak•2d ago•77 comments

Spanish archaeologists discover trove of ancient shipwrecks in Bay of Gibraltar

https://www.theguardian.com/science/2026/apr/15/hidden-treasures-spanish-archaeologists-discover-...
90•1659447091•2d ago•20 comments

FDA approves first gene therapy for treatment of genetic hearing loss

https://www.fda.gov/news-events/press-announcements/fda-approves-first-ever-gene-therapy-treatmen...
231•JeanKage•20h ago•86 comments

Pgbackrest is no longer being maintained

https://github.com/pgbackrest/pgbackrest
410•c0l0•19h ago•218 comments

GitHub Copilot is moving to usage-based billing

https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/
615•frizlab•14h ago•450 comments

China blocks Meta's acquisition of AI startup Manus

https://www.cnbc.com/2026/04/27/meta-manus-china-blocks-acquisition-ai-startup.html
353•yakkomajuri•18h ago•249 comments

“Why not just use Lean?”

https://lawrencecpaulson.github.io//2026/04/23/Why_not_Lean.html
272•ibobev•15h ago•188 comments

Super ZSNES – GPU Powered SNES Emulator

https://zsnes.com/
272•haunter•12h ago•78 comments

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

https://github.com/dirac-run/dirac
329•GodelNumbering•17h ago•120 comments
Open in hackernews

The best – but not good – way to limit string length

https://adam-p.ca/blog/2025/04/string-length/
49•adam-p•12mo ago

Comments

adam-p•12mo ago
@dang Can the title be changed? It should be "The best – but not good – way to limit string length". Thanks.
dang•12mo ago
Fixed!
neuroelectron•12mo ago
This is why my website is going to be ASCII only.
poincaredisk•12mo ago
Which is a reasonable and clean solution - I love simplicity of ASCII like every programmer does.

Except ASCII is not enough to represent my language, or even my name. Unicode is complex, but I'm glad it's here. I'm old enough to remember the absolute nightmare that was multi-language support before Unicode and now the problem of encodings is... almost solved.

fsckboy•12mo ago
>ASCII is not enough to represent my language, or even my name.

Hebrew and Arabic don't include vowels. While you think that writing your language needs vowels, we can tell from the existence of Hebrew and Arabic that you are probably wrong. It would take some getting used to, but just like that "scramble the letters in the middle of words, you can still read":

https://www.sciencealert.com/word-jumble-meme-first-last-let...

>Aocdrnig to a rscheearch at Cmabrigde Vinervtisy, it deosn't mttaer in waht oredr the Itteers in a wrod are, the olny iprmoetnt ting is taht the frist and Isat Itteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcusee the huamn mnid deos not raed ervey teter by istlef, but the wrod as a wlohe.

your language, too, is redundant and could be modified to be simpler to write.

I'm not asking you to write your language with no vowels, I'm simply saying you could reduce to ASCII, get used to it, and civilization could move on. Stop clinging to the past, you are holding up the flying cars.

smrq•12mo ago
English itself lost some lovely letters because of the printing press (RIP, þ), so I suppose simplifying writing systems in the name of technological simplicity isn't unprecedented.
cjs_ac•12mo ago
Once again, I request that 95% of the world's population change the way it does almost everything, so that I can simplify my code.

Thank you for writing this comment; it's cleared up some self-esteem issues I've been having about whether I'm clever enough to start my own company.

fsckboy•12mo ago
pot. kettle. unicode itself was the 95% change request, and this particular discussion is sparked by anguish about that change and people such as yourself who want to discuss their anguish about the change.

and you simply ignored the points that I went to the trouble to write down, and rather than considering them or thinking about them, you just started screaming "status quo status quo"

saagarjha•12mo ago
It's hilarious that the guy telling people to go back to ASCII is the one saying "stop clinging to the past".
fsckboy•12mo ago
whoosh
StefanBatory•12mo ago
No, you can't.

Robię Ci łaskę vs robię ci laskę is a very big difference.

fsckboy•12mo ago
you let your fingers hit the keyboard before thinking at all.

in english, we have laws that sanction the selling of street drugs, and other laws that sanction funding for women's sports. in the first case, sanction means "forbid", and in the second case it means "encourage". although these usages are opposite in meaning, the words are used on a daily basis and nobody gets confused because context is everything.

Robię Ci łaskę could mean "badass" and robię ci laskę could mean "bad ass": if you read "robie ci laske" in ASCII (hey, i'm thinking that rhymes) nobody (except you) would be confused by that, it's not how functioning brains work.

i provided enough evidence in my original comment that you should have been able to realize that i was already talking about the issue you are pointing out so to rebut what i suggest you need to account for what i said and not argue against a strawman's tabula rasa

neuroelectron•12mo ago
What would be neat is an ASCII or byte-encoding that simplified foreign language into ascii encoding on the data side to being basically ascii than recording it for display, only supporting a subset, unfortunately but eliminating all these edge cases and moving them away from the logic and database layers.
just6979•12mo ago
Nah, you gotta go at least one further back and use EBCDIC. Or go all the way to a BCD and you get to save more bits (only need 6) and can avoid dealing with case-sensitivity as well (only uppercase latin letters)!
o11c•12mo ago
Note that normalization involves rearranging combining characters of different combining classes:

  > Array.from("\u{10FFff}\u0300\u0327".normalize('NFC')).map(x=>x.codePointAt().toString(16))
  [ '10ffff', '327', '300' ]
If a precombined character exists, the relevant accent will be pulled into the base regardless of where it is in the sequence. Note also that normalization can change the visual length (see below) under some circumstances.

The article is somewhat wrong when it says Unicode may "change character normalization rules"; new combining characters may be added (which affects the class sort above) but new precombined ones cannot.

---

There's one important notion of "length" that this doesn't cover: how wide is this on the screen?

For variable-width fonts of course this is very difficult. For monospace fonts, there are several steps for the least-bad answer:

* Zeroth, if you have reason to believe a later stage has a limit on the number of combining characters or will normalize, do the normalization yourself if that won't ruin your other concerns. (TODO - since there are some precomposed characters with multiple accents, can this actually make things worse?)

* First, deal with whitespace. Do you collapse space? What forms of line separator do you accept? How far apart are tab stops?

* Second, deal with any nonprintable/control/format characters (including spaces you don't recognize), e.g. escaping them or replacing them by their printable form but adding the "inverted" attribute.

* Third, deal with any leading (meaning, immediately after a nonprintable or a line-separator) combining characters, treat them by synthesizing a NBSP (which is not a space), which has length 1. Likewise, synthesize missing Hangul fillers anywhere in the line.

* Now, iterate through the codepoints, checking their EastAsianWidth (note that you can usually have a table combining this lookup with the earlier stages): -1 for a control character, 0 for a combining character (unless dealing with a system that's too dumb to strip them), 1 or 2 for normal characters.

* Any codepoints that are Ambiguous or in one of the Private Use Areas should be counted both ways (you want to produce two separate counts). Any combining characters that are enclosing should be treated as ambiguous (unless the base was already wide). Likewise for the Korean Hangul LVT sequences, you should produce a range of lengths (since in practice, whether they will combine depends on whether the font includes that exact sequence).

* If you encounter any ZWJ sequences, regardless of whether or not they correspond to a known emoji, count them both ways (min length being the max of any single component, max length as counted all separately).

* Flag characters are evil, since they violate Unicode's random-access rule. Count them both as if they would render separately and if they would render as a flag.

* TODO what about Ideographic Description Characters?

* Finally, hard-code any exceptions you encounter in the wild, e.g. there are some Arabic codepoints that are really supposed to be more than 2 columns.

For the purpose of layout, you should mostly work based on the largest possible count. But if the smallest possible count is different, you need to use some sort of absolute positioning so you don't mess up the user's terminal.

adam-p•12mo ago
> The article is somewhat wrong when it says Unicode may "change character normalization rules"; new combining characters may be added (which affects the class sort above) but new precombined ones cannot.

That's fair. I updated the wording in the post.

Thanks for the display info. It's cool and horrible and out of scope for my post.

DemocracyFTW2•12mo ago
> * TODO what about Ideographic Description Characters?

I've never encountered them other than rendered with widths like any other CJK character, i.e. with (nominally) double width. There may be software that makes an effort to render IDSes (Ideographic Description Sequences) as existing or generated ideographs (or whacha may call those), but I have yet to see one. There may however, and IMO more likely, be situations where you want to grant the user an input of exactly one, or up to a certain number of CJK characters e.g. for the purpose of searching and grant them the ability to use IDSes for unencoded characters or incompletely known characters. But in that case you're clearly leaving the boundaries of what is Unicode and enter into the grammar of your search engine's customized search strings. Meaning that you probably don't need to handle IDC separately at all other than treating them like any other fullwidth CJK codepoint.

bsder•12mo ago
TIL: In worst case, "20 UTF-8 bytes" == "1 Hindi character"

Going to have to remember that.

saagarjha•12mo ago
You can go way beyond that, although at some point I think it's unlikely that the character is something that is semantically valid.
Retr0id•12mo ago
> The byte size allowed would need to be about 100x the length limit. That’s… kind of a lot?

Would it need to be, though? ~10x ought to be enough for any realistic string that wasn't especially crafted to be annoying.

aidenn0•12mo ago
They show a single Hindi character that is 15 bytes in UTF-8. That's enough over 10 that it would be believable that Hindi words could get uncomfortably close to the 10x limit.
Retr0id•12mo ago
A single hindi character, yes. But they also mention that only ~25% of hindi characters use combining marks.
saagarjha•12mo ago
Most of them are vowels. They're pretty common. (Also, I feel like you of all people would understand the issues with "only 25% of the time this happens, therefore surprising behavior at the edges is unlikely to happen".)
Retr0id•12mo ago
That's why you have a limit on both.
chrismorgan•12mo ago
Triple conjuncts are very uncommon in Indic scripts, though there are a few in common use, like stri is a single-syllable word that means woman or wife in many languages. Pick your Indic script, and that’ll be LETTER SA, SIGN VIRAMA, LETTER TA, SIGN VIRAMA, LETTER RA, VOWEL SIGN I. Most Indic syllables/grapheme clusters are a single consonant and a single vowel sign, if not the inherent vowel -a. Conjuncts use their script’s SIGN VIRAMA to suppress the inherent vowel and normally graphically join the next consonant (an orthographic choice rarely broken, a little like ß being ss in German).

I’m not so confident about Hindi, though 25% seems very low if we’re talking frequency; but in Telugu writing it’s definitely a lot more than that that specify a vowel sign and thus take at least two Unicode scalar values to represent a syllable.

My feeling (as a white fellow moved to India, with well above average knowledge of Indian languages and Unicode for a place like HN, but not yet fluent in any Indian language) is that some four-bytes-per-code-point script might conceivably get realistic existing texts above an average of 10 bytes per syllable for at least twenty syllables, and that most Indic languages could sustain it indefinitely in specific deliberate styles of writing.

adam-p•12mo ago
Valid question, and I think you're right in the abstract and most of the time. But I also think you end up with a mismatch.

What's the concrete spec for the limit if you've only got 10x storage per grapheme cluster?

Probably you end providing the limit in bytes. That's fine, but it's no longer the "hybrid counting" thing anymore.

jasonthorsness•12mo ago
Huh, apparently HTML input attributes like maxsize don't try anything fancy and just count UTF-16 code units same as JavaScript strings (I guess it makes sense...) With the prevalence of emojis this seems like it might not do the right thing.

https://html.spec.whatwg.org/multipage/input.html#attr-input...

wavemode•12mo ago
In the age of unicode (and modern computing in general), all of this is more headache than it's worth. What is actually important is that you limit the size of an HTTP request to your server (perhaps making some exceptions for file upload endpoints). As long as the user's form entries fit within that, let them do what they want.
adam-p•12mo ago
If you can get away with that, that's great. But I feel like there are still plenty of cases where you want to limit the lengths of particular fields (and communicate to the user which lengths were exceeded).
gcau•12mo ago
I don't it's practical or useful to just say "limit the size of entire requests" and just ignore all the real world reasons you'd want to actually validate/check data before putting it in your database. The logic you're using is how we have bugs and security holes. This persons write-up gives specific and detailed information that's genuinely useful.
HeyImAlex•12mo ago
Thank you for writing this! It’s something I’ve always wanted a comprehensive guide on, now I have something to point to.
aidenn0•12mo ago
This doesn't seem to cover truncation, but rather acceptance/rejection. If you are given something with "too many" codepoints, but need to use it anyways it seems like it would make sense to truncate it on a grapheme cluster boundary.
adam-p•12mo ago
I don't get into truncation much, but I do mention the risk of:

a) failing to truncate on a code point sequence boundary (a bug React Native iOS used to have)[1], and

b) failing to truncate on a grapheme cluster boundary (a bug React Native Android seems to still have)[2]

[1]: https://adam-p.ca/blog/2025/04/string-length/#utf-16-code-un...

[2]: https://adam-p.ca/blog/2025/04/string-length/#unicode-code-p...

adam-p•12mo ago
I added a section with brief discussion of rejection, truncation, and the perils therein.

https://adam-p.ca/blog/2025/04/string-length/#what-to-do-whe...

aidenn0•12mo ago
Thanks!
jerf•12mo ago
I had this problem recently, in logging email subjects into something that has a defined byte limit size. I went for iterating on graphemes and fitting as many complete graphemes into the bytes as I could, and then stopping. The idea is, don't show broken graphemes and fit as much as I can.

This approach probably solves most programmer problems with length. However if this has to be surfaced to an end-user who is not intimately familiar with the nature of Unicode encodings, which is, you know, basically everybody, it may be difficult to explain to them what the limits actually mean in any sensible way. About all you can do is maybe give vague hints about it being nearly too long and avoid being precise enough for there to be a problem. There doesn't seem to me to be a perfect solution here, the intrinsic problem of there being no easy to explain the lengths of these things to end-users and no reason to ever expect them to understand it seems fundamental to me.

wpollock•12mo ago
Best advice I've heard is to never use the character type in your programming language. Instead, store characters in strings. An array of strings can be used as a string of characters. In this approach, characters become opaque blobs of bytes. This makes it easy to get the two numbers you care about: length in characters and size in bytes.

There is some overhead for this, so maybe a technique more suited to backends. Normalization, sanitation and validation steps are best performed in the frontend.

Also worth knowing is the ICU library, which is often the easiest way to work with Unicode consistently regardless of programming language.

Finally, punycode is a standard way to represent arbitrary Unicode strings as ASCII. It's reversible too (and built into every web browser). You can do size limits on the punycode representation.

BTW, you shouldn't store passwords in strings in the first place. Many programming languages have an alternative to hold secrets in memory safely.

wild_egg•12mo ago
> validation steps are best performed in the frontend.

I'm really hoping we have very different definitions of "frontend"

wpollock•12mo ago
I meant the web server, not in the end user's browser! (So by backend, I meant the application and data layers.)
frizlab•12mo ago
Swift’s Character type represents an extended grapheme cluster, which is the correct thing to do.
fsckboy•12mo ago
>length in characters and size in bytes

you change the word you use as if those words have inherent meanings that we can draw upon. they don't.

it would be more clear to write "length in characters and length in bytes"

[linguistically speaking, words don't carry meanings, it is us who ascribe meaning to words. we use words to say what we want to say, but words don't limit us in what we can say]

wpollock•12mo ago
You are correct. It's just that I am loquacious by nature and often use a plethora of words when a paucity would better and more succinctly convey meaning precisely.

My bad!

saagarjha•12mo ago
This is generally a bad idea, even if you ignore the obvious overhead from doing so. At some point you are going to create a "real" string out of the thing you have, and it is not going to behave like you expect if you just blindly use the array's properties to compute them. Nor will they really have well defined semantics unless you are careful about what the "characters" you're storing in strings are.
chrismorgan•12mo ago
Another problem is line breaks. Have a <textarea>? Line breaks are counted as \n on the client (affecting maxlength attribute and JavaScript calculations using textarea.value.length), but submitted as \r\n. This has bitten me on “2000 character maximum” feedback forms at least twice: client says it’s fine, server says it’s too long, and promptly throws everything away.