frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Do you have a mathematically attractive face?

https://www.doimog.com
1•a_n•2m ago•1 comments

Code only says what it does

https://brooker.co.za/blog/2020/06/23/code.html
1•logicprog•8m ago•0 comments

The success of 'natural language programming'

https://brooker.co.za/blog/2025/12/16/natural-language.html
1•logicprog•8m ago•0 comments

The Scriptovision Super Micro Script video titler is almost a home computer

http://oldvcr.blogspot.com/2026/02/the-scriptovision-super-micro-script.html
2•todsacerdoti•8m ago•0 comments

Discovering the "original" iPhone from 1995 [video]

https://www.youtube.com/watch?v=7cip9w-UxIc
1•fortran77•10m ago•0 comments

Psychometric Comparability of LLM-Based Digital Twins

https://arxiv.org/abs/2601.14264
1•PaulHoule•11m ago•0 comments

SidePop – track revenue, costs, and overall business health in one place

https://www.sidepop.io
1•ecaglar•14m ago•1 comments

The Other Markov's Inequality

https://www.ethanepperly.com/index.php/2026/01/16/the-other-markovs-inequality/
1•tzury•15m ago•0 comments

The Cascading Effects of Repackaged APIs [pdf]

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055034
1•Tejas_dmg•17m ago•0 comments

Lightweight and extensible compatibility layer between dataframe libraries

https://narwhals-dev.github.io/narwhals/
1•kermatt•20m ago•0 comments

Haskell for all: Beyond agentic coding

https://haskellforall.com/2026/02/beyond-agentic-coding
2•RebelPotato•23m ago•0 comments

Dorsey's Block cutting up to 10% of staff

https://www.reuters.com/business/dorseys-block-cutting-up-10-staff-bloomberg-news-reports-2026-02...
2•dev_tty01•26m ago•0 comments

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]

https://www.youtube.com/watch?v=3SxNBz1VTE0
1•sanity•27m ago•1 comments

In the AI age, 'slow and steady' doesn't win

https://www.semafor.com/article/01/30/2026/in-the-ai-age-slow-and-steady-is-on-the-outs
1•mooreds•35m ago•1 comments

Administration won't let student deported to Honduras return

https://www.reuters.com/world/us/trump-administration-wont-let-student-deported-honduras-return-2...
1•petethomas•35m ago•0 comments

How were the NIST ECDSA curve parameters generated? (2023)

https://saweis.net/posts/nist-curve-seed-origins.html
2•mooreds•36m ago•0 comments

AI, networks and Mechanical Turks (2025)

https://www.ben-evans.com/benedictevans/2025/11/23/ai-networks-and-mechanical-turks
1•mooreds•36m ago•0 comments

Goto Considered Awesome [video]

https://www.youtube.com/watch?v=1UKVEUGEk6Y
1•linkdd•38m ago•0 comments

Show HN: I Built a Free AI LinkedIn Carousel Generator

https://carousel-ai.intellisell.ai/
1•troyethaniel•40m ago•0 comments

Implementing Auto Tiling with Just 5 Tiles

https://www.kyledunbar.dev/2026/02/05/Implementing-auto-tiling-with-just-5-tiles.html
1•todsacerdoti•41m ago•0 comments

Open Challange (Get all Universities involved

https://x.com/i/grok/share/3513b9001b8445e49e4795c93bcb1855
1•rwilliamspbgops•42m ago•0 comments

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

https://www.youtube.com/watch?v=QLK6ixQpQsQ
2•gnabgib•44m ago•0 comments

Show HN: Isolating AI-generated code from human code | Vibe as a Code

https://www.npmjs.com/package/@gace/vaac
1•bstrama•45m ago•0 comments

Show HN: More beautiful and usable Hacker News

https://twitter.com/shivamhwp/status/2020125417995436090
3•shivamhwp•45m ago•0 comments

Toledo Derailment Rescue [video]

https://www.youtube.com/watch?v=wPHh5yHxkfU
1•samsolomon•47m ago•0 comments

War Department Cuts Ties with Harvard University

https://www.war.gov/News/News-Stories/Article/Article/4399812/war-department-cuts-ties-with-harva...
9•geox•51m ago•1 comments

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

https://github.com/localgpt-app/localgpt
5•yi_wang•52m ago•0 comments

A Bid-Based NFT Advertising Grid

https://bidsabillion.com/
1•chainbuilder•56m ago•1 comments

AI readability score for your documentation

https://docsalot.dev/tools/docsagent-score
1•fazkan•1h ago•0 comments

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

https://science.nasa.gov/blogs/science-news/2026/02/06/nasa-study-non-biologic-processes-dont-ful...
3•bediger4000•1h ago•2 comments
Open in hackernews

Unstructured Document Ingestion Pipeline

1•moaffaneh•4w ago
Hi all, I am designing an AWS-based unstructured document ingestion platform (PDF/DOCX/PPTX/XLSX) for large-scale enterprise repositories, using vision-language models to normalize pages into layout-aware markdown and then building search/RAG indexes or extract structured data.

For those who have built something similar recently, what approach did you use to preserve document structure reliably in the normalized markdown (headings, reading order, nested tables, page boundaries), especially when documents are messy or scanned? Did you do page-level extraction only, or did you use overlapping windows / multi-page context to handle tables and sections spanning pages?

On the indexing side, do you store only chunks + embeddings, or do you also persist richer metadata per chunk (page ranges, heading hierarchy, has_table/contains_image flags, extraction confidence/quality notes, source pointers) and if so, what proved most valuable? How does that help in the agent retrieval process?

What prompt patterns worked best for layout-heavy pages (multi-column text, complex tables, footnotes, repeated headers/footers), and what failed in practice?

How did you evaluate extraction quality at scale beyond spot checks (golden sets, automatic heuristics, diffing across runs/models, table-structure metrics)?

Any lessons learned, anti-patterns, or “if I did it again” recommendations would be very helpful.