frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

More than one hundred years of Film Sizes

https://wichm.home.xs4all.nl/filmsize.html
1•exvi•4m ago•0 comments

BTS of OpenTelemetry Instrumentation

https://newsletter.signoz.io/p/bts-of-opentelemetry-auto-instrumentation
2•elza_1111•6m ago•0 comments

Claude Codes

https://thezvi.substack.com/p/claude-codes
1•nsoonhui•10m ago•0 comments

Sir Nicholas Winton – BBC Programme "That's Life" Aired in 1988 [video]

https://www.youtube.com/watch?v=6_nFuJAF5F0
1•handfuloflight•11m ago•0 comments

Spectral Geodesic Routing: Traffic Engineering via Laplacian Potentials

https://zenodo.org/records/18193686
3•andrespi•13m ago•0 comments

Native iOS and Android Nullschool App

https://twitter.com/cambecc/status/2010254018598392022
1•pppone•13m ago•0 comments

Uruguay's Renewable Charge: A Small Nation, a Big Lesson for the World

https://www.forbes.com/sites/kensilverstein/2025/10/19/uruguays-renewable-charge-a-small-nation-a...
2•ciconia•14m ago•0 comments

A Practical Guide to Build Secure MCP Servers

https://go.mcptotal.io/blog/a-practical-guide-to-build-secure-mcp-servers
2•agentictime•16m ago•0 comments

Whenwords: A relative time formatting library, with no code

https://github.com/dbreunig/whenwords
1•todsacerdoti•17m ago•0 comments

Mossad urges Iran protests, says agents present

https://www.jpost.com/middle-east/iran-news/article-881733
2•ParentiSoundSys•19m ago•0 comments

21 years of IDE evolution in one chart (2004 – 2025)

https://twitter.com/willwangcc/status/2010259528391307510
2•will_wang•20m ago•1 comments

Annote: A Turing complete language using only Java annotations as its syntax

https://github.com/kusoroadeolu/annote
1•kushv•20m ago•1 comments

Things I've quit doing at my desk

https://justinjackson.ca/i-quit-my-desk
2•Tomte•21m ago•0 comments

A Unique Performance Optimization for a 3D Geometry Language

https://cprimozic.net/notes/posts/persistent-expr-memo-optimization-for-geoscript/
2•Ameo•28m ago•0 comments

Markdown Is a Disaster: Why and What to Do Instead

https://www.karl-voit.at/2025/08/17/Markdown-disaster/
2•todsacerdoti•28m ago•1 comments

Elon Musk says X's new algorithm will be made open source next week

https://www.engadget.com/big-tech/elon-musk-says-xs-new-algorithm-will-be-made-open-source-next-w...
3•O1111OOO•29m ago•0 comments

I hope to help you evaluate your GenAI App

https://github.com/shihongDev/evalyn
1•shloveai•37m ago•1 comments

After 20 Years, This Scientist Proved Birds Can Talk and Use Grammar [video]

https://www.youtube.com/watch?v=jmys2abx4co
2•theogravity•38m ago•0 comments

What do you think about a "linter" for code logic?

https://commitguard.ai
1•moshetanzer•39m ago•1 comments

Removing Tahoe's Unwanted Menu Icons

https://weblog.rogueamoeba.com/2026/01/10/removing-tahoes-unwanted-menu-icons/
1•dbushell•41m ago•0 comments

Gixy-Next: Nginx Configuration Security and Hardening Scanner

https://gixy.io/
1•mmsc•44m ago•0 comments

Debian Taco – Towards a GitSecDevOps Debian

https://blog.josefsson.org/2026/01/09/debian-taco-towards-a-gitsecdevops-debian/
1•pabs3•46m ago•0 comments

Netlify Is Down

https://www.netlifystatus.com
1•forgingahead•50m ago•0 comments

Linus is vibe coding

https://github.com/torvalds/AudioNoise
8•dhruv3006•54m ago•2 comments

80% of Rye in 20% of the Time [1/3]

https://ryelang.org/blog/posts/learn_80_rye_in_20_time_code/
3•todsacerdoti•57m ago•0 comments

Notes on Enterprise Architecture from Doing the Job

https://github.com/justinamiller/EnterpriseArchitecture
2•maverickeye•59m ago•1 comments

Instagram breach exposes data of 17.5M accounts

https://twitter.com/H4ckmanac/status/2009870969998049400
3•thunderbong•59m ago•1 comments

Côme, une ville italienne dénaturée

https://www.lemonde.fr/m-le-mag/article/2026/01/02/en-italie-la-ville-de-come-denaturee-pour-deve...
1•altro•1h ago•0 comments

A new type of microscope lets scientists observe life unfolding inside cells

https://www.thebrighterside.news/post/a-new-type-of-microscope-lets-scientists-observe-life-unfol...
3•01-_-•1h ago•1 comments

Practical .NET Coding Guidelines We Use Internally

https://github.com/justinamiller/DotNet-Coding-Guidelines
1•maverickeye•1h ago•1 comments
Open in hackernews

Unstructured Document Ingestion Pipeline

1•moaffaneh•10h ago
Hi all, I am designing an AWS-based unstructured document ingestion platform (PDF/DOCX/PPTX/XLSX) for large-scale enterprise repositories, using vision-language models to normalize pages into layout-aware markdown and then building search/RAG indexes or extract structured data.

For those who have built something similar recently, what approach did you use to preserve document structure reliably in the normalized markdown (headings, reading order, nested tables, page boundaries), especially when documents are messy or scanned? Did you do page-level extraction only, or did you use overlapping windows / multi-page context to handle tables and sections spanning pages?

On the indexing side, do you store only chunks + embeddings, or do you also persist richer metadata per chunk (page ranges, heading hierarchy, has_table/contains_image flags, extraction confidence/quality notes, source pointers) and if so, what proved most valuable? How does that help in the agent retrieval process?

What prompt patterns worked best for layout-heavy pages (multi-column text, complex tables, footnotes, repeated headers/footers), and what failed in practice?

How did you evaluate extraction quality at scale beyond spot checks (golden sets, automatic heuristics, diffing across runs/models, table-structure metrics)?

Any lessons learned, anti-patterns, or “if I did it again” recommendations would be very helpful.