Microformats – building blocks for data-rich web pages

https://microformats.org

86•surprisetalk•4mo ago

Comments

alganet•4mo ago

This is dead. If you need something similar, better look for RDFa instead.

turnsout•4mo ago

In an era of LLMs, do you think RDFa still has a place?

jgalt212•4mo ago

Only if you think the transition to a zero-click Internet is happening too slowly.

turnsout•4mo ago

Is zero click a real problem for actual site owners, or is it just affecting SEO job security?

robertlagrant•4mo ago

I imagine it is if the content you paid for to create and host is only monetised by a search engine.

jerven•4mo ago

RDFa/Microdata is more interesting for people whom sell objects instead of content. e.g. marking up that a page is about a kitchen cabinet that is 60cm wide and in the color white might lead to more sales in the long run. As people whom are looking for 60cm wide cabinets might get to your page instead of one about one 36 inch wide.

robertlagrant•4mo ago

That doesn't sound relevant to zero click.

seec•4mo ago

That's an oddly specific search and even Google doesn't have any kind of tools for queries like that. What is more likely is that you'll find companies specialized in selling cabinets and they'll have a browser/search to restrict choice by given dimensions. There is not a lot of benefits for them to expose all that data to various search engine, best case scenario they end up competing with a bunch of other brands on a generic search engine page where they have absolutely no control how things are presented etc...

And even before thinking about that, you can actually put the dimensions in a description, which some do (like Ikea) and Google is definitely able to pick up on that, no RDFa was ever needed. As far as I can tell, LLMs can work that out just fine as well.

The problem with the metadata discussion is that if they are actually useful, there is no reason that they are not useful to humans as well, so instead of trying to make the human work for the machine it is much better to make the machine understand humans.

turnsout•4mo ago

It sounds harsh, but maybe that was never a good business model in the first place. And I fully realize that this includes most news sites. In order for the web to grow, I think we need to figure out some way to get past banner ads as the only way to make money. It's been 30 years.

robertlagrant•4mo ago

It would be good to know why it was never a good business model.

jgalt212•4mo ago

that's a terribly naive question.

pixelat3d•4mo ago

It is VERY real, sadly

alganet•4mo ago

Whatever relevance it has, it's more than microformats.

I personally think semweb-related technologies could play a significant and productive role in synthetic data generation, but that's a whole different conversation that is beyond the current era of LLMs.

zozbot234•4mo ago

You can use JSON-LD to cleanly embed RDF into LLM-friendly JSON. Not sure if there's a direct Markdown equivalent, but the Turtle plain-text syntax for RDF is very simple and modern LLMs should be able to cope quite fine with it.

alganet•4mo ago

The purpose of RDF is to enable reasoning ("I'm sharing this with you, and I'm also sharing how to reason about this thing I shared with you, so we can all reason about it the same way").

If you show me an LLM that can take in serialized RDF and perform reasoning on it, I will be surprised. Once the LLM takes in the RDF serialization, it is a dead end for that knowledge: by principle, you can't rely on anything the LLM does with it.

In a world of LLMs, it makes much more sense to put the semweb technologies alongside the training step instead. You create ontologies and generate text from triples, then feed the generated text as training for the model. This is good because you can tweak and transform the triples-to-text mechanism in all sorts of ways (you can tune the data while retaining its meaning).

It doesn't make much sense to do it now, but if (or when?) training data becomes scarce, converting triples to text might be a viable approach for synthetic data, much more stable than having models themselves generate text.

zozbot234•4mo ago

> If you show me an LLM that can take in serialized RDF and perform reasoning on it, I will be surprised

It's easy, you just ask the LLM to convert your question to a SPARQL query, then you sanity-check it and run it on your dataset. The RDF input step is just so that the LLM knows what your schema looks like in the first place.

alganet•4mo ago

I don't understand how this can work.

Can you make me a quick demonstration using a publicly available model and the dbpedia SPARQL endpoint?

https://dbpedia.org/sparql

seec•4mo ago

You got downvoted but I also think it's largely a pointless "technology" today as well. LLMs have been able to find stuff and excavate meaning without needing any kind of specialized tags or schema so it really is a crutch at best.

And there is the fact that people/orgs lie/maintain poorly and thus you cannot trust any open tag/schema anymore than the data it is supposed to describe. You end up in a conundrum where your data may be sketchy but on top of that the schema/tags may be even worse, at the end of the day if you got to trust one thing it's got to be the data that is readily visible by everyone because it's more likely to be accurate and that's actually what's relevant.

To top all of that, tagging your stuff with RDFas just makes it easier for google and others to parse your stuff and then exploit the data/information without ever sending anyone to your site. If you are Wikipedia, it's mostly fine, but almost anyone else would benefit from receiving the traffic for a chance at value conversion/transaction.

All those metadata things are really idealized academic endeavor; they may make sense for a few noncommercial stuffs but the reality is that most website on the web need to find a way to pay the bills, making it easier for others to exploit your work is largely self-defeating.

And yes, LLMs didn't even need that to vacuum most of the web and still derive meaning, so at the end of the day it's mostly a waste of time...

alganet•3mo ago

I think he got downvoted because I predicted his answer by saying "if you need something like this..." and he wasn't able to take the hint that I was already expecting some LLM preaching here.

tempfile•4mo ago

I could never understand why microformats were so popular among fediverse people when XML is right there. Seems like unnecessary fragmentation.

phpnode•4mo ago

If there’s one thing semantic web folks like it’s fragmentation

zerkten•4mo ago

I don't think anyone desires fragmentation. It's just the reality of the space. People were exploring options but didn't have support from the key stakeholders who were the browser makers (IE was at its peak) and Google. Firefox and WHATWG advanced some of the ideas in time.

People always mention RDF when the semantic web comes up. It's really important to understand where W3C was in the early-2000s and that RDF was driven by those with an academic bent. No one working with microformats was interested in anything beyond the RDF basics because they were too impractical for use by web devs. Part of this was complexity (OWL, anyone?), but the main part was browser and tool support.

zozbot234•4mo ago

> People always mention RDF when the semantic web comes up.

There's nothing wrong with RDF itself, the modern plain-text and JSON serializations are very simple and elegant. Even things like OWL are being reworked now with efforts like SHACL and ShEx (see e.g. https://arxiv.org/abs/2108.06096 for a description of how these relate to the more logical/formal, OWL-centered point of view).

Telemakhos•4mo ago

Google.

For a while, Google gave you good boy points for including microformats, and they still offer tests and validators [0] to tell you what the crawlers get out of your page. Supposedly microformats would not just give you better SEO ranking but also help Google connect people (like the fediverse) to accounts, so that you could surface things relevant to person by searching for the person.

[0] https://developers.google.com/search/docs/appearance/structu...

zerkten•4mo ago

If you go back to the time when they were invented, many semantic elements, like article or footer, didn't exist in HTML. People tried to find conventions and efforts like microformats were an attempt to standardize those when the best solution (updating the HTML standard) was difficult. In terms of timing, it's worth looking at the arc of Firefox, WHATWG, the advent of Safari and Chrome, and table use for layout.

Google was a driver in practice. Accessibility and better web experiences were important to those involved. The reality was that people interested in this area were at the bleeding edge. Many people still held onto tables for site layout and Flash was still a default option for some in the period when microformats emerged.

Telemakhos•4mo ago

ARIA and accessibility microformats were separate from the ones the fediverse was excited about (and thus the GP was talking about)—things like hCard for identifying people, places, and things. Accessibility is useful to many people, but hCard et al. were probably never really useful to anybody other than Google. Still, many of us back then were obsessive-compulsive about using them in the hope that one day computers would better be able to understand authoritative information about identities and relationships between identities. I still have microdata on my personal web page.

paulbjensen•4mo ago

I remember this from way back in the 00's as part of the Web 2.0/semantic web era. I was a big fan of Dan Cederholm's design work (https://simplebits.com - he did the design for the site).

I do like the principle of trying to use semantic html, but I don't think that things like this ever got the kind of mass adoption that would give them staying power. Still a nice nostalgia trip to see it here.

kk3•4mo ago

Hah I just had to chime in here, blast from the past for me as well! Dan Cederholm was my guiding light when I was learning CSS and semantic HTML. I used to View Source and study the code behind SimpleBits :)

Agree with the mass adoption part. Back then I was so hopeful for what the web might become... The days before social media blowing up.

SigmundurM•4mo ago

The IndieWeb people use Microformats[1] extensively for things like Webmention[2] and such. Seems quite neat, though maybe I'd prefer the tags to be data attributes instead of classes.

[1]: https://indieweb.org/microformats

[2]: https://indieweb.org/Webmention

nikolay•4mo ago

Great idea, always used the markup, but is it used anywhere outside of some extremely niche services?

pornel•4mo ago

W3C was way too optimistic about XML namespaces leading to creation of infinitely extensible vocabularies (XHTML2 was DOA, and even XHTML1 couldn't break past tagsoup-compatible minimum).

This was the alternative – simpler, focused, fully IE-compatible.

W3C tried proper Semantic Web again with RDF, RDFa, JSON-LD. HTML5 tried Microdata a compromise between extensibility of RDF and simplicity of Microformats, but nothing really took off.

Eventually HTML5 gave up on it, and took position that invisible metadata should be avoided. Page authors (outside the bunch who have Valid XHTML buttons on their pages) tend to implement and maintain only the minimum needed for human visitors, so on the Web invisible markup has a systemic disadvantage. It rarely exists at all, and when it does it can be invalid, out of date, or most often a SEO spam.

zozbot234•4mo ago

Schema.org metadata (using microdata, RDFa or JSON-LD) is quite common actually, search engines rely on it for "rich" SERP features. With LLMs being able to sanity-check the metadata for basic consistency with the page contents, SEO spam will ultimately be at a disadvantage. It just becomes easier and cheaper to penalize/ignore spam while still rewarding sites that include accurate data.

The schema.org vocab is being actively maintained, the latest major version came out last March w/ the latest minor release in September.

Lance table format explained simply, stupid (Animated)

Solving Soma

We built a cloud platform for agentic software (our virtualization, etc.)

Show HN: WLM-SLP – A 0D-27D Structural Language for Multi-Agent Alignment

Former Tumblr Head Jeff D'Onofrio Steps in as Acting CEO at the Washington Post

Bounded Flexible Arrays in C

The Invisible Labor Force Powering AI

Reading Recursion via Pascal

Show HN: I made a website that finds patterns on your spreadsheet

Jokes on You AI: Turning the Tables – LLMs for Learning

You don't need RAG in 2026

WatchLLM – Cost kill switch for AI agents (with loop detection)

I turned myself into an AI-generated deathbot – here's what I found

Management style doesn't predict survival

One Generation Runs the Country. The Next Cashed in on Crypto

"I Was Wrong": Why the Civil War Is Running Late [video][2h21m]

Show HN: A sandboxed execution environment for AI agents via WASM

Wine-Staging 11.2 Brings More Patches to Help Adobe Photoshop on Linux

The Nature of the Beast

From Prediction to Compilation: A Manifesto for Intrinsically Reliable AI

Show HN: Curated list of 1000 open source alternatives to proprietary software

AI's Real Problem Is Illegitimacy, Not Hallucination

'I fell into it': ex-criminal hackers urge UK pupils to use web skills for good

Why 175-Year-Old Glassmaker Corning Is Suddenly an AI Superstar

Keeping WSL Alive

Unlocking core memories with GoldSrc engine and CS 1.6 (2025)

Gtrace an advanced network path analysis tool

America does not trust Putin or Trump

Let's Do Music in Linux [video]

"Nothing" is the secret to structuring your work

Lance table format explained simply, stupid (Animated)

Solving Soma

We built a cloud platform for agentic software (our virtualization, etc.)

Show HN: WLM-SLP – A 0D-27D Structural Language for Multi-Agent Alignment

Former Tumblr Head Jeff D'Onofrio Steps in as Acting CEO at the Washington Post

Bounded Flexible Arrays in C

The Invisible Labor Force Powering AI

Reading Recursion via Pascal

Show HN: I made a website that finds patterns on your spreadsheet

Jokes on You AI: Turning the Tables – LLMs for Learning

You don't need RAG in 2026

WatchLLM – Cost kill switch for AI agents (with loop detection)

I turned myself into an AI-generated deathbot – here's what I found

Management style doesn't predict survival

One Generation Runs the Country. The Next Cashed in on Crypto

"I Was Wrong": Why the Civil War Is Running Late [video][2h21m]

Show HN: A sandboxed execution environment for AI agents via WASM

Wine-Staging 11.2 Brings More Patches to Help Adobe Photoshop on Linux

The Nature of the Beast

From Prediction to Compilation: A Manifesto for Intrinsically Reliable AI

Show HN: Curated list of 1000 open source alternatives to proprietary software

AI's Real Problem Is Illegitimacy, Not Hallucination

'I fell into it': ex-criminal hackers urge UK pupils to use web skills for good

Why 175-Year-Old Glassmaker Corning Is Suddenly an AI Superstar

Keeping WSL Alive

Unlocking core memories with GoldSrc engine and CS 1.6 (2025)

Gtrace an advanced network path analysis tool

America does not trust Putin or Trump

Let's Do Music in Linux [video]

"Nothing" is the secret to structuring your work

Microformats – building blocks for data-rich web pages

Comments