I am an ML Engineer and a full-stack software engineer. For the past few weekends, I have been working on a pipeline to solve a PropTech problem: turning messy, highly occluded 2D floor plans into clean, structured data for 3D extrusion. Originally demoed for a firm hiring for the role.
The Problem: If you try to use standard object detection (bounding boxes) or basic OCR (tested Qwen, DeepSeek) on architectural plans, it fails instantly. Walls intersect, doors swings and dimension lines heavily occlude the actual structures.
The Stack & Architecture: I built an instance segmentation pipeline that relies strictly on pixel-perfect masking to pull the geometry.
The Backbone: Swin Transformer + Detectron2.
Model trained on 1024X1024 images, with RTX 4090
Inference: Inference on CPU <10s.
Output: Clean JSON
Demo Performance: 67.1% AP50 for instance segmentation masks, and a 38.2% AP across the strict 0.50:0.95 IoU thresholds.
Vector Clean-up (The JSON Payload): 3D engines don't want pixel masks; they want math. The pipeline passes the raw predictions through Shapely to run boolean unions on intersecting walls, outputting clean, mathematically sound 2D polygons in a structured JSON payload.
Why am I posting: A lot of virtual staging and architectural startups have beautiful Three.js rendering engines, but they still rely on human data entry to get the base data. I built this specifically as an extraction engine to sit underneath those UIs.
If you are in the PropTech or you are building a product that could benefit from embedding this model, I would love to chat.