Show HN: CLI tool for detecting non-exact code duplication with embedding models

89•rkochanowski•1d ago

Comments

rkochanowski•1d ago

I built Slopo to solve one specific problem: finding similar code that is hardest to detect by other tools, coding AI agents, and humans.

It finds similar-looking code with embeddings. This detects more than just copy-paste clones or even clones with minor changes. Similar code is often not a clone to refactor, and this is a trade-off. Initial results need to be verified, but coding agents can do this quickly. Example prompts are available on https://slopo.dev

Additionally, similar code distant in the codebase is ranked higher to focus on less obvious duplication.

The results differ a lot depending on the codebase. I noticed that sometimes most of the detected duplicates are false positives, but the remaining ones are strong candidates to refactor or even bugs. Sometimes it reveals much more real duplication.

realxrobau•1d ago

If it did PHP I would love to run it over WordPress. What would it take to add that?

rkochanowski•1d ago

PHP support can be easily added, I will release a new version soon.

raro11•1d ago

Thank you

rkochanowski•3h ago

PHP was added in the latest release.

klibertp•1d ago

Correct me if I'm wrong, but looking at [1] it seems to be specifically using function definitions (I'm guessing this works with functions, methods, and lambdas (the "<unknown>" part)?) as units of repetition. If yes, that's fine, but I would seriously consider adding some settings to allow the user to control that granularity. Sometimes, the repeated code is a conditional branch within larger functions (i.e., "every else:" or "every except Ex:" looks the same). If the functions are large enough, the dissimilarity of the rest of the body would (probably?) cause such things to be missed.

I would also consider - perhaps as a separate pass, with scoring set differently - to analyze comments (especially docstrings in Python). If I read the code correctly, you're currently just stripping them, which is the right thing to do when looking for code duplication, but duplicated docstrings are also often a signal that something is wrong in the codebase. The "different scoring" is because we expect docstring to be structured similarly (at least more than normal code), so some tweaking would be needed.

Finally: very nice project, congrats! :)

[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...

rkochanowski•21h ago

Currently, only whole functions (including function-like constructs depending on language) are considered as unit.

Skipping the extraction of conditional branches was my decision to not overcomplicate the first versions, which was intended to validate the idea. I will add this in future versions because I agree it's needed for large functions.

I don't think it needs configurable granularity. In the current version, there is an analogous mechanism: when functions are nested, both outer and inner are embedded separately. When both are similar to each other, this pair is excluded. Inner or outer functions can appear in results depending on similarity to other units.

Regarding comments, they are removed and I will think about handling them. The challenge is not with extraction, but with how to present this in a report. This may be a nice addition because coding agents often add comments.

Thanks for the feedback.

nttylock•22h ago

The false positive rate you're describing matches what we see running similarity detection on generated text instead of code: cosine similarity alone flags a lot of same-topic pairs that aren't actually duplicates. What helped was combining the embedding score with a structural signal (AST edit distance for code, overlapping headings and citations for text) so no single metric makes the call. Also worth surfacing the raw similarity score in the CLI output instead of just a binary duplicate flag, since people will want to tune the threshold per codebase.

rkochanowski•20h ago

My solution for false positives is simpler:

1. The tool uses only cosine similarity plus boost depending on distance in the codebase.

2. Classification with LLM. This can be done by coding agent used with project giving better results than integrating this pass in the tool. LLMs used for coding are pretty good.

I assumed that this is not a problem I need to solve inside the tool. I'm aware this is not deterministic, but this is by design.

Regarding information about raw similarity: currently, the score (raw similarity + boost) is visible in the report, so this value can be configured based on data. The raw similarity threshold can also be configured, but it's not displayed. I will think about how to handle this.

AlexeyBelov•4h ago

You're replying to an LLM bot.

jadbox•22h ago

What a clever little tool. This is exactly the kind of pragmatic AI tools I want to see more of: linux-y single purpose tools!

NYCHMPAI•1d ago

This is a great use case for embeddings. Code deduplication across distant modules is notoriously hard for traditional AST-based tools.

How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.

rkochanowski•23h ago

Generally, I chunk by function/method (not by whole class), but different languages have specific concepts and features. Nested code units, anonymous functions, lambdas, closures are extracted as separate chunks.

The chunk size has allowed range and those outside are simply ignored.

- Upper limit is hardcoded with a body size of 10k chars

- Lower limit is configurable with a default of 10 AST nodes inside the body

The chunking strategy is something that can be improved in future versions.

murats•1d ago

Nice idea. I can see this being useful before refactors, especially when the duplication is semantic rather than copy paste.

hdz•1d ago

Very nice. I can imagine putting this into a pre push hook to keep things clean after an initial sweep.

philajan•1d ago

This is neat. Have you noticed any difference in duplicate detection between strongly typed and loosely typed languages / code bases?

rkochanowski•1d ago

No. It depends the most on general code quality and architecture. Some implementations require more code similarity by design. Some languages, like Java, may tend to have more duplication, but it's only a theoretical guess. It also depends on what kind of software is developed with what language.

If you are interested in data, you can check my article. Analysis was done with this tool, but a previous version where exact-copy duplicates were excluded from analysis. https://rkochanowski.com/article/analysis-code-duplication/

SpyCoder77•1d ago

I think that this is pretty cool, but is there any reason why we would want to remove similar/possible duplicate code?

Zopieux•1d ago

Have you written software before?

SpyCoder77•1h ago

I seem to have misunderstood. I thought that it was talking about stuff like similar repos, but now I realize that this is most likely talking about a singular codebase

rufius•1d ago

(without sarcasm) Is this a serious question?

If so - maintainability, testability. This is old software engineering best practice at this point.

You shouldn’t hyper optimize for deduplication, but it’s usually worth considering. Fewer places to fix issues or improve as well.

klibertp•1d ago

I tend to follow the "rule of 3": a second similar implementation is OK, introducing the third triggers a refactor. As with everything, this isn't dogma, and sometimes the second implementation is already too much, while at other times you get tens of similar code sections (in codegen, repeating patterns with almost no changes is a virtue). But it's a good rule of thumb.

On testability: two implementations can be tested against each other, leading to greater coverage with less test code. It doesn't work that way for 3+ implementations, which is another reason not to have that many.

BrandiATMuhkuh•1d ago

What a simple and smart idea. Wonderful

forhadahmed•1d ago

self plug (for similar tool): https://github.com/forhadahmed/refactor

rohanat•1d ago

have you considered a deterministic tier before the embedding pass? I feel that approach can be more efficient.

rkochanowski•1d ago

There are good mature tools for deterministic duplication detection and I intentionally focused on embedding-based to fill this gap (I didn't find other tools using this approach).

If by "more efficient" you mean to avoid embedding of the same code multiple times, this optimization is already implemented internally.

vander_elst•23h ago

We did this by using the ASTs you can go quite far without embeddings and the result is easier to debug and follow what's going on.

vander_elst•1d ago

I implemented this for a large monorepo last year, it runs as an analysis during code review and it shows what are possible similar snippets wrt the code under review. It was a very nice project. It also allows to see across the repo what are the most common constructs for the different languages. This could also be helpful to see if some code has been copied e.g. from open source projects.

supriyo-biswas•23h ago

Cool project, I've been meaning to do this myself at work for a codebase, and it's nice to see that this exists now.

Does the project you simply compute embeddings for every function unit and cluster them, or do we also mean-pool significant dependencies of a function? In other words, given the function

    def a():
      b()
      c()
      d()

Do we also embed b, c, and d as well and combine them somehow in the embedding of a?

klibertp•23h ago

It looks like it works only on function bodies[1]. I'm not sure I understand why you would want to look at invoked callables code, though. Calling the same set of helper functions is already flagged; repeated code in helpers is flagged as well when those helpers are analyzed. Do you have a specific example where you'd like a function flagged as a duplicate based on the code it calls out to?

[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...

rkochanowski•22h ago

Based on your example there is only a single function a() which is embedded. The rest is just a code and dependencies are not resolved. Did you think about adding this feature in your tool?

janalsncm•22h ago

Did you benchmark it against simpler methods like BM25?

rkochanowski•21h ago

I just focused on embeddings without comparing them to deterministic solutions.

But I plan to do my own analysis of different embedding models in the context of code similarity detection. Including BM25 in the comparison is a very good idea.

noashavit•21h ago

looks super useful- thanks for sharing!

romanoonhn•19h ago

Looks very cool! I'd be very interested in applying this to my Elixir projects. What does it take to add proper support for a new language?

rkochanowski•8h ago

It can be added with https://pypi.org/project/tree-sitter-elixir/ similar to other languages. I will add this and plan to release a new version today.

rkochanowski•3h ago

Elixir was added in the latest release.

mempko•18h ago

Nice, what's the chunking level? I would want sub function, logical blocks, etc

rkochanowski•8h ago

Function level only. I will add more granular chunking in next versions.

msephton•17h ago

How does it compare to jscpd? https://github.com/kucherenko/jscpd

rkochanowski•8h ago

It's the opposite.

jscpd is advertised as "Copy/paste detector", Slopo is advertised as "non-exact code duplication".

Slopo also detects copy/pasted code, but this is not the main goal and the report focuses more on similar code units.

Bibabomas•10h ago

Have you compared this to https://github.com/MinishLab/semhash (or considered using that for the deduplication backend)?

rkochanowski•9h ago

Finding similar code is something different than deduplication, even when the final goal looks similar.

Deduplication backend is the easiest part of the tool and it doesn't need any additional libraries. Just calculate embeddings and find close pairs. The complexity is everything around.

Using local models is worth considering and the tool already uses the LiteLLM wrapper, allowing it to configure different models, including local. I left this part for the user.

Bibabomas•6h ago

I don't understand what you mean with your first sentence. Both SemHash and Slopo are deduplication libraries right? Regardless, finding similar code is the core functionality that enables (semantic) deduplication.

I also don't think the backend is the "easiest part" here, a lot of the scalability lives there, which is important for monorepos, or cases where you want to deduplicate across projects. For example, from looking at the implementation, you use exact brute-force similarity search (comparing every item to every other item) which is an O(n^2) operation. It also allocates large dense similarity blocks in memory, so memory use won’t scale well either.

rkochanowski•2h ago

Slopo doesn't deduplicate. It only reports similar code units, which in most cases are not duplicates. This needs to be cleaned up by the user, which is fast and easy with coding agents.

Embedding calculation is also outsourced externally: it's only a simple API call with a LiteLLM wrapper.

"brute-force similarity search" and O(n^2) may sound scary, but it works fine and this is not a bottleneck. For large projects, other parts are much slower, which has room for improvement. [1] is an implementation you probably saw. It uses NumPy, spreads work across all CPU cores and there is also a split into blocks (block_size = 1000). In larger sets, all vectors are not loaded into memory at once. Where I need to be honest, I didn't measure actual memory usage. I just tested this on large repos, so I'm aware of bottlenecks.

From my perspective, this is the easiest part. Code extraction, chunking, applying boost, clustering, generating report and designing everything as a single user-friendly tool is a real challenge. Architectural and product decisions are more difficult than implementation and solving performance issues.

[1] https://github.com/rafal-qa/slopo/blob/v0.3.0/src/slopo/anal...

Show HN: ctx – Search the coding agent history already on your machine

Show HN: Mcpsnoop – Wireshark for MCP (transparent proxy and live TUI)

Show HN: Dockside – I turned unused space around the macOS Dock into a workspace

Show HN: Pieces – Social network for people

Show HN: TaskPeace – a task queue my AI coding agents pull work from over MCP

Show HN: Hacker News but as Tweets

Show HN: Ultracodex – Run Claude Ultracode Dynamic Workflows with Codex Agents

Show HN: OM Core – multidimensional models without spreadsheet cell formulas

Show HN: zkGolf – Competitive optimization of formally verified circuits

Show HN: I got tired of messy PDF bank statements, so I built a RAM-only parser

Show HN: AI latent space with overlapping manifolds

Show HN: Inkwell – An RSS reader for e-ink devices

Show HN: SigRank – Competitive Stat Screen and Operator Performance Evals O7

Show HN: Finding better quantum error correction codes using ILP

Show HN: I made a tool that prevents websites from tracking you

Show HN: CLI tool for detecting non-exact code duplication with embedding models

Show HN: Gitstock–Transform you GitHub commit history into K-line and animations

Show HN: An assertion library for E2E testing and real user monitoring

Show HN: A graph paper generator that renders vector PDFs in the browser

Show HN: I measured the half-life of 41,301 Show HN launches. It's 7 hours

Show HN: Claudoro, Pomodoro timer embedded in the Claude Code statusline

Show HN: I built an open-source alternative to Claude Cowork

Show HN: CLI that helps AI agents avoid vulnerable dependencies

Show HN: Bramble – Local-first password manager

Show HN: ZeroFS – A log-structured filesystem for S3

Show HN: I built a declarative layout engine for SVG, Canvas, WebGL

Show HN: Imagent – agentic image/video/speech generation

Show HN: GeoSpoof – your VPN hides your IP, but the browser leaks your location

Show HN: GolemUI – Declarative Form Engine

Show HN: Ordered dithering command-line tool

Show HN: ctx – Search the coding agent history already on your machine

Show HN: Mcpsnoop – Wireshark for MCP (transparent proxy and live TUI)

Show HN: Dockside – I turned unused space around the macOS Dock into a workspace

Show HN: Pieces – Social network for people

Show HN: TaskPeace – a task queue my AI coding agents pull work from over MCP

Show HN: Hacker News but as Tweets

Show HN: Ultracodex – Run Claude Ultracode Dynamic Workflows with Codex Agents

Show HN: OM Core – multidimensional models without spreadsheet cell formulas

Show HN: zkGolf – Competitive optimization of formally verified circuits

Show HN: I got tired of messy PDF bank statements, so I built a RAM-only parser

Show HN: AI latent space with overlapping manifolds

Show HN: Inkwell – An RSS reader for e-ink devices

Show HN: SigRank – Competitive Stat Screen and Operator Performance Evals O7

Show HN: Finding better quantum error correction codes using ILP

Show HN: I made a tool that prevents websites from tracking you

Show HN: CLI tool for detecting non-exact code duplication with embedding models

Show HN: Gitstock–Transform you GitHub commit history into K-line and animations

Show HN: An assertion library for E2E testing and real user monitoring

Show HN: A graph paper generator that renders vector PDFs in the browser

Show HN: I measured the half-life of 41,301 Show HN launches. It's 7 hours

Show HN: Claudoro, Pomodoro timer embedded in the Claude Code statusline

Show HN: I built an open-source alternative to Claude Cowork

Show HN: CLI that helps AI agents avoid vulnerable dependencies

Show HN: Bramble – Local-first password manager

Show HN: ZeroFS – A log-structured filesystem for S3

Show HN: I built a declarative layout engine for SVG, Canvas, WebGL

Show HN: Imagent – agentic image/video/speech generation

Show HN: GeoSpoof – your VPN hides your IP, but the browser leaks your location

Show HN: GolemUI – Declarative Form Engine

Show HN: Ordered dithering command-line tool

Show HN: CLI tool for detecting non-exact code duplication with embedding models

Comments