How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.
The chunk size has allowed range and those outside are simply ignored.
- Upper limit is hardcoded with a body size of 10k chars
- Lower limit is configurable with a default of 10 AST nodes inside the body
The chunking strategy is something that can be improved in future versions.
If you are interested in data, you can check my article. Analysis was done with this tool, but a previous version where exact-copy duplicates were excluded from analysis. https://rkochanowski.com/article/analysis-code-duplication/
If so - maintainability, testability. This is old software engineering best practice at this point.
You shouldn’t hyper optimize for deduplication, but it’s usually worth considering. Fewer places to fix issues or improve as well.
On testability: two implementations can be tested against each other, leading to greater coverage with less test code. It doesn't work that way for 3+ implementations, which is another reason not to have that many.
If by "more efficient" you mean to avoid embedding of the same code multiple times, this optimization is already implemented internally.
Does the project you simply compute embeddings for every function unit and cluster them, or do we also mean-pool significant dependencies of a function? In other words, given the function
def a():
b()
c()
d()
Do we also embed b, c, and d as well and combine them somehow in the embedding of a?[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...
But I plan to do my own analysis of different embedding models in the context of code similarity detection. Including BM25 in the comparison is a very good idea.
jscpd is advertised as "Copy/paste detector", Slopo is advertised as "non-exact code duplication".
Slopo also detects copy/pasted code, but this is not the main goal and the report focuses more on similar code units.
Deduplication backend is the easiest part of the tool and it doesn't need any additional libraries. Just calculate embeddings and find close pairs. The complexity is everything around.
Using local models is worth considering and the tool already uses the LiteLLM wrapper, allowing it to configure different models, including local. I left this part for the user.
I also don't think the backend is the "easiest part" here, a lot of the scalability lives there, which is important for monorepos, or cases where you want to deduplicate across projects. For example, from looking at the implementation, you use exact brute-force similarity search (comparing every item to every other item) which is an O(n^2) operation. It also allocates large dense similarity blocks in memory, so memory use won’t scale well either.
Embedding calculation is also outsourced externally: it's only a simple API call with a LiteLLM wrapper.
"brute-force similarity search" and O(n^2) may sound scary, but it works fine and this is not a bottleneck. For large projects, other parts are much slower, which has room for improvement. [1] is an implementation you probably saw. It uses NumPy, spreads work across all CPU cores and there is also a split into blocks (block_size = 1000). In larger sets, all vectors are not loaded into memory at once. Where I need to be honest, I didn't measure actual memory usage. I just tested this on large repos, so I'm aware of bottlenecks.
From my perspective, this is the easiest part. Code extraction, chunking, applying boost, clustering, generating report and designing everything as a single user-friendly tool is a real challenge. Architectural and product decisions are more difficult than implementation and solving performance issues.
[1] https://github.com/rafal-qa/slopo/blob/v0.3.0/src/slopo/anal...
While testing this tool, one detected duplication was interesting for a use case. Permission check logic was duplicated and placed in different distant places in the codebase. The code was similar, but not identical, the logic was not the same. One version had stricter checks. I analyzed this with the coding agent, and we found out that both versions are used for the same thing, which means that in some cases validation is insufficient. Having only a single validation place, this bug could be prevented or easily detected.
rkochanowski•1d ago
It finds similar-looking code with embeddings. This detects more than just copy-paste clones or even clones with minor changes. Similar code is often not a clone to refactor, and this is a trade-off. Initial results need to be verified, but coding agents can do this quickly. Example prompts are available on https://slopo.dev
Additionally, similar code distant in the codebase is ranked higher to focus on less obvious duplication.
The results differ a lot depending on the codebase. I noticed that sometimes most of the detected duplicates are false positives, but the remaining ones are strong candidates to refactor or even bugs. Sometimes it reveals much more real duplication.
realxrobau•1d ago
rkochanowski•1d ago
raro11•1d ago
rkochanowski•3h ago
klibertp•1d ago
I would also consider - perhaps as a separate pass, with scoring set differently - to analyze comments (especially docstrings in Python). If I read the code correctly, you're currently just stripping them, which is the right thing to do when looking for code duplication, but duplicated docstrings are also often a signal that something is wrong in the codebase. The "different scoring" is because we expect docstring to be structured similarly (at least more than normal code), so some tweaking would be needed.
Finally: very nice project, congrats! :)
[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...
rkochanowski•21h ago
Skipping the extraction of conditional branches was my decision to not overcomplicate the first versions, which was intended to validate the idea. I will add this in future versions because I agree it's needed for large functions.
I don't think it needs configurable granularity. In the current version, there is an analogous mechanism: when functions are nested, both outer and inner are embedded separately. When both are similar to each other, this pair is excluded. Inner or outer functions can appear in results depending on similarity to other units.
Regarding comments, they are removed and I will think about handling them. The challenge is not with extraction, but with how to present this in a report. This may be a nice addition because coding agents often add comments.
Thanks for the feedback.
nttylock•22h ago
rkochanowski•20h ago
1. The tool uses only cosine similarity plus boost depending on distance in the codebase.
2. Classification with LLM. This can be done by coding agent used with project giving better results than integrating this pass in the tool. LLMs used for coding are pretty good.
I assumed that this is not a problem I need to solve inside the tool. I'm aware this is not deterministic, but this is by design.
Regarding information about raw similarity: currently, the score (raw similarity + boost) is visible in the report, so this value can be configured based on data. The raw similarity threshold can also be configured, but it's not displayed. I will think about how to handle this.
AlexeyBelov•4h ago
jadbox•22h ago