I built Skyulf because I kept encountering two specific problems that existing tools (like MLflow or standard Scikit-learn pipes) didn't quite solve for me: silent data leakage and monolithic pickles.
## The Problems
1. Data Leakage is Silent: You compute mean imputation on the full dataset, then split. Your model looks great in dev but fails in production. It happens to the best of us.
2. Deployment Hell (The Pickle Problem): Standard pipelines pickle everything data schema, logic, and 3rd party library versions into one opaque blob. To run a simple inference, you need the same heavy environment used for training.
## The Solution: Distinct Calculator & Applier
Skyulf enforces a strict separation of concerns using a Calculator / Applier pattern (inspired by modern engine design).
1. Calculator (Fit): Consumes data (`X`, `y`), learns the state (means, vocabularies, coefficients), and outputs a lightweight, JSON-serializable Artifact.
2. Applier (Predict): A pure function. Consumes the Artifac + New Data -> Output.
Why this matters: You can train on a massive GPU cluster, save just the lightweight JSON artifacts (state), and run the Applier on a tiny CPU instance. The Applier is stateless.
3. Structural Leakage Prevention: We use a `SplitDataset` abstraction. Transformers receive train/test/val as a single object but are mathematically forced to compute statistics on `.train` only.
```python from skyulf import SkyulfPipeline
config = { "preprocessing": [ # Split happens FIRST. Leakage is structurally impossible. {"name": "split", "transformer": "TrainTestSplitter", "params": {"test_size": 0.2}}, {"name": "impute_age", "transformer": "SimpleImputer", "params": {"columns": ["age"], "strategy": "mean"}}, {"name": "scale_income", "transformer": "StandardScaler", "params": {"columns": ["income"]}}, ], "modeling": {"type": "random_forest_classifier", "params": {"n_estimators": 100}} }
pipeline = SkyulfPipeline(config) pipeline.fit(df, target_column="target") pipeline.save("model.pkl") ```
## Features
1. Polars-First (~3.5x Faster): We migrated the core engine from Pandas to Polars. Lazy evaluation means we can scan generic CSV/Parquet files instantly for EDA.
2. One-Liner EDA: Generates a comprehensive profile (quality, outliers, VIF, causal graphs) in seconds.
```python from skyulf.profiling.analyzer import EDAAnalyzer from skyulf.profiling.visualizer import EDAVisualizer import polars as pl
df = pl.read_csv("data.csv") profile = EDAAnalyzer(df).analyze(target_col="churn")
viz = EDAVisualizer(profile, df) viz.summary() # Terminal dashboard viz.plot() # Matplotlib distributions & correlations ```
3. Visual ML Canvas (Local-First): A React-based drag-and-drop UI (running locally via FastAPI) that lets you visually debug pipelines. You can click any node to see data stats at that exact point in the pipeline.
## Why Another Tool?
- vs MLflow: We focus on the construction and execution of the pipeline, not just tracking the metrics.
- vs Scikit-learn Pipelines: We separate state (Artifacts) from logic (Appliers) and enforce leakage checks.
- vs Cloud Platforms: Skyulf is self-hosted. Your data never leaves your machine.
## Current Status
The library skyulf-core is stable on PyPI. The visual platform is functional but still being polished. I'm a solo dev building this in public.
I'm building this in public and would love your feedback. If you find this interesting, a star on GitHub would mean a lot! I'm also looking for contributors if you're into Python, React, or MLOps, check out the issues.
---
*Links*: - Repo: https://github.com/flyingriverhorse/Skyulf - PyPI: https://pypi.org/project/skyulf-core - Docs: https://www.skyulf.com