How I Structure a Fraud Detection Side Project So Iteration Stays Cheap

When people talk about machine learning side projects, they usually focus on the model choice, the leaderboard score, or the notebook. That is understandable, but it misses what I find most useful from an engineering perspective.

In this fraud detection project, the interesting part was not just training an XGBoost or LightGBM classifier. The interesting part was building the project so that feature engineering, balancing strategies, and model comparisons could evolve without the codebase collapsing into a pile of scripts.

That is the kind of work I care about most: taking an exploratory problem and giving it a shape that remains cheap to change.

The Short Answer

If I had to summarize the approach in one sentence, it would be this: build the experiment flow like a small production system, not like a disposable notebook.

In practice, that meant a few concrete decisions:

a dedicated dataset pipeline responsible for loading, cleaning, feature application, and train/test splitting
explicit feature objects that can be added or removed without touching the rest of the flow
balancing isolated behind its own protocol
model runners that share a stable interface
repeatable evaluation with deterministic splits and automated tests

This is not a large system, but those boundaries matter. They are what make iteration sustainable.

The Problem I Wanted to Avoid

Most ML projects start reasonably and then degrade quickly.

A first notebook becomes three notebooks. A preprocessing step gets duplicated in two places. A balancing trick is applied inline inside model code. Someone adds a new feature and forgets how it affects the split. Evaluation logic drifts between experiments. The project still “works”, but each new change gets more expensive.

I wanted to avoid that pattern from the beginning.

Even in a personal project, I prefer a structure where the main entry point stays trivial and the real logic is pushed into explicit components. In this repository, src/main.py is intentionally tiny. It just creates a models pipeline and runs evaluation. That only works because the interesting decisions live somewhere better. If you want to inspect the implementation directly, the code is available on GitHub.

How I Structured the Pipeline

The core of the project is a DatasetPipelineBuilder that assembles the dataset flow in a predictable order:

load the source dataset
clean the raw frame
apply registered features
build a dataset pipeline object
split, balance, and hand the result to the model runner

Viewed from the actual code path, the flow looks like this:

flowchart TD
    A["src/main.py"] --> B["create_models_pipeline()"]
    B --> C["ModelsPipeline"]
    B --> D["DatasetPipelineBuilder.default()"]

    D --> E["Load creditcard.csv.zip"]
    E --> F["DatasetCleaner.clean()"]
    F --> G["Load features dynamically from src/features"]
    G --> H["Apply feature objects one by one"]
    H --> I["DatasetPipeline(df)"]

    C --> J["Run XGBoostModel"]
    C --> K["Run LightGBMModel"]

    J --> L["dataset_pipeline.split(use_cache=True)"]
    K --> L

    L --> M["Stratified train/test split"]
    M --> N["OversamplingBalancer.fit_resample()"]
    N --> O["One-hot encode amountBin"]
    O --> P["SMOTETomek with adaptive k_neighbors"]
    P --> Q["Drop amountBin from train/test matrices"]

    Q --> R["Scale + fit classifier"]
    R --> S["Predict + PR-AUC + classification report"]
    S --> T["EvaluationResult"]

That separation is simple, but it pays off immediately.

The dataset pipeline owns the data lifecycle. The model classes do not need to know where the data came from, how features were derived, or which balancing strategy was used. They receive a stable interface and focus on training and evaluation.

On top of that, a ModelsPipeline can execute multiple models against the same prepared dataset flow. In this case the comparison is between XGBoost and LightGBM, but the structure is open enough that adding another model is straightforward.

This is one of those cases where a small amount of upfront design removes a lot of future friction.

Feature Engineering as a First-Class Module

The part I like most in this project is the feature loading strategy.

Features are not hidden inside one monolithic transformation function. They are implemented as separate units under src/features, and the pipeline loads them dynamically through a registration mechanism. That gives me a very practical workflow:

create a new feature in its own file
expose it through register()
let the builder pick it up automatically
validate it with focused tests

That is a much better setup than editing a single growing block of feature logic every time a new idea appears.

Some of the features are deliberately simple, like hour-of-day or log-transformed amount. Others are more contextual and much more interesting for fraud detection:

transaction counts over rolling 1h, 6h, and 24h windows
amount aggregations over the same windows
ratio features comparing the current amount against recent local averages
percentile-style signals for transaction magnitude

Those features matter because fraud detection is rarely about an isolated transaction. It is usually about behavior in context. A transaction amount that looks normal globally may look suspicious relative to the last hour for the same timeline.

I also liked using Polars here for the temporal feature work. Windowed aggregations and column-oriented transformations are a good fit for this kind of feature engineering.

Class Imbalance Was Treated as a System Concern

Fraud detection is an imbalanced classification problem. That is obvious, but the implementation detail matters.

Instead of sprinkling balancing logic across the training code, I kept it behind a dedicated balancer abstraction. The current implementation uses SMOTETomek, adapts k_neighbors to the minority class size, and encodes the categorical amountBin column before resampling.

That decision matters for two reasons.

First, it keeps the model runners clean. XGBoost and LightGBM can focus on model-specific concerns such as hyperparameters, probability prediction, and evaluation.

Second, it makes balancing replaceable. If later I want to compare plain SMOTE, undersampling, or no balancing at all, I do not need to rewrite the pipeline around the model code.

That kind of replaceability is often treated as overengineering in side projects. I disagree. If the goal is learning, then reducing the cost of comparison is one of the highest leverage choices you can make.

Evaluation Was Kept Repeatable

Another detail I care about is keeping evaluation deterministic enough to be trustworthy.

The project fixes the random state, uses stratified train/test splits, caches prepared datasets, and wraps the evaluation result in a typed domain object instead of passing around anonymous dictionaries.

This is not glamorous work, but it is the difference between “I think this experiment improved the model” and “I can compare runs without wondering whether the pipeline changed underneath me.”

The repository also includes unit, integration, and acceptance tests. That is important because once a project starts accumulating engineered features, silent regressions become much more likely than syntax errors.

A few examples of what is verified:

dynamic feature loading still finds the expected registered features
pipeline splits remain deterministic
balancing produces the expected class distribution
transformed training and test frames preserve a compatible shape

That test coverage is part of the design, not an afterthought.

The Result That Actually Matters

Yes, the model performance improved over iterations. The README shows the PR-AUC moving from roughly 0.8345 to 0.8403 for the strongest XGBoost iteration.

That is useful, but it is not the main reason I would show this project to another engineer.

What I would show is that the codebase supports the work behind that improvement:

adding temporal features without rewriting the full pipeline
comparing model families with a stable execution path
isolating balancing from training concerns
keeping evaluation explicit and testable

That is much closer to how useful ML engineering works in real teams. Raw model quality matters, but the ability to iterate safely matters just as much.

Trade-Offs and What I Would Refine Next

The design is intentionally pragmatic, not perfect.

There are still trade-offs here:

dynamic loading is flexible, but it can hide coupling if feature dependencies grow unchecked
caching speeds up iteration, but cached artifacts need discipline if the feature set changes
a project like this still sits in the middle ground between experimentation and full production hardening
the current structure is good for comparative iteration, but it would need stronger experiment tracking if many parameter sweeps were added

If I kept extending it, the next things I would likely add are explicit experiment metadata, cleaner separation for model configuration, and stronger reporting around feature contribution and threshold selection.

Why This Kind of Project Is Worth Building

What I value in projects like this is not the chance to say I trained a fraud model. It is the chance to demonstrate engineering judgment under a realistic constraint: the problem starts exploratory, but the code still needs to remain understandable after the fifth iteration.

That is usually where software engineering begins to matter more than the initial idea.

It also supports a workflow I find unusually valuable in side projects: execute, leave it alone, and come back later without paying a large re-entry cost. I built this roughly eight months ago, and that matters to me because I may not touch it again for a few more months. The structure makes that acceptable. The entry point is small, the pipeline is explicit, and the responsibilities are separated cleanly enough that future me will not need to reverse-engineer a notebook jungle just to remember how the experiment works.

I apply the same principle in larger systems too: make the workflow explicit, keep boundaries honest, and optimize for change instead of just first execution. It is also the same reason I care about turning personal tooling into proper systems, as I described in My Second Brain for AI: MCP, Obsidian, and Personal Knowledge Retrieval.

Conclusion

The model matters. The features matter. The PR-AUC matters.

But if I had to choose one thing to optimize for, it would still be this: make the next change cheaper than the previous one.