Sponsored by the Purdue School of Industrial Engineering, I worked in a team of five to develop a model for detecting cyberattacks in sensor data from automated vehicles. Our goals were minimal runtime and at least 90% accuracy on test data. We processed a large dataset and successfully built a model that met our expectations.
The dataset consists of multiple CSV files, each containing localization sensor features. Each file represents vehicle trajectory data over time. The dataset includes:
Across eight routes, between 200-300 attack trajectories were generated from their respective ground truths. The dataset was derived from the Complex Urban Dataset.
Dataset structure visualization:
To simplify the dataset, we aggregated each CSV using mean values, reducing the dataset from over a million entries to 2,546 rows. We combined attack and ground truth data, labeled them, and used XGBoost's feature importance to retain only eight key features:
Final dataset statistical summary:
We tested logistic regression, support vector machines, and XGBoost. Initially, XGBoost yielded unexpectedly high test metrics, leading us to investigate class imbalance.
To address this, we bootstrapped additional ground truth samples. After balancing, the dataset showed improved performance.
Model performance after balancing:
The optimized XGBoost model successfully identified anomalies, achieving all metrics above 0.9, meeting our performance goal. The most impactful features in identifying cyberattacks were:
Potential improvements include expanding the dataset with more real-world ground truth trajectories, reducing reliance on bootstrapping for class balancing.
This project reinforced my skills in data cleaning, modeling, and handling unstructured datasets. I also gained experience in directing team efforts, designing modeling approaches, and using XGBoost for classification.