Anomaly Detection in Vehicle Localization Sensor Data

Project Overview

Sponsored by the Purdue School of Industrial Engineering, I worked in a team of five to develop a model for detecting cyberattacks in sensor data from automated vehicles. Our goals were minimal runtime and at least 90% accuracy on test data. We processed a large dataset and successfully built a model that met our expectations.

Initial Dataset Description

The dataset consists of multiple CSV files, each containing localization sensor features. Each file represents vehicle trajectory data over time. The dataset includes:

Across eight routes, between 200-300 attack trajectories were generated from their respective ground truths. The dataset was derived from the Complex Urban Dataset.

Dataset structure visualization:

Dataset directory structure

Dataset Attributes (28 features)

  1. time: Timestamp (seconds per row).
  2. x, y, z: Longitude, latitude (radians), and height (meters).
  3. utm_x, utm_y: Coordinates in UTM format.
  4. ve, vn, vu: Velocity in east, north, and up directions (m/s).
  5. pitch, roll, yaw: Vehicle orientation (Euler angles in radians).
  6. qbn_0, qbn_1, qbn_2, qbn_3: Orientation represented as a quaternion.
  7. init_align: Initial vehicle heading.
  8. Standard deviations: Position, velocity, and orientation (x_sd, y_sd, yaw_sd, etc.).

Data Preparation

To simplify the dataset, we aggregated each CSV using mean values, reducing the dataset from over a million entries to 2,546 rows. We combined attack and ground truth data, labeled them, and used XGBoost's feature importance to retain only eight key features:

  1. vx_sd, z_sd: Standard deviations of x and z positions.
  2. ve, vu: Velocity in the east and up directions.
  3. pitch, roll, yaw: Euler angle-based vehicle orientation.
  4. qbn_3: Quaternion component representing vehicle orientation.

Final dataset statistical summary:

Final dataset statistics

Modeling

We tested logistic regression, support vector machines, and XGBoost. Initially, XGBoost yielded unexpectedly high test metrics, leading us to investigate class imbalance.

Class distribution before balancing

To address this, we bootstrapped additional ground truth samples. After balancing, the dataset showed improved performance.

Class distribution after balancing

Model performance after balancing:

Updated model performance metrics

Findings and Performance Evaluation

Final model metrics Final model learning curves

The optimized XGBoost model successfully identified anomalies, achieving all metrics above 0.9, meeting our performance goal. The most impactful features in identifying cyberattacks were:

Future Work

Potential improvements include expanding the dataset with more real-world ground truth trajectories, reducing reliance on bootstrapping for class balancing.

Project Takeaways

This project reinforced my skills in data cleaning, modeling, and handling unstructured datasets. I also gained experience in directing team efforts, designing modeling approaches, and using XGBoost for classification.