Volunteer Analyst @ Modern Military Association of America

Overview

During my time as a volunteer analyst for the Modern Military Association of America, I completed an analysis of the organization's case report data. This involved data cleaning, transformation, visualization, and reporting. The goal was to assist the organization with quantifying their progress towards goals and to help set new, realistic ones.

Data Quality Assessment and Cleaning

The data used in the analysis was a csv with each entry representing a respondant's answers given to fill out a google sheet. There were around 1000 responses over the period of the previous year at the time of the analysis. These responses were manually logged into excel and exported as a csv. The shape and content of the data wasn't easy to work with initially, wiith many issues stemming from the form allowing for free-text responses. Steps were taken to alleviate this. Problems encountered:

Across eight routes, between 200-300 attack trajectories were generated from their respective ground truths. The dataset was derived from the Complex Urban Dataset.

Dataset structure visualization:

Dataset directory structure

Dataset Attributes (28 features)

  1. time: Timestamp (seconds per row).
  2. x, y, z: Longitude, latitude (radians), and height (meters).
  3. utm_x, utm_y: Coordinates in UTM format.
  4. ve, vn, vu: Velocity in east, north, and up directions (m/s).
  5. pitch, roll, yaw: Vehicle orientation (Euler angles in radians).
  6. qbn_0, qbn_1, qbn_2, qbn_3: Orientation represented as a quaternion.
  7. init_align: Initial vehicle heading.
  8. Standard deviations: Position, velocity, and orientation (x_sd, y_sd, yaw_sd, etc.).

Data Preparation

To simplify the dataset, we aggregated each CSV using mean values, reducing the dataset from over a million entries to 2,546 rows. We combined attack and ground truth data, labeled them, and used XGBoost's feature importance to retain only eight key features:

  1. vx_sd, z_sd: Standard deviations of x and z positions.
  2. ve, vu: Velocity in the east and up directions.
  3. pitch, roll, yaw: Euler angle-based vehicle orientation.
  4. qbn_3: Quaternion component representing vehicle orientation.

Final dataset statistical summary:

Final dataset statistics

Modeling

We tested logistic regression, support vector machines, and XGBoost. Initially, XGBoost yielded unexpectedly high test metrics, leading us to investigate class imbalance.

Class distribution before balancing

To address this, we bootstrapped additional ground truth samples. After balancing, the dataset showed improved performance.

Class distribution after balancing

Model performance after balancing:

Updated model performance metrics

Findings and Performance Evaluation

Final model metrics Final model learning curves

The optimized XGBoost model successfully identified anomalies, achieving all metrics above 0.9, meeting our performance goal. The most impactful features in identifying cyberattacks were:

Future Work

Potential improvements include expanding the dataset with more real-world ground truth trajectories, reducing reliance on bootstrapping for class balancing.

Project Takeaways

This project reinforced my skills in data cleaning, modeling, and handling unstructured datasets. I also gained experience in directing team efforts, designing modeling approaches, and using XGBoost for classification.