During my time as a volunteer analyst for the Modern Military Association of America, I completed an analysis of the organization's case report data. This involved data cleaning, transformation, visualization, and reporting. The goal was to assist the organization with quantifying their progress towards goals and to help set new, realistic ones.
The data used in the analysis was a csv with each entry representing a respondant's answers given to fill out a google sheet. There were around 1000 responses over the period of the previous year at the time of the analysis. These responses were manually logged into excel and exported as a csv. The shape and content of the data wasn't easy to work with initially, wiith many issues stemming from the form allowing for free-text responses. Steps were taken to alleviate this. Problems encountered:
Across eight routes, between 200-300 attack trajectories were generated from their respective ground truths. The dataset was derived from the Complex Urban Dataset.
Dataset structure visualization:
To simplify the dataset, we aggregated each CSV using mean values, reducing the dataset from over a million entries to 2,546 rows. We combined attack and ground truth data, labeled them, and used XGBoost's feature importance to retain only eight key features:
Final dataset statistical summary:
We tested logistic regression, support vector machines, and XGBoost. Initially, XGBoost yielded unexpectedly high test metrics, leading us to investigate class imbalance.
To address this, we bootstrapped additional ground truth samples. After balancing, the dataset showed improved performance.
Model performance after balancing:
The optimized XGBoost model successfully identified anomalies, achieving all metrics above 0.9, meeting our performance goal. The most impactful features in identifying cyberattacks were:
Potential improvements include expanding the dataset with more real-world ground truth trajectories, reducing reliance on bootstrapping for class balancing.
This project reinforced my skills in data cleaning, modeling, and handling unstructured datasets. I also gained experience in directing team efforts, designing modeling approaches, and using XGBoost for classification.