Astrophysical Object

The Deep Sky Collective - Carl Björk

Astrophysics Object Classification

A common challenge in astrophysics is the identification of what type of object is being observed (everything tends to look the same from millions of lightyears away). This project will create 4 models then analyze their relative performance, benefits, and costs.

Data Collection

Data was queried from the Sloan Digital Sky Survey. Data was queried using SQL, features were selected to include labels as well as intensity of objects in multiple wavelengths (an astrophysics analog of color)

Data Cleaning

Data was cleaned and inspected so that there were no nans, invalid inputs, blanks, or variances in formatting. Data was then standardized so that it could be used with the classification algorithms.

Preprocessing

We plot the correlation graph to look at how the variables are pretty related, and as such, linear regression would not be an appropriate algorithm to use (among other reasons).

One thing that is striking about this data set is that the categories are not evenly balanced. QSOs (Quasistellar Objects, or Quasars) make up a very small portion of the data.

Generally, for classification, you want all of your classes to have approximately the same representation in your dataset. To achieve this, we have three options. We can rewrite our SQL Query to get equal amounts of each class (or drop the QSO class altogether), undersample, or oversample. For this project, we will choose to oversample, the results of which are summarized in the graphs below.

Essentially, what this tells us is that we have oversampled the minority classes to the point where each class has the same representation in the data set. This is to prevent biases that favor the majority class. Data is then split into a training and test group and we are ready to move to the next step.

Model Construction

For this project, we will train 4 models, then compare their costs (mainly time to train) and benefits (accuracy).

The four models that we are going to look at are Catboost, Random Forest, KNN, and a Stacking model. We train the four models and look at their accuracy as well as the time it took to train them.

From this readout of our models, their accuracy, and their training time we could have different metrics for what makes the best model. For example, if we were concerned only with accuracy, we could choose the CatBoost model because it scored the best, at the expense of taking the longest to train. If we instead were concerned about scaling this to a larger training dataset, and we didn't have the resources for a model with an expensive training process, we might instead choose the Random Forest Classifier, since it trained in only a fraction of a second. Before deciding which model would be the best, we should look at the confusion matrices, which describe how each prediction compared to its actual class.

(Graphs are Catboost, RandomForest, KNN, and Stacking going left to right top to bottom)

What these tell us is that certain models are biased towards certain classes, and we can actually use this to evaluate how well our model performs for our given task. For this project, we could see that in the original dataset, QSOs were relatively rare. For this classification algorithm, maybe I want to catch as many QSOs as possible. If that were the case, then the best performing algorithm would be the random forest, because it had the lowest training time, as well as the lowest rate of missed QSOs, which we value more than galaxies and stars. It is worth noting that the special importance on QSOs has led us to choose the algorithm with the lowest overall accuracy, but since we are less concerned about stars and galaxies, overall accuracy is less important to us.

For more information, please check out my Github or Contact Me!

Repository

GitHub

Email

brady.kalei@gmail.com