Exploring Machine Learning with the Kepler Telescope
By: Luke Rivers (Project Lead), Ashley Lu, Ben Brill, Hyerin Lee
Introduction
We have all wondered at one point or another whether there is life out there in the universe, or if we are alone. The Kepler mission is a step in finding out the answer to that question. The Kepler telescope was launched in 2009 [1] with the goal of identifying exoplanets orbiting other stars, some of which could be in what is referred to as a ‘habitable zone’ in which temperatures are similar to Earth and could therefore have liquid water and potentially support life.
The Kepler telescope follows a heliocentric orbit, meaning instead of orbiting earth it orbits the sun. As opposed to most other astronomical telescopes the Kepler telescope stays focused on one patch of sky for years on end, and has a wide viewing angle of around 10 degrees by 10 degrees. That is comparable to the amount of night sky that is covered by your hand at arm’s length. That area contains an estimated 6.5 million stars, of which roughly 1–10% are likely to contain planets that orbit on a plane at which they can be detected by the telescope [2]. In the nine years of its operation, Kepler identified over 2600 planets outside our solar system. So how does the telescope identify planets?
- https://www.jpl.nasa.gov/missions/kepler
- https://exoplanets.nasa.gov/resources/1013/overview-of-kepler-mission/
How the Telescope Works
Imagine that you’re watching a movie from a projector. When a person passes in front of the projector, it creates a shadow and blocks part of the light that’s coming out of the projector. From the perspective of the audience, besides being annoyed, they can see that part of the light from the projector was blocked because of the person passing in front of it. What the audience did is very similar to what the Kepler telescope does: it spots the exoplanets orbiting the stars by detecting the subtle change in the amount of light from the star when a planet passes in front of a star, a phenomenon called transit [1].
The change in the star’s level of light can be plotted on a graph, which would show a regular dip that corresponds to a transiting planet. This graph is called the light curve, and it tells us a lot of information about the planet. When the light curve is plotted over a long period of time, we can measure the length of time between transits to determine the orbital period of that planet — the time it takes for a planet to complete one orbit around the star. We can also measure the transit depth, which is how much a planet dims the light of the star. Transit depth can be used to estimate the size of the planet: the higher the transit depth is, the larger the planet since a bigger object would block more light. The curve also tells us the transit duration, which is how long a planet takes to transit, shown by the length of time of the dip in the light curve. It gives information about the distance of the planet from the star. Greater distance would cause a planet to take longer to orbit the star, meaning the transit duration would also be longer [2].
- https://ca.pbslearningmedia.org/resource/nvap-sci-keplerwork/how-does-kepler-work/
- https://lsintspl3.wgbh.org/en-us/lesson/buac18-il-transitexoplanets/
Background on the Dataset
The dataset we are using for this project is highly complex and contains well over 30 features, but to give a basic overview there are five important components of the dataset.
- The identification of each object works as the index for the dataset.
- The label is koi_pdisposition which informs us whether a given candidate is an exoplanet or not.
- A subset of the features describes the transit properties of the object (such as orbital period, star-planet distance, etc).
- A subset of the features describes the Transit Crossing Event (TCE) of the object (such as transit depth, number of planets in the system, etc).
- A subset of the features describes the stellar parameters (such as surface temperature, radius, mass, age, etc).
The correlations of a subset of the features with the label are shown below. The features chosen were determined to be the most important through feature importance analysis that is described in the next section.
Decision Tree and Feature Importance
To aid in filtering down the over 50 columns of data present within the data set, we used the Feature Importance attribute of the Random Forest Classifier. The random forest consists of n numbers of Decision Trees, each with its own nodes and branches. In each individual tree, and for each individual variable, a value of “impurity” is calculated, which is the probability of that specific feature being calculated correctly if selected randomly. The feature importance is calculated as the average of these impurity values across all trees, with the lowest impurity values being the most important.
Our original top feature importances had relatively high scores, but we decided to remove them and perform another feature importance, as they could be potentially leaking data to the classifier. The most important variable originally, koi_score, was essentially the probability that the candidate was an exoplanet. Flag and error variables provided more information about the Kepler telescope’s own prediction rather than the planet’s actual features.
Eliminating those variables provided us with the above feature performances, which we then used to create our own classification models.
KNN
K Nearest Neighbors (KNN) is a model that works by classifying data points based on the labels of the k nearest neighbors, for some positive integer k. The algorithm generally performs the following three steps:
- Calculate the distances between the data point you want to classify and the other data points.
- Find the k nearest neighbors based on which data points are closest to the data point you want to classify.
- Determine the label of the data point you want to classify based on the frequency of the other labels in the k nearest neighbors [1].
We want to classify whether Kepler Objects of Interest are either candidate exoplanets or false positives. Although KNN is considered to be a simple classification model, it still can yield very good results. Using our data, we tested the KNN model using various k values between 1 and 20, and we found that max accuracy on test data occurs at k = 13 with a value of 0.9289.
This is a fairly high accuracy and using KNN we are able to classify false positives and candidates correctly almost 93% of the time.
The confusion matrix illustrates the results of the KNN model in more detail. We can see that for KNN the false positive rate and false-negative rate are both low.
We also checked the performance of the KNN model using K-Fold Cross Validation. This is a technique that shuffles your data randomly and splits your data into k groups. Each group will be used as a test set to evaluate the performance of your model, while all of the other groups will be used as a training set. For each iteration, the accuracy of the model will be measured and retained. The process is complete when all k groups have been used once as a test set [2].
https://commons.wikimedia.org/wiki/File:K-fold_cross_validation_EN.svg
K-Fold Cross Validation is useful in obtaining a more accurate measure of model performance. For the KNN model, we obtained an accuracy of 0.9231 after performing K-Fold Cross Validation with k = 5. This shows that the KNN model performance is very strong.
- https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
- https://machinelearningmastery.com/k-fold-cross-validation/
Naive Bayes
Naive Bayes model works by using the Bayes’ Theorem, which allows us to calculate the posterior probability based on the data attributes. Posterior probability, denoted by P(c|x), is the probability of an item being in a class given its attributes [1].
In the picture above which shows the Bayes’ Theorem, c stands for class and x stands for attributes. P(c|x) is the posterior probability, P(x|c) is the probability of an attribute given a class, P(c) is the probability of class, and P(x) is the probability of an attribute.
For each item in the dataset, the posterior probability is calculated for each class, and the class with the highest posterior probability becomes the outcome of the prediction [1].
The result of applying the dataset to Naive Bayes model is shown in the confusion matrix below.
The overall accuracy rate of this model was 70.7%, which was the lowest among the four models (XgBoost, KNN, Naive Bayes, SVM). From the confusion matrix, we can see that the model had more false positives than false negatives, meaning that there were more cases where what the model predicted to be an exoplanet was not actually an exoplanet.
SVM
SVM stands for support vector machine. It works by iteratively fitting a hyperplane to separate the classes in the dataset. Take the two-dimensional example below, the different classifications are represented by the different shapes, and the decision boundary, in this case, is a line (in higher dimensions it is called a hyperplane). The distance between the hyperplane and the nearest neighbors from each class is measured (those are called the support vectors), and the distance is maximized to produce a large margin.
This method can also be applied to data that is less obviously separable such as the data shown below. Another technique known as kernels creates a new dimension as a function of the existing parameters (shown by z). The data can then be more easily separated in the higher dimension using the same method described above.
Applying the SVM model to classify exoplanet candidates yielded an accuracy around 81.4%. The confusion matrix illustrating the classifications the model made on the test data is shown below.
Conclusion
After analyzing the data using simple statistics we took a closer look at the feature importances using random forests. Through this analysis, we were able to recognize features that leaked label information and would essentially be ‘cheating’ to include in our models. Once the most important features that did not leak label information were chosen we fed them into several machine learning models to compare their performances.
The best performing model was KNN (k nearest neighbors) with an accuracy of around 92% using 13 neighbours. This model was also shown to be quite robust after using 5-fold cross-validation in which the performance was consistently high. The next strongest model was the SVM (support vector machine model) with an accuracy of 81%. This type of model works best when data is linearly separable, whereas KNN is more flexible, which is likely the reason it performed better. Lastly, the Naive Bayes model had the weakest performance with an accuracy of around 71%. This poor performance is likely because the underlying assumption of the independence of features is very false in this dataset.
Overall, this project has shown that machine learning performance can vary greatly on the same dataset depending on the type of model chosen. This reinforces the importance of understanding your data as well as the models you are employing so that the right fit can be found. With variations in accuracy above 20%, a small bit of research can go a long way to improve the effectiveness of your predictions in a binary classification problem such as the Kepler telescope’s identification of exoplanets.