As I began preparing for this project I felt intuitively it would be somewhat ridiculous for me to build neural networks for classifying astrophysical data if I didn’t fully grasp how and why the equations used to calculate my findings actually work.
Let’s Ask Mr. Feynman
“Mathematicians […] are often led astray when ‘studying’ physics because they lose sight of the physics. They say: ‘Look, these differential equations–the Maxwell equations–are all there is to electrodynamics; it is admitted by the physicists that there is nothing which is not contained in the equations. The equations are complicated, but after all they are only mathematical equations and if I understand them mathematically inside out, I will understand the physics inside out.’ Only it doesn’t work that way. Mathematicians who study physics with that point of view–and there have been many of them–usually make little contribution to physics and, in fact, little to mathematics. They fail because the actual physical situations in the real world are so complicated that it is necessary to have a much broader understanding of the equations.”
-Richard Feynman, The Feynman Lectures on Physics: Volume 2, Chapter 2-1: “Differential Calculus of Vector Fields”
Feynman circa 1962 (photographer unknown) courtesy of Ralph Leighton. Source: Feynman’s Tips on Physics
So…do we need physics?
The problem with the dataset I’m using for this project is that it came without any units. How can we be expected to incorporate the laws of physics into any model if we don’t know the units?? I wondered if it was some sort of test - that NASA put this dataset on Kaggle without any units because they wanted to see if a data scientist with some time on their hands (no one), or with an incurable sense of curiosity (most of us) would take it upon themselves (probably just me) to figure out the units themselves by doing some research and matching up the data points with the original Campaign 3 data on the Mikulsky Archive website, then using the AstroPy and AstroQuery libraries to perform phase-folding to identify transiting planets. Nevermind your project is due in one week and you won’t graduate if it’s late.
That’s not very nice.
Then I thought that’s a bizarre task to put someone up to - not very nice of them to take away our precious units. How the heck do we do phase-folding, and how do we filter out junk data points where K2’s thrusters were firing?? After going very deep into Fourier transforms to identify the period through harmonic means (because I’m a musician and that seemed like an excellent rabbit-hole to go down), I thought, well, maybe the real test is to see if we can build a model that filters out all of the non-candidates without having to go to all that trouble of period estimation and phase-folding. Unfortunately (perhaps) I was able to identify all 5 planets in the test set without any physics. I’m still not totally convinced, but I ultimately concluded that even though the model worked, I would need to validate its accuracy by testing against a larger data set and seeing if my model still performed well. I could then compare it to a model that uses the physics, and the unique properties of the telescope, to clean and scrub the data as well as perform normalization properly. This is why there is a STARSKØPE 2 and 3 which you can read about in the posts following this one.
Questions
Some specific questions this project seeks to answer are as follows:
Can a transiting exoplanet be detected strictly by analyzing the raw flux values of a given star?
What is the best approach for pre-processing photometric timeseries data and what are some of the issues we might encounter in choosing how the data is prepared for classification modeling?
How much signal-to-noise ratio is too much? That is, if the classes are highly imbalanced, for instance only a few planets can be confirmed out of thousands of stars, does the imbalance make for an unreliable or inaccurate model?
How do we test and validate that?
copyright: NASA
W∆
The robot who posted the dataset on Kaggle (I assume she’s a robot because her name is Winter Delta or W∆) gives us a few hints on how we could determine the units, and the dimensions, and a lot of other important physics-related information, if we do a little research. The biggest hint is that this dataset is from the K2 space telescope’s Campaign 3 observations in which only 42 confirmed exoplanets are detected in a set of over 5,000 stars. Looking at the dataset on its own (before doing any digging), we are given little information about how long the time period covers, and we know do not know what the time intervals between flux values are. So far, this has not stopped any data scientists from attempting to tackle the classification model without gathering any additional information.
Model
To answer the question, I first set out to build a model for the data as is, “sans-physics”. The baseline model is a convolutional neural network using the Keras API in a sci-kit learn wrapper.
Results
I was able to identify with 99% accuracy the handful of stars (5) in the test dataset that have a confirmed exoplanet in their orbit. This baseline model is mathematically accurate, but it does not “understand physics”. The conclusion we need to make about the model is whether or not this lack of physics embedded in the training process (or even pre-training process) is acceptable or not.
Conclusion
While it is possible to create a 99% accurate machine learning model for detecting exoplanets using the raw flux values, without any sense of the actual time intervals, and with a highly imbalanced data set (imbalanced meaning only a few positive examples in a sea of negatives) - it is unclear that we can “get away with” this in every case. Furthermore, it is unlikely that could feel completely sure that we aren’t missing out on critical information - such as detecting the existence of an earth-like exoplanet transiting a star - if we don’t use our understanding of physics to further de-noise, normalize, and scale the data before training the model (and possibly even embed this into a pre-training phase). As a case in point, if you read any of the space telescope handbooks, you will quickly learn just how complex the instruments that are producing this data are, and that the way their technology works, when and where in the sky they were pointing, as well as what actually happened during their missions, you’d know that should all probably be taken into account in your model! The K2 data in particular, for instance, has a unique issue that every so often its thrusters would fire to adjust/maintain its position in the sky, causing data at multiple points to be completely useless.
Why that matters…
This type of noise cannot be removed without knowing what exact times the thrusters fired, as well as what times each of the observations of the dataset occurred. Even if we do manage to throw the bad data out, we are still stuck with the problem of not having any data for that time period, and once again might miss our potential planet’s threshold crossing event! If we know where and when those missing pieces occur, we could use that to collect our missing data from another telescope like TESS, which has overlapping targets of observation. A model that can combine data from two different space telescopes, and be smart enough to know based on the telescope it came from how to handle the data, would make truly accurate predictions, and much more useful classifications.
What we can do about that…
This is the type of model I will set out to build in my future work. This is what we would call a cyberoptic artificial telescope - one that can aggregate large datasets from multiple missions and give us a more accurate, more detailed picture of the stars and planets than what we have available to us in the limited view of a single picture from a single telescope at a single point in time. This is the vision for STARSKØPE which will come out of this project.
Recommendations
My recommendations are the following:
Use datasets from the MAST website (via API) to incorporate other calculations of the star’s properties as features to be used for classification algorithms. Furthermore, attempt other types of transformations and normalizations on the data before running the model - for instance, apply a Fourier transform.
Combine data from multiple campaigns and perhaps even multiple telescopes (for instance, matching sky coordinates and time intervals between K2, Kepler, and TESS for a batch of stars that have overlapping observations - this would be critical for finding transit periods that are longer than the campaigns of a single telecope’s observation period).
Explore using computer vision on not only the Full Frame images we can collect from telescopes like TESS, but also on spectographs of the flux values themselves. The beauty of machine learning is our ability to rely on the computer to pick up very small nuances in differences that we ourselves cannot see with our own eyes.
Explore using autoencoded machine learning algorithms with Restricted Boltzmann Machines - this type of model has proven to be incredibly effective in the image analysis of handwriting as we’ve seen applied the MNIST dataset - let’s find out if the same is true for images of stars, be they the Full Frame Images or spectographs.
Future Work
To continue this project, I’ll take another approach for detecting exoplanets using computer vision to analyze images of spectographs of this same star flux data set. Please go to the next post starskope-2 to see how I build a convolutional neural network to classify stars using spectrograph images of the flux values to find transiting exoplanets. Following this, I apply the same algorithm to spectrographs of Fourier transformed data.
Additional future work following this project will be to develop my “cyberoptic artificial telescope” as a machine learning driven application that any astrophysicist can use to look at a single or collection of stars and have the model classify them according not only to exoplanet predictions, but also predict what type of star it is, and other key properties that would be of interest for astrophysical science applications.
Project Walkthrough
Background
K2 Campaign 3
Each star’s light frequency makes up a single row of data collected over the course of the campaign (#3), which in this case for K2 campaign 3 was a little over 60 days (campaigns are normally ~80 days but c3 ended early due to data storage capacity issues.
Timeseries Units?
If we crunched the numbers (which I did elsewhere), it’s 29.4 minutes between each flux measurement, also known as the cadence. This matches the information available in the K2 handbook/MAST website/NASA. Knowing the units and time intervals would allow us to scale and normalize the data very methodically. However, since our initial (math-based) model doesn’t ‘care’ about units, the scrubbing will not take any of the physics into account. This is intentional.
De-noising?
This is something we DO want to come back to for comparison with future models that will have the astrophysical properties embedded in their pre-learning process, and in particular the SCRUBBING: remember, this is a timeseries…it’s hard to do any normalizing, scaling, de-noising to a timeseries if we don’t know anything about the time units. And that’s only ONE of the dimensions being completely ignored by our strict mathematical approach. The question is, will it matter?
Initial Inspection
# Check the value counts
display(train['LABEL'].value_counts(),test['LABEL'].value_counts())
Our target column LABEL assigns each star with a 1 or a 2 to designate whether or not there is a confirmed exoplanet that was found in the star’s orbit. This is precisely what we are trying to classify in our model below.
“The Answer to the Life, Meaning, and Everything is 42”
If you don’t get the joke you haven’t read my favorite book. Anyway, there are a total of only 42 stars that are labeled “2”, ie confirmed exoplanet orbiting this star. There are 37 exoplanet host stars in the training set, and only 5 in the test set. Such highly imbalanced classes will be something we need to deal with carefully in our model.
Planet Host vs Non-Host Stars
Since we are setting out to classify stars as being either a planet-host or non-host, it would be useful to compare the data visually and see if we can pick up on any significant differences in the flux values with just our eyeballs. The simplest way to do this is plot the signals of each target class for a couple of stars and look at the scatter plots and a line plots.
Threshold Crossing Event (TCE)
TCE is determined by a significant dip in the flux values, the assumption being something crossed in front of the star blocking its light for some period of time that the telescope has designated as suspect of an orbiting planet! The occurrence of a TCE means that star is flagged as a ‘Target of Interest’ or in K2’s case, ‘Kepler Object of Ineterst’ (KOI). The KOIs for each campaign have to be confirmed by a human, of course, usually an astrophysicist, and that is precisely where machine learning comes in - there are billions and billions of stars, and thus billions of billions of potential data points. “Looking for a needle in a haystack” doesn’t even work as a metaphor for a scale this immense. This is the ultimate challenge for data scientists! Let’s see what this looks like.
# grab first row of observations to create pandas series
# first row is label = 2 which is a confirmed exoplanet host star
# TCE "Threshold Crossing Event"
tce1=train.iloc[0,:]tce2=train.iloc[1,:]# last row is label = 1 (no tce meaning no evidence this star hosts a planet)
no_tce1=train.iloc[-1,:]no_tce2=train.iloc[-2,:]display(tce1.head(),tce2.head(),no_tce1.head(),no_tce2.head())
After doing a little research (mostly by reading the K2 Handbook and visiting the MAST website where NASA houses all of its space telescope data) we learn that the flux values for campaign 3 that are in the Kaggle dataset have been put through a de-noising process. Prior to this particular de-noising process, the flux values would be called SAP Flux however in this case we are dealing with PDC_SAP Flux. At the moment the units may not seem to matter much, since we assume they are consist across all observations. However, as with anything relating to physics, and science for that matter, the units MATTER. All that to say, for now we are at least going to label the axes accurately so that later down the line if we want to compare this dataset to another from the archive, we will know the units! :)
atomic_vector_plotter()
defatomic_vector_plotter(signal,label_col=None,classes=None,class_names=None,figsize=(15,5),y_units=None,x_units=None):"""
Plots scatter and line plots of time series signal values.
**ARGS
signal: pandas series or numpy array
label_col: name of the label column if using labeled pandas series
-use default None for numpy array or unlabeled series.
-this is simply for customizing plot Title to include classification
classes: (optional- req labeled data) tuple if binary, array if multiclass
class_names: tuple or array of strings denoting what the classes mean
figsize: size of the figures (default = (15,5))
******
Ex1: Labeled timeseries passing 1st row of pandas dataframe
> first create the signal:
star_signal_alpha = x_train.iloc[0, :]
> then plot:
star_signals(star_signal_alpha, label_col='LABEL',classes=[1,2],
class_names=['No Planet', 'Planet']), figsize=(15,5))
Ex2: numpy array without any labels
> first create the signal:
>then plot:
star_signals(signal, figsize=(15,5))
######
TODO:
-`signal` should take an array rather than pdseries
-could allow either array or series to be passed, conv to array if series
######
"""# pass None to label_col if unlabeled data, creates generic title
iflabel_colisNone:label=Nonetitle_scatter="Scatterplot of Star Flux Signals"title_line="Line Plot of Star Flux Signals"color='black'# store target column as variable
eliflabel_colisnotNone:label=signal[label_col]# for labeled timeseries
iflabel==1:cn=class_names[0]color='red'eliflabel==2:cn=class_names[1]color='blue'#create appropriate title acc to class_names
title_scatter=f"Scatterplot for Star Flux Signal: {cn}"title_line=f"Line Plot for Star Flux Signal: {cn}"# Set x and y axis labels according to units
# if the units are unknown, we will default to "Flux"
ify_units==None:y_units='Flux'else:y_units=y_units# it is assumed this is a timeseries, default to "time"
ifx_units==None:x_units='Time'else:x_units=x_units# Scatter Plot
plt.figure(figsize=figsize)plt.scatter(pd.Series([iforiinrange(1,len(signal))]),signal[1:],marker=4,color=color)plt.ylabel(y_units)plt.xlabel(x_units)plt.title(title_scatter)plt.show()# Line Plot
plt.figure(figsize=figsize)plt.plot(pd.Series([iforiinrange(1,len(signal))]),signal[1:],color=color)plt.ylabel(y_units)plt.xlabel(x_units)plt.title(title_line)plt.show()
This second star’s flux signal pattern looks very different - are we to assume that each one of those dips is a transit event? Could there be more than one planet is orbiting? Let’s compare these to the NON planet host stars.
defhypersonic_pliers(path_to_train,path_to_test):"""
Using Numpy to extract data into 1-dimensional arrays
separate target classes (y) for training and test data
assumes y (target) is first column in dataframe
#TODO: option to pass in column index loc for `y` if not default (0)
#TODO: option for `y` to already be 0 or 1 (don't subtract 1)
#TODO: allow option for `y` to be categorical (convert via binarizer)
"""importnumpyasnpTrain=np.loadtxt(path_to_train,skiprows=1,delimiter=',')X_train=Train[:,1:]y_train=Train[:,0,np.newaxis]-1.Test=np.loadtxt(path_to_test,skiprows=1,delimiter=',')X_test=Test[:,1:]y_test=Test[:,0,np.newaxis]-1.delTrain,Testprint("X_train: ",X_train.shape)print("y_train: ",y_train.shape)print("X_test: ",X_test.shape)print("y_test: ",y_test.shape)returnX_train,X_test,y_train,y_test
Scale each observation to zero mean and unit variance.
defthermo_fusion_chisel(matrix1,matrix2=None):"""
Scales each array of a matrix to zero mean and unit variance.
returns matrix/matrices of same shape as input but scaled
matrix2 is optional - useful if data was already train-test split
example: matrix1=X_train, matrix2=X_test
"""importnumpyasnpmatrix1=((matrix1-np.mean(matrix1,axis=1).reshape(-1,1))/np.std(matrix1,axis=1).reshape(-1,1))print("Mean: ",matrix1[0].mean())print("Variance: ",matrix1[0].std())ifmatrix2isnotNone:matrix2=((matrix2-np.mean(matrix2,axis=1).reshape(-1,1))/np.std(matrix2,axis=1).reshape(-1,1))print("Mean: ",matrix2[0].mean())print("Variance: ",matrix2[0].std())returnmatrix1,matrix2else:returnmatrix1
# Scale each row to zero mean and unit variance.
X_train,X_test=T.thermo_fusion_chisel(X_train,X_test)
In order to reduce the amount of high frequency noise that is likely to have an adverse effect on the neural network’s learning outcomes, we can pass a uniform 1-D filter on our scaled train and test data then stack the arrays along the second axis. There are other techniques we might want to apply for further de-noising but for now we’ll start with this for the baseline.
print(X_train.shape)print(y_train.shape)
(5087, 3197)
(5087, 1)
defbabel_fish_dispenser(matrix1,matrix2=None,step_size=None,axis=2):"""
Adds an input corresponding to the running average over a set number
of time steps. This helps the neural network to ignore high frequency
noise by passing in a uniform 1-D filter and stacking the arrays.
**ARGS
step_size: integer, # timesteps for 1D filter. defaults to 200
axis: which axis to stack the arrays
ex:
noise_filter(matrix1=X_train, matrix2=X_test, step_size=200)
"""importnumpyasnpfromscipy.ndimage.filtersimportuniform_filter1difstep_sizeisNone:step_size=200# calc input for flux signal rolling avgs
filter1=uniform_filter1d(matrix1,axis=1,size=step_size)# store in array and stack on 2nd axis for each obs of X data
matrix1=np.stack([matrix1,filter1],axis=2)ifmatrix2isnotNone:filter2=uniform_filter1d(matrix2,axis=1,size=step_size)matrix2=np.stack([matrix2,filter2],axis=2)print(matrix1.shape,matrix2.shape)returnmatrix1,matrix2else:print(matrix1.shape)returnmatrix1
# we now have a 2-dimensional array for every star
X_train,X_test=T.babel_fish_dispenser(X_train,X_test,step_size=200,axis=2)
(5087, 3197, 2) (570, 3197, 2)
# array on 2nd axis
print('\nx_train[-1] flux signal rolling avgs\n')# plot arrays
rolling=X_train[1][:,1]print(rolling)plt.plot(rolling)
x_train[-1] flux signal rolling avgs
[-0.10910981 -0.10801068 -0.10926314 ... -0.18190533 -0.19232921
-0.21176035]
</div
## Model
### **Tactical Decisions**
Since I'm building the baseline model from scratch, a few considerations need to be made. While we can run a gridsearch (or randomizedsearchCV) to get the parameters for us, we still need to decide what type of model would be most ideal for this dataset, knowing what we know so far based on the work done so far. From there, we can go with best practices, assess the initial outcomes, and tune the hyperparameters with each iteration.
### **CNN**
The baseline will consist of a one-dimensional convolutional neural network (CNN). This is ideal for working with this particular dataset in which we will pass one row of the timeseries flux values as an array. This is very similar to how we would process image data (and that's strategically useful if we want to develop the model in the future to handle Full-Frame Images from Tess, for instance, or spectographs of the flux frequences, for instance.
### **1-Layer at a time**
We'll be using the Keras API which makes it easy to add in the layers one at a time. Each 1D convolutional layer corresponds to a local filter, and then a pooling layer reduces the data length by approximately a factor 4. At the end, there are two dense layers. Again, this is similar to the approach taken for a typical image classifier.
### **Activation Function**
The RELU activation function is closest to how real neurons actually work and often produces the best results compared to the other options, so we'll at least start with this for the baseline.
### **Batch Normalization**
Finally, the batch normalization layers are what help to speed up convergence.
## `Model 1`
We'll begin creating a baseline model with a lower than usual learning rate and then speed things up and fine-tune parameters for optimization in the next iterations. (The lower learning rate will help to ensure convergence.)
We'll increase the learning rate in Model2 iteration and also tune any other parameters as necessary. The first iteration uses the Adam optimizer, however, SGD is also a good option we could try here.
```python
X_train.shape
```
(5087, 3197, 2)
```python
y_train.shape
```
(5087, 1)
## build_cnn()
```python
from spacekit.builder import Keras
K = builder.Keras()
```
```python
def build_cnn(X_train, X_test, y_train, y_test, kernel_size=None,
activation=None, input_shape=None, strides=None,
optimizer=Adam, learning_rate=None, loss=None, metrics=None):
"""
Builds, compiles and fits a linear CNN using Keras API
"""
import keras
from keras.utils.np_utils import to_categorical
from keras import models, layers, optimizers
from keras.models import Sequential, Model
from keras.layers import Conv1D, MaxPool1D, Dense, Dropout, Flatten, \
BatchNormalization, Input, concatenate, Activation
from keras.optimizers import Adam
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
if input_shape is None:
input_shape = X_train.shape[1:]
if kernel_size is None:
kernel_size=11
if activation is None:
activation='relu'
if strides is None:
strides = 4
if learning_rate is None:
learning_rate = 1e-5
if loss is None:
loss='binary_crossentropy'
if metrics is None:
metrics=['accuracy']
print("BUILDING MODEL...")
model=Sequential()
print("LAYER 1")
model.add(Conv1D(filters=8, kernel_size=kernel_size,
activation=activation, input_shape=input_shape))
model.add(MaxPool1D(strides=strides))
model.add(BatchNormalization())
print("LAYER 2")
model.add(Conv1D(filters=16, kernel_size=kernel_size,
activation=activation))
model.add(MaxPool1D(strides=strides))
model.add(BatchNormalization())
print("LAYER 3")
model.add(Conv1D(filters=32, kernel_size=kernel_size,
activation=activation))
model.add(MaxPool1D(strides=strides))
model.add(BatchNormalization())
print("LAYER 4")
model.add(Conv1D(filters=64, kernel_size=kernel_size,
activation=activation))
model.add(MaxPool1D(strides=strides))
model.add(Flatten())
print("FULL CONNECTION")
model.add(Dropout(0.5))
model.add(Dense(64, activation=activation))
model.add(Dropout(0.25))
model.add(Dense(64, activation=activation))
print("ADDING COST FUNCTION")
model.add(Dense(1, activation='sigmoid'))
##### COMPILE #####
model.compile(optimizer=optimizer(learning_rate), loss=loss,
metrics=metrics)
print("COMPILED")
return model
```
```python
m1 = K.build_cnn(X_train, X_test, y_train, y_test, kernel_size=11,
activation='relu', input_shape=X_train.shape[1:],
strides=4, optimizer=Adam, learning_rate=1e-5,
loss='binary_crossentropy', metrics=['accuracy'])
```
BUILDING MODEL...
LAYER 1
LAYER 2
LAYER 3
LAYER 4
FULL CONNECTION
ADDING COST FUNCTION
COMPILED
## Batch Generator
To correct for the extremely unbalanced dataset, we'll ensure that the network sees 50% of the positive sample over each batch. We will also apply augmentation by rotating each of the samples randomly each time, thus generating new data. This is similar to image classification when we rotate or shift the samples each time.
### fit_cnn()
```python
def fit_cnn(X_train,y_train, X_test, y_test, model, validation_data=None,
verbose=None, epochs=None, steps_per_epoch=None, batch_size=None):
if verbose is None:
verbose=2
if epochs is None:
epochs = 5
if validation_data is None:
validation_data=(X_test, y_test)
if steps_per_epoch is None:
steps_per_epoch = (X_train.shape[1]//batch_size)
if batch_size is None:
batch_size = 32
print("FITTING MODEL...")
def batch_maker(X_train, y_train, batch_size=batch_size):
"""
Gives equal number of positive and negative samples rotating randomly
The output of the generator must be either
- a tuple `(inputs, targets)`
- a tuple `(inputs, targets, sample_weights)`.
This tuple (a single output of the generator) makes a single
batch. Therefore, all arrays in this tuple must have the same
length (equal to the size of this batch). Different batches may have
different sizes.
For example, the last batch of the epoch is commonly smaller than the others,
if the size of the dataset is not divisible by the batch size.
The generator is expected to loop over its data indefinitely.
An epoch finishes when `steps_per_epoch` batches have been seen by the model.
"""
import numpy
import random
# hb: half-batch
hb = batch_size // 2
# Returns a new array of given shape and type, without initializing.
# x_train.shape = (5087, 3197, 2)
xb = np.empty((batch_size, X_train.shape[1], X_train.shape[2]), dtype='float32')
#y_train.shape = (5087, 1)
yb = np.empty((batch_size, y_train.shape[1]), dtype='float32')
pos = np.where(y_train[:,0] == 1.)[0]
neg = np.where(y_train[:,0] == 0.)[0]
# rotating each of the samples randomly
while True:
np.random.shuffle(pos)
np.random.shuffle(neg)
xb[:hb] = X_train[pos[:hb]]
xb[hb:] = X_train[neg[hb:batch_size]]
yb[:hb] = y_train[pos[:hb]]
yb[hb:] = y_train[neg[hb:batch_size]]
for i in range(batch_size):
size = np.random.randint(xb.shape[1])
xb[i] = np.roll(xb[i], size, axis=0)
yield xb, yb
history = model.fit_generator(batch_maker(X_train, y_train, batch_size),
validation_data=validation_data,
verbose=verbose, epochs=epochs,
steps_per_epoch=steps_per_epoch)
print("TRAINING COMPLETE")
model.summary()
return history
```
```python
h1 = K.fit_cnn(X_train,y_train, X_test, y_test, m1,
validation_data=(X_test, y_test), verbose=2, epochs=5,
steps_per_epoch=(X_train.shape[1]//32), batch_size=32)
```
FITTING MODEL...
Epoch 1/5
- 26s - loss: 0.7978 - accuracy: 0.5044 - val_loss: 0.6152 - val_accuracy: 0.9351
Epoch 2/5
- 24s - loss: 0.7549 - accuracy: 0.5117 - val_loss: 0.6112 - val_accuracy: 0.8439
Epoch 3/5
- 26s - loss: 0.7197 - accuracy: 0.5319 - val_loss: 0.6259 - val_accuracy: 0.7386
Epoch 4/5
- 26s - loss: 0.7358 - accuracy: 0.5063 - val_loss: 0.6466 - val_accuracy: 0.6474
Epoch 5/5
- 23s - loss: 0.7300 - accuracy: 0.5142 - val_loss: 0.6549 - val_accuracy: 0.6193
TRAINING COMPLETE
Model: "sequential_1"
_________________________________________________________________
Layer (type) output Shape Param #
=================================================================
conv1d_1 (Conv1D) (None, 3187, 8) 184
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 797, 8) 0
_________________________________________________________________
batch_normalization_1 (Batch (None, 797, 8) 32
_________________________________________________________________
conv1d_2 (Conv1D) (None, 787, 16) 1424
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 197, 16) 0
_________________________________________________________________
batch_normalization_2 (Batch (None, 197, 16) 64
_________________________________________________________________
conv1d_3 (Conv1D) (None, 187, 32) 5664
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 47, 32) 0
_________________________________________________________________
batch_normalization_3 (Batch (None, 47, 32) 128
_________________________________________________________________
conv1d_4 (Conv1D) (None, 37, 64) 22592
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 9, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 576) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 576) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 36928
_________________________________________________________________
dropout_2 (Dropout) (None, 64) 0
_________________________________________________________________
dense_2 (Dense) (None, 64) 4160
_________________________________________________________________
dense_3 (Dense) (None, 1) 65
=================================================================
Total params: 71,241
Trainable params: 71,129
Non-trainable params: 112
_________________________________________________________________
## Evaluate (M1)
Let's assess the model thus far before tuning parameters. We'll create a few helper functions for calculating metrics and analyzing results visually.
## Class Predictions: get_preds()
```python
def get_preds(X,y,model=None,verbose=False):
if model is None:
model=model
# class predictions
y_true = y.flatten()
y_pred = model.predict_classes(X).flatten()
preds = pd.Series(y_pred).value_counts(normalize=False)
if verbose:
print(f"y_pred:\n {preds}")
print("\n")
return y_true, y_pred
```
```python
y_true, y_pred = computer.get_preds(X_test,y_test, model=m1,verbose=True)
```
y_pred:
0 354
1 216
dtype: int64
### False -/+ Rates (Training): fnfp()
```python
def fnfp(X,y,model, training=False):
import numpy as np
y_pred = np.round( model.predict(X) )
pos_idx = y==1
neg_idx = y==0
#tp = np.sum(y_pred[pos_idx]==1)/y_pred.shape[0]
fn = np.sum(y_pred[pos_idx]==0)/y_pred.shape[0]
#tn = np.sum(y_pred[neg_idx]==0)/y_pred.shape[0]
fp = np.sum(y_pred[neg_idx]==1)/y_pred.shape[0]
if training:
print(f"FN Rate (Training): {round(fn*100,4)}%")
print(f"FP Rate (Training): {round(fp*100,4)}%")
else:
print(f"FN Rate (Test): {round(fn*100,4)}%")
print(f"FP Rate (Test): {round(fp*100,4)}%")
```
```python
computer.fnfp(X_train, y_train, m1, training=True)
```
FN Rate (Training): 0.2556%
FP Rate (Training): 39.4339%
### False -/+ Rates (Test)
```python
computer.fnfp(X_test, y_test, m1)
```
FN Rate (Test): 0.5263%
FP Rate (Test): 37.5439%
### Classification Report
Sci-kit learn has a nice built-in method for evaluating our model:
```python
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, recall_score
report = metrics.classification_report(y_test,y_pred)
print(report)
```
precision recall f1-score support
0.0 0.99 0.62 0.76 565
1.0 0.01 0.40 0.02 5
accuracy 0.62 570
macro avg 0.50 0.51 0.39 570
weighted avg 0.98 0.62 0.76 570
#### Fowlkes-Mallows
Fowlkes-Mallows is a good metric for imbalanced datasets, along with Jaccard which is similar to F1.
```python
sklearn.metrics.fowlkes_mallows_score(y_true,y_pred)
```
0.7207089012303081
### Interpret Scores
With only 5 epochs, the model performed high in precision. However, because this such an imbalanced dataset, recall and F1 score are more critical metrics and these could definitely be improved. We'll tune some of the hyperparameters, specifically adjusting the learning rate and increasing the number of epochs up to 40.
### History Metrics: keras_history()
The baseline model is not meant to give us optimal results - the real test will be in our final model below. First let's take a look at some of the visuals to understand what the scores really mean. This will help us decide how to proceed in tuning the model appropriately.
```python
def keras_history(history, figsize=(10,4)):
"""
side by side sublots of training val accuracy and loss (left and right respectively)
"""
import matplotlib.pyplot as plt
fig,axes=plt.subplots(ncols=2,figsize=(15,6))
axes = axes.flatten()
ax = axes[0]
ax.plot(history.history['accuracy'])
ax.plot(history.history['val_accuracy'])
ax.set_title('Model Accuracy')
ax.set_ylabel('Accuracy')
ax.set_xlabel('Epoch')
ax.legend(['Train', 'Test'], loc='upper left')
ax = axes[1]
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('Model Loss')
ax.set_ylabel('Loss')
ax.set_xlabel('Epoch')
ax.legend(['Train', 'Test'], loc='upper left')
plt.show()
```
```python
computer.keras_history(h1)
```
</div
With only a few epochs, and a small learning rate, it's obvious that our training parameters have a great deal of room for improvement. This is good - we will definitely need to adjust the learning rate. If that doesn't go far enough in producing desired results, we can also try using a different optimizer such as SGD instead of Adam. For now let's look at a few other key metrics.
## Fusion Matrix
It's like a confusion matrix, without the confusion...
```python
def fusion_matrix(matrix, classes=None, normalize=True, title='Fusion Matrix', cmap='Blues',
print_raw=False):
"""
FUSION MATRIX!
-------------
It's like a confusion matrix...without the confusion.
matrix: can pass in matrix or a tuple (ytrue,ypred) to create on the fly
classes: class names for target variables
"""
from sklearn import metrics
from sklearn.metrics import confusion_matrix #ugh
import itertools
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# make matrix if tuple passed to matrix:
if isinstance(matrix, tuple):
y_true = matrix[0].copy()
y_pred = matrix[1].copy()
if y_true.ndim>1:
y_true = y_true.argmax(axis=1)
if y_pred.ndim>1:
y_pred = y_pred.argmax(axis=1)
fusion = metrics.confusion_matrix(y_true, y_pred)
else:
fusion = matrix
# INTEGER LABELS
if classes is None:
classes=list(range(len(matrix)))
#NORMALIZING
# Check if normalize is set to True
# If so, normalize the raw fusion matrix before visualizing
if normalize:
fusion = fusion.astype('float') / fusion.sum(axis=1)[:, np.newaxis]
thresh = 0.5
fmt='.2f'
else:
fmt='d'
thresh = fusion.max() / 2.
# PLOT
fig, ax = plt.subplots(figsize=(10,10))
plt.imshow(fusion, cmap=cmap, aspect='equal')
# Add title and axis labels
plt.title(title)
plt.ylabel('TRUE')
plt.xlabel('PRED')
# Add appropriate axis scales
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
# Text formatting
fmt = '.2f' if normalize else 'd'
# Add labels to each cell
#thresh = fusion.max() / 2.
# iterate thru matrix and append labels
for i, j in itertools.product(range(fusion.shape[0]), range(fusion.shape[1])):
plt.text(j, i, format(fusion[i, j], fmt),
horizontalalignment='center',
color='white' if fusion[i, j] > thresh else 'black',
size=14, weight='bold')
# Add a legend
plt.colorbar()
plt.show()
return fusion, fig
```
```python
m1_fusion = computer.fusion_matrix(matrix=(y_true,y_pred),
classes=['No Planet','Planet'],
title='M1 Fusion Matrix')
```
The baseline model only managed to correctly identify 2 planets in the test set, while missing the other 3. The model incorrectly classified 215 non-TCEs as planets.
## ROC AUC: roc_plots()
Plot the ROC area under the curve
```python
def roc_plots(X,y,model):
from sklearn import metrics
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score
y_true = y.flatten()
y_hat = model.predict(X)
fpr, tpr, thresholds = roc_curve(y_true, y_hat)
# Threshold Cutoff for predictions
crossover_index = np.min(np.where(1.-fpr <= tpr))
crossover_cutoff = thresholds[crossover_index]
crossover_specificity = 1.-fpr[crossover_index]
fig,axes=plt.subplots(ncols=2, figsize=(15,6))
axes = axes.flatten()
ax=axes[0]
ax.plot(thresholds, 1.-fpr)
ax.plot(thresholds, tpr)
ax.set_title("Crossover at {0:.2f}, Specificity {1:.2f}".format(crossover_cutoff, crossover_specificity))
ax=axes[1]
ax.plot(fpr, tpr)
ax.set_title("ROC area under curve: {0:.2f}".format(roc_auc_score(y_true, y_hat)))
plt.show()
roc = roc_auc_score(y_true,y_hat)
return roc
```
```python
m1_roc = computer.roc_plots(X_test, y_test, m1)
```
# Interpret Results
## Conclusion
Above, we were able to identify with 99% accuracy 5 of 5 stars with an orbiting exoplanet (or exoplanets). The best model (MODEL 2) incorrectly predicted just 2 False Positives, with ZERO false negatives.
## Save Weights (Model 2)
```python
# # %mkdir models
# m2.save_weights('models/k2_cnn1d.h5')
```
# Recommendations
While it is possible to create a near-perfect classification model for detecting exoplanets using the raw flux values of an imbalanced data set, the model would benefit from further validation using additional data from either K2 or another telescope such as TESS. One issue with this model is that it doesn't reveal how it makes the decision on each prediction, an insight that would be extremely useful for astrophysicists and for developing and improving the model itself. A better, more robust and useful model, therefore, would be one which gives such insight without sacrificing accuracy or recall.
My recommendations are the following:
1. Use datasets from the MAST website (via API) to incorporate other calculations of the star's properties as features to be used for classification algorithms. Furthermore, attempt other types of transformations and normalizations on the data before running the model - for instance, apply Fourier transform and phase folding.
2. Combine data from multiple campaigns and perhaps even multiple telescopes (for instance, matching sky coordinates and time intervals between K2, Kepler, and TESS for a batch of stars that have overlapping observations - this would be critical for finding transit periods that are longer than the campaigns of a single telecope's observation period).
3. Explore using computer vision on not only the Full Frame images we can collect from telescopes like TESS, but also on spectographs of the flux values themselves. The beauty of machine learning is our ability to rely on the computer to pick up very small nuances in differences that we ourselves cannot see with our own eyes.
4. Explore using autoencoded machine learning algorithms with Restricted Boltzmann Machines - this type of model has proven to be incredibly effective in the image analysis of handwriting as we've seen applied the MNIST dataset - let's find out if the same is true for images of stars, be they the Full Frame Images or spectographs.
# Future Work
To continue this project, I'll take another approach for detecting exoplanets using computer vision to analyze images of spectographs of this same star flux data set. In part II [starskøpe-2](/datascience/2020/05/06/starskope-2-spectrograph-image-classification.html) I use Restricted Boltzmann Machines on Fourier-transformed spectograph images of the Flux data for K2. These are then stacked on top of each other as layers in a Deep Boltzmann Machine neural network. In part III [starskøpe-3](/datascience/2020/06/01/starskope-3-scraping-mast-api.html) I will apply a similar technique using data from TESS.
For triage/vetting purposes, this model could be useful for scientists. However, I would like to extend the model's capability using a multiclassification algorithm that can tell us not only if there is a transiting body, but how many, as well as additional information about the star's properties. The latter could be done by training another model on stellar parameters, and then stacking the two together into one neural network.