You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1.9 MiB

<html> <head> </head>

___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

Random Forest - Regression

Plus: An Additional Analysis of Various Regression Methods!

The Data

We just got hired by a tunnel boring company which uses X-rays in an attempt to know rock density, ideally this will allow them to switch out boring heads on their equipment before having to mine through the rock!

They have given us some lab test results of signal strength returned in nHz to their sensors for various rock density types tested. You will notice it has almost a sine wave like relationship, where signal strength oscillates based off the density, the researchers are unsure why this is, but

In [178]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [179]:
df = pd.read_csv("../DATA/rock_density_xray.csv")
In [180]:
df.head()
Out[180]:
Rebound Signal Strength nHz Rock Density kg/m3
0 72.945124 2.456548
1 14.229877 2.601719
2 36.597334 1.967004
3 9.578899 2.300439
4 21.765897 2.452374
In [181]:
df.columns=['Signal',"Density"]
In [182]:
plt.figure(figsize=(12,8),dpi=200)
sns.scatterplot(x='Signal',y='Density',data=df)
Out[182]:
<AxesSubplot:xlabel='Signal', ylabel='Density'>


Splitting the Data

Let's split the data in order to be able to have a Test set for performance metric evaluation.

In [198]:
X = df['Signal'].values.reshape(-1,1)  
y = df['Density']
In [199]:
from sklearn.model_selection import train_test_split
In [200]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

Linear Regression

In [201]:
from sklearn.linear_model import LinearRegression
In [202]:
lr_model = LinearRegression()
In [203]:
lr_model.fit(X_train,y_train)
Out[203]:
LinearRegression()
In [204]:
lr_preds = lr_model.predict(X_test)
In [205]:
from sklearn.metrics import mean_squared_error
In [206]:
np.sqrt(mean_squared_error(y_test,lr_preds))
Out[206]:
0.2570051996584629

What does the fit look like?

In [208]:
signal_range = np.arange(0,100)
In [210]:
lr_output = lr_model.predict(signal_range.reshape(-1,1))
In [214]:
plt.figure(figsize=(12,8),dpi=200)
sns.scatterplot(x='Signal',y='Density',data=df,color='black')
plt.plot(signal_range,lr_output)
Out[214]:
[<matplotlib.lines.Line2D at 0x216f1c9c490>]

Polynomial Regression

Attempting with a Polynomial Regression Model

Let's explore why our standard regression approach of a polynomial could be difficult to fit here, keep in mind, we're in a fortunate situation where we can easily visualize results of y vs x.

Function to Help Run Models

In [223]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
In [224]:
model.fit(poly_features,y_train)
Out[224]:
LinearRegression()
In [266]:
def run_model(model,X_train,y_train,X_test,y_test):
    
    # Fit Model
    model.fit(X_train,y_train)
    
    # Get Metrics
    
    preds = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test,preds))
    print(f'RMSE : {rmse}')
    
    # Plot results
    signal_range = np.arange(0,100)
    output = model.predict(signal_range.reshape(-1,1))
    
    
    plt.figure(figsize=(12,6),dpi=150)
    sns.scatterplot(x='Signal',y='Density',data=df,color='black')
    plt.plot(signal_range,output)
In [267]:
run_model(model,X_train,y_train,X_test,y_test)
RMSE : 0.15234870286353372

Pipeline for Poly Orders

In [268]:
from sklearn.pipeline import make_pipeline
In [269]:
from sklearn.preprocessing import PolynomialFeatures
In [270]:
pipe = make_pipeline(PolynomialFeatures(2),LinearRegression())
In [271]:
run_model(pipe,X_train,y_train,X_test,y_test)
RMSE : 0.2817309563725596

Comparing Various Polynomial Orders

In [272]:
pipe = make_pipeline(PolynomialFeatures(10),LinearRegression())
run_model(pipe,X_train,y_train,X_test,y_test)
RMSE : 0.1417947898442399

KNN Regression

In [273]:
from sklearn.neighbors import KNeighborsRegressor
In [274]:
preds = {}
k_values = [1,5,10]
for n in k_values:
    
    
    model = KNeighborsRegressor(n_neighbors=n)
    run_model(model,X_train,y_train,X_test,y_test)
RMSE : 0.15234870286353372
RMSE : 0.13730685016923655
RMSE : 0.13277855732740926

Decision Tree Regression

In [275]:
from sklearn.tree import DecisionTreeRegressor
In [276]:
model = DecisionTreeRegressor()

run_model(model,X_train,y_train,X_test,y_test)
RMSE : 0.15234870286353372
In [277]:
model.get_n_leaves()
Out[277]:
270

Support Vector Regression

In [282]:
from sklearn.svm import SVR
In [283]:
from sklearn.model_selection import GridSearchCV
In [284]:
param_grid = {'C':[0.01,0.1,1,5,10,100,1000],'gamma':['auto','scale']}
svr = SVR()
In [285]:
grid = GridSearchCV(svr,param_grid)
In [286]:
run_model(grid,X_train,y_train,X_test,y_test)
RMSE : 0.12634668775105407
In [287]:
grid.best_estimator_
Out[287]:
SVR(C=1000)

Random Forest Regression

In [288]:
from sklearn.ensemble import RandomForestRegressor
In [289]:
# help(RandomForestRegressor)
In [290]:
trees = [10,50,100]
for n in trees:
    
    model = RandomForestRegressor(n_estimators=n)
    
    run_model(model,X_train,y_train,X_test,y_test)
RMSE : 0.1417613358931285
RMSE : 0.133281449397454
RMSE : 0.13699094997283662

Gradient Boosting

We will cover this in more detail in next section.

In [291]:
from sklearn.ensemble import GradientBoostingRegressor
In [292]:
# help(GradientBoostingRegressor)
In [293]:
model = GradientBoostingRegressor()

run_model(model,X_train,y_train,X_test,y_test)
RMSE : 0.13294148649584667

Adaboost

In [294]:
from sklearn.ensemble import AdaBoostRegressor
In [295]:
model = GradientBoostingRegressor()

run_model(model,X_train,y_train,X_test,y_test)
RMSE : 0.13294148649584667

</html>