1.9 MiB
Random Forest - Regression¶
Plus: An Additional Analysis of Various Regression Methods!¶
The Data¶
We just got hired by a tunnel boring company which uses X-rays in an attempt to know rock density, ideally this will allow them to switch out boring heads on their equipment before having to mine through the rock!
They have given us some lab test results of signal strength returned in nHz to their sensors for various rock density types tested. You will notice it has almost a sine wave like relationship, where signal strength oscillates based off the density, the researchers are unsure why this is, but
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("../DATA/rock_density_xray.csv")
df.head()
df.columns=['Signal',"Density"]
plt.figure(figsize=(12,8),dpi=200)
sns.scatterplot(x='Signal',y='Density',data=df)
Splitting the Data¶
Let's split the data in order to be able to have a Test set for performance metric evaluation.
X = df['Signal'].values.reshape(-1,1)
y = df['Density']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)
Linear Regression¶
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
lr_model.fit(X_train,y_train)
lr_preds = lr_model.predict(X_test)
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test,lr_preds))
What does the fit look like?
signal_range = np.arange(0,100)
lr_output = lr_model.predict(signal_range.reshape(-1,1))
plt.figure(figsize=(12,8),dpi=200)
sns.scatterplot(x='Signal',y='Density',data=df,color='black')
plt.plot(signal_range,lr_output)
Polynomial Regression¶
Attempting with a Polynomial Regression Model¶
Let's explore why our standard regression approach of a polynomial could be difficult to fit here, keep in mind, we're in a fortunate situation where we can easily visualize results of y vs x.
Function to Help Run Models¶
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(poly_features,y_train)
def run_model(model,X_train,y_train,X_test,y_test):
# Fit Model
model.fit(X_train,y_train)
# Get Metrics
preds = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print(f'RMSE : {rmse}')
# Plot results
signal_range = np.arange(0,100)
output = model.predict(signal_range.reshape(-1,1))
plt.figure(figsize=(12,6),dpi=150)
sns.scatterplot(x='Signal',y='Density',data=df,color='black')
plt.plot(signal_range,output)
run_model(model,X_train,y_train,X_test,y_test)
Pipeline for Poly Orders¶
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
pipe = make_pipeline(PolynomialFeatures(2),LinearRegression())
run_model(pipe,X_train,y_train,X_test,y_test)
Comparing Various Polynomial Orders¶
pipe = make_pipeline(PolynomialFeatures(10),LinearRegression())
run_model(pipe,X_train,y_train,X_test,y_test)
KNN Regression¶
from sklearn.neighbors import KNeighborsRegressor
preds = {}
k_values = [1,5,10]
for n in k_values:
model = KNeighborsRegressor(n_neighbors=n)
run_model(model,X_train,y_train,X_test,y_test)
Decision Tree Regression¶
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
run_model(model,X_train,y_train,X_test,y_test)
model.get_n_leaves()
Support Vector Regression¶
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
param_grid = {'C':[0.01,0.1,1,5,10,100,1000],'gamma':['auto','scale']}
svr = SVR()
grid = GridSearchCV(svr,param_grid)
run_model(grid,X_train,y_train,X_test,y_test)
grid.best_estimator_
Random Forest Regression¶
from sklearn.ensemble import RandomForestRegressor
# help(RandomForestRegressor)
trees = [10,50,100]
for n in trees:
model = RandomForestRegressor(n_estimators=n)
run_model(model,X_train,y_train,X_test,y_test)
Gradient Boosting¶
We will cover this in more detail in next section.
from sklearn.ensemble import GradientBoostingRegressor
# help(GradientBoostingRegressor)
model = GradientBoostingRegressor()
run_model(model,X_train,y_train,X_test,y_test)
Adaboost¶
from sklearn.ensemble import AdaBoostRegressor
model = GradientBoostingRegressor()
run_model(model,X_train,y_train,X_test,y_test)