You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

594 KiB

<html> <head> </head>

___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

Exploring Support Vector Machines

NOTE: For this example, we will explore the algorithm, so we'll skip scaling and also skip a train\test split and instead see how the various parameters can change an SVM (easiest to visualize the effects in classification)

Link to a great Paper on SVM

  • A tutorial on support vector regression by ALEX J. SMOLA and BERNHARD SCHOLKOPF

SVM - Classification

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Data

The data shown here simulates a medical study in which mice infected with a virus were given various doses of two medicines and then checked 2 weeks later to see if they were still infected. Given this data, our goal is to create a classifcation model than predict (given two dosage measurements) if they mouse will still be infected with the virus.

You will notice the groups are very separable, this is on purpose, to explore how the various parameters of an SVM model behave.

In [2]:
df = pd.read_csv("../DATA/mouse_viral_study.csv")
In [3]:
df.head()
Out[3]:
Med_1_mL Med_2_mL Virus Present
0 6.508231 8.582531 0
1 4.126116 3.073459 1
2 6.427870 6.369758 0
3 3.672953 4.905215 1
4 1.580321 2.440562 1
In [4]:
df.columns
Out[4]:
Index(['Med_1_mL', 'Med_2_mL', 'Virus Present'], dtype='object')

Classes

In [5]:
sns.scatterplot(x='Med_1_mL',y='Med_2_mL',hue='Virus Present',
                data=df,palette='seismic')
Out[5]:
<AxesSubplot:xlabel='Med_1_mL', ylabel='Med_2_mL'>

Separating Hyperplane

Our goal with SVM is to create the best separating hyperplane. In 2 dimensions, this is simply a line.

In [6]:
sns.scatterplot(x='Med_1_mL',y='Med_2_mL',hue='Virus Present',palette='seismic',data=df)

# We want to somehow automatically create a separating hyperplane ( a line in 2D)

x = np.linspace(0,10,100)
m = -1
b = 11
y = m*x + b
plt.plot(x,y,'k')
Out[6]:
[<matplotlib.lines.Line2D at 0x1dab706f208>]

SVM - Support Vector Machine

In [7]:
from sklearn.svm import SVC # Supprt Vector Classifier
In [8]:
help(SVC)
Help on class SVC in module sklearn.svm._classes:

class SVC(sklearn.svm._base.BaseSVC)
 |  SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)
 |  
 |  C-Support Vector Classification.
 |  
 |  The implementation is based on libsvm. The fit time scales at least
 |  quadratically with the number of samples and may be impractical
 |  beyond tens of thousands of samples. For large datasets
 |  consider using :class:`sklearn.svm.LinearSVC` or
 |  :class:`sklearn.linear_model.SGDClassifier` instead, possibly after a
 |  :class:`sklearn.kernel_approximation.Nystroem` transformer.
 |  
 |  The multiclass support is handled according to a one-vs-one scheme.
 |  
 |  For details on the precise mathematical formulation of the provided
 |  kernel functions and how `gamma`, `coef0` and `degree` affect each
 |  other, see the corresponding section in the narrative documentation:
 |  :ref:`svm_kernels`.
 |  
 |  Read more in the :ref:`User Guide <svm_classification>`.
 |  
 |  Parameters
 |  ----------
 |  C : float, default=1.0
 |      Regularization parameter. The strength of the regularization is
 |      inversely proportional to C. Must be strictly positive. The penalty
 |      is a squared l2 penalty.
 |  
 |  kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf'
 |      Specifies the kernel type to be used in the algorithm.
 |      It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
 |      a callable.
 |      If none is given, 'rbf' will be used. If a callable is given it is
 |      used to pre-compute the kernel matrix from data matrices; that matrix
 |      should be an array of shape ``(n_samples, n_samples)``.
 |  
 |  degree : int, default=3
 |      Degree of the polynomial kernel function ('poly').
 |      Ignored by all other kernels.
 |  
 |  gamma : {'scale', 'auto'} or float, default='scale'
 |      Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
 |  
 |      - if ``gamma='scale'`` (default) is passed then it uses
 |        1 / (n_features * X.var()) as value of gamma,
 |      - if 'auto', uses 1 / n_features.
 |  
 |      .. versionchanged:: 0.22
 |         The default value of ``gamma`` changed from 'auto' to 'scale'.
 |  
 |  coef0 : float, default=0.0
 |      Independent term in kernel function.
 |      It is only significant in 'poly' and 'sigmoid'.
 |  
 |  shrinking : bool, default=True
 |      Whether to use the shrinking heuristic.
 |      See the :ref:`User Guide <shrinking_svm>`.
 |  
 |  probability : bool, default=False
 |      Whether to enable probability estimates. This must be enabled prior
 |      to calling `fit`, will slow down that method as it internally uses
 |      5-fold cross-validation, and `predict_proba` may be inconsistent with
 |      `predict`. Read more in the :ref:`User Guide <scores_probabilities>`.
 |  
 |  tol : float, default=1e-3
 |      Tolerance for stopping criterion.
 |  
 |  cache_size : float, default=200
 |      Specify the size of the kernel cache (in MB).
 |  
 |  class_weight : dict or 'balanced', default=None
 |      Set the parameter C of class i to class_weight[i]*C for
 |      SVC. If not given, all classes are supposed to have
 |      weight one.
 |      The "balanced" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data
 |      as ``n_samples / (n_classes * np.bincount(y))``
 |  
 |  verbose : bool, default=False
 |      Enable verbose output. Note that this setting takes advantage of a
 |      per-process runtime setting in libsvm that, if enabled, may not work
 |      properly in a multithreaded context.
 |  
 |  max_iter : int, default=-1
 |      Hard limit on iterations within solver, or -1 for no limit.
 |  
 |  decision_function_shape : {'ovo', 'ovr'}, default='ovr'
 |      Whether to return a one-vs-rest ('ovr') decision function of shape
 |      (n_samples, n_classes) as all other classifiers, or the original
 |      one-vs-one ('ovo') decision function of libsvm which has shape
 |      (n_samples, n_classes * (n_classes - 1) / 2). However, one-vs-one
 |      ('ovo') is always used as multi-class strategy. The parameter is
 |      ignored for binary classification.
 |  
 |      .. versionchanged:: 0.19
 |          decision_function_shape is 'ovr' by default.
 |  
 |      .. versionadded:: 0.17
 |         *decision_function_shape='ovr'* is recommended.
 |  
 |      .. versionchanged:: 0.17
 |         Deprecated *decision_function_shape='ovo' and None*.
 |  
 |  break_ties : bool, default=False
 |      If true, ``decision_function_shape='ovr'``, and number of classes > 2,
 |      :term:`predict` will break ties according to the confidence values of
 |      :term:`decision_function`; otherwise the first class among the tied
 |      classes is returned. Please note that breaking ties comes at a
 |      relatively high computational cost compared to a simple predict.
 |  
 |      .. versionadded:: 0.22
 |  
 |  random_state : int or RandomState instance, default=None
 |      Controls the pseudo random number generation for shuffling the data for
 |      probability estimates. Ignored when `probability` is False.
 |      Pass an int for reproducible output across multiple function calls.
 |      See :term:`Glossary <random_state>`.
 |  
 |  Attributes
 |  ----------
 |  support_ : ndarray of shape (n_SV,)
 |      Indices of support vectors.
 |  
 |  support_vectors_ : ndarray of shape (n_SV, n_features)
 |      Support vectors.
 |  
 |  n_support_ : ndarray of shape (n_class,), dtype=int32
 |      Number of support vectors for each class.
 |  
 |  dual_coef_ : ndarray of shape (n_class-1, n_SV)
 |      Dual coefficients of the support vector in the decision
 |      function (see :ref:`sgd_mathematical_formulation`), multiplied by
 |      their targets.
 |      For multiclass, coefficient for all 1-vs-1 classifiers.
 |      The layout of the coefficients in the multiclass case is somewhat
 |      non-trivial. See the :ref:`multi-class section of the User Guide
 |      <svm_multi_class>` for details.
 |  
 |  coef_ : ndarray of shape (n_class * (n_class-1) / 2, n_features)
 |      Weights assigned to the features (coefficients in the primal
 |      problem). This is only available in the case of a linear kernel.
 |  
 |      `coef_` is a readonly property derived from `dual_coef_` and
 |      `support_vectors_`.
 |  
 |  intercept_ : ndarray of shape (n_class * (n_class-1) / 2,)
 |      Constants in decision function.
 |  
 |  fit_status_ : int
 |      0 if correctly fitted, 1 otherwise (will raise warning)
 |  
 |  classes_ : ndarray of shape (n_classes,)
 |      The classes labels.
 |  
 |  probA_ : ndarray of shape (n_class * (n_class-1) / 2)
 |  probB_ : ndarray of shape (n_class * (n_class-1) / 2)
 |      If `probability=True`, it corresponds to the parameters learned in
 |      Platt scaling to produce probability estimates from decision values.
 |      If `probability=False`, it's an empty array. Platt scaling uses the
 |      logistic function
 |      ``1 / (1 + exp(decision_value * probA_ + probB_))``
 |      where ``probA_`` and ``probB_`` are learned from the dataset [2]_. For
 |      more information on the multiclass case and training procedure see
 |      section 8 of [1]_.
 |  
 |  class_weight_ : ndarray of shape (n_class,)
 |      Multipliers of parameter C for each class.
 |      Computed based on the ``class_weight`` parameter.
 |  
 |  shape_fit_ : tuple of int of shape (n_dimensions_of_X,)
 |      Array dimensions of training vector ``X``.
 |  
 |  Examples
 |  --------
 |  >>> import numpy as np
 |  >>> from sklearn.pipeline import make_pipeline
 |  >>> from sklearn.preprocessing import StandardScaler
 |  >>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
 |  >>> y = np.array([1, 1, 2, 2])
 |  >>> from sklearn.svm import SVC
 |  >>> clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
 |  >>> clf.fit(X, y)
 |  Pipeline(steps=[('standardscaler', StandardScaler()),
 |                  ('svc', SVC(gamma='auto'))])
 |  
 |  >>> print(clf.predict([[-0.8, -1]]))
 |  [1]
 |  
 |  See also
 |  --------
 |  SVR
 |      Support Vector Machine for Regression implemented using libsvm.
 |  
 |  LinearSVC
 |      Scalable Linear Support Vector Machine for classification
 |      implemented using liblinear. Check the See also section of
 |      LinearSVC for more comparison element.
 |  
 |  References
 |  ----------
 |  .. [1] `LIBSVM: A Library for Support Vector Machines
 |      <http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf>`_
 |  
 |  .. [2] `Platt, John (1999). "Probabilistic outputs for support vector
 |      machines and comparison to regularizedlikelihood methods."
 |      <http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639>`_
 |  
 |  Method resolution order:
 |      SVC
 |      sklearn.svm._base.BaseSVC
 |      sklearn.base.ClassifierMixin
 |      sklearn.svm._base.BaseLibSVM
 |      sklearn.base.BaseEstimator
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.svm._base.BaseSVC:
 |  
 |  decision_function(self, X)
 |      Evaluates the decision function for the samples in X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape (n_samples, n_features)
 |      
 |      Returns
 |      -------
 |      X : ndarray of shape (n_samples, n_classes * (n_classes-1) / 2)
 |          Returns the decision function of the sample for each class
 |          in the model.
 |          If decision_function_shape='ovr', the shape is (n_samples,
 |          n_classes).
 |      
 |      Notes
 |      -----
 |      If decision_function_shape='ovo', the function values are proportional
 |      to the distance of the samples X to the separating hyperplane. If the
 |      exact distances are required, divide the function values by the norm of
 |      the weight vector (``coef_``). See also `this question
 |      <https://stats.stackexchange.com/questions/14876/
 |      interpreting-distance-from-hyperplane-in-svm>`_ for further details.
 |      If decision_function_shape='ovr', the decision function is a monotonic
 |      transformation of ovo decision function.
 |  
 |  predict(self, X)
 |      Perform classification on samples in X.
 |      
 |      For an one-class model, +1 or -1 is returned.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features) or                 (n_samples_test, n_samples_train)
 |          For kernel="precomputed", the expected shape of X is
 |          (n_samples_test, n_samples_train).
 |      
 |      Returns
 |      -------
 |      y_pred : ndarray of shape (n_samples,)
 |          Class labels for samples in X.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.svm._base.BaseSVC:
 |  
 |  predict_log_proba
 |      Compute log probabilities of possible outcomes for samples in X.
 |      
 |      The model need to have probability information computed at training
 |      time: fit with attribute `probability` set to True.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape (n_samples, n_features) or                 (n_samples_test, n_samples_train)
 |          For kernel="precomputed", the expected shape of X is
 |          (n_samples_test, n_samples_train).
 |      
 |      Returns
 |      -------
 |      T : ndarray of shape (n_samples, n_classes)
 |          Returns the log-probabilities of the sample for each class in
 |          the model. The columns correspond to the classes in sorted
 |          order, as they appear in the attribute :term:`classes_`.
 |      
 |      Notes
 |      -----
 |      The probability model is created using cross validation, so
 |      the results can be slightly different than those obtained by
 |      predict. Also, it will produce meaningless results on very small
 |      datasets.
 |  
 |  predict_proba
 |      Compute probabilities of possible outcomes for samples in X.
 |      
 |      The model need to have probability information computed at training
 |      time: fit with attribute `probability` set to True.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape (n_samples, n_features)
 |          For kernel="precomputed", the expected shape of X is
 |          [n_samples_test, n_samples_train]
 |      
 |      Returns
 |      -------
 |      T : ndarray of shape (n_samples, n_classes)
 |          Returns the probability of the sample for each class in
 |          the model. The columns correspond to the classes in sorted
 |          order, as they appear in the attribute :term:`classes_`.
 |      
 |      Notes
 |      -----
 |      The probability model is created using cross validation, so
 |      the results can be slightly different than those obtained by
 |      predict. Also, it will produce meaningless results on very small
 |      datasets.
 |  
 |  probA_
 |  
 |  probB_
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Return the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.ClassifierMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.svm._base.BaseLibSVM:
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Fit the SVM model according to the given training data.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)                 or (n_samples, n_samples)
 |          Training vectors, where n_samples is the number of samples
 |          and n_features is the number of features.
 |          For kernel="precomputed", the expected shape of X is
 |          (n_samples, n_samples).
 |      
 |      y : array-like of shape (n_samples,)
 |          Target values (class labels in classification, real numbers in
 |          regression)
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Per-sample weights. Rescale C per sample. Higher weights
 |          force the classifier to put more emphasis on these points.
 |      
 |      Returns
 |      -------
 |      self : object
 |      
 |      Notes
 |      -----
 |      If X and y are not C-ordered and contiguous arrays of np.float64 and
 |      X is not a scipy.sparse.csr_matrix, X and/or y may be copied.
 |      
 |      If X is a dense array, then the other methods will not support sparse
 |      matrices as input.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.svm._base.BaseLibSVM:
 |  
 |  coef_
 |  
 |  n_support_
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Estimator instance.

NOTE: For this example, we will explore the algorithm, so we'll skip any scaling or even a train\test split for now

In [9]:
y = df['Virus Present']
X = df.drop('Virus Present',axis=1)
In [10]:
model = SVC(kernel='linear', C=1000)
model.fit(X, y)
Out[10]:
SVC(C=1000, kernel='linear')
In [11]:
# This is imported from the supplemental .py file
# https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html
from svm_margin_plot import plot_svm_boundary
In [12]:
plot_svm_boundary(model,X,y)

Hyper Parameters

C

Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.

Note: If you are following along with the equations, specifically the value of C as described in ISLR, C in scikit-learn is inversely* proportional to this value.*

In [13]:
model = SVC(kernel='linear', C=0.05)
model.fit(X, y)
Out[13]:
SVC(C=0.05, kernel='linear')
In [14]:
plot_svm_boundary(model,X,y)

Kernel

Choosing a Kernel

rbf - Radial Basis Function

When training an SVM with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and gamma. The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

In [15]:
model = SVC(kernel='rbf', C=1)
model.fit(X, y)
plot_svm_boundary(model,X,y)
In [16]:
model = SVC(kernel='sigmoid')
model.fit(X, y)
plot_svm_boundary(model,X,y)

Degree (poly kernels only)

Degree of the polynomial kernel function ('poly'). Ignored by all other kernels.

In [17]:
model = SVC(kernel='poly', C=1,degree=1)
model.fit(X, y)
plot_svm_boundary(model,X,y)
In [18]:
model = SVC(kernel='poly', C=1,degree=2)
model.fit(X, y)
plot_svm_boundary(model,X,y)

gamma

gamma : {'scale', 'auto'} or float, default='scale' Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.

- if ``gamma='scale'`` (default) is passed then it uses
  1 / (n_features * X.var()) as value of gamma,
- if 'auto', uses 1 / n_features.
In [19]:
model = SVC(kernel='rbf', C=1,gamma=0.01)
model.fit(X, y)
plot_svm_boundary(model,X,y)

Keep in mind, for this simple example, we saw the classes were easily separated, which means each variation of model could easily get 100% accuracy, meaning a grid search is "useless".

In [20]:
from sklearn.model_selection import GridSearchCV
In [21]:
svm = SVC()
param_grid = {'C':[0.01,0.1,1],'kernel':['linear','rbf']}
grid = GridSearchCV(svm,param_grid)
In [22]:
# Note again we didn't split Train|Test
grid.fit(X,y)
Out[22]:
GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.01, 0.1, 1, 10], 'kernel': ['linear', 'rbf']})
In [23]:
# 100% accuracy (as expected)
grid.best_score_
Out[23]:
1.0
In [24]:
grid.best_params_
Out[24]:
{'C': 0.01, 'kernel': 'linear'}

This is more to review the grid search process, recall in a real situation such as your exercise, you will perform a train|test split and get final evaluation metrics.

</html>