337 KiB

Raw Permalink Blame History Unescape Escape

___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

Principal Component Analysis¶

Imports¶

In [5]:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Data¶

Breast cancer wisconsin (diagnostic) dataset¶

Data Set Characteristics:

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - class:
            - WDBC-Malignant
            - WDBC-Benign

:Summary Statistics:

===================================== ====== ======
                                       Min    Max
===================================== ====== ======
radius (mean):                        6.981  28.11
texture (mean):                       9.71   39.28
perimeter (mean):                     43.79  188.5
area (mean):                          143.5  2501.0
smoothness (mean):                    0.053  0.163
compactness (mean):                   0.019  0.345
concavity (mean):                     0.0    0.427
concave points (mean):                0.0    0.201
symmetry (mean):                      0.106  0.304
fractal dimension (mean):             0.05   0.097
radius (standard error):              0.112  2.873
texture (standard error):             0.36   4.885
perimeter (standard error):           0.757  21.98
area (standard error):                6.802  542.2
smoothness (standard error):          0.002  0.031
compactness (standard error):         0.002  0.135
concavity (standard error):           0.0    0.396
concave points (standard error):      0.0    0.053
symmetry (standard error):            0.008  0.079
fractal dimension (standard error):   0.001  0.03
radius (worst):                       7.93   36.04
texture (worst):                      12.02  49.54
perimeter (worst):                    50.41  251.2
area (worst):                         185.2  4254.0
smoothness (worst):                   0.071  0.223
compactness (worst):                  0.027  1.058
concavity (worst):                    0.0    1.252
concave points (worst):               0.0    0.291
symmetry (worst):                     0.156  0.664
fractal dimension (worst):            0.055  0.208
===================================== ====== ======

:Missing Attribute Values: None

:Class Distribution: 212 - Malignant, 357 - Benign

:Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

:Donor: Nick Street

:Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

.. topic:: References

W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.

In [6]:

df = pd.read_csv('../DATA/cancer_tumor_data_features.csv')

In [7]:

df.head()

Out[7]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 30 columns

PCA with Scikit-Learn¶

Scaling Data¶

In [8]:

from sklearn.preprocessing import StandardScaler

In [9]:

scaler = StandardScaler()

In [10]:

scaled_X = scaler.fit_transform(df)

In [11]:

scaled_X

Out[11]:

array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
         2.75062224,  1.93701461],
       [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
        -0.24388967,  0.28118999],
       [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
         1.152255  ,  0.20139121],
       ...,
       [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
        -1.10454895, -0.31840916],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528],
       [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
        -0.04813821, -0.75120669]])

Scikit-Learn Implementation¶

In [12]:

from sklearn.decomposition import PCA

In [13]:

help(PCA)

Help on class PCA in module sklearn.decomposition._pca:

class PCA(sklearn.decomposition._base._BasePCA)
 |  PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
 |  
 |  Principal component analysis (PCA).
 |  
 |  Linear dimensionality reduction using Singular Value Decomposition of the
 |  data to project it to a lower dimensional space. The input data is centered
 |  but not scaled for each feature before applying the SVD.
 |  
 |  It uses the LAPACK implementation of the full SVD or a randomized truncated
 |  SVD by the method of Halko et al. 2009, depending on the shape of the input
 |  data and the number of components to extract.
 |  
 |  It can also use the scipy.sparse.linalg ARPACK implementation of the
 |  truncated SVD.
 |  
 |  Notice that this class does not support sparse input. See
 |  :class:`TruncatedSVD` for an alternative with sparse data.
 |  
 |  Read more in the :ref:`User Guide <PCA>`.
 |  
 |  Parameters
 |  ----------
 |  n_components : int, float, None or str
 |      Number of components to keep.
 |      if n_components is not set all components are kept::
 |  
 |          n_components == min(n_samples, n_features)
 |  
 |      If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's
 |      MLE is used to guess the dimension. Use of ``n_components == 'mle'``
 |      will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``.
 |  
 |      If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the
 |      number of components such that the amount of variance that needs to be
 |      explained is greater than the percentage specified by n_components.
 |  
 |      If ``svd_solver == 'arpack'``, the number of components must be
 |      strictly less than the minimum of n_features and n_samples.
 |  
 |      Hence, the None case results in::
 |  
 |          n_components == min(n_samples, n_features) - 1
 |  
 |  copy : bool, default=True
 |      If False, data passed to fit are overwritten and running
 |      fit(X).transform(X) will not yield the expected results,
 |      use fit_transform(X) instead.
 |  
 |  whiten : bool, optional (default False)
 |      When True (False by default) the `components_` vectors are multiplied
 |      by the square root of n_samples and then divided by the singular values
 |      to ensure uncorrelated outputs with unit component-wise variances.
 |  
 |      Whitening will remove some information from the transformed signal
 |      (the relative variance scales of the components) but can sometime
 |      improve the predictive accuracy of the downstream estimators by
 |      making their data respect some hard-wired assumptions.
 |  
 |  svd_solver : str {'auto', 'full', 'arpack', 'randomized'}
 |      If auto :
 |          The solver is selected by a default policy based on `X.shape` and
 |          `n_components`: if the input data is larger than 500x500 and the
 |          number of components to extract is lower than 80% of the smallest
 |          dimension of the data, then the more efficient 'randomized'
 |          method is enabled. Otherwise the exact full SVD is computed and
 |          optionally truncated afterwards.
 |      If full :
 |          run exact full SVD calling the standard LAPACK solver via
 |          `scipy.linalg.svd` and select the components by postprocessing
 |      If arpack :
 |          run SVD truncated to n_components calling ARPACK solver via
 |          `scipy.sparse.linalg.svds`. It requires strictly
 |          0 < n_components < min(X.shape)
 |      If randomized :
 |          run randomized SVD by the method of Halko et al.
 |  
 |      .. versionadded:: 0.18.0
 |  
 |  tol : float >= 0, optional (default .0)
 |      Tolerance for singular values computed by svd_solver == 'arpack'.
 |  
 |      .. versionadded:: 0.18.0
 |  
 |  iterated_power : int >= 0, or 'auto', (default 'auto')
 |      Number of iterations for the power method computed by
 |      svd_solver == 'randomized'.
 |  
 |      .. versionadded:: 0.18.0
 |  
 |  random_state : int, RandomState instance, default=None
 |      Used when ``svd_solver`` == 'arpack' or 'randomized'. Pass an int
 |      for reproducible results across multiple function calls.
 |      See :term:`Glossary <random_state>`.
 |  
 |      .. versionadded:: 0.18.0
 |  
 |  Attributes
 |  ----------
 |  components_ : array, shape (n_components, n_features)
 |      Principal axes in feature space, representing the directions of
 |      maximum variance in the data. The components are sorted by
 |      ``explained_variance_``.
 |  
 |  explained_variance_ : array, shape (n_components,)
 |      The amount of variance explained by each of the selected components.
 |  
 |      Equal to n_components largest eigenvalues
 |      of the covariance matrix of X.
 |  
 |      .. versionadded:: 0.18
 |  
 |  explained_variance_ratio_ : array, shape (n_components,)
 |      Percentage of variance explained by each of the selected components.
 |  
 |      If ``n_components`` is not set then all components are stored and the
 |      sum of the ratios is equal to 1.0.
 |  
 |  singular_values_ : array, shape (n_components,)
 |      The singular values corresponding to each of the selected components.
 |      The singular values are equal to the 2-norms of the ``n_components``
 |      variables in the lower-dimensional space.
 |  
 |      .. versionadded:: 0.19
 |  
 |  mean_ : array, shape (n_features,)
 |      Per-feature empirical mean, estimated from the training set.
 |  
 |      Equal to `X.mean(axis=0)`.
 |  
 |  n_components_ : int
 |      The estimated number of components. When n_components is set
 |      to 'mle' or a number between 0 and 1 (with svd_solver == 'full') this
 |      number is estimated from input data. Otherwise it equals the parameter
 |      n_components, or the lesser value of n_features and n_samples
 |      if n_components is None.
 |  
 |  n_features_ : int
 |      Number of features in the training data.
 |  
 |  n_samples_ : int
 |      Number of samples in the training data.
 |  
 |  noise_variance_ : float
 |      The estimated noise covariance following the Probabilistic PCA model
 |      from Tipping and Bishop 1999. See "Pattern Recognition and
 |      Machine Learning" by C. Bishop, 12.2.1 p. 574 or
 |      http://www.miketipping.com/papers/met-mppca.pdf. It is required to
 |      compute the estimated data covariance and score samples.
 |  
 |      Equal to the average of (min(n_features, n_samples) - n_components)
 |      smallest eigenvalues of the covariance matrix of X.
 |  
 |  See Also
 |  --------
 |  KernelPCA : Kernel Principal Component Analysis.
 |  SparsePCA : Sparse Principal Component Analysis.
 |  TruncatedSVD : Dimensionality reduction using truncated SVD.
 |  IncrementalPCA : Incremental Principal Component Analysis.
 |  
 |  References
 |  ----------
 |  For n_components == 'mle', this class uses the method of *Minka, T. P.
 |  "Automatic choice of dimensionality for PCA". In NIPS, pp. 598-604*
 |  
 |  Implements the probabilistic PCA model from:
 |  Tipping, M. E., and Bishop, C. M. (1999). "Probabilistic principal
 |  component analysis". Journal of the Royal Statistical Society:
 |  Series B (Statistical Methodology), 61(3), 611-622.
 |  via the score and score_samples methods.
 |  See http://www.miketipping.com/papers/met-mppca.pdf
 |  
 |  For svd_solver == 'arpack', refer to `scipy.sparse.linalg.svds`.
 |  
 |  For svd_solver == 'randomized', see:
 |  *Halko, N., Martinsson, P. G., and Tropp, J. A. (2011).
 |  "Finding structure with randomness: Probabilistic algorithms for
 |  constructing approximate matrix decompositions".
 |  SIAM review, 53(2), 217-288.* and also
 |  *Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011).
 |  "A randomized algorithm for the decomposition of matrices".
 |  Applied and Computational Harmonic Analysis, 30(1), 47-68.*
 |  
 |  Examples
 |  --------
 |  >>> import numpy as np
 |  >>> from sklearn.decomposition import PCA
 |  >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
 |  >>> pca = PCA(n_components=2)
 |  >>> pca.fit(X)
 |  PCA(n_components=2)
 |  >>> print(pca.explained_variance_ratio_)
 |  [0.9924... 0.0075...]
 |  >>> print(pca.singular_values_)
 |  [6.30061... 0.54980...]
 |  
 |  >>> pca = PCA(n_components=2, svd_solver='full')
 |  >>> pca.fit(X)
 |  PCA(n_components=2, svd_solver='full')
 |  >>> print(pca.explained_variance_ratio_)
 |  [0.9924... 0.00755...]
 |  >>> print(pca.singular_values_)
 |  [6.30061... 0.54980...]
 |  
 |  >>> pca = PCA(n_components=1, svd_solver='arpack')
 |  >>> pca.fit(X)
 |  PCA(n_components=1, svd_solver='arpack')
 |  >>> print(pca.explained_variance_ratio_)
 |  [0.99244...]
 |  >>> print(pca.singular_values_)
 |  [6.30061...]
 |  
 |  Method resolution order:
 |      PCA
 |      sklearn.decomposition._base._BasePCA
 |      sklearn.base.TransformerMixin
 |      sklearn.base.BaseEstimator
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit(self, X, y=None)
 |      Fit the model with X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_samples, n_features)
 |          Training data, where n_samples is the number of samples
 |          and n_features is the number of features.
 |      
 |      y : None
 |          Ignored variable.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Returns the instance itself.
 |  
 |  fit_transform(self, X, y=None)
 |      Fit the model with X and apply the dimensionality reduction on X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_samples, n_features)
 |          Training data, where n_samples is the number of samples
 |          and n_features is the number of features.
 |      
 |      y : None
 |          Ignored variable.
 |      
 |      Returns
 |      -------
 |      X_new : array-like, shape (n_samples, n_components)
 |          Transformed values.
 |      
 |      Notes
 |      -----
 |      This method returns a Fortran-ordered array. To convert it to a
 |      C-ordered array, use 'np.ascontiguousarray'.
 |  
 |  score(self, X, y=None)
 |      Return the average log-likelihood of all samples.
 |      
 |      See. "Pattern Recognition and Machine Learning"
 |      by C. Bishop, 12.2.1 p. 574
 |      or http://www.miketipping.com/papers/met-mppca.pdf
 |      
 |      Parameters
 |      ----------
 |      X : array, shape(n_samples, n_features)
 |          The data.
 |      
 |      y : None
 |          Ignored variable.
 |      
 |      Returns
 |      -------
 |      ll : float
 |          Average log-likelihood of the samples under the current model.
 |  
 |  score_samples(self, X)
 |      Return the log-likelihood of each sample.
 |      
 |      See. "Pattern Recognition and Machine Learning"
 |      by C. Bishop, 12.2.1 p. 574
 |      or http://www.miketipping.com/papers/met-mppca.pdf
 |      
 |      Parameters
 |      ----------
 |      X : array, shape(n_samples, n_features)
 |          The data.
 |      
 |      Returns
 |      -------
 |      ll : array, shape (n_samples,)
 |          Log-likelihood of each sample under the current model.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.decomposition._base._BasePCA:
 |  
 |  get_covariance(self)
 |      Compute data covariance with the generative model.
 |      
 |      ``cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)``
 |      where S**2 contains the explained variances, and sigma2 contains the
 |      noise variances.
 |      
 |      Returns
 |      -------
 |      cov : array, shape=(n_features, n_features)
 |          Estimated covariance of data.
 |  
 |  get_precision(self)
 |      Compute data precision matrix with the generative model.
 |      
 |      Equals the inverse of the covariance but computed with
 |      the matrix inversion lemma for efficiency.
 |      
 |      Returns
 |      -------
 |      precision : array, shape=(n_features, n_features)
 |          Estimated precision of data.
 |  
 |  inverse_transform(self, X)
 |      Transform data back to its original space.
 |      
 |      In other words, return an input X_original whose transform would be X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_samples, n_components)
 |          New data, where n_samples is the number of samples
 |          and n_components is the number of components.
 |      
 |      Returns
 |      -------
 |      X_original array-like, shape (n_samples, n_features)
 |      
 |      Notes
 |      -----
 |      If whitening is enabled, inverse_transform will compute the
 |      exact inverse operation, which includes reversing whitening.
 |  
 |  transform(self, X)
 |      Apply dimensionality reduction to X.
 |      
 |      X is projected on the first principal components previously extracted
 |      from a training set.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_samples, n_features)
 |          New data, where n_samples is the number of samples
 |          and n_features is the number of features.
 |      
 |      Returns
 |      -------
 |      X_new : array-like, shape (n_samples, n_components)
 |      
 |      Examples
 |      --------
 |      
 |      >>> import numpy as np
 |      >>> from sklearn.decomposition import IncrementalPCA
 |      >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
 |      >>> ipca = IncrementalPCA(n_components=2, batch_size=3)
 |      >>> ipca.fit(X)
 |      IncrementalPCA(batch_size=3, n_components=2)
 |      >>> ipca.transform(X) # doctest: +SKIP
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.TransformerMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Estimator instance.

In [14]:

pca = PCA(n_components=2)

In [15]:

principal_components = pca.fit_transform(scaled_X)

In [16]:

plt.figure(figsize=(8,6))
plt.scatter(principal_components[:,0],principal_components[:,1])
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

Out[16]:

Text(0, 0.5, 'Second Principal Component')

In [17]:

from sklearn.datasets import load_breast_cancer

In [18]:

# REQUIRES INTERNET CONNECTION AND FIREWALL ACCESS
cancer_dictionary = load_breast_cancer()

In [19]:

cancer_dictionary.keys()

Out[19]:

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [20]:

cancer_dictionary['target']

Out[20]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

In [22]:

plt.figure(figsize=(8,6))
plt.scatter(principal_components[:,0],principal_components[:,1],c=cancer_dictionary['target'])
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

Out[22]:

Text(0, 0.5, 'Second Principal Component')

Fitted Model Attributes¶

In [24]:

pca.n_components

Out[24]:

In [25]:

pca.components_

Out[25]:

array([[ 0.21890244,  0.10372458,  0.22753729,  0.22099499,  0.14258969,
         0.23928535,  0.25840048,  0.26085376,  0.13816696,  0.06436335,
         0.20597878,  0.01742803,  0.21132592,  0.20286964,  0.01453145,
         0.17039345,  0.15358979,  0.1834174 ,  0.04249842,  0.10256832,
         0.22799663,  0.10446933,  0.23663968,  0.22487053,  0.12795256,
         0.21009588,  0.22876753,  0.25088597,  0.12290456,  0.13178394],
       [-0.23385713, -0.05970609, -0.21518136, -0.23107671,  0.18611302,
         0.15189161,  0.06016536, -0.0347675 ,  0.19034877,  0.36657547,
        -0.10555215,  0.08997968, -0.08945723, -0.15229263,  0.20443045,
         0.2327159 ,  0.19720728,  0.13032156,  0.183848  ,  0.28009203,
        -0.21986638, -0.0454673 , -0.19987843, -0.21935186,  0.17230435,
         0.14359317,  0.09796411, -0.00825724,  0.14188335,  0.27533947]])

In this numpy matrix array, each row represents a principal component, Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

We can visualize this relationship with a heatmap:

In [32]:

df_comp = pd.DataFrame(pca.components_,index=['PC1','PC2'],columns=df.columns)

In [33]:

df_comp

Out[33]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
PC1	0.218902	0.103725	0.227537	0.220995	0.142590	0.239285	0.258400	0.260854	0.138167	0.064363	...	0.227997	0.104469	0.236640	0.224871	0.127953	0.210096	0.228768	0.250886	0.122905	0.131784
PC2	-0.233857	-0.059706	-0.215181	-0.231077	0.186113	0.151892	0.060165	-0.034768	0.190349	0.366575	...	-0.219866	-0.045467	-0.199878	-0.219352	0.172304	0.143593	0.097964	-0.008257	0.141883	0.275339

2 rows × 30 columns

In [38]:

plt.figure(figsize=(20,3),dpi=150)
sns.heatmap(df_comp,annot=True)

Out[38]:

<AxesSubplot:>

In [47]:

pca.explained_variance_ratio_

Out[47]:

array([0.44272026, 0.18971182])

In [48]:

np.sum(pca.explained_variance_ratio_)

Out[48]:

0.6324320765155944

In [49]:

pca_30 = PCA(n_components=30)
pca_30.fit(scaled_X)

Out[49]:

PCA(n_components=30)

In [50]:

pca_30.explained_variance_ratio_

Out[50]:

array([4.42720256e-01, 1.89711820e-01, 9.39316326e-02, 6.60213492e-02,
       5.49576849e-02, 4.02452204e-02, 2.25073371e-02, 1.58872380e-02,
       1.38964937e-02, 1.16897819e-02, 9.79718988e-03, 8.70537901e-03,
       8.04524987e-03, 5.23365745e-03, 3.13783217e-03, 2.66209337e-03,
       1.97996793e-03, 1.75395945e-03, 1.64925306e-03, 1.03864675e-03,
       9.99096464e-04, 9.14646751e-04, 8.11361259e-04, 6.01833567e-04,
       5.16042379e-04, 2.72587995e-04, 2.30015463e-04, 5.29779290e-05,
       2.49601032e-05, 4.43482743e-06])

In [51]:

np.sum(pca_30.explained_variance_ratio_)

Out[51]:

1.0

In [57]:

explained_variance = []

for n in range(1,30):
    pca = PCA(n_components=n)
    pca.fit(scaled_X)
    
    explained_variance.append(np.sum(pca.explained_variance_ratio_))

In [60]:

plt.plot(range(1,30),explained_variance)
plt.xlabel("Number of Components")
plt.ylabel("Variance Explained");

</html>

337 KiB Raw Permalink Blame History Unescape Escape