Get the important variables of random forest classifier

Tue Oct 29, 2019 3:22 pm

This example illustrates the extra analysis that random forest can provide for data scientists. In a random forest, we can rank the essential features based on the error caused by dropping any of them. This script loads the breast cancer dataset from sklearn, and uses the RandomForestClassifier to train a model on it. It then tests the model on the test set and prints the classification report. After that, it plots the feature importance of the different features and shows them in a bar chart. It is also printing the feature names. The model is trained on the first ten features; the target is breast cancer diagnosis (malignant or benign).

python code

#https://jupyter.org/try
#Demo7 - part2
#M. S. Rakha, Ph.D.
# Post-Doctoral - Queen's University 
# Supervised Learning - Random Forest
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

np.random.seed(5)
breastCancer = datasets.load_breast_cancer()

list(breastCancer.target_names)

#Only two features
X = breastCancer.data[:, 0:10]
y = breastCancer.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)

X_train[:,0].size
X_train[:,0].size

varriableNames= breastCancer.feature_names
 

 

randomForestModel = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)

randomForestModel.fit(X_train, y_train);

y_pred = randomForestModel.predict(X_test)


from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))



importances = randomForestModel.feature_importances_
std = np.std([tree.feature_importances_ for tree in randomForestModel.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

print(varriableNames)

The script starts by importing the necessary libraries, including numpy, matplotlib, pandas, and the sklearn library. It then loads the breast cancer dataset and splits it into training and test sets. The script then creates an instance of the RandomForestClassifier with 100 estimators and a maximum depth of 2 and trains it on the training set. It then uses this trained model to make predictions on the test set and prints the classification report to evaluate the model's performance. The script then prints feature importances for the features used in the model, which are the first ten features. The feature importances are used to determine which features are most important in the classification of breast cancer diagnosis. The feature importances are visualized using a bar chart, where each bar's width represents the feature's importance. The script then prints the feature names, which are the names of the features used in the dataset. These names are used to match the feature index to the feature name in the feature importance chart. Overall, the script shows how the random forest algorithm can be used to classify the breast cancer dataset and determine which features are most important in the classification.

The script also uses the train_test_split function from sklearn to split the data into training and test sets. This is a common practice in machine learning to evaluate the performance of a model on unseen data. In this case, the test set is made up of 50% of the data, and the random_state parameter is set to 42, which means that the split will always be the same if the script is run multiple times with the same seed. The script also uses the NumPy random seed function to set a seed for the random number generator, which is used for the train_test_split function. Setting a seed for the random number generator ensures that the same random numbers will be generated each time the script is run, which can be useful for reproducibility. Additionally, the script uses the confusion_matrix and classification_report from sklearn.metrics to evaluate the performance of the model. The classification report provides a more detailed evaluation of the model's performance, including precision, recall, and f1-score for each class. In contrast, the confusion matrix summarizes the number of true positives, true negatives, false positives, and false negatives. Overall, this script demonstrates a typical machine learning workflow where the data is loaded, split into training and test sets, used to train a model, and then evaluated on unseen data. It also demonstrates how to extract feature importance and evaluate a model's performance using a classification report and confusion matrix.

The output of this snippet:

Code:: precision recall f1-score support 0 0.92 0.85 0.88 98 1 0.92 0.96 0.94 187 accuracy 0.92 285 macro avg 0.92 0.90 0.91 285 weighted avg 0.92 0.92 0.92 285 Feature ranking: 1. feature 7 (0.327613) 2. feature 6 (0.197932) 3. feature 2 (0.187159) 4. feature 0 (0.104715) 5. feature 3 (0.102147) 6. feature 5 (0.039644) 7. feature 1 (0.026285) 8. feature 9 (0.008671) 9. feature 4 (0.005309) 10. feature 8 (0.000525) ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']

: RandomForestImportant.png (4.4 KiB) Viewed 4913 times

Get the important variables of random forest classifier

Get the important variables of random forest classifier

Topic Tags