On the beast cancer dataset, the code snippet below applies supervised learning of the random forest classifier. The code is divided into seven main steps. The first step is loading the necessary packages that are used in the rest of the code snippet. For example, we use the RandomForest implementation from the sklearn package. The second step is loading the dataset of this experiment. We use the publically available data set of Breast Cancer which has 569 records and 30 features. The target class of this dataset is a binary value representing the diagnosis results of this disease as "Malignant" or "Benign". Next is the feature selection step to choose which features to pick out of 30 features available in this dataset. Feature selection is a big pre-processing topic that is outside the scope of this example discussion. To keep things simple, we choose the first two features by specifying the column range as follows "[:, 0:2]". Generally, in more advanced examples, we would like to choose the features that performing the best and get rid of the noisy ones. Step4 concerns about preparing the split of the dataset for model validation. We split the data equally using the ready to use function "train_test_split". Step5 uses the training set from Step4 to train the random forest model. Training speed depends on the size of the training set and also on model parameters (such as the number of trees). We evaluate the prediction powers of the random forest model using the testing set in Step6. In the end, we measure the accuracy of the model using famous metrics such as precision, recall, and f-measure.
python code
#https://jupyter.org/try
#Demo3
#M. S. Rakha, Ph.D.
#Post-Doctoral - Queen's University
# Supervised Learning - RandomForest Classification
#Step 1: Loading packages
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Random seed so results remain consistent every time you run the script.
np.random.seed(5)
#Step 2: Loading the datset.
breastCancer = datasets.load_breast_cancer()
list(breastCancer.target_names)
#Step 3: Selecting the features to use.
X = breastCancer.data[:, 0:2]
y = breastCancer.target
#Step 4: Splitting the dataset into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)
X_train[:,0].size
X_train[:,0].size
varriableNames= breastCancer.feature_names
#Step 5: Fitting the Random Forest model using the training set.
randomForestModel = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
randomForestModel.fit(X_train, y_train);
#Step 6: Testing the trained model using the test dataset.
y_pred = randomForestModel.predict(X_test)
#Step 7: Printing out the accuracy measurements
print(classification_report(y_test, y_pred))
This script is using several machine learning libraries, such as numpy, matplotlib, pandas, and sklearn, to classify breast cancer data using Random Forest classifier. The breast cancer data is loaded from the sklearn datasets library, and only two features from the data set are used for the analysis. The script starts by using the %matplotlib inline magic command, which tells Jupyter notebook to display the plots generated by matplotlib library in the output cells of the notebook. Then, the script loads the breast cancer dataset from sklearn datasets library and assigns it to the variable breastCancer. It uses the target_names attribute of the breastCancer object to obtain the names of the target classes.
The script then selects two features from the breast cancer dataset to use for the analysis and assigns them to the variable X. It also assigns the target labels to the variable y. The script then splits the data into training and testing sets using the train_test_split method from sklearn. Then it creates a Random Forest classifier object from the sklearn.ensemble library, and fits the model to the training data using the fit() method. The script then uses the predict method to make predictions on the test dataset and assigns the result to the variable y_pred. The script uses the classification_report method from sklearn.metrics to evaluate the model's performance on the test data and prints out the report, which includes precision, recall, f1-score, and support for each label.
For binary classification problems such as breast cancer classification, where the goal is to predict whether a patient has malignant or benign cancer, Random Forest can be a good choice. The Random Forest algorithm is known for handling high-dimensional and complex datasets, which is the case with breast cancer datasets, as it has 30 features. It also deals well with non-linear decision boundaries, which is a common characteristic of many datasets, including breast cancer. Additionally, Random Forest is also less prone to overfitting than a single decision tree, which makes it a good choice for datasets with a high number of features. However, it's worth noting that even though Random Forest is a good algorithm, it's not the best one for every dataset and every problem. Hence, it's better to compare the performance of Random Forest with other algorithms and select the one that best suits your data and problem.
Below is the accuracy measurements as printed out from the "classification_report" function :
- Code:
precision recall f1-score support
0 0.89 0.82 0.85 98
1 0.91 0.95 0.93 187
Accuracy 0.90 285
Macro avg 0.90 0.88 0.89 285
Weighted avg 0.90 0.90 0.90 285
The "0" row is for the "Malignant" class, while the "1" row is for the "Benign" class.