This example illustrates the impact of applying parameter optimization on the performance of supervised learning, such as a random forest. This script uses the RandomizedSearchCV function from the sklearn library to perform a randomized search of the hyperparameters for a Random Forest Regressor. The goal is to find the best set of hyperparameters for the model to improve its performance. It starts by loading the breast cancer dataset and extracting two features from the data. Then, it splits the data into a training set and a test set using the train_test_split function. The script then defines a Random Forest Regressor model and uses the get_params() function to print the current parameters of the model. Next, the script defines a grid of possible hyperparameters for the model, such as the number of trees in the forest, the number of features to consider at each split, the maximum depth of the tree, the minimum number of samples required to split a node, the minimum number of samples required at each leaf node and whether to use bootstrap samples when building the trees. This grid is used to create a dictionary called random_grid. Then, the script uses the RandomizedSearchCV function to perform a randomized search of the hyperparameters, using 3-fold cross-validation, 100 different combinations of hyperparameters, and all available cores. The function is passed the base model (Random Forest Regressor), the parameter grid, and the training data. Finally, the script uses the best_params_ attribute of the fitted randomized search object to print the best parameters found during the search.
In summary, this script is using a randomized search of hyperparameters to find the best set of hyperparameters for a Random Forest Regressor model, with the goal of improving the model's performance on the breast cancer dataset. The randomized search uses 3-fold cross-validation, 100 different combinations of hyperparameters, and all available cores to perform the search. The best set of hyperparameters found during the search is printed at the end of the script. This process can save much time compared to trying every possible combination of hyperparameters.
python code
#https://jupyter.org/try
#Demo6
#M. S. Rakha, Ph.D.
# Post-Doctoral - Queen's University
# Parameter Optimization
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
np.random.seed(5)
breastCancer = datasets.load_breast_cancer()
list(breastCancer.target_names)
#Only two features
X = breastCancer.data[:, 0:2]
y = breastCancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)
X_train[:,0].size
X_train[:,0].size
varriableNames= breastCancer.feature_names
##Second method...
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state = 42)
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
pprint(random_grid)
{'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
#Altogether, there are 2 * 12 * 2 * 3 * 3 * 10 = 4320
#settings! However, the benefit of a random search is that we are not trying every combination,
#but selecting at random to sample a wide range of values.
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = accuracy_score(test_labels, np.round(predictions,0).astype(int))
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
base_model = RandomForestRegressor(n_estimators = 10, random_state = 42)
base_model.fit(X_train, y_train)
base_accuracy = evaluate(base_model, X_test, y_test)
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))
RandomizedSearchCV is a powerful tool for hyperparameter tuning that can save a lot of time compared to a traditional grid search, which would involve trying every possible combination of hyperparameters. Instead, it samples a random subset of the possible combinations of hyperparameters, allowing it to cover a wide range of possible values quickly. The n_iter parameter in the script specifies the number of random combinations to try, and the cv parameter specifies the number of folds for k-fold cross-validation. By default, the randomized search will only use one processor, but the n_jobs parameter can be set to -1 to use all available processors to speed up the search. The best set of hyperparameters found during the search can be accessed using the best_params_ attribute of the fitted randomized search object. These hyperparameters can then be used to train the final model and evaluate its performance on the test set.
It's important to note that the search process is random, so the best set of hyperparameters found may vary each time the script is run. However, by increasing the n_iter parameter and using k-fold cross-validation, we can increase the chances of finding the optimal set of hyperparameters. In conclusion, the randomized search of hyperparameters is a powerful technique for improving the performance of machine learning models. By using the RandomizedSearchCV function from the sklearn library, we can quickly and efficiently find the best set of hyperparameters for a given model and dataset, which can lead to better model performance and increased accuracy.
Another essential aspect to consider when performing hyperparameter tuning is the use of appropriate evaluation metrics. In this script, the evaluation metric used is the accuracy score. It's a good practice to use different evaluation metrics to have a complete evaluation. A good example would be using precision, recall, and f1-score when working on classification problems and ROC-AUC. Another thing to remember is that the script uses only two features from the dataset, which may not be enough to perform a good classification. It's essential to consider using more features or even using feature selection techniques like univariate selection, recursive feature elimination, or feature importance from Random Forest to select the most relevant features.
Additionally, when performing hyperparameter tuning, it's essential to consider the trade-off between model complexity and performance. While increasing the complexity of the model (such as the number of trees in the forest) may lead to better performance, it can also lead to overfitting, which can make the model perform poorly on new, unseen data. Finally, it's worth noting that the model's performance can also be improved by using more advanced techniques, such as ensemble learning, where multiple models are combined to make predictions. Ensemble learning can help improve the model's performance by combining the predictions of multiple models, which can lead to better generalization and increased accuracy.
Another thing to note is that the script uses a random seed for the random number generator. This is done to ensure that the results are reproducible. When the script is run multiple times with the same seed, the same random numbers will be generated, and the same splits of data, hyperparameters, and the same model will be trained. This is important for comparing the performance of different models or for comparing the performance of a model before and after a modification, such as adding more features or tuning the hyperparameters. Also, it's important to keep in mind that the breast cancer dataset that is used in the script is a relatively small dataset, with only 569 samples, and 30 features. The results obtained from this script may not be generalizable to larger and more complex datasets. So, when working on a new dataset it's important to evaluate the model on a independent dataset, called validation set, to avoid overfitting.
Finally, it's worth noting that the script is using only the Random Forest Regressor model. However, there are many other models and algorithms that can be used for classification problems, such as logistic regression, support vector machines, k-nearest neighbors, and neural networks. It's important to evaluate the performance of multiple models and select the one that performs best on the specific dataset.
The output should be like close to the following:
- Code:
Parameters currently in use:
{'bootstrap': True,
'criterion': 'mse',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 'warn',
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
{'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
Fitting 3 folds for each of 100 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 20.7s
[Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 1.2min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 2.2min finished
C:\Users\m.rakha\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
DeprecationWarning)
Model Performance
Average Error: 0.1512 degrees.
Accuracy = 0.87%.
Model Performance
Average Error: 0.1635 degrees.
Accuracy = 0.90%.
Improvement of 4.05%.
C:\Users\m.rakha\Anaconda3\lib\site-packages\ipykernel_launcher.py:105: RuntimeWarning: divide by zero encountered in true_divide
C:\Users\m.rakha\Anaconda3\lib\site-packages\ipykernel_launcher.py:105: RuntimeWarning: invalid value encountered in true_divide
C:\Users\m.rakha\Anaconda3\lib\site-packages\ipykernel_launcher.py:105: RuntimeWarning: divide by zero encountered in true_divide