Machine Learning. Hyperparameter Tuning.

In machine learning, a suitable model has to be first chosen for the fitting of a particular dataset. However, the choice of a suitable model is not the end of the story. Each machine learning model has what is known as hyperparameters, which is a parameter whose value is used to control the learning process and typically defined by the user. In this post, we’ll look at how the choice of these parameters affect the result of a particular machine learning model.

For this example, the Concrete Compressive Strength Data Set from the UCI Machine Learning Repository will be used.

Reference
https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
(Accessed on 10 Feb 2020)
“Reuse of this database is unlimited with retention of copyright notice for Prof. I-Cheng Yeh and the following published paper: I-Cheng Yeh, “Modeling of strength of high performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).”

This data set contains the following 8 feature variables:
– Cement (component 1)(kg in a m^3 mixture)
– Blast Furnace Slag (component 2)(kg in a m^3 mixture)
– Fly Ash (component 3)(kg in a m^3 mixture)
– Water (component 4)(kg in a m^3 mixture)
– Superplasticizer (component 5)(kg in a m^3 mixture)
– Coarse Aggregate (component 6)(kg in a m^3 mixture)
– Fine Aggregate (component 7)(kg in a m^3 mixture)
– Age (day)


And the following target variable:
– Concrete compressive strength(MPa, megapascals)

A quick glance of the .csv file is as follows:

A common linear regression model will first be considered with the block of code below. The code shown would not start from Line 1 or have continuous Line numbering as with previous post, as some comments/notes present in the code has been omitted for simplicity sake.

Line 35 imports the Pandas module for data handling. Line 37 reads the .csv file downloaded from the repository as a Pandas dataframe. Line 38 returns the size of the data as (1030 rows, 9 columns). Line 39 returns False, meaning there are no NaN values in the data; a value of zero is different from a missing NaN value.
The data is then split into X (feature variables) and y (target variable) in Line 43 and 44.

A quick data visualisation could be achieved using the matrix scatterplot function within pandas to get a sense of the data:

pd.plotting.scatter_matrix(df, alpha = 0.15, figsize = [10,10])

Axes labels and values have been omitted for clarity due to the lengthy column headings. X-axis corresponds to the headings of column A to I in the .csv file from left to right; Y-axis from top to bottom.

Let’s start with the simplest of regression model – linear regression. It is not expected to work well with this dataset given the data visualised via the matrix scatterplots above.

Line 55 imports the Linear Regression function, while Line 56 imports the train_test_split function, both from the sklearn module. Line 57 and 58 imports the numpy and matplotlib module which will be used later.

Line 61 assigns a variable as the Linear Regression function. Line 62 then splits the data into a training set, and a testing set. The test_size is set as 30% of the full data set, while the random_state value act as a seed value to make the result reproducible to follow by fixing a particular state.
Line 63 fits a line to the training data and Line 64 uses this best fit line to predict the testing data. Line 65 then calculates the R-squared (R^2) value for the prediction. Running this block of code returns a R^2 value of 0.5877423418118819. (Yup, not that great a prediction…)

The problem here is that if a different random_state is used, a different result is obtained and the range of R^2 values can be quite large, ranging from about 0.4 to about 0.7, if you are lucky. Another question then, is what should be the ‘correct’ test_size to use? It is not quite right to use this R^2 value as a basis for comparison with other regression models.

What if this linear regression model could be run multiple times, with different values of test_size, in order to determine the average R^2 this model could provide?

Let’s use this block of code to obtain the optimal test_size for train_test_split.

Line 73 and 74 create an empty np array which will store the mean and standard deviation of the R^2 values obtained.
Line 76 uses a for loop to obtain results with a test_size value of 0.1 to 0.9, at 0.1 increments.
Line 77 creates another empty np array to store the replicate R^2 values.

The secondary for loop in Line 79 to 84 will execute the train_test_split function over 1000 iterations for any given test_size values. These 1000 R^2 values will be store in the replicate_tts_m_array. The mean of these 1000 values will be calculated by the code in Line 86, while its standard deviation calculated by the code in Line 87. The mean and standard deviation will then be appended to the tts_mean_array and tts_stdev_array respectively. At the end of the primary for loop, there will be 9 values in each of tts_mean_array and tts_stdev_array as shown below:

Since each execution of the code will sample a different proportion of the data set, the results for tts_mean_array and tts_stdev_array will differ slightly.

A graphical representation can be created for the tts_mean_array and tts_stdev_array results.

Line 98 creates a np array that will be used as the x-axis of the plot.
Line 100 plots a graph with two y-axes using subplots.
Line 102 to 105 formats the first graph, which will show the mean R^2 values for each test_size values.
Line 107 instantiate a second y-axis that shares the same x-axis.
Line 109 to 111 formats the second graph, which will show the standard deviation for each test_size values.
Line 113 creates a tight layout format, and Line 114 displays the plot as follows:

This shows that the best mean R^2 values is about 0.60 with a standard deviation of around 0.03 if a test_size = 0.3 is to be used. This would provide an indication about the performance of the Linear Regression model.

Let’s consider a different regression model – Random Forest Regression – and see how it performs.

Line 120 imports the RandomForestRegressor function from the sklearn package. Line 123 assigns a variable as the RandomForestRegressor function. Line 124 fits the training data as defined in Line 62 earlier, and Line 125 predicts the test data. Line 126 then calculates the R-squared (R^2) value for the prediction. Running this block of code returns a R^2 value of 0.9135701428095969. (Not surprisingly, a great improvement…)

The RandomForestRegressor has a lot more parameters as compared to the Linear Regression model. How then could the optimal set of parameters be chosen?

Line 130 imports the GridSearchCV function which will create a combination matrix of the different parameters that were defined in Line 132 to 138. Line 140 to 143 then creates this combination matrix.
Line 145 will then call the GridSearchCV function to test out every combination from this matrix, using a 5-fold cross-validation to fit the training data. Line 146 will give the best parameter combination as follows:
‘max_depth’:35, ‘max_features’:’auto’, ‘min_samples_leaf’:1, ‘min_samples_split’: 2, ‘n_estimators’: 250.

More parameters, and/or a wider range of values could be screened, but it takes quite a bit of time (on my rather old laptop…), so this will serve as an example for now. Or consider using the RandomizedSearchCV function to only screen a fixed number of parameter settings sampled from the specified distributions to save a bit of computation time.

Similar to what has been done for Line 72 to 95 above, a for loop was used to iterate 1000 times the RandomForestRegressor using the tuned parameters. The mean R^2 value obtained was 0.9176761248352558, with a standard deviation of 0.0012886896266347866, (this block of code took quite a while with my laptop…) which is obviously a marked improvement over the linear regression model.

Should there be a need to simplify the model by using lesser features, the RandomForestRegressor has a function called feature_importance that will show the contribution of the various feature variable to the model.

Line 170 defines the plot size. Line 171 calls the feature_importance function, using column headers as the index. Line 173 shows the 8 largest contributor (which is essentially all the features in this dataset), and plot a horizontal bar graph in return as shown below:

The age and cement content contributes significantly to the RFR model in this case, while the fly ash content has the least contribution.

Is there any way to make the above codes more concise and elegant? Feel free to comment.

2 thoughts on “Machine Learning. Hyperparameter Tuning.

Leave a comment