A machine learning pipeline is the overall process from data processing up till model prediction of the data. It could be considered as being a sequence of actions (e.g. data imputation / data scaling / encoding of categorical values / hyperparameter tuning / model selection) which could be broken down into smaller ‘components’ or functions. The user can then configure these ‘components’ to suit the intended purpose. I would think that most of the GUI-based machine learning software exemplify this. Also, the pipeline can be extended to model validation, deployment and monitoring.
This post will build on the result using the RandomForestRegressor obtained in the previous post titled “Machine Learning. Hyperparameter Tuning“.
The dataset used is taken from the Concrete Compressive Strength Data Set from the UCI Machine Learning Repository.
Reference
https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
(Accessed on 10 Feb 2020)
“Reuse of this database is unlimited with retention of copyright notice for Prof. I-Cheng Yeh and the following published paper: I-Cheng Yeh, “Modeling of strength of high performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).”
A simple pipeline will be explored in this post.
Let us consider the block of code below for creating an interactive data visual. The code shown would not start from Line 1 or have continuous Line numbering as with previous post, as some comments/notes present in the code has been omitted for simplicity sake.

Line 35 import the pandas module to read the data from the .csv file. Line 36 to 40 imports the required functions from the sklearn package.
Line 42 reads the .csv file as a pandas DataFrame. Line 43 and 44 splits the dataset into feature variables and target variables respectively.
Three ‘components’ will be used in this simple pipeline example.
As seen in the previous post, the RFR feature_importance function ranks the features of this dataset. The two least important features (fly ash content and coarse aggregate content) will be removed to simplify the model. Line 50 to 54 uses the ColumnTransformer function to remove these two columns, and pass on the data to the second ‘component’.

Line 56 is the second ‘component’ which will scale the data.
Line 58 and 59 is the last ‘component’ which will call the RandomForestRegressor function, using the tuned hyperparameters.
Line 62 to 65 then creates the pipeline to be used.

To use the pipeline, the training dataset is first fitted to the pipeline in Line 67, followed by predicting the testing data in Line 69.
Line 70 then calculates the R-squared (R^2) score of this model.
For this particular random_state, the R^2 value obtained is 0.9269673111015994. This value is slightly lower than that obtained in the previous post, partly due to the omission of two feature variables. Nonetheless, the model still seems good enough.
Is there any way to make the above codes more concise and elegant? Feel free to comment.