Comparison of Predictive Modeling Concrete Compressive Strength with Machine Learning Approaches

effectiveness in concrete compressive strength prediction and underscore the importance of feature engineering in enhancing model accuracy.


Introduction
Concrete is a quintessential material in civil engineering, holds the civilization's infrastructure in its robust, unyielding grip [1]- [3].The spectrum of its applications, spanning bridges, skyscrapers, tunnels, and highways, underscores its pivotal role in development [4]- [6].However, its compressive strength is the core attribute underpinning its universal application [7]- [9].The concrete compressive strength, a paramount indicator of the structural integrity and longevity of constructions, is a non-linear function of its age and composition [10]- [12].This intricate dependency on various factors, including cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate, alongside the curing period, delineates the complexity of predicting concrete's compressive strength [13].The urgency for accurate predictions stems from the escalating demand for sustainable construction practices [14].Modern engineering projects necessitate materials that exhibit superior performance, are resource-efficient and are environmentally friendly.Optimizing concrete mixes to achieve desired strength outcomes without excess material use becomes imperative [15].Thus, predicting concrete strength accurately is not merely an academic exercise but a practical necessity that echoes the broader call for sustainable development in the construction industry.
Historically, the quest to predict concrete compressive strength has traversed through empirical models based on laboratory experiments to sophisticated statistical analyses [16].
Initial efforts heavily relied on trial and error, with mixed designs adjusted based on iterative testing.However, this foundational approach proved time-consuming and resource-intensive [17]- [19].The advent of computational models marked a paradigm shift, allowing for the exploration of complex relationships between mix ingredients and compressive strength [20]- [22].The literature abounds with studies employing various statistical and machine-learning methods to model concrete strength [23].The body of research reflects a vibrant exploration of computational approaches, from regression analyses to more complex algorithms like artificial neural networks, support vector machines, and decision trees [24].Yet, despite these advancements, challenges persist.The inherent variability in raw materials, environmental conditions, and mixing processes introduces significant unpredictability, making model accuracy and generalizability ongoing concerns [25].
The sophistication in concrete compressive strength prediction increasingly gravitates towards ensemble methods and advanced machine learning techniques [26].By leveraging the strengths of multiple predictive models, various approaches aim to enhance accuracy and robustness [27].Furthermore, the integration of domain knowledge into computational models represents a promising frontier, offering the potential to refine predictions by incorporating insights into the chemical and physical interactions within the concrete mix [26], [28], [29].
This research seeks to contribute to the evolving landscape of concrete strength prediction by addressing identified gaps in the literature.Specifically, while previous studies have demonstrated the efficacy of various predictive models, there remains a need for comprehensive comparative analyses that elucidate the relative performance of these models across diverse datasets [30], [31].Moreover, the potential of feature engineering and advanced scaling techniques to improve model performance has not been fully explored.Recognizing these gaps, our study addresses critical gaps in predicting concrete compressive strength by focusing on the comparative effectiveness of various machine learning models and their optimization through feature engineering and data scaling.
The research aims to evaluate the predictive accuracy of models such as linear regression, random forest, support vector regression, K-nearest neighbors, gradient boosting, and XGBoost.So, how well these models predict concrete strength and determine the impact of sophisticated feature engineering and scaling techniques on their performance can be ascertained.This is crucial for developing more accurate and adaptable models in civil engineering, ensuring the structural integrity and sustainability of concrete constructions.The comparative results of six predictive models will explain their relative strengths and weaknesses in estimating concrete compressive strength, thereby assisting practitioners and researchers in selecting the most suitable models for their specific needs.Additionally, it will show how feature engineering and data scaling can significantly improve model accuracy, providing insights that guide the enhancement of predictive methods.This dual approach advances our understanding of model capabilities in a practical engineering context and promotes the refinement of methodologies for greater generalizability and precision in material science predictions.

Research Design
This study employs a quantitative research design focusing on predicting concrete compressive strength using various statistical and machine-learning models.The quantitative approach is chosen due to the nature of the problem, which involves analyzing numerical data related to concrete mixtures and their compressive strength.The research design encompasses data collection from existing datasets, preprocessing of data, feature engineering, application of multiple predictive models, and evaluation of these models based on their performance metrics.The models included in the study are Linear Regression, Random Forest, Support Vector Regression (SVR), K-nearest neighbors (KNN), Gradient Boosting, and XGBoost.This comparative approach allows for a robust analysis of each model's efficacy in predicting the target variable.

Population and Samples
The population of interest in this study consists of various concrete mixtures, each defined by a specific combination of ingredients.The dataset can be downloaded from https://www.kaggle.com/datasets/vinayakshanawad/cement-manufacturing-concrete-dataset[32], and the dataset comprises 1030 instances of concrete mixtures, including eight quantitative input variables: Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, and Age.These variables have been meticulously chosen due to their pivotal roles in influencing the concrete's compressive strength, a crucial determinant of the material's applicability in diverse construction scenarios.Cement is selected as it acts as the primary binder in concrete, directly affecting the mixture's strength and longevity.The inclusion of Blast Furnace Slag and Fly Ash as supplementary cementitious materials is justified by their ability to enhance the mechanical properties of concrete and its sustainability credentials by reducing cement necessity.Water, indispensable for the hydration reaction with cement, is measured precisely, as its excess can dilute the concrete mix, adversely impacting structural integrity.The superplasticizer is considered for its capacity to enhance mix workability without increasing water content, thus allowing for higher compressive strength achievements.The choice of coarse and fine aggregate stems from their role as structural fillers, which contribute to the mix's volume, affecting its density and strength.Finally, age is accounted for due to the significant maturation of concrete strength over time, particularly in the initial curing phases, making it a critical factor in the mix's performance evaluation.The output variable, concrete compressive strength measured in MPa, serves as a quantifiable metric for assessing the efficacy of the mix design.This dataset, therefore, stands as a comprehensive representation of the broader population of concrete mixtures used in civil engineering.

Instruments
Several software tools and libraries are utilized for data analysis, including Python (for data preprocessing and analysis), Scikit-learn and XGBoost for implementing the machine learning models.The SHAP library is used for model interpretability, specifically for understanding the impact of each feature on the predictions of the Random Forest model.

Data Analysis
The analysis begins with an exploratory data analysis (EDA) to understand the dataset's characteristics, including the distribution of variables, presence of outliers, and potential correlations between features.Following EDA, feature engineering is applied to create new features that might improve model performance.Data scaling is performed using StandardScaler and MinMaxScaler to normalize the feature values, facilitating more efficient learning by the models.For model evaluation, the study employs a cross-validation approach, specifically 5-fold cross-validation, to ensure the models' performance is not dependent on a particular data partitioning.This approach provides a more generalized performance metric across different subsets of the data.The main performance metric used for model evaluation is the R-squared (R²) value, which measures the proportion of variance in the compressive strength that is predictable from the input features.The methodology culminates in the application of RandomizedSearchCV for hyperparameter tuning of the Random Forest model, aiming to optimize its performance.Finally, SHAP values are calculated for the Random Forest model to interpret the model's predictions and understand the influence of each feature on the compressive strength outcome.

Results and Discussions
The R-Square results indicate a notable variance in model performance.XGBoost emerged as the front-runner, demonstrating an exemplary average R-squared value of 0.9178, which suggests a high predictive capability.This model's performance is particularly noteworthy compared to more traditional models, such as Linear Regression, which exhibited an average R-squared value of 0.5886, indicating a moderate fit.Gradient Boosting also performed robustly, with an R-squared value that signals strong predictive power.The performance spectrum of these models underscores the advanced capabilities of ensemble learning techniques in capturing complex, non-linear relationships within data.
A deeper exploration into model interpretability is afforded by SHAP values, which reveal the impact of each feature on the model output.For instance, age and cement content emerge as significant predictors, reflecting their crucial role in determining concrete strength.
The distribution of SHAP values for these features indicates their varied influence across different observations, providing insights that could guide more nuanced concrete mix designs.
Expanding upon these findings, the discussion integrates them into the broader landscape of civil engineering by exploring their practical implications.The significance of age and cement As presented in the Table 1, Linear Regression presented an average R² value of approximately 0.59 with a standard deviation of 0.049, indicating a moderate fit to the data.
However, the spread of residuals in the residual plot demonstrates that the model has limitations  Moreover, the comparative analysis of our results with those of previous studies elucidates a significant advancement in predictive model development.While Linear Regression has been a staple in earlier research due to its simplicity and interpretability, our findings suggest a shift towards more sophisticated models like XGBoost and Random Forest for higher accuracy in complex scenarios.This transition mirrors the evolving landscape of construction material research, where the demand for precision and efficiency necessitates a departure from traditional methodologies.Our investigation's reliance on ensemble models and the nuanced understanding facilitated by SHAP analysis not only enriches the existing body of knowledge but also offers a practical guide for future concrete mix optimizations.The clear depiction of variable impacts through SHAP values provides a roadmap for adjusting mix components to achieve desired strength levels is a strategy for the efficacy of machine learning in construction material innovation [38].

Conclusion
The study's findings emphasize the complexity inherent in predicting concrete compressive strength and the potential of machine learning models to capture this complexity.
The superior performance of ensemble methods, particularly XGBoost with an average R² value of 0.9178, suggests that these models are more adept at handling the nonlinearity and high dimensionality of the problem space.Additionally, the feature impact analysis via SHAP values underscores the importance of both quantitative mix components and curing time, providing valuable insights for practitioners seeking to optimize concrete formulations for desired strength outcomes.The research contributes to the field by demonstrating the applicability and effectiveness of various predictive models, and it also highlights the critical role of feature engineering and model interpretability in understanding and improving predictions.Future work could explore further optimization of model parameters, integration of additional features that may influence concrete strength, and the application of these findings in practical mix design and quality control processes.

Acknowledgements
We express our deepest gratitude to Atma Jaya Catholic University of Indonesia for the unwavering support and resources throughout this research project.Our sincere thanks also extend to the Information System Study Program, whose rigorous academic environment and collaborative ethos have facilitated our work.
in capturing the complexity of the data, as evidenced by the heteroscedastic patternthe variance of residuals is not constant across the range of fitted values.Furthermore, Random Forest and Gradient Boosting models performed significantly better, with average R² values of approximately 0.90 and 0.89, respectively.The low standard deviation for these models suggests a consistent performance across different folds in cross-validation.By integrating multiple decision trees, these ensemble learning models effectively captured the non-linear relationships in the data.The Support Vector Regression (SVR) and K-Nearest Neighbors (KNN) models yielded average R² values of 0.61 and 0.67, respectively.While outperforming the Linear Regression model, these methods did not achieve the high predictive accuracies of the ensemble models.The SVR model's performance indicates a limited capacity to manage the multidimensional feature space, and KNN's performance suggests sensitivity to the feature space's dimensionality and the scale of the data.The standout performer was the XGBoost model, with an impressive average R² value of approximately 0.92, which signifies a superior fit to the dataset compared to the other models.The model's ability to leverage gradient boosting and tree pruning mechanisms, along with sophisticated handling of missing values and regularization to avoid overfitting, contributed to its high performance.Source: Author's Analysis Results (2024).

Figure 1 .
Figure 1.SHAP Results.As presented in Figure1, the SHAP (SHapley Additive exPlanations) summary plot provides insights into the feature importance and impact on the model output.In our study, the SHAP summary plot for the Random Forest model revealed that age and cement are the most influential factors in predicting concrete compressive strength.Higher SHAP values for age suggest that longer curing times have a strong positive effect on strength, aligning with the understanding that concrete continues to gain strength over time.Similarly, higher cement content is positively correlated with strength, which is consistent with the known properties of concrete mixtures.Water, slag, and superplasticizer features also show varied impacts on the

Figure 2
Figure 2 shows that the results emphasize the capability of ensemble models and advanced tree-based algorithms in predicting complex, non-linear relationships inherent in construction materials data.The superior performance of Random Forest and XGBoost over traditional Linear Regression and simpler machine learning models underscores the importance of model selection in predictive analytics for material science.The higher standard deviation in the R² values of the Linear Regression model points to potential overfitting or underfitting issues.It could also indicate that the assumptions of Linear Regression, such as homoscedasticity and linearity, might not hold for this dataset.

Table 1 .
Comparison Results of Machine Learning Methods.