Random Forest and Ensemble Model - Predicting wRC+ Values

Intro and Methodology 

Baseball is a volatile sport. The name of the game when it comes to becoming or maintaining status as a star player is consistency, which is why we often see players who excel in one season and fall off the face of the earth the next, or players who are completely irrelevant and then suddenly blossom into a star. The basis of our research will attempt to quantify what aspects of a batter’s skillset best predict offensive output, including how hard they hit the ball, plate discipline, batted ball trajectory, and contact skills.

Research Question:

Basing our analysis around wRC+ as a form of quantifying a player’s offensive value, what metrics best predict offensive value as well as changes in that value across years? Is there any way we can accurately predict a player’s output the next season using their metrics from a previous year using what we gleaned from this research?

Data 

Our data consisted of batter stats from 2019 - 2024, filtering to players with a minimum 250 plate appearances. However, our data did exclude 2020 data, as it was an incomplete season that would have provided us with low sample size data. Additionally, since the season was only 60 games, our minimum 250 plate appearances would have required us to lower the minimum PAs which would have increased potential for outliers and players with low sample size. 

We chose the following metrics to conduct our analysis: 

  1. Exit Velo average

  2. Hard Hit %

  3. BB%

  4. K%

  5. ISO (isolated power) 

  6. Batting Average

  7. Whiff 

  8.  Swing %

  9. Launch Angle average

  10. Flyball 

  11. Groundball %

  12. Pull

  13.  Oppo %

Ultimately, we wanted variables that fully encompass a batter’s hitting trends, from how much their plate discipline, their quality of contact, swinging trends, strike out trends, and their ability to elevate the ball. 

Our data preparation involved a longitudinal approach, tracking individual players across multiple seasons. We created a time-based dataset where each row represented a player's performance in a specific year, then adding a key feature: wRCplus_next, which represents the player's Weighted Runs Created Plus (wRC+) in the subsequent season. The final data to run the model including only choosing to include seasons for players who qualified both in that season and the next one, that way we weren’t comparing next season’s projections to a player’s next qualified season, which could be two or three years later due to injuries or other outside factors.

In order to predict future wRC+ values, we initially used a random forest model.

A random forest is a machine learning algorithm used for both classification and regression tasks. It operates by constructing multiple decision trees during training and outputting the mode (for classification) or mean prediction (for regression) of the individual trees. In our case, we are conducting a regression, predicting wRC+ outcomes. When building each tree, the algorithm selects a random subset of features at each split point. The randomness also ensures that certain data features or predictors don’t overly dominate the model and the predictions. 

In ensemble machine learning, random forest models leverage two critical hyperparameters to optimize predictive performance: ntree (number of trees) and mtry (variables per split). The ntree parameter determines the total number of decision trees constructed, with each tree trained on data samples to reduce overfitting. Concurrently, mtry controls the number of predictors randomly considered at each tree split, which introduces randomness and prevents model bias. This parameter involves a process of bootstrapping, a resampling technique that draws data from the dataset with replacement.

Results

Feature importance and randomness played a key role in shaping our analysis and model, as we attempted to observe which metrics best predicted year to year wRC+ and predicted next season’s wRC+. Feature importance assigns a level of importance to each predictor based on how significant or impactful it is on the model.

Our predictive model's results align closely with the characteristics of high-performing players in contemporary baseball, highlighting key metrics that define offensive excellence. The analysis revealed that exit velocity, strikeout percentage, hard-hit percentage, walk rate, and isolated power (ISO) are critical indicators of a player's offensive potential

The model fundamentally favors a specific archetype of hitter: one who combines multiple elite skills. These players demonstrate the ability to consistently make hard contact, minimize strikeouts, generate significant power, and maintain a disciplined approach at the plate by drawing walks. 

What sets this analysis apart is not necessarily the discovery of new information, but the quantitative reinforcement of long-held baseball wisdom. The data suggests that the most impactful hitters are those who can balance power, plate discipline, and contact skills—a nuanced blend of attributes that transforms good hitters into exceptional offensive performers.

Upon discovering our feature importance values using our initial random forest model, we then applied cross validation to our model. Cross-validation in our baseball performance prediction model serves as a crucial statistical technique to ensure the reliability of our predictive approach. By splitting our dataset into multiple subsets and training the model iteratively, we can more accurately estimate how well the model will perform on unseen data, reducing the risk of overfitting and capturing the inherent variability in player performance. In our specific implementation using 4-fold cross-validation, the random forest model is trained four times, with each iteration using a different subset of data as the validation set. Since we are predicting future data, data that is unseen, cross validation prevents overfitting and prevents the risk of our model working effectively on training data but not on predicting future data. 

2024 Predictions

Before we began our process of predicting 2025 data, we first wanted to test our model on data we did have access to, the 2024 data. We ran our cross validation model, using an ntree of 483 and an mtry of the square root of the number of our predictors. 

Looking at the results, we see pretty mixed results.

Naturally, we see players like Judge, Alvarez, Betts, Ohtani, Tucker, and Soto up top. However, we see some pretty abysmal predictions, including that of Nolan Jones, who underperformed his expected wRC+ by 59. 

Looking at the results, we see pretty mixed results.

Naturally, we see players like Judge, Alvarez, Betts, Ohtani, Tucker, and Soto up top. However, we see some pretty abysmal predictions, including that of Nolan Jones, who underperformed his expected wRC+ by 59.

Then looking at the chart above, we see that the model did not predict particularly well overall correlation wise when comparing predicted vs actual wRC+. With just a 0.538 correlation, there is no doubt there is plenty of room for improvement. The RMSE (Root Mean Squared Error) of 22.75 indicates that the model's predicted 2024 wRC+ values, on average, deviate from the actual values by around 22.75 points. While not ideal, this RMSE is reasonably good given the difficulty of accurately predicting a complex metric like wRC+ a year in advance, especially for a large group of players. The moderately strong positive correlation coefficient of 0.538 suggests the model is capturing key factors that influence wRC+, but there is room for improvement through further refinements, feature engineering, or incorporation of additional data sources. Overall, the results seem reasonable given the inherent uncertainty and variability in player performance from one season to the next, but continued optimization of the model and understanding its limitations would be prudent for making the most accurate predictions possible. 

2025 Predictions

Using an ntree of 402 and an mtry of the square root of the number of our predictors, here we get a look at our top 25 player results. It’s no surprise to see players such as Judge, Ohtani, Alvarez, Tucker leading the way; these are players who fit the mold that our feature importance seemed to outline as most valuable.

Ensemble Method Model

Since our random forest model didn’t quite perform the way we wanted to, we decided to apply a new approach to modeling wRC+. We once again applied random forest in our ensemble method, but also included the use of two new models.

xGBoost (extreme gradient boosting)

xGBoost is a model that builds decision trees similarly to a random forest. However the key difference is that xGBoost builds trees sequentially, using results from trees to continue improving as the training goes on. The model applies an algorithm that identifies the residual error of the previous tree and then using that to train a more effective model for the next tree. This process is known as gradient boosting, a process that helps reduce variability and bias. Additionally, while random forest uses bootstrapping to train the model on multiple subsets of data, xGBoost uses bagging, which is bootstrapping aggregation. This process similarly creates multiple subsets of data, but instead applies a different model to each subset, and then averaging the output of each trained subset.

Glmnet Linear Regression

Glmnet is a linear regression method that blends Lasso and Ridge regression to improve predictive modeling by adding a penalty term for regularization. The penalty, controlled by parameters alpha (α) and lambda (λ), helps manage model complexity and perform feature selection. When α = 1, the model applies Lasso regression, which can shrink some coefficients to zero, effectively selecting features. When α = 0, it uses Ridge regression, shrinking coefficients without eliminating them, which helps address multicollinearity and overfitting. The λ parameter controls the strength of regularization, with higher values increasing shrinkage and reducing complexity at the cost of bias. Glmnet refines traditional regression by prioritizing significant predictors and reducing the impact of less important ones. This is particularly useful in datasets with many variables, like baseball performance analysis, where multiple metrics influence outcomes. By combining Lasso and Ridge techniques, glmnet enhances generalization, reduces overfitting, and improves interpretability compared to standard regression methods.

Results

Once again we used cross validation with our ensemble method, using 5-folds, and repeating the process three times. We used rmse as our performance metric, which basically told our model to focus on reducing rmse when training on the data.

Instantly, we saw significantly improved results when applying this model to predicting 2024 wRC+.

The correlation coefficient of 0.824 represents a strong positive relationship between predicted and actual values, which is substantially better than the 0.538 correlation mentioned in your example.

The RMSE of 17.47 indicates that, on average, the model's predictions deviate from actual wRC+ values by about 17.47 points. This is a meaningful improvement from the previous RMSE of 22.75, representing approximately a 23% reduction in prediction error. This suggests that the ensemble method is doing a better job at capturing the underlying patterns in the data.

The research demonstrates the value of ensemble modeling techniques in predicting baseball player performance. While the initial random forest model showed moderate predictive power with a correlation of 0.538 and RMSE of 22.75, the enhanced ensemble approach combining random forest, xGBoost, and glmnet linear regression yielded significantly improved results. The final model achieved a much stronger correlation of 0.824 and reduced the RMSE to 17.47, representing a 23% improvement in prediction accuracy. These results validate the importance of key offensive metrics like exit velocity, strikeout percentage, hard-hit percentage, walk rate, and isolated power in forecasting player performance. The research not only quantifies what baseball experts have long understood about successful hitting profiles, but also provides a robust statistical framework for predicting future offensive production through wRC+. While no model can perfectly predict player performance given the inherent variability in baseball, this ensemble approach offers a more reliable tool for projecting offensive output and could prove valuable for teams in player evaluation and development decisions.

Next
Next

Markov Chain Analysis of State Transitions, Player Impact