top of page

Our Task

Every year, after Major League Baseball announces the players earning All-Star berths, fans are angry because they do not agree with the selections. Either they think a player should have made it but didn’t (known as getting snubbed), or they think some player is on the team but didn’t deserve it. Our task is to use machine learning techniques to create a classifier that accurately predicts whether or not a MLB player was an All-Star in a specific year based on their raw individual hitting statistics. We want to find out if fan outrage is warranted and if there are any trends that explain why players are snubbed or earn undeserved All-Star appearances.

Methodology

We used a data set gathered from journalist Sean Lahman’s baseball archive, which is comprised of baseball statistics dating back to 1871. We filtered our data to only include seasons from position players dating back to the first All-Star game in 1933. The following attributes were used to predict whether or not a player made the All-Star team: 

  1. Games Played

  2. At Bats

  3. Runs

  4. Hits

  5. Doubles

  6. Triples

  7. Runs Batted In

  8. Home Runs

  9. Stolens Bases

  10. Times Caught Stealing 

  11. Walks

  12. Strikeouts

Three different machine learning algortihms – J48 Decision Tree, Multilayer Perceptron Neural Network, and IBk K-Nearest Neighbor – were used on Weka to classify All-Star players. These methods were chosen because we believe that they would work well with out large numerical dataset. 

Testing and Training

10-fold cross validation was used to find the optimal K for K-Nearest Neighbors, and from that we found that the K that yielded the highest number of All-Star players accurately was 3.

After training each classifier on the training set, we analyzed each of the results on the validation set and found that IBk K-Nearest Neighbor was the best algorithm to use since it gave the highest number of correctly classified All-Stars on the validation set. The table below shows the accuracy percentages and number of correctly classified All-Star for each of the models.

Although K-Nearest Neighbors did not yield the highest accuracy on validation set, we still believe that it is the best model to use becuse the accuracy can be a very skewed metric. This is due to the fact that there are many more players who are not All-Stars, and therefore predicting All-Stars accurately is much harder for the models and more valuable for the analysis. The accuracy on our test set for K-Nearest Neighbors was 81% with 1,865 out of 2,300 players correctly classified and 195 out of 479 All-Stars correctly classified. 

Results

We looked at the average statistics of 3 types of players: Deserving All-Stars (All-Stars the classifier correctly predicted), Snubs (players that were predicted All-Stars, but didn't make the team) and Undeserving All-Stars (All-Stars that weren't predicted to make the All-Star team). The figure below shows a side-by-side comparison of the three categories of players.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

On average, deserving All-Stars had the best statistics, followed by snubs and then undeserving All-Stars.It came to no surprise that deserving All-Stars had the best statistics. The players located in this category were “no-brainer” All-Stars and often vied for more prestigious awards such as the Most Valuable Player (MVP) trophy.

Examples of Deserving All-Stars: Barry Bonds in 2001, Lou Gehrig in 1936

It is interesting that snubs, on average, had better statistics than many people who made the All-Star team. We were able to identify 3 trends that explained why these people didn't make the team. 

  1. Relatively Unknown Players Without "Flashy" Stats

    • Matt Duffy in 2015

  2. Players With Much Better Second Halves

    • Chipper Jones in 1999

  3. Unexplainable Anomalies

    • Hank Greenberg in 1935

    • Fans should be angry at these!

We also found trends that explained why some undeserving players made the All-Star team.

  1. Iconic Players Past Their Prime

    • Joe Dimaggio and Carl Yastrzemski at the end of their careers

  2. Players With Much Better First Halves

    • Kosuke Fukudome in 2008

  3. Bad Teams' Representative

    • Ken Harvey in 2004

    • MLB rule that each team needs an All-Star

These trends allow us to understand why otherwise confusing selections occur and explain why certain baseball fans get mad every year after All-Star selections are announced

 

bottom of page