Classifying NCAA Basketball Players

Jaden
5 min readFeb 16, 2021
Logo from https://freebiesupply.com/logos/ncaa-logo/

Every kid from the moment they picked up a basketball has dreamt of making it pro. The vision of hearing their name being called by the commissioner and shaking his hand while the world is watching signifies a lifelong accomplishment. This event is known as the NBA draft, where 30 teams get to select from a pool of eligible players. To those that are selected is a dream come true and to those that are selecting are hoping for an addition that could potentially change the course of their franchise for the better. The NBA draft can often have a critical impact on the success of a team. Especially in a landscape of rapid change in play style, acquiring good players is more important than ever. In my attempt to simplify this problem just a tiny bit, I classified former collegiate athletes into categories of drafted and undrafted. Obviously, those who had their names called were drafted and those who weren’t were classified as undrafted. Using machine learning algorithms I classified collegiate players from an analytical perspective, the models predicted solely based on their career statistics in college games.

Data Collection

The data collection process was quite simple. Using the SportsReferences Database I was able to collect over 30 thousand players’ career statistics ranging from 2000 to 2018. Statistics including career Points, Rebounds, Assists, and more. I also incorporated a dataset from Kaggle of every player that was drafted since 1947, using this I was able to create my target variable.

Data Exploration

In the recent year, we witness a shift in the way the NBA games are played out, from big dominant Centers playing under the rim to more of a Guard dominated beyond the arc league. Without even looking at stats we can tell teams are shooting way more threes than ever before. Centers who are able to shoot the three are praised for their unique skillset and those that aren’t are out of a job. It’s no surprise that since 2012 Guards and Forwards were drafted 3x more likely than Centers.

Given the pool of Centers we may actually say this isn’t really a concerning problem but digging deeper we begin to see a trend. Grouping them into years we see that at the beginning of the decade there was quite a balanced in drafted positions then a big drop off

One of the biggest reasons being a stylistic change in the way the game is played. As an NBA fan growing up in the early 2000s, players like Shaquille O'Neal, Tim Duncan, and Ben Wallace dominated the league with their size. The game was played inside out, where the best shot a team could take was the shot closest to the basket. Nowadays, analytics has given players a green light to take further and further shots. Since 2012 the league has average record-breaking attempts and makes for 3 pointers each year.

Fun Fact: As an incentive to increase scoring and neutralize the physicality, the NBA decided to shorten the 3-point line ahead of the ‘94–95 season. Which lead to a huge spike in 3 point field goals but was later revert back to the standard 23ft in ’97 when scoring averages didn’t rise as expected.

Support Vector Machine

from sklearn.svm import SVC
from sklearn.metrics import plot_confusion_matrix
clf = SVC(random_state=4)
clf.fit(X_train,y_train)
plot_confusion_matrix(svm,x_test,y_test,display_labels=['Undrafted','Drafted'])
plt.xlabel('Pred')
plt.ylabel('Actual');

TP : 211 Correctly predicted as Drafted
TN : 185Correctly predicted as Undrafted
FP : 68 Incorrectly predicted as Drafted
FN : 39 Incorrectly predicted as Undrafted

Accuracy : 78.7%

Support Vector Machine are known to be pretty good straight out of the box. Our model gave us an accuracy of 78.7% but we can do better. Using Grid Search we can tune the parameters to get the best ‘C’ and ‘gamma’ to maximize accuracy.

from sklearn.model_selection import GridSearchCVparam_grid = [{'C': [0.5,10,100,1000],
'gamma' : [1,0.1,0.001,0.0001]}]
optimal_params = GridSearchCV(SVC(),param_grid,cv=5,scoring='accuracy')optimal_params.fit(X_train,y_train)
print(optimal_params.best_params_)
#output{'C':1000,'gamma':0.1}

Using the newly optimized parameters our results improved slightly

TP : 207 Correctly predicted as Drafted
TN : 193 Correctly predicted as Undrafted
FP : 60 Incorrectly predicted as Drafted
FN : 43 Incorrectly predicted as Undrafted

Accuracy : 79.5%

For the source code and other models check out my GitHub

Conclusion

In a landscape where the style of play is evolving at such a rapid pace, the ability to understand the players of the upcoming generation becomes integral. The barrier being the NBA Draft where players who are capable are inaugurated to the next stage of the competition. To separate the capable and incapable is a big decision amongst those that run multi-million dollar teams, and can be a huge deciding factor in winning the championship. There is still a lot of further research needed before the models can be deemed useful/actionable but I think it a great place to learn and understand the problem. With that being said my attempts at classifying the drafted versus undrafted had an accuracy between 70–80% depending on the model. The most accurate being the Support Vector Machine model at 80% and Logistic Regression following behind at 79%. Even at an 80% accuracy, there are potential avenues for improvement. In conclusion, Out of the 3 models shown, the best model for our problem is SVM because for what we are trying to gather we are aiming for a high True Positive and lower False Positive.

--

--