Olympics Data Analysis for Triathlon Sport
7 min readMay 21, 2023
Overview
This analysis aims to determine if there are patterns in the characteristics of Olympic medal victors across categories such as nationality, age, height, etc. I utilized statistical analysis to determine whether or not any of these factors were significantly correlated with Olympic medalists and to identify the most predictive characteristics of Olympic success.
Data Details
Dataset from Kaggle: 120 years of Olympic history athletes and results
data = pd.read_csv('athlete_events.csv')
data.head(5)
data.describe()
Let’s Explore the data based on the specific sport “Triathlon.”
#Triathlon sport for all countries
triathlon = data.loc[(data["Sport"] == 'Triathlon')]
total_count = triathlon.groupby("Season")[["Team", "NOC", "Event", "City", "Sport"]].nunique().reset_index()
total_count
Medal count
triathlon['Medal'].value_counts()
plt.figure(figsize=(6,4))
sns.countplot(x='Medal', data=triathlon)
plt.show()
# Medals for Top 20 countries in summer olympics
total_medals_by_country = triathlon[triathlon['Season']=='Summer'].groupby(['NOC','Medal'])['Sex'].count().reset_index()
total_medals_by_country=total_medals_by_country.pivot('NOC','Medal','Sex').fillna(0).sort_values(by='Gold',ascending=False).head(20)
#total_medals_by_country
total_medals_by_country.plot(kind='bar')
fig = plt.gcf()
fig.set_size_inches(10, 7)
plt.xticks(rotation=90)
plt.xlabel('Country')
plt.ylabel("Athelete count")
plt.title('Medals by Country- Summer Olympics ')
plt.show()
Men Medal Analysis
Men Age Analysis who won the medal
Let’s see which countries have the highest proportions of medal earners throughout the years as men.
##Ploting data for total males participated per year
plt.figure(figsize=(7,4))
plotm = sns.barplot(x="Year", y="Total Males Per Year", data=year_df)
##Ploting data for medal winners
plt.figure(figsize=(8,5))
plota = sns.barplot(x="Year", y="Medal Winners", data=year_df)
Men Analysis by Weight
mdf = triathlon[triathlon['Sex'] == 'M']
mmedals = medals[medals['Sex'] == 'M']
medalweight = mmedals['Weight'].value_counts()
weighttotal = mdf['Weight'].value_counts()
weightdt = pd.DataFrame(weighttotal)
weightdt.reset_index()
weightdt['Medal Winners'] = pd.Series(medalweight)
weightdt = weightdt.rename(index=str, columns={"Weight": "Total Males Per Weight"})
weightdt['Percentage'] = weight_df['Medal Winners'] / weight_df['Total Males Per Weight'] * 100
weightdt = weightdt.dropna(axis=0)
weightdt = weightdt.reset_index()
weightdt = weightdt.rename(index=str, columns={"index": "Weight"})
weightdt = weightdt.sort_values(by=['Medal Winners'], ascending=False)
weightdt.head()
# plot for data for percentage
plt.figure(figsize=(10,5))
plot_tm = sns.barplot(x="Weight", y="Total Males Per Weight", data=weightdt)
plt.figure(figsize=(10,5))
plot_mwm = sns.barplot(x="Weight", y="Medal Winners", data=weightdt)
plt.figure(figsize=(10,5))
plot_pctm = sns.barplot(x="Weight", y="Percentage", data=weightdt)
mxmedal_weight=medals_wgt_athlete.groupby(['Weight'])['Total Medals'].max().reset_index(name='Max Medals')
plt.figure(figsize=(10,5))
plot_t = sns.barplot(x="Weight", y= "Max Medals", data=mxmedal_weight)
mxmedal_weight.head()
Weight between 70 to 75 got the highest medals for Men.
Men’s Height Analysis
medal_ht = medals['Height'].value_counts()
ht_total = mdf['Height'].value_counts()
ht_total = pd.DataFrame(ht_total)
ht_total.reset_index()
ht_total['Medal Winners'] = pd.Series(medal_ht)
ht_total = ht_total.rename(index=str, columns={"Height": "Total Males"})
ht_total['Percentage'] = ht_total['Medal Winners'] / ht_total['Total Males'] * 100
ht_total = ht_total.dropna(axis=0)
ht_total = ht_total.reset_index()
ht_total = ht_total.rename(index=str, columns={"index": "Height"})
ht_total = ht_total.sort_values(by=['Medal Winners'], ascending=False)
print(ht_total.head())
##Ploting data for medal winners
plt.figure(figsize=(10,5))
plot_mwf = sns.barplot(x="Height", y="Medal Winners", data=ht_total)
Women Medal Analysis
# Get the counts for female medal earner
fmedal = fmedals['Sex'].value_counts()
# Get the total counts for female overall for all competitors
ftotal = fdf['Sex'].value_counts()
# Create a new dataframe with the data from the total females
fdf = pd.DataFrame(ftotal)
fdf.reset_index()
# Add the data for the winners and calculate the percentages for each female
fdf['Medal Winners'] = pd.Series(fdf)
fdf = fdf.rename(index=str, columns={"Sex": "Total Females"})
fdf['Percentage'] = fdf['Medal Winners'] / fdf['Total Females'] * 100
fdf = fdf.dropna(axis=0)
fdf.index.names = ['Sex']
fdf.head()
Women’s Age Analysis for Winning Medal
# Get the counts of each age for each female medal earner
fmedal_ages = fmedals['Age'].value_counts()
# Get the total counts for age overall for all female competitors
fage_total = f_df['Age'].value_counts()
# Create a new dataframe with the data from the total ages
fage_df = pd.DataFrame(fage_total)
fage_df.reset_index()
# Add the data for the winners and calculate the percentages for each age
fage_df['Winners'] = pd.Series(fage_df)
fage_df = fage_df.rename(index=str, columns={"Age": "Total"})
fage_df['Percentage'] = fage_df['Winners'] / fage_df['Total'] * 100
fage_df = fage_df.dropna(axis=0)
fage_df = fage_df.reset_index()
fage_df = fage_df.rename(index=str, columns={"index": "Age"})
# Convert all of the ages from floats to ints
fage_df['Age'] = pd.to_numeric(fage_df['Age'], errors='coerce')
fage_df['Age'] = fage_df['Age'].astype(int)
# Plot the data
plt.figure(figsize=(10,5))
Female weight Analysis
fmedal_wt = fmedals['Weight'].value_counts()
fwt_total = f_df['Weight'].value_counts()
fwt_df = pd.DataFrame(fwt_total)
fwt_df.reset_index()
fwt_df['Medal Winners'] = pd.Series(fwt_df)
fwt_df = fwt_df.rename(index=str, columns={"Weight": "Total Females Per Weight"})
fwt_df['Percentage'] = fwt_df['Medal Winners'] / fwt_df['Total Females Per Weight'] * 100
fwt_df = fwt_df.dropna(axis=0)
fwt_df = fwt_df.reset_index()
fwt_df = fwt_df.rename(index=str, columns={"index": "Weight"})
fwt_df = fwt_df.sort_values(by=['Medal Winners'], ascending=False)
print(fwt_df.head())
plt.figure(figsize=(10,5))
plot_tf = sns.barplot(x="Weight", y="Total Females Per Weight", data=fwt_df)
plt.figure(figsize=(10,5))
plot_mwf = sns.barplot(x="Weight", y="Medal Winners", data=fwt_df)
plt.figure(figsize=(10,5))
plot_pctf = sns.barplot(x="Weight", y="Percentage", data=fwt_df)
Women’s height analysis
fmedal_ht = fmedals['Height'].value_counts()
fht_total = f_df['Height'].value_counts()
fht_df = pd.DataFrame(fht_total)
fht_df.reset_index()
fht_df['Medal Winners'] = pd.Series(fht_df)
fht_df = f_height_df.rename(index=str, columns={"Height": "Total Females"})
fht_df['Percentage'] = fht_df['Medal Winners'] / fht_df['Total Females'] * 100
fht_df = fht_df.dropna(axis=0)
fht_df = fht_df.reset_index()
fht_df = fht_df.rename(index=str, columns={"index": "Height"})
fht_df = fht_df.sort_values(by=['Medal Winners'], ascending=False)
print(fht_df.head())
##Ploting data for medal winners
plt.figure(figsize=(10,5))
plot_mwf = sns.barplot(x="Height", y="Medal Winners", data=fht_df)
Countries WON the maximum number of Gold Medals in the last held Olympic competitions.
# Visulaizing the countries with most Gold medals
plt.figure(figsize=(12,5))
sns.barplot(x=max_year_country, y=max_year_country.index)
plt.xlabel("Top 20 Gold Medal Won Countrywise")
Data Cleaning
print("data shape",triathlon.shape)
print("\n")
print("Total null values in data:", triathlon.isnull().sum())
# Binary Encoding for Medal
triathlon['Medal'] = triathlon['Medal'].apply(lambda x: 1 if str(x) != 'nan' else 0)
# groupby for medal and gender then calculate mean for numerical features
triathlon.groupby(['Medal', 'Sex']).mean().astype(np.int)
missing_values_columns = [col for col in triathlon.columns if triathlon.isnull().sum()[col] > 0]
missing_values_columns
# function for missing values substitution
def fill_missing_values(data,missing_values_columns):
df = data.copy()
for col in missing_values_columns:
df[col] = df.groupby(['Medal', 'Sex'])[col].apply(lambda x: x.fillna(x.mean()).astype(np.int))
return df
# lets use this function to fill the missing values
triathlon = fill_missing_values(triathlon,missing_values_columns)
for column in ['Age', 'Height', 'Weight']:
triathlon[column] = triathlon.groupby(['Medal', 'Sex'])[column].apply(lambda x: x.fillna(x.mean()).astype(np.int))
# Checking null values again
print("Total missing values:", triathlon.isna().sum().sum())
Total missing values: 0
data['Sex'] = data['Sex'].replace('F', 1)
data['Sex'] = data['Sex'].replace('M', 0)
# Drop Uncessary feature columns
data = data.drop(['ID', 'Name', 'Games', 'Season'], axis=1)
Feature Selection
#Apply SelectKBest and extract top 10 features out of the data
best = SelectKBest(score_func=chi2, k=10)
fit = best.fit(X,y)
data_scores=pd.DataFrame(fit.scores_)
data_columns=pd.DataFrame(X.columns)
# Join the two dataframes
scores=pd.concat([data_columns,data_scores],axis=1)
scores.columns=['Feature','Score']
print(scores.nlargest(10,'Score'))
Feature Score
53 T_Switzerland 9.595257
105 N_SUI 9.595257
6 T_Australia 7.257331
24 T_Great Britain 7.257331
61 N_AUS 7.257331
79 N_GBR 7.257331
52 T_Sweden 2.793604
107 N_SWE 2.793604
41 T_New Zealand 2.092209
98 N_NZL 2.092209
# Select 10 features
features=scores["Feature"].tolist()[:10]
features
['Sex',
'Age',
'Height',
'Weight',
'Year',
'T_Argentina',
'T_Australia',
'T_Austria',
'T_Azerbaijan',
'T_Barbados']
#create new data frame for modeling
data=data[['Age','Sex','Height','Weight','Year','T_Argentina','T_Barbados',
'T_Australia','T_Australia','T_Australia', 'T_United States','Medal']]
data.head()
y = data['Medal']
X = data.drop(['Medal'], axis=1)
#Split data into training and testing data
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.4, random_state=1)
# Scaling data
scaler = MinMaxScaler()
train_x = scaler.fit_transform(train_x)
test_x = scaler.transform(test_x)
Model Prediction
1. Logistic Regression
algo1 = 'LogisticRegression'
lr = LogisticRegression(random_state=1, max_iter=1000)
lr.fit(train_x, train_y)
lr_predict = lr.predict(test_x)
lr_conf_matrix = confusion_matrix(test_y, lr_predict)
lr_acc_score = accuracy_score(test_y, lr_predict)
print("confusion matrix")
print(lr_conf_matrix)
print("\n")
print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
print(classification_report(test_y,lr_predict))
Conclusion & Discussion
- From this analysis, we discover that men around the average age of 24 won more medals, whereas women won the most medals between the ages of 27 and 35.
- According to our weight analysis, men and women who weigh between 60 and 80 pounds are more likely to win medals, but this is likely because more medals are available for this weight range, and more athletes are competing in this weight range.
- The height and weight analysis was the most eye-opening. Traditionally, we presumed that heavier and taller athletes were physically more capable of winning, but our research indicates this is false.
- It appears that people of average weight and height are more successful, which makes sense given that there are more people of average height, but we were still surprised that their larger counterparts did not dominate them. There was no significant correlation between an Olympic athlete’s weight, height, and performance. The year the athlete competed had no significant effect on whether or not the athlete won a medal.
References:
- https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results
- https://www.tableau.com/data-insights/dashboard-showcase
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- https://statisticsglobe.com/dataframe-manipulation-using-pandas-python