Olympics Data Analysis for Triathlon Sport

7 min readMay 21, 2023

Overview

This analysis aims to determine if there are patterns in the characteristics of Olympic medal victors across categories such as nationality, age, height, etc. I utilized statistical analysis to determine whether or not any of these factors were significantly correlated with Olympic medalists and to identify the most predictive characteristics of Olympic success.

Data Details

Dataset from Kaggle: 120 years of Olympic history athletes and results

data = pd.read_csv('athlete_events.csv')
data.head(5)
data.describe()

Let’s Explore the data based on the specific sport “Triathlon.”

#Triathlon sport for all countries
triathlon = data.loc[(data["Sport"] == 'Triathlon')]
total_count = triathlon.groupby("Season")[["Team", "NOC", "Event", "City", "Sport"]].nunique().reset_index()
total_count

EDA Dashboard Links

Maps

Medal count

triathlon['Medal'].value_counts()
plt.figure(figsize=(6,4))
sns.countplot(x='Medal', data=triathlon)
plt.show()

# Medals for Top 20 countries in summer olympics
total_medals_by_country = triathlon[triathlon['Season']=='Summer'].groupby(['NOC','Medal'])['Sex'].count().reset_index()
total_medals_by_country=total_medals_by_country.pivot('NOC','Medal','Sex').fillna(0).sort_values(by='Gold',ascending=False).head(20)
#total_medals_by_country
total_medals_by_country.plot(kind='bar')
fig = plt.gcf()
fig.set_size_inches(10, 7)
plt.xticks(rotation=90)
plt.xlabel('Country')
plt.ylabel("Athelete count")
plt.title('Medals by Country- Summer Olympics ')
plt.show()

Men Medal Analysis

Men Age Analysis who won the medal

Let’s see which countries have the highest proportions of medal earners throughout the years as men.

##Ploting data for total males participated per year
plt.figure(figsize=(7,4))
plotm = sns.barplot(x="Year", y="Total Males Per Year", data=year_df)
##Ploting data for medal winners
plt.figure(figsize=(8,5))
plota = sns.barplot(x="Year", y="Medal Winners", data=year_df)

Men Analysis by Weight

mdf = triathlon[triathlon['Sex'] == 'M']
mmedals = medals[medals['Sex'] == 'M']
medalweight = mmedals['Weight'].value_counts()
weighttotal = mdf['Weight'].value_counts()
weightdt = pd.DataFrame(weighttotal)
weightdt.reset_index()
weightdt['Medal Winners'] = pd.Series(medalweight)
weightdt = weightdt.rename(index=str, columns={"Weight": "Total Males Per Weight"})
weightdt['Percentage'] = weight_df['Medal Winners'] / weight_df['Total Males Per Weight'] * 100
weightdt = weightdt.dropna(axis=0)
weightdt = weightdt.reset_index()
weightdt = weightdt.rename(index=str, columns={"index": "Weight"})
weightdt = weightdt.sort_values(by=['Medal Winners'], ascending=False)
weightdt.head()
# plot for data for percentage
plt.figure(figsize=(10,5))
plot_tm = sns.barplot(x="Weight", y="Total Males Per Weight", data=weightdt)

plt.figure(figsize=(10,5))
plot_mwm = sns.barplot(x="Weight", y="Medal Winners", data=weightdt)

plt.figure(figsize=(10,5))
plot_pctm = sns.barplot(x="Weight", y="Percentage", data=weightdt)

mxmedal_weight=medals_wgt_athlete.groupby(['Weight'])['Total Medals'].max().reset_index(name='Max Medals')
plt.figure(figsize=(10,5))
plot_t = sns.barplot(x="Weight", y= "Max Medals", data=mxmedal_weight)
mxmedal_weight.head()

Weight between 70 to 75 got the highest medals for Men.

Men’s Height Analysis

medal_ht = medals['Height'].value_counts()
ht_total = mdf['Height'].value_counts()
ht_total = pd.DataFrame(ht_total)
ht_total.reset_index()
ht_total['Medal Winners'] = pd.Series(medal_ht)
ht_total = ht_total.rename(index=str, columns={"Height": "Total Males"})
ht_total['Percentage'] = ht_total['Medal Winners'] / ht_total['Total Males'] * 100
ht_total = ht_total.dropna(axis=0)
ht_total = ht_total.reset_index()
ht_total = ht_total.rename(index=str, columns={"index": "Height"})
ht_total = ht_total.sort_values(by=['Medal Winners'], ascending=False)
print(ht_total.head())

##Ploting data for medal winners
plt.figure(figsize=(10,5))
plot_mwf = sns.barplot(x="Height", y="Medal Winners", data=ht_total)

Women Medal Analysis

# Get the counts for female medal earner
fmedal = fmedals['Sex'].value_counts()

# Get the total counts for female overall for all competitors
ftotal = fdf['Sex'].value_counts()

# Create a new dataframe with the data from the total females
fdf = pd.DataFrame(ftotal)
fdf.reset_index()

# Add the data for the winners and calculate the percentages for each female
fdf['Medal Winners'] = pd.Series(fdf)
fdf = fdf.rename(index=str, columns={"Sex": "Total Females"})
fdf['Percentage'] = fdf['Medal Winners'] / fdf['Total Females'] * 100
fdf = fdf.dropna(axis=0)
fdf.index.names = ['Sex']
fdf.head()

Women’s Age Analysis for Winning Medal

# Get the counts of each age for each female medal earner
fmedal_ages = fmedals['Age'].value_counts()

# Get the total counts for age overall for all female competitors
fage_total = f_df['Age'].value_counts()

# Create a new dataframe with the data from the total ages
fage_df = pd.DataFrame(fage_total)
fage_df.reset_index()

# Add the data for the winners and calculate the percentages for each age
fage_df['Winners'] = pd.Series(fage_df)
fage_df = fage_df.rename(index=str, columns={"Age": "Total"})
fage_df['Percentage'] = fage_df['Winners'] / fage_df['Total'] * 100
fage_df = fage_df.dropna(axis=0)
fage_df = fage_df.reset_index()
fage_df = fage_df.rename(index=str, columns={"index": "Age"})

# Convert all of the ages from floats to ints
fage_df['Age'] = pd.to_numeric(fage_df['Age'], errors='coerce')
fage_df['Age'] = fage_df['Age'].astype(int)

# Plot the data
plt.figure(figsize=(10,5))

Female weight Analysis

fmedal_wt = fmedals['Weight'].value_counts()
fwt_total = f_df['Weight'].value_counts()
fwt_df = pd.DataFrame(fwt_total)
fwt_df.reset_index()
fwt_df['Medal Winners'] = pd.Series(fwt_df)
fwt_df = fwt_df.rename(index=str, columns={"Weight": "Total Females Per Weight"})
fwt_df['Percentage'] = fwt_df['Medal Winners'] / fwt_df['Total Females Per Weight'] * 100
fwt_df = fwt_df.dropna(axis=0)
fwt_df = fwt_df.reset_index()
fwt_df = fwt_df.rename(index=str, columns={"index": "Weight"})
fwt_df = fwt_df.sort_values(by=['Medal Winners'], ascending=False)
print(fwt_df.head())

plt.figure(figsize=(10,5))
plot_tf = sns.barplot(x="Weight", y="Total Females Per Weight", data=fwt_df)

plt.figure(figsize=(10,5))
plot_mwf = sns.barplot(x="Weight", y="Medal Winners", data=fwt_df)

plt.figure(figsize=(10,5))
plot_pctf = sns.barplot(x="Weight", y="Percentage", data=fwt_df)

Women’s height analysis

fmedal_ht = fmedals['Height'].value_counts()
fht_total = f_df['Height'].value_counts()
fht_df = pd.DataFrame(fht_total)
fht_df.reset_index()
fht_df['Medal Winners'] = pd.Series(fht_df)
fht_df = f_height_df.rename(index=str, columns={"Height": "Total Females"})
fht_df['Percentage'] = fht_df['Medal Winners'] / fht_df['Total Females'] * 100
fht_df = fht_df.dropna(axis=0)
fht_df = fht_df.reset_index()
fht_df = fht_df.rename(index=str, columns={"index": "Height"})
fht_df = fht_df.sort_values(by=['Medal Winners'], ascending=False)
print(fht_df.head())

##Ploting data for medal winners
plt.figure(figsize=(10,5))
plot_mwf = sns.barplot(x="Height", y="Medal Winners", data=fht_df)

Countries WON the maximum number of Gold Medals in the last held Olympic competitions.

# Visulaizing the countries with most Gold medals
plt.figure(figsize=(12,5))
sns.barplot(x=max_year_country, y=max_year_country.index)
plt.xlabel("Top 20 Gold Medal Won Countrywise")

Data Cleaning

print("data shape",triathlon.shape)
print("\n")
print("Total null values in data:", triathlon.isnull().sum())
# Binary Encoding for Medal
triathlon['Medal'] = triathlon['Medal'].apply(lambda x: 1 if str(x) != 'nan' else 0)
# groupby for medal and gender then calculate mean for numerical features 
triathlon.groupby(['Medal', 'Sex']).mean().astype(np.int)

missing_values_columns = [col for col in triathlon.columns if triathlon.isnull().sum()[col] > 0]
missing_values_columns

# function for missing values substitution
def fill_missing_values(data,missing_values_columns):
    df = data.copy()
    for col in missing_values_columns:
        df[col] = df.groupby(['Medal', 'Sex'])[col].apply(lambda x: x.fillna(x.mean()).astype(np.int))
    return df

# lets use this function to fill the missing values
triathlon = fill_missing_values(triathlon,missing_values_columns)
for column in ['Age', 'Height', 'Weight']:
    triathlon[column] = triathlon.groupby(['Medal', 'Sex'])[column].apply(lambda x: x.fillna(x.mean()).astype(np.int))

# Checking null values again
print("Total missing values:", triathlon.isna().sum().sum())

Total missing values: 0

data['Sex'] = data['Sex'].replace('F', 1)
data['Sex'] = data['Sex'].replace('M', 0)
# Drop Uncessary feature columns
data = data.drop(['ID', 'Name', 'Games', 'Season'], axis=1)

Feature Selection

#Apply SelectKBest and extract top 10 features out of the data
best = SelectKBest(score_func=chi2, k=10)
fit = best.fit(X,y)
data_scores=pd.DataFrame(fit.scores_)
data_columns=pd.DataFrame(X.columns)
# Join the two dataframes
scores=pd.concat([data_columns,data_scores],axis=1)
scores.columns=['Feature','Score']
print(scores.nlargest(10,'Score'))

Feature     Score
53     T_Switzerland  9.595257
105            N_SUI  9.595257
6        T_Australia  7.257331
24   T_Great Britain  7.257331
61             N_AUS  7.257331
79             N_GBR  7.257331
52          T_Sweden  2.793604
107            N_SWE  2.793604
41     T_New Zealand  2.092209
98             N_NZL  2.092209


# Select 10 features
features=scores["Feature"].tolist()[:10]
features

['Sex',
 'Age',
 'Height',
 'Weight',
 'Year',
 'T_Argentina',
 'T_Australia',
 'T_Austria',
 'T_Azerbaijan',
 'T_Barbados']

#create new data frame for modeling
data=data[['Age','Sex','Height','Weight','Year','T_Argentina','T_Barbados',
           'T_Australia','T_Australia','T_Australia', 'T_United States','Medal']]
data.head()

y = data['Medal']
X = data.drop(['Medal'], axis=1)
#Split data into training and testing data
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.4, random_state=1)

# Scaling data 
scaler = MinMaxScaler()
train_x = scaler.fit_transform(train_x)
test_x = scaler.transform(test_x)

Model Prediction

1. Logistic Regression

algo1 = 'LogisticRegression'
lr = LogisticRegression(random_state=1, max_iter=1000)
lr.fit(train_x, train_y)
lr_predict = lr.predict(test_x)
lr_conf_matrix = confusion_matrix(test_y, lr_predict)
lr_acc_score = accuracy_score(test_y, lr_predict)
print("confusion matrix")
print(lr_conf_matrix)
print("\n")
print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
print(classification_report(test_y,lr_predict))

Conclusion & Discussion

From this analysis, we discover that men around the average age of 24 won more medals, whereas women won the most medals between the ages of 27 and 35.
According to our weight analysis, men and women who weigh between 60 and 80 pounds are more likely to win medals, but this is likely because more medals are available for this weight range, and more athletes are competing in this weight range.
The height and weight analysis was the most eye-opening. Traditionally, we presumed that heavier and taller athletes were physically more capable of winning, but our research indicates this is false.
It appears that people of average weight and height are more successful, which makes sense given that there are more people of average height, but we were still surprised that their larger counterparts did not dominate them. There was no significant correlation between an Olympic athlete’s weight, height, and performance. The year the athlete competed had no significant effect on whether or not the athlete won a medal.

References: