I have been training to run on a 10 km distance. At some point I felt the need to record my progress, thus I bought a Garmin Forerunner 10 GPS. Since then I uploaded my running records on the Garmin connect platform. Although, the platform has nice features to visualize running data, It is not well suited for more in depth analyses such as finding the best pace on a given segment, which segment’s pace has the most influence on the total running time, or which frequency of training has the most influence on the day’s performance.
Project objectives:
I first downloaded all the data uploaded on the garmin connect web site as csv files. I collected all the files into one file named Activities.csv
.
activities = read.csv("c:/Users/mgirardot/Desktop/garmin_data/Activities.csv", stringsAsFactors = F)
head(activities)
## activite Type date Temps Distance
## 1 peggy Course à pied lun., 30 juin 2014 7:03 20:46 2,94
## 2 michael Course à pied lun., 30 juin 2014 5:28 47:08 6,88
## 3 michael Course à pied dim., 29 juin 2014 5:28 57:52 8,37
## 4 michael Course à pied sam., 28 juin 2014 5:26 48:49 6,88
## 5 peggy Course à pied ven., 27 juin 2014 6:58 15:04 2,05
## 6 michael Course à pied ven., 27 juin 2014 6:04 37:04 5,40
## Gain_altitude Perte_altitude Vitesse_moy Vitesse_max Calories X
## 1 23 19 7:04 5:42 241 NA
## 2 47 48 6:51 5:48 573 NA
## 3 61 58 6:55 4:41 693 NA
## 4 48 50 7:06 4:04 573 NA
## 5 12 15 7:20 6:10 170 NA
## 6 35 33 6:52 5:25 443 NA
This dataset is 314 records long and contain the date, running.time, distance, average.pace
and max_pace
columns.
This dataset is rather poor compared to what the GPS is actually recording. Unfortunately it is not possible to acces to the original raw data stored on the garmin servers without paying a 5000$ fee to access the API. However I have acess to the internal storage of my garmin forerunner 10 watch.
Thus I could retrieve 45 runs recorded in the last 3 months.These data are stored as FIT files (Flexible and Interoperable Data). This is a binary format that can be converted to a csv file with the FIT SDK.
> java -jar FitSDKRelease_13.10\java\FitCSVTool.jar -b 56J54309.FIT fit.csv
FIT CSV Tool - Protocol 1.0 Profile 16,10 Release
FIT binary file 56J54309.FIT decoded to fit*.csv files.
This utility produce 2 csv files: fit.csv
and fit_data.csv
fit = read.csv("C:/Users/mgirardot/Desktop/garmin_data/fit.csv", stringsAsFactors = F)
#head(fit)
fit_data = read.csv("C:/Users/mgirardot/Desktop/garmin_data/fit_data.csv", stringsAsFactors = F)
#head(fit_data)
Fit_data.csv
is properly formatted to use as a data.frame.
#select only the records and lap data (tail all but the six first row)
workout = tail(fit_data[,20:61],n = -6)
names(workout)
## [1] "record.timestamp.s."
## [2] "record.position_lat.semicircles."
## [3] "record.position_long.semicircles."
## [4] "record.distance.m."
## [5] "record.speed.m.s."
## [6] "record.enhanced_speed.m.s."
## [7] "lap.timestamp.s."
## [8] "lap.start_time"
## [9] "lap.start_position_lat.semicircles."
## [10] "lap.start_position_long.semicircles."
## [11] "lap.end_position_lat.semicircles."
## [12] "lap.end_position_long.semicircles."
## [13] "lap.total_elapsed_time.s."
## [14] "lap.total_timer_time.s."
## [15] "lap.total_distance.m."
## [16] "lap.message_index"
## [17] "lap.total_calories.kcal."
## [18] "lap.avg_speed.m.s."
## [19] "lap.event"
## [20] "lap.event_type"
## [21] "lap.lap_trigger"
## [22] "lap.sport"
## [23] "lap.enhanced_avg_speed.m.s."
## [24] "event.data"
## [25] "session.timestamp.s."
## [26] "session.start_time"
## [27] "session.start_position_lat.semicircles."
## [28] "session.start_position_long.semicircles."
## [29] "session.total_elapsed_time.s."
## [30] "session.total_timer_time.s."
## [31] "session.total_distance.m."
## [32] "session.message_index"
## [33] "session.total_calories.kcal."
## [34] "session.avg_speed.m.s."
## [35] "session.first_lap_index"
## [36] "session.num_laps"
## [37] "session.event"
## [38] "session.event_type"
## [39] "session.sport"
## [40] "session.sub_sport"
## [41] "session.trigger"
## [42] "session.enhanced_avg_speed.m.s."
The semicircle GPS coordinates are the 32 bit encoded latitude and longitude coordinates. They can be converted back using the formula: dms = semicircles * (180/2^31)
. This dataset is much more fine grained. This will be very useful for an in depth analysis.
plot(workout$record.distance.m./1000, workout$record.speed.m.s*3600/1000, type='l', col="steelblue", ylab = "Speed (km/h)", xlab = "Distance (km)")
library(RgoogleMaps)
# Storing the latitude and longitude for the center of the map in a bbox object
bb=qbbox(as.double(mean(workout$record.position_lat.semicircles.*180/2^31)),as.double(mean(workout$record.position_long.semicircles.*180/2^31)))
#size of the picture
sz=c(550,500)
# Download and save the map
myMap=GetMap.bbox(lat =bb$latR, lon=bb$lonR, destfile = 'mapTest.png', zoom=15, size=sz)
# Draw the paths as red lines on the map
PlotOnStaticMap(myMap, lat = workout$record.position_lat.semicircles.*180/2^31, lon = workout$record.position_long.semicircles.*180/2^31, size=sz, cex=.2, lwd=2, col="red", add=F, FUN=lines)
One obvious complementary data to the running records is weather data. I found one free weather API : OpenWeatherMap. I asked for a free API key. I first did some cleaning of the date field in the Activities.csv
file by splitting it into jour, mois, date2, heure, date2_heure
columns with notepad++. Then the columns were pasted into excel and the file saved in csv.
The procedure to import weather data is described in an IPython notebook.
data = read.csv("C:/Users/mgirardot/Desktop/garmin_data/Activities_format_date_weather.csv", sep="\t")
summary(data)
## X activite
## Min. : 0.00 michael :89
## 1st Qu.: 78.25 Juvignac Course à pied :86
## Median :156.50 Sans titre :59
## Mean :156.50 Malbosc :40
## 3rd Qu.:234.75 peggy :12
## Max. :313.00 michael & peggy & perrine: 6
## (Other) :22
## Type date Temps
## Course à pied :306 dim., 13 juil. 2014 6:06: 2 44:21:00: 4
## Randonnée pédestre: 8 dim., 6 juil. 2014 5:52 : 2 01:00:05: 3
## jeu., 10 juil. 2014 5:27: 2 01:01:15: 3
## jeu., 17 juil. 2014 6:13: 2 01:03:58: 3
## jeu., 3 juil. 2014 5:35 : 2 01:04:08: 3
## jeu., 3 juil. 2014 7:09 : 2 01:06:10: 3
## (Other) :302 (Other) :295
## Distance Gain_altitude Perte_altitude Vitesse_moy Vitesse_max
## 12,75 : 7 114 : 14 110 : 16 05:05 : 14 03:33 : 8
## 11,26 : 6 115 : 13 112 : 16 05:00 : 12 03:31 : 7
## 12,17 : 6 112 : 12 109 : 11 05:06 : 11 03:37 : 6
## 11,25 : 5 111 : 11 111 : 10 05:07 : 11 03:38 : 6
## 11,81 : 5 110 : 10 115 : 9 05:02 : 10 03:42 : 6
## 12,69 : 5 113 : 10 106 : 8 05:12 : 10 03:47 : 6
## (Other):280 (Other):244 (Other):244 (Other):246 (Other):275
## Calories jour mois annee
## Min. : 1.003 Min. : 1.00 juillet :71 Min. :2014
## 1st Qu.:570.250 1st Qu.: 7.00 août :40 1st Qu.:2014
## Median :847.000 Median :15.00 septembre:33 Median :2014
## Mean :716.384 Mean :15.32 juin :25 Mean :2014
## 3rd Qu.:921.750 3rd Qu.:23.00 octobre :25 3rd Qu.:2015
## Max. :984.000 Max. :31.00 mars :21 Max. :2015
## (Other) :99
## date2 heure date2_heure temp
## 12/07/2014: 6 05:49 : 16 01/07/2014 05:57: 2 Min. :264.6
## 03/07/2014: 4 05:50 : 16 02/07/2014 06:43: 2 1st Qu.:283.4
## 07/07/2014: 4 05:51 : 16 03/07/2014 05:35: 2 Median :289.7
## 09/07/2014: 4 05:53 : 13 03/07/2014 07:09: 2 Mean :288.0
## 01/07/2014: 2 05:45 : 10 04/07/2014 05:39: 2 3rd Qu.:293.1
## 02/07/2014: 2 05:48 : 9 05/07/2014 05:28: 2 Max. :302.4
## (Other) :292 (Other):234 (Other) :302 NA's :30
## humidity wind_speed wind_degree clouds_coverage
## Min. : 33.00 Min. : 0.380 Min. : 0.0 Min. : 0.00
## 1st Qu.: 57.00 1st Qu.: 1.330 1st Qu.: 59.5 1st Qu.: 0.00
## Median : 72.00 Median : 2.100 Median :297.8 Median : 8.00
## Mean : 70.56 Mean : 2.664 Mean :212.9 Mean : 31.33
## 3rd Qu.: 84.00 3rd Qu.: 3.735 3rd Qu.:327.6 3rd Qu.: 75.00
## Max. :100.00 Max. :17.000 Max. :360.0 Max. :100.00
## NA's :30 NA's :30 NA's :30 NA's :30
## rain pressure
## Min. : 0.0000 Min. : 967.1
## 1st Qu.: 0.0000 1st Qu.: 987.2
## Median : 0.5162 Median :1001.9
## Mean : 6.2618 Mean :1000.9
## 3rd Qu.: 4.0000 3rd Qu.:1015.0
## Max. :62.0000 Max. :1026.0
## NA's :226 NA's :30
#select only "Course à pied":
run_data = data[data$Type == data[1,3],]
#remove runs from peggy
run_data$activite[1]
## [1] peggy
## 17 Levels: Juvignac Course à pied Malbosc ... Sans titre
run_data_michael = run_data[run_data$activite != run_data$activite[1], ]
#parsing time
vit_moy = unclass(as.POSIXlt(strptime(run_data_michael$Vitesse_moy, format = "%M:%S")))$min*60 +unclass(as.POSIXlt(strptime(run_data_michael$Vitesse_moy, format = "%M:%S")))$sec
run_data_michael$vit_moy_sec = vit_moy
mean(vit_moy)
## [1] 322.7721
library(ggplot2)
a=ggplot(data=run_data_michael, aes(as.Date(date2,format= "%d/%m/%Y"), vit_moy_sec/60)) + geom_point()
a + xlab("Date") + ylab("Average pace (min/km)")+ geom_smooth()
b = ggplot(data=run_data_michael, aes(x = as.Date(date2, format="%d/%m/%Y")))+
geom_line(stat="density", adjust=0.7, colour="steelblue") +
xlim(as.Date("01/07/2014", format="%d/%m/%Y"), as.Date("01/09/2015", format="%d/%m/%Y"))
b+ xlab("Date") + ylab("Density")
The frequency of training has decreased at fall (injuries) and winter (bad weather).
run_data_michael$X =NULL
pairs(run_data_michael[,17:24])
The average speed (s/km) do not presents some obvious correlations with the weather data. However, some non-linear correlations may be anticipated on the vit vs. temp, wind_speed and rain plots.
glm.fit = glm(vit_moy_sec~temp+wind_speed+rain , data = run_data_michael)
summary(glm.fit)
##
## Call:
## glm(formula = vit_moy_sec ~ temp + wind_speed + rain, data = run_data_michael)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -55.49 -20.54 -10.52 11.89 75.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -326.9299 184.5636 -1.771 0.080406 .
## temp 2.3141 0.6422 3.603 0.000551 ***
## wind_speed -6.0634 2.6214 -2.313 0.023358 *
## rain -0.1973 0.2687 -0.734 0.464920
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1065.475)
##
## Null deviance: 102765 on 81 degrees of freedom
## Residual deviance: 83107 on 78 degrees of freedom
## (212 observations deleted due to missingness)
## AIC: 810.24
##
## Number of Fisher Scoring iterations: 2
The temperature and wind_speed could explain the average pace but not the rain. Maybe some interaction of the two features could give better results:
glm.fit = glm(vit_moy_sec~temp*wind_speed, data=run_data_michael)
summary(glm.fit)
##
## Call:
## glm(formula = vit_moy_sec ~ temp * wind_speed, data = run_data_michael)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -45.65 -19.59 -11.36 14.67 94.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -203.44012 117.19002 -1.736 0.0838 .
## temp 1.85549 0.41096 4.515 9.61e-06 ***
## wind_speed 9.55999 42.37632 0.226 0.8217
## temp:wind_speed -0.04232 0.14793 -0.286 0.7751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 932.6425)
##
## Null deviance: 290961 on 263 degrees of freedom
## Residual deviance: 242487 on 260 degrees of freedom
## (30 observations deleted due to missingness)
## AIC: 2560.4
##
## Number of Fisher Scoring iterations: 2
The interaction of the two terms is not significant. Some transformations:
glm.fit = glm(vit_moy_sec~temp+I(exp(temp))+wind_speed+I(exp(wind_speed)), data = run_data_michael)
summary(glm.fit)
##
## Call:
## glm(formula = vit_moy_sec ~ temp + I(exp(temp)) + wind_speed +
## I(exp(wind_speed)), data = run_data_michael)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -46.510 -19.514 -9.998 12.598 93.977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.828e+02 7.245e+01 -2.524 0.0122 *
## temp 1.788e+00 2.544e-01 7.028 1.85e-11 ***
## I(exp(temp)) 8.036e-131 1.814e-130 0.443 0.6581
## wind_speed -3.188e+00 1.187e+00 -2.686 0.0077 **
## I(exp(wind_speed)) 1.495e-06 1.461e-06 1.023 0.3074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 932.1641)
##
## Null deviance: 290961 on 263 degrees of freedom
## Residual deviance: 241430 on 259 degrees of freedom
## (30 observations deleted due to missingness)
## AIC: 2561.3
##
## Number of Fisher Scoring iterations: 2
Transformation did not improved the prediction.
c = ggplot(run_data_michael, aes(y=vit_moy_sec, x=temp)) + geom_point(shape=20, size=5)
c + ylab("Running pace (s/km)") + xlab("Temperature(°K)")
It seems that the running pace is higher for temperatures above 290°K (18°C). Is it still true when we exclude the initial training phase from july to october 2014 ?
# format the Date field to perform comparisons
run_data_michael$mois_annee = as.Date(run_data_michael$date2, format="%d/%m/%Y")
nrow(run_data_michael)
## [1] 294
nrow(run_data_michael[run_data_michael$mois_annee>as.Date("01/10/2014", format="%d/%m/%Y"),])
## [1] 189
subset_runs =run_data_michael[run_data_michael$mois_annee>as.Date("01/10/2014", format="%d/%m/%Y"),]
summary(subset_runs$mois_annee)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2014-10-02" "2014-12-12" "2015-03-10" "2015-03-11" "2015-05-28"
## Max.
## "2015-09-07"
d = ggplot(subset_runs, aes(y = vit_moy_sec, x=temp)) + geom_point(shape=20, size=5)
d + ylab("Running pace (s/km)") + xlab("Temperature(°K)")
Now that the slower runs are not included, the running pace do not seem to be correlated anymore.
glm.fit = glm(vit_moy_sec~temp+wind_speed, data = subset_runs)
summary(glm.fit)
##
## Call:
## glm(formula = vit_moy_sec ~ temp + wind_speed, data = subset_runs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -27.784 -5.423 -2.114 4.793 56.575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.44433 26.46790 7.762 8.51e-13 ***
## temp 0.35207 0.09367 3.759 0.000237 ***
## wind_speed -0.47671 0.38333 -1.244 0.215413
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 95.43618)
##
## Null deviance: 17016 on 166 degrees of freedom
## Residual deviance: 15652 on 164 degrees of freedom
## (22 observations deleted due to missingness)
## AIC: 1240.2
##
## Number of Fisher Scoring iterations: 2
The temp
is still a predictive feature of the average pace of the run. From the coefficients, we can see that there is a 0.35 sec increase of the pace for one kelvin degree. Interestingly, the U shape looks like a quadratic function:
glm.fit = glm(vit_moy_sec~temp+I(temp^2), data = subset_runs)
summary(glm.fit)
##
## Call:
## glm(formula = vit_moy_sec ~ temp + I(temp^2), data = subset_runs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -25.371 -5.299 -0.832 5.075 60.143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.585e+03 7.192e+02 4.984 1.57e-06 ***
## temp -2.348e+01 5.073e+00 -4.628 7.48e-06 ***
## I(temp^2) 4.196e-02 8.941e-03 4.693 5.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 84.93198)
##
## Null deviance: 17016 on 166 degrees of freedom
## Residual deviance: 13929 on 164 degrees of freedom
## (22 observations deleted due to missingness)
## AIC: 1220.7
##
## Number of Fisher Scoring iterations: 2
Ideed, the quadratic term is significant. Let’s see if a degree four polynomial could be usefull:
glm.fit = glm(vit_moy_sec~poly(temp, 4), data = subset_runs[!is.na(subset_runs$temp),])
summary(glm.fit)
##
## Call:
## glm(formula = vit_moy_sec ~ poly(temp, 4), data = subset_runs[!is.na(subset_runs$temp),
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -24.873 -5.802 -0.548 4.452 60.416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 304.4132 0.7144 426.133 < 2e-16 ***
## poly(temp, 4)1 34.8906 9.2316 3.779 0.000221 ***
## poly(temp, 4)2 43.2468 9.2316 4.685 5.91e-06 ***
## poly(temp, 4)3 -6.5998 9.2316 -0.715 0.475692
## poly(temp, 4)4 -8.9051 9.2316 -0.965 0.336164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 85.22214)
##
## Null deviance: 17016 on 166 degrees of freedom
## Residual deviance: 13806 on 162 degrees of freedom
## AIC: 1223.2
##
## Number of Fisher Scoring iterations: 2
The third and fourth polynomials are not significant.
Estimate the standard error of the temp
and temp^2
coefficients using the bootstrap function:
set.seed(1)
boot.fn = function(data, index){
glm.fit = glm(vit_moy_sec~temp+I(temp^2), data = data, subset = index)
c= glm.fit$coefficients
return(c)
}
#testing the boot.fn
boot.fn(subset_runs, sample(1:nrow(subset_runs), 50))
## (Intercept) temp I(temp^2)
## 3563.96756479 -23.22853622 0.04132681
library(boot)
boot(subset_runs, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = subset_runs, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 3584.75188109 18.9646527679 7.283136e+02
## t2* -23.47784794 -0.1353171317 5.136240e+00
## t3* 0.04195909 0.0002412651 9.049289e-03
The standard errors are large for the β 1 (cv=0.21) and β 2 (cv=0.21).
dpt = data.frame(temp = seq(range(subset_runs$temp, na.rm = TRUE)[1],range(subset_runs$temp, na.rm = TRUE)[2]))
glm.fit = glm(vit_moy_sec~temp+I(temp^2), data = subset_runs)
dpt$pred = predict(glm.fit, newdata = dpt)
e = ggplot(data = subset_runs, aes(y = vit_moy_sec, x=temp)) + geom_point(colour="grey40")
e + geom_smooth(data = dpt, aes(x=temp, y=pred)) + ylab("Running pace (s/km)") + xlab("Temperature(°K)")
range(dpt$pred)
## [1] 300.5458 317.0949
This regression line shows that the optimal temperature for running is around 7°C (280°K) . However in the temperature range from -8 to 27°C the average running pace only change of 17 sec (~5%). Thus we can conclude that the temperature has a very limited impact on the running pace.
Another explanation for this small correlation between the running pace and the temperature might just be a confounding variable such as the date. Indeed the temperature is changing over the year and we have seen that some periods of the year has different running pace averages.
From the two plots above, the regression lines are not colinear, thus it is unlikely that the observed correlation between the temperature and running pace is due to similar seasonal variations.
In conclusion, weather data, and in particular, Temperature could have a small predictive power for estimating the average running pace. Thus the temperature feature could be incorporated into a larger predictive model.