Processing math: 100%

Project: Predict the best running pace and frequency for a 10km’s training.

I have been training to run on a 10 km distance. At some point I felt the need to record my progress, thus I bought a Garmin Forerunner 10 GPS. Since then I uploaded my running records on the Garmin connect platform. Although, the platform has nice features to visualize running data, It is not well suited for more in depth analyses such as finding the best pace on a given segment, which segment’s pace has the most influence on the total running time, or which frequency of training has the most influence on the day’s performance.

Project objectives:

1. Collect the data

I first downloaded all the data uploaded on the garmin connect web site as csv files. I collected all the files into one file named Activities.csv.

activities = read.csv("c:/Users/mgirardot/Desktop/garmin_data/Activities.csv", stringsAsFactors = F)
head(activities)
##   activite           Type                    date Temps Distance
## 1    peggy Course à pied lun., 30 juin 2014 7:03 20:46     2,94
## 2  michael Course à pied lun., 30 juin 2014 5:28 47:08     6,88
## 3  michael Course à pied dim., 29 juin 2014 5:28 57:52     8,37
## 4  michael Course à pied sam., 28 juin 2014 5:26 48:49     6,88
## 5    peggy Course à pied ven., 27 juin 2014 6:58 15:04     2,05
## 6  michael Course à pied ven., 27 juin 2014 6:04 37:04     5,40
##   Gain_altitude Perte_altitude Vitesse_moy Vitesse_max Calories  X
## 1            23             19        7:04        5:42      241 NA
## 2            47             48        6:51        5:48      573 NA
## 3            61             58        6:55        4:41      693 NA
## 4            48             50        7:06        4:04      573 NA
## 5            12             15        7:20        6:10      170 NA
## 6            35             33        6:52        5:25      443 NA

This dataset is 314 records long and contain the date, running.time, distance, average.pace and max_pace columns.

This dataset is rather poor compared to what the GPS is actually recording. Unfortunately it is not possible to acces to the original raw data stored on the garmin servers without paying a 5000$ fee to access the API. However I have acess to the internal storage of my garmin forerunner 10 watch.

Thus I could retrieve 45 runs recorded in the last 3 months.These data are stored as FIT files (Flexible and Interoperable Data). This is a binary format that can be converted to a csv file with the FIT SDK.

> java -jar FitSDKRelease_13.10\java\FitCSVTool.jar -b 56J54309.FIT fit.csv
FIT CSV Tool - Protocol 1.0 Profile 16,10 Release
FIT binary file 56J54309.FIT decoded to fit*.csv files.

This utility produce 2 csv files: fit.csv and fit_data.csv

fit = read.csv("C:/Users/mgirardot/Desktop/garmin_data/fit.csv", stringsAsFactors = F)
#head(fit)

fit_data = read.csv("C:/Users/mgirardot/Desktop/garmin_data/fit_data.csv", stringsAsFactors = F)
#head(fit_data)

Fit_data.csv is properly formatted to use as a data.frame.

#select only the records and lap data (tail all but the six first row)
workout = tail(fit_data[,20:61],n = -6)
names(workout)
##  [1] "record.timestamp.s."                     
##  [2] "record.position_lat.semicircles."        
##  [3] "record.position_long.semicircles."       
##  [4] "record.distance.m."                      
##  [5] "record.speed.m.s."                       
##  [6] "record.enhanced_speed.m.s."              
##  [7] "lap.timestamp.s."                        
##  [8] "lap.start_time"                          
##  [9] "lap.start_position_lat.semicircles."     
## [10] "lap.start_position_long.semicircles."    
## [11] "lap.end_position_lat.semicircles."       
## [12] "lap.end_position_long.semicircles."      
## [13] "lap.total_elapsed_time.s."               
## [14] "lap.total_timer_time.s."                 
## [15] "lap.total_distance.m."                   
## [16] "lap.message_index"                       
## [17] "lap.total_calories.kcal."                
## [18] "lap.avg_speed.m.s."                      
## [19] "lap.event"                               
## [20] "lap.event_type"                          
## [21] "lap.lap_trigger"                         
## [22] "lap.sport"                               
## [23] "lap.enhanced_avg_speed.m.s."             
## [24] "event.data"                              
## [25] "session.timestamp.s."                    
## [26] "session.start_time"                      
## [27] "session.start_position_lat.semicircles." 
## [28] "session.start_position_long.semicircles."
## [29] "session.total_elapsed_time.s."           
## [30] "session.total_timer_time.s."             
## [31] "session.total_distance.m."               
## [32] "session.message_index"                   
## [33] "session.total_calories.kcal."            
## [34] "session.avg_speed.m.s."                  
## [35] "session.first_lap_index"                 
## [36] "session.num_laps"                        
## [37] "session.event"                           
## [38] "session.event_type"                      
## [39] "session.sport"                           
## [40] "session.sub_sport"                       
## [41] "session.trigger"                         
## [42] "session.enhanced_avg_speed.m.s."

The semicircle GPS coordinates are the 32 bit encoded latitude and longitude coordinates. They can be converted back using the formula: dms = semicircles * (180/2^31). This dataset is much more fine grained. This will be very useful for an in depth analysis.

plot(workout$record.distance.m./1000, workout$record.speed.m.s*3600/1000, type='l', col="steelblue", ylab = "Speed (km/h)", xlab = "Distance (km)")

library(RgoogleMaps)

# Storing the latitude and longitude for the center of the map in a bbox object
bb=qbbox(as.double(mean(workout$record.position_lat.semicircles.*180/2^31)),as.double(mean(workout$record.position_long.semicircles.*180/2^31)))

#size of the picture
sz=c(550,500)

# Download and save the map
myMap=GetMap.bbox(lat =bb$latR, lon=bb$lonR, destfile = 'mapTest.png', zoom=15, size=sz)

# Draw the paths as red lines on the map
PlotOnStaticMap(myMap, lat = workout$record.position_lat.semicircles.*180/2^31, lon = workout$record.position_long.semicircles.*180/2^31, size=sz, cex=.2, lwd=2, col="red", add=F, FUN=lines)

2. Combine running data with other avaliable data

One obvious complementary data to the running records is weather data. I found one free weather API : OpenWeatherMap. I asked for a free API key. I first did some cleaning of the date field in the Activities.csv file by splitting it into jour, mois, date2, heure, date2_heurecolumns with notepad++. Then the columns were pasted into excel and the file saved in csv.

The procedure to import weather data is described in an IPython notebook.

3. Exploratory analysis

  • What is the average running pace?
data = read.csv("C:/Users/mgirardot/Desktop/garmin_data/Activities_format_date_weather.csv", sep="\t")
summary(data)
##        X                                activite 
##  Min.   :  0.00    michael                  :89  
##  1st Qu.: 78.25    Juvignac Course à pied  :86  
##  Median :156.50    Sans titre               :59  
##  Mean   :156.50    Malbosc                  :40  
##  3rd Qu.:234.75    peggy                    :12  
##  Max.   :313.00    michael & peggy & perrine: 6  
##                   (Other)                   :22  
##                    Type                           date          Temps    
##  Course à pied      :306   dim., 13 juil. 2014 6:06:  2   44:21:00:  4  
##  Randonnée pédestre:  8   dim., 6 juil. 2014 5:52 :  2   01:00:05:  3  
##                             jeu., 10 juil. 2014 5:27:  2   01:01:15:  3  
##                             jeu., 17 juil. 2014 6:13:  2   01:03:58:  3  
##                             jeu., 3 juil. 2014 5:35 :  2   01:04:08:  3  
##                             jeu., 3 juil. 2014 7:09 :  2   01:06:10:  3  
##                             (Other)                 :302   (Other) :295  
##     Distance   Gain_altitude Perte_altitude  Vitesse_moy   Vitesse_max 
##  12,75  :  7   114    : 14   110    : 16    05:05  : 14   03:33  :  8  
##  11,26  :  6   115    : 13   112    : 16    05:00  : 12   03:31  :  7  
##  12,17  :  6   112    : 12   109    : 11    05:06  : 11   03:37  :  6  
##  11,25  :  5   111    : 11   111    : 10    05:07  : 11   03:38  :  6  
##  11,81  :  5   110    : 10   115    :  9    05:02  : 10   03:42  :  6  
##  12,69  :  5   113    : 10   106    :  8    05:12  : 10   03:47  :  6  
##  (Other):280   (Other):244   (Other):244    (Other):246   (Other):275  
##     Calories            jour              mois        annee     
##  Min.   :  1.003   Min.   : 1.00   juillet  :71   Min.   :2014  
##  1st Qu.:570.250   1st Qu.: 7.00   août    :40   1st Qu.:2014  
##  Median :847.000   Median :15.00   septembre:33   Median :2014  
##  Mean   :716.384   Mean   :15.32   juin     :25   Mean   :2014  
##  3rd Qu.:921.750   3rd Qu.:23.00   octobre  :25   3rd Qu.:2015  
##  Max.   :984.000   Max.   :31.00   mars     :21   Max.   :2015  
##                                    (Other)  :99                 
##         date2         heure               date2_heure       temp      
##  12/07/2014:  6   05:49  : 16   01/07/2014 05:57:  2   Min.   :264.6  
##  03/07/2014:  4   05:50  : 16   02/07/2014 06:43:  2   1st Qu.:283.4  
##  07/07/2014:  4   05:51  : 16   03/07/2014 05:35:  2   Median :289.7  
##  09/07/2014:  4   05:53  : 13   03/07/2014 07:09:  2   Mean   :288.0  
##  01/07/2014:  2   05:45  : 10   04/07/2014 05:39:  2   3rd Qu.:293.1  
##  02/07/2014:  2   05:48  :  9   05/07/2014 05:28:  2   Max.   :302.4  
##  (Other)   :292   (Other):234   (Other)         :302   NA's   :30     
##     humidity        wind_speed      wind_degree    clouds_coverage 
##  Min.   : 33.00   Min.   : 0.380   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.: 57.00   1st Qu.: 1.330   1st Qu.: 59.5   1st Qu.:  0.00  
##  Median : 72.00   Median : 2.100   Median :297.8   Median :  8.00  
##  Mean   : 70.56   Mean   : 2.664   Mean   :212.9   Mean   : 31.33  
##  3rd Qu.: 84.00   3rd Qu.: 3.735   3rd Qu.:327.6   3rd Qu.: 75.00  
##  Max.   :100.00   Max.   :17.000   Max.   :360.0   Max.   :100.00  
##  NA's   :30       NA's   :30       NA's   :30      NA's   :30      
##       rain            pressure     
##  Min.   : 0.0000   Min.   : 967.1  
##  1st Qu.: 0.0000   1st Qu.: 987.2  
##  Median : 0.5162   Median :1001.9  
##  Mean   : 6.2618   Mean   :1000.9  
##  3rd Qu.: 4.0000   3rd Qu.:1015.0  
##  Max.   :62.0000   Max.   :1026.0  
##  NA's   :226       NA's   :30
#select only "Course à pied":
run_data = data[data$Type == data[1,3],]


#remove runs from peggy
run_data$activite[1]
## [1]  peggy
## 17 Levels:   Juvignac Course à pied   Malbosc ...  Sans titre
run_data_michael = run_data[run_data$activite != run_data$activite[1], ]

#parsing time
vit_moy = unclass(as.POSIXlt(strptime(run_data_michael$Vitesse_moy, format = "%M:%S")))$min*60 +unclass(as.POSIXlt(strptime(run_data_michael$Vitesse_moy, format = "%M:%S")))$sec

run_data_michael$vit_moy_sec = vit_moy
mean(vit_moy)
## [1] 322.7721
library(ggplot2)
a=ggplot(data=run_data_michael, aes(as.Date(date2,format= "%d/%m/%Y"), vit_moy_sec/60)) + geom_point()
a + xlab("Date") + ylab("Average pace (min/km)")+ geom_smooth()

  • What is the frequency of training?
b = ggplot(data=run_data_michael, aes(x = as.Date(date2, format="%d/%m/%Y")))+ 
  geom_line(stat="density", adjust=0.7, colour="steelblue") + 
  xlim(as.Date("01/07/2014", format="%d/%m/%Y"), as.Date("01/09/2015", format="%d/%m/%Y"))

b+ xlab("Date") + ylab("Density")

The frequency of training has decreased at fall (injuries) and winter (bad weather).

4. Running pace predictions with the weather data

  • Is there a relationship between the average running pace and weather data?
run_data_michael$X =NULL
pairs(run_data_michael[,17:24])

The average speed (s/km) do not presents some obvious correlations with the weather data. However, some non-linear correlations may be anticipated on the vit vs. temp, wind_speed and rain plots.

glm.fit = glm(vit_moy_sec~temp+wind_speed+rain , data = run_data_michael)
summary(glm.fit)
## 
## Call:
## glm(formula = vit_moy_sec ~ temp + wind_speed + rain, data = run_data_michael)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -55.49  -20.54  -10.52   11.89   75.20  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -326.9299   184.5636  -1.771 0.080406 .  
## temp           2.3141     0.6422   3.603 0.000551 ***
## wind_speed    -6.0634     2.6214  -2.313 0.023358 *  
## rain          -0.1973     0.2687  -0.734 0.464920    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1065.475)
## 
##     Null deviance: 102765  on 81  degrees of freedom
## Residual deviance:  83107  on 78  degrees of freedom
##   (212 observations deleted due to missingness)
## AIC: 810.24
## 
## Number of Fisher Scoring iterations: 2

The temperature and wind_speed could explain the average pace but not the rain. Maybe some interaction of the two features could give better results:

glm.fit = glm(vit_moy_sec~temp*wind_speed, data=run_data_michael)
summary(glm.fit)
## 
## Call:
## glm(formula = vit_moy_sec ~ temp * wind_speed, data = run_data_michael)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -45.65  -19.59  -11.36   14.67   94.26  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -203.44012  117.19002  -1.736   0.0838 .  
## temp               1.85549    0.41096   4.515 9.61e-06 ***
## wind_speed         9.55999   42.37632   0.226   0.8217    
## temp:wind_speed   -0.04232    0.14793  -0.286   0.7751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 932.6425)
## 
##     Null deviance: 290961  on 263  degrees of freedom
## Residual deviance: 242487  on 260  degrees of freedom
##   (30 observations deleted due to missingness)
## AIC: 2560.4
## 
## Number of Fisher Scoring iterations: 2

The interaction of the two terms is not significant. Some transformations:

glm.fit = glm(vit_moy_sec~temp+I(exp(temp))+wind_speed+I(exp(wind_speed)), data = run_data_michael)
summary(glm.fit)
## 
## Call:
## glm(formula = vit_moy_sec ~ temp + I(exp(temp)) + wind_speed + 
##     I(exp(wind_speed)), data = run_data_michael)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -46.510  -19.514   -9.998   12.598   93.977  
## 
## Coefficients:
##                       Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)         -1.828e+02   7.245e+01  -2.524   0.0122 *  
## temp                 1.788e+00   2.544e-01   7.028 1.85e-11 ***
## I(exp(temp))        8.036e-131  1.814e-130   0.443   0.6581    
## wind_speed          -3.188e+00   1.187e+00  -2.686   0.0077 ** 
## I(exp(wind_speed))   1.495e-06   1.461e-06   1.023   0.3074    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 932.1641)
## 
##     Null deviance: 290961  on 263  degrees of freedom
## Residual deviance: 241430  on 259  degrees of freedom
##   (30 observations deleted due to missingness)
## AIC: 2561.3
## 
## Number of Fisher Scoring iterations: 2

Transformation did not improved the prediction.

c = ggplot(run_data_michael, aes(y=vit_moy_sec, x=temp)) + geom_point(shape=20, size=5)
c + ylab("Running pace (s/km)") + xlab("Temperature(°K)")

It seems that the running pace is higher for temperatures above 290°K (18°C). Is it still true when we exclude the initial training phase from july to october 2014 ?

# format the Date field to perform comparisons
run_data_michael$mois_annee = as.Date(run_data_michael$date2, format="%d/%m/%Y")
nrow(run_data_michael)
## [1] 294
nrow(run_data_michael[run_data_michael$mois_annee>as.Date("01/10/2014", format="%d/%m/%Y"),])
## [1] 189
subset_runs =run_data_michael[run_data_michael$mois_annee>as.Date("01/10/2014", format="%d/%m/%Y"),]
summary(subset_runs$mois_annee)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2014-10-02" "2014-12-12" "2015-03-10" "2015-03-11" "2015-05-28" 
##         Max. 
## "2015-09-07"
d = ggplot(subset_runs, aes(y = vit_moy_sec, x=temp)) + geom_point(shape=20, size=5)
d + ylab("Running pace (s/km)") + xlab("Temperature(°K)")

Now that the slower runs are not included, the running pace do not seem to be correlated anymore.

glm.fit = glm(vit_moy_sec~temp+wind_speed, data = subset_runs)
summary(glm.fit)
## 
## Call:
## glm(formula = vit_moy_sec ~ temp + wind_speed, data = subset_runs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -27.784   -5.423   -2.114    4.793   56.575  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.44433   26.46790   7.762 8.51e-13 ***
## temp          0.35207    0.09367   3.759 0.000237 ***
## wind_speed   -0.47671    0.38333  -1.244 0.215413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 95.43618)
## 
##     Null deviance: 17016  on 166  degrees of freedom
## Residual deviance: 15652  on 164  degrees of freedom
##   (22 observations deleted due to missingness)
## AIC: 1240.2
## 
## Number of Fisher Scoring iterations: 2

The tempis still a predictive feature of the average pace of the run. From the coefficients, we can see that there is a 0.35 sec increase of the pace for one kelvin degree. Interestingly, the U shape looks like a quadratic function:

glm.fit = glm(vit_moy_sec~temp+I(temp^2), data = subset_runs)
summary(glm.fit)
## 
## Call:
## glm(formula = vit_moy_sec ~ temp + I(temp^2), data = subset_runs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -25.371   -5.299   -0.832    5.075   60.143  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.585e+03  7.192e+02   4.984 1.57e-06 ***
## temp        -2.348e+01  5.073e+00  -4.628 7.48e-06 ***
## I(temp^2)    4.196e-02  8.941e-03   4.693 5.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 84.93198)
## 
##     Null deviance: 17016  on 166  degrees of freedom
## Residual deviance: 13929  on 164  degrees of freedom
##   (22 observations deleted due to missingness)
## AIC: 1220.7
## 
## Number of Fisher Scoring iterations: 2

Ideed, the quadratic term is significant. Let’s see if a degree four polynomial could be usefull:

glm.fit = glm(vit_moy_sec~poly(temp, 4), data = subset_runs[!is.na(subset_runs$temp),])
summary(glm.fit)
## 
## Call:
## glm(formula = vit_moy_sec ~ poly(temp, 4), data = subset_runs[!is.na(subset_runs$temp), 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -24.873   -5.802   -0.548    4.452   60.416  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    304.4132     0.7144 426.133  < 2e-16 ***
## poly(temp, 4)1  34.8906     9.2316   3.779 0.000221 ***
## poly(temp, 4)2  43.2468     9.2316   4.685 5.91e-06 ***
## poly(temp, 4)3  -6.5998     9.2316  -0.715 0.475692    
## poly(temp, 4)4  -8.9051     9.2316  -0.965 0.336164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 85.22214)
## 
##     Null deviance: 17016  on 166  degrees of freedom
## Residual deviance: 13806  on 162  degrees of freedom
## AIC: 1223.2
## 
## Number of Fisher Scoring iterations: 2

The third and fourth polynomials are not significant.

Estimate the standard error of the temp and temp^2 coefficients using the bootstrap function:

set.seed(1)

boot.fn = function(data, index){
  glm.fit = glm(vit_moy_sec~temp+I(temp^2), data = data, subset = index)

  c= glm.fit$coefficients
  return(c)
}

#testing the boot.fn
boot.fn(subset_runs, sample(1:nrow(subset_runs), 50))
##   (Intercept)          temp     I(temp^2) 
## 3563.96756479  -23.22853622    0.04132681
library(boot)
boot(subset_runs, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = subset_runs, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* 3584.75188109 18.9646527679 7.283136e+02
## t2*  -23.47784794 -0.1353171317 5.136240e+00
## t3*    0.04195909  0.0002412651 9.049289e-03

The standard errors are large for the β 1 (cv=0.21) and β 2 (cv=0.21).

dpt = data.frame(temp = seq(range(subset_runs$temp, na.rm = TRUE)[1],range(subset_runs$temp, na.rm = TRUE)[2]))
glm.fit = glm(vit_moy_sec~temp+I(temp^2), data = subset_runs)
dpt$pred = predict(glm.fit, newdata = dpt)

e = ggplot(data = subset_runs, aes(y = vit_moy_sec, x=temp)) + geom_point(colour="grey40")
e + geom_smooth(data = dpt, aes(x=temp, y=pred)) + ylab("Running pace (s/km)") + xlab("Temperature(°K)")

range(dpt$pred)
## [1] 300.5458 317.0949

This regression line shows that the optimal temperature for running is around 7°C (280°K) . However in the temperature range from -8 to 27°C the average running pace only change of 17 sec (~5%). Thus we can conclude that the temperature has a very limited impact on the running pace.

Another explanation for this small correlation between the running pace and the temperature might just be a confounding variable such as the date. Indeed the temperature is changing over the year and we have seen that some periods of the year has different running pace averages.

From the two plots above, the regression lines are not colinear, thus it is unlikely that the observed correlation between the temperature and running pace is due to similar seasonal variations.

In conclusion, weather data, and in particular, Temperature could have a small predictive power for estimating the average running pace. Thus the temperature feature could be incorporated into a larger predictive model.