Handling and visualising archive data from Strava

Jun 21, 2025

Some time last year, I deleted Strava. All I really cared about was how far I had ran, for how long, and where I had been. But you can get all of this information from open source alternatives like FitoTrack without handing over all your data to a commercial enterprise, or being bombarded with ads for a paid subscription. I’ve now settled with recording time for each run using my Casio, and then swiftly moving on with my life.

Before quitting Strava, I downloaded all my archived data. As much as I’d like to rise above obsessions over split times and average pace, I do like to look back on specific runs and check PBs periodically. When you archive Strava data, you get given your activities as individual GPX files. These files are of course pretty useless on their own, which I suppose deters many users from deleting the app. But with a fairly simple R script, you can get decent summaries of all your activities and mess around with data visualisation to your heart’s content. This is what I’ve begun to do myself, making all my data and code openly available.

The main challenge of the whole project was simply extracting the relevant information from the GPX files and compiling that information into a usable data frame. Printing the raw text contents of the files looks like absolute garbage, and without much experience of XML files, a lot my explorations were trial and error. I settled on using functions from the XML package to parse the files and extract information about each activity. This included stuff like the name (e.g., “Morning run”) and type (e.g., running, cycling), but importantly, also ping-level information collected throughout the duration of the activity, like the time of day, coordinate location and elevation at one or two second intervals.

The preliminary stuff at the top of script just loads in the packages required, lists all my GPX files in the data folder, and then imports them into RStudio using htmlTreeParse(). If you clone/download the repository and throw your GPX files into the data folder, you should be able to replicate everything shown here for your own data.

# Load libraries.
library(pbapply)
library(XML)
library(dplyr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(leaflet)
library(maptiles)
library(tidyterra)
library(sf)

# Settings.
theme_set(theme_minimal())

# Create list of all the gpx files that we have.
file_names <- paste0(
  "data/",
  list.files("data", pattern = glob2rx("*.gpx"))
)

# Read them all into a list.
raw_list <- pblapply(file_names, function(x){
  htmlTreeParse(file = x, useInternalNodes = TRUE)
}
)

We can then execute the main parsing function: extracting the nodes we need and then sticking them together into a list of usable data frames. The final step uses st_as_sf() to convert the coordinates into an sf object, so we can easily calculate distances and create maps later on. I had around 200 activities and this took 1-2 monutes to run on a standard laptop.

# Function for extracting the relevant information.
acts_clean <- list()

for (i in seq_along(raw_list)){
  
# Extract name.
name <- xpathSApply(doc = raw_list[[i]], path = "//trk/name", fun = xmlValue)

# Extract type.
type <- xpathSApply(doc = raw_list[[i]], path = "//trk/type", fun = xmlValue)

# Extract coords.
coords <- xpathSApply(doc = raw_list[[i]], path = "//trkpt", fun = xmlAttrs)

# Extract elevation.
elevation <- xpathSApply(doc = raw_list[[i]], path = "//trkpt/ele", fun = xmlValue)

# Extract time.
time <- xpathSApply(doc = raw_list[[i]], path = "//trkpt/time", fun = xmlValue)

# Extract information into a dataframe.
gpx_sf <- data.frame(
  act_name    = name,
  act_type    = type,
  timestamps  = time,
  lat         = coords["lat", ],
  lon         = coords["lon", ],
  ele         = as.numeric(elevation)
) %>% 
  mutate(timestamps = ymd_hms(timestamps),
         week_lub   = week(timestamps),
         year_lub   = year(timestamps)) %>% 
  st_as_sf(coords = c(x = "lon", y = "lat"), crs = 4326) 

# Insert each into the list.
acts_clean[[i]] <- gpx_sf

}

Each element of the acts_clean list is a single activity, with each row representing a single GPS ping recording the time and elevation.

We can easily bind all the data frames together. At this point, I subset my activities for running-only. You can of course keep all your different activity types, but remember that later on, you will need to group_by(act_type) to get equivalent summaries, or use some equivalent loop or facet.

# Bind together for broad summaries, then filter for runs only.
acts_sf <- bind_rows(acts_clean, .id = "act_id") %>% 
  filter(act_type == "running")

The final key data handling step before we can begin summarising activities is spatial. At the moment, the ping-level data consists of coordinate locations recorded at one or two second intervals throughout the activity. But for mapping visuals and to easily calculate distances, we need to convert these series of points to lines. I do this by computing a union on each activity and casting the output to a linestring.

# Convert coords to lines.
acts_lines_sf <- acts_sf %>% 
  group_by(act_id) %>% 
  summarize(do_union=FALSE) %>% 
  st_cast("LINESTRING") %>% 
  ungroup()

We can then calculate the distance of each line using st_length(). I made this a standalone dataframe, with no spatial information, so it can be quickly joined back later on.

# Create df of the distances.
acts_dist_df <- acts_lines_sf %>% 
  mutate(total_km = round(as.numeric(st_length(.)/1000), 2)) %>% 
  as_tibble() %>% 
  select(-geometry)

Okay, now we can actually calculate something… Here, we create usable ping-level information for each activity, including joining back the distance data.

# Ping-level data for every activity. 
pings_df <- acts_sf %>% 
  as_tibble() %>% 
  group_by(act_id) %>% 
  mutate(
  act_time   = max(timestamps)-min(timestamps),
  act_mins   = as.numeric(act_time, units = "mins"),
  ele_gain   = sum(diff(ele)[diff(ele) > 0])
) %>% 
  ungroup() %>% 
  left_join(acts_dist_df) %>%
  mutate(av_km_time = act_mins/total_km,
         act_id     = as.numeric(act_id),
         ping_id    = 1:nrow(.))

This allows us to make a usable descriptive summary table in one go.

# Summary table example.
sum_table_df <- pings_df %>% 
  mutate(av_km_time = round(av_km_time, 2),
         act_mins   = round(act_mins, 2),
         ele_gain   = round(ele_gain, 0),
         act_date = format(date(timestamps), "%d.%m.%y")) %>% 
  select(act_id, act_date, act_name, act_mins, total_km, ele_gain, av_km_time) %>% 
  distinct(act_id, .keep_all = TRUE) %>% 
  arrange(act_id)

This table pretty much contains most information I would ever want from my archive data. The script should be adaptable with your own data to add or remove anything.

act_id	act_date	act_name	act_mins	total_km	ele_gain	av_km_time
22	16.02.21	Stretford - Jackson’s Boat - Home	31.62	5.72	16	5.53
23	19.02.21	Water Park - Jackson’s Boat - Home	33.83	6.09	16	5.56
25	05.03.21	Blair Witch Vibes	28.70	5.33	18	5.38
26	10.03.21	Stretford meander	30.28	5.72	21	5.29
27	13.03.21	Misjudged bridge situation	35.18	6.74	25	5.22
29	20.03.21	Charlie Don’t Surf	34.23	6.21	17	5.51
32	24.03.21	Final run	55.65	10.04	25	5.54
33	30.03.21	Jog back from test centre	18.28	3.59	11	5.09
34	31.03.21	Misjudged bridge situation 2	81.28	11.58	9	7.02
35	04.04.21	Blimey	38.70	7.81	23	4.96
36	10.04.21	Morning run	21.53	4.37	10	4.93
37	18.04.21	First city run	29.10	5.67	10	5.13
38	22.04.21	Vondel run	25.65	5.10	9	5.03
39	29.04.21	Rembrandt run	33.52	6.85	24	4.89
40	11.05.21	Rembrandt run	23.25	4.82	20	4.82
66	03.09.21	Tide’s out legs out	37.65	6.66	12	5.65
67	11.09.21	Hilbre	43.98	7.16	37	6.14
68	19.09.21	Rembrandt run	26.12	5.51	18	4.74
69	25.09.21	Rembrandt run	39.10	7.51	26	5.21
72	30.09.21	Roman	16.53	3.58	7	4.62
74	03.10.21	PT	20.52	3.43	12	5.98
75	04.10.21	Night 5	24.23	5.05	12	4.80
76	08.10.21	Following the lights	24.43	5.15	16	4.74
77	09.10.21	PT	25.25	4.17	13	6.06
78	13.10.21	Night 5	29.10	5.72	20	5.09
79	16.10.21	Rembrandt run	27.18	5.98	20	4.55
80	18.10.21	PT	32.12	5.07	19	6.33
81	21.10.21	Evening 10	52.63	10.10	36	5.21
83	25.10.21	PT	30.67	5.06	19	6.06
84	27.10.21	PT	31.10	5.11	14	6.09
85	29.10.21	Sloterplas 10	51.30	10.25	22	5.00
86	01.11.21	Rembrandt run	41.12	8.01	24	5.13
87	04.11.21	PT	28.37	5.06	17	5.61
88	06.11.21	Porridge	87.33	16.26	32	5.37
89	10.11.21	Rembrandt recovery	23.27	4.38	14	5.31
90	17.11.21	Rembrandt run	26.73	5.43	19	4.92
93	30.11.21	Knee test	30.93	5.94	17	5.21
94	09.12.21	Kneed knees	25.55	5.06	16	5.05
95	16.12.21	Dune run	30.78	5.52	10	5.58
96	24.12.21	Explore	29.60	3.92	108	7.55
97	26.12.21	Hill climber	28.97	5.09	156	5.69
98	28.12.21	Explore turbo	22.70	3.94	111	5.76
99	02.01.22	PT	20.13	3.46	12	5.82
100	05.01.22	Rembrandt run	26.92	5.21	18	5.17
101	09.01.22	Combo	47.17	8.95	21	5.27
102	16.01.22	Sloterplas 10	53.55	10.19	23	5.26
103	20.01.22	Night run	27.38	5.12	16	5.35
104	22.01.22	Rembrandt run	24.73	5.25	19	4.71
105	26.01.22	Evening run	26.87	5.25	18	5.12
106	29.01.22	Sloterplas 10	50.10	10.16	32	4.93
107	03.02.22	Evening run	25.07	5.11	15	4.91
108	06.02.22	Rembrandt run	24.77	5.04	19	4.91
109	08.02.22	West run	40.47	8.08	27	5.01
110	12.02.22	West 10	53.22	10.48	29	5.08
111	15.02.22	Rembrandt run	22.57	5.03	18	4.49
112	23.02.22	Lunchtime run	16.00	3.22	11	4.97
113	25.02.22	Morning run	24.75	5.14	21	4.82
115	28.02.22	West run	41.42	8.03	27	5.16
116	01.03.22	Lunchtime run	14.05	3.00	10	4.68
117	12.03.22	Spring	25.32	5.16	19	4.91
119	16.03.22	Sandy	24.68	5.03	18	4.91
120	19.03.22	Rembrandt run	40.48	7.78	27	5.20
121	21.03.22	Morning run	15.33	2.85	9	5.38
124	27.03.22	Evening run	16.43	3.02	9	5.44
125	06.04.22	Back to it	33.18	6.40	19	5.18
126	15.04.22	Afternoon run	12.15	2.27	7	5.35
128	18.04.22	Sloterplas 10	51.45	10.19	27	5.05
129	20.04.22	Cake	36.10	7.09	20	5.09
130	27.04.22	Rembrandt run	25.22	5.02	18	5.02
134	08.05.22	Sunday run	20.52	4.02	14	5.10
135	09.05.22	Morning run	12.73	2.55	9	4.99
136	14.05.22	West meander	29.15	5.80	16	5.03
138	16.05.22	PT	13.20	2.36	7	5.59
139	20.05.22	Run	12.77	2.63	9	4.85
140	21.05.22	Sloterplas 10	53.88	10.08	23	5.35
141	25.05.22	Rembrandt run	27.37	5.01	18	5.46
143	29.05.22	Rembrandt run	25.25	5.25	18	4.81
144	04.06.22	West	35.17	7.12	17	4.94
145	12.06.22	Rembrandt run	21.60	4.51	15	4.79
147	10.07.22	Urban trail series	36.95	6.45	23	5.73
148	27.07.22	Heal the heel	17.55	3.40	11	5.16
150	01.08.22	Rembrandt run	24.92	4.34	14	5.74
151	05.08.22	Alright then	24.47	5.04	16	4.85
152	10.08.22	Toasty	30.18	6.03	16	5.01
153	14.08.22	West	37.05	7.22	20	5.13
154	19.08.22	Westish	25.50	5.44	14	4.69
155	22.08.22	Blimey	26.57	5.13	129	5.18
156	26.08.22	Jogaroo	15.70	3.47	13	4.52
157	01.09.22	West 10	48.28	10.12	27	4.77
158	11.09.22	Pancakes	63.33	12.86	32	4.92
159	18.09.22	Summit attempt	26.00	4.16	116	6.25
160	25.09.22	Paros 5	33.90	5.79	163	5.85
161	29.09.22	West (pt1)	8.08	1.64	4	4.93
162	29.09.22	West (pt2)	35.65	7.37	18	4.84
163	03.10.22	Rembrandt run	22.20	4.62	17	4.81
164	06.10.22	West	43.18	8.22	31	5.25
165	08.10.22	Ginger cake	85.43	16.56	47	5.16
166	13.10.22	Rembrandt run	21.02	4.34	15	4.84
167	16.10.22	Amsterdam half	106.42	21.98	70	4.84
168	30.10.22	Rembrandt run	22.60	4.46	16	5.07
169	02.11.22	West	37.88	7.81	24	4.85
170	05.11.22	Rembrandt run	24.57	5.11	20	4.81
171	09.11.22	West	31.20	7.24	25	4.31
172	11.11.22	Rembrandt run	22.20	4.48	17	4.96
173	15.11.22	West	28.17	6.09	17	4.63
174	15.11.22	West finish	5.15	1.11	5	4.64
175	19.11.22	West	31.87	7.06	19	4.51
176	22.11.22	PT	40.63	7.06	25	5.76
177	24.11.22	Rembrandt run	31.65	5.88	20	5.38
178	27.11.22	West 10ish	56.10	11.23	39	5.00
179	03.12.22	West 10ish	50.78	10.92	29	4.65
180	06.12.22	West meander	29.72	6.18	21	4.81
181	08.12.22	West / Sloterplas	38.80	8.22	25	4.72
182	13.12.22	Canal loop	28.10	5.71	19	4.92
183	17.12.22	West / Sloterplas	53.48	10.98	32	4.87
184	23.12.22	Wirral	82.30	15.64	33	5.26
185	28.12.22	Windy bastard	23.65	4.67	152	5.06
186	30.12.22	LPA (pt1)	7.35	1.33	45	5.53
187	30.12.22	LPA (pt2)	49.72	9.77	139	5.09
188	04.01.23	Rembrandt run	21.73	4.74	18	4.59
189	08.01.23	Egmond half	113.02	21.85	89	5.17
190	18.01.23	Back to it	16.02	3.44	11	4.66
191	22.01.23	Rembrandt run	28.62	4.75	16	6.02
194	04.02.23	Rembrandt run	16.45	3.99	15	4.12
195	05.02.23	PT	40.65	7.38	27	5.51
196	08.02.23	Rembrandt run	18.65	4.23	13	4.41
197	12.02.23	Rembrandt run	12.08	2.95	11	4.10
198	07.03.23	Rembrandt run return	17.60	3.92	16	4.49
199	11.03.23	Rembrandt run	29.93	6.36	20	4.71
200	16.03.23	Westish	25.85	5.87	16	4.40
201	19.03.23	Urban pickle trail series	101.98	14.03	41	7.27
202	23.03.23	Bats	14.32	2.38	8	6.02
203	26.03.23	Zandvoort	72.52	12.35	102	5.87
204	24.07.23	Chart	18.23	3.16	38	5.77
205	24.07.23	Chart 2	4.78	0.85	12	5.63
207	03.10.23	Come on, legs	15.75	2.85	10	5.53
208	07.10.23	Michael Ketone	16.98	3.02	10	5.62
1	16.10.23	Jacket	24.25	4.31	15	5.63
2	17.10.23	PT	14.43	2.20	7	6.56
3	21.10.23	Bugger this	15.73	3.02	10	5.21
4	25.10.23	The Heel Strike’s Back	20.13	4.10	14	4.91
5	30.10.23	Rembrandt run	21.63	4.32	15	5.01
6	04.11.23	21 bathrooms	27.33	4.56	15	5.99
7	25.11.23	Room for a small one	10.87	2.10	8	5.17
8	29.11.23	Lil one	11.18	1.77	7	6.32
9	24.12.23	Sherry	15.85	2.50	88	6.34
10	27.12.23	Audacity	30.80	4.75	50	6.48
11	30.12.23	Beachy	10.27	2.01	6	5.11
12	02.01.24	Watery bastard	26.77	4.79	11	5.59
13	04.01.24	Mr. Motivator	12.70	2.65	10	4.79
14	06.01.24	Jumbo	15.60	2.44	9	6.39
15	10.01.24	Nippy	12.52	2.61	9	4.80
16	14.01.24	No Vondelling	25.70	5.13	15	5.01
17	18.01.24	Cheese	25.58	4.38	17	5.84
18	22.01.24	Kattenlaan	21.25	4.23	12	5.02
19	24.01.24	Vondel thingy	24.72	5.08	14	4.87
20	27.01.24	Strava are bastards. Moving to FitoTrack	15.28	2.76	9	5.54

That said, the table is quite boring. The fun derived from going through GPX file hell is in the visualisation that follows. I haven’t spent too long on this yet, so there’s plenty more fun to be had. Before getting into the ggplot2 chunks, I tidy up the summary table by renaming some columns and pivoting everything to long format. The pivot makes visualisation much easier later on.

# Initial handling before visuals.
sum_visuals_df <- sum_table_df %>% 
  select(-act_date, -act_name) %>% 
  rename(`Time (mins)`   = act_mins,
         `Distance (km)` = total_km,
         `Elevation gain (metres)` = ele_gain,
         `Km pace (mins)`          = av_km_time) %>% 
  pivot_longer(cols = -act_id,
               names_to = "measure",
               values_to = "value")

Now we can easily create some visual summaries of distributions across different metrics.

# Histograms.
ggplot(data = sum_visuals_df) +
  geom_histogram(mapping = aes(x = value), bins = 20, fill = "#fc4c02") +
  facet_wrap(~measure, scales = "free", ncol = 4) +
  labs(y = NULL, x = NULL) +
  theme(
    axis.text.y = element_blank()
  )

Or plot the individual points.

# Scatter plot of individual runs.
ggplot(data = sum_visuals_df) +
  geom_jitter(mapping = aes(x = value, y = 0),
               colour = "#fc4c02", alpha = 0.5) +
  facet_wrap(~measure, scales = "free", nrow = 4) +
  labs(y = NULL, x = NULL) +
  theme(
    axis.text.y = element_blank(),
    panel.grid.major.y = element_blank()
  )

We can also select individual runs to visualise things like elevation. Here, I choose a run manually by name because it was a good example of a hilly one. Be careful doing this if you tend to use the same name for different runs. You can always use the activity id numeric variable instead.

# elevation.
pings_df %>% 
  filter(act_name == "Hill climber ") %>% # Name has to be distinct!
  ggplot(data = .) +
  geom_ribbon(mapping = aes(x = ping_id, ymin = min(ele)*0.5, ymax = ele, group = 1),
              fill = "#fc4c02", linewidth = 1) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank()
  ) +
  labs(y = "Elevation (metres)", x = NULL)

To make some decent maps, we need to convert the point-level pings to lines. I do this for every activity in one go using group_by(), followed by a spatial union and then ensuring the output is treated as a line.

# Single activity map.
# First, create the linestrings from the points.
acts_line_sf <- acts_sf %>% 
  group_by(act_id) %>% 
  summarize(do_union=FALSE) %>% 
  st_cast("LINESTRING") %>% 
  ungroup()

While not necessary, to give our activity maps some geographic context, I obtain CARTO base maps for each activity. You could do this for every activity in one go, but for now I just do it for a single example.

# Single activity selection for examples.
act_i <- 1

# Second, obtain the osm layer for a single activity.
osm_posit <- get_tiles(
  filter(acts_line_sf, act_id == act_i),
  provider = "CartoDB.Positron",
  crop = FALSE, zoom = 15
  )

Then we can get mapping. I do a small wrangle beforehand, first to subset the lines for my example activity, and then to join back the activity information. The latter step means I can make a joint graphic of both the map and the activity information (e.g., distance, pace) at some point later on, if wanted.

# First we subset to get the label and keep the other data.
act1_sf <- acts_line_sf %>% 
  filter(act_id == act_i) %>% 
  mutate(act_id = as.numeric(act_id)) %>% 
  left_join(sum_table_df, by = "act_id") # get info back.

# Map it out.
ggplot(data = act1_sf) +
  geom_spatraster_rgb(data = osm_posit) +
  geom_sf(colour = "#fc4c02", linewidth = 1) +
  theme_void()

We can also, with a little effort, create an interactive map using leaflet. Here, I just use the basic Open Street Map layer, but you can add other (prettier) layers and an interactive legend with a bit more fiddling. You can view a more elaborate example here.

# Interactive map for single activity.
leaflet() %>%
  addTiles() %>%
  addPolylines(data = act1_sf,
               color = "#fc4c02")

For now, that’s that! Please feel free to make use of the code for your own archived data, and if you have any suggestions or make any further progress, feel free to get in touch.

strava GPX r spatial

Samuel Langton

I am a former researcher and current open science enthusiast. I’ve recently finished a project on promoting and developing open science practices and computational reproducibility at Amsterdam UMC.

Handling and visualising archive data from Strava

Samuel Langton

Related