With people being urged to stay at home during the Covid-19 outbreak, gyms closed, and social distancing measures in place, it’s safe to say that fitness and workout trends have changed along with other aspects of daily life. Using data from Twitter, we set out to investigate the popularity of certain fitness-related keywords, before and during the UK lockdown that started on the 23rd of March 2020, to see if we can uncover Londoners exercise trends.
The dataset consists of Tweets that include the word “london”, either as hashtags or standalone words, starting from 1st of January 2020 to the 11th of May 2020. The lockdown is still not over as this article is written. The Tweets were gathered using Twint, a Python-based scraper tool for Twitter.
The resulting dataset consists of roughly 50k Tweets per day, and takes up roughly 1gb of storage space. We are not taking big data here, but it should be enough to uncover some interesting trends.
For reference, the Twint configuration:
twint -s london -o london_tweets.txt --since "2020-01-01 00:00:00"
Note, link to the dataset, and a Github repo can be found at the end of the article.
With the lockdown, and people urged to stay at home - only allowed out for necessities and one daily exercise, we should be able to spot some trends before after the lockdown. When people alter their habits, and hopefully broadcast their changes to Twitter, and the world.
We should expect to see trends where non-gym dependent fitness such as running, and yoga gain popularity, and gym dependent exercises such as swimming decreases. For example, I had to swap out going to my local gym and pool, with weights at home, and an amazing cork yoga mat by Viking Panda - though I did not take to Twitter to broadcast the habit change.
Further online classes of various exercises have also skyrocketed in popularity, and we should be able to spot some trends on for example #onlineclasses or #workyoga.
Let’s define an initial list of words to investigate, in this case, we will look for the word in the entire tweet text, and not just among the hashtags.
The data analysis is done with Python, and mainly Pandas dataframe, through a few functions. The first step is to convert the text file of tweets into something we can easier work with.
Each line in the text file constitutes one Tweet, and each tweet is split into a space-separated format:
“tweet_id date time timezone username tweet_text”
1257763147626287112 2020-05-05 21:04:37 BST <username> I don’t doubt that. Still good going for a starter. They way you’re going you can join @username in the London Marathon
This format makes it easy for us to tidy up, we can simply iterate through each line and split by space - “ “.
The following code will read in the text file into memory, and then iterate through each tweet and tidy it up into a nice list of structured dictionaries.
Additionally, it will do a check on the date to check that the text conforms with the date dateformat %Y-%m-%d, if it doesn’t conform with the dateformat an exception will occur and the datapoint is omitted. This step is necessary to remove errors where the timestamp is missing from the tweet.
Note that date is essentially represented twice in the dictionary, we have an entry called “day” and another entry called “timestamp”, and they are essentially the same, but keeping a separate “day” entry makes it easier to group the data later.
Further, we can argue that a sequential for loop is a slow way to go, but the dataset is quite small and the time saved for writing the for loop vastly outperforms spending more time on speeding up the code - total runtime is just a few minutes.
Let’s check how many tweets we have in total, after some were omitted due to data errors.
print("Number of tweets",len(tweet_list))
Output shows we are dealing with a roughly 5.5 million Tweets.
Working with a list of dictionaries is not ideal for data analysis. Instead, we’ll opt for Pandas.
Serving pandas a list of dictionaries yields in a nicely formated dataframe.
df = pd.DataFrame(new_tweet_list)
Additionally, convert the timestamp column to a pandas datetime object. This is useful for later processing.
Lets inspect the first 5 rows
|0||1257763174629195779||2020-05-05||2020-05-05 21:04:43||BST||<some_user>||we are 2 weeks behind them, we had plenty of w...|
|1||1257763166018138113||2020-05-05||2020-05-05 21:04:41||BST||<some_user>||never imagined #london could look like this! ...|
|2||1257763164592226306||2020-05-05||2020-05-05 21:04:41||BST||some_user||gang of london 8/10\n|
|3||1257763163426172930||2020-05-05||2020-05-05 21:04:40||BST||<some_user>||not london or manchester for a change 😭 this @...|
|4||1257763159823220744||2020-05-05||2020-05-05 21:04:40||BST||<some_user>||i'll name that chair in one, 'broken chair' by...|
We have now converted the inital text file, into a nice strucutred dataframe held in memory. Usernames are removed for anonymisation purposes.
The next step is to define some code to count tweets with a keyword such as ‘yoga’, and group the count per day.
The data structure is now a column of the full tweet text, which allow us to use string contains across the tweets, something that is super quick with pandas.
The following two functions will search through the string of each tweet containing a keyword. Although the variable name is ‘hashtag’, it will also pickup words. The naming is more to stay true to the nature of Twitter.
The next function uses the function defined above, but groups the data per day, and finds the count of the supplied word - or hashtag - grouped by day.
Then some throat-clearing to get nice column names, before we merge the data into one formatted dataframe.
‘total’ is the total number of tweets for that day.
Let’s begin with the word “lockdown” to see if we can spot a trend.
tweets_per_day_df = get_num_tweets_per_day("lockdown")
This table clearly shows that there is a huge spike after the lockdown was imposed on Londoners. In the beginning of the year only a handful of tweets contained the word lockdown, and as we get closer to the UK lockdown the number jumps into the thousands.
For example on the 19th of March almost 10% of the Tweets that day mentioned the lockdown.
Let’s create a simple plotting function, to see trends visually.
This function takes the main dataframe, and a word, and returns a timeseries based on the words.
With the working code, and confirmation that there are trends to find in the dataset, let’s dig into the list above, and look at some trends for fitness. Instead of inspecting everything individually, we will construct a new dataframe with a column for each of the variables, and the accompanying percentage per day.
The following snippet will loop over a list of keywords using a lovely for loop, and then merge the results into a new dataframe based on the index. The result is a new dataframe where we can easily see trends.
We are expecting to see a rise in popularity for running and yoga as the lockdown begins. Running could also see a rise in popularity as spring and the lockdown coincides - but we will ignore this assumption for now.
Let’s remind ourself of the defined list of words from above
Searching for ‘gym’ will also yield results from ‘homegym’ as we are relying upon
str.contains(), but we will simply ignore that in this instance.
Running this code results in a nice dataframe, with some surprises. The code will also generate percentages for each property, but the percentage columns have been omitted below to make interpretability of the table easier.
Let’s plot the table and look at the data visually. This time, we want to plot more than one property at the time, so lets modify our plotting function a litle bit.
This function takes a dataframe from
get_multiple_hashtag_count , and returns a nice visual for the selected properties.
With the plotting function defined, lets send in the dataframe with a few selected columns.
We can see some visual trends here,
There is a data error on the 2020-04-18 where all properties dip down. Inspecting the total column in the table also confirms this. This is likely due to a web-scraping error for that day.
The fluctuation in the range between the different variables makes it difficult to interpret the plot. So let’s break them down a little bit and plot only ‘yoga’, ‘run’, and ‘gym’.
There are some clear trends on yoga, and gym, while run pretty much looks like a Random Walk where there is no clear trend.
There is a spike around the time the lockdown was put in place, but there are similar, and even bigger spikes also other places in the time series.
There is no clear trends we can see from this plot, there might be hidden textual differences, but that is not visible based on this frequency plot.
So let’s remove run, and plot only gym and yoga.
There is no clear trend that Tweets with yoga increase after the lockdown.
The number of Tweets with Yoga hovers around 20-30 pretty much throughout, except from one huge spike on 1st of April - let’s investigate this spike closer, by looking at some tweets from that day.
tmp_df = get_tweets_from_hashtag(hashtag="yoga")
Manually inspecting the dataframe clearly shows that almost all the tweets contain the same text with small variations, that leads to different Instagram links with the same photo.
The photo asks for contact details in return for a free book, a phishing attempt by a bot network.
Let’s have a look at the number of unique users for that day.
This returns 212, of a total of 642 Tweets that day. Which means that each bot has posted variations of the Tweet roughly three times each.
Most of the Tweets contain ‘RealGod’, and if we remove all Tweets containing that phrase we’re left with 18 Tweets.
drilled_down_list = [x for x in tmp_df.text.values if "RealGod".lower() not in x.lower()]
The spike for April the 1st had nothing to do with either lockdown or virus, this was just a god ‘ol bot network.
With the mystery of the yoga spike out of mind, let’s have a look at the gym property, which reveals some interesting trends at first sight.
We can see a big spike of gym-related tweets around the lockdown, and immediately after, the number of Tweets appear to normalise back to previous levels, and then dip down somewhat more compared to before the lockdown.
The difference is minor, but there is a clear visual dip in the number of Tweets containing gym after the lockdown was imposed.
Revisiting our table, the other properties we had selected either don’t change much before and after the lockdown, or don’t yield many tweets.
For example, barely anyone talks about “onlinefitness” or “onlineyoga”, and “workyoga” must be a word only used by me.
I also tried the same word combinations (bigrams) separated by space instead of a joined hashtag format - the result was pretty much the same as their joined cousins.
The dataset consists of Tweets with the keyword london, and reveals spikes in the use of certain words as the lockdown was imposed. The hypothesis did not fully prove true for all the keywords, but some words such as run and gym, and least to mention lockdown clearly spiked around the time the lockdown was imposed.
The word lockdown went from barely ever used, to be a considerable percentage of all Tweets containing london. Interestingly, words such as yoga did not considerably rise in popularity after the lockdown - even though its an easily accessible form for home workout.
This little exercise is just the tip of the iceberg of the information hidden in the dataset, and I encourage further analysis of the data. Some ideas follow below.
The top words before and after the lockdown are most likely looking very different, and partially if paired with keywords, and other n-grams. For example, the word closed is probably non-existent together with gym before the lockdown was imposed.
Particularly on the Tweets before and after the lockdown, and in conjunction with keywords such as run, or gym. Maybe even on the words where the count stayed consistent such as yoga, there could be interesting patterns hidden in the underlying text.
Running a similar analysis on data from another social media such as Instagram, or TikTok might reveal completely different results. I theorise that people would rather take to Instagram, than Twitter to share about life in lockdown, and particularily related to lockdown-fitness.
The code is gathered into a notebook on Github
The dataset is available as a tidy csv file through this link.
Note that the data has been lightly anonymised, by removing the usernames in the username column, and then exported using Pandas.
Therefore the data can be read directly to a dataframe using pd.read_csv(), and the data loading steps outlined in the article can be skipped.