Quantifying Londoners change in fitness habits during lockdown

Published Categorised as Data Science, Python Tagged , , , , ,

With people being urged to stay at home during the Covid-19 outbreak, gyms closed, and social distancing measures in place, it’s safe to say that fitness and workout trends have changed along with other aspects of daily life. Using data from Twitter, we set out to investigate the popularity of certain fitness related keywords, before and during the UK lockdown that started on the 23rd of March 2020, to see if we can uncover Londoners exercise trends.

Dataset

The dataset consists of Tweets that include the word “london”, either as hashtags or standalone words, starting from 1st of January 2020 to the 11th of May 2020. The lockdown is still not over as this article is written. The Tweets were gathered using Twint, a Python-based scraper tool for Twitter.

The resulting dataset consists of roughly 50k Tweets per day, and takes up roughly 1gb of storage space. We are not taking big data here, but it should be enough to uncover some interesting trends.

For reference, the Twint configuration:

twint -s london -o london_tweets.txt --since "2020-01-01 00:00:00"

Note, link to the dataset, and a Github repo can be found at the end of the article.

Hypothesis

With the lockdown, and people urged to stay at home – only allowed out for necessities and one daily exercise, we should be able to spot some trends before after the lockdown. When people alter their habits, and hopefully broadcast their changes to Twitter, and the world.

We should expect to see trends where non-gym dependent fitness such as running, and yoga gain popularity, and gym dependent exercises such as swimming decreases. For example, I had to swap out going to my local gym and pool, with weights at home, and an amazing cork yoga mat by Viking Panda – though I did not take to Twitter to broadcast the habit change.

Further online classes of various exercises have also skyrocketed in popularity, and we should be able to spot some trends on for example #onlineclasses or #workyoga.

Let’s define an initial list of words to investigate, in this case, we will look for the word in the entire tweet text, and not just among the hashtags.

  • yoga
  • run
  • homegym
  • gym
  • swim
  • walk
  • onlinefitness
  • onlineclasses
  • workyoga

Analysis

The data analysis is done with Python, and mainly Pandas dataframe, through a few functions. The first step is to convert the text file of tweets into something we can easier work with.

Each line in the text file constitutes one Tweet, and each tweet is split into a space-separated format:

“tweet_id date time timezone username tweet_text”

For example:

1257763147626287112 2020-05-05 21:04:37 BST <username> I don’t doubt that. Still good going for a starter. They way you’re going you can join @username in the London Marathon

This format makes it easy for us to tidy up, we can simply iterate through each line and split by space – “ “.

The following code will read in the text file into memory, and then iterate through each tweet and tidy it up into a nice list of structured dictionaries.

Additionally, it will do a check on the date to check that the text conforms with the date dateformat %Y-%m-%d, if it doesn’t conform with the dateformat an exception will occur and the datapoint is omitted. This step is necessary to remove errors where the timestamp is missing from the tweet.

Note that date is essentially represented twice in the dictionary, we have an entry called “day” and another entry called “timestamp”, and they are essentially the same, but keeping a separate “day” entry makes it easier to group the data later.

Further, we can argue that a sequential for loop is a slow way to go, but the dataset is quite small and the time saved for writing the for loop vastly outperforms spending more time on speeding up the code – total runtime is just a few minutes.

import pandas as pd
pd.set_option('max_columns', None)
pd.set_option('display.max_rows', 500)
import numpy as np
import nltk
from datetime import datetime
import matplotlib.pyplot as plt

tweet_path = "path to tweets file"

# read the text file into a variable
with open(tweet_path) as f:
    tweet_list = list(f)

# converts each tweet into a nice dict format
# appends the dicts to a list
def line_to_json(text_line):
    
    tweet_dict = {}
    try:
        text = text_line.split(" ")
        day_format = datetime.strptime(text[1], "%Y-%m-%d")
        tweet_dict['id'] = text[0]
        tweet_dict['day'] = text[1]
        tweet_dict['timestamp'] = text[1]+" " + text[2]
        tweet_dict["timezone"] = text[3]
        tweet_dict["username"] = text[4]
        tweet_dict['text'] = " ".join(text[5:]).lower()
    except Exception as e:
        print(e)
        print(text_line," --> ",text)
    return tweet_dict

new_tweet_list = [line_to_json(text) for tweet in tweet_list]

Let’s check how many tweets we have in total, after some were omitted due to data errors.

print("Number of tweets",len(tweet_list))

Output shows we are dealing with a roughly 5.5 million Tweets.

Working with a list of dictionaries is not ideal for data analysis. Instead, we’ll opt for Pandas.

Serving pandas a list of dictionaries yields in a nicely formated dataframe.

df = pd.DataFrame(new_tweet_list)
df['timestamp'] = pd.to_datetime(df['timestamp'],errors="coerce")

Additionally, convert the timestamp column to a pandas datetime object. This is useful for later processing.

Lets inspect the first 5 rows

df.head()
iddaytimestamptimezoneusernametext
012577631746291957792020-05-052020-05-05 21:04:43BST<some_user>we are 2 weeks behind them, we had plenty of w…
112577631660181381132020-05-052020-05-05 21:04:41BST<some_user>never imagined #london could look like this!  …
212577631645922263062020-05-052020-05-05 21:04:41BSTsome_usergang of london 8/10\n
312577631634261729302020-05-052020-05-05 21:04:40BST<some_user>not london or manchester for a change ???? this @…
412577631598232207442020-05-052020-05-05 21:04:40BST<some_user>i’ll name that chair in one, ‘broken chair’ by…

We have now converted the inital text file, into a nice strucutred dataframe held in memory. Usernames are removed for anonymisation purposes.

The next step is to define some code to count tweets with a keyword such as ‘yoga’, and group the count per day.

The data structure is now a column of the full tweet text, which allow us to use string contains across the tweets, something that is super quick with pandas.

The following two functions will search through the string of each tweet containing a keyword. Although the variable name is ‘hashtag’, it will also pickup words. The naming is more to stay true to the nature of Twitter.

def get_tweets_from_hashtag(df,hashtag="london"):
    return df[df['text'].str.contains(hashtag, na = False)].copy()

The next function uses the function defined above, but groups the data per day, and finds the count of the supplied word – or hashtag – grouped by day.

Then some throat-clearing to get nice column names, before we merge the data into one formatted dataframe.

‘total’ is the total number of tweets for that day.

def get_num_tweets_per_day(df,hashtag='london'):
    tmp_df_tweets = get_tweets_from_hashtag(df=df,hashtag=hashtag)
    tmp_df = tmp_df_tweets.groupby('day').count()
    full_df = df.groupby(by=df['day']).count()[["text"]]
    full_df.columns = ["total"]
    tmp_df = pd.concat([tmp_df, full_df], axis=1, join='inner')
    tmp_df = tmp_df[['text','total']]
    tmp_df["hashtag"] = hashtag
    tmp_df.columns=["count","total","hashtag"]
    tmp_df["percentage"] = (tmp_df["count"]/tmp_df["total"])*100
    return tmp_df

Let’s begin with the word “lockdown” to see if we can spot a trend.

tweets_per_day_df = get_num_tweets_per_day("lockdown")
display(tweets_per_day_df)
daycounttotalhashtagpercentage
2020-01-02135925lockdown0.002784
2020-01-03234768lockdown0.005752
2020-01-04233258lockdown0.006014
2020-01-07137702lockdown0.002652
2020-01-10241478lockdown0.004822
2020-01-11137931lockdown0.002636
2020-01-12236150lockdown0.005533
2020-01-14154522lockdown0.001834
2020-01-15344526lockdown0.006738
2020-01-18137863lockdown0.002641
2020-01-20246204lockdown0.004329
2020-01-21120590lockdown0.004857
2020-01-22655993lockdown0.010716
2020-01-233743522lockdown0.085014
2020-01-24644963lockdown0.013344
2020-01-251236972lockdown0.032457
2020-01-26933502lockdown0.026864
2020-01-27240581lockdown0.004928
2020-01-28544980lockdown0.011116
2020-01-29544676lockdown0.011192
2020-01-30245479lockdown0.004398
2020-01-311053012lockdown0.018864
2020-02-01240796lockdown0.004902
2020-02-02549968lockdown0.010006
2020-02-031247899lockdown0.025053
2020-02-043148453lockdown0.063980
2020-02-05944750lockdown0.020112
2020-02-06645029lockdown0.013325
2020-02-071541916lockdown0.035786
2020-02-08240969lockdown0.004882
2020-02-09243127lockdown0.004637
2020-02-10346258lockdown0.006485
2020-02-111048414lockdown0.020655
2020-02-12647543lockdown0.012620
2020-02-13348097lockdown0.006237
2020-02-144044128lockdown0.090645
2020-02-153138185lockdown0.081184
2020-02-161235623lockdown0.033686
2020-02-17845291lockdown0.017664
2020-02-18251086lockdown0.003915
2020-02-19549695lockdown0.010061
2020-02-20250623lockdown0.003951
2020-02-21246649lockdown0.004287
2020-02-22439368lockdown0.010161
2020-02-23736217lockdown0.019328
2020-02-241842468lockdown0.042385
2020-02-251551938lockdown0.028881
2020-02-268547929lockdown0.177346
2020-02-271950424lockdown0.037680
2020-02-282548212lockdown0.051854
2020-02-292039109lockdown0.051139
2020-03-011235150lockdown0.034139
2020-03-021342806lockdown0.030370
2020-03-031445469lockdown0.030790
2020-03-041545012lockdown0.033324
2020-03-053145894lockdown0.067547
2020-03-061645723lockdown0.034993
2020-03-07835760lockdown0.022371
2020-03-083227887lockdown0.114749
2020-03-098244109lockdown0.185903
2020-03-108830472lockdown0.288790
2020-03-1113944787lockdown0.310358
2020-03-1220248203lockdown0.419061
2020-03-139728867lockdown0.336024
2020-03-1416238135lockdown0.424807
2020-03-1519435067lockdown0.553227
2020-03-1633941553lockdown0.815826
2020-03-1750441576lockdown1.212238
2020-03-18313146894lockdown6.676760
2020-03-19524554895lockdown9.554604
2020-03-20187550476lockdown3.714637
2020-03-21103139634lockdown2.601302
2020-03-22172642984lockdown4.015448
2020-03-23229946133lockdown4.983418
2020-03-24316050735lockdown6.228442
2020-03-25216249367lockdown4.379444
2020-03-26139646149lockdown3.024984
2020-03-27127442760lockdown2.979420
2020-03-28111536239lockdown3.076796
2020-03-29100834524lockdown2.919708
2020-03-30100038198lockdown2.617938
2020-03-3199940899lockdown2.442603
2020-04-0197843197lockdown2.264046
2020-04-0283440153lockdown2.077055
2020-04-03107240862lockdown2.623464
2020-04-04112037275lockdown3.004695
2020-04-05192045255lockdown4.242625
2020-04-06107939683lockdown2.719048
2020-04-0795737633lockdown2.542981
2020-04-08138140707lockdown3.392537
2020-04-09108438047lockdown2.849108
2020-04-10159136717lockdown4.333143
2020-04-11108429361lockdown3.691972
2020-04-12108137506lockdown2.882206
2020-04-13114634913lockdown3.282445
2020-04-14112433985lockdown3.307341
2020-04-1593333247lockdown2.806268
2020-04-16118136639lockdown3.223341
2020-04-17206347554lockdown4.338226
2020-04-181254513lockdown2.769776
2020-04-19104634463lockdown3.035139
2020-04-2098736569lockdown2.699007
2020-04-21101347667lockdown2.125160
2020-04-22104939516lockdown2.654621
2020-04-23119839434lockdown3.037988
2020-04-24154639748lockdown3.889504
2020-04-25128634765lockdown3.699123
2020-04-26131137445lockdown3.501135
2020-04-27128937816lockdown3.408610
2020-04-28112439129lockdown2.872550
2020-04-2989938327lockdown2.345605
2020-04-30120142076lockdown2.854359
2020-05-01117140938lockdown2.860423
2020-05-02171338234lockdown4.480305
2020-05-03131639731lockdown3.312275
2020-05-04127940254lockdown3.177324
2020-05-05132342485lockdown3.114040
2020-05-06174842784lockdown4.085639
2020-05-07203646743lockdown4.355732
2020-05-08147646876lockdown3.148733
2020-05-09188041986lockdown4.477683
2020-05-10157742227lockdown3.734577
2020-05-11152339188lockdown3.886394

Expand/Collapse

This table clearly shows that there is a huge spike after the lockdown was imposed on Londoners. In the beginning of the year only a handful of tweets contained the word lockdown, and as we get closer to the UK lockdown the number jumps into the thousands.

For example on the 19th of March almost 10% of the Tweets that day mentioned the lockdown.

Let’s create a simple plotting function, to see trends visually.

def make_time_series_plot(df,hashtag="lockdown"):
    tweets_per_day_df = get_num_tweets_per_day(df=df,hashtag="lockdown")

    x = tweets_per_day_df.index.values
    y = tweets_per_day_df["count"].values

    fig = plt.figure(figsize=(20,10))
    ax1 = fig.add_subplot(111)
    ax1.plot(x,y)
    ax1.set_xticklabels(x)
    plt.xticks(rotation=90)
    ax1.legend([hashtag])
    plt.xlabel('DATE')
    plt.ylabel('Number of Tweets')
    plt.title('Number of Tweets containing the word '+hashtag)
    n = 7  # Keeps every 7th label
    [l.set_visible(False) for (i,l) in enumerate(ax1.xaxis.get_ticklabels()) if i % n != 0]

    plt.rcParams.update({'font.size': 20})
    plt.show()
    
make_time_series_plot(df=df,hashtag="lockdown")

This function takes the main dataframe, and a word, and returns a timeseries based on the words.

With the working code, and confirmation that there are trends to find in the dataset, let’s dig into the list above, and look at some trends for fitness. Instead of inspecting everything individually, we will construct a new dataframe with a column for each of the variables, and the accompanying percentage per day.

The following snippet will loop over a list of keywords using a lovely for loop, and then merge the results into a new dataframe based on the index. The result is a new dataframe where we can easily see trends.

def get_multiple_hashtag_count(df,key_word_list = []):
    df_list = []

    i = 0
    for word in key_word_list:
        tweets_per_day_df = get_num_tweets_per_day(df=df,hashtag=word)
        tweets_per_day_df.rename(columns = {'count':word, 'percentage': word+"_pct"}, inplace = True)

        column_list = [word,word+"_pct"]
        if i == 0:
            column_list = ['total']+column_list # we only need this once, and we want total to be first

        df_list.append(tweets_per_day_df[column_list])
        i+=1

    merged_df = pd.concat(df_list, axis=1)
    
    return merged_df

key_word_list = ["yoga","lockdown", "run", "gym", "homegym", "walk", "swim", "onlinefitness", "onlineclasses", "workyoga"]

merged_df = get_multiple_hashtag_count(df=df,key_word_list = key_word_list)

We are expecting to see a rise in popularity for running and yoga as the lockdown begins. Running could also see a rise in popularity as spring and the lockdown coincides – but we will ignore this assumption for now.

Let’s remind ourself of the defined list of words from above

  • yoga
  • run
  • homegym
  • gym
  • swim
  • walk
  • onlinefitness
  • onlineclasses
  • workyoga

Searching for ‘gym’ will also yield results from ‘homegym’ as we are relying upon str.contains(), but we will simply ignore any potential seasonal trends in this instance.

Running this code results in a nice dataframe, with some surprises. The code will also generate percentages for each property, but the percentage columns have been omitted below to make interpretability of the table easier.

datetotallockdownyogarungymwalkswimhomegymonlinefitnessonlineclassesworkyoga
2020-01-0138527NaN105723735330NaNNaNNaNNaN
2020-01-02359251.0146065832027NaNNaNNaNNaN
2020-01-03347682.0166085737826NaNNaNNaNNaN
2020-01-04332582.0175583948128NaNNaNNaNNaN
2020-01-0534525NaN126334035631NaNNaNNaNNaN
2020-01-0637123NaN176735738530NaNNaNNaNNaN
2020-01-07377021.0287167137230NaNNaNNaNNaN
2020-01-0844104NaN307748542234NaNNaN1.0NaN
2020-01-0940348NaN127425940937NaNNaNNaNNaN
2020-01-10414782.0248425841423NaNNaNNaNNaN
2020-01-11379311.0197124236927NaNNaNNaNNaN
2020-01-12361502.0197194739633NaNNaNNaNNaN
2020-01-1349961NaN238016443638NaNNaNNaNNaN
2020-01-14545221.02010406942938NaNNaN1.0NaN
2020-01-15445263.0288926437842NaNNaNNaNNaN
2020-01-1642320NaN237946738038NaNNaNNaNNaN
2020-01-1742778NaN218876041338NaNNaNNaNNaN
2020-01-18378631.0227054241240NaNNaNNaNNaN
2020-01-1935382NaN216474244238NaNNaNNaNNaN
2020-01-20462042.0208586944846NaNNaNNaNNaN
2020-01-21205901.073122219621NaNNaNNaNNaN
2020-01-22559936.0257977842537NaNNaNNaNNaN
2020-01-234352237.0277537040025NaNNaNNaNNaN
2020-01-24449636.02378753446371.0NaNNaNNaN
2020-01-253697212.02067152370341.0NaNNaNNaN
2020-01-26335029.0197354134030NaNNaNNaNNaN
2020-01-27405812.0257265530932NaNNaNNaNNaN
2020-01-28449805.0148887240328NaNNaNNaNNaN
2020-01-29446765.0279027541144NaNNaNNaNNaN
2020-01-30454792.0249928041843NaNNaNNaNNaN
2020-01-315301210.0189205047529NaNNaNNaNNaN
2020-02-01407962.0157463938933NaNNaNNaNNaN
2020-02-02499685.0219034045628NaNNaNNaNNaN
2020-02-034789912.0278514548543NaNNaNNaNNaN
2020-02-044845331.0339078149228NaNNaNNaNNaN
2020-02-05447509.0258076644339NaNNaNNaNNaN
2020-02-06450296.0157766244734NaNNaNNaNNaN
2020-02-074191615.0268165148050NaNNaNNaNNaN
2020-02-08409692.0249104742427NaNNaNNaNNaN
2020-02-09431272.03214955337928NaNNaNNaNNaN
2020-02-10462583.02911586439330NaNNaNNaNNaN
2020-02-114841410.02010344146247NaNNaNNaNNaN
2020-02-12475436.0269147344847NaNNaNNaNNaN
2020-02-13480973.024873674482418.0NaNNaNNaN
2020-02-144412840.0238061004303219.0NaNNaNNaN
2020-02-153818531.0177424142533NaNNaNNaNNaN
2020-02-163562312.01991651336291.0NaNNaNNaN
2020-02-17452918.02310198050036NaNNaNNaNNaN
2020-02-18510862.0261086925042811.0NaNNaNNaN
2020-02-19496955.032106584477348.0NaNNaNNaN
2020-02-20506232.02594167446446.0NaNNaNNaN
2020-02-21466492.02780787546319.0NaNNaNNaN
2020-02-22393684.02470847378315.0NaNNaNNaN
2020-02-23362177.02173269397336.0NaNNaNNaN
2020-02-244246818.0238756938965NaNNaNNaNNaN
2020-02-255193815.010100672400413.0NaNNaNNaN
2020-02-264792985.02586368433323.0NaNNaNNaN
2020-02-275042419.0211442865003328.0NaNNaNNaN
2020-02-284821225.023100073440361.0NaNNaNNaN
2020-02-293910920.02177356393305.01.0NaNNaN
2020-03-013515012.02382155353469.0NaNNaNNaN
2020-03-024280613.03783357424392.0NaNNaNNaN
2020-03-034546914.0248285448841NaNNaNNaNNaN
2020-03-044501215.0218077642240NaNNaNNaNNaN
2020-03-054589431.02487672347404.0NaNNaNNaN
2020-03-064572316.02579552459382.0NaNNaNNaN
2020-03-07357608.01974161387275.0NaNNaNNaN
2020-03-082788732.02059349307237.0NaNNaNNaN
2020-03-094410982.01586046411311.0NaNNaNNaN
2020-03-103047288.02363347298261.0NaNNaNNaN
2020-03-1144787139.0308897039136NaNNaNNaNNaN
2020-03-1248203202.0218229544944NaNNaNNaNNaN
2020-03-132886797.0147985227133NaNNaNNaNNaN
2020-03-1438135162.02377443432894.0NaNNaNNaN
2020-03-1535067194.0136596242748NaNNaNNaNNaN
2020-03-1641553339.0256897848136NaNNaNNaNNaN
2020-03-1741576504.0218097739832NaNNaN1.0NaN
2020-03-18468943131.0337866347731NaNNaN1.0NaN
2020-03-19548955245.025974130529261.0NaN2.0NaN
2020-03-20504761875.019110147851932NaNNaN2.0NaN
2020-03-21396341031.01177289527302.0NaN1.0NaN
2020-03-22429841726.0277686568522NaNNaN1.0NaN
2020-03-23461332299.0418806159523NaNNaN3.0NaN
2020-03-24507353160.03312257865326NaNNaN1.0NaN
2020-03-25493672162.0409425352829NaN1.0NaNNaN
2020-03-26461491396.02683451548211.0NaN2.0NaN
2020-03-27427601274.0237125455123NaNNaNNaNNaN
2020-03-28362391115.0206192344511NaN1.0NaNNaN
2020-03-29345241008.0186773248522NaNNaNNaNNaN
2020-03-30381981000.0256182440519NaNNaN3.0NaN
2020-03-3140899999.02164143474171.0NaN1.0NaN
2020-04-0143197978.06467563845614NaNNaN1.0NaN
2020-04-0240153834.0499193744916NaNNaN1.0NaN
2020-04-03408621072.02564633419271.0NaN1.0NaN
2020-04-04372751120.02761523526201.0NaNNaNNaN
2020-04-05452551920.03491537919301.0NaNNaNNaN
2020-04-06396831079.0286633845217NaNNaN2.0NaN
2020-04-0737633957.02761933437181.01.0NaNNaN
2020-04-08407071381.02775535410201.0NaN3.0NaN
2020-04-09380471084.03163344437181.0NaN5.0NaN
2020-04-10367171591.0335472342233NaNNaN1.0NaN
2020-04-11293611084.0364892343312NaNNaN2.0NaN
2020-04-12375061081.0255443643022NaNNaN1.0NaN
2020-04-13349131146.0265244141927NaNNaN1.0NaN
2020-04-14339851124.0235603539222NaNNaNNaNNaN
2020-04-1533247933.0225633739930NaNNaN3.0NaN
2020-04-16366391181.01360045452291.0NaN7.0NaN
2020-04-17475542063.02772364504321.02.012.0NaN
2020-04-184513125.01524594NaNNaNNaNNaN
2020-04-19344631046.01464244427231.0NaNNaNNaN
2020-04-2036569987.0216123338131NaNNaN3.0NaN
2020-04-21476671013.0257143736526NaN1.0NaNNaN
2020-04-22395161049.0228164147229NaNNaN1.0NaN
2020-04-23394341198.0336713443126NaN1.04.0NaN
2020-04-24397481546.0198013355128NaNNaN1.0NaN
2020-04-25347651286.02281433485301.0NaNNaNNaN
2020-04-26374451311.01717374960029NaNNaNNaNNaN
2020-04-27378161289.0137173641028NaNNaN1.0NaN
2020-04-28391291124.0226164439230NaNNaN2.0NaN
2020-04-2938327899.0235453835820NaN1.02.0NaN
2020-04-30420761201.02560544400231.0NaN4.0NaN
2020-05-01409381171.0156283851821NaNNaN1.0NaN
2020-05-02382341713.0195512444829NaNNaN2.0NaN
2020-05-03397311316.0105473841322NaNNaNNaNNaN
2020-05-04402541279.0255534146924NaNNaN1.0NaN
2020-05-05424851323.0156063149923NaNNaN1.0NaN
2020-05-06427841748.0778440599191.02.0NaNNaN
2020-05-07467432036.0236933653322NaNNaN11.0NaN
2020-05-08468761476.0186343853737NaNNaN3.0NaN
2020-05-09419861880.0196055980147NaNNaNNaNNaN
2020-05-10422271577.0206034787034NaNNaNNaNNaN
2020-05-11391881523.0226313575739NaNNaN1.0NaN

Expand/Collapse

Let’s plot the table and look at the data visually. This time, we want to plot more than one property at the time, so lets modify our plotting function a litle bit.

This function takes a dataframe from get_multiple_hashtag_count , and returns a nice visual for the selected properties.

def make_multi_property_plots(df,columns=[]):
    if len(columns) == 0:
        to_plot_df = df.copy()
    else:
        to_plot_df = df[columns].copy()
    # creating a x-list from the strings, before converting the datatype to datetime
    # This will be used as x labels later
    x = to_plot_df.index
    to_plot_df.index = pd.to_datetime(to_plot_df.index)

    # defining a list of easy to read colours, source: https://gist.github.com/tsherwen/268a3f2a4b638de299dabe0375970041
    CB_color_cycle = ['#377eb8', '#ff7f00', '#4daf4a',
                      '#f781bf', '#a65628', '#984ea3',
                      '#999999', '#e41a1c', '#dede00']
    cmap = LinearSegmentedColormap.from_list('mycmap', CB_color_cycle)


    fig = plt.figure(figsize=(20,10))
    ax1 = fig.add_subplot(111)
    

    to_plot_df.plot(ax=ax1,cmap=cmap,x_compat=True)
    ax1.set_xticks(x)
    ax1.set_xticklabels(x)
    plt.xticks(rotation=90)
    plt.xlabel('DATE')
    plt.ylabel('Number of Tweets')
    plt.title('Number of Tweets per day containing key word(s) x')
    n = 7 # Showing every n tick, source: https://stackoverflow.com/questions/20337664/cleanest-way-to-hide-every-nth-tick-label-in-matplotlib-colorbar
    [l.set_visible(False) for (i,l) in enumerate(ax1.xaxis.get_ticklabels()) if i % n != 0]

    plt.rcParams.update({'font.size': 20})
    plt.show()

With the plotting function defined, lets send in the dataframe with a few selected columns.

make_multi_property_plots(df=new_merged_df,columns=['lockdown','yoga','run','gym','walk' ,'swim','homegym'])
upload successful
Time series for the number of Tweets for different words.

We can see some visual trends here,

There is a data error on the 2020-04-18 where all properties dip down. Inspecting the total column in the table also confirms this. This is likely due to a web-scraping error for that day.

The fluctuation in the range between the different variables makes it difficult to interpret the plot. So let’s break them down a little bit and plot only ‘yoga’, ‘run’, and ‘gym’.

make_multi_property_plots(df=new_merged_df,columns=['yoga','run','gym'])
upload successful
There are some clear trends on yoga, and gym, while run pretty much looks like a Random Walk where there is no clear trend.

There are some clear trends on yoga, and gym, while run pretty much looks like a Random Walk where there is no clear trend.

There is a spike around the time the lockdown was put in place, but there are similar, and even bigger spikes also other places in the time series.

There is no clear trends we can see from this plot, there might be hidden textual differences, but that is not visible based on this frequency plot.

So let’s remove run, and plot only gym and yoga.

make_multi_property_plots(df=new_merged_df,columns=['yoga','gym'])
upload successful

There is no clear trend that Tweets with yoga increase after the lockdown.

The number of Tweets with Yoga hovers around 20-30 pretty much throughout, except from one huge spike on 1st of April – let’s investigate this spike closer, by looking at some tweets from that day.

tmp_df = get_tweets_from_hashtag(hashtag="yoga")
tmp_df = tmp_df[tmp_df.day == "2020-04-01"]

Manually inspecting the dataframe clearly shows that almost all the tweets contain the same text with small variations, that leads to different Instagram links with the same photo.

The photo asks for contact details in return for a free book, a phishing attempt by a bot network.

Let’s have a look at the number of unique users for that day.

len(tmp_df.username.unique())

This returns 212, of a total of 642 Tweets that day. Which means that each bot has posted variations of the Tweet roughly three times each.

Most of the Tweets contain ‘RealGod’, and if we remove all Tweets containing that phrase we’re left with 18 Tweets.

drilled_down_list = [x for x in tmp_df.text.values if "RealGod".lower() not in x.lower()]
print(len(drilled_down_list))

The spike for April the 1st had nothing to do with either lockdown or virus, this was just a god ‘ol bot network.

With the mystery of the yoga spike out of mind, let’s have a look at the gym property, which reveals some interesting trends at first sight.

We can see a big spike of gym-related tweets around the lockdown, and immediately after, the number of Tweets appear to normalise back to previous levels, and then dip down somewhat more compared to before the lockdown.

The difference is minor, but there is a clear visual dip in the number of Tweets containing gym after the lockdown was imposed.

Revisiting our table, the other properties we had selected either don’t change much before and after the lockdown, or don’t yield many tweets.

For example, barely anyone talks about “onlinefitness” or “onlineyoga”, and “workyoga” must be a word only used by me.

I also tried the same word combinations (bigrams) separated by space instead of a joined hashtag format – the result was pretty much the same as their joined cousins.

Summary and further work

The dataset consists of Tweets with the keyword london, and reveals spikes in the use of certain words as the lockdown was imposed. The hypothesis did not fully prove true for all the keywords, but some words such as run and gym, and least to mention lockdown clearly spiked around the time the lockdown was imposed.

The word lockdown went from barely ever used, to be a considerable percentage of all Tweets containing london. Interestingly, words such as yoga did not considerably rise in popularity after the lockdown – even though its an easily accessible form for home workout.

This little exercise is just the tip of the iceberg of the information hidden in the dataset, and I encourage further analysis of the data. Some ideas follow below.

Word Freqency analysis, with n-grams

The top words before and after the lockdown are most likely looking very different, and partially if paired with keywords, and other n-grams. For example, the word closed is probably non-existent together with gym before the lockdown was imposed.

Sentiment analysis

Particularly on the Tweets before and after the lockdown, and in conjunction with keywords such as run, or gym. Maybe even on the words where the count stayed consistent such as yoga, there could be interesting patterns hidden in the underlying text.

Swapping out Twitter

Running a similar analysis on data from another social media such as Instagram, or TikTok might reveal completely different results. I theorise that people would rather take to Instagram, than Twitter to share about life in lockdown, and particularily related to lockdown-fitness.

Accessing the data and the code

The code is gathered into a notebook on Github

The dataset is available as a tidy csv file through this link.
Note that the data has been lightly anonymised, by removing the usernames in the username column, and then exported using Pandas.

Therefore the data can be read directly to a dataframe using pd.read_csv(), and the data loading steps outlined in the article can be skipped.

By Christopher Ottesen

Chris is a data scientist based in London, UK.

Leave a comment

Your email address will not be published. Required fields are marked *