Quantifying Londoners change in fitness habits during lockdown

With people being urged to stay at home during the Covid-19 outbreak, gyms closed, and social distancing measures in place, it’s safe to say that fitness and workout trends have changed along with other aspects of daily life. Using data from Twitter, we set out to investigate the popularity of certain fitness related keywords, before and during the UK lockdown that started on the 23rd of March 2020, to see if we can uncover Londoners exercise trends.

Dataset

The dataset consists of Tweets that include the word “london”, either as hashtags or standalone words, starting from 1st of January 2020 to the 11th of May 2020. The lockdown is still not over as this article is written. The Tweets were gathered using Twint, a Python-based scraper tool for Twitter.

The resulting dataset consists of roughly 50k Tweets per day, and takes up roughly 1gb of storage space. We are not taking big data here, but it should be enough to uncover some interesting trends.

For reference, the Twint configuration:

twint -s london -o london_tweets.txt --since "2020-01-01 00:00:00"

Note, link to the dataset, and a Github repo can be found at the end of the article.

Hypothesis

With the lockdown, and people urged to stay at home – only allowed out for necessities and one daily exercise, we should be able to spot some trends before after the lockdown. When people alter their habits, and hopefully broadcast their changes to Twitter, and the world.

We should expect to see trends where non-gym dependent fitness such as running, and yoga gain popularity, and gym dependent exercises such as swimming decreases. For example, I had to swap out going to my local gym and pool, with weights at home, and an amazing cork yoga mat by Viking Panda – though I did not take to Twitter to broadcast the habit change.

Further online classes of various exercises have also skyrocketed in popularity, and we should be able to spot some trends on for example #onlineclasses or #workyoga.

Let’s define an initial list of words to investigate, in this case, we will look for the word in the entire tweet text, and not just among the hashtags.

yoga
run
homegym
gym
swim
walk
onlinefitness
onlineclasses
workyoga

Analysis

The data analysis is done with Python, and mainly Pandas dataframe, through a few functions. The first step is to convert the text file of tweets into something we can easier work with.

Each line in the text file constitutes one Tweet, and each tweet is split into a space-separated format:

“tweet_id date time timezone username tweet_text”

For example:

1257763147626287112 2020-05-05 21:04:37 BST <username> I don’t doubt that. Still good going for a starter. They way you’re going you can join @username in the London Marathon

This format makes it easy for us to tidy up, we can simply iterate through each line and split by space – “ “.

The following code will read in the text file into memory, and then iterate through each tweet and tidy it up into a nice list of structured dictionaries.

Additionally, it will do a check on the date to check that the text conforms with the date dateformat %Y-%m-%d, if it doesn’t conform with the dateformat an exception will occur and the datapoint is omitted. This step is necessary to remove errors where the timestamp is missing from the tweet.

Note that date is essentially represented twice in the dictionary, we have an entry called “day” and another entry called “timestamp”, and they are essentially the same, but keeping a separate “day” entry makes it easier to group the data later.

Further, we can argue that a sequential for loop is a slow way to go, but the dataset is quite small and the time saved for writing the for loop vastly outperforms spending more time on speeding up the code – total runtime is just a few minutes.

import pandas as pd
pd.set_option('max_columns', None)
pd.set_option('display.max_rows', 500)
import numpy as np
import nltk
from datetime import datetime
import matplotlib.pyplot as plt

tweet_path = "path to tweets file"

# read the text file into a variable
with open(tweet_path) as f:
    tweet_list = list(f)

# converts each tweet into a nice dict format
# appends the dicts to a list
def line_to_json(text_line):
    
    tweet_dict = {}
    try:
        text = text_line.split(" ")
        day_format = datetime.strptime(text[1], "%Y-%m-%d")
        tweet_dict['id'] = text[0]
        tweet_dict['day'] = text[1]
        tweet_dict['timestamp'] = text[1]+" " + text[2]
        tweet_dict["timezone"] = text[3]
        tweet_dict["username"] = text[4]
        tweet_dict['text'] = " ".join(text[5:]).lower()
    except Exception as e:
        print(e)
        print(text_line," --> ",text)
    return tweet_dict

new_tweet_list = [line_to_json(text) for tweet in tweet_list]

Let’s check how many tweets we have in total, after some were omitted due to data errors.

print("Number of tweets",len(tweet_list))

Output shows we are dealing with a roughly 5.5 million Tweets.

Working with a list of dictionaries is not ideal for data analysis. Instead, we’ll opt for Pandas.

Serving pandas a list of dictionaries yields in a nicely formated dataframe.

df = pd.DataFrame(new_tweet_list)
df['timestamp'] = pd.to_datetime(df['timestamp'],errors="coerce")

Additionally, convert the timestamp column to a pandas datetime object. This is useful for later processing.

Lets inspect the first 5 rows

df.head()

	id	day	timestamp	timezone	username	text
0	1257763174629195779	2020-05-05	2020-05-05 21:04:43	BST	<some_user>	we are 2 weeks behind them, we had plenty of w…
1	1257763166018138113	2020-05-05	2020-05-05 21:04:41	BST	<some_user>	never imagined #london could look like this! …
2	1257763164592226306	2020-05-05	2020-05-05 21:04:41	BST	some_user	gang of london 8/10\n
3	1257763163426172930	2020-05-05	2020-05-05 21:04:40	BST	<some_user>	not london or manchester for a change ???? this @…
4	1257763159823220744	2020-05-05	2020-05-05 21:04:40	BST	<some_user>	i’ll name that chair in one, ‘broken chair’ by…

We have now converted the inital text file, into a nice strucutred dataframe held in memory. Usernames are removed for anonymisation purposes.

The next step is to define some code to count tweets with a keyword such as ‘yoga’, and group the count per day.

The data structure is now a column of the full tweet text, which allow us to use string contains across the tweets, something that is super quick with pandas.

The following two functions will search through the string of each tweet containing a keyword. Although the variable name is ‘hashtag’, it will also pickup words. The naming is more to stay true to the nature of Twitter.

def get_tweets_from_hashtag(df,hashtag="london"):
    return df[df['text'].str.contains(hashtag, na = False)].copy()

The next function uses the function defined above, but groups the data per day, and finds the count of the supplied word – or hashtag – grouped by day.

Then some throat-clearing to get nice column names, before we merge the data into one formatted dataframe.

‘total’ is the total number of tweets for that day.

def get_num_tweets_per_day(df,hashtag='london'):
    tmp_df_tweets = get_tweets_from_hashtag(df=df,hashtag=hashtag)
    tmp_df = tmp_df_tweets.groupby('day').count()
    full_df = df.groupby(by=df['day']).count()[["text"]]
    full_df.columns = ["total"]
    tmp_df = pd.concat([tmp_df, full_df], axis=1, join='inner')
    tmp_df = tmp_df[['text','total']]
    tmp_df["hashtag"] = hashtag
    tmp_df.columns=["count","total","hashtag"]
    tmp_df["percentage"] = (tmp_df["count"]/tmp_df["total"])*100
    return tmp_df

Let’s begin with the word “lockdown” to see if we can spot a trend.

tweets_per_day_df = get_num_tweets_per_day("lockdown")
display(tweets_per_day_df)

day	count	total	hashtag	percentage
2020-01-02	1	35925	lockdown	0.002784
2020-01-03	2	34768	lockdown	0.005752
2020-01-04	2	33258	lockdown	0.006014
2020-01-07	1	37702	lockdown	0.002652
2020-01-10	2	41478	lockdown	0.004822
2020-01-11	1	37931	lockdown	0.002636
2020-01-12	2	36150	lockdown	0.005533
2020-01-14	1	54522	lockdown	0.001834
2020-01-15	3	44526	lockdown	0.006738
2020-01-18	1	37863	lockdown	0.002641
2020-01-20	2	46204	lockdown	0.004329
2020-01-21	1	20590	lockdown	0.004857
2020-01-22	6	55993	lockdown	0.010716
2020-01-23	37	43522	lockdown	0.085014
2020-01-24	6	44963	lockdown	0.013344
2020-01-25	12	36972	lockdown	0.032457
2020-01-26	9	33502	lockdown	0.026864
2020-01-27	2	40581	lockdown	0.004928
2020-01-28	5	44980	lockdown	0.011116
2020-01-29	5	44676	lockdown	0.011192
2020-01-30	2	45479	lockdown	0.004398
2020-01-31	10	53012	lockdown	0.018864
2020-02-01	2	40796	lockdown	0.004902
2020-02-02	5	49968	lockdown	0.010006
2020-02-03	12	47899	lockdown	0.025053
2020-02-04	31	48453	lockdown	0.063980
2020-02-05	9	44750	lockdown	0.020112
2020-02-06	6	45029	lockdown	0.013325
2020-02-07	15	41916	lockdown	0.035786
2020-02-08	2	40969	lockdown	0.004882
2020-02-09	2	43127	lockdown	0.004637
2020-02-10	3	46258	lockdown	0.006485
2020-02-11	10	48414	lockdown	0.020655
2020-02-12	6	47543	lockdown	0.012620
2020-02-13	3	48097	lockdown	0.006237
2020-02-14	40	44128	lockdown	0.090645
2020-02-15	31	38185	lockdown	0.081184
2020-02-16	12	35623	lockdown	0.033686
2020-02-17	8	45291	lockdown	0.017664
2020-02-18	2	51086	lockdown	0.003915
2020-02-19	5	49695	lockdown	0.010061
2020-02-20	2	50623	lockdown	0.003951
2020-02-21	2	46649	lockdown	0.004287
2020-02-22	4	39368	lockdown	0.010161
2020-02-23	7	36217	lockdown	0.019328
2020-02-24	18	42468	lockdown	0.042385
2020-02-25	15	51938	lockdown	0.028881
2020-02-26	85	47929	lockdown	0.177346
2020-02-27	19	50424	lockdown	0.037680
2020-02-28	25	48212	lockdown	0.051854
2020-02-29	20	39109	lockdown	0.051139
2020-03-01	12	35150	lockdown	0.034139
2020-03-02	13	42806	lockdown	0.030370
2020-03-03	14	45469	lockdown	0.030790
2020-03-04	15	45012	lockdown	0.033324
2020-03-05	31	45894	lockdown	0.067547
2020-03-06	16	45723	lockdown	0.034993
2020-03-07	8	35760	lockdown	0.022371
2020-03-08	32	27887	lockdown	0.114749
2020-03-09	82	44109	lockdown	0.185903
2020-03-10	88	30472	lockdown	0.288790
2020-03-11	139	44787	lockdown	0.310358
2020-03-12	202	48203	lockdown	0.419061
2020-03-13	97	28867	lockdown	0.336024
2020-03-14	162	38135	lockdown	0.424807
2020-03-15	194	35067	lockdown	0.553227
2020-03-16	339	41553	lockdown	0.815826
2020-03-17	504	41576	lockdown	1.212238
2020-03-18	3131	46894	lockdown	6.676760
2020-03-19	5245	54895	lockdown	9.554604
2020-03-20	1875	50476	lockdown	3.714637
2020-03-21	1031	39634	lockdown	2.601302
2020-03-22	1726	42984	lockdown	4.015448
2020-03-23	2299	46133	lockdown	4.983418
2020-03-24	3160	50735	lockdown	6.228442
2020-03-25	2162	49367	lockdown	4.379444
2020-03-26	1396	46149	lockdown	3.024984
2020-03-27	1274	42760	lockdown	2.979420
2020-03-28	1115	36239	lockdown	3.076796
2020-03-29	1008	34524	lockdown	2.919708
2020-03-30	1000	38198	lockdown	2.617938
2020-03-31	999	40899	lockdown	2.442603
2020-04-01	978	43197	lockdown	2.264046
2020-04-02	834	40153	lockdown	2.077055
2020-04-03	1072	40862	lockdown	2.623464
2020-04-04	1120	37275	lockdown	3.004695
2020-04-05	1920	45255	lockdown	4.242625
2020-04-06	1079	39683	lockdown	2.719048
2020-04-07	957	37633	lockdown	2.542981
2020-04-08	1381	40707	lockdown	3.392537
2020-04-09	1084	38047	lockdown	2.849108
2020-04-10	1591	36717	lockdown	4.333143
2020-04-11	1084	29361	lockdown	3.691972
2020-04-12	1081	37506	lockdown	2.882206
2020-04-13	1146	34913	lockdown	3.282445
2020-04-14	1124	33985	lockdown	3.307341
2020-04-15	933	33247	lockdown	2.806268
2020-04-16	1181	36639	lockdown	3.223341
2020-04-17	2063	47554	lockdown	4.338226
2020-04-18	125	4513	lockdown	2.769776
2020-04-19	1046	34463	lockdown	3.035139
2020-04-20	987	36569	lockdown	2.699007
2020-04-21	1013	47667	lockdown	2.125160
2020-04-22	1049	39516	lockdown	2.654621
2020-04-23	1198	39434	lockdown	3.037988
2020-04-24	1546	39748	lockdown	3.889504
2020-04-25	1286	34765	lockdown	3.699123
2020-04-26	1311	37445	lockdown	3.501135
2020-04-27	1289	37816	lockdown	3.408610
2020-04-28	1124	39129	lockdown	2.872550
2020-04-29	899	38327	lockdown	2.345605
2020-04-30	1201	42076	lockdown	2.854359
2020-05-01	1171	40938	lockdown	2.860423
2020-05-02	1713	38234	lockdown	4.480305
2020-05-03	1316	39731	lockdown	3.312275
2020-05-04	1279	40254	lockdown	3.177324
2020-05-05	1323	42485	lockdown	3.114040
2020-05-06	1748	42784	lockdown	4.085639
2020-05-07	2036	46743	lockdown	4.355732
2020-05-08	1476	46876	lockdown	3.148733
2020-05-09	1880	41986	lockdown	4.477683
2020-05-10	1577	42227	lockdown	3.734577
2020-05-11	1523	39188	lockdown	3.886394

Expand/Collapse

This table clearly shows that there is a huge spike after the lockdown was imposed on Londoners. In the beginning of the year only a handful of tweets contained the word lockdown, and as we get closer to the UK lockdown the number jumps into the thousands.

For example on the 19th of March almost 10% of the Tweets that day mentioned the lockdown.

Let’s create a simple plotting function, to see trends visually.

def make_time_series_plot(df,hashtag="lockdown"):
    tweets_per_day_df = get_num_tweets_per_day(df=df,hashtag="lockdown")

    x = tweets_per_day_df.index.values
    y = tweets_per_day_df["count"].values

    fig = plt.figure(figsize=(20,10))
    ax1 = fig.add_subplot(111)
    ax1.plot(x,y)
    ax1.set_xticklabels(x)
    plt.xticks(rotation=90)
    ax1.legend([hashtag])
    plt.xlabel('DATE')
    plt.ylabel('Number of Tweets')
    plt.title('Number of Tweets containing the word '+hashtag)
    n = 7  # Keeps every 7th label
    [l.set_visible(False) for (i,l) in enumerate(ax1.xaxis.get_ticklabels()) if i % n != 0]

    plt.rcParams.update({'font.size': 20})
    plt.show()
    
make_time_series_plot(df=df,hashtag="lockdown")

This function takes the main dataframe, and a word, and returns a timeseries based on the words.

With the working code, and confirmation that there are trends to find in the dataset, let’s dig into the list above, and look at some trends for fitness. Instead of inspecting everything individually, we will construct a new dataframe with a column for each of the variables, and the accompanying percentage per day.

The following snippet will loop over a list of keywords using a lovely for loop, and then merge the results into a new dataframe based on the index. The result is a new dataframe where we can easily see trends.

def get_multiple_hashtag_count(df,key_word_list = []):
    df_list = []

    i = 0
    for word in key_word_list:
        tweets_per_day_df = get_num_tweets_per_day(df=df,hashtag=word)
        tweets_per_day_df.rename(columns = {'count':word, 'percentage': word+"_pct"}, inplace = True)

        column_list = [word,word+"_pct"]
        if i == 0:
            column_list = ['total']+column_list # we only need this once, and we want total to be first

        df_list.append(tweets_per_day_df[column_list])
        i+=1

    merged_df = pd.concat(df_list, axis=1)
    
    return merged_df

key_word_list = ["yoga","lockdown", "run", "gym", "homegym", "walk", "swim", "onlinefitness", "onlineclasses", "workyoga"]

merged_df = get_multiple_hashtag_count(df=df,key_word_list = key_word_list)

We are expecting to see a rise in popularity for running and yoga as the lockdown begins. Running could also see a rise in popularity as spring and the lockdown coincides – but we will ignore this assumption for now.

Let’s remind ourself of the defined list of words from above

yoga
run
homegym
gym
swim
walk
onlinefitness
onlineclasses
workyoga

Searching for ‘gym’ will also yield results from ‘homegym’ as we are relying upon str.contains(), but we will simply ignore any potential seasonal trends in this instance.

Running this code results in a nice dataframe, with some surprises. The code will also generate percentages for each property, but the percentage columns have been omitted below to make interpretability of the table easier.

date	total	lockdown	yoga	run	gym	walk	swim	homegym	onlinefitness	onlineclasses	workyoga
2020-01-01	38527	NaN	10	572	37	353	30	NaN	NaN	NaN	NaN
2020-01-02	35925	1.0	14	606	58	320	27	NaN	NaN	NaN	NaN
2020-01-03	34768	2.0	16	608	57	378	26	NaN	NaN	NaN	NaN
2020-01-04	33258	2.0	17	558	39	481	28	NaN	NaN	NaN	NaN
2020-01-05	34525	NaN	12	633	40	356	31	NaN	NaN	NaN	NaN
2020-01-06	37123	NaN	17	673	57	385	30	NaN	NaN	NaN	NaN
2020-01-07	37702	1.0	28	716	71	372	30	NaN	NaN	NaN	NaN
2020-01-08	44104	NaN	30	774	85	422	34	NaN	NaN	1.0	NaN
2020-01-09	40348	NaN	12	742	59	409	37	NaN	NaN	NaN	NaN
2020-01-10	41478	2.0	24	842	58	414	23	NaN	NaN	NaN	NaN
2020-01-11	37931	1.0	19	712	42	369	27	NaN	NaN	NaN	NaN
2020-01-12	36150	2.0	19	719	47	396	33	NaN	NaN	NaN	NaN
2020-01-13	49961	NaN	23	801	64	436	38	NaN	NaN	NaN	NaN
2020-01-14	54522	1.0	20	1040	69	429	38	NaN	NaN	1.0	NaN
2020-01-15	44526	3.0	28	892	64	378	42	NaN	NaN	NaN	NaN
2020-01-16	42320	NaN	23	794	67	380	38	NaN	NaN	NaN	NaN
2020-01-17	42778	NaN	21	887	60	413	38	NaN	NaN	NaN	NaN
2020-01-18	37863	1.0	22	705	42	412	40	NaN	NaN	NaN	NaN
2020-01-19	35382	NaN	21	647	42	442	38	NaN	NaN	NaN	NaN
2020-01-20	46204	2.0	20	858	69	448	46	NaN	NaN	NaN	NaN
2020-01-21	20590	1.0	7	312	22	196	21	NaN	NaN	NaN	NaN
2020-01-22	55993	6.0	25	797	78	425	37	NaN	NaN	NaN	NaN
2020-01-23	43522	37.0	27	753	70	400	25	NaN	NaN	NaN	NaN
2020-01-24	44963	6.0	23	787	53	446	37	1.0	NaN	NaN	NaN
2020-01-25	36972	12.0	20	671	52	370	34	1.0	NaN	NaN	NaN
2020-01-26	33502	9.0	19	735	41	340	30	NaN	NaN	NaN	NaN
2020-01-27	40581	2.0	25	726	55	309	32	NaN	NaN	NaN	NaN
2020-01-28	44980	5.0	14	888	72	403	28	NaN	NaN	NaN	NaN
2020-01-29	44676	5.0	27	902	75	411	44	NaN	NaN	NaN	NaN
2020-01-30	45479	2.0	24	992	80	418	43	NaN	NaN	NaN	NaN
2020-01-31	53012	10.0	18	920	50	475	29	NaN	NaN	NaN	NaN
2020-02-01	40796	2.0	15	746	39	389	33	NaN	NaN	NaN	NaN
2020-02-02	49968	5.0	21	903	40	456	28	NaN	NaN	NaN	NaN
2020-02-03	47899	12.0	27	851	45	485	43	NaN	NaN	NaN	NaN
2020-02-04	48453	31.0	33	907	81	492	28	NaN	NaN	NaN	NaN
2020-02-05	44750	9.0	25	807	66	443	39	NaN	NaN	NaN	NaN
2020-02-06	45029	6.0	15	776	62	447	34	NaN	NaN	NaN	NaN
2020-02-07	41916	15.0	26	816	51	480	50	NaN	NaN	NaN	NaN
2020-02-08	40969	2.0	24	910	47	424	27	NaN	NaN	NaN	NaN
2020-02-09	43127	2.0	32	1495	53	379	28	NaN	NaN	NaN	NaN
2020-02-10	46258	3.0	29	1158	64	393	30	NaN	NaN	NaN	NaN
2020-02-11	48414	10.0	20	1034	41	462	47	NaN	NaN	NaN	NaN
2020-02-12	47543	6.0	26	914	73	448	47	NaN	NaN	NaN	NaN
2020-02-13	48097	3.0	24	873	67	448	24	18.0	NaN	NaN	NaN
2020-02-14	44128	40.0	23	806	100	430	32	19.0	NaN	NaN	NaN
2020-02-15	38185	31.0	17	742	41	425	33	NaN	NaN	NaN	NaN
2020-02-16	35623	12.0	19	916	51	336	29	1.0	NaN	NaN	NaN
2020-02-17	45291	8.0	23	1019	80	500	36	NaN	NaN	NaN	NaN
2020-02-18	51086	2.0	26	1086	92	504	28	11.0	NaN	NaN	NaN
2020-02-19	49695	5.0	32	1065	84	477	34	8.0	NaN	NaN	NaN
2020-02-20	50623	2.0	25	941	67	446	44	6.0	NaN	NaN	NaN
2020-02-21	46649	2.0	27	807	87	546	31	9.0	NaN	NaN	NaN
2020-02-22	39368	4.0	24	708	47	378	31	5.0	NaN	NaN	NaN
2020-02-23	36217	7.0	21	732	69	397	33	6.0	NaN	NaN	NaN
2020-02-24	42468	18.0	23	875	69	389	65	NaN	NaN	NaN	NaN
2020-02-25	51938	15.0	10	1006	72	400	41	3.0	NaN	NaN	NaN
2020-02-26	47929	85.0	25	863	68	433	32	3.0	NaN	NaN	NaN
2020-02-27	50424	19.0	21	1442	86	500	33	28.0	NaN	NaN	NaN
2020-02-28	48212	25.0	23	1000	73	440	36	1.0	NaN	NaN	NaN
2020-02-29	39109	20.0	21	773	56	393	30	5.0	1.0	NaN	NaN
2020-03-01	35150	12.0	23	821	55	353	46	9.0	NaN	NaN	NaN
2020-03-02	42806	13.0	37	833	57	424	39	2.0	NaN	NaN	NaN
2020-03-03	45469	14.0	24	828	54	488	41	NaN	NaN	NaN	NaN
2020-03-04	45012	15.0	21	807	76	422	40	NaN	NaN	NaN	NaN
2020-03-05	45894	31.0	24	876	72	347	40	4.0	NaN	NaN	NaN
2020-03-06	45723	16.0	25	795	52	459	38	2.0	NaN	NaN	NaN
2020-03-07	35760	8.0	19	741	61	387	27	5.0	NaN	NaN	NaN
2020-03-08	27887	32.0	20	593	49	307	23	7.0	NaN	NaN	NaN
2020-03-09	44109	82.0	15	860	46	411	31	1.0	NaN	NaN	NaN
2020-03-10	30472	88.0	23	633	47	298	26	1.0	NaN	NaN	NaN
2020-03-11	44787	139.0	30	889	70	391	36	NaN	NaN	NaN	NaN
2020-03-12	48203	202.0	21	822	95	449	44	NaN	NaN	NaN	NaN
2020-03-13	28867	97.0	14	798	52	271	33	NaN	NaN	NaN	NaN
2020-03-14	38135	162.0	23	774	43	432	89	4.0	NaN	NaN	NaN
2020-03-15	35067	194.0	13	659	62	427	48	NaN	NaN	NaN	NaN
2020-03-16	41553	339.0	25	689	78	481	36	NaN	NaN	NaN	NaN
2020-03-17	41576	504.0	21	809	77	398	32	NaN	NaN	1.0	NaN
2020-03-18	46894	3131.0	33	786	63	477	31	NaN	NaN	1.0	NaN
2020-03-19	54895	5245.0	25	974	130	529	26	1.0	NaN	2.0	NaN
2020-03-20	50476	1875.0	19	1101	478	519	32	NaN	NaN	2.0	NaN
2020-03-21	39634	1031.0	11	772	89	527	30	2.0	NaN	1.0	NaN
2020-03-22	42984	1726.0	27	768	65	685	22	NaN	NaN	1.0	NaN
2020-03-23	46133	2299.0	41	880	61	595	23	NaN	NaN	3.0	NaN
2020-03-24	50735	3160.0	33	1225	78	653	26	NaN	NaN	1.0	NaN
2020-03-25	49367	2162.0	40	942	53	528	29	NaN	1.0	NaN	NaN
2020-03-26	46149	1396.0	26	834	51	548	21	1.0	NaN	2.0	NaN
2020-03-27	42760	1274.0	23	712	54	551	23	NaN	NaN	NaN	NaN
2020-03-28	36239	1115.0	20	619	23	445	11	NaN	1.0	NaN	NaN
2020-03-29	34524	1008.0	18	677	32	485	22	NaN	NaN	NaN	NaN
2020-03-30	38198	1000.0	25	618	24	405	19	NaN	NaN	3.0	NaN
2020-03-31	40899	999.0	21	641	43	474	17	1.0	NaN	1.0	NaN
2020-04-01	43197	978.0	646	756	38	456	14	NaN	NaN	1.0	NaN
2020-04-02	40153	834.0	49	919	37	449	16	NaN	NaN	1.0	NaN
2020-04-03	40862	1072.0	25	646	33	419	27	1.0	NaN	1.0	NaN
2020-04-04	37275	1120.0	27	615	23	526	20	1.0	NaN	NaN	NaN
2020-04-05	45255	1920.0	34	915	37	919	30	1.0	NaN	NaN	NaN
2020-04-06	39683	1079.0	28	663	38	452	17	NaN	NaN	2.0	NaN
2020-04-07	37633	957.0	27	619	33	437	18	1.0	1.0	NaN	NaN
2020-04-08	40707	1381.0	27	755	35	410	20	1.0	NaN	3.0	NaN
2020-04-09	38047	1084.0	31	633	44	437	18	1.0	NaN	5.0	NaN
2020-04-10	36717	1591.0	33	547	23	422	33	NaN	NaN	1.0	NaN
2020-04-11	29361	1084.0	36	489	23	433	12	NaN	NaN	2.0	NaN
2020-04-12	37506	1081.0	25	544	36	430	22	NaN	NaN	1.0	NaN
2020-04-13	34913	1146.0	26	524	41	419	27	NaN	NaN	1.0	NaN
2020-04-14	33985	1124.0	23	560	35	392	22	NaN	NaN	NaN	NaN
2020-04-15	33247	933.0	22	563	37	399	30	NaN	NaN	3.0	NaN
2020-04-16	36639	1181.0	13	600	45	452	29	1.0	NaN	7.0	NaN
2020-04-17	47554	2063.0	27	723	64	504	32	1.0	2.0	12.0	NaN
2020-04-18	4513	125.0	1	52	4	59	4	NaN	NaN	NaN	NaN
2020-04-19	34463	1046.0	14	642	44	427	23	1.0	NaN	NaN	NaN
2020-04-20	36569	987.0	21	612	33	381	31	NaN	NaN	3.0	NaN
2020-04-21	47667	1013.0	25	714	37	365	26	NaN	1.0	NaN	NaN
2020-04-22	39516	1049.0	22	816	41	472	29	NaN	NaN	1.0	NaN
2020-04-23	39434	1198.0	33	671	34	431	26	NaN	1.0	4.0	NaN
2020-04-24	39748	1546.0	19	801	33	551	28	NaN	NaN	1.0	NaN
2020-04-25	34765	1286.0	22	814	33	485	30	1.0	NaN	NaN	NaN
2020-04-26	37445	1311.0	17	1737	49	600	29	NaN	NaN	NaN	NaN
2020-04-27	37816	1289.0	13	717	36	410	28	NaN	NaN	1.0	NaN
2020-04-28	39129	1124.0	22	616	44	392	30	NaN	NaN	2.0	NaN
2020-04-29	38327	899.0	23	545	38	358	20	NaN	1.0	2.0	NaN
2020-04-30	42076	1201.0	25	605	44	400	23	1.0	NaN	4.0	NaN
2020-05-01	40938	1171.0	15	628	38	518	21	NaN	NaN	1.0	NaN
2020-05-02	38234	1713.0	19	551	24	448	29	NaN	NaN	2.0	NaN
2020-05-03	39731	1316.0	10	547	38	413	22	NaN	NaN	NaN	NaN
2020-05-04	40254	1279.0	25	553	41	469	24	NaN	NaN	1.0	NaN
2020-05-05	42485	1323.0	15	606	31	499	23	NaN	NaN	1.0	NaN
2020-05-06	42784	1748.0	7	784	40	599	19	1.0	2.0	NaN	NaN
2020-05-07	46743	2036.0	23	693	36	533	22	NaN	NaN	11.0	NaN
2020-05-08	46876	1476.0	18	634	38	537	37	NaN	NaN	3.0	NaN
2020-05-09	41986	1880.0	19	605	59	801	47	NaN	NaN	NaN	NaN
2020-05-10	42227	1577.0	20	603	47	870	34	NaN	NaN	NaN	NaN
2020-05-11	39188	1523.0	22	631	35	757	39	NaN	NaN	1.0	NaN

Expand/Collapse

Let’s plot the table and look at the data visually. This time, we want to plot more than one property at the time, so lets modify our plotting function a litle bit.

This function takes a dataframe from get_multiple_hashtag_count , and returns a nice visual for the selected properties.

def make_multi_property_plots(df,columns=[]):
    if len(columns) == 0:
        to_plot_df = df.copy()
    else:
        to_plot_df = df[columns].copy()
    # creating a x-list from the strings, before converting the datatype to datetime
    # This will be used as x labels later
    x = to_plot_df.index
    to_plot_df.index = pd.to_datetime(to_plot_df.index)

    # defining a list of easy to read colours, source: https://gist.github.com/tsherwen/268a3f2a4b638de299dabe0375970041
    CB_color_cycle = ['#377eb8', '#ff7f00', '#4daf4a',
                      '#f781bf', '#a65628', '#984ea3',
                      '#999999', '#e41a1c', '#dede00']
    cmap = LinearSegmentedColormap.from_list('mycmap', CB_color_cycle)


    fig = plt.figure(figsize=(20,10))
    ax1 = fig.add_subplot(111)
    

    to_plot_df.plot(ax=ax1,cmap=cmap,x_compat=True)
    ax1.set_xticks(x)
    ax1.set_xticklabels(x)
    plt.xticks(rotation=90)
    plt.xlabel('DATE')
    plt.ylabel('Number of Tweets')
    plt.title('Number of Tweets per day containing key word(s) x')
    n = 7 # Showing every n tick, source: https://stackoverflow.com/questions/20337664/cleanest-way-to-hide-every-nth-tick-label-in-matplotlib-colorbar
    [l.set_visible(False) for (i,l) in enumerate(ax1.xaxis.get_ticklabels()) if i % n != 0]

    plt.rcParams.update({'font.size': 20})
    plt.show()

With the plotting function defined, lets send in the dataframe with a few selected columns.

make_multi_property_plots(df=new_merged_df,columns=['lockdown','yoga','run','gym','walk' ,'swim','homegym'])

upload successful — Time series for the number of Tweets for different words.

We can see some visual trends here,

There is a data error on the 2020-04-18 where all properties dip down. Inspecting the total column in the table also confirms this. This is likely due to a web-scraping error for that day.

The fluctuation in the range between the different variables makes it difficult to interpret the plot. So let’s break them down a little bit and plot only ‘yoga’, ‘run’, and ‘gym’.

make_multi_property_plots(df=new_merged_df,columns=['yoga','run','gym'])

There are some clear trends on yoga, and gym, while run pretty much looks like a Random Walk where there is no clear trend.

There is a spike around the time the lockdown was put in place, but there are similar, and even bigger spikes also other places in the time series.

There is no clear trends we can see from this plot, there might be hidden textual differences, but that is not visible based on this frequency plot.

So let’s remove run, and plot only gym and yoga.

make_multi_property_plots(df=new_merged_df,columns=['yoga','gym'])

There is no clear trend that Tweets with yoga increase after the lockdown.

The number of Tweets with Yoga hovers around 20-30 pretty much throughout, except from one huge spike on 1st of April – let’s investigate this spike closer, by looking at some tweets from that day.

tmp_df = get_tweets_from_hashtag(hashtag="yoga")
tmp_df = tmp_df[tmp_df.day == "2020-04-01"]

Manually inspecting the dataframe clearly shows that almost all the tweets contain the same text with small variations, that leads to different Instagram links with the same photo.

The photo asks for contact details in return for a free book, a phishing attempt by a bot network.

Let’s have a look at the number of unique users for that day.

len(tmp_df.username.unique())

This returns 212, of a total of 642 Tweets that day. Which means that each bot has posted variations of the Tweet roughly three times each.

Most of the Tweets contain ‘RealGod’, and if we remove all Tweets containing that phrase we’re left with 18 Tweets.

drilled_down_list = [x for x in tmp_df.text.values if "RealGod".lower() not in x.lower()]
print(len(drilled_down_list))

The spike for April the 1st had nothing to do with either lockdown or virus, this was just a god ‘ol bot network.

With the mystery of the yoga spike out of mind, let’s have a look at the gym property, which reveals some interesting trends at first sight.

We can see a big spike of gym-related tweets around the lockdown, and immediately after, the number of Tweets appear to normalise back to previous levels, and then dip down somewhat more compared to before the lockdown.

The difference is minor, but there is a clear visual dip in the number of Tweets containing gym after the lockdown was imposed.

Revisiting our table, the other properties we had selected either don’t change much before and after the lockdown, or don’t yield many tweets.

For example, barely anyone talks about “onlinefitness” or “onlineyoga”, and “workyoga” must be a word only used by me.

I also tried the same word combinations (bigrams) separated by space instead of a joined hashtag format – the result was pretty much the same as their joined cousins.

Summary and further work

The dataset consists of Tweets with the keyword london, and reveals spikes in the use of certain words as the lockdown was imposed. The hypothesis did not fully prove true for all the keywords, but some words such as run and gym, and least to mention lockdown clearly spiked around the time the lockdown was imposed.

The word lockdown went from barely ever used, to be a considerable percentage of all Tweets containing london. Interestingly, words such as yoga did not considerably rise in popularity after the lockdown – even though its an easily accessible form for home workout.

This little exercise is just the tip of the iceberg of the information hidden in the dataset, and I encourage further analysis of the data. Some ideas follow below.

Word Freqency analysis, with n-grams

The top words before and after the lockdown are most likely looking very different, and partially if paired with keywords, and other n-grams. For example, the word closed is probably non-existent together with gym before the lockdown was imposed.

Sentiment analysis

Particularly on the Tweets before and after the lockdown, and in conjunction with keywords such as run, or gym. Maybe even on the words where the count stayed consistent such as yoga, there could be interesting patterns hidden in the underlying text.

Swapping out Twitter

Running a similar analysis on data from another social media such as Instagram, or TikTok might reveal completely different results. I theorise that people would rather take to Instagram, than Twitter to share about life in lockdown, and particularily related to lockdown-fitness.

Accessing the data and the code

The code is gathered into a notebook on Github

The dataset is available as a tidy csv file through this link.
Note that the data has been lightly anonymised, by removing the usernames in the username column, and then exported using Pandas.

Therefore the data can be read directly to a dataframe using pd.read_csv(), and the data loading steps outlined in the article can be skipped.

Auto Amazon Links: No products found.

Quantifying Londoners change in fitness habits during lockdown

Dataset

Hypothesis

Analysis

Summary and further work

Word Freqency analysis, with n-grams

Sentiment analysis

Swapping out Twitter

Accessing the data and the code

Leave a Reply Cancel reply

Search

About

Archive

Categories

Recent Posts

Tags

Social Icons

Gallery