Fitbit has improved its online data exporting tool to allow export of all the accumulated data, a big step up from their previous export tool that was limited to averages over a limited period of time.
With devices that are meant to be worn 24/7 to track everything from sleep, steps and heart rate, the amount of data quickly piles up, and this tutorial presents a python script to convert the exported data into an easily usable format that can be analysed further in Python or opened up in other tools such as Excel.
Note: To access the Fitbit API or find other usefull health related data scripts, head over to the Open_Health_Data_Analysis project on GitHub.
The new export tool exports a huge amount of .json files where each file roughly constitutes one day, this makes it finicky to do further data analysis - especially with a couple of years worth of data - and to get around this, we will merge the JSON files into a more manageable combined .csv file. The code for this tutorial is available in the accompanying GitHub repository and I encourage pull requests to enhance the script.
1 - Export data from Fitbit
2 - Load the .csv files with Pandas
3 - Merge the individual measures such as steps, heart rate etc into one Pandas Dataframe
4 - Save the Dataframe to a .csv file
5 - Simple Exploratory Data Analysis
Head over to the Fitbit’s data export page and request a data export.
This takes a couple of minutes as Fitbit have to ‘prepare’ the file, depending on how long you have had a Fitbit, it can also take some time to download.
The downloaded files are in a folder named with your first and last name, diving into this folder we see ‘Challenges’, ‘user-profile’, and ‘user-site-export’, its the later folder the main data is stored in.
The main data folder contains a range of files with measures of different values, but in this tutorial, we will limit ourselves to files for altitude, calories, heart rate, and steps. The approach outlined in this tutorial will also work on the remaining files.
The files are saved with filename where the first part starts with the type of data such as calories, and the follows with the date of the measurement, for example calories-2014-06-10
The altitude files contain fields for time (DateTime) and ‘value’, what this value refers to is a bit vague, but my closest educated guess is a number of stairs, based on the information available in the Fitbit app.
"dateTime" : "07/13/17 17:21:00",
The calories files are very similar to the altitude file, but her value is a decimal number, something that makes sense when calculating calories.
"dateTime" : "06/10/14 00:00:00",
The steps files are similar to the other files, they contain a timestamp as well as the value which refers to time.
One thing to note here is that Fitbit reports back the average steps per minute, and all consecutive values are with one-minute intervals.
"dateTime" : "11/21/17 00:00:00",
The heart rate files are a bit different than the already described files. Similar to the other files it contains a timestamp, but in addition, it contains two more values, it contains the raw heart rate value denoted by ‘bpm’, and it contains the confidence for the reading. The confidence ranges from 0 to 3 where 3 means that the device is very confident on the accuracy of heart rate measured.
"dateTime" : "05/08/18 23:00:02",
Inspecting the files, we clearly see some patterns, and the plan is to make each unique ‘value’ field into a column header where ‘value’ is replaced with what it actually denotes, so, for example, the steps value will have the column name ‘steps’.
In addition, we will use the timestamp as our merging key and make sure the values follow each other consecutively.
We are going to need a few libraries to get this working. Pandas and glob should be installed by default if you have installed Python through the Anaconda distribution. Pandas profiling isn’t required to merge the datafiles, but we will use pandas profiling to do a simple exploratory data analysis, and create a simple report of the dataset. This will help us to understand how values are distributed and to see for example how many missing values there are. Install pandas_profiling by running ‘pip install pandas-profiling’.
import pandas as pd
Since we are dealing with a lot of files, we want to load all the files automatically. Fitbit has put everything in one folder, and to be able to work with the right files, we will open up the folder using ‘glob’ and create lists of for example all the heart rate files. This can be done by adding the folder structure and the beginning of the file name, so get all heart rate files we use
We will do this manually for all the set of files we want to use in the following way.
## Creating lists of all the respective files in the directory
glob will open op the folder, and create a python list of all the paths to these files. Make sure that your data path is set correctly as this depends on where you saved your file.
The star means that we want to get all files where the beginning starts with for example ‘heart_rate’.
With a set of lists with paths to the files we want to open, the next step is to define a few functions to read and handle the data.
def get_json_to_df(file_list = ):
The get_json_to_df function takes a file list parameter, and reads each .json file into a pandas dataframe and saves them in a list. With all the dataframes gathered in a list, they are concatenated into on dataframe and retuned back.
The make_new_df_value function is a helper function used with lambda expressions. Some of the .json files contain ‘sub-JSON’ that Pandas are not able to split into separate columns, for example, the heart rate files store the ‘bpm’ and ‘confidence’ values as a sub-JSON under ‘value’, and these are by default stored in the dataframe as dictionary values under the joint column ‘value’. This function combats this problem and converts the dictionary values into separate columns. Since data is not perfect, this check has to go through a try catch, otherwise, the lambda expression will fail when we reach data with errors (which sadly happens).
The merge_dataframes function is self-explanatory, it takes two dataframes and merges them with an outer join on the dataTime columns. The outer join ensures that all values are preserved as the timestamp of the readings doesn’t always match up, for example, the averaged steps timestamps will only match up with heart_rates measured on the exact minute. This function will be used in the end to merge all the individual dataframes into one master dataframe.
Let’s put this into code where we read all the JSON files for each parameter, do some data cleaning such as extracting values from dictionaries and renames the column headers. Note that this can take a while as each reading have to be processed, and if you have had a Fitbit for a couple of years, there will be a lot of readings to process.
heart_rate_df = get_json_to_df(file_list = heart_rate_file_list).reset_index()
We are almost done now, in fact, you can skip the merging step and do data analysis or save the individual dataframes into .csv files.
To create one big master file, all the individual files are merged by using the function defined previously.
merged = merge_dataframes(heart_rate_df,steps_df)
Finally, let’s inspect the data by creating a visual data report through pandas-profling.
With this, we have cleaned up the messy .json files into a data format that is easier to work with. The same steps can be used to extract and clean more of the .json files provided by inspecting the JSON format and creating a file list. Despite a few extra steps, accessing intra day time series readings are now more easily accessible than anytime before, opening up for a lot of interesting data analysis without going through complicated API authentication.
The code for this tutorial is available in the accompanying GitHub repository and I encourage pull requests to enhance the script.
Adding more JSON files to the script
This tutorial only covered a few of the measurements exported, and it just a matter of code-lines to integrate more readings into the cleaning script.
Improve the loading speed
Currently it takes quite a bit of time to load all the JSON files, and since we have the timestamp in each file, the load process can be moved away from a slow loop into a quicker parallel function.
Currently the dataframes are naively joined on the timestamp column, but a better approach would be to join for example steps and altitude to the nearest DateTime in another column to avoid adding unnecessary length to the file and to avoid creating more missing values.
Filling in missing values
Currently no missing values are being filled in, but this can easily be done with some smart individual column based functions. For example heart rate can be filled in with a rolling median filter while missing values in steps means that there is no movement and the missing values should be replaced with 0.
The ultimate goal of this project is to be able to download the data, have python automatically clean it and generate helpful reports.
This can be simple exploratory data analysis, or some more clever machine learning on the parameters - with so much data collected, the only limit is our imagination!