I have used the Python data manipulation package Pandas daily and professionally for the last six years, but I’ve never actually done any official training or read books on using Pandas. My Pandas fluency has come slowly by solving daily problems and overcoming obstacles through looking at the Pandas documentation and reading helpful posts on Stack Overflow.
Therefore, I was intrigued to read and review the book The Pandas Workshop by Saikat Basak et al. when their publisher, Packt, reached out. I was given a free example in exchange for a review, but neither the authors nor Packt had any influence over this review or had a chance to read it before it was published.
The Pandas Workshop, or to use its full title – The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies – aims to help coders become data fluent and use data for analysis. I started the reading journey by examining the table of contents to pick out chapters that would be particularly interesting to me and get an overview of what the book would cover.
The first two chapters are a general introduction to Pandas and the data structures driving Pandas. Then from chapter three and onwards, the juicy bits start flowing. They cover reading and parsing popular data types such as CSV, Excel, JSON, Databases, and web tables.
Then they move into different data types, how to handle missing values, and how to optimize memory usage. The book then moves into practical data selection, data manipulation, and data visualization and inspection.
The book’s last section covers data modelling and machine learning, heavily focused on regression and forecasting. The latter is a bit of fresh air in the Python book world, where I often find authors mainly focusing on classification problems when introducing machine learning.
The example scenarios are heavily geared towards trading and problems faced in a financial environment. The trading industry happens to be the industry I’m working in, and a lot of the requirements in that industry is reading data from CSV files, Excel sheets, and various databases, applying multiple statistical analysis on the data, and solving forecasting problems with regression analysis.
Reading through the book, I was amazed by how relevant every section is to my day-to-day job, and the book is clearly written by someone who has worked in the industry and knows what problems and demands a Pandas practitioner faces daily. The book covers all the basics of reading data. I was glad to find the in-depth description of common pitfalls in time series analysis, data leakage, and example implementation and explanation of many commonly used plots such as line charts, Q-Q plots, and bar graphs. The authors even included a whole chapter on dealing with time series, and anyone having worked with datatypes knows how difficult it is. I would have liked that particular chapter to delve deeper into different time formats, such as working with data from, e.g., the US, where the month and day are reversed compared to most other countries. Different time and date formats have caused me a lot of headaches when working with data from teams spread around the globe.
Overall, the book is everything needed for someone to get up to speed with data analysis. Anyone who wants to break into the financial world will be productive from day one after reading and understanding the scenarios given in this book. This book should be a gift to graduates joining any trading or financial analysis team. I recommend the book to anyone new to the Pandas and analysis world. Still, for more seasoned Pandas practitioners, you will most likely already have working experience with most of the content and examples of the book and are better off sticking with the official documentation or looking for a book targeting the advanced user.
Appendix – some formatting feedback
I read the first edition of the book, and there are a few minor annoyances I’d like to see fixed in the next iteration of the book.
- The authors often use inplace=True instead of returning a new Pandas object. The Pandas creators no longer advise on this and generally bad coding practices. The correct way is to return and assign a new Pandas object. For example:
df.dropna(axis='index', how='all', inplace=True)
Should be written as
df = df.dropna(axis='index', how='all')
- The authors are often inconsistent with using print and just displaying output. The examples have been written in a notebook where you can write the variable name to show the output, but in a book meant for beginners, I’d like to see the use of explicit print() or display() to make the transition away from Notebook to scripts smoother. To make matters more confusing, the authors are inconsistent with using print() or just writing the variable name.
- There is often too much happening in one line where the authors have tried to condense code. With the word breaks and paragraphs breaks and slashes in the book the code often become very difficult to read, and instead, it would be better to perform operations over several lines of code. Splitting the code will increase the number of code lines but make readability much more straightforward.
For example, they have put long SQL statements inside the function call calling SQL, where they could have created a variable for the SQL query and made the code more readable and easier to print with limited space in a book. Splitting up the code is also better coding practice when the user in the future wants to wrap the code into a function to avoid repetition.
- The last point is probably due to formatting, but on the PDF version I had, there was a confusing space in underscore delimited variables. So “some_variable” became “some _ variable”.
Link to the book on Amazon (Not an affiliate link, but a link for Packt to track clicks)