8 Minutes Read | Product Development

Behind the Scenes of Piktochart’s Data Team – How to Effectively Communicate Data

Flavien Lambert

Data Scientist | Piktochart

This is the first article from our brand new ‘product development’ category. In the next months, our technical team members will be sharing tips and how-to’s about product development, processes, data crunching and more. Stay tuned and welcome to the first article!

It’s nearly impossible to miss the omnipresence of data science in recent years, with news headlines about breakthroughs in deep learning. Machines can now learn how to paint videos like Van Gogh or beat the world-class champions of Go. There are countless best practices articles about building your first logistic regression, playing with random forests or training neural networks.

As a consequence, there is abundant literature on why every company should rely on data, build predictive models or leverage Artificial Intelligence to get a competitive business edge.

But from a private sector perspective, I’ve found only few elaborations – in the data science community – on the last and perhaps most important step: when research findings are presented.

First, some context into what we do as Pikochart’s data analysis team. While we dabble occasionally in building mathematical models for prediction, our primary focus has been statistical analysis of our current business situation and presenting our findings in a meaningful format to our colleagues. Visualization of statistics has a long history and clearly defined best practices. With the expansion of libraries like D3, data visualization – in particular interactive – has gained a lot of traction. For example, (data-)journalism makes now extensive use of it and can be both a reference and a source of inspiration.

Our biggest challenge stays disseminating the research results to a broad spectrum of teammates who come from diverse backgrounds and cultures, have different interests and are not necessarily trained in statistics or data visualization.

Therefore, I’d like to describe the different paths that my colleague Wei Jian and I have taken to communicate data and help our teammates in their jobs.

For the spreadsheet lovers: Google APIs

It’s no secret that spreadsheets are one of the basic tools used in data analysis. The situation isn’t any different at Piktochart and our colleagues from the marketing team extensively use them. More importantly, they’re used to this format. Given the frequent collaboration between the analytics and marketing teams, we naturally opted to share our data via spreadsheets.

However, Spreadsheets are definitely not our preferred tool. On a daily basis, we’re extracting data from diverse sources like SQL tables and unstructured logs like JSON formatted blobs. We need a versatile tool to cope with these different tasks, a tool which offers efficient libraries for manipulating data. That’s why we ultimately settled on Python, which turned out to be an excellent choice because of its pandas library.

When I joined Piktochart, the analysis was already being performed with the help of pandas but we lacked automation, populating the spreadsheets by hand.

Every week, colleagues would ask us for the latest numbers on user signups or paying subscribers so that they could present them in meetings. It was tedious and highly inefficient, not to mention extremely boring for us. We decided to organize a single point of entry for them to find relevant data, a spreadsheet that would be regularly updated.

Thanks to the rich ecosystem surrounding the Python programming language, libraries for manipulating Google Spreadsheets were readily available. We chose to try the gspread module, a convenient wrapper around the Google Spreadsheets API.

For example, the following basic function inserts a list of values into a given row:

One advantage of this method is that rows can be inserted without modifying the rest of the spreadsheet. In order to make a spreadsheet clearer, one may want – for example – to highlight cells by color, emphasize rows or columns, or introduce comments. This will ensure that the spreadsheet’s format will not be altered by an update.

We found one drawback with this method though. To properly insert a row, the library needs to rebuild part of the spreadsheet by downloading all the rows below the one inserted. This can be a bit lengthy so it’s better to crop all unnecessary columns and rows in the original spreadsheet (a basic spreadsheet has 1,000 rows and 26 columns).

In conjunction with gspread, we also make use of the full Google Drive Python API. Following the documentation, we set up a few scripts for uploading or deleting files – not just spreadsheets – in the company’s Google Drive.

With just a few lines of code, we’re able to automate repetitive tasks for tracking various experiments and deepen the analysis by merging several data sources.

As an example, we’ve experimented with automation on our product demonstrations for increased signup numbers. We started by extracting subscription data from our WordPress platform and activity from the database supporting our application, and consolidating them into a file in table format. Next, we used the following code to upload the file to Google Drive.

By imposing mimeType="application/vnd.google-apps.spreadsheet" in the file meta-data, a simple CSV file – mimetype="text/csv" – is automatically converted to a spreadsheet when dumped in Google Drive. By default, no file in Google Drive is overwritten. So if you want to have a unique file updated, you’ll need to delete the existing version first (it can be done automatically with the API).

As a side note, it isn’t obvious to customize cells – for example, by choosing specific fonts, layouts or colors – within the Google API. So by design, the spreadsheets created through the above code are very basic. We could have used gspread again for this particular job but, since the data set was pretty small (a few thousands lines), it was more convenient to write the full table from scratch rather than updating specific cells.

Finally, continued optimization and improvements mean that experimental data needs to be regularly crunched and evaluated. Achieving this is a matter of running our scripts via cron jobs, on servers that are dedicated to performing data analysis. Our fellow colleagues can now read – and manipulate – updated reports on a weekly basis.

For the lazy ones: Slack

Automating spreadsheets was a necessary step; getting our colleagues interested in data was another challenge. We noticed that engagement was lacking, despite the availability of data at their fingertips. Of course, not everyone needs to keep an eye on the full scale of company performance, but we thought that circulating a few key numbers was necessary.

In other words, if people don’t dig into data, data should come to them.

Since we already use Slack for internal communication, we decided to highlight important metrics in the Announcements channel. To this end, we leveraged slackclient, a Python library wrapped around the Slack API. With only a few lines of code, inspired by this article, we could directly share numbers with our coworkers while keeping records in the aforementioned spreadsheets.

The following image is an example of the output for our new signups.

Python on Slack

Enhance data literacy: Jupyter notebooks

Because of my teaching experience, I’ve always found it crucial to explain my work methodology when sharing findings or statistical results. It helps in building people’s statistical and visual literacy by making them understand the processes of extracting data and its limitations. It empowers them to have a critical perspective on the methodology and the results.

However, explaining takes time; time that may not have been allocated, such as during a presentation. We also refrain from explaining by using comments in spreadsheets to avoid mixing methodologies and data in space-constrained columns and rows. We recently came across a solution that seems to fit our needs.

On a daily basis, we use Jupyter notebooks to perform analysis. For the unaware, these offer a powerful and convenient interface based on web technologies to keep code, comments and visualizations in one place. As an alternative to the browser interface, one can use the desktop app nteract. Notebooks are an incredible tool for sharing and are now widely used in education. Examples are available on nbviewer.

An interesting feature of notebooks is their versatility. They are an excellent way for discussing code with fellow analysts or developers. One can also shape them for a different audience too. Notebooks can indeed be transformed straightforwardly into HTML pages through:

This command produces an HTML page with basic characteristics, containing everything from code to outputs, table, comments or charts.

In order to make it suitable for my colleagues, I tweaked the result by using templates through

which leads to this kind of final report:

how we communicate data

And this one:

how we communicate data at Piktochart

A few words about the choices behind this template extended-basic.tpl. All code cells have been removed since the purpose is not the Python machinery but the methodology itself. Since design is our DNA at Piktochart, the design tries to follow some solid guidelines:

  • the text width is limited to roughly 75 characters;
  • tables are inspired by this article for column name placement, number styles and formatting;
  • charts are based on guidelines from the Sunlight Foundation that I discovered through Andy Kirk’s blog with some tweaks from this article.

While we use matplotlib and seaborn for static charts, since we are producing HTML reports, it’s possible to fully exploit interactive ones. I am now using plot.ly open-source Python libraries which has embedded functions for zoom, data selection, or annotations by hovering on the data points.

All of these allow our teammates to read the steps of an analysis and explore the charts to better grasp the content.

On top of the interactivity and the clarity brought by HTML reports, it’s also possible to execute the notebooks automatically thanks to the nbconvert library as shown by the following:

We used the previous code in conjunction with cron jobs to produce updated HTML reports. The files are dumped into an Amazon S3 bucket that is queried by a server which lists the available reports.

Final remarks

This list of solutions is, of course, not exhaustive. There’s a tremendous number of platforms which offer reporting capabilities based on SQL like plot.ly or Mode and support Jupyter notebooks and online dashboards. Microsoft Azure, Google Cloud or Amazon AWS have also developed similar features on their respective cloud platforms. Dashboards can nevertheless be easily set up from notebooks through the Layout extension or by using widgets.

We chose to develop our very basic solutions to allow our teammates to keep using familiar tools but we’re also considering SaaS solutions to improve our current process of sharing data analysis.

How do you communicate data within your company? Do you have any processes for sharing analysis? Let us know in the comments below!

  • Deborah Johnstone

    This is brilliant! I’m a writer and graphic designer but your behind the scenes chronicle is fascinating and helps me to understand the process of data mining and what is involved when pulling the numbers. Thanks for being so detailed!