This is the first article from our brand new ‘product development’ category. In the next months, our technical team members will be sharing tips and how-to’s about product development, processes, data crunching and more. Stay tuned and welcome to the first article!
It’s nearly impossible to miss the omnipresence of data science in recent years, with news headlines about breakthroughs in deep learning. Machines can now learn how to paint videos like Van Gogh or beat the world-class champions of Go. There are countless best practices articles about building your first logistic regression, playing with random forests or training neural networks.
As a consequence, there is abundant literature on why every company should rely on data, build predictive models or leverage Artificial Intelligence to get a competitive business edge.
But from a private sector perspective, I’ve found only few elaborations – in the data science community – on the last and perhaps most important step: when research findings are presented.
First, some context into what we do as Pikochart’s data analysis team. While we dabble occasionally in building mathematical models for prediction, our primary focus has been statistical analysis of our current business situation and presenting our findings in a meaningful format to our colleagues. Visualization of statistics has a long history and clearly defined best practices. With the expansion of libraries like D3, data visualization – in particular interactive – has gained a lot of traction. For example, (data-)journalism makes now extensive use of it and can be both a reference and a source of inspiration.
Our biggest challenge stays disseminating the research results to a broad spectrum of teammates who come from diverse backgrounds and cultures, have different interests and are not necessarily trained in statistics or data visualization.
Therefore, I’d like to describe the different paths that my colleague Wei Jian and I have taken to communicate data and help our teammates in their jobs.
For the spreadsheet lovers: Google APIs
It’s no secret that spreadsheets are one of the basic tools used in data analysis. The situation isn’t any different at Piktochart and our colleagues from the marketing team extensively use them. More importantly, they’re used to this format. Given the frequent collaboration between the analytics and marketing teams, we naturally opted to share our data via spreadsheets.
However, Spreadsheets are definitely not our preferred tool. On a daily basis, we’re extracting data from diverse sources like SQL tables and unstructured logs like JSON formatted blobs. We need a versatile tool to cope with these different tasks, a tool which offers efficient libraries for manipulating data. That’s why we ultimately settled on Python, which turned out to be an excellent choice because of its pandas library.
When I joined Piktochart, the analysis was already being performed with the help of pandas but we lacked automation, populating the spreadsheets by hand.
Every week, colleagues would ask us for the latest numbers on user signups or paying subscribers so that they could present them in meetings. It was tedious and highly inefficient, not to mention extremely boring for us. We decided to organize a single point of entry for them to find relevant data, a spreadsheet that would be regularly updated.
Thanks to the rich ecosystem surrounding the Python programming language, libraries for manipulating Google Spreadsheets were readily available. We chose to try the gspread
module, a convenient wrapper around the Google Spreadsheets API.
For example, the following basic function inserts a list of values into a given row:
def insert_row_to_ss(row, credentials_file, ws_title, ss_title, values): """ This function inserts a list of values into a row of a given spreadsheet. Args: row (int): row number credentials_file (str): path to the credentials (see Step1 of https://developers.google.com/sheets/api/quickstart/python) ws_title, ss_title (str): worksheet and spreadsheet names values (list): values to be inserted """ scope = ["https://spreadsheets.google.com/feeds"] credentials = ServiceAccountCredentials.from_json_keyfile_name(credentials_file, scope) client = gspread.authorize(credentials) spreadsheet = client.open(ws_title) spreadsheet.worksheet(title=ss_title).insert_row(values, index=row)
One advantage of this method is that rows can be inserted without modifying the rest of the spreadsheet. In order to make a spreadsheet clearer, one may want – for example – to highlight cells by color, emphasize rows or columns, or introduce comments. This will ensure that the spreadsheet’s format will not be altered by an update.
We found one drawback with this method though. To properly insert a row, the library needs to rebuild part of the spreadsheet by downloading all the rows below the one inserted. This can be a bit lengthy so it’s better to crop all unnecessary columns and rows in the original spreadsheet (a basic spreadsheet has 1,000 rows and 26 columns).
In conjunction with gspread
, we also make use of the full Google Drive Python API. Following the documentation, we set up a few scripts for uploading or deleting files – not just spreadsheets – in the company’s Google Drive.
With just a few lines of code, we’re able to automate repetitive tasks for tracking various experiments and deepen the analysis by merging several data sources.
As an example, we’ve experimented with automation on our product demonstrations for increased signup numbers. We started by extracting subscription data from our Wordpress platform and activity from the database supporting our application, and consolidating them into a file in table format. Next, we used the following code to upload the file to Google Drive.
# service (Google Drive API service object): see https://developers.google.com/drive/v3/web/quickstart/python#file_metadata = dict( name="your_file_name", parents=["Google_Drive_folder_id"], mimeType="application/vnd.google-apps.spreadsheet")media = apiclient.http.MediaFileUpload( path_to_CSV_file, mimetype="text/csv", resumable=True)created_file = service.files().create(body=file_metadata, media_body=media, fields="id").execute()
By imposing mimeType="application/vnd.google-apps.spreadsheet"
in the file meta-data, a simple CSV file – mimetype="text/csv"
– is automatically converted to a spreadsheet when dumped in Google Drive. By default, no file in Google Drive is overwritten. So if you want to have a unique file updated, you’ll need to delete the existing version first (it can be done automatically with the API).
As a side note, it isn’t obvious to customize cells – for example, by choosing specific fonts, layouts or colors – within the Google API. So by design, the spreadsheets created through the above code are very basic. We could have used gspread
again for this particular job but, since the data set was pretty small (a few thousands lines), it was more convenient to write the full table from scratch rather than updating specific cells.
Finally, continued optimization and improvements mean that experimental data needs to be regularly crunched and evaluated. Achieving this is a matter of running our scripts via cron jobs, on servers that are dedicated to performing data analysis. Our fellow colleagues can now read – and manipulate – updated reports on a weekly basis.
For the lazy ones: Slack
Automating spreadsheets was a necessary step; getting our colleagues interested in data was another challenge. We noticed that engagement was lacking, despite the availability of data at their fingertips. Of course, not everyone needs to keep an eye on the full scale of company performance, but we thought that circulating a few key numbers was necessary.
In other words, if people don’t dig into data, data should come to them.
Since we already use Slack for internal communication, we decided to highlight important metrics in the Announcements channel. To this end, we leveraged slackclient
, a Python library wrapped around the Slack API. With only a few lines of code, inspired by this article, we could directly share numbers with our coworkers while keeping records in the aforementioned spreadsheets.
import osimport slackclientdef slack_message(text, channel="@flavien", botname="statistics", icon=":python:"): """ This function sends a message to Slack. Args: text (str): text of the message channel: the channel/user to whom the message is sent (default: @flavien) botname (str): the name of the bot (default: statistics) icon (str): the bot icon (default: :python:) """ # The Slack token is kept in an environment variable for security purposes. slack_token = os.environ["SLACK_TOKEN"] slackclient.SlackClient(slack_token).api_call( "chat.postMessage", channel=channel, text=text, username=botname, icon_emoji=icon )
The following image is an example of the output for our new signups.
Enhance data literacy: Jupyter notebooks
Because of my teaching experience, I’ve always found it crucial to explain my work methodology when sharing findings or statistical results. It helps in building people’s statistical and visual literacy by making them understand the processes of extracting data and its limitations. It empowers them to have a critical perspective on the methodology and the results.
However, explaining takes time; time that may not have been allocated, such as during a presentation. We also refrain from explaining by using comments in spreadsheets to avoid mixing methodologies and data in space-constrained columns and rows. We recently came across a solution that seems to fit our needs.
On a daily basis, we use Jupyter notebooks to perform analysis. For the unaware, these offer a powerful and convenient interface based on web technologies to keep code, comments and visualizations in one place. As an alternative to the browser interface, one can use the desktop app nteract. Notebooks are an incredible tool for sharing and are now widely used in education. Examples are available on nbviewer.
An interesting feature of notebooks is their versatility. They are an excellent way for discussing code with fellow analysts or developers. One can also shape them for a different audience too. Notebooks can indeed be transformed straightforwardly into HTML pages through:
jupyter nbconvert your_notebook.ipynb --to html
This command produces an HTML page with basic characteristics, containing everything from code to outputs, table, comments or charts.
In order to make it suitable for my colleagues, I tweaked the result by using templates through
jupyter nbconvert your_notebook.ipynb --to html --template extended-basic.tpl
which leads to this kind of final report:
And this one:
A few words about the choices behind this template extended-basic.tpl. All code cells have been removed since the purpose is not the Python machinery but the methodology itself. Since design is our DNA at Piktochart, the design tries to follow some solid guidelines:
- the text width is limited to roughly 75 characters;
- tables are inspired by this article for column name placement, number styles and formatting;
- charts are based on guidelines from the Sunlight Foundation that I discovered through Andy Kirk’s blog with some tweaks from this article.
While we use matplotlib and seaborn for static charts, since we are producing HTML reports, it’s possible to fully exploit interactive ones. I am now using plot.ly open-source Python libraries which has embedded functions for zoom, data selection, or annotations by hovering on the data points.
All of these allow our teammates to read the steps of an analysis and explore the charts to better grasp the content.
On top of the interactivity and the clarity brought by HTML reports, it’s also possible to execute the notebooks automatically thanks to the nbconvert
library as shown by the following:
import nbformatfrom nbconvert.preprocessors import ExecutePreprocessorfrom nbconvert import HTMLExporterimport osimport codecsimport jinja2def nb_to_html(notebook_path, template_path="basic", version=4, timeout=3600, kernel="python3"): """ This functions executes a Jupyter notebook and creates the related HTML file. Args: notebook_path (str): notebook path template_path (str): template path version (int): version of the notebook timeout (float): maximum time spent per cell kernel (str) """ if os.path.isfile(notebook_path): with open(notebook_path, "r") as f: nb = nbformat.read(f, as_version=version) ep = ExecutePreprocessor(timeout=timeout, kernel_name=kernel) ep.preprocess(nb, {"metadata": {"path": "."}}) notebook_folder, notebook = os.path.split(notebook_path) with open("/tmp/" + notebook, "wt") as f: nbformat.write(nb, f) html_exporter = HTMLExporter() if os.path.isfile(template_path): template_folder, template = os.path.split(template_path) loader = jinja2.FileSystemLoader(template_folder) html_exporter.extra_loaders = [loader] html_exporter.template_file = template.replace(".tpl", "") with open("/tmp/" + notebook, mode="r") as f: notebook_exe = nbformat.reads("".join(f.readlines()), as_version=version) (body, _) = html_exporter.from_notebook_node(notebook_exe) codecs.open(notebook_folder + "/" + notebook.replace(".ipynb", ".html"), "w", encoding="utf-8").write(body) else: print("{} does not exist.".format(notebook_path))
We used the previous code in conjunction with cron jobs to produce updated HTML reports. The files are dumped into an Amazon S3 bucket that is queried by a server which lists the available reports.
Final remarks
This list of solutions is, of course, not exhaustive. There’s a tremendous number of platforms which offer reporting capabilities based on SQL like plot.ly or Mode and support Jupyter notebooks and online dashboards. Microsoft Azure, Google Cloud or Amazon AWS have also developed similar features on their respective cloud platforms. Dashboards can nevertheless be easily set up from notebooks through the Layout extension or by using widgets.
We chose to develop our very basic solutions to allow our teammates to keep using familiar tools but we’re also considering SaaS solutions to improve our current process of sharing data analysis.
How do you communicate data within your company? Do you have any processes for sharing analysis? Let us know in the comments below!