By Rebecca Bilbro
The potential to derive insight from data is substantially amplified when developers and users of data analytics tools work together towards a common goal. As data scientists, computer programmers, designers, and devOps specialists joined to share ideas at two recent conferences — PyData Carolinas and PyDataDC — it was clear that the field of data science is opening up important new conversations.
PyData Carolinas – September 14-16, 2016
While the data science communities in Washington, DC, New York, and California are the largest in the U.S., there are a substantial number of small but rapidly growing data science enclaves developing all across the country. One such place is in the Research Triangle Park in North Carolina, where IBM recently hosted PyData Carolinas. Organized by the nonprofit NumFOCUS, the PyData Conference Series provides a unique opportunity for presenters and attendees to explore the triumphs and potential pitfalls of using Python to transform raw data into insights, products, and applications to empower data-driven decision making.
While at PyData Carolinas, one of the key highlights was Sarah Bird’s presentation on the power of visualization in explaining data and making information intuitive. In her talk, Bird used the Python library Bokeh to recreate Hans Rosling’s famous “Health and Wealth of Nations” animation, an illustration of how our assumptions about ‘first world’ and ‘third world’ countries can betray us.
Two of the major themes at PyData Carolinas were the power of building custom open source tools in Python and establishing repeatable workflows that enable data scientists to do more robust analysis and modeling.
Sarah Bird talks about the power of telling stories with data.
Suggested Talks from PyData Carolinas:
- “Stemgraphic” by Francois Dion presented a new visualization tool for big data
- “Transforming Data to Unlock Its Latent Value” by Tony Ojeda argued for a repeatable workflow for data transformation
- “Dynamics in Graph Analysis” by Benjamin Bengfort offered a novel approach to representing time in graph structures to capture additional insight.
- Thomas Caswell’s presentation on Matplotlib 2.0 discussed the base visualization library in Python on top of which many developers are working to build the next generation of data viz tools.
- Marshall Wang’s presentation on building an artificially intelligent video game entirely in Python.
PyData DC Conference – October 7-9, 2016
At the PyData DC conference hosted by Capital One a few weeks later, many similar themes emerged about the intersection of development and data science. Peter Wang of Continuum Analytics presented an excellent keynote on the role of the data scientist in relationship to other key players in the data ecosystem: the business analyst, the developer, the data engineer, and the devOps specialist.
Peter Wang on why data science is a team sport.
One unique “inside the beltway” theme at PyData DC concerned the challenge of using open data and its incredible potential. In “Forecasting critical food violations,” Nicole Donnelly revealed the host of data quality issues she encountered while using open health inspection data, but also illustrated how important it is for data scientists to be willing to their hands dirty with imperfect data. In his talk, Star Ying of the Commerce Data Service argued that making data open and making data accessible are two separate, but equally significant tasks of Government 2.0.
Suggested Talks from PyData DC:
- “Eat Your Vegetables” by Will Voorhies argued for the importance of data security in data science.
- “How I learned to time travel” by Laura Lorenz demonstrated the value of integrating good engineering practices into the data science pipeline.
- “NoSQL doesn’t mean No Schema” by Steven Lott emphasized the need for data scientists to write clean, readable code.
- “GraphGen” by Konstantinos Xirogiannopoulos showed how his recent implementations of computer science research in relational database extraction are making graph analytics practical for a whole host of users.
One key insight from the two conferences (videos from both available here) is that data science is thriving and developing rapidly. Moreover, data science communities in different parts of the country, while unique, are concerned with many of the same issues. Another significant take-away is that open source development is critical to the work of data science, but it’s a two-way street; it’s important for Python developers to be in dialogue with their users (the analysts who will employ their packages to ingest, wrangle, analyze, and model data), but it’s even better when those analysts take an active role in guiding development by contributing back to open source. Data scientists and analysts need to see themselves not only as testers and users of analytics software, but also as architects.
Crowd at the PyData DC diversity luncheon hear about how diversity can help open source developers disrupt.