People to Watch 2018
Wes McKinney
Creator of Pandas
Author of “Python for Data Analysis”
Wes McKinney is the creator of the Python pandas project and a PMC member for the Apache Arrow and Apache Parquet projects. He published the book “Python for Data Analysis” in 2012, with an updated 2nd edition released in 2017. He was the co-founder and CEO of DataPad, an analytics company later acquired by Cloudera in 2014. At Cloudera, he focused on engineering efforts to bridge the Python, Hadoop, and Spark ecosystems. He now works at Two Sigma in New York City as a software architect focused on data science tools.
Datanami: Congratulations on being named a Datanami Person to Watch in 2018! To kick things off, were you surprised at all at the huge success of the Pandas library? What were you expecting?
Wes McKinney: I was mostly relieved! When I got interested in building data tools for Python in 2008, it wasn’t obvious that the Python community would be able to develop the robust data community that we have now. Some of the biggest hurdles to community growth were basic data access and tabular data wrangling, problems that pandas made significantly easier for newcomers. I spent most of 2011 and 2012 focused (with the help of Adam Klein and Chang She) on making pandas a viable tool for real world data analysis, and writing my book “Python for Data Analysis” in the process. The turning point for the project came around the end of 2012. In 2013, when I got busy with my company DataPad, I handed off the reigns of maintenance and growth to Jeff Reback and rest of the pandas core team, who’ve done an outstanding job keeping the project alive and healthy the last 5 years. We just recently crossed 1,000 unique contributors on GitHub, a huge milestone for any open source project.
Datanami: You’re only 32. Do you think being a bit on the younger side gives you a unique perspective into today’s data science problems?
I was fortunate to have gotten involved in data science tools when I was 22, before it was even called data science! A lot of people my age and younger have gravitated toward web technologies in the JavaScript ecosystem and newer programming languages. As time has passed, I have been doing more and more infrastructure-level systems engineering in C and C++ for data processing. I believe there are still many important problems to solve, particularly at the systems level, and as a community we need to do what we can to make the engineering side of data science more attractive and interesting to the upcoming generations of developers.
Datanami: Python has become an incredibly popular language for data science over the last few years. Do you think Python’s rapid ascent have an impact on the field of data science, and if so, what was the impact?
The Python and R in particular have had a huge impact on the field in some key ways. By virtue of being open source, anyone can install the software and get up and running for free. Additionally, the communities have focused on usability, education, and developer experience to enable individuals to become productive very quickly. In recent years, Python has emerged as the “user interface of choice” for many cutting-edge machine learning projects, like TensorFlow and PyTorch. Engineering teams have been successful implementing systems code in lower-level languages like C or C++ and exposing the functionality to users through Python bindings. Python’s strength as a “glue language” is probably the main reason that it developed a numerical computing community back in the 1990s, and this remains a part of its success today.
Datanami: What do you hope to see from the big data community in the coming year?
I have personally spent the last several years focused mainly on the Apache Arrow open source project, a cross-language in-memory computing and data interoperability platform. As many big data systems have grown more mature in recent years, I hope we see increased ecosystem-spanning collaborations on projects like Arrow to help with platform interoperability and architectural simplification. I believe that this “defragmentation,” so to speak, will make the whole ecosystem more productive and successful using open source big data technologies.
Datanami: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?
In the late 1990s, I helped operate a “speed run” competition website for the video game GoldenEye 007. I guess you could say it was my first experience with managing a distributed online community — surprisingly good preparation for some of the challenges that come with open source community development