Data Driven #25 - the Humanity of Data
Tuesday night I had the pleasure of attending episode #25 of Data Driven NYC, a monthly conversation on big data hosted and deftly moderated by Matt Turck. We heard from a stellar lineup of speakers from widely varied data science disciplines:
- Jake Klamka, Insight Data Science - Insight prepares Ph.D.’s from data-intensive disciplines ranging from astrophysics to social science for careers in data science. The program is a response to the growing number of Ph.D.’s against a slow-growing number of positions in academia and explosive growth of industry demand for data scientists. The six-week program based in San Francisco adapts fellows to the toolset (migrating from Matlab and flat files to Python/R and databases), culture, and pacing of data science in industry, before guiding them through a portfolio project and company visits and eventual job placement. Fellows begin their industry career with a wider network in data science than some of their supervisors. Klamka stressed the importance of deliverables and the ability to communicate results as the differentiators between great research scientists and effective data scientists.
- Jesse St. Charles, Knewton - Knewton is the provider of an online educational platform, currently used mostly in higher education, that adapts the curriculum and learning path to the individual student. In contrast to the Industrial age “one size fits all” model, Knewton’s platform decomposes the curriculum into a dependency graph of bite-size modules and collects data on students’ progress through the material to tailor the sequencing of the instruction to the student’s abilities. St. Charles noted the imperfect nature of some of the heuristics used by the Knewton system - is the student puzzling over a difficult concept for 15 minutes, or have they stepped away to make a PB&J sandwich? This highlighted the need for a concept St. Charles mentioned which struck a chord with many attendees - “data empathy,” the understanding of the provenance, context, and imperfections of raw data.
- Sean Gourley, Quid - Gourley is the presenter of one of my favorite TED talks of all time and co-founder / CTO of Quid, which produces software to help humans explore and make sense of large, complex data sets. He revisited the limits on the speed of human judgment (~650ms), in comparison to the speed at which high-frequency trading algorithms operate in electronic financial markets (microsecond-scale). Gourley presented charts that illustrate the scale of algorithmic activity and its stunning share of overall trading activity, but he also highlighted some of the algorithms’ shortcomings - they haven’t been “trained” to trade around major news releases like Fed announcements and withdraw from the market, and they are easily duped by spurious information like the falsified tweet from Reuters last year claiming President Obama had been injured in an attack on the White House, which sent markets plummeting only to fully recover once the error was revealed. Gourley also noted the use of technology to try to shape the narrative, showing data on the volume of propaganda published to Twitter, and suggesting that algorithms might soon be writing the news consumed by humans.
- Riley Newman, AirBnB - Turck spoke with Newman in a fireside chat format, discussing the role of data science in the rapid growth of AirBnB. AirBnB creates a two-sided market between seekers and providers of temporary residential space. Newman started at the company when there were nine other employees, and as people do in those situations took on a variety of roles. He now oversees a team of 20 data scientists, out of the company’s global headcount of 800. Newman noted that simple approaches are often the most effective (compared to machine learning addicts who want to “random forest everything”) but noted the importance of increasing sophistication to making AirBnB more effective at matching buyers and sellers - initial versions of their search algorithm ranked results based on their distance from the city center, for example pointing many seekers of lodging to SF’s Tenderloin district, suboptimal for the faint of heart. Speaking to the difficulties that Jake Klamka noted in identifying data science talent, Newman said AirBnB’s interview process consists of a one-day, eight-hour trial run, where candidates are asked a question in the morning, given access to a workstation and the necessary data, and they present their data-driven answer at the end of the day- trial and error trumping predictive analytics.
- Chris Volinsky, AT&T - After an entertaining introduction to AT&T’s concrete, windowless, nuke-proof, zombie-proof(!!) Manhattan headquarters, Volinsky walked the audience through some fascinating studies using cellular network usage data to analyze and solve urban planning problems, centered on Morristown, NJ. After acknowledging the privacy concerns of working with the data, Volinsky went on to show how the data tells the rich human story of the city’s inhabitants, in the aggregate- the spike in SMS activity on a cell antenna near a high school as students reach out at the start of the day, at lunch, at the end of the day; the spike in voice traffic at the same antenna as students dial their parents for pick-up; the all-too-familiar patterns in voice and SMS traffic on a cellular antenna positioned near a popular bar. Volinsky also showed how the data could be used to determine where Morristown’s workforce commuted in from (some as far as Brooklyn!) and what routes they took to get there. AT&T’s access to this unique dataset affords them the opportunity for all kinds of interesting studies, but for privacy reasons they are unable to open the data to the wider community to allow further study. Volinsky discussed plans to publish a “synthetic” dataset that mimics the statistical profile of the real data without compromising the privacy of the very real people whose footprints are captured there.
If I were to try to extract a theme touched on by all the speakers it would be the emphasis of big data as a strikingly human concern. The talks were not focused on the mathematical, technological, or commercial aspects of big data and the industry surrounding it or the industries using it to drive growth. Rather, humans create data and make decisions based on data, and while we obtain significant leverage from the technological tools we use to manage our data, making sense of that data remains a distinctly human skill and a growing part of human dialogue.
Chris Volinsky from AT&T explored data as the residue of human activity, most of it routine but some of it very precious and intimate. Riley Newman and AirBnB showed the trust we now place in algorithms and the data we feed them as they reshape decisions as simple as choosing what neighborhood to crash in for the night. Jesse St. Charles of Knewton touched on how we can use these tools to revamp as fundamental a human tradition as education - the machine learns from us, and we learn from the machine. He also touched on data empathy, the judgment outside the data that impacts how we derive meaning from it. Sean Gourley of Quid showed the consequences of the absence of data empathy from systems like HFT algorithms and Twitter propaganda bots that can far outpace humans on volume and velocity, but (so far) lack the adaptive judgment to come to the table with a viewpoint. The human element doesn’t just play a role in using big data as input - big data output requires the human touch as well. Jake Klamka and the work of Insight Data Science show that it is not sufficient to be a superb data technician - driving the dialogue requires superb data communication.