“What I Wish They Knew”: 5 Answers from a Government Data Scientist

One of the goals of this blog is to bridge the worlds of tech and government: I believe we can do so much more by working together, yet we often don’t understand each other deeply enough to begin. I will be starting the “What I Wish They Knew” series, which features people who are familiar with both these communities.

The first person to kick off this series is none other than my husband, Kenneth, who worked in Singapore’s Government Data Science unit before pursuing a PhD in Statistics at Stanford. By the way, you can find out more about Singapore’s Government Data Science unit on their blog: https://blog.gds-gov.tech/ – highly recommended.

kenneth handsome.jpg
Kenneth at The Hive, where the Government Digital Services team is located.
  1. How did you get into Government Data Science? 

I’ve had a strong interest in mathematics and related quantitative fields like statistics and computer science for as long as I can remember, and studied math at Princeton as an undergrad. [NB: Kenneth’s mathematics blog can be found here.]

I began my career in the Singapore public service, hoping to give back to society in the small ways I could. While I was at the Ministries of Defence and Environment, most of my work did not involve any advanced math or data analysis.

I missed the intellectual challenge of quantitative thinking, and started taking online courses on the side. Andrew Ng’s “Machine Learning” course on Coursera first piqued my interest in data science. I learnt how we can use a small toolbox of algorithms to extract a whole lot of information from data, and I thought to myself, “How cool would it be if I could use some of these techniques in my work?”

Fortunately, a unit in Government was being set up to do just that: use state-of-the-art data analysis techniques to inform policy decision-making. I joined the Government Data Science unit in 2015 as a consultant. My work experience in policy and my quantitative undergraduate training put me in a rare position to understand the mindsets of both the policy officer and the data scientist. As such, I felt that I was an effective translator between the 2 parties.

  1. How about an example – what is one meaningful thing you did in Government Data Science?

An agency was very concerned about congestion during peak hours at the checkpoint between Singapore and Malaysia. Understanding cargo traffic patterns could help them design policies to reduce congestion. All they had was tens of millions of “permit data” entries, which captured the time that the cargo truck carrying the permit passed through a checkpoint, the industry code and value of the goods carried. I worked with the agency to define useful problem statements to shape the direction of analysis. One example: what are the top 5 industries moving cargo during peak hours? (Since policy interventions would be done at an industry-level).

Next, since each truck could hold multiple permits and the number of trucks was what we cared about, we went through a (non-trivial) process of turning “permit data” into “truck data”. We were able to identify the top industries moving cargo during peak hours, and further narrowed this group to those who were moving cargo on the busiest roads. We were also able to develop hypotheses on what influenced industry behaviour.

After completing the analysis, I shaped the narrative of our presentation in a way that delivered impactful policy insights, ruthlessly cutting down on unnecessary details. This is often the most painful part of the process for a data scientist – it’s so tempting to want to show ALL the great analysis we did.

After dozens of hours of work, there was nothing more satisfying in seeing the audience’s facial expressions saying, in effect, “before I was blind, now I see”! They never had a picture of the cargo traffic patterns until our analysis was done, and could now act upon it to improve congestion.

  1. What is one thing you wish non-data scientists knew about working with data scientists? 

That good data analysis requires significant collaboration between the data scientist and you, the domain expert.

Some people view the relationship between the domain expert and the data scientist as follows:

  1. Domain expert gives data scientist a bunch of Excel files.
  2. Data scientist crunches the numbers and churns out a report or presentation 3 months later. After all, the data scientist knows everything about data and that’s what we are paying them to do, right?

Nothing could be further from the truth! Domain expertise can speed up the data analysis process tremendously and direct it meaningfully, resulting in greater value from the project. Let me give two examples of this.

First, explaining the data to the data scientist, down to what each column means and how the data was collected, will save him/her much second-guessing angst. For example, if the patient check-out time was 18:00:00, does it mean that the person checked out at 6pm, or does it mean that the clinic closed at 6pm, and so everyone who hadn’t checked out yet was given a standard check-out time? Explaining the data will also give the data scientist a better sense of which variables are of greater importance and deserve more attention. In the example above, what does “checking-out” mean anyway, and is it significant?

Second, domain experts can pick up on insights that would escape data scientists. For example, a data scientist finds that Chromosome 21 seems to have an impact on a health outcome. Is that expected? Does it confirm some of the other hunches that we have? Or is it something completely unexpected, that suggests that the model is wrong? These are questions that a data scientist is unlikely to have any intuition about. However, with feedback from the domain expert, the data scientist can quickly decide to pursue or drop lines of inquiry.

  1. As a policy-maker, what is one thing you wish more data scientists paid attention to?

That data analysis is not for the data scientist, but for the policy-maker (or client). As such, good data analysis always puts findings and insights in the proper context.

Consider the sentence: “Our prediction model for which patients will be re-admitted over the next 6 months is 34.56% more accurate than the existing model.” Upon seeing this sentence, several questions come to mind:

  • What do you mean by accuracy? Is the measure for accuracy that you are using appropriate? (See this https://en.wikipedia.org/wiki/Confusion_matrix for a whole zoo of accuracy measures.)
  • 56% seems overly precise: can we really compare performance down to 2 decimal places? (35% might be better.)
  • Does 35% more accuracy translate to a meaningful difference? For example, will this allow us to tailor our services better to 100,000 patients, or 10 patients? (If possible, relate the finding to something the policy-maker cares about, like dollars or man-hours saved.)
  • Is this even a meaningful thing to predict?? (Hopefully, the domain expert would have said so. See answer to question 2.)

Good data analysis also provides enough detail to illuminate, but not too much till it confuses. For example, saying nothing about the modeling process could make me wonder whether you did your homework in choosing the most appropriate model (and whether I should have spent all that money hiring you). At the other extreme, I will not appreciate going through slide after slide of raw SAS/R output.

At this point I cannot overstate the importance of appropriate data visualisation. These visualisations have to be thought through: good charts clarify, poor ones confuse. Unfortunately, it’s a lot easier to make the latter. (See this for examples not to follow.)

kenneth-graph
Truly, a (well-designed) picture is worth a thousand words

 

5. What is your hope for the field of data science?

The economist Ronald Coase famously said “Torture the data, and it will confess to anything.” In an era of subjective reporting and “fake news”, this concern is more pertinent than ever.

My hope is that the general population will have enough statistical knowledge so that they can call a bluff when they see one, and demand quantitative evidence for decisions their leaders make. To this end, I hope to see reforms in statistics education at the high-school level so that it becomes a subject that people feel is relevant and interesting, rather than abstract and theoretical (which is often how it is taught today).

Thanks for reading! Know anyone who should be featured in this series? Do let me know at karentay@gmail.com.

Advertisements