I’m excited to profile Annalyn Ng, a self-taught data scientist and #womanintech, who is pushing for the adoption of data science in the public service. She currently works at the Ministry of Defence (Singapore), where she analyses data to identify predictors for personnel performance in military vocations. Originally a psychology and economics major, she first learnt about data science in a statistics class, and has been addicted ever since.
She co-authors a blog, algobeans.com, that teaches data science in layman’s terms, and has recently published a book: Numsense! Data Science for the Layman, which is used as reference material in Stanford and Cambridge.
In this article, she outlines the challenges and solutions to enabling data science in the public service, and ideas about how to build these capabilities individually and in your own organization. All opinions here are her own.
Introducing Data Science in the Public Service: Challenges and Solutions
My plea for wider application of data science is a personal one. My mum passed away due to a misdiagnosis when doctors administered wrong medication while stalling the treatment she required. Then, I wondered—if we can teach machines to play games like Go and Starcraft, can we invest as much to teach machines how to save lives? While we’ve had breakthroughs, such as in automated interpretation of medical image scans, similar success for general diagnosis seems lacking.
Many people regard data science as a craft that is exclusive to tech companies. Let’s dispel this myth. The fact is, wherever there is data, there is potential for data science. If fashion retailers can use purchase history to recommend products and predict trends, we can easily apply the same methods on past medical data to recommend treatment and predict diagnosis.
Despite being a profit driver in the private sector, the use of data science is still relatively immature in public service. Healthcare analytics is one specialised domain with untapped potential, but data science can also be applied in mainstay departments like policy (e.g. analysing public feedback), finance (e.g. flagging fraudulent transactions), and human resource (e.g. personnel deployment).
So, what’s stopping us?
There are two parts to data science: 1) data collection, and 2) data analysis, each with its own unique set of challenges to overcome:
Getting data is often the hardest part of any data science effort. As public data is sensitive, infrastructure is needed to collect data systematically and securely. To reach deeper insights, data from different agencies and ministries need to be merged, and this process usually begs questions on confidentiality.
Hence, data collection requires collaboration across agencies. Mutual trust must be built to ensure that useful data is exchanged for insights to be uncovered. Ownership and maintenance of IT infrastructure should be established, and stress tests conducted regularly to ensure data security. We rely on senior management to set this stage, before public servants can take cue to play their part.
Once we have data, we need to analyse it. Skilled data scientists are required for this role, but talented ones might be enticed away by private companies while those committed to stay might not be given the support to learn, thereby resulting in a lack of expertise.
However, expertise can be developed. It is a misconception that data science is solely quantitative. Data literacy can be divided into two levels: 1) knowing how data analysis works, and 2) executing the actual analysis.
The first level is basic knowledge on how algorithms work and their assumptions. These do not involve much math, and thus should be made accessible to everyone.
Algorithms are increasingly being automated, lowering the bar to allow people with non-technical backgrounds to do basic data exploration through apps and dashboards. As data science research becomes more accessible, we need to improve data literacy among regular public servants, to ensure that conclusions made from such research are accurate.
Besides checking results for errors and assumptions, a broad understanding of data analytics can help managers to identify potential data sources, as well as to facilitate collection of data in a suitable format for analysis. In turn, analysts are likely to be appreciative of managers who provide conditions for work to be done effectively.
The second level is technical know-how of math and coding that data scientists, rather than managers, need to master. To nurture expertise, we need to build an ecosystem for experts to thrive. Many agencies have made the mistake of recruiting data scientists in isolation. Without peers who can provide feedback and healthy competition, data scientists may have fewer ideas to build on and less motivation to improve. Therefore, it is crucial to deploy data scientists in teams.
While data scientists can either be trained in school or self-taught, enlightened employers have since realised that the medium of learning is less important than the rigor and continuity in learning. Many companies, including Google and Facebook, have sought out programmers with no formal degree but nonetheless armed with a solid portfolio of coding projects.
Regardless of our current level of expertise, data science is an evolving field, so a data scientist’s learning journey never ceases as they seek to add new techniques to their toolbox through constant reading and practice.
So, how do we start learning?
Traditional classroom training is growing obsolete as they are costly, time-consuming, and possibly ineffective as participants are likely to forget technical details without constant review. Moreover, data science is a fast-moving field, and any one-off training is unlikely to suffice for public servants whom we wish to groom as experts.
As a data science convert myself (having majored in psychology and economics), I have a few alternatives to suggest:
Enrol into massive open online courses (MOOCs), which are video courses available freely or easily priced within $20. Examples of established course platforms include Coursera, Udacity and Udemy. Participants can choose courses based on reviews, and good instructors are also prompt in addressing Q&A on forums. With courses spanning a range of difficulty levels, both beginners and experts can find content suited for their needs. Moreover, as course videos are usually made available for a lifetime, participants can review them whenever they need to.
Learning is not just about sponging up knowledge, because knowledge is easily forgotten without practice. Therefore, to apply what I learn, I’d usually pair my learning with relevant projects. Managers can also encourage a proactive learning culture, such as allowing staff to reserve time for research and experimenting with new data science methods.
After mastering new techniques, I’d share what I learn with others because teaching reinforces learning. Writing blog articles is a convenient way to do this. To engage a non-technical audience, I’d leave out the math and jargon, and instead focus on intuitive explanations and visuals. I eventually compiled the tutorials into a book: Numsense! Data Science for the Layman, which, I’m ecstatic (!) to share, has since been chosen by top universities like Cambridge and Stanford as reference text. Nevertheless, simply keeping a blog can be gratifying, knowing that your tutorials can benefit a global audience.
As for colleagues just starting out in data science, I frequently encourage the recruitment of interns with statistics or computer science background to help with relevant projects. This is a win-win arrangement—supervisors get to learn more techniques, while interns get to appreciate data science applications in the public sector. To ensure accuracy of results, projects can be vetted by trained colleagues.
Finally, there are opportunities for everyone, regardless of expertise, to get together to share ideas. Data science meetup groups are common in major cities, often featuring a range of speakers from different industries, and attracting large audiences interested to learn and network.
So, where do we go from here?
Learning data science is just a means to an end. In public service, the end goal would be to use data science to improve lives.
A predictive algorithm to diagnose heart disease would be useless if we cannot pack it into a fast and intuitive interface that any doctor can use. To build products incorporating data science, we need to plug data scientists into interdisciplinary teams of engineers and designers. Here, good communication is essential to facilitate teamwork, as well as to convince end users of product benefits.
In implementing a data science product, we also need to validate it regularly, to ensure that it remains effective over time. This is not as straightforward as it sounds. Take, for example, an algorithm that predicts whether a person requires medical treatment for a latent disease. To conclude that the algorithm is more accurate than doctors’ judgement, we need to compare the health outcomes of two groups—one selected by the algorithm, and the other selected by doctors. This inevitably raises ethical questions of whether we’d be denying early medical treatment to the group judged by doctors, at the possible expense of their lives. There is no perfect solution to this problem, but awareness is a good start.
Apart from conducting data science within the government, we can also consider publishing non-sensitive data, to put public service into the hands of the public. Open satellite imagery, for example, has enabled community involvement in humanitarian search efforts for missing Malaysian Airlines flight MH370, as well as detection of illegal forest fires in Indonesia. Pollutants from forest fires can be a regional health hazard, and boycotting culpable companies has been a way for the public to fight back. Crowdsourcing has emerged as a check and balance to ensure that corporations and government maintain social responsibility.
With more data available and data literacy improving, the potential for data science to improve the lives of citizens has never been greater. Whether we can successfully introduce data science in the public service will depend on how ready we are to tackle its accompanying challenges.
Thanks, Annalyn! We can’t wait to see what you get up to next.