Turpis mollis

Sea summo mazim ex, ea errem eleifend definitionem vim. Ut nec hinc dolor possim mei ludus efficiendi ei sea summo mazim ex.

R Programming for Public Policy Analysis

Early in 2019 I posted a short ‘listicle’ with some of the key reasons I think Python and/or R should become essential tools in a modern policy analyst’s toolkit.

The full article is here, but the headline points in the article were; R programming’s use across disciplines fitting in well with multidisciplinary policy analysis teams; the greater reproducibility/transparency written code provides; and the practical advantages that can come from automating repetitive bits of policy analysis (such as reporting results of policy analysis across multiple scenarios).

While the article didn’t end in me getting a book deal, it did result in me receiving a surprising number of messages from people that were just as passionate as I am about the potential of R in the public policy world. At the same time, I also had a number of people asking me to put my money where my mouth is by showing them how it’s useful by teaching them.

So after being offered a space by the Microsoft Reactor in Sydney, I took up their challenge. Throwing together a course based on what I thought would be most useful based from my experience as a consultant/economist/policy analyst.

Running for an hour a week over four weeks it covered the basics of automating tasks, undertaking exploratory analysis, visualizing data and generating summary statistics in the context of answering questions as a policy advisor.

The course went well. So well in fact, that the most common request from participants in course evaluations were for future courses to be longer. I also found:

  • The grammar of the Tidyverse made learning the basics much faster: I learned base R via an online series of courses and managed to learn the core principles of the Tidyverse in a little more than a day. For policy analysts/consultants it also made more sense, thanks to Tidyverse’s more intuitive grammar.
  • People didn’t need a background in statistics to be able to quickly pick up the basics: the course was equivalent to a little over a full day of material and covered a lot of ground but everyone kept up. From past courses I’ve seen this isn’t guaranteed, so it was a pleasant surprise.
  • Practice was preferred to theory: I wasn’t a straight A student, so I get this. But everything was picked up quicker if it was made relevant to the daily lives of participants rather than being draped in a purely theoretical framework.
  • Pipes are confusing: This is contentious, but I remember feeling this way when I first started to learn R. I love pipes now, but people in my sessions preferred nested formulas. Trying to introduce it so early was just distracting.
  • People loved data viz with ggplot: However, this was more because of ggplot’s ability to quickly segment and visualize data (such as through applying facets to demographic classifications) than quality of what it could produce. This makes sense given a large part of a policy analyst’s work is about exploratory analysis that is used to inform written recommendations, rather than being presented.

So where to from here? Well, outside of shamelessly rebranding my 2019 article for 2020, I’ve been convinced to develop a longer and more widely accessible online version of the free course to satisfy the demands from those that wanted to join but couldn’t due to time constraints or being in the wrong city/country:

Which is the second reason I wanted to write this up, as if you’re a fellow R/Python programmer in the policy/consulting space I’d love to hear from you to get your thoughts about what you think is useful. So if that’s you, feel free to drop me a line either via LinkedIn, Twitter or the contact form here.

And if you or someone you know is interested in signing up for the first run of the online crash course in R, you can do so via program4policy.com

7 Reasons for policy professionals to get into R programming in 2019

Note: A version of this article was also published via LinkedIn here and on Medium here. 

With the rise of ‘Big Data’, ‘Machine Learning’ and the ‘Data Scientist’ has come an explosion in the popularity of using open-source programming tools for data analysis.

This article provides a short summary of some of the evidence of these tools overtaking commercial alternatives and why, if you work with data, adding an open programming language, like R or Python, to your professional repertoire is likely to be a worthwhile career investment for 2019 and beyond.

Like most faithful public policy wonks, I’ve spent more hours than I can count dragging numbers across a screen to understand, analyse or predict whatever segment of the world I have data on.

Exploring where the money was flowing in the world’s youngest democracy; analysing which government program was delivering the biggest impact; or predicting which roads were likely to disappear first as a result of climate change.

New policy questions, new approaches to answer them and a fresh set of data.

Yet, every silver-lining has a cloud. And in my experience with data it’s often the need to scale a new learning curve to adhere to legacy systems and fulfil an organizational fetish for using their statistical software of choice.

Excel, SAS, SPSS, Eviews, Minitab, Stata and the list goes on.

Which is why I’ve decided this article needed to be written:

Because not only am I tired of talking to fellow analytical wonks about why they’re limiting themselves by only being able to work on data with spreadsheets, but also that there are distinct professional advantages to unshackling yourself from the professional tyranny of proprietary tools:

  1. Open-Source Statistics is Becoming the Global Standard

Firstly, if you haven’t been watching, the world is increasingly full of data. So much data, that the world is chasing after nerds to analyse it. As a result, the demand for a ‘jack of all trades’ data person, or “data scientist” has been outstripping that of a more vanilla-flavoured ‘statistician’:

% Job Advertisements with term “data scientist” vs. “statistician”

(Credit: Bob Muenchen – r4stats.com)

And although you might not have aspirations to work in what the Harvard Business Review called the ‘Sexiest Job of the 21st Century’ the data gold rush has had implications far beyond the sex appeal of nerds.

For one, online communities like Stackoverflow, Kaggle and Data for Democracy have flourished. Providing practical avenues for learning how to do some science with data and driving demand for tools that make applying this science accessible to everyone, like R and Python.

So much, that some of the best evidence, suggests that not only is demand for quants with R and Python skills booming, but the practical use of open-source statistical tools like R and Python are starting to eclipse their proprietary relatives:

Statistical software by Google Scholar Hits:

(More credit to Bob Muenchen – r4stats.com)

Of course, I’m not here to conclusively make the point that a particular piece of software is a ‘silver bullet’. Only that something has happened in the world of data that the quantitatively inclined shouldn’t ignore: Not only are R and Python becoming programming languages for the masses, but they’re increasingly proving themselves as powerful complements to more traditional business analysis tools like Excel and SAS.

2. R is for Renaissance Wo(Man)

For those watching the news, you’ll no doubt have heard of the great battle being waged between the R and Python languages that has tragically left the internet strewn with the blood of programmers and their pocket protectors.

But I’m going to goosestep right over the issue as in my opinion much of what I say for R, is increasingly applicable to Python.

For those of you unfamiliar with R, in essence it’s a programming language made to use computers to do stuff with numbers.

Enter: “10*10” and it will tell you ‘100’ 

Enter: “print(‘Sup?’)” and the computer will speak to you like that kid loitering on your lawn.

Developed around 25 years ago, the idea behind R was in essence to develop a simpler, more open and extendible programming language for statisticians. Something which allowed you greater power and flexibility than a ‘point and click’ interface, but that was quicker than punch cards or manually keying in 1s and 0s to tell the computer what to do.

The result: R – A free statistical tool whose sustained growth has helped create one of the most flexible statistical tools in existence.

So much growth in fact, that in 2014 enough new functionality was added to R by the community that “R added more functions/procs than the SAS Institute has written in its entire history.” And while it’s not the quantity of your software packages that counts, the speed of development is impressive and a good indication of the likely future trajectory of R’s functionality. Particularly as many heavy hitters including the likes of Microsoft, IBM and Google are already using R and making their own contributions to the ecosystem:  

Using R for Analytics – Get in Before George Clooney Does:

Image source. Also, see here

Not only that, but with much of this growth being driven by user contributions, it is also a great reminder of the active and supportive community you have access to as an R and Python user. Making it easier to get help, access free resources and find example code to steal base your analysis on.

3. R is Data and Discipline Agnostic

(Source: xkcd)

One of the first things that motivated me to learn R, was the observation that many of the most interesting questions I encountered went unanswered because they crossed disciplines, involved obscure analytical techniques, or were locked away in a long-forgotten format. It therefore seemed logical to me that if I could become a data analytics “MacGyver”, I’d have greater opportunities to work on interesting problems.

Which is how I discovered R. You see, as somebody that is interested in almost everything, R’s adoption by such a diverse range of fields made it nearly impossible to overlook. With extensions being freely available to work with a wide variety of data formats (proprietary or otherwise) and apply a range of nerdy methods, R made a lot of sense.

I think it was Richard Branson that once said “If somebody offers you a problem but you are not sure you can do it, say yes. R probably has a package for it” (!):

Then R (and increasingly Python) has you covered.

Yet there is perhaps a subtler reason adopting R made sense and that’s the simple fact that by being ‘discipline agnostic’ it’s well-suited for multidisciplinary teams, applied multi-potentialites and anyone uncertain about exactly where their career might take them.

4. R Helps Avoid Fitting the Problem to the Tool

As an economist, I love a good echo chamber. Not only does everybody speak my language and get my jokes, but my diagnosis of the problem is always spot-on. Unfortunately, thanks to errors of others, I’m aware that such cosy teams of specialists, isn’t always a good idea – with homogeneous specialist teams risking developing solutions which aren’t fit for purpose by too narrowly defining a problem and misunderstanding the scope of the system it’s embedded in.

(Source: chainsawsuit.com)

While good organizations are doing their best to address this, creating teams that are multidisciplinary and have more diverse networks can be a useful means to protect against these risks while also driving better performance. Which of course stands to be another useful advantage of using more general statistical tools with a diverse user base like R: as you can more fluidly collaborate across disciplines while being better able to pick the right technique for your problem, reducing the risk that everything look like a nail, merely because you have a hammer.  5. Programming Encourages Reproducibility

Yet programming languages also hold an additional advantage to more typical ‘point and click’ interfaces for conducting analysis – transparency and reproducibility.  

For instance, because software like R encourages you to write down each step in your analysis, your work is more likely to be ‘reproducible’ than had it been done using more traditional ‘point and click solutions. This is because you’re encouraged to record each step needed to achieve the final result making it easier for your colleagues to understand what the hell you’re doing and increasing the likelihood you’ll be able to reproduce the results when you need to (or somebody else will).

In addition to this being practically useful for tracing your journey down the data-analysis-maze, for analytical teams it can also serve as a means for encouraging collaboration by allowing to more easily understand your work and replicate your results. Assisting with organizational knowledge retention and providing an incentive for ensuring analysis is accurate by often making it easier to spot errors before they impact your analysis or soil your reputation.

Finally, while the use of scripting isn’t unique to open-source programming languages, by being free, R and Python comes with an additional advantage that in the instance you decide to release your analysis, the potential audience is likely to be greater and more diverse than had it been written using propriety software. Which is why in a world of the “Open Government Partnership” open-source programming languages makes a lot of sense, providing a means of easing the transition towards government publicly releasing government policy models.

6. R Helps Make Bytes Beautiful  

As data-driven-everything becomes all the rage, making data pretty is becoming an increasingly important skill. R is great at this, with virtually unlimited options for unleashing your creativity on the world and communicating your results to the masses. Bar graphs, scatter diagrams, histograms and heat maps. Easy.

Just not pie graphs. They’re terrible.  

But R’s visualization tools don’t finish at your desk, with the ‘Shiny’ package allowing you to take your pie graphs to the bigtime by publishing interactive dashboards for the web. Boss asking you to redo a graph 20 times each day? Outsource your work to the web by automating it through a dashboard and send them a link while you sip cocktails at the beach.

7. R and Python are free, but the Cost of Ignoring the Trend Towards Open-Source Statistics Won’t Be

Finally, R and Python are free, meaning not only can you install it wherever you want, but that you can take it with you throughout your career:

  • Statistics lecturers prescribing you textbooks that are trying to get you hooked on expensive software that likely won’t exist when you graduate?  Tell them it’s not 1999 and send them a link to this.
  • Working for a not-for-profit organization that needs statistical software but can’t afford the costs of proprietary software? Install R and show them how to install Swirl’s free interactive lessons.
  • Want to install it at home? No problem. You can even give a copy to your cat.
  • Got a promotion and been gifted a new team of statisticians?  Swap the Christmas bonuses for the give the gift that keeps giving: R!

But I’m not here to tell you R (or Python) are perfect. Afterall, there are good reasons some companies are reluctant to switch their analysis to R or Python. Nor am I interested in convincing you that it can, or should, replace every proprietary tool you’re using. As I’m an avid spreadsheeter and programs like Excel have distinct advantages.

Rather, I’d like to suggest that for all the immediate costs involved in learning an open-source programming language, whether it be R or Python, the long-term benefits are more than likely to surpass them.

(Source)

Not only that, but as a new generation of data scientists continue to push for the use of open-source tools, it’s reasonable to expect R and Python will become as pervasive a business tool as the spreadsheet and as important to your career as laughing at your boss’ terrible jokes.  

Interested in learning R? Check out this link here for a range of free resources.

You can also read my review of the online specialization I took to scale the R learning curve here.

A Review of John Hopkins University’s Online Data Science Specialization (Coursera.org)

So for all those loyal subscribers out there (hey mum!) you might wonder what the hell happened to my constant stream of insightful, relevant and handsome blog posts.

Well, I’d have to say you’re thinking about me a little too much – I’d suggest committing yourself to a hobby like me.

Perhaps regular blogging?

Surviving Nay Pyi Taw

In truth, Myanmar has also kept me pretty busy. That is until recently, when I was handed a steaming pile of free time as a result of moving to the traffic-free social desert that is Nay Pi Taw, Myanmar.

And how would any sane person use this time?

Well you’d be best to ask them. As for me, I decided to enroll in a six month dose of data science administered by John Hopkins University (JHU).

So consider this your warning, as that’s where this blog is going.

But to make escape easy here’s a link to YouTube trending and for those of you with a short attention span I’ve also created a TLDR (short) version at the end.

Cyanide and Happiness

Rationalizing Self-Harm

That’s right. I know what you’re thinking – why would anyone volunteer to learn about data?!

Well you see, I was a curious child and time has turned me into a curious man-child, as a result, a surprising amount of my career has been defined by being asked difficult questions by difficult people.

While this has meant that I’ve been able to do a lot of interesting work, it has also made it increasingly apparent that many interesting problems go unsolved because people aren’t sure how to approach data.

Dilbert Comics

Which is where this niche blog post begins, as it was from this observation that I decided it would makes sense to arm myself with a statistical tool that:

  1. Is capable – allowing it to be applied to a range of data-related geekery;
  2. Is portable and cheap – allowing it to be easily adopted regardless of an organization’s size and financial resources;
  3. Can work with data in a variety of formats – making it easier to transport analysis to/from a wide variety of sources; and
  4. Is useful across disciplines – making it suited across fields and in multidisciplinary teams/organizations.

In essence, I was looking for the ‘spork’ of data science software. Which is apt because like a spork, R can do a lot of things but is a little awkward and unwieldy.

However, unlike a spork R is popular.

So popular, that it’s a global standard in the data world. But not so popular that you’re going to get invited to more parties :(.

Which brings me to another disadvantage of R – it’s known for being a little unwieldy:Basic Analysis of Workshop Data

Don’t get me wrong, I’m not claiming your learning experience will see as many deaths as the figure above. But it’s best to go in expecting that learning R is more like walking on lego than cake.

Which is why I chose to do JHU’s Data Science specialization. As if I’m going to be walking on lego I’d prefer to do it quickly and with more decorum than a monkey with a typewriter.

So, the choice was made and a high standard set: Don’t be a monkey.

An Overview

Now for those of you with a short attention span, remember I’ve include a short summary at the end of this post but in essence JHU’s Data Science specialization is made up of nine courses which can be roughly divided up into two main ‘flavors’:

  • The basics of working with R – such as writing scripts, using GitHub, importing/exploring data, and generating statistics; and
  • The actual reason we want to work with R – such as creating interactive visualizations on the web, creating catterplots, running regression models and encouraging your computer to become sentient via machine learning.

Once completing these nine courses students are then provided with the option to complete the final ‘Capstone’ course which is meant to provide an opportunity to apply your skills on a real-world problem.

setwd(“internet”)

So in the spirit of [insert closest holiday here] and to spoil the ending, let me just say that completing the specialization was worthwhile. It covers a range of useful topics, is delivered by world-class lecturers and forces you to apply what you learn. The course also fulfilled my embarrassing desire to apply some science to data, which is essentially the only way to learn R, via R-ing (?).

For instance, the quizzes and programming assignments give you messy data, complicated problems and ask you to use R to present analysis in a digestible format. As a result, if you legitimately complete the courses you’ll come out having learned a lot.

Although it’s hard to compare online courses with those offered by a traditional university, I’d probably say that JHU’s Data Science specialization might be something close to a four-course graduate certificate. This is based both on the level it’s pitched at, the workload and the fact the entire specialization took me a little over six months with a background in statistics (although your experience may vary).

It’s also relatively inexpensive when compared with the more traditional alternatives at a little under $300 USD or around five percent of the cost of a comparable program.

This is Fine

Yet all is not well in the world of the JHU Data Science series.

Gunshow comics.

You see, although I’m glad I did the course, it was not without shortcomings.

Firstly, I was originally attracted to the course as it appeared to cover an impressive array of topics. Yet courses were only a month long which meant subjects often had to sacrifice depth and/or place unrealistic learning outcomes on the students. Unfortunately, the JHU Data Science Specialization often chose both by skimming through essential topics then grading students on them.

Take the Statistical Inference course, which tries to quickly illustrate how to respect the rules of the God of numbers and explain why we care about infinity, even when we’re unlikely to get there anytime soon. While interesting, a frolic through discussions in the message boards made it pretty clear that the ‘vomiting equations onto a PowerPoint presentation’ wasn’t a particularly effective teaching approach.

A similar story could also be told of the regression course which gently introduces learners to the concept of linear regression before abruptly lobbing a grenade of generalized-linear models, probits, logits and something to do with a hockey stick.

This I found to be particularly unfortunate as regression analysis is a useful tool for so many types of analysis. It’s also conceptually useful, as it reminds budding statisticians that there isn’t usually a ‘silver bullet’ explanation for what’s driving something and usually your conclusion relies as heavily on statistical assumptions as it does the data.

More generally, when you’re applying statistics in the real-world, abstract concepts aren’t particularly helpful until you’ve internalized them – something I suspect for most mortals would require more time than the course allowed.

The Sound of One-Hand Clapping

War and Peas

Of course whether the course did include other mortals is an open question, with discussion boards mainly filled with generic ‘please mark my assignment’ requests from past sessions of the course. Although this might have been a natural consequence of the field not attracting social superstars (myself excluded of course…), even for a mixed-gender game of dungeons and dragons human-to-human interaction was low.

Relative to other online courses I’ve done, this led to a much poorer learning experience. This was both because you weren’t able to rely on the hive-mind when you had a problem and as it meant you didn’t get the benefit of understanding how others are applying what they learn outside of the course.

Assessment Structure

Given online courses can have thousands of students, quizzes and ‘peer-graded’ assignments tend to be the backbone of the assessment structure in the world of MOOCs. In JHU’s case, online quizzes were typically run each week while peer-grading (where students mark each other) was used for major projects.

For those unfamiliar with ‘peer-grading’ basically you submit your assignment, mark five of your peers and receive a grade based on the most common score given by five students that have marked your submission. Generally, it can work quite well and I’m a big fan – you see how others have approached a problem, get a sense of where you stand relative to your peers and hopefully receive useful feedback to improve your work.

Alas, in the JHU specialization it wasn’t always done well, with much of the feedback I received being minimal. Although I suspect this is in part due to me having attained perfection, I’d also say that this is a result of:

  • The courses being run within a short timeframe – discouraging students from assigning more time and thought to marking;
  • The marking criteria sometimes not providing much scope to differentiate adequate assignments from the exceptional;
  • The age of the course meaning that the internet is now awash with past assignments, making plagiarism easier for the lazy; and
  • The system not encouraging quality feedback – such as by rewarding those that give good feedback by assigning them markers that are likely to give good feedback in return.

PHDComics.com

The Capstone

Finally, there’s the final project or ‘Capstone’ which was described as “a project drawn from real-world problems and will be conducted with industry, government, and academic partners.”

I of course assume that was a typo as a more apt description was “A project randomly drawn from a real-world problem largely unsuited to the R language, principally unrelated to the other courses in the specialization and unlikely to be useful at any point in the near future.’

In the words of one reviewer “Of all the offerings in the specialization, this one felt like it was thrown together in less than hour.” And while this might seem unfair, this thought definitely crossed my mind as I was cobbling together an interactive predictive text application that will unlikely be useful to anyone unless they’re looking to generate gibberish.

A disappointing end given the effort that was required to get there.

Two-parts contentment. One part complaining.

But again, the specialization wasn’t all bad. Far from it.

For instance, while the regression modelling, machine learning and statistical inference courses could definitely be better structured and longer, my experience is that teaching these topics is harder than learning them. I also imagine this is all the more difficult when you’re teaching a classroom of 100+ whiny nerds.

I’d also say that some of the potentially boring topics were well done.

For instance, although both ‘Getting and Cleaning Data’ and ‘Exploratory Data Analysis’ could have been more tightly focused, I came out of both courses with a much better appreciation of what’s possible. The courses also made me remember why I was doing the course in the first place as it demonstrated why R is so useful.

Finally, while the final lecture for ‘Reproducible Research’ appeared to be from a different subject altogether, the course was one my favorites. This is for one as it explained what the hell the ‘knit’ button in R Studio does, but also as it covered the how/why of making research reproducible in R – something that is rarely achieved in economics.

While at first glance this might appear as a solely academic issue, as an applied economist I can see many times during my career that the tools would have been tremendously useful for naturally building in reproducibility and transparency into my team’s work as it:

  1. Makes collaboration easier;
  2. Allows the analysis to be quickly repeated with new assumptions and/or data; and
  3. It provides a more reliable way of recording what was done for archival purposes.

While this might still sound somewhat abstract, in the world of economic policy it’s not uncommon to be asked to repeat several iterations of politically sensitive analysis in a short-time frame.

Get it right and you can keep your job.

Make a mistake and you might just make history.

Summary()

So what would I say to someone thinking of making the arduous journey to complete the specialization?

Well, firstly although parts of the specialization were disappointing, it’s a great overall program and I’ve learned much of what I was hoping to. I understand R, have a better sense for when meaningful insights can be gained from data outside of economics and have a better feel for how analysis can be made interactive and accessible to a wider audience.

The course is also a bargain, costing less than five percent of the tuition of a comparable 6-month course at University.

XKCD

Of course, it also seems that the JHU Data Science specialization has been largely abandoned, with the world of online data science courses becoming more competitive in the meantime, with Harvard, Berkley, Microsoft and the University of Michigan all providing their own data science specializations both in R and Python.

As such, while I’m glad I endured through the 10-course JHU data science extravaganza, if I was going to do it now I’d be inclined to go with one of the competitors.

This is both because I would place my bets on the competing options having learned from the strengths of the JHU course, while dropping its weaknesses. But also, because unless JHU updates their specialization its prestige and its power to signal the recipient’s determination will diminish over time.

Of course, in the world of online learning it doesn’t have to be all or nothing – pick one, pick two or decide to prioritize your social life by picking none of them, whatever you choose it’s a great time to conquer your fear of data.

I’m looking at you Darren.

 

TLDR Version:

Good course, glad I did it but would recommend checking out the alternatives from Harvard, Berkley, Microsoft and the University of Michigan.

The Good:

  1. Getting and Cleaning Data­­­­­ – Learn how to get data into R and make it useful for analysis.
  2. Exploratory Data Analysis – Make graphs with different plotting systems and be given a brief and unsatisfying crash course on PCA.
  3. Reproducible Research – Learn what the ‘create R markdown’ document option means in Rstudio and the philosophy of reproducible research.
  4. Regression Models – Be gently introduced to linear regression through a series of intuitive lectures before being rushed through the more complicated logistic and poison regression in the final week.
  5. Practical Machine Learning – Learn the basics of machine learning models.
  6. Developing Data Products – Briefly learn about some of the coolest parts of R such as creating interactive dashboards for the web.

The Bad

  1. R Programming – Learn the essentials of R and lose sleep while writing functions needed to complete the assignments. Bonus: Watch a large proportion of the class drop out.
  2. Statistical Inference – Be quickly rushed through essential statistical concepts with insufficient explanation. Bonus: Watch a large proportion of the class drop out.
  3. Capstone Project – Be given minimal instructions about solving a problem which will likely be useful for 0.5 per cent of R users during their career.

The Neutral

  1. The Data Scientist’s Toolbox – Install R, Rstudio and set up a github account.