journalism, personal

Getting started in data science: One journalist’s journey

Let me first say up front what you will not get in this blog post: A step-by-step guide to using whatever data tool you want. Using spreadsheets is beyond the scope of a blog post. Using R (and, I’d guess, using Python) is beyond the scope of the classes I’ve taken that purport to teach me how to use R.

What you will get is one person’s take on how to dip your toes into an ocean. I hope you’ll be able to get some advice on how to go from the occasional spreadsheet user to Data Journalism Deity and perhaps get some idea of where to go next. If this is the first thing you read about data science or data journalism, fine — I’m assuming no prior knowledge.

First: Reconsider what you mean by “education.” Seriously.

Remember back in college when you knew a bunch of annoying dudes who had figured they just needed to make a lot of great contacts in school, and the classes themselves were secondary? While you busted your butt studying and working, they were shaking hands and drinking beer?

I’m not going to say they were right. But they were on the right track. Here’s why:

Working with data is less about learning how to do it and more about learning where to ask.

Don’t believe me? Ask Vik Paruchuri, who made the liberal arts-to-data leap himself and has this to say about it:

data-problem

Check out his whole video. It’s 32 minutes, but you can skip the first couple of minutes because he took it from a Google Hangout and spent the first bit of it waiting around. (Expert on data science but doesn’t edit video in YouTube? Knowledge is specialized.)

He devoted a year or so to learning data science. But he also just jumped in. He started doing projects (because you learn by doing in this field) and going to meetups, all before he knew much code.

On the other hand, here’s how *I* did it:

I signed up at Coursera, an online-learning hub, for a nine-course series offered by Johns Hopkins University. I figured I would plow through the courses and get a spiffy certificate at the end, proving to myself and everyone else that I know my way around R (the en-vogue data programming language today) and everything else in the world of data.

Around the 13:30 mark of Paruchuri’s video, he says MOOCs (like Coursera’s content) are not the best way to learn. But by the time I watched his video, I had already gone past the “no refund” part of the Hopkins specialization. Oops.

That’s not to say I’ve wasted my time and money. Check that: I do think I’ve wasted quite a bit of time trying to pass quizzes that I really didn’t need to pass.

In retrospect, I wish I knew there are two ways to approach the Hopkins specialization:

  1. Make this your life, as if you were a full-time student, particularly if you don’t have a ton of prior programming experience or stats background. (The Pascal I learned in college and the JavaScript I learned 20 years ago weren’t enough. In stats, I’m comfortable talking about medians, means and even standard deviations, but I have little idea what a “linear regression” even means.) You’ll finish up with a certificate that might get you employed somewhere.
  2. Browse. Learn what you want. Attempt a few quizzes, but feel free to bail.

The seductive part of data science is that it seems so accessible. It seems like everyone’s doing it, from political bloggers breaking down government data to 14-year-old fantasy football wizards. But in reality, they’re just doing a small part of data science. When you start digging around and finding powerful data applications, you’ll find they’ve been developed by people with “PhD” in their LinkedIn profiles, not “BA in philosophy and music.”

Consider a music analogy. As Radiohead sang, anyone can play guitar. It might be a high school kid figuring out Rush songs (like me, many years ago) or your friend’s dad who suddenly whips out an old acoustic and plays Classical Gas. But how many people do you know who can sight-read just about anything on piano? Or teach band in an elementary school, helping kids learn every woodwind and brass instrument?

You don’t go to Berklee for four years to learn how to play Purple Haze or even to write your own guitar riffs. So why would you work your way through everything in the Hopkins data specialization to learn a few tools to use in journalism?

The funny thing here: The quizzes in the Hopkins sequence helped teach me that lesson and the importance of knowing where to look for the answers. Those quizzes — at least, once you get past the simple multiple-choice stuff in the intro class — are programming assignments. And the classes don’t teach you how to do them.

Kind of a weird way to approach teaching, isn’t it? And very frustrating if you, like me, don’t know what you’re getting into.

To pass the quizzes, you have to look around the Web for help. You may quickly find that the regulars at StackOverflow, an impressive online forum for sharing programming tips, are getting sick of answering questions from people who are stuck on the Hopkins programming assignments. But you can often find a couple of things that help.

The course itself has an online forum that substitutes for the interaction you’d have the teacher if you were taking this class in person. But they can only give you general tips, not answers. You click an honor-code pledge with every submission, just like we did at Athens Academy. (All together now: “I have neither given nor received any aid on this work, nor have I observed any infraction of our Honor System.” One kid made a rubber stamp with all those words to speed along his test-taking.)

The forum is manned by mentors who have survived the class already. And the general message is to get used to “hacking.” Get out on StackOverflow and other sites, then figure it out. Because that’s what you’ll be doing in the real world.

“Sure,” you may say, “but what am I paying for?” You’re really paying for the lectures, a nifty set of online tutorials, and a basic intro to some of the tools you need, like RStudio (a bit like Notepad with a whole lot of tools to help with your code) and Github (a sharing site). And if you have hours upon hours — other students have reported spending months on quizzes with an estimated time of “30 minutes” or so — you may be able to plow your way through and get the specialization.

At some point — and I’m writing this so you’ll do it before you take the course rather than partway into it like I did — you have to stop and ask what you really want to accomplish. Even if you want a full-time data job, there are so many different ones. Data scientist? Data engineer? Data journalist?

panther

You’re probably better off playing around with online data tools first, and then signing up for a course. That’s true whether you’re just looking to supplement your knowledge and skillset (like me) or going become a Full-Time Data Science Person (like Paruchuri).

One example: Paruchuri says 90 percent of the work is data “cleaning” (if you’ve ever seen a spreadsheet in which some entries say “Miscellaneous” and some say “Misc,” you get the idea). You could use R for that. It’s powerful. Or you could use a former Google tool called OpenRefine. Knowing a bit of programming logic may help with that, but it’s not as intense as learning complex operations in R.

So now that I’ve spent four months learning what I can, I’ve managed to define my goals.

First, what do I want to do? 

  1. Find an efficient way to do Olympic medal projections. I’ve used spreadsheets to track past results and use a few formulas to do them in the past, but it’s safe to say I spent far too much time gathering and processing data.
  2. Learn enough to try other projects on my own, perhaps a survey of North American curling clubs, for example.
  3. Learn enough to tell a potential part-time or full-time employer that I might not be a full-fledged data scientist, but I know the tools and have a good sense of what’s feasible.

Now bear in mind everything else I want to do in the next 2-3 years:

  1. Continue writing epic soccer pieces and other content for The Guardian.
  2. Finish retooling parts of my unpublished MMA book into a series of posts at Bloody Elbow.
  3. Finish retooling the other parts of that book into a small self-published book.
  4. Write another book on youth soccer.
  5. Write a bit more for FourFourTwo and OZY.
  6. Maybe find a steady outlet for Olympic-sports content (which could include a lot of data work).
  7. Maybe start working for a nonprofit (maybe even with data).
  8. Maybe even start the definitive book (or multimedia project) on creativity.

I’m not including high priorities like “be a good parent” or even low but unavoidable priorities like “mow the danged lawn.”

So from a data perspective, here’s what I should be able to do:

  1. Understand what I’m looking at when I check Kaggle, which turns data-science sharing into fun things like a March Madness contest.
  2. Navigate github.
  3. Use OpenRefine and any other good web tools I can find.
  4. Scrape data from reputable sources.
  5. Present the output in some coherent and engaging form.

I’ll pick my way through the rest of the Hopkins courses. I’ve also enrolled in a cost-friendly course at Udemy, which I started taking so I could figure out enough to pass the R programming course at Hopkins. (I passed two. The rest? You may consider me an auditor.)

And then I’ll just explore, like I did when I was figuring out Rush songs on my guitar. (Hmmm … can I process songs in R?)

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s