Agile Data Science: Review and Thoughts

Agile Data Science cover

Recently, I read the book Agile Data Science by Russell Jurney. The book covers data science and how the author applies an agile workflow and powerful tooling to accomplish tasks. While I found the book interesting, and would recommend it as a good introduction, I have some issues with the book that I’d like to discuss. I’d like to go over the book and the tools briefly, if only to save my thoughts for later.

A quick note: data science is actively being defined by the web community as the process of analyzing large data sets with statistics and other approaches. That definition is ongoing and changing all the time. Big Data is the term that the industry seems to be using for such large datasets. You’ll also see the terms machine learning, analytics, and recommender systems mentioned: these are all various sub-topics that I won’t cover in depth here.

The book centers around the use of Hadoop. In turn, Hadoop is commanded by writing and running Apache Pig scripts in the book. Pig allows you to write workflows in a high-level scripting language that may compose many Hadoop jobs into one system. With Pig, you need not worry about the specifics of what each Hadoop job is doing when you write a Pig script.

Hadoop is patterned after Google’s MapReduce paper. Google had large clusters of computers and large data sets that it wanted to process on those clusters. What they came up with was a simple idea: Write a single program that would specify a map function to run across tuples of all the input data. Add a reduce function that compiles that output down into the expected format. MapReduce coordinates deploying the program to each worker machine, divvying up the input data across the different machines, gathering up the results, and handling things like restarts after failures. This was a huge success inside Google, and Hadoop implements that architecture with improvements.

It should be noted that this MapReduce architecture is essentially batch-processing for large amounts of data. The same system would have a hard time with continuous streams of data.

Hadoop is, unfortunately, my first stumbling block with learning to process big data.

Configuring and running Hadoop is not easy. I have far more experience as a developer than a sysadmin (or today’s term: devops engineer). There exists more than one “distribution” of Hadoop and more than one versioning scheme between those. This means that understanding what’s available, how to configure it, and whether search results are relevant to you is quite hard for the unexperienced. Imagine the confusion of trying to install a Debian Linux distro and only being able to find instructions for Red Hat Linux; further, not being able to tell what the problem was when it wouldn’t boot and printed a Debian-specific error.

It seems like Hadoop is designed for to be run by someone whose full-time job is to configure and maintain that cluster. That person will need to have enough experience with all the different choices to have an opinion on them. For a developer wanting to run things locally before committing to configuring (and paying for!) a full cluster out on AWS, it was daunting.

Luckily for me, Charles Flynn has created a neat repo on Github at charlesflynn/agiledata. It builds a local development VM for the Agile Data Science book, with all the dependencies installed and the book’s code in the right place to run. With that project, I was able to get up and running with the project quickly and found it useful to not have to sink anymore time into configuring Hadoop. I’d like to give another shout-out to Charles for this great resource and the work done to make sure it works.

The book has the reader work with email data: your Gmail inbox pulled locally for analyzing. I thought this was neat, in itself. Many data science tools use free datasets; as a result working with those datasets may not be the most interesting problem space to you. But insights about your own communication and how others communicate with you is something you might find more interesting.

After explaining Hadoop, Pig, and a few other tools, the rest of the book follows a fairly lightweight “recipe” format. Each chapter explains the goal and how it fits in an “agile data science” workflow. Then, some code is presented, and then we see what kinds of results we can take from that step. Once this pattern is set up, the book moves fairly quickly through some rather interesting data wrangling. By the end, the reader has built several data analysis scripts and a simple web app put together with MongoDB, Python Flask, and D3.js graphs to display all the results.

At times, though, the quick recipe format seemed to explain too little. There was little explanation of how Pig script syntax worked or how to understand what was going on under the covers. What this book is not: an exhaustive guide to how to write Pig scripts, how to pick approaches to analyzing a dataset, or how to compose these systems in production in the wild. Also missing were any mention of performance tuning or what other algorithms might be considered.

Which seems like an awful lot to be missing, but for this book that would have been diversions that bogged the book down.

To the author’s benefit, I finished the book, and finished it far faster than I expected I would. I cam away having done almost all of the book’s examples (helped a great deal by the excellent virtual machine repo from Charles Flynn mentioned above). And, I had a deeper understanding and respect for tools that I’d never used before.

Final thoughts

When it comes down to it, I wouldn’t recommend Agile Data Science to read on its own. I’d recommend that you used it as a quick introductory book to build familiarity and confidence, so that you could dive into a deeper resource afterwards. I’d also recommend it if you’re a developer who isn’t going to be doing data science as your full time job but are curious about the tools and practices, this book would be a good read.

What I’m doing next

Almost immediately after finishing this book, I attended an event at a nearby college to talk about Apache Storm. Our company blog covered the event if you’re curious.

Storm is a tool that came out of Twitter for processing streams of big data. If you think about it, Twitter has one of the biggest streaming data sets ever. They need to use that streaming data for everything from recommendations to analytics to top tweet/hashtag rankings.

After attending the event and having run a word-counting topology (Storm’s term for a workflow that may contain many data-processing jobs) out on a cluster, I began to see the potential of using Storm.

Plus, Storm is far friendlier to local development on a laptop. One can run it with a simple command line tool or even from inside your Java or Clojure code. Or, perhaps most simply, from inside the Clojure REPL.

The other plus here is that Storm is mostly written in Clojure and has a full Clojure API. Combined with a few other Clojure tools that I prefer, like Datomic, Ring, and C2, I can see a toolset similar to that used in Agile Data Science. This toolset has the benefit of using the same language for everything. And, Clojure is already well-suited for data manipulation and processing.

So I began to rewrite the examples in Agile Data Science in Clojure. I am hoping to make enough progress to begin posting some of the code with explanations in blog format. Stay tuned for that.


A Theory of Compound Intelligence Gain

Note that this is probably not enough to call a theory. It’s an idea, at most.

I’m currently reading the book Race Against the Machine, which describes how increasing levels of automation by technology are related to capital and labor. But this post isn’t about that book. It simply triggered me to think about my motivations for my current side projects, and how I might explain to others why exactly I think that my current side projects are so important.

While Race Against the Machine describes technological progress as a force that leaves behind skilled workers who no longer have relevant skills, my thinking is on intelligence augmentation, and how I can use my own knowledge and programming skills to build tools that increase my own effectiveness and ability to perform my job. Namely, how can I write software that improves my cognition and memory such that I am better at writing software, and gain other benefits from having increased cognition and memory?

Douglas Engelbart wrote extensively about augmenting intelligence, primarily with improving workflows and then with computer software. I’ve previously quoted him on this blog. I feel that part of that quote bears repeating here:

By “augmenting human intellect” we mean increasing the capability of a man to approach a complex problem situation, to gain comprehension to suit his particular needs, and to derive solutions to problems.

Of course, Engelbart was writing about this in 1962 – well before every home had a personal computer and everyone had a powerful supercomputer in their pocket. For a modern overview of Engelbart’s framework, see The Design of Artifacts for Augmenting Intellect.

My earliest encounters with concepts of intelligence augmentation most likely come from science fiction. One character that has inspired a lot of my work (and that I’ve probably told you a lot about if we’ve discussed this project in person) is Manfred Macx from Charles Stross’s Accelerando. Macx is described in the early parts of the book as having a wearable computer that acts as his exocortex. The idea of an exocortex being that some part of his memory, thinking, and information processing lives outside of his head and on the wearable computer. Similarly, the exocortex can help act as a gate to his attention, which is one of our limited resources.

If you think about it, just as we are all cyborgs now by virtue of the technology we use every day, we are also all on our way to having exocortexes. Many of us use Gmail filters to protect our attention spans from email we receive but don’t always need to read. Or we use Google search to add on to our existing memory, perhaps to remember some long-forgotten fact that we only have an inkling of.

I’ve had Manfred Macx’s exocortex (and other flavors of science fiction’s wearable computers and augmented intelligences) kicking around in my head for years. Gmail tells me that I was trying to plan the architecture for such a thing as far back as 2006. It’s taken a lot of thinking and further learning in my career to even get to the point where I felt ready to tackle such a project.

What I am setting out to build is an exocortex of my own design, under my own control. Not something that is handed to me by Google in bits and pieces. And to do so, it turns out, requires a lot of research and learning. There’s tons of research on the topics of proactive autonomous agents, text classification, and wearable computing that I have been reading up on. Just to build the first phase of my project, I have been learning all of the following:

  • core.logic (which is based on Prolog, so I’m learning some Prolog now, too)
  • core.async (Clojure’s implementation of C.A.R. Hoare’s Communicating Sequential Processes, which is also how Go’s goroutines work)
  • Cascalog and Hadoop, to do my distributed computing tasks
  • Datomic & Datalog (a subset of Prolog for querying Datomic), to store knowledge in a historical fashion that makes sense for a persistent, lifelong knowledge system
  • Topic clustering, text classification, and other natural language processing approaches
  • Data mining, and in particular, streaming data mining of large datasets on Hadoop clusters, by reading the Stanford textbook Mining of Massive Datasets
  • Generally learning Clojure and ClojureScript better
  • and probably more that I am forgetting to mention

Of course, if I look at that list, I can be fairly certain that this project is already paying off. These are all things that I had very little experience with before, and very little reason to dig into so deeply. Not represented here are the 40 or so academic papers that I identified as important, and seriously set out to read and take notes on – again, probably learning more deeply these topics than I otherwise would have.

Which brings me to this theory, the idea of this post: That by even beginning to work on this problem, I’m seeing some gains, and that any tools I can build that give me further gains will only compound the impact and effectiveness. Improving cognition and learning compounds to allow further gains in cognition and learning.

There’s some idea in the artificial intelligence community that we don’t need the first general artificial intelligence to be built as a super-intelligence; we need only build an artificial intelligence that is capable of improving itself (or a new generation of artificial intelligence.) As each generation improves, such intelligences could become unfathomably intelligent. But all it takes is that first seed AI that can improve the next.

So for improving our own human intelligences, we may not need to build a single device up-front that makes us massively intelligent. We only need take measures to improve our current knowledge and cognition, to build tools that will help us improve further, and continue down this path. It will definitely not be the exponential gains predicted for AI, and may not be even linear – that is, the gains in cognition from building further tools and learning more may plateau. But there will be improvements.

For that reason, I’m not setting out to build Manfred Macx’s exocortex from the beginning. Instead, I have been building what I describe as a “Instapaper clone for doing research” – a tool that, if it improves my existing ability to research and learn new topics, could pay off in helping me to build the next phase of my projects.

Of course, at the same time, I have an eye towards using the foundation of this tool as the datastore and relevance-finding tool for the overall project. Such a tool can automatically go and find related content – either things I have read, or simply crawl related content on the web. Eventually, this tool will also ingest all of the information I interact with on a daily basis: every website I browse, every email I receive, every book that I read. A searchable, tagged, annotatable reference with full metadata for each document as an external long-term memory. But this is all a topic for another post.

This, in concert with what current research tells us is effective: improved nutrition and supplementation, exercise, meditation, and N-back training, may just be my ticket to higher levels of human intelligence. But for now, I just want the early-adopter edge. I want to see how far I can push myself on my own skills. Some large corporation may be able to field hundreds of developers to create a consumer product for the public that benefits everyone in similar ways – but I might be able to do this for myself years ahead of that. And wouldn’t that be cool?

And this is where I call it a theory: it could very well be that there’s no such thing as compounding interest on intelligence. Only time and my own experiences with this project will tell me.

If you’ve made it this far and you’re interested in this kind of stuff, that is: intelligence augmentation, wearable computing, autonomous proactive agents, etc., get in touch. There doesn’t seem to be much of an online community around these topics, and I’d like to start creating one for discussion and organizing open source projects around these topics.


An (Unscientific) Study in Behavior Change With Software

Forming habits is hard. There’s been tons of research on what practices help form new habits successfully. And there has been research on what software can do to help form new habits. It’s not enough to simply send daily reminders or keep track of the goals in a visible place. For software to help us form new habits successfully, we must look to the current research for clues as to how habits are formed.

Over the past year or so, I’ve been trying to adopt a habit to take Vitamin D every morning. I’ve been largely successful, which I think is partly due to the software I used. I use Lift on my iPhone, which sends me emails every morning as a reminder. The app itself has checkins for each habit, progress charts, and social features. Most mornings, I wake up, swipe away the reminder email, and take my morning antihistamine and a Vitamin D. [1] Like I said, I’ve been mostly successful, and at this point, I’ve taken Vitamin D every day in a row for 442 days in a row. [2] Granted, taking a vitamin every morning is only a small change, but it is one that I wanted to accomplish and did. Small successes add up to bigger successes, and this gives me confidence that if I set out to make a bigger change in my life, I have a toolset that will help me to accomplish that goal.

So what does the research say helps us form successful habits? The Fogg Method [3] is one of the more well-known systems, and suggests that a way to be successful is to:

  1. Select the right target behavior.
  2. Make the target behavior easy to do.
  3. Ensure a trigger will prompt the behavior.

So what do each of these steps tell us?

The Right Target Behavior

It’s hard to be successful in picking up a habit that you don’t already want to accomplish. Some things you may already want to do include things like learning a language, eating a specific diet, or flossing your teeth. It goes without saying that things you’d rather not do are going to be harder to implement.

But there’s another factor in play here that I think determines the right target behavior: simplicity. That is, is the habit a simple task to accomplish, or is it something complex and unmanageable? Can you perform one simple task per day and call it “done”, or is it more complicated as to whether it is “done” or not each day? The simple “done” state seems really important, and so it is good to focus on using this technique for binary actions: either you did them today, or you didn’t. Things that must be done with complicated schedules, every other day, or once a week, will be much harder to establish as habits.

Easy to do

One reason we want a simple target behavior is so that it is easy for us to add to our schedule. You may have a goal of exercising more. But “exercising more” doesn’t have a binary action associated with it; for example: what is “more”? Instead, you might say, “I want to exercise 45 minutes per day.” And that would be a much better goal. But if exercising means you have to drive to the gym, and the gym is out of your way each day, it might be very unlikely that you will do it. This is not a simple target behavior.

If you do have some goal that may not be simple to implement at first — say, the example of having to drive out of your way to the gym — instead try to find a simpler version of the habit that you can adopt first. You may decide instead to just do some bodyweight exercises before you leave for work each morning. Decide on the exercises and write them down. Either you did them or you didn’t. Later on, you can modify this existing habit to be more exercise, but for now, focus on what you can reasonably adopt as a simple habit.

The other concern in implementing an “easy” habit is how much time the new habit will take. In the above example, the initial goal was something like 45 minutes per day. Eventually, you could probably find when you exercise best and are least likely to schedule appointments (say, early morning or late at night), and actually implement that goal. But early on, it’s going to be hard to change your schedule for your new habit. I ran into this frequently while trying to find time after 5PM but before dinner to practice guitar. It didn’t help that after-work and dinnertime are frequently scheduled as social events, and that I have a habit of staying at the office past 5; all these added up to very little success in trying to spend 45 minutes to an hour practicing guitar at home after work.


The last step is quite important. Where you might think of triggers as things like alerts on your phone or daily emails from a service like Lift, I didn’t find those kinds of prompts very effective in helping me adopt a habit.

To be more likely to perform some task on any given day, look at the habits you already have. I’ve been taking an antihistamine every morning since I was about 12; this has been a constant in my life and part of my routine for a very long time. Since I already have this daily habit, I added taking Vitamin D every morning to that habit. Other habits with no daily routine to hinge off of, like practicing guitar, were much harder to make stick.

Flossing is an easy addition to brushing your teeth every night, and just took enough of me making it simpler (finding a brand of flossers I liked rather than wrangling loose floss) and doing it enough times before it stuck, too.

What didn’t work for me?

As noted above, despite a couple attempts to really make daily guitar practice stick, I’ve never been able to tackle that habit. There were no good triggers that I could add the event on to, and I frequently didn’t have time for what I was trying to accomplish. If I were to go back to trying to focus on guitar, I’d probably start with much less time commitment, and schedule it some time when I’m very likely to be home and have 5-10 minutes, like early in the morning before work. Whether or not guitar practice is effective with my first cup of coffee would have to be tested, of course.

What I’ve found is that I’m partially motivated by progress bars and graphs, though, and so I will make time in my day for easy-to-accomplish things. So when I can, I will try to squeeze in some mundane activity I’m tracking in Lift, like washing the dishes. [4]

The social component of Lift, on the other hand, doesn’t really help me any. For others, it might be a good motivator. In cycling, I have several local friends, including one local cyclist who is quite prolific and who frequently rides 10x as much as I do in a given week. We all use Strava to track our cycling, and the social component alerts me to new rides that the prolific cyclist has done. Seeing that cyclist’s rides helps remind me to get out and enjoy more cycling, as well as sets up a nice carrot-on-a-stick for me to ride more to “catch up.” In that case, the social features definitely help me to perform an action more, but I wouldn’t really call cycling a habit as much as my transportation and leisure-time hobby that I can do whenever I have time.

Habits with no simple binary action and no triggers, such as creative acts, are especially hard to form as habits. I have tracked writing blog posts in Lift for some time, but since I only write blog posts when the mood strikes me, it is hardly a daily goal, and it would be difficult for me to implement the above steps to form an actual habit of blogging on a daily basis.

Final thoughts

Notice that most of the guidelines above have very little to do with software? Software itself can’t convince you to go to the gym or make you more likely to floss. But it can provide some prompts and some encouragement, and that might be enough to get you over the hurdle of adopting a new habit.

As with anything, you are an individual and your mileage may vary. Experiment, use an app like Lift or something else you prefer to track your progress, and see where it takes you.

There’s more resources out there to help understand forming new habits, self-control, and behavior change, but I feel like this is the baseline one needs to know to be more successful in implementing behavior change. Some references of note that I have been consuming:

  1. Designing for Behavior Change. Stephen Wendel. 2013.
  2. The Sugary Secret of Self-Control. New York Times. September 2, 2011.
  3. The Healthy Programmer. Joe Kutner. 2013.

If you’re interested in some research, the above book (Designing for Behavior Change) is a good reference, as well as these papers:

  1. ReflectOns : mental prostheses for self-reflection (hardware and software solutions)
  2. Behavior Wizard: A Method for Matching Target Behaviors with Solutions

1 Yes, I’m aware of the fact that taking vitamins with an antihistamine decreases the effectiveness of the antihistamine. I’ll cover why I take them at the same time in this post.

2 You can view my progress on Lift on my public profile. Notice there’s quite a few habits I’ve tried to form with Lift in the past that didn’t quite work.

3 As described by BJ Fogg in the preface to Designing for Behavior Change.

4 We don’t have a dishwasher in our current apartment, and I both dislike dirty dishes and dislike washing dishes by hand.


In the Year 2100

Recently I was asked by a coworker to write up some ideas for where our company would be in the future. Not just next year, or in 5 years, but where we saw the company in the year 2100. For my other coworkers, the year 2100 probably represents far enough out to ensure they stop thinking in the constraints of right now. But for someone who has read as much science fiction as I have, I saw a wide gulf of time.

So I hit up Wolfram Alpha, asking myself what technology might look like in the year 2100.

If Moore’s Law holds, then processors could contain 3.5 x 1022 transistors — roughly 350 times more transistors than the number of grains of sands on the planet. (Thanks, Wolfram Alpha!) This also represents roughly 2.9x1018 MIPS per chip, which is roughly a factor of ten more processing power in each chip than all 7 billion brains of humans on the planet, combined. That kind of computing power is almost unimaginable to me now.

While I don’t have a religious belief in the Singularity, I do think that we can’t really predict what it’ll mean to have so much computing power available to us. Or what technology, society, or people will look like by then. Of course, someone’s gotta write the software to make that hardware useful…

Note: It could be that I messed up these numbers a bit; I went off the current transistor count and MIPS for the Intel i7-4770k processor, since I recently started putting together a server with one. And the numbers are extrapolated out quite a bit. If you’ve got corrections to these numbers, hit me up on Twitter to let me know!


Indent and Colorize HTML Strings in Pry

(This post is part of my blog archiving project. This post appeared on Coderwall on November 14, 2013.)

Note: I have converted the inline code to Gists for better readabililty.

An issue I run into frequently while testing with tools like capybara by dropping into pry is that the last response for a page is a single string, containing the HTML that was rendered. But those string have lost indentation and generally make it really hard to see the content of the page, or whatever you care about.

For example, a simple login page might look like:

Wouldn’t it be great if Pry could re-indent and colorize that string of HTML for you? Well, I put together a quick little Pry command that does. Throw this into your ~/.pryrc:

Originally, I had tried to use the html5 fork of the tidy command: but that tool changes the HTML as it parses it, and spits out a bunch of warnings. So instead, I have this pry command use nokogiri when it is available. The command should warn you if you try to use it without nokogiri available. What is output should be very close to the original rendered HTML, just cleaned up and re-indented.

So what does it look like in action?

(imagine that pry has colorized this output, too, through the excellent CodeRay tool.)

I’d love to hear from you if you find this useful! Or even if you don’t find it useful, but have some suggestions to improve it. Thanks!

Blog Archives