Clojure Data Science: Ingesting Your Gmail Inbox


This is Part 1 of a series of blog posts inspired by the exercises from Agile Data Science with Clojure. You may be interested in my review of the book.


For this blog post series, we are going to use your Gmail inbox as a dataset for an exploration of data science practices. Namely, we will use your email for machine learning and natural language processing applications. Email makes interesting data to process:

  • it has lots of metadata that we can use as features [1]
  • we can model the relationships of senders and receivers as a graph
  • each message has a body of text associated with it that we can analyze
  • gaining insights from our personal communication is far more interesting than using an open data set!

Note: This is not an intro-to-Clojure blog post. If you need a tutorial that starts with the basics, I recommend the Clojure from the ground up blog post series by Aphyr. It does an excellent job at introducing concepts in Clojure.

In this post, I follow my typical Clojure workflow: I open a REPL and begin exploring the problem space. I look at individual pieces of data and start transforming them. When I write some functionality that I like for one piece of data, I try to extract it into the source code as a function that can work for any data our project may see. In this way, we can build up the project to contain the functions that are necessary to get to our goal.

So what is our goal for this blog post? Well, we want to fetch all emails from our Gmail inbox. We want to get metadata for each email, including things like who sent it and when it was sent. Then, we want to save the messages into a database so we can do further processing in later posts.

Starting off, make a new basic Clojure project with lein. I’ve named my project autodjinn after AUTODIN, one of the first email networks. You can use the repo to refer to and to clone to follow along. At the beginning of each subsequent post, I’ll provide a SHA that you can reset the code to. Feel free to name your project whatever you want; just be sure to pay attention to the changes in filenames and namespaces as we go along!

Create the project and enter it:

To import our Gmail data, we will use a Clojure library called clojure-mail. Clojure-mail is still under active development and is likely to change. For this blog post, we’ll be using version 0.1.6 to ensure compatibility between the code in this post and the library.

Edit project.clj to contain your information and add the [clojure-mail "0.1.6"] dependency:

We’ll start by working in src/autodjinn/core.clj and later move the functionality out into a script for our email import task. Open up the file in your favorite editor and launch a REPL.

In your REPL, (use 'autodjinn.core) and verify it worked by running (foo "MYNAME"). You should see “MYNAME Hello, World!” printed out. Feel free to remove the (defn foo…) in core.clj now. We will not need it.

You may want to use something like Emacs’ cider or LightTable’s InstaREPL as your REPL environment. But you can use the regular Clojure REPL to build this project, as well. If you are not working with a REPL integrated to your editor, you will need to run (use 'autodjinn.core :reload) to force a reload of the code each time you save.

Connecting to Gmail

Our first goal is to connect to our inbox and verify that we can read email from it. To do that, we’re going to need to use our Gmail address and password — which we don’t want to put into our source files. It’s bad practice to put a password or a private key into a source file or check it into our repo! Just don’t do it!

Instead, we will use a nice library called nomad to load a config file containing our email address and password. We will add the config file to .gitignore so that it is never saved into our code.

Add the line [jarohen/nomad "0.6.3"] to your project.clj dependencies before moving on, and run lein deps in a console to pull in the dependency.

Back in our core.clj add the require statements for clojure-mail and nomad to your ns macro like this:

Then create a new file in resources/config/autodjinn-config.edn. It should look like this, with your email address and password filled in:

Now open up your .gitignore file and add the following line to it:

Following nomad’s README, we need to load our config file and pull out our gmail-username and gmail-password keys. We add to the following to core.clj after the ns macro:

Using the get function here is a safe lookup for maps that returns nil if nothing is found for the key. Back in our REPL, we can see this in action with some quick experimentation:

We could also use the shorter (:keyname mymap) syntax here, since symbols are an invocable function that looks up a key in a map. But the get function reads better than (:gmail-username (autodjinn-config)) in my opinion.

In your REPL, you should now be able to get the values for gmail-username and gmail-password:

Note that since I’m in the user namespace here, I had to qualify the vars with their autodjinn.core namespace. If this is confusing, you might want to read up on namespaces in Clojure before moving on. (See also: the ‘Namespaces’ section in Clojure from the ground up: logistics.)

clojure-mail requires us to open a connection to Gmail with the gen-store function (src). We then pass that connection around to various functions to interact with our inbox. Define a var called my-store in your core.clj that does this with our email address and password:

Make sure the (def my-store… above has been run in your REPL and then take a look at our open connection:

The type of my-store should be an IMAPSSLStore as above. If it didn’t work, you’ll see a string error message when you try to define my-store.

Your inbox as a list

Now we’ll use our REPL to build up a function that will eventually import all of our email. To start, we can use the inbox function (src) from clojure-mail to get a seq of messages in our inbox. Note that since it is a seq and inboxes can be very large, we limit it with the take function.

If everything is working, you should see a list of of the IMAPMessages returned by the last line in your REPL.

What if, instead, we wanted to loop over many messages and print out their subjects? We can pull in the message namespace (src) from clojure-mail, which gives us convenience functions for getting at message data.

You’ll have to be careful running this next line — on a large inbox it’ll print out the subject of everything in your inbox! If you have a lot of messages, consider wrapping the call to inbox in a take as above.

Those are the subject lines of the 4 messages in the inbox of my test account, so I know that this is working. Save our doseq line into a function called ingest-inbox; we’ll come back to it later:

Examining messages

Before we move on, let’s take a look at an individual message and what we can get out of it from the message namespace.

From this, we can see a few things:

  • The ID returned by message/id looks like a good candidate to get good unique IDs for each message when we store the messages. But we might want to strip off those angle brackets first.
  • The message/message-body function doesn’t return a string of the body. Instead, it returns a list of maps which contains the text/plain form of the body and the text/html form. We will have to extract each from the map so that we can use the plaintext version for things like language processing. We’ll also keep the HTML version in case we need it later.
  • If you started digging in to the message namespace’s source you may have noticed that we don’t have functions for getting date sent or date received for a message. Nor can we get a list of addresses CCed or BCCed for the message. We’ll have to write those functions ourselves.

Cleaning up the IDs

Let’s focus on writing a function to clean up the ID returned by the message/id function. Recall that such IDs look like <CAJiAYR90LbbN6k8tVXuhQc8f6bZoK647ycdc7mxF5mVEaoLKHw@mail.gmail.com>

The clojure.string namespace provides a replace function which does simple replacement on a string. We can use it like this:

That worked for replacing the angle brackets for the original string. But remember that data structures are immutable in Clojure, including strings. Replacing the first angle bracket didn’t change the original string when we tried to replace the other angle bracket. We need something that allows us to build up an intermediate value and pass it to the next function. For that, we will use the thread-first macro: ->. It is easiest if I show the macro in use with some comments showing what the intermediate values would be at each step:

It is called the thread-first macro because it threads through the first argument to each function. In this case, clojure.string/replace’s first argument is the string to replace on. So the each successively returned value gets passed to the next function above.

Now that we’ve figured out how to clean up that ID, we will create a function to clean up any ID we pass it:

Extracting the message bodies

Recall the message/message-body call above:

Ideally, we want to write a function that can get the text/plain body out of this value, and another function that can get the text/html body out. Notice that the :content-type values aren’t quite so simple as just selecting the item in the list where the string text/plain appears. We will need our function to ignore the additional information in the :content-type value, which includes things like string encodings.

Let’s look at just the first map in the list returned by message/message-body:

If we build a predicate function that can detect when the :content-type key is the type we want, we can use it in a filter function to choose the correct type of body in our functions.

Notice that TEXT/PLAIN and TEXT/HTML are always separated from the rest of the content-type by a semicolon, and it always appears first. You’d have to look at a few messages from your inbox to arrive at the same conclusion, but I’ve already done the work and can assure you that the previous statement is true.

Then, an easy to to get at the part of the content-type we want would be to split on the semicolon and take the first element returned:

This leads us to a function to first clean up the content-type string, and then our predicate function to detect if it is the one we want:

To finish off our work on the message bodies, we want to filter the list returned by message/message-body:

And turn it into a function that works for any message bodies list:

Note that we’ve also used this function to create two convenience functions, one for extracting plaintext bodies and one for extracting HTML bodies. By keeping functions simple and small, we can build up useful functions for our project rather than try to plan it all out ahead of time.

Getting more information out of the IMAPMessages

As noted above, we will need to write a few more functions to get the fields of the IMAPMessages that we cannot get through this version of clojure-mail. Recall that we want to get CC list, BCC list, date sent, and date received values. To do that, we will use Java interop functionality. It’s really not as bad as it sounds. Remember that the IMAPMessages we see are Java instances of the IMAPMessage class. Calling a method on an instance is accomplished by using a dot before the method name, with the method in the function position, such as: (.javaMethod some-java-instance)

To start, we can look at clojure-mail’s project.clj and see that it depends on javax.mail. The next step is to find the documentation for the Java implementation of javax.mail.Message, which lives here.

In the REPL, we can try some of the Java interop on our my-msg:

The datetimes for each message are automatically turned into Clojure instants for us, which is convenient. If we dig into how the clojure-mail.message/to function [src] works, we see that it is using the .getRecipients method. .getRecipients takes the message and a constant of a RecipientType. For our purposes, we want the javax.mail.Message$RecipientType/CC and javax.mail.Message$RecipientType/BCC recipients:

The last line maps the str function across each element returned, so that we get the string representation of the email addresses. That way, our database can just store the strings.

As before, now that we know how to use these methods in the REPL, we write functions in core.clj to take advantage of our newfound knowledge:

In the REPL, it should now be possible to get a nice map representation of all the fields on the message we care about:

Congrats on making it this far. We’ve used quite a few neat little features of Clojure and the libraries we’re building this project with to get here.

The last step we’ll go through in this post is to get these messages into a database.

Enter Datomic, the immutable datastore

Datomic is a great database layer built on Clojure that gives us a database value representing immutable data. New transactions on the database create new database values. It fits very well with Clojure’s own concept of state and identity because it was designed by the same folks as Clojure. Plus, Datomic is meant to grow and scale in modern environments like AWS, with many backend datastore options to run it on.

There’s some important reasons why you might choose Datomic as your database for a data science / machine learning application:

  • There are various storage backends, so you can grow from tens of thousands of rows in PostgreSQL on a developer’s laptop to millions of records (or more) in Riak or DynamoDB on AWS. That is, it has a good migration path from small datasets to big data through the Datomic import/export process
  • The concept of time associated with each value in Datomic means that we can query for historical data to compare against
  • Datomic has a lightweight schema compared to a relational database like PostgreSQL. Schemas are just data! When we begin computing new values from our dataset, we can add new types of entities easily at the same time.
  • Datomic’s schemas allow us to treat it as a key-value store, relational database, or even build a graph store on top of it, if we need to

Note: I won’t go through setting up an entire Datomic installation here. It’s worth reading up on the docs and the rationale behind Datomic’s design.

You can get the Datomic free build if you like, but you will be limited to in-memory stores. It is unlikely that your Gmail inbox will fit into memory on your dev machine. Instead, I recommend signing up for the free Datomic Pro Starter Edition. (The free Starter Edition is fine because you will not be using this project in a commercial capacity.) Once you have Datomic Pro downloaded and installed in your local Maven, I recommend using the PostgreSQL storage adapter locally with memcached. Follow the guides for configuring storage on the Datomic Storage page.

Add the correct line to your project.clj dependencies for the version of Datomic you’ll be using (mine was [com.datomic/datomic-pro "0.9.4384"] which might be a bit out of date and likely won’t match yours.) Now we can start using Datomic in our core.clj and our REPL.

The first thing we need is the URI where the Datomic database lives. When we start up the Datomic transactor, you will see a DB URI that looks something like datomic:sql://DBNAMEHERE?jdbc:postgresql://localhost:5432/datomic?user=datomic&password=datomic in the output. Grab that URI and add it to our resources/config/autodjinn-config.edn:

Back at the top of core.clj, save that value to a var as we did with gmail-username and gmail-password:

And then in the REPL:

Note that according to the datomic clojure docs for the create-database function, it returns true if the database was created, and false if it already exists. So running create-database every time we run our script is safe, since it won’t destroy data.

If the above work in the REPL doesn’t work, it is likely your code is unable to talk to your running Datomic, or your Datomic transactor is not configured correctly. Diagnose it with Googling and reading the docs until you get it to work, then move on.

Calling (d/db db-connection) gives us the current value of our database. In most cases, we want to just get the most current value. So, we can write a convenience function new-db-val to always get us the current (and possibly different) database value. But there are cases where we want to coordinate several queries and use the same database values for each. In those cases, we won’t use the function get the latest database value, but rather pass this database value to the queries so that all query against the same state.

In our core.clj, we can add the code we need to create the database, get our connection, and the convenience new-db-val function:

Next, we need to tell Datomic about the schema of our data. Schemas are just data that you run as a transaction on the database. Reading up on the Schema page of the Datomic docs might be helpful to understand what’s going on here. The short version is that we define each attribute of an email and set up its properties. The collection of all attributes together will constitute a mail entity, so we namespace all the attributes under the :mail/ namespace.

We add that var def to our core.clj because it is, after all, just data. We may choose later to move it to its own edn file, but for now, it can live in our source code. Next, we want to apply this schema to our database with a transaction. That looks like this:

Now we put that transaction in a convenience function in core.clj that we’ll run every time we run this file. The function will ensure that our database is ‘converged’ to this schema. Running a transaction will create a new database value. But it will not blow away any data that we had in the database by running this transaction many times. It will simply try to update the existing attributes, and nothing in the attributes themselves need change. It is far more work to retract (delete) data in Datomic than it is to add or update it. This leads to much more safety around working with data without worrying that we will destroy data, and it encourages a REPL-based exploration of the data and its history.

Now that our mail entities are defined in Datomic, we can try a query to find all the entity-IDs where any :mail/uid value is present. Read up on the Query page of the Datomic docs to dig into querying deeper. You might also be interested in the excellent Learn Datalog Today website to learn more about querying Datomic with Datalog.

Since we have no mail entities in our database, Datomic returns an empty set. So now we reach the end of task: We can ingest some emails and save them in our database! Return to the ingest-inbox function that we left before. Here’s what the updated version will look like:

We use the @-sign before the (d/transact…) call because Datomic normally returns a promise of the completed transaction. However, we want to force Datomic to complete each transaction before moving on by deref-ing it with the @-sign. Per the Clojure docs: “Calls to deref/@ prior to delivery will block.”

If you run this function in your REPL, you should see it start to ingest your email from Gmail!

Note that this could a take a long time if you’ve chosen to import a really large Gmail inbox! You might want to stop the import at some point; in most REPLs Ctrl-c will stop the running function.

If we query for our entity-IDs again, as above, we should see some values returned!

What does one of those database entities look like when we run it through Datomic’s entity and touch functions to instantiate all its attributes?

Wrapping up

That’s it for this blog post. It took a little setup, but we were able to build up a working Gmail import tool with help from our REPL and some nice Clojure libraries.

Next time, we’ll be looking at doing some basic querying of the data, including getting a count of the number of times each email address has sent you an email.

Comments? Questions? Feel free to contact me at contact@mattgauger.com. I’d love to hear from you.


1 In this case, machine learning features, which are the input variables for our learning tasks. Not software features that we a client might ask us to implement. See: Feature learning - Wikipedia, the free encyclopedia.

.

A Quick Dashboard in Hoplon & Castra

Note: I began writing a much longer blog post that went into a ton of detail about how to build an app dashboard that used Hoplon and Castra. The kind of dashboard that just consumes JSON API endpoints from another app or other data sources. Such dashboards update on the fly in the browser. Many apps these days need a dashboard like this to monitor stats: worker job queues, database size, average response times, etc.

Rather than that long blog post, I wanted to simply show the steps I would take to build such a dashboard with Hoplon and Castra. I won’t go into detail here or explain either Hoplon or Castra — go read on your own first, and also look into boot, the build tool this uses.

If you want to follow along, I’ve provided a repo. The README has instructions for getting setup. Assuming you have boot installed, you can just run boot gleam-app to get started.

So here’s how I’d build up a dashboard, in several iterations:

Static data in the browser:

First, we get some data into the HTML using Hoplon cells:

You’ll want to git reset --hard 69b070 to get to this point.

Move the data to ClojureScript:

In src/cljs/gleam/rpc.cljs:

And take out the (def articles…) from index.html.hl. After boot recompiles everything, you should still see the data in the page.

To get to this point, you can run git reset --hard d63f299.

Move the data to the server side

Change src/cljs/gleam/rpc.cljs again, this time to make a remote call for data:

On the backend, we need something like this in src/castra/gleam/api/gleam.clj:

The Hoplon HTML file changes in the script tag at the top to use the new ClojureScript remote call and start up the polling:

To get to this point in the example repo, you can do git reset --hard 0bad1e5.

Real time data

The last step that I will show is to verify that we are in fact getting regular updates of data from the back end.

Change your Castra Clojure file to look like this:

To get to this point, you can do a git reset --hard f19325

Talking to a remote service.

The last step here is left as an exercise for the reader. You can imagine replacing the articles function in src/castra/gleam/api/gleam.clj with something that polls a remote JSON API for data. Or you could look at my social news app gnar for inspiration on using a Postgres database for data.

I hope to finish up a post with full explanations soon. Castra is relatively new, and it’s worth explaining how some of the pieces fit together. My explanation should include more complicated interaction. like user authentication. I will be publishing that blog post after I get back from ClojureWest next week!

Let me know what you thought of this post by shooting me an email. I’d love to hear from you.

.

Agile Data Science: Review and Thoughts

Agile Data Science cover

Recently, I read the book Agile Data Science by Russell Jurney. The book covers data science and how the author applies an agile workflow and powerful tooling to accomplish tasks. While I found the book interesting, and would recommend it as a good introduction, I have some issues with the book that I’d like to discuss. I’d like to go over the book and the tools briefly, if only to save my thoughts for later.

A quick note: data science is actively being defined by the web community as the process of analyzing large data sets with statistics and other approaches. That definition is ongoing and changing all the time. Big Data is the term that the industry seems to be using for such large datasets. You’ll also see the terms machine learning, analytics, and recommender systems mentioned: these are all various sub-topics that I won’t cover in depth here.

The book centers around the use of Hadoop. In turn, Hadoop is commanded by writing and running Apache Pig scripts in the book. Pig allows you to write workflows in a high-level scripting language that may compose many Hadoop jobs into one system. With Pig, you need not worry about the specifics of what each Hadoop job is doing when you write a Pig script.

Hadoop is patterned after Google’s MapReduce paper. Google had large clusters of computers and large data sets that it wanted to process on those clusters. What they came up with was a simple idea: Write a single program that would specify a map function to run across tuples of all the input data. Add a reduce function that compiles that output down into the expected format. MapReduce coordinates deploying the program to each worker machine, divvying up the input data across the different machines, gathering up the results, and handling things like restarts after failures. This was a huge success inside Google, and Hadoop implements that architecture with improvements.

It should be noted that this MapReduce architecture is essentially batch-processing for large amounts of data. The same system would have a hard time with continuous streams of data.

Hadoop is, unfortunately, my first stumbling block with learning to process big data.

Configuring and running Hadoop is not easy. I have far more experience as a developer than a sysadmin (or today’s term: devops engineer). There exists more than one “distribution” of Hadoop and more than one versioning scheme between those. This means that understanding what’s available, how to configure it, and whether search results are relevant to you is quite hard for the unexperienced. Imagine the confusion of trying to install a Debian Linux distro and only being able to find instructions for Red Hat Linux; further, not being able to tell what the problem was when it wouldn’t boot and printed a Debian-specific error.

It seems like Hadoop is designed for to be run by someone whose full-time job is to configure and maintain that cluster. That person will need to have enough experience with all the different choices to have an opinion on them. For a developer wanting to run things locally before committing to configuring (and paying for!) a full cluster out on AWS, it was daunting.

Luckily for me, Charles Flynn has created a neat repo on Github at charlesflynn/agiledata. It builds a local development VM for the Agile Data Science book, with all the dependencies installed and the book’s code in the right place to run. With that project, I was able to get up and running with the project quickly and found it useful to not have to sink anymore time into configuring Hadoop. I’d like to give another shout-out to Charles for this great resource and the work done to make sure it works.

The book has the reader work with email data: your Gmail inbox pulled locally for analyzing. I thought this was neat, in itself. Many data science tools use free datasets; as a result working with those datasets may not be the most interesting problem space to you. But insights about your own communication and how others communicate with you is something you might find more interesting.

After explaining Hadoop, Pig, and a few other tools, the rest of the book follows a fairly lightweight “recipe” format. Each chapter explains the goal and how it fits in an “agile data science” workflow. Then, some code is presented, and then we see what kinds of results we can take from that step. Once this pattern is set up, the book moves fairly quickly through some rather interesting data wrangling. By the end, the reader has built several data analysis scripts and a simple web app put together with MongoDB, Python Flask, and D3.js graphs to display all the results.

At times, though, the quick recipe format seemed to explain too little. There was little explanation of how Pig script syntax worked or how to understand what was going on under the covers. What this book is not: an exhaustive guide to how to write Pig scripts, how to pick approaches to analyzing a dataset, or how to compose these systems in production in the wild. Also missing were any mention of performance tuning or what other algorithms might be considered.

Which seems like an awful lot to be missing, but for this book that would have been diversions that bogged the book down.

To the author’s benefit, I finished the book, and finished it far faster than I expected I would. I cam away having done almost all of the book’s examples (helped a great deal by the excellent virtual machine repo from Charles Flynn mentioned above). And, I had a deeper understanding and respect for tools that I’d never used before.

Final thoughts

When it comes down to it, I wouldn’t recommend Agile Data Science to read on its own. I’d recommend that you used it as a quick introductory book to build familiarity and confidence, so that you could dive into a deeper resource afterwards. I’d also recommend it if you’re a developer who isn’t going to be doing data science as your full time job but are curious about the tools and practices, this book would be a good read.

What I’m doing next

Almost immediately after finishing this book, I attended an event at a nearby college to talk about Apache Storm. Our company blog covered the event if you’re curious.

Storm is a tool that came out of Twitter for processing streams of big data. If you think about it, Twitter has one of the biggest streaming data sets ever. They need to use that streaming data for everything from recommendations to analytics to top tweet/hashtag rankings.

After attending the event and having run a word-counting topology (Storm’s term for a workflow that may contain many data-processing jobs) out on a cluster, I began to see the potential of using Storm.

Plus, Storm is far friendlier to local development on a laptop. One can run it with a simple command line tool or even from inside your Java or Clojure code. Or, perhaps most simply, from inside the Clojure REPL.

The other plus here is that Storm is mostly written in Clojure and has a full Clojure API. Combined with a few other Clojure tools that I prefer, like Datomic, Ring, and C2, I can see a toolset similar to that used in Agile Data Science. This toolset has the benefit of using the same language for everything. And, Clojure is already well-suited for data manipulation and processing.

So I began to rewrite the examples in Agile Data Science in Clojure. I am hoping to make enough progress to begin posting some of the code with explanations in blog format. Stay tuned for that.

.

A Theory of Compound Intelligence Gain

Note that this is probably not enough to call a theory. It’s an idea, at most.

I’m currently reading the book Race Against the Machine, which describes how increasing levels of automation by technology are related to capital and labor. But this post isn’t about that book. It simply triggered me to think about my motivations for my current side projects, and how I might explain to others why exactly I think that my current side projects are so important.

While Race Against the Machine describes technological progress as a force that leaves behind skilled workers who no longer have relevant skills, my thinking is on intelligence augmentation, and how I can use my own knowledge and programming skills to build tools that increase my own effectiveness and ability to perform my job. Namely, how can I write software that improves my cognition and memory such that I am better at writing software, and gain other benefits from having increased cognition and memory?

Douglas Engelbart wrote extensively about augmenting intelligence, primarily with improving workflows and then with computer software. I’ve previously quoted him on this blog. I feel that part of that quote bears repeating here:

By “augmenting human intellect” we mean increasing the capability of a man to approach a complex problem situation, to gain comprehension to suit his particular needs, and to derive solutions to problems.

Of course, Engelbart was writing about this in 1962 – well before every home had a personal computer and everyone had a powerful supercomputer in their pocket. For a modern overview of Engelbart’s framework, see The Design of Artifacts for Augmenting Intellect.

My earliest encounters with concepts of intelligence augmentation most likely come from science fiction. One character that has inspired a lot of my work (and that I’ve probably told you a lot about if we’ve discussed this project in person) is Manfred Macx from Charles Stross’s Accelerando. Macx is described in the early parts of the book as having a wearable computer that acts as his exocortex. The idea of an exocortex being that some part of his memory, thinking, and information processing lives outside of his head and on the wearable computer. Similarly, the exocortex can help act as a gate to his attention, which is one of our limited resources.

If you think about it, just as we are all cyborgs now by virtue of the technology we use every day, we are also all on our way to having exocortexes. Many of us use Gmail filters to protect our attention spans from email we receive but don’t always need to read. Or we use Google search to add on to our existing memory, perhaps to remember some long-forgotten fact that we only have an inkling of.

I’ve had Manfred Macx’s exocortex (and other flavors of science fiction’s wearable computers and augmented intelligences) kicking around in my head for years. Gmail tells me that I was trying to plan the architecture for such a thing as far back as 2006. It’s taken a lot of thinking and further learning in my career to even get to the point where I felt ready to tackle such a project.

What I am setting out to build is an exocortex of my own design, under my own control. Not something that is handed to me by Google in bits and pieces. And to do so, it turns out, requires a lot of research and learning. There’s tons of research on the topics of proactive autonomous agents, text classification, and wearable computing that I have been reading up on. Just to build the first phase of my project, I have been learning all of the following:

  • core.logic (which is based on Prolog, so I’m learning some Prolog now, too)
  • core.async (Clojure’s implementation of C.A.R. Hoare’s Communicating Sequential Processes, which is also how Go’s goroutines work)
  • Cascalog and Hadoop, to do my distributed computing tasks
  • Datomic & Datalog (a subset of Prolog for querying Datomic), to store knowledge in a historical fashion that makes sense for a persistent, lifelong knowledge system
  • Topic clustering, text classification, and other natural language processing approaches
  • Data mining, and in particular, streaming data mining of large datasets on Hadoop clusters, by reading the Stanford textbook Mining of Massive Datasets
  • Generally learning Clojure and ClojureScript better
  • and probably more that I am forgetting to mention

Of course, if I look at that list, I can be fairly certain that this project is already paying off. These are all things that I had very little experience with before, and very little reason to dig into so deeply. Not represented here are the 40 or so academic papers that I identified as important, and seriously set out to read and take notes on – again, probably learning more deeply these topics than I otherwise would have.

Which brings me to this theory, the idea of this post: That by even beginning to work on this problem, I’m seeing some gains, and that any tools I can build that give me further gains will only compound the impact and effectiveness. Improving cognition and learning compounds to allow further gains in cognition and learning.

There’s some idea in the artificial intelligence community that we don’t need the first general artificial intelligence to be built as a super-intelligence; we need only build an artificial intelligence that is capable of improving itself (or a new generation of artificial intelligence.) As each generation improves, such intelligences could become unfathomably intelligent. But all it takes is that first seed AI that can improve the next.

So for improving our own human intelligences, we may not need to build a single device up-front that makes us massively intelligent. We only need take measures to improve our current knowledge and cognition, to build tools that will help us improve further, and continue down this path. It will definitely not be the exponential gains predicted for AI, and may not be even linear – that is, the gains in cognition from building further tools and learning more may plateau. But there will be improvements.

For that reason, I’m not setting out to build Manfred Macx’s exocortex from the beginning. Instead, I have been building what I describe as a “Instapaper clone for doing research” – a tool that, if it improves my existing ability to research and learn new topics, could pay off in helping me to build the next phase of my projects.

Of course, at the same time, I have an eye towards using the foundation of this tool as the datastore and relevance-finding tool for the overall project. Such a tool can automatically go and find related content – either things I have read, or simply crawl related content on the web. Eventually, this tool will also ingest all of the information I interact with on a daily basis: every website I browse, every email I receive, every book that I read. A searchable, tagged, annotatable reference with full metadata for each document as an external long-term memory. But this is all a topic for another post.

This, in concert with what current research tells us is effective: improved nutrition and supplementation, exercise, meditation, and N-back training, may just be my ticket to higher levels of human intelligence. But for now, I just want the early-adopter edge. I want to see how far I can push myself on my own skills. Some large corporation may be able to field hundreds of developers to create a consumer product for the public that benefits everyone in similar ways – but I might be able to do this for myself years ahead of that. And wouldn’t that be cool?

And this is where I call it a theory: it could very well be that there’s no such thing as compounding interest on intelligence. Only time and my own experiences with this project will tell me.

If you’ve made it this far and you’re interested in this kind of stuff, that is: intelligence augmentation, wearable computing, autonomous proactive agents, etc., get in touch. There doesn’t seem to be much of an online community around these topics, and I’d like to start creating one for discussion and organizing open source projects around these topics.

.

An (Unscientific) Study in Behavior Change With Software

Forming habits is hard. There’s been tons of research on what practices help form new habits successfully. And there has been research on what software can do to help form new habits. It’s not enough to simply send daily reminders or keep track of the goals in a visible place. For software to help us form new habits successfully, we must look to the current research for clues as to how habits are formed.

Over the past year or so, I’ve been trying to adopt a habit to take Vitamin D every morning. I’ve been largely successful, which I think is partly due to the software I used. I use Lift on my iPhone, which sends me emails every morning as a reminder. The app itself has checkins for each habit, progress charts, and social features. Most mornings, I wake up, swipe away the reminder email, and take my morning antihistamine and a Vitamin D. [1] Like I said, I’ve been mostly successful, and at this point, I’ve taken Vitamin D every day in a row for 442 days in a row. [2] Granted, taking a vitamin every morning is only a small change, but it is one that I wanted to accomplish and did. Small successes add up to bigger successes, and this gives me confidence that if I set out to make a bigger change in my life, I have a toolset that will help me to accomplish that goal.

So what does the research say helps us form successful habits? The Fogg Method [3] is one of the more well-known systems, and suggests that a way to be successful is to:

  1. Select the right target behavior.
  2. Make the target behavior easy to do.
  3. Ensure a trigger will prompt the behavior.

So what do each of these steps tell us?

The Right Target Behavior

It’s hard to be successful in picking up a habit that you don’t already want to accomplish. Some things you may already want to do include things like learning a language, eating a specific diet, or flossing your teeth. It goes without saying that things you’d rather not do are going to be harder to implement.

But there’s another factor in play here that I think determines the right target behavior: simplicity. That is, is the habit a simple task to accomplish, or is it something complex and unmanageable? Can you perform one simple task per day and call it “done”, or is it more complicated as to whether it is “done” or not each day? The simple “done” state seems really important, and so it is good to focus on using this technique for binary actions: either you did them today, or you didn’t. Things that must be done with complicated schedules, every other day, or once a week, will be much harder to establish as habits.

Easy to do

One reason we want a simple target behavior is so that it is easy for us to add to our schedule. You may have a goal of exercising more. But “exercising more” doesn’t have a binary action associated with it; for example: what is “more”? Instead, you might say, “I want to exercise 45 minutes per day.” And that would be a much better goal. But if exercising means you have to drive to the gym, and the gym is out of your way each day, it might be very unlikely that you will do it. This is not a simple target behavior.

If you do have some goal that may not be simple to implement at first — say, the example of having to drive out of your way to the gym — instead try to find a simpler version of the habit that you can adopt first. You may decide instead to just do some bodyweight exercises before you leave for work each morning. Decide on the exercises and write them down. Either you did them or you didn’t. Later on, you can modify this existing habit to be more exercise, but for now, focus on what you can reasonably adopt as a simple habit.

The other concern in implementing an “easy” habit is how much time the new habit will take. In the above example, the initial goal was something like 45 minutes per day. Eventually, you could probably find when you exercise best and are least likely to schedule appointments (say, early morning or late at night), and actually implement that goal. But early on, it’s going to be hard to change your schedule for your new habit. I ran into this frequently while trying to find time after 5PM but before dinner to practice guitar. It didn’t help that after-work and dinnertime are frequently scheduled as social events, and that I have a habit of staying at the office past 5; all these added up to very little success in trying to spend 45 minutes to an hour practicing guitar at home after work.

Triggers

The last step is quite important. Where you might think of triggers as things like alerts on your phone or daily emails from a service like Lift, I didn’t find those kinds of prompts very effective in helping me adopt a habit.

To be more likely to perform some task on any given day, look at the habits you already have. I’ve been taking an antihistamine every morning since I was about 12; this has been a constant in my life and part of my routine for a very long time. Since I already have this daily habit, I added taking Vitamin D every morning to that habit. Other habits with no daily routine to hinge off of, like practicing guitar, were much harder to make stick.

Flossing is an easy addition to brushing your teeth every night, and just took enough of me making it simpler (finding a brand of flossers I liked rather than wrangling loose floss) and doing it enough times before it stuck, too.

What didn’t work for me?

As noted above, despite a couple attempts to really make daily guitar practice stick, I’ve never been able to tackle that habit. There were no good triggers that I could add the event on to, and I frequently didn’t have time for what I was trying to accomplish. If I were to go back to trying to focus on guitar, I’d probably start with much less time commitment, and schedule it some time when I’m very likely to be home and have 5-10 minutes, like early in the morning before work. Whether or not guitar practice is effective with my first cup of coffee would have to be tested, of course.

What I’ve found is that I’m partially motivated by progress bars and graphs, though, and so I will make time in my day for easy-to-accomplish things. So when I can, I will try to squeeze in some mundane activity I’m tracking in Lift, like washing the dishes. [4]

The social component of Lift, on the other hand, doesn’t really help me any. For others, it might be a good motivator. In cycling, I have several local friends, including one local cyclist who is quite prolific and who frequently rides 10x as much as I do in a given week. We all use Strava to track our cycling, and the social component alerts me to new rides that the prolific cyclist has done. Seeing that cyclist’s rides helps remind me to get out and enjoy more cycling, as well as sets up a nice carrot-on-a-stick for me to ride more to “catch up.” In that case, the social features definitely help me to perform an action more, but I wouldn’t really call cycling a habit as much as my transportation and leisure-time hobby that I can do whenever I have time.

Habits with no simple binary action and no triggers, such as creative acts, are especially hard to form as habits. I have tracked writing blog posts in Lift for some time, but since I only write blog posts when the mood strikes me, it is hardly a daily goal, and it would be difficult for me to implement the above steps to form an actual habit of blogging on a daily basis.

Final thoughts

Notice that most of the guidelines above have very little to do with software? Software itself can’t convince you to go to the gym or make you more likely to floss. But it can provide some prompts and some encouragement, and that might be enough to get you over the hurdle of adopting a new habit.

As with anything, you are an individual and your mileage may vary. Experiment, use an app like Lift or something else you prefer to track your progress, and see where it takes you.

There’s more resources out there to help understand forming new habits, self-control, and behavior change, but I feel like this is the baseline one needs to know to be more successful in implementing behavior change. Some references of note that I have been consuming:

  1. Designing for Behavior Change. Stephen Wendel. 2013.
  2. The Sugary Secret of Self-Control. New York Times. September 2, 2011.
  3. The Healthy Programmer. Joe Kutner. 2013.

If you’re interested in some research, the above book (Designing for Behavior Change) is a good reference, as well as these papers:

  1. ReflectOns : mental prostheses for self-reflection (hardware and software solutions)
  2. Behavior Wizard: A Method for Matching Target Behaviors with Solutions

1 Yes, I’m aware of the fact that taking vitamins with an antihistamine decreases the effectiveness of the antihistamine. I’ll cover why I take them at the same time in this post.

2 You can view my progress on Lift on my public profile. Notice there’s quite a few habits I’ve tried to form with Lift in the past that didn’t quite work.

3 As described by BJ Fogg in the preface to Designing for Behavior Change.

4 We don’t have a dishwasher in our current apartment, and I both dislike dirty dishes and dislike washing dishes by hand.

.
Blog Archives