Clojure Data Science: Refactoring and Cleanup

This is Part 2 of a series of blog posts called Clojure Data Science. Check out the previous post if you missed it.

Welcome to the second post in this series. If you followed along in the last post, your code should be ready to use in this post. If not, or if you need to go back to known working state, you can clone the autodjinn repo and git checkout v0.1.0.

I started out writing this post to develop simple functionality on our inbox data. Finishing the post was taking longer than I was expecting, so I split the post in half in the interest of posting this sooner.

In this post, we need to create an email ingestion script that we can run repeatedly with lein. And we need to talk about refactoring our code out into maintainable namespaces.

So make sure your Datomic transactor is running and launch a REPL, because it is time to give our code a makeover.

A Gmail ingestion script

Because Clojure sits on the JVM, it shares some similarities with Java. One of these is the special purpose of a -main function. You can think of this as the main method in a Java class. The -main function in a Clojure namespace will be run when a tool like lein tries to “run” the namespace. That sounds like exactly what we want to do with our Gmail import functionality, so we will add a -main function that calls our ingest-inbox function. To get started, we will only have it print us a message.

You can then run this by invoking lein run -m autodjinn.core. You should see Hello world! if everything worked. You may notice that the process doesn’t seem to quit after it prints the hello world message – this seems to be problem with Leiningen. To ensure that our process ends when the script is done, we can add a (System/exit 0) line to the end of our -main function to ensure that the process quits normally. On *nix systems, a 0 return code means successful exit, and a nonzero response code means something went wrong. Knowing this, we can take advantage of response codes in the future to signal that an error occurred in our script. But for now, we will have the script end by returning 0 to indicate a successful exit.

Think back to what we did to ingest email in our REPL in the last post. We had to connect to the database, run the data schema transaction, and then we were able to run ingest-inbox to pull in our email.

The following function will do the same thing. Remember that things like trying to create an existing database or performing a schema update against the same schema in Datomic should be harmless. It will add a new transaction ID, but it will not modify or destroy data. Putting together all the steps we need to run, we get a -main function that looks like this:

Refactoring namespaces

With Clojure, one must walk a fine line between putting all of your functions into one big file, and having too many namespaces. One big file quickly grows unmaintainable and gains too many responsibilities.

But having too many namespaces can also be a problem. It may create strange cyclic dependency errors. Or you may find that with many separate namespaces, you have to require many namespaces to get anything done.

To avoid this, I start with most code in one namespace, and then look for common functionality to extract to a new namespace. Good candidates to extract are those that all talk about the same business logic or business domain. You may notice that the responsibility for one group of functions is different than the rest of the functions. That is a good candidate for a new namespace. Looking at responsibilities can be a good way to determine where to break apart functions into namespaces.

In this project, we can identify two responsibilities that currently live in our autodjinn.core namespace. The first is working with the database. The second is ingesting Gmail messages. As our project grows, we will not want the code for ingesting Gmail messages to live in autodjinn.core. With that in mind, let’s create a new file called src/autodjinn/gmail_ingestion.clj and move over the vars and functions that we think should live there. That file should look like this:

Be sure to remove the functions and vars that we moved to this file from the autodjinn.core namespace. Note that we moved the -main function here, too, so that we can now run lein run -m

You may also notice that we still had to require the datomic.api namespace here to be able to perform a transaction. Our autodjinn.core namespace already handles database interaction, though. So let’s write a create-mail function in core.clj and call it in our new namespace:

And in gmail_ingestion.clj we change ingest-inbox to use the new function. While we’re at it, we’ll break out a convenience function to prepare the attr map for Datomic:

If we run our lein run -m command, we should see that the code is still working.

Don’t forget to remove the datomic.api requirement in gmail-ingestion namespace! Now we only need to require Datomic in the autodjinn.core namespace.

There’s one more low-hanging fruit that we can refactor about this code before moving on. The config file is loaded and used in both namespaces. We already require everything from autodjinn.core into So we can safely change a few lines to use the config in gmail_ingestion.clj and stop requiring nomad in two places:

And in core.clj:

Running lein run -m one more time, we should see that our changes did not break the system. The config is now only loaded once, and we use it everywhere.

That’s it! We’ve taken care of some low-hanging fruit and are ready to implement some new functionality. If you want to compare what you’ve done with my version, you can run git diff v0.1.1 on the autodjinn repo.

Please let me know what you think of these posts by sending me an email at I’d love to hear from you!


Clojure Data Science: Ingesting Your Gmail Inbox

This is Part 1 of a series of blog posts inspired by the exercises from Agile Data Science with Clojure. You may be interested in my review of the book.

For this blog post series, we are going to use your Gmail inbox as a dataset for an exploration of data science practices. Namely, we will use your email for machine learning and natural language processing applications. Email makes interesting data to process:

  • it has lots of metadata that we can use as features [1]
  • we can model the relationships of senders and receivers as a graph
  • each message has a body of text associated with it that we can analyze
  • gaining insights from our personal communication is far more interesting than using an open data set!

Note: This is not an intro-to-Clojure blog post. If you need a tutorial that starts with the basics, I recommend the Clojure from the ground up blog post series by Aphyr. It does an excellent job at introducing concepts in Clojure.

In this post, I follow my typical Clojure workflow: I open a REPL and begin exploring the problem space. I look at individual pieces of data and start transforming them. When I write some functionality that I like for one piece of data, I try to extract it into the source code as a function that can work for any data our project may see. In this way, we can build up the project to contain the functions that are necessary to get to our goal.

So what is our goal for this blog post? Well, we want to fetch all emails from our Gmail inbox. We want to get metadata for each email, including things like who sent it and when it was sent. Then, we want to save the messages into a database so we can do further processing in later posts.

Starting off, make a new basic Clojure project with lein. I’ve named my project autodjinn after AUTODIN, one of the first email networks. You can use the repo to refer to and to clone to follow along. At the beginning of each subsequent post, I’ll provide a SHA that you can reset the code to. Feel free to name your project whatever you want; just be sure to pay attention to the changes in filenames and namespaces as we go along!

Create the project and enter it:

To import our Gmail data, we will use a Clojure library called clojure-mail. Clojure-mail is still under active development and is likely to change. For this blog post, we’ll be using version 0.1.6 to ensure compatibility between the code in this post and the library.

Edit project.clj to contain your information and add the [clojure-mail "0.1.6"] dependency:

We’ll start by working in src/autodjinn/core.clj and later move the functionality out into a script for our email import task. Open up the file in your favorite editor and launch a REPL.

In your REPL, (use 'autodjinn.core) and verify it worked by running (foo "MYNAME"). You should see “MYNAME Hello, World!” printed out. Feel free to remove the (defn foo…) in core.clj now. We will not need it.

You may want to use something like Emacs’ cider or LightTable’s InstaREPL as your REPL environment. But you can use the regular Clojure REPL to build this project, as well. If you are not working with a REPL integrated to your editor, you will need to run (use 'autodjinn.core :reload) to force a reload of the code each time you save.

Connecting to Gmail

Our first goal is to connect to our inbox and verify that we can read email from it. To do that, we’re going to need to use our Gmail address and password — which we don’t want to put into our source files. It’s bad practice to put a password or a private key into a source file or check it into our repo! Just don’t do it!

Instead, we will use a nice library called nomad to load a config file containing our email address and password. We will add the config file to .gitignore so that it is never saved into our code.

Add the line [jarohen/nomad "0.6.3"] to your project.clj dependencies before moving on, and run lein deps in a console to pull in the dependency.

Back in our core.clj add the require statements for clojure-mail and nomad to your ns macro like this:

Then create a new file in resources/config/autodjinn-config.edn. It should look like this, with your email address and password filled in:

Now open up your .gitignore file and add the following line to it:

Following nomad’s README, we need to load our config file and pull out our gmail-username and gmail-password keys. We add to the following to core.clj after the ns macro:

Using the get function here is a safe lookup for maps that returns nil if nothing is found for the key. Back in our REPL, we can see this in action with some quick experimentation:

We could also use the shorter (:keyname mymap) syntax here, since symbols are an invocable function that looks up a key in a map. But the get function reads better than (:gmail-username (autodjinn-config)) in my opinion.

In your REPL, you should now be able to get the values for gmail-username and gmail-password:

Note that since I’m in the user namespace here, I had to qualify the vars with their autodjinn.core namespace. If this is confusing, you might want to read up on namespaces in Clojure before moving on. (See also: the ‘Namespaces’ section in Clojure from the ground up: logistics.)

clojure-mail requires us to open a connection to Gmail with the gen-store function (src). We then pass that connection around to various functions to interact with our inbox. Define a var called my-store in your core.clj that does this with our email address and password:

Make sure the (def my-store… above has been run in your REPL and then take a look at our open connection:

The type of my-store should be an IMAPSSLStore as above. If it didn’t work, you’ll see a string error message when you try to define my-store.

Your inbox as a list

Now we’ll use our REPL to build up a function that will eventually import all of our email. To start, we can use the inbox function (src) from clojure-mail to get a seq of messages in our inbox. Note that since it is a seq and inboxes can be very large, we limit it with the take function.

If everything is working, you should see a list of of the IMAPMessages returned by the last line in your REPL.

What if, instead, we wanted to loop over many messages and print out their subjects? We can pull in the message namespace (src) from clojure-mail, which gives us convenience functions for getting at message data.

You’ll have to be careful running this next line — on a large inbox it’ll print out the subject of everything in your inbox! If you have a lot of messages, consider wrapping the call to inbox in a take as above.

Those are the subject lines of the 4 messages in the inbox of my test account, so I know that this is working. Save our doseq line into a function called ingest-inbox; we’ll come back to it later:

Examining messages

Before we move on, let’s take a look at an individual message and what we can get out of it from the message namespace.

From this, we can see a few things:

  • The ID returned by message/id looks like a good candidate to get good unique IDs for each message when we store the messages. But we might want to strip off those angle brackets first.
  • The message/message-body function doesn’t return a string of the body. Instead, it returns a list of maps which contains the text/plain form of the body and the text/html form. We will have to extract each from the map so that we can use the plaintext version for things like language processing. We’ll also keep the HTML version in case we need it later.
  • If you started digging in to the message namespace’s source you may have noticed that we don’t have functions for getting date sent or date received for a message. Nor can we get a list of addresses CCed or BCCed for the message. We’ll have to write those functions ourselves.

Cleaning up the IDs

Let’s focus on writing a function to clean up the ID returned by the message/id function. Recall that such IDs look like <>

The clojure.string namespace provides a replace function which does simple replacement on a string. We can use it like this:

That worked for replacing the angle brackets for the original string. But remember that data structures are immutable in Clojure, including strings. Replacing the first angle bracket didn’t change the original string when we tried to replace the other angle bracket. We need something that allows us to build up an intermediate value and pass it to the next function. For that, we will use the thread-first macro: ->. It is easiest if I show the macro in use with some comments showing what the intermediate values would be at each step:

It is called the thread-first macro because it threads through the first argument to each function. In this case, clojure.string/replace’s first argument is the string to replace on. So the each successively returned value gets passed to the next function above.

Now that we’ve figured out how to clean up that ID, we will create a function to clean up any ID we pass it:

Extracting the message bodies

Recall the message/message-body call above:

Ideally, we want to write a function that can get the text/plain body out of this value, and another function that can get the text/html body out. Notice that the :content-type values aren’t quite so simple as just selecting the item in the list where the string text/plain appears. We will need our function to ignore the additional information in the :content-type value, which includes things like string encodings.

Let’s look at just the first map in the list returned by message/message-body:

If we build a predicate function that can detect when the :content-type key is the type we want, we can use it in a filter function to choose the correct type of body in our functions.

Notice that TEXT/PLAIN and TEXT/HTML are always separated from the rest of the content-type by a semicolon, and it always appears first. You’d have to look at a few messages from your inbox to arrive at the same conclusion, but I’ve already done the work and can assure you that the previous statement is true.

Then, an easy to to get at the part of the content-type we want would be to split on the semicolon and take the first element returned:

This leads us to a function to first clean up the content-type string, and then our predicate function to detect if it is the one we want:

To finish off our work on the message bodies, we want to filter the list returned by message/message-body:

And turn it into a function that works for any message bodies list:

Note that we’ve also used this function to create two convenience functions, one for extracting plaintext bodies and one for extracting HTML bodies. By keeping functions simple and small, we can build up useful functions for our project rather than try to plan it all out ahead of time.

Getting more information out of the IMAPMessages

As noted above, we will need to write a few more functions to get the fields of the IMAPMessages that we cannot get through this version of clojure-mail. Recall that we want to get CC list, BCC list, date sent, and date received values. To do that, we will use Java interop functionality. It’s really not as bad as it sounds. Remember that the IMAPMessages we see are Java instances of the IMAPMessage class. Calling a method on an instance is accomplished by using a dot before the method name, with the method in the function position, such as: (.javaMethod some-java-instance)

To start, we can look at clojure-mail’s project.clj and see that it depends on javax.mail. The next step is to find the documentation for the Java implementation of javax.mail.Message, which lives here.

In the REPL, we can try some of the Java interop on our my-msg:

The datetimes for each message are automatically turned into Clojure instants for us, which is convenient. If we dig into how the clojure-mail.message/to function [src] works, we see that it is using the .getRecipients method. .getRecipients takes the message and a constant of a RecipientType. For our purposes, we want the javax.mail.Message$RecipientType/CC and javax.mail.Message$RecipientType/BCC recipients:

The last line maps the str function across each element returned, so that we get the string representation of the email addresses. That way, our database can just store the strings.

As before, now that we know how to use these methods in the REPL, we write functions in core.clj to take advantage of our newfound knowledge:

In the REPL, it should now be possible to get a nice map representation of all the fields on the message we care about:

Congrats on making it this far. We’ve used quite a few neat little features of Clojure and the libraries we’re building this project with to get here.

The last step we’ll go through in this post is to get these messages into a database.

Enter Datomic, the immutable datastore

Datomic is a great database layer built on Clojure that gives us a database value representing immutable data. New transactions on the database create new database values. It fits very well with Clojure’s own concept of state and identity because it was designed by the same folks as Clojure. Plus, Datomic is meant to grow and scale in modern environments like AWS, with many backend datastore options to run it on.

There’s some important reasons why you might choose Datomic as your database for a data science / machine learning application:

  • There are various storage backends, so you can grow from tens of thousands of rows in PostgreSQL on a developer’s laptop to millions of records (or more) in Riak or DynamoDB on AWS. That is, it has a good migration path from small datasets to big data through the Datomic import/export process
  • The concept of time associated with each value in Datomic means that we can query for historical data to compare against
  • Datomic has a lightweight schema compared to a relational database like PostgreSQL. Schemas are just data! When we begin computing new values from our dataset, we can add new types of entities easily at the same time.
  • Datomic’s schemas allow us to treat it as a key-value store, relational database, or even build a graph store on top of it, if we need to

Note: I won’t go through setting up an entire Datomic installation here. It’s worth reading up on the docs and the rationale behind Datomic’s design.

You can get the Datomic free build if you like, but you will be limited to in-memory stores. It is unlikely that your Gmail inbox will fit into memory on your dev machine. Instead, I recommend signing up for the free Datomic Pro Starter Edition. (The free Starter Edition is fine because you will not be using this project in a commercial capacity.) Once you have Datomic Pro downloaded and installed in your local Maven, I recommend using the PostgreSQL storage adapter locally with memcached. Follow the guides for configuring storage on the Datomic Storage page.

Add the correct line to your project.clj dependencies for the version of Datomic you’ll be using (mine was [com.datomic/datomic-pro "0.9.4384"] which might be a bit out of date and likely won’t match yours.) Now we can start using Datomic in our core.clj and our REPL.

The first thing we need is the URI where the Datomic database lives. When we start up the Datomic transactor, you will see a DB URI that looks something like datomic:sql://DBNAMEHERE?jdbc:postgresql://localhost:5432/datomic?user=datomic&password=datomic in the output. Grab that URI and add it to our resources/config/autodjinn-config.edn:

Back at the top of core.clj, save that value to a var as we did with gmail-username and gmail-password:

And then in the REPL:

Note that according to the datomic clojure docs for the create-database function, it returns true if the database was created, and false if it already exists. So running create-database every time we run our script is safe, since it won’t destroy data.

If the above work in the REPL doesn’t work, it is likely your code is unable to talk to your running Datomic, or your Datomic transactor is not configured correctly. Diagnose it with Googling and reading the docs until you get it to work, then move on.

Calling (d/db db-connection) gives us the current value of our database. In most cases, we want to just get the most current value. So, we can write a convenience function new-db-val to always get us the current (and possibly different) database value. But there are cases where we want to coordinate several queries and use the same database values for each. In those cases, we won’t use the function get the latest database value, but rather pass this database value to the queries so that all query against the same state.

In our core.clj, we can add the code we need to create the database, get our connection, and the convenience new-db-val function:

Next, we need to tell Datomic about the schema of our data. Schemas are just data that you run as a transaction on the database. Reading up on the Schema page of the Datomic docs might be helpful to understand what’s going on here. The short version is that we define each attribute of an email and set up its properties. The collection of all attributes together will constitute a mail entity, so we namespace all the attributes under the :mail/ namespace.

We add that var def to our core.clj because it is, after all, just data. We may choose later to move it to its own edn file, but for now, it can live in our source code. Next, we want to apply this schema to our database with a transaction. That looks like this:

Now we put that transaction in a convenience function in core.clj that we’ll run every time we run this file. The function will ensure that our database is ‘converged’ to this schema. Running a transaction will create a new database value. But it will not blow away any data that we had in the database by running this transaction many times. It will simply try to update the existing attributes, and nothing in the attributes themselves need change. It is far more work to retract (delete) data in Datomic than it is to add or update it. This leads to much more safety around working with data without worrying that we will destroy data, and it encourages a REPL-based exploration of the data and its history.

Now that our mail entities are defined in Datomic, we can try a query to find all the entity-IDs where any :mail/uid value is present. Read up on the Query page of the Datomic docs to dig into querying deeper. You might also be interested in the excellent Learn Datalog Today website to learn more about querying Datomic with Datalog.

Since we have no mail entities in our database, Datomic returns an empty set. So now we reach the end of task: We can ingest some emails and save them in our database! Return to the ingest-inbox function that we left before. Here’s what the updated version will look like:

We use the @-sign before the (d/transact…) call because Datomic normally returns a promise of the completed transaction. However, we want to force Datomic to complete each transaction before moving on by deref-ing it with the @-sign. Per the Clojure docs: “Calls to deref/@ prior to delivery will block.”

If you run this function in your REPL, you should see it start to ingest your email from Gmail!

Note that this could a take a long time if you’ve chosen to import a really large Gmail inbox! You might want to stop the import at some point; in most REPLs Ctrl-c will stop the running function.

If we query for our entity-IDs again, as above, we should see some values returned!

What does one of those database entities look like when we run it through Datomic’s entity and touch functions to instantiate all its attributes?

Wrapping up

That’s it for this blog post. It took a little setup, but we were able to build up a working Gmail import tool with help from our REPL and some nice Clojure libraries.

Next time, we’ll be looking at doing some basic querying of the data, including getting a count of the number of times each email address has sent you an email.

Comments? Questions? Feel free to contact me at I’d love to hear from you.

1 In this case, machine learning features, which are the input variables for our learning tasks. Not software features that we a client might ask us to implement. See: Feature learning - Wikipedia, the free encyclopedia.


A Quick Dashboard in Hoplon & Castra

Note: I began writing a much longer blog post that went into a ton of detail about how to build an app dashboard that used Hoplon and Castra. The kind of dashboard that just consumes JSON API endpoints from another app or other data sources. Such dashboards update on the fly in the browser. Many apps these days need a dashboard like this to monitor stats: worker job queues, database size, average response times, etc.

Rather than that long blog post, I wanted to simply show the steps I would take to build such a dashboard with Hoplon and Castra. I won’t go into detail here or explain either Hoplon or Castra — go read on your own first, and also look into boot, the build tool this uses.

If you want to follow along, I’ve provided a repo. The README has instructions for getting setup. Assuming you have boot installed, you can just run boot gleam-app to get started.

So here’s how I’d build up a dashboard, in several iterations:

Static data in the browser:

First, we get some data into the HTML using Hoplon cells:

You’ll want to git reset --hard 69b070 to get to this point.

Move the data to ClojureScript:

In src/cljs/gleam/rpc.cljs:

And take out the (def articles…) from index.html.hl. After boot recompiles everything, you should still see the data in the page.

To get to this point, you can run git reset --hard d63f299.

Move the data to the server side

Change src/cljs/gleam/rpc.cljs again, this time to make a remote call for data:

On the backend, we need something like this in src/castra/gleam/api/gleam.clj:

The Hoplon HTML file changes in the script tag at the top to use the new ClojureScript remote call and start up the polling:

To get to this point in the example repo, you can do git reset --hard 0bad1e5.

Real time data

The last step that I will show is to verify that we are in fact getting regular updates of data from the back end.

Change your Castra Clojure file to look like this:

To get to this point, you can do a git reset --hard f19325

Talking to a remote service.

The last step here is left as an exercise for the reader. You can imagine replacing the articles function in src/castra/gleam/api/gleam.clj with something that polls a remote JSON API for data. Or you could look at my social news app gnar for inspiration on using a Postgres database for data.

I hope to finish up a post with full explanations soon. Castra is relatively new, and it’s worth explaining how some of the pieces fit together. My explanation should include more complicated interaction. like user authentication. I will be publishing that blog post after I get back from ClojureWest next week!

Let me know what you thought of this post by shooting me an email. I’d love to hear from you.


Agile Data Science: Review and Thoughts

Agile Data Science cover

Recently, I read the book Agile Data Science by Russell Jurney. The book covers data science and how the author applies an agile workflow and powerful tooling to accomplish tasks. While I found the book interesting, and would recommend it as a good introduction, I have some issues with the book that I’d like to discuss. I’d like to go over the book and the tools briefly, if only to save my thoughts for later.

A quick note: data science is actively being defined by the web community as the process of analyzing large data sets with statistics and other approaches. That definition is ongoing and changing all the time. Big Data is the term that the industry seems to be using for such large datasets. You’ll also see the terms machine learning, analytics, and recommender systems mentioned: these are all various sub-topics that I won’t cover in depth here.

The book centers around the use of Hadoop. In turn, Hadoop is commanded by writing and running Apache Pig scripts in the book. Pig allows you to write workflows in a high-level scripting language that may compose many Hadoop jobs into one system. With Pig, you need not worry about the specifics of what each Hadoop job is doing when you write a Pig script.

Hadoop is patterned after Google’s MapReduce paper. Google had large clusters of computers and large data sets that it wanted to process on those clusters. What they came up with was a simple idea: Write a single program that would specify a map function to run across tuples of all the input data. Add a reduce function that compiles that output down into the expected format. MapReduce coordinates deploying the program to each worker machine, divvying up the input data across the different machines, gathering up the results, and handling things like restarts after failures. This was a huge success inside Google, and Hadoop implements that architecture with improvements.

It should be noted that this MapReduce architecture is essentially batch-processing for large amounts of data. The same system would have a hard time with continuous streams of data.

Hadoop is, unfortunately, my first stumbling block with learning to process big data.

Configuring and running Hadoop is not easy. I have far more experience as a developer than a sysadmin (or today’s term: devops engineer). There exists more than one “distribution” of Hadoop and more than one versioning scheme between those. This means that understanding what’s available, how to configure it, and whether search results are relevant to you is quite hard for the unexperienced. Imagine the confusion of trying to install a Debian Linux distro and only being able to find instructions for Red Hat Linux; further, not being able to tell what the problem was when it wouldn’t boot and printed a Debian-specific error.

It seems like Hadoop is designed for to be run by someone whose full-time job is to configure and maintain that cluster. That person will need to have enough experience with all the different choices to have an opinion on them. For a developer wanting to run things locally before committing to configuring (and paying for!) a full cluster out on AWS, it was daunting.

Luckily for me, Charles Flynn has created a neat repo on Github at charlesflynn/agiledata. It builds a local development VM for the Agile Data Science book, with all the dependencies installed and the book’s code in the right place to run. With that project, I was able to get up and running with the project quickly and found it useful to not have to sink anymore time into configuring Hadoop. I’d like to give another shout-out to Charles for this great resource and the work done to make sure it works.

The book has the reader work with email data: your Gmail inbox pulled locally for analyzing. I thought this was neat, in itself. Many data science tools use free datasets; as a result working with those datasets may not be the most interesting problem space to you. But insights about your own communication and how others communicate with you is something you might find more interesting.

After explaining Hadoop, Pig, and a few other tools, the rest of the book follows a fairly lightweight “recipe” format. Each chapter explains the goal and how it fits in an “agile data science” workflow. Then, some code is presented, and then we see what kinds of results we can take from that step. Once this pattern is set up, the book moves fairly quickly through some rather interesting data wrangling. By the end, the reader has built several data analysis scripts and a simple web app put together with MongoDB, Python Flask, and D3.js graphs to display all the results.

At times, though, the quick recipe format seemed to explain too little. There was little explanation of how Pig script syntax worked or how to understand what was going on under the covers. What this book is not: an exhaustive guide to how to write Pig scripts, how to pick approaches to analyzing a dataset, or how to compose these systems in production in the wild. Also missing were any mention of performance tuning or what other algorithms might be considered.

Which seems like an awful lot to be missing, but for this book that would have been diversions that bogged the book down.

To the author’s benefit, I finished the book, and finished it far faster than I expected I would. I cam away having done almost all of the book’s examples (helped a great deal by the excellent virtual machine repo from Charles Flynn mentioned above). And, I had a deeper understanding and respect for tools that I’d never used before.

Final thoughts

When it comes down to it, I wouldn’t recommend Agile Data Science to read on its own. I’d recommend that you used it as a quick introductory book to build familiarity and confidence, so that you could dive into a deeper resource afterwards. I’d also recommend it if you’re a developer who isn’t going to be doing data science as your full time job but are curious about the tools and practices, this book would be a good read.

What I’m doing next

Almost immediately after finishing this book, I attended an event at a nearby college to talk about Apache Storm. Our company blog covered the event if you’re curious.

Storm is a tool that came out of Twitter for processing streams of big data. If you think about it, Twitter has one of the biggest streaming data sets ever. They need to use that streaming data for everything from recommendations to analytics to top tweet/hashtag rankings.

After attending the event and having run a word-counting topology (Storm’s term for a workflow that may contain many data-processing jobs) out on a cluster, I began to see the potential of using Storm.

Plus, Storm is far friendlier to local development on a laptop. One can run it with a simple command line tool or even from inside your Java or Clojure code. Or, perhaps most simply, from inside the Clojure REPL.

The other plus here is that Storm is mostly written in Clojure and has a full Clojure API. Combined with a few other Clojure tools that I prefer, like Datomic, Ring, and C2, I can see a toolset similar to that used in Agile Data Science. This toolset has the benefit of using the same language for everything. And, Clojure is already well-suited for data manipulation and processing.

So I began to rewrite the examples in Agile Data Science in Clojure. I am hoping to make enough progress to begin posting some of the code with explanations in blog format. Stay tuned for that.


A Theory of Compound Intelligence Gain

Note that this is probably not enough to call a theory. It’s an idea, at most.

I’m currently reading the book Race Against the Machine, which describes how increasing levels of automation by technology are related to capital and labor. But this post isn’t about that book. It simply triggered me to think about my motivations for my current side projects, and how I might explain to others why exactly I think that my current side projects are so important.

While Race Against the Machine describes technological progress as a force that leaves behind skilled workers who no longer have relevant skills, my thinking is on intelligence augmentation, and how I can use my own knowledge and programming skills to build tools that increase my own effectiveness and ability to perform my job. Namely, how can I write software that improves my cognition and memory such that I am better at writing software, and gain other benefits from having increased cognition and memory?

Douglas Engelbart wrote extensively about augmenting intelligence, primarily with improving workflows and then with computer software. I’ve previously quoted him on this blog. I feel that part of that quote bears repeating here:

By “augmenting human intellect” we mean increasing the capability of a man to approach a complex problem situation, to gain comprehension to suit his particular needs, and to derive solutions to problems.

Of course, Engelbart was writing about this in 1962 – well before every home had a personal computer and everyone had a powerful supercomputer in their pocket. For a modern overview of Engelbart’s framework, see The Design of Artifacts for Augmenting Intellect.

My earliest encounters with concepts of intelligence augmentation most likely come from science fiction. One character that has inspired a lot of my work (and that I’ve probably told you a lot about if we’ve discussed this project in person) is Manfred Macx from Charles Stross’s Accelerando. Macx is described in the early parts of the book as having a wearable computer that acts as his exocortex. The idea of an exocortex being that some part of his memory, thinking, and information processing lives outside of his head and on the wearable computer. Similarly, the exocortex can help act as a gate to his attention, which is one of our limited resources.

If you think about it, just as we are all cyborgs now by virtue of the technology we use every day, we are also all on our way to having exocortexes. Many of us use Gmail filters to protect our attention spans from email we receive but don’t always need to read. Or we use Google search to add on to our existing memory, perhaps to remember some long-forgotten fact that we only have an inkling of.

I’ve had Manfred Macx’s exocortex (and other flavors of science fiction’s wearable computers and augmented intelligences) kicking around in my head for years. Gmail tells me that I was trying to plan the architecture for such a thing as far back as 2006. It’s taken a lot of thinking and further learning in my career to even get to the point where I felt ready to tackle such a project.

What I am setting out to build is an exocortex of my own design, under my own control. Not something that is handed to me by Google in bits and pieces. And to do so, it turns out, requires a lot of research and learning. There’s tons of research on the topics of proactive autonomous agents, text classification, and wearable computing that I have been reading up on. Just to build the first phase of my project, I have been learning all of the following:

  • core.logic (which is based on Prolog, so I’m learning some Prolog now, too)
  • core.async (Clojure’s implementation of C.A.R. Hoare’s Communicating Sequential Processes, which is also how Go’s goroutines work)
  • Cascalog and Hadoop, to do my distributed computing tasks
  • Datomic & Datalog (a subset of Prolog for querying Datomic), to store knowledge in a historical fashion that makes sense for a persistent, lifelong knowledge system
  • Topic clustering, text classification, and other natural language processing approaches
  • Data mining, and in particular, streaming data mining of large datasets on Hadoop clusters, by reading the Stanford textbook Mining of Massive Datasets
  • Generally learning Clojure and ClojureScript better
  • and probably more that I am forgetting to mention

Of course, if I look at that list, I can be fairly certain that this project is already paying off. These are all things that I had very little experience with before, and very little reason to dig into so deeply. Not represented here are the 40 or so academic papers that I identified as important, and seriously set out to read and take notes on – again, probably learning more deeply these topics than I otherwise would have.

Which brings me to this theory, the idea of this post: That by even beginning to work on this problem, I’m seeing some gains, and that any tools I can build that give me further gains will only compound the impact and effectiveness. Improving cognition and learning compounds to allow further gains in cognition and learning.

There’s some idea in the artificial intelligence community that we don’t need the first general artificial intelligence to be built as a super-intelligence; we need only build an artificial intelligence that is capable of improving itself (or a new generation of artificial intelligence.) As each generation improves, such intelligences could become unfathomably intelligent. But all it takes is that first seed AI that can improve the next.

So for improving our own human intelligences, we may not need to build a single device up-front that makes us massively intelligent. We only need take measures to improve our current knowledge and cognition, to build tools that will help us improve further, and continue down this path. It will definitely not be the exponential gains predicted for AI, and may not be even linear – that is, the gains in cognition from building further tools and learning more may plateau. But there will be improvements.

For that reason, I’m not setting out to build Manfred Macx’s exocortex from the beginning. Instead, I have been building what I describe as a “Instapaper clone for doing research” – a tool that, if it improves my existing ability to research and learn new topics, could pay off in helping me to build the next phase of my projects.

Of course, at the same time, I have an eye towards using the foundation of this tool as the datastore and relevance-finding tool for the overall project. Such a tool can automatically go and find related content – either things I have read, or simply crawl related content on the web. Eventually, this tool will also ingest all of the information I interact with on a daily basis: every website I browse, every email I receive, every book that I read. A searchable, tagged, annotatable reference with full metadata for each document as an external long-term memory. But this is all a topic for another post.

This, in concert with what current research tells us is effective: improved nutrition and supplementation, exercise, meditation, and N-back training, may just be my ticket to higher levels of human intelligence. But for now, I just want the early-adopter edge. I want to see how far I can push myself on my own skills. Some large corporation may be able to field hundreds of developers to create a consumer product for the public that benefits everyone in similar ways – but I might be able to do this for myself years ahead of that. And wouldn’t that be cool?

And this is where I call it a theory: it could very well be that there’s no such thing as compounding interest on intelligence. Only time and my own experiences with this project will tell me.

If you’ve made it this far and you’re interested in this kind of stuff, that is: intelligence augmentation, wearable computing, autonomous proactive agents, etc., get in touch. There doesn’t seem to be much of an online community around these topics, and I’d like to start creating one for discussion and organizing open source projects around these topics.

Blog Archives