Clojure Code Quality Tools

I work with many programming languages on a daily basis. As a polyglot programmer, I’ve come to appreciate tools that help me follow best practices. For JavaScript, there’s the excellent jshint. When I need to verify some XML, there’s xmllint. In a Ruby on Rails project, I can count on the rails_best_practices gem. For Ruby smells, I reach for rubocop. There’s tools like SimpleCov to measure test coverage on my Ruby projects. cane helps me to ensure line length, method complexity, and more in my Ruby code. Syntastic helps bring real syntax checking to vim for many languages. Every day, more open source tools are introduced that help me to improve the quality of the software that I write.

It follows that when I write Clojure code, I want nice tooling to help me manage code quality, namespace management, and out-of-date dependencies. What tools do I use on a day-to-day basis for this? In this post, I’ll show 5 tools that I use in my workflow every day on Clojure projects, and also provide some other tools for further exploration. Most of these tools exist as plugins to the excellent Leiningen tool for Clojure.

lein deps :tree

In the past, lein deps was a command that downloaded the correct versions of your project’s dependencies. Running lein deps is no longer necessary, as each lein command now checks for dependencies before executing. But deps provides an interesting variant for our uses: lein deps :tree.

The :tree keyword at the end instructs lein to print out your project’s dependencies as a tree. This itself is a good visualization, but not what we’re looking for. The tree command will first print out any dependencies-of-dependencies which have conflicts with other dependencies. For example, here’s what lein deps :tree says for one of my projects:

As you can see, the tool suggests dependencies that request conflicting versions, and how we can modify our project.clj file to resolve those conflicting versions by excluding one or the other. This isn’t always very useful, but when you run into issues because two different Clojure libraries require two wildly different joda-time versions (a situation I have run into before), it will be good to know what dependencies are causing that issue and how you might go about resolving it.

Note that this functionality disappeared in Leiningen 2.4.3 but is back in 2.5.0, so make sure you run lein upgrade!

lein-ancient

This plugin to lein exists simply to check your project for outdated dependencies. Without lein-ancient, I’d be unable to keep up with some of the faster-moving libraries in the Java and Clojure world.

After adding ancient to your ~/.lein/profiles.clj, running the lein ancient command yields output on the same project as before:

Whoops! Looks like I haven’t been keeping up to date with my dependencies. lein ancient makes checking for new dependency versions easy. Further, thanks to the ubiquity of semantic versioning in Clojure projects, it is usually quite safe to bump the minor versions (0.0.x) of dependencies.

You can also use lein-ancient to find outdated lein plugins in your ~/.lein/profiles.clj file. Just run it with the profiles argument:

lein kibit

As we gain experience and confidence in a programming language, we begin to talk about whether we’re writing idiomatic code. I’d argue that idiomatic code is code that accomplishes a goal with proper use of language features, in a way that other developers familiar with that language would understand. A simpler way to say it might be: idiomatic code uses the community-accepted best practices of how to do something.

Clojure’s design seeks to solve some problems found in older Lisps, as well as add in niceties like complementary predicate functions. A good example of these convenient complementary functions are if and if-not. Clojure also contains several cases of simplification for common usage. For example, when you don’t need an else clause on an if, you can use the when macro.

Wouldn’t it be great if there was someone who was well-versed in Clojure idioms pairing with you and offering suggestions? That’s exactly what kibit does.

Running against a project I’d set up to contain some smells, lein kibit found:

These kinds of small improvements are all over our Clojure projects. They’re not show-stopper bugs, but they’re small places for improvement.

Kibit’s suggestions are almost always logically equivalent to the original code. Still, I always do some smoke-testing to ensure the code still works after using Kibit’s suggestion, and it generally does. Problems I frequently fix with Kibit are replacing if statements with the when macro, as well as places where the code checks for empty seqs, or that I can simplify nil checks.

You can point lein kibit at a specific namespace by appending the path, like this: lein kibit src/foo/bar.clj

Kibit catches many cases where there is a more-idiomatic way to express what you are trying to do. I recommend running it often. In fact, it’s possible to use kibit in your emacs buffers if you want it to be that much more convenient and real-time.

Eastwood

For linting Clojure code, there’s Eastwood. It is similar in functionality to Kibit, bit will catch different issues than Kibit. Built on two interesting Clojure projects: tools.analyzer and tools.analyzer.jvm, Eastwood does a powerful examination of your code inside the JVM. It is worth highlighting that since Eastwood loads your code to analyze it, it might trigger any side effects that happen when your code loads: writing files, modifying databases, etc. Note that it only loads the code; it does not execute it.

After adding eastwood to your lein profiles.clj, simply run: lein eastwood and you will see output like:

That’s a lot of problems for a simple file! Notice how one mistake got caught for two reasons: A misplaced docstring (placed after the arguments vector) becomes just a string in the function body that will be thrown away.

Another nice catch that Eastwood provides is detecting the redefinition of the var qux in the file.

But Eastwood covers a lot more cases than just vars being def’d more than once. See the full list to find out what else it does. There’s a few linters that are disabled by default, but they might make sense to enable for your project.

Frequently running lint tools can help prevent subtle problems that come from code that looks correct but contains some small error. Eastwood is less concerned with style than tools like JSHint are, but we have other tools that cover stylistic concerns.

lein bikeshed

This is a relative newcomer to my own tool set. lein bikeshed has features related to the low-hanging fruit in our Clojure code: lines longer than 80 characters, blank lines at ends of files, and more. It will also tell you what percentage of functions have docstrings. Like other tools mentioned here, it is a lein plugin that you add to your profiles.clj.

A run of lein bikeshed on its own source (which purposefully includes some code designed to fail) looks like this:

Bikeshed might give a lot of output for your existing projects, but the warnings are worth investigating and addressing. You can always silence the long-lines warning if it doesn’t matter to you with the -m command line argument.

Tying it all together with a Lein alias

Wouldn’t it be great to run all these tools frequently, so that you can check for as many problems as possible? Well, you can, with a lein alias. (The lein wiki documents aliases in the lein sample.project.clj.)

In ~/.lein/profiles.clj, inside your :user map, add the line:

Now, when you want to run all these tools at once on a project, you simply invoke lein omni. I use this alias on all my Clojure(Script) projects. I have grown accustomed to seeing the kinds of output that a clean Clojure project will have.

It’s worth noting that I don’t run Eastwood unless it is necessary for the project. When it is necessary, I override the alias in the project’s project.clj to run Eastwood as well.

This command can take some time to complete, but with an alias we’re only spinning up lein once.

And a bash alias

The output of lein omni can be long, which can either result in a lot of scrolling or neglecting to run the command due to the inconvenience. To help manage the length of the output, I’ve created a bash alias that runs the plugins and pipes them to less.

My personal bash alias also runs midje at the end. You can choose whether to run the tests for your own alias. That’s just my personal preference.

Note that just like running the lein alias above, this may take a bit of time. Since we’re piping it to less, it might take awhile before less receives output. While it is still running, output will periodically show up at the bottom of the less buffer. You can use both Emac’s and vim’s movement commands in less to advance the buffer. I find less to be more manageable for scrolling through output than switching to tmux’s history scrolling mode.

Managing your namespaces: lein slamhound

Namespace management often becomes an issue on nontrivial Clojure projects. Actively developing a project means managing the functions we pull in from other namespaces and from libraries. These require statements can often get out of date. Often, they’re either missing namespaces that are needed, or containing requirements for old functions that are no longer used in the current code.

slamhound is a tool that can help to manage dependencies in your namespaces. It knows how to require and import Clojure and Java dependencies, and can remove stale requires that are no longer necessary. Slamhound can often fix missing requires for functions that it can resolve.

Note: slamhound rewrites the namespace macros in your project’s .clj files! I recommend only running it on code that’s committed to git (or whatever you use as a VCS) so that you can review and rollback any changes it makes.

The most basic way to use slamhound is to add it to your ~/.lein/profiles.clj as a dependency. Then add this alias:

Now you can use slamhound on a project by running lein slamhound in the project’s directory. There’s also REPL and Emacs support, which you can learn more about in the slamhound README.

Measuring test coverage with cloverage

It is often claimed that less unit testing is necessary in Clojure because Clojure is functional and makes use of immutable data structures. And it is true that with functional programming, most tests are simple: given some input, the output should be a certain value.

Some would even argue that Clojure functions should be well-factored enough into simple functions that the behavior of the function is apparent and requires no tests. Still others maintain that developing in the REPL is as good as writing unit tests, since functions are constantly evaluated and integrated with this style of development.

That said, there’s still mutable Java code to interop with, there’s still the necessary evil of functions with side effects, and we might want to check the structure of the data we’re producing in our functions rather than the value of it. For all those reasons and to check that I don’t introduce regressions, I tend to write unit tests in Clojure.

This blog post isn’t a platform to argue for or against testing Clojure. But when you do test, you may wonder how to tell how much test coverage your test suite has. How do we know at a glance what percentage of our namespaces is being tested? And how do we find lines that are never being exercised in our tests? After all, we can’t improve what we don’t measure.

That’s where cloverage comes in. Cloverage is another lein plugin, so it gets added to ~/.lein/profiles.clj like the others. Then run lein cloverage in your project; it will run the test suite and generate a coverage report.

The coverage report appears in target/coverage as HTML files, broken down by namespace.

You can still use Cloverage even if you don’t use clojure.test. I use midje in most of my tests. To use Cloverage in those situations, wrap your tests in a deftest.

Since deftest has a hyphenated Clojure keyword as its identifier, and Midje facts have a string as an identifier, I’ve come to use the deftest to group related tests together. Usually this means naming the group of tests after the function I’m testing. Then I name Midje facts after the situation that the fact exercises. This makes sense to me because it fits well with the hierarchy of rspec unit tests in Ruby.

Here’s an example of using this approach:

Cloverage also outputs a coverage.txt file that might be useful for use with services like Coveralls. I haven’t used this, so I can’t comment on its usefulness.

If you’re using speclj for your tests, you might run into some issues getting Cloverage to play nice. I don’t use speclj often, so when I couldn’t get it to work with Cloverage, I didn’t pursue the issue.

Final Thoughts

In this post, I covered 5 tools to add to your workflow all the time, and some others that might be useful in certain cases. I’m sure there’s more tools out there that are useful that I don’t know about, and I’d love to hear about them.

I’m also thinking about writing some posts about other development tools that I use, particularly how I use midje to test, and how you can benchmark code with perforate. If you’re interested in those topics, get in touch and let me know.

Have fun and enjoy your cleaner codebase with these tools in your tool belt!


Interested in commenting or contacting me? Send an email to contact@mattgauger.com. Thanks!

.

Atreus: My Custom Keyboard

Last year I wrote about about building chording keyboards and USB foot pedals. At the time, using the Teensy micro controller as a USB HID device was possible, but it still required a lot of research. There was no good central resource for knowledge about building keyboards. Since then, the Ergo Dox keyboard was released as open source and got quite popular. This seems to have opened the door for many to get into building keyboards.

My friend Ian ordered an Ergo Dox on the Massdrop crowdfunding campaign, after I suggested that I’d teach him to solder and we’d assemble it together. Finding time to get together and build it took almost a year, but we’ve started meeting up weekly to assemble the Ergo Dox. Building his keyboard has been a lot of fun, and inspired me to work on my own keyboard projects again.

Almost exactly a month ago, I started working on building my own keyboard. I wanted to build a keyboard from scratch that could replace my daily-driver keyboard, a PFU Happy Hacking Lite, so it had to be smaller than most tenkeyless keyboards. The Ergo Dox’s columnar layout was always intriguing, but I wasn’t sure that I needed all those keys. (Normal keyboards stagger the keys of each row, which is a holdover from preventing mechanical typewriters from jamming. Columnar layouts assign a column of keys to each finger.)

Through Geekhack, I found the Atreus, a keyboard designed by Phil Hagelberg (better known as technomancy online.) The Atreus is open source (hardware, firmware), and has gone through several revisions at this point. My keyboard is done now, and I wanted to share it.

IMG_3220

The original Atreus was constructed out of layers of laser-cut acrylic. Since then, some folks on the Geekhack thread have redesigned the laser-cut design to be cut out of a sheet of birch plywood on Ponoko. Ponoko is a great: you upload a file and choose materials and size. The Ponoko website keeps you updated on your project’s status as they check your design, pick materials, and so on. Later, your laser-cut project arrives in the mail. I highly recommend Ponoko’s service if you need laser cutting and can’t get it done at a local makerspace.

IMG_3156

I finished the birch ply with semi-gloss marine polyurethane. The polyurethane should give it a durable finish, and added a nice amber tint to the wood. The downside is that more than a week after the final coat went on, the poly is still out gassing some headache-inducing fumes.

After applying the finish, I hot-glued the switches in and soldered it together. There’s no PCB with this design, just point-to-point with wires and components to a central Teensy. I used Cherry MX Clear switches for the majority of the keys because they seem the closest to my Happy Hacking’s Topre switches to me. The modifiers are Cherry MX Blacks.

Assembling the Atreus with point-to-point soldering wasn’t too bad, but I’ve had a lot of experience soldering. I’ve no doubt that the construction will be durable and reliable, but a PCB might make it easier to assemble for beginners. There’s some talk on Geekhack about using the One Hand PCB as a circuit board for a Atreus-like keyboard.

The rest of my images from the build on Flickr in this album.

After hours of soldering, the moment of truth came: I plugged in the Teensy, uploaded the firmware, and typed some keys. It worked! I felt relieved that the keyboard worked on the first try. Because I had checked for continuity and shorts throughout the soldering process, I can be confident that my Atreus won’t have any issues with ghosting or glitches. The finished keyboard feels really solid; maybe more so than some plastic keyboards I’ve typed on before.

Because the Atreus uses a columnar layout, I’m not planning to use it with a QWERTY layout. So, I decided to learn Dvorak. I’ve been practicing on the home row on dvorak.nl, which is a great website for learning Dvorak in your browser. The neat thing about that typing tutor is that you don’t have to commit to changing any key layouts at the OS-level. I’ve got the default QWERTY layout on my Atreus now, but will be switching to a hardware-native Dvorak layout soon.

Since the Atreus uses a Teensy as its brain, it can be reconfigured easily by uploading a new firmware. Keyboard layouts for the Atreus start as a JSON file, and then an emacs function can be invoked to compile and upload the firmware to the board. The same JSON file can also be used to generate an HTML table of the layout with Org Mode in emacs. More information can be found on the firmware project repo.

What next?

I haven’t worked on my chording keyboard in a long time. I’m happy to see that things like the tmk firmware will now make that project much easier. With my new knowledge and the many open source projects now available, I’m going to restart work on that project.

Further, I’ve been playing with Matt Adereth’s dactyl to design chording keyboard layouts. Dactyl allows me to write Clojure code and output it in a format that OpenSCAD can generate a 3D model with. OpenSCAD can export the files to the formats that 3D printers use. 3D printing has a lot of promise for iteratively prototyping unique ergonomic peripherals, and I intend to try out several ideas for one-hand / chording keyboards.

If you’re interested in building your own keyboard, I would recommend the Ergo Dox. Especially if you can get the kit that Massdrop produced, because the circuit boards are well-made. Otherwise, spend some time on the Geekhack & Deskthority forums, read some wiki pages, and test some keyboards. And if you’re interested in building the Atreus, join the discussion! Everyone in that thread has been very helpful. This project wouldn’t have been possible without their answers and advice.


Interested in commenting or contacting me? Send an email to contact@mattgauger.com. Thanks!

.

Housekeeping: Imported Coderwall Protips

As part of my continuing effort to archive content I’ve created to this blog, I’ve migrated all of my Coderwall protips.

Here’s a quick list of the posts:

.

Clojure Data Science: Refactoring and Cleanup


This is Part 2 of a series of blog posts called Clojure Data Science. Check out the previous post if you missed it.


Welcome to the second post in this series. If you followed along in the last post, your code should be ready to use in this post. If not, or if you need to go back to known working state, you can clone the autodjinn repo and git checkout v0.1.0.

I started out writing this post to develop simple functionality on our inbox data. Finishing the post was taking longer than I was expecting, so I split the post in half in the interest of posting this sooner.

In this post, we need to create an email ingestion script that we can run repeatedly with lein. And we need to talk about refactoring our code out into maintainable namespaces.

So make sure your Datomic transactor is running and launch a REPL, because it is time to give our code a makeover.

A Gmail ingestion script

Because Clojure sits on the JVM, it shares some similarities with Java. One of these is the special purpose of a -main function. You can think of this as the main method in a Java class. The -main function in a Clojure namespace will be run when a tool like lein tries to “run” the namespace. That sounds like exactly what we want to do with our Gmail import functionality, so we will add a -main function that calls our ingest-inbox function. To get started, we will only have it print us a message.

You can then run this by invoking lein run -m autodjinn.core. You should see Hello world! if everything worked. You may notice that the process doesn’t seem to quit after it prints the hello world message – this seems to be problem with Leiningen. To ensure that our process ends when the script is done, we can add a (System/exit 0) line to the end of our -main function to ensure that the process quits normally. On *nix systems, a 0 return code means successful exit, and a nonzero response code means something went wrong. Knowing this, we can take advantage of response codes in the future to signal that an error occurred in our script. But for now, we will have the script end by returning 0 to indicate a successful exit.

Think back to what we did to ingest email in our REPL in the last post. We had to connect to the database, run the data schema transaction, and then we were able to run ingest-inbox to pull in our email.

The following function will do the same thing. Remember that things like trying to create an existing database or performing a schema update against the same schema in Datomic should be harmless. It will add a new transaction ID, but it will not modify or destroy data. Putting together all the steps we need to run, we get a -main function that looks like this:

Refactoring namespaces

With Clojure, one must walk a fine line between putting all of your functions into one big file, and having too many namespaces. One big file quickly grows unmaintainable and gains too many responsibilities.

But having too many namespaces can also be a problem. It may create strange cyclic dependency errors. Or you may find that with many separate namespaces, you have to require many namespaces to get anything done.

To avoid this, I start with most code in one namespace, and then look for common functionality to extract to a new namespace. Good candidates to extract are those that all talk about the same business logic or business domain. You may notice that the responsibility for one group of functions is different than the rest of the functions. That is a good candidate for a new namespace. Looking at responsibilities can be a good way to determine where to break apart functions into namespaces.

In this project, we can identify two responsibilities that currently live in our autodjinn.core namespace. The first is working with the database. The second is ingesting Gmail messages. As our project grows, we will not want the code for ingesting Gmail messages to live in autodjinn.core. With that in mind, let’s create a new file called src/autodjinn/gmail_ingestion.clj and move over the vars and functions that we think should live there. That file should look like this:

Be sure to remove the functions and vars that we moved to this file from the autodjinn.core namespace. Note that we moved the -main function here, too, so that we can now run lein run -m autodjinn.gmail-ingestion

You may also notice that we still had to require the datomic.api namespace here to be able to perform a transaction. Our autodjinn.core namespace already handles database interaction, though. So let’s write a create-mail function in core.clj and call it in our new namespace:

And in gmail_ingestion.clj we change ingest-inbox to use the new function. While we’re at it, we’ll break out a convenience function to prepare the attr map for Datomic:

If we run our lein run -m autodjinn.gmail-ingestion command, we should see that the code is still working.

Don’t forget to remove the datomic.api requirement in gmail-ingestion namespace! Now we only need to require Datomic in the autodjinn.core namespace.

There’s one more low-hanging fruit that we can refactor about this code before moving on. The config file is loaded and used in both namespaces. We already require everything from autodjinn.core into autodjinn.gmail-ingestion. So we can safely change a few lines to use the config in gmail_ingestion.clj and stop requiring nomad in two places:

And in core.clj:

Running lein run -m autodjinn.gmail-ingestion one more time, we should see that our changes did not break the system. The config is now only loaded once, and we use it everything.

That’s it! We’ve taken care of some low-hanging fruit and are ready to implement some new functionality. If you want to compare what you’ve done with my version, you can run git diff v0.1.1 on the autodjinn repo.

Please let me know what you think of these posts by sending me an email at contact@mattgauger.com. I’d love to hear from you!

.

Clojure Data Science: Ingesting Your Gmail Inbox


This is Part 1 of a series of blog posts inspired by the exercises from Agile Data Science with Clojure. You may be interested in my review of the book.


For this blog post series, we are going to use your Gmail inbox as a dataset for an exploration of data science practices. Namely, we will use your email for machine learning and natural language processing applications. Email makes interesting data to process:

  • it has lots of metadata that we can use as features [1]
  • we can model the relationships of senders and receivers as a graph
  • each message has a body of text associated with it that we can analyze
  • gaining insights from our personal communication is far more interesting than using an open data set!

Note: This is not an intro-to-Clojure blog post. If you need a tutorial that starts with the basics, I recommend the Clojure from the ground up blog post series by Aphyr. It does an excellent job at introducing concepts in Clojure.

In this post, I follow my typical Clojure workflow: I open a REPL and begin exploring the problem space. I look at individual pieces of data and start transforming them. When I write some functionality that I like for one piece of data, I try to extract it into the source code as a function that can work for any data our project may see. In this way, we can build up the project to contain the functions that are necessary to get to our goal.

So what is our goal for this blog post? Well, we want to fetch all emails from our Gmail inbox. We want to get metadata for each email, including things like who sent it and when it was sent. Then, we want to save the messages into a database so we can do further processing in later posts.

Starting off, make a new basic Clojure project with lein. I’ve named my project autodjinn after AUTODIN, one of the first email networks. You can use the repo to refer to and to clone to follow along. At the beginning of each subsequent post, I’ll provide a SHA that you can reset the code to. Feel free to name your project whatever you want; just be sure to pay attention to the changes in filenames and namespaces as we go along!

Create the project and enter it:

To import our Gmail data, we will use a Clojure library called clojure-mail. Clojure-mail is still under active development and is likely to change. For this blog post, we’ll be using version 0.1.6 to ensure compatibility between the code in this post and the library.

Edit project.clj to contain your information and add the [clojure-mail "0.1.6"] dependency:

We’ll start by working in src/autodjinn/core.clj and later move the functionality out into a script for our email import task. Open up the file in your favorite editor and launch a REPL.

In your REPL, (use 'autodjinn.core) and verify it worked by running (foo "MYNAME"). You should see “MYNAME Hello, World!” printed out. Feel free to remove the (defn foo…) in core.clj now. We will not need it.

You may want to use something like Emacs’ cider or LightTable’s InstaREPL as your REPL environment. But you can use the regular Clojure REPL to build this project, as well. If you are not working with a REPL integrated to your editor, you will need to run (use 'autodjinn.core :reload) to force a reload of the code each time you save.

Connecting to Gmail

Our first goal is to connect to our inbox and verify that we can read email from it. To do that, we’re going to need to use our Gmail address and password — which we don’t want to put into our source files. It’s bad practice to put a password or a private key into a source file or check it into our repo! Just don’t do it!

Instead, we will use a nice library called nomad to load a config file containing our email address and password. We will add the config file to .gitignore so that it is never saved into our code.

Add the line [jarohen/nomad "0.6.3"] to your project.clj dependencies before moving on, and run lein deps in a console to pull in the dependency.

Back in our core.clj add the require statements for clojure-mail and nomad to your ns macro like this:

Then create a new file in resources/config/autodjinn-config.edn. It should look like this, with your email address and password filled in:

Now open up your .gitignore file and add the following line to it:

Following nomad’s README, we need to load our config file and pull out our gmail-username and gmail-password keys. We add to the following to core.clj after the ns macro:

Using the get function here is a safe lookup for maps that returns nil if nothing is found for the key. Back in our REPL, we can see this in action with some quick experimentation:

We could also use the shorter (:keyname mymap) syntax here, since symbols are an invocable function that looks up a key in a map. But the get function reads better than (:gmail-username (autodjinn-config)) in my opinion.

In your REPL, you should now be able to get the values for gmail-username and gmail-password:

Note that since I’m in the user namespace here, I had to qualify the vars with their autodjinn.core namespace. If this is confusing, you might want to read up on namespaces in Clojure before moving on. (See also: the ‘Namespaces’ section in Clojure from the ground up: logistics.)

clojure-mail requires us to open a connection to Gmail with the gen-store function (src). We then pass that connection around to various functions to interact with our inbox. Define a var called my-store in your core.clj that does this with our email address and password:

Make sure the (def my-store… above has been run in your REPL and then take a look at our open connection:

The type of my-store should be an IMAPSSLStore as above. If it didn’t work, you’ll see a string error message when you try to define my-store.

Your inbox as a list

Now we’ll use our REPL to build up a function that will eventually import all of our email. To start, we can use the inbox function (src) from clojure-mail to get a seq of messages in our inbox. Note that since it is a seq and inboxes can be very large, we limit it with the take function.

If everything is working, you should see a list of of the IMAPMessages returned by the last line in your REPL.

What if, instead, we wanted to loop over many messages and print out their subjects? We can pull in the message namespace (src) from clojure-mail, which gives us convenience functions for getting at message data.

You’ll have to be careful running this next line — on a large inbox it’ll print out the subject of everything in your inbox! If you have a lot of messages, consider wrapping the call to inbox in a take as above.

Those are the subject lines of the 4 messages in the inbox of my test account, so I know that this is working. Save our doseq line into a function called ingest-inbox; we’ll come back to it later:

Examining messages

Before we move on, let’s take a look at an individual message and what we can get out of it from the message namespace.

From this, we can see a few things:

  • The ID returned by message/id looks like a good candidate to get good unique IDs for each message when we store the messages. But we might want to strip off those angle brackets first.
  • The message/message-body function doesn’t return a string of the body. Instead, it returns a list of maps which contains the text/plain form of the body and the text/html form. We will have to extract each from the map so that we can use the plaintext version for things like language processing. We’ll also keep the HTML version in case we need it later.
  • If you started digging in to the message namespace’s source you may have noticed that we don’t have functions for getting date sent or date received for a message. Nor can we get a list of addresses CCed or BCCed for the message. We’ll have to write those functions ourselves.

Cleaning up the IDs

Let’s focus on writing a function to clean up the ID returned by the message/id function. Recall that such IDs look like <CAJiAYR90LbbN6k8tVXuhQc8f6bZoK647ycdc7mxF5mVEaoLKHw@mail.gmail.com>

The clojure.string namespace provides a replace function which does simple replacement on a string. We can use it like this:

That worked for replacing the angle brackets for the original string. But remember that data structures are immutable in Clojure, including strings. Replacing the first angle bracket didn’t change the original string when we tried to replace the other angle bracket. We need something that allows us to build up an intermediate value and pass it to the next function. For that, we will use the thread-first macro: ->. It is easiest if I show the macro in use with some comments showing what the intermediate values would be at each step:

It is called the thread-first macro because it threads through the first argument to each function. In this case, clojure.string/replace’s first argument is the string to replace on. So the each successively returned value gets passed to the next function above.

Now that we’ve figured out how to clean up that ID, we will create a function to clean up any ID we pass it:

Extracting the message bodies

Recall the message/message-body call above:

Ideally, we want to write a function that can get the text/plain body out of this value, and another function that can get the text/html body out. Notice that the :content-type values aren’t quite so simple as just selecting the item in the list where the string text/plain appears. We will need our function to ignore the additional information in the :content-type value, which includes things like string encodings.

Let’s look at just the first map in the list returned by message/message-body:

If we build a predicate function that can detect when the :content-type key is the type we want, we can use it in a filter function to choose the correct type of body in our functions.

Notice that TEXT/PLAIN and TEXT/HTML are always separated from the rest of the content-type by a semicolon, and it always appears first. You’d have to look at a few messages from your inbox to arrive at the same conclusion, but I’ve already done the work and can assure you that the previous statement is true.

Then, an easy to to get at the part of the content-type we want would be to split on the semicolon and take the first element returned:

This leads us to a function to first clean up the content-type string, and then our predicate function to detect if it is the one we want:

To finish off our work on the message bodies, we want to filter the list returned by message/message-body:

And turn it into a function that works for any message bodies list:

Note that we’ve also used this function to create two convenience functions, one for extracting plaintext bodies and one for extracting HTML bodies. By keeping functions simple and small, we can build up useful functions for our project rather than try to plan it all out ahead of time.

Getting more information out of the IMAPMessages

As noted above, we will need to write a few more functions to get the fields of the IMAPMessages that we cannot get through this version of clojure-mail. Recall that we want to get CC list, BCC list, date sent, and date received values. To do that, we will use Java interop functionality. It’s really not as bad as it sounds. Remember that the IMAPMessages we see are Java instances of the IMAPMessage class. Calling a method on an instance is accomplished by using a dot before the method name, with the method in the function position, such as: (.javaMethod some-java-instance)

To start, we can look at clojure-mail’s project.clj and see that it depends on javax.mail. The next step is to find the documentation for the Java implementation of javax.mail.Message, which lives here.

In the REPL, we can try some of the Java interop on our my-msg:

The datetimes for each message are automatically turned into Clojure instants for us, which is convenient. If we dig into how the clojure-mail.message/to function [src] works, we see that it is using the .getRecipients method. .getRecipients takes the message and a constant of a RecipientType. For our purposes, we want the javax.mail.Message$RecipientType/CC and javax.mail.Message$RecipientType/BCC recipients:

The last line maps the str function across each element returned, so that we get the string representation of the email addresses. That way, our database can just store the strings.

As before, now that we know how to use these methods in the REPL, we write functions in core.clj to take advantage of our newfound knowledge:

In the REPL, it should now be possible to get a nice map representation of all the fields on the message we care about:

Congrats on making it this far. We’ve used quite a few neat little features of Clojure and the libraries we’re building this project with to get here.

The last step we’ll go through in this post is to get these messages into a database.

Enter Datomic, the immutable datastore

Datomic is a great database layer built on Clojure that gives us a database value representing immutable data. New transactions on the database create new database values. It fits very well with Clojure’s own concept of state and identity because it was designed by the same folks as Clojure. Plus, Datomic is meant to grow and scale in modern environments like AWS, with many backend datastore options to run it on.

There’s some important reasons why you might choose Datomic as your database for a data science / machine learning application:

  • There are various storage backends, so you can grow from tens of thousands of rows in PostgreSQL on a developer’s laptop to millions of records (or more) in Riak or DynamoDB on AWS. That is, it has a good migration path from small datasets to big data through the Datomic import/export process
  • The concept of time associated with each value in Datomic means that we can query for historical data to compare against
  • Datomic has a lightweight schema compared to a relational database like PostgreSQL. Schemas are just data! When we begin computing new values from our dataset, we can add new types of entities easily at the same time.
  • Datomic’s schemas allow us to treat it as a key-value store, relational database, or even build a graph store on top of it, if we need to

Note: I won’t go through setting up an entire Datomic installation here. It’s worth reading up on the docs and the rationale behind Datomic’s design.

You can get the Datomic free build if you like, but you will be limited to in-memory stores. It is unlikely that your Gmail inbox will fit into memory on your dev machine. Instead, I recommend signing up for the free Datomic Pro Starter Edition. (The free Starter Edition is fine because you will not be using this project in a commercial capacity.) Once you have Datomic Pro downloaded and installed in your local Maven, I recommend using the PostgreSQL storage adapter locally with memcached. Follow the guides for configuring storage on the Datomic Storage page.

Add the correct line to your project.clj dependencies for the version of Datomic you’ll be using (mine was [com.datomic/datomic-pro "0.9.4384"] which might be a bit out of date and likely won’t match yours.) Now we can start using Datomic in our core.clj and our REPL.

The first thing we need is the URI where the Datomic database lives. When we start up the Datomic transactor, you will see a DB URI that looks something like datomic:sql://DBNAMEHERE?jdbc:postgresql://localhost:5432/datomic?user=datomic&password=datomic in the output. Grab that URI and add it to our resources/config/autodjinn-config.edn:

Back at the top of core.clj, save that value to a var as we did with gmail-username and gmail-password:

And then in the REPL:

Note that according to the datomic clojure docs for the create-database function, it returns true if the database was created, and false if it already exists. So running create-database every time we run our script is safe, since it won’t destroy data.

If the above work in the REPL doesn’t work, it is likely your code is unable to talk to your running Datomic, or your Datomic transactor is not configured correctly. Diagnose it with Googling and reading the docs until you get it to work, then move on.

Calling (d/db db-connection) gives us the current value of our database. In most cases, we want to just get the most current value. So, we can write a convenience function new-db-val to always get us the current (and possibly different) database value. But there are cases where we want to coordinate several queries and use the same database values for each. In those cases, we won’t use the function get the latest database value, but rather pass this database value to the queries so that all query against the same state.

In our core.clj, we can add the code we need to create the database, get our connection, and the convenience new-db-val function:

Next, we need to tell Datomic about the schema of our data. Schemas are just data that you run as a transaction on the database. Reading up on the Schema page of the Datomic docs might be helpful to understand what’s going on here. The short version is that we define each attribute of an email and set up its properties. The collection of all attributes together will constitute a mail entity, so we namespace all the attributes under the :mail/ namespace.

We add that var def to our core.clj because it is, after all, just data. We may choose later to move it to its own edn file, but for now, it can live in our source code. Next, we want to apply this schema to our database with a transaction. That looks like this:

Now we put that transaction in a convenience function in core.clj that we’ll run every time we run this file. The function will ensure that our database is ‘converged’ to this schema. Running a transaction will create a new database value. But it will not blow away any data that we had in the database by running this transaction many times. It will simply try to update the existing attributes, and nothing in the attributes themselves need change. It is far more work to retract (delete) data in Datomic than it is to add or update it. This leads to much more safety around working with data without worrying that we will destroy data, and it encourages a REPL-based exploration of the data and its history.

Now that our mail entities are defined in Datomic, we can try a query to find all the entity-IDs where any :mail/uid value is present. Read up on the Query page of the Datomic docs to dig into querying deeper. You might also be interested in the excellent Learn Datalog Today website to learn more about querying Datomic with Datalog.

Since we have no mail entities in our database, Datomic returns an empty set. So now we reach the end of task: We can ingest some emails and save them in our database! Return to the ingest-inbox function that we left before. Here’s what the updated version will look like:

We use the @-sign before the (d/transact…) call because Datomic normally returns a promise of the completed transaction. However, we want to force Datomic to complete each transaction before moving on by deref-ing it with the @-sign. Per the Clojure docs: “Calls to deref/@ prior to delivery will block.”

If you run this function in your REPL, you should see it start to ingest your email from Gmail!

Note that this could a take a long time if you’ve chosen to import a really large Gmail inbox! You might want to stop the import at some point; in most REPLs Ctrl-c will stop the running function.

If we query for our entity-IDs again, as above, we should see some values returned!

What does one of those database entities look like when we run it through Datomic’s entity and touch functions to instantiate all its attributes?

Wrapping up

That’s it for this blog post. It took a little setup, but we were able to build up a working Gmail import tool with help from our REPL and some nice Clojure libraries.

Next time, we’ll be looking at doing some basic querying of the data, including getting a count of the number of times each email address has sent you an email.

Comments? Questions? Feel free to contact me at contact@mattgauger.com. I’d love to hear from you.


1 In this case, machine learning features, which are the input variables for our learning tasks. Not software features that we a client might ask us to implement. See: Feature learning - Wikipedia, the free encyclopedia.

.
Blog Archives