Clojure Data Science: Sent Counts and Aggregates


This is Part 3 of a series of blog posts called Clojure Data Science. Check out the previous post if you missed it.


For this post, we want to generate some summaries of our data by doing aggregate queries. We won’t yet be pulling in tools like Apache Storm into the mix, since we can accomplish this through Datomic queries. We will also talk about trade-offs of running aggregate queries on large datasets and devise a way to save our data back to Datomic.

Updating dependencies

It has been some time since we worked on autodjinn. Libraries move fast in the Clojure ecosystem, and we want to make sure that we’re developing against the most recent versions of each dependency. Before we begin making changes, let’s update everything. If you have already read my [Clojure Code Quality Tools]/blog/2014/09/15/clojure-code-quality-tools/) post, you’ll be familiar with the lein ancient plugin.

Below is output when I run lein ancient on the last post’s finished git tag, v0.1.1. To go back to that state, you can run git checkout v0.1.1 on the autodjinn repo.

It looks like our nomad dependency is out of date. Update the version number in project.clj to 0.7.0 and run lein ancient again to verify that it worked.

If you take a look at project.clj yourself, you may notice that our project is still on Clojure 1.5.1. lein ancient doesn’t look at the version of Clojure that we’re specifying; it assumes you have a good reason for picking the Clojure version you specify. In our case, we’d like to be on the latest stable Clojure, version 1.6.0. Update the version of Clojure in project.clj and then run your REPL. There should be no issues with using the functionality in the app that we created in previous posts. If there is, carefully read the error messages and try to find a solution before moving on.

To save on the hassle of upgrading, I have created a tag for the project after upgrading Clojure and nomad. To go to that tag in your local copy of the repo, run git checkout v0.1.2.

Datomic query refresher

If you remember back to the first post, we wrapped up by querying for entity IDs and then using Datomic’s built-in entity and touch functions to instantiate each message with all of its attributes. We had to do this because the query itself only returned a set of entity IDs:

Note that the Datomic query is made up of several parts:

  • The :find clause says what will be returned. In this case, it is the ?eid variable for each record we matched in the rest of the query.
  • The :where clause gives a condition to match. In this case, we want all ?eid where the entity has a :mail/uid fact, but we don’t care about the :mail/uid fact’s value, so we give it a wildcard with the underscore (_).

We could pass in the :mail/uid we care about, and only get one message’s entity-ID back.

Notice how the ?uid variable gets passed in with the :in clause, as the third argument to d/q?

Or we could change the query to match on other attributes:

In all these cases, we’d still get the entity IDs back because the :find clause tells Datomic to return ?eid. Typically, we pass around entity IDs and lazy-load any facts (attributes) that we need off that entity.

But, we could just as easily return other attributes from an entity as part of a query. Let’s ask for the recipients of all the emails in our system:

While it is less common to return only the value of an entity’s attribute, being able to do so will allow us to build more functionality on top of our email abstraction later.

One last thing. Take a look at the return of that query above. Remember that the results returned by a Datomic query are a set. In Clojure, sets are a collection of unique values. So we’re seeing the unique list of addresses that are in the To: field in our data. What we’re not seeing is duplicate recipient addresses. To be able to count the number of times an email address received a message, we’ll need a list with non-unique members.

Datomic creates a unique set for the values returned by a query. This is generally a great thing, since it gets around some of the issues that one can run into with JOINing in SQL. But in this case, it is not ideal for what we want to accomplish. We could try to get around the uniqueness constraint on output by returning vectors of the entity ID and the ?to address, and then mapping across the result to pull out the second item:

There’s a simpler way that we can use in the Datomic query. By keeping it inside Datomic, we can later combine this approach with more-complex queries. We can tell the Datomic query to look at other attributes when considering what the unique key is by passing the query a :with clause. By changing our query slightly to include a :with clause, we end up with the full list of recipients in our datastore:

At this point, it might be a good idea to review Datomic’s querying guide. We’ll be using some of the advanced querying features found in the later sections of that guide, most notably aggregate functions.

Sent Counts

For this feature, we want to find all the pairs of from-to addresses for each email in our datastore, and then sum up the counts for each pair. We will save all these sent counts into a new entity type in Datomic. This will allow us to ask Datomic questions like who sends you the most email, and who you send the most email to.

We start by building up the query in our REPL. Let’s start with a simpler query, to count how many emails have been sent to each email address in our data store. Note that this isn’t sufficient to answer the question above, since we won’t know who those emails came from; they could have been sent by us or by someone else, or they could have been sent to us. Later, we’ll make it work with from-to pairs that allow us to know things like who is sending email to us.

A simple way to do this would be to wrap our previous query in the frequencies function that Clojure.core provides. frequencies returns a map of items with their count from a Clojure collection.

However, we want to perform the same sort of thing in Datomic itself. To do that, we’re going to need to know about aggregate functions. Aggregate functions operate over the intermediate results of a Datomic query. Datomic provides functions like max, min, sum, count, rand (for getting a random value out of the query results), and more. With aggregates, we need to be sure to use a :with clause to ensure we aggregate over all our values.

Looking at that short list of aggregate functions I’ve named, we can see that we probably want to use the count function to count the occurance of each email address in a to field in our data. To see how aggregates work, I’ve come up with a simpler example (the only new thing to know is that Datomic’s Datalog implementation can query across Clojure collections as easily as it can against a database value, so I’ve given a simple vector-of-vectors here to describe data in the form

[database-id person-name]

When the query looks at records in the data, our :where clause gives each position in the vector an id and a name based on position in the vector.)

Let’s review what happened there. Before the count aggregate function was applied, our results looked like this:

[["Jon"] ["Jon"] ["Bob"] ["Chris"]]

So the count function just counts across the values of the variable it is passed (in our case, ?name), and by pairing it with the original ?name value, we get each name and the number of times it appears in our dataset.

It makes sense that we can do the same thing with our recipient email addresses from the previous query. Combining our previous queries with the count aggregate function, we get:

That looks like the same kind of data we were getting with the use of the frequencies function before! So now we know how to use a Datomic aggregate function to count results in our queries.

What’s next? Well, what we really want is to get results that are of the form

[from-address to-address]

and count those tuples. That way, we can differentiate between email sent to us versus email we’ve sent to others, etc. And eventually, we’d like to save those queries off as functions that we can call to compute the counts from other places in our project.

We can’t pass a tuple like [from-address to-address] to the count aggregate function in one query. The way around this is to write two queries. The inner query will return the tuples, and the outer query will return the tuple and a count of the tuple in the output data. Since the queries run on the peer, we don’t really have to worry about whether it is one query or two, just that it returns the correct data at the end.

So what would the inner query look like? Remember that the outer query will still need a field to pass to the :with clause, so we’ll probably want to pass through the entity ID.

Those tuples will be used by our outer query. However, we also need a combined value for the count to operate on. For that, we can throw in a function call in the :where clause and give it a binding at the end for Datomic to use for that new value. In this case, I’ll combine the ?from and ?to values into a PersistentVector that the count aggregate function can use. The combined query ends up looking like this:

And the output is as we expect.

Reusable functions

The next step is to turn the query above into various functions we can use to query for from-to counts later. In our data, we don’t just have recipients in the To: field, we also have CC and BCC recipients. Those fields will need their own variations of the query function, but since they will share so much functionality, we will try to compose our functions in such a way that we avoid duplicate code.

In general, when I write query functions for Datomic, I use multiple arities to always allow a database value to be passed to the query function. This can be useful, for example, when we want to query against previous (historical) values of the database, or when we want to work with a particular database value across multiple queries, to ensure our data is consistent and doesn’t change between queries.

Such a query function typically looks like this:

By taking advantage of multiple arities, we can default to not having to pass a database value into the function. But in the cases where we do need to ensure a particular database version is used, we can do that. This is a very powerful idiom that I’ve learned since I began to use Datomic, and I suggest you structure all your query functions similarly.

Now, let’s take that function that only queries for :mail/to addresses and make it more generic, with specific wrapper functions for each case where we’d want to use it:

Note that we had to change the inner query to take the attr we want to query on as a variable; this is the proper way to pass a piece of data into a query we want to run. The $ that comes first in the :in clause tells Datomic to use the second d/q argument as our dataset (the db value we pass in), and the ?attr tells it to bind the third d/q argument as the variable ?attr.

While the three variations on functions are similar, we keep the code DRY. (DRY is an acronym for Don’t Repeat Yourself.) In the long run, less code should mean less bugs and the ability to fix problems in one place.

Building complex systems by composing functions is one of the features of Clojure that I enjoy the most! And notice how we got to these finished query functions by building up functionality in our REPL: another aspect of writing systems in Clojure that I appreciate.

Querying against large data sets

Right now, our functions calculate the sent counts across all messages every time they’re called. This is fine for the small sample dataset I’ve been working with locally, but if it were to run against the 35K+ messages that are in my Gmail inbox alone (not to mention all the labels and other places my email lives…) it would take a very long time. With even bigger datasets, we can run into an additional problem: the results may not fit into memory.

When building systems with datasets big enough that they don’t fit into memory, or that may take too much time to compute to be practical, there are two general approaches that we will explore. The first is storing results as data (known as memoizing or caching the results), and the other is breaking up the work to run on distributed systems like Hadoop or Apache Storm.

For this data, we only want to avoid redoing the calculating every time we want to know the sent counts. Currently, the data in our system changes infrequently, and it’s likely that we could tell the system to recompute sent counts only after ingesting new data from Gmail. For these reasons, a reasonable solution will be to store the computed sent counts back into Datomic.

A new entity type to store our results

For all three query functions we wrote, each result is of the form:

[from-address to-address count]

Let’s add to the Datomic schema in our core.clj file to create a new :sent-count entity type with these three attributes. Note that sent counts don’t really have a unique identifier of their own; it is the combination of from -> to addresses that uniquely identifies them. However, we will leave the from and to addresses as separate fields so it is easy to use them in queries.

Add the following maps to the schema-txn vector:

You’ll have to call the update-schema function in your REPL to run the schema transaction.

Something that’s worth calling out is that we’re using a Datomic schema valueType that we haven’t seen yet in this project: db.type/ref. In most cases, you’d want to use the ref type to associate with other entities in Datomic. But we can also use it to associate with a given list of facts. Here, we give the ref type an enum of the possible values that :sent-count/type can have: to, cc, and bcc. By adding this type field to our new entities, we can either choose to look at sent counts for only one type of address, or we can sum up all the counts for a given from-to pair and get the total counts for the system.

Our next job is to add some functions to create the initial sent counts data, as well as to query for it. To keep things clean, I created a sent-counts namespace for these functions to live in. I’ve provided it below with minimal explanation, since it should look very familiar to what we’ve already done.

/src/autodjinn/sent_counts.clj

After adding in the sent_counts.clj file, running:

(sent-counts/create-sent-counts)

will populate your datastore with the sent counts computed with functions we created earlier.

Note: The sent counts don’t have any sort of unique key on them, so if you run create-sent-counts multiple times, you’ll get duplicate results. We’ll handle that another time when we need to update our data.

Wrapping up

We’ve covered a lot of material on querying Datomic. In particular, we used aggregate functions to get the counts and sums of records in our data store. Because we don’t want to run the queries all the time, we created a new entity type to store our sent counts and saved our data into it. With query functions like those found in the sent-counts namespace, we can start to ask our data questions like “In the dataset, what address was sent the most email?”

If you want to compare what you’ve done with my version, you can run git diff v0.1.3 on the autodjinn repo.

Please let me know what you think of these posts by sending me an email at contact@mattgauger.com. I’d love to hear from you!

.

Clojure Code Quality Tools

I work with many programming languages on a daily basis. As a polyglot programmer, I’ve come to appreciate tools that help me follow best practices. For JavaScript, there’s the excellent jshint. When I need to verify some XML, there’s xmllint. In a Ruby on Rails project, I can count on the rails_best_practices gem. For Ruby smells, I reach for rubocop. There’s tools like SimpleCov to measure test coverage on my Ruby projects. cane helps me to ensure line length, method complexity, and more in my Ruby code. Syntastic helps bring real syntax checking to vim for many languages. Every day, more open source tools are introduced that help me to improve the quality of the software that I write.

It follows that when I write Clojure code, I want nice tooling to help me manage code quality, namespace management, and out-of-date dependencies. What tools do I use on a day-to-day basis for this? In this post, I’ll show 5 tools that I use in my workflow every day on Clojure projects, and also provide some other tools for further exploration. Most of these tools exist as plugins to the excellent Leiningen tool for Clojure.

lein deps :tree

In the past, lein deps was a command that downloaded the correct versions of your project’s dependencies. Running lein deps is no longer necessary, as each lein command now checks for dependencies before executing. But deps provides an interesting variant for our uses: lein deps :tree.

The :tree keyword at the end instructs lein to print out your project’s dependencies as a tree. This itself is a good visualization, but not what we’re looking for. The tree command will first print out any dependencies-of-dependencies which have conflicts with other dependencies. For example, here’s what lein deps :tree says for one of my projects:

As you can see, the tool suggests dependencies that request conflicting versions, and how we can modify our project.clj file to resolve those conflicting versions by excluding one or the other. This isn’t always very useful, but when you run into issues because two different Clojure libraries require two wildly different joda-time versions (a situation I have run into before), it will be good to know what dependencies are causing that issue and how you might go about resolving it.

Note that this functionality disappeared in Leiningen 2.4.3 but is back in 2.5.0, so make sure you run lein upgrade!

lein-ancient

This plugin to lein exists simply to check your project for outdated dependencies. Without lein-ancient, I’d be unable to keep up with some of the faster-moving libraries in the Java and Clojure world.

After adding ancient to your ~/.lein/profiles.clj, running the lein ancient command yields output on the same project as before:

Whoops! Looks like I haven’t been keeping up to date with my dependencies. lein ancient makes checking for new dependency versions easy. Further, thanks to the ubiquity of semantic versioning in Clojure projects, it is usually quite safe to bump the minor versions (0.0.x) of dependencies.

You can also use lein-ancient to find outdated lein plugins in your ~/.lein/profiles.clj file. Just run it with the profiles argument:

lein kibit

As we gain experience and confidence in a programming language, we begin to talk about whether we’re writing idiomatic code. I’d argue that idiomatic code is code that accomplishes a goal with proper use of language features, in a way that other developers familiar with that language would understand. A simpler way to say it might be: idiomatic code uses the community-accepted best practices of how to do something.

Clojure’s design seeks to solve some problems found in older Lisps, as well as add in niceties like complementary predicate functions. A good example of these convenient complementary functions are if and if-not. Clojure also contains several cases of simplification for common usage. For example, when you don’t need an else clause on an if, you can use the when macro.

Wouldn’t it be great if there was someone who was well-versed in Clojure idioms pairing with you and offering suggestions? That’s exactly what kibit does.

Running against a project I’d set up to contain some smells, lein kibit found:

These kinds of small improvements are all over our Clojure projects. They’re not show-stopper bugs, but they’re small places for improvement.

Kibit’s suggestions are almost always logically equivalent to the original code. Still, I always do some smoke-testing to ensure the code still works after using Kibit’s suggestion, and it generally does. Problems I frequently fix with Kibit are replacing if statements with the when macro, as well as places where the code checks for empty seqs, or that I can simplify nil checks.

You can point lein kibit at a specific namespace by appending the path, like this: lein kibit src/foo/bar.clj

Kibit catches many cases where there is a more-idiomatic way to express what you are trying to do. I recommend running it often. In fact, it’s possible to use kibit in your emacs buffers if you want it to be that much more convenient and real-time.

Eastwood

For linting Clojure code, there’s Eastwood. It is similar in functionality to Kibit, bit will catch different issues than Kibit. Built on two interesting Clojure projects: tools.analyzer and tools.analyzer.jvm, Eastwood does a powerful examination of your code inside the JVM. It is worth highlighting that since Eastwood loads your code to analyze it, it might trigger any side effects that happen when your code loads: writing files, modifying databases, etc. Note that it only loads the code; it does not execute it.

After adding eastwood to your lein profiles.clj, simply run: lein eastwood and you will see output like:

That’s a lot of problems for a simple file! Notice how one mistake got caught for two reasons: A misplaced docstring (placed after the arguments vector) becomes just a string in the function body that will be thrown away.

Another nice catch that Eastwood provides is detecting the redefinition of the var qux in the file.

But Eastwood covers a lot more cases than just vars being def’d more than once. See the full list to find out what else it does. There’s a few linters that are disabled by default, but they might make sense to enable for your project.

Frequently running lint tools can help prevent subtle problems that come from code that looks correct but contains some small error. Eastwood is less concerned with style than tools like JSHint are, but we have other tools that cover stylistic concerns.

lein bikeshed

This is a relative newcomer to my own tool set. lein bikeshed has features related to the low-hanging fruit in our Clojure code: lines longer than 80 characters, blank lines at ends of files, and more. It will also tell you what percentage of functions have docstrings. Like other tools mentioned here, it is a lein plugin that you add to your profiles.clj.

A run of lein bikeshed on its own source (which purposefully includes some code designed to fail) looks like this:

Bikeshed might give a lot of output for your existing projects, but the warnings are worth investigating and addressing. You can always silence the long-lines warning if it doesn’t matter to you with the -m command line argument.

Tying it all together with a Lein alias

Wouldn’t it be great to run all these tools frequently, so that you can check for as many problems as possible? Well, you can, with a lein alias. (The lein wiki documents aliases in the lein sample.project.clj.)

In ~/.lein/profiles.clj, inside your :user map, add the line:

Now, when you want to run all these tools at once on a project, you simply invoke lein omni. I use this alias on all my Clojure(Script) projects. I have grown accustomed to seeing the kinds of output that a clean Clojure project will have.

It’s worth noting that I don’t run Eastwood unless it is necessary for the project. When it is necessary, I override the alias in the project’s project.clj to run Eastwood as well.

This command can take some time to complete, but with an alias we’re only spinning up lein once.

And a bash alias

The output of lein omni can be long, which can either result in a lot of scrolling or neglecting to run the command due to the inconvenience. To help manage the length of the output, I’ve created a bash alias that runs the plugins and pipes them to less.

My personal bash alias also runs midje at the end. You can choose whether to run the tests for your own alias. That’s just my personal preference.

Note that just like running the lein alias above, this may take a bit of time. Since we’re piping it to less, it might take awhile before less receives output. While it is still running, output will periodically show up at the bottom of the less buffer. You can use both Emac’s and vim’s movement commands in less to advance the buffer. I find less to be more manageable for scrolling through output than switching to tmux’s history scrolling mode.

Managing your namespaces: lein slamhound

Namespace management often becomes an issue on nontrivial Clojure projects. Actively developing a project means managing the functions we pull in from other namespaces and from libraries. These require statements can often get out of date. Often, they’re either missing namespaces that are needed, or containing requirements for old functions that are no longer used in the current code.

slamhound is a tool that can help to manage dependencies in your namespaces. It knows how to require and import Clojure and Java dependencies, and can remove stale requires that are no longer necessary. Slamhound can often fix missing requires for functions that it can resolve.

Note: slamhound rewrites the namespace macros in your project’s .clj files! I recommend only running it on code that’s committed to git (or whatever you use as a VCS) so that you can review and rollback any changes it makes.

The most basic way to use slamhound is to add it to your ~/.lein/profiles.clj as a dependency. Then add this alias:

Now you can use slamhound on a project by running lein slamhound in the project’s directory. There’s also REPL and Emacs support, which you can learn more about in the slamhound README.

Measuring test coverage with cloverage

It is often claimed that less unit testing is necessary in Clojure because Clojure is functional and makes use of immutable data structures. And it is true that with functional programming, most tests are simple: given some input, the output should be a certain value.

Some would even argue that Clojure functions should be well-factored enough into simple functions that the behavior of the function is apparent and requires no tests. Still others maintain that developing in the REPL is as good as writing unit tests, since functions are constantly evaluated and integrated with this style of development.

That said, there’s still mutable Java code to interop with, there’s still the necessary evil of functions with side effects, and we might want to check the structure of the data we’re producing in our functions rather than the value of it. For all those reasons and to check that I don’t introduce regressions, I tend to write unit tests in Clojure.

This blog post isn’t a platform to argue for or against testing Clojure. But when you do test, you may wonder how to tell how much test coverage your test suite has. How do we know at a glance what percentage of our namespaces is being tested? And how do we find lines that are never being exercised in our tests? After all, we can’t improve what we don’t measure.

That’s where cloverage comes in. Cloverage is another lein plugin, so it gets added to ~/.lein/profiles.clj like the others. Then run lein cloverage in your project; it will run the test suite and generate a coverage report.

The coverage report appears in target/coverage as HTML files, broken down by namespace.

You can still use Cloverage even if you don’t use clojure.test. I use midje in most of my tests. To use Cloverage in those situations, wrap your tests in a deftest.

Since deftest has a hyphenated Clojure keyword as its identifier, and Midje facts have a string as an identifier, I’ve come to use the deftest to group related tests together. Usually this means naming the group of tests after the function I’m testing. Then I name Midje facts after the situation that the fact exercises. This makes sense to me because it fits well with the hierarchy of rspec unit tests in Ruby.

Here’s an example of using this approach:

Cloverage also outputs a coverage.txt file that might be useful for use with services like Coveralls. I haven’t used this, so I can’t comment on its usefulness.

If you’re using speclj for your tests, you might run into some issues getting Cloverage to play nice. I don’t use speclj often, so when I couldn’t get it to work with Cloverage, I didn’t pursue the issue.

Final Thoughts

In this post, I covered 5 tools to add to your workflow all the time, and some others that might be useful in certain cases. I’m sure there’s more tools out there that are useful that I don’t know about, and I’d love to hear about them.

I’m also thinking about writing some posts about other development tools that I use, particularly how I use midje to test, and how you can benchmark code with perforate. If you’re interested in those topics, get in touch and let me know.

Have fun and enjoy your cleaner codebase with these tools in your tool belt!


Interested in commenting or contacting me? Send an email to contact@mattgauger.com. Thanks!

.

Atreus: My Custom Keyboard

Last year I wrote about about building chording keyboards and USB foot pedals. At the time, using the Teensy micro controller as a USB HID device was possible, but it still required a lot of research. There was no good central resource for knowledge about building keyboards. Since then, the Ergo Dox keyboard was released as open source and got quite popular. This seems to have opened the door for many to get into building keyboards.

My friend Ian ordered an Ergo Dox on the Massdrop crowdfunding campaign, after I suggested that I’d teach him to solder and we’d assemble it together. Finding time to get together and build it took almost a year, but we’ve started meeting up weekly to assemble the Ergo Dox. Building his keyboard has been a lot of fun, and inspired me to work on my own keyboard projects again.

Almost exactly a month ago, I started working on building my own keyboard. I wanted to build a keyboard from scratch that could replace my daily-driver keyboard, a PFU Happy Hacking Lite, so it had to be smaller than most tenkeyless keyboards. The Ergo Dox’s columnar layout was always intriguing, but I wasn’t sure that I needed all those keys. (Normal keyboards stagger the keys of each row, which is a holdover from preventing mechanical typewriters from jamming. Columnar layouts assign a column of keys to each finger.)

Through Geekhack, I found the Atreus, a keyboard designed by Phil Hagelberg (better known as technomancy online.) The Atreus is open source (hardware, firmware), and has gone through several revisions at this point. My keyboard is done now, and I wanted to share it.

IMG_3220

The original Atreus was constructed out of layers of laser-cut acrylic. Since then, some folks on the Geekhack thread have redesigned the laser-cut design to be cut out of a sheet of birch plywood on Ponoko. Ponoko is a great: you upload a file and choose materials and size. The Ponoko website keeps you updated on your project’s status as they check your design, pick materials, and so on. Later, your laser-cut project arrives in the mail. I highly recommend Ponoko’s service if you need laser cutting and can’t get it done at a local makerspace.

IMG_3156

I finished the birch ply with semi-gloss marine polyurethane. The polyurethane should give it a durable finish, and added a nice amber tint to the wood. The downside is that more than a week after the final coat went on, the poly is still out gassing some headache-inducing fumes.

After applying the finish, I hot-glued the switches in and soldered it together. There’s no PCB with this design, just point-to-point with wires and components to a central Teensy. I used Cherry MX Clear switches for the majority of the keys because they seem the closest to my Happy Hacking’s Topre switches to me. The modifiers are Cherry MX Blacks.

Assembling the Atreus with point-to-point soldering wasn’t too bad, but I’ve had a lot of experience soldering. I’ve no doubt that the construction will be durable and reliable, but a PCB might make it easier to assemble for beginners. There’s some talk on Geekhack about using the One Hand PCB as a circuit board for a Atreus-like keyboard.

The rest of my images from the build on Flickr in this album.

After hours of soldering, the moment of truth came: I plugged in the Teensy, uploaded the firmware, and typed some keys. It worked! I felt relieved that the keyboard worked on the first try. Because I had checked for continuity and shorts throughout the soldering process, I can be confident that my Atreus won’t have any issues with ghosting or glitches. The finished keyboard feels really solid; maybe more so than some plastic keyboards I’ve typed on before.

Because the Atreus uses a columnar layout, I’m not planning to use it with a QWERTY layout. So, I decided to learn Dvorak. I’ve been practicing on the home row on dvorak.nl, which is a great website for learning Dvorak in your browser. The neat thing about that typing tutor is that you don’t have to commit to changing any key layouts at the OS-level. I’ve got the default QWERTY layout on my Atreus now, but will be switching to a hardware-native Dvorak layout soon.

Since the Atreus uses a Teensy as its brain, it can be reconfigured easily by uploading a new firmware. Keyboard layouts for the Atreus start as a JSON file, and then an emacs function can be invoked to compile and upload the firmware to the board. The same JSON file can also be used to generate an HTML table of the layout with Org Mode in emacs. More information can be found on the firmware project repo.

What next?

I haven’t worked on my chording keyboard in a long time. I’m happy to see that things like the tmk firmware will now make that project much easier. With my new knowledge and the many open source projects now available, I’m going to restart work on that project.

Further, I’ve been playing with Matt Adereth’s dactyl to design chording keyboard layouts. Dactyl allows me to write Clojure code and output it in a format that OpenSCAD can generate a 3D model with. OpenSCAD can export the files to the formats that 3D printers use. 3D printing has a lot of promise for iteratively prototyping unique ergonomic peripherals, and I intend to try out several ideas for one-hand / chording keyboards.

If you’re interested in building your own keyboard, I would recommend the Ergo Dox. Especially if you can get the kit that Massdrop produced, because the circuit boards are well-made. Otherwise, spend some time on the Geekhack & Deskthority forums, read some wiki pages, and test some keyboards. And if you’re interested in building the Atreus, join the discussion! Everyone in that thread has been very helpful. This project wouldn’t have been possible without their answers and advice.


Interested in commenting or contacting me? Send an email to contact@mattgauger.com. Thanks!

.

Housekeeping: Imported Coderwall Protips

As part of my continuing effort to archive content I’ve created to this blog, I’ve migrated all of my Coderwall protips.

Here’s a quick list of the posts:

.

Clojure Data Science: Refactoring and Cleanup


This is Part 2 of a series of blog posts called Clojure Data Science. Check out the previous post if you missed it.


Welcome to the second post in this series. If you followed along in the last post, your code should be ready to use in this post. If not, or if you need to go back to known working state, you can clone the autodjinn repo and git checkout v0.1.0.

I started out writing this post to develop simple functionality on our inbox data. Finishing the post was taking longer than I was expecting, so I split the post in half in the interest of posting this sooner.

In this post, we need to create an email ingestion script that we can run repeatedly with lein. And we need to talk about refactoring our code out into maintainable namespaces.

So make sure your Datomic transactor is running and launch a REPL, because it is time to give our code a makeover.

A Gmail ingestion script

Because Clojure sits on the JVM, it shares some similarities with Java. One of these is the special purpose of a -main function. You can think of this as the main method in a Java class. The -main function in a Clojure namespace will be run when a tool like lein tries to “run” the namespace. That sounds like exactly what we want to do with our Gmail import functionality, so we will add a -main function that calls our ingest-inbox function. To get started, we will only have it print us a message.

You can then run this by invoking lein run -m autodjinn.core. You should see Hello world! if everything worked. You may notice that the process doesn’t seem to quit after it prints the hello world message – this seems to be problem with Leiningen. To ensure that our process ends when the script is done, we can add a (System/exit 0) line to the end of our -main function to ensure that the process quits normally. On *nix systems, a 0 return code means successful exit, and a nonzero response code means something went wrong. Knowing this, we can take advantage of response codes in the future to signal that an error occurred in our script. But for now, we will have the script end by returning 0 to indicate a successful exit.

Think back to what we did to ingest email in our REPL in the last post. We had to connect to the database, run the data schema transaction, and then we were able to run ingest-inbox to pull in our email.

The following function will do the same thing. Remember that things like trying to create an existing database or performing a schema update against the same schema in Datomic should be harmless. It will add a new transaction ID, but it will not modify or destroy data. Putting together all the steps we need to run, we get a -main function that looks like this:

Refactoring namespaces

With Clojure, one must walk a fine line between putting all of your functions into one big file, and having too many namespaces. One big file quickly grows unmaintainable and gains too many responsibilities.

But having too many namespaces can also be a problem. It may create strange cyclic dependency errors. Or you may find that with many separate namespaces, you have to require many namespaces to get anything done.

To avoid this, I start with most code in one namespace, and then look for common functionality to extract to a new namespace. Good candidates to extract are those that all talk about the same business logic or business domain. You may notice that the responsibility for one group of functions is different than the rest of the functions. That is a good candidate for a new namespace. Looking at responsibilities can be a good way to determine where to break apart functions into namespaces.

In this project, we can identify two responsibilities that currently live in our autodjinn.core namespace. The first is working with the database. The second is ingesting Gmail messages. As our project grows, we will not want the code for ingesting Gmail messages to live in autodjinn.core. With that in mind, let’s create a new file called src/autodjinn/gmail_ingestion.clj and move over the vars and functions that we think should live there. That file should look like this:

Be sure to remove the functions and vars that we moved to this file from the autodjinn.core namespace. Note that we moved the -main function here, too, so that we can now run lein run -m autodjinn.gmail-ingestion

You may also notice that we still had to require the datomic.api namespace here to be able to perform a transaction. Our autodjinn.core namespace already handles database interaction, though. So let’s write a create-mail function in core.clj and call it in our new namespace:

And in gmail_ingestion.clj we change ingest-inbox to use the new function. While we’re at it, we’ll break out a convenience function to prepare the attr map for Datomic:

If we run our lein run -m autodjinn.gmail-ingestion command, we should see that the code is still working.

Don’t forget to remove the datomic.api requirement in gmail-ingestion namespace! Now we only need to require Datomic in the autodjinn.core namespace.

There’s one more low-hanging fruit that we can refactor about this code before moving on. The config file is loaded and used in both namespaces. We already require everything from autodjinn.core into autodjinn.gmail-ingestion. So we can safely change a few lines to use the config in gmail_ingestion.clj and stop requiring nomad in two places:

And in core.clj:

Running lein run -m autodjinn.gmail-ingestion one more time, we should see that our changes did not break the system. The config is now only loaded once, and we use it everywhere.

That’s it! We’ve taken care of some low-hanging fruit and are ready to implement some new functionality. If you want to compare what you’ve done with my version, you can run git diff v0.1.1 on the autodjinn repo.

Please let me know what you think of these posts by sending me an email at contact@mattgauger.com. I’d love to hear from you!

.
Blog Archives