Why Build Intelligence Augmentation Tools?

In a past blog post I talked about the concept of intelligence augmentation. The idea of building software to augment intelligence has been around for some time. That post covers its history more than this one will.

I’ve noticed that software developers I know (myself included) will have a thought: Imagine a tool that would allow flexible note-taking, archive and index their documents and email, and enable hyperlinking to any content in that index. This tool would have some sort of AI agent architecture on top of it that would offer improved searching, find related content automatically, and otherwise assist you in thinking and researching. Variations on this tool might include improved filtering of incoming information, or improving their own ability to learn new things.

Such thoughts tend to lead people to start designing architectures and picking programming languages to implement it in. Or they might start designing a UI. Or they fall down the rabbit hole of knowledge systems, machine learning, and natural language processing.

Indeed, such a note-capturing system is the source of inspiration for loper-os, a more-perfect Lisp machine project envisioned and taken up by Stanislav Datskovskiy. (Although he is still working on the hardware on which to write loper-os and thus run his thought-capturing system. Again, rabbit holes.)

I’ve had discussions with at least a dozen other people about how they would build such a virtual assistant. Clearly, there’s some tooling lacking here that a lot of people have thought about and feel a need for.

Why is building a virtual assistant such a tempting thought for software developers? It’s likely because they experience technology as constant change. Their work and communication revolves around technology.

The tools and services we use now (read: Twitter, Facebook, email, and so on) only compound the information overload that occurs when you try to stay up to date with your email and your calendar. Or when you try to stay up to date with everyone on social media. Suffice to say, the people who use technology the most may feel this pain the greatest.

Let’s take a step back and think about why we would want to augment our own intelligence. And in particular, I’m going to focus on building software here. We could also have discussions around using smart drugs (nootropics), or of using genetics and medicine. Or we might discuss building hardware such as brainports or Elon Musk’s neural lace, but those are out of scope for this article and for my expertise. Software I know.

In 1997, Garry Kasparov lost to Deep Blue at chess. This was the first case of a computer defeating a world champion. After this point, advances in computing power meant that off-the-shelf chess software running on a modern laptop can play as well as Deep Blue. Since the search space of chess is now in the CPU’s reach, no human can hope to beat the best computer at chess again.

Yet, Kasparov noticed something. If you combine software with a human player, and let the human use the computer software to explore the results of a particular move before making it, that team plays better than man or computer alone. They call these man-machine hybrids “centaurs.” Kasparov called this game Advanced Chess, and an offshoot called freestyle chess has emerged with teams of humans and computers on each side.

To bring it back to our terminology, a centaur composed of a human operating a computer is an augmented human. The chess software is an intelligence augmentation tool. Now, chess and its rules are not something as complex as writing a more compelling document or pulling together disparate academic papers and original research into one new thesis. We do not yet have the tools to enable a regular researcher to become a super-researcher simply by giving them software to consult.

The complexity of problems that we need to solve is ever-increasing. This was the main reason that Engelbart cited in 1962 for exploring augmenting human intellect:

Man’s population and gross product are increasing at a considerable rate, but the complexity of his problems grows still faster, and the urgency with which solutions must be found becomes steadily greater in response to the increased rate of activity and the increasingly global nature of that activity. Augmenting man’s intellect, in the sense defined above, would warrant full pursuit by an enlightened society if there could be shown a reasonable approach and some plausible benefits.

The benefits to be able to create a super-researcher or super-productive professional should be obvious. There are likely aspects of your job or your hobbies that you can imagine aided by better software.

In particular, when we augment the brain, we will look at it like another piece of technology:

  • improving short and long term memory recall (storage)
  • improving the number of different ideas we can hold in our heads at once (RAM)
  • improving the speed, focus, and association-making aspects of our thinking (CPU)

What kinds of features might we want to see in these tools? A possible, but not exhaustive, list might include:

  • filtering the noise of your email inbox, news sources (and fake news), social media, and more
  • proactively providing related content and automatically categorizing content
  • visualizing and summarizing information so that you can work with it more effectively
  • helping us to remember everything we’ve ever seen, heard, or said in the real world
  • helping us remember names and otherwise augment our social ability
  • optimizing our time, schedule, and work load (as SRI’s CALO focused on)
  • optimizing our health (as an outgrowth of the quantified self movement)

So we, as software developers and tool-makers, see a need and want to build tools. We know what kinds of tools we might create. But why else might we, as humans, want to build these tools?

Productivity

If we extrapolate from the Advanced Chess centaurs, then augmented humans will be better than unaugmented workers. In some cases, augmented humans will be able to get more work done than AI/ML tools. In a time when automation continues to threaten our jobs, we might yet find meaningful work for a longer period of time if we can join with machine learning technologies to become augmented humans. In the short term, being more productive is more likely to earn you raises and advancement. You may have better choices about what to work on, or how you work with it.

A worker becomes more valuable as their ability to solve more problems and get more done increases. A more productive work force can help us to have a healthier economy and to smooth the transition to a fully-automated world. Once we reach that level of automation, we can hope to find post-scarcity, an end of wage labor, and the ability to fill our time with leisure.

The downsides to augmenting for productivity are that if there are some barriers to entry here, such as cost, then only the rich can afford such tools. The average worker still won’t benefit from expensive productivity improvements. Worse, we may not even see any benefit for those that weren’t working in professional roles now. Those with jobs threatened by automation – truck drivers, factory workers, and so on – may be hardest hit by expensive or unavailable augmentation tools. We should focus on helping those workers train for their next career, and help those entering the workforce keep up.

Education and better learning

The education system cannot move fast enough for the rapid pace of change. Those applying to college now should be looking at what jobs will be available in 5 years when they choose a major. Jobs that may become automated in that time makes those jobs a bad decision. But the existing college system does not tell students to avoid jobs that may soon be irrelevant.

If you are likely to change careers in your lifetime, does it make sense to pursue a particular degree? (At least, in cases where it is not required for certification/practice, such as law or medicine.) Or should you optimize for a lifetime of learning?

The cost of higher education is high, and most students take on loans to complete their degrees. Is this cost worth it? Do they learn enough in a degree to pay back the loans later? Do they retain enough information from that learning period to later use it in their job?

And if technology is moving so fast that colleges can’t keep up, will workers be able to juggle learning new advances while working full time?

We need software that accelerates learning and increases retention. More learning in shorter periods of time can help students – and workers in the workforce – to keep up. Better retention means better job performance and success.

There’s been a lot of research into learning on multiple fronts. I recommend the book A Mind for Numbers and its related MOOC Learning How to Learn to find out more about this topic and how to put that research into practice.

There are a long list of startups now offering online courses, nano-degrees, and certificates of study. Khan Academy presents videos for elementary school studies through graduate admissions tests. But putting a class online as a screencast does not turn these courses into an intelligence augmentation tool.

There’s the entire internet and all of Wikipedia available whenever we use a search engine. Ebooks give us access to a shelf of books without needing an expensive and wasteful physical copy. (Especially when it comes to textbooks.) There’s note-taking applications and word processors with spell-checkers. There’s flashcard apps, such as SuperMemo, that use a spaced repetition to help with memorization and language learning. So why aren’t these enough to enable people to learn better?

The difference between existing tools and what we need from educational augmentation tools is personalization to the learner and optimizing the learning. These existing tools are inert and require the learner to expend all of the energy and thinking to use the tool. Intelligence augmentation tools could be proactive learning tools that can do more than provide the content to learn. They could actually bring the right content to the user at the right time (as in spaced repetition). They could structure the learning to the individual, rather than the current method of teaching to the widest range of students. They could let the learner explore at their own pace and go off on tangents of learning to related topics. Last, these tools would keep track of the current level of learning and track mastery of each topic.

All of these features are lacking in current tools. The area seems ripe for change and improvement.

More free time

Increased productivity gives us more benefits than getting more done at work. We should have the ability to work less if we are more productive. This could free up more time for leisure, hobbies, entertainment, and further higher education goals.

Since there’s an association (at least in the USA) between labor and success in life, it might be hard to convince ourselves that working less per week is a positive. Yet, I see the rise of full-time travelers, many of whom work as contractors for less than 40 hours per week, as an sign that this will become socially acceptable. These digital nomads might have the most success with intelligence augmentation tools and the work of the future (as Charles Stross’s Manfred Macx did.)

As we approach a post-scarce society (and hopefully we do), work for pay will have less importance. Being able to spend time with friends and family, be entertained, and pursue intellectual interests will all become more acceptable ways to spend our time.

Solving complex problems

As noted by Engelbart, our world is increasingly complex. One way that academia has dealt with this is increasing specialization. A PhD may only have a deep expertise on a narrow subject. Outside of academia, we have the forces of globalization and technological progress to contend with. Narrow specialization may not always work. Our world changes rapidly and has difficult multi-disciplinary problems to solve. Global warming, food, clean water, and eradicating disease are all tough problems. How can we go about solving them?

In the space of all the possible intelligences, imagine a intelligence drawn on a curve. On the top of that scale is the most intelligent human (Einstein, Newton, or any other that you wish.) The rest of us are somewhere in the middle of the scale, and on the lower end of the scale, animals. Of course, the graph goes much farther up and to the right than the smartest human so far. We just haven’t seen those intelligences yet.

In that space of possible minds, there exists a new category. The augmented humans, or centaurs, have their own range of intelligences. Augmentation could allow regular humans reach a range that includes Einstein-level intelligence. For specific topics or skills, centaurs could rank much farther up the curve than the smartest humans. We’ve already seen this with Freestyle Chess: the best human chess players, unaugmented, are no match for a team of computers and humans working together. Augmented humans with good software will be able to surpass the smartest natural humans in multiple aspects, and we can give that augmentation software to far more people.

Intelligence augmentation tools won’t just let us do more work faster, then, but unlock the ability to understand and solve problems that we previously could not. This will power new forms of technology and science well beyond what we’ve accomplished today.

In conclusion

I’m excited to be thinking about and writing about these topics again. There’s potential to start building some of these tools today. In particular, to build personalized tools as experiments. These early tools will be like the tracking done by the quantified self movement: separate data points from individuals with little overlap. But cross-pollination of ideas and techniques will be possible. Conversations around these experiments will be important to develop the technology further.

From these personal experiments, we can learn to build generic tools for everybody. Those generic solutions will allow the creation of freely-available software. Open source should help with concerns around access only being available to the wealthy and the privileged.

We’ll still be on our own to deal with the ethics involved in augmentation, which I did not touch on in this article.

If you find these topics interesting and would like to discuss them, I invite you to join me over on the new Intelligence Augmentation BBS.

.

My Current Setup: Habits Tracking

In the past, I blogged about how I used Lift.do (now coach.me) to prompt for habit-forming. Learning how to form new habits is one of the key tools to focusing on your growth and the ability learn more. You might recall from that previous post that I refer to the Fogg method for behavior change. The three steps are:

  1. Select the right target behavior.
  2. Make the target behavior easy to do.
  3. Ensure a trigger will prompt the behavior.

So why did I stop using Lift.do? In short, because Reminders.app for iOS and MacOS got better.

Reminders.app example

You may recall that Lift.do emailed me every morning a reminder for one habit I was trying to form: take a Vitamin D.

It’s possible that I could have had Lift.do email about other habits at different times. It could have sent me a digest email of all my habits for the day, each morning. But, sending more emails would quickly overwhelm my inbox and help turn it into the dreaded TODO list.

The prompt or trigger to perform a habit should happen as close to the right moment as possible. With recurring reminders in Reminders.app, I now get notifications for each habit. I can easily tweak the schedule to fit where I think the habit will best fit in.

The notifications show on both my laptop and phone, and I usually have my phone on me at all times. My phone can bug me wherever I am, because I have my phone with me. Completing a task on my phone or my laptop syncs to the other device thanks to iCloud.

As a result, I’m no longer just tracking the one Vitamin D habit. (In fact, I don’t track Vitamin D at all anymore. It has become a real habit and doesn’t need a prompt every morning.) I track 14 habits that happen daily, and an additional 3 that happen every other day or on a custom schedule.

The ease of these tasks varies. Some of them I would do anyways or have already formed habits around. Having the record makes it easier to remember whether I really did them 8 hours later.

Some of these tasks are harder or take more motivation to do each day. I have habits that I’m forming as part long-term learning goals. The easiness of these goals comes having broken it down to tasks that I absolutely can do every day. There’s power in having small amounts of consistent progress every day.

I’d recommend that you think about setting the goal for a new habit to the smallest thing you could do every day. For example: practice for just 5 minutes. Allow yourself to take more time if you have it. If you get to the end of the day, and you’re being honest with yourself, can you fit in 5 minutes of practice before turning in? You will make a lot of progress this way.

What am I lacking with this setup?

Reminders.app doesn’t really track of the history of checked-off items. It only knows whether I missed a deadline (and how long ago that was) and what’s due today. Items checked off yesterday wait until they’re due today, and then notify me. I don’t have Lift.do’s graphs or Seinfield’s “streaks calendar”. But this seems ok. I have a general idea of how well I’ve been doing lately. For longer-term goals, like fitness, I use other apps that have their own “streaks calendar.”

Despite saying in my previous post on habits that I’ve found progress bars and charts to be motivating, I don’t find that I need those now. Having a list in Reminders.app and knowing whether I’ve done today’s items is enough.

A protip is that you can view all of your scheduled reminders in Reminders.app in a separate list called Scheduled. I keep this list up in Reminders.app most of the time, and use a separate list as a more general/on-the-fly TODO list. I tend to use Emacs’ excellent Org mode for my real TODO lists, among many other things. (Org mode does a lot. You should check it out!)

Final thoughts

This approach triggers habits much more consistently and improves the likelihood that I can get through so many small tasks in a given day. It doesn’t require additional software or notebooks in my process. For something as simple as habit-tracking, I’m not too worried that I’m locked to an Apple app on Apple hardware. This list could be move to something else with little efort, but convenience wins right now.

Overall, this process works for me. And that might be the most important point.

.

Mining for Computation on the Beach

The introduction to Writing GNU Emacs Extensions introduces Emacs by talking about plumbers. “Plumbers?” you might think. The thing it wants us to think about is whether plumbers make their own tools.

Plumbers buy pipes and fittings in standardized sizes. They depend on the International Building Code and local building codes to tell them what is safe and necessary. They use tools made for the tasks they’ll likely need to do. As the book says, “a plumber doesn’t tinker with his wrench.”

I imagine that there are plumbers do build their own little jigs to hold pipes together, or to trace a pattern before cutting a hole. These solutions come from experience from past work and knowing the current problem. The book might fall a little flat here – plumbers can solve problems by making things. But what about their standard tools?

Again, from the book, “the plumber would tinker with their tools, if they knew how.”

Are most plumbers like the programmer that uses Emacs? Likely not. Because, the book says, Emacs is a tool that programmers can use to build tools. Emacs is a kind of ouroboros: software that builds software, software that can change itself.

A better example of makers making tools is the machinist in their machine shop. Machinists make jigs and holders all the time. And they must, or they’d never be able to clamp an odd-shaped piece to work on it. But the abilities of machinist goes beyond jigs and holders.

The tools in a machine shop are sufficient to create more tools. With a lathe and a vertical mill, one could create the hard parts of any of the machine tools in the shop. Granted, most machinists would not consider creating another vertical mill from scratch. The labor involved would suggest that one should buy one from a manufacturer. A manufacturer builds tools at scale. Such tools come at a reasonable cost relative to the labor to create a mill yourself. But the ability of a machinist to recreate everything from scratch is there, if need be.

Software is much cheaper to build. The ease of modifying Emacs causes people to build all sorts of tools with it. Those tools go beyond editing of source files. There’s IRC clients, web browsers, and more. There’s even a system called TRAMP that allows you to edit a file over FTP, SFTP, NTFS, and more. TRAMP makes them all appear to be a local buffer in Emacs.

If the machinist wanted to recreate their machine shop from scratch, what kind of reference would they need for that?

There exists a shelf full of books that sets out to create a machine shop from scratch – the Gingery books series ‘Build Your Own Metal Working Shop From Scrap’. These colorful books haunt the book cases of Makerspaces and home shops. The first book starts with sand on the beach and charcoal made from trees inland. With this start, you can cast aluminum and zinc parts. Take some scrap with some common parts and the second book will teach you how to make a metal lathe. This continues for seven books, until you have built your own home machine shop from scrap and sand on the beach. With these books, you are able to build more tools and replace anything worn out or broken.

Having these books on your shelf might be interesting if you like to make physical things. Owning them might be part of your expanded understanding of how the world works.

And of course, these books might be a good idea to have on your shelf for any zombie apocalypse that might befall us. (Whether you’ll need a machine shop in the apocalypse is a worthwhile question. Antibiotics and growing food might be more important. As an aside, what other references would we want on our shelf to recreate civilization?)

For computation, there are several books that set out to teach from the level of “sand on the beach.” What might that look like?

Start with relays, simple electromagnetic switches. A relay can only be on or off. The binary state means that we can represent numbers by counting in binary. We can treat on or off like true and false. With enough switches, we can create logic gates for all the types of logic operations. Then, we can combine these gates into more complex mechanisms like adders or a full 8-bit CPU. Lastly, we can begin to understand how the instruction set works in a CPU to create software. That’s the approach in Code: The Hidden Language of Computer Hardware and Software. I recommend this book highly. The difficulty curve of each chapter is just right to keep you engaged through the whole book. I only wish I’d read it a decade earlier!

Nand2Tetris is a free book and online course. This approach uses hardware emulation rather than relays as a building block. The book starts with basic gates and works up to a “general-purpose computer system.” Or rather, something we can build a Tetris clone on. Following along in the hardware emulation software is a good way to grok the details. Hardware emulation avoids most of the frustrations of my undergraduate digital logic course. Namely, having to build a test harness to ensure the quad NAND gates weren’t faulty.

Computer science is about the theory of computation. If you’d like to learn about that, then Understanding Computation by Tom Stuart is next on the list. This book uses Ruby, which makes it more approachable than more academic books. This book helps to fill in some of the gaps in my understanding from a career in mostly-web programming.

That covers the hardware and the computational theory that goes into programming. But what about programming itself? What if we wanted to start with the basics of programming? To answer this, we’d want books that teach us to how go from assembly to something higher-level. We’d want to know how what issues one might face when stepping up a level from machine code.

I’m afraid I don’t have much to suggest beyond the classics on programming here. Stick with favorites like The C Programming Language by Kernighan and Ritchie, and you can’t go wrong.

If you want an alternative world view on solving problems, read Programming A Problem Oriented Language: Forth - how the internals work by Charles H. Moore and wrap your head around Forth.

Another book in this vein might be Build Your Own Lisp if you’ve never completed such a feat. Or look to mal and its wealth of implementations to understand how to build a Lisp. Implementing lambda calculus interpreters and Lisp-like languages is a good pastime, and one that I’d like to practice more.

At the level of operating systems, we find more valuable resources. Lions’ Commentary on UNIX provides the UNIX source code with commentary. Suppressed by AT&T long ago for revealing their trade secrets, it’s now easy to get a copy of on Amazon.

Imagine starting from scratch and creating an operating system. And, creating a language to go with that operating system. That’s the path taken in Project Oberon by Niklaus Wirth. This book will help you to think about different facets of a problem and how to solve it from all sides. You might want to abandon what you take for granted in computing. It’s an alternative-computing rabbit hole that will make you wonder why current computing is so mundane. (This, along with learning about Lisp machines, might make you interested in reinventing the wheel. Fair warning.)

At this point, we’re diverging from covering the basics and into realms that I enjoy thinking about. If you’ve come this far and you really must push your understanding past traditional Turing machines, then I have one book to recommend to you. It’s much more expensive than when I bought it, and is only available used. The Architecture of Symbolic Computers by Kogge is a tome on symbolic and logical computing. If you’ve taken the rabbit hole of Oberon and have made time to learn about Lisp machines, this book is a real treat. Symbolic and logic computing are part of a complete understanding of computation.

Stepping back from our tangent, you might ask, “What does this all have to do with Emacs?” Well, I’d put Writing GNU Emacs Extensions on the list, as the book about building tools. It won’t cover the other tools you’ll likely need in computing: how to build a compiler, how to write Makefiles, and so on. But if you want to build tools, it is good to have a deep understanding of a tool for building tools. Emacs is a good platform to tinker, and it can be that workshop from which your other tools emerge. Learning Emacs, and how to build things in Emacs, has been rewarding to me and my time invested.

Even now, hackers are rebuilding Emacs in Rust with a project called remacs. The ouroboros Emacs is helping to rebuild itself in a new language.

You need to know your tools, and know where they came from, to know them well. This list is a good start on a deep knowledge of computing. Books help us to understand what came before, and to think about where we can go.

I’ve set up an apocalyptic-computing bookshelf on Goodreads to track these books. The name suggests that this is the list of books I’d bring with me to rebuild society, should we need it. With a list like this can we go “mining for computation on the beach” and hope to know enough to start from scratch.

What books would be on your apocalyptic-computing shelf? Let me know, or set up your own Goodreads shelf and send it to me!

Notes

  1. If you want to learn Emacs itself before you dive into Writing GNU Extensions, then I recommend Mastering Emacs and the Using Emacs video series.
  2. Org mode is perhaps the most important tool I’ve learned in Emacs, and now powers large parts of my life.
.

Clojure Data Science: Sent Counts and Aggregates


This is Part 3 of a series of blog posts called Clojure Data Science. Check out the previous post if you missed it.


For this post, we want to generate some summaries of our data by doing aggregate queries. We won’t yet be pulling in tools like Apache Storm into the mix, since we can accomplish this through Datomic queries. We will also talk about trade-offs of running aggregate queries on large datasets and devise a way to save our data back to Datomic.

Updating dependencies

It has been some time since we worked on autodjinn. Libraries move fast in the Clojure ecosystem, and we want to make sure that we’re developing against the most recent versions of each dependency. Before we begin making changes, let’s update everything. If you have already read my Clojure Code Quality Tools post, you’ll be familiar with the lein ancient plugin.

Below is output when I run lein ancient on the last post’s finished git tag, v0.1.1. To go back to that state, you can run git checkout v0.1.1 on the autodjinn repo.

It looks like our nomad dependency is out of date. Update the version number in project.clj to 0.7.0 and run lein ancient again to verify that it worked.

If you take a look at project.clj yourself, you may notice that our project is still on Clojure 1.5.1. lein ancient doesn’t look at the version of Clojure that we’re specifying; it assumes you have a good reason for picking the Clojure version you specify. In our case, we’d like to be on the latest stable Clojure, version 1.6.0. Update the version of Clojure in project.clj and then run your REPL. There should be no issues with using the functionality in the app that we created in previous posts. If there is, carefully read the error messages and try to find a solution before moving on.

To save on the hassle of upgrading, I have created a tag for the project after upgrading Clojure and nomad. To go to that tag in your local copy of the repo, run git checkout v0.1.2.

Datomic query refresher

If you remember back to the first post, we wrapped up by querying for entity IDs and then using Datomic’s built-in entity and touch functions to instantiate each message with all of its attributes. We had to do this because the query itself only returned a set of entity IDs:

Note that the Datomic query is made up of several parts:

  • The :find clause says what will be returned. In this case, it is the ?eid variable for each record we matched in the rest of the query.
  • The :where clause gives a condition to match. In this case, we want all ?eid where the entity has a :mail/uid fact, but we don’t care about the :mail/uid fact’s value, so we give it a wildcard with the underscore (_).

We could pass in the :mail/uid we care about, and only get one message’s entity-ID back.

Notice how the ?uid variable gets passed in with the :in clause, as the third argument to d/q?

Or we could change the query to match on other attributes:

In all these cases, we’d still get the entity IDs back because the :find clause tells Datomic to return ?eid. Typically, we pass around entity IDs and lazy-load any facts (attributes) that we need off that entity.

But, we could just as easily return other attributes from an entity as part of a query. Let’s ask for the recipients of all the emails in our system:

While it is less common to return only the value of an entity’s attribute, being able to do so will allow us to build more functionality on top of our email abstraction later.

One last thing. Take a look at the return of that query above. Remember that the results returned by a Datomic query are a set. In Clojure, sets are a collection of unique values. So we’re seeing the unique list of addresses that are in the To: field in our data. What we’re not seeing is duplicate recipient addresses. To be able to count the number of times an email address received a message, we’ll need a list with non-unique members.

Datomic creates a unique set for the values returned by a query. This is generally a great thing, since it gets around some of the issues that one can run into with JOINing in SQL. But in this case, it is not ideal for what we want to accomplish. We could try to get around the uniqueness constraint on output by returning vectors of the entity ID and the ?to address, and then mapping across the result to pull out the second item:

There’s a simpler way that we can use in the Datomic query. By keeping it inside Datomic, we can later combine this approach with more-complex queries. We can tell the Datomic query to look at other attributes when considering what the unique key is by passing the query a :with clause. By changing our query slightly to include a :with clause, we end up with the full list of recipients in our datastore:

At this point, it might be a good idea to review Datomic’s querying guide. We’ll be using some of the advanced querying features found in the later sections of that guide, most notably aggregate functions.

Sent Counts

For this feature, we want to find all the pairs of from-to addresses for each email in our datastore, and then sum up the counts for each pair. We will save all these sent counts into a new entity type in Datomic. This will allow us to ask Datomic questions like who sends you the most email, and who you send the most email to.

We start by building up the query in our REPL. Let’s start with a simpler query, to count how many emails have been sent to each email address in our data store. Note that this isn’t sufficient to answer the question above, since we won’t know who those emails came from; they could have been sent by us or by someone else, or they could have been sent to us. Later, we’ll make it work with from-to pairs that allow us to know things like who is sending email to us.

A simple way to do this would be to wrap our previous query in the frequencies function that Clojure.core provides. frequencies returns a map of items with their count from a Clojure collection.

However, we want to perform the same sort of thing in Datomic itself. To do that, we’re going to need to know about aggregate functions. Aggregate functions operate over the intermediate results of a Datomic query. Datomic provides functions like max, min, sum, count, rand (for getting a random value out of the query results), and more. With aggregates, we need to be sure to use a :with clause to ensure we aggregate over all our values.

Looking at that short list of aggregate functions I’ve named, we can see that we probably want to use the count function to count the occurance of each email address in a to field in our data. To see how aggregates work, I’ve come up with a simpler example (the only new thing to know is that Datomic’s Datalog implementation can query across Clojure collections as easily as it can against a database value, so I’ve given a simple vector-of-vectors here to describe data in the form

[database-id person-name]

When the query looks at records in the data, our :where clause gives each position in the vector an id and a name based on position in the vector.)

Let’s review what happened there. Before the count aggregate function was applied, our results looked like this:

[["Jon"] ["Jon"] ["Bob"] ["Chris"]]

So the count function just counts across the values of the variable it is passed (in our case, ?name), and by pairing it with the original ?name value, we get each name and the number of times it appears in our dataset.

It makes sense that we can do the same thing with our recipient email addresses from the previous query. Combining our previous queries with the count aggregate function, we get:

That looks like the same kind of data we were getting with the use of the frequencies function before! So now we know how to use a Datomic aggregate function to count results in our queries.

What’s next? Well, what we really want is to get results that are of the form

[from-address to-address]

and count those tuples. That way, we can differentiate between email sent to us versus email we’ve sent to others, etc. And eventually, we’d like to save those queries off as functions that we can call to compute the counts from other places in our project.

We can’t pass a tuple like [from-address to-address] to the count aggregate function in one query. The way around this is to write two queries. The inner query will return the tuples, and the outer query will return the tuple and a count of the tuple in the output data. Since the queries run on the peer, we don’t really have to worry about whether it is one query or two, just that it returns the correct data at the end.

So what would the inner query look like? Remember that the outer query will still need a field to pass to the :with clause, so we’ll probably want to pass through the entity ID.

Those tuples will be used by our outer query. However, we also need a combined value for the count to operate on. For that, we can throw in a function call in the :where clause and give it a binding at the end for Datomic to use for that new value. In this case, I’ll combine the ?from and ?to values into a PersistentVector that the count aggregate function can use. The combined query ends up looking like this:

And the output is as we expect.

Reusable functions

The next step is to turn the query above into various functions we can use to query for from-to counts later. In our data, we don’t just have recipients in the To: field, we also have CC and BCC recipients. Those fields will need their own variations of the query function, but since they will share so much functionality, we will try to compose our functions in such a way that we avoid duplicate code.

In general, when I write query functions for Datomic, I use multiple arities to always allow a database value to be passed to the query function. This can be useful, for example, when we want to query against previous (historical) values of the database, or when we want to work with a particular database value across multiple queries, to ensure our data is consistent and doesn’t change between queries.

Such a query function typically looks like this:

By taking advantage of multiple arities, we can default to not having to pass a database value into the function. But in the cases where we do need to ensure a particular database version is used, we can do that. This is a very powerful idiom that I’ve learned since I began to use Datomic, and I suggest you structure all your query functions similarly.

Now, let’s take that function that only queries for :mail/to addresses and make it more generic, with specific wrapper functions for each case where we’d want to use it:

Note that we had to change the inner query to take the attr we want to query on as a variable; this is the proper way to pass a piece of data into a query we want to run. The $ that comes first in the :in clause tells Datomic to use the second d/q argument as our dataset (the db value we pass in), and the ?attr tells it to bind the third d/q argument as the variable ?attr.

While the three variations on functions are similar, we keep the code DRY. (DRY is an acronym for Don’t Repeat Yourself.) In the long run, less code should mean less bugs and the ability to fix problems in one place.

Building complex systems by composing functions is one of the features of Clojure that I enjoy the most! And notice how we got to these finished query functions by building up functionality in our REPL: another aspect of writing systems in Clojure that I appreciate.

Querying against large data sets

Right now, our functions calculate the sent counts across all messages every time they’re called. This is fine for the small sample dataset I’ve been working with locally, but if it were to run against the 35K+ messages that are in my Gmail inbox alone (not to mention all the labels and other places my email lives…) it would take a very long time. With even bigger datasets, we can run into an additional problem: the results may not fit into memory.

When building systems with datasets big enough that they don’t fit into memory, or that may take too much time to compute to be practical, there are two general approaches that we will explore. The first is storing results as data (known as memoizing or caching the results), and the other is breaking up the work to run on distributed systems like Hadoop or Apache Storm.

For this data, we only want to avoid redoing the calculating every time we want to know the sent counts. Currently, the data in our system changes infrequently, and it’s likely that we could tell the system to recompute sent counts only after ingesting new data from Gmail. For these reasons, a reasonable solution will be to store the computed sent counts back into Datomic.

A new entity type to store our results

For all three query functions we wrote, each result is of the form:

[from-address to-address count]

Let’s add to the Datomic schema in our core.clj file to create a new :sent-count entity type with these three attributes. Note that sent counts don’t really have a unique identifier of their own; it is the combination of from -> to addresses that uniquely identifies them. However, we will leave the from and to addresses as separate fields so it is easy to use them in queries.

Add the following maps to the schema-txn vector:

You’ll have to call the update-schema function in your REPL to run the schema transaction.

Something that’s worth calling out is that we’re using a Datomic schema valueType that we haven’t seen yet in this project: db.type/ref. In most cases, you’d want to use the ref type to associate with other entities in Datomic. But we can also use it to associate with a given list of facts. Here, we give the ref type an enum of the possible values that :sent-count/type can have: to, cc, and bcc. By adding this type field to our new entities, we can either choose to look at sent counts for only one type of address, or we can sum up all the counts for a given from-to pair and get the total counts for the system.

Our next job is to add some functions to create the initial sent counts data, as well as to query for it. To keep things clean, I created a sent-counts namespace for these functions to live in. I’ve provided it below with minimal explanation, since it should look very familiar to what we’ve already done.

/src/autodjinn/sent_counts.clj

After adding in the sent_counts.clj file, running:

(sent-counts/create-sent-counts)

will populate your datastore with the sent counts computed with functions we created earlier.

Note: The sent counts don’t have any sort of unique key on them, so if you run create-sent-counts multiple times, you’ll get duplicate results. We’ll handle that another time when we need to update our data.

Wrapping up

We’ve covered a lot of material on querying Datomic. In particular, we used aggregate functions to get the counts and sums of records in our data store. Because we don’t want to run the queries all the time, we created a new entity type to store our sent counts and saved our data into it. With query functions like those found in the sent-counts namespace, we can start to ask our data questions like “In the dataset, what address was sent the most email?”

If you want to compare what you’ve done with my version, you can run git diff v0.1.3 on the autodjinn repo.

Please let me know what you think of these posts by sending me an email at contact@mattgauger.com. I’d love to hear from you!

.

Clojure Code Quality Tools

I work with many programming languages on a daily basis. As a polyglot programmer, I’ve come to appreciate tools that help me follow best practices. For JavaScript, there’s the excellent jshint. When I need to verify some XML, there’s xmllint. In a Ruby on Rails project, I can count on the rails_best_practices gem. For Ruby smells, I reach for rubocop. There’s tools like SimpleCov to measure test coverage on my Ruby projects. cane helps me to ensure line length, method complexity, and more in my Ruby code. Syntastic helps bring real syntax checking to vim for many languages. Every day, more open source tools are introduced that help me to improve the quality of the software that I write.

It follows that when I write Clojure code, I want nice tooling to help me manage code quality, namespace management, and out-of-date dependencies. What tools do I use on a day-to-day basis for this? In this post, I’ll show 5 tools that I use in my workflow every day on Clojure projects, and also provide some other tools for further exploration. Most of these tools exist as plugins to the excellent Leiningen tool for Clojure.

lein deps :tree

In the past, lein deps was a command that downloaded the correct versions of your project’s dependencies. Running lein deps is no longer necessary, as each lein command now checks for dependencies before executing. But deps provides an interesting variant for our uses: lein deps :tree.

The :tree keyword at the end instructs lein to print out your project’s dependencies as a tree. This itself is a good visualization, but not what we’re looking for. The tree command will first print out any dependencies-of-dependencies which have conflicts with other dependencies. For example, here’s what lein deps :tree says for one of my projects:

As you can see, the tool suggests dependencies that request conflicting versions, and how we can modify our project.clj file to resolve those conflicting versions by excluding one or the other. This isn’t always very useful, but when you run into issues because two different Clojure libraries require two wildly different joda-time versions (a situation I have run into before), it will be good to know what dependencies are causing that issue and how you might go about resolving it.

Note that this functionality disappeared in Leiningen 2.4.3 but is back in 2.5.0, so make sure you run lein upgrade!

lein-ancient

This plugin to lein exists simply to check your project for outdated dependencies. Without lein-ancient, I’d be unable to keep up with some of the faster-moving libraries in the Java and Clojure world.

After adding ancient to your ~/.lein/profiles.clj, running the lein ancient command yields output on the same project as before:

Whoops! Looks like I haven’t been keeping up to date with my dependencies. lein ancient makes checking for new dependency versions easy. Further, thanks to the ubiquity of semantic versioning in Clojure projects, it is usually quite safe to bump the minor versions (0.0.x) of dependencies.

You can also use lein-ancient to find outdated lein plugins in your ~/.lein/profiles.clj file. Just run it with the profiles argument:

lein kibit

As we gain experience and confidence in a programming language, we begin to talk about whether we’re writing idiomatic code. I’d argue that idiomatic code is code that accomplishes a goal with proper use of language features, in a way that other developers familiar with that language would understand. A simpler way to say it might be: idiomatic code uses the community-accepted best practices of how to do something.

Clojure’s design seeks to solve some problems found in older Lisps, as well as add in niceties like complementary predicate functions. A good example of these convenient complementary functions are if and if-not. Clojure also contains several cases of simplification for common usage. For example, when you don’t need an else clause on an if, you can use the when macro.

Wouldn’t it be great if there was someone who was well-versed in Clojure idioms pairing with you and offering suggestions? That’s exactly what kibit does.

Running against a project I’d set up to contain some smells, lein kibit found:

These kinds of small improvements are all over our Clojure projects. They’re not show-stopper bugs, but they’re small places for improvement.

Kibit’s suggestions are almost always logically equivalent to the original code. Still, I always do some smoke-testing to ensure the code still works after using Kibit’s suggestion, and it generally does. Problems I frequently fix with Kibit are replacing if statements with the when macro, as well as places where the code checks for empty seqs, or that I can simplify nil checks.

You can point lein kibit at a specific namespace by appending the path, like this: lein kibit src/foo/bar.clj

Kibit catches many cases where there is a more-idiomatic way to express what you are trying to do. I recommend running it often. In fact, it’s possible to use kibit in your emacs buffers if you want it to be that much more convenient and real-time.

Eastwood

For linting Clojure code, there’s Eastwood. It is similar in functionality to Kibit, bit will catch different issues than Kibit. Built on two interesting Clojure projects: tools.analyzer and tools.analyzer.jvm, Eastwood does a powerful examination of your code inside the JVM. It is worth highlighting that since Eastwood loads your code to analyze it, it might trigger any side effects that happen when your code loads: writing files, modifying databases, etc. Note that it only loads the code; it does not execute it.

After adding eastwood to your lein profiles.clj, simply run: lein eastwood and you will see output like:

That’s a lot of problems for a simple file! Notice how one mistake got caught for two reasons: A misplaced docstring (placed after the arguments vector) becomes just a string in the function body that will be thrown away.

Another nice catch that Eastwood provides is detecting the redefinition of the var qux in the file.

But Eastwood covers a lot more cases than just vars being def’d more than once. See the full list to find out what else it does. There’s a few linters that are disabled by default, but they might make sense to enable for your project.

Frequently running lint tools can help prevent subtle problems that come from code that looks correct but contains some small error. Eastwood is less concerned with style than tools like JSHint are, but we have other tools that cover stylistic concerns.

lein bikeshed

This is a relative newcomer to my own tool set. lein bikeshed has features related to the low-hanging fruit in our Clojure code: lines longer than 80 characters, blank lines at ends of files, and more. It will also tell you what percentage of functions have docstrings. Like other tools mentioned here, it is a lein plugin that you add to your profiles.clj.

A run of lein bikeshed on its own source (which purposefully includes some code designed to fail) looks like this:

Bikeshed might give a lot of output for your existing projects, but the warnings are worth investigating and addressing. You can always silence the long-lines warning if it doesn’t matter to you with the -m command line argument.

Tying it all together with a Lein alias

Wouldn’t it be great to run all these tools frequently, so that you can check for as many problems as possible? Well, you can, with a lein alias. (The lein wiki documents aliases in the lein sample.project.clj.)

In ~/.lein/profiles.clj, inside your :user map, add the line:

Now, when you want to run all these tools at once on a project, you simply invoke lein omni. I use this alias on all my Clojure(Script) projects. I have grown accustomed to seeing the kinds of output that a clean Clojure project will have.

It’s worth noting that I don’t run Eastwood unless it is necessary for the project. When it is necessary, I override the alias in the project’s project.clj to run Eastwood as well.

This command can take some time to complete, but with an alias we’re only spinning up lein once.

And a bash alias

The output of lein omni can be long, which can either result in a lot of scrolling or neglecting to run the command due to the inconvenience. To help manage the length of the output, I’ve created a bash alias that runs the plugins and pipes them to less.

My personal bash alias also runs midje at the end. You can choose whether to run the tests for your own alias. That’s just my personal preference.

Note that just like running the lein alias above, this may take a bit of time. Since we’re piping it to less, it might take awhile before less receives output. While it is still running, output will periodically show up at the bottom of the less buffer. You can use both Emac’s and vim’s movement commands in less to advance the buffer. I find less to be more manageable for scrolling through output than switching to tmux’s history scrolling mode.

Managing your namespaces: lein slamhound

Namespace management often becomes an issue on nontrivial Clojure projects. Actively developing a project means managing the functions we pull in from other namespaces and from libraries. These require statements can often get out of date. Often, they’re either missing namespaces that are needed, or containing requirements for old functions that are no longer used in the current code.

slamhound is a tool that can help to manage dependencies in your namespaces. It knows how to require and import Clojure and Java dependencies, and can remove stale requires that are no longer necessary. Slamhound can often fix missing requires for functions that it can resolve.

Note: slamhound rewrites the namespace macros in your project’s .clj files! I recommend only running it on code that’s committed to git (or whatever you use as a VCS) so that you can review and rollback any changes it makes.

The most basic way to use slamhound is to add it to your ~/.lein/profiles.clj as a dependency. Then add this alias:

Now you can use slamhound on a project by running lein slamhound in the project’s directory. There’s also REPL and Emacs support, which you can learn more about in the slamhound README.

Measuring test coverage with cloverage

It is often claimed that less unit testing is necessary in Clojure because Clojure is functional and makes use of immutable data structures. And it is true that with functional programming, most tests are simple: given some input, the output should be a certain value.

Some would even argue that Clojure functions should be well-factored enough into simple functions that the behavior of the function is apparent and requires no tests. Still others maintain that developing in the REPL is as good as writing unit tests, since functions are constantly evaluated and integrated with this style of development.

That said, there’s still mutable Java code to interop with, there’s still the necessary evil of functions with side effects, and we might want to check the structure of the data we’re producing in our functions rather than the value of it. For all those reasons and to check that I don’t introduce regressions, I tend to write unit tests in Clojure.

This blog post isn’t a platform to argue for or against testing Clojure. But when you do test, you may wonder how to tell how much test coverage your test suite has. How do we know at a glance what percentage of our namespaces is being tested? And how do we find lines that are never being exercised in our tests? After all, we can’t improve what we don’t measure.

That’s where cloverage comes in. Cloverage is another lein plugin, so it gets added to ~/.lein/profiles.clj like the others. Then run lein cloverage in your project; it will run the test suite and generate a coverage report.

The coverage report appears in target/coverage as HTML files, broken down by namespace.

You can still use Cloverage even if you don’t use clojure.test. I use midje in most of my tests. To use Cloverage in those situations, wrap your tests in a deftest.

Since deftest has a hyphenated Clojure keyword as its identifier, and Midje facts have a string as an identifier, I’ve come to use the deftest to group related tests together. Usually this means naming the group of tests after the function I’m testing. Then I name Midje facts after the situation that the fact exercises. This makes sense to me because it fits well with the hierarchy of rspec unit tests in Ruby.

Here’s an example of using this approach:

Cloverage also outputs a coverage.txt file that might be useful for use with services like Coveralls. I haven’t used this, so I can’t comment on its usefulness.

If you’re using speclj for your tests, you might run into some issues getting Cloverage to play nice. I don’t use speclj often, so when I couldn’t get it to work with Cloverage, I didn’t pursue the issue.

Final Thoughts

In this post, I covered 5 tools to add to your workflow all the time, and some others that might be useful in certain cases. I’m sure there’s more tools out there that are useful that I don’t know about, and I’d love to hear about them.

I’m also thinking about writing some posts about other development tools that I use, particularly how I use midje to test, and how you can benchmark code with perforate. If you’re interested in those topics, get in touch and let me know.

Have fun and enjoy your cleaner codebase with these tools in your tool belt!


Interested in commenting or contacting me? Send an email to contact@mattgauger.com. Thanks!

.
Blog Archives