on
Clojure Data Science: Refactoring and Cleanup
This is Part 2 of a series of blog posts called Clojure Data Science. Check out the previous post if you missed it.
Welcome to the second post in this series. If you followed along in the last post, your code should be ready to use in this post. If not, or if you need to go back to known working state, you can clone the autodjinn repo and git checkout v0.1.0
.
I started out writing this post to develop simple functionality on our inbox data. Finishing the post was taking longer than I was expecting, so I split the post in half in the interest of posting this sooner.
In this post, we need to create an email ingestion script that we can run repeatedly with lein
. And we need to talk about refactoring our code out into maintainable namespaces.
So make sure your Datomic transactor is running and launch a REPL, because it is time to give our code a makeover.
A Gmail ingestion script
Because Clojure sits on the JVM, it shares some similarities with Java. One of these is the special purpose of a -main
function. You can think of this as the main
method in a Java class. The -main
function in a Clojure namespace will be run when a tool like lein
tries to “run” the namespace. That sounds like exactly what we want to do with our Gmail import functionality, so we will add a -main
function that calls our ingest-inbox
function. To get started, we will only have it print us a message.
;; in core.clj:
(defn -main []
(println "Hello world!"))
You can then run this by invoking lein run -m autodjinn.core
. You should see Hello world!
if everything worked. You may notice that the process doesn’t seem to quit after it prints the hello world message – this seems to be problem with Leiningen. To ensure that our process ends when the script is done, we can add a (System/exit 0)
line to the end of our -main
function to ensure that the process quits normally. On *nix systems, a 0 return code means successful exit, and a nonzero response code means something went wrong. Knowing this, we can take advantage of response codes in the future to signal that an error occurred in our script. But for now, we will have the script end by returning 0 to indicate a successful exit.
Think back to what we did to ingest email in our REPL in the last post. We had to connect to the database, run the data schema transaction, and then we were able to run ingest-inbox
to pull in our email.
The following function will do the same thing. Remember that things like trying to create an existing database or performing a schema update against the same schema in Datomic should be harmless. It will add a new transaction ID, but it will not modify or destroy data. Putting together all the steps we need to run, we get a -main
function that looks like this:
(defn -main
"Perform a Gmail ingestion"
[]
(println "Gmail ingestion starting up")
(println "Attmpting to update the schema")
(update-schema)
(println "Beginning email ingestion")
(ingest-inbox)
(println "Done ingesting")
(System/exit 0))
Refactoring namespaces
With Clojure, one must walk a fine line between putting all of your functions into one big file, and having too many namespaces. One big file quickly grows unmaintainable and gains too many responsibilities.
But having too many namespaces can also be a problem. It may create strange cyclic dependency errors. Or you may find that with many separate namespaces, you have to require many namespaces to get anything done.
To avoid this, I start with most code in one namespace, and then look for common functionality to extract to a new namespace. Good candidates to extract are those that all talk about the same business logic or business domain. You may notice that the responsibility for one group of functions is different than the rest of the functions. That is a good candidate for a new namespace. Looking at responsibilities can be a good way to determine where to break apart functions into namespaces.
In this project, we can identify two responsibilities that currently live in our autodjinn.core namespace. The first is working with the database. The second is ingesting Gmail messages. As our project grows, we will not want the code for ingesting Gmail messages to live in autodjinn.core
. With that in mind, let’s create a new file called src/autodjinn/gmail_ingestion.clj
and move over the vars and functions that we think should live there. That file should look like this:
(ns autodjinn.gmail-ingestion
(:require [autodjinn.core :refer :all]
[clojure-mail.core :refer :all]
[clojure-mail.message :as message :refer [read-message]]
[nomad :refer [defconfig]]
[clojure.java.io :as io]
[datomic.api :as d]))
(defconfig mail-config (io/resource "config/autodjinn-config.edn"))
(def gmail-username (get (mail-config) :gmail-username))
(def gmail-password (get (mail-config) :gmail-password))
(defn get-sent-date
"Returns an instant for the date sent"
[msg]
(.getSentDate msg))
(defn get-received-date
"Returns an instant for the date sent"
[msg]
(.getReceivedDate msg))
(defn cc-list
"Returns a sequence of CC-ed recipients"
[msg]
(map str
(.getRecipients msg javax.mail.Message$RecipientType/CC)))
(defn bcc-list
"Returns a sequence of BCC-ed recipients"
[msg]
(map str
(.getRecipients msg javax.mail.Message$RecipientType/BCC)))
(defn simple-content-type [full-content-type]
(-> full-content-type
(clojure.string/split #"[;]")
(first)
(clojure.string/lower-case)))
(defn is-content-type? [body requested-type]
(= (simple-content-type (:content-type body))
requested-type))
(defn find-body-of-type [bodies type]
(:body (first (filter #(is-content-type? %1 type) bodies))))
(defn get-text-body [msg]
(find-body-of-type (message/message-body msg) "text/plain"))
(defn get-html-body [msg]
(find-body-of-type (message/message-body msg) "text/html"))
(defn remove-angle-brackets
[string]
(-> string
(clojure.string/replace ">" "")
(clojure.string/replace "<" "")))
(def my-store (gen-store gmail-username gmail-password))
(defn ingest-inbox []
(doseq [msg (inbox my-store)]
(println (message/subject msg))
@(d/transact db-connection [{:db/id (d/tempid "db.part/user")
:mail/uid (remove-angle-brackets (message/id msg))
:mail/from (message/from msg)
:mail/to (message/to msg)
:mail/cc (cc-list msg)
:mail/bcc (bcc-list msg)
:mail/subject (message/subject msg)
:mail/date-sent (get-sent-date msg)
:mail/date-received (get-received-date msg)
:mail/text-body (get-text-body msg)
:mail/html-body (get-html-body msg)}])))
(defn -main
"Perform a Gmail ingestion"
[]
(println "Gmail ingestion starting up")
(println "Attmpting to update the schema")
(update-schema)
(println "Beginning email ingestion")
(ingest-inbox)
(println "Done ingesting")
(System/exit 0))
Be sure to remove the functions and vars that we moved to this file from the autodjinn.core
namespace. Note that we moved the -main
function here, too, so that we can now run lein run -m autodjinn.gmail-ingestion
You may also notice that we still had to require the datomic.api
namespace here to be able to perform a transaction. Our autodjinn.core
namespace already handles database interaction, though. So let’s write a create-mail
function in core.clj
and call it in our new namespace:
(defn create-mail [attrs]
(d/transact db-connection
[(merge {:db/id (d/tempid "db.part/user")}
attrs)]))
And in gmail_ingestion.clj
we change ingest-inbox
to use the new function. While we’re at it, we’ll break out a convenience function to prepare the attr map for Datomic:
(defn db-attrs-for [msg]
{:mail/uid (remove-angle-brackets (message/id msg))
:mail/from (message/from msg)
:mail/to (message/to msg)
:mail/cc (cc-list msg)
:mail/bcc (bcc-list msg)
:mail/subject (message/subject msg)
:mail/date-sent (get-sent-date msg)
:mail/date-received (get-received-date msg)
:mail/text-body (get-text-body msg)
:mail/html-body (get-html-body msg)})
(defn ingest-inbox []
(doseq [msg (inbox my-store)]
(println (message/subject msg))
@(create-mail (db-attrs-for msg))))
If we run our lein run -m autodjinn.gmail-ingestion
command, we should see that the code is still working.
Don’t forget to remove the datomic.api
requirement in gmail-ingestion
namespace! Now we only need to require Datomic in the autodjinn.core
namespace.
There’s one more low-hanging fruit that we can refactor about this code before moving on. The config file is loaded and used in both namespaces. We already require everything from autodjinn.core
into autodjinn.gmail-ingestion
. So we can safely change a few lines to use the config in gmail_ingestion.clj
and stop requiring nomad
in two places:
(ns autodjinn.gmail-ingestion
(:require [autodjinn.core :refer :all]
[clojure-mail.core :refer :all]
[clojure-mail.message :as message :refer [read-message]]
[clojure.java.io :as io]))
(def mail-config autodjinn.core/config)
(def gmail-username (get (mail-config) :gmail-username))
(def gmail-password (get (mail-config) :gmail-password))
And in core.clj
:
(defconfig config (io/resource "config/autodjinn-config.edn"))
(def db-uri (get (config) :db-uri))
Running lein run -m autodjinn.gmail-ingestion
one more time, we should see that our changes did not break the system. The config is now only loaded once, and we use it everywhere.
That’s it! We’ve taken care of some low-hanging fruit and are ready to implement some new functionality. If you want to compare what you’ve done with my version, you can run git diff v0.1.1
on the autodjinn repo.
Please let me know what you think of these posts by sending me an email at contact@mattgauger.com. I’d love to hear from you!