Datomic Performance with Clause Ordering

Published: August 13, 2014

When optimizing datomic queries, it’s best to put the most restrictive clauses first. Doing so, the datomic query engine can perform better.

For example, say we have 100 cities all with a daily temperature reading for every day of the last year. A total of 36,500 temperature entries. If we want to find the temperature for Denver on July 31, which query would be faster?

Version 1:

[:find ?temp
 :in ?city ?date
 :where
 [?temp :temperature/city ?city]
 [?temp :temperature/date ?date]]

Version 2:

[:find ?temp
 :in ?city ?date
 :where
 [?temp :temperature/date ?date]
 [?temp :temperature/city ?city]]

Version 1 will filter 36,500 temperatures → 365 temperatures → 1 temperature.

Version 2 will filter 36,500 temperatures → 100 temperatures → 1 temperature.

So, I’m suggesting that Version 2 of the query should perform better per the suggestions of the Datomic docs.

Let’s Get Scientific

Instead of just speculating, how about I run some timings and find out for sure.

(defn version-1 [& {city :for date :on}]
  (datomic.api/q '[:find ?temp
                   :in $ ?city ?date
                   :where
                   [?temp :temperature/city ?city]
                   [?temp :temperature/date ?date]]
                   (db) city date))

(defn version-2 [& {city :for date :on}]
  (datomic.api/q '[:find ?temp
                   :in $ ?city date
                   :where
                   [?temp :temperature/date ?date]
                   [?temp :temperature/city ?city]]
                   (db) city date))

(defn sample-1 [n-times]
  (dotimes [n n-times]
    (time (version-1 :for "Denver" :on (clj-time.coerce/to-date
                                         (clj-time.core/date-time 2014 7 31))))))

(defn sample-2 [n-times]
  (dotimes [n n-times]
    (time (version-2 :for "Denver" :on (clj-time.coerce/to-date
                                         (clj-time.core/date-time 2014 7 31))))))

(sample-1 100)
;;=> "Elapsed time: 10.949 msecs"
;;=> "Elapsed time: 9.007 msecs"
;;=> "Elapsed time: 9.8 msecs"
...

(sample-2 100)
;;=> "Elapsed time: 6.772 msecs"
;;=> "Elapsed time: 7.103 msecs"
;;=> "Elapsed time: 8.745 msecs"
...

It turns out, for a year’s worth of data for 100 cities, the average time for Version 1 was 8.91 msecs. Version 2 average was 7.47 msecs. A measly 1.5 msecs, or 16% difference.

As the Dataset Grows

Let’s think about how these two queries perform as the data grows. After the second year of our application gathering daily temperature data, we would have 73,000 total temperature entries.

Version 1 will filter 73,000 temperatures → 730 temperatures → 1 temperature.

Version 2 will filter 73,000 temperatures → 100 temperatures → 1 temperature.

Even though the amount of data doubled, the number of candidate entries after the first clause in Version 2 stayed the same.

How does the increase in data affect our timings?

(sample-1 100)
;;=> "Elapsed time: 42.886 msecs"
;;=> "Elapsed time: 30.389 msecs"
;;=> "Elapsed time: 43.137 msecs"
...

(sample-2 100)
;;=> "Elapsed time: 12.494 msecs"
;;=> "Elapsed time: 10.94 msecs"
;;=> "Elapsed time: 11.813 msecs"
...

The average for Version 1 for 2 years of data came to 18.87 msecs while the average time for Version 2 was 12.53 msecs. That is nearly 66% the running time. This goes to show, that as your data increases, performance adjustments like these become increasingly important.

Caveat

For small datasets, say within the first week, here’s how the two versions perform.

Version 1 will filter 700 temperatures → 7 temperatures → 1 temperature.

Version 2 will filter 700 temperatures → 100 temperatures → 1 temperature.

Wait, what?

This is because the number of cities is greater than the number of dates. In this example, it is a small amount and may only result in fractions of a millisecond difference. It shall serve only as a reminder that your data is different than my data, and your results may vary.