Visualising GCSE statistics using Datomic and Quil
If you are interested in Clojure, you’ve probably heard about Datomic. Datomic is a database that is not only suitable for storing data, but also for performing complicated queries, combining data and extracting information.
Another interesting Clojure project is Quil, which aims to bring the abstractions from the Processing framework into Clojure-land. Quil defines a drawing loop and gives access to functions for drawing at a canvas. Most Quil applications I have seen so far have been in generative art, but it is equally well suited for doing more complex data visualisation.
In this post I will be loading GCSE results data from 2011 and 2012 into an in-memory Datomic database and visualise it using Quil. This post walks through most steps of the process. The full listing can be found at github.
Loading GCSE data
I’m interested in comparing the GCSE results from 2011 and 2012, grouped on gender, subjects and results. In particular, I’m interested in examining which courses students performance have decreased in.
The dataset I’ve used can be downloaded from here. A html view of the data can be found here. The data has been altered slightly – the third row contained a wrongly formatted number. I have also removed the last six lines, which contain totals.
We are going to use Datomic and Quil, so we’re going to include the libraries.
Converting the csv-file into a sequence of the columns is quite easy
To get the data into Datomic, we first need to define a schema. We are only going to track C marks or above, as they are normally needed for gaining access to further education and jobs. Therefore, the schema will only contain data from that column.
Each line of the data is loaded into a seperate transaction. Here we are using datomic.api/tempid – the reader macro for creating ids will return the same id for each iteration in the loop, and can therefore not be used.
We need to transact the schema and the data onto a database. As we do not need any persistence, we can use an in-memory database.
We can now query Datomic. For example, we can get all individual subjects out
Visualising the data using Quil
In Quil, we define a function for setting up the view port
and a function for drawing on the canvas. Let’s start out with just painting the background white.
To see the result, we need to define a sketch.
If you connect to nREPL and compile the code, the sketch is going to update according to your changes. I used that while developing the visualisation.
We are going to plot a 2D point for each subject. The first coordinate is going to be the number of students achieving C or above in that subject, 2011. The second coordinate is the same number for 2012. All elements above the line y = x will be subjects in which students have done better in 2012 than in 2011, and vice versa. We are going to stick to “Male” pupils for now.
It can be quite difficult telling which points are above and below y = x. Let’s color the subjects in which the students are doing better blue, and the ones in which they are doing worse red.
In our data, we can see that quite a different number of students sat the different subjects. Let’s scale the dots according to number of pupils who sat them. The area of the dot is going to be proportional to the number of students.
The image reveals that quite a lot of students seem to have done worse in three different subjects. However, we cannot tell which subjects they are. Let’s highlight the subject our mouse cursor is closest to, and print the name of the subject and the percentage of students achieving C or above for 2012 and 2011.
Alright, so now we can see that it’s Science, Maths and English that have fallen. This observation matches the critique of the new marking scheme issued 2012.
This is only data for boys. In the final version I have added a listener for keypresses to change gender, as well as axes. The full listing can be found at github.
As we can see, girls are doing slightly better than boys in most topics, whereas the mixed group is slightly evened out.
Another interesting observation is that students generally have very good marks in Chemistry, Biology and Physics. This might be because these three subjects are usually combined in GCSE Science, and only the most able students are likely to take them as three seperate GSCEs. Particularly grammar schools and selective private schools tend to split Science into these three subjects, which means the data might be slightly skewed towards these schools in those three data points. Unfortunately, the dataset does not contain information about the types of schools.
Conclusion
In this blog post I wanted to demonstrate the visualisation of GCSE data from 2011 and 2012. The resulting program is an interactive data visualisation, coming in at under 200 lines of code, including database schema, parsing and visualisation.
Datomic has shown itself to be an extremely useful tool for performing data analysis. Performing queries to aggregate data is very natural. Likewise, Quil has been demonstrated as being capable of quite advanced visualisations with very little code.
I’m looking forward to seeing what types of data visualisations people will come up with, using these powerful, but very simple tools!