How might GCSE and A Level results be generated? A machine learning perspective.

 


Since the announcement of the cancellation of the GCSE and A Level examinations, Year 11 and 13 students have been faced with plenty of time on their hands. Having spent mine so far pursuing an online machine learning course[1] at the suggestion of Dr Hedges, I thought it might be interesting to apply the result of these endeavours back to their causation, namely in explaining a possible method for exam grade generation. Before proceeding, I’d like to point out that this is not based on any evidence other than my own thought experiment, and so should be treated as such.

To begin with, what is machine learning and why does it apply to this problem? Well machine learning is the science of getting computers to perform tasks without being explicitly programmed, specifically those involving data and predictions. For example, we might train a machine learning algorithm to predict some result, e.g. banana prices, based on some input data such as supply, demand and quality. In our case, because we have an end goal to be predicted (GCSE/A Level grades), input data (SATs, Mocks etc. - to be further discussed later), and a large number of cases to apply the algorithm to, machine learning is well suited. Having lots of students both means that the time saving is greater by using a machine learning algorithm than manually predicting grades, and also the predicted grades should be more reliable because they are based on a greater sample size of available data. We call this “training” data, which is a collection of “training examples”. In this case, each example is one past student for which we have a number of input variables such as their Mock Exam results, and the ‘correct answer’ - their final grade. We can then use a “learning algorithm” on this data to work out the correlation between inputs and outputs. Having done this, it’s a simple case of inputting each current student into the produced function, and recording the resulting predicted grade. Because we have the ‘correct answer’ for each of our past students, this type of learning is called “supervised learning” - we are “supervising” the algorithm as it tries to discover the relationship between mock and final grades. 

Features
A key point of discussion for our problem is: on what basis should we predict students’ grades? The cornucopia of possible measures includes: Mock Exam results, SAT results, effort in class (as measured by teacher), scores in homework assignments, GCSE grades for A Level students and test results throughout the year. Problematically, each measure has its opponents and proponents, and so selecting which ones we should take into account is difficult. We could also ask whether we should include ideas such as socio-economic background, cultural origin or gender. Plenty of students would be angered to feel constrained in their results by an immutable aspect of their nature, but studies have suggested that BaME students often outperform their mock grades more than other groups.[2] If we want truly representative predicted grades, we should surely include all of the data we can. In some ways, these multitudinous possibilities make the problem both more and less suitable for machine learning. The risk that a computer would unfairly disadvantage some students by focusing on specific characteristics would make it difficult to persuade the public to trust in an algorithm. However, machine learning works particularly well with a large number of different input data sources (we call these “features”), and would be able to use historical data to make sure that it finds all of the correlations, even those that humans couldn’t spot. This both reduces the risk of bias, which is a concern for teacher-predicted grades,  and increases the likelihood that our predicted grades will be genuinely representative of trends in student performance. Machine learning can also work out exactly how much each feature should contribute to the output value, through the use of learning “parameters”. These are multipliers applied to the features, to modify them before a result is determined. Using them means machine learning will be able to use all of the available student characteristics without unfairly focusing on certain ones. In this example, I’m going to use some data which I have fabricated to illustrate the concepts, but the real data is available and could conceivably be used by the DfE[3]; I simply don’t have access to it. We will use SATs results (raw marks), Mock Exam results (raw marks) and a teacher effort score (out of 100) as representative features.


(Example input data “from past students” that we could use for the algorithm)

Algorithm
Having chosen our features, we now need to decide how we are going to produce a result for each student. There are two main types of supervised learning algorithms that we could use - classifiers and regression. The most basic regression model is Linear Regression and the most basic classifier algorithm is Logistic Regression. We mainly use Linear Regression to determine correlations between continuous variables. This essentially amounts to drawing a line of best fit to our data without plotting the graph. In this method, each student’s mark would have to be predicted by reading off of the function produced by the algorithm. These could then be converted into grades using the normal grade boundaries method. Raw marks are close enough to being continuous that we needn’t worry that they can only take integer values, but grades are very much discrete data. If we want to directly predict grades instead of marks, we would use Logistic Regression. This method is used to predict discrete values. We can think of it as plotting a graph of all of the students and then drawing the line which separates the grade 9s from the 8s and the 8s from the 7s and so on. I will briefly discuss both methods here, and if you want to investigate in more detail you can take a look at my code or the machine learning course[1]. (Both are Linked below)

Linear Regression - Marks
In Linear Regression, we will use the learning algorithm to produce a function, called the hypothesis function, which represents the line of best fit that I previously mentioned. This will have the form \(y = mx + nx + ox + c\), or in machine learning Notation, \(h(x) = θ_0 + θ_1x_1 + θ_2x_2 + θ_3x_3 \) . All these θ characters are just numbers that our machine learning algorithm is trying to work out, while the xs are the input features (SATs results, Mock Exam results and a teacher effort score). So  θ1x1 equals a real number to be determined multiplied by the SAT result, θ2x2 equals a real number to be determined times the Mock Exam result, and so on. h(x) equals the predicted final mark of the student. As you can imagine, this means that θ1, the SAT score multiplier should be lower than θ2, because the Mock Exam result should have a greater bearing on the student’s grade than the SATs scores, which are 5/7 years out of date. 

Logistic Regression - Grades
In Logistic Regression, our hypothesis function is a little more complicated, as it involves a special function called the sigmoid function, which makes sure that the output is between 0 and 1. Intuitively, hopefully you can imagine how this output would be used. For each grade, 9 through 1, we will use Logistic Regression to draw the line that separates the students of this grade from all of the other students. When we have a new student to generate a grade for, we will put their data into each of these 8 functions, and each one will tell us whether the student is at that grade or above that grade. This works based on whether the function outputs a number above or below 0.5. We will now be able to zero in on the student's correct grade. E.g:

1
2
3
4
5
6
7
8
9
h(x) >0.5 so grade should be above 1
h(x) >0.5 so grade should be above 2
h(x) >0.5 so grade should be above 3
h(x) >0.5 so grade should be above 4
h(x) >0.5 so grade should be above 
5
h(x) >0.5 so grade should be above 6
h(x) <0.5 so grade should be 7 or below. Hence, it must be 7.
---------
--------

Using Logistic Regression to predict grades instead of marks speeds up the process, and will reduce the comparing of marks between students, which should make students happier about the results they get. Furthermore, a mark wouldn’t really make sense as the students haven’t actually sat a test to generate this from. However, it would be much more difficult to standardise the results in order to make sure that the numbers of each grade awarded are the same as in other years, because we are not able to set grade boundaries to achieve a normal distribution. Both methods have the issue that grades could go down from the mock exams if schools used different grade boundaries to each other, so some people got a higher or lower grade than they should have. This is likely to frustrate students, and lead to additional work in an appeals process, which is why machine learning may be problematic in practice.

Results
If you want to take a look at my results and the code used, click the following link and select the .mlx files for either the logistic or linear regression method.


Remember, however, that this is simply an illustration of the methods that could be used in these algorithms, and not an indication of the correct relationship between mock grades and final grades, because I made up the input data. 

To conclude
While machine learning seems to provide a future in which grades can be fairly and accurately generated, all of the available data can be used, and all students’ needs can be met, it must be recognised that computational algorithms cannot adjust to the multitudinous specific circumstances that could occur amongst this year’s cohort of exam leavers. If a student was ill during their mock exams, decided not to revise because they wanted to see how much they knew anyway, were particularly liked or disliked by the teacher or their school’s mock exams were easier or harder than another school’s, machine learning would struggle. This is because it only works when the input data can be easily and fairly quantified. We can’t convert the level of stress a student would have felt if they were ill during their mock exams into a number. Equally, since the time saving from machine learning comes from applying the same algorithm to all cases, it wouldn’t work if different schools submitted data from different exams. This then begs the question  - why not just use the input data if it is already fair for everyone? As such, while machine learning is an exciting frontier, its application in predicting students’ results currently seems limited: teachers predicting grades seems like the best option for this year, and the most likely outcome of Ofqual’s decision making process. Hopefully applying the ideas of machine learning to the one topical idea other than COVID-19 has allowed you to understand a bit about how it works, below the surface.

If you want to learn more about machine learning, take a look at the course which I have put a link to in my sources, as I have found it to be one of the best ones I have tried. The level of maths involved is quite a bit more than what I have included here, so it’s probably most suitable for those in Y12 and 13 and those Y11s who have done further maths. 

Finally, I’d like to take the opportunity to thank Dr Hedges for recommending that I follow the path of machine learning, which has been so far interesting, enlightening and relevant, and there is still plenty more to learn.

Hamish Starling
11C

Sources & Links
[3] - A possible dataset for real-world machine learning of grades https://www.ucl.ac.uk/ioe/news/2019/jul/linked-education-data-opens-new-research-opportunities
[4] - https://archive.ics.uci.edu/ml/datasets/Iris - A good beginners’ dataset for your first ML Program