A case study in analysing results and adapting future instruction using principles of Purpose, Accuracy, Velocity and Elaboration.
Mock exams are a time of the year where you can source a large amount of rich information to inform future instruction. However reflections are often restricted to the single letter or number analysis, comparisons to predicted grades and then performance management-esque meetings of how things are going to improve – again building the feeling that data is something that is used ‘against’ teachers, not for the benefit of the instructors. However the right statistical analysis can open up lots of meaningful information, and there are things we can do when designing these assignments that can help promote identification of this.
This week we have our second mock exam session of the year in the Sixth Form. The post that follows is specifically regards the analysis and implementation that followed December’s mock exam sitting. This is the third post where I am looking at the impact of the Go North East bus strike – however it is a broader post than that, and instead is an evaluation of approaches of using data to inform intervention and delivery following poor mock examination performance and (hopefully) formulating an approach that can improve performance of young people moving forward.
I will also be mapping to the proposed ‘4 Principles for sourcing and applying data to instruction (VAPE)’ (https://themarkscheme.co.uk/data-is-not-a-dirty-word/) although with a slight reordering to PAVE.
PURPOSE: What is a mock exam for?
Trial summative assessments, such as mock exams serve a variety of purposes.
1. Allows teachers and students the opportunity to track performance and progress over time.
2. Allows students to experience exam conditions, provide opportunities to trial elements of structured revision and familiarise themselves with examination style materials and the pressures associated with terminal assessment in a more robust way than general lesson time facilitates.
3. Allows institutions to gather evidence and proof of ‘regular way of working’ for the application of access arrangements for exams through the JCQ.
4. Provides a monitoring and reporting point for institutions. This can inform predicted grades, reports to stakeholders etc.
5. Allows an institution to attempt to take a snapshot of current performance – in order infer potential end of course outcomes and divert resources towards intervention.
Often these come in contrast of each other – and for many an institution being able to label a grade or level of performance is viewed as the end-goal of mock examinations.
With regards construction of an assessment, Evidence Based Education’s fantastic “Four Pillars of Assessment” document (https://evidencebased.education/wp-content/uploads/sites/3/4-pillars-of-great-assessment.pdf) puts “Purpose” as the first pillar to consider – what skills or knowledge from the curriculum are you wanting to assess? This “Purpose” can also be spun to think about what data you are actually trying to get from the assessment – what future actions or interpretations are you wanting to make following this assessment?
My December mock exams for both year 12 and 13 Mathematics centre around Pure Maths concepts. These account for 2/3rds of the overall qualification. I am going to focus on Year 12 in this post, as this was the group where the October/November bus strikes had the largest impact. I thought about the data that I wanted to collect and categorised learners into 3 groups based on how bus strikes impacted physical attendance in class;
1) No Physical Attendance Issues
2) Some Physical Attendance Issues (were able to attend some lessons live but predominantly attended over Teams)
3) Major/Constant Physical Attendance Issues (averaged less than 1 in-person lesson a week)
I also categorised questions and topics by what had been covered before bus strikes and during bus strikes. I have used the same assessment now for three academic years, so not only was I in a position of comparing the students within each category cohort, I had national performance from the question use in terminal examinations, but also a selection of cohorts who sat these mock exams under equivalent conditions and timeframe.
Reusing an assessment is something I highly recommend for key mock examination points. This grows the pool of comparative data that you have, and helps with the purposes 4 and 5 – aiding predictions and your confidence in accuracy and identifying areas of comparative concern. The assessment I use is based on 5 of the 9 specification topics of Pure Maths within AS Maths – with materials taken from two different exam papers to gain additional depth across these topic areas. This level of depth is not representative of the final exam, however does give scope to assess further knowledge in the five areas covered, and is purposeful as a reflection point on topics covered, to judge areas of assessment improvement and where performance has dropped from earlier in the year.
There is also an over-emphasis on AO3 modelling, this is due to the comparative difficulty young people find in these questions, so it helps add a level of difficulty missing from the examinations that topics like Exponentials and Logarithms bring to the full examination – I can also contrast general topic performance and modelling/application based performance for a topic. As noted in a previous post (https://themarkscheme.co.uk/the-end-of-it/) I was able to identify significant underperformance in these questions that then informed future delivery involving language and aiming to inspire use of the elaborated code in the cohort.
Grading is based on the grade boundary of the exam papers the questions came from, adjusted for the ‘average marks per grade’ for each question as a proportion of the overall grade boundary. This has led to grade boundaries proportionally being slightly higher than the exam papers themselves as percentages. This decision reduces the likelihood of over representing ‘graded performance’, however that graded performance is only part of the purpose of these assignments – although arguably for the institution, the one that holds the most weight (whether right or wrong, I won’t discuss that here).
The first stage of analysis was to compare this graded performance of the current cohort against the previous cohorts who had sat this paper in equivalent time and conditions.
Comparing the overall cohort you can see a significant fall in high grades. A/B performance has fallen from around 41% to around 12% – and failing grades increasing from approximately 11% up to 29%. Nationally in summer 2023, approximately 24% of learners failed to gain a grade, but as previously stated this assessment is not set up as a direct indicator of final performance (I will discuss this more in the next section). However the significant difference is concerning. These cumulative grades show how bus strikes have clearly had an element of influence on performance, but more concerning is that performance hasn’t held up for those unaffected by the bus strikes, which is a huge concern and requires further investigation.
Breaking into smaller categories allows use to delve a bit further. For those affected by the bus strikes there is a much greater proportion of lower grades, the highest grade of those with ‘major attendance issues’ being a grade D. This student and I had a conversation following the feedback where we identified that at that point they had technically spent more time ‘on Teams’ for class than actually ‘in class’. Using previous assessment data, this student has fallen from the top 20% of the cohort earlier in the year, two of the U grades have also fallen from the top 50% of the cohort.
Analysing U grades further using Bayesian statistics emphasises how the bus strikes have led to reduced comparative performance;
The probability a student failed to achieve GIVEN THAT they had no physical attendance issues = 15.8%
The probability a student failed to achieve GIVEN THAT they had some physical attendance issues = 33.3%
The probability a student failed to achieve GIVEN THAT they had constant/major physical attendance issues = 66.6%
These statistics indicate a doubling in likelihood of ‘failing’ on the metric of the mock exam – as in-person attendance reduced through the given categories.
ACCURACY: How do you know the data you get is accurate?
I’ve been throwing grades around, one of the problems of assessments, as pointed out by EBE is centred around ‘validity’. No assessment is technically ‘valid’ or ‘invalid’ – but how you choose to use it can be. Are these grades accurate? How do I know?
Ultimately any assessment judges performance on a given day – inference from a December assessment to final performance is always going to be built around a high level of extrapolation (predicting beyond what the information collected actually tells you). Furthermore a terminal assessment sitting is a ‘sample’ of the knowledge and skills that make up a specification, for these year 12’s it’s 100 marks of Pure Maths, 30 marks of Statistics and 30 marks of Mechanics – sampling knowledge to infer ability in comparison to the rest of a national cohort. At its very best a mock exam is therefore fundamentally a ‘sample’ of a ‘sample’. The graded aspect is always in some aspects unreliable – however in the context of institution it is arguably where the largest emphasis is placed.
Accuracy of ‘grade’ is only one part of the accuracy you have to account for with regards data. One is the expectations of the assessment, are the questions at the appropriate standard, are you applying mark schemes of the correct rigour? If the appropriate rigour is not there the data collected is not fit for the purpose of this type of assessment. Using last paper materials and exam board materials are a key aspect of doing this. Furthermore, the identified aspect of purpose 1 – to illustrate progress over time – this often facilitates the want of specific topic, content or skill within the assessment – or as in my case of December mock exams – a depth of topic knowledge being assessed beyond what would be found in the ‘terminal sample’. This reduces ‘accuracy’ in the graded aspect, but provides more rich information to allow myself and students to identify gaps in knowledge and misconceptions – so provides a more accurate representation of knowledge, skillset and application of both.
Therefore, you have to find a balance, as these assessments mean different things to different people, and you have to identify and articulate the limitations that come from balancing each of the purposes. They are never going to be perfect ‘samples’ of the terminal ‘sample’ – but becoming aware of the limitations are essential. This is why I am big on recommending reusing a mock assessment through the years – you build a knowledge base that accumulates for future comparison, it begins to tell you more about performance and projections as time goes on. If students struggle with a particular problem, you can start to predict where future issues may arise in the course based on prior experience, you can start to gauge accuracy of ‘predictions’ that come the graded elements of the assessment.
For example, with this mock exam I know that historically the ‘grading’ element is quite accurate at the top end. There has only been 1 student who achieved a high grade (A/B) in December who didn’t in the terminal examinations. However, I know that the lower grades tend to be more inconsistent. There is often variation between D/E/U allocations and final performance – these are limitations I know about the assessment and have historical evidence to communicate.
This situation makes sense, as further breadth is introduced learners without a firm grasp of the current material and core knowledge that overlap between topics will often struggle with to embed this new material, alternatively it can act as a ‘wake up call’ to some students, leading to an increased work ethic down the line. The accuracy of ‘grading’ is therefore questionable, however the depth into the topics assessed gives me a greater opportunity to identify where there’s less grasp on current material and identify issues with core mathematical methods. Therefore, the accuracy of ‘progress’ and ‘current ability’ is arguably strengthened due to this depth focus – and can assist in future instruction.
With regards this mock exam, as well as identifying individual question performance, I separated topics into ‘bus strike content’ and ‘pre-bus strike content’ – again this allows for direct comparison between the category types of this cohort, as well as performance of previous cohorts.
‘NBS Median’ (Non-Bus Strike) for those unaffected is comparative to previous years – which is a positive sign. However the other categories fall down from this, this suggests ‘missing content’ is not the issue here. This is further supported by the Bus Strike averages, which have fallen across all categories by incredibly significant amounts, regardless as to whether students were within the live lesson or not. This means the cause of the issues go beyond the mere ‘being in class’/‘being out of class’.
It would be easy for me to write off that the ‘bus strike’ lessons contained more difficult content – we did study the calculus topics of Differentiation and Integration during that time. To maths educators, on the surface this would even sound like a sensible justification, these are topics that are highly significant to the specification, and often have ‘high challenge’. However, viewing the historical data, this isn’t supported by the quantitative evidence – on average previous cohorts have achieved a higher level of success on these questions, a median achievement of 1.42% on ‘bus strike topics’ for every achievement of 1% of ‘non-bus strike’, where this improvement (or any improvement) is not replicated across any category this year.
The figures tell a lot as well. One aspect of my initial introduction of VAPE/PAVE (https://themarkscheme.co.uk/data-is-not-a-dirty-word/) was that we need to empower educators with the skills with regards statistical analysis. Knowing what statistics can tell us. I have both median and mean calculations within my analysis. Comparing these averages the ‘non-bus strike’ topics the mean and median are fairly equivalent to each other (although it reduces for each category as previously discussed). This suggests comparative values without significant extreme values.
However, for the categories of “No Effect” and “Constant Effect” the mean ratio was higher than the median. This tells me that the ‘extreme values’ are the better performers. As an example, a dataset of 1,2,3,4,50 generates a median of 3, but a mean of 12 – this tells me that not only are these averages lower than historically, but students are generally performing lower, with a few higher grades. This contrasts the prior achievement ratios, where Median>Mean, indicating the ‘outlying values’ are lower performers.
When we talk about statistical analysis it doesn’t need to be standard deviation, logit regression, modelled functions and hypothesis tests. Mean and median are basic concepts, but additional comprehension of how these values work and interact can lead to a richer interpretation to influence future decisions.
Interestingly this Median>Mean ratio also happens with those coded as “1” in the data – which leads us into the “velocity” aspect of the model – how do we start to use this to drive future instruction?
VELOCITY: What comes next?
Velocity is the key component of the VAPE/PAVE model. It’s all well and good to number crunch, even aim for justifications or reasons, but there’s a strong element of “so what?!” following this – how do we adapt instruction to make a difference based on this analysis?
In this situation I noticed that results were lower regardless of the bus strike effect for the content from the October/November – I could suggest influence of not having a full cohort present reduces the impact of the ‘mock exam drive’. Further inference this suggests a breakdown of classroom routines of a reason as well.
Due to the complexity of dual delivery, aspects that students had been introduced to in the first 6 weeks of term were dropped away. This could explain why year 13 were significantly less affected, as these routines had been embedded for longer, and students were in a position to self-regulate and work independently.
Elements of low stakes formative assessment were downscaled, or not immediate due to balancing the online and in-person class simultaneously. I structured homework differently and due to time constraints that the dual delivery exacerbated I did not run a significant in-class exam based progress test as I normally do at the end of November, which solely focuses on Differentiation/Integration (the only non-cumulative progress test I do in the academic year).
This assessment really shines a light on the importance of calculus (2 of the ‘key 4’ topics) and directly encourages learners into a revision schedule on the topics which will roll into the mock examination. This cohort missed all of that, and the impact is visible when analysing the results from just those topics.
Regarding more general routines, it was a difficult situation and I spent time outside class checking in with Teams calls/chats/phonecalls/annotating photographs of work – but I have to look at myself as having the mindset of worrying more about students still getting ‘content’ – and without having everyone present, whether the ‘drive’ towards ‘mock season lacked its resonance in comparison to other years.
As time is finite, I maybe neglected ‘checking in’ with the students who were present due to time spent elsewhere, and when checking in with learners not in-class my focus was content ‘in the moment’ and not ‘retention’. This also potentially explains why those with ‘partial’ attendance, whilst lower performance across the whole, did not see as much of a dip in performance from the bus strike lessons comparitively. These students did have in-person routines or check-ins, and unlike those who were completely unable to attend, had extra incentive to keep on top of the work remotely, for returning to class.
I can’t change what has happened but can adapt things moving forward. So first was to make sure routines returned to the classroom. This includes low-stakes formative assessment and activities that were part of lessons earlier in the year, significantly more direct practice for the students in class – spread and structured, a newly introduced homework routine focussed around exam questions and reiteration of expectations of what makes a great A Level Maths student.
Due to the extreme values being higher percentages suggesting lower performance as the ‘norm’ I knew I needed to rework on embedding fundamentals. This led to a decision of a “differentiated progress test” in January. Differentiation as a teaching strategy has become stigmatised in recent time, but as a method adjacent to adaptive teaching it can still yield benefits when implemented correctly. In this situation I 100% own differentiation – I am not expecting less from different students, they just have immediate different focuses and different areas to strengthen. These were differentiated resources, and it was what they needed to be.
After analysis and discussion of the mock papers and previous performance, including some direct 1-to-1s, learners decided on 3 topics (from 7 different groups) that they were going to focus on following the mock exam, giving justifications for their decision. Whilst I had a right to ‘veto’ a decision (if l felt the topic wasn’t appropriate) the choices were the students alone. It was to build ownership and responsibility for their own development goals. Again, there was increased depth in assessment of these areas, each topic representing 20 marks in a future assessment.
During this time we lost 2 students due to withdrawal. However the differentiated assessment approach yielded some positive results. 69% of the topics chosen yielded at least some improvement in performance from the mock examination – particularly in the aforementioned areas of calculus. The two topics that tended not to yield improvements were Circles and Further Functions – anecdotally students saying they weren’t prepared for the variety in problems of the latter (e.g inequalities and types of graphs) instead focussing a lot on cubics and transformations, which were only part of the topic.
Looking at the more positive outcomes indicated more than marginal gains in those areas. Of those who successfully improved performance 33% of those reached a level of performance that was at least 30 percentage points higher than the mock (e.g. from 30% to at least 60%), which given even further increased depth within the topics is a large positive. Reducing this to the number of topics where given there was an improvement, the improvement is 20 percentage points or more was at 60%.
Again this is a snapshot of a moment in time, and isn’t universal progress, but at least shows movement in a positive direction. It’s continuing use of mock exam data beyond the initial analysis – informing future decisions and reflections – both at teacher level, and as encouragement for students to do the same.
ELABORATION: Looking at a bigger picture
Technically the elaboration aspect permeates the rest of this post, as I aimed to develop meaning and understanding from the data and statistical calculations with regards the position of the cohort and how to use this to propel the, forward. Elaboration is also the ability to illustrate the big picture to stakeholders, such as the learners themselves – the engagement in the follow up assessment evidenced buy-in at student level. We can also look beyond the assessment itself, and in this case the bus strike framing, to other circumstances that may have led to an impact.
The elephant in the room is ‘are standards equivalent across cohorts’? We should not treat prior attainment as a limiting factor for performance and expectation, or as a method of restricting students access to intervention – which is why I refrained from looking into it during key analysis of results, but reflecting on starting points can help bring meaning to the situation that the cohort has found themselves in.
Looking at prior GCSE Maths attainment, this cohort has an average prior performance of 6.7 – comparing to prior cohorts with an average grade of 7.6. However, given CAG’s/TAG’s and then the incremental step of reducing outcomes ‘back to pre-pandemic level’ makes it difficult to infer too much from this. We begin the year with a diagnostic assessment – with maths it is very much centred on algebraic fundamentals. I made changes to my diagnostic in the academic year or 2022/2023 as I had taken a different approach during years where terminal examinations were ‘cancelled’ – so only want to compare like for like. In 2022 the average diagnostic score was 59.7%, and this year it was 58.4% – negligible different. Correlation and linear regression models of ‘mock % on diagnostic %’ tell an interesting story (accidentally labelled 2024 and 2023, instead of 2023 and 2022).
The correlation coefficients are indicative of a positive linear correlation between the variables for both years – a Pearson’s Product Moment Correlation Coefficient of r = 0.6986 for 2023 and r = 0.7455 for 2022. Looking just at the horizontal scale we can see a much more significant spread in the 2023 data, with a larger proportion of the cohort on the right side of the graph, indicating a better level of diagnostic performance. With the current cohort there is a lot more congestion on the left side of the graph, around 40-55%, with limited students falling in the 55-75% region which was densely populated the year before.
Delving into the areas that make the largest differences in the outcomes between the 55-75% cohort from 2023 and the 40-55% cohort this year and it was predominantly ‘quadratic equations’ and ‘laws of indices’. Whilst activities were in place early in the year to address key gaps in knowledge – these again are skills re-emphasised through in-class retrieval that was scaled back during the time period. The latter in-particular acting as a barrier of access to many of the calculus (integration/differentiation) questions which saw significant underperformance based on historical data.
There are also re-emergence of themes, such that time is finite, and the unfortunate truth is that addressing issues begets other issues that are needed to be addressed. In the 2022/2023 academic year, whilst the Pure Maths and Mechanics components of the cohort were strong, and above the national average – the statistics components of the course saw the cohort underperform (awkward to bring up in a post about statistical analysis). I had already factored into the year plan for an earlier introduction of statistics to combat this – a ‘long and thin’ approach to delivery from January. Not wanting to lose this, it meant balancing multiple intervention aspects with rebuilding of routines and new Pure maths content. Post half-term the issues that plagued the calculus topics of Differentiation and Integration I noticed the warning signs of in another key 4 topic of Trigonometric Identities and Equations, which we covered during the heart of this in January. I’ve quickly re-emphasised and targeted this over the past couple of weeks – hoping to avoid a similar situation.
This week the students are doing a full Paper 1 as their second mock exam. As educators we are not benign in this process – even though there is not manipulation or editing, the choice of a specific paper is purposeful – for what it shows me, what it links back to, and what it can be used to inform for the last push towards the summer.
This was technically the 3rd post in the series relating to the Go North East bus strikes from the end of last year. The first two parts can be found here:
PART 1 – https://themarkscheme.co.uk/bus-strike-is-over/
PART 2 – https://themarkscheme.co.uk/visions-like-snow/
The following post also introduces the VAPE/PAVE framework used within this post (which is still work in progress) – https://themarkscheme.co.uk/data-is-not-a-dirty-word/