Can computers really mark exams? Benefits of ELT automated assessments

ɫèAV Languages
Hands typing at a laptop with symbols

Automated assessment, including the use of Artificial Intelligence (AI), is one of the latest education tech solutions. It speeds up exam marking times, removes human biases, and is as accurate and at least as reliable as human examiners. As innovations go, this one is a real game-changer for teachers and students. 

However, it has understandably been met with many questions and sometimes skepticism in the ELT community – can computers really mark speaking and writing exams accurately? 

The answer is a resounding yes. Students from all parts of the world already take AI-graded tests.  and Versanttests – for example – provide unbiased, fair and fast automated scoring for speaking and writing exams – irrespective of where the test takers live, or what their accent or gender is. 

This article will explain the main processes involved in AI automated scoring and make the point that AI technologies are built on the foundations of consistent expert human judgments. So, let’s clear up the confusion around automated scoring and AI and look into how it can help teachers and students alike. 

AI versus traditional automated scoring

First of all, let’s distinguish between traditional automated scoring and AI. When we talk about automated scoring, generally, we mean scoring items that are either multiple-choice or cloze items. You may have to reorder sentences, choose from a drop-down list, insert a missing word- that sort of thing. These question types are designed to test particular skills and automated scoring ensures that they can be marked quickly and accurately every time.

While automatically scored items like these can be used to assess receptive skills such as listening and reading comprehension, they cannot mark the productive skills of writing and speaking. Every student's response in writing and speaking items will be different, so how can computers mark them?

This is where AI comes in. 

We hear a lot about how AI is increasingly being used in areas where there is a need to deal with large amounts of unstructured data, effectively and 100% accurately – like in medical diagnostics, for example. In language testing, AI uses specialized computer software to grade written and oral tests. 

How AI is used to score speaking exams

The first step is to build an acoustic model for each language that can recognize speech and convert it into waveforms and text. While this technology used to be very unusual, most of our smartphones can do this now. 

These acoustic models are then trained to score every single prompt or item on a test. We do this by using human expert raters to score the items first, using double marking. They score hundreds of oral responses for each item, and these ‘Standards’ are then used to train the engine. 

Next, we validate the trained engine by feeding in many more human-marked items, and check that the machine scores are very highly correlated to the human scores. If this doesn’t happen for any item, we remove it, as it must match the standard set by human markers. We expect a correlation of between .95-.99. That means that tests will be marked between 95-99% exactly the same as human-marked samples. 

This is incredibly high compared to the reliability of human-marked speaking tests. In essence, we use a group of highly expert human raters to train the AI engine, and then their standard is replicated time after time.  

How AI is used to score writing exams

Our AI writing scoring uses a technology called . LSA is a natural language processing technique that can analyze and score writing, based on the meaning behind words – and not just their superficial characteristics. 

Similarly to our speech recognition acoustic models, we first establish a language-specific text recognition model. We feed a large amount of text into the system, and LSA uses artificial intelligence to learn the patterns of how words relate to each other and are used in, for example, the English language. 

Once the language model has been established, we train the engine to score every written item on a test. As in speaking items, we do this by using human expert raters to score the items first, using double marking. They score many hundreds of written responses for each item, and these ‘Standards’ are then used to train the engine. We then validate the trained engine by feeding in many more human-marked items, and check that the machine scores are very highly correlated to the human scores. 

The benchmark is always the expert human scores. If our AI system doesn’t closely match the scores given by human markers, we remove the item, as it is essential to match the standard set by human markers.

AI’s ability to mark multiple traits 

One of the challenges human markers face in scoring speaking and written items is assessing many traits on a single item. For example, when assessing and scoring speaking, they may need to give separate scores for content, fluency and pronunciation. 

In written responses, markers may need to score a piece of writing for vocabulary, style and grammar. Effectively, they may need to mark every single item at least three times, maybe more. However, once we have trained the AI systems on every trait score in speaking and writing, they can then mark items on any number of traits instantaneously – and without error. 

AI’s lack of bias

A fundamental premise for any test is that no advantage or disadvantage should be given to any candidate. In other words, there should be no positive or negative bias. This can be very difficult to achieve in human-marked speaking and written assessments. In fact, candidates often feel they may have received a different score if someone else had heard them or read their work.

Our AI systems eradicate the issue of bias. This is done by ensuring our speaking and writing AI systems are trained on an extensive range of human accents and writing types. 

We don’t want perfect native-speaking accents or writing styles to train our engines. We use representative non-native samples from across the world. When we initially set up our AI systems for speaking and writing scoring, we trialed our items and trained our engines using millions of student responses. We continue to do this now as new items are developed.

The benefits of AI automated assessment

There is nothing wrong with hand-marking homework tests and exams. In fact, it is essential for teachers to get to know their students and provide personal feedback and advice. However, manually correcting hundreds of tests, daily or weekly, can be repetitive, time-consuming, not always reliable and takes time away from working alongside students in the classroom. The use of AI in formative and summative assessments can increase assessed practice time for students and reduce the marking load for teachers.

Language learning takes time, lots of time to progress to high levels of proficiency. The blended use of AI can:

  • address the increasing importance of formative assessmentto drive personalized learning and diagnostic assessment feedback 

  • allow students to practice and get instant feedback inside and outside of allocated teaching time

  • address the issue of teacher workload

  • create a virtuous combination between humans and machines, taking advantage of what humans do best and what machines do best. 

  • provide fair, fast and unbiased summative assessment scores in high-stakes testing.

We hope this article has answered a few burning questions about how AI is used to assess speaking and writing in our language tests. An interesting quote from Fei-Fei Li, Chief scientist at Google and Stanford Professor describes AI like this:

“I often tell my students not to be misled by the name ‘artificial intelligence’ — there is nothing artificial about it; A.I. is made by humans, intended to behave [like] humans and, ultimately, to impact human lives and human society.”

AI in formative and summative assessments will never replace the role of teachers. AI will support teachers, provide endless opportunities for students to improve, and provide a solution to slow, unreliable and often unfair high-stakes assessments.

Examples of AI assessments in ELT

At ɫèAV, we have developed a range of assessments using AI technology, including , aimed at those who need to prove their level of English for a university place, a job or a visa. It uses AI to score tests and results are available within five days. 

More blogs from ɫèAV

  • University students sat in a classroom at desks with a teacher speaking to them

    Planning for success with the GSE

    By Sara Davila

    The Global Scale of English (GSE) is the first truly global English language standard.

    It consists of a detailed scale of language ability and learning objectives, forming the foundations of our courses and assessments at ɫèAV English.

    The GSE was developed based on research involving over 6000 language teachers worldwide. The objective was to extend the current descriptor sets to enable the measurement of progression within a CEFR level – and also to address the learning needs of a wider group of students.

    It can be used in conjunction with a current school curriculum and allows teachers to accurately measure their learners’ progress in all four skills of reading, writing, listening and speaking.

    GSE was introduced at the – an English language school run by the University of Toledo in Ohio, USA – with impressive results.

    The American Language Institute

    The Institute provides English courses for students who want to improve their English and prepares students to take the International Student English exam. They offer an intensive language program consisting of 20 hours of classes every week and 40 hours of self-study. This 60-hour week is designed to fast-track students from a lower level of English to a standard which allows them to participate successfully in college courses. There are five course levels offered, from A2+ to B2+ and class sizes average at around 10 students.

    Most students at the Institute are full-time international students planning to attend the University of Toledo once their English language proficiency reaches the required standard. On average, they are between 18 and 20 years old, and enter the language program with a B1 level of English.

    A mission statement

    At the Institute, the main aim of the language courses is to help students develop their English skills to a level that will allow them to integrate successfully into the university community, not just academically but socially. In their own words; “Our ultimate goal isn’t to teach them how to take and pass language tests, but to teach them how to use English and engage themselves with the local communities.”

    So how did the GSE, in conjunction with the Versant test and other ɫèAV products, help to achieve this goal?

    Transitioning to a objectives-based curriculum

    First, the course coordinator Dr Ting Li adopted the GSE for a more detailed approach to the CEFR. She found that the GSE “made the CEFR more manageable because it broke out the levels and outlined CEFR goals into different categories.”

    Next, she replaced the current course materials with NorthStar Speaking & Listening, NorthStar Reading & Writing, and Focus on Grammar. These courses covered the areas taught in the previous curriculum, as well as the three key areas of study; literacy, speaking and listening, and grammar.

    The instructors also began using ɫèAV English Connect, a digital platform for teachers and students.This gave them the flexibility to revise questions and reduce administrative burden due to the automatic grading feature.

    Finally, the Institute started using the Versant English placement test to decide which level students should enter when they first begin studying at the Institute.

    Key findings from the case study

    The new curriculum was a great success. Students, teachers and administrators all found that the courses and assessments, all underpinned by the GSE, made the language learning experience smoother and easier. Once students had completed the highest level of the course and achieved a 3.0 GPA, they were able to transition smoothly into their courses at the University of Toledo.

    The alignment between the NorthStar courses, the grammar study books and the Versant test was informed by the GSE. This meant students didn’t have to sit as many assessments as before, reducing time teachers had to spend setting and marking exams, and allowing them to focus more on supporting learners and the quality of their lessons.

    Dr Li highlighted the following key benefits:

    • The Global Scale of English supports the development of a standardized curriculum and a consistent framework for teaching English
    • The average student GPA was highly related to the University of Toledo’s undergraduate GPA, which indicates that if students do well at the Institute, they will have a successful academic career.
    • There was no group difference between graduates of the Institute and the average University of Toledo student GPA, which indicates that the Institute’s students perform as well as other international students who have been directly admitted to the university.
    • There was no difference between credits earned 2 years into the university program compared with the general student population.

    What’s more, the Institute was recently recognized by the , meaning that the course run by Dr Li is now nationally recognized. Using the GSE to inform the organization of the course curriculum made the accreditation process smoother and easier.

    Working as a team

    One of the main pieces of feedback from Dr Li and the Institute was how helpful they found the ɫèAV representatives, who offered excellent customer support, building a sense of a team between their representatives and the school. This very teamwork helped the Institute to fulfill the ambition in their mission statement. It makes for an inspiring story of how one school used the GSE to transform their curriculum, and achieved their goal of helping students to improve their English and achieve their academic ambitions.

  • a young man sat in a lecture hall with other students behind him

    How the GSE helped Salem State University meet learner needs

    By Sara Davila

    Salem State University is one of the largest and most diverse public teaching universities in Massachusetts. In total, it has about 8,700 students enrolled, 37% of whom are people of color. It also educates 221 international students from 59 different countries – with China, Albania, Brazil, Morocco, Nigeria and Japan among the most represented countries on campus.

    The university runs an intensive English language program. Most students who enrol come from China, Brazil, Albania, Vietnam, and Japan. The program also has a number of part-time English language learners from the local community.

    In 2016, Associate Director Shawn Wolfe and teachers at the American Language and Culture Institute did a review and found that areas for growth included establishing a universal documentation for identifying learner needs, goals and progress.

    “The biggest challenge was that we needed to have a better way of placing students,” Wolfe says. “We also needed to have a way to have our curriculum, our assessment and our student learning outcomes unified.”

    The team lacked programmatic data related to learning gains and outcomes. Additionally, they realized that assessments could be used to inform students about entry requirements at the university and other programs. And that’s where the Global Scale of English (GSE) came in, as a tool which enabled the staff at the American Language and Culture Institute to personalize and diversity their English teaching program to meet learner needs.

    Cultural and linguistic diversity

    David Silva PhD, the Provost and Academic Vice President, highlights the need for this type of personalization when it comes to education.

    “We have to be prepared for an increasing variety of learners and learning contexts. This means we have to make our learning contexts real,” he says. “We have to think about application, and we have to think about how learners will take what they learn and apply it, both in terms of so-called book smarts, but also in terms of soft skills, because they’re so important.”

    Silva makes the point that, as the world gets smaller and technology becomes a bigger part of our lives, we can be anywhere at any time, working with anyone from across the globe. “We need to be prepared,” he says, “for those cultural and linguistic differences that we’re going to face in our day-to-day jobs.”

    The ability to change and adapt

    So how does the curriculum at the American Language and Culture Institute help prepare students for the world of study and work?

    At the Institute, the general review led to the realization that the program needed to be adaptive and flexible. This would provide a balance between general English and academic preparation and would also encompass English for specific purposes (ESP).

    Wolfe says, “The GSE fit with what we were trying to do because it offers three different options; English for academic learners, English for professionals and English for adults, which is another area that we realized we needed to add to our evening program so that we can serve working adults that are English language learners in our community.”

    The English language instructors at the Institute were also impressed with the capabilities of the GSE. Joni Hagigeorges, one of the instructors, found the GSE to be an excellent tool for tracking student progress.

    “What I really like is that you can choose the skill – , listening, speaking – and you’re given the can-do statements, the learning objectives that each student will need to progress to the next level,” she said.

    Wolfe also commented on the GSE Teacher Toolkit and the way that it supports assessment and planning, allowing instructors to get ideas for specific learning objectives for groups or individual students. “It’s enabled us to personalize learning, and it’s changed the way that our teachers are planning their lessons, as well as the way that they are assessing the students.”

    A curriculum that will meet learner needs

    The GSE has allowed the team at the Institute to become more responsive to changing student expectations. The alignment of placement and progress tests to the GSE has allowed instructors to have more input into the courses they are teaching.

    Elizabeth Cullen, an English language instructor at the Institute, said, “The GSE helps us assess the strengths and weaknesses of various textbooks. It has helped us develop a unified curriculum, and a unified assessment mechanism.”

    This unification means that the curriculum can easily be tweaked or redesigned quickly to meet the needs of the students. What’s more, as Elizabeth points out, the students benefit too. “The Global Scale of English provides students with a road map showing them where they are now, where they want to go and how they’re going to get there.”

    Standing out from the crowd

    In this time of global hyper-competition, the challenge for any language program is finding innovative ways to stand out from the crowd while staying true to your identity. At Salem State, the staff found that the GSE was the perfect tool for the modern, data-driven approach to education, inspiring constant inquiry, discussion and innovation. It offers students, instructors and administrators a truly global metric to set and measure goals, and go beyond the ordinary.