The earlier we all start thinking about this problem, the sooner we can start generating ideas and potential solutions.
Given the magnitude of impact generative AI is having and will have in education (and many other aspects of life), I’m working with some diligence to keep up to date with developments in the field. Recently, I noticed how a couple of the emerging capabilities of generative AI will come together in the future in a way that will impact education much more dramatically than I am hearing anyone talking about currently (if I’m missing this conversation somewhere, please help me connect to it!). But before I give away the punch line, let me share the individual pieces. Maybe you’ll see what I saw.
The Pieces
“Agents” are ways of using generative AI to interact with the world outside the large language model (LLM). Some recent examples include:
In these two examples, we see LLMs searching the web to find and read technical documentation, writing, debugging, and running computer code, playing songs on Spotify, and creating images on Midjourney. But we also see an LLM ordering food via DoorDash, booking a ride from Uber, and purchasing a plane ticket. These LLMs aren’t just writing essays or conducting mock job interviews. They’re reaching outside themselves to navigate the web, use a wide range of services, and take actions in the real world (in some cases spending real money to do so).
The rabbit r1 takes a standard approach to connecting to and using other services – you individually authenticate with each service you want the r1 to be able to access and use (e.g., you can see Jesse authenticating with Spotify around 11:31 in the video above).
Open Interpreter takes a radically different approach to connecting to and using other services.
Open Interpreter is a kind of integration layer that allows LLMs to take actions directly using your computer – operating the keyboard and mouse autonomously. Rather than authenticating with Spotify and operating it via an API like the r1 did, Open Interpreter would simply open the Spotify app, click in the search box, type the name of a song, hit enter, and then double click the song title to start playing. (Open Interpreter is open source and you can check out the repo on Github.)
In the video below, introducing the 01 hardware device made to work with Open Interpreter, Killian says, “You can think of the 01 as a smart person in front of your computer” (3:50 mark in the video).
Introducing the 01 Developer Preview.
Order or build your own today: https://t.co/ROEcj9jVPX
The 01 Light is a portable voice interface that controls your home computer. It can see your screen, use your apps, and learn new skills.
This is only the beginning for 01— the… pic.twitter.com/J5VoWlCI5i
— Open Interpreter (@OpenInterpreter) March 21, 2024
In the 01 demo (starting about 4:10) we see the Open Interpreter use Slack by pressing hotkeys on the keyboard, seeing and interpreting what’s on the screen, clicking on user interface elements, typing, and hitting enter. This is absolutely incredible. And it connects to a topic that was discussed briefly on a recent episode of the Latent Space podcast (starting around 35:11).
While many computer vision models have been trained on datasets like COCO, which is comprised of photos of a wide range of objects in a wide range of contexts, the kind of computer vision that’s needed to support knowledge work is the capacity to understand PDFs, charts, graphs, screenshots, etc. And while they’re playing catch-up, the capabilities of vision models in this area are advancing quickly. As the 01 demo shows, this kind of multimodal support in LLMs is already pretty good.
(And you likely noticed that the r1 and the 01 both include a learning function, which you can use to teach them how to perform new skills. That’s an entirely different essay.)
Now let’s add one more piece. Here’s an oldie-but-a-goldie by Ethan Mollick from over a year ago.
??Well this is something else.
GPT-4 passes basically every exam. And doesn't just pass…
The Bar Exam: 90%
LSAT: 88%
GRE Quantitative: 80%, Verbal: 99%
Every AP, the SAT… pic.twitter.com/zQW3k6uM6Z— Ethan Mollick (@emollick) March 14, 2023
And of course the capability of frontier models has only increased in the 13 months since this tweet was published. (And yes, I called it a tweet.)
Ok, those are the main pieces. Do you see what I see?
Putting the Pieces Together
As we’ve seen above, generative AI is capable of opening programs on your computer and using those programs autonomously. It can use a web browser to open webpages, navigate, click on buttons or form fields or radio buttons or other UI elements. And as we already knew, generative AI can write essays and pass a wide range of very difficult exams with flying colors. In other words,
All the technology necessary for an “AI student agent” to autonomously complete a fully asynchronous online course already exists today. I’m not talking about an “unsophisticated” kind of cheating where a student uses ChatGPT to write their history essay. I’m talking about an LLM opening the student’s web browser, logging into Canvas, navigating through the course, checking the course calendar, reading, replying to, and making posts in discussion forums, completing and submitting written assignments, taking quizzes, and doing literally everything fully autonomously – without any intervention from the learner whatsoever.
Putting these pieces together to build an AI student agent will require some technical sophistication. But in terms of overall difficulty, it feels like the kind of thing that could be done by a team of two during a weekend AI Hackathon.
I’ve experimented with setting up a toy version of an AI student agent on my laptop using Open Interpreter with GPT-4. It’s probably prohibitively expensive to do an entire course this way today – it would cost well over $100 to have an AI student agent complete a single class in this configuration, and this creates a (temporary) barrier to widespread adoption. With more time and effort, you might be able to use a cheaper model (like GPT-3.5-turbo). But either way, the price per token will keep going down in the future, so high prices are likely only a temporary barrier to the adoption of AI student agents.
Of course, the way to avoid paying for API calls altogether is to run an LLM locally. So I also tried using Open Interpreter with Mistral-7b running locally via LM Studio (and therefore costing me essentially $0 per token). This was slower and not as accurate, but with enough time and effort I think you could get an AI student agent working using a local LLM. (UPDATE: This works better with Llama3-8b running locally.) The barrier to adoption here is that, in order to use an AI student agent in this configuration, a student would have to download, install, and run a large language model on a pretty powerful laptop. But again, this barrier is also likely only temporary – the UI/UX for running local models will keep improving and computers will keep getting more powerful.
With OpenAI widely rumored to be releasing updated functionality this summer specifically designed to make agents easier to create and control, and with GPT-5 rumored to be coming toward the end this year (the CEO of OpenAI recently said that GPT-4 “kind of sucks” compared to what’s coming), the tasks of building and running this AI student agent will only get easier as time goes on. I’m not sure what the odds are that this tech exists by Fall semester of 2024, but it seems highly likely it exists by Fall 2025.
The implications for formal education are obvious, if hard to fully appreciate. But then there’s also corporate training, safety and compliance training, etc. to consider. The overwhelming majority of this kind of training is delivered fully asynchronously.
So now what?
I’ll share some early ideas in another post. I’m anxious to hear yours. We clearly have work to do.