Research Scientist

San Francisco • $150K–$220K (annualized) • Summer 2026 to start

Run the experiments and build the evaluations that improve how our production AI tutor teaches real students.

About Aristotle

At Aristotle, we are building the world's first AI tutor — giving every student access to an expert, personal guide. Aristotle provides highly personalized tutoring, complete with memory, personality, consistency, and genuine care.

The Role

Research at Aristotle works on the open problems in building a realtime AI tutor. The biggest one right now is evaluation. Tutoring is multi-turn and adaptive, there is no verifiable reward signal, and existing benchmarks test single turns in toy scenarios. We are building the evaluations that make tutoring quality measurable, including simulated students realistic enough to test tutors against, and using them to improve our tutor in production.

You will own this work end to end: forming hypotheses, running experiments on real production sessions, and shipping findings into the product. You will work directly with one of the co-founders leading research, and with data that few groups have: full multimodal tutoring sessions at scale, with voice, whiteboard state, and student history.

This starts as a summer contract, with the intent to continue if it goes well. Full-time in SF is preferred; we are flexible on hours and location for the right person. We support publishing where we can, and we will tell you the constraints up front.

Responsibilities

Scope and plan research projects with the team.
Design and validate measures of multi-turn tutoring quality, grounded in learning science and applied to real production sessions.
Build simulated students grounded in real student data, and test whether they respond to good and bad tutoring the way real students do.
Turn expert pedagogical principles into criteria that can be evaluated reliably at scale.
Train models where prompting falls short, for student simulators and beyond.
Run empirical work end to end: form hypotheses, experiment against real data, and write up findings that change what we build.

Qualifications

Graduate research experience (PhD student, recent PhD, or equivalent track record) in CS, ML, NLP, or a related field.
Hands-on experience with some of: LLM evaluation, post-training/fine-tuning, agent systems, student simulation, dataset curation.
Familiarity with education research (pedagogical evaluation, student modeling, tutoring dialogues, knowledge tracing, etc.) is a strong plus.
Clear research planning and communication, and the ability to make research legible to people outside your subfield.

You Might Be a Great Fit If You...

Have spent serious time teaching or tutoring.
Are willing to read fifty real tutoring transcripts before proposing a metric.
Are comfortable owning a problem where the right approach isn't known yet.
Treat evaluation as a research problem in its own right.
Would rather show a rough result this week than a polished one next month.
Already use coding agents heavily in your day-to-day work.

Why Now?

For decades, educators have dreamed of giving every student a personalized tutor. With LLMs, that vision is within reach. If you're on Team Human, there is no more meaningful or urgent mission to pursue.

Apply for this position