Published On May 13, 2024
OpenAI recently released GPT-4o, which reports significant improvements in latency and cost. Many users may wonder how to evaluate the effects of upgrading their app to GPT-4o? For example, what latency benefit will users expect to gain and are there any material differences in app performance when I switch to the new GPT-4o model.
Decisions like this are often limited by quality evaluations! Here, we show the process of evaluating GPT-4o on an example RAG app with a 20 question eval set related to LangChain documentation. We show how regression testing in the LangSmith UI allows you to quickly pinpoint examples where GPT-4o shows improvements or regressions over your current app.
GPT-4o docs:
https://openai.com/index/hello-gpt-4o/
LangSmith regression testing UI docs:
https://docs.smith.langchain.com/old/...
RAG evaluation docs:
https://docs.smith.langchain.com/old/...
Public dataset referenced in the video:
https://smith.langchain.com/public/ea...
Cookbook referenced in the video:
https://github.com/langchain-ai/langs...