How to Run DeepEval from GitHub
How to Run DeepEval from GitHub
DeepEval is useful when you want LLM evals to run close to your codebase, especially in local development and CI. Running it from GitHub gives you access to the source repo, examples, and unreleased changes, but it also adds a few ways to make a mess: cloning the wrong repo, skipping a virtual environment, leaking API keys, or treating sample tests as production evals.
This guide walks through a clean setup for developers who want to clone DeepEval from GitHub, run a minimal eval, and avoid the common mistakes that cause flaky or unsafe LLM test runs.
1. Confirm the correct GitHub repository
The DeepEval GitHub repository is:
https://github.com/confident-ai/deepeval
Before cloning, check the repo page and confirm:
- The owner is confident-ai.
- The repo name is deepeval.
- The repository has recent commits and release tags.
- The README install commands match the package you expect.
If you are writing internal setup docs, capture a screenshot of the GitHub repo page that shows the URL, owner, repo name, branch selector, and latest commit. This prevents a common onboarding issue: someone cloning a fork, stale mirror, or unrelated package with a similar name.
2. Clone DeepEval locally
Use a clean workspace. Do not clone it inside another application repo unless you have a clear reason.
mkdir llm-eval-tools
cd llm-eval-tools
git clone https://github.com/confident-ai/deepeval.git
cd deepevalIf you want repeatable results, pin the repo to a tag or commit instead of running from the default branch forever.
git tag --sort=-version:refname | head
# Example pattern:
git checkout <tag-or-commit-sha>For CI, prefer a specific commit SHA. A moving branch can change behavior without warning.
3. Create a virtual environment
Do not install directly into your global Python environment. DeepEval, model SDKs, and test dependencies can conflict with application dependencies.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheelOn Windows PowerShell:
python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip setuptools wheelConfirm that your shell is using the virtual environment:
which python
python --versionYou should see a path inside .venv.
4. Install DeepEval from the cloned repo
From the root of the cloned repository, install it in editable mode:
pip install -e .Then confirm the install works:
python -c "import deepeval; print(deepeval.__version__)"If you want to install DeepEval directly from GitHub into a separate project, use this pattern instead:
pip install "deepeval @ git+https://github.com/confident-ai/deepeval.git@<tag-or-commit-sha>"Pinning matters. If you install from main without a commit SHA, your eval behavior can change the next time a teammate or CI runner installs dependencies.
5. Keep API keys out of Git
Many DeepEval metrics use an LLM judge, which means you usually need a model provider API key, such as an OpenAI key. Set it through your shell, secret manager, or CI secret store. Do not paste it into test files.
export OPENAI_API_KEY="sk-..."For Windows PowerShell:
$env:OPENAI_API_KEY="sk-..."Add local environment files to .gitignore:
.env
.env.local
.venv/
__pycache__/
.pytest_cache/If you already committed an API key, deleting the line is not enough. Rotate the key in the provider dashboard, then remove it from Git history if needed.
6. Create a minimal eval file
Create a small test file outside the DeepEval repo if you are testing your own app. For a quick local check, you can create a file named test_minimal_eval.py in a scratch directory.
mkdir ../deepeval-smoke-test
cd ../deepeval-smoke-test
cat > test_minimal_eval.py <<'PY'
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_answer_relevancy():
test_case = LLMTestCase(
input="What does PromptLayer help AI teams do?",
actual_output=(
"PromptLayer helps AI teams manage prompts, run evaluations, "
"trace LLM requests, and debug production AI workflows."
),
)
metric = AnswerRelevancyMetric(
threshold=0.70,
model="gpt-4o-mini",
)
assert_test(test_case, [metric])
PYThis is a smoke test. It proves DeepEval can run, call the judge model, and evaluate one response. It is not a production eval suite.
7. Run the eval
Run the file with the DeepEval test runner:
deepeval test run test_minimal_eval.pyA passing run will look similar to this:
$ deepeval test run test_minimal_eval.py
Running 1 test case...
test_answer_relevancy
Answer Relevancy
Score: 0.91
Threshold: 0.70
Status: passed
1 passedA failing run will look similar to this:
$ deepeval test run test_minimal_eval.py
Running 1 test case...
test_answer_relevancy
Answer Relevancy
Score: 0.34
Threshold: 0.70
Status: failed
Reason: The output does not answer the input directly.
1 failedThe exact output format can vary by DeepEval version, which is another reason to pin dependencies in CI.
8. Run an intentionally failing test
Before trusting your setup, make sure failures actually fail. Change the output to something irrelevant:
actual_output="Bananas are yellow and grow in warm climates."Run the test again:
deepeval test run test_minimal_eval.pyIf this still passes, your threshold is too low, your metric is not checking what you think it checks, or your test is wired incorrectly.
9. Do not confuse sample tests with production evals
The examples in an open-source repo are useful for learning the API. They are usually not a good eval plan for your product.
A production eval should include:
- Real inputs: Use support tickets, search queries, agent tasks, or anonymized production traces.
- Expected behavior: Define what good output means for each task.
- Thresholds: Set pass or fail criteria before the run.
- Stable test data: Keep a versioned dataset so prompt and model changes are comparable.
- Failure review: Store failed examples and inspect them before shipping.
For example, a customer support bot eval should not stop at “answer relevancy.” You may also need checks for policy compliance, refusal behavior, citation quality, tone, and whether the answer used the retrieved account context correctly.
10. Make non-determinism explicit
LLM evals can be noisy. A prompt that passes once can fail on the next run if model sampling, retrieved context, or judge behavior changes.
To reduce noise:
- Use low temperature for generated outputs when possible, such as
temperature=0or0.1. - Pin the judge model, such as
gpt-4o-mini, instead of relying on defaults. - Set metric thresholds, such as
0.70or0.80, based on real examples. - Run each eval against a stable dataset.
- Track pass rate over time instead of judging quality from one hand-picked example.
For CI, avoid blocking deploys on a single fragile test. A better pattern is to require a minimum pass rate on a fixed dataset, such as 95 out of 100 cases passing, while still blocking on critical safety or policy failures.
11. Pin dependencies for repeatable runs
If your team uses DeepEval in CI, capture exact versions. A simple requirements file can work for a small project:
pip freeze > requirements.txtA better pattern is to pin the GitHub commit explicitly:
deepeval @ git+https://github.com/confident-ai/deepeval.git@<commit-sha>If you use uv, poetry, or pip-tools, commit the lockfile. The goal is simple: the same code, same package versions, same model configuration, and same dataset should produce comparable results.
12. Add DeepEval to CI carefully
Once the local smoke test works, wire it into CI. Start with a small eval suite so developers get fast feedback.
name: deepeval
on:
pull_request:
workflow_dispatch:
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run DeepEval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
deepeval test run tests/evalsKeep CI keys in GitHub Actions secrets or your CI provider’s secret store. Never place them in the workflow file.
Common mistakes to avoid
Cloning the wrong repository
Confirm the repo owner and URL before installing. If your team has an internal fork, document why it exists and how often it syncs with upstream.
Skipping the virtual environment
Global installs make eval failures harder to debug. Use a virtual environment for local testing and a clean Python environment in CI.
Committing API keys
Use environment variables and secret managers. Add .env files to .gitignore. Rotate any exposed key immediately.
Treating sample tests as production evals
Sample tests teach syntax. Production evals need representative data, thresholds, versioning, and review loops.
Using non-deterministic prompts without thresholds
If your prompt, model, or retrieval layer changes between runs, raw pass or fail results become hard to interpret. Set thresholds and track pass rates on stable datasets.
Failing to pin dependencies
Installing from an unpinned GitHub branch can break CI without a code change in your app. Pin the commit SHA or release tag.
A practical workflow for AI teams
Use this order when you move from local testing to production evals:
- Clone the correct DeepEval repo.
- Create a virtual environment.
- Install DeepEval in editable mode or pin a GitHub commit.
- Run one passing smoke test.
- Run one intentionally failing test.
- Create a small dataset from real application cases.
- Add thresholds for each metric.
- Pin dependencies and model settings.
- Run evals in CI with secrets managed by the CI provider.
- Review failures before changing prompts, models, or agent logic.
This keeps DeepEval useful as an engineering tool instead of another test command that nobody trusts.
PromptLayer helps AI teams manage prompts, trace LLM requests, organize datasets, and run evaluations against real application behavior. If you are building eval workflows around DeepEval, prompts, agents, or production LLM traces, create a PromptLayer account to start tracking and improving your AI system with a reliable workflow.