Back

How to Run DeepEval from GitHub

Jun 06, 2026
How to Run DeepEval from GitHub

How to Run DeepEval from GitHub

DeepEval is useful when you want LLM evals to run close to your codebase, especially in local development and CI. Running it from GitHub gives you access to the source repo, examples, and unreleased changes, but it also adds a few ways to make a mess: cloning the wrong repo, skipping a virtual environment, leaking API keys, or treating sample tests as production evals.

This guide walks through a clean setup for developers who want to clone DeepEval from GitHub, run a minimal eval, and avoid the common mistakes that cause flaky or unsafe LLM test runs.

1. Confirm the correct GitHub repository

The DeepEval GitHub repository is:

https://github.com/confident-ai/deepeval

Before cloning, check the repo page and confirm:

  • The owner is confident-ai.
  • The repo name is deepeval.
  • The repository has recent commits and release tags.
  • The README install commands match the package you expect.

If you are writing internal setup docs, capture a screenshot of the GitHub repo page that shows the URL, owner, repo name, branch selector, and latest commit. This prevents a common onboarding issue: someone cloning a fork, stale mirror, or unrelated package with a similar name.

2. Clone DeepEval locally

Use a clean workspace. Do not clone it inside another application repo unless you have a clear reason.

mkdir llm-eval-tools
cd llm-eval-tools

git clone https://github.com/confident-ai/deepeval.git
cd deepeval

If you want repeatable results, pin the repo to a tag or commit instead of running from the default branch forever.

git tag --sort=-version:refname | head

# Example pattern:
git checkout <tag-or-commit-sha>

For CI, prefer a specific commit SHA. A moving branch can change behavior without warning.

3. Create a virtual environment

Do not install directly into your global Python environment. DeepEval, model SDKs, and test dependencies can conflict with application dependencies.

python3 -m venv .venv
source .venv/bin/activate

python -m pip install --upgrade pip setuptools wheel

On Windows PowerShell:

python -m venv .venv
.venv\Scripts\Activate.ps1

python -m pip install --upgrade pip setuptools wheel

Confirm that your shell is using the virtual environment:

which python
python --version

You should see a path inside .venv.

4. Install DeepEval from the cloned repo

From the root of the cloned repository, install it in editable mode:

pip install -e .

Then confirm the install works:

python -c "import deepeval; print(deepeval.__version__)"

If you want to install DeepEval directly from GitHub into a separate project, use this pattern instead:

pip install "deepeval @ git+https://github.com/confident-ai/deepeval.git@<tag-or-commit-sha>"

Pinning matters. If you install from main without a commit SHA, your eval behavior can change the next time a teammate or CI runner installs dependencies.

5. Keep API keys out of Git

Many DeepEval metrics use an LLM judge, which means you usually need a model provider API key, such as an OpenAI key. Set it through your shell, secret manager, or CI secret store. Do not paste it into test files.

export OPENAI_API_KEY="sk-..."

For Windows PowerShell:

$env:OPENAI_API_KEY="sk-..."

Add local environment files to .gitignore:

.env
.env.local
.venv/
__pycache__/
.pytest_cache/

If you already committed an API key, deleting the line is not enough. Rotate the key in the provider dashboard, then remove it from Git history if needed.

6. Create a minimal eval file

Create a small test file outside the DeepEval repo if you are testing your own app. For a quick local check, you can create a file named test_minimal_eval.py in a scratch directory.

mkdir ../deepeval-smoke-test
cd ../deepeval-smoke-test

cat > test_minimal_eval.py <<'PY'
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase


def test_answer_relevancy():
    test_case = LLMTestCase(
        input="What does PromptLayer help AI teams do?",
        actual_output=(
            "PromptLayer helps AI teams manage prompts, run evaluations, "
            "trace LLM requests, and debug production AI workflows."
        ),
    )

    metric = AnswerRelevancyMetric(
        threshold=0.70,
        model="gpt-4o-mini",
    )

    assert_test(test_case, [metric])
PY

This is a smoke test. It proves DeepEval can run, call the judge model, and evaluate one response. It is not a production eval suite.

7. Run the eval

Run the file with the DeepEval test runner:

deepeval test run test_minimal_eval.py

A passing run will look similar to this:

$ deepeval test run test_minimal_eval.py

Running 1 test case...

test_answer_relevancy
  Answer Relevancy
  Score: 0.91
  Threshold: 0.70
  Status: passed

1 passed

A failing run will look similar to this:

$ deepeval test run test_minimal_eval.py

Running 1 test case...

test_answer_relevancy
  Answer Relevancy
  Score: 0.34
  Threshold: 0.70
  Status: failed
  Reason: The output does not answer the input directly.

1 failed

The exact output format can vary by DeepEval version, which is another reason to pin dependencies in CI.

8. Run an intentionally failing test

Before trusting your setup, make sure failures actually fail. Change the output to something irrelevant:

actual_output="Bananas are yellow and grow in warm climates."

Run the test again:

deepeval test run test_minimal_eval.py

If this still passes, your threshold is too low, your metric is not checking what you think it checks, or your test is wired incorrectly.

9. Do not confuse sample tests with production evals

The examples in an open-source repo are useful for learning the API. They are usually not a good eval plan for your product.

A production eval should include:

  • Real inputs: Use support tickets, search queries, agent tasks, or anonymized production traces.
  • Expected behavior: Define what good output means for each task.
  • Thresholds: Set pass or fail criteria before the run.
  • Stable test data: Keep a versioned dataset so prompt and model changes are comparable.
  • Failure review: Store failed examples and inspect them before shipping.

For example, a customer support bot eval should not stop at “answer relevancy.” You may also need checks for policy compliance, refusal behavior, citation quality, tone, and whether the answer used the retrieved account context correctly.

10. Make non-determinism explicit

LLM evals can be noisy. A prompt that passes once can fail on the next run if model sampling, retrieved context, or judge behavior changes.

To reduce noise:

  • Use low temperature for generated outputs when possible, such as temperature=0 or 0.1.
  • Pin the judge model, such as gpt-4o-mini, instead of relying on defaults.
  • Set metric thresholds, such as 0.70 or 0.80, based on real examples.
  • Run each eval against a stable dataset.
  • Track pass rate over time instead of judging quality from one hand-picked example.

For CI, avoid blocking deploys on a single fragile test. A better pattern is to require a minimum pass rate on a fixed dataset, such as 95 out of 100 cases passing, while still blocking on critical safety or policy failures.

11. Pin dependencies for repeatable runs

If your team uses DeepEval in CI, capture exact versions. A simple requirements file can work for a small project:

pip freeze > requirements.txt

A better pattern is to pin the GitHub commit explicitly:

deepeval @ git+https://github.com/confident-ai/deepeval.git@<commit-sha>

If you use uv, poetry, or pip-tools, commit the lockfile. The goal is simple: the same code, same package versions, same model configuration, and same dataset should produce comparable results.

12. Add DeepEval to CI carefully

Once the local smoke test works, wire it into CI. Start with a small eval suite so developers get fast feedback.

name: deepeval

on:
  pull_request:
  workflow_dispatch:

jobs:
  evals:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Run DeepEval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          deepeval test run tests/evals

Keep CI keys in GitHub Actions secrets or your CI provider’s secret store. Never place them in the workflow file.

Common mistakes to avoid

Cloning the wrong repository

Confirm the repo owner and URL before installing. If your team has an internal fork, document why it exists and how often it syncs with upstream.

Skipping the virtual environment

Global installs make eval failures harder to debug. Use a virtual environment for local testing and a clean Python environment in CI.

Committing API keys

Use environment variables and secret managers. Add .env files to .gitignore. Rotate any exposed key immediately.

Treating sample tests as production evals

Sample tests teach syntax. Production evals need representative data, thresholds, versioning, and review loops.

Using non-deterministic prompts without thresholds

If your prompt, model, or retrieval layer changes between runs, raw pass or fail results become hard to interpret. Set thresholds and track pass rates on stable datasets.

Failing to pin dependencies

Installing from an unpinned GitHub branch can break CI without a code change in your app. Pin the commit SHA or release tag.

A practical workflow for AI teams

Use this order when you move from local testing to production evals:

  1. Clone the correct DeepEval repo.
  2. Create a virtual environment.
  3. Install DeepEval in editable mode or pin a GitHub commit.
  4. Run one passing smoke test.
  5. Run one intentionally failing test.
  6. Create a small dataset from real application cases.
  7. Add thresholds for each metric.
  8. Pin dependencies and model settings.
  9. Run evals in CI with secrets managed by the CI provider.
  10. Review failures before changing prompts, models, or agent logic.

This keeps DeepEval useful as an engineering tool instead of another test command that nobody trusts.


PromptLayer helps AI teams manage prompts, trace LLM requests, organize datasets, and run evaluations against real application behavior. If you are building eval workflows around DeepEval, prompts, agents, or production LLM traces, create a PromptLayer account to start tracking and improving your AI system with a reliable workflow.

The first platform built for prompt engineering