Semantic search configuration

---
layout: 'page'
uri: '/configuration/semantic-search'
position: 2
slug: 'configuration-semantic-search'
parent: 'configuration'
navTitle: 'Semantic search'
title: 'Semantic search configuration'
description: 'OpenAI embedding variables that power the vectorize command and semantic search — API key, base URL, embedding model, and chunk size.'
---

# Semantic search configuration

Documan can embed your documentation so it can be searched by meaning, not just keywords. These variables configure the
OpenAI embeddings used by the [`vectorize`](/getting-started/commands) command and by semantic search at runtime — both
the [MCP server](/ai-integration/mcp-setup) and the [AI chat](/configuration/ai-chat) assistant rely on them.

**Important:** Any change to environment variables requires a server restart to take effect. `DOCUMAN_OPENAI_API_KEY` is
required for semantic search; the rest have sensible defaults.


## DOCUMAN_OPENAI_API_KEY

**Required for:** Semantic search (vectorize + MCP search)

**Default:** None

Your OpenAI API key. Used for two purposes:

1. **Vectorize command** — generates embeddings for all documents (one-time or after changes)
2. **MCP search at runtime** — converts each search query into an embedding to find relevant documents

The key must be available both during `vectorize` and at runtime when serving MCP search requests.

**Example:**

```bash
DOCUMAN_OPENAI_API_KEY=sk-...
```


## DOCUMAN_OPENAI_BASE_URL

**Required for:** Custom or self-hosted embeddings endpoint

**Default:** `https://api.openai.com/v1`

Base URL for the embeddings API. Override it to route embedding requests through a proxy, Azure OpenAI, or a
self-hosted OpenAI-compatible embeddings server. The endpoint must implement the OpenAI `/embeddings` API.

**Example:**

```bash
DOCUMAN_OPENAI_BASE_URL=https://your-proxy.example/openai/v1
```


## DOCUMAN_OPENAI_EMBEDDING_MODEL

**Required for:** Vectorize command

**Default:** `text-embedding-3-small`

The OpenAI embedding model to use for vectorization. Available options:

| Model                    | Dimensions | Notes                               |
|--------------------------|------------|-------------------------------------|
| `text-embedding-3-small` | 1536       | Recommended, best price/performance |
| `text-embedding-3-large` | 3072       | Highest quality                     |
| `text-embedding-ada-002` | 1536       | Legacy model                        |

The `text-embedding-3-small` model is extremely cost-effective — $1 covers approximately 62,500 pages of text. See [OpenAI Embeddings pricing](https://platform.openai.com/docs/guides/embeddings#embedding-models) for current rates.

**Example:**

```bash
DOCUMAN_OPENAI_EMBEDDING_MODEL=text-embedding-3-large
```

**Note:** Changing the embedding model requires re-running `./documan vectorize` to regenerate all embeddings.


## DOCUMAN_CHUNK_MAX_LEN

**Required for:** Vectorize command

**Default:** `250` (minimum: 100)

Maximum length (in characters) of each text chunk during vectorization. Chunks are semantic units used for vector
search.

- **Smaller chunks** = More precise search results, finer granularity
- **Larger chunks** = More context per result, but less precise matching

**Example:**

```bash
DOCUMAN_CHUNK_MAX_LEN=500
```


## Related configuration

- [General](/configuration/general) — project name, paths, port, AI discovery surfaces, and license key
- [AI chat](/configuration/ai-chat) — the Anthropic-powered chat assistant

---

[← General](/configuration/general.md) | [AI chat →](/configuration/ai-chat.md)