Join our FREE personalized newsletter for news, trends, and insights that matter to everyone in America

Newsletter
New

Building A Semantic Search Api With Spring Boot And Pgvector - Part 1: Architecture

Card image cap

The problem with Keyword Search

Keyword search breaks more often than most engineers realize.

A few months ago, I was building an internal document management tool. Users could upload policy documents, product guides, and support articles — and search through them.

I implemented a simple keyword search, deployed it, and assumed I was done.

Then the complaints started.

One support engineer searched for "billing retries" and got zero results. The document absolutely existed. It was titled "Payment Failure Handling Policy" and covered exactly what they were looking for.

The problem wasn’t the content.

The problem was the search engine.

It was doing exactly what keyword search is designed to do: scanning documents for the exact words “billing” and “retries.”

Those words weren't in the document. So the system concluded there was no match.

Query: "billing retries"
Document: "Payment Failure Handling Policy"
Keyword search: ❌ No match — strings don't overlap
Semantic search: ✅ Strong match — meaning is the same

This is the fundamental limitation of keyword search: it compares strings, not meaning.

It treats "car" and "automobile" as completely unrelated.

It sees "help me fix this bug" and "debugging assistance" as different queries. But that’s not how people search in the real world.

People search using intent, and they rarely phrase a query the same way a document is written.

Semantic search approaches the problem differently.

Instead of matching text directly, it attempts to capture the meaning behind the words. To do that, it converts text into numerical representations called embeddings. Before we build the search system itself, we first need to understand what embeddings are and why they work.

Keyword Search vs Semantic Search:

What Are Embeddings?

Embeddings are the core idea behind semantic search. At a high level, an embedding is a numerical representation of text.

Instead of storing meaning as words, machine learning models convert text into vectors - lists of numbers that capture semantic relationships between pieces of text.

For example, a sentence like:

"How do I retry a failed payment?" => [0.023, -0.181, 0.442, ..., 0.091] 
                                        1,536 dimensions 

These numbers by themselves don’t mean much to us.

What matters is how close two vectors are in this space.

If two pieces of text express similar ideas, their vectors will appear close together in this space.

If they describe completely different concepts, their vectors will be far apart.

For example, embeddings for the words: "car", "automobile" and "vehicle" will appear very close together.

Meanwhile, something unrelated like: "Banana" will be far away from them.

This is what allows semantic search to work.

Instead of asking:

Do these documents contain the same words?

The system asks:

Are these documents about the same idea?

That small shift fundamentally changes how search works.

It allows search engines to retrieve relevant documents even when the wording is completely different.

Embedding Space:

In practice, modern embedding models produce vectors with hundreds or thousands of dimensions.

The model used in this project generates vectors with 1536 dimensions, which means every piece of text becomes a point in a 1536-dimensional space.

While we can't visualize that space directly, distance between vectors can still be measured mathematically.

That measurement is what allows us to rank documents by semantic similarity.

Measuring Semantic Similarity

Once both the query and documents are converted into embeddings, the next question becomes:

How do we compare them?

This is where vector similarity comes in.

A semantic search system measures how close two vectors are to each other in the embedding space.

If two vectors point in nearly the same direction, the underlying text likely expresses the same idea.

If the vectors point in very different directions, the concepts are probably unrelated.

One of the most common ways to measure this similarity is cosine similarity.

Cosine similarity measures the angle between two vectors.

Vectors pointing in the same direction have a similarity close to 1.

Vectors pointing in different directions have a similarity closer to 0.

In practice, this allows the search system to rank documents by semantic relevance.

Instead of returning documents that simply contain the same words, the system returns documents whose meaning is closest to the user’s query.

This is what makes semantic search so powerful.

Even if the wording is different, the search engine can still retrieve the right documents.

Cosine Similarity:

Now let's look at how this works in the system we're building.

What We're Building

The service exposes six endpoints:

POST   /documents        — store a document and compute its embedding

GET    /documents/{id}   — retrieve a document by ID

PUT    /documents/{id}   — update a document and re-compute its embedding

DELETE /documents/{id}   — remove a document

POST   /search           — semantic search with filters and pagination

GET    /ping             — health check

When a search request arrives, the system does five things in sequence.

The client sends a query to the API.

The API converts that query into an embedding using the same model that was used to embed the stored documents.

PostgreSQL then performs a vector similarity search using pgvector, comparing the query vector against document embeddings using pgvector's vector index.

The database returns the documents whose vectors are closest to the query.

The API ranks them by similarity score and returns the results.

The key detail is step two. The query and the documents are embedded using the same model, which means they live in the same vector space.

That shared space is what makes comparison possible. A query about "billing retries" and a document about "payment failure handling" end up close together in that space, and pgvector finds that closeness in milliseconds even across thousands of documents.

Search Execution Flow:

Because pgvector runs inside PostgreSQL, the similarity search can be combined with standard database features — filtering by metadata, pagination, and indexing — all inside a single query.

No separate vector database is required.

Here's what a search request and response look like:

{ 
  "query": "billing retries", 
  "page": 0, 
  "size": 10, 
  "minScore": 0.6, 
  "filters": { "category": "billing" } 
}  
 

And the response:

{ 
  "page": 0, 
  "size": 10, 
  "totalElements": 3, 
  "items": [ 
    { 
      "id": 1, 
      "title": "Payment Failure Handling Policy", 
      "cosineDistance": 0.12, 
      "cosineSimilarity": 0.88, 
      "score": 0.94 
    } 
  ] 
} 

Three score fields appear in every result.

cosineDistance is the raw output from pgvector — lower means more similar.

cosineSimilarity inverts that — higher means more similar.

score normalises the result to a clean [0, 1] range and is the value your application should actually use.

Set minScore: 0.7 in the request and only results with a score of 0.7 or above come back.

The filters field narrows results to documents whose metadata matches specific values. In the example above, only documents tagged category: billing are searched. The filter keys are validated at the API boundary — malformed keys are rejected before they reach the database.

The full source code is on GitHub — linked at the end of this article.

System Architecture:

The Tech Stack and Why

The goal of this project wasn’t just to build semantic search — it was to build it using tools that many backend engineers already use in production.

Instead of introducing a completely new ecosystem, the idea was to see how far we could push a familiar stack.

Here’s what that stack looks like.

Spring Boot

Spring Boot handles the infrastructure; dependency injection, validation, exception handling, configuration management — leaving the focus on business logic. Spring Boot 3 with Java 9+ also brings virtual threads via Project Loom, which is relevant for a service making frequent I/O calls to OpenAI.

The honest reason for this choice over Quarkus or Micronaut is that Spring Boot is widely used in enterprise Java, and this service needs to be readable and maintainable by other Java developers. Familiarity is a legitimate engineering consideration.

PostgreSQL

PostgreSQL stores the documents, metadata and timestamps. The vector storage is handled by the pgvector extension, covered next.

pgvector

The question worth addressing directly is why not a dedicated vector database like Pinecone, Weaviate, or Qdrant?
For most production use cases, you don't need one.

pgvector is a PostgreSQL extension that adds a VECTOR column type and a cosine distance operator <=>.

It stores embeddings directly alongside relational data, in the same database, with the same ACID guarantees.

A document and its embedding are written in a single transaction — no synchronisation between two systems, no eventual consistency to reason about.

CREATE TABLE documents ( 
    id                   BIGSERIAL PRIMARY KEY, 
    title                TEXT NOT NULL, 
    content              TEXT NOT NULL, 
    metadata             JSONB, 
    embedding            VECTOR(1536), 
    status               TEXT NOT NULL, 
    embedding_error      TEXT, 
    embedding_updated_at TIMESTAMPTZ 
); 

One honest caveat is that pgvector works well at moderate scale, millions of documents. For billions of vectors with sub-millisecond latency requirements, a dedicated vector database makes more sense. But for the vast majority of production use cases, pgvector is the right starting point.

Reach for a specialist tool only when you've proven you need it.

OpenAI Embeddings

text-embedding-3-small converts text to 1,536-dimensional vectors. It was chosen over ada-002 and text-embedding-3-large for the balance of quality, speed, and cost. It produces embeddings that are more than good enough for document search at a fraction of the cost of the larger model.

More importantly, the OpenAI client is never imported directly into the service layer.

The OpenAI client sits behind an EmbeddingClient interface — the provider can be swapped without touching the service layer. More on this in Part 6.

Flyway

Finally, Flyway is used to manage database migrations. As the schema evolves, for example when introducing document status fields or metadata changes, Flyway ensures that database changes are applied consistently across environments.

Using migrations also makes it easier for readers of this series to reproduce the database setup.

High-Level Architecture

The service is organised into four layers. Each layer has one job and communicates only with the layer directly below it.

Controller — the HTTP boundary.

Receives requests, validates them with @valid, delegates to the service, and returns the correct status code. No business logic lives here. A GlobalExceptionHandler sits across all controllers and ensures every error response — whether a 400, 404, or 500 — returns the same structured JSON shape.

Service — where all decisions happen.

DocumentServiceImpl orchestrates the repository and the embedding client. It controls the document lifecycle, every document is saved immediately with a PENDING status, then moves to READY once the embedding succeeds, or FAILED if OpenAI returns an error.

A failed embedding is never silent — the error message is stored in the database and the document is excluded from all search results until it's resolved.

public CreateDocumentResponse create(CreateDocumentRequest request) { 
    Document saved = saveAsPending(request);     // status = PENDING 
    embedAndPersist(saved.getId(), ...);          // status → READY or FAILED 
    return new CreateDocumentResponse(saved.getId(), DocumentStatus.READY); 
} 

Repository — Spring Data JPA handles standard CRUD.

JdbcTemplate handles vector operations. pgvector's <=> cosine distance operator and ::vector casting don't map to JPQL, so those queries are written in SQL directly. Two tools, two clearly defined responsibilities.

SELECT id, title, 
       (embedding <=> ?::vector) AS cosine_distance 
FROM documents 
WHERE status = 'READY' 
ORDER BY cosine_distance ASC 
LIMIT ?; 

Embedding — OpenAiEmbeddingClient sits behind an EmbeddingClient interface.

Nothing else in the application imports the implementation directly. Swapping OpenAI for a local model means writing one new class — the service layer is untouched.

public interface EmbeddingClient { 
    float[] embed(String text); 
} 

Architecture:

The full source code, including all migrations and tests, is available on GitHub: link

Series Roadmap

This article covered the foundation — the problem semantic search solves, how embeddings work, and how the system is structured. The rest of the series builds out each layer in full.

Part 2 — The Database Layer : All three Flyway migrations in detail. The documents table structure, the IVFFlat index configuration, the JSONB metadata design, and how the schema supports the document lifecycle from day one.

Part 3 — Calling the OpenAI Embeddings API in Java Without an SDK: Building the HTTP request with plain java.net.http.HttpClient, parsing the response with Jackson, L2 normalisation, and the bugs worth knowing about before you write a single line.

Part 4 — The Full CRUD API and Service Layer: The complete DocumentServiceImpl — create, read, update, delete, and search. The QueryBuilder inner class for safe dynamic SQL. The GlobalExceptionHandler for consistent error responses across the entire API.

Part 5 — Testing Without a Real Database or API Key: Mockito, MockMvc, H2 test profiles, and the specific JdbcTemplate varargs trap that catches most developers the first time — with the exact fix.

Part 6 — Lessons Learned: 6 Bugs Found Before the Service Could Run: Six real bugs from this codebase — wrong API URL, missing annotation bracket, trailing comma in a JSON string, broken SQL subquery, silent double normalisation, and a RuntimeException returning 500 instead of 404. What each one taught me and how to avoid them.

If you found this useful, the full source code is available on GitHub and the next article in the series dives into the database layer and pgvector indexing.

See you in Part 2.