css | Business Life Coaching

Subtitle: Schema design, embedding workflows, hybrid search, and performance tradeoffs explained.

Quick Recap from Part 1

In Part 1, we introduced the MyVector plugin — a native extension that brings vector embeddings and HNSW-based approximate nearest neighbor (ANN) search into MySQL. We covered how MyVector supports scoped queries (e.g., WHERE user_id = X) to ensure that semantic search remains relevant, performant, and secure in real-world multi-tenant applications.

Now in Part 2, we move from concept to implementation:

How to store and index embeddings
How to design embedding workflows
How hybrid (vector + keyword) search works
How HNSW compares to brute-force search
How to tune for performance at scale

1. Schema Design for Vector Search

The first step is designing tables that support both structured and semantic data.

A typical schema looks like:

CREATE TABLE documents (
    id BIGINT PRIMARY KEY,
    user_id INT NOT NULL,
    title TEXT,
    body TEXT,
    embedding VECTOR(384),
    INDEX(embedding) VECTOR
);

Design tips:

Use VECTOR(n) to store dense embeddings (e.g., 384-dim for MiniLM).
Always combine vector queries with SQL filtering (WHERE user_id = …, category = …) to scope the search space.
Use TEXT or JSON fields for hybrid or metadata-driven filtering.
Consider separating raw text from embedding storage for cleaner pipelines.

2. Embedding Pipelines: Where and When to Embed

MyVector doesn’t generate embeddings — it stores and indexes them. You’ll need to decide how embeddings are generated and updated:

a. Offline (batch) embedding

Run scheduled jobs (e.g., nightly) to embed new rows.
Suitable for static content (documents, articles).
Can be run using Python + HuggingFace, OpenAI, etc.

# Python example
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode(["Your text goes here"])

b. Write-time embedding

Embed text when inserted via your application.
Ensures embeddings are available immediately.
Good for chat apps, support tickets, and notes.

c. Query-time embedding

Used for user search input only.
Transforms search terms into vectors (not stored).
Passed into queries like:

ORDER BY L2_DISTANCE(embedding, '[query_vector]') ASC

3. Hybrid Search: Combine Text and Semantics

Most real-world search stacks benefit from combining keyword and vector search. MyVector enables this inside a single query:

SELECT id, title
FROM documents
WHERE MATCH(title, body) AGAINST('project deadline')
  AND user_id = 42
ORDER BY L2_DISTANCE(embedding, EMBED('deadline next week')) ASC
LIMIT 5;

This lets you:

Narrow results using lexical filters
Re-rank them semantically
All in MySQL — no sync to external vector DBs

This hybrid model is ideal for support systems, chatbots, documentation search, and QA systems.

4. Brute-Force vs. HNSW Indexing in MyVector

When it comes to similarity search, how you search impacts how fast you scale.

Brute-force search

Compares the query against every row
Guarantees exact results (100% recall)
Simple but slow for >10K rows

SELECT id
FROM documents
ORDER BY COSINE_DISTANCE(embedding, '[query_vector]') ASC
LIMIT 5;

HNSW: Hierarchical Navigable Small World

Graph-based ANN algorithm used by MyVector
Fast and memory-efficient
High recall (~90–99%) with tunable parameters (ef_search, M)

CREATE INDEX idx_vec ON documents(embedding) VECTOR
  COMMENT='{"HNSW_M": 32, "HNSW_EF_CONSTRUCTION": 200}';

Comparison

Feature	Brute Force	HNSW (MyVector)
Recall	✅ 100%	🔁 ~90–99%
Latency (1M rows)	❌ 100–800ms+	✅ ~5–20ms
Indexing	❌ None	✅ Required
Filtering Support	✅ Yes	✅ Yes
Ideal Use Case	Small datasets	Production search

5. Scoped Search as a Security Boundary

Because MyVector supports native SQL filtering, you can enforce access boundaries without separate vector security layers.

Patterns:

WHERE user_id = ? → personal search
WHERE org_id = ? → tenant isolation
Use views or stored procedures to enforce access policies

You don’t need to bolt access control onto your search engine — MySQL already knows your users.

6. HNSW Tuning for Performance

MyVector lets you tune index behavior at build or runtime:

Param	Purpose	Effect
M	Graph connectivity	Higher = more accuracy + RAM
ef_search	Traversal breadth during queries	Higher = better recall, more latency
ef_construction	Index quality at build time	Affects accuracy and build cost

Example:

ALTER INDEX idx_vec SET HNSW_M = 32, HNSW_EF_SEARCH = 100;

You can also control ef_search per session or per query soon (planned feature).

TL;DR: Production Patterns with MyVector

Use VECTOR(n) columns and HNSW indexing for fast ANN search
Embed externally using HuggingFace, OpenAI, Cohere, etc.
Combine text filtering + vector ranking for hybrid search
Use SQL filtering to scope vector search for performance and privacy
Tune ef_search and M to control latency vs. accuracy

Coming Up in Part 3

In Part 3, we’ll explore real-world implementations:

Semantic search
Real-time document recall
Chat message memory + re-ranking
Integrating MyVector into RAG and AI workflows

We’ll also show query plans and explain fallbacks when HNSW is disabled or brute-force is needed.

Business Life Coaching

Non-technical blog for technical people…

Tag Archives: css

Scoped Vector Search with the MyVector Plugin for MySQL – Part II