Tag Archives: linux

Scoped Vector Search with the MyVector Plugin for MySQL – Part II

Subtitle: Schema design, embedding workflows, hybrid search, and performance tradeoffs explained.



Quick Recap from Part 1

In Part 1, we introduced the MyVector plugin — a native extension that brings vector embeddings and HNSW-based approximate nearest neighbor (ANN) search into MySQL. We covered how MyVector supports scoped queries (e.g., WHERE user_id = X) to ensure that semantic search remains relevant, performant, and secure in real-world multi-tenant applications.

Now in Part 2, we move from concept to implementation:

  • How to store and index embeddings
  • How to design embedding workflows
  • How hybrid (vector + keyword) search works
  • How HNSW compares to brute-force search
  • How to tune for performance at scale

1. Schema Design for Vector Search

The first step is designing tables that support both structured and semantic data.

A typical schema looks like:

CREATE TABLE documents (
    id BIGINT PRIMARY KEY,
    user_id INT NOT NULL,
    title TEXT,
    body TEXT,
    embedding VECTOR(384),
    INDEX(embedding) VECTOR
);

Design tips:

  • Use VECTOR(n) to store dense embeddings (e.g., 384-dim for MiniLM).
  • Always combine vector queries with SQL filtering (WHERE user_id = …, category = …) to scope the search space.
  • Use TEXT or JSON fields for hybrid or metadata-driven filtering.
  • Consider separating raw text from embedding storage for cleaner pipelines.

2. Embedding Pipelines: Where and When to Embed

MyVector doesn’t generate embeddings — it stores and indexes them. You’ll need to decide how embeddings are generated and updated:

a. Offline (batch) embedding

  • Run scheduled jobs (e.g., nightly) to embed new rows.
  • Suitable for static content (documents, articles).
  • Can be run using Python + HuggingFace, OpenAI, etc.
# Python example
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode(["Your text goes here"])

b. Write-time embedding

  • Embed text when inserted via your application.
  • Ensures embeddings are available immediately.
  • Good for chat apps, support tickets, and notes.

c. Query-time embedding

  • Used for user search input only.
  • Transforms search terms into vectors (not stored).
  • Passed into queries like:
ORDER BY L2_DISTANCE(embedding, '[query_vector]') ASC

3. Hybrid Search: Combine Text and Semantics

Most real-world search stacks benefit from combining keyword and vector search. MyVector enables this inside a single query:

SELECT id, title
FROM documents
WHERE MATCH(title, body) AGAINST('project deadline')
  AND user_id = 42
ORDER BY L2_DISTANCE(embedding, EMBED('deadline next week')) ASC
LIMIT 5;

This lets you:

  • Narrow results using lexical filters
  • Re-rank them semantically
  • All in MySQL — no sync to external vector DBs

This hybrid model is ideal for support systems, chatbots, documentation search, and QA systems.


4. Brute-Force vs. HNSW Indexing in MyVector

When it comes to similarity search, how you search impacts how fast you scale.

Brute-force search

  • Compares the query against every row
  • Guarantees exact results (100% recall)
  • Simple but slow for >10K rows
SELECT id
FROM documents
ORDER BY COSINE_DISTANCE(embedding, '[query_vector]') ASC
LIMIT 5;

HNSW: Hierarchical Navigable Small World

  • Graph-based ANN algorithm used by MyVector
  • Fast and memory-efficient
  • High recall (~90–99%) with tunable parameters (ef_search, M)
CREATE INDEX idx_vec ON documents(embedding) VECTOR
  COMMENT='{"HNSW_M": 32, "HNSW_EF_CONSTRUCTION": 200}';

Comparison

FeatureBrute ForceHNSW (MyVector)
Recall✅ 100%🔁 ~90–99%
Latency (1M rows)❌ 100–800ms+✅ ~5–20ms
Indexing❌ None✅ Required
Filtering Support✅ Yes✅ Yes
Ideal Use CaseSmall datasetsProduction search

5. Scoped Search as a Security Boundary

Because MyVector supports native SQL filtering, you can enforce access boundaries without separate vector security layers.

Patterns:

  • WHERE user_id = ? → personal search
  • WHERE org_id = ? → tenant isolation
  • Use views or stored procedures to enforce access policies

You don’t need to bolt access control onto your search engine — MySQL already knows your users.


6. HNSW Tuning for Performance

MyVector lets you tune index behavior at build or runtime:

ParamPurposeEffect
MGraph connectivityHigher = more accuracy + RAM
ef_searchTraversal breadth during queriesHigher = better recall, more latency
ef_constructionIndex quality at build timeAffects accuracy and build cost

Example:

ALTER INDEX idx_vec SET HNSW_M = 32, HNSW_EF_SEARCH = 100;

You can also control ef_search per session or per query soon (planned feature).


TL;DR: Production Patterns with MyVector

  • Use VECTOR(n) columns and HNSW indexing for fast ANN search
  • Embed externally using HuggingFace, OpenAI, Cohere, etc.
  • Combine text filtering + vector ranking for hybrid search
  • Use SQL filtering to scope vector search for performance and privacy
  • Tune ef_search and M to control latency vs. accuracy

Coming Up in Part 3

In Part 3, we’ll explore real-world implementations:

  • Semantic search
  • Real-time document recall
  • Chat message memory + re-ranking
  • Integrating MyVector into RAG and AI workflows

We’ll also show query plans and explain fallbacks when HNSW is disabled or brute-force is needed.


From MySQL to Oracle ACE Pro: A Milestone in My Database Journey

I’m incredibly honored to share some exciting news—I’ve been recognized as an Oracle ACE Pro by Oracle!

This recognition is deeply meaningful to me, not just as a personal milestone but as a reflection of the ongoing work I’ve poured into the database community for over three decades. It’s also a reminder of how powerful open collaboration, curiosity, and mentorship can be in shaping both a career and a community.

What Is the Oracle ACE Program?

For those unfamiliar, the Oracle ACE Program recognizes individuals who are not only technically skilled but also passionate about sharing their knowledge with the wider community. It celebrates those who contribute through blogging, speaking, writing, mentoring, and engaging in forums or user groups.

The program has multiple tiers: ACE Associate, Oracle ACE, ACE Pro, and ACE Director. Each level reflects a growing commitment to community contribution and leadership. Being named an Oracle ACE Pro places me among a diverse, global group of technologists who are actively shaping the future of Oracle technologies—and open-source ecosystems alongside them.

From MySQL to ACE: A Journey Rooted in Community

My journey with data began over three decades ago, and it’s taken me across continents, companies, and countless events. My early days were steeped in MySQL—performance tuning, operations, scaling architectures—and I quickly discovered that the greatest impact didn’t come from just solving problems, but from sharing the solutions.

Since then, my path has included global roles in consulting, support, and engineering leadership. I’ve had the opportunity to speak at international conferences, publish books like the MySQL Cookbook (4th Edition), and contribute to countless community efforts in the MySQL and opensource database ecosystems.

Recognition such as Most Influential in the Database Community (Redgate 100) and MySQL Rockstar have meant a lot—but being named an Oracle ACE Pro is especially meaningful. It represents a bridge between the worlds of open source and enterprise and affirms that collaboration across ecosystems is not only possible—it’s essential.

What This Recognition Means to Me

This isn’t just about a title or a badge. To me, becoming an Oracle ACE Pro is about continuing the mission—to share what I’ve learned, amplify others doing amazing work, and give back to the communities that have shaped my path.

I’ve always believed that technical excellence must go hand in hand with generosity. Whether it’s mentoring a young DBA, helping a team scale their architecture, or writing about real-world database design challenges, the point has never been visibility—it’s always been about value.

And that’s what this recognition reflects: not just what I’ve done, but what I hope to keep doing for the next generation of data professionals.

Looking Ahead

This milestone energizes me even more to keep contributing—not just within the Oracle ecosystem but across the open-source database space. I’ll continue speaking at events, writing, mentoring, and building resources that help engineers build better, faster, and more resilient systems.

I’m also excited about promoting hybrid data architectures combining MySQL, opensource, and cloud-native technologies. This is where the industry is heading, and I’m committed to helping folks navigate that evolving landscape with clarity and confidence.

Gratitude and Community

I want to thank Oracle for running a program that not only recognizes technical contributions, but also community-driven spirit. And a heartfelt thank you to the MySQL community, open-source contributors, and peers I’ve had the privilege of working alongside over the years.

You’ve all helped shape my thinking, my work, and my growth. I stand on the shoulders of a global community, and this milestone belongs to all of us.

Let’s Stay Connected

If you’re building something, learning something, or just curious about databases, I’d love to hear from you. Whether it’s MySQL performance, opensource design, or data architecture strategy, reach out. Let’s keep learning, building, and sharing—together.

And if you’re interested in becoming part of the Oracle ACE community, feel free to ping me. I’m always happy to share what I’ve learned and help others navigate that journey.


A Note About What’s Coming

As part of my role and responsibilities as an Oracle ACE Pro, I’ll be launching a new series of technical blog posts in the coming months. These will explore cutting-edge topics including:

AI/ML and LLMs (Large Language Models)

Vector Search and database integration

• Real-world use cases at the intersection of AI and relational databases

These areas are rapidly evolving, and I’m excited to share practical, hands-on insights on how they tie into modern data architecture—especially within the Oracle and open-source ecosystems.

Disclaimer: The views and opinions I’ll be sharing in upcoming posts are my own and do not necessarily reflect those of Oracle or any other organization. Content will be independent, community-driven, and based on real-world experience.

Stay tuned—and if you have specific questions or topics you’d like to see covered, feel free to reach out!

Thanks for reading—and here’s to the next chapter in our database story.