December 06, 2023

Super-charged Searching (Made Simple)

The face of the IT industry has changed, and with it, so has how we think about our software, our infrastructure, orchestration - and yes - even our data. For the last few decades, database technologies primarily focused on querying data using what could be seen as "traditional" search methods, i.e., methods that are based on predefined relationships between entities, matching values or keyword association. These traditional search methods work for simple use cases, but not for situations where a more nuanced approach is required. Consider applications that work with large sets of unstructured data, applications that make use of Natural Language Processing, recommendation engines making choices about what a user might be interested in based on existing parameters, plagiarism detection software, and a myriad of other use cases.

What is vector search?

Enter vector search, which uses the power of representing data as mathematical data points in a high-dimensional vector space. If these terms are somewhat unfamiliar to you, here is the simple version. Imagine a collection of articles, like our Synatic Engineering blog. The articles all have characteristics. The topics of the articles may be different, but some articles may cover similar broader fields. Some will overlap with others in terms of content, both literally (i.e., similar words are used) and semantically (i.e., similar meaning is conveyed). Now imagine that each of these articles is interpreted and all of those characteristics and semantics are digested into a large array of numbers representing the characteristics of the article and how they score against a set of metrics.

Here is where the "high-dimensional" part (and a little abstract thinking) applies. It may be a bit difficult to imagine, but if you consider that each dimension represents a characteristic and each numerical value represents how the article “scores” against this characteristic, rationalising the concept becomes more straightforward. The how of this concept is not something that we as humans need to be concerned about, since plenty of work has been done around generating these vectors using pre-trained models, which can process bodies of text or even images into an embedding.

An embedding is a way of representing high-dimensional information in a relatively low-dimensional vector.

Once we have generated the embeddings for our dataset, we can use different algorithms to calculate any similarities within these embeddings. This calculation will output the closest matching results based on a given input.

Implementing a vector search by leveraging MongoDB and the Azure OpenAI service

Now for a practical example. Firstly, we need a dataset. We chose to use a set of book data, which provides a list of 16000+ book titles and, most importantly, their summaries. Perfect! Now on to the fun part. We need to build an application that will allow us to use a vector search to recommend books with similar synopses. We used Node.js, but the process is the same for any other language that has MongoDB driver support. We can install the dependencies that we need for our application from npm, i.e., @azure/openai and mongodb.

We now need to store this data, along with the embeddings generated for each book synopsis, in a MongoDB collection. The dataset is kept in a text file, so we need to do some pre-processing to read the book data and map the books to records with a title, author, date of publication and synopsis:

// process.js
const fs = require('fs')
const path = require('path');

const fileName = 'books.txt';
const filePath = path.join(__dirname, fileName);

module.exports = function () {
    try {
        const data = fs.readFileSync(filePath, 'utf8');
        const books = data.split('\n');
        return books.map((book) => {
            const fields = book.split('\t');
            return {
                // Extract only the fields we need
                title: fields[2],
                author: fields[3],
                date_of_publication: fields[4],
                synopsis: fields[6],
            }
        });
    } catch (err) {
        console.error(err);
    }
}

Generating embeddings with OpenAI

Next, we will use Azure’s OpenAI service and its text-embedding-ada-002 model, which specializes in creating embeddings from text input. We opted for this approach as Azure’s OpenAI service provides a simple, effective and affordable method of integrating a service that can be used to quickly generate embeddings on the fly into an application. If you are looking at implementing vector search yourself, and you’d prefer using a more manual approach, feel free to swap this out for another text embedding model from a platform like Hugging Face. If you would like to try text-embedding-ada-002 out for yourself, check out Azure’s guide on embeddings.

Here’s how we generated embeddings for the book dataset using text-embedding-ada-002:

// openai.js
const {OpenAIClient, AzureKeyCredential} = require('@azure/openai');
const {OPENAI_URL, OPENAI_KEY, EMBEDDINGS_MODEL} = process.env;

const openAiClient = new OpenAIClient(OPENAI_URL, new AzureKeyCredential(OPENAI_KEY));

module.exports = async function getEmbeddings(inputTextArray) {
    try {
        return (await openAiClient.getEmbeddings(EMBEDDINGS_MODEL, inputTextArray));
    } catch (err) {
        console.error(err);
    }
}

As you can see, the process is quite simple. We set up our client and then use it to generate embeddings for an array of text. Even though we use an array, there is a limit on how many tokens our model can process at the same time. So to be safe, we will only generate embeddings for one book at a time. Furthermore, some of our synopses exceed this token limit, so we will skip these records to avoid any headaches:

// saveBooks.js
const getBooks = require('./process');
const getEmbeddings = require('./openai');
async function saveBooks() {
    const books = getBooks();
    for (const book of books) {
        // Skip the book if the synopsis is too long or missing
        if (!book.synopsis || countTokens(book.synopsis) > 8191) {
            continue;
        }
        // Note that we only want to pass an array of book synopses to our model NOT the whole book record
        const embeddingsResponse = await getEmbeddings(book.synopsis);
        book.embedding = embeddingsResponse.data[0].embedding;
    }
}

function countTokens(inputString) {
    const tokens = inputString.split(/\s+/);
    const nonEmptyTokens = tokens.filter(token => token.trim() !== '');
    return nonEmptyTokens.length;
}

saveBooks();

If you test this code out, you will notice that our book records now look something like this:

{
  title: 'Animal Farm',
  author: 'George Orwell',
  date_of_publication: '1945-08-17',
  synopsis: `<a lengthy description of the book's plot>`,
  embedding: [
     -0.004395536,   -0.02512478,   -0.020391125,   -0.008218872,
     // . . . and 1532 similar numbers
  ]
}

Our embedding vector has a length of 1536! We captured a lot of data about each book's synopsis, but keep in mind that no matter how much text we process with text-embedding-ada-002, we still obtain a vector of 1536 floating point numbers. This value is the number of dimensions that text-embedding-ada-002 uses to process semantic information from our text. A consistent vector size is needed to compare our embeddings as you will see in the next step.

Setting up a vector search in MongoDB Atlas

Once we have generated our embeddings, we simply save the book in our MongoDB books collection. At this stage, it is worth noting that MongoDB provides vector search capability as part of its Atlas Search feature set. Therefore, you need a few prerequisites to leverage this capability:

An Atlas cluster running an M10 machine (or higher) to house your database.
An appropriate Atlas Search Index on the collection in question. The index we create should follow the schema described in the MongoDB vector search documentation.

We need to make a key decision here, i.e., what algorithm will be used to query our vectors? Euclidean distance can be used if we wanted to find similarity in the actual text of our corpus, which is interesting, but not as suited to our use case as Cosine similarity, which provides results with similar semantic meaning.

With all that in mind, our index should look like this:

{
  "mappings": {
    "name": "vectorSearchIndex",
    "dynamic": false,
    "fields": {
      "embedding": {
        "type": "knnVector",      // Tells Atlas what type of search to use with this index
        "dimensions": 1536,      // Note that our embedding vector length is set here
        "similarity": "cosine",    // Give us that semantic similarity!
      }
    }
  }
}

We now call the relevant MongoDB operations:

// mongo.js
const {MongoClient} = require('mongodb');
const {DB_CONNECTION} = process.env; //mongodb+srv://<db-username>:<db-password>@<db-uri>.mongodb.net/?retryWrites=true&w=majority

let client;

async function getClient() {
    if (client) return client;
    try {
        client = await MongoClient.connect(DB_CONNECTION);
        return client;
    } catch (err) {
        console.error(err);
    }
}

async function storeBook(book) {
    try {
        const client = await getClient();
        await client.db('book-test').collection('books').insertOne(book);
    } catch (err) {
        console.error(err);
    }
}

async function createSearchIndex() {
    const index = {
        mappings: {
            dynamic: false,
            fields: {
                embedding: {
                    type: 'knnVector',
                    dimensions: 1536,
                    similarity: 'cosine',
                }
            }
        }
    }

    try {
        const client = await getClient();
        await client.db('book-test').collection('books').createSearchIndex({name: 'vectorSearchIndex', definition: index});
    } catch (err) {
        console.error(err);
    }
}

module.exports = {
    storeBook,
    createSearchIndex,
}

Let’s update our book saving code:

// saveBooks.js
const getBooks = require('./process');
const getEmbeddings = require('./openai');
const {storeBook, createSearchIndex} = require('./mongo');
async function saveBooks() {
    const books = getBooks();
    for (const book of books) {
        // Skip the book if the synopsis is too long or missing
        if (!(book.synopsis) || countTokens(book.synopsis) > 8191) {
            continue;
        }
        // Note that we only want to pass an array of book synopses to our model NOT the whole book record
        const embeddingsResponse = await getEmbeddings(book.synopsis);
        book.embedding = embeddingsResponse.data[0].embedding;
        await storeBook(book);
    }
    await createSearchIndex();
}

function countTokens(inputString) {
    const tokens = inputString.split(/\s+/);
    const nonEmptyTokens = tokens.filter(token => token.trim() !== '');
    return nonEmptyTokens.length;
}

saveBooks();

We can now run this code, and after a short wait, all our books will be saved to MongoDB with our embedding vectors. That's all the groundwork for our application done.

Querying our dataset

The only work left to do is to write some code that can actually perform our vector search query. Let’s extend our MongoDB code with a new function that looks like this:

// mongo.js
async function vectorSearch(inputEmbeddings) {
    const pipeline = [
        {
            $vectorSearch: {
                queryVector: inputEmbeddings,    // This is our input vector
                path: 'embedding',                      // The path to the embeddings on our records
                numCandidates: 1000,                // We instruct the search algorithm to only consider the 1000 closest candidates
                limit: 5,                                     // We will limit our final results to the top 5
                index: 'vectorSearchIndex',         // The name of our Atlas Search Index
            },
            $project: {
                // We get the title, synopsis and the special vectorSearchScore meta property
                title: 1,
                synopsis: 1,
                match_score: {
                    $meta: 'vectorSearchScore',   // Special aggregation metadata property
                },
            },
        }
    ];

    try {
        const client = await getClient();
        return await client.db('book-test').collection('books').aggregate(pipeline);
    } catch (err) {
        console.error(err);
    }
}

At this point, MongoDB really starts to shine. We are able to perform our vector search as part of a standard aggregation pipeline. This ability may not seem that remarkable at first, but aggregation pipelines are one of the most flexible and powerful features MongoDB offers, and being able to tap into this feature with more diverse search capability is an exciting prospect. If we, for example, wanted to create another search type where we filtered by a given author before performing our vector search operation, we could simply add a $filter step before the $vectorSearch step in our pipeline. Continuing on that thought trajectory, you should begin to understand the true power of vector search within MongoDB - combining your traditional database with your vector database. All your data, in one place, with a highly dynamic method of querying it...

There are two notes to highlight in the above function. Firstly, we need some background on the numCandidates property in the first part of our aggregation pipeline. You may assume that the smaller this number is, the more accurate our search will be, but this is a misnomer. While this property is used to narrow our search’s field of vision to the n closest candidates, because of MongoDB’s approximate implementation of vector search, too low a value of n in this case can cause muddier search results. We recommend keeping this number reasonably high and adjusting the limit property accordingly. Secondly, you may have noticed the $meta property in our $project step. This special property allows us to access metadata about our aggregation, in this case a score of how closely this book synopsis matches our query vector.

With that function implemented, the only task left is to test it out. Let’s write some code to query our books collection for some epic high fantasy reading:

// query.js
const getEmbeddings = require('./openai');
const {vectorSearch} = require('./mongo');

const query = 'A fantasy story involving epic battles and magic set in a mystical and wondrous world.';

async function findRecommendations() {
    const inputEmbeddings = await getEmbeddings([query]);
    const cursor = await vectorSearch(inputEmbeddings.data[0].embedding);
        const results = await cursor.toArray();
        console.log(results);
}

findRecommendations();

Our recommendation application outputs these results:

[
  {
    _id: new ObjectId('655b494cbee6ec2c48f4c1a7'),
    title: 'Kinsmen of the Dragon',
    synopsis: 'The novel concerns an empire of invisible wizards and adventure in the realm of Annwyn.',
    match_score: 0.9316140413284302
  },
  {
    _id: new ObjectId('655b49bdbee6ec2c48f4c21f'),
    title: 'The Temple of the Ten',
    synopsis: 'The novel adventures in the realms of Prester John.',
    match_score: 0.926908016204834
  },
  {
    _id: new ObjectId('655b4a84bee6ec2c48f4c2f4'),
    title: 'The Magician Out of Manchuria',
    synopsis: 'The novel concerns the adventures of a hero who encounters a queen with remarkable talents.',
    match_score: 0.926051139831543
  },
  {
    _id: new ObjectId('655b2aa4bee6ec2c48f4a0b3'),
    title: 'Arrowsmith',
    synopsis: 'The series is set in an alternate history Earth in which the United States of America is actually the United States of Columbia, magic is real, and the First World War is fought with and by dragons, spells, vampires and all other kinds of magical weapons and beings. The story follows the protagonist, Fletcher Arrowsmith, as he joins the war effort on the side of the Allies, gets taught the rudiments of sorcery and engages in some brutal battles with the enemy Prussians.',
    match_score: 0.9236781597137451
  },
  {
    _id: new ObjectId('655b4144bee6ec2c48f4b918'),
    title: 'The Debt Collector',
    synopsis: 'An immortal beauty makes a bargain with a dying hard edged despot. He enters her service and learns of worlds he never knew existed. She guides him through many journeys, where he encounters strange and powerful creatures. He is never sure of her motives and she is never certain he can be trusted. Together they face perils and intrigues and learn each other’s deepest secrets. This emotionally powerful story grabs your attention and never lets go. The story is written in first person as a retrospective. The setting is a mythological world. The novel is 454 pages divided into two volumes.',    
    match_score: 0.9235432147979736
  }
]

Our match_score property provides a good indication on the accuracy of a particular record. As you can see, the records that are returned are automatically sorted in descending order of match_score. Next, let’s look for some crime novels using a more specific query. We’ll change the query in our code to: A gritty, serious, and mysterious criminal drama with themes surrounding crime and punishment, set in a big city environment, and centering around a character with a troubled past.

Our new result set now looks like:

[
  {
    _id: new ObjectId('655b4620bee6ec2c48f4be52'),
    title: 'The Smack Man',
    synopsis: "The novel focuses on super-cop Joe Ryker's attempt to stop a murderer from poisoning illegal drugs coming into New York City.",
    match_score: 0.9175916910171509
  },
  {
    _id: new ObjectId('655b23d8bee6ec2c48f49971'),
    title: 'Firewall',
    synopsis: `A series of bizarre incidents sweep across Sweden: a man dies in front of an ATM, two young women slaughter an elderly taxi driver, a murder is committed aboard a Baltic Sea ferry, and a su
b-station engineer makes a gruesome discovery while investigating the cause of a nationwide power cut. As Wallander investigates, he uncovers a sinister plan to bring the Western world to its knees. The ma
jor background theme around which the action takes place is the dilemma of the Western economic system versus poverty. The criminal mastermind is a persuasive and talented IT specialist who plans to right 
the wrongs of the world by "deleting" vast quantities of money from multinational banks' accounts system, so bringing on a credit and financial panic. The criminals believe their intended cybercrime is jus
tified; for them the "big picture" involves the sacrifice of the banking system in order to wipe out third world debt. At a crucial moment Wallander unwittingly manages to persuade a key accomplice that, e
thically, there is in fact no "big picture," that instead we just have lives that are fragile but also "miraculous". That this major issue of our times should feature in a detective novel shows that it is 
not merely about detection, yet Wallander's answer just repeats the very old idea of caring for one's proximate neighbours in the here and now.`,
    match_score: 0.9174903631210327
  },
  {
    _id: new ObjectId('655b35bdbee6ec2c48f4ac93'),
    title: 'The Twenty-Seventh City',
    synopsis: 'A complex, partly satirical thriller that studies a family unraveling under intense pressure, the novel is set amidst intricate political conspiracy and financial upheaval in St. Louis, Missouri in the year 1984.',
    match_score: 0.9168357849121094
  },
  {
    _id: new ObjectId('655b3c47bee6ec2c48f4b3a1'),
    title: 'A Touch of Frost',
    synopsis: `The murder of a local drug addict, the hunt for a serial rapist, a hit-and-run involving the spoiled son of an MP, and a robbery at a strip joint all have something in common. Detective Ins
pector Jack Frost has been assigned with the thankless task of investigating them. Fighting the stress and ignoring his mounting pile of paperwork, Frost soon finds himself up against the various manifestations of criminality...`,
    match_score: 0.9137168526649475
  },
  {
    _id: new ObjectId('655b2928bee6ec2c48f49f1a'),
    title: 'The Yellow Feather Mystery',
    synopsis: 'In trying to trace a missing will, detectives Frank and Joe Hardy trap a dangerous criminal who is willing to risk all--including murder--for money.',
    match_score: 0.913487434387207
  }
]

Conclusion

And there you have it: database queries super-charged with the power of machine learning, made simple with some help from MongoDB. However, vector search is useful for more than just quickly trawling a library. In the world of business, this kind of technology can make vast oceans of complex data much easier to navigate and can help decipher valuable insights that traditional queries aren’t capable of providing. At Synatic, we have found good use for vector search in our DataFix solution, which helps our clients maintain highly accurate data at scale while minimizing duplicate records.

If you would like to take a look at the code used in our example, you can find it on GitHub.

Super-charged Searching (Made Simple)

What is vector search?

Implementing a vector search by leveraging MongoDB and the Azure OpenAI service

Generating embeddings with OpenAI

Setting up a vector search in MongoDB Atlas

Querying our dataset

Conclusion

Related posts

A Simple Way to Run a MongoDB Replica Set in GitHub Actions

SvelteKit SPA Auth0

No Reporting? No Problem!