Semantic Search Using Open AI Embeddings and Elastic Search

April 1, 2024

How To implement Semantic Search Using Open AI Embeddings and Elastic Search

In this tutorial we will use embeddings from Open AI to embed our data set. We will learn how to use cosine search to find similarities in our data. Then we will speed up our process of searching by using elastic search to index our data.

First we set up a virtual environment

python3 -m venv venv
source venv/bin/activate
pip install pandas numpy openai elasticsearch

Load data using pandas

First we want to be able to load a csv of whatever you would like into a python varaiable

import pandas as pd
import numpy as np

df = pd.read_csv("/Users/jesseleonard/Downloads/AG_news_samples.csv")
df = df.head(1000)

>>> df.head(5)
                                               title                                        description  label_int     label
0                                    World Briefings  BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime M...          1     World
1  Nvidia Puts a Firewall on a Motherboard (PC Wo...  PC World - Upcoming chip set will include buil...          4  Sci/Tech
2                Olympic joy in Greek, Chinese press  Newspapers in Greece reflect a mixture of exhi...          2    Sports
3                          U2 Can iPod with Pictures  SAN JOSE, Calif. -- Apple Computer (Quote, Cha...          4  Sci/Tech
4                                  The Dream Factory  Any product, any shape, any size -- manufactur...          4  Sci/Tech

Apply Embeddings by Calling Open AI client

Now we want to use the open ai client to embed our data. We will use the text-embedding-3-small model to embed our data. The second to last line will apply the get embeddigns1 function to every row in the dataframe. This will take a while so we are just going to do it with a small data set of 100 rows. The last line will save the dataframe to a csv file.

from openai import OpenAI

client = OpenAI(
  max_retries=5,
)

def get_embedding(text: str, model="text-embedding-3-small", **kwargs):
    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")
    response = client.embeddings.create(input=[text], model=model, **kwargs)
    print("embedding received")
    return response.data[0].embedding


df["embedding"] = df["title"].apply(lambda x: get_embedding(x))
df.to_csv("word_embeddings.csv")

Create a Search Term & Embed it

Now we want to create a search term and embed it using the same model as before.

search_term = "stock market crash"
search_term_vector = get_embedding(search_term)

Prepare the data for cosine similarity

We need to prepare the data by running this numpy function to prep the data before applying cosine search.

df = pd.read_csv("word_embeddings.csv")
df["embedding"] = df["embedding"].apply(eval).apply(np.array)

Apply search term cosine similarities

This applies cosine similarity between search term and every item in dataframe and places it in the similarities row. Then writes this file to csv

def cosine_similarity(a, b):
      return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

df["similarities"] = df["embedding"].apply(
    lambda x: cosine_similarity(x, search_term_vector)
)
df = df.sort_values("similarities", ascending=False).head(20)
df.to_csv("similarities.csv")

Our problem is that this takes a long time to run these calculations just using the pandas library. We can speed up this process by using elastic search to index our data.

Set up Elasticsearch with docker

docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.2
docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:8.12.2

Set up elasticsearch client

password = "elasticsearch_password"
ssl_assert_fingerprint = "fingerprint"
es = Elasticsearch(
  "http://localhost:9200",
  basic_auth=("elastic", password),
  ssl_assert_fingerprint=ssl_assert_fingerprint,
)

index_mapping = {
  "properties": {
    "id": {
      "type": "text"
    },
    "chapter_number": {
      "type": "long"
    },
    "book_name": {
      "type": "text"
    },
    "verse_number": {
      "type": "long"
    },
    "text": {
      "type": "text"
    },
    "chapter_id": {
      "type": "text"
    },
    "version_id": {
      "type": "text"
    },
    "embedding": {
      "type": "dense_vector",
      "dims": 1536,
      "index": True,
      "similarity": "l2_norm",
    },
  }
}

es.indices.create(index="bible_verses", mappings=index_mapping)

Populate Elastic search with rows from dataframe

# conver dataframe to json objects
record_list = df.to_dict("records")

#print(record_list[0])

for record in record_list:
  try:
    es.index(index="bible_verses", document=record, id=record["id"])
  except Exception as e:
    print(e)

# indexes = es.count(index="bible_verses")

Make a search query

search_term = "What is the meaning of life"

search_term_vector = get_embedding(search_term)
query = {
  "field": "embedding",
  "query_vector": search_term_vector,
  "k": 20,
  "num_candidates": 10000,
}

res = es.knn_search(index="bible_verses", knn=query, source=["text"])

response = dict(res)["hits"][ "hits" ]