OpenAI’s Latest Patents Point Straight to Semantic SEO

Posted in: General | Semantic Web | SEO

Join thousands of marketers to get the best search news in under 5 minutes. Get resources, tips and more with The Splash newsletter:

A question most SEOs are hearing today: How do we show up in ChatGPT? As platforms powered by large language models become a new front in the search landscape, marketers are looking for more clarity on how these models search and present information. Traditional ranking factors may play a part but it’s clear that AI platforms approach information retrieval, scoring, and presentation differently than how we’re use to seeing them over the past two decades.

When we want to understand how a technology works, we turn to patents — a practice we’ve followed since Bill Slawski became part of Go Fish Digital in 2013. OpenAI’s latest patent filings reveal the underlying retrieval mechanisms shaping AI-powered discovery. And like the Google patents we’ve dissected for years, these documents offer practical insight into how content can be seen, surfaced, and served in tomorrow’s LLM-first search engines.

Vector Embeddings: The New “Index”

OpenAI’s search patents show a shift from keyword-based retrieval to vector-based matching.

Modern search isn’t limited to keyword matching, and it hasn’t been for years. Google helped pioneer the use of vector embeddings at scale when it introduced BERT in 2018 (At the time it outperformed Human performance by just 2%), which allowed the search engine to better understand the relationships between words in a query and the context of those words in a document.

However, many believe that BERT is layered on top of Google’s existing inverted index structure, meaning that content still needed to be discoverable through keyword-based crawling and indexing first.  This is primarily because of the way people have performed search historically. That started to change with ChatGPT and conversational based research.

OpenAI’s architecture leans more heavily on vector embeddings. Instead of relying on keyword lookups, ChatGPT converts chunks of content into high-dimensional vectors and stores them in a vector database. This includes not just external sources, but also your prior conversations — enabling a form of personalized search. When you submit a new prompt, it’s embedded and compared against stored vectors using semantic similarity.

The system then retrieves the most relevant content, whether from public data via search engines like Microsoft’s Bing or your past interactions, and uses it to shape the response. This vector-based retrieval powers features like memory, where ChatGPT can recall preferences, past questions, or ongoing projects, even when keywords don’t match exactly.

The implications for SEO are practical. If your content isn’t written in a way that can be cleanly chunked as a passage, embedded and matched against a user’s intent, it won’t be retrieved. And if it’s not retrieved, it’s not part of the answer. This isn’t about ranking higher in a list of links; it’s about being selected for inclusion in a language model’s working memory. Optimizing for embeddings requires a different approach to structure, clarity, and coverage than optimizing for traditional search.

Multiple patents filed by OpenAI lay out how the company is training, storing, and retrieving content using vector embeddings across both public and private knowledge sources. These documents offer direct insight into how content is selected for inclusion in AI answers.

Here are two recent patents that make this approach clear:

  • US 20240249186 A1Systems and methods for using contrastive pre-training to generate text and code embeddings (Granted July 2024)

    Describes how OpenAI improves embedding quality using contrastive learning — a technique that teaches models to distinguish between related and unrelated content in vector space.

  • US 20250103962 A1 – Systems and methods for generating customized AI models (Published March 2025)

    Outlines how user-uploaded content or external data is embedded, stored in a vector database, and retrieved via semantic search — the same workflow powering personalized GPTs and likely future versions of ChatGPT’s search.

Next, we’ll look closer at each of these patents and how they can help us understand how semantic search is the future of SEO.

AI Search - Vector embeddings training

OpenAI Embedding Patent – Contrastive Embeddings for Text & Code

This patent shows how OpenAI builds cleaner, faster, and more semantically rich embeddings.

In July 2024, OpenAI was granted US 20240249186 A1, titled Systems and methods for using contrastive pre-training to generate text and code embeddings. This patent outlines a training technique where the model learns by comparing pairs of content: it brings semantically related examples closer together in vector space and pushes unrelated examples farther apart.  For example:

When training on text – the system uses naturally occurring neighboring text like consecutive sentences or adjacent paragraphs as positive pairs.

When training on code –  the model pairs a function’s top-level docstring with its actual implementation.

These examples help the model learn to align natural language with structured logic or related passages, a critical step in improving the quality of retrieval in real-world applications. This contrastive approach results in embeddings that are more precise, more efficient, and better suited for real-time retrieval.

The stated goal is to enhance the performance of systems that rely on these embeddings, including those that power AI-generated responses and code suggestions.

Why SEOs Should Care

  • The patent’s primary objective is high-performance similarity search, the same method used to find relevant content in response to a query or discussion.
  • Cleanly organized and structured text surrounded by relevant content leads to vectors embeddings that are more accurate retrieved in tools like ChatGPT other LLM-driven search system.
  • If a ChatGPT search experience is built on embedding recall, this patent describes how ChatGPT will create vector embeddings of your content for better retrieval.

OpenAI CustomGPT Patent – Custom AI Models & Built-in Semantic Search

This patent shows how OpenAI provides tools for CustomGPTs and enables retrieval across data sources for its custom models using vector embeddings.

In March 2025, OpenAI published US 20250103962 A1, titled Systems and methods for generating customized AI models. This patent outlines how users, or OpenAI itself, can create custom versions of GPT models that are connected to their own knowledge sources. The process begins by taking a dataset, splitting it into chunks, and embedding each chunk using OpenAI’s proprietary embedding models. These vectors are then stored in a vector database.  This is primarily the way that OpenAI’s CustomGPT use the knowledge you provide it when building them. In fact the patent describes this process in the following way:

Semantic search goes beyond keyword search (which relies on the occurrence of specific index words in the search input) to find contextually relevant data based on the conceptual similarity of the input string. As a result, semantic searches of a knowledge base can provide more context to models. Semantic search may use a vector database, storing text chunks (derived from some documents) and their vectors (mathematical representations of the text). When querying a vector database, the search input (in vector form) is compared to all of the stored vectors, and the most similar text chunks are returned.

At runtime, the system takes a user prompt, embeds it, and performs a similarity search to identify the most relevant chunks. The top results are then stitched into the prompt, essentially serving as context, and passed to the language model to generate a grounded, often cited response.

For example, if a business uploads internal documentation or product guides to a custom GPT, the model will semantically search across that private content using embeddings, not keywords. The patent makes it clear that this retrieval mechanism isn’t limited to user-uploaded files, it can also pull from other data sources, such as web pages or APIs, using the same embedding and vector search approach.

Why SEOs Should Care

  • The retrieval method described is textbook semantic search and it works equally well on public websites as it does on private data.
  • If OpenAI crawls and embeds your site content, the quality of those embeddings will determine whether your content is retrieved in a response.
  • This patent lays out the exact pipeline likely powering how AI based search works.  Content is crawled, chunked, embedded, and stored and only top-matching chunks make it into the model’s response window.
  • Your site’s visibility depends on whether your content structure, clarity, and topical depth align well with embedding-based retrieval.

How Go Fish Digital Uses Vector Embeddings To Improve Inclusion in ChatGPT

We use vector-based similarity scoring to quantify content relevance, not just guess at it.

At Go Fish Digital, we’ve developed proprietary tools to bring vector embeddings directly into our SEO workflow. This tool mirrors how modern search engines like Google and ChatGPT evaluate content by converting text into embeddings and measuring semantic similarity, rather than relying solely on keyword matching.

The extension works by:

  1. Prompting you to enter a target query
  2. Generating embeddings for each section of the page
  3. Calculating cosine similarity between the query and each content block
  4. Assigning a score from 0 to 10 to indicate how well each section aligns with the query
  5. Visualizing the results with a heat map to highlight optimization opportunities

Vector-embeding-text-for-AI

To support this same analysis at scale, we also use an internal tool called Barracuda, which allows us to chunk, embed, and score content across hundreds or thousands of pages. Barracuda applies the same embedding-based retrieval logic, but at the domain level, giving us insight into how entire sites perform semantically for a set of topics or intents.

Together, these tools help us align content with how LLM-powered search systems actually retrieve and interpret information so we’re optimizing not just for keywords, but for retrieval visibility.

For more details contact us to see how our tools can help you appear more in AI platforms like ChatGPT.

Search News Straight To Your Inbox

This field is for validation purposes and should be left unchanged.

*Required

Join thousands of marketers to get the best search news in under 5 minutes. Get resources, tips and more with The Splash newsletter: