By “AI-first data storage”, I mean storing data in a way so they can be used by an AI system. By “AI system”, I particularly mean LLMs (large language models). And when I say “data”, I mean text data.
For AI-first data storage, you typically use vector databases. Vector databases store and retrieve data as vectors, unlike traditional full-text search systems like Lucene, which store data as text.
There are a number of vector storage or database providers, including KDB.AI, LanceDB, Milvus, MyScale”, Pinecone, Qdrant, Vectara, Weaviate, and Zilliz. These providers differ along several dimensions:
You can’t just insert a text into a vector database. Rather, in order to store text data in a vector database, you have to convert your texts into vector representations. These vector representations are called “embeddings”. Generally, there are two important steps in this conversion process: (1) Chunking your data, (2) creating the embeddings for the text chunks.
You could just take an entire text document and convert it into a vector representation. But this typically does not produce good results. For example, let’s say you want to build a RAG (retrieval-augmented generation) application. If you embedded an entire text and a user asks a question, the entire text might be returned as relevant—even if only a small passage of the text is relevant. In theory, you could use an LLM to post-process the text and get only the relevant part. But this quickly becomes prohibitively slow and expensive.
A better way of data preparation is to split your texts into parts, i.e. chunk them, and to then make separate embeddings for each chunk.
You can chunk your data in many different ways. For example, you could simply create text chunks of 300 characters or words per chunk. But while this is simple, it could bulldoze across the structure of your text. For instance, it would not distinguish between text sections, headlines and paragraphs, etc.. Or think of different document types. For instance, you would probably chunk a news article differently than a slide deck or a Tweet. Or what if your text contains tables?
In general, the more you know about the structure of your texts and take this knowledge into account, the better your chunks will be because they will follow the “natural texture” of your data.
When you create embeddings, you typically use a deep learning model (= a type of neural network) to turn texts into vectors. This way you get a multidimensional space in which the vectors “live”. And the closer any two vectors are in that space, the more semantically similar the texts are that these vectors represent.
So how multidimensional should this space be, or how many dimensions should your vectors have? This is a design decision you will have to make. As of now, typical vector dimensions for embeddings are 384 (generated by Sentence Transformers Mini-LM), 768 (by Sentence Transformers MPNet), 1,536 (by OpenAI) and 2,048 (by ResNet-50), according to this article.
While this is still a field of ongoing research, generally, the higher your vector dimension, the higher-precision similarity calculations you can make. But higher dimensions also increase storage and compute requirements.
Importantly, once you decided to go with a specific vector dimensionality and model for creating the embeddings, you typically have to stick to that dimensionality and model. Otherwise you basically end up comparing apples to oranges in your database—the embeddings would be incompatible.
As mentioned above, there are a number of different vector database providers. The differences between these providers—both in system architecture and in pricing—affect several design decisions you have to make. Below are some factors you should consider.
Do you plan to add new data to your database all the time, or will your dataset be more or less constant? Different offerings are optimized for different scenarios.
Some vector database offerings are optimized for a scenario where you have lots of user who all bring their own data. This is often referred to as “namespaces”. So if this is important to you, check if your database provider supports this.
You cannot always have both. Some providers are better for high data volumes, at the expense of query response time, whereas other providers offer very high query speed but for smaller data volumes.
Some of the vector databases you can host yourself. Generally, hosting yourself might be interesting if either of the following is true:
Just be aware that hosting yourself comes with a whole range of things you will need to handle yourself: Security of your systems; scaling up or down your database as your data volume, number of users etc. change; systems availability; updates (including updates to all the enabling systems and components you need); hardware (typically requires GPUs, which means it is expensive); and of course the team to do all of this.
By “traditional databases”, I mean full-text retrieval databases that are keyword-, not concept-similarity-based. For example, Elastic Search is such a database (like many others, they use BM25 for document ranking by default).
And no, they are not obsolete. There are use cases where such databases are still very useful. For example, if you know exactly what a document has to contain in order to match your query, they typically work better than vector databases (e.g. “I am interested in texts that discuss a very specific industrial standard”).
This is why several vector databases offer a hybrid approach where you can combine vector and keyword search. One such database is Pinecone.
When we build custom databases for you in Spark, we typically use vector storage. And the factors above play an important role in our design decision process.
If you are interested in a Spark custom database for you, we’d be happy to hear from you.