LanceDB, which counts Midjourney as a customer, is building databases for multimodal AI

6 months ago 18
ARTICLE AD

Chang She, previously the VP of engineering at Tubi and a Cloudera veteran, has years of experience building data tooling and infrastructure. But when She began working in the AI space, he quickly ran into problems with traditional data infrastructure — problems that prevented him from bringing AI models into production.

“Machine learning engineers and AI researchers are often stuck with a subpar development experience,” She told TechCrunch in an interview. “Data infra companies don’t really understand the problem for machine learning data at a fundamental level.”

So Chang — who’s one of the co-creators of Pandas, the wildly popular Python data science library — teamed up with software engineer Lei Xu to co-launch LanceDB.

LanceDB is building the eponymous open source database software LanceDB, which is designed to support multimodal AI models — models that train on and generate images, videos and more in addition to text. Backed by Y Combinator, LanceDB this month raised $8 million in a seed funding round led by CRV, Essence VC and Swift Ventures, bringing its total raised to $11 million.

“If multimodal AI is critical to the future success of your company, you want your very expensive AI team to focus on the model and bridging the AI with business value,” Chang said. “Unfortunately, today, AI teams are spending most of their time dealing with low-level data infrastructure details. LanceDB provides the foundation AI teams need so they can be free to focus on what really matters for enterprise value and bring AI products to market much faster than otherwise possible.”

LanceDB is essentially a vector database — a database containing series of numbers (“vectors”) that encode the meaning of unstructured data (e.g. images, text and so on).

As my colleague Paul Sawers recently wrote, vector databases are having a moment as the AI hype cycle peaks. That’s because they’re useful for all manner of AI applications, from content recommendations in ecommerce and social media platforms to reducing hallucinations.

The vector database competition is fierce — see Qdrant, Vespa, Weaviate, Pinecone and Chroma to name a few vendors (not counting the Big Tech incumbents). So what makes LanceDB unique? Better flexibility, performance and scalability, according to Chang.

For one, Chang says, LanceDB — which is built on top of Apache Arrow — is powered by a custom data format, Lance Format, that’s optimized for multimodal AI training and analytics. Lance Format enables LanceDB to handle up to billions of vectors and petabytes of text, images and videos, and to allow engineers to manage various forms of metadata associated with that data.

“Until now, there’s never been a system that can unite training, exploration, search and large-scale data processing,” Chang said. “Lance Format allows AI researchers and engineers to have a single source of truth and get lightning-fast performance across their entire AI pipeline. It’s not just about storing vectors.”

LanceDB makes money by selling fully managed versions of its open source software with added features such as hardware acceleration and governance controls — and business appears to be going strong. The company’s customer list includes text-to-image platform Midjourney, chatbot unicorn Character.ai, autonomous car startup WeRide and Airtable.

Chang insisted that LanceDB’s recent VC backing wouldn’t shift its attention away from the open source project, though, which he says is now seeing around 600,000 downloads per month.

“We wanted to create something that would make it 10x easier for AI teams working with large-scale multimodal data,” he said. “LanceDB offers — and will continue to offer — a very rich set of ecosystem integrations to minimize adoption effort.”

Read Entire Article