Stardog: Customer Spotlight
What was the inspiration for founding your company?
Our mission is universal self-service analytics. Everyone working at any large enterprise should be able to ask any question, subject to data governance and access control, and get a trusted, timely, and accurate answer based on public and private data. As well, they should be able to do this without having to learn a query language or a BI tool. Our product uses knowledge graph technology to solve the data silo, sprawl, and context problems that stand in the way. It also helps navigate the risks of generative AI, such as hallucination, that prevent organizations from leveraging LLMs for business value.
How does your company work with Databricks?
We help customers create a contextualized view of their data stored both inside and outside of Databricks. By “contextualized,” we mean conceptual relationships that tie data into a network of information that means something to business users. This “semantic layer” is a powerful part of a fabric that includes the Databricks Lakehouse. We started our partnership by getting on Partner Connect, and our relationship has expanded with the advent of generative AI. For example, Stardog Voicebox, which allows users to interact with their data through LLMs, has leveraged MosaicML’s training platform for large models. Rather than writing a SQL or a SPARQL query, customers can ask questions of their data in plain language and get plain language responses back that are informed not just by the data, but by what it all means.
Can you tell me about the model that powers these natural language queries?
We take off-the-shelf models and use the Mosaic framework to fine-tune them. We developed our own training data that is specific to graph data models and questions against graphs. That's how we augment user prompts. What we’ve built is essentially a knowledge engineer that can interpret a business user’s intent and craft the right query for them to get them the answer they’re looking for–or be clear, rather than creative, when it doesn’t have the data. What we’ll never do is compromise data security and privacy by sending data to a third party. Our customers in financial services, manufacturing, and life sciences appreciate our commitment to in-house approaches.
Tell me about your journey to model development. Did you begin with RAG?
Instead of RAG, we are using LLMs to generate a structured query. We execute that query over the knowledge graph and then return the results in natural language form. For that reason, standard RAG applications don't work. What we are trying to solve here are hallucinations and traceability problems. We can trace every answer to the data point in the graph; we have that lineage. There is no hallucination because we know the data origin. We started by experimenting with OpenAI, but we quickly learned that if you have a smaller model that you fine-tune for a specific task, you can match or exceed the quality of OpenAI, and you have more control over data security and privacy.
Before working with MosaicML, we spent a lot of time improving the quality of our training data to make the most out of it. But we faced other challenges, such as catastrophic forgetting. We would fine-tune on a task, but then the model might forget everything else it knew about how to have a conversation. I lost count of how many fine-tuning sessions we had, but we are in the hundreds at this point. We did a lot of work on Amazon EC2 machines before we got to Mosaic, but hardware limitations were slowing us down. With Mosaic, we were able to get access to more hardware.
There are certainly lots of nice things that come with Mosaic with respect to your fine-tuning API and things like that. It’s really nice and easy in the Databricks ecosystem to see the learning curve, and how things are changing over epochs during training. There are a lot of usability benefits to the Mosaic platform. To build Stardog Voicebox, we needed a way to serve our models and we didn't want to run these things ourselves. It was a big time saver for us to be able to rely on Databricks.
Can you share any final thoughts on the benefits of AI customization to Stardog?
Off-the-shelf open source models are just not that good for the kinds of tasks we need, especially writing a structured query. For example, in text generation, if there's a typo, a human would understand the meaning. It wouldn't be catastrophic. In our case, if you mistyped the curly bracket, that might undermine query execution.
Models need to be very sophisticated in that sense to generate these queries correctly. The margin of error is very low. Fine-tuning is not really a nice-to-have for us. It's a necessity.
To learn more about Stardog, visit: https://www.stardog.com/platform
To find out how customers are using Stardog in applications from drug discovery to supply chain management to risk analysis, visit: https://www.stardog.com/company/customers/
What’s a Rich Text element?
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
Static and dynamic content editing
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
How to customize formatting for each rich text
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.