MIT researchers present generative artificial intelligence for databases | MIT News

MIT researchers present generative artificial intelligence for databases | MIT News

MIT researchers present generative artificial intelligence for databases | MIT News

A new tool makes it easier for database users to perform complicated statistical analysis of tabular data without needing to know what’s going on behind the scenes.

GenSQL, a generative artificial intelligence system for databases, could help users make predictions, detect anomalies, guess missing values, correct errors, or generate synthetic data with just a few keystrokes.

For example, if the system were used to analyze medical data for a patient who has always had high blood pressure, it might detect a blood pressure reading that is low for that particular patient but is otherwise within the normal range.

GenSQL automatically integrates a tabular dataset and a generative probabilistic AI model, which can account for uncertainty and adjust its decision-making based on new data.

Additionally, GenSQL can be used to produce and analyze synthetic data that mimics real data in a database. This can be especially useful in situations where sensitive data, such as patient medical records, cannot be shared or where real data is scarce.

This new tool is built on SQL, a programming language for creating and manipulating databases that was introduced in the late 1970s and is used by millions of developers around the world.

“Historically, SQL taught the business world what a computer could do. They didn’t have to write custom programs, they just asked a database in a high-level language. We think that as we move from just querying data to asking questions of models and data, we’re going to need an analogous language that teaches people the meaningful questions to ask a computer that has a probabilistic model of the data,” said Vikash Mansinghka, lead author of a paper introducing GenSQL and a principal research scientist and leader of the Probabilistic Computing Project in MIT’s Department of Cognitive and Brain Sciences.

When researchers compared GenSQL to popular AI-based methods for data analysis, they found that it was not only faster, but also produced more accurate results. Importantly, the probabilistic models used by GenSQL are explainable, so users can read and edit them.

“If we analyze the data and try to find meaningful patterns using just some simple statistical rules, we might miss important interactions. What we really want is to capture the correlations and dependencies of variables, which can be quite complicated, in a model. With GenSQL, we want to enable a large set of users to query their data and their model without having to know all the details,” adds lead author Mathieu Huot, a research scientist in the Department of Cognitive and Brain Sciences and a member of the Probabilistic Computing Project.

They are joined in the paper by MIT graduate students Matin Ghavami and Alexander Lew; Cameron Freer, a research scientist; Ulrich Schaechtel and Zane Shelby of Digital Garage; Martin Rinard, a professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Feras Saad, an adjunct professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Programming Language Design and Implementation.

Combining models and databases

SQL, which stands for Structured Query Language, is a programming language for storing and manipulating information in a database. In SQL, people can ask questions about data using keywords, such as summing, filtering, or grouping records in the database.

However, consulting a model can provide more detailed information, as models can capture what the data means for an individual. For example, a developer wondering if she is underpaid is probably more interested in what salary data means for her individually than in trends in database records.

The researchers noted that SQL did not provide an effective way to incorporate probabilistic AI models, but at the same time, approaches using probabilistic models to make inferences did not support complex database queries.

They developed GenSQL to fill this gap, allowing someone to query both a dataset and a probabilistic model using a simple but powerful formal programming language.

A GenSQL user uploads their data and probabilistic model, which the system automatically integrates. They can then run queries on the data that also receive input from the probabilistic model running in the background. This not only allows for more complex queries, but can also provide more accurate answers.

For example, a query in GenSQL might be something like, “How likely is it that a developer in Seattle knows the Rust programming language?” If you only look at a correlation between columns in a database, you might miss subtle dependencies. Incorporating a probabilistic model can capture more complex interactions.

Furthermore, the probabilistic models that GenSQL uses are auditable, so people can see what data the model is using to make decisions. In addition, these models provide calibrated uncertainty measures alongside each response.

For example, with this calibrated uncertainty, if one queried the model for the predicted outcomes of different cancer treatments for a patient from a minority group that is underrepresented in the dataset, GenSQL would tell the user that there is uncertainty and how uncertain it is, rather than overconfidently advocating for the wrong treatment.

Faster and more accurate results

To evaluate GenSQL, the researchers compared their system to popular benchmark methods using neural networks. GenSQL was between 1.7 and 6.8 times faster than these approaches, executing most queries in a few milliseconds and providing more accurate results.

They also applied GenSQL in two case studies: one in which the system identified mislabeled clinical trial data and the other in which it generated accurate synthetic data that captured complex relationships in genomics.

Next, the researchers want to apply GenSQL more broadly to run large-scale models of human populations. Using GenSQL, they can generate synthetic data to draw conclusions about things like health and salary, while controlling what information is used in the analysis.

They also want to make GenSQL easier to use and more powerful by adding new optimizations and automation to the system. In the long term, the researchers want to enable users to perform natural language queries on GenSQL. Their goal is to develop an AI expert similar to ChatGPT that can be talked to about any database and that bases its responses on GenSQL queries.

This research is funded, in part, by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.