Companies crave fresh data to train AI models. This startup’s recipe? Data made from scratch – by AI

Tan KW

Publish date: Wed, 19 Jun 2024, 03:58 PM

Ever since OpenAI’s ChatGPT sparked the generative AI boom in 2022, it’s been clear that having the right data, and enough of it, is essential to creating an AI model that is accurate, reliable, and efficient.

The problem? The best data, particularly specialised “expert” data in specific domains like health and finance, is in short supply.

AI companies have strip-mined the Internet for fresh information, but AI models are constantly hungry - and must be fed.

San Francisco-based startup Gretel AI has long believed that the most satisfying solution is to create fake food that is just as tasty as the real thing. It helps clients such as EY, Google, and the US Department of Justice generate synthetic data - that is, artificially generated data that mimics the characteristics of real-world data.

And it’s getting easier to make it: Today, for example, Gretel announced the wide availability of a generative-AI-powered system that lets users create synthetic datasets for tabular data - think of text and number data that goes in columns and rows, like Excel spreadsheets - with just a natural language prompt like those used for ChatGPT.

Let’s say a bank wants to create a synthetic dataset that is similar to its own customer data but does not include actual individual names or information. Using Gretel’s Navigator product, the bank can prompt the system to create millions of fictional names, IDs, dates, dollar amounts, and account balances, for example, based off of Gretel’s own datasets, or off of the bank’s own proprietary data.

The resulting computer-generated data doesn’t infringe on customer privacy, since it does not include any real-world customer information, and can generate enough data to train a powerful, accurate model, claims Gretel.

As data scarcity forces companies to seek other sources to build general models or fine-tune ones for specific tasks, synthetic data is having a moment in 2024, Gretel cofounder and CEO Ali Golshan told Fortune.

Golshan, who had previously cofounded two security-focused startups, pointed out that the company got its start in 2020 as a way to generate privacy-minded data (the name Gretel came from the classic story of Hansel And Gretel, who left a trail of breadcrumbs to find their way home). The company “wanted to make sure people don’t leave digital breadcrumbs behind” while offering developers a way to access useful data, particularly in highly regulated industries.

“We never really thought about the context of running out of data - that was a ChatGPT moment,” he said. But now data scarcity - as well as data privacy and security - is why companies are turning to synthetic data as an option to train AI models.

Golshan emphasises that generating synthetic data is not about spewing out high volumes of low-quality, useless data (think Reddit posts). “People think synthetic data is sort of interchangeable with fake data or junk data, that they just need more of it,” he said. “That is where you end up with these sorts of toxic dovetails and spirals of hallucinations - the quality part has to be there.”

What will drive business over the next two decades, he added, is taking large AI investments built on the back of “messy, public, privacy-riddled data” and “plugging them into our sensitive, owned, domain-specific data - that is unique and can drive models forward.”

He also pushed back on the idea of synthetic data being not “as good” as real data, as well as the potential dangers of AI training itself on its own hallucinations or misinformation. Since the company mostly services businesses, organisations, and governments, Gretel’s work typically starts with a seed of data a company already has - whether it is patient data, fraud data, or transaction data. “That acts as the boundaries and the gates for how we build the rest of the data,” he said.

Gretel’s latest product lets companies generate data even on topics about which they lack information. Its technology focuses on highly specific data meant to improve individual tasks within a client’s internal systems - and not produce data based on millions of pages scraped from the internet that could prove problematic.

Gretel is not alone in attempting to corner the market on generating synthetic data to train AI models. Startups like SynthLabs, Synthetaic, and Clearbox AI are all racing to provide companies with all the data they need - computer-generated, that is.

That has led Golshan and his cofounders to consider the future. He says companies will soon be able to make money by allowing others to buy synthetic data trained on that organisation’s unique datasets. Organisations that have lots of data but aren’t building AI models, for instance, could sell others access to their data to help training for their synthetic data.

To that end, Golshan said, Gretel’s next big move is to build a synthetic data and model exchange. “We are going to enable companies and customers to train models on their data, get mathematical guarantees that data is safe, and somebody can come and ‘subscribe’ to that model, generate data, and pay as you go,” he explained.

This, he added, will take Gretel to the next level to “become the safe interface for private data, where you remove this exploitative approach to mining and harvesting data.” It would also mean companies like Anthropic and OpenAI, which have built huge AI models built on massive amounts of data, would not have to strike licenses with every individual company they want to get data from, he said.

As for funding, Gretel has raised a total of US$68mil with its Series B back in 2021. Golshan said the startup has a lot of money left, with “about two years of runway ahead of us”. But in this “moment” for synthetic data, he says he sees an opportunity to build the next Databricks or Snowflake - two of the biggest data cloud platforms - or even OpenAI.

“We are leaning into it pretty aggressively because we’re having a ton of pull,” he said. “We envision building the next safe, high-quality data business, which, if you think about the needs, is a pretty significant opportunity.”

- NY Times

Discussions

Be the first to like this. Showing 0 of 0 comments

Featured Posts

MQ Chat

New Update. Discover investment communities that resonate with your ideas

Latest Videos

MQ Market Updates - 27 June 2024

MQ Trader

Apps

MQ Chat

Send individual or group chats with anyone on i3investor

MQ Trader

Earn MQ Points while trading with MQ Trader

MQ Affiliate

Earn side income from Affiliate Program

MQdemy

Online learning and teaching marketplace

Hot Stocks Today >

JCY

JCY INTERNATIONAL BERHAD

1000

DNEX

DAGANG NEXCHANGE BERHAD

993

HLIND

HONG LEONG INDUSTRIES BHD

729

PTRANS

PERAK TRANSIT BERHAD

727

YTLPOWR

YTL POWER INTERNATIONAL BHD

697

MPI

MALAYSIAN PACIFIC INDUSTRIES

696

NOTION

NOTION VTEC BHD

618

MYEG

MY E.G. SERVICES BHD

554

GENTING

GENTING BHD

513

PBBANK

PUBLIC BANK BHD

509

Daily Stocks

HSI-CXF

0.08

-0.03

147,307,800

INIX-OR

0.02

0.00

141,839,900

HSI-CXV

0.11

-0.035

132,517,000

HSI-HWE

0.175

+0.02

109,641,200

HSI-HUZ

0.24

+0.05

107,181,300

MYEG

0.97

-0.05

101,501,300

YNHPROP

0.495

-0.11

94,446,300

AHB-WC

0.07

0.00

85,284,900

DNEX

0.44

-0.025

75,452,000

EDUSPEC-OR

0.005

0.00

75,002,700

More active Stocks

GESHEN

3.40

+0.29

29,200

THETA

1.73

+0.25

20,598,500

PENTA

5.08

+0.15

4,228,700

PTT

2.38

+0.15

3,074,400

AIRPORT

9.73

+0.13

2,520,900

PETDAG

17.26

+0.12

389,600

UTDPLT

24.20

+0.12

123,200

IJM

3.07

+0.10

15,946,200

F&N

31.82

+0.10

335,100

HLIND

11.22

+0.10

30,800

More gainer Stocks

DLADY

35.50

-0.30

17,500

AMBANK-C46

0.075

-0.275

1,400,000

HSI-CXU

0.295

-0.275

59,700

PCHEM

6.33

-0.19

2,008,700

ORIENT

6.97

-0.18

1,220,900

HEIM

22.20

-0.18

400,400

ICON

0.89

-0.16

12,816,000

KLK

20.52

-0.16

953,700

HSI-CXT

0.725

-0.135

23,600

NIKKEI-CC

0.065

-0.125

30,000

More loser Stocks

MQ Trading Signals

BUY
SELL

KLK

KUALA LUMPUR KEPONG BHD

2024-06-27 16:55:00

EMA 5

5 Mins

LCTITAN

LOTTE CHEMICAL TITAN HOLDING BERHAD

2024-06-27 16:55:00

EMA 5

5 Mins

CLMT

CAPITALAND MALAYSIA TRUST

2024-06-27 16:55:00

EMA 5

5 Mins

MAYBULK

MAYBULK BERHAD

2024-06-27 16:55:00

EMA 5

5 Mins

ENGTEX

ENGTEX GROUP BHD

2024-06-27 16:55:00

EMA 5

5 Mins

More Trading Signals

HPMT

HPMT HOLDINGS BERHAD

2024-06-27 16:55:00

EMA 5

5 Mins

KUB

KUB MALAYSIA BHD

2024-06-27 16:55:00

EMA 5

5 Mins

BORNOIL

BORNEO OIL BHD

2024-06-27 16:55:00

EMA 5

5 Mins

CAPITALA

CAPITAL A BERHAD

2024-06-27 16:55:00

EMA 5

5 Mins

EFFICEN

EFFICIENT E-SOLUTIONS BHD

2024-06-27 16:55:00

EMA 5

5 Mins

More Trading Signals

Featured Advertisers / Partners

Top Brokers >

AmEquities

Affin Hwang

Rakuten Trade

Hong Leong Bank

Books Review >

Ride The Bull Short The Bear

CS Tan

4.9 / 5.0

This book is the result of the author's many years of experience and observation throughout his 26 years in the stockbroking industry. It was written for general public to learn to invest based on facts and not on fantasies or hearsay....

Read More