Blog: Building the Future of Medical Imaging with Synthetic Data and Distributed Training

Written by Daniel Vallejo | Aug 19, 2025 2:37:19 PM

A New Path for Medical Imaging Research

Every startup begins with a challenge worth solving. Some challenges are small and incremental, others touch entire industries. In healthcare, the problems are complex, the data is sensitive, and the impact of innovation can be life changing. That is why the work of startups in this space feels so important.

In this episode of AWS Let's Build a Startup, Giuseppe Battista, Senior Solutions Architect at AWS, sat down with the teams from Sinkove and Cloud Combinator. Their partnership is tackling one of the toughest issues in medical research: how to train advanced AI models when access to high quality, unbiased imaging data is limited. By combining Sinkove’s expertise in synthetic data generation with Cloud Combinator’s knowledge of distributed training on AWS, they are showing a path forward for healthcare innovation.

Watch full Livestream here:

A Match Made in the Cloud

Sinkove develops advanced AI models for medical imaging. Their goal is to generate synthetic data that fills the gaps in traditional datasets. To train these models, Sinkove needed serious GPU power and reliable distributed training pipelines. That is where Cloud Combinator stepped in.

With support from AWS, the two teams worked to integrate Sinkove’s training workflows into Amazon SageMaker HyperPod, a managed service that simplifies distributed training. This partnership allowed Sinkove to focus on model innovation rather than the complexity of infrastructure.

Pedro from Sinkove explained:

“We needed to train our diffusion models repeatedly on specific datasets. It required heavy GPU resources and distributed training. AWS introduced us to Cloud Combinator, and we brought everything into SageMaker HyperPod. Now our training runs at scale, and we can deliver faster for customers.”

Read more in the Cloud Combinator case study

Why Distributed Training Matters

Training AI models on a laptop or single GPU is fine for small projects, but when the models involve 3D medical imaging the limits are reached quickly. Each image consumes huge amounts of memory. Sometimes a single image can max out an entire GPU.

Anton from Cloud Combinator broke it down:

“Distributed training is like spreading the workload across many GPUs. Each instance trains on a slice of the data, and in the end the results merge into one model. It transforms your hardware into a kind of supercomputer.”

This approach not only accelerates training but also allows researchers to use larger batch sizes, which stabilize learning and improve results. For medical imaging, where 3D volumes expand data size exponentially, this kind of parallelization is essential.

Learn more in the AWS distributed training guide

From Academia to Industry

One theme that emerged was the smooth transition from research to production. Many scientists are familiar with tools like Slurm, a workload manager widely used in universities. SageMaker HyperPod makes it easy to run those same workflows in the cloud.

Try it yourself with the SageMaker HyperPod Slurm workshop

Pedro noted that his PhD experience with Slurm translated directly into how Sinkove now operates on AWS. That continuity lowers the barrier for researchers who want to scale their projects into startups.

The Problem with Medical Data

Why go to all this effort? Because medical data is scarce, fragmented, and often biased. Each radiology image can cost hundreds of dollars to acquire. Privacy laws restrict how data can be shared. Hospitals use different imaging protocols. And historical datasets may over represent certain populations while ignoring others.

Pedro described the challenge clearly:

“In medical imaging, data is expensive and often biased. You need to prove to regulators like the FDA that your model works across populations. But when datasets are skewed by geography or demographics, that becomes difficult.”

Sinkove’s solution is to generate synthetic medical images. These are AI generated x rays, MRIs, and other scans that mimic real patient data while filling in missing diversity. By creating synthetic examples across demographics, imaging protocols, or disease subtypes, researchers get balanced datasets that improve downstream models.

Real World Impact

A striking example came from a project with Pfizer. Their drug development pipeline involved data from twelve labs. Even with standardized protocols, variations in microscopes and acquisition methods caused models to focus on technical differences rather than meaningful compounds.

Sinkove generated synthetic images simulating each compound across multiple protocols. This helped the models learn the right features and improved results.

Other customers face similar challenges. A European company may want to deploy in the US but lacks data representing American populations. Synthetic data bridges that gap and accelerates regulatory approval.

Watch the episode highlight: Transforming Clinical Research with Generative AI

Trust and Regulation

Of course, synthetic data must be trusted. Agencies like the MHRA in the UK and the FDA in the US are publishing guidelines on its use. Encouragingly, both regulators are experimenting with synthetic datasets themselves, signaling openness to this approach.

Pedro emphasized:

“It is a super exciting time. Regulators know collecting medical data is hard. They are incentivizing innovation in synthetic data and publishing frameworks for its safe use.”

The Technology Behind the Scenes

So how does this all run in practice? Cloud Combinator helped Sinkove set up a SageMaker HyperPod cluster. This includes head nodes for control, compute nodes for training, and high speed storage like Amazon FSx for Lustre.

One advantage is that observability comes built in. With integration into Grafana, Sinkove can monitor GPU load, memory usage, and network performance with little setup. For startups, avoiding weeks of infrastructure work is a huge win.

And cost wise, HyperPod is efficient. You only pay for the EC2 instances you use, starting from just two nodes and scaling to hundreds. Flexible training plans further reduce expense by optimizing GPU usage.

Seeing Sinkove in Action

During the episode, Pedro demonstrated Sinkove’s platform. Using their Python SDK, he generated a dataset of chest x rays showing left lung consolidation. With just a few lines of code, researchers can spin up images tailored to their needs.

The results were striking. Ten unique synthetic x rays appeared, each with the specified condition but with diverse variations. That diversity is key to reducing bias. Sinkove even provides masks for precise region labeling, which speeds up annotation and downstream tasks.

Giuseppe summed it up perfectly:

“I never thought I could just generate x rays on my machine. That is mind blowing.”

See more from the community post: Sinkove on LinkedIn

Questions from the Community

Viewers joined in with thought provoking questions. One asked whether it is more important to train models from scratch or to focus on combining them with techniques like RAG to build applications.

Anton explained that both paths are valuable. For specialized domains like medical imaging, training custom models is essential. For more general applications, integrating existing models through engineering techniques can be faster. It depends on what excites you and the problems you want to solve.

Another question touched on bias: what happens if overlooked populations are excluded from datasets? Pedro explained that Sinkove actively balances this by amplifying under represented groups with synthetic data. By labeling training data with demographic information, they can generate new examples that ensure fairness in downstream models.

Looking Ahead

As the hour closed, both guests encouraged innovators to explore this space. Synthetic data and distributed training are unlocking opportunities in healthcare and beyond.

Pedro invited researchers to connect directly through sinkove.com/contact and explore their blog posts for deeper insights. Anton reminded viewers that Cloud Combinator hosts workshops on SageMaker HyperPod, helping teams experience the power of distributed training firsthand.

Next week, the series continues with Lovable, one of the fastest growing startups in the world, alongside AWS Chief Evangelist Jeff Barr.

Final Thoughts

This episode captured what makes the startup ecosystem so exciting. A small team with a big idea, powered by cloud infrastructure, can tackle global problems like healthcare bias. Synthetic data may sound futuristic, but thanks to companies like Sinkove and partners like Cloud Combinator, it is already shaping the future of diagnostics, drug discovery, and medical AI.

The message is clear: with the right tools, collaboration, and creativity, the next wave of healthcare innovation is already here.

Included links:

View full post