AI Model Geneformer Unlocks Gene Networks Using Limited Data


Alvin
Lang


Jul
15,
2024
14:38

Geneformer,
an
AI
model
developed
by
Broad
Institute
and
Harvard,
uses
limited
data
to
predict
gene
behavior
and
disease
mechanisms,
accelerating
drug
discovery.

AI Model Geneformer Unlocks Gene Networks Using Limited Data

Geneformer,
a
powerful
artificial
intelligence
(AI)
model,
has
emerged
as
a
significant
tool
in
understanding
gene
network
dynamics
and
interactions
using
limited
data.
Developed
by
researchers
at
the
Broad
Institute
of
MIT
and
Harvard,
this
model
leverages
transfer
learning
from
extensive
single-cell
transcriptome
data
to
make
accurate
predictions
about
gene
behavior
and
disease
mechanisms,
facilitating
faster
drug
target
discovery
and
advancing
the
comprehension
of
complex
genetic
networks,
according
to

NVIDIA
Technical
Blog
.

A
BERT-like
Reference
Model
for
Single-Cell
Data

Geneformer
employs
a
BERT-like
transformer
architecture,
pre-trained
on
data
from
approximately
30
million
single-cell
transcriptomes
across
various
human
tissues.
Its
attention
mechanism
focuses
on
the
most
relevant
parts
of
the
input
data,
enabling
the
model
to
consider
relationships
and
dependencies
between
genes.
During
its
pretraining
phase,
Geneformer
uses
a
masked
language
modeling
technique,
where
a
portion
of
the
gene
expression
data
is
masked,
and
the
model
learns
to
predict
the
masked
genes
based
on
the
surrounding
context.
This
approach
allows
the
model
to
understand
complex
gene
interactions
without
the
need
for
labeled
data.

This
architecture
and
training
method
enhance
predictive
accuracy
across
various
tasks
related
to
chromatin
and
gene
network
dynamics,
even
with
limited
data.
For
instance,
Geneformer
can
reconstruct
crucial
gene
networks
in
heart
endothelial
cells
using
only
5,000
cells
of
data,
a
task
that
previously
required
over
30,000
cells
with
state-of-the-art
methods.

Enhanced
Predictive
Capabilities

Geneformer
also
demonstrates
impressive
accuracy
in
specific
cell
type
classification
tasks.
Using
a
Crohn’s
Disease
small
intestine
dataset
for
evaluation,
the
NVIDIA
BioNeMo
model
showed
performance
improvements
over
baseline
models
in
accuracy
and
F1
score.
The
comparisons
used
a
baseline
Logp1
PCA+RF
model
trained
on
normalized
and
log-transformed
expression
counts.
Geneformer
models
with
10M
and
106M
parameters
showed
improved
cell
annotation
accuracy
and
F1
scores
over
these
baseline
models.

Scalability
and
Advanced
Features

To
support
the
next
generation
of
Geneformer-based
models,
the
BioNeMo
Framework
has
introduced
two
new
features.
Firstly,
a
data
loader
that
accelerates
data
loading
four
times
faster
than
the
published
method,
maintaining
compatibility
with
the
original
data
types.
Secondly,
Geneformer
now
supports
tensor
and
pipeline
parallelism,
which
helps
manage
memory
constraints
and
reduces
training
time,
making
it
feasible
to
train
models
with
billions
of
parameters
using
multiple
GPUs.

Geneformer
is
part
of
a
growing
suite
of
accelerated
single-cell
and
spatial
omics
analysis
tools
within
the
NVIDIA
Clara
suite.
These
tools
can
be
integrated
into
complementary
research
workflows
for
drug
discovery,
exemplified
by
research
at
The
Translational
Genomics
Research
Institute
(TGen).
The
RAPIDS
suite
of
programming
libraries,
including
the
RAPIDS-SINGLECELL
toolkit
and
ScanPy
library,
accelerates
preprocessing,
visualization,
clustering,
trajectory
inference,
and
differential
expression
testing
of
omics
data.

A
Foundation
AI
Model
for
Disease
Modeling

Geneformer’s
applications
span
molecular
to
organismal-scale
problems,
making
it
a
versatile
tool
for
biological
research.
The
model
is
now
open-source
and
available
for
research.
It
supports
zero-shot
learning,
enabling
it
to
predict
data
classes
it
hasn’t
explicitly
been
trained
for.
In
gene
regulation
research,
for
instance,
Geneformer
can
be
fine-tuned
on
datasets
measuring
gene
expression
changes
in
response
to
varying
levels
of
transcription
factors,
aiding
in
understanding
gene
regulation
and
potential
therapeutic
interventions.

Fine-tuning
Geneformer
on
datasets
capturing
cell
state
transitions
during
differentiation
can
enable
precise
classification
of
cell
states,
assisting
in
understanding
differentiation
processes
and
development.
The
model
can
also
identify
cooperative
interactions
between
transcription
factors,
enhancing
the
understanding
of
complex
regulatory
mechanisms.

Get
Started

The
6-layer
(30M
parameter)
and
12-layer
(106M
parameter)
models,
along
with
fully
accelerated
example
code
for
training
and
deployment,
are
available
through
the
NVIDIA
BioNeMo
Framework
on

NVIDIA
NGC
.

Image
source:
Shutterstock

Comments are closed.