The last decade has seen rapid development in our
ability to test multiple biomarkers in individual cells. This is particularly
the case with respect to DNA and RNA markers. The areas of genomics and
transcriptomics have seen incredible development. However, while DNA and RNA
analysis is essential for gene expression, a comprehensive single-cell analysis
requires looking at other biomarkers as well – particularly proteins.
PROTEOMICS PROBLEM
Why proteomics? To put it simply, because proteins matter
and we don’t know nearly as much about the proteome diversity at the
single-cell level, as we do about genome and transcriptome.
Just as the transcriptome is not a simple copy of
the genome, the proteome is also not a direct image of the transcriptome. In
cells, RNA and protein copy numbers are not always the same. With only around
20,000 genes, and a couple of hundreds of thousands of mRNA molecules in
mammalian cells, and millions of protein molecules in a single cell, there is
too much variation in a cell not to look at it. Also, the same protein can have
many proteoforms. Proteins can have different splicing and fusions.
Post-translation modifications add another dimension to this variation. All
this really speaks to the importance of looking directly at the proteome
variation in a different cell.
The problem is that we don’t have the technology
that would allow us to look at the proteome at the scale available for genome
or transcriptome analysis at the single-cell level.
WHAT IS NEEDED
What would be the requirements for such
technology? Many are in fact similar to what is needed in single-cell
transcription analysis.
The essential requirement is sensitivity. For
genes, this problem is solved by PCR. PCR amplifies DNA and RNA to easily
detectable levels. However, there is no equivalent to PCR for proteins. Without
the methods to amplify protein sample, it is rather complicated to work with
proteins that can have a massive difference in the concentration in a single
cell, from over half a million copies to less than ten copies. This technical
challenge has been a big problem for proteomics studies for a long time. Not
being able to amplify the target, protein analysis focuses on increasing the
sensitivity of detection.
High content is the second requirement for
single-cell proteomics. Single-cell proteome can have as many as a hundred
thousand different proteoforms. It may not be necessary to capture that much
information for every proteome but a good method should be able to capture a
sufficiently large fraction of each proteome.
High throughput is also required. You need to be
able to test many cells in a single run. This is both in the context of the
number of the analyzed cells, but also in the context of spatial resolution when
working on localization of markers in tissues and cells. The sample should not
be confined to just a very small area.
Good methods will also have to be accurate.
Instead of just four variables (i.e., four bases) as is the case with nucleic
acids, protein analysis deals with twenty variables (i.e., amino acids) and a
correspondingly higher chance of error.
It is very important to eliminate the possibility
of a bias. Proteins can have a lot of biochemical variations (i.e., charge,
size, hydrophobicity) due to the amino acid differences and also later
modifications (e.g., phosphorylation, glycosylation). That may introduce a
technical bias to the analysis. That bias may lead to under or
over-representation of some species or variants.
Other considerations are not technical in nature
but can be just as important for any technique that wants to have a wider
impact.
a. It would need to be user-friendly. Technically
complicated methods, that require a high level of specialization and training
have a harder time getting wider adoption.
b. It needs to be cost-effective, both from the
standpoint of the equipment setup cost and later cost per assay.
c. Time is the remaining requirement. Slow methods
just don’t work well somewhere where high throughput is a needed. Assay time
adds to the labor costs and labor cost is a factor when choosing what method to
use.
STATE OF ART
Currently, the most common approach to look at
the proteome is based on the mass spectrometry (MS). It characterizes the
protein by measuring the mass of either whole protein or the mixture of shorter
peptides. Recent technical improvements in this area are aiming at single-cell
mass spectrometry. Currently, sensitivity level for the most part is still in
the range of nanograms but new technical improvements promise to increase
detection sensitivity. By using new and improved MS approaches, it is now
possible to analyze as many as one thousand proteins from a single cell. While
encouraging, this is still not quite at the high throughput level needed for
proteomic research. In addition, MS still remains somewhat challenging
approach. Not all proteins and peptides are equally ionizable and transmitted
through a spectrometer. This can introduce bias in the analysis and possible error
in data.
Another proteomics approach is based on
immunoassay. Targets are detected with specific antibodies labeled with unique
tags (fluorophores, enzymes, mass tags, oligo barcodes). Immunoassays have been
used for many years, in methods like ELISA, immunohistochemistry, or cell
sorting. Sensitivity is pretty good, at the femtogram level. This is achieved
by different signal amplification methods. This approach also has its
limitations, specifically when it comes to multiplexing protein markers in a
spatial context. It is possible to increase the content by using sequential
rounds of assays and unique tags that allow the identification of a single
marker. This allows parallel analysis of up to a hundred markers in a spatially
limited region of interest. The total assay time is long due to the sequential
nature of the multiple testing rounds. By increasing rounds, the testing sample
can be progressively degraded. This caps the number of possible total rounds
and thus the total number of protein biomarkers that can be analyzed per assay.
Additionally, immunological methods depend on the availability of antibodies.
Antibodies can be costly and not always available, especially for new proteins
or specific proteoforms.
NEW METHODS
New methods are being developed to circumvent the
limitations of the current ones and provide a high content level of information
that is needed for the investigation of the proteome at the single cell level. They
promise to be able to detect proteins with single-molecule sensitivity and
throughput levels similar to what is now done in single-cell transcription analysis.
They are based on single-molecule protein sequencing and fingerprinting. The
difference between the two is rather simple. Essentially, sequencing will
identify the protein of interest by determining the exact and complete amino
acid sequence. Fingerprinting is based on determining only the partial sequence
and inferring the identity of the protein by comparison to the already know
reference sequences. Either way, if you can identify individual protein
molecules in a complex mixture of thousands of different proteins, you can
get a good idea about the variation of protein expression from cell to cell.
This is also a good way to quantitate protein levels without having to resort
to reference controls. Some of those new technologies are already in commercial
development.
One of those companies that are working on
bringing those technologies to the market is Erysion. It is a startup that came
out of the University of Texas in 2018. Their technology is based on what
is known as fluorosequencing. It is essentially a modification of
traditional Edman degradation sequencing. In Edman sequencing, amino acids are
removed one-by-one, from the N-terminus of the protein and characterize. In Erysion’s
approach, amino acids side chains are first chemically labeled with fluorescent
tags. Labeled peptides are localized on the surface of the array. As the amino
acid is removed from the peptide terminus, this event is detected by the
drop in fluorescence output for a particular fluorescent channel. One by one,
the peptides are sequenced and by overlapping the data of many peptides, the
sequences of proteins are assembled. What makes this fingerprinting,
rather than sequencing, is that you cannot differentially label the side chains
of every amino acid. Only lysine, cysteine, tryptophan, and tyrosine can be
labeled this way. This leaves gaps in the sequence and those gaps are filled by
referencing the distribution of these four amino acids in the know reference
protein sequences. Compared to mass spec, this approach is much more sensitive.
It would allow single-molecule sensitivity from far less sample and at a much
wideer range of individual protein concentrations.
Nautilus Biotechnology is another startup
developing new method to sequence proteins. It was founded in 2016 by a group
from Stanford University. Their approach to sequencing is very different. They
use antibody binding to proteins to determine the protein sequence. Proteins
are first immobilized on an array. They are repeatedly probed by antibodies and
every round of binding is imaged. These antibodies do not recognize any
specific protein. Instead, they bind to a short epitope, only three amino acids
long. Antibodies are uniquely tagged so that every round of imaging adds a
piece of information. This is repeated many times. The results are digitized and
analyzed to decode the proteome. This approach also has the potential to allow
single-molecule identification as well as quantification at a wide dynamic
range.
QuantumSi was founded in 2015. Their technology
is based on time-domain sequencing. Peptides are linked to the surface of a
microwell array. Peptides are exposed to two types of molecules, recognizers
and cutters. Recognizers bind specific terminal amino acids. They are labeled
with fluorophores. Binding events are differentiated not so much by fluorescence
color but by the difference in time of flight that can differ based on the type
of binding interaction. Cutters remove the terminal amino acid and allow the
next cycle of probing with the recognizers. Eventually, the entire peptide is
analyzed and the complete protein sequence can be assembled.
Another approach is developed by Encodia. They
use reverse-translation technology that turns peptide sequences into DNA. The
DNA can be then be read by DNA sequencing. Recognition agents labeled with DNA
encoding tags are binding N-terminal amino acids. Those tags are used to make a
DNA library for sequencing.
Protein identification based on sequencing would
have a number of advantages over the more conventional methods. The obvious
ones are sensitivity and content. Single-molecule identification would allow
high content analysis needed for the single-cell proteome. It would also allow
accurate protein quantification, even when using much less sample.
Of course, methods based on protein sequencing
have some limitations. They do not address the post-translational modification
of proteins. Alternative sequencing methods, based on the adoption of nanopore
sequencing might be better positioned to address this. Pore-based sequencing is
still in early development. The focus here is on the accuracy of the readouts
with new pore proteins.
CONCLUSION
In the next three to five years, we should expect
to see several different technologies coming to the market. They promise to
finally allow proteomics to catch up with genomics and transcriptomics. This
will be very important in many areas, from early drug development to
diagnostics.