Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Not from concentrate.

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.

user:
pass:
register,


                                                                       

Please log in.
Before you can vote, you need to register. Please log in or create an account.

Homopolymers begone

When DNA sequencing, add corruption to remove systematic errors
  (+7)
(+7)
  [vote for,
against]

So when you're sequencing some genome... as you do... using one of the high-throughput third-generation sequencing technologies, one of the annoyances of the data is indels in homopolymer runs.

In essence: Both pacbio and nanopore long-sequence reads have problems in determining the exact length of a run of the same base, so the resultant sequence exhibits systematic errors which can't be fixed by higher read-depth (reading the same DNA template more times). This is true even though they're very distinct technologies, so it's probably going to be an issue until it's fixed.
Obviously, one can resolve many of these with a short read technology (e.g. illumina), since these have a different error profile. However, this may not be a panacea, since they may not be able to resolve repeats.

Now, you may say, hey, it's just one more or less base in a run of umpteen bases in a set of degenerate regions, who cares?
But I say no I'm sequencing this shit and I want it to be right, dammit.[1]

So. How do we resolve this?
The long read tech is basically trying to read off the bases in a single molecule of DNA as it goes through an enzyme or channel of some kind, somehow. The issue arises because the signal of each base in a run (of the same base) is indistinguishable. You know how if you're trying to count a set of identical things, sometimes you lose your place? Well, that.
The solution to losing your place is to add variety, so that you are able to avoid slipping position.
I propose that we add a pre-processing stage which corrupts or substitutes a fraction of the bases, such that they give a distinct signal. Obviously it's not as simple as just bollixing bases in random ways; the product will need to pass through the molecular machinery without jamming it up. This will be highly dependent on what the sequencing technology is doing.

As I understand it, nanopore sequencing essentially observes the leakage of ions around the DNA as it passes through a pore. Provided the DNA can pass through the associated unwinding enzymes, perhaps any change will help.
However, one (pretty elegant, I think) approach would be to find a methylase enzyme which methylates one of the bases of a run. For genomic sequencing, at least two would be required - one for polyA or polyT, and one for polyG or polyC. Four enzymes, one for each nucleotide, would however be desirable to give complete coverage. NEB currently supplies EcoGII Methyltransferase commercially, which apparently indiscriminately methylates adenine residues (N6), so that's a start - and this would be immediately testable.

Pacbio essentially reads off individual base additions as DNA is synthesised by DNA polymerase, through detection of a side-product. That's a bit more finicky, because the inserted base needs to match specifically. One -pretty extreme- approach would be to try to /insert/ additional bases into the strand. Ideally, non- canonical bases forming a new base- pair. Being distinct isn't completely essential, though. And I note there is precedent for such an enzymatic system in the RNA insertion editing pathway of Trypanosoma brucei.
Pacbio has the option of forming a loop of DNA, the bases of which can be read in multiple cycles. This currently seems to solve everything except homopolymer run lengths over about 6. So if we could incorporate about one additional base per 5 originals, or alter about one in six to a non-canonical read this would be all systemic errors essentially solved.

[1] But /is/ this a massive engineering effort for mostly very little gain?
I spent a great deal of time and effort correcting 415 errors in a pacbio-generated 6 megabase microbial genome; all but one were homopolymer indels. Most other genome sequencers don't bother, probably most of those in smaller labs getting individual organisms of interest sequenced don't even realise there's an issue. However, that probably affects something like 5% of protein sequences in these genomes, so it's not insignificant.

***

I think Max would have liked this idea, so I dedicate it to his memory.

Loris, Sep 10 2020

[link]






       Is this a genuine idea, or is it a HOX ... ?   

       <wind blows/>   

       ......@...... @...... @ ......
8th of 7, Sep 10 2020
  

       You recognised it as a witticism ? And then added to the pun ?   

       Are you quite sure you're American ?
8th of 7, Sep 10 2020
  

       You Homeland Security file must be interesting ... "Can read, has travelled outside home state ... dangerous subversive intellectual, to be kept under close surveillance ... "
8th of 7, Sep 10 2020
  

       Dat's der puppy ...   

       Question: Do they pay for the laundry of the white robe with the pointy hood, or do you offset it against tax as a legitimate expense, or do you just have to do it yourself ? Presumably it counts as obligatory "work uniform" ...   

       Bloodstains can be so stubborn, even with biological washing powders.
8th of 7, Sep 10 2020
  

       Fair enough. And if you want to know, that's one of the reasons for BorgCo's hostile takeover of ACME Corp. It's not something we're ashamed of or anything. Frustrated and embarrassed, yes, but not actually ashamed ...
8th of 7, Sep 10 2020
  

       (Disclaimer: this is WAAY outside my knowledge-base...)
Q1: Can you run 2 scans in parallel (at a molecularly-close proximity)? As in, they work along in lock-step.
Q2: Can you create an "artificial" DNA-like molecule that the scan CAN differentiate accurately between the "bases"?
If "yes" to these, create the "fake" with a different base precisely every 10 (or whatever) steps. So, as the "actual" molecule is read, the "fake" provides a count.
neutrinos_shadow, Sep 10 2020
  

       I believe the answers to these are:
Q1 - not with current technology
Q2 - not with current technology
  

       Nice try, though.   

       Although... short read technologies (i.e. illumina, 454) don't have this run-length issue, because they do almost exactly this - read off a lot of molecules (of the same sequence) in lockstep.
But they're short read technologies at least in part because some fraction of the population falls out of synchrony in each cycle.
  

       To be honest, I'd say it's pretty impressive that sequencing technologies work as well as they do. They're almost ridiculous.
Loris, Sep 10 2020
  

       //it's pretty impressive that sequencing technologies work as well as they do. They're almost ridiculous.//   

       Indeed, we're now in the genomic age. Not back when we first did the human genome, that's just one, and only bits of it too. Now we can actually compare and contrast, which is where all the interesting stuff will be. There must be a lot of worried criminals right now.   

       Anyhow, how about a nucleic acid guided methylase? You can choose your guide, say 5 x T, and then you'll get a methylated pulse every 5 A bases that you can use to get a handle on the length, incubate, then just melt it off before sequencing.
bs0u0155, Sep 10 2020
  

       // how about a nucleic acid guided methylase? //   

       <Obligtory gratuitous Ethyl Methane Sulphonate reference/>
8th of 7, Sep 10 2020
  

       Reading this posting and it's annotations has made my head hurt and it will probably be days before I get to look up half of these words if I even remember to, and... aurgh!   

       //this is WAAY outside my knowledge-base//   

       That makes a pair of us.   

       But this is how we learn. (Well, this and the data of regrettable experience). [+]
pertinax, Sep 11 2020
  

       Sorry about that.
Would a glossary help? These are very informal definitions:
  

       base : in this context, a single constituent unit of DNA (a monomeric unit). In standard DNA this is one of four options : A, G, C or T (which stand for the chemical names). Bases can be methylated and still recognised as the same base by molecular machinery.
DNA : a polymeric molecule in which an organism's genetic information is encoded
enzyme : a protein which performs some reaction; think of it as a biological machine carrying out some function.
genome : all the genetic information of an organism
high-throughput : the ability to do something lots - often by running the process massively in parallel.
homopolymer : a polymer (or part thereof) comprising a series of identical monomeric units
illumina : a short-read sequencing technology
indel : insertion or deletion of a base (or multiple bases), either as a mutation or as a sequencing error
long-read sequencing technology : approximately, any method of reading out several kilobases of DNA sequence at a time
monomer : a small molecule which can be joined to other similar molecules to form a polymer
nanopore [sequencing]: a long-read sequencing technology
nucleotide : another word for base. (the word 'base' is often used for counting numbers of monomeric unit, or in abstract, and is not nucleic-acid-specific, while 'nucleotide' is explicitly nucleic-acid, and tends to be used to refer to the nature of the monomer is - i.e. which of A,C,G,T it is)
nucleic acid : DNA (or RNA, or some other related molecule typically encoding genetic information)
methyl~ (methylation, methyl group etc) : A small part of a molecule, comprising a carbon and three hydrogen atoms.
methylase : an enzyme which adds a methyl group to a molecule. (also, methyltransferase; an enzyme which moves a methyl group from one place to another.)
pacbio [sequencing]: short for Pacific Biosciences sequencing; a long-read sequencing technology
PCR : Polymerase Chain Reaction; a method of generating many copies of a DNA sequence from a template molecule. Uses primers to define the ends.
polymer : a chemical made up of a chain of small units. (Some are branched, but DNA is generally not)
polymerase : an enzyme which builds a polymer out of monomeric units.
short-read sequencing technology : approximately, any method of reading out less than about a kilobase of DNA sequence at a time. Initial versions could often read only a few tens of bases, but technology has increased the lenths significantly.
primer : in this context, a short molecule (of the order of 17 bases) with a sequence making it capable of specifically binding to a DNA molecule of interest. It can then be extended by a polymerase.
sequencing : determining the order of As, Gs, Cs, and Ts in some DNA.
systematic error : an error which is inherent to a measuring process and hence can't be fixed by e.g. repeated readings and averaging.
umpteen : slang - lots; about 10 or more.
Loris, Sep 11 2020
  

       Good work, thank you [Loris].
pertinax, Sep 11 2020
  

       When you use short-read sequencing, how do you address a specific section of the longer molecule? Do you have to dissect it first, and then remember carefully where you put the severed sections?
pertinax, Sep 11 2020
  

       //When you use short-read sequencing, how do you address a specific section of the longer molecule? Do you have to dissect it first, and then remember carefully where you put the severed sections?//   

       There are ways and ways.   

       1) Old-style sequencing, where you get one sequence per reaction:
eg "Sanger sequencing". You would address a specific region of a larger DNA molecule, by designing a "primer" - a short piece of DNA (something like 17 bases long) which is complementary to (exactly matches) the region just before the part you're trying to read. This anneals (sticks to) the matching sequence in the sequencing reaction and can be extended using a polymerase.
Works well, and very useful for "finishing" - completing any awkward parts of an unfinished sequence, checking mutations and so on. [1] Nowadays reads may be over a kilobase of decent sequence.
It can be scaled up, but scale becomes an issue - the original human genome used warehouses of sequencing machines, each the size of a fridge-freezer, which would run out something like 384 independent reactions at a time, supported by an army of technicians.
  

       2) "Shotgun sequencing"
Smash up your large DNA molecule(s) into bits, tag at least one of the ends of each (methods vary), and read along from there.
You get back a 'jigsaw puzzle' of overlapping - and possibly error-ridden - ranges which can be pieced together.
It's possible to do this with Sanger sequencing; you create a library of 'clones' - bits of your sequence held in a plasmid vector, or similar - but this is still beholden to the one sequence per reaction condition.
So the 'high throughput' methods do multiple reads in a single reaction. Methods vary enormously - 454 does its PCR in an emulsion, attaching the template to beads, illumina attaches the template strands to a glass slip, then does 'bridging PCR' to form a patchwork of attached clones, and so on.
  

       Note that even the long-read technologies will often be using a shotgun strategy - they're not long enough to read out an entire molecule of biologically realistic length.   

       [1] Another old technology - "Maxam–Gilbert sequencing" used terminus marking and a strand breakage strategy, but it was technically demanding and at best only gave passable data. It saw some early popularity due to parochialism and a single 'pro' vs the initial Sanger method, but the many 'cons' meant it became obsolete as Sanger-style sequencing improved.
Loris, Sep 11 2020
  

       Well I feel better now.   

       Outside of my wheelhouse as well. But I will [+] for the dedication to our dear friend who left us because he would have understood it and splained it to me when I asked. :-(
blissmiss, Sep 11 2020
  

       What [bliss] said.
8th of 7, Sep 11 2020
  

       Can't we just put a sewing machine pin in where the count was stopped, and then restart the machine later? I have a cork board if that helps...
RayfordSteele, Sep 11 2020
  

       Doesn't count slippage just mean the read head of the nano pore is not specific enough and a finer or more correctly, a more massive base disturbance in the measurerment variable is needed. Can DNA conduct free electrons without breaking apart?
wjt, Sep 13 2020
  

       bliss, just ask, I'll splain it- or at least try to. I splain stuff to myself with imaginary sock puppets all the time.   

       //Doesn't count slippage just mean the read head of the nano pore is not specific enough and a finer or more correctly, a more massive base disturbance in the measurerment variable is needed. Can DNA conduct free electrons without breaking apart?//   

       It's not really slippage, it's a failure to read out every base as exactly one base.
I've no idea about DNA's conductivity, but that's not what is going on.
  

       In nanopore's case, the equipment is measuring the leakage of ions past the DNA as it goes through the pore. The readout at any instant is a function of multiple bases, plus some noise. How they deconvolve that seems more involved than I want to look into in detail, but my naive model is that they're using a hidden Markov model or similar to guess the oligomer which is present in the pore, given the very high time-resolution measurement series. The DNA doesn't pass through the pore at a linear rate, and perhaps might even go backwards (because Brownian motion) so if the sequence is all the same base, i.e. a homopolymeric tract, they have to guess how long the run is based on the time it takes to get though.   

       For pacbio on the other hand, they're detecting a fluorescent molecule created through the addition of a base to a synthesised DNA chain. DNA polymerase is in general very reliable, making very few errors in copying the chain. However, the read-out is not so accurate. The fluorescence is a different colour for each nucleotide. Each DNA strand being read is at the end of a short tunnel, and the fluorescence is only detected while the fluorescent molecule diffuses out of the tunnel because it acts as something called a zero-mode waveguide. Don't ask me how that works, I don't know. Anyway, that escape may be faster or slower than average - it's random - and maybe there's some measurement error or the fluorescence decays (because generally they do). The time between base additions is also random, because that too relies on diffusion.
It's probably relevant to mention that the raw base reads are very error-prone - something like 5-15% error is generally reported, but I don't know what the profile is on that.
So normally, this is resolved by getting multiple reads and reconciling them. This works well for the most part because the reads are very long. But it isn't so reliable for base runs.
Loris, Sep 13 2020
  

       Why don't we yet have a proper magnet based DNA reader? Using an extremely thin probe it would physically move up and down the DNA strand a few dozen times and infer the molecule from magnetic field strength. It doesn't need to touch it, just be very close.   

       Or better yet, an RNA based device that pulls the DNA strand through and outputs the data in the form of electrical pulses.
Voice, Sep 13 2020
  

       //Why don't we yet have a proper magnet based DNA reader? Using an extremely thin probe it would physically move up and down the DNA strand a few dozen times and infer the molecule from magnetic field strength. It doesn't need to touch it, just be very close.//   

       An electron microscope approach?
Over 10 years ago there was this TV show about a mathmatician helping out a detective, (called num three ers; I dunno..). And in one episode there's a virus or something, and the maths guy gets an EM image of maybe 10 or 20 bases of nucleic acid, and is like ... "Oh, I'm just trying to get a very early idea ahead of the labs".
I was... well : "Aaargh, no! That's so wrong in so many ways...".
  

       //Or better yet, an RNA based device that pulls the DNA strand through and outputs the data in the form of electrical pulses.//   

       I'd say both these suggestions were wishful thinking, but ... hell, I'd have said what is now currently existing technology was crazy magic at that point, so I wouldn't rule them out as a future option.
But making them work is probably harder than idly speculating in chat.
Loris, Sep 13 2020
  

       So, you line up the dna along a spiral track on an LP, find the finest pickup needle, and amplify the result with a warm-sounding vacuum tube...
RayfordSteele, Sep 13 2020
  

       Really, this comes down to finding a system that is part ribosomic-like stepping method and part amplification of the unique metric each base generates.   

       It won't matter that the strand acts is nothing like it acts in the nucleus so long as the strand stays in sequence and gives the needed identifying signals.
wjt, Sep 14 2020
  

       //Really, this comes down to finding a system that is part ribosomic-like stepping method [...]//   

       I'm not sure you meant ribosome - they're the molecular machinery which translate RNA sequence into protein, and are just massively more complicated than DNA polymerases.
However, just in case you did mean that, it's not an idea completely without possibility. ribosomes interpret, and step along the RNA (not DNA, but let's not worry...) three bases at a time. If it wasn't for all the horrendous complexity of setting things up, and extracting the information, it might be useful to read out multiple bases at once with some sort of ribosome-based technology. There is another sequencing technology, SOLiD, which progresses in such longer steps (not involving ribsosomes)... but it's a short read tech, rather involved to use and although it did have some advantages, the platform didn't really take off.
  

       //[...] and part amplification of the unique metric each base generates.
It won't matter that the strand acts is nothing like it acts in the nucleus so long as the strand stays in sequence and gives the needed identifying signals//
  

       Um, yes. This is the basis of all sequencing tech.
Loris, Sep 14 2020
  

       Q What do you call gay plastics?
A Homopolymers
xenzag, Sep 14 2020
  

       Why thank you [Loris], for the splaining. I certainly get it now...I'll gets back to you with a snappy comment in hmmm...a year or so.
blissmiss, Sep 14 2020
  

       What about a myosin-actin engine dragging the strand* to be sequenced, up through the pore against gravity or heavy weight termination. The actin filament might give a stepped frame reference. A force stress on the polymer might even exaggerate base signals.   

       * if it can be attached, is to scale and works distance-wise.
wjt, Sep 15 2020
  

       //What about a myosin-actin engine dragging the strand* to be sequenced, up through the pore against gravity. The actin filament might give a stepped frame reference.//   

       I have no idea, and only the enormous development cost and time needed to test such a thing is stopping me.   

       Well. That and I'm not sure that would be an improvement, to be honest. I don't think gravity is significant at that scale. In nanopore sequencing, I think the DNA is actively driven by an enzyme, a helicase. (I just need to go out at short notice, so can't check right now.) I kind of doubt that a muscle-based strategy would be a big win, to be honest. There are probably some issues with setting up to pull the DNA rather than push it (as nanopore does now) as well.
Loris, Sep 15 2020
  

       //the region just before the part you're trying to read//   

       Is there any meaningful difference between "just before" and "just after" in this context?
pertinax, Sep 15 2020
  

       //Is there any meaningful difference between "just before" and "just after" [he part you're trying to read] in this context?//   

       Yes.
DNA has a polarity - that is, a direction. Basically the monomers arn't symmetrical along the chain. Without going into the gory chemical details (which I'd have to look up anyway), people talk about the 5' ("five prime") and 3' ("three prime") ends. (These names relate to that chemistry.)
All known polymerases extend the chain by adding nucleotides to the 3' end. Ribosomes also process RNA in the same direction. (As an aside, DNA sequence is canonically written with the 5' end on the left, and the 3' end on the right, at least in the English speaking world - so you can read it in the same direction as the cellular machinery.)
It's probably worth mentioning that the two DNA strands of a duplex molecule are "anti-parallel", that is, they run in opposite directions.
Loris, Sep 15 2020
  

       Helicase?, I wasn't imagining the need to messily unzip the DNA, rather just topologically expose each base pairing for signal interrogation. But then I have been imagining a lot of fanciful stuff around this one.   

       Then again there aren't the physical methods or environmental variables that can manipulate DNA at base to base bond scale. Life all seems to rest on large molecular weight machinery manipulating the ladder.   

       One day soon*, protein engineers may design of a large molecule that changes as it traverses the major groove, such that it communicates the sequence to our higher scale. *Might not be alive.
wjt, Sep 17 2020
  

       //Helicase?, I wasn't imagining the need to messily unzip the DNA, rather just topologically expose each base pairing for signal interrogation. But then I have been imagining a lot of fanciful stuff around this one.//   

       I think it's chemically possible to flip out a base, IIRC I read about a DNA binding protein which does that (in the process of interacting with its specific binding site). Many DNA binding proteins pattern-match their binding site from the side, impinging into the large (or, sometimes, the small) groove. But in terms of reading out the information in arbitrary sequence, I think biological processes are -without exception- all about matching up base-pairs. It's the easiest approach.   

       //Then again there aren't the physical methods or environmental variables that can manipulate DNA at base to base bond scale. Life all seems to rest on large molecular weight machinery manipulating the ladder.//   

       At the molecular scale, I think that the processes involved are really quite "messy", and a large part of the nature of biochemistry is about managing that.
When people talk about nanobots it's easy to imagine little machines which work like robots, but the reality is - that's just not feasible. Sure there are machines (scanning tunneling microsopes) which can visualise and move around individual atoms, but (a) under very specific circumstances - atoms on a flat surface in a vaccuum at a few degrees above absolute zero, without vibration; and (b) sure the scanning tip is small, but the functional parts of the machine as a whole are very much larger.
  

       //One day soon*, protein engineers may design of a large molecule that changes as it traverses the major groove, such that it communicates the sequence to our higher scale. *Might not be alive.//   

       Possibly, but the difficult part there is communicating the information, not crawling along the DNA. Unwinding the DNA is really not an issue, it's done routinely.
Loris, Sep 17 2020
  

       //the processes involved are really quite "messy"// Yeah, that's what I find fantastic, that such a complexity of machinery can work with Brownian motion influence. I more imagine there is a over arching, guiding dimension at work, such as all molecules are little magnets and it is that that makes the clock work, underpins protein folds and like. To have a transported amino acid come out of the set when needed, has to be more than just random bump.   

       //communicating the information// I did imagine using an expanding wave front to magnify surface bumps. Maybe a network of printer bubble jet heads to pulse a medium fluid and create an engineered wave scope. But even to me it sounds too out there.
NcNcNH2..OcN or from the other edge cNccO..2HNcCC
wjt, Sep 19 2020
  
      
[annotate]
  


 

back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle