It doesn't look like that nucleotide sequence is actually unique to both SARS-CoV-2 and Moderna's 2015/2016 patents. Just BLASTed that sequence with the "gss" database and got hits¹ for a bunch of different species of grass (plus some other hits for everything from bacteria to a species of dolphin). Tried again with the "RefSeq_Gene" database (ostensibly the human genome, AFAICT) and got a bunch of hits² for that, too.
I ain't a geneticist, so it goes without saying that I don't really know what I'm doing and I'm probably using BLAST wrong, but it seems to me like (EDIT: something very close to) CTCCTCGGCGGGCACGTAG is pretty dang common even within the human genome (let alone other species), and it doesn't seem far-fetched to me that SARS-CoV-2 might've yanked (EDIT: something very close to) that sequence from one of its hosts at some point.
EDIT: I forgot to check the completeness of the matches, and it does look like the hits tend to have one or two nucleotides that don't match. Still, it's close enough that the article doesn't seem like it presents much of a smoking gun.Reply
Archive link from today: https://web.archive.org/web/20220114174712/https://arkmedic....Reply
I like how hackernews has now become a covid-19 truther board.
Every day a random substack blog by some starslatecodex-like "rationalist" out of their field.Reply
A large part of this blog post's ridiculous assertion lies either on ignorance or on the false assumption that horizontal gene transfer  does not exist in Nature:
> It’s easy enough to change a single nucleotide (a single point mutation or SNP) or even insert or delete nucleotides (less common) but to insert 20 or 30 nucleotides with a code that works? Nope, that has to come from another virus or else it’s been done in a lab.
The original preprint which noted similarity of the HIV-1 sequences and those in the COVID spike protein, and on which this blog author's claims rest, was voluntarily withdrawn by its authors. 
Moreover, the assertions of uniqueness have been thoroughly addressed in the published article, "HIV-1 did not contribute to the 2019-nCoV genome"  :
> For any virus to obtain additional insert sequences from other organisms, it requires that it has direct interactions with other organisms, most likely through homologous or non-homologous recombination. [...] On the contrary, these motifs are widely present in various mammalian cells and so it will be more likely for bat CoV viruses to gain those motifs from the genomes of their infected cells if recombination indeed occurs.
That rebuttal article clearly suggests several feasible sources of the gene in Nature. That surely is more plausible than suggesting that Moderna or some other group nefariously created the COVID genome _in vitro_ before unleashing it upon humanity.
I deeply enjoy Hacker News for its ability to surface different ideas and sources of information that I would otherwise never find, but this article is simply ridiculous.Reply
for those with SciAm subs, this June 2020 print article is a profile of the first scientist to encounter Covid. She had spent a decade in caves around China investigating SARS and was therefore called on when early reports of Covid arose.
I read this before all the lab leak theories came out, and I'm glad I did because it makes them all seem ridiculous.Reply
This guy appears to also believe that the omicron variant is not a natural mutation but a lab-made version as well.
BLAST does not contain all genetic material from all viruses or organisms in all of the natural world. There are countless (really, impossible to count) viruses of all kinds out there dancing around in the bodies of all kinds of creatures. There are literally hundreds of trillions of viruses in your body right now (not different kinds but individual viruses). Each of those hundreds of trillions of replications was an opportunity to mutate. Now multiply that by all of the individual animals human or otherwise who could harbor coronaviruses and you have many hundred billions of trillions of chances to generate the gene sequences in question. If I could get billions of trillions of lottery tickets, I'd win the lottery, and here it looks like SARS-CoV-2 won the "does a gene sequence in the virus exist in a patent by moderna" lottery.Reply
Humans really are just pattern searching animals.Reply
I get that there's a lot of fear around the idea that Covid-19 was engineered, and the potential fallout that could occur if it was proven to be fact. But I expect that almost all governments are now working under the assumption that it was, because they would be extremely foolish not to. So this is all a moot point -- at the top level, the working theory must be the lab-leak theory.
And so, fast vaccine production must be a priority. The development of drugs to treat viral infections and their manufacture must also be a priority. The ability to detect carriers and quickly shut down travel must also be a priority. And how do we enforce quarantine in populations that are resistant to it? All of these concerns are, I am certain, at the forefront in the mind of any Western government to be certain, but likely any government anywhere.
Because regardless of IF Covid-19 was created in a lab, it COULD HAVE BEEN created in a lab. And we all got caught with our pants down.Reply
As other commenters point out, this "analysis" is a joke. An E()-value of 282 says we would expect a match of this short sequence against a completely random library of the same size 282 times. Even with BLASTp's parameter correction for short sequences, NO short sequence will generate a statistically significant score.
Let's try a more sensible approach. Let's see how often the sequence occurs in other databases. The author's 1/20^6 (1/64,000,000) probability calculation omits the fact that every database search considers many millions of possible alignments (a crude calculation would be 6 * the database length, or 6 * 1.6E7/20^6 ~ 10000E6/64E6 = 156 times by chance), and we're shown that in a 5,000,000 protein/1.6E7 residue database, 100% matches are seen several dozens of times, as expected.
How about other protein sets? Search the human protein set (19,000 proteins), there are no 6 amino acid 100% identical matches. The best matches are 100% of 5 residues, or 83% identity over 6 (with an E()-value of 22).
How many proteins do we need to search to find an exact full-length match? (we have dozens of matches in a 5,000,000 viral protein database. and none in 19,000). If you search all the NCBI landmark sequences (~500,000, non-redundant), there are 3 identical matches (E()-value: ~44) in Drosophila, Corn, and Strep. pneumoniae. So this sequence is not very special -- as the E()-value statistics indicate, it happens all the time by chance. (If it happens 3 times in a 500,000 sequence non-redundant database by chance, we would expect it to happen 30-100's of times in a 5,000,000 redundant viral database by chance, which is what we see.)
Six amino acid non-significant matches are evidence that properly done statistics are accurate, and pretty much nothing else.Reply
Just to be clear, this doesn't imply that moderna released the viruses. Patents are publically visible, after all. It suggests that someone was told to insert a furin cleavage site, then went and "stack overflow copy-pasted" the moderna furin cleavage site, and this became the coronavirus. If we believe that this is what happened, it possible that whoever did this also looked up and copy pasted other furin cleavage sites, and thus the purely random odds of the eye-popping "moderna" aspect of the coincidence would considerably less (let's say, 1 in 100 or less, depending on how much work you think the hypothetical postdoc bothered to do). Do note that the moderna furin cleavage site itself is probably subjected to industrial selection pressures that enhance its delivery, which might also correlate to infectivity, so a non-uniform distribution of infectivity among stack-overflow-ed furin cleavage sites, that favors the infectivity of something like the moderna sequence, should be considered.Reply
In my cursory encounters with numerology, I have learned that if you try enough hypotheses, something will always pop out. Could this be the case here too, or is the gun-smoke so thick that no cherry picking can explain it?Reply
Odd to see that HN will not tolerate any evidence that the virus may have been designed in a lab.Reply
Impressive work, but author's logic is unsound because it silently assumes the library or database BLAST uses is complete, yet it likely represents less than 1% of all sequences in the wild. The database BLAST uses only contains sequences that have been found and reported. If every person on the planet was a genetic expert with all necessary resources to find and identify new sequences, in 1000 years we still wouldn't have found all of them. The sequence in question that only matches HIV-1 may very well match a trillion^2 more sequences not in BLAST's database. Anything like this that is engineered likely has a mountain of documentation. Documentary evidence, if found, is a smoking gun in a way the furin cleavage site sequence can't be.Reply
This is a semi-interesting piece (because of how poorly written it is and how hilariously confident the writer is), but it is embarrassing that the person writing claims to have a PhD in at least something semi-related to molecular biology, yet is trying to convince people that we can use independent probability when dealing with protein amino acid sequences. Certain sequences are seen time and time again in nature, just like certain traits visible to the naked eye have evolved independently. More importantly, this author dismisses natural mixing of SARS and HIV1 as "ridiculous" and provides no other explanation. It is possible that SARS-CoV-2 was created in a lab, but it is also possible that HIV-1 and SARS-CoV-2 co-infected a host.
Edit: to elaborate, certain amino acid sequences are seen again and again, and others are practically impossible to see in nature because amino-acids fold into 3D proteins...and each amino acid is like a Lego piece. There is a lot more detail needed to understand fully how tertiary and 4try protein structures form...but it is also possible to understand it decently well after an undergraduate degree in biology, chemistry, biochemistry, etc. A PhD in the field knows this inside and out. If I had to guess, the author of this post has a PhD in something like statistics or computer science, and thinks that they can apply high school level math to two fields (molecular biology and biochemistry) that they do not even have a high school level of understanding of.
To make a very stretched analogy (I am a doctoral student in the life sciences and a hobby programmer), this blog post is like saying that some Java source code is stolen because four of the class names are the same between two projects, and then using the number of possible characters in the class names and the length of the name to run some statistics. The problem being that no one has class names like "AeNOQ92bA"...in fact the vast majority of 8 character sequences will never be class names. Just like the vast majority of amino acid sequences (likely) do not exist in nature. And then you dig deeper and find out the class name is something like "MainDashboardSupervisorTree" (I do not know Java so forgive me). Then you also find out that the author of both classes...is the same guy who moved companies and likes particular naming convection, but never meant to copy stuff word for word. Similar to how SARS-CoV-2 could have naturally incorporated HIV-1 RNA into its genome when co-infecting a host.Reply
> *As of writing this those links are still up which at 12 months is pretty good going for any article that dares challenge the drivel propagandised by our beloved “free press [sponsored by pharma]”.
Well this is a rough start. “This information is being suppressed, which you can tell by the fact that these linked articles are still available.” The utter lack of critical thinking doesn’t make me expect much from the rest of this.Reply
Another reason his probabilities are garbage:
Nature favors sequences that work. Try a billion variants, keep the one that actually helps. We don't see the 999,999,999 failures.Reply
Some questions from a layperson:
1. Why do we care so much about a furin cleavage site? The author makes a big deal out of this and I'm not sure why. It seems like he specifically looked at the gene sequence at a furin cleavage site and then said: hey look, this occurs at a furin cleavage site! It must mean something! (Thanks to etaoins for finding at least some evidence that furin cleavage sites are not, in fact, unusual in Coronaviruses, although I have not read the paper: https://www.sciencedirect.com/science/article/pii/S187350612...)
2. What does HIV-1 have to do with anything? That is, if these sequences all matched virus XYZ instead, would that also be some sort of smoking gun? The top 3 sequences all seem to be very common among lots of viruses (rotavirus, measles, coronaviruses, and HIV all show up in the top 3 lists). So he seems to have found a single sequence that lots and lots of viruses have (at least variants of), and that one sequence is the first 3 rows of his table. Am I missing something?
3. If we ignore every result entered after Feb 2020, aren't we throwing out a lot of sequences that were entered specifically because people were suddenly all very interested in Coronaviruses? If he's making claims like "these don't appear in any other Coronavirus", the "except those we found after everyone started looking at Coronaviruses much more" somewhat detracts from that, doesn't it?
It feels a little bit like finding a passage "He froze, trembling with fear" and saying: the English language has 1 million words, so the probability of finding these 5 words in order is (1 million)^5. Therefore if we find it in two novels, one must have plagiarized the other. But not really though, because "he" and "with" are extremely common words, and "froze", "trembling", and "fear" are very highly correlated to each other. I think it's pretty clear his first 3 examples are all highly correlated.
His last example (and the title of this post) seem a little harder to explain to me as a layperson, but I'd note that for a 30-nucleotide sequence, there are 3^30 other sequences that are only a single mutation away. We again see Rotavirus on his list. How far away is that Rotavirus sequence, and do we think it prohibitively unlikely this sequence came from Rotavirus?Reply
To clarify this sequence has not been identified in any other virus in the wild. Additionally it codes for the section pertaining to the furin cleavage site which does not exist in any other observed beta coronavirus.Reply
Using the number of BLAST hits as the basis of an argument is about as reliable as using the number of results when searching for a string on GitHub. Without further analysis of the specific sequence and its biological context it can be highly misleading. See a previous Twitter thread on some other purported HIV inserts: https://twitter.com/trvrb/status/1223666856923291648
There also seems to be some circular reasoning the argument. Apparently we can ignore RaTG13 because it’s obviously synthetic, which makes SARS-CoV-2 look even more synthetic. It would be interesting to compare to the BANAL family of SARS-CoV-2-related viruses that are even more closely related to SARS-CoV-2 than RaTG13 .
I’m not sure why only viral genomes were searched for the furin cleavage site sequence. Viruses famously exchange genetic material with their host organisms. The “smoking gun” sequence also appears Mycobacterium smegmatis, for example .Reply
This substack post is amazing work. I recommend reading the top comments on it for further discussion.Reply
So what does this mean?
That covid was made in a lab? Or did covid and moderna independently reach the same "conclusion" through parallel evolutionary pressures (in nature and in research)?Reply
>What this means is that there is no virus known to man that has this particular sequence in its genome prior to the discovery of SARS-Cov-2. So where on earth has it come from?
Mutation? The same way we get every new pathogen?Reply
Previous post was taken down within a few minutes by an unhelpful poster causing it getting it flagged with an abusive post. Two good questions were asked there so I will repost them here:
>Posted by 'version_five'
"This sounds like one of those "bible code" things. As someone who knows nothing about DNA sequences, what is the probability that two codes could be matched? (In a birthday paradox sense where you can go looking anywhere)"
>Posted by 'whoomp12342'
"isn't it well known that one of the main reasons why we got a vaccine so quickly was because we had done a ton of research when SARS originally broke out?"Reply
I ask a question about dna sequences for SWE interviews. It ends up providing some useful signal. For example questions I ask include: if you have a symbol table with 4 symbols, how many bits are required to uniquely identify each symbol (answer: 2 bits). OK, given this ASCII string 'CTCCTCGG', can you compress it a few bits? (answer: without entropy, run length, or backrefs, you can easily compress DNA encoded in 8-bit ASCII from 8 bits/byte to 2 bits/byte). OK, so, how many bits does an 8-long DNA string require encode (answer: 16 bits). Huh, that's a short.
The rest of the interview is how to move form using a hash table for counting k-mer frequencies to a vector using a minimal perfect hash, and compute k-mer frequencies quickly using a rolling hash. It's a great question because it always throws the CS people off to hear their questions asked with biological structures involved.Reply