pull over dat ass too fat
kiwifarms.net
- Joined
- Mar 3, 2019
The E-value is in part dependent on the size of the sequence you're looking at. So depending on the subrange you use that value will def differ. This is what I was probably poorly explaining in the edit note lol. the subrange I used was based on a pic presented in the article, though it's probably better to look at the entire sequence because in retrospect I'm not sure he actually meant to highlight that specific region.Funny. When I re-did the thing the E scores were much better.
Between AY862402.1 and MN997409.1 subrange 21697:23074, using blastn, that same match has E=2E-111 with 68% identity
But sure, that guy intentionally mixed up AY862402.1 which is a SARS spike protein plasmid (which of course has good match with coronaviruses) and AF334399 which is the empty backbone he linked to in the blogpost.
It doesn't actually change the result because the identity is such garbage. A good E-value is worthless if the number of matches is lower than you're interested in. In this case it simply means that the 68% matches it did find are more likely to be genuine rather than occurring by chance. Cool to know, but having so few matches doesn't bode well for arguing that they're the same sequence, so whether or not those few matches are of good quality doesn't improve your argument.
Conversely, you could have two completely identical but short alignments e.g.:
T A T A T A
T A T A T A
And this would have a very high (AKA insignificant) E-value, because the likelihood of that happening by chance is relatively high. This is why E-value alone is not an indicator of the strength of a match. Identity % is important to frame why you should care. Hope that makes sense.
Last edited: