20240529 Quantifying Evolution - 29 st May – Quantifying molecular evolution At this point in your

asdfasdf

Vak

Genoom (BMW21614)

314Documenten

Studenten deelden 314 documenten in dit vak

Universiteit

Universiteit Utrecht

Aanbevolen voor jou

21Technieken en onderzoekGenoomCollege-aantekeningen100% (2)
78Samenvatting van Genoom deel 1GenoomCollege-aantekeningen100% (2)
29College-aantekeningen, alle collegesGenoomCollege-aantekeningen100% (1)
7College terugkijken Van RNA naar eiwitGenoomCollege-aantekeningen100% (1)
20Uitwerkingen colleges Genoom deel 1GenoomCollege-aantekeningen100% (1)

Reacties

inloggen of registreren om een reactie te plaatsen.

Preview tekst

29

st

May – Quantifying molecular evolution

At this point in your studies, you have already heard a lot about evolution and how it works. Abird that evolves a new beak shape might have an advantage over a different bird in a particulararea and therefore have more offspring. On a molecular level, this process happens when theDNA sequence of an organism changes from generation to generation. Consequently, theamino acid sequence of the resulting proteins changes, which leads to a different phenotype.But how do we know which sequences (and thus the organisms they occur in) are more closelyrelated than others? Let’s look at these three sequences:

➢ Sequence 1: VEWINHVAG

➢ Sequence 2: IEWLDHSCG

➢ Sequence 3: VCPIWCVAL

Both Sequence 2 and Sequence 3 have 5/9 differences to Sequence 1, that is, they each have44% sequence identity with Sequence 1. Does this mean that sequences 2 and 3 are equallylikely to be hom*ologous (i. evolutionarily related) to Sequence 1? Not necessarily. Amino acidsubstitutions are not coincidental. Because some amino acids are more similar than others, theyare more frequently substituted in evolution than amino acids that are less similar. Thesepatterns can be quantified by looking at protein sequence data, allowing us to translate thebiological concept of “amino acid similarity” into numbers and probabilities. Together, theseprobabilities of different types of mutations form part of a so-called model of evolution. Sincewe cannot express molecular evolution perfectly, this conceptual model is a quantitativeestimate of the mutational processes that occur in evolution. In this exercise, we will build apartial model of evolution step by step:To quantify how likely one amino acid changes into another, we will look in real biological data.First, we need as many protein sequences as we can find to determine how often one specificamino acid replaces the other. Some amino acids appear more frequently than others, so let usstart by estimating the natural frequency of each amino acid using a large data set. a. First, download the R script WC_qss and the multiple sequence alignment (MSA) polymerases_msa_no_gaps from Blackboard  Course Content  Bioinformatics  Werkcolleges  29 th May – Quantifying Molecular Evolution. Open the script with RStudio. b. Run the code in the segment “Preparations”. If you get a line that says “Update all/some/none? [a/s/n]:?”, then type the letter “a” and press enter. The installation will take a few minutes. In the meantime, answer question c. c. Databases are highly biased towards certain organisms, i. they contain a lot more sequences of certain organisms (human, E. coli, SARS-CoV2, ...) than others. If we want

to look at all the DNA sequences ever recorded, this bias will distort the naturalfrequencies of amino acids, e. if SARS-CoV2 proteins contain more alanine than otherorganisms. Because there are so many SARS-CoV2 protein sequences relative to otherorganisms, we would inflate the frequency of alanine if we included all SARS-CoVprotein sequences in our research. How would you combat this problem?d. Now, work through the code under Step 1 of WC_qss. In some places you will have toadd a line of code, in some places you will have to complete the code. Make sure toread the comments in the script!e. In the end of Step 1, you will get a vector that shows you the frequency of each aminoacid in the alignment. Now look up the actual frequencies of amino acids in the fileaa_frequencies on Blackboard. Are they similar? Why (not)?Now that we know the natural frequencies of the amino acids, for the next few steps we willfocus on six amino acids: Aspartic Acid (D), Phenylalanine (F), Isoleucine (I), Leucine (L),Arginine (R), and Tyrosine (Y).f. What would be the probability of F being replaced by Y if substitutions were completelyrandom? (Hint: it is the same probability as finding F followed by Y at an exact point inan amino acid sequence, use your aa frequencies from step 1 – not fromaa_frequencies. If you have trouble with this calculation, look at the lecture slidesfrom earlier today. What would be the probability of D being replaced by R? I beingreplaced by L? Write the probabilities down.g. Work through the R code under Step 2. Again, make sure to read the comments and fillin the R script. You will get a substitution frequency matrix for the six amino acidsmentioned above.h. You will notice that some amino acids replace each other more often than others, i.they have a higher value in the substitution frequency matrix. Does that make sense toyou? Do you have an idea why that is? (Hint: look up the structure of the amino acids onthe internet.) Why do you think some amino acids hardly get replaced at all?i. Now you know (1) the observed frequency of substitution (which is the one given by thesubstitution frequency matrix) and (2) the expected frequency of substitution for threeamino acid pairs (F → Y, D → R, I → L). Are the observed frequencies higher or lowerthan the expected frequencies?j. For the final step, we need a little math. We want to translate the observed and expectedsubstitution frequencies into numbers that we can easily add up and subtract to calculatealignment scores. To do this, we will calculate a substitution score Sij between aminoacids i and j. Sij is a log-odds score: the logarithm of the odds qij/eij which represents howmuch more likely it is that two sequences are well-aligned hom*ologs, than being non-hom*ologous or unaligned sequences. Specifically, qij is the frequency that we observe iand j aligned in the dataset of well-aligned hom*ologs, while eij is the frequency that weexpect i and j to be aligned by random chance:

from the alignment score using Eq which is directly derived from Eq. 1. What are theodds that Sequence 1 and Sequence 2 are well-aligned hom*ologs?

Equation 2:

qij

eij

= 2

(Sij / 2 )o. Finally, calculate the alignment score for the alignment below, assuming a model ofevolution represented by the BLOSUM62 matrix. What are the odds that these twosequences are well-aligned hom*ologs?SKRNITKNAPNVSAWhat is the point of this exercise?The whole point of bioinformatics is that we do biological research on a computer. However,translating biological concepts into something that a computer can understand can be quitedifficult. You have learned in previous courses how evolution and mutations work. In thisexercise, you have started from the concept of mutations, and have reconstructed a version ofthe BLOSUM matrix, which is one of the most widely used substitution matrices there is.Today, you have learned how to use biological data to make a quantitative model of evolutionthat can be understood by a computer and used to calculate a meaningful sequence similarityscore. On the way, you have calculated the expected probability for two amino acids to becoincidentally “replaced” by each other in the course of evolution and have observed that thereal substitution frequencies are different. Using the BLOSUM62 matrix, you have shown thattwo amino acid sequences that have the same % identity to a third sequence are notautomatically equally closely related to the third sequence. Models of evolution, like theBLOSUM62 matrix, are used by sequence similarity search algorithms such as BLAST. Theyenable us to assess how likely two sequences have the same evolutionary origin, or in otherwords, how likely two sequences are hom*ologous. Remember that different models of evolution,e., different amino acid substitution matrices can give different similarity scores between thesequences. hom*ology is a concept that will be discussed in detail later in the course.

20240529 Quantifying Evolution - 29 st May – Quantifying molecular evolution At this point in your - Studeersnel (2024)