Something for amino acid sequences. Part II.

In the previous post, we had a programme that can identify the position and occurrence of a single amino acid (represented by a letter) within a given sequence (represented by a string of letters). Let’s modify the code to enable us to search for a block consisting of multiple amino acids within the sequence provided.

Let us consider a longer amino acid sequence in this example. The partial amino acid residue sequence of the pyruvate kinase type M2 protein is as follows:

MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASR
SVETLKEMIKSGMNVARLNFSHGTHEYHAETIKNVRTATESFASDPILYRPVAVAL
DTKGPEIRTGLIKGSGTAEVELKKGATL

[ Abstracted from http://www.metabolic-database.com/html/m2-pk_amino_acid_sequence.html ]

First, input the amino acid sequence that we would be searching from, followed by the block of amino acid sequence that will be searched – in this case it is “GTA” (Glycine-Threonine-Alanine). It was found that this block occurred twice in our main sequence, starting at positions 9 and 128.

When the search was repeated with different block of amino acid sequence, in this case “ATG”, this block was not found.

The code for this programme is as follows:

Line 31 first request the user to input the string of amino acid sequence, followed by Line 32 requesting the user to indicate the block of amino acid residues of interest. Line 34 calls the function defined as aa_finder, which performs the analysis.

Lines 1 to 28 constitutes the aa_finder function. The essence of this code is as follows: based on the number of letters in the search term (three in this example), the main sequence that would be search from would be sliced to the same length accordingly. Thus the sequence this 140 letters will be sliced to just the first three letters (or the corresponding number of letters in the block to be searched), and compared with the block that the user is searching for. If it matches, then the position is noted and the search continues considering the next set of three letters starting from position 2. The process continues until the end of the main amino acid sequence is reached, with the last two searches being performed on a smaller fragment.
E.g. “MSK” , “SKP”, “KPH”, “PHS”, “HSE” … … “GAT”, “ATL”, “TL”, “L”

Line 4 defines a variable representing the starting index of the fragment that will be sliced; since the first letter has an index of 0, we will begin with that.
Line 5 defines another variable which can be thought of as a ‘switch’ in this programme. The initial state is set to a Boolean value of False, to represent that a match has yet to be found.
Line 6 creates an empty list, which will be used later to store the index of the first element of the sliced fragment when a match is found.

Line 8 to 17 uses a for loop to perform the iteration where the main sequence is sliced to the same length as the block to be searched.
Line 8 indicates this iteration would be done on every letter of the main sequence. Line 9 sets the length of the fragment that will be sliced by considering the length of the block that the user input. Line 10 slices the main sequence to the fragment length to be compared.
The if-else statements in Line 12 to 17 specifies that if the fragment under consideration matches the block the user indicates, the Boolean ‘switch’ will be triggered and changed to have a value of True, signifying that a match was found. Line 14 then appends the starting index of this fragment (in the numbering system that we count, not the Pythonic way; remember Python start counting from 0, not 1) into the empty list that was created earlier. Afterwhich, the starting point of the next fragment to be considered is moved one position downstream. If a match is not found, Line 16 to 17 indicates the Boolean ‘switch’ remains as False, and the starting point of the next fragment to be considered is moved one position downstream.

The if-else statements in Line 19 to 28 performs what happens after the programme has gone through comparing the entire main sequence. If the Boolean ‘switch’ remains False, no match was found and the result will be shown accordingly. If the Boolean ‘switch’ is True, a match was found and the number of elements in the list corresponds to the number of occurrences and will be shown by Line 22.
Line 25 to 28 then prints out the elements in the list, which corresponds to the starting index of the fragment where a match was found.

Great! Now that we have gone through this simple programme, I felt that it is a bit of a hassle to rerun this code whenever a new search is to be performed. However, with a slight modification to the existing code, a while loop could be used to repeat the inquiry process on the same main amino acid sequence. In this case, to exit the inquiry process, the user enters ‘exit’ when prompted to input the amino acid sequence to search.

As shown below, Line 32 to 38 adds on the while loop to the inquiry process. If ‘exit’ is entered as the amino acid sequence to search, this while loop is stopped and the programme exits.

In case you are wondering, this could be applicable (with some slight modification) on a paragraph of text to find the occurrence and position of any given word as well.

Is there any way to make the above codes more concise and elegant? Feel free to comment.

Leave a comment