Sequence Editing

Organizing and Editing Sequences in Geneious

1.    Obtain sequences from DNALims
2.    Start organizing your Geneious work environment.
a.    Working with sequences efficiently requires organization.  If you’re starting a new project your should begin by making a new folder in the local directory with the name of your project (e.g., “Sagrei_Group_Phylogeny”).  From here on, we’re going to call this folder the “project directory”.  If you’re going to be working with multiple loci, create a subfolder within your project directory for each locus (e.g., “Sagrei_Group_Phylogeny/RAG1”).
3.    Import sequences into Geneious.
a.    Create a new folder in your project directory with your last name and the date the sequences were obtained in a year-month-day format (e.g., GLOR_091203).  This is going to be a temporary holding place for our newly imported sequences.
b.    To import the sequences highlight the folder you just created with your name and the dat, go to File:Import:From File, select your sequence(s) in the browser, and click Import.
4.    Take a first look at your sequences by highlighting them all and viewing the associated set of chromatograms.  If you can’t see the chromatograms, you may need to check the toggle button next to the “Graphs” tab at the lower right side of your Geneious work environment and the button next to “Chromatogram”.
5.    Annotate problematic regions of your sequences.
a.    Highlight all of the sequences you just imported.  The simplest way to do this is to enter the folder containing the sequences, click on one of the sequences and then hit open apple+a (control+a on a PC) to select all the sequences in that folder.  If you are highlights a lot of sequences at this point, you may see a box saying “The selected documents are very large and may take a long time to load. Click “Load Documents” if you wish to proceed.”  Click the “Load Documents” button to proceed with your annotation.  Once you’ve done this, you sould be able to see all of your chromatograms.  NOTE: Bulk sequence annotation may not be possible with large samples of sequences.  If you run into memory problems when attempting to conduct these functions you may need to do this step in a few batches.
b.    Identify sloppy ends.
i.    Go to Sequence:Trim Ends, click the dialogue box next to Annotate new trimmed regions and leave other values at their default settings before clicking Ok.  If this operation is successful, you should see red bars under your sequences that indicate where low quality end sequences has been identified.  Don’t select the option to trim ends because you still want to have a look at these regions to see if good data can be recovered.
c.    Identify heterozygotes and sloppy base calls.
i.    Go to Sequence:Find Heterozygotes…, set the peak similarity to 50%, and click the dialogue box next to Annotate heterozygotes.  You should see a blue tab underneath putatively heterozygous positions (you may need to check the box next to Annotations to see these tabs).
ii.    Repeat steps above, but select Edit sequences to show ambiguities before clicking Ok.  This will change the basecall at positions with multiple peaks to an ambiguous base (Y, R, S, etc.).  The reason for conducting this operation twice is so that the putatively heterozygous positions are both easy to spot (via annotations) and appropriately indicated by ambiguous basecalls.
6.    Sort your sequences.
a.    First sort sequences by marker.
i.    Create a new folder for each marker and drop the sequences into the appropriate folder.
b.    Now you’ll begin sorting by quality.
i.    Create four folders within each marker’s individual folder: Good_Sequences(contig), (2) Good_Sequences(no_contig) and Poor_Sequences, and (4) Contigs.  Place sequences that are completely unreadable (or nearly so) in the Poor_Sequences folder and all other sequences (i.e., those with some readable sequence data) into the Good_Sequences(contig) folder.  You’ll use the Good_Sequences(no_contig)folder in a bit.
7.    Generate contigs from paired forward and reverse sequences
a.    Select all of the sequences in your Good_Sequences(contig) folder.  To make contigs, click the Assembly icon.
i.    To conduct automatic assembly select the appropriate options to tell Geneious how your sequence names can be used to identify paired sequences.  You will have the option of telling Geneious where it can find the name of each individual that was sequenced in your sequence file names so that it can identify sequences that should be assembled.
ii.    Click on More options, but leave these settings at their default values for now (minimum overlap=25, overlap identity=80%, gap open penalty=18, mismatch score=-9, match score=5).  Click Ok.
b.    When this operation is completed you will be presented with a new window containing the Assembly report.  Scroll to the bottom of this report, where files that did not form contigs are listed.  Click Select all to highlight all of your sequences that did not form contigs and move these sequences into the folder called Good_Sequences(no_contig).
c.    Move the contig files in your Good_Sequences(contig) folder to the Contigs folder.
8.    Edit contigs.
a.    Take a first look at each contig to get some idea of the overall quality of your assemblies.  A good contig will have a nice long green bar over the top of it, indicating identity between sequences obtained in each direction.  Bad contigs will have more orange, red, or white bars.
b.    Select a single contig and zoom in on it to the point that you can clearly make out individual peaks (~100%).  Adjust the chromatogram height as well, if necessary.
c.    Correct problems with individual contigs.
i.    Regions with overlapping sequences in both the forward and reverse direction and agreement between the two sequences are generally OK.  However, you should begin by scanning across your contigs to determine if any areas of concordance are problematic (i.e., based on poor basecalls in both directions).
ii.    Most of your attention will be devoted to two types of potentially problematic sequence data: (1) regions from which sequence was obtained in only a single direction (which will be indicated by a green bar) and (2) regions where sequences overlap but disagree (indicated by orange, red, or white bars).
1.    Beginning with the first type of region, carefully investigate any regions from which sequence data is available in only a single direction.  Identify any regions of this sequence from which non-ambiguous basecalls are not possible as judged by the following criteria.
a.    One reason for ambiguity involves sloppy or overlapping peaks on your electropherogram.  Most such regions should have been flagged as amiguous by your application of the Find Heterozygotes application used previously.  Delete regions at the beginning or end of your sequence that have more ambiguous basecalls than they do non-ambiguous basecalls.  If regions of ambiguity exist internally, edit these regions to reflect this ambiguity using either ambiguous basecalls or Ns.
b.    A second reason for ambiguity involves indistinct peaks.  This is typical toward then end of most sequencing reactions and can make it difficult to determine how many bases are present, particularly when two or more of the same base occur sequentially.  Because this tends to occur toward the end of a read, you should simply delete regions after the point at which peaks become indistinct.
2.    Now you’re ready to focus on regions where two sequences are in conflict.  There a few important rules of thumbs when dealing with these regions.  First, always focus your attention on the higher quality sequence data.  If this sequence appears reliable, ensure that the consensus sequence reflects the high quality sequence.  Editing of the lower quality sequence to match the higher quality sequence in such cases is not advised, as this can lead to fudging data.  If both sequences are low quality examine them closely to see if reconciliation is possible.  If there is any question about this reconciliation, ensure that the consensus sequence reflects the observed ambiguity.
9.    Align contigs.
a.    Use the alignment button to create an alignment of your contigs.  Ensure that you have told this aligment to maintain sequence order.  Use MUSCLE alignment with default settings.  Examine concordance among contigs.  Double-check original sequence data when contigs from related species are in conflict.
10.    Form alignment with as much data as possible.
a.    Sequence data that has not formed contigs may remain a useful contribution to your dataset. To include this data:
i.    Generate consensus sequences from you contigs and place these consensus files in a folder called Data_for_Alignment.  Place copies of all the files in your Good_Data(no_contig) in this folder.  Form an alignment from the consensus sequences and individual unassemblable sequence files.
ii.    Check for disagreement.
b.    Export your alignment as a NEXUS file.

Posted by Rich Glor – Oct. 24th, 2009

Leave a Reply