TagCleaner. For cleaner sequences.

Manual

Web version
    Necessary resources
    Upload data

Standalone version
    Necessary resources
    Available options
    Examples
    Notes

Tag sequence trimming
    Allowed mismatches
    Trimming range
    Continuous trimming
    Non-matching reads
    Concatenated reads
    Notes



Preprocessing_454_SFF_chart_pdf.gif
Preprocessing 454 data

01/16/2011
PDF, 92 KB

Preprocessing_FASTA_only_chart_pdf.gif
Preprocessing FASTA-only data

01/16/2011
PDF, 76 KB

Raw_data_processing_pdf.gif
Raw data processing

01/16/2011
PDF, 112 KB

Quality_control_with_PRINSEQ_pdf.gif
Quality control

01/16/2011
PDF, 520 KB

Data_preprocessing_pdf.gif
Data preprocessing

09/01/2011
PDF, 827 KB

Introduction

Sequencing metagenomes that were pre-amplified with primer-based methods requires the removal of the additional tag sequences that can contain deletions or insertions due to sequencing limitations. The tag (e.g. WTA primer) sequence may be unavailable or incorrectly reported in public databases. Because of the potential for downstream inaccuracies introduced by unwanted sequence contaminations, it is important to use reliable tools for pre-processing sequence data.

TagCleaner is distributed under the GNU Public License (GPL). All its source codes are freely available to both academic and commercial users. The latest version can be downloaded at the SourceForge download page.


Web version

        TOP OF PAGE

The interactive web interface of TagCleaner can be used to automatically detect and efficiently remove tag sequences from genomic and metagenomic datasets.

Necessary resources

Hardware
  - Computer connected to the Internet

Software
  - Up-to-date Web browser (Firefox, Safari, Chrome, Internet Explorer, ...)

Files
  - FASTA file with sequence data
  - FASTQ file (as alternative format to trim sequence and quality data)


Upload data to the TagCleaner web version

To upload a new dataset in FASTA or FASTQ format to TagCleaner, follow these steps:
  1. Go to http://tagcleaner.sourceforge.net
  2. Click on "Use TagCleaner" in the top menu on the right (the latest TagCleaner web version should load)
  3. Select your FASTA or FASTQ file
  4. Select trim mode (trim tag sequences from both ends, 5'-end only, or 3'-end only)
  5. Specify tag sequence for 5'-end and/or 3'-end (if available)
  6. Click "Submit"


Notes
If the tag sequences are not available or are unknown, leave the fields free and TagCleaner will try to find the tag sequence. The predicted tag sequence is provided to the user and can be modified before any further steps are performed (see below).

tag_detection.png


TagCleaner does not require the setting of filter parameters (such as maximum number of mismatches) before the data is processed. Instead, the filter parameters are set after the data is processed, which allows the user to choose parameters appropriate for their dataset and does not require them to submit and process the same data with modified parameters for several times.


Tag sequence trimming

        TOP OF PAGE

Tag sequence trimming should be performed before quality trimming and sequence dereplication. The trimming of low-quality bases at the ends might truncate the tag sequence and reduce the ability to recognize the remainder of the tag sequence. In those cases, large parts of the tag sequences might still remain for further analysis and data processing steps. Dereplication before trimming may miss duplicated sequences due to variations in the tag sequences that will be trimmed off later and would therefore require an additional dereplication step after the trimming.

The algorithm implemented in TagCleaner for the automatic detection of tag sequences assumes the randomness of a typical metagenome. Datasets that do not contain random sequences from organisms in an environment, but rather contain, for example, 16S data may cause incorrect detection of the tag sequences. However, the tag sequences will most likely be over-predicted and can be redefined by the user prior to data processing.


Maximum allowed mismatches

The independent definition of maximum allowed mismatches for the 5'- and 3'-end of the reads accounts for the differences in tag sequences due to the limitations of the sequencing method used to generate the datasets. The 3'-end will in most cases show a lower number of matching tag sequences with low number of mismatches due to incomplete or missing tags at the ends of incompletely sequenced fragments.

TagCleaner_3_chart.png


Trim tag sequence only if within

The sequence of the tag could occur not only at the sequence end, but also at any other position of the sequence. To assure that only tags are trimmed, the tag sequences can be defined to occur only at the ends allowing a certain number of variable bases. The default value for tag sequence of at least 10 bp is 1.5 times the tag sequence length.


Trim tag sequence continuously

The continuous trimming of tag sequences from the ends allows filtering of sequences mainly consisting of concatenated tag sequences.


Remove reads not matching tag sequence

This filter can be applied to remove sequences that do not contain any tag sequence (with the defined number of maximum mismatches). This feature was originally designed to separate MID tagged sequences before the MID tags were made publicly available. The feature can also be used to further investigate sequences without a matching tag sequence.


Split fragment-to-fragment concatenated reads

TagCleaner provides an additional feature of detecting and splitting fragment-to-fragment concatenations. This important pre-processing step removes tag contaminations inside the sequences, which may allow, for example, more accurate assemblies. The concatenated fragments may additionally present a source of error for annotation and taxonomic assignments, since fragments from different organisms may not be assigned correctly when concatenated.

filter_parameters.png


Notes

TagCleaner is able to detect the quasi-random 3'-end of whole transcriptome amplification (WTA) primers. The user has the option whether or not to trim this part of the tag sequence by simply adding or removing the letter N from the end of the tag sequence. However, users are advised to trim the complete tag sequence. It is important to trim the random parts in order to account for mismatch-induced mutations that often happen when primers anneal to similar (but not identical) sequences with high enough affinity for binding (see below). Therefore, one cannot be certain that this part of the tag sequence represents the actual sequence of the sample.

mismatch_induced_mutation.png


The tag sequence prediction uses filtered base frequencies instead of raw base frequencies. This allows a more accurate prediction as it accounts for incomplete and shifted tag sequences. The following example shows the raw base frequencies (top) and the filtered base frequencies (bottom) for the 3'-ends.

filtered.png


Standalone version

        TOP OF PAGE

The standalone version does not provide graphical outputs, but all the functionality you might be used to from the web version. Starting with version 0.9, the web version is using the standalone version for all the calculations in the backend. The readme file contains information on the usage of the standalone version.

Necessary resources

Hardware
  - Computer with a Linux/Unix or Mac OS X operating system

Software
  - TagCleaner standalone (available at http://deconseq.sourceforge.net)
  - Perl 5 (or higher)

Files
  - FASTA file with sequence data (and optional QUAL file to trim sequences and their quality scores)
  - FASTQ file (as alternative format)

Available options

Option/flag

Description

Default   

Range

-help or -h

Print the help message; ignore other arguments

   

-man

Print the full documentation; ignore other arguments

   

-version

Print program version; ignore other arguments

   

-verbose

Prints status and info messages during processing

   
Input options      

-fastq

Input file in FASTQ format that contains the sequence and quality data.

 

STRING

-fasta

Input file in FASTA format that contains the sequence data.

 

STRING

-qual

Input file in QUAL format that contains the quality data.

 

 

-matrix

Use union, subset or exact matrix instead of default matrix when matching tag characters to the sequence characters. The default matrix requires that the ambiguous characters of the sequence represent bases that are a subset of the ambigous characters of the tag sequence (e.g. R = AG in the sequence is a subset of D = AGT and is marked with an X in the matrix). The subset matrix requires that either the tag character or the sequence character is a subset of the other. The union matrix requires that at least one base is common in tag and sequence (e.g. R = AG and M = AC have base A in common and are marked with an X in the matrix). The exact matrix allows to trim ambiguous bases without considering the ambiguity code (e.g. N in tag only trims N, not ACGT). Optional options are either "subset", "union" or "exact".

Default matrix:
    A C G T R Y M K W S B D H V N (tag)
  A X       X   X   X     X X X X
  C   X       X X     X X   X X X
  G     X   X     X   X X X   X X
  T       X   X   X X   X X X   X
  R         X             X   X X
  Y           X         X   X   X
  M             X           X X X
  K               X     X X     X
  W                 X     X X   X
  S                   X X     X X
  B                     X       X
  D                       X     X
  H                         X   X
  V                           X X
  N                             X
  (sequence)


Subset matrix:
    A C G T R Y M K W S B D H V N (tag)
  A X       X   X   X     X X X X
  C   X       X X     X X   X X X
  G     X   X     X   X X X   X X
  T       X   X   X X   X X X   X
  R X   X   X             X   X X
  Y   X   X   X         X   X   X
  M X X         X           X X X
  K     X X       X     X X     X
  W X     X         X     X X   X
  S   X X             X X     X X
  B   X X X   X   X   X X       X
  D X   X X X     X X     X     X
  H X X   X   X X   X       X   X
  V X X X   X   X     X       X X
  N X X X X X X X X X X X X X X X
  (sequence)


Union matrix:
    A C G T R Y M K W S B D H V N (tag)
  A X       X   X   X     X X X X
  C   X       X X     X X   X X X
  G     X   X     X   X X X   X X
  T       X   X   X X   X X X   X
  R X   X   X   X X X X X X X X X
  Y   X   X   X X X X X X X X X X
  M X X     X X X   X X X X X X X
  K     X X X X   X X X X X X X X
  W X     X X X X X X   X X X X X
  S   X X   X X X X   X X X X X X
  B   X X X X X X X X X X X X X X
  D X   X X X X X X X X X X X X X
  H X X   X X X X X X X X X X X X
  V X X X   X X X X X X X X X X X
  N X X X X X X X X X X X X X X X
  (sequence)


Exact matrix:
    A C G T R Y M K W S B D H V N (tag)
  A X
  C   X
  G     X
  T       X
  R         X
  Y           X
  M             X
  K               X
  W                 X
  S                   X
  B                     X
  D                       X
  H                         X
  V                           X
  N                             X
  (sequence)

Default matrix

[subset, union]

Output options      

-out

By default, the output files are created in the same directory as the input file containing the sequence data with an additional "_tagcleaner_XXXX" in their name (where XXXX is replaced by random characters to prevent overwriting previous files). To change the output filename and location, specify the filename using this option.
Example: use "file_passed" to generate the output file file_passed.fasta (fasta output) in the current directory

 

STRING

-out_format

To change the output format, use one of the following options. If not defined, the output format will be the same as the input format.
1 (FASTA only), 2 (FASTA and QUAL) or 3 (FASTQ)

 

[1,2,3]

-line_width

Sequence characters per line. Use 0 if you want each sequence in a single line. Use 80 for line breaks every 80 characters. Note that this option only applies to FASTA output files, since FASTQ files store sequences without additional line breaks.

60

INT

-stats

Prints the number of tag sequences matching for different numbers of mismatches. In combination with -split, the number of sequences with fragment-to-fragment concatenations is printed as well. Cannot be used in combination with -predict and will not perform any trimming. The output values are separated by tabs with the header line: "#Param Mismatches_or_Splits Number_of_Sequences Percentage Percentage_Sum". Cannot be used in combination with -predict and require -tag5 or -tag3.

 

 

-predict

Use this option to have TagCleaner predict the tag sequences. It will attempt to predict the tag at either or both sites, if possible. The algorithm implemented for the tag prediction assumes the randomness of a typical metagenome. Datasets that do not contain random sequences from organisms in an environment, but rather contain, for example, 16S data may cause incorrect detection of the tag sequences. However, the tag sequences will most likely be over-predicted and can be redefined by the user prior to data processing. The tag sequence prediction uses filtered base frequencies instead of raw base frequencies. This allows a more accurate prediction as it accounts for incomplete and shifted tag sequences. The output values are separated by tabs with the header line: "#Param Tag_Sequence Tag_Length Percent_Explained". If no tags are reported, then no tags could be identified in the data set. Cannot be used in combination with -tag3 or -tag5 or -stats. When using this option, no trimming will be performed.

 

 

-log

Log file to keep track of parameters, errors, etc. The log file name is optional. If no file name is given, the log file name will be "inputname.log". If the log file already exists, new content will be added to the file.

 

FILE

-info

This option will provide the trimming and splitting information in the header line after the sequence identifier. The following information is given and separated by a single space: initial length, length after trimming, 5'-end trimming position, 3'-end trimming position, number of mismatches at 5'-end, number of mismatches at 3'-end and number of sequences after splitting. In case of a splitting event, the number of mismatches at the 5'- and 3'-end will be the number of mismatches of the concatenated tags.

 

 

-nomatch

This option allows to filter sequences that do not match the tag sequence at the ends or do not contain inner tags within the maximum number of allowed mismatches. The following options allow to filter reads not matching the tag at:
(1) 5'-end, (2) 3'-end, (3) either end - requires both ends to match, (4) both ends - requires either end to match, or (5) reads splitting into two or more reads.

 

[1,2,3,4,5]

-minlen

Minimum read length of trimming and/or splitting

 

INT

-filtered

Output the sequences that would be filtered out instead of the sequences passing the filters. This includes sequences that e.g. are tag sequence repeats or do not fulfill the -nomatch option.

 

 

Trim/Split options      

-tag5

Tag sequence at 5'-end. Use option -predict if unknown.

 

STRING

-tag3

Tag sequence at 3'-end. Use option -predict if unknown.

 

STRING

-mm5

Maximum number of allowed mismatches at the 5'-end.

0

INT

-mm3

Maximum number of allowed mismatches at the 3'-end. The independent definition for the 5'- and 3'-end of the reads accounts for the differences in tag sequences due to the limitations of the sequencing method used to generate the datasets. The 3'-end will in most cases show a lower number of matching tag sequences with low number of mismatches due to incomplete or missing tags at the ends of incompletely sequenced fragments.

0

INT

-trim_within

The sequence of the tag could occur not only at the sequence end, but also at any other position of the sequence. To assure that only tags are trimmed, the tag sequences can be defined to occur only at the ends allowing a certain number of variable bases. The default value for -trim_within for a tag sequence of at least 10 bp is 1.5 times the tag sequence length.
Example: Use -tag3 NNNNNNNNNCCAAACACACCCAACACACCAC -trim_within 60 to trim ATCCATTTCCCAAACACACCCAACACACCAC AAAAAAAAAAAAAAAACAAACAACACC

 1.5x tag
 length

INT

-cont

Trim tag sequences continuously. This is helpful if you have sequence with tag sequence repeats or sequence that are concatenated tag sequences. Note that a high number of allowed mismatches and continuous trimming can cause over-trimming. Use more than 20% mismatches with continuous trimming only with caution.

 

 

-split

This feature removes tag contaminations inside the sequences and splits fragment-to-fragment concatenations into separate sequences. The optional integer value specifies the maximum number of allowed mismatches for the internal (concatenated) tag sequence(s). This feature should be used with caution for inputs with only a 5' or 3' tag sequence (likely splits too many false positive that naturally occur for single tags compared to much longer concatenated 5' and 3' tags). The number of mismatches is used as maximum value for -stats. This option will cause a decrease in speed.
sequence1-tag3-tag5-sequence2

0

INT

-split5r

This feature is similar to -split, but instead of search for 3'-5'-tag repeats, it will search for 5'-tag repeats (2 or more). This option only applies if both -tag3 and -tag5 are specified. To split at a single 5'-tag, run the program without -tag3 and with -split.
sequence1-tag5-tag5-sequence2

0

INT

-split3r

This feature is similar to -split, but instead of search for 3'-5'-tag repeats, it will search for 3'-3'-tag repeats (2 or more). This option only applies if both -tag3 and -tag5 are specified. To split at a single 3'-tag, run the program without -tag5 and with -split.
sequence1-tag3-tag3-sequence2

0

INT

-splitall

This features is for convenience only and applies the integer value to all split options (-split, -split3r and -split5r). This option only applies if both -tag3 and -tag5 are specified and will overwrite all other split options with the given integer value.
Combining all split options can split things like sequence1-tag3-tag3-tag5-tag5-tag5-sequence2 into sequence1 and sequence2.

0

INT



Examples

The following examples apply to version 0.9 and higher and assume a 64 bit architecture. If you do not use a 64 bit system, ignore the -64 option.

Verbose mode (-verbose)
The verbose mode shows you what the program is doing while it is running and also the progress of tasks that require more computation:

$ perl tagcleaner.pl -verbose -64 -fastq SRR004254.fastq -predict
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	status: 34 %


Predicting the tag sequence (-predict)
To predict the tag sequences, you can simply use the -predict option:

$ perl tagcleaner.pl -verbose -64 -fastq SRR004254.fastq -predict
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	done
#Param	Tag_Sequence	Tag_Length	Percentage_Explained
tag3	NNNNNNNNNNCCAAACACACCCAACACA	28	49.52
tag5	TGTGTTGGGTGTGTTTGGNNNNNNNNNN	28	71.56

Using the non-verbose mode:

$ perl tagcleaner.pl -64 -fastq SRR004254.fastq -predict
#Param	Tag_Sequence	Tag_Length	Percentage_Explained
tag3	NNNNNNNNNNCCAAACACACCCAACACA	28	49.52
tag5	TGTGTTGGGTGTGTTTGGNNNNNNNNNN	28	71.56

Note that the comment lines in the outputs will start with the pound sign (#) that can be eaily parsed by other programs. All values in a row are separated by tabs.
Using -predict will not trim any input sequences and can only be used to predict the possible tag sequence(s).


Checking for tag sequences (-predict)
The percentage values given in the column called "Percentage_Explained" (see example above) can be used to estimate the probability of a tag being real. If the percentage is high, there is likely a tag. If the percentage is low (5-10%), then there is likely no tag.


Using the stats option (-stats)
This options can be used to get the number of sequences that match the tag sequence(s) with different numbers of allowed mismatches.

$ perl tagcleaner.pl -fastq SRR004254.fastq \
-tag3 NNNNNNNNNNCCAAACACACCCAACACA -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN -stats
#Param	Number_of_Mismatches_or_Splits	Number_of_Sequences	Percentage	Percentage_Sum
tag3	0	7167	46.85	46.85
tag3	1	316	2.07	48.91
tag3	2	696	4.55	53.46
tag3	3	348	2.27	55.74
tag3	4	556	3.63	59.37
tag3	5	227	1.48	60.86
tag3	6	255	1.67	62.52
tag3	7	431	2.82	65.34
tag3	8	1031	6.74	72.08
tag3	9	1595	10.43	82.51
tag3	10	1538	10.05	92.56
tag3	11	622	4.07	96.63
tag3	12	227	1.48	98.11
tag3	13	99	0.65	98.76
tag3	14	25	0.16	98.92
tag3	15	20	0.13	99.05
tag3	16	34	0.22	99.27
tag3	17	34	0.22	99.50
tag3	18	77	0.50	100.00
tag5	0	10752	70.28	70.28
tag5	1	533	3.48	73.77
tag5	2	462	3.02	76.79
tag5	3	492	3.22	80.00
tag5	4	258	1.69	81.69
tag5	5	103	0.67	82.36
tag5	6	104	0.68	83.04
tag5	7	231	1.51	84.55
tag5	8	371	2.43	86.98
tag5	9	628	4.11	91.08
tag5	10	775	5.07	96.15
tag5	11	325	2.12	98.27
tag5	12	180	1.18	99.45
tag5	13	25	0.16	99.61
tag5	14	8	0.05	99.67
tag5	15	16	0.10	99.77
tag5	16	5	0.03	99.80
tag5	17	8	0.05	99.86
tag5	18	22	0.14	100.00

Note that the tag is shorter than 32 bp and therefore we do not need the -64 option, which is only required for longer tag sequences using more than the default 32 bit. The printed table tells us how many sequences match the tag3 or tag5 sequence with how many mismatches. The slash (/) allows to continue commands over multiple lines and is used here to keep the lines short enough for easy display.


Trimming tags (-tag5, -tag3, -mm5 and -mm3)
When you perform the tag trimming, the -log option will be useful to keep track of the processing of your data.

$ perl tagcleaner.pl -fastq SRR004254.fastq \
-tag3 NNNNNNNNNNCCAAACACACCCAACACA -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN \
-mm5 3 -mm3 3 -verbose -log
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	done
Input/Output stats:
	Input sequences: 15,298
	Input bases: 3,122,243
	Input mean length: 204.09
	Output sequences: 15,298
	Output bases: 2,538,321
	Output mean length: 165.93

The verbose mode will provide some basic statistics of the input and output data. Since we did not provide a filename for the output file and log file, it will be automatically generated. The output file in this example is "SRR004254_tagcleaner_t5gF.fastq" and the log file "SRR004254.fastq.log". The number of input and output sequences is identical, as we did not filter the trimmed sequences by matching tags or sequence length. The log file contains all the processing information and comes in handy if you perform multiple trimming or preprocessing steps. The log file for this example contains the following content:

[tagcleaner-standalone-0.9] [02/09/2011 15:55:00] Executing TagCleaner with command: "perl tagcleaner.pl -fastq SRR004254.fastq -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN -tag3 NNNNNNNNNNCCAAACACACCCAACACA -mm5 3 -mm3 3 -verbose -log"
[tagcleaner-standalone-0.9] [02/09/2011 15:55:00] Write results to file: "SRR004254_tagcleaner_Tnqj.fastq"
[tagcleaner-standalone-0.9] [02/09/2011 15:55:00] Parse and process input data: "SRR004254.fastq"
[tagcleaner-standalone-0.9] [02/09/2011 15:55:05] Finished processing input data: "SRR004254.fastq"
[tagcleaner-standalone-0.9] [02/09/2011 15:55:05] Input sequences: 15,298
[tagcleaner-standalone-0.9] [02/09/2011 15:55:05] Input bases: 3,122,243
[tagcleaner-standalone-0.9] [02/09/2011 15:55:05] Input mean length: 204.09
[tagcleaner-standalone-0.9] [02/09/2011 15:55:05] Output sequences: 15,298
[tagcleaner-standalone-0.9] [02/09/2011 15:55:05] Output bases: 2,538,321
[tagcleaner-standalone-0.9] [02/09/2011 15:55:05] Output mean length: 165.93


Splitting fragment-to-fragment concatenations (-split)
The splitting option can be used to split fragment-to-fragment concatenations at 3'-5' concatenated tag sequences:

$ perl tagcleaner.pl -fastq SRR004254.fastq \
-tag3 NNNNNNNNNNCCAAACACACCCAACACA -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN \
-mm5 3 -mm3 3 -split 5 -verbose
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	done
Input/Output stats:
	Input sequences: 15,298
	Input bases: 3,122,243
	Input mean length: 204.09
	Output sequences: 15,695
	Output bases: 2,515,773
	Output mean length: 160.29


Note the difference in number of input and output sequences due to the splitting of concatenated fragments into separate sequences allowing 5 mismatches for the concatenated tags.


Splitting more complex concatenations (-split5r and -split3r)
Real example sequence with 3'-5', 5'-5' and 5'-5' concatenated tags:
ACGGGCTTAGTGATCCGGTGGTTCCGCATGGAAGGGCCATCGCTCAACGGATAAAAGCTACCCCGGGGATAACAGGCTTA TCTCCCCCAAACCCCAAACACACCCAACACAGTGTTGGGTGTGTTTGGTGTGTGGTGGTTGTGTTGGGTGTGTTTGGTTT TGGTTGGGTGTGTTGGGTGTGTTTGGGGTTGGTGGGCTGTGTTGGGTGTGTTTGGTGGTGTGTGGGTTGTGTTGGGTGTG TTTGGGTATGCGTGGTAGGAGAGCGTTCTAACAGCGTTGAAGTCAGACCGGAAGGACT

To split fragments at more complex tag sequence concatenations such as in this worse case scenario "sequence1-tag3-tag5-tag5-tag5-sequence2-tag5-tag5-tag5-sequence3-tag3-tag3-sequence4", you would need to use additional options for splitting at tag sequence repeats:

$ perl tagcleaner.pl -fastq SRR004254.fastq \
-tag3 NNNNNNNNNNCCAAACACACCCAACACA -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN \
-mm5 3 -mm3 3 -split 5 -split5r 5 -split3r 5 -verbose
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	done
Input/Output stats:
	Input sequences: 15,298
	Input bases: 3,122,243
	Input mean length: 204.09
	Output sequences: 16,479
	Output bases: 2,454,259
	Output mean length: 148.93

Note the even higher number of output sequences suggesting that the data set contained fragment-to-fragment concatenated sequences that could be identified and split at tag sequence repeats.


Continuous trimming (-cont)
This option is useful to trim tag sequence repeats at the ends, as shown in the example below:
TGTTGGGTGTGTTTGGTGTGTGGTGGTTGTGTTGGGTGTGTTTGGTTTTGGTTGGTGTGTTGGGTGTGTTTGGGGTTGGT GGGCTAGGAGAGCGTTCTAACAGCGTTGAAGTCAGACCGGAAGGACTACGGGCTTAGTGATCCGGTGGTTCCGCATGGAA GGGCCATCGCTCAACGGATAAAAGCTACCCCGGGGATAACAGGCTTATCTCCCCCAAACCCCAAACACACCCAACACA

$ perl tagcleaner.pl -verbose -fastq SRR004254.fastq \
-tag3 NNNNNNNNNNCCAAACACACCCAACACA -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN -mm5 3 -mm3 3 -cont
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	done
Input/Output stats:
	Input sequences: 15,298
	Input bases: 3,122,243
	Input mean length: 204.09
	Output sequences: 15,280
	Output bases: 2,386,486
	Output mean length: 156.18

Note the lower number of sequence that suggest that the data set contained sequences that were pure tag sequence repeats.


Trim tag sequences further from the ends (-trim_within)
This option is useful to trim tags that are not occoring right at the ends, as shown in the following example:
AGCGTAGTCGATGGGAAACAGGTTAATATTCCTGTACTTCTGGTTACTGCGATGGAGGGACGGAGAAGGCTAGGCCAGCT TGGCATTGGTTGTCCAAGTTTAAGGTGGTAGGCTGAAATCTTAGGTAAATCCGGGGTTTCAAGGCCGAGAGCTGATGACG AGTCGTCTTTTAGATGACGAAGTGGTTGATAGACCATGCTTCCAAGAAAAGCTTCTAAGCTTCAGGTAACCACGAAACCA AACACACCCAACACACCAACAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAACAA

The 3'-tag can be found within 70 bp from the end and therefore, can be trimmed by using the -trim_within option with parameter 70.

$ perl tagcleaner.pl -verbose -fastq SRR004254.fastq \
-tag3 NNNNNNNNNNCCAAACACACCCAACACA -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN -mm5 3 -mm3 3 \
-trim_within 70


Filtering short reads (-minlen)
This option is useful to filter reads that are very short after trimming tag sequences and/or splitting.

$ perl tagcleaner.pl -verbose -fastq SRR004254.fastq \
-tag3 NNNNNNNNNNCCAAACACACCCAACACA -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN -mm5 3 -mm3 3 \
-minlen 10


Suggested options for trimming and splitting
This are the options that would be used in a good use case (the parameters are up to the specific use and should not be blindly copied without thinking about what they do):

$ perl tagcleaner.pl -verbose -fastq SRR004254.fastq \
-tag3 NNNNNNNNNNCCAAACACACCCAACACA -tag5 TGTGTTGGGTGTGTTTGGNNNNNNNNNN -mm5 3 -mm3 3 \
-splitall 5 -minlen 10 -cont -log SRR004254_preprocessing.log -out SRR004254_trimmed
Estimate size of input data for status report (this might take a while for large files)
	done
Parse and process input data
	done
Input/Output stats:
	Input sequences: 15,298
	Input bases: 3,122,243
	Input mean length: 204.09
	Output sequences: 15,662
	Output bases: 2,354,964
	Output mean length: 150.36

This will provide you the basic statistics of the data processing using the verbose mode and writes all the processing steps to the file "SRR004254_preprocessing.log" by using the -log option and the trimmed and split data to file "SRR004254_trimmed.fastq" by using the -out option:

$ ls
SRR004254.fastq
SRR004254_preprocessing.log
SRR004254_trimmed.fastq
tagcleaner.pl


Notes

The following is a collection of random notes that might help you when using TagCleaner to process your data.

How to handle cases where tags are not completely sequenced

Depending on the length of your tag sequence, you have different options on how to deal with incomplete tag sequences during removal. If your tag sequences are long enough (e.g. 15+bp), then you could perform the search for end tags with the last 7-10 bases and specify that they have to be found only at the end (e.g. first/last 15 bases). If you have short primer sequences, then I can only suggest to slightly increase the number of mismatches.

An example command for a tag with the sequence ACCGTGACGTCTGAGACT (18 bases) could be:

perl tagcleaner.pl -log file.log -verbose -fastq file.fastq -tag3 ACCGTGAC -mm3 1 -trim_within 18 -out file_tagcleaned

For this approach, a low number of allowed mismatches should be defined to reduce the number of false positive trimmed tag sequences. Also, it might be useful to run TagCleaner first with the full tag sequence and a higher number of allowed mismatches and then again with the partial sequence and low number of allowed mismatches.


How to trim ambiguous bases without considering the ambiguity code

If you want to trim NNA from the 3'-end, but do not want to trim AAA or ATA, then you should use the matrix exact.

perl tagcleaner.pl -log file.log -verbose -fasta file.fasta -tag3 NNA -mm3 0 -trim_within 3 -matrix exact -out file_tagcleaned

Allowing no mismatches and only checking at the last three positions will only trim NNA from the sequences and nothing else. If you also want to trim sequences with NNAA (and NNAC, NNAG, NNAT and NNAN), then you only have to modify the -trim_within parameter.

perl tagcleaner.pl -log file.log -verbose -fasta file.fasta -tag3 NNA -mm3 0 -trim_within 4 -matrix exact -out file_tagcleaned