Section: Misc. Reference Manual Pages (February 3rd, 2005)


SPATT - Statistics for Patterns, based on simple models. 




spatt (statistic for patterns) is a suite of programs designed for the computationof pattern occurrencies p-value on text. Assuming the text is generated according to Markovmodel, the p-value of a given observation is its probability to occur. The lower is the p-value,the more unlikely is the observation. For example, this tools can be used to find patterns withunusual behaviour in DNA sequences.

Let us note N(P) the number of occurrences of a pattern P on a given sequence. If we considerthe sequence is random (according to a model of our choice), N(P) become a random variable andwe can associate p-values to observations using the following statistic:

S = - log10[ P( N(P) > Nobs(P) ) ] when P is seen more than expected


S = +log10[ P( N(P) < Nobs(P) ) ] when P is seen less than expected

For example S=+3.23 means the pattern is over-represented (seen more than expected) with a p-valueof 10^-3.23 = 5.888e-4. S=-12.67 means the pattern is under-represented (seen less than expected)with a p-value of 10^-12.67 = 2.138e-13.



Several programs are provided based on different statistics methods and they all share the same syntaxand options (even if few of them are specific to a given program):

sspatt (simple statistics for patterns) compute p-value using binomial approximation.This approximation is known to be false but is in fact a very fast and reliable heuristic.

gspatt (gaussian statistics for patterns) compute expectation and variance for patterncounts and derive from these a p-value approximation.

ldspatt (large deviations statistics for patterns) is based on the large deviations theory,the computed p-value are especially reliable for the smallest but are asymptotic and so must be usedwith care on short sequences (let say less than 10000 long).

xspatt (exact statistics for patterns) proposes to use exact computations with arbitrary precisionmethods to give high quality p-value. Memory requirements are growing linearly with sequence length andtime complexity is proportional both to sequence length and pattern number of occurrences.Therefore, this method should be used only on short sequence (let say less than 10000 long).

cpspatt (compound poisson statistics for patterns) use Chen-Stein method to approximate N(P)with a geometric Poisson distribution. Thanks to a nice recurrence, p-values computations are linear(and not quadratic) with the observed number of occurrences. In the case of non overlapping patterns,these approximations fall back to the simple Poisson approximation which is very close to the binomialstatistics implemented in sspatt.



FASTA sequence
The name of the file containing a sequence in FASTA format.



-h, --help
displays this message.
-v, --version
displays version number.
-q, --quiet
quiet output (same effect as --debug-level -1).
-V, --Verbose
verbose output (same effect as --debug-level 1).
-d, --debug
debug output (same effect as --debug-level 2).
set the verbose level:
 -1 (quiet),
 0 (normal),
 1 (verbose)
 2 and more (debug).
-a, --alphabet=descriptor
define the alphabet to use (ex: -a "acgt:tgca" for standard DNAalphabet).
-M, --Markov-file=filename
destination file for the Markov model parameters.
-S, --Stationnary-file=filename
destination file for the stationnary distribution.
-U, --Use-markov-file=filename
file containing the Markov model parameters.
-C, --Count-file=filenamme
file containing word counts.
-P, --Pattern-file=filename
file containing patterns.
-l, --length=max
length of the longest word counted.
-p, --pattern=descriptor
define a pattern (could be used several times).
-b, --both-strands
count patterns on both strands.
-m, --markov=order
order of the Markov model.
process all words of the given length.
-n, --normalize
statistics are normalized by the length of the sequence.
--max-pvalue <threshold>
only statistics corresponding to pvalue lower than the threshold are produced



Alphabet descriptor

this allows to specify the alphabet you want to use reading sequence and patterns. First, each letter of the alphabet must be given as an ordered sequence, each letter must be either a simple caracter either a list of caracter between two brackets. Then, optionnally, a complementary alphabet could be specified after a ":". Only one caracter per letter should be used in this way.

By default, white space (blanks, tabulations, carriage returns, ...) are ignored and alphabets are considered not to be case sensitible unless two different cases have been used in alphabet descriptor.

Each time a invalid caracter is found in a sequence, it is considered as an interruption and has therefore the same result than the separation of the sequence in two different pieces.

Here follows some examples of valid descriptors:

"acgt:tgca" for the standard DNA alphabet (this is also the default alphabet)

"[ag][ct]" for the purin-pyrimidin alphabet

"abcdefghijklmnopkrstuv" for the standard latin alphabet

"acgtN" for a custom case sensitive DNA alphabet including the letter N

"[ARNDCE][QGHI][LKM][FPST][WYV]" for amino-acid alphabet divided in five groups

"ARNDCEQGHILKMFPSTWYV" for the full amin-acid alphabet

Pattern descriptor

A pattern descriptor should use only valid caracters from the specified alphabet as well as "_" and "[" or "]". These extra caracters allow to consider positions where the pattern could be degenerate: "." means any caracter and the brackets allow to specify a list of authorized caracters. The caracter "|" could also be used as a separator between several patterns in the same descriptor

Here come few examples:

"gctgg[tc]gg" expands to {gctggtgg,gctggcgg}

"[at]atg.a" expands to {aatgaa,aatgca,aatgga,aatgta,tatgaa,tatgca,tatgga,tatgta}

"agct|tgcat" expands to {agct,tgcat}

FASTA sequence

All sequence must be in FASTA format which is very simple. A text file containing one ore more sequences all starting with a title line which first caracter should be a ">". No more than 100 columns should be used in the file (other caracters will simply be ignored).

Here is a simple example:

> sequence 1atcgtagctagcatcgatcggtagaa> sequence 2atatattagctaatagatcgatcgaatatatag

Both strands

This option allows to complete the patterns with their inverse complementary. This way, occurrences of the pattern on the two strands are taken into account

Some examples:

-p gctggtgg -b gives the same result than -p gctggtgg|ccaccagc

-p tc.a -b gives the same result than -p tc.a|

Markov model order

This determine the order of the Markov model used to modelize the random sequence. By default, its parameters are estimated on the observed sequence using maximum of likelihood. Alternatively the parameters could be provided by a Markov file through the --Use-markov-file option.

order 0 correspond to the independant model and order -1 to the independant and uniform model (no parameter).

Markov file

A markov file contain the parameters of a markov model. It is a simple text file where line starting with a "#" are considered as commentary. For an order m model on a size k alphabet, the file must contain k^m non commentary lines each of it containing k real number. Each line correspond to the value of the last m letters (in alphabet order) and the column to the following letter.

Such file is either produced as an output of the program (--Markov-file option) or either used as input parameters (--Use-markov-file option)

Stationnary file

A stationnary file contain the stationnary distribution of a Markov model. It is a simple text file where line starting with a "#" are considered as commentary. The file contain k^m lines of one real number for an order m model on a size k alphabet. Such a file can be produced through the (--Stationnary-file option).

Pattern file

A pattern file contains a list of pattern descriptor. It is a simple text file where line starting with a "#" are considered as commentary. Each non commentary line is considered as a pattern descriptor.

Count file

A count file contains the number of occurrences of every words of a given length L. It is a simple text file where line starting with a "#" are considered as commentary. On each non commentary line is given a number of occurrence. All words of length L are treated this way in the alphabetic order.

Statistic Normalization

Statictics usually grow in magnitude with the length of the considered sequence, this is well known in the largedeviations theory and can be simply corrected by considering the normalization provided by the --normalize option:

normalizedS = - 1/n log10[ P( N(P) > Nobs(P) ) ] when P is seen more than expected


normalizedS = + 1/n log10[ P( N(P) < Nobs(P) ) ] when P is seen less than expected
 where n is the length of the sequence.

min-pvalue filtering (sspatt only)

Allows to filter the output by producing only statistics for which the corresponding pvalues are lowerthan the specified threshold. Please note that in all cases, statistics are computed anyway.



Count all occurrencies of two letter words in the DNA sequencecontained in the seq.fasta file:

sspatt --all-words -l 2 seq.fasta

Compute the statistics of all two letter words in the sequencecontained in seq.fasta, according to à 0 order Markovmodel. Store the parameters of the model in file markov:

sspatt --all-words -l 2 seq.fasta -m 0 -M markov

A simple word:

$ sspatt ecoli.fasta -p gctggtgg -m 1gctggtgg        499     70.10   +240.760141

An order 1 Markov model is estimated on the sequence ecoli.fasta and statistic in log scale (default) is outputed for pattern gctggtgg. We observe 499 occurrences, expect 70.10, so the pattern is over-represented with a p-value around 1e-240.

A more complex pattern:

$ sspatt ecoli.fasta -p g.tggtgg -m 0g.tggtgg        1043    294.63  +249.394597

Statistics for pattern {gatggtgg,gctggtgg,ggtggtgg,gttggtgg} are computed first for an order 0 Markov model. We observe 1043 occurrences expecting only 294.63 of them. The pattern is over-represented with a p-value around 1e-249.

All words of a given length:

$ sspatt ecoli.fasta -l 4 -m 2 --all-wordsaaaa    35124   35104.19        +0.338697aaac    25253   26618.98        -16.893008aaag    22788   20425.36        +59.038597aaat    25736   26752.44        -9.738282aaca    21870   18864.77        +100.972531aacc    20444   24098.25        -128.653351aacg    24404   23571.88        +7.477807(...)

Statistics for all words of length 4 are computed for an order 2 Markov model.

A very long pattern:


Computes the statistic for the (very) long pattern specified on the aminoacid alphabet. As this alphabet has a high cardinal (20), a shorter length for the counted words than the default one must be used (this explains the -l 3 parameter). One occurrence is observed, 2e-45 is expected resulting in an over-representation with a p-value around 1e-44.



Grégory Nuel <>and Mark Hoebeke <




This document was created byman2html,using the manual pages.