# SSPATT

Section: Misc. Reference Manual Pages (February 3rd, 2005)

Index ## NAME

SPATT - Statistics for Patterns, based on simple models.

## SYNOPSIS

**[s|g|ld|x|cp]spatt**[-hVqvdb]

*[--debug-level=<level>]*[-a

*descriptor]*[FASTA

*sequence]*[-M

*<filename>]*[-S

*<filename>]*[-U

*<filename>]*[-C

*<filename>]*[-P

*<filename>]*[-l

*<max>]*[-p

*<threshold>]* ## GENERAL DESCRIPTION

**spatt (statistic for patterns)** is a suite of programs designed for the computationof pattern occurrencies p-value on text. Assuming the text is generated according to Markovmodel, the p-value of a given observation is its probability to occur. The lower is the p-value,the more unlikely is the observation. For example, this tools can be used to find patterns withunusual behaviour in DNA sequences.

Let us note N(P) the number of occurrences of a pattern P on a given sequence. If we considerthe sequence is random (according to a model of our choice), N(P) become a random variable andwe can associate p-values to observations using the following statistic:

S = - log10[ P( N(P) > Nobs(P) ) ] when P is seen more than expected

and

S = +log10[ P( N(P) < Nobs(P) ) ] when P is seen less than expected

For example S=+3.23 means the pattern is over-represented (seen more than expected) with a p-valueof 10^-3.23 = 5.888e-4. S=-12.67 means the pattern is under-represented (seen less than expected)with a p-value of 10^-12.67 = 2.138e-13.

## AVAILABLE PROGRAMS

Several programs are provided based on different statistics methods and they all share the same syntaxand options (even if few of them are specific to a given program):

**sspatt (simple statistics for patterns)** compute p-value using binomial approximation.This approximation is known to be false but is in fact a very fast and reliable heuristic.

**gspatt (gaussian statistics for patterns)** compute expectation and variance for patterncounts and derive from these a p-value approximation.

**ldspatt (large deviations statistics for patterns)** is based on the large deviations theory,the computed p-value are especially reliable for the smallest but are asymptotic and so must be usedwith care on short sequences (let say less than 10000 long).

**xspatt (exact statistics for patterns)** proposes to use exact computations with arbitrary precisionmethods to give high quality p-value. Memory requirements are growing linearly with sequence length andtime complexity is proportional both to sequence length and pattern number of occurrences.Therefore, this method should be used only on short sequence (let say less than 10000 long).

**cpspatt (compound poisson statistics for patterns)** use Chen-Stein method to approximate N(P)with a geometric Poisson distribution. Thanks to a nice recurrence, p-values computations are linear(and not quadratic) with the observed number of occurrences. In the case of non overlapping patterns,these approximations fall back to the simple Poisson approximation which is very close to the binomialstatistics implemented in sspatt.

## ARGUMENTS

**FASTA sequence**- The name of the file containing a sequence in FASTA format.

## OPTIONS

**-h, --help**- displays this message.
**-v, --version**- displays version number.
**-q, --quiet**- quiet output (same effect as --debug-level -1).
**-V, --Verbose**- verbose output (same effect as --debug-level 1).
**-d, --debug**- debug output (same effect as --debug-level 2).
**--debug-level=level**- set the verbose level:

-1 (quiet),

0 (normal),

1 (verbose)

2 and more (debug). **-a, --alphabet=descriptor**- define the alphabet to use (ex: -a "acgt:tgca" for standard DNAalphabet).
**-M, --Markov-file=filename**- destination file for the Markov model parameters.
**-S, --Stationnary-file=filename**- destination file for the stationnary distribution.
**-U, --Use-markov-file=filename**- file containing the Markov model parameters.
**-C, --Count-file=filenamme**- file containing word counts.
**-P, --Pattern-file=filename**- file containing patterns.
**-l, --length=max**- length of the longest word counted.
**-p, --pattern=descriptor**- define a pattern (could be used several times).
**-b, --both-strands**- count patterns on both strands.
**-m, --markov=order**- order of the Markov model.
**--all-words**- process all words of the given length.
**-n, --normalize **- statistics are normalized by the length of the sequence.
**--max-pvalue <threshold>**- only statistics corresponding to pvalue lower than the threshold are produced

## MORE DETAILS ON OPTIONS

**Alphabet descriptor**

this allows to specify the alphabet you want to use reading sequence and patterns. First, each letter of the alphabet must be given as an ordered sequence, each letter must be either a simple caracter either a list of caracter between two brackets. Then, optionnally, a complementary alphabet could be specified after a ":". Only one caracter per letter should be used in this way.

By default, white space (blanks, tabulations, carriage returns, ...) are ignored and alphabets are considered not to be case sensitible unless two different cases have been used in alphabet descriptor.

Each time a invalid caracter is found in a sequence, it is considered as an interruption and has therefore the same result than the separation of the sequence in two different pieces.

Here follows some examples of valid descriptors:

- "acgt:tgca" for the standard DNA alphabet (this is also the default alphabet)

- "[ag][ct]" for the purin-pyrimidin alphabet

- "abcdefghijklmnopkrstuv" for the standard latin alphabet

- "acgtN" for a custom case sensitive DNA alphabet including the letter N

- "[ARNDCE][QGHI][LKM][FPST][WYV]" for amino-acid alphabet divided in five groups

- "ARNDCEQGHILKMFPSTWYV" for the full amin-acid alphabet

**Pattern descriptor**

A pattern descriptor should use only valid caracters from the specified alphabet as well as "_" and "[" or "]". These extra caracters allow to consider positions where the pattern could be degenerate: "." means any caracter and the brackets allow to specify a list of authorized caracters. The caracter "|" could also be used as a separator between several patterns in the same descriptor

Here come few examples:

- "gctgg[tc]gg" expands to {gctggtgg,gctggcgg}

- "[at]atg.a" expands to {aatgaa,aatgca,aatgga,aatgta,tatgaa,tatgca,tatgga,tatgta}

- "agct|tgcat" expands to {agct,tgcat}

**FASTA sequence **

All sequence must be in FASTA format which is very simple. A text file containing one ore more sequences all starting with a title line which first caracter should be a ">". No more than 100 columns should be used in the file (other caracters will simply be ignored).

Here is a simple example:

> sequence 1atcgtagctagcatcgatcggtagaa> sequence 2atatattagctaatagatcgatcgaatatatag

**Both strands **

This option allows to complete the patterns with their inverse complementary. This way, occurrences of the pattern on the two strands are taken into account

Some examples:

- -p gctggtgg -b gives the same result than -p gctggtgg|ccaccagc

- -p tc.a -b gives the same result than -p tc.a|t.ga

**Markov model order**

This determine the order of the Markov model used to modelize the random sequence. By default, its parameters are estimated on the observed sequence using maximum of likelihood. Alternatively the parameters could be provided by a Markov file through the --Use-markov-file option.

order 0 correspond to the independant model and order -1 to the independant and uniform model (no parameter).

**Markov file**

A markov file contain the parameters of a markov model. It is a simple text file where line starting with a "#" are considered as commentary. For an order m model on a size k alphabet, the file must contain k^m non commentary lines each of it containing k real number. Each line correspond to the value of the last m letters (in alphabet order) and the column to the following letter.

Such file is either produced as an output of the program (--Markov-file option) or either used as input parameters (--Use-markov-file option)

**Stationnary file**

A stationnary file contain the stationnary distribution of a Markov model. It is a simple text file where line starting with a "#" are considered as commentary. The file contain k^m lines of one real number for an order m model on a size k alphabet. Such a file can be produced through the (--Stationnary-file option).

**Pattern file**

A pattern file contains a list of pattern descriptor. It is a simple text file where line starting with a "#" are considered as commentary. Each non commentary line is considered as a pattern descriptor.

**Count file**

A count file contains the number of occurrences of every words of a given length L. It is a simple text file where line starting with a "#" are considered as commentary. On each non commentary line is given a number of occurrence. All words of length L are treated this way in the alphabetic order.

**Statistic Normalization**

Statictics usually grow in magnitude with the length of the considered sequence, this is well known in the largedeviations theory and can be simply corrected by considering the normalization provided by the --normalize option:

normalizedS = - 1/n log10[ P( N(P) > Nobs(P) ) ] when P is seen more than expected

and

normalizedS = + 1/n log10[ P( N(P) < Nobs(P) ) ] when P is seen less than expected

where n is the length of the sequence.

**min-pvalue filtering** (sspatt only)

Allows to filter the output by producing only statistics for which the corresponding pvalues are lowerthan the specified threshold. Please note that in all cases, statistics are computed anyway.

## EXAMPLES

Count all occurrencies of two letter words in the DNA sequencecontained in the *seq.fasta* file:

- sspatt --all-words -l 2 seq.fasta

Compute the statistics of all two letter words in the sequencecontained in *seq.fasta*, according to à 0 order Markovmodel. Store the parameters of the model in file *markov*:

- sspatt --all-words -l 2 seq.fasta -m 0 -M markov

A simple word:

$ sspatt ecoli.fasta -p gctggtgg -m 1gctggtgg 499 70.10 +240.760141

An order 1 Markov model is estimated on the sequence ecoli.fasta and statistic in log scale (default) is outputed for pattern gctggtgg. We observe 499 occurrences, expect 70.10, so the pattern is over-represented with a p-value around 1e-240.

A more complex pattern:

$ sspatt ecoli.fasta -p g.tggtgg -m 0g.tggtgg 1043 294.63 +249.394597

Statistics for pattern {gatggtgg,gctggtgg,ggtggtgg,gttggtgg} are computed first for an order 0 Markov model. We observe 1043 occurrences expecting only 294.63 of them. The pattern is over-represented with a p-value around 1e-249.

All words of a given length:

$ sspatt ecoli.fasta -l 4 -m 2 --all-wordsaaaa 35124 35104.19 +0.338697aaac 25253 26618.98 -16.893008aaag 22788 20425.36 +59.038597aaat 25736 26752.44 -9.738282aaca 21870 18864.77 +100.972531aacc 20444 24098.25 -128.653351aacg 24404 23571.88 +7.477807(...)

Statistics for all words of length 4 are computed for an order 2 Markov model.

A very long pattern:

$ sspatt swissprot.fasta -l 3 -m 2 -a ARNDCEQGHILKMFPSTWYV \-p PNEKVVGIYRMTTPSVLLRDLDIIKHVLIKDFESFADRGVEFPNEKVVGIYRMTTPSVLLRDLDIIKHVLIKDFESFADRGVEF 1 1.998352e-45 +44.699328

Computes the statistic for the (very) long pattern specified on the aminoacid alphabet. As this alphabet has a high cardinal (20), a shorter length for the counted words than the default one must be used (this explains the -l 3 parameter). One occurrence is observed, 2e-45 is expected resulting in an over-representation with a p-value around 1e-44.

## AUTHORS

Grégory Nuel

*<nuelAATTgenopole.cnrs.fr>*and Mark Hoebeke

*<Mark.HoebekeAATTjouy.inra.fr*

## Index

- NAME
- SYNOPSIS
- GENERAL DESCRIPTION
- AVAILABLE PROGRAMS
- ARGUMENTS
- OPTIONS
- MORE DETAILS ON OPTIONS
- EXAMPLES
- AUTHORS

This document was created byman2html,using the manual pages.