SEARCH
NEW RPMS
DIRECTORIES
ABOUT
FAQ
VARIOUS
BLOG
DONATE


YUM REPOSITORY

 
 

MAN page from OpenSuSE 13.X perl-DBD-SQLite-1.54-42.1.x86_64.rpm

DBD::SQLite::Fulltext_search

Section: User Contributed Perl Documentation (3)
Updated: 2016-04-21
Index 

NAME

DBD::SQLite::Fulltext_search - Using fulltext searches with DBD::SQLite 

DESCRIPTION

 

Introduction

SQLite is bundled with an extension module called ``FTS'' for full-textindexing. Tables with this feature enabled can be efficiently queriedto find rows that contain one or more instances of some specifiedwords (also called ``tokens''), in any column, even if the table contains manylarge documents.

The first full-text search modules for SQLite were called "FTS1" and "FTS2"and are now obsolete. The latest version is "FTS4", but it shares manyfeatures with the former module "FTS3", which is why parts of the API and parts of the documentation still refer to "FTS3"; from a clientpoint of view, both can be considered largely equivalent.Detailed documentation can be foundat <http://www.sqlite.org/fts3.html>. 

Short example

Here is a very short example of using FTS :

  $dbh->do(<<"") or die DBI::errstr;  CREATE VIRTUAL TABLE fts_example USING fts4(content)    my $sth = $dbh->prepare("INSERT INTO fts_example(content) VALUES (?)");  $sth->execute($_) foreach @docs_to_insert;    my $results = $dbh->selectall_arrayref(<<"");  SELECT docid, snippet(fts_example) FROM fts_example WHERE content MATCH 'foo'

The key points in this example are :

*
The syntax for creating FTS tables is

  CREATE VIRTUAL TABLE <table_name> USING fts4(<columns>)

where "<columns>" is a list of column names. Columns may betyped, but the type information is ignored. If no columnsare specified, the default is a single column named "content".In addition, FTS tables have an implicit column called "docid"(or also "rowid") for numbering the stored documents.

*
Statements for inserting, updating or deleting records use the same syntax as for regular SQLite tables.
*
Full-text searches are specified with the "MATCH" operator, and anoperand which may be a single word, a word prefix ending with '*', alist of words, a ``phrase query'' in double quotes, or a boolean combinationof the above.
*
The builtin function "snippet(...)" builds a formatted excerpt of thedocument text, where the words pertaining to the query are highlighted.

There are many more details to building and searchingFTS tables, so we strongly invite you to readthe full documentation at <http://www.sqlite.org/fts3.html>. 

QUERY SYNTAX

Here are some explanation about FTS queries, borrowed from the sqlite documentation. 

Token or token prefix queries

An FTS table may be queried for all documents that contain a specifiedterm, or for all documents that contain a term with a specifiedprefix. The query expression for a specific term is simply the termitself. The query expression used to search for a term prefix is theprefix itself with a '*' character appended to it. For example:

  -- Virtual table declaration  CREATE VIRTUAL TABLE docs USING fts3(title, body);    -- Query for all documents containing the term "linux":  SELECT * FROM docs WHERE docs MATCH 'linux';    -- Query for all documents containing a term with the prefix "lin".  SELECT * FROM docs WHERE docs MATCH 'lin*';

If a search token (on the right-hand side of the MATCH operator) begins with ``^'' then that token must be the first in its field ofthe document : so for example "^lin*" matches'linux kernel changes ...' but does not match 'new linux implementation'. 

Column specifications

Normally, a token or token prefix query is matched against the FTStable column specified as the right-hand side of the MATCHoperator. Or, if the special column with the same name as the FTStable itself is specified, against all columns. This may be overriddenby specifying a column-name followed by a ``:'' character before a basicterm query. There may be space between the ``:'' and the term to queryfor, but not between the column-name and the ``:'' character. Forexample:

  -- Query the database for documents for which the term "linux" appears in  -- the document title, and the term "problems" appears in either the title  -- or body of the document.  SELECT * FROM docs WHERE docs MATCH 'title:linux problems';  -- Query the database for documents for which the term "linux" appears in  -- the document title, and the term "driver" appears in the body of the document  -- ("driver" may also appear in the title, but this alone will not satisfy the.  -- query criteria).  SELECT * FROM docs WHERE body MATCH 'title:linux driver';
 

Phrase queries

A phrase query is a query that retrieves all documents that contain anominated set of terms or term prefixes in a specified order with nointervening tokens. Phrase queries are specified by enclosing a spaceseparated sequence of terms or term prefixes in double quotes ("). Forexample:

  -- Query for all documents that contain the phrase "linux applications".  SELECT * FROM docs WHERE docs MATCH '"linux applications"';  -- Query for all documents that contain a phrase that matches "lin* app*".   -- As well as "linux applications", this will match common phrases such   -- as "linoleum appliances" or "link apprentice".  SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
 

NEAR queries.

A NEAR query is a query that returns documents that contain a two ormore nominated terms or phrases within a specified proximity of eachother (by default with 10 or less intervening terms). A NEAR query isspecified by putting the keyword ``NEAR'' between two phrase, term orprefix queries. To specify a proximity other than the default, anoperator of the form ``NEAR/<N>'' may be used, where <N> is the maximumnumber of intervening terms allowed. For example:

  -- Virtual table declaration.  CREATE VIRTUAL TABLE docs USING fts4();  -- Virtual table data.  INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');  -- Search for a document that contains the terms "sqlite" and "database" with  -- not more than 10 intervening terms. This matches the only document in  -- table docs (since there are only six terms between "SQLite" and "database"   -- in the document).  SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';  -- Search for a document that contains the terms "sqlite" and "database" with  -- not more than 6 intervening terms. This also matches the only document in  -- table docs. Note that the order in which the terms appear in the document  -- does not have to be the same as the order in which they appear in the query.  SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';  -- Search for a document that contains the terms "sqlite" and "database" with  -- not more than 5 intervening terms. This query matches no documents.  SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';  -- Search for a document that contains the phrase "ACID compliant" and the term  -- "database" with not more than 2 terms separating the two. This matches the  -- document stored in table docs.  SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';  -- Search for a document that contains the phrase "ACID compliant" and the term  -- "sqlite" with not more than 2 terms separating the two. This also matches  -- the only document stored in table docs.  SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';

More than one NEAR operator may appear in a single query. In this caseeach pair of terms or phrases separated by a NEAR operator must appearwithin the specified proximity of each other in the document. Usingthe same table and data as in the block of examples above:

  -- The following query selects documents that contains an instance of the term   -- "sqlite" separated by two or fewer terms from an instance of the term "acid",  -- which is in turn separated by two or fewer terms from an instance of the term  -- "relational".  SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';  -- This query matches no documents. There is an instance of the term "sqlite" with  -- sufficient proximity to an instance of "acid" but it is not sufficiently close  -- to an instance of the term "relational".  SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';

Phrase and NEAR queries may not span multiple columns within a row. 

Set operations

The three basic query types described above may be used to query thefull-text index for the set of documents that match the specifiedcriteria. Using the FTS query expression language it is possible toperform various set operations on the results of basic queries. Thereare currently three supported operations:
*
The AND operator determines the intersection of two sets of documents.
*
The OR operator calculates the union of two sets of documents.
*
The NOT operator may be used to compute the relative complement of oneset of documents with respect to another.

The AND, OR and NOT binary set operators must be entered using capitalletters; otherwise, they are interpreted as basic term queries insteadof set operators. Each of the two operands to an operator may be abasic FTS query, or the result of another AND, OR or NOT setoperation. Parenthesis may be used to control precedence and grouping.

The AND operator is implicit for adjacent basic queries without anyexplicit operator. For example, the query expression ``implicitoperator'' is a more succinct version of ``implicit AND operator''.

Boolean operations as just described correspond to the so-called``enhanced query syntax'' of sqlite; this is the version compiled with "DBD::SQLite", starting from version 1.31.A former version, called the ``standard query syntax'', used tosupport tokens prefixed with '+' or '-' signs (for token inclusionor exclusion); if your application needs to support this oldsyntax, use DBD::SQLite::FTS3Transitional (publishedin a separate distribution) for doing the conversion. 

TOKENIZERS

 

Concept

The behaviour of full-text indexes strongly depends on howdocuments are split into tokens; therefore FTS tabledeclarations can explicitly specify how to performtokenization:

  CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)

where "<tokenizer>" is a sequence of space-separatedwords that triggers a specific tokenizer. Tokenizers canbe SQLite builtins, written in C code, or Perl tokenizers.Both are as explained below. 

SQLite builtin tokenizers

SQLite comes with some builtin tokenizers (see<http://www.sqlite.org/fts3.html#tokenizer>) :
simple
Under the simple tokenizer, a term is a contiguous sequence ofeligible characters, where eligible characters are all alphanumericcharacters, the ``_'' character, and all characters with UTF codepointsgreater than or equal to 128. All other characters are discarded whensplitting a document into terms. They serve only to separate adjacentterms.

All uppercase characters within the ASCII range (UTF codepoints lessthan 128), are transformed to their lowercase equivalents as part ofthe tokenization process. Thus, full-text queries are case-insensitivewhen using the simple tokenizer.

porter
The porter tokenizer uses the same rules to separate the inputdocument into terms, but as well as folding all terms to lower case ituses the Porter Stemming algorithm to reduce related English languagewords to a common root.
icu
The icu tokenizer uses the ICU library to decide how toidentify word characters in different languages; however, thisrequires SQLite to be compiled with the "SQLITE_ENABLE_ICU"pre-processor symbol defined. So, to use this tokenizer, you neededit Makefile.PL to add this flag in @CC_DEFINE, and thenrecompile "DBD::SQLite"; of course, the prerequisite is to havean ICU library available on your system.
unicode61
The unicode61 tokenizer works very much like ``simple'' except that itdoes full unicode case folding according to rules in Unicode Version6.1 and it recognizes unicode space and punctuation characters anduses those to separate tokens. By contrast, the simple tokenizer onlydoes case folding of ASCII characters and only recognizes ASCII spaceand punctuation characters as token separators.

By default, ``unicode61'' also removes all diacritics from Latin scriptcharacters. This behaviour can be overridden by adding the tokenizerargument "remove_diacritics=0". For example:

  -- Create tables that remove diacritics from Latin script characters  -- as part of tokenization.  CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);  CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");  -- Create a table that does not remove diacritics from Latin script  -- characters as part of tokenization.  CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");

Additional options can customize the set of codepoints that unicode61treats as separator characters or as token characters --- see thedocumentation in <http://www.sqlite.org/fts3.html#unicode61>.

If a more complex tokenizing algorithm is required, for example toimplement stemming, discard punctuation, or to recognize compound words,use the perl tokenizer to implement your own logic, as explained below. 

Perl tokenizers

Declaring a perl tokenizer

In addition to the builtin SQLite tokenizers, "DBD::SQLite"implements a perl tokenizer, that can hook to any tokenizingalgorithm written in Perl. This is specified as follows :

  CREATE ... USING fts4(<columns>, tokenize=perl '<perl_function>')

where "<perl_function>" is a fully qualified Perl function name(i.e. prefixed by the name of the package in which that function isdeclared). So for example if the function is "my_func" in the main program, write

  CREATE ... USING fts4(<columns>, tokenize=perl 'main::my_func')

Writing a perl tokenizer by hand

That function should return a code reference that takes a string assingle argument, and returns an iterator (another function), whichreturns a tuple "($term, $len, $start, $end, $index)" for eachterm. Here is a simple example that tokenizes on words according tothe current perl locale

  sub locale_tokenizer {    return sub {      my $string = shift;      use locale;      my $regex      = qr/\w+/;      my $term_index = 0;      return sub { # closure        $string =~ /$regex/g or return; # either match, or no more token        my ($start, $end) = ($-[0], $+[0]);        my $len           = $end-$start;        my $term          = substr($string, $start, $len);        return ($term, $len, $start, $end, $term_index++);      }    };  }

There must be three levels of subs, in a kind of ``Russian dolls'' structure,because :

*
the external, named sub is called whenever accessing a FTS tablewith that tokenizer
*
the inner, anonymous sub is called whenever a new stringneeds to be tokenized (either for inserting new text into the table,or for analyzing a query).
*
the innermost, anonymous sub is called repeatedly for retrievingall terms within that string.

Using Search::Tokenizer

Instead of writing tokenizers by hand, you can grab one of thosealready implemented in the Search::Tokenizer module. For example,if you want ignore differences between accented characters, you canwrite :

  use Search::Tokenizer;  $dbh->do(<<"") or die DBI::errstr;  CREATE ... USING fts4(<columns>,                         tokenize=perl 'Search::Tokenizer::unaccent')

Alternatively, you can use ``new'' in Search::Tokenizer to buildyour own tokenizer. Here is an example that treats compoundwords (words with an internal dash or dot) as single tokens :

  sub my_tokenizer {    return Search::Tokenizer->new(      regex => qr{\p{Word}+(?:[-./]\p{Word}+)*},     );  }
 

Fts4aux - Direct Access to the Full-Text Index

The content of a full-text index can be accessed through thevirtual table module ``fts4aux''. For example, assuming thatour database contains a full-text indexed table named ``ft'',we can declare :

  CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft)

and then query the "ft_terms" table to access thelist of terms, their frequency, etc.Examples are documented in<http://www.sqlite.org/fts3.html#fts4aux>. 

How to spare database space

By default, FTS stores a complete copy of the indexed documents,together with the fulltext index. On a large collection of documents,this can consume quite a lot of disk space. However, FTS has someoptions for compressing the documents, or even for not storing them atall --- see <http://www.sqlite.org/fts3.html#fts4_options>.

In particular, the option for contentless FTS tables only storesthe fulltext index, without the original document content. This isspecified as "content=""", like in the following example :

  CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b)

Data can be inserted into such an FTS4 table using an INSERTstatements. However, unlike ordinary FTS4 tables, the user must supplyan explicit integer docid value. For example:

  -- This statement is Ok:  INSERT INTO t1(docid, a, b) VALUES(1, 'a b c', 'd e f');  -- This statement causes an error, as no docid value has been provided:  INSERT INTO t1(a, b) VALUES('j k l', 'm n o');

Of course your application will need an algorithm for findingthe external resource corresponding to any docid stored withinSQLite.

When using placeholders, the docid must be explicitly typed to INTEGER, because this is a ``hidden column'' for which sqlite is not able to automatically infer the proper type. So the followingdoesn't work :

  my $sth = $dbh->prepare("INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)");  $sth->execute(2, 'aa', 'bb'); # constraint error

but it works with an explicitly cast :

  my $sql = "INSERT INTO t1(docid, a, b) VALUES(CAST(? AS INTEGER), ?, ?)",  my $sth = $dbh->prepare(sql);  $sth->execute(2, 'aa', 'bb');

or with an explicitly typed ``bind_param'' in DBI :

  use DBI qw/SQL_INTEGER/;  my $sql = "INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)";  my $sth = $dbh->prepare(sql);  $sth->bind_param(1, 2, SQL_INTEGER);  $sth->bind_param(2, "aa");  $sth->bind_param(3, "bb");  $sth->execute();

It is not possible to UPDATE or DELETE a row stored in a contentlessFTS4 table. Attempting to do so is an error.

Contentless FTS4 tables also support SELECT statements. However, it isan error to attempt to retrieve the value of any table column otherthan the docid column. The auxiliary function "matchinfo()" may beused, but "snippet()" and "offsets()" may not, so if suchfunctionality is needed, it has to be directly programmed within thePerl application. 

AUTHOR

Laurent Dami <damiAATTcpan.org> 

COPYRIGHT

Copyright 2014 Laurent Dami.

Some parts borrowed from the <http://sqlite.org> documentation, copyright 2014.

This documentation is in the public domain; you can redistributeit and/or modify it under the same terms as Perl itself.


 

Index

NAME
DESCRIPTION
Introduction
Short example
QUERY SYNTAX
Token or token prefix queries
Column specifications
Phrase queries
NEAR queries.
Set operations
TOKENIZERS
Concept
SQLite builtin tokenizers
Perl tokenizers
Fts4aux - Direct Access to the Full-Text Index
How to spare database space
AUTHOR
COPYRIGHT

This document was created byman2html,using the manual pages.