SEARCH
NEW RPMS
DIRECTORIES
ABOUT
FAQ
VARIOUS
BLOG
DONATE


YUM REPOSITORY

 
 

MAN page from Fedora 5 mnogosearch-3.2.39-1.x86_64.rpm

INDEXER.CONF

Section: mnoGoSearch reference manual (5)
Updated: 23 March 2001
Index 

NAME

indexer.conf - configuration file for indexer 

DESCRIPTION

This is configuration file forindexer (1).Configuration file consists of commands and their arguments.All commands are case-insensitive.You can use # to comment out lines. 

VARIABLES

 

Global parameters

These commands should be used only once and take global effect for the whole configuration file.
DBType type
Database type, currently supported values aremysql, pgsql, msql, solid, mssql, oracle, ibase, sqliteActually it does not matter for native libraries support,but ODBC users must specify one of the supported values. If your database type is not supported, use unknowninstead.
DBHost host
SQL host name (Not required for ODBC)Default: localhost
DBName mnogosearch
SQL database name or ODBC DSNDefault: mnogosearch
DBUser foo
Database username to connect to databaseDefault: no user
DBPass bar
Database password to connect to databaseDefault: no password
DBMode single/multi/crc/crc-multi
SQL database words storage mode. Does not apply for built-in database. Whensingleis specified, all words are stored in the same table.multimeans that words are stored in different tables depending on wordlength.multimode is usualy faster, but it requires more tables in database.In case of crcmode, mnoGoSearch will store 32 bit integer word ID's calculated by CRC32 algorythminstead of words. crcmode requires less diskspace and is faster thansingleandmultimodes.crc-multimode shares storage structure withcrcmode, but stores words in different tables depending on wordlength likemultimode.Default DBMode value issingle
LocalCharset charset
Defines charset for local file system. It is required if you are using 8 bit characters and is not applicable for 7 bit characters. This command is to be used once and takes global effect for the whole configuration file. Example:
LocalCharset windows-1250
CrossWords yes|no
Building CrossWords index. Crosswords are those, that are used in a link to the present page.The default value isno
StopWordFile filename
This command indicates which file contains stopwords list to load. You may specify either absolute file name, or filename with a relative path to mnoGoSearch /etc directory.You may use several StopWordsFilecommands.
MinWordLength characters
MinWordLength charactersWith these commands you can change default length range of words stored in database. By default mnoGoSearch stores words that are longer than 1 and shorter than 32.Example:MaxWordLength 35
MaxDocSize bytes
Specify maximum size of a document in bytes that can be indexed. The default value is 1048576(1 Mb). This command take global effect for the whole config file.
HTTPHeader header
You may add custom HTTP headers to indexer HTTP request. Do not use "If-modified-since" and "Accept-Charset" headers, since they are composed by indexer itself. "User-Agent: mnoGoSearch/version" is sent too, although you may override it. The command has global effect for the whole configuration file.
ServerTable table_name
This command works only with SQL database and is not applicable for built-in database mode. Load servers with all their parameters from the table table_nameFor an example of such tables structure, please refer to the file create/mysql/server.txt You may use several arguments with this command:ServerTable my_servers1 my_servers2 my_servers3or just a single argument:ServerTable server
DeleteNoServer yes|no
Use this command to specify whether to delete the URL that have no corresponding Servercommands. Default value isyes

VarDir /path/to/my/var/dir
Specify a custom path to directory that indexer stores data to when use with built-in database and in cache mode. By default /var directory of mnoGoSearch installation is used.

 

URL Control Configuration


  
Allow [Match|NoMatch] {NoCase|Case] [String|Regex] <arg> [<arg> ...]
Use this command to allow URL's that match (does not match) given argument. First three optional parameters describe thetype of comparison. Default values areMatch, NoCase, StringUseNoCaseorCasevalues to to choose case insensitive or sensitive comparison. UseRegexto choose regular expression comparison. UseStringto choose string with wildcards comparison. Wildcards are *for any number of characters, and ?for one character. Note that *and?have special meaning in Stringmatch type. Please use Regexto describe documents with ?and*signs in URL.Stringmatch is much faster thanRegex

Stringwrere it is possible. You may use several arguments for one Allow command and use this command any number of times. It takes global effect for the config file.Note that mnoGoSearch automatically adds one Allow regex .*command after reading config file. That command means that everything is allowed that is not disallowed

Disallow [Match|NoMatch] [Case|NoCase] [String|Regex] [<arg> ...]
Use this to disallow indexing documents with URLs that match given argument.The meaning of the first three optional parameters is exactly the same as with the Allowcommand. You can use several arguments for oneDisallowcommand. Takes global effect for config file.
Example:
#Exclude cgi-bin and non-parsed-headers
Disallow /cgi-bin/ \.cgi /nph #Exclude some known extensions
Disallow \.b$  \.sh$   \.md5$

Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.rpm$#Exclude Apache directory list in different sort order
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$#Exclude ./. and ./.. from Apache and Squid directory list
Disallow /[.]{1,2} /\%2e /\%2f
CheckOnly regexp [regexp [...] ]
Indexer will use HEAD instead of GET http method for URLs thatmatches regexp. It means that file will be checked onlyand will not be downloaded. Usefull for zip,exe,arj etc files.One can use several arguments for one 'CheckOnly' command.One can use this command any times but not more than MAXFILTER in indexer.hTakes global effect for config file.
Examples:
#Use HEAD method for some known non-text extensions:
CheckOnly \.b$ \.sh$   \.md5$

CheckOnly \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
CheckOnly \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
CheckOnly \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
CheckOnly \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
CheckOnly \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$
CheckOnly \.vrml$ \.wrl$
CheckOnly \.exe$ \.cab$ \.dll$ \.bin$ \.class$
CheckOnly \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
CheckOnly \.rtf$ \.pdf$ \.cdf$ \.ps$
CheckOnly \.ai$ \.eps$ \.ppt$ \.hqx$
CheckOnly \.cpt$ \.bms$ \.oda$ \.tcl$
CheckOnly \.rpm$
HrefOnly regexp [regexp [...] ]
Indexer scans html documents that match regexp as it would scan any otherURLs, except that it will not index the contents. It will add any URLs itfinds in html document to database. Usefull when indexing mail listarchives with big index pages which contain mostly URLs.One can use several arguments for one 'HrefOnly' command.One can use this command any times but not more than MAXFILTER in indexer.hTakes global effect for config file.
Examples:
#Scan these files for href tags only, but do not index there contents.
HrefOnly mail.*\.html$ thr.*\.html$
 

MIME types and external parsers

UseRemoteContentType yes|no
This command specifies if the indexer should get content type from HTTP server headers (yes), or from itsAddType settings (no). If set tono, and the indexer could not determine content-type with its AddTypesettings,
SyslogFacility facility
Useful only ifindexeris compiled with syslog support and if youdo not like the default. Argument is the same as used in syslog.conffile (for example:local7,daemon). For list of possible facilities see syslog.conf(5)Takes global effect and should be used onlyonce!Default: depends on compilation.
LogdAddr host[:port]
Usecachelogdat given host and port if specified. Required for cache modeonly. Default values are localhostand port7000
FollowOutside yes|no
Allow/disallow indexer to walk outside current server.Should be used carefully (see MaxHopscommand). Default: no
Period seconds
Reindex period in seconds, 604800 = 1 week.May be used before everyServercommand andtakes effect till the end of config file or till next Periodcommand.
Tag number
Use this parameter for your own purposes. For example for groupingsome servers into one group, etc.May be used multiple times before everyServercommand andtakes effect till the end of config file or till nextTagcommand.
MaxHops number
Maximum way in "mouse clicks" from start URL given inServercommand. May be used multiple times before everyServercommand andtakes effect till the end of config file or till nextMaxHopscommand.Default: 256
MaxNetErrors number
Maximum network errors for each server.If there are too many network errors on some server(server is down, host unreachable etc.)indexerwill try not to do more thannumberattempts to connect to this server.May be used multiple times beforeServercommand and takes effect till the end of config file or till nextMaxNetErrorscommand.Default: 16
TitleWeight number
Weight of the words in the <title>...</title>Can be set multiple times before Servercommand andtakes effect till the end of config file or till nextTitleWeightcommand.Default: 2
BodyWeight number
Weight of the words in the <body>...</body> of the html documents and in the contents of the text/plain documents.Can be set multiple times before Servercommand andtakes effect till the end of config file or till nextBodyWeightcommand.Default: 1
DescWeight number
Weight of the words in the <META NAME="Description" Content="...">Can be set multiple times before Servercommand and takes effect till the end of config file or till next DescWeightcommand.Default: 2
KeywordWeight number
Weight of the words in the <META NAME="Keywords" Content="...">Can be set multiple times before Servercommand and takes effect till the end of config file or till next KeywordWeightcommand.Default: 2
UrlWeight number
Weight of the words in the URL of the documents.Can be set multiple times before Servercommand and takes effect till the end of config file or till nextUrlWeightcommand.Default: 0
DeleteBad yes|no
Prevent indexer from deleting bad (not found, forbidden etc) URLsfrom database. Useful if you want to check 'integrity' of you server(s),so if you set it to, that "bad" URLs will remain in database.Can be set multiple times before Servercommand andtakes effect till the end of config file or till nextDeleteBadcommand.Default: yes
Robots yes|no
Allows/disallows using robots.txt and <META NAME="robots">exclusions. Useful if you want to check 'integrity' of you server(s).Can be set multiple times before Servercommand andtakes effect till the end of config file or till nextRobotscommand.Default: yes.

Section <string> <number>
where <string> is a section name and <number> is section IDbetween 0 and 255. Use 0 if you don't want to index some of these sections. It is better to use different sections IDsfor different documents parts. In this case during search time you'll be able to give different weight to each partor even disallow some sections at a search time.

Index yes|no
Prevent indexer from storing words into database.Useful if you want to check 'integrity' of you server(s).Can be set multiple times before "Server" command andtakes effect till the end of config file or till next Index command.Note:Instead ofIndex noyou can use the alternate formNoIndexDefault: yes
Follow yes|no
Allow/disallow indexer to store <a href="..."> into database.Can be set multiple times beforeServercommand andtakes effect till the end of config file or till nextFollowcommand.Note:Instead ofFollow noyou can use the alternate formNoFollowDefault: yes
MaxDocSize size
Hope the name is self-explanatory, this command is to limit maximum document size.sizeis in bytes.If there is document with size more thansize,indexerwill parse only firstsizebytes of documents.Default: 1048576 (which is 1 megabyte)
Mime
<from_mime><to_mime>[;charset][command line [$1]]This is used to add support for parsing documents with mime types otherthantext/plainandtext/html.It can be done via external parser (which should provide output in plainor html text) or just by substituting mime type so indexer can understand itdirectly.<from_mime>and<to_mime>are standard mime types.<to_mime>should be eithertext/plainortext/html, because these are the only types thatindexerunderstands.
                                                                           We assume external parser generates results on stdout (if not, you have towrite a little script and cat results to stdout).Optional charset parameter used to change charset if needed.Command line parameter is optional. If there's no command line, this isused to change mime type. Command line could also have $1 parameter whichstands for temporary file name. Some parsers could not operate on stdin,soindexercreates temporary file for parser and its name passed instead of $1.
CharSet charset
Useful for 8 bit character sets.WWW-servers send data in different character sets.charsetis default character set of server in next Servercommand(s).May be used before everyServercommand andtakes effect till the end of config file or till nextCharSetcommand.By nowindexersupports Cyrillic koi8-r, cp1251, cp866, iso8859-5, x-mac-cyrillic, Arabic cp1256, Western iso-8859-1, Central Europe iso-8859-2 and cp1250 character sets.This parameter is default character set for "bad" servers that do not send information about charset in header: just "Content-type: text/html" instead of for example "Content-type: text/html; charset=koi8-r" anddo not send charset information in META tags.CharSet command.
Examples:
CharSet koi8-r
CharSet windows-1250
CharSet ISO-8859-1
ForceIISCharset1251 yes/no
This option is useful for users dealing with Cyrillic content and broken(or misconfigured?) Microsoft IIS web servers, which tends to reportcharset incorrectly. This is a really dirty hack, but if this option is turned onit is assumed that all servers that are reported as 'Microsoft' or 'IIS' havecontent in Windows-1251 codepage.This command should be used only once in configuration file and takes globaleffect.Default: no
AuthBasic login:passwd
Use basic http authorization. Can be set before everyServercommand and takes effect only for nextServercommand.
Examples:
AuthBasic somebody:somethingIf you have password protected directory(ies), but whole server is open, use:AuthBasic login1:passwd1
Server http://my.server.com/my/secure/directory1/
AuthBasic login2:passwd2
Server http://my.server.com/my/secure/directory2/
Server http://my.server.com/
ProxyAuthBasic login:passwd
Use http proxy basic authorisation. Can be used before everyServercommand and taked effect only for the next one Servercommand! It should be also beforeProxycommand.
Example:
ProxyAuthBasic somebody:smth
Proxy your.proxy.host[:port]
Connect ia proxy rather directly.You can index ftp servers (only) when using proxy.Ifportis not specified, it is set to default value of 3128 (Squid).If proxy host is not specified, direct connection will be performed.Can be set before everyServercommand andtakes effect till the end of config file or till nextProxycommand.
Examples:
Proxy atoll.anywhere.com
 - proxy on atoll.anywhere.com, port 3128Proxy lota.anywhere.com:8090
 - proxy on lota.anywhere.com, port 8090Proxy
 - turn off proxy usage (direct connection)
Server URL
It is the main configuration command.Use this to add start URL of server to be indexed.You may use manyServer commands in the same indexer.conf file
Examples:
Server http://localhost/
Server http://www.yoursite.com/
Server http://www.yoursite.com/~yourname/
Server ftp://ftp.yourdomain.com/pub/
 

EXAMPLE

This is a minimal sample indexer config file
DBHost          localhost

DBName         udmsearch

DBUser         foo

DBPass         bar

Server         http://localhost/

Disallow /cgi-bin/ \.cgi /nph
Disallow \.b$  \.sh$   \.md5$

Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.rpm$
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$
Disallow /[.]{1,2} /\%2e /\%2f
 

SEE ALSO

indexer(1),syslog.conf(5)


 

Index

NAME
DESCRIPTION
VARIABLES
Global parameters
URL Control Configuration
MIME types and external parsers
EXAMPLE
SEE ALSO

This document was created byman2html,using the manual pages.