EFDB - Information on FULLTEXT search engine

Information on FULLTEXT search engine

How FULLTEXT search works and what you can expect from it

The FULLTEXT search function matches a natural language query against a text collection (which is simply the set of columns covered by a FULLTEXT index). For every row in a table it returns relevance - a similarity measure between the text in that row (in the columns that are part of the collection) and the query. The rows returned are automatically sorted with relevance decreasing. Relevance is a non-negative floating-point number. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word. Any "word" that is present in the stopword list or just too short (3 characters or less) is ignored. Every correct word in the collection and in the query is weighted, according to its significance in the query or collection. This way, a word that is present in many documents will have lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Otherwise, if the word is rare, it will receive a higher weight. The weights of the words are then combined to compute the relevance of the row. Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution does not reflect adequately their semantical value, and this model may sometimes produce bizarre results.

Using keywords and boolean operators

This FULLTEXT search engine supports boolean mode. This means, it is possible to use keywords in combination with boolean operators. The boolean FULLTEXT search capability supports the following operators:

+ A leading plus sign indicates that this word must be present in every row returned.
- A leading minus sign indicates that this word must not be present in any row returned. By default (when neither plus nor minus is specified) the word is optional, but the rows that contain it will be rated higher.
< > These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The < operator decreases the contribution and the > operator increases it. See the example below.
( ) Parentheses are used to group words into subexpressions.
~ A leading tilde acts as a negation operator, causing the word's contribution to the row relevance to be negative. It's useful for marking noise words. A row that contains such a word will be rated lower than others, but will not be excluded altogether, as it would be with the - operator.
* An asterisk is the truncation operator. Unlike the other operators, it should be appended to the word, not prepended.
" The phrase, that is enclosed in double quotes ", matches only rows that contain this phrase literally, as it was typed.

Examples

apple banana
         find rows that contain at least one of these words.

+apple +juice
         find rows that contain both words.

+apple macintosh
         ... word "apple", but rank it higher if it also contains "macintosh".

+apple -macintosh
         ... word "apple" but not "macintosh".

+apple +(>pie <strudel)
         ... "apple" and "pie", or "apple" and "strudel" (in any order), but rank "apple pie" higher than "apple strudel".

apple*
         ... "apple", "apples", "applesauce", "applet", ...

"some words"
         ... "some words of wisdom", but not "some noise words".

+"methane conversion" +"animal production" +methane -cattle
         ... "methane conversion" and "animal production" and methane, but not cattle.