Abstract

Searching is a key feature of applications dealing with
large amount of data. It is necessary that searching is
fast. Usually it is desireable that people are able
to search for substrings and can combine search strings.

Database layout
Search is based on two tables. On is a table that only
contains a list of words and a counter determine how often
this word is used. A second table contains the relationship
to the datasets that contain these words. This attemp
helps to reduce redundancy.

Table: SearchWords
+-----+-------+-------+
| Id  | Word  | Count |
+-----+-------+-------+
| 1   | foo   | 3     |
+-----+-------+-------+

Table: SearchRelationship
+-----+-----------------+--------+-----------+
| Id  | SearchWords_Id  | Module | Module_Id |
+-----+-----------------+--------+-----------+
| 1   | 1               | Todo   | 2         |
+-----+-----------------+--------+-----------+

Application Interface
The serch is divded into two parts: 
Indexing data and the actual search. The indexing mechanism
is provided by the PHProjekt_Search_Indexing class. It takes a
model as an argument and inserts or updates both the SearchWords
and SearchRelationship table with the found words. To determine what
is a word, the indexing class uses heuristics. It can be enhanced by
adding additional heurstics.
The Phprojekt_Search_Searching class does the actual searching and returns
a set of founded entries (and maybe even their model objects). 
Searching can be limited and ordered based on external criteries (ascending, descending).

Heuristics
To figure out what are words in a string, PHProjekt implements a set of
heuristics. The simples heuristic is using stop characters. Characters between
two stop characters are taken as words. 
Stop characters are:
 - all non word characters in PCRE

In addition to that words shorter than 3 characters are rejected