[Users] Faster filtering (sans mairix) >> Recoll search utility

Victoria S. 1 at VictoriasJourney.com
Mon Jul 8 22:32:16 CEST 2013


----- Original Message(s): -----
Date: 2013 Jul 06 (Sat) 08:02
From: Cia Watson <ciamarie at my180.net>
To: users at lists.claws-mail.org
Subject: Re: [Users] Faster filtering (sans mairix)

On Fri, 5 Jul 2013 09:55:27 -0700 "Victoria S." <1 at VictoriasJourney.com> wrote:

> Is there a way to implement faster filtering (searching) of folders/subfolders in Claws?
> I routinely keep copies of mail in my Sent folder (26.5K items) that
> permits quick access to emails-of-interest; other folders contain (e.g.)
> >30K items; all my archived mail comprises >130K items/messages.

I use Recoll for desktop search, and while I don't often search mail I find that results from my Mail folder are usually included in results for other searches I do. It's not difficult to configure, you can set it to include or ignore certain directories or subdirectories. I also have it scheduled to update the index at 10:00am every morning, rather than having it index whenever new articles (or whatever) is added. It's also possible to set it to index when you do a manual request, but I find once a day works fine for me.

Here's information from "About recoll" on Debian Wheezy:

Recoll 1.17.3 + Xapian 1.2.12


Thank you Cla - that was a good suggestion; appreciated!  

This is somewhat lengthy for a post to a forum, but I think that it will be of good use to other Claws users who are interested in using this tool (Recoll).  I provide some e-mail-related Recoll search examples, below, that will be of particular interest, plus pertinent Recoll-related information (excerpted from the Manual).


RECOLL HOME PAGE: http://www.lesbonscomptes.com/recoll/
Features: http://www.lesbonscomptes.com/recoll/features.html
User manual: http://www.lesbonscomptes.com/recoll/usermanual/index.html
FAQ / How-To: https://bitbucket.org/medoc/recoll/wiki/FaqsAndHowTos
Searching: http://www.lesbonscomptes.com/recoll/usermanual/RCL.SEARCH.html
Review: http://www.webupd8.org/2013/05/desktop-search-tool-recoll-updated-with.html


Ubuntu users can install the latest Recoll by using its official PPA:

sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt-get update
sudo apt-get install recoll




To also install the Recoll Unity Lens, use the following command:

sudo apt-get install recoll-lens


VICTORIA: To use, simply open the Dash [Option-Command, my keyboard] >> type search term(s)



2.8. Real time indexing.  Real time monitoring/indexing is performed by starting the recollindex -m command. With this option, recollindex will detach from the terminal and become a daemon, permanently monitoring file changes and updating the index.

VICTORIA: Created a 'Startup Applications' entry >> [name:] Real time indexing] >> [Command:] recollindex -m


Recoll supports wildcards (*, ?, []) as well as advanced searches such as:

"OR", "AND" operators
search for author, e.g.: author:"george orwell"
search by size, date, mime or format
search inside a folder, e.g.: dir:/home/andrei/Dropbox

more >> 3.5. The query language >> http://www.lesbonscomptes.com/recoll/usermanual/usermanual.html#RCL.SEARCH.LANG

* matches 0 or more characters;

? matches a single character;

[] allows defining sets of characters to be matched (ex: [abc] matches a single character which may be 'a' or 'b' or 'c', [0-9] matches any number.

You should be aware of a few things when using wildcards.

* Using a wildcard character at the beginning of a word can make for a slow search because Recoll will have to scan the whole index term list to find the matches. However, this is much less a problem for field searches, and queries like author:*@domain.com can sometimes be very useful.

* For Recoll version 18 only, when working with a raw index (preserving character case and diacritics), the literal part of a wildcard expression will be matched exactly for case and diacritics. This is not true any more for versions 19 and later.

* Using a * at the end of a word can produce more matches than you would think, and strange search results. You can use the term explorer tool to check what completions exist for a given term. You can also see exactly what search was performed by clicking on the link at the top of the result list. In general, for natural language terms, stem expansion will produce better results than an ending * (stem expansion is turned off when any wildcard character appears in the term).

 ** ... you can disable stem expansion for any term by capitalizing it: i.e.: a search for floor will also normally look for flooring, floored, etc., but a search for Floor will only look for floor, in any character case.

2001/ from the beginning of 2001 to the latest date in the index.

2001 the whole year of 2001

P2D/ means 2 days ago up to now if there are no documents with dates in the future.

    [VICTORIA: Essentially, 'Previous 2 Days'.  See (further below): "... 'Periods' are specified as PnYnMnD. The n numbers are the respective numbers of years, months or days, any of which may be missing. Dates are specified as YYYY-MM-DD. ..."]

/2003 all documents from 2003 or older

 - - - - - - - - - - Wildcards and path filtering

Due to the way that Recoll processes wildcards inside dir path filtering clauses, they will have a multiplicative effect on the query size. A clause containg wildcards in several paths elements, like, for example, dir:/home/me/*/*/docdir, will almost certainly fail if your indexed tree is of any realistic size.



• author:"john doe"
• /2007 (all documents from 2007 or older)
• dir:/path/to/dir (filters content from /path/to/dir directory)

Relative paths also make sense, for example, dir:share/doc would match either /usr/share/doc or /usr/local/share/doc

    Several dir clauses can be specified, both positive and negative. For example the following makes sense:

    dir:recoll dir:src -dir:utils -dir:common

    This would select results which have both recoll and src in the path (in any order), and which have not either utils or common.




1. Recoll v. 1.19.5 + Xapian 1.2.8
2. "message" results (only) radio button selected/checked
3. Search examples performed on /home/victoria/Mail/sent (my Claws mail 'sent' directory)

Default fields: http://www.lesbonscomptes.com/recoll/usermanual/RCL.SEARCH.LANG.html


EXAMPLE 1 (search of Claws mail 'sent' folder):

Claws: s circadian & s gene  >>  10 hits  [2010 x 2; 2012 x 1; 2013 x 7]
Recoll: dir:Mail/sent subject:circadian subject:gene >> 6 hits
Recoll: dir:Mail/sent subject:circadian subject:gene* >> 10 hits

dir:Mail/sent subject:circadian subject:gene* /2012/  >>  1 hit (as per Claws: 1 such email in 2012)
dir:Mail/sent subject:circadian subject:gene* /2013  >>  7 hits (as per Claws: 7 such emails in 2013)

NOTE 1.  Despite the Recoll Manual notes (elsewhere, these notes), any prefix/suffix of a year with a forward slash returns messages only for that year! E.g., the search, above with

/2013  >>  7 hits
2013/  >>  7 hits
/2012  >>  1 hit
2012/  >>  1 hit
/2010  >>  2 hits
2010/  >>  2 hits

NOTE 2.  Regarding the wildcard usage above, as noted elsewhere in these notes the wildcard approach is generally not preferred.  Upon further investigation:

    [ Terms and search expansion] "... Entering a capitalized word in any search field will prevent stem expansion (no search for gardening if you enter Garden instead of garden). This is the only case where character case should make a difference for a Recoll search. You can also disable stem expansion or change the stemming language in the preferences." << Nope!

    [3.1.12. Customizing the search interface] "... Search parameters:  ... Stemming language: stemming obviously depends on the document's language. This listbox will let you chose among the stemming databases which were built during indexing (this is set in the main configuration file), or later added with recollindex -s (See the recollindex manual). Stemming languages which are dynamically added will be deleted at the next indexing pass unless they are also added in the configuration file."  << Probably: recollindex -s ...

        *** More here: http://manpages.ubuntu.com/manpages/gutsy/man1/recollindex.1.html >> "recollindex  -s will build the stem expansion database for a given language, which may or may not be part of the list  in  the  configuration file. If the language is not part of the configuration, the stem expansion database will be deleted during the next normal run. The following languages (abbreviations) are recognized: ..."

        victoria at victoria:~$ recollindex -s english

AFTER running 'recollindex -s en' (which took ~5 min. or so execute):

Claws: s circadian & s gene  >>  10 hits  [2010 x 2; 2012 x 1; 2013 x 7]
Recoll: dir:Mail/sent subject:circadian subject:gene >> 9 hits
Recoll: dir:Mail/sent subject:circadian subject:gene* >> 10 hits


EXAMPLE 2a (search of claws mail 'sent' folder):

Claws: s circadian & b clock  >>  41 hits  [2010 x 6; 2011 x 18; 2012 x 8; 2013 x 9]
Recoll: dir:Mail/sent subject:circadian body:clock  >>  41 hits

 - - - - - - - - - -


Claws: s circadian & b human  >>  31 hits  [2010 x 6; 2011 x 15; 2012 x 3; 2013 x 7]

PRIOR to running 'recollindex -s english' :

Recoll: dir:Mail/sent subject:circadian body:human  >>  22 hits
Recoll: dir:Mail/sent subject:circadian body:human*  >>  31 hits

AFTER running 'recollindex -s english' :

Recoll: dir:Mail/sent subject:circadian body:human  >>  31 hits

 - - - - - - - - - -


Claws: s circadian & b clock & b human  >>  29 hits  [2010 x 6; 2011 x 15; 2012 x 3; 2013 x 5]

PRIOR to running 'recollindex -s english' :

Recoll: dir:Mail/sent subject:circadian body:clock body:human  >>  20 hits
Recoll: dir:Mail/sent subject:circadian body:clock body:human*  >>  29 hits

AFTER running 'recollindex -s english' :

Recoll: dir:Mail/sent subject:circadian body:clock body:human  >>  29 hits


EXAMPLE 3 (search of Claws mail 'sent' folder, AFTER running 'recollindex -s english'):

Claws: s circadian & b clock & b human C victoria >> [C: message either to: or cc: to victoria]  >>  29 hits

Recoll: dir:Mail/sent subject:circadian body:clock body:human recipient:victoria  >>  16 hits
Recoll: dir:Mail/sent subject:circadian body:clock body:human recipient:victoria*  >>  29 hits

Here, wildcard (*) still needed.  Same results (hits) with 'to:victoria' in place of 'recipient:victoria'. [I always cc every e-mail sent to myself.]

Interestingly (I did not see 'cc:' as a search field),

Recoll: dir:Mail/sent subject:circadian body:clock body:human cc:victoria  >>  29 hits!  :-) [same with cc:victoria*]



Chapter 3. Searching  >>  3.1. Searching with the Qt graphical user interface

The recoll program provides the main user interface for searching. It is based on the Qt library.

recoll has two search modes:

    Simple search (the default, on the main screen) has a single entry field where you can enter multiple words.

    Advanced search (a panel accessed through the Tools menu or the toolbox bar icon) has multiple entry fields, which you may use to build a logical condition, with additional filtering on file type, location in the file system, modification date, and size.

In most cases, you can enter the terms as you think them, even if they contain embedded punctuation or other non-textual characters. For example, Recoll can handle things like email addresses, or arbitrary cut and paste from another text window, punctation and all.

The main case where you should enter text differently from how it is printed is for east-asian languages (Chinese, Japanese, Korean). Words composed of single or multiple characters should be entered separated by white space in this case (they would typically be printed without white space).

3.1.1. Simple search

    Start the recoll program.

    Possibly choose a search mode: Any term, All terms, File name or Query language.

    Enter search term(s) in the text field at the top of the window.

    Click the Search button or hit the Enter key to start the search.

The initial default search mode is Query language. Without special directives, this will look for documents containing all of the search terms (the ones with more terms will get better scores), just like the All terms mode which will ignore such directives. Any term will search for documents where at least one of the terms appear.

The Query Language features are described in a separate section.

All search modes allow wildcards inside terms (*, ?, []). You may want to have a look at the section about wildcards for more information about this.

File name will specifically look for file names. The point of having a separate file name search is that wild card expansion can be performed more efficiently on a small subset of the index (allowing wild cards on the left of terms without excessive penality). Things to know:

    White space in the entry should match white space in the file name, and is not treated specially.

    The search is insensitive to character case and accents, independantly of the type of index.

    An entry without any wild card character and not capitalized will be prepended and appended with '*' (ie: etc -> *etc*, but Etc -> etc).

    If you have a big index (many files), excessively generic fragments may result in inefficient searches.

You can search for exact phrases (adjacent words in a given order) by enclosing the input inside double quotes. Ex: "virtual reality".

When using a stripped index, character case has no influence on search, except that you can disable stem expansion for any term by capitalizing it. Ie: a search for floor will also normally look for flooring, floored, etc., but a search for Floor will only look for floor, in any character case. Stemming can also be disabled globally in the preferences [Preferences menu >> (no stemming)]. When using a raw index, the rules are a bit more complicated.

Recoll remembers the last few searches that you performed. You can use the simple search text entry widget (a combobox) to recall them (click on the thing at the right of the text field). Please note, however, that only the search texts are remembered, not the mode (all/any/file name).

Typing Esc Space while entering a word in the simple search entry will open a window with possible completions for the word. The completions are extracted from the database.

Double-clicking on a word in the result list or a preview window will insert it into the simple search entry field.

You can cut and paste any text into an All terms or Any term search field, punctuation, newlines and all - except for wildcard characters (single ? characters are ok). Recoll will process it and produce a meaningful search. This is what most differentiates this mode from the Query Language mode, where you have to care about the syntax.

You can use the Tools → Advanced search dialog for more complex searches.

    [ ... SNIP! ... ]

3.1.6. Complex/advanced search

The advanced search dialog helps you build more complex queries without memorizing the search language constructs. It can be opened through the Tools menu or through the main toolbar.

The dialog has two tabs:

    The first tab lets you specify terms to search for, and permits specifying multiple clauses which are combined to build the search.

    The second tab lets filter the results according to file size, date of modification, mime type, or location.

Click on the Start Search button in the advanced search dialog, or type Enter in any text field to start the search. The button in the main window always performs a simple search.

Click on the Show query details link at the top of the result page to see the query expansion. Avanced search: the "find" tab

This part of the dialog lets you constructc a query by combining multiple clauses of different types. Each entry field is configurable for the following modes:

    All terms.

    Any term.

    None of the terms.

    Phrase (exact terms in order within an adjustable window).

    Proximity (terms in any order within an adjustable window).

    Filename search.

Additional entry fields can be created by clicking the Add clause button.

When searching, the non-empty clauses will be combined either with an AND or an OR conjunction, depending on the choice made on the left (All clauses or Any clause).

Entries of all types except "Phrase" and "Near" accept a mix of single words and phrases enclosed in double quotes. Stemming and wildcard expansion will be performed as for simple search.

Phrases and Proximity searches. These two clauses work in similar ways, with the difference that proximity searches do not impose an order on the words. In both cases, an adjustable number (slack) of non-matched words may be accepted between the searched ones (use the counter on the left to adjust this count). For phrases, the default count is zero (exact match). For proximity it is ten (meaning that two search terms, would be matched if found within a window of twelve words). Examples: a phrase search for quick fox with a slack of 0 will match quick fox but not quick brown fox. With a slack of 1 it will match the latter, but not fox quick. A proximity search for quick fox with the default slack will match the latter, and also a fox is a cunning and quick animal. Avanced search: the "filter" tab

This part of the dialog has several sections which allow filtering the results of a search according to a number of criteria

    The first section allows filtering by dates of last modification. You can specify both a minimum and a maximum date. The initial values are set according to the oldest and newest documents found in the index.

    The next section allows filtering the results by file size. There are two entries for minimum and maximum size. Enter decimal numbers. You can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12 respectively.

    The next section allows filtering the results by their mime types, or mime categories (ie: media/text/message/etc.).

    You can transfer the types between two boxes, to define which will be included or excluded by the search.

    The state of the file type selection can be saved as the default (the file type filter will not be activated at program start-up, but the lists will be in the restored state).

    The bottom section allows restricting the search results to a sub-tree of the indexed area. You can use the Invert checkbox to search for files not in the sub-tree instead. If you use directory filtering often and on big subsets of the file system, you may think of setting up multiple indexes instead, as the performance may be better.

    You can use relative/partial paths for filtering. Ie, entering dirA/dirB would match either /dir1/dirA/dirB/myfile1 or /dir2/dirA/dirB/someother/myfile2. Avanced search history

The advanced search tool memorizes the last 100 searches performed. You can walk the saved searches by using the up and down arrow keys while the keyboard focus belongs to the advanced search dialog.

The complex search history can be erased, along with the one for simple search, by selecting the File → Erase Search History menu entry.

3.1.7. The term explorer tool

Recoll automatically manages the expansion of search terms to their derivatives (ie: plural/singular, verb inflections). But there are other cases where the exact search term is not known. For example, you may not remember the exact spelling, or only know the beginning of the name.

The term explorer tool (started from the toolbar icon or from the Term explorer entry of the Tools menu) can be used to search the full index terms list. It has three modes of operations:


    In this mode of operation, you can enter a search string with shell-like wildcards (*, ?, []). ie: xapi* would display all index terms beginning with xapi. (More about wildcards here).

Regular expression

    This mode will accept a regular expression as input. Example: word[0-9]+. The expression is implicitely anchored at the beginning. Ie: press will match pression but not expression. You can use .*press to match the latter, but be aware that this will cause a full index term list scan, which can be quite long.

Stem expansion

    This mode will perform the usual stem expansion normally done as part user input processing. As such it is probably mostly useful to demonstrate the process.


    In this mode, you enter the term as you think it is spelled, and Recoll will do its best to find index terms that sound like your entry. This mode uses the Aspell spelling application, which must be installed on your system for things to work (if your documents contain non-ascii characters, Recoll needs an aspell version newer than 0.60 for UTF-8 support). The language which is used to build the dictionary out of the index terms (which is done at the end of an indexing pass) is the one defined by your NLS environment. Weird things will probably happen if languages are mixed up.

Note that in cases where Recoll does not know the beginning of the string to search for (ie a wildcard expression like *coll), the expansion can take quite a long time because the full index term list will have to be processed. The expansion is currently limited at 10000 results for wildcards and regular expressions. It is possible to change the limit in the configuration file.

Double-clicking on a term in the result list will insert it into the simple search entry field. You can also cut/paste between the result list and any entry field (the end of lines will be taken care of).

[ ... SNIP! ... ]



3.5. The query language  >>  Chapter 3. Searching  >>  3.5. The query language


Here is a sample request (explained):

    author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes

This would search for all documents with John Doe appearing as a phrase in the author field (exactly what this is would depend on the document type, ie: the From: header, for an email message), and containing either beatles or lennon and either live or unplugged but not potatoes (in any part of the document).

An element is composed of an optional field specification, and a value, separated by a colon (the field separator is the last colon in the element). Example: Eugenie, author:balzac, dc:title:grandet.  The colon, if present, means "contains".

All elements in the search entry are normally combined with an implicit AND.  It is possible to specify that elements be OR'ed instead, as in Beatles OR Lennon.  The OR must be entered literally (capitals), and it has priority over the AND associations: word1 word2 OR word3 means word1 AND (word2 OR word3) not (word1 AND word2) OR word3.  Explicit parenthesis are not supported.

An element preceded by a - specifies a term that should not appear.  Pure negative queries are forbidden.

As usual, words inside quotes define a phrase (the order of words is significant), so that title:"prejudice pride" is not the same as title:prejudice title:pride, and is unlikely to find a result.

Modifiers can be set on a phrase clause, for example to specify a proximity search (unordered). See the modifier section (http://bit.ly/ZoZQEQ).

Recoll currently manages the following default fields:

    title, subject or caption are synonyms which specify data to be searched for in the document title or subject.

    author or from for searching the documents originators.

    recipient or to for searching the documents recipients.

    keyword for searching the document-specified keywords (few documents actually have any).

    filename for the document's file name.

    ext specifies the file name extension (Ex: ext:html)

The field syntax also supports a few field-like, but special, criteria:

    dir for filtering the results on file location (Ex: dir:/home/me/somedir). -dir also works to find results not in the specified directory (release >= 1.15.8). A tilde inside the value will be expanded to the home directory. Wildcards will not be expanded. You cannot use OR with dir clauses (this restriction may go away in the future).

            dir:~/Mail/sent  >>  0 hits
            dir:/home/victoria/Mail/sent  >>  'at least 16481' hits  [actual: 26,581]

            Claws (Mail/sent): s facebook  >>  148 hits
            Recoll: dir:/home/victoria/Mail/sent subject:facebook  >>  'at least 144' hits

    Relative paths also make sense, for example, dir:share/doc would match either /usr/share/doc or /usr/local/share/doc

    Several dir clauses can be specified, both positive and negative. For example the following makes sense:

    dir:recoll dir:src -dir:utils -dir:common

    This would select results which have both recoll and src in the path (in any order), and which have not either utils or common.

    Another special aspect of dir clauses is that the values in the index are not transcoded to UTF-8, and never lower-cased or unaccented, but stored as binary. This means that you need to enter the values in the exact lower or upper case, and that searches for names with diacritics may sometimes be impossible because of character set conversion issues. Non-ASCII UNIX file paths are an unending source of trouble and are best avoided.

    You need to use double-quotes around the path value if it contains space characters.

    size for filtering the results on file size. Example: size<10000. You can use <, > or = as operators. You can specify a range like the following: size>100 size<1000. The usual k/K, m/M, g/G, t/T can be used as (decimal) multipliers. Ex: size>1k to search for files bigger than 1000 bytes.

    date for searching or filtering on dates. The syntax for the argument is based on the ISO8601 standard for dates and time intervals. Only dates are supported, no times. The general syntax is 2 elements separated by a / character. Each element can be a date or a 'period' of time. 'Periods' are specified as PnYnMnD. The n numbers are the respective numbers of years, months or days, any of which may be missing. Dates are specified as YYYY-MM-DD. The days and months parts may be missing. If the / is present but an element is missing, the missing element is interpreted as the lowest or highest date in the index. Examples:

        2001-03-01/2002-05-01 the basic syntax for an interval of dates.

        2001-03-01/P1Y2M the same specified with a 'period.'
        [VICTORIA: convoluted, but in essence 'Previous 1 Year 2 Months', as defined by [start of range] 2001-03-01]

        2001/ from the beginning of 2001 to the latest date in the index.

        2001 the whole year of 2001

        P2D/ means 2 days ago up to now if there are no documents with dates in the future. [VICTORIA: i.e., Previous 2 Days]

        /2003 all documents from 2003 or older.

    'Periods' can also be specified with small letters (ie: p2y).

    mime or format for specifying the mime type. This one is quite special because you can specify several values which will be OR'ed (the normal default for the language is AND). Ex: mime:text/plain mime:text/html. Specifying an explicit boolean operator before a mime specification is not supported and will produce strange results. You can filter out certain types by using negation (-mime:some/type), and you can use wildcards in the value (mime:text/*). Note that mime is the ONLY field with an OR default. You do need to use OR with ext terms for example.

    type or rclcat for specifying the category (as in text/media/presentation/etc.). The classification of mime types in categories is defined in the Recoll configuration (mimeconf), and can be modified or extended. The default category names are those which permit filtering results in the main GUI screen. Categories are OR'ed like mime types above. This can't be negated with - either.

Words inside phrases and capitalized words are not stem-expanded. Wildcards may be used anywhere inside a term. Specifying a wild-card on the left of a term can produce a very slow search (or even an incorrect one if the expansion is truncated because of excessive size).  Also see More about wildcards (http://tinyurl.com/ch5upph).

The document filters used while indexing have the possibility to create other fields with arbitrary names, and aliases may be defined in the configuration, so that the exact field search possibilities may be different for you if someone took care of the customisation.

3.5.1. Modifiers

Some characters are recognized as search modifiers when found immediately after the closing double quote of a phrase, as in "some term"modifierchars. The actual "phrase" can be a single term of course. Supported modifiers:

    l can be used to turn off stemming (mostly makes sense with p because stemming is off by default for phrases).

    o can be used to specify a "slack" for phrase and proximity searches: the number of additional terms that may be found between the specified ones. If o is followed by an integer number, this is the slack, else the default is 10.

    p can be used to turn the default phrase search into a proximity one (unordered). Example:"order any in"p

    C will turn on case sensitivity (if the index supports it).

    D will turn on diacritics sensitivity (if the index supports it).

    A weight can be specified for a query element by specifying a decimal value at the start of the modifiers. Example: "Important"2.5.


More information about the Users mailing list