glimpse supports a large variety of patterns, including simple strings, strings with classes of characters, sets of strings, wild cards, and regular expressions. (See LIMITATIONS.)
Strings
Strings are any sequence of characters, including
the special symbols `^' for beginning of line and `$'
for end of line. The following special characters (
`$', `^', `*', `[',
`^', `|', `(', `)', `!',
and `\' ) as well as the following meta characters
special to glimpse (and agrep): `;', `,',
`#', `<', `>', `-',
and `.', should be preceded by `\' if they are
to be matched as regular characters. For example, \^abc\
corresponds to the string ^abc\, whereas ^abc corresponds
to the string abc at the beginning of a line.
Classes of characters
a list of characters inside [] (in order) corresponds
to any character from the list. For example, [a-ho-z]
is any character between a and h or between o and z.
The symbol `^' inside [] complements the list. For
example, [^i-n] denote any character in the character
set except character `i' to `n'. The symbol `^' thus
has two meanings, but this is consistent with egrep.
The symbol `.' (don't care) stands for any symbol (except
for the newline symbol).
Boolean operations
Glimpse supports an `AND' operation denoted
by the symbol `;' an `OR' operation denoted by the
symbol `,',
a limited version of a 'NOT' operation (starting at version 4.0B1)
denoted by the symbol `~',
or any combination.
For example, glimpse
`pizza;cheeseburger' will output all
lines containing both patterns. glimpse -F
`gnu;\.c$' `define;DEFAULT'
will output all lines containing both `define' and
`DEFAULT' (anywhere in the line, not necessarily in
order) in files whose name contains `gnu' and ends
with .c. glimpse `{political,computer};science'
will match `political science' or `science of computers'.
The NOT operation works only together with the -W option and it is
generally applies only to the whole file rather to individual records.
It currently does not work with approximate matching.
Its output may sometimes seem counterintuitive.
Use with care.
glimpse -W 'fame;~glory' will output all lines containing 'fame'
in all files that contain 'fame' but do not contain 'glory';
This is the most common use of NOT, and in this case it works
as expected.
glimpse -W '~{fame;glory}' will be limited to files that do
not contain both words, and will output all lines containing one
of them.
Wild cards
The symbol `#' is used to denote a sequence of any
number (including 0) of arbitrary characters
see LIMITATIONS).
The symbol # is equivalent to .* in egrep. In fact,
.* will work too, because it is a valid regular expression
(see below), but unless this is part of an actual regular
expression, # will work faster. (Currently glimpse
is experiencing some problems with #.)
Combination of exact and approximate matching Any pattern inside angle brackets <> must match the text exactly even if the match is with errors. For example, <mathemat>ics matches mathematical with one error (replacing the last s with an a), but mathe<matics> does not match mathematical no matter how many errors are allowed. (This option is buggy at the moment.)
Regular expressions
Since the index is word based, a regular expression
must match words that appear in the index for glimpse
to find it. Glimpse first strips the regular expression
from all non-alphabetic characters, and searches the
index for all remaining words. It then applies the
regular expression matching algorithm to the files
found in the index. For example, glimpse `abc.*xyz'
will search the index for all files that contain both
`abc' and `xyz', and then search directly for `abc.*xyz'
in those files. (If you use glimpse -w `abc.*xyz',
then `abcxyz' will not be found, because glimpse will
think that abc and xyz need to be matches to whole
words.) The syntax of regular expressions in glimpse
is in general the same as that for agrep. The
union operation `|', Kleene closure `*', and parentheses
() are all supported. Currently `+' is not supported.
Regular expressions are currently limited to approximately
30 characters (generally excluding meta characters).
Some options (-d, -w, -t, -x, -D, -I, -S) do not currently
work with regular expressions. The maximal number of
errors for regular expressions that use `*' or `|'
is 4.
(See LIMITATIONS.)
glimpse -F `haystack.h$' needle
finds all needles in all haystack.h's files.
glimpse -2 -F html Anestesiology
outputs all occurrences of Anestesiology with two
errors in files with html somewhere in their full name.
glimpse -l -F `.c$' variablename
lists the names of all .c files that contain variablename
(the -l option lists file names rather than output
the matched lines).
glimpse -F `mail;1993' `windsurfing;Arizona'
finds all lines containing windsurfing and
Arizona in all files having `mail' and `1993'
somewhere in their full name.
glimpse -F mail `t.j@#uk'
finds all mail addresses (search only files with mail
somewhere in their name) from the uk, where the login
name ends with t.j, where the . stands for any one
character. (This is very useful to find a login name
of someone whose middle name you don't know.)
glimpse -F mbox -h -G . > MBOX
concatenates all files whose name matches `mbox' into
one big one.
The index of glimpse is word based. A pattern that contains more than one word cannot be found in the index. The way glimpse overcomes this weakness is by splitting any multi-word pattern into its set of words and looking for all of them in the index. For example, glimpse `linear programming' will first consult the index to find all files containing both linear and programming, and then apply agrep to find the combined pattern. This is usually an effective solution, but it can be slow for cases where both words are very common, but their combination is not.
As was mentioned in the section on PATTERNS above, some characters serve as meta characters for glimpse and need to be preceded by `\' to search for them. The most common examples are the characters `.' (which stands for a wild card), and `*' (the Kleene closure). So, "glimpse ab.de" will match abcde, but "glimpse ab\.de" will not, and "glimpse ab*de" will not match ab*de, but "glimpse ab\*de" will. The meta character - is translated automatically to a hypen unless it appears between [] (in which case it denotes a range of characters).
The index of glimpse stores all patterns in lower case. When glimpse searches the index it first converts all patterns to lower case, finds the appropriate files, and then searches the actual files using the original patterns. So, for example, glimpse ABCXYZ will first find all files containing abcxyz in any combination of lower and upper cases, and then searches these files directly, so only the right cases will be found. One problem with this approach is discovering misspellings that are caused by wrong cases. For example, glimpse -B abcXYZ will first search the index for the best match to abcxyz (because the pattern is converted to lower case); it will find that there are matches with no errors, and will go to those files to search them directly, this time with the original upper cases. If the closest match is, say AbcXYZ, glimpse may miss it, because it doesn't expect an error. Another problem is speed. If you search for "ATT", it will look at the index for "att". Unless you use -w to match the whole word, glimpse may have to search all files containing, for example, "Seattle" which has "att" in it.
There is no size limit for simple patterns and simple patterns within Boolean expressions. More complicated patterns, such as regular expressions, are currently limited to approximately 30 characters. Lines are limited to 1024 characters. Records are limited to 48K, and may be truncated if they are larger than that. The limit of record length can be changed by modifying the parameter Max_record in agrep.h.
Glimpseindex does not index words of size > 64.
A query that contains no alphanumeric characters is not recommended (unless glimpse is used as agrep and the file names are provided). This is an understatement.