index++(1) - Man pages

dwww Home | Manual pages | Find package
index++(1)                  General Commands Manual                 index++(1)

NAME
       index++ - SWISH++ indexer

SYNOPSIS
       index++ [ options ] directory...  file...

DESCRIPTION
       index++  is  the  SWISH++ file indexer.  It indexes the specified files
       and files in the specified  directories;  files  in  subdirectories  of
       specified directories are also indexed by default (unless either the -r
       or --no-recurse option or the RecurseSubdirs variable is given).  Files
       are  indexed  either only if their filename matches one of the patterns
       in the set specified with either the -e or --pattern option or the  In-
       cludeFile  variable (unless standard input is used; see next paragraph)
       or is not in the set specified with either the -E or  --no-pattern  op-
       tion or the ExcludeFile variable.

       If there is a single filename of `-', the list of directories and files
       to  index is instead taken from standard input (one per line).  In this
       case, filename patterns of files to index need not be specified explic-
       itly: all files, regardless of whether they  match  a  pattern  (unless
       they  are in the set not to index specified with either the -E or --no-
       pattern option or the ExcludeFile variable), are indexed, i.e., index++
       assumes you know what you're doing when specifying  filenames  in  this
       manner.

       In  any case, care must be taken not to specify files or subdirectories
       in directories that are also specified: since  directories  are  recur-
       sively  indexed by default (unless either the -r or --no-recurse option
       or the RecurseSubdirs variable is given), explicitly specifying a  sub-
       directory  or file in a directory that is also specified will result in
       those files being indexed more than once.

   Character Mapping
       Characters in the ISO 8859-1 (Latin 1)  character  set  are  mapped  to
       their closest ASCII equivalent before further examination and indexing.
       (Individual indexing modules may also do their own character mapping.)

   Word Determination
       Stop words, words that occur too frequently or have no information con-
       tent,  are not indexed.  (There is a default built-in set of a few hun-
       dred such English words.)  Additionally, several heuristics are used to
       determine which words should not be indexed.

       First, a word is checked to see if it looks like an acronym.  A word is
       considered an acronym only if it starts with a capital  letter  and  is
       composed  exclusively  of capital letters, digits, and punctuation sym-
       bols, e.g., ``AT&T.''  If a word looks like an acronym, it  is  indexed
       and no further checks are done.

       Second, there are several other checks that are applied.  A word is not
       indexed if it:

       1.  Is less than Word_Min_Size letters.  (Default is 4.)

       2.  Contains less than Word_Min_Vowels vowels.  (Default is 1.)

       3.  Contains  more than Word_Max_Consec_Same of the same character con-
           secutively (not including digits).  (Default is 2.)

       4.  Contains more than  Word_Max_Consec_Consonants  consecutive  conso-
           nants.  (Default is 5.)

       5.  Contains more than Word_Max_Consec_Vowels consecutive vowels.  (De-
           fault is 4.)

       6.  Contains  more  than Word_Max_Consec_Puncts consecutive punctuation
           characters.  (Default is 1.)

   Filters
       Via the FilterFile configuration file variable, files matching particu-
       lar patterns can be filtered prior to indexing.  Via the  FilterAttach-
       ment  configuration  file variable, e-mail attachments whose MIME types
       match particular patterns can be filtered prior to indexing.  (See FIL-
       TERS in swish++.conf(5).)

   Incremental Indexing
       In order to add words from new documents to an existing index++, either
       the entire set of documents can be reindexed or the new documents alone
       can be incrementally indexed.  In many cases, reindexing everything  is
       sufficient  since  index++  is  really fast.  For a very large document
       set, however, this may use too many resources.

       However, there is a pitfall for incremental indexing: if any of the -f,
       --word-files, -p, or --word-percent options or WordFilesMax or WordPer-
       centMax variables are used, then words that are too frequent  are  dis-
       carded.  If new documents are added containing very few of those words,
       then they could no longer be too frequent.  However, there is no way to
       get them back since they were discarded.

       The  way  around this problem is not to discard any words by specifying
       101%.  However, because no words are discarded, the size of  the  index
       file will be larger, perhaps significantly so.

       It is possible that, in practice, the loss of words may not be that im-
       portant  especially  if new documents are very similar to old documents
       and that words that were too frequent in the old set would also be  too
       frequent in new set.

       Another way around this problem is to do periodic full indexing.

INDEXING MODULES
       index++  is written in a modular fashion where different types of files
       have different indexing modules.  Currently, there are 7 modules:  Text
       (plain text), HTML (HTML and XHTML), ID3 (ID3 tags found in MP3 files),
       LaTeX,  Mail  (RFC  822  and Usenet News), Manual (Unix manual pages in
       nroff(1) with man(7) macros), and RTF (Rich Text Format).

   Text Module
       This module simply indexes plain text files performing  character  map-
       ping and word determination as has already been described.

   HTML and XHTML Module
       Additional processing is done for HTML and XHTML files.  The additional
       processing is:

       1.  Character  and  numeric (decimal and hexadecimal) entity references
           are converted to their ASCII character equivalents  before  further
           examination  and indexing.  For example, ``r&eacute;sum&#233;'' be-
           comes ``resume'' before indexing.

       2.  If a matched set of <TITLE> ... </TITLE> tags is found  within  the
           first  TitleLines  lines of the file (default is 12), then the text
           between the tags is stored in  the  generated  index  file  as  the
           file's  title rather than the file's name.  (Every non-space white-
           space character in the title is converted to a space;  leading  and
           trailing spaces are removed.)

       3.  If  an HTML or XHTML element contains a CLASS attribute whose value
           is among the set of class names specified as  those  not  to  index
           (via  one  or more of either the -C or --no-class option or the Ex-
           cludeClass variable), then all the text up to the tag that ends the
           element will not be indexed.

           For an element that has an optional end tag, ``the  tag  that  ends
           the  element''  is either the element's end tag or a tag of another
           element that implicitly ends it; for an element that does not  have
           an  end  tag,  ``the  tag  that ends the element'' is the element's
           start tag.  (See the EXAMPLES.)

           All elements from the HTML 4.0 specification (including  deprecated
           elements),  Ruby  elements,  plus common, browser-specific elements
           are recognized; unrecognized elements are ignored.  (See the -H  or
           --dump-html option.)

       4.  If  an  HTML  or XHTML element contains a TITLE attribute, then the
           words specified as the value of the TITLE attribute are indexed.

       5.  If an AREA, IMG, or INPUT element contains an ALT  attribute,  then
           the words specified as the value of the ALT attribute are indexed.

       6.  If  a META element contains both a NAME and CONTENT attribute, then
           the words specified as the value of the CONTENT attribute  are  in-
           dexed  associated  with the meta name specified as the value of the
           NAME attribute.

           (However, if either the -A or --no-assoc-meta options or the  Asso-
           ciateMeta  variable  is  specified, then the words specified as the
           value of the CONTENT attribute are still indexed, but  not  associ-
           ated with the meta name.)

           (See  also  the  -m,  --meta,  -M, and --no-meta options or the In-
           cludeMeta or ExcludeMeta  variables.)   Meta  names  can  later  be
           queried against specifically using search++(1).

       7.  If  a  TABLE  element  contains a SUMMARY attribute, then the words
           specified as the value of the SUMMARY attribute are indexed.

       8.  If an OBJECT element contains a STANDBY attribute, then  the  words
           specified as the value of the STANDBY attribute are indexed.

       9.  All other HTML or XHTML tags and comments (anything between < and >
           characters) are discarded.

       In compliance with the HTML specification, any one of no quotes, single
       quotes,  or  double  quotes may be used to contain attribute values and
       attributes can appear in any order.  Values containing whitespace, how-
       ever, must be quoted.  The specification is vague as to whether  white-
       space surrounding the = is legal, but index++ allows it.

   ID3 Module
       ID3 tags are used to store audio meta information for MP3 files (gener-
       ally).   Since  audio files contain mostly binary information, only the
       ID3 tag text fields are indexed.  ID3 tag versions 1.x and 2.x (through
       2.4) are supported (except for encrypted frames).  If a  file  contains
       both  1.x  and  2.x  tags, only the 2.x tag is indexed.  The processing
       done for files containing an ID3 tag is:

       1.  If a title field is found, then the value of the title is stored in
           the generated index file as the file's title rather than the file's
           name.  (Every non-space whitespace character in the title  is  con-
           verted to a space; leading and trailing spaces are removed.)

       2.  Words  that are the value of fields are indexed associated with the
           field name as a meta name.  (However, if either the -A or  --no-as-
           soc-meta  options  or the AssociateMeta variable is specified, then
           the words specified as the value of the field  are  still  indexed,
           but not associated with the field.)

           (See  also  the  -m,  --meta,  -M, and --no-meta options or the In-
           cludeMeta or ExcludeMeta  variables.)   Meta  names  can  later  be
           queried against specifically using search++(1).

           For  ID3v1.x,  the  recommended  fields  to  be indexed are: album,
           artist, comments, genre, and title.

           For ID3v2.2, the recommended text fields (with reassignments) to be
           indexed  are:  com=comments,  tal=album,  tcm=composer,  tco=genre,
           tcr=copyright,  ten=encoder,  txt=lyricist, tt1=content, tt2=title,
           tt3=subtitle, ipl=musicians, tot=original-title, tol=original-lyri-
           cist, toa=original-artist, tp1=artist, tp2=performers,  tp3=conduc-
           tor, tpb=publisher, txx=user, slt=lyrics, and ult=lyrics.

           For ID3v2.4, the recommended text fields (with reassignments) to be
           indexed  are: comm=comments, talb=album, tcom=composer, tcon=genre,
           tcop=copyright, tenc=encoder, text=lyricist, tipl=people, tit1=con-
           tent,   tit2=title,   tit3=subtitle,   tmcl=musicians,   tmoo=mood,
           toal=original-title,  toly=original-lyricist, tope=original-artist,
           town=owner, tpe1=artist, tpe2=performers, tpe3=conductor, tpub=pub-
           lisher, tsst=set-subtitle, txxx=user, user=terms, sylt=lyrics,  and
           uslt=lyrics.

           ID3v2.3  is  the  same  as  2.4  except replace tmcl=musicians with
           ipls=musicians.

           All text fields (with reassignments) for all versions  of  ID3  can
           (and  should)  be specified concurrently so it need not be known in
           advance which version(s) of ID3 MP3 files are encoded with.

       3.  For ID3v2.x, text fields that are compressed are uncompressed prior
           to indexing.

       4.  For ID3v2.x, Unicode text that is encoded in either UTF-8 or UTF-16
           (either big- or little-endian) is decoded prior to indexing.

   LaTeX Module
       Additional processing is done for LaTeX files.  If a \title command  is
       found  within  the  first TitleLines lines of the file (default is 12),
       then the value of the title is stored in the generated  index  file  as
       the  file's title rather than the file's name.  (Every non-space white-
       space character in the title is  converted  to  a  space;  leading  and
       trailing spaces are removed.)

   Mail Module
       Additional  processing is done for mail and news files.  The additional
       processing is:

       1.  If a Subject header is found within the first TitleLines  lines  of
           the  file  (default is 12), then the value of the subject is stored
           in the generated index file as the file's  title  rather  than  the
           file's name.  (Every non-space whitespace character in the title is
           converted to a space; leading and trailing spaces are removed.)

       2.  Words  that  are  the value of a header are indexed associated with
           the header name as a meta name.  (However,  if  either  the  -A  or
           --no-assoc-meta options or the AssociateMeta variable is specified,
           then  the  words specified as the value of the header are still in-
           dexed, but not associated with the header.)

           (See also the -m, --meta, -M, and  --no-meta  options  or  the  In-
           cludeMeta  or  ExcludeMeta  variables.)   Meta  names  can later be
           queried against specifically using search++(1).

           The recommended headers to be indexed are: Bcc, Cc, Comments,  Con-
           tent-Description,  From,  Keywords, Newsgroups, Resent-To, Subject,
           and To.

       3.  MIME attachments are indexed.

       4.  Text that is in the text/enriched  content  type  is  converted  to
           plain text prior to indexing.

       5.  Text  that  is encoded as either quoted-printable or base-64 is de-
           coded prior to indexing.

       6.  Unicode text that is encoded in either the UTF-7 or UTF-8 character
           set is decoded prior to indexing.

       7.  Text in vCards is indexed such that the values  of  types  (fields)
           are  associated  with the types as meta names.  (However, if either
           the -A or --no-assoc-meta options or the AssociateMeta variable  is
           specified, then the words specified as the value of types are still
           indexed, but not associated with the types.)

           The recommended vCard types (with reassignments) to be indexed are:
           adr=address,  categories,  class, label=address, fn=name, nickname,
           note, org, role, and title.

       Indexing mail and news files is most effective only when there  is  ex-
       actly  one  message per file.  While Usenet news files are usually this
       way, mail files are not.  Mail files, e.g., mailboxes, are usually com-
       prised of multiple messages.  Such files would need to be split up into
       files of individual messages prior to indexing since there's  no  point
       in  indexing  a single mailbox: every search result would return a rank
       of 100 for the same file.  Therefore, the splitmail++(1) utility is in-
       cluded in the SWISH++ distribution.

   Manual Module
       Additional processing is done for Unix manual page  files.   The  addi-
       tional processing is:

       1.  If a NAME section heading macro (.SH) is found within the first Ti-
           tleLines  lines  of  the file (default is 12), then the contents of
           the next line are stored in the generated index file as the  file's
           title  rather  than  the  file's name.  (Every non-space whitespace
           character in the title is converted to a space; leading and  trail-
           ing  spaces  as  well  as backslash sequences, such as \f2, are re-
           moved.)

       2.  Words that are in a section are indexed associated with the name of
           the section as a meta name.  (However, if either the -A or --no-as-
           soc-meta options or the AssociateMeta variable is  specified,  then
           the  words  in a section are still indexed, but not associated with
           the section heading.)

           Spaces in multi-word section  headings  are  converted  to  dashes,
           e.g.,  ``see also'' becomes ``see-also'' as a meta name.  (See also
           the -m, --meta, -M, and --no-meta options or the IncludeMeta or Ex-
           cludeMeta variables.)  Meta names  can  later  be  queried  against
           specifically using search++(1).

           The  recommended sections to be indexed are: AUTHOR, BUGS, CAVEATS,
           DESCRIPTION, DIAGNOSTICS, ENVIRONMENT, ERRORS, EXAMPLES,  EXIT-STA-
           TUS,  FILES, HISTORY, NAME, NOTES, OPTIONS, RETURN-VALUE, SEE-ALSO,
           SYNOPSIS, and WARNINGS.

   RTF Module
       This module simply indexes rich text format files without  all  format-
       ting commands.

OPTIONS
       Options  begin with either a `-' for short options or a ``--'' for long
       options.  Either a `-' or ``--'' by itself explicitly ends the options;
       either short or long options may be used.  Long option names may be ab-
       breviated so long as the abbreviation is unambiguous.

       For a short option that takes an argument, the argument is either taken
       to be the remaining characters of the same option, if any, or, if  not,
       is taken from the next option unless said option begins with a `-'.

       Short  options  that take no arguments can be grouped (but the last op-
       tion in the group can take an argument), e.g., -lrv4 is  equivalent  to
       -l -r -v4.

       For  a long option that takes an argument, the argument is either taken
       to be the characters after a `=', if any, or, if not, is taken from the
       next option unless said option begins with a `-'.

       -?
       --help              Print the usage (``help'') message and exit.

       -A
       --no-assoc-meta     Do not associate words with meta names  during  in-
                           dexing nor store such associations in the generated
                           index  file.   This  sacrifices  meta names for de-
                           creased memory usage and index file size.

       -cf
       --config-file=f     The name of the  configuration  file,  f,  to  use.
                           (Default is swish++.conf in the current directory.)
                           A  configuration  file  is not required: if none is
                           specified and the default does not exist,  none  is
                           used;  however, if one is specified and it does not
                           exist, then this is an error.

       -Cc
       --no-class=c        For HTML or XHTML files only, a class name,  c,  of
                           an  HTML  or  XHTML element whose text is not to be
                           indexed.  Multiple -C or --no-class options may  be
                           specified.

       -em:p[,p...]
       --pattern=m:p[,p...]
                           A module name, m, and a filename pattern (or set of
                           patterns  separated  by commas), p, of files to in-
                           dex.  Case is irrelevant for the module  name,  but
                           significant  for  the  patterns.   Multiple  -e  or
                           --pattern options may be specified.

       -Ep[,p...]
       --no-pattern=p[,p...]
                           A filename pattern (or set of patterns separated by
                           commas), p, of files not to index.  Case is signif-
                           icant.  Multiple -E or --no-pattern options may  be
                           specified.

       -fn
       --word-files=n      The maximum number of files, n, a word may occur in
                           before it is discarded as being too frequent.  (De-
                           fault is infinity.)

       -Fn
       --files-reserve=n   Reserve  space  for  this  number  of  files, n, to
                           start.  More space will be allocated as  necessary,
                           but with a slight performance penalty.  (Default is
                           1000.)

       -gn
       --files-grow=n      Grow the space for the reserved number of files, n,
                           when incrementally indexing.  The number can either
                           be  an  absolute  number  of  files or a percentage
                           (when followed by a percent sign %).  Just as  with
                           the -F option, more space will be allocated as nec-
                           essary,  but  with  a  slight  performance penalty.
                           (Default is 100.)

       -H
       --dump-html         Dump the built-in set of recognized HTML and  XHTML
                           elements to standard output and exit.

       -if
       --index-file=f      The  name  of  the generated index file, f (for new
                           indexes; default is swish++.index  in  the  current
                           directory)  or the old index file when doing incre-
                           mental indexing.

       -I
       --incremental       Incrementally add the indexed files and words to an
                           existing index++.   The  existing  index++  is  not
                           touched; instead, a new index is created having the
                           same pathname of the existing index++ with ``.new''
                           appended.

       -l
       --follow-links      Follow symbolic links during indexing.  (Default is
                           not  to follow them.)  This option is not available
                           under Microsoft Windows since  it  doesn't  support
                           symbolic links.

       -mm[=n]
       --meta=m[=n]        The value of a meta name, m, for which words are to
                           be  associated  when  indexed.  Case is irrelevant.
                           Multiple -m or --meta options may be specified.

                           A meta name can be reassigned when  followed  by  a
                           new  name,  n, meaning that the name n and not m is
                           stored in the generated index file so that  queries
                           would use the new name rather than the original.

                           By  default,  words  associated with all meta names
                           are indexed.  Specifying at least one meta name via
                           this option changes that so that only the words as-
                           sociated with a member of the set of meta names ex-
                           plicitly specified via one or more -m or --meta op-
                           tions are indexed.

       -Mm
       --no-meta=m         The value of a meta name, m, for  which  words  are
                           not  to  be indexed.  Case is irrelevant.  Multiple
                           -M or --no-meta options may be specified.

       -pn
       --word-percent=n    The maximum percentage, n, of files a word may  oc-
                           cur  in  before  it  is discarded as being too fre-
                           quent.  (Default is 100.)  If you want to keep  all
                           words regardless, specify 101.

       -P
       --no-pos-data       Do not store word positions in memory during index-
                           ing  nor  in  the generated index file needed to do
                           ``near'' searches  later  during  searching.   This
                           sacrifices  ``near'' searching for decreased memory
                           usage and index file size (approximately 50%).

       -r
       --no-recurse        Do not recursively index the files  in  subdirecto-
                           ries, that is: when a directory is encountered, all
                           the files in that directory are indexed (modulo the
                           filename  patterns  specified  via  either  the -e,
                           --pattern, -E, or --no-pattern options or  the  In-
                           cludeFile or ExcludeFile variables) but subdirecto-
                           ries  encountered  are  ignored  and  therefore the
                           files contained in them are not indexed.  This  op-
                           tion is most useful when specifying the directories
                           and files to index via standard input.  (Default is
                           to index the files in subdirectories recursively.)

       -sf
       --stop-file=f       The  name of a file, f, containing the set of stop-
                           words to use instead of the built-in  set.   White-
                           space, including blank lines, and characters start-
                           ing  with  #  and continuing to the end of the line
                           (comments) are ignored.

       -S
       --dump-stop         Dump the built-in set  of  stop-words  to  standard
                           output and exit.

       -tn
       --title-lines=n     The maximum number of lines, n, into a file to look
                           at  for  a  file's title.  (Default is 12.)  Larger
                           numbers slow indexing.

       -Td
       --temp-dir=d        The path of the directory, d, to use for  temporary
                           files.  The directory must exist.  (Default is /tmp
                           for Unix or /temp for Windows.)

                           If  your  OS mounts swap space on /tmp, as indexing
                           progresses and more files get created in /tmp,  you
                           will  have  less  swap  space,  indexing  will  get
                           slower, and you may run out of memory.  If this  is
                           the  case, you should specify a directory on a real
                           filesystem, i.e., one on a physical disk.

       -vn
       --verbosity=n       The verbosity level, n, for printing additional in-
                           formation to standard output during indexing.   The
                           verbosity levels, 0-4, are:

                           0   No  output  is  generated  except  for  errors.
                               (This is the default.)
                           1   Only run statistics (elapsed  time,  number  of
                               files, word count) are printed.
                           2   Directories are printed as indexing progresses.
                           3   Directories  and files are printed with a word-
                               count for each file.
                           4   Same as 3 but also prints all  files  that  are
                               not indexed and why.

       -V
       --version           Print  the  version  number  of SWISH++ to standard
                           output and exit.

       -Wn
       --word-threshold=n  The word count past which partial indices are  gen-
                           erated  and  merged since all the words are too big
                           to fit into memory at the same time.   If  you  in-
                           dex++  and  your  machine  begins to swap like mad,
                           lower this value.  Only the super-user can  specify
                           a value larger than the compiled-in default.

CONFIGURATION FILE
       The  following variables can be set in a configuration file.  Variables
       and command-line options can be mixed, the latter taking priority.

            AssociateMeta       Same as -A or --no-assoc-meta
            ExcludeClass        Same as -C or --no-class
            ExcludeFile         Same as -E or --no-pattern
            ExcludeMeta         Same as -M or --no-meta
            FilesGrow           Same as -g or --files-grow
            FilesReserve        Same as -F or --files-reserve
            FilterAttachment    (See FILTERS in swish++.conf(5).)
            FilterFile          (See FILTERS in swish++.conf(5).)
            FollowLinks         Same as -l or --follow-links
            IncludeFile         Same as -e or --pattern
            IncludeMeta         Same as -m or --meta
            Incremental         Same as -I or --incremental
            IndexFile           Same as -i or --index-file
            RecurseSubdirs      Same as -r or --no-recurse
            StopWordFile        Same as -s or --stop-file
            StoreWordPositions  Same as -P or --no-pos-data
            TempDirectory       Same as -T or --temp-dir
            TitleLines          Same as -t or --title-lines
            Verbosity           Same as -v or --verbosity
            WordFilesMax        Same as -f or --word-files
            WordPercentMax      Same as -p or --word-percent
            WordsNear           Same as -n or --near
            WordThreshold       Same as -W or --word-threshold

EXAMPLES
   Unix Command-Lines
       All these example assume you change your working directory to your  web
       server's document root prior to indexing.

       To index all HTML and text files on a web server:

            index++ -v3 -e 'html:*.*htm*' -e 'text:*.txt' .

       To index all files not under directories named CVS:

            find . -name CVS -prune -o -type f -a -print | index++ -e 'html:*.*htm*' -

   Windows Command-Lines
       When  using the Windows command interpreter, single quotes around file-
       name patterns don't work; you must use double quotes:

            index++ -v3 -e "html:*.*htm*" -e "text:*.txt" .

       This is a problem with Windows, not SWISH++.  (Double quotes will  also
       work under Unix.)

   Using CLASS Attributes to Index HTML Selectively
       In  an HTML or XHTML document, there may be sections that should not be
       indexed.  For example, if every page of a web site contains  a  naviga-
       tion menu such as:

            <SELECT NAME="menu">
              <OPTION>Home
              <OPTION>Automotive
              <OPTION>Clothing
              <OPTION>Hardware
            </SELECT>

       or  a  common header and footer, then, ordinarily, those words would be
       indexed for every page and therefore be discarded because they would be
       too frequent.  However, via either the -C or --no-class option  or  the
       ExcludeClass  variable,  one  or  more class names can be specified and
       then HTML or XHTML elements belonging to one of those classes will  not
       have the text up to the tag that ends them indexed.  Given a class name
       of, say, no_index, the above menu can be changed to:

            <SELECT NAME="menu" CLASS="no_index">

       and then everything up to the </SELECT> tag will not be indexed.

       For  an HTML element that has an optional end tag (such as the <P> ele-
       ment), the text up to the tag that ends it will not be  indexed,  which
       is either the element's own end tag or a tag of some other element that
       implicitly ends it.  For example, in:

            <P CLASS="no_index">
            This was the poem that Alice read:
            <BLOCKQUOTE>
              <B>Jabberwocky</B><BR>
              `Twas brillig, and the slithy toves<BR>
              Did gyre and gimble in the wabe;<BR>
              All mimsy were the borogoves,<BR>
              And the mome raths outgrabe.
            </BLOCKQUOTE>

       the  <BLOCKQUOTE> tag implicitly ends the <P> element (as do all block-
       level elements) so the only text that is not indexed above  is:  ``This
       was the poem that Alice read.''

       For  an  HTML  or XHTML element that does not have an end tag, only the
       text within the start tag will not be indexed.  For example, in:

            <IMG SRC="home.gif" ALT="Home" CLASS="no_index">

       the word ``Home'' will not be indexed even though it  ordinarily  would
       have been if the CLASS attribute were not there.

   Filters
       (See Filters under EXAMPLES in swish++.conf(5).)

EXIT STATUS
       Exits with one of the values given below:

            0    Success.
            1    Error in configuration file.
            2    Error in command-line options.
            10   Unable to open temporary file.
            11   Unable to write index file.
            12   Unable to write temporary file.
            13   Root-only operation attempted.
            30   Unable to read stop-word file.
            40   Unable to read index file.
            127  Internal error.

CAVEATS
       1.  Generated index files are machine-dependent (size of data types and
           byte order).

       2.  The  word-determination  heuristics employed are heavily geared for
           English.  Using SWISH++ as-is to index and search++ files  in  non-
           English languages is not recommended.

       3.  Unless otherwise noted above, the character encoding always used is
           ISO  8859-1  (Latin  1).  Character encodings that are specified in
           HTML or XHTML files are ignored.

       4.  An e-mail message can have both an encoding and a non-ASCII or non-
           ISO-8859-1 charset simultaneously, e.g., base64-encoded UTF-8.  (In
           practice, this particular case  should  never  happen  since  UTF-7
           should be used instead; but you get the idea.)

           However,  handling  both  an encoding and such a charset simultane-
           ously is problematic; hence, an e-mail message  or  attachment  can
           have  either an encoding or a non-ASCII or a non-ISO-8859-1 charac-
           ter set, but not both.  If it does, the encoding takes precedence.

FILES
       swish++.conf      default configuration file name
       swish++.index     default index file name

ENVIRONMENT
       TMPDIR    If set, the default path of the directory to use  for  tempo-
                 rary files.  The directory must exist.  This is superseded by
                 either the -T or --temp-dir option or the TempDirectory vari-
                 able.

SEE ALSO
       extract++(1),    find(1),    nroff(1),   search++(1),   splitmail++(1),
       swish++.conf(5), glob(7), man(7).

       Tim Berners-Lee.  ``The text/enriched MIME Content-type,'' Request  for
       Comments  1563,  Network Working Group of the Internet Engineering Task
       Force, January 1994.

       David H. Crocker.  ``Standard for the Format of ARPA Internet Text Mes-
       sages,'' Request for Comments 822, Department of  Electrical  Engineer-
       ing, University of Delaware, August 1982.

       Frank  Dawson and Tim Howes.  ``vCard MIME Directory Profile,'' Request
       for Comments 2426, Network Working Group of  the  Internet  Engineering
       Task Force, September 1998.

       Ned  Freed  and  Nathaniel S. Borenstein.  ``Multipurpose Internet Mail
       Extensions (MIME) Part One: Format of Internet  Message  Bodies,''  Re-
       quest for Comments 2045, RFC 822 Extensions Working Group of the Inter-
       net Engineering Task Force, November 1996.

       David  Goldsmith  and  Mark Davis.  ``UTF-7, a mail-safe transformation
       format of Unicode,'' Request for Comments 2152, Network  Working  Group
       of the Internet Engineering Task Force, May 1997.

       International Standards Organization.  ISO 8859-1: Information Process-
       ing  -- 8-bit single-byte coded graphic character sets -- Part 1: Latin
       alphabet No. 1, 1987.

       --.  ISO 8879: Information Processing -- Text  and  Office  Systems  --
       Standard Generalized Markup Language (SGML), 1986.

       --.   ISO/IEC 9945-2: Information Technology -- Portable Operating Sys-
       tem Interface (POSIX) -- Part 2: Shell and Utilities, 1993.

       Leslie Lamport.  LaTeX: A Document Preparation System, 2nd  ed.,  Addi-
       son-Wesley, Reading, MA, 1994.

       Martin Nilsson.  ID3 tag version 2, March 1998.

       --.  ID3 tag version 2.3.0, February 1999.

       --.  ID3 tag version 2.4.0 - Main Structure, November 2002.

       --.  ID3 tag version 2.4.0 - Native Frames, November 2002.

       Steven  Pemberton,  et  al.  XHTML 1.0: The Extensible HyperText Markup
       Language, World Wide Web Consortium, January 2000.

       Dave Raggett, Arnaud Le Hors, and Ian Jacobs.  ``On SGML and HTML: SGML
       constructs used in HTML: Entities,'' HTML  4.0  Specification,  §3.2.3,
       World Wide Web Consortium, April 1998.

       --.  ``The global structure of an HTML document: The document head: The
       title  attribute,'' HTML 4.0 Specification, §7.4.3, World Wide Web Con-
       sortium, April 1998.

       --.  ``The global structure of an HTML  document:  The  document  head:
       Meta data,'' HTML 4.0 Specification, §7.4.4, World Wide Web Consortium,
       April 1998.

       --.  ``The global structure of an HTML document: The document body: El-
       ement  identifiers:  the id and class attributes,'' HTML 4.0 Specifica-
       tion, §7.5.2, World Wide Web Consortium, April 1998.

       --.  ``Tables: Elements for constructing tables: The  TABLE  element,''
       HTML 4.0 Specification, §11.2.1, World Wide Web Consortium, April 1998.

       --.  ``Objects, Images, and Applets: Generic inclusion: the OBJECT ele-
       ment,'' HTML 4.0 Specification, §13.3, World Wide Web Consortium, April
       1998.

       --.   ``Objects,  Images, and Applets: How to specify alternate text,''
       HTML 4.0 Specification, §13.8, World Wide Web Consortium, April 1998.

       --.  ``Index of Elements,'' HTML 4.0 Specification, World Wide Web Con-
       sortium, April 1998.

       Marcin Sawicki, et al.  Ruby Annotation,  World  Wide  Web  Consortium,
       April 2001.

       The  Unicode Consortium.  ``Encoding Forms,'' The Unicode Standard 3.0,
       §2.3, Addison-Wesley, 2000.

       Francois Yergeau.  ``UTF-8, a transformation format of ISO 10646,'' Re-
       quest for Comments 2279, Network Working Group of  the  Internet  Engi-
       neering Task Force, January 1998.

AUTHOR
       Paul J. Lucas <pauljlucas@mac.com>

SWISH++                         March 25, 2004                      index++(1)
Generated by dwww version 1.16 on Mon Dec 15 21:05:56 CET 2025.