DtSearch - HP-UX

· Home
+ man pages
-> Linux
-> FreeBSD
-> OpenBSD
-> NetBSD
-> Tru64 Unix
-> HP-UX 11i
-> IRIX
· Linux HOWTOs
· FreeBSD Tips
· *niX Forums
man pages->HP-UX 11i man pages -> DtSearch (5)

 DtSearch(special file)                               DtSearch(special file)




 NAME    [Toc]    [Back]
      DtSearch - Introduces the DtSearch text search and retrieval system

 DESCRIPTION    [Toc]    [Back]
      DtSearch is a general purpose text search and retrieval system that
      serves as the text search engine for the DtInfo browser in the Common
      Desktop Environment (CDE). DtSearch utilizes a full text inverted
      index of natural language words and stems. Both queries and documents
      have been internationalized for CDE single- and multi-byte languages,
      with provision for the definition of custom languages. Queries are
      simple text strings that can optionally include full boolean
      specifications with a simple intuitive syntax. Results of searches can
      be ranked statistically. Document retrievals can include information
      for highlighting query words in retrieved documents.

      DtSearch consists of two major functional areas.  The first is a set
      of offline build tools that:

         +  Create searchable databases.

         +  Index the user's text files and load the resultant search
            information into the databases.

         +  Maintain the databases.

      The second functional area is an online search API. It provides a
      simple interface to the search engine to facilitate user-written
      search and retrieval programs. The API consists of a set of functions
      compiled into the library libDtSearch, with function prototypes,
      constant definitions, and data structures defined in Search.h.
      DtSearch includes a sample browser source program, dtsrtest.c, to
      demonstrate API usage.

      Information and error messages in both functional areas, including
      those appended to the online API MessageList, are generated from a
      single DtSearch Message catalog, dtsearch.cat. The source for this
      catalog is dtsearch.msg.

      Each DtSearch database is associated with a single full text inverted
      index. In addition, each database can be partitioned into logical
      subsets of documents called "keytypes" by a naming convention of the
      database keys. The search engine can open multiple databases and users
      can specify any combination of databases and keytypes for each query,
      thus providing a two tier search capability. Users can further qualify
      searches by restricting the search return list by date ranges and
      maximum number of documents returned.

      DtSearch is written in ANSI Standard and POSIX compliant C. The
      DtSearch online search API is not reentrant (not "thread-safe") and
      must therefore be directly linked into the user-written search
      program. The DtSearch API will increase the size of a browser search



                                    - 1 -       Formatted:  January 24, 2005






 DtSearch(special file)                               DtSearch(special file)




      program from 100K to 200K bytes depending on which functions are
      called.

 GENERAL SPECIFICATIONS AND CONVENTIONS    [Toc]    [Back]
    Database Names
      Databases consist of a set of binary and ASCII files whose names are
      the 1- to 8-character ASCII database name specified to the dtsrcreate
      command, a period (.), and a 1- to 3-character ASCII file name suffix.
      Executing dtsrcreate will create and initialize these files.  After
      creation, databases are always identified by the 1- to 8-character
      name string used in dtsrcreate. The database names dtsearch and
      austext are reserved and may not be specified.

    DtSearch Languages    [Toc]    [Back]
      Each database is associated with a single natural language. Unlike
      conventional locales, a DtSearch language includes code set
      presumptions and, most importantly, linguistic parsing and stemming
      rules to identify indexable terms in a text stream. A DtSearch
      language is specified when a database is created. Developers can also
      define custom languages with special code sets and linguistic rules.
      See "Language Parsing and Stemming" in this man page below for
      details.

    Database Types    [Toc]    [Back]
      The API can be used simply as a search engine, referring to documents
      only through the inverted indexes. Alternatively, a database can be
      configured to store actual document text in compressed format in a
      repository efficiently accessible to the engine. The configuration
      options that indicate these alternatives are referred to as database
      types and are specified to dtsrcreate at database creation time.

    Abstracts    [Toc]    [Back]
      A field called the "abstract" is included in the fzk file for each
      document loaded into a database, and is included on the Results list
      for each document in a successful search. When documents are not
      stored in a repository, the abstract typically specifies a file name,
      URL, or other reference useful to the browser. It can also include
      summary information viewable by users to help them select documents
      for retrieval and display.

    Offline Build Tools    [Toc]    [Back]
      dtsrcreate creates and initializes new databases or reinitializes
      preexisting databases. Textual data is loaded into databases by the
      execution of two programs. dtsrload creates a database object record
      for each text document, and dtsrindex creates the full text inverted
      index of words and stems for each object record. Based on unique
      database keys for each object, dtsrload and dtsrindex can also serve
      as update programs for preexisting databases.

      The input to the load and index programs is a canonical text file with
      a .fzk file name suffix. The format of fzk files is sufficiently



                                    - 2 -       Formatted:  January 24, 2005






 DtSearch(special file)                               DtSearch(special file)




      simple that they can be generated manually. In addition, DtSearch
      includes a utility program, dtsrhan, which can generate a correctly
      formatted fzk file for some kinds of text documents.

      Several other utilites provided in the distribution package are
      suitable for extracting summary database information, including
      dtsrdbrec and dtsrkdump.

    Argument Conventions    [Toc]    [Back]
      Optional command line arguments are specified with a dash (-) and
      typically a single character argument identifier. Some required
      arguments also use the dash convention. Unless specifically indicated
      otherwise, dash arguments may be specified in any order. Where values
      are used with dash arguments, they must be directly appended to the
      argument without white space.

      Optional arguments precede required arguments. Non-dash required
      arguments must usually be specified in the order indicated by the
      usage statement.

 LANGUAGE PARSING AND STEMMING    [Toc]    [Back]
    Parsing and Stemming
      Word parsing is fundamental to DtSearch operations at both index time
      and query time. Linguistic parsing algorithms filter incoming text
      strings into sequences of word tokens for each natural language.
      Depending on the language, word tokens may also be processed into stem
      tokens. At index time each linguistic token, or term, in a document is
      stored in the inverted index. At search time queries are parsed for
      linguistic terms and used to access the documents that contain them.

      Each database is assigned its own DtSearch language identified by a
      language number at database creation time. A language number
      determines the parsing and stemming algorithms to be applied to the
      database's text and queries. Internal DtSearch algorithms are supplied
      for supported languages including several European languages and
      Japanese. In addition a user exit mechanism permits developers to
      provide their own custom language algorithms for a database.

    Language Files    [Toc]    [Back]
      Language algorithms often use various word lists. Typically, these
      lists are stored in language files for easy maintenance, with the type
      of list identified by the file name extension. Language files are
      opened and read into internal tables when the offline programs
      initialize or when the DtSearchInit online API function is called.
      Some language files are required and initialization will return fatal
      errors if they are missing. Some language files are optional and
      associated algorithms will be silently bypassed if they are missing.
      Files for supported languages may be edited to provide database
      specific enhancements. At open time, database specific files supercede
      generic language files.




                                    - 3 -       Formatted:  January 24, 2005






 DtSearch(special file)                               DtSearch(special file)




    General European Parsing Rules    [Toc]    [Back]
      The currently supported European languages are
      0       English, ASCII character set
      1       English, ISO Latin-1 character set
      2       Spanish, ISO Latin-1 character set
      3       French, ISO Latin-1 character set
      4       Italian, ISO Latin-1 character set
      5       German, ISO Latin-1 character set

      If not otherwise specified, dtscreate will initialize databases with
      language number 0. Note that all supported European languages use a
      single-byte encoding method, with the ASCII code set as a proper
      subset.

      Parsed text, including both queries and indexed text in documents, is
      case insensitive in supported European languages.

      In supported European languages parsing is accomplished with the
      Teskey algorithm, which partitions a character set into characters
      that are always parts of words (concordable), characters that are
      never parts of words (nonconcordable), and characters that may be
      parts of words depending on context (optionally concordable).
      Typically, alphanumeric characters are concordable. Whitespace and
      most punctuation is nonconcordable. Slashes are examples of characters
      that may or may not separate words depending on context. The essence
      of the parsing algorithm is "optionally concordable characters
      preceding concordable characters are concordable; otherwise, they are
      nonconcordable". For example, UNIX directory names of the form
      /usr/local/bin would be considered just one word, but slashes in
      isolation would be discarded as nonconcordable.

      The parsing algorithm does a table lookup to determine the
      concordability of characters. The tables are arrays of the characters
      for each code page supported by the algorithm. Currently 7-bit ASCII
      and ISO Latin-1 are supported.

    Words Not Indexed    [Toc]    [Back]
      Several additional parsing rules are applied to prevent indexing
      meaningless terms. These terms include common prepositions, indefinite
      articles, and nonlinguistic text strings such as formatting tags,
      sequences of hexadecimal dump characters, list identifiers, etc.

      Tokens whose lengths are less than a minimum word size or greater than
      a maximum word size are discarded. The default minimum and maximum
      word sizes can be overridden in dtsrcreate.

      Similarly words found in the "stop list" file for the database are
      discarded. Stop lists are external, editable language files. Each
      supported European language is provided with a default stop list.





                                    - 4 -       Formatted:  January 24, 2005






 DtSearch(special file)                               DtSearch(special file)




      Words found in an "include list" file are forcibly indexed even if
      they would otherwise be discarded. Include list database files are
      optional; no defaults are provided.

    Stemming    [Toc]    [Back]
      When specified for a language, individual parsed words will be
      "conflated" or mapped into their "stem" form, a new word that
      represents the etymological root of the original word. A default null
      stemming algorithm is used for languages that are not otherwise
      provided with a supported stemmer. The null stemmer returns the
      original word as its own stem. Both words and stems are stored in the
      inverted index. API searches can be specified for either words or
      stems, but the two search methods are distinguished only when real
      stems have been stored in the inverted index.

      In the supported European languages stemming can be accomplished
      heuristically or by dictionary lookup. The heuristic algorithms
      typically remove affixes in a language-dependent way. Affix lists are
      usually stored in language files. Currently stemming is supported for
      English languages 0 and 1, Spanish language 2, French language 3,
      Italian language 4, and German language 5.

    Japanese    [Toc]    [Back]
      Two Japanese DtSearch languages (numbers 6 and 7) are supported. Both
      use the same packed, Extended UNIX Code (EUC) character set. The two
      languages differ only in the technique used to parse compound kanji
      words. All validly encoded text for supported Japanese languages
      incorporates ASCII encoding as a proper, single-byte subset. The
      supported Japanese languages use the null stemmer.

    Kanji Compounds    [Toc]    [Back]
      Individual kanji characters are parsed as single words. In addition,
      for language number 6 all possible kanji substrings (pairs, triplets,
      etc.) found in any contiguous string of kanjis will be parsed as
      compound kanji words, up to a maximum word size of 6 kanji characters.
      For language number 7, only kanji substrings listed in the jpn.knj
      language file may be treated as compound kanji words. At offline index
      time all possible individual kanjis and kanji compounds for a language
      are stored in the inverted index. At online search time kanji
      substrings in the query are treated as single query terms and are not
      compounded further.

    Japanese Code Sets    [Toc]    [Back]
      The supported packed EUC character set consists of four separate
      multibyte Code Sets. Code Set 0 can be either 7-bit ASCII or 7-bit
      JIS-Roman. The first and only byte of a character in Code Set 0 is
      less than 0x80. Substrings of Code Set 0 in supported Japanese text
      are parsed into individual words with the European language parser
      described above. Minimum and maximum word sizes, stop lists, and
      include lists will be used as in European languages if provided with a
      Japanese database.



                                    - 5 -       Formatted:  January 24, 2005






 DtSearch(special file)                               DtSearch(special file)




      Code Set 1 is JIS X 0208-1990. The two-byte characters in Code Set 1
      always begin with a byte greater than 0xA0 and less than 0xFF. Symbols
      and line drawing elements are not indexed. Hirigana strings are
      discarded as equivalent to stop list words. Contiguous substrings of
      katakana, Roman, Greek, or cyrillic are parsed as single words.
      Individual kanji characters are treated as single words with
      additional kanji compounding depending on language number, as
      described above.  Characters from unassigned kuten rows are treated as
      user-defined kanji.

      Code Set 2 is halfwidth katakana. The two-byte characters in Code Set
      2 always begin with the unique byte 0x8E. Contiguous strings are
      parsed as single words.

      Code Set 3 is JIS X 0212-1990. The three-byte characters in Code Set 3
      always begin with the unique byte 0x8F. Parsing is similar to Code Set
      1: discard symbols, etc., contiguous strings of related foreign
      characters equal words, and individual kanjis and unassigned
      characters equal single words, with additional kanji compounding
      depending on language. Kuten row 5 is treated as katakana; undefined
      rows are treated as kanji.

    Custom Languages    [Toc]    [Back]
      All language dependent data structures and functions are referenced by
      fields in the main internal DtSearch structure for databases (DBLK).
      The same structure is used for offline build programs as well as
      online API search functions. Language processing is initialized
      database by database by an internal language loader function which
      stores values in DBLK fields. A database whose language number is not
      supported is presumed to be associated with a custom language. A
      special function, load_custom_language, is called to initialize
      language fields for custom languages. The default load_custom_language
      merely returns an error code.  However, developers can link in their
      own load_custom_language function, which will be called to initialize
      the DBLK fields needed to parse and stem one or more custom languages.
      Values required for the language fields of a DBLK are specified in
      DtSrAPI(3).

 SEE ALSO    [Toc]    [Back]
      dtsrcreate(1), dtsrdbrec(1), dtsrhan(1), dtsrindex(1), dtsrload(1),
      dtsrkdump(1), huffcode(1), DtSrAPI(3), dtsrfzkfiles(4),
      dtsrocffile(4), dtsrhanfile(4), dtsrlangfiles(4), dtsrdbfiles(4)


                                    - 6 -       Formatted:  January 24, 2005
[ Back ]
Similar pages
Name	OS	Title
DtSearchQuery	HP-UX	Perform a DtSearch database search for a specified query
DtSearchRetrieve	HP-UX	Return clear text of documents from DtSearch databases
srchtxt	IRIX	display contents of, or search for a text string in, message databases
endhostent	Tru64	End retrieval of network host entries
endhostent_r	Tru64	End retrieval of network host entries
XmClipboardEndRetrieve	HP-UX	A clipboard function that completes retrieval of data from the clipboard
DtSearchReinit	HP-UX	Reinitialize the DtSearch online API
DtSearchSetMaxResults	HP-UX	Set the DtSearch maximum results value
DXmCSTextSetString	Tru64	Replaces all the text in the compound string text widget with new text.
DtSearchGetMaxResults	HP-UX	Obtain the DtSearch maximum results value
newsletter delivery service
Contents