DETAILED DESCRIPTION One characteristic of the LSI technique is that it does not take into account term order. Each document is considered as a collection of unordered terms. It is known, however, that phrases, e. g. , small groups of ordered terms, constitute an important element of semantic content. In preferred embodiments of this aspect of the present invention, the scope of processing is broadened to take into account the semantic contribution of phrases, also referred to herein as n-tuples. Most phrases of interest consist of only a few consecutive terms; typically two to four. One method of identifying n-tuples is to consider n contiguous words in a document as an n-tuple. For example, consider the sentence "United States policies towards Cuba are changing. " Automatically identifying n-tuples for n=2 from left to right would result in: "united*states", "states*policies", "policies*towards", "towards*cuba", "cuba*are", "are*changing". For n=3, the result would be: "united*states*policies", "states*policies*towards", "policies*towards*cuba", "towards*cuba*are", "cuba*are*changing". In most applications, it will not be necessary to continue beyond triplets or quadruplets of words. In some embodiments, a list of phrases maintained external to the document space is used to identify phrases. Once phrases have been identified, preferred embodiments of the invention may proceed in at least two ways. In a first way, a single LSI space combining single terms and n-tuples is formed. In another way, separate LSI spaces are formed; each space containing one set of n-tuples, e. g. , an LSI space containing triples, another containing quadruples. In some embodiments of the invention, a subset of identified n-tuples is indexed into the LSI space along with single terms. For example, consider the sentence "United States policies towards Cuba are changing. " If only "united*states" was identified as a phrase form that sentence, then there would be one occurrence each of: "united", "states", "united*states", "policies", "towards", "cuba", "are", and "changing"
|