Home | Links | Contact Us | More About Intellectual Property | Bookmark
Search patents:
Home Databases Information-retrieval-and-text-mining-using-distributed-latent-semantic-indexing

 Replaceable classes and virtual constructors for object-oriented programming languages
In view of the foregoing, the present invention provides a new approach and language extension to ...


 String predicate selectivity estimation
Exemplary Operating Environment FIG. 1 and the following discussion are intended to provide a ...


 Dynamic generation of user interface components
An embodiment of the invention is discussed in detail below. While specific implementations of the ...


 System for allocating resources in a computer system
Problems associated with the manual installation and configuration of adapter boards and peripheral ...


 High speed data transfer between mainframe storage systems
The present invention relates to a method and associated systems for transferring data between ...


 Method, system, and article of manufacture for transferring structured data between different data stores
OF THE PREFERRED EMBODIMENTS In the following description, reference is made to the accompanying ...


 Method and apparatus for reformatting of content for display on interactive television
The present invention addresses the needs of the interactive television environment discussed above....


 Method and system for accessing a collection of images in a database
OF THE PREFERRED EMBODIMENTS FIGS. 1, 2 and 3 are flowcharts illustrating the steps carried out ...


 Method and apparatus for evaluating relevancy of messages to users
In one aspect, a system is provided for evaluating the relevancy of an incoming message to a ...


 System and method for implementing a transaction log
The present invention satisfies the above-described needs by using a chronological list ...


 Information retrieval and text mining using distributed latent semantic indexing

Details
Inventors: Behrens, Clifford A.; Bassu, Devasis;
Assignee: Telcordia Technologies, Inc. (Piscataway, NJ)
Primary Examiner: Robinson; Greta
Assistant Examiner:
Attorney, Agent or Firm: Giordano; Joseph Schoneman; William A.

The use of latent semantic indexing (LSI) for information retrieval and text mining operations is adapted to work on large heterogeneous data sets by first partitioning the data set into a number of smaller partitions having similar concept domains. A similarity graph network is generated in order to expose links between concept domains which are then exploited in determing which domains to query as well as in expanding the query vector. LSI is performed on those partitioned data sets most likely to contain information related to the user query or text mining operation. In this manner LSI can be applied to datasets that heretofore presented scalability problems. Additionally, the computation of the singular value decomposition of the term-by-document matrix can be accomplished at various distributed computers increasing the robustness of the retrieval and text mining system while decreasing search times.

DETAILED DESCRIPTION Referring to FIG.
1 the inventive method of the document collection processing of the present invention is set forth.
At step 110 the method of the present invention generates a frequency count for each term in each document in the collection (or set) of documents.
The term "data objects" in this context refers to information such as documents, files, records etc.
Data objects may also be referred to herein as documents.
In an optional preprocessing step 100 the terms in each document are reduced to their canonical forms and a predetermined set of "stop" words are ignored.
Stop words are typically those words that are used as concept connectors but provide no actual content such as "a" "are" "do" "for" etc.
The list of common stop words is well known in the art.
Suffix strippers that reduce a set of similar words to their canonical forms are also well known in the art.
Such a stripper or parser will reduce a set of words such as computed, computing and computer to a stem word "comput" thereby combining the frequency counts for such words and reducing the overall size of the set of terms.
At step 120 the heterogeneous collection of data objects is partitioned by concept domain into sub-collections of like concept.
If it is known that one or more separate sub-collections within a larger collection of data are homogenous in nature, the initial partitioning need not be done for those known homogenous data collections.
For initial sorting of data objects into more conceptually homogeneous sub collections, the bisecting k-means algorithm in a recursive form with k=2 at each stage to obtain k clusters is preferably used.
Clustering techniques have been explored in "A Comparison of Document Clustering Techniques" by M.
Steinbach et al.
Technical Report 00-034, Department of Computer Science and Engineering, University of Minnesota.
Although the bisecting k-means algorithm is preferred, the "standard" k-means algorithm or other types of spatial clustering algorithms may be utilized



Related patents
  Method for providing a reverse star schema data model
According to the invention, techniques for organizing information from a variety of sources, including legacy systems, in a data warehousing environment are provided. In ...
  Interaction protocol for managing cross company processes among network-distributed applications
The present invention is premised on the observation that there are at least two basic constituent parts in any e-commerce business model. These are commerce services, ...
  Static and dynamic assessment procedures
The invention, as summarized above and as claimed in the appended claims, corresponds most closely to a second embodiment of the invention which is described below in ...
  Method and apparatus for configuring massively parallel systems
To address the requirements described above, the present invention discloses a simplified method, apparatus, and article of manufacture for configuring a parallel ...
  Reduced memory row hash match scan join for a partitioned database system
The partitioned table storage technique disclosed herein has particular application, but is not limited, to large databases that might contain many millions or billions ...
  Method and apparatus for partitioning data for storage in a database
In the following description, numerous details are set forth to provide an understanding of the present invention. However, it will be understood by those skilled in ...
  Apparatus for generating sales probability
The present invention is a method and apparatus for generating accurate sales probabilities. The apparatus, a sales probability generator, comprises a general purpose ...
  Method for managing concurrent access to virtual memory data structures
Broadly speaking, the present invention fills these needs by providing methods for managing concurrent access to the kernel data structures for a virtual page in memory. ...
  Pledge-based resource allocation system
A pledge-based resource allocation system for a client/server environment is provided. In this system, resources, such as database objects, are allocated to clients for ...
  Performance of table insertion by using multiple tables or multiple threads
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present ...

0.004

Archive: All patents - Links

Copyright (c)2006 Eipa-patents.org - All rights reserved