TOPIXTRACT is a language independent keyterm extractor from documents developed by Me in the framework of my MSC Thesis.
For this purpose, it takes either words, or multi-words, or word prefixes (with fixed length 4 or 5 characters) as features to represent documents.Then uses 24 measures to identify feature importance for each document discimination.
Results obtained may be evaluated by independent evaluators and their agreement is meaured usig Kappa statistics. Tf-idf and Chi-square based metrics have shown a higher precision.Word prefixes were used for dealing with highly inflected languages, and topic prefixes were just used as an aid for promoting words and multi-words as possible document topics.
More information can be obtained in the paper:Luís Teixeira, Gabriel Lopes, and Rita A. Ribeiro, “Automatic Extraction of Document Topics,” in DoCEIS’11 – 2nd Edition of the Doctoral Conference on Computing, Electrical and Industrial Systems, Costa da Caparica, Portugal, 2011, pp. 101–108.
Overwhelming amounts of information in corporations can make search and browse for a specific topic or information a very hard task. Therefore, it is of paramount importance to develop tools to ease the retrieval of specific information and to support the exploration by users on corporate intranets (composed of several hundreds of gigabytes of documents). Although not explicitly identified, many of these documents are related among themselves (directly or implicitly).
This project aims to enable the visual representation of documents found to be related among themselves, but also to explore/mine those relations.
The representation, intuitive navigation and selection of these concepts is our major goal. When certain relations between these concepts are particularly relevant they
may lead to a natural flow of information and consequent navigation between them.
In the prototype page the studies for developing a navigation
support system to explore graphs applied to document correlations, using concepts from the weighted complex network field, and using their unstructured textual content are presented.
For any aditional information, please don’t hesitate in contacting me