In this Page I’ll describe in a little more detail the prototype TOPIXTRACT.
It’s a system composed by 4 components. Three graphical User Interfaces and one relational data base.
In the first module, the system administrator must configure the necessary options for the text to be processed and entered into the database.
The next image shows this module.
The fields that we observe allow the system administrator to configure some of the various configuration fields required:
Some of these fields are described in the list below:
- “Insert Prefix Size number” – The size of the prefix of a word to be considered. (Required for some successful measures for highly inflected languages)
- “Insert Language Prefix” – This need arises from the possibility of having in the same database documents of several different languages. Since this prototype results regardless of the language you’re working on.
- “Project Short Name” – We also have several groups of documents relating to different areas, hence the ability to assign a project name to a set of documents to be treated.
- It also has fields to configure the various configuration directories to use when you boot the system.
- One is the “Files Folder Location” – folder where the source files are located txt (in UTF8) that form the corpus where are dealing (Note: The more documents and more extensive The better results will be obtained).
After these settings the administrator has access to two buttons, one that allows him to open a console that displays a log of operations being performed by the module.
The start button “Run” to start the processing and loading into the relational database.
The second module is the application where human evaluators (linguists) will evaluate, in accordance with pre established rules, the topics extracted using the various measures available, at this time the measures are based on Tf-Idf and Chi-Square. p>
This second module presents some features, one of them the fact that the evaluator has to “login” so that, when information of a specific assessment is made, it can be easily followed. p>
In the following items, the functionality of some fields is described
- “Insert Evaluator Name” – Field where the evaluator inserts her/his identifying name.
- “Set Button” – Stores in the the database the evaluator identifying name, and activates all other fields.
- Components to deal with document informations
- “Choose Language of Documents” Where the evaluator filters the documents by their language.
- “Choose Document Project”Because documents may be associated with specific projects, these can be specified in this field. Otherwise all available documents in that language will appear
- There are components in this module that are specifically tailored to deal with the terms of each document
- For example “Number of Terms to Get” The evaluator chooses the number of terms to get for that particular document, the options are 25, 50 e 100.
The following images show the second Module in a more advanced stage of usage by an evaluator
It can be visualized, a document selected from the list and the list of the highest scoring terms using Tf-Idf measure for that document.
Next some more fields are described in more detail.
- The Document content appears in the following fields “Document treated Content” e “Document Original Content”
- Evaluation Buttons, there are used by the evaluator to classify the presented terms in one of the 4 possible presented categories, that are “Good”, “Near Good”,“Bad” and “Unkown”.
- “Save Evaluation” It allows the evaluator to save his/her evaluation to the database. For posterior usage in the Third module, as it will be shown in the next section.
The third module of the system is the application that provides access to the results of precision and coverage resulting from the evaluations done by the evaluators.
It also allows access to plots of correlation between different evaluators for the same document. Other features exist, but out of context to describe this page. P>
In the following image you can see an overview of the third module, initial state. p>
For any additional information, please don’t hesitate to contact me