Text Classification
From RapidWiki
The task of text classification is to assign an (electronic) document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents and new documents should be automatically classified, and unsupervised document classification, where the classification can be done entirely without reference to external information.
We will give here a short example configuration for the first setting. We use two predefined classes ("graphics" and "hardware" in this example). The text for each of these classes is stored in its own directory. The first operator, WVTool, loads the texts and transforms it into a word vector format which can be used for learning. The next operator learns the model which is then written into a file. Here is the setup:
<operator name="Root" class="Experiment">
<operator name="WVTool" class="WVTool">
<list key="attributes">
</list>
<parameter key="default_content_language" value="english"/>
<parameter key="inputfilter" value="TextInputFilter"/>
<parameter key="min_chars" value="3"/>
<list key="namespaces">
</list>
<parameter key="output_word_list" value="../data/training_words.list"/>
<parameter key="stemmer" value="PorterStemmerWrapper"/>
<list key="texts">
<parameter key="graphics" value="../data/newsgroup/graphics"/>
<parameter key="hardware" value="../data/newsgroup/hardware"/>
</list>
</operator>
<operator name="LibSVMLearner" class="LibSVMLearner">
<list key="class_weights">
</list>
<parameter key="kernel_type" value="linear"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="../data/training_model.mod"/>
</operator>
</operator>
