Text Classification and Visualization
By
Shachar
Beiser[2]
September
1999
This
is a report of a work done as ‘Laboratory in Machine Learning’ at the
Technion’s CS faculty under the supervision of Dr. Ran El-Yaniv[3].
Overview 3
Theory and Experiments: This part describes the learning algorithm used and the theoretical background. This part also describes experiments done prior to building the java interface.
This part contains the following sections:
Self-Organization Maps: The SOM algorithm is presented and the context in which it is used.
Visualization: Several methods of visualizing the documents are described. These methods will be used in the graphical viewer, which demonstrate some of the properties of SOMs.
Classification: Describes ways to classify new documents after learning the training set.
Experiment 1: Describes an experiment performed on a dataset, which consists of courses syllabi from Technion’s catalog.
Experiment 2: Describes an experiment performed on the 20 newsgroups dataset.
Application: This part describes the user interface implemented for visualizing course syllabi from five faculties in the Technion.
This part includes descriptions of the user interface and the scenarios for operating it. The applet is available at the web at the site http://www.cs.technion.ac.il/~bumbleB.
The vast increase in Internet and other textual archives calls for an efficient way to automatically categorize documents and visualize them.
Two problems that we try to solve are:
1) Classification: Given a labeled training set and a testing set, classify a document from the testing set according to the information found in the training set.
2) Visualization: Given a training set (not necessary labeled), represent information relating to the shape and position of the clusters (which represent the different classes) in a human readable form. Note that this information is more general than clustering (i.e. hard partitioning of the training set to equivalence classes).
The benefits of classification are obvious, but why do we need visualization if we have powerful classification techniques.
Imagine the following scenario: A buyer wishes to buy a book on radial basis neural networks used for agricultural robot control. Assume that such a book exists and the buyer has some knowledge of the area.
The buyer searches at the computerized catalog of a bookstore. Because the required subject is very specific, no such category exists in the bookstore. However, books in the store are categorized according to more general subjects such as neural networks, robotics, radial…
Since the labels in the catalog are not the required subject, classification will not help us (the training set is not labeled appropriately).
A way to solve this problem is send queries with the related more general fields: neural networks, robotics, control, agriculture …. This result in a mass of books – each book is given a score of the book’s similarity to each query.
A simple solution is to choose the book whose average score is the highest, but sometimes more complex methods are needed.
When deciding which book to buy domain knowledge should be inserted to the algorithm. A problem with such methods is that while our buyer knows the domain, it is difficult and time consuming to insert the buyer’s intuitive knowledge (which sometimes can’t be described explicitly by the buyer) to the algorithm.
Kohonen’s SOM can be used to visualize the documents in a human comfortable way (instead of as vectors in high dimensional vector space).
The documents (books returned by the queries) are projected to a two dimensional map. The buyer can then visualize the domain and make his pick.
For example, the buyer looks at the SOM and notices 10 books whose titles look appropriate for him. He then sees that the 10 books form 2 clusters, distant from one another. The buyer decides to buy two books that are the closest to the centroids of the two clusters.
Such a decision is done after viewing the results (for example if there were only one cluster the buyer would buy one book only).
This kind of visualization is useful when an interaction with a human expert, whose knowledge is hard to transfer to a machine in a short time, is needed.
Visualization is also useful when the human doesn’t know exactly what he wants (and therefore he can’t instruct the machine). For example, a buyer in the supermarket without a list, buying items that he did not intend to buy before entering the supermarket. The human expert needs to visualize the documents before making the decision, as the buyer in the supermarket needs to see the items on the shelves arranged with a certain order.
Classification techniques are described in the classifications section. Visualization by SOM is described in the self-organizing maps and visualization sections.
The first stage in performing classification and visualization is to represent the documents as vectors in a high dimensional space. Later the vectors can be transformed to a lower dimensional space (by a singular transformation).
Lowering the dimensionality of the documents facilitates classification in terms of speed and quality (a result known as the curse of dimensionality – Bellman 1961, summarized in Haykin 1999 page 211-212).
When performing visualization the dimensionality is lowered to 2, since this is the optimal dimension for human visualization. Each document is represented as a point in the 2D plane (on the hexagon grid) for easy viewing.
We use a representation known as bag of words: A document is converted to a vector by retaining information regarding the appearances of terms only (information about the order of the terms in the document is lost).
We use coding known as TF(term frequency): the document is
converted to a vector in which each component fits a term. The value of the
i’th component in the vector is the frequency of the appearance of terms i in
the document:
.
Note that the dimension of the vector is as large as the size of the vocabulary.
Because the input space dimension equals to the size of the vocabulary (under the bag of words assumption), we can lower the input dimensionality by reducing the vocabulary size.
2 ways of reducing the vocabulary’s size are tested (in test 2 section):
1)
Information gain (IG) measures the number of bits of
information obtained for category prediction by knowing the presence or absence
of a term in a document. The information gain of term t is defined as: ![]()
where ci is class or label i.
If a space of N dimensions is required, we disregard any terms other than the N terms with the highest information gain.
2) Distributional Clustering (DC): This method partitions the vocabulary to N equivalence classes (clusters). A new vocabulary (with N terms) is later built in which each cluster is a term.
The algorithm used is:
-
Sort
the vocabulary by information gain
-
Initialize
the N clusters as singletons with the top N words.
-
Loop
until all words have been put into one of the N clusters:
o
Merge
the two clusters which are the most similar (Equation *) , resulting in N-1 clusters
o
Create
a new cluster consisting of the next word from the sorted list, restoring the
number of clusters to N
Thus starting from clusters that contain single terms, by merging iteratively, larger clusters are built until each word is set to some cluster. The measure of similarity between clusters is based on “KL divergence to the mean” – a measure from information theory that represents the dissimilarity between two probability distribution (It actually measures the expected loss in effective compression when compressing one distribution and assume it is the other distribution).
The KL divergence to the mean between the two
clusters
and ![]()
is defined as:
where C is the class random variable and

More on KL-divergence can be found in Cover and Thomas 1991.
An extensive description of distributional clustering can be found in Pereira et al 1993, and in Baker and McCallum 1998.
A comparison between the results of dimensionality reduction with information gain and distributional clustering can be found in the Experiment 2 section.
Self
Organizing Maps
Self-Organizing Maps is a neural net based on competitive learning. The neurons are spread on a (usually 2D) grid, connected in the form of a square grid (each neuron is connected to 4 close neurons, one from each direction) or a hexagon grid (each neuron is connected to 6 close neurons).
When
a vector is presented to the net (all the neurons simultaneously), the neurons
compete among themselves and the winner “fires away”. The learning algorithm
modifies the parameters of the neurons near (in the grid coordinates) the
winning neuron, while leaving intact the parameters of all the rest.
In this architecture, the spatial location of the firing neuron becomes more important than the actual value of the firing neuron. In fact, the spatial location on the grid of the winning neuron forms a nonlinear transformation to two-dimensional space (the grid coordinates).
The basic learning algorithm is as follows:
1) Initialization: Choose small random values for
the initial weight vectors of the neurons mi(t=0) i=1…k. The
only restriction is that different values should be chosen to different
neurons.
2) Sampling. Draw a
sample x from the input space.
Similarity Matching: Find the
best matching(winning) neuron c(x) by using the inner product measure:
c(x)=argmini{<x, mi>}.
3) Updating: Adjust the synaptic weight mi
of the neurons in the neighborhood of the winner (defined by a neighborhood
function N(c(x),t)) by using the update formula:

Where h is a monotonically decreasing learning rate function.
5) Continuation. Continue with step 2 until no noticeable change in the neurons’ weights.
Note that the update rule corresponds to moving the neurons’ weights vectors toward the input vector x.
The neurons are spread on a sheet (in 2-D case), in which each neuron has a pair of x and y indices indicating its place in the sheet.
After the learning phase, when an input vector x is presented to the net and the winner neuron is chosen for it, a nonlinear transformation can be defined from the input vector x which is in Rn to the winning neuron indices (in the grid) which are in R2.
This dimensionality reduction is inherently nonlinear and may be viewed as a nonlinear generalization of principal component analysis (It can be shown that the transformation minimizes the reconstruction error – known as vector quantization error).
Some examples of maps produced are:
The neurons are plotted in the input space according to their synaptic weights, and lines connect neurons which are neighbors. Simple cases of two-dimensional lattices will be shown, as they are easy to understand.
· Two-dimensional lattice approximating uniform distribution in two-dimensional input space. The upper image is the input space and the lower image shows the neurons in the space of their synaptic weights (Both input space and the grid are two dimensional):

·

Two-dimensional lattice approximating clusters in
two-dimensional input space.
This shows how clustering can be done by self-organizing maps. For each cluster, a neuron is constructed. After the learning phase, each neuron is assigned to the cluster of the best matching neuron.
Extensive treatment of SOM and improved learning algorithms are found in Kohonen 1997.
The maps shown here are based on hexagon grid, and were generated by SOM_pack and Matlab SOM toolbox (see technical appendix).

The unified matrix (or U matrix) shows the general cluster form of the
documents. Each map area represents an area of the input space. Areas in the
map are colored according to the variability in the synaptic weights
![]()
Figure: The U matrix
of the courses of the CS and Civil Eng. Each hexagon shows a change between two
adjacent neurons. The name of the dominant faculty is shown for some areas at
the map
between two adjacent neurons (each small hexagon represents a border between two neurons and it is colored according to the variability across this border ). Areas which show high variability in synaptic weights are colored with warmer colors (high areas), and areas which little change are colored with colder colors (low areas). Thus, the clusters can be viewed as 1ow plateaus (blue color), which are surrounded by mountain ridges (red color).
In the component map, each hexagon represents a neuron. The color of the hexagon shows the value of the neuron’s synaptic weight in a component of the original vector space. In our case, we can use component maps to see how the areas in the map relate to a certain word in the vocabulary.

Figure: Component maps for 8 words with highest infogain in the CS and
CIVIL ENG. Courses. Note that the words are stemmed. Obviously, documents from
CS are located near the upper right corner and documents from CIVIL ENG are
located in the lower left corner. These maps are corresponding to the U matrix
above.
The component map shows the affinity of a neuron to the component, which in our case is a term from the vocabulary.
The density of documents in map areas can be visualized by changing each hexagon’s size to the number of documents the unit.

Figure: Component map for the stemmed term language, in which the size of each hexagon corresponds to the number of documents in that area (the number of documents for which the neuron is the best matching one)
We will now, demonstrate the use of the maps as a visualization tool for data taken from undergraduates courses syllabi taken from the faculties of Computer Science, Physics and Civil Engineering (see experiment 1 section for more details).
Consider a Physics student who wants to take some courses in Computer Science, which deals with logic. Looking at the logic component map (see image) shows three distinct areas in the map, which show high values in the component corresponding to logic.

The student then looks at the courses found in each ‘hot’ area and finds:
Left area contains:
- 236304 LOGIC FOR COMPUTER SCIENCE 2
-236331
COMPUTABILITY AND DEFINABILITY
- 234292 LOGIC FOR COMPUTER SCIENCE 1
236760- COMPUTATIONAL LEARNING THEORY
Right area contains:
236305 - LABORATORY IN LOGICAL DESIGN
234305 - LABORATORY IN LOGICAL DESIGN U
Bottom area contains:
236345 - AUTOMATIC VERIFICATION OF PROGRAMS
236512 - PROJECTS IN ADVANCED PROGRAMMING B
236519 - FOUNDATIONS OF LOGIC PROGRAMMING
Indeed the three areas represent three different topics: “mathematical logic”, “logical design” and “logic programming and verification”. The student can then choose one course from each area to get a broad education in logic. Alternatively, the student may notice that the left cluster is the closest to an area dominated by Physics and because he has background in Physics, he may prefer to select a course from the left cluster (which is the most mathematical in nature and therefore is more suited for him). Another alternative is to spot the location on the map of a course that interested him in relation to logic and choose the course closest to that course.
The key idea here is that beforehand the student does not know how he is going to choose the courses (one from each cluster; the closest course to physics; a course which is closest to a course he learned already) and only after inspecting the map, he decides on the strategy.
Search methods which lack this graphical representation may return scores of accuracy relative to logic, or a representation in a high dimensional space – which is not very helpful when the human needs to make the decision (instead of the computer making the decision – in the case of classification or clustering).
Another useful map is the U-matrix that shows the gradient of the neurons (hot areas indicate high variability):
If we have to find a series of courses that connect two points on the map, we can look at the u-matrix and choose a path that does not cross any high changing areas (search for a connecting series between the two points which does not cross ‘red mountains’).
Classification can be defined as: Given a labeled training set, label a testing set.
There are many techniques for performing classification such as: nearest neighbor, naive Bayes, feed forward neural networks, support vector machines etc. Note that classification is done after dimensionality reduction (e.g. information gain, distributional clustering) if the dimensionality is high.
We focus on classifying with Bayes and k-nearest neighbor.
Naive Bayes algorithm assumes that the appearances of a term in a document are independent of the term’s context and its location (which fits the bag of words representation).
We assume the data was generated by a parametric model, and use training data to calculate Bayes optimal estimates for the model parameters. Then equipped with these estimates we classify new test documents by using Bayes rule to turn the generative model around and calculate the probability that a class would have generated the test documents in question. We then choose the class that maximizes the probability: P(class=c| document, parameter). More on naive Bayes can be found in Baker 1998.
K nearest neighbors (knn) looks at the k nearest neighbors (in the labeled training set) of the test vector (usually using Euclidean metric) and classifies according to the most popular class among these k neighbors.
More on knn can be found in Duda and Hart 1973.
A comparison between the results of the classification with naive Bayes and k-NN (for several k values) can be found in the Experiment 1 and Experiment 2 sections.
Experiment 1 is performed on data taken from the Technion’s undergraduate course syllabi [1]. Each document is a course syllabus, usually less than 50 words long. The courses were taken from 3 faculties: Computer Science, Civil Engineering and Physics which are the classes.
The documents were converted from html to text, stemmed (an operation which cuts the ending of the words for example computer computers and computing are reduced to comput. This is done using grammatical laws). All letters other than alphabet were disregarded (for example numbers or text signs such as #,@,? etc.).
The documents were transformed to vectors using TF representation. The dimensionality was lowered by information gain and classification was tested using k-nn, naïve Bayes. The results are shown in the following graphs.
Because the dataset is relatively small (about 170 documents), no conclusions can be safely drawn from this experiment’s results (Note the noisy graphs). Nonetheless, we show the results as they might be of some interest.
The first 3 graphs shows results for k-nn and naïve Bayes classification as a function of reducing the vocabulary size by information gain. The last graph shows results for n-gram vocabulary (instead of a vocabulary of one words, use a vocabulary of ordered n-tuples). The classification and dimensionality reduction is the same as for 1-gram, except that each component in the vector represent several words and the initial vocabulary size is much larger.



Experiment 2 is performed on a data set known as ‘the 20 newsgroups’ [2]. Each document is a newsgroup letter. About 20,000 letters were taken from 20 newsgroups (about 1,000 letters from each newsgroup) of varied topics, for example: alt.atheism, comp.graphics, rec.motorcycles, talk.politics.guns, and talk.politics.mideast.
The documents were preprocessed by removing the first lines that contained information about the sender and the newsgroup (it is pointless to test classification on documents which contain the class name inside explicitly).
First, we examine the classification percentage as a function of the classification method and dimensionality. To avoid presenting many graphs, we will use more compact and clear 3-D graphs.
The first of the following graphs shows the classification percentage as an image in which the color of a pixel shows the success rate for the corresponding vocabulary size and number of nearest neighbors. The color bar at the side translates colors to success percentage.
The second graph is similar to the first one but we use bilinear interpolation to get a smooth function.
The third graph shows the success rate as a function in the 3-D plane. On the bottom, the image shows the success rates as well (the colors used are different than the previous two plots).
Two things are evident from these graphs:
1) The quality of the classification increases with the number of nearest neighbors (for the range tested: 1-70 neighbors). Note that this is true for the majority of the vocabulary sizes.
2) As for the vocabulary size, the quality of the classification is at a maximum somewhere between 10-300. For every number of neighbors, the classification success rate is at a clear maximum. This is a very important fact, as it tells us that it is better to lower the dimensionality of the input vectors by a singular transform (which results in a loss of information), but not to lower it too much.



This fact is a consequence of the curse of dimensionality (see text representation section). An intuitive explanation for this fact (Friedman 1995) is: “A function defined in high-dimensional space is likely to be much more complex than a function defined in a lower-dimensional space, and those complications are harder to discern”.
In other words, to learn well a function, we need a dense training set. A growth in the dimensionality causes an exponential growth in the volume of the space and therefore in the size of the training set needed for our dense sample. Because the training set is of fixed size, when we lower the dimensionality, our training set becomes more dense and concentrated in the vector space – which results in a better function approximation.
The following graph shows the superiority of the naïve Bayes algorithm to the 70 nearest neighbors algorithm (which is the best of the k-nn familiy). Note again, how the percentage rises to a maximum around dimensionality of 50.

The next graph compares between dimensionality reduction by information gain and by distributional clustering with respect to 1-nn classification. In Baker and McCallum 1998, it was observed that better results are achieved if first, the dimensionality is reduced to a medium level (500) by information gain, and then continue with distributional clustering.
We reduced the dimensionality to 1000 by information gain, and then compared between continuing to reduce the dimension with information gain or continue with distributional clustering. It is clear from the next graph that indeed distributional clustering performs better than information gain.
The last graph compares between representing text with term frequency (TF – see text representation section) and binary appearance (convert a text to a binary vector according to appearance of terms in the document).


The binary appearance achieved better results, but because the difference is small, no general conclusion can be drawn.
It is interesting to look at the derivative of the classification rate to the vocabulary size. It is relatively small in the range [5000-50] and has very large values in the range [1-50]. In all the graphs, the change happens around 50. This implies that the number 50 is a property of the dataset and not of the dimensionality reduction method or classification method.
This property is called intrinsic dimensionality (Fukanaga,
1982). A dataset, which lies entirely in d dimensions, is said to have an intrinsic
dimensionality of d.