Saturday, May 28, 2016

Combined Impact of Social Networking Techniques and Shannon Information Theory on Text Mining

Impact of Social Networking Techniques and Shannon Information Theory on Text Mining

Part-1: Centrality Entropy



1. Entropy 

The following equation presents the Shannon’s definition of entropy of a random variable X which can take 'n' values.

                               (1)

In the Shannon entropy equation, p(xi) represents the probability of the given symbol 'xi'. The Shannon entropy equation provides a way to estimate the average minimum number of bits needed to encode a string of symbols, based on the frequency of the symbols. The minimum average number of bits is per symbol and it can be calculated as:

                                                                  (2)
Example: Suppose we take the string: "Shannon" and calculates its entropy:
       
Distinct characters
Frequency
Probability
S
1
0.143
h
1
0.143
a
1
0.143
n
3
0.429
o
1
0.143

Shannon entropy can be calculated as follow:
H(X) = -[(0.143log20.143)+(0.143log20.143)+(0.143log20.143)+(0.429log20.429)+ (0.143log20.143)]
H(X) = -[(-0.401)+(-0.401)+(-0.401)+(-0.524)+(-0.401)]
H(X) = -[-2.128]
H(X) = 2.128
Discussion: Now, according to Eq-2, the ceiling value of 2.128 is 3. So, each symbol will be encoded by 3 bits and current string "Shannon" will require 21 bits to encode your string optimally (as, it contains total 7 characters).


2. Social Networking (Centrality Measures)

There are a lot of social networking techniques available. However, to demonstrate the combined impact of social networking techniques and Shannon information theory on text mining, I will use "closeness centrality".


2.1. Closeness-Centrality It gives the information about how close a  node is, w.r.t., the entire network [4]. see more information at:






3. Impact of Entropy on Centrality Measures 

Centrality entropy provides information on the degree of reachability for a node in the graph[5]. However, such calculations may have different impact on (1) fully connected graph and (2) partially connected graph. The following contains the impact of such calculations on both types of graphs [2, 3, 5].

Fully connected graphIn a fully connected graph the removal of any node will have the same effect on centrality entropy as when any other node is removed. All nodes are equally important for the flow of information [5]. 
Partially connected graph: In partially connected graphs, those nodes whose removal will split the
graph in two or more parts or that will reduce substantially the number of geodesic paths available to reach other nodes when removed, will have a higher impact in decreasing the total centrality entropy [5]. This  event produces the largest change in centrality entropy for the graph.


4. How to apply it in Text Mining

4.1. Representing Text as a Graph 

To apply the social networking techniques on text, we need to convert the text into graph. Instead of going into a complex steps, we can use a very simple form of word graph of text, similar to that used in Textrank [6]. This word graph of text add link between any two words in the text, if (1) they are adjacent to each other and (2) comes under nouns/adjectives parts of speech tags. The following example will demonstrate it:




NOTE: However, we can go through a more informative graph, which contains edge weight as product of (1) co-occurrence frequency of terms and (2) their semantic strength. Such, approaches, neutralize the impact of noisy words having very high occurrence frequency [1, 7]. A sample java code can be downloaded from my site

4.2. Applying Entropy of centrality Measures

For this, we use the text graph used, in previous section. Next, according to the process applied in [5], we use the following.














CalculationTo calculate the centrality entropy, we first calculate the entropy of entire graph by using the centrality score of node/edge (as per requirements), Next, we just drop the a node, and all incident edges for that node, and then  calculate the entropy of the remaining graph. The importance score for each node is the difference of entropy scores calculated before and after removing that node. By using the same process, we can collect the score of all nodes of the graph.

4.3. Result and Analysis

We, use the text given in the section 3, and extracted 10 top ranked words by using (1) Pagerank algorithm, (2) closeness centrality score and (3) current setting of entropy of centrality. We find that, top-10 extracted words, extracted by using Pagerank, contains 7 informative terms. Top-10 words extracted by using closeness centrality contains 6 informative words and Finally with the current setting, it was 8 out of 10.  

Reference:

  1. Niraj Kumar, Kannan Srinathan and Vasudeva Varma; "A Graph based Unsupervised N-gram Filtration Technique for Automatic Keyphrase Extraction"; accepted for publication in "Int. J. of Data Mining, Modelling and Management".
  2. Borgatti, S. P. (2003), The key player problem. Dynamic social network modeling and analysis: Workshop summary and papers. National Academy Press. 
  3. Everett, M. G. and Borgatti, S. P. (1999). The centrality of groups and classes. Journal of Mathematical Sociology, Vol. 23, No. 3. pp. 181-201.
  4. Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99, 7821–7826.
  5. Ortiz-Arroyo, D. (2010). Discovering sets of key players in social networks (pp. 27-47). Springer London.
  6. Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing order into texts." Association for Computational Linguistics, 2004.
  7. Niraj Kumar, Kannan Srinathan, Vasudeva Varma:A Knowledge Induced Graph-Theoretical Model for Extract and Abstract Single Document Summarization. CICLing (2) 2013: LNCS 7817, pp. 408-423