Tag Archives: Algorithm

[repost ]Algorithm Writes People’s Life Histories Using Twitter Stream

original:http://www.technologyreview.com/view/519961/algorithm-writes-peoples-life-histories-using-twitter/

If you tweet about your life, a new algorithm can identify your most significant events and assemble them into an accurate life history, say the computer scientists who built it

Twitter allows anyone to describe their life in unprecedented detail. Many accounts provide an ongoing commentary of an individual’s interests, activities and opinions.

So it’s not hard to imagine that it’s possible to reconstruct a person’s life history by analysing their Twitter stream.

But doing this automatically is trickier than it sounds. That’s because most Twitter streams contain news of important events mixed up with entirely trivial details about events of little or no significance. The difficulty is in telling these apart.

Today, Jiwei Li at Carnegie Mellon University in Pittsburgh and Claire Cardie at Cornell University in Ithaca say they’ve developed an algorithm that does this. Their new technique can create an accurate life history for any individual by mining their tweets and those of their followers. That allows them to generate an eerily accurate chronology of a person’s life-changing  events, without knowing anything about them other than their twitter handle.

The key behind this work is a technique for separating the wheat from the chaff in any twitter stream. Li and Cardie do this classifying every tweet in one of four categories. The most important tweets are those that describe important, time specific events of a personal nature.

A tweet about starting a new job would be a good example. By contrast, a tweet about a 5 kilometre run that is part of a regular exercise regime would not qualify because it happens regularly. So personal events fall into two categories–time specific and time general.

Equally, tweets about other non-personal events fall into a similar two categories–time specific and time general. A tweet about the US election would be an example of the former whereas an opinion about the summer weather would be an example of the latter.

The problem that Li and Cardie have solved is to find a way of automaticallydistinguishing tweets in the first category from the others. The solution is based on the discovery that that the pattern of tweets, retweets and replies varies for each of the categoroies they’ve defined.

For example, a tweet about starting a new job has a different pattern of responses from followers than a tweet about running or the US election or the weather. So the trick is to identify this ‘Twitter signature’ of these important personal events and then mine the twitter stream for other examples. A chronological list of these events is that person’s life history.

At least, that’s the theory. Li and Cardie test their idea by mining the streams of 20 ordinary twitter users and 20 celebrities over a 21 month period from 2011 to 2013. They then asked the ordinary users to create their own life history by manually identifying their most important tweets.  For the celebrities, Li and Cardie used Wikipedia biographies and other sources of information to create ‘gold standard’ life histories manually.

Finally, they compared  these gold standard life-histories against the ones generated by their algorithm. The results are not bad. The algorithm accurately picks out many important life events that are also identified in the gold standards. “Experiments on real Twitter data quantitatively demonstrate the effectiveness of our method,” they say.

But it is by no means perfect. For example, the technique only works with users who tweet regularly and with enough followers to allow the algorithm to spot the unique pattern of responses that identifies important tweets.

Still, that’s a significant number of people and Lie and Cardie say their technique can be broadly applied. “It can be extended to any individual, (e.g. friend, competitor or movie star), if only he or she has a twitter account,” they add.

Lie and Cardie talk about their future plans in terms of improving the accuracy of their technique. However, they do not talk about making the algorithm more widely available. If it works as well as they imply, there should be no shortage of interested parties wanting to use it.

The ability to mine the twitter firehose for the life histories of the masses will be valuable. Just who might want to use this technique and how, I’ll leave for the comments section below.

The work raises some interesting questions, not least about privacy. Would Individuals think more carefully about placing their life history in the public domain if they knew how easily it could be distilled?

The new technique means that a detailed life history will be available at the touch of a button to friends and family but also to prospective employers, business competitors, the government, the media, law enforcement agencies, stalkers and so on.

What’s clear is that social networks are an important aspect of modern life. What is not yet so clear is just how powerful and revealing they will turn out to be.

Ref:arxiv.org/abs/1309.7313 : Timeline Generation: Tracking individuals on Twitter

[repost ]Beam Search Algorithm (Draft by Andrew Jungwirth)

original:http://jhave.org/algorithms/graphs/beamsearch/beamsearch.shtml

Java-Hosted Algorithm Visualization Environment

Beam Search Algorithm (Draft by Andrew Jungwirth)

Objectives

  • To show how the Beam Search Algorithm uses a heuristic function and a given beam width in an attempt to simulate the Breadth-First Search in a memory-efficient way.

 

  • To emphasize the importance of limiting a graph-search’s memory consumption when finding solutions to large problem spaces.

 

  • To demonstrate the strengths and weaknesses of the Beam Search Algorithm.

Preparation

In order to understand this algorithm, you must be familiar with the concept of a graph as a group of nodes/vertices and edges connecting these nodes. It is also helpful to understand the how a search tree can be used to show the progress of a graph search. Additionally, knowledge of the Breadth-First Search Algorithm is required because Beam Search is a modification of this algorithm.

Beam Search Algorithm

Even though the Breadth-First Search Algorithm is guaranteed to find the shortest path from a start node to a goal node in an unweighted graph, it is infeasible to use this algorithm on large search spaces because its memory consumption is exponential. This causes the algorithm run out of main memory before a solution can be found to most large, nontrivial problems. For this reason, Beam Search was developed in an attempt to achieve the optimal solution found by the Breadth-First Search Algorithm without consuming too much memory.

In order to accomplish this goal, Beam Search utilizes a heuristic function, h, to estimate the cost to reach the goal from a given node. It also uses a beam width, B, which specifies the number of nodes that are stored at each level of the Breadth-First Search. Thus, while the Breadth-First Search stores all the frontier nodes (the nodes connected to the closing vertices) in memory, the Beam Search Algorithm only stores the B nodes with the best heuristic values at each level of the search. The idea is that the heuristic function will allow the algorithm to select nodes that will lead it to the goal node, and the beam width will cause the algorithm to store only these important nodes in memory and avoid running out of memory before finding the goal state.

Instead of the open list used by the Breadth-First Search Algorithm, the Beam Search Algorithm uses the BEAM to store the nodes that are to be expanded in the next loop of the algorithm. A hash table is used to store nodes that have been visited, similar to the closed list used in the Breadth-First Search. Beam Search initially adds the starting node to the BEAM and the hash table. Then, each time through the main loop of the algorithm, Beam Search adds all of the nodes connected to the nodes in the BEAM to its SET of successor nodes and then adds the B nodes with the best heuristic values from the SETto the BEAM and the hash table. Note that a node that is already in the hash table is not added to the BEAM because a shorter path to that node has already been found. This process continues until the goal node is found, the hash table becomes full (indicating that the memory available has been exhausted), or the BEAM is empty after the main loop has completed (indicating a dead end in the search).

The Beam Search Algorithm is shown by the pseudocode below. This pseudocode assumes that the Beam Search is used on an unweighted graph so the variable g is used to keep track of the depth of the search, which is the cost of reaching a node at that level.

Pseudocode for the Beam Search Algorithm

/* initialization */
g = 0;
hash_table = { start };
BEAM = { start };

/* main loop */
while(BEAM ≠ ∅){                             // loop until the BEAM contains no nodes
  SET = ∅;                                   // the empty set

  /* generate the SET nodes */
  for(each state in BEAM){
    for(each successor of state){
      if(successor == goal) return g + 1;
      SET = SET ∪ { successor };             // add successor to SET
    }
  }

  BEAM = ∅;                                  // the empty set
  g = g + 1;

  /* fill the BEAM for the next loop */
  while((SET ≠ ∅) AND (B > |BEAM|)){         // set is not empty and the number of nodes in BEAM is less than B
    state = successor in SET with smallest h value;
    SET = SET \ { state };                   // remove state from SET
    if(state ∉ hash_table){                  // state is not in the hash_table
      if(hash_table is full) return ∞;
      hash_table = hash_table ∪ { state };   // add state to hash_table
      BEAM = BEAM ∪ { state };               // add state to BEAM
    }
  }
}

// goal was not found, and BEAM is empty - Beam Search failed to find the goal
return ∞;

Example Trace of the Beam Search Algorithm

The following traces of the Beam Search Algorithm use two rows to represent each main loop of the algorithm’s execution. The first row of each numbered loop displays the nodes added to the SET. These nodes are ordered by their heuristic values, with alphabetical ordering used to sort nodes with identical h values. Since the SET is a mathematical set, if a node is inserted into the SET more than once from multiple parents, it only appears in the SET once. The second row of each numbered loop lists the nodes from the SET that are added to the BEAM in the second part of the main loop. Both rows also display the hash table to show its current state. Notice that the hash table has only seven slots, indicating that the memory size for this example trace is seven. A simple linear hashing scheme with key values determined by the node names’ ASCII values mod 7 is used for simplicity. In all three of these lists, nodes are listed in the format node_name(predecessor_name). The algorithm is traced four times with different values of B to demonstrate the strengths and weaknesses of the algorithm. Each trace includes a search tree that shows the BEAM at each level of the search. In the graph, the numbers under the node names are the h values for the nodes. These traces show how Beam Search attempts to find the shortest path from node I to node B in the graph shown in Figure 1. (Figure 1 is included above each trace for convenience.)

Figure 1

Trace 1, B = 1

loop
number
SET (first row per numbered loop)
BEAM (second row per numbered loop)
hash_table
BEAM = { I(null) } hash_table = { _, I(null), _, _, _, _, _ }
1 SET = { G(I), J(I), E(I), H(I) } hash_table = { _, I(null), _, _, _, _, _ }
1 BEAM = { G(I) } hash_table = { _, I(null), _, _, _, _, G(I) }
2 SET = { D(G), J(G), I(G) } hash_table = { _, I(null), _, _, _, _, G(I) }
2 BEAM = { D(G) } hash_table = { _, I(null), _, D(G), _, _, G(I) }
3 SET = { G(D) } hash_table = { _, I(null), _, D(G), _, _, G(I) }
3 BEAM = { } hash_table = { _, I(null), _, D(G), _, _, G(I) }

 

Figure 2

 

At this point, the BEAM is empty, and the Beam Search Algorithm has reached a dead-end in its search. Since the node G in the SET was already in the hash table, it could not be added to the BEAM, which left the BEAM empty. This trace illustrates the greatest weakness of the Beam Search Algorithm: An inaccurate heuristic function can lead the algorithm into a situation in which it cannot find a goal, even if a path to the goal exists. While increasing the value of B may allow Beam Search to find the goal, increasing B by too much may cause the algorithm to run out of memory before it finds the goal. For this reason, the choice of B has a large impact on Beam Search’s performance. Figure 2 shows the BEAM nodes at each level in this dead-end search.

Figure 1

Trace 2, B = 2

loop
number
SET (first row per numbered loop)
BEAM (second row per numbered loop)
hash_table
BEAM = { I(null) } hash_table = { _, I(null), _, _, _, _, _ }
1 SET = { G(I), J(I), E(I), H(I) } hash_table = { _, I(null), _, _, _, _, _ }
1 BEAM = { G(I), J(I) } hash_table = { _, I(null), J(I), _, _, _, G(I) }
2 SET = { A(J), D(G), G(J), J(G), E(J), I(G) } hash_table = { _, I(null), J(I), _, _, _, G(I) }
2 BEAM = { A(J), D(G) } hash_table = { A(J), I(null), J(I), D(G), _, _, G(I) }
3 SET = { C(A), G(D), J(A) } hash_table = { A(J), I(null), J(I), D(G), _, _, G(I) }
3 BEAM = { C(A) } hash_table = { A(J), I(null), J(I), D(G), C(A), _, G(I) }
4 SET = { B(C) [goal found – algorithm returns], A(C) } hash_table = { A(J), I(null), J(I), D(G), C(A), _, G(I) }

 

Figure 3

 

In this trace, the Beam Search Algorithm successfully found the goal via the path IJACB. Even though a solution was found, this solution is not optimal because IECB is a shorter path to the goal node. Once again, an inaccurate heuristic function reduced the effectiveness of the Beam Search Algorithm. Figure 3 shows the BEAM nodes at each level of the search. Notice that only one node appears in the BEAMat level three in the tree. This demonstrates that Beam Search may not always be able to fill the BEAM at each level in the search. In the last level of the tree, node A was first added to the SET, and then node B (the goal node) was found and caused the search to complete.

Figure 1

Trace 3, B = 3

loop
number
SET (first row per numbered loop)
BEAM (second row per numbered loop)
hash_table
BEAM = { I(null) } hash_table = { _, I(null), _, _, _, _, _ }
1 SET = { G(I), J(I), E(I), H(I) } hash_table = { _, I(null), _, _, _, _, _ }
1 BEAM = { G(I), J(I), E(I) } hash_table = { _, I(null), J(I), _, E(I), _, G(I) }
2 SET = { A(J), C(E), D(G), F(E), G(J), J(E), E(J), H(E), I(E) } hash_table = { _, I(null), J(I), _, E(I), _, G(I) }
2 BEAM = { A(J), C(E), D(G) } hash_table = { A(J), I(null), J(I), C(E), E(I), D(G), G(I) }
3 SET = { B(C) [goal found – algorithm returns], A(C), C(A), J(A) } hash_table = { A(J), I(null), J(I), C(E), E(I), D(G), G(I) }

 

Figure 4

 

With B = 3, the Beam Search Algorithm found the optimal path to the goal. However, the larger beam width caused the algorithm to fill the entire memory available for the hash table. Figure 4 shows theBEAM nodes at each level in the search. In the last level of the tree, nodes AC, and J were added to the SET, and then the goal node B was found, which caused to search to complete.

Figure 1

Trace 4, B = 4

loop
number
SET (first row per numbered loop)
BEAM (second row per numbered loop)
hash_table
BEAM = { I(null) } hash_table = { _, I(null), _, _, _, _, _ }
1 SET = { G(I), J(I), E(I), H(I) } hash_table = { _, I(null), _, _, _, _, _ }
1 BEAM = { G(I), J(I), E(I), H(I) } hash_table = { H(I), I(null), J(I), _, E(I), _, G(I) }
2 SET = { A(J), C(E), D(G), F(E), G(J), J(E), E(H), H(E), I(E) } hash_table = { H(I), I(null), J(I), _, E(I), _, G(I) }
2 BEAM = { A(J), C(E), D(G) [not enough memory – algorithm returns] } hash_table = { H(I), I(null), J(I), A(J), E(I), C(E), G(I) }

 

Figure 5

 

Using B = 4, the Beam Search Algorithm quickly ran out of memory. This shows the second major weakness of the Beam Search Algorithm: When B becomes large, the algorithm consumes memory very quickly like the Breadth-First Search Algorithm. Figure 5 shows the BEAM at each level in the search. The last level in the tree shows the progress of the search when the algorithm ran out of memory.

Efficiency/Algorithm Analysis

It is generally effective to analyze graph-search algorithms by considering four traits:

  • Completeness: A search algorithm is complete if it will find a solution (goal node) when a solution exists.

 

  • Optimality: A search algorithm is optimal if it finds the optimal solution. In the case of the Beam Search Algorithm, this means that the algorithm must find the shortest path from the start node to the goal node.

 

  • Time complexity: This is an order-of-magnitude estimate of the speed of the algorithm. The time complexity is determined by analyzing the number of nodes that are generated during the algorithm’s execution.

 

  • Space complexity: This is an order-of-magnitude estimate of the memory consumption of the algorithm. The space complexity is determined by the maximum number of nodes that must be stored at any one time during the algorithm’s execution.

Completeness

In general, the Beam Search Algorithm is not complete. This is illustrated in Trace 1 above. Even though the memory was not depleted, the algorithm failed to find the goal because it could not add any nodes to the BEAM. Thus, even given unlimited time and memory, it is possible for the Beam Search Algorithm to miss the goal node when there is a path from the start node to the goal node. A more accurate heuristic function and a larger beam width can improve Beam Search’s chances of finding the goal. However, this lack of completeness is one of the foremost weaknesses of the Beam Search Algorithm.

Optimality

Just as the Beam Search Algorithm is not complete, it is also not guaranteed to be optimal. This is shown by Trace 2 above. In this example, Beam Search found the goal node but failed to find the optimal path to the goal, even though the heuristic in Figure 1 is admissible (underestimates the cost to the goal from every node) and consistent (underestimates the cost between neighboring nodes). This happened because the beam width and an inaccurate heuristic function caused the algorithm to miss expanding the shortest path. A more precise heuristic function and a larger beam width can make Beam Search more likely to find the optimal path to the goal.

Time Complexity

The time for the Beam Search Algorithm to complete tends to depend on the accuracy of the heuristic function. An inaccurate heuristic function usually forces the algorithm to expand more nodes to find the goal and may even cause it to fail to find the goal. In the worst case, the heuristic function leads Beam Search all the way to the deepest level in the search tree. Thus, the worst case time is O(Bm), where B is the beam width, and m is the maximum depth of any path in the search tree. This time complexity is linear because the Beam Search Algorithm only expands B nodes at each level; it does not branch out more widely at each level like many search algorithms that have exponential time complexities. The speed with which this algorithm executes is one of its greatest strengths.

Space Complexity

Beam Search’s memory consumption is its most desirable trait. Because the algorithm only stores B nodes at each level in the search tree, the worst-case space complexity is O(Bm), where B is the beam width, and m is the maximum depth of any path in the search tree. This linear memory consumption allows Beam Search to probe very deeply into large search spaces and potentially find solutions that other algorithms cannot reach.

Compare with Your Textbook

Algorithms can look differently but still operate in almost the same ways. Compare the pseudocode above with the description in your textbook (if available). Then consider these questions:

  1. Does your textbook use a hash table to store the nodes that have been expanded? If not, how does it store these nodes?

 

  1. Does your textbook explain what type of structure should be used to implement the SET? If so, what structure does it use?

Exploring the Algorithm’s Dynamic Behavior

Explore the Algorithm within JHAVÉ

You can practice the Beam Search Algorithm using the algorithm visualization system JHAVÉ. If you have not used JHAVÉ before, please take the time to view the instructions on using JHAVÉ first. If your browser supports Java Webstart, you can launch a visualization of the Beam Search Algorithm directly from this link.

The Beam Search visualization has fifteen example graphs with which you can experiment. The first four examples, perfect1perfect2perfect3, and perfect4, have perfect heuristic functions that allow the algorithm to find the optimal path if it has enough memory. The next seven graphs, errant1errant2errant3errant4errant5errant6, and errant7, have inaccurate heuristic functions that can lead the algorithm to find paths that are longer than optimal if the beam width is too small. The last four graphs, end1end2end3, and end4, result in dead-end searches when using smaller beam widths. For each of these examples, you can set the values for the beam width, B, and the memory size, M, to see how different values of these parameters affect the outcome of the algorithm on the graph. Finally, the level of detail option allows you to control how the animation steps through the pseudocode. The high detail option shows how each node is added to the SET one-by-one and is useful when you are less familiar with the algorithm. The low detail option generates all the nodes in the SET in one step so that you can more easily focus on the other aspects of the algorithm.

Step through the examples in the visualization and test how the different parameters modify the results found by the Beam Search Algorithm. Answer the questions that appear during the visualization to assess your understanding of the algorithm. When you can consistently answer the questions correctly, try the exercises below.

Exercises

  1. Using the values of B = 2 and M = 10, trace the Beam Search Algorithm on the graph in Figure 6 from start node J to goal node D.

 

Figure 6

 

  1. Did your trace from above find the optimal path from node J to node D? If not, is there a value of B that will find the shortest path with M = 10?

 

  1. Is it possible for Beam Search to make a dead-end search (BEAM is empty at the beginning of the main loop) in the graph above (Figure 6) from node J to node D? If so, what value(s) of Bproduce(s) this situation?

 

  1. Modify the heuristic values in the graph above (Figure 6) so that the Beam Search Algorithm can find the optimal path from J to D with B = 1. How much memory is required to find the shortest path in this new graph with B = 1?

Designing Data Sets

Creating input for an algorithm is an effective way to demonstrate your understanding of the algorithm’s behavior.

  1. Design a graph such that Beam Search fails to find the optimal path from the start node to the goal node with
    B = 1 but finds this shortest path with B = 2.

 

  1. Construct a graph so that a Beam Search with B = 2 reaches a dead-end and fails to find the goal node. (Note: Here a dead-end search refers to the situation in which the BEAM is empty at the beginning of the main loop and does not mean the search runs out of memory.)

Modifying the Algorithm

Consider modifying the Beam Search Algorithm as described above so that the lines previously written as

if(hash_table is full) return ∞;
hash_table = hash_table ∪ { state };
BEAM = BEAM ∪ { state };

are reordered as

hash_table = hash_table ∪ { state };
BEAM = BEAM ∪ { state };
if(hash_table is full) return ∞;

Does this have any effect on the results found by the algorithm? Can you devise an example in which the first version of the algorithm finds the goal, but the modified version returns before finding the goal?

Create Your Own Visualization

Using your own source code, presentation software, or manually produced drawings, create your own visualization of the Beam Search Algorithm.

Presentation

Develop a ten-minute presentation on the Beam Search Algorithm that uses the visualization you developed above to explain the strengths and weaknesses of the Beam Search Algorithm.

[repost ]最短路算法(Shortest Paths Algorithm)

original:http://mindlee.net/2011/11/18/shortest-paths-algorithm/

帮助

+ expand source
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include<iostream>
#include<cstdio>
#include<cstring>
#include<cmath>
#include<algorithm>
#include<climits>
const int N = 101;
int map[N][N];
void Floyd(int n) {
    for (int k = 1; k <= n; k++) {
        for (int i = 1; i <= n; i++) {
            for (int j = 1; j <= n; j++) {
                if((map[i][k] != INT_MAX) && (map[k][j] != INT_MAX)
                    &&(map[i][j] > map[i][k] + map[k][j] || map[i][j] == INT_MAX)) {
                        map[i][j] = map[i][k] + map[k][j];
                }
            }
        }
    }
}
int main() {
    int n, m;
    while (~scanf("%d%d", &n, &m), n | m) {
        for (int i = 1; i <= n; i++) {
            for (int j = i; j <= n; j++) {
                map[i][j] = map[j][i] = INT_MAX;
            }
        }
        while (m--) {
            int u, v, w;
            scanf("%d%d%d", &u, &v, &w);
            map[u][v] = map[v][u] = w;
        }
        Floyd(n);
        printf("%d\n",map[1][n]);
    }
    return 0;
}

[repost ]the median of a trillion numbers

original:http://matpalm.com/median/index.html

the question

the base algorithm

distributing

generating test data

ruby implementation

brutally short introduction to erlang

single process erlang implementation

multiple process erlang implementation

performance comparisons

running on amazon ec2

conclusion

 

code available at github.com/matpalm/median

other projects

[repost ] stemming algorithm

1) The Snowball Porter stemming algorithm

http://snowball.tartarus.org/algorithms/porter/stemmer.html

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample English vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent
Tar-gzipped file of all of the above
A stop word list

Links to resources

Quick Introduction
An account of Snowball
How You Can Help
Snowball
the manual
how to run it
Tar gzipped files of Snowball sources
stemmers
English (porter)
English (porter2)
A note on early English
Romance stemmers:
French
Spanish
Portuguese
Italian
Romanian
Germanic stemmers
German
(German variant)
Dutch
Scandinavian stemmers
Swedish
Norwegian
Danish
Russian
Finnish
Character codes
Contributed stemmers in other programming languages
Wrappers
External Contributions
Irish and Czech
Object Pascal codegenerator for Snowball
Two stemmers for Romanian
Hungarian
Turkish
Armenian
Basque (Euskera)
Catalan
Other work
The Schinke Latin stemmer
The Lovins English stemmer
The Kraaij/Pohlmann Dutch stemmer 

2) The Porter Stemming Algorithm

http://tartarus.org/~martin/PorterStemmer/

This page was completely revised Jan 2006. The earlier edition is here.

This is the ‘official’ home page for distribution of the Porter Stemming Algorithm, written and maintained by its author, Martin Porter.

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

History

The original stemming algorithm paperwas written in 1979 in the Computer Laboratory, Cambridge (England), as part of a larger IR project, and appeared as Chapter 6 of the final project report,

C.J. van Rijsbergen, S.E. Robertson and M.F. Porter, 1980. New models in probabilistic information retrieval. London: British Library. (British Library Research and Development Report, no. 5587).

With van Rijsbergen’s encouragement, it was also published in,

M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.

And since then it has been reprinted in

Karen Sparck Jones and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4.

The original stemmer was written in BCPL, a language once popular, but now defunct. For the first few years after 1980 it was distributed in its BCPL form, via the medium of punched paper tape. Versions in other languages soon began to appear, and by 1999 it was being widely used, quoted and adapted. Unfortunately there were numerous variations in functionality among these versions, and this web page was set up primarily to ‘put the record straight’ and establish a definitive version for distribution.

Encodings

The ANSI C version that heads the table below is exactly equivalent to the original BCPL version. The BCPL version did, however, differ in three minor points from the published algorithm and these are clearly marked in the downloadable ANSI C version. They are discussed further below.

This ANSI C version may be regarded as definitive, in that it now acts as a better definition of the algorithm than the original published paper.

Over the years, I have received many encoding from other workers, and they are also presented below. I have a reasonable confidence that all these versions are correctly encoded.

language author affiliation received notes
ANSI C me
ANSI C thread safe me
java me
Perl me
Perl Daniel van Balen Oct 1999 slightly faster?
python Vivake Gupta Jan 2001
Csharp André Hazelwood The Official Web Guide Sep 2001
Csharp .NET compliant Leif Azzopardi Univerity of Paisley, Scotland Nov 2002
Common Lisp Steven M. Haflich Franz Inc Mar 2002
Ruby Ray Pereda www.raypereda.com Jan 2003 github link
Visual Basic VB6 Navonil Mustafee Brunel University Apr 2003
Delphi Jo Rabin Apr 2004
Javascript ‘Andargor’ www.andargor.com Jul 2004 substantial revisions by
Christopher McKenzie
(See ‘Links’ below.)
Visual Basic
VB7; .NET compliant
Christos Attikos University of Piraeus, Greece Jan 2005
php Richard Heyes www.phpguru.org Feb 2005
Prolog Philip Brooks University of Georgia Oct 2005
Haskell Dmitry Antonyuk Nov 2005
T-SQL Keith Lubell www.atelierdevitraux.com May 2006
matlab Juan Carlos Lopez California Pacific Medical Center Research Institute Sep 2006
Tcl Aris Theodorakos NCSR Demokritos Nov 2006
D Daniel Truemper Humboldt-Universitaet zu Berlin May 2007
erlang (1) erlang (2) Alden Dima National Institute of Standards and Technology,
Gaithersburg, MD USA
Sep 2007
REBOL Dale K Brearcliffe Apr 2009
Scala Ken Faulkner May 2009
sas Antoine St-Pierre Business Researchers, Inc Apr 2010
plugin vim script Mitchell Bowden May 2010 github link
node.js Jed Parsons jedparsons.com May 2011 github link
Google Go Alex Gonopolskiy Oct 2011 github link
awk Gregory Grefenstette 3ds.com/exalead Jul 2012

All these encodings of the algorithm can be used free of charge for any purpose. Questions about the algorithms should be directed to their authors, and not to Martin Porter (except when he is the author).

To test the programs out, here is a sample vocabulary (0.19 megabytes), and the corresponding output.

Email any comments, suggestions, queries

Points of difference from the published algorithm

There is an extra rule in Step 2,

(m>0) logi  →  log

So archaeology is equated with archaeological etc.

The Step 2 rule

(m>0) abli  →  able

is replaced by

(m>0) bli  →  ble

So possibly is equated with possible etc.

The algorithm leaves alone strings of length 1 or 2. In any case a string of length 1 will be unchanged if passed through the algorithm, but strings of length 2 might lose a final s, so as goes to a and is to i.

These differences may have been present in the program from which the published algorithm derived. But at such a great distance from the original publication it is now difficult to say.

It must be emphasised that these differences are very small indeed compared to the variations that have been observed in other encodings of the algorithm.

Status

The Porter stemmer should be regarded as ‘frozen’, that is, strictly defined, and not amenable to further modification. As a stemmer, it is slightly inferior to the Snowball English or Porter2 stemmer, which derives from it, and which is subjected to occasional improvements. For practical work, therefore, the new Snowball stemmer is recommended. The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable.

Common errors

Historically, the following shortcomings have been found in other encodings of the stemming algorithm.

The algorithm clearly explains that when a set of rules of the type

(condition)S1  →  S2

are presented together, only one rule is applied, the one with the longest matching suffix S1 for the given word. This is true whether the rule succeeds or fails (i.e. whether or not S2 replaces S1). Despite this, the rules are sometimes simply applied in turn until either one of them succeeds or the list runs out.

This leads to small errors in various places, for example in the Step 4 rules

(m>1)ement  →
(m>1)ment  →
(m>1)ent  → 

to remove final ement, ment and ent.

Properly, argument stems to argument. The longest matching suffix is -ment. Then stem argu- has measure m equal to 1 and so -ment will not be removed. End of Step 4. But if the three rules are applied in turn, then for suffix -ent the stem argum- has measure m equal to 2, and -ent gets removed.

The more delicate rules are liable to misinterpretation. (Perhaps greater care was required in explaining them.) So

((m>1) and (*s or *t))ion

is taken to mean

(m>1)(s or t)ion

The former means that taking off -ion leaves a stem with measure greater than 1 ending -s or -t; the latter means that taking off -sion or -tion leaves a stem of measure greater than 1. A similar confusion tends to arise in interpreting rule 5b, to reduce final double L to single L.

Occasionally cruder errors have been seen. For example the test for Y being consonant or vowel set up the wrong way round.

It is interesting that although the published paper explains how to do the tests on the strings S1 by a program switch on the last or last but one letter, many encodings fail to use this technique, making them much slower than they need be.

FAQs (frequently asked questions)

#1. What is the licensing arrangement for this software?

This question has become very popular recently (the period 2008−2009), despite the clear statment above that ‘‘all these encodings of the algorithm can be used free of charge for any purpose.’’ The problem I think is that intellectual property has become such a major issue that some more formal statement is expected. So to restate it:

The software is completely free for any purpose, unless notes at the head of the program text indicates otherwise (which is rare). In any case, the notes about licensing are never more restrictive than the BSD License.

In every case where the software is not written by me (Martin Porter), this licensing arrangement has been endorsed by the contributor, and it is therefore unnecessary to ask the contributor again to confirm it.

I have not asked any contributors (or their employers, if they have them) for proofs that they have the right to distribute their software in this way.

(For anyone taking software from the Snowball website, the position is similar but simpler. There, all the software is issued under the BSD License, and for contributions not written by Martin Porter and Richard Boulton, we have again not asked the authors, or the authors’ employers, for proofs that they have such distribution rights.)

#2. Why is the stemmer not producing proper words?

It is often taken to be a crude error that a stemming algorithm does not leave a real word after removing the stem. But the purpose of stemming is to bring variant forms of a word together, not to map a word onto its ‘paradigm’ form.

And connected with this,

#3. Why are there errors?

The question normally comes in the form, why should word X be stemmed to x1, when one would have expected it to be stemmed to x2? It is important to remember that the stemming algorithm cannot achieve perfection. On balance it will (or may) improve IR performance, but in individual cases it may sometimes make what are, or what seem to be, errors. Of course, this is a different matter from suggesting an additional rule that might be included in the stemmer to improve its performance.

Links

I will add to this from time to time any noteworthy instructive or educational links. Suggestions welcome.

Phil Bewig of Programming Praxis has made the Porter stemmer into a programming example on his site.

Christopher McKenzie, who contributed the Javascript stemmer, demonstrates his program (and therefore demonstrates the Porter stemming algorith) at qaa.ath.cx/porter_js_demo.html

3) Solr

[repost ]Which hashing algorithm is best for uniqueness and speed?

original:http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/145633#145633

I was just wondering which hashing algorithm is best for uniqueness and speed? Example (good) uses include hash dictionaries.

I know there are things like sha256 and such, but these algorithms are designed to be secure, which usually means it is slower than algorithms that are less unique. I want a hash algorithm designed to be fast, yet remain fairly unique to avoid collisions.

 

I tested some different algorithms, measuring speed and number of collisions.

I used three different key sets:

For each corpus, the number of collisions and the average time spent hashing was recorded.

I tested:

Results

Each result contains the average hash time, and the number of collisions

Hash Lowercase Random UUID Numbers ============= ============= =========== ============== Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collis FNV-1a 152 ns 504 ns 86 ns 4 collis 4 collis 0 collis FNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collis* DBJ2a 158 ns 443 ns 91 ns 5 collis 6 collis 0 collis*** DJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collis*** SDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collis** SuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collis CRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collis LoseLose 338 ns - - 215178 collis 

Notes:

Collisions actually happen?

Yes. I started writing my test program to see if hash collisions actually happen – and are not just a theoretical construct. They do indeed happen:

FNV-1 collisions

  • creamwove collides with quists

FNV-1a collisions

  • costarring collides with liquid
  • declinate collides with macallums
  • altarage collides with zinke
  • altarages collides with zinkes

Murmur2 collisions

  • cataract collides with periti
  • roquette collides with skivie
  • shawl collides with stormbound
  • dowlases collides with tramontane
  • cricketings collides with twanger
  • longans collides with whigs

DJB2 collision

  • hetairas collides with mentioner
  • heliotropes collides with neurospora
  • depravement collides with serafins
  • stylist collides with subgenera
  • joyful collides with synaphea
  • redescribed collides with urites
  • dram collides with vivency

DJB2a collions

  • haggadot collides with loathsomenesses
  • adorablenesses collides with rentability
  • playwright collides with snush
  • playwrighting collides with snushing
  • treponematoses collides with waterbeds

CRC32 collisions

  • codding collides with gnu
  • exhibiters collides with schlager

SuperFastCash collisions

  • dahabiah collides with drapability
  • encharm collides with enclave
  • grahams collides with gramary
  • …snip 79 collisions…
  • night collides with vigil
  • nights collides with vigils
  • finks collides with vinic

Randomnessification

The other subjective measure is how randomly distributed the hashes are. Mapping the resulting HashTables shows how evenly the data is distributed. All the hash functions show good distribution when mapping the table linearly:

enter image description here

Or as a Hilbert Map (XKCD is always relevant):

enter image description here

Except when hashing number strings ("1""2", …, "216553") (e.g. zip codes), where patterns begin to emerge in most of the hashing algorithms:

SDBM:

enter image description here

DJB2a:

enter image description here

FNV-1:

enter image description here

All except FNV-1a, which still look plenty random to me:

enter image description here

In fact, Murmur2 seems to have even better randomness with Numbers than FNV-1a:

enter image description here

When i look at the FNV-1a “number” map, i think i see subtle vertical patterns. With Murmur i see no patterns at all. What do you think?


The extra * in the above table denotes how bad the randomness is. With FNV-1a being the best, andDJB2x being the worst:

 Murmur2: FNV-1a: FNV-1: * DJB2: ** DJB2a: ** SDBM: *** SuperFastHash: CRC: ******************* Loselose: ************************** * ************* ****************************** ****************************************************************** 

I originally wrote this program to decide if I even had to worry about collisions: I do.

And then it turned into making sure that the hash functions were sufficiently random.

FNV-1a algorithm

The FNV1 hash comes in variants that return 32, 64, 128, 256, 512 and 1024 bit hashes.

The FNV-1a algorithm is:

hash = FNV_offset_basis for each octetOfData to be hashed hash = hash xor octetOfData hash = hash * FNV_prime return hash 

Where the constants FNV_offset_basis and FNV_prime depend on the return hash size you want:

Hash Size Prime Offset =========== =========================== ================================= 32-bit 16777619 2166136261 64-bit 1099511628211 14695981039346656037 128-bit 309485009821345068724781371 144066263297769815596495629667062367629 256-bit prime: 2^88 + 2^8 + 0x3b = 309485009821345068724781371 offset: 100029257958052580907070968620625704837092796014241193945225284501741471925557 512-bit prime: 2^344 + 2^8 + 0x57 = 35835915874844867368919076489095108449946327955754392558399825615420669938882575126094039892345713852759 offset: 9659303129496669498009435400716310466090418745672637896108374329434462657994582932197716438449813051892206539805784495328239340083876191928701583869517785 1024-bit prime: 2^680 + 2^8 + 0x8d = 5016456510113118655434598811035278955030765345404790744303017523831112055108147451509157692220295382716162651878526895249385292291816524375083746691371804094271873160484737966720260389217684476157468082573 offset: 1419779506494762106872207064140321832088062279544193396087847491461758272325229673230371772250864096521202355549365628174669108571814760471015076148029755969804077320157692458563003215304957150157403644460363550505412711285966361610267868082893823963790439336411086884584107735010676915 

See the main FNV page for details.

As a practical matter:

  • 32-bit UInt32,
  • 64-bit UInt64, and
  • 128-bit Guid can be useful

All my results are with the 32-bit variant.

FNV-1 better than FNV-1a?

No. FNV-1a is all around better. There was more collision with FNV-1a when using the english word corpus:

Hash Word Collisions ====== =============== FNV-1 1 FNV-1a 4 

Now compare lowercase and uppercase:

Hash lowercase word Collisions UPPERCASE word collisions ====== ========================= ========================= FNV-1 1 9 FNV-1a 4 11 

In this case FNV-1a isn’t “400%” worse than FN-1, only 20% worse.

I think the more important takeaway is that there’s two classes of algorithms when it comes to collisions:

  • collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM
  • collisions common: SuperFastHash, Loselose

And then there’s the how evenly distributed the hashes are:

  • outstanding distribution: Murmur2, FNV-1a, SuperFastHas
  • excellent distribution: FNV-1
  • good distribution: SDBM, DJB2, DJB2a
  • horrible distribution: Loselose

Update

Murmur? Sure, why not


Update

@whatshisname wondered how a CRC32 would perform, added numbers to the table.

CRC32 is pretty good. Few collisions, but slower, and the overhead of a 1k lookup table.

snip all erroneous stuff about CRC distribution – my bad


Up until today I was going to use FNV-1a as my de facto hash-table hashing algorithm. But now I’m switching to Murmur2:

  • Faster
  • Better randomnessification of all classes of input

And I really, really hope there’s something wrong with the SuperFastHash algorithm i found; it’s too bad to be as popular as it is.

Update: From the MurmurHash3 homepage on Google:

(1) – SuperFastHash has very poor collision properties, which have been documented elsewhere.

So I guess it’s not just me.

Update: I realized why Murmur is faster than the others. MurmurHash2 operates on 4 bytes at a time. Most algorithms are byte by byte:

for each octet in Key AddTheOctetToTheHash 

This means that as keys get longer Murmur gets its chance to shine.


Update

GUIDs are designed to be unique, not random

A timely post by Raymond Chen reiterates the fact that “random” guids are not meant to be used for their randomness. They, or a subset of them, are unsuitable as a hash key:

Even the Version 4 GUID algorithm is not guaranteed to be unpredictable, because the algorithm does not specify the quality of the random number generator. The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong.

Randomess is not the same as collision avoidance; which is why it would be a mistake to try to invent your own “hashing” algorithm by taking some subset of a “random” guid:

int HashKeyFromGuid(Guid type4uuid) { //A "4" is put somewhere in the guid. //i can't remember exactly where, but it doesn't matter for //the illustrative purposes of this pseudocode int guidVersion = ((type4uuid.D3 & 0x0f00) >> 8); Assert(guidVersion == 4); return (int)GetFirstFourBytesOfGuid(type4uuid); } 

Note: Again, i put “random guid” in quotes because it’s the “random” variant of guids. A more accurate description would be Type 4 uuid. But nobody knows what type 4, or types 1, 3 and 5 are. So it’s just easier to call them “random” guids.

[repost ]Forward algorithm

original:http://en.wikipedia.org/wiki/Forward_algorithm

Not to be confused with Forward-backward algorithm.

The forward algorithm, in the context of a hidden Markov model, is used to calculate a ‘belief state’: the probability of a state at a certain time, given the history of evidence. The process is also known as filtering. The forward algorithm is closely related to, but distinct from, the Viterbi algorithm.

For an HMM such as this one:

Temporal evolution of a hidden Markov model

this probability is written as P(x_t | y_{1:t} ). (abbreviating x(t) as x_t.) A belief state can be calculated at each time step, but doing this does not, in a strict sense, produce the most likely state sequence. but rather the most likely state at each time step, given the previous history.

Contents

Smoothing

In order to take into account future history (i.e., if one wanted to improve the estimate for past times), you can run the Backward algorithm, a complement of the Forward. This is called smoothing. Mathematically, it would be said that the forward/backward algorithm computes P(x_k | y_{1:t} ) for 0<k<t. So the use of the full F/B algorithm takes into account all evidence.

Decoding

In order to achieve the most likely sequence, the Viterbi algorithm is required. It computes the most likely state sequence given the history of observations, that is, the state sequence that maximizes P(x_{0:t}|y_{0:t}).

The difference between the state sequence that the Viterbi algorithm estimate generates and the state sequence that the Forward algorithm generates is that the Viterbi algorithm recalculates the entire sequence with each new data point whereas the Forward Algorithm only appends the new current value to the previous sequence computed.

See also

Further reading

  • Russel and Norvig’s Artificial Intelligence, a Modern Approach, starting on page 541 of the 2003 edition, provides a succinct exposition of this and related topics