Tag Archives: Algorithm

[Online Education ]Design and Analysis of Algorithms I

original:http://www.algo-class.org/

About The Course

In this course you will learn several fundamental principles of algorithm design. You’ll learn the divide-and-conquer design paradigm, with applications to fast sorting, searching, and multiplication. You’ll learn several blazingly fast primitives for computing on graphs, such as how to compute connectivity information and shortest paths. Finally, we’ll study how allowing the computer to “flip coins” can lead to elegant and practical algorithms and data structures. Learn the answers to questions such as: How do data structures like heaps, hash tables, bloom filters, and balanced search trees actually work, anyway? How come QuickSort runs so fast? What can graph algorithms tell us about the structure of the Web and social networks? Did my 3rd-grade teacher explain only a suboptimal algorithm for multiplying two numbers?

Prerequisites

How to program in at least one programming language (like C, Java, or Python); familiarity with proofs, including proofs by induction and by contradiction; and some discrete probability, like how to compute the probability that a poker hand is a full house. At Stanford, a version of this course is taken by sophomore, junior, and senior-level computer science majors.

Textbooks

No books are required for the course. However, three books have significantly influenced the instructor’s presentation and can be consulted for extra details. In order of decreasing relevance to the course, they are: Kleinberg & TardosAlgorithm Design, Dasgupta, Papadimitriou & VaziraniAlgorithms, and Cormen, Leiserson, Rivest, & SteinIntroduction to Algorithms.

No special software (e.g., development enviroment) is required to take this course.

The Instructor

Tim Roughgarden is an Associate Professor of Computer Science and (by courtesy) Management Science and Enginering at Stanford University, where he holds the Chambers Faculty Scholar development chair. At Stanford, he has taught the Design and Analysis of Algorithms course for the past eight years. His research concerns the theory and applications of algorithms, especially for networks, auctions and other game-theoretic applications, and data privacy. For his research, he has been awarded the ACM Grace Murray Hopper Award, the Presidential Early Career Award for Scientists and Engineers (PECASE), the Shapley Lecturership of the Game Theory Society, a Sloan Fellowship, INFORM’s Optimization Prize for Young Researchers, and the Mathematical Programming Society’s Tucker Prize.

Frequently Asked Questions

  1. When does the class start?The class will start in January 2012.
  2. What is the format of the class?The class will consist of lecture videos, which are broken into small chunks, usually between eight and twelve minutes each. Some of these may contain integrated quiz questions. There will also be standalone quizzes that are not part of video lectures, and programming assignments. There will be approximately two hours worth of video content per week.
  3. Will the text of the lectures be available?We hope to transcribe the lectures into text to make them more accessible for those not fluent in English. Stay tuned.
  4. Do I need to watch the lectures live?No. You can watch the lectures at your leisure.
  5. Can online students ask questions and/or contact the professor?Yes, but not directly There is a Q&A forum in which students rank questions and answers, so that the most important questions and the best answers bubble to the top. Teaching staff will monitor these forums, so that important questions not answered by other students can be addressed.
  6. Will other Stanford resources be available to online students?No.

 

[repost ]程序员面试题精选100题(05)-查找最小的k个元素

original:http://zhedahht.blog.163.com/blog/static/2541117420072432136859/

题目:输入n个整数,输出其中最小的k个。

例如输入1,2,3,4,5,6,7和8这8个数字,则最小的4个数字为1,2,3和4。

分析:这道题最简单的思路莫过于把输入的n个整数排序,这样排在最前面的k个数就是最小的k个数。只是这种思路的时间复杂度为O(nlogn)。我们试着寻找更快的解决思路。

我们可以先创建一个大小为k的数据容器来存储最小的k个数字。接下来我们每次从输入的n个整数中读入一个数。如果容器中已有的数字少于k个,则直接把这次读入的整数放入容器之中;如果容器中已有k个数字了,也就是容器已满,此时我们不能再插入新的数字而只能替换已有的数字。我们找出这已有的k个数中最大值,然和拿这次待插入的整数和这个最大值进行比较。如果待插入的值比当前已有的最大值小,则用这个数替换替换当前已有的最大值;如果带插入的值比当前已有的最大值还要大,那么这个数不可能是最小的k个整数之一,因为我们容器内已经有k个数字比它小了,于是我们可以抛弃这个整数。

因此当容器满了之后,我们要做三件事情:一是在k个整数中找到最大数,二是有可能在这个容器中删除最大数,三是可能要插入一个新的数字,并保证k个整数依然是排序的。如果我们用一个二叉树来实现这个数据容器,那么我们能在O(logk)时间内实现这三步操作。因此对于n个输入数字而言,总的时间效率就是O(nlogk)。

我们可以选择用不同的二叉树来实现这个数据容器。由于我们每次都需要找到k个整数中的最大数字,我们很容易想到用最大堆。在最大堆中,根结点的值总是大于它的子树中任意结点的值。于是我们每次可以在O(1)得到已有的k个数字中的最大值,但需要O(logk)时间完成删除以及插入操作。

我们自己从头实现一个最大堆需要一定的代码。我们还可以采用红黑树来实现我们的容器。红黑树通过把结点分为红、黑两种颜色并根据一些规则确保树是平衡的,从而保证在红黑树中查找、删除和插入操作都只需要O(logk)。在STL中set和multiset都是基于红黑树实现的。如果面试官不反对我们用STL中的数据容器,我们就直接拿过来用吧。下面是基于STL中的multiset的参考代码:

typedef multiset<int, greater<int> >  IntHeap;

///////////////////////////////////////////////////////////////////////
// find k least numbers in a vector
///////////////////////////////////////////////////////////////////////
void FindKLeastNumbers
(
const vector<int>& data,               // a vector of data
IntHeap& leastNumbers,                 // k least numbers, output
unsigned int k
)
{
leastNumbers.clear();

if(k == 0 || data.size() < k)
return;

vector<int>::const_iterator iter = data.begin();
for(; iter != data.end(); ++ iter)
{
// if less than k numbers was inserted into leastNumbers
if((leastNumbers.size()) < k)
leastNumbers.insert(*iter);

// leastNumbers contains k numbers and it’s full now
else
{
// first number in leastNumbers is the greatest one
IntHeap::iterator iterFirst = leastNumbers.begin();

// if is less than the previous greatest number
if(*iter < *(leastNumbers.begin()))
{
// replace the previous greatest number
leastNumbers.erase(iterFirst);
leastNumbers.insert(*iter);
}
}
}
}

在我的英文版博客(http://codercareer.blogspot.com/2011/09/no-05-least-k-numbers.html)中,我还介绍了一种O(n)的算法,感兴趣的读者可以去看看英文的博客。

博主何海涛对本博客文章享有版权。网络转载请注明出处http://zhedahht.blog.163.com/。整理出版物请和作者联系。对解题思路有任何建议,欢迎在评论中告知,或者加我微博http://weibo.com/zhedahht或者http://t.163.com/zhedahht与我讨论。谢谢。

[repost ]Data Structure Visualizations

original:http://www.cs.usfca.edu/~galles/visualization/Algorithms.html

Currently, we have visualizations for the following data structures and algorithms:

  • Basics
  • Recursion
  • Indexing
  • Sorting
  • Heap-like Data Structures
  • Graph Algorithms
  • Dynamic Programming
  • Geometric Algorithms
  • Others …
  •  

    [repost ]Pearson’s Correlation Coefficient, r

    Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, age and blood pressure. Pearson’s correlation coefficient (r) is a measure of the strength of the association between the two variables.

    The first step in studying the relationship between two continuous variables is to draw a scatter plot of the variables to check for linearity. The correlation coefficient should not be calculated if the relationship is not linear. For correlation only purposes, it does not really matter on which axis the variables are plotted. However, conventionally, the independent (or explanatory) variable is plotted on the x-axis (horizontally) and the dependent (or response) variable is plotted on the y-axis (vertically).

    The nearer the scatter of points is to a straight line, the higher the strength of association between the variables. Also, it does not matter what measurement units are used.

    Values of Pearson’s correlation coefficient

    Pearson’s correlation coefficient (r) for continuous (interval level) data ranges from -1 to +1:

    r = -1 data lie on a perfect straight line with a negative slope data lie on a perfect straight line with a negative slope
    r = 0 no linear relationship between the variables no linear relationship between the variables
    r = +1 data lie on a perfect straight line with a positive slope data lie on a perfect straight line with a positive slope

    Positive correlation indicates that both variables increase or decrease together, whereas negative correlation indicates that as one variable increases, so the other decreases, and vice versa.

    Example Scatterplots

    Identify the approximate value of Pearson’s correlation coefficient. There are 8 charts, and on choosing the correct answer, you will automatically move onto the next chart.

     

    Tip: that the square of the correlation coefficient indicates the proportion of variation of one variable ‘explained’ by the other (see Campbell & Machin, 1999 for more details).

    Statistical significance of r

    Significance

    The t-test is used to establish if the correlation coefficient is significantly different from zero, and, hence that there is evidence of an association between the two variables. There is then the underlying assumption that the data is from a normal distribution sampled randomly. If this is not true, the conclusions may well be invalidated. If this is the case, then it is better to use Spearman’s coefficient of rank correlation (for non-parametric variables). See Campbell & Machin (1999) appendix A12 for calculations and more discussion of this.

    It is interesting to note that with larger samples, a low strength of correlation, for example r = 0.3, can be highly statistically significant (ie p < 0.01). However, is this an indication of a meaningful strength of association?

    NB Just because two variables are related, it does not necessarily mean that one directly causes the other!

    Worked example

    Nine students held their breath, once after breathing normally and relaxing for one minute, and once after hyperventilating for one minute. The table indicates how long (in sec) they were able to hold their breath. Is there an association between the two variables?

    Subject
    A
    B
    C
    D
    E
    F
    G
    H
    I
    Normal
    56
    56
    65
    65
    50
    25
    87
    44
    35
    Hypervent
    87
    91
    85
    91
    75
    28
    122
    66
    58

    chart showing scatter plot

    The chart shows the scatter plot (drawn in MS Excel) of the data, indicating the reasonableness of assuming a linear association between the variables.

    Hyperventilating times are considered to be the dependent variable, so are plotted on the vertical axis.

    Output from SPSS and Minitab are shown below:

    SPSS
    Select Analysis>Correlation>Bi-variate

    table of correlations

    Minitab
    Correlations: Normal, Hypervent

    Pearson correlation of Normal and Hypervent = 0.966
    P-Value = 0.000

    In conclusion, the printouts indicate that the strength of association between the variables is very high (r = 0.966), and that the correlation coefficient is very highly significantly different from zero (P < 0.001). Also, we can say that 93% (0.9662) of the variation in hyperventilating times is explained by normal breathing times.

    original:http://hsc.uwe.ac.uk/dataanalysis/quantInfAssPear.asp

    Similarity Arithmetics for Mahout 0.4 Hadoop based recommender Job

    Mahout 0.4 Hadoop based recommender Job has one parameter –similarityClassname that can specify the name of distributed similarity class to instantiate or a predefined similarity from SimilarityType

    Mahout 0.4 Hadoop based recommender Job has provided  following SimilarityType:

    SIMILARITY_COOCCURRENCE
    SIMILARITY_EUCLIDEAN_DISTANCE
    SIMILARITY_LOGLIKELIHOOD
    SIMILARITY_PEARSON_CORRELATION
    SIMILARITY_TANIMOTO_COEFFICIENT
    SIMILARITY_UNCENTERED_COSINE
    SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE

    The source code of those Similarity Arithmetics are in here:

    http://svn.apache.org/repos/asf/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/

    You can view the code to know the implementation or search it in:  http://wikipedia.org