Yet Another Y Combinator Tutorial

After reading about the Y combinator in the book The Little Schemer, simply staring at its code becomes a great way to kill time. In the past few days, whenever I got bored (or need an excuse for procrastination), I spent hours trying to make sense out of its mysterious implementation. It’s like monad: everybody has a hard time reaching that “Eureka” moment, but once he finally groks it, he believes that he can explain it more clearly than the tutorial he read (which is often not the case).

So here I am writing yet another Y combinator tutorial. I think the reason why Y combinator is so hard to fathom is that when you read the code, you are facing its entire complexity. In fact, however, its derivation can be broken into a few fairly intuitive steps, and if each step is well-motivated, we no long need a “Eurika” to understand the whole thing.

The code snippets in this tutorial are written in Racket, but it should be trivial to translate them into other Lisp dialects or other dynamic functional languages. 1

Y? Why?

Before deriving the Y combinator, let see what it is about. Consider the following simple recursive function that returns the length of a given list:

1
2
3
4
5
6
7
(define len
  (λ (lst)
    (cond
      [(empty? lst) 0]
      [else (+ 1 (len (rest lst)))])))

(len '(1 2 3 4 5)) ; => 5

Simple enough, except that if you think about it, it’s quite weird: len is used in the process of defining len itself - that is, the identifier len is used before it has been bound to anything, since the lambda to which we intend to bind it hasn’t yet been defined at that point! We should expect an “unbound identifier” error, but somehow the interpreter helps take care of it. But what if the interpreter is not so helpful? We have been able to write anonymous functions using lambda literals; can we write recursive anonymous functions, too?

Answering this question helps us better understand the essence of the language. If recursive anonymous functions is impossible, then define should be something essential to the language; otherwise it is merely a syntactic sugar, since everything can be anonymous.

Tell Me About Ur-self

We can’t write the len function as we like to, so what shall we do? Maybe we can write some other function mk-len that constructs len for us. Let’s start by simply wrapping a lambda without arguments around len to get a new function mk-len. Then all uses of len can be substituted with (mk-len).

1
2
3
4
5
6
7
8
(define mk-len
  (λ ()
    (λ (lst)
      (cond
        [(empty? lst) 0]
        [else (+ 1 ((mk-len) (rest lst)))]))))

((mk-len) '(1 2 3 4 5)) ; => 5

We haven’t got rid of the recursion in mk-len yet. Well, to achieve the effect of recursion, mk-len needs to somehow refer to itself using some name anyway - but not the name we defined. Is there another way of binding a value to a name?

It turns out that there is: when we pass a value to a function as its argument, the value is automatically bound to the name of the corresponding parameter within the body of the function. That’s why ((λ (x) x) 123) returns 123 instead of an error “unbound identifier: x”.

So how about passing the function mk-len as a argument of itself, so that it can refer to itself using the name of that parameter? Let’s see:

1
2
3
4
5
6
7
8
(define mk-len
  (λ (self)
    (λ (lst)
      (cond
        [(empty? lst) 0]
        [else (+ 1 ((self self) (rest lst)))]))))

((mk-len mk-len) '(1 2 3 4 5)) ; => 5

With this simple trick, now mk-len can recurse without knowing its name! Let’s further extract a function U to do the self-application (mk-len mk-len):

1
2
3
4
(define U
  (λ (f) (f f)))

((U mk-len) '(1 2 3 4 5)) ; => 5

Now we no longer need to give the function mk-len a name:

1
2
3
4
5
6
((U (λ (self)
      (λ (lst)
        (cond
          [(empty? lst) 0]
          [else (+ 1 ((self self) (rest lst)))]))))
 '(1 2 3 4 5)) ; => 5

Congratulations! Now you know how to do recursion on anonymous functions!

By the way, the simple function U has a name: the U combinator. (Hold on, we’ll get to the Y soon.)

The Chicken Or The Egg Or The Fixed Point

While the U combinator allows us to write recursive anonymous functions, at the point of recursion we must always do the ugly self-application (self self). This is annoying. We would really want to write our mk-len as:

1
2
3
4
5
6
(define mk-len
  (λ (rlen)
    (λ (lst)
      (cond
        [(empty? lst) 0]
        [else (+ 1 (rlen (rest lst)))]))))

Now the problem is to find some function F that transforms mk-len into another function that behaves exactly the same as len.

First let’s take a look at this new function for a moment. It takes a function rlen as its argument and returns another function. It basically says: “give me a function rlen that can be used to compute the length of (rest lst), then I will make a function for you that can compute the length of lst”.

If lst is empty, what rlen you provide doesn’t matter, since ((mk-len rlen) empty) always returns 0 without calling rlen. But when lst is not empty, if we want ((mk-len rlen) lst) to return the length of lst, we need to provide a rlen that correctly computes the length of (rest lst). If we provide a function that can compute the length of a list with 10 elements, mk-len returns a function that can compute the length of a list with 11 elements. If rlen is able to compute the length of a list with 100 elements, (mk-len rlen) can compute the length of a list with 101 elements.

It follows that if we could provide a rlen such that (rlen (rest lst)) always correctly returns the length of (rest lst) for any non-empty lst, then ((mk-len rlen) lst) would always return the length of lst - that is, (mk-len rlen) would behave like the recursive len we defined at the beginning of this tutorial.

So what function should rlen be? Since lst can be any list, (rest lst) can also be any list (when lst is not empty). So we are basically asking for a rlen that is able to compute the length of any list - that is, rlen should also behave like len.

In other words, to transform mk-len into a function that behaves like len, we need to pass it a function rlen that behaves like len, but the only way to get such a function is by transforming mk-len. Now we seem to be stuck at this chicken-egg dilemma!

What should we do? Let’s just cheat by passing len to mk-len:

1
((mk-len len) '(1 2 3 4 5)) ; => 5

It works as we expected. Observe carefully, we can notice something curious: (mk-len len) not only behaves like len; in fact they are exactly the same function, i.e. len = (mk-len len). Therefore len is by definition the fixed point of mk-len. We can just define len in terms of mk-len following this definition, and it is equivalent to the original len.

1
2
3
4
5
(define len
  (mk-len
   (λ (lst) (len lst))))

(len '(1 2 3 4 5)) ; => 5

Now we can extract a function F that computes mk-len’s fixed point len:

1
2
3
4
5
6
7
8
(define F
  (λ (f)
    (local [(define fx
              (f (λ (x)
                   (fx x))))]
      fx)))

((F mk-len) '(1 2 3 4 5)) ; => 5

And again the function mk-len can be anonymous:

1
2
3
4
5
6
((F (λ (rlen)
      (λ (lst)
        (cond
          [(empty? lst) 0]
          [else (+ 1 (rlen (rest lst)))]))))
 '(1 2 3 4 5)) ; => 5

Congratulations (again)! Now you know how to do recursion on anonymous functions without the ugly (self self)! We are done…right?

Simply Y

Not quite, because we are still cheating: the definition of fixed point fx in F blatantly calls its own name to recurse. It seems that we are starting all over again. Are we making any progress?

In fact we have made progress: the users of our F function can now write recursive anonymous functions using a very clean syntax, and we are free to do whatever we want to eliminate recursion of fx in F, as long as its behavior doesn’t change.

What do we do? Remember how we used the simple U combinator to avoid recursion? Let’s try it on fx:

1
2
3
4
5
6
7
(define F
  (λ (f)
    (U (λ (self)
         (f (λ (x)
              ((self self) x)))))))

((F mk-len) '(1 2 3 4 5)) ; => 5

The self-application (self self) is still somewhat ugly, but it only appears once in F. Someone needs to do the dirty job, so that all functions like mk-len can be pretty. That’s life.

Now if we put the definition of U in F, and change a few names, we get:

1
2
3
4
5
6
7
(define Y
  (λ (f)
    ((λ (g) (g g))
     (λ (g)
       (f (λ (x) ((g g) x)))))))

((Y mk-len) '(1 2 3 4 5)) ; => 5

This Y is what we call: (drum roll) … the famous Y combinator discovered by Haskell Curry! There you have it. Not that hard, right?

In fact, if we define F in different ways and then use U to remove recursion, we can easily get many non-recursive functions that work just like Y. For example, because (F f) returns the fixed point of f, then by definition (F f) = (f (F f)). Let’s write the definition of F in this way2:

1
2
3
4
5
6
(define F
  (λ (f)
    (f (λ (x)
         ((F f) x)))))

((F mk-len) '(1 2 3 4 5)) ; => 5

Similarly, we can use U to remove the recursion:

1
2
3
4
5
6
7
(define F
  (U (λ (self)
       (λ (f)
         (f (λ (x)
              (((self self) f) x)))))))

((F mk-len) '(1 2 3 4 5)) ; => 5

Then, again, put in the definition of U and change some names:

1
2
3
4
5
6
7
(define Θ
  ((λ (f) (f f))
   (λ (g) (λ (f)
            (f (λ (x)
                 (((g g) f) x)))))))

((Θ mk-len) '(1 2 3 4 5)) ; => 5

This Θ is what we call: (drum roll) … the (less) famous Turing combinator discovered by Alan Turing!

I can go on and on. Such higher-order functions like Y and Θ that computes a fixed-point of other functions are called fixed-point combinators. In fact, there are infinitely many of them.

So Y is not so mysterious, and it’s not so special, either. I hope you are not too disappointed.

Wrap It Up (Pun Intended)

Now you know what Y combinator does: it computes the fixed point of another function. We can use Y to achieve anonymous recursion because a recursive function (like len) can be rewritten as the fixed-point of another function (like mk-len). So (Y mk-len) gives us len.

But there’s something more interesting. Consider some function f, in its definition it invokes another function g. If we want to control who f is talking to, we can make g an parameter. Now step back and stare at mk-len for a moment. Think about what we have done: we parameterized the point of recursion! When the normal len calls itself, it is very certain about it, and there’s nothing we can do. However, mk-len has no idea what rlen we pass to it! By writing recursion as fixed point computation, we gain the power and freedom of controlling what happens in a recursive call without modifying mk-len’s code.

Consider that we want to print some log at each recursive call. For a normal recursive function like len, there’s nothing we can do but to modify the code. If there are 10 such functions, we need to modify each of them and create a lot of mess. On the other hand, for functions like mk-len, obviously we can simply modify our fixed-point combinator:

1
2
3
4
5
6
7
(define Y
  (λ (f)
    ((λ (g) (g g))
     (λ (g)
       (f (λ (x) (begin (displayln x)
                        ((g g) x))))))))
((Y mk-len) '(1 2 3 4 5))

This prints out the following:

1
2
3
4
5
(2 3 4 5)
(3 4 5)
(4 5)
(5)
()

Now if we want to print log after each recursive call, we would need to modify the Y again. If we want to do something else, modify again…

Well, we can do better than this. We can create wrappers around functions like mk-len that controls the recursion calls in different ways. Here’s an example:

1
2
3
4
5
6
7
8
9
(define log-start-wrapper
  (λ (mk-len)
    (λ (log)
      (λ (x)
        (begin
          (printf "Start computing for: ~a~n" x)
          ((mk-len log) x))))))

((Y (log-start-wrapper mk-len)) '(1 2 3 4 5))

It prints out the following:

1
2
3
4
5
6
Start computing for: (1 2 3 4 5)
Start computing for: (2 3 4 5)
Start computing for: (3 4 5)
Start computing for: (4 5)
Start computing for: (5)
Start computing for: ()

Similarly, we can define log-start-wrapper that prints a log after each recursive call:

1
2
3
4
5
6
7
8
9
10
11
(define log-end-wrapper
  (λ (mk-len)
    (λ (log)
      (λ (x)
        (local [(define result
                  ((mk-len log) x))]
          (begin
            (printf "Result for ~a is: ~a~n" x result)
            result))))))

((Y (log-end-wrapper mk-len)) '(1 2 3 4 5))

It prints out:

1
2
3
4
5
6
Result for () is: 0
Result for (5) is: 1
Result for (4 5) is: 2
Result for (3 4 5) is: 3
Result for (2 3 4 5) is: 4
Result for (1 2 3 4 5) is: 5

We can even use several wrappers together:

1
((Y (log-start-wrapper (log-end-wrapper mk-len))) '(1 2 3 4 5))

This gives you:

1
2
3
4
5
6
7
8
9
10
11
12
Start computing for: (1 2 3 4 5)
Start computing for: (2 3 4 5)
Start computing for: (3 4 5)
Start computing for: (4 5)
Start computing for: (5)
Start computing for: ()
Result for () is: 0
Result for (5) is: 1
Result for (4 5) is: 2
Result for (3 4 5) is: 3
Result for (2 3 4 5) is: 4
Result for (1 2 3 4 5) is: 5

In this example, wrappers are like decorators for recursive function.

With such flexibility, there are actually more funky stuff we can do with wrappers. For example, we can cache the result of recursive calls to avoid redundant computation (a.k.a memoization), or modify the result (or even change the type of the result) of the recursive calls. You can read this interesting paper “That About Wraps it Up” for more information.


  1. In the cases of statically typed languages, it gets more complicated or even impossible. Let’s just ignore them in this tutorial for the sake of clarity.

  2. A subtle point here: we write (λ (x) ((F f) x)) instead of just (F f). This is because otherwise in order to pass (F f) as an argument of f, we first need to evaluate (F f), which expands to (f (F f)), to evaluate which we again need to evaluate first evaluate (F f)… The program will hang until it finally crashes from a stack overflow. Wrapping a lambda around (F f) delays its evaluation, making sure x passed to (F f) is evaluated first.

Surprises in Scala’s ‘Uniform Return Type’ Magic

There’s an increasingly heated debate about Scala’s complexity. A recent post talks about Scala’s “true complexity”: the terrifying intricacy you will encounter, if you try to extend Scala’s Collections Library with a method that achieves the “perfect” behavior.

Yet it also might be worth asking: what are the motivations of these “perfect” behaviors? Are they always desirable?

One important aspect of the “perfect” collection behavior is the magical uniform return type principle. Sloppily speaking, this means that transforming a collection should give you another collection of the same kind. For instance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Filtering a List gives you a filtered List
scala> List(1,2,3,4).filter(_%2==0)
res: List[Int] = List(2, 4)

// Mapping a Set gives you a mapped Set
scala> Set(1,2,3,4).map(1+)
res: scala.collection.immutable.Set[Int] = Set(2, 3, 4, 5)

// Uniform return type at both compile time and runtime
scala> val iterable : Iterable[Int] = Set(1,2,3,4)
iterable: Iterable[Int] = Set(1, 2, 3, 4)

scala> iterable.map(1+)
res: Iterable[Int] = Set(2, 3, 4, 5)

// Works for Strings, too
scala> "E is the most frequent letter in English".map(
     |    c => if(c=='e'|c=='E') '3' else c)
res: String = 3 is th3 most fr3qu3nt l3tt3r in 3nglish

Clean and consistent, can’t be more intuitive. Of course filtering a List should return a List, and mapping a Set should give you a Set. Both compile time and runtime. Works even for non-collections like Strings.

When I first saw this, I thought “yeah, cool”, and then forgot about it. I took it for granted as just another Scala magic that makes my life easier, without realizing its intricate implications. Not until recently have I realized that Scala’s collections were much smarter, or trickier, than I had expected.

The uniform return type principle is fairly straightforward for operations like filter, take, drop and slice, but situations get complicated in the cases of some other operations like map and flatMap. Consider the following:

1
2
scala> Set(1,2,3,4).map(_ % 2 == 0)
res: scala.collection.immutable.Set[Boolean] = Set(false, true)

It starts getting interesting. When mapping a Set, you should get a Set, which by definition doesn’t contain repeated values. So you get a Set[Boolean] which consists of only 2 values. You might get surprised here if you are familiar with LINQ in C#. In C#

1
new HashSet(new[] {1, 2, 3, 4}).Select(x => x%2 == 0)

gives you an IEnumerable that consists of “false, true, false, true” in order, because the HashSet is simply iterated through as a Normal IEnumerable. Some say C#/LINQ sucks, because the type information of original collection is “lost”.

In Scala, Map[K, V] is a subtype of Iterable[(K, V)], thus can be seen as a collection of key-value pairs. Now stop for a moment and think about it: what happens if you map a Map?

1
2
3
4
5
scala> Map('a'->1,'b'->2,'c'->3,'d'->4).map(t => t._2 -> t._1)
res: scala.collection.immutable.Map[Int,Char] = Map(1 -> a, 2 -> b, 3 -> c, 4 -> d)

scala> Map('a'->0,'b'->1,'c'->0,'d'->1).map(t => t._1)
res: scala.collection.immutable.Iterable[Char] = List(a, b, c, d)

Ideally, according to the uniform return type principle, mapping a Map should give you another Map. But this is only possible when the target element type is key-value pair. Otherwise, the best we can do is to treat the Map as just an Iterable of key-value pairs and return an List of the target elements.

However, there is still the problem of inconsistency. Here’s an example:

1
2
3
4
5
scala> Map(1->'a', 2->'b', 3->'c').map(_._2)
res: scala.collection.immutable.Iterable[Char] = List(a, b, c)

scala> Map(1->('a','b'), 2->('a','c'), 3->('b','c')).map(_._2)
res: scala.collection.immutable.Map[Char,Char] = Map(a -> c, b -> c)

Same operation on two Sets, but the behaviors are totally different, simply because these two Sets have different types of values? This is weird. The problem is that 2-tuples are selected to represent the key-value pairs, but are not exclusively used for this purpose. So when user map the elements to 2-tuples, there’s this unavoidable ambiguity.

It becomes even more error-prone if the actual class type of the collection is not known at compile time. We know that Map has a property “keys” that gives you an Iterable of its keys. Consider the following example:

1
2
3
4
5
scala> val m = Map(1 -> 'a', 2 -> 'b', 3 -> 'c', 4 -> 'd')
m: scala.collection.immutable.Map[Int,Char] = Map(1 -> a, 2 -> b, 3 -> c, 4 -> d)

scala> m.keys.map(_%2==0)
res: Iterable[Boolean] = Set(false, true)

Ah…the runtime type of “keys” is Set[Int], so repeated values in the returned collection are also ignored. The behavior should surprise you if you haven’t think of this. In fact, when given a Iterable without knowing its actual type, in any case the behaviors of collection operations are simply unpredictable. Scala guys can argue that this is just how it works, but this doesn’t seems to be a good thing, given that Iterables have been widely used between interfaces to pass data collections around without specifying their internal structures.

To sum up, the behavior of the map operation depends on not only the type of the collection on which it invokes, both static and dynamic, but also on the type of the mapped elements.

How is this magic implemented? If you take a look at the source code, you will probably see the most complicated collection library you’ve ever seen in your life. I’ll try to give a brief and simplified explanation. Here is the actual signature of the map method:

1
2
def map[MappedElem, That](p: Elem => MappedElem)
    (implicit bf: CanBuildFrom[This, MappedElem, That]): That

As you can see there’s an extra implicit parameter bf of type CanBuildFrom[This, MappedElem, That], which will give you a Builder[MappedElem, That] that can build a collection (of type That) from the mapped elements (of type MappedElem). In short, CanBuildFrom[This, MappedElem, That] is a factory for Builder[MappedElem, That], which itself is a factory for That. When both type parameters This and MappedElem are given, the compiler can find the best eligible for bf (according to some perplexing rules) and consequently determines the static type of That. For example:

1
2
3
4
5
// bf : CanBuildFrom[Map[_,_], (Char, Int), Map[Char, Int]]
Map(1 -> a, 2 -> b).map(t => t._2 -> t._1)

// bf: CanBuildFrom[Iterable[_], Int, Iterable[Int]]
Map(1 -> a, 2 -> b).map(_._1)

The CanBuildFrom can forward the call to the genericBuilder[MappedElem] method of the collection inferred in compile time, so that the “right” runtime type can be selected via virtual dispatch.

Now you should have noticed the conceptual and implementation complexity brought by the seemingly simple “uniform return type principle”. But why do we need this at the first place?

Let’s digress to talk about LINQ for a moment. I don’t think the argument “LINQ sucks, because the type information of original collection is lost” really stands, because I think this is exactly what it was designed to ignore. For most of the times, we often don’t care about whether the Iterable is a Seq or a Set or a Map or whatever. What we do with a collection is often just querying and consuming its content. LINQ provides a uniform interface with consistent, lazy behavior to facilitate this. When you need a collection of certain type, you explicitly convert the query result to it. When you need special behaviors of a Set or Map (Dictionary), you use their own interfaces. But queries are queries, different collections are treated equally. Such simplicity is not a weakness, but a feature.

In my point of view, LINQ and Scala’s collection operation are cosmetically similar but semantically different. LINQ queries a collection; Scala’s collection operation transforms a collection. (This may give you the clue why LINQ doesn’t have a ForEach extension method.) Scala’s functional nature necessitates the collection transformation semantic: you need to be able to deal with the value of a collection and compute a new value from it. And since Scala is also statically and strongly typed, the “uniform return type principle” is required to make the system consistent. Thus the behaviors of collection operations should of course depend on the actual collection type and the type of the target element. The operations are non-lazy (strict) by default to avoid unintended delay of the side-affects within the operations (there’s a more detailed explanation about the decision on laziness at the bottom of this page).

What if we only want to query and consume the content of the collection, like we do in LINQ? I suggest using Iterator directly or convert the collection to Stream:

1
2
3
4
scala> Set(1,2,3,4).iterator.map(_%2).foreach(print)
1010
scala> Set(1,2,3,4).toStream.map(_%2).foreach(print)
1010

What comes with the complexity is great flexibility and power. Scala is expressive and seems quite easy, but the paradox is that you need to spend much time learning to avoid misusing it. Scala is not easy, even in the case of these basic collection operations; it’s important to recognize that.

[学习笔记] Logistic 回归

本文为 Stanford 的机器学习公开课第三周的笔记,不过和上课内容的细节有些出入。

Logistic 回归解决的并不是回归问题,而是分类问题,即目标变量(target variable)的值是离散而非连续的。 这时目标变量也可称为标签(label)。如果仍用线性回归硬搞,得到的结果会非常不靠谱。

我们先考虑简单的情况:数据点只有 0 和 1 两个标签(binary classification),即 , 且大致上是线性可分的(linearly separable)。如图:

那么现在问题就是要找到一个 ,使得直线 1 能够将上面图中的 positive 和 negative 两类数据点“分开”;这样的一条直线称为决策边界(decision boundary)。 ——但具体什么叫“分开”?或者说,如果已知 ,对于一个标签未知的数据点 ,怎么判断它是 positive 还是 negative?

直观上看,在给定 的情况下, 应该满足一个0-1 分布,即

其中 应该满足:

  1. 首先
  2. ,我们可以认为 的可能性相对更大,即 , 且如果数据点离边界越远,即 越大, 也应该越大。
  3. 反之,若 ,则 ,且随 减小而减小。
  4. 时,点落在决策边界上,我们无从判断它的标签 ,因此

应该取什么样的函数呢?我们知道 logistic 函数 恰巧满足上述条件, 不妨就取它(“logistic 回归”也因此得名)2。即:

logistic 函数的图像是:

可以直观地看到它确实是满足我们要求的。这样我们就得到了:

我们记 ,表示这是我们希望预测的量,也就是模型的假设(hypothesis)3。在实际分类应用中, 时我们可以给出判断 的话就蒙一个吧。

接下来 该怎么求呢?根据log极大似然法,可知 的最优值就是 ,其中

其中第二步有点小 tricky。注意到 时值为 ,在 时值为 。这样就把 两种取值的情况合并写在一个式子里了。

接下来我们可以求出梯度 。如果我们像上篇笔记中一样定义:

可以推导得到:

接下来,根据梯度上升算法 4,我们就可以迭代计算下式来逼近

注意到这个更新式的形式和之前线性规划+最小二乘法+梯度下降得到是一样的,只是 变了。

以上就是用梯度上升做 Logistic 回归的算法了。课上还谈到了另外 3 个问题:

  1. 如果要把 Logistic 回归推广到多个分类的情况,可以利用一种叫“1 vs all”的方法。具体地说,即每次取一个分类为 positive, 其它分类都为 negative。这样针对每一个分类都学习得到一个 Logistic 回归模型。要判断一个新数据点的标签,用每一个模型都测一下, 取对应 最大的分类为预测结果。
  2. 如果数据点不是线性可分的,类似线性回归到多项式回归(polynomial regression)的推广,可以将同一属性次数不同的项看成是彼此独立的, 再和普通 Logistic 回归问题一样处理。
  3. 属性过多可能会造成模型过分复杂,导致过拟合问题。关于 overfit 和 underfit 的问题可以参见这篇文章的 3~6 段,这位学长讲得非常深入浅出了。课上提到的一种解决方法叫 regularization,即对 中希望加以限制的各项 ,在 后加上 ,使这些值过大时“惩罚”目标函数。这里 的取值也要注意, 如果取得太小解决不了过拟合问题,但取得太大又会造成 underfit。

p.s. 今天早睡的目标又达成失败了……………………

  1. 和上节线性规划中一样,每个 是一个 维向量,为书写简便我们在 个属性前再加一维

  2. 这里说“不妨”似乎有点太随意了。事实上,如果假设在给定 的情况下 服从 0-1 分布,那么 取 logistic 函数实际上可以由广义线性模式(Generalized linear model)的假设推导得出。

  3. 注意到和上节课讲的线性回归一样,我们要预测的 其实都是期望

  4. 由于这里要求的是 最大值,因此是梯度上升而非梯度下降。课程视频中取要最小化的目标函数 ,因此是梯度下降。实际是一回事。

[学习笔记] 线性回归(及梯度下降)

这个做的是 Stanford 的机器学习公开课的笔记。前两课很简单,吴大牛讲得又极清楚,我很多地方就简单记了。

线性回归解决的是一个回归问题(怎么像废话),即要预测的目标变量(target variable)是连续值。我们有一个大小为 的训练集, 其中数据点 分别对应目标变量 ;每个 是一个 维向量, 为一个数据点的属性(feature)数。(为后面书写简便我们在 个属性前再加一维 ,即 。)给定这样一个训练集,我们希望知道对于某个训练集外的 , 它对应的 是什么。

线性回归的假设(hypothesis)是,数据点的属性 和对应的目标变量 是线性关系,可以用一条直线 来拟合1,就像下面这样(图片盗自维基):

其中 也是一个 维向量,就是我们想拟合的参数,因为假如能求出 , 对于一个训练集外的点 我们就能预测它对应的 了。

为了评价我们估计的 的优劣,我们需要引入一个目标函数/费用函数(cost function)。 根据最小二乘法,我们定义下面的 为我们需要最小化的目标函数:

其中 这个常系数并非必须,只是为了计算便利加上去的。我们要求的就是

不过在继续之前我们先把 的记法再进一步 vectorize 一下。我们定义一个 设计矩阵 ,其第 行为 ,即

我们再定义

不难推出 可以写成下面的形式:

我们还可以接着求出 梯度 2

我们知道函数在极值处梯度为零,因此第一种解法就是直接求 。这样可以得到一个解析解:

注意其中 是一个 的矩阵,而矩阵求逆的的复杂度一般是 。 所以这种解法在 比较小的时候非常高效,但在 较大时就瞎了。

我们还知道 上某一点的梯度向量指向 增长最快的方向,长度为这个增长的变化率。 因此第二种解法就是让 的值向着 的方向,即 减小最快的方向迭代变化,这样就可以逼近 的极小值了。因此我们可以得到下面被称为 梯度下降的算法:

Repeat until converge { $$ \begin{eqnarray} \theta_{t+1} &:=& \theta_t - \alpha \nabla_\theta J(\theta) \ &=& \theta_t - \frac{\alpha}{m} X^{\rm T} (\vec{h_\theta} - \vec{y}) \end{eqnarray} $$ }

3

其中 称为学习速率(learning rate),意思就是……学习的速率= = 这个参数控制着每次迭代 值变化的保守或激进程度; 如果太大,一次迭代对 改变过大,导致“冲得太猛”越过了极值点没法收敛;如果太小,一次迭代对 改变太少,算法收敛太慢。 决定 的方法是……呃,就是试,“0.01 太小,0.1 太大,0.03 有点小,咦 0.06 正好”这样。

对于线性回归和梯度下降,课上还谈到了另外两个话题。一是属性缩放(feature scaling)。如果不同属性数量级差得太大, 会严重影响梯度下降的性能。这时可以对分别对每项属性做类似“减去均值,除以标准差”这样的缩放。二是 多项式回归(polynomial regression)。 属性值和目标变量值之间的关系可能不是线性的,而是多项式的关系,比如 这样。 但由于这个式子对于 仍然是线性的,所以用最小二乘法时可以直接把 看成是彼此独立的不同的属性, 然后按上面的方法如法炮制。至于多项式回归时属性的次数怎么取,应该会在讲 bias-variance trade-off 的时候讲吧,我到时候再写好了。

  1. 这里讲得有点 sloppy 了(不过你能明白我的意思就行=.=)。更严格一点说,应该是训练集中的数据满足:

    其中 表示误差;各 满足均值为零,方差相同且互不相关 (参见高斯-马尔科夫定理)。

    顺带一提,我们如果进一步假设 ,有 ,则这个模型成为 广义线性模式(Generalized linear model) 的一个特例。我们用极大log似然法

    可以推出和下面提到的最小二乘法一样的形式。——如果你不知道我上面在说什么,忽略吧,不碍得的。

  2. 推导其实挺麻烦的。教授在课上没讲,我也懒得敲了=.=

  3. 注意这里每次迭代的运算都会考虑整个训练集,这称为批量梯度下降(batch gradient descent), 当训练集很大时这可能导致很大的开销。我们也可以每次迭代只依次选取训练集中的一个实例 ,更新

    这个叫增量梯度下降(incremental gradient descent)。好处在于每次迭代开销小,因而收敛速度较快, 在处理大数据集时可能很有用。但坏处在于这算法可能永远都收敛不到极小值,而是一直围着极小值转——不过在实际应用中这个也能接受了。

[学习笔记] Probabilistic Latent Semantic Analysis (pLSA)

Probabilistic latent semantic analysis (概率潜在语义分析,pLSA) 是一种 Topic model,在99年被 Thomas Hofmann 提出。它和随后提出的 LDA 使得 Topic Model 成为了研究热点,其后的模型大都是建立在二者的基础上的。

我们有时会希望在数量庞大的文档库中自动地发现某些结构。比如我们希望在文档库发现若干个“主题”,并将每个主题用关键词的形式表现出来。 我们还希望知道每篇文章中各个主题占得比重如何,并据此判断两篇文章的相关程度。而 pLSA 就能完成这样的任务。

我之前取了 Wikinews 中的 1000 篇新闻,试着用 pLSA 在其中发现 K=15 个主题。比如一篇关于 Wikileaks 的阿萨奇被保释消息的新闻,算法以 100% 的概率把它分给了主题 9,其关键词为:

media phone hacking wikileaks assange australian stated information investigation murdoch

可以看到这个主题的发现还是非常靠谱的。又比如这条中国人民的老朋友威胁要大打打核战争的新闻。 算法把它以 97.7% 的概率分给了主题 3,2.3% 的概率分给了主题 7。主题 3 的关键词是:

south north court china military death tornado service million storm

主题 7 的关键词是:

nuclear plant power japan million carbon radiation china water minister

可以看到这条新闻和主题 3 中的“南北”、“军事”、“中国”、“死亡”这些信息联系在一起,和主题 7 中的“核”、“中国”联系在一起。 应该是因为我的数据集中与北朝鲜核问题相关的新闻只有不到 10 条,而 10 个主题的划分并不够细致,所以关于“朝核问题”或者“核武器”的这样的主题并没能被分离出来。但可以看到即使是这样结果也是很 make sense 的。

那我们就来看看 pLSA 模型是怎么回事吧。和很多模型一样,pLSA 遵从 bag-of-words 假设, 即只考虑一篇文档中单词出现的次数,忽略单词的先后次序关系,且每个单词的出现都是彼此独立的。 这样一来,我们观察到的其实就是每个单词 在每篇文档 中出现的次数 。 pLSA 还假设对于每对出现的 都对应着一个表示“主题”的隐藏变量 。 pLSA 是一个生成模型,它假设 之间的关系用贝叶斯网络表示是这样的(从 (Blei et al., 2003) 偷的图):

实心的节点 表示我们能观察到的文档和单词,空心的 表示我们观察不到的隐藏变量,用来表示隐含的主题。图中用了所谓的“盘子记法”, 即用方框表示随机变量的重复。这里方框右下角的字母 分别表示有 篇文档,第 篇文档有 个单词。每条有向边表示随机变量间的依赖关系。也就是说,pLSA 假设每对 都是由下面的过程产生的:

  1. 的先验概率选择一篇文档
  2. 选定 后,以 的概率选中主题
  3. 选中主题 后,以 的概率选中单词

而我们感兴趣的正是其中的 :利用前者我们可以知道每篇文章中各主题所占的比重, 利用后者我们则能知道各单词在各主题中出现的概率,从而进一步找出各主题的“关键词”。记 , 表示我们希望估计的模型参数。当然 不仅仅代表两个数,而是对于每对 , 我们都要希望知道 的值。也就是说,模型中共有 个参数。我们还知道:

根据最大log似然估计法,我们要求的就是

这里由于 这一项与 无关,在 中可以被直接扔掉。1因此

这里出现了 的形式,导致很难直接拿它做最大似然。但假如能观察到 ,问题就很简单了。于是我们想到根据 EM 算法 (参见我的上篇笔记),可以用下式迭代逼近

其中

在 E-step 中,我们需要求出 中除 外的其它未知量,也就是说对于每组 我们都需要求出 。 根据贝叶斯定理贝叶斯定理,我们知道:

就是上轮迭代求出的 。这样就完成了 E-step。

接下来 M-step 就是要求 了。利用基本的微积分工具 2,可以分别对每对 求出:

以上就是 pLSA 算法了。最后贴个我用 MATLAB 写的实现 3

plsa.m
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
function [p_w_z, p_z_d, Lt] = pLSA(n_dw, n_z, iter_num)
% PLSA	Fit a pLSA model on given data
%       in which n_dw(d,w) is the number of occurrence of word w
%		in document d, d, n_z is the number of topics to be discovered
%		

% pre-allocate space
[n_d, n_w] = size(n_dw); % max indices of d and w
p_z_d = rand(n_z, n_d); % p(z|d)
p_w_z = rand(n_w, n_z); % p(w|z)
n_p_z_dw = cell(n_z, 1); % n(d,w) * p(z|d,w)
for z = 1:n_z
    n_p_z_dw{z} = sprand(n_dw);
end

p_dw = sprand(n_dw); % p(d,w)
Lt = []; % log-likelihood
for i = 1:iter_num
    %disp('E-step');
    for d = 1:n_d
        for w = find(n_dw(d,:))
            for z = 1:n_z
                n_p_z_dw{z}(d,w) = p_z_d(z,d) * p_w_z(w,z) * ...
					n_dw(d,w) / p_dw(d, w);
            end
        end
    end

    %disp('M-step');
    %disp('update p(z|d)')
    concat = cat(2, n_p_z_dw{:}); % make n_p_z_dw{:}(d,:)) possible
    for d = 1:n_d
        for z = 1:n_z
            p_z_d(z,d) = sum(n_p_z_dw{z}(d,:));
        end
        p_z_d(:,d) = p_z_d(:,d) / sum(concat(d,:));
    end

    %disp('update p(w|z)')
    for z = 1:n_z
        for w = 1:n_w
            p_w_z(w,z) = sum(n_p_z_dw{z}(:,w));
        end
        p_w_z(:,z) = p_w_z(:,z) / sum(n_p_z_dw{z}(:));
    end

    % update p(d,w) and calculate likelihood
    L = 0;
    for d = 1:n_d
        for w = find(n_dw(d,:))
            p_dw(d,w) = 0;
            for z = 1:n_z
                p_dw(d,w) = p_dw(d,w) + p_w_z(w,z) * p_z_d(z,d);
            end
            L = L + n_dw(d,w) * log(p_dw(d, w));
        end
    end

    Lt = [Lt; L];
    %plot(Lt); ylim([2*median(Lt)-L-0.1 L+(L-median(Lt))/2+0.1]);
    %drawnow; pause(0.1)
end

end

第一次拿 Mablab 写程序,比较丑……4下图是 Log 似然度随迭代收敛的情况。可以看到收敛速度还是相对较快的。 而且由于是 EM 算法的缘故,Log 似然度确实是单调上升的.

最后,pLSA 的问题是在文档的层面上没有一个概率模型,每篇文档的 P(d|z) 都是需要拟合的模型参数。 这就导致参数的数目会随文档数目线性增长、不能处理训练集外的文档这样的问题。所以02 David Blei、Andrew Ng(就是正在 ml-class.org 里上公开课的那位) 和 Michael Jordan 又提出了一个更为简洁的模型:LDA。有时间的话下次再写了。


参考文献

  • (Hofmann, 1999) Hofmann, T. 1999. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval SIGIR 99. pages, (1999), 50-57.
  • (Gildea & Hofmann, 1999) Gildea, D. and Hofmann, T. 1999. Topic-based language models using EM. Proceedings of the 6th European Conference on Speech (1999), 2167-2170.
  • (Brants, 2005) Brants, T. 2005. Test Data Likelihood for PLSA Models. Information Retrieval. (2005), 181-196.
  • (Blei et al., 2003) Blei, D.M. et al. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. 3, 4-5 (2003), 993-1022.

  1. 这里 Hofmann 自己在 (Hofmann, 1999) 和 (Gildea & Hofmann, 1999) 中使用了不同的形式。本文和 (Gildea & Hofmann, 1999)、(Brants, 2005) 一样选择不去理会 。因为正如 (Brants, 2005) 中指出、(Blei et al., 2003) 及很多其它文献吐槽的那样,(Hofmann, 1999) 中的模型算出的 实在坑爹,当 不在训练集中时 就一律为0,没什么意义,还不如别估计它呢。另外 (Hofmann, 1999) 中额外引入了一个参数 来“解决”过度拟合问题,但 (Brants, 2005) 中指出这一问题实际并不存在,因此本文也对此忽略不提。

  2. 具体而言,这里要求的是 约束条件下的极值。根据拉格朗日乘数法,解: $$ \nabla_\theta \left( Q(\theta) + \sum_z \alpha_z (\sum_w P(w|z) -1) + \sum_d \beta_d (\sum_z P(z|d) -1) \right) = \mathbf{0} $$

  3. 吐槽:用 Matlab 做简单字符串处理怎么都那么恶心!长度不同的字符串竟然算是不同类型的!Cell array 怎么那么难用!

[学习笔记] Expectation-Maximization(EM) 算法

果然还是有必要保持记笔记的习惯呐。前两天实验室讲 pLSA 的推导,用到 EM 竟然完全记不清了,竟然还把 Jensen 不等式和 SVM 的 minmax=maxmin 什么的记混= = (这么一说 SVM 具体怎么回事也记不清了。。。)赶紧补一篇复习笔记。

Expectation-Maximization 算法是统计学中用来给带隐含变量的模型做最大似然(和最大后验概率)的一种方法。EM 的应用特别广泛,经典的比如做概率密度估计用的 Gaussian Mixture Model。这两天我或许还会写 p 的笔记放上来,也是 EM 应用的例子。

下面我会先解释 EM 算法要解决的问题,然后直接给出算法的过程,最后再说明算法的正确性。

问题

首先我们定义要解决的问题。给定一个训练集 ,我们希望拟合包含隐含变量 的模型 中的参数 。根据模型的假设,每个我们观察到的 还对应着一个我们观察不到的隐含变量 ,我们记 。做最大对数似然就是要求 的“最优值”:

其中

想用这个的形式直接求解 往往非常困难。而如果能观察到隐含变量 ,求下面的似然函数的极大值会容易许多:

问题是实际上我们没法观察到 的值,只能在给定 时求 的后验概率 1 EM 算法就可以帮我们打破这样的困境。

算法

下面给出 EM 算法的过程。其中 是第 t-1 次迭代时算出的 值; 为任意初值。

Repeat until converge {
  1. (E-step) Calculate \( P(Z|X;\theta_t) \), in order to get: $$ \begin{eqnarray} E_{Z|X;\theta_t}[L(\theta;X,Z)] &:=& E_{Z|X;\theta_t}[\log{P(X,Z;\theta)}] \ &=& \sum_Z P(Z|X;\theta_t) \log{P(X,Z;\theta)} \end{eqnarray} $$
  2. (M-step) $$\theta_{t+1} := \arg\max_\theta E_{Z|X;\theta_t}[\log{P(X,Z;\theta)}]$$
}

对,就这么短。所以我总觉得称之为 algorithm 不如称之为 method 更恰当。上面的过程在收敛后就得到了我们需要的 2

先简单说说这短短两步都做了些啥。EM 算法每次迭代都建立在上轮迭代对 的最优值的估计 上,利用它可以求出 的后验概率 ,进而求出 在分布 上的期望 。在第一节中我们提到 在未知 的情况下难以直接计算,于是 EM 算法就转而通过最大化它的期望 来逼近 的最优值,得到 。注意由于 的这个期望是在 的一个分布上求的,这样得到的表达式就只剩下 一个未知量,因而绕过了 未知的问题。而 又可以作为下轮迭代的基础,继续向最优逼近。算法中 E-step 就是在利用 求期望 ,这就是所谓“Expectation”;M-step 就是通过寻找 最大化这个期望来逼近 的最优值,这就叫“Maximization”。EM 算法因此得名。
另外,如果数据满足独立同分布的条件,分别枚举 的值可能要比枚举整个 方便些,可把 替换成:

原理

为什么这样 E一步,M一步,一步E,一步M,就能逼近极大似然?具体而言,为什么通过迭代计算 可以逼近 的最优值 ?我们稍后会看到,这是因为每次迭代得到的 一定比 更优,即算法是在对 的最优值做单调逼近。

不过首先让我们先抛开最大似然,考虑一个更一般的问题。假设有一个凹函数 ,我们想求 ,但直接求很困难。不过对于任意给定的 ,假设我们都能找到 的一个下界函数 ,满足 ——我管 这样的函数叫 的“在 处相等的下界函数”。现在考虑 ,它一定会满足:

也就是说, 一定比 更优。而接下来我们又可以用 找到一个 ,再据此算出比 还优的 。如此不断迭代,就能不断步步逼近 的最优值了。由此可见,如果对任意 都能找到 的“在 处相等的下界函数”,我们就得到了一个能单调逼近 的最优值的算法:

Repeat until converge {
  1. 找到函数 \( F(\theta) \) 的“在 \( \theta_t \) 处相等的下界函数” \( G_{\theta_t}(\theta) \)
  2. \( \theta_{t+1} := \arg\max_\theta G_{\theta_t}(\theta) \)
}

上面的算法看起来和 EM 算法的每步都分别对应——事实上也正如此。下面是从 PRML 中偷的一张图改的,展示了上述逼近的过程:

现在我们回到最大似然问题 。如果我们想套用上面的算法来逼近 的这个最优解,就需要找到对于每个 ,函数 的“在 处相等的下界函数”。该怎么找呢?让我们从 的初始形式开始推导:

又卡在这个的形式上了……我们说过麻烦在于观察不到 的值,那不妨给它任意引入一个概率分布 3,利用分子分母同乘 的小 trick,得到:

根据 Jensen 不等式 4,对于任意分布 都有:

且上面的不等式在 为常数时取等号。

于是我们就得到了 的一个下界函数。我们要想套用上面的算法,还要让这个不等式在 处取等号,这就这要求在 为常数,即 。由于 是一个概率分布,必须满足 ,所以这样的 只能是 。那我们就把 代入上式,得到:
且该不等式在 时取到等号。那么…… 就是 的“在 处相等的下界函数”——这不就是我们要找的么!于是把它塞进本节开始得到的算法里替换“”就可以用啦。也就是说,迭代计算 就可以逼近 的最优值了。而由于利用 Jensen 不等式的那一步搞掉了的形式,它算起来往往要比直接算 容易不少。

我们还可以再做几步推导,得到一个更简单的形式:

其中倒数第二步是因为 这一项与 无关,所以就直接扔掉了。这样就得到了本文第二节 EM 算法中的形式——它就是这么来的。

以上就是 EM 了。至于独立同分布的情况推导也类似。

顺带一提, 有时也比较难算。这时我们其实可以退而求其次,不要求这个期望最大化了,只要它在 处的值大于在 处的值就行了。根据上面的推导,这样也能逼近 的最优值,只是收敛速度较慢。这就是所谓 GEM (Generalized EM) 算法了。

p.s. MathJax 很神嘛。

p.p.s. 这篇笔记竟然断断续续写写改改了两天多,我对 EM 的认识也越来越清晰。“‘教’是最好的‘学’”真是一点没错。


更新历史:

  • 2011.10.12 重写了“原理”部分,把利用函数的“在 处相等的下界函数”逼近 的最优值的算法单独提到前面说,这样似乎清楚很多。
  • 2011.10.13 修正了对在利用 Jensen 不等式的那一步要取 的解释

  1. 一般可以利用贝叶斯定理

    往往是模型假设的一部分。

  2. 实际上在某些特殊情况下, 还可能收敛在局部最优点或鞍点上。这时可以多跑几次算法,每次随机不同的 ,最后取最好的结果。为简明起见,本文忽略这种情况。

  3. 为概率分布,意即需满足

  4. Jensen 不等式:

    为凸函数, 为随机变量。则

    是严格凸的,则上式取等号当前仅当 为常数。

    在这里 函数是严格凹的,所以要把上面的不等号方向调转。

Thanksgiving – George Winston

谢谢你,因为这183天中的每时每刻,因为那所有的一切。

别了。

To the Daisy

        To the Daisy
                     by W. Wordsworth

With little here to do or see
Of things that in the great world be,
Sweet Daisy! oft I talk to thee,
      For thou art worthy,
Thou unassuming commonplace
Of Nature, with that homely face,
And yet with something of a grace
      Which Love makes for thee!	 

Oft on the dappled turf at ease
I sit and play with similes,
Loose types of things through all degrees,
      Thoughts of thy raising;
And many a fond and idle name
I give to thee, for praise or blame,
As is the humour of the game,
      While I am gazing.	 

A nun demure, of lowly port;
Or sprightly maiden, of Love's court,
In thy simplicity the sport
      Of all temptations;
A queen in crown of rubies drest;
A starveling in a scanty vest;
Are all, as seems to suit thee best,
      Thy appellations.	 

A little Cyclops, with one eye
Staring to threaten and defy,
That thought comes next—and instantly
      The freak is over,
The shape will vanish, and behold!
A silver shield with boss of gold
That spreads itself, some fairy bold
      In fight to cover.	 

I see thee glittering from afar—
And then thou art a pretty star,
Not quite so fair as many are
      In heaven above thee!
Yet like a star, with glittering crest,
Self-poised in air thou seem'st to rest;—
May peace come never to his nest
      Who shall reprove thee!

Sweet Flower! for by that name at last
When all my reveries are past
I call thee, and to that cleave fast,
      Sweet silent creature!
That breath'st with me in sun and air,
Do thou, as thou art wont, repair
My heart with gladness, and a share
      Of thy meek nature!

学习笔记:概率统计与随机过程

上学期“概率统计与随机过程”课的笔记,用 XMind 做的思维导图,仅供参考。。

(需要下载才能看到全部内容。层次很深,请多用F6下钻功能)

从新浪博客搬家到WordPress

要从新浪搬到Wordpres,网上广为流传的方法是利用blogbus的博客搬家服务获得blogbus格式的xml,然后再用一个Python写的脚本把它转换成WordPress认识的格式。但是这种方法在最近新浪博客升级以后就失效了。于是自己用现学的scala写了一个小程序,搬家时能保留标签、目录、评论及评论回复这些信息。

猛击这里下载。注意,此程序仅支持2010年初的新版新浪博客,之前或之后的版本都不支持。

要运行程序,你需要确保已经安装过JRE。双击运行后显示如图界面,填入自己的博客地址(不要省略“ http:// ”),然后点击“Start”即可。这时“Start”按钮变为灰色,标题栏显示“Extracting”。等待几分钟,当标题栏显示为“Done”、“Start”按钮重新变为可用时,程序所在目录下会出现一个blog.xml文件。把这个文件直接导入WordPress就可以了。

代码也打在jar包里了,MIT协议。欢迎报告bug。

下面是废话。

恩由于新浪用了ajax,评论信息是通过xhr异步读取的,用一般的方法没法抓到。我纠结许久,最后是用了非常ad hoc的方法解决的,不知道有没有什么什么不太麻烦的通用解决方案呢。

再扯两句scala。我都想不起来当初具体是怎么想到要学scala的,也许是为了了解下函数式编程,也许只是想在jvm上有一个喜欢的语言吧——Java写起来太不爽了;Java社区的低效和保守也已经开始显出C++的影子。

scala确实是非常强大和灵活;我在见到一些颇富技巧性的hack之后都有些怀疑scala社区的风气会不会慢慢变得像C++社区一样过分热衷技巧的炫耀。不过scala的设计目标就是以较简单的语法规则获得最大的scalability,不需要通过挖掘语言规范里的犄角旮旯来实现一些必要功能,所以不会像C++一样成为一门本身已相当复杂,却还需要别人反过来教语言发明者如何使用的语言。

scala毕竟表现力比Java强太多,代码也简洁太多。比如这次我需要实现一个抛出异常后重试若干次的逻辑,只需定义一个函数:

1
2
3
4
5
6
7
def tryFor[T](times: Int)(op: => T): T = {
  if (times <= 0) throw new RuntimeException("Operation failed.")
  try { return op } catch {
    case e: Throwable => e.printStackTrace
    tryFor(times - 1)(op)
  }
}

然后这样使用:

1
val source = tryFor(5) {new Source(url)}

程序就会不断获得网页源代码,并在5次失败后抛出异常。Java实现同样的东西可不会如此优雅了。又如下面这段代码返回一篇博文xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
private def generateEntryXml(entry: BlogEntry) = {
  <item>
    <title>
      {entry.title}
    </title>
    <wp:post_date>
      {dateFormat.format(entry.postDate)}
    </wp:post_date>
    <category>
      {entry.category}
    </category>
    {for (tag <- entry.tags) yield <category domain="tag">{tag}</category>}
    <content:encoded>
      {xml.Unparsed(handleNewLines(entry.content))}
    </content:encoded>
    <wp:status>publish</wp:status>
    {for (comment <- entry.comments) yield generateCommentXml(comment)}
  </item>
}

注意,xml标签直接作为scala的源代码的一部分在代码中出现!虽然我觉得这样会使scala语言多出一种“特殊情况”,增加语言的复杂性,但不得不承认这样的设计确实非常优美简洁。

我比较看好scala,以后自己做跑在jvm上的东西scala应该是首选语言。推荐有兴趣的童鞋也了解一下。