After reading about the Y combinator in the book The Little Schemer, simply staring at its code becomes a great way to kill time. In the past few days, whenever I got bored (or need an excuse for procrastination), I spent hours trying to make sense out of its mysterious implementation. It’s like monad: everybody has a hard time reaching that “Eureka” moment, but once he finally groks it, he believes that he can explain it more clearly than the tutorial he read (which is often not the case).

So here I am writing yet another Y combinator tutorial. I think the reason why Y combinator is so hard to fathom is that when you read the code, you are facing its entire complexity. In fact, however, its derivation can be broken into a few fairly intuitive steps, and if each step is well-motivated, we no long need a “Eurika” to understand the whole thing.

The code snippets in this tutorial are written in Racket, but it should be trivial to translate them into other Lisp dialects or other dynamic functional languages. 1

## Y? Why?

Before deriving the Y combinator, let see what it is about. Consider the following simple recursive function that returns the length of a given list:

Simple enough, except that if you think about it, it’s quite weird: len is used in the process of defining len itself - that is, the identifier len is used before it has been bound to anything, since the lambda to which we intend to bind it hasn’t yet been defined at that point! We should expect an “unbound identifier” error, but somehow the interpreter helps take care of it. But what if the interpreter is not so helpful? We have been able to write anonymous functions using lambda literals; can we write recursive anonymous functions, too?

Answering this question helps us better understand the essence of the language. If recursive anonymous functions is impossible, then define should be something essential to the language; otherwise it is merely a syntactic sugar, since everything can be anonymous.

## Tell Me About Ur-self

We can’t write the len function as we like to, so what shall we do? Maybe we can write some other function mk-len that constructs len for us. Let’s start by simply wrapping a lambda without arguments around len to get a new function mk-len. Then all uses of len can be substituted with (mk-len).

We haven’t got rid of the recursion in mk-len yet. Well, to achieve the effect of recursion, mk-len needs to somehow refer to itself using some name anyway - but not the name we defined. Is there another way of binding a value to a name?

It turns out that there is: when we pass a value to a function as its argument, the value is automatically bound to the name of the corresponding parameter within the body of the function. That’s why ((λ (x) x) 123) returns 123 instead of an error “unbound identifier: x”.

So how about passing the function mk-len as a argument of itself, so that it can refer to itself using the name of that parameter? Let’s see:

With this simple trick, now mk-len can recurse without knowing its name! Let’s further extract a function U to do the self-application (mk-len mk-len):

Now we no longer need to give the function mk-len a name:

Congratulations! Now you know how to do recursion on anonymous functions!

By the way, the simple function U has a name: the U combinator. (Hold on, we’ll get to the Y soon.)

## The Chicken Or The Egg Or The Fixed Point

While the U combinator allows us to write recursive anonymous functions, at the point of recursion we must always do the ugly self-application (self self). This is annoying. We would really want to write our mk-len as:

Now the problem is to find some function F that transforms mk-len into another function that behaves exactly the same as len.

First let’s take a look at this new function for a moment. It takes a function rlen as its argument and returns another function. It basically says: “give me a function rlen that can be used to compute the length of (rest lst), then I will make a function for you that can compute the length of lst”.

If lst is empty, what rlen you provide doesn’t matter, since ((mk-len rlen) empty) always returns 0 without calling rlen. But when lst is not empty, if we want ((mk-len rlen) lst) to return the length of lst, we need to provide a rlen that correctly computes the length of (rest lst). If we provide a function that can compute the length of a list with 10 elements, mk-len returns a function that can compute the length of a list with 11 elements. If rlen is able to compute the length of a list with 100 elements, (mk-len rlen) can compute the length of a list with 101 elements.

It follows that if we could provide a rlen such that (rlen (rest lst)) always correctly returns the length of (rest lst) for any non-empty lst, then ((mk-len rlen) lst) would always return the length of lst - that is, (mk-len rlen) would behave like the recursive len we defined at the beginning of this tutorial.

So what function should rlen be? Since lst can be any list, (rest lst) can also be any list (when lst is not empty). So we are basically asking for a rlen that is able to compute the length of any list - that is, rlen should also behave like len.

In other words, to transform mk-len into a function that behaves like len, we need to pass it a function rlen that behaves like len, but the only way to get such a function is by transforming mk-len. Now we seem to be stuck at this chicken-egg dilemma!

What should we do? Let’s just cheat by passing len to mk-len:

It works as we expected. Observe carefully, we can notice something curious: (mk-len len) not only behaves like len; in fact they are exactly the same function, i.e. len = (mk-len len). Therefore len is by definition the fixed point of mk-len. We can just define len in terms of mk-len following this definition, and it is equivalent to the original len.

Now we can extract a function F that computes mk-len’s fixed point len:

And again the function mk-len can be anonymous:

Congratulations (again)! Now you know how to do recursion on anonymous functions without the ugly (self self)! We are done…right?

## Simply Y

Not quite, because we are still cheating: the definition of fixed point fx in F blatantly calls its own name to recurse. It seems that we are starting all over again. Are we making any progress?

In fact we have made progress: the users of our F function can now write recursive anonymous functions using a very clean syntax, and we are free to do whatever we want to eliminate recursion of fx in F, as long as its behavior doesn’t change.

What do we do? Remember how we used the simple U combinator to avoid recursion? Let’s try it on fx:

The self-application (self self) is still somewhat ugly, but it only appears once in F. Someone needs to do the dirty job, so that all functions like mk-len can be pretty. That’s life.

Now if we put the definition of U in F, and change a few names, we get:

This Y is what we call: (drum roll) … the famous Y combinator discovered by Haskell Curry! There you have it. Not that hard, right?

In fact, if we define F in different ways and then use U to remove recursion, we can easily get many non-recursive functions that work just like Y. For example, because (F f) returns the fixed point of f, then by definition (F f) = (f (F f)). Let’s write the definition of F in this way2:

Similarly, we can use U to remove the recursion:

Then, again, put in the definition of U and change some names:

This Θ is what we call: (drum roll) … the (less) famous Turing combinator discovered by Alan Turing!

I can go on and on. Such higher-order functions like Y and Θ that computes a fixed-point of other functions are called fixed-point combinators. In fact, there are infinitely many of them.

So Y is not so mysterious, and it’s not so special, either. I hope you are not too disappointed.

## Wrap It Up (Pun Intended)

Now you know what Y combinator does: it computes the fixed point of another function. We can use Y to achieve anonymous recursion because a recursive function (like len) can be rewritten as the fixed-point of another function (like mk-len). So (Y mk-len) gives us len.

But there’s something more interesting. Consider some function f, in its definition it invokes another function g. If we want to control who f is talking to, we can make g an parameter. Now step back and stare at mk-len for a moment. Think about what we have done: we parameterized the point of recursion! When the normal len calls itself, it is very certain about it, and there’s nothing we can do. However, mk-len has no idea what rlen we pass to it! By writing recursion as fixed point computation, we gain the power and freedom of controlling what happens in a recursive call without modifying mk-len’s code.

Consider that we want to print some log at each recursive call. For a normal recursive function like len, there’s nothing we can do but to modify the code. If there are 10 such functions, we need to modify each of them and create a lot of mess. On the other hand, for functions like mk-len, obviously we can simply modify our fixed-point combinator:

This prints out the following:

Now if we want to print log after each recursive call, we would need to modify the Y again. If we want to do something else, modify again…

Well, we can do better than this. We can create wrappers around functions like mk-len that controls the recursion calls in different ways. Here’s an example:

It prints out the following:

Similarly, we can define log-start-wrapper that prints a log after each recursive call:

It prints out:

We can even use several wrappers together:

This gives you:

In this example, wrappers are like decorators for recursive function.

With such flexibility, there are actually more funky stuff we can do with wrappers. For example, we can cache the result of recursive calls to avoid redundant computation (a.k.a memoization), or modify the result (or even change the type of the result) of the recursive calls. You can read this interesting paper “That About Wraps it Up” for more information.

1. In the cases of statically typed languages, it gets more complicated or even impossible. Let’s just ignore them in this tutorial for the sake of clarity.

2. A subtle point here: we write (λ (x) ((F f) x)) instead of just (F f). This is because otherwise in order to pass (F f) as an argument of f, we first need to evaluate (F f), which expands to (f (F f)), to evaluate which we again need to evaluate first evaluate (F f)… The program will hang until it finally crashes from a stack overflow. Wrapping a lambda around (F f) delays its evaluation, making sure x passed to (F f) is evaluated first.

There’s an increasingly heated debate about Scala’s complexity. A recent post talks about Scala’s “true complexity”: the terrifying intricacy you will encounter, if you try to extend Scala’s Collections Library with a method that achieves the “perfect” behavior.

Yet it also might be worth asking: what are the motivations of these “perfect” behaviors? Are they always desirable?

One important aspect of the “perfect” collection behavior is the magical uniform return type principle. Sloppily speaking, this means that transforming a collection should give you another collection of the same kind. For instance:

Clean and consistent, can’t be more intuitive. Of course filtering a List should return a List, and mapping a Set should give you a Set. Both compile time and runtime. Works even for non-collections like Strings.

When I first saw this, I thought “yeah, cool”, and then forgot about it. I took it for granted as just another Scala magic that makes my life easier, without realizing its intricate implications. Not until recently have I realized that Scala’s collections were much smarter, or trickier, than I had expected.

The uniform return type principle is fairly straightforward for operations like filter, take, drop and slice, but situations get complicated in the cases of some other operations like map and flatMap. Consider the following:

It starts getting interesting. When mapping a Set, you should get a Set, which by definition doesn’t contain repeated values. So you get a Set[Boolean] which consists of only 2 values. You might get surprised here if you are familiar with LINQ in C#. In C#

gives you an IEnumerable that consists of “false, true, false, true” in order, because the HashSet is simply iterated through as a Normal IEnumerable. Some say C#/LINQ sucks, because the type information of original collection is “lost”.

In Scala, Map[K, V] is a subtype of Iterable[(K, V)], thus can be seen as a collection of key-value pairs. Now stop for a moment and think about it: what happens if you map a Map?

Ideally, according to the uniform return type principle, mapping a Map should give you another Map. But this is only possible when the target element type is key-value pair. Otherwise, the best we can do is to treat the Map as just an Iterable of key-value pairs and return an List of the target elements.

However, there is still the problem of inconsistency. Here’s an example:

Same operation on two Sets, but the behaviors are totally different, simply because these two Sets have different types of values? This is weird. The problem is that 2-tuples are selected to represent the key-value pairs, but are not exclusively used for this purpose. So when user map the elements to 2-tuples, there’s this unavoidable ambiguity.

It becomes even more error-prone if the actual class type of the collection is not known at compile time. We know that Map has a property “keys” that gives you an Iterable of its keys. Consider the following example:

Ah…the runtime type of “keys” is Set[Int], so repeated values in the returned collection are also ignored. The behavior should surprise you if you haven’t think of this. In fact, when given a Iterable without knowing its actual type, in any case the behaviors of collection operations are simply unpredictable. Scala guys can argue that this is just how it works, but this doesn’t seems to be a good thing, given that Iterables have been widely used between interfaces to pass data collections around without specifying their internal structures.

To sum up, the behavior of the map operation depends on not only the type of the collection on which it invokes, both static and dynamic, but also on the type of the mapped elements.

How is this magic implemented? If you take a look at the source code, you will probably see the most complicated collection library you’ve ever seen in your life. I’ll try to give a brief and simplified explanation. Here is the actual signature of the map method:

As you can see there’s an extra implicit parameter bf of type CanBuildFrom[This, MappedElem, That], which will give you a Builder[MappedElem, That] that can build a collection (of type That) from the mapped elements (of type MappedElem). In short, CanBuildFrom[This, MappedElem, That] is a factory for Builder[MappedElem, That], which itself is a factory for That. When both type parameters This and MappedElem are given, the compiler can find the best eligible for bf (according to some perplexing rules) and consequently determines the static type of That. For example:

The CanBuildFrom can forward the call to the genericBuilder[MappedElem] method of the collection inferred in compile time, so that the “right” runtime type can be selected via virtual dispatch.

Now you should have noticed the conceptual and implementation complexity brought by the seemingly simple “uniform return type principle”. But why do we need this at the first place?

Let’s digress to talk about LINQ for a moment. I don’t think the argument “LINQ sucks, because the type information of original collection is lost” really stands, because I think this is exactly what it was designed to ignore. For most of the times, we often don’t care about whether the Iterable is a Seq or a Set or a Map or whatever. What we do with a collection is often just querying and consuming its content. LINQ provides a uniform interface with consistent, lazy behavior to facilitate this. When you need a collection of certain type, you explicitly convert the query result to it. When you need special behaviors of a Set or Map (Dictionary), you use their own interfaces. But queries are queries, different collections are treated equally. Such simplicity is not a weakness, but a feature.

In my point of view, LINQ and Scala’s collection operation are cosmetically similar but semantically different. LINQ queries a collection; Scala’s collection operation transforms a collection. (This may give you the clue why LINQ doesn’t have a ForEach extension method.) Scala’s functional nature necessitates the collection transformation semantic: you need to be able to deal with the value of a collection and compute a new value from it. And since Scala is also statically and strongly typed, the “uniform return type principle” is required to make the system consistent. Thus the behaviors of collection operations should of course depend on the actual collection type and the type of the target element. The operations are non-lazy (strict) by default to avoid unintended delay of the side-affects within the operations (there’s a more detailed explanation about the decision on laziness at the bottom of this page).

What if we only want to query and consume the content of the collection, like we do in LINQ? I suggest using Iterator directly or convert the collection to Stream:

What comes with the complexity is great flexibility and power. Scala is expressive and seems quite easy, but the paradox is that you need to spend much time learning to avoid misusing it. Scala is not easy, even in the case of these basic collection operations; it’s important to recognize that.

Logistic 回归解决的并不是回归问题，而是分类问题，即目标变量（target variable）的值是离散而非连续的。 这时目标变量也可称为标签（label）。如果仍用线性回归硬搞，得到的结果会非常不靠谱。

1. 首先 $0 \leq \phi(\theta^{\rm T} x) \leq 1$
2. $\theta^{\rm T} x > 0$，我们可以认为 $y=1$ 的可能性相对更大，即 $\phi(\theta^{\rm T} x) > 0.5$， 且如果数据点离边界越远，即 $\theta^{\rm T} x$ 越大，$\phi(\theta^{\rm T} x)$ 也应该越大。
3. 反之，若 $% $，则 $% $，且随 $\theta^{\rm T} x$ 减小而减小。
4. $\theta^{\rm T} x = 0$ 时，点落在决策边界上，我们无从判断它的标签 $y$，因此 $\phi(\theta^{\rm T} x) = 0.5$

$\phi$ 应该取什么样的函数呢？我们知道 logistic 函数 恰巧满足上述条件， 不妨就取它（“logistic 回归”也因此得名）2。即：

logistic 函数的图像是：

1. 如果要把 Logistic 回归推广到多个分类的情况，可以利用一种叫“1 vs all”的方法。具体地说，即每次取一个分类为 positive， 其它分类都为 negative。这样针对每一个分类都学习得到一个 Logistic 回归模型。要判断一个新数据点的标签，用每一个模型都测一下， 取对应 $h_\theta(x)$ 最大的分类为预测结果。
2. 如果数据点不是线性可分的，类似线性回归到多项式回归（polynomial regression）的推广，可以将同一属性次数不同的项看成是彼此独立的， 再和普通 Logistic 回归问题一样处理。
3. 属性过多可能会造成模型过分复杂，导致过拟合问题。关于 overfit 和 underfit 的问题可以参见这篇文章的 3~6 段，这位学长讲得非常深入浅出了。课上提到的一种解决方法叫 regularization，即对 $\theta$ 中希望加以限制的各项 $\theta_i$，在 $L(\theta)$ 后加上 $-\lambda \theta_i^2$，使这些值过大时“惩罚”目标函数。这里 $\lambda > 0$ 的取值也要注意， 如果取得太小解决不了过拟合问题，但取得太大又会造成 underfit。

p.s. 今天早睡的目标又达成失败了……………………

1. 和上节线性规划中一样，每个 $x^{(i)}$ 是一个 $n+1$ 维向量，为书写简便我们在 $n$ 个属性前再加一维 $x^{(i)}_0=1$

2. 这里说“不妨”似乎有点太随意了。事实上，如果假设在给定 $x$ 的情况下 $y$ 服从 0-1 分布，那么$\phi$ 取 logistic 函数实际上可以由广义线性模式（Generalized linear model）的假设推导得出。

3.  注意到和上节课讲的线性回归一样，我们要预测的 $h_\theta(x)$ 其实都是期望 $E_{y|x;\theta}$。

4. 由于这里要求的是 $L(\theta)$最大值，因此是梯度上升而非梯度下降。课程视频中取要最小化的目标函数 $J(\theta)=-L(\theta)$，因此是梯度下降。实际是一回事。

Repeat until converge { $$\begin{eqnarray} \theta_{t+1} &:=& \theta_t - \alpha \nabla_\theta J(\theta) \ &=& \theta_t - \frac{\alpha}{m} X^{\rm T} (\vec{h_\theta} - \vec{y}) \end{eqnarray}$$ }

3

1. 这里讲得有点 sloppy 了（不过你能明白我的意思就行=.=）。更严格一点说，应该是训练集中的数据满足：

其中 $\epsilon^{(i)}$ 表示误差；各 $\epsilon^{(i)}$ 满足均值为零，方差相同且互不相关 （参见高斯-马尔科夫定理）。

顺带一提，我们如果进一步假设 $\epsilon^{(i)} \sim N(0,\sigma^2)$，有 $y^{(i)}|x^{(i)};\theta \sim N(\theta^{\rm T} x^{(i)},\sigma^2)$，则这个模型成为 广义线性模式（Generalized linear model） 的一个特例。我们用极大log似然法

可以推出和下面提到的最小二乘法一样的形式。——如果你不知道我上面在说什么，忽略吧，不碍得的。

2. 推导其实挺麻烦的。教授在课上没讲，我也懒得敲了=.=

3. 注意这里每次迭代的运算都会考虑整个训练集，这称为批量梯度下降（batch gradient descent）， 当训练集很大时这可能导致很大的开销。我们也可以每次迭代只依次选取训练集中的一个实例 $x^{(i)}$，更新

Probabilistic latent semantic analysis (概率潜在语义分析，pLSA) 是一种 Topic model，在99年被 Thomas Hofmann 提出。它和随后提出的 LDA 使得 Topic Model 成为了研究热点，其后的模型大都是建立在二者的基础上的。

media phone hacking wikileaks assange australian stated information investigation murdoch

south north court china military death tornado service million storm

nuclear plant power japan million carbon radiation china water minister

1. $P(d)$ 的先验概率选择一篇文档 $d$
2.  选定 $d$ 后，以 $P(z|d)$ 的概率选中主题 $z$
3.  选中主题 $z$ 后，以 $P(w|z)$ 的概率选中单词 $w$

 而 $P_t(z|d)$ 和 $P_t(w|z)$ 就是上轮迭代求出的 $\theta_t$。这样就完成了 E-step。

• (Hofmann, 1999) Hofmann, T. 1999. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval SIGIR 99. pages, (1999), 50-57.
• (Gildea & Hofmann, 1999) Gildea, D. and Hofmann, T. 1999. Topic-based language models using EM. Proceedings of the 6th European Conference on Speech (1999), 2167-2170.
• (Brants, 2005) Brants, T. 2005. Test Data Likelihood for PLSA Models. Information Retrieval. (2005), 181-196.
• (Blei et al., 2003) Blei, D.M. et al. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. 3, 4-5 (2003), 993-1022.

1. 这里 Hofmann 自己在 (Hofmann, 1999) 和 (Gildea & Hofmann, 1999) 中使用了不同的形式。本文和 (Gildea & Hofmann, 1999)、(Brants, 2005) 一样选择不去理会 $P(d)$。因为正如 (Brants, 2005) 中指出、(Blei et al., 2003) 及很多其它文献吐槽的那样，(Hofmann, 1999) 中的模型算出的 $P(d)$ 实在坑爹，当 $d$ 不在训练集中时 $P(d)$ 就一律为0，没什么意义，还不如别估计它呢。另外 (Hofmann, 1999) 中额外引入了一个参数 $\beta$ 来“解决”过度拟合问题，但 (Brants, 2005) 中指出这一问题实际并不存在，因此本文也对此忽略不提。

2. 具体而言，这里要求的是 $Q_t(\theta)$$\sum_w P(w|z) = 1$$\sum_z P(z|d) = 1$ 约束条件下的极值。根据拉格朗日乘数法，解： $$\nabla_\theta \left( Q(\theta) + \sum_z \alpha_z (\sum_w P(w|z) -1) + \sum_d \beta_d (\sum_z P(z|d) -1) \right) = \mathbf{0}$$

3. 吐槽：用 Matlab 做简单字符串处理怎么都那么恶心！长度不同的字符串竟然算是不同类型的！Cell array 怎么那么难用！

Expectation-Maximization 算法是统计学中用来给带隐含变量的模型做最大似然（和最大后验概率）的一种方法。EM 的应用特别广泛，经典的比如做概率密度估计用的 Gaussian Mixture Model。这两天我或许还会写 p 的笔记放上来，也是 EM 应用的例子。

## 问题

 问题是实际上我们没法观察到 $z$ 的值，只能在给定 $\theta$ 时求 $z$ 的后验概率 $P(z|x;\theta)$ 。1 EM 算法就可以帮我们打破这样的困境。

## 算法

Repeat until converge {
1. (E-step) Calculate $P(Z|X;\theta_t)$, in order to get: $$\begin{eqnarray} E_{Z|X;\theta_t}[L(\theta;X,Z)] &:=& E_{Z|X;\theta_t}[\log{P(X,Z;\theta)}] \ &=& \sum_Z P(Z|X;\theta_t) \log{P(X,Z;\theta)} \end{eqnarray}$$
2. (M-step) $$\theta_{t+1} := \arg\max_\theta E_{Z|X;\theta_t}[\log{P(X,Z;\theta)}]$$
}

 先简单说说这短短两步都做了些啥。EM 算法每次迭代都建立在上轮迭代对 $\theta$ 的最优值的估计 $\theta_t$ 上，利用它可以求出 $Z$ 的后验概率 $P(Z|X;\theta_t)$ ，进而求出 $L(\theta;X,Z)$ 在分布 $Z \sim P(Z|X;\theta)$ 上的期望 $E_{Z|X;\theta_t}[L(\theta;X,Z)]$。在第一节中我们提到 $\arg\max_\theta L(\theta;X,Z)$ 在未知 $Z$ 的情况下难以直接计算，于是 EM 算法就转而通过最大化它的期望 $E_{Z|X;\theta_t}[L(\theta;X,Z)]$ 来逼近 $\theta$ 的最优值，得到 $\theta_{t+1}$ 。注意由于 $L(\theta;X,Z)$ 的这个期望是在 $Z$ 的一个分布上求的，这样得到的表达式就只剩下 $\theta$ 一个未知量，因而绕过了 $z$ 未知的问题。而 $\theta_{t+1}$ 又可以作为下轮迭代的基础，继续向最优逼近。算法中 E-step 就是在利用 $\theta_t$ 求期望 $E_{Z|X;\theta_t}[L(\theta;X,Z)]$，这就是所谓“Expectation”；M-step 就是通过寻找 $\theta_{t+1}$ 最大化这个期望来逼近 $\theta$ 的最优值，这就叫“Maximization”。EM 算法因此得名。
 另外，如果数据满足独立同分布的条件，分别枚举 $z^{(i)}$ 的值可能要比枚举整个 $Z$ 方便些，可把 $E_{Z|X;\theta_t}[L(\theta;X,Z)]$ 替换成：

## 原理

Repeat until converge {
1. 找到函数 $F(\theta)$ 的“在 $\theta_t$ 处相等的下界函数” $G_{\theta_t}(\theta)$
2. $\theta_{t+1} := \arg\max_\theta G_{\theta_t}(\theta)$
}

 于是我们就得到了 $L(\theta;X)$ 的一个下界函数。我们要想套用上面的算法，还要让这个不等式在 $\theta_t$ 处取等号，这就这要求在 $\theta = \theta_t$ 时 $\frac {P(X,Z;\theta)} {Q(Z)}$ 为常数，即 $Q(Z) \propto P(X,Z;\theta_t)$。由于 $Q(Z)$ 是一个概率分布，必须满足 $\sum_z Q_i(z) = 1$，所以这样的 $Q(Z)$ 只能是 $Q(Z) = \frac {P(X,Z;\theta_t)} {\sum_Z P(X,Z;\theta_t)} = P(Z|X;\theta_t)$。那我们就把 $Q(Z) = P(Z|X;\theta_t)$ 代入上式，得到：
 且该不等式在 $\theta = \theta_t$ 时取到等号。那么……$E_{Z|X;\theta_t}\left[ \log\frac{P(X,Z;\theta)}{P(Z|X;\theta_t)} \right]$ 就是 $L(\theta;X)$ 的“在 $\theta_t$ 处相等的下界函数”——这不就是我们要找的么！于是把它塞进本节开始得到的算法里替换“$G_{\theta_t}(\theta)$”就可以用啦。也就是说，迭代计算 $\arg\max_\theta E_{Z|X;\theta_t}\left[ \log\frac{P(X,Z;\theta)}{P(Z|X;\theta_t)} \right]$ 就可以逼近 $\theta$ 的最优值了。而由于利用 Jensen 不等式的那一步搞掉了$\log$套$\sum$的形式，它算起来往往要比直接算 $\arg\max_\theta{L(\theta;X)}$ 容易不少。

p.s. MathJax 很神嘛。

p.p.s. 这篇笔记竟然断断续续写写改改了两天多，我对 EM 的认识也越来越清晰。“‘教’是最好的‘学’”真是一点没错。

• 2011.10.12 重写了“原理”部分，把利用函数的“在 $\theta_t$ 处相等的下界函数”逼近 $\theta$ 的最优值的算法单独提到前面说，这样似乎清楚很多。
• 2011.10.13 修正了对在利用 Jensen 不等式的那一步要取 $Q(Z) = P(Z|X;\theta_t)$ 的解释

1. 一般可以利用贝叶斯定理

$P(x|z;\theta)$$P(z;\theta)$ 往往是模型假设的一部分。

2. 实际上在某些特殊情况下，$\theta$ 还可能收敛在局部最优点或鞍点上。这时可以多跑几次算法，每次随机不同的 $\theta_0$，最后取最好的结果。为简明起见，本文忽略这种情况。

3. $Q(Z)$ 为概率分布，意即需满足 $\sum_Z Q(Z) = 1$$Q(Z) \geq 0$

4. Jensen 不等式：

$f$ 为凸函数，$X$ 为随机变量。则

$f$ 是严格凸的，则上式取等号当前仅当 $X$ 为常数。

在这里 $\log$ 函数是严格凹的，所以要把上面的不等号方向调转。

        To the Daisy
by W. Wordsworth

With little here to do or see
Of things that in the great world be,
Sweet Daisy! oft I talk to thee,
For thou art worthy,
Thou unassuming commonplace
Of Nature, with that homely face,
And yet with something of a grace
Which Love makes for thee!

Oft on the dappled turf at ease
I sit and play with similes,
Loose types of things through all degrees,
Thoughts of thy raising;
And many a fond and idle name
I give to thee, for praise or blame,
As is the humour of the game,
While I am gazing.

A nun demure, of lowly port;
Or sprightly maiden, of Love's court,
In thy simplicity the sport
Of all temptations;
A queen in crown of rubies drest;
A starveling in a scanty vest;
Are all, as seems to suit thee best,
Thy appellations.

A little Cyclops, with one eye
Staring to threaten and defy,
That thought comes next—and instantly
The freak is over,
The shape will vanish, and behold!
A silver shield with boss of gold
That spreads itself, some fairy bold
In fight to cover.

I see thee glittering from afar—
And then thou art a pretty star,
Not quite so fair as many are
In heaven above thee!
Yet like a star, with glittering crest,
Self-poised in air thou seem'st to rest;—
May peace come never to his nest
Who shall reprove thee!

Sweet Flower! for by that name at last
When all my reveries are past
I call thee, and to that cleave fast,
Sweet silent creature!
That breath'st with me in sun and air,
Do thou, as thou art wont, repair
My heart with gladness, and a share
Of thy meek nature!


（需要下载才能看到全部内容。层次很深，请多用F6下钻功能）

scala确实是非常强大和灵活；我在见到一些颇富技巧性的hack之后都有些怀疑scala社区的风气会不会慢慢变得像C++社区一样过分热衷技巧的炫耀。不过scala的设计目标就是以较简单的语法规则获得最大的scalability，不需要通过挖掘语言规范里的犄角旮旯来实现一些必要功能，所以不会像C++一样成为一门本身已相当复杂，却还需要别人反过来教语言发明者如何使用的语言。

scala毕竟表现力比Java强太多，代码也简洁太多。比如这次我需要实现一个抛出异常后重试若干次的逻辑，只需定义一个函数：