标签存档: Scala

Surprises in Scala’s “Uniform Return Type” Magic

There’s an increasingly heated debate about Scala’s complexity. A recent post talks about Scala’s “true complexity“: the terrifying intricacy you will encounter, if you try to extend Scala’s Collections Library with a method that achieves the “perfect” behavior.

Yet it also might be worth asking: what are the motivations of these “perfect” behaviors? Are they always desirable?

One important aspect of the “perfect” collection behavior is the magical uniform return type principle. Sloppily speaking, this means that transforming a collection should give you another collection of the same kind. For instance:

// Filtering a List gives you a filtered List
scala> List(1,2,3,4).filter(_%2==0)
res: List[Int] = List(2, 4)

// Mapping a Set gives you a mapped Set
scala> Set(1,2,3,4).map(1+)
res: scala.collection.immutable.Set[Int] = Set(2, 3, 4, 5)

// Uniform return type at both compile time and runtime
scala> val iterable : Iterable[Int] = Set(1,2,3,4)
iterable: Iterable[Int] = Set(1, 2, 3, 4)

scala> iterable.map(1+)
res: Iterable[Int] = Set(2, 3, 4, 5)

// Works for Strings, too
scala> "E is the most frequent letter in English".map(
     |    c => if(c=='e'|c=='E') '3' else c)
res: String = 3 is th3 most fr3qu3nt l3tt3r in 3nglish

Clean and consistent, can’t be more intuitive. Of course filtering a List should return a List, and mapping a Set should give you a Set. Both compile time and runtime. Works even for non-collections like Strings.

When I first saw this, I thought “yeah, cool”, and then forgot about it. I took it for granted as just another Scala magic that makes my life easier, without realizing its intricate implications. Not until recently have I realized that Scala’s collections were much smarter, or trickier, than I had expected.

The uniform return type principle is fairly straightforward for operations like filter, take, drop and slice, but situations get complicated in the cases of some other operations like map and flatMap. Consider the following:

scala> Set(1,2,3,4).map(_ % 2 == 0)
res: scala.collection.immutable.Set[Boolean] = Set(false, true)

It starts getting interesting. When mapping a Set, you should get a Set, which by definition doesn’t contain repeated values. So you get a Set[Boolean] which consists of only 2 values. You might get surprised here if you are familiar with LINQ in C#. In C#

new HashSet(new[] {1, 2, 3, 4}).Select(x => x%2 == 0)

gives you an IEnumerable that consists of “false, true, false, true” in order, because the HashSet is simply iterated through as a Normal IEnumerable. Some say C#/LINQ sucks, because the type information of original collection is “lost”.

In Scala, Map[K, V] is a subtype of Iterable[(K, V)], thus can be seen as a collection of key-value pairs. Now stop for a moment and think about it: what happens if you map a Map?

scala> Map('a'->1,'b'->2,'c'->3,'d'->4).map(t => t._2 -> t._1)
res: scala.collection.immutable.Map[Int,Char] = Map(1 -> a, 2 -> b, 3 -> c, 4 -> d)

scala> Map('a'->0,'b'->1,'c'->0,'d'->1).map(t => t._1)
res: scala.collection.immutable.Iterable[Char] = List(a, b, c, d)

Ideally, according to the uniform return type principle, mapping a map should give you another Map. But this is only possible when the target element type is key-value pair. Otherwise, the best we can do is to treat the Map as just an Iterable of key-value pairs and return an List of the target elements.

However, there is still the problem of inconsistency. Here’s an example:

scala> Map(1->'a', 2->'b', 3->'c').map(_._2)
res: scala.collection.immutable.Iterable[Char] = List(a, b, c)

scala> Map(1->('a','b'), 2->('a','c'), 3->('b','c')).map(_._2)
res: scala.collection.immutable.Map[Char,Char] = Map(a -> c, b -> c)

Same operation on two Sets, but the behaviors are totally different, simply because these two Sets have different types of values? This is weird. The problem is that 2-tuples are selected to represent the key-value pairs, but are not exclusively used for this purpose. So when user map the elements to 2-tuples, there’s this unavoidable ambiguity.

It becomes even more error-prone if the actual class type of the collection is not known at compile time. We know that Map has a property “keys” that gives you an Iterable of its keys. Consider the following example:

scala> val m = Map(1 -> 'a', 2 -> 'b', 3 -> 'c', 4 -> 'd')
m: scala.collection.immutable.Map[Int,Char] = Map(1 -> a, 2 -> b, 3 -> c, 4 -> d)

scala> m.keys.map(_%2==0)
res: Iterable[Boolean] = Set(false, true)

Ah…the runtime type of “keys” is Set[Int], so repeated values in the returned collection are also ignored. The behavior should surprise you if you haven’t think of this. In fact, when given a Iterable without knowing its actual type, in any case the behaviors of collection operations are simply unpredictable. Scala guys can argue that this is just how it works, but this doesn’t seems to be a good thing, given that Iterables have been widely used between interfaces to pass data collections around without specifying their internal structures.

To sum up, the behavior of the map operation depends on not only the type of the collection on which it invokes, both static and dynamic, but also on the type of the mapped elements.

How is this magic implemented? If you take a look at the source code, you will probably see the most complicated collection library you’ve ever seen in your life. I’ll try to give a brief and simplified explanation. Here is the actual signature of the map method:

def map[MappedElem, That](p: Elem => MappedElem)
    (implicit bf: CanBuildFrom[This, MappedElem, That]): That

As you can see there’s an extra implicit parameter “bf” of type “CanBuildFrom[This, MappedElem, That]“, which will give you a “Builder[MappedElem, That]” that can build a collection (of type “That”) from the mapped elements (of type “MappedElem”). In short, “CanBuildFrom[This, MappedElem, That]” is a factory for “Builder[MappedElem, That]“, which itself is a factory for “That”. When both type parameters “This” and “MappedElem” are given, the compiler can find the best eligible for “bf” (according to some perplexing rules) and consequently determines the static type of “That”. For example:

// bf : CanBuildFrom[Map[_,_], (Char, Int), Map[Char, Int]]
Map(1 -> ‘a’, 2 -> ‘b’).map(t => t._2 -> t._1)

// bf: CanBuildFrom[Iterable[_], Int, Iterable[Int]]
Map(1 -> ‘a’, 2 -> ‘b’).map(_._1)

The “CanBuildFrom” can forward the call to the “genericBuilder[MappedElem]” method of the collection inferred in compile time, so that the “right” runtime type can be selected via virtual dispatch.

Now you should have noticed the conceptual and implementation complexity brought by the seemingly simple “uniform return type principle”. But why do we need this at the first place?

Let’s digress to talk about LINQ for a moment. The argument “LINQ sucks, because the type information of original collection is lost” doesn’t really stand, because I think this is exactly what it was designed to ignore. For most of the times, we often don’t care about whether the Iterable is a Seq or a Set or a Map or whatever. What we do with a collection is often just querying and consuming its content. LINQ provides a uniform interface with consistent, lazy behavior to facilitate this. When you need a collection of certain type, you explicitly convert the query result to it. When you need special behaviors of a Set or Map (Dictionary), you use their own interfaces. But queries are queries, different collections are treated equally. Such simplicity is not a weakness, but a feature.

In my point of view, LINQ and Scala’s collection operation are cosmetically similar but semantically different. LINQ queries a collection; Scala’s collection operation transforms a collection. (This may give you the clue why LINQ doesn’t have a ForEach extension method.) Scala’s functional nature necessitates the collection transformation semantic: you need to be able to deal with the value of a collection and compute a new value from it. And since Scala is also statically and strongly typed, the “uniform return type principle” is required to make the system consistent. Thus the behaviors of collection operations should of course depend on the actual collection type and the type of the target element. The operations are non-lazy (strict) by default to avoid unintended delay of the side-affects within the operations (there’s a more detailed explanation about the decision on laziness at the bottom of this page).

What if we only want to query and consume the content of the collection, like we do in LINQ? I suggest using Iterator directly or convert the collection to Stream:

scala> Set(1,2,3,4).iterator.map(_%2).foreach(print)
1010
scala> Set(1,2,3,4).toStream.map(_%2).foreach(print)
1010

What comes with the complexity is great flexibility and power. Scala is expressive and seems quite easy, but the paradox is that you need to spend much time learning to avoid misusing it. Scala is not easy, even in the case of these basic collection operations; it’s important to recognize that.

sinablog2wordpress:从新浪博客搬家到WordPress

Screenshot

要从新浪搬到Wordpres,网上广为流传的方法是利用blogbus的博客搬家服务获得blogbus格式的xml,然后再用一个Python写的脚本把它转换成WordPress认识的格式。但是这种方法在最近新浪博客升级以后就失效了。于是自己用现学的scala写了一个小程序,搬家时能保留标签目录评论评论回复这些信息。

猛击[这里]下载。注意,此程序仅支持2010年初的新版新浪博客,之前或之后的版本都不支持。

要运行程序,你需要确保已经安装过JRE。双击运行后显示如图界面,填入自己的博客地址(不要省略“ http:// ”),然后点击“Start”即可。这时“Start”按钮变为灰色,标题栏显示“Extracting”。等待几分钟,当标题栏显示为“Done”、“Start”按钮重新变为可用时,程序所在目录下会出现一个blog.xml文件。把这个文件直接导入WordPress就可以了。

代码也打在jar包里了,MIT协议。欢迎报告bug。

下面是废话。

恩由于新浪用了ajax,评论信息是通过xhr异步读取的,用一般的方法没法抓到。我纠结许久,最后是用了非常ad hoc的方法解决的,不知道有没有什么什么不太麻烦的通用解决方案呢。

再扯两句scala。我都想不起来当初具体是怎么想到要学scala的,也许是为了了解下函数式编程,也许只是想在jvm上有一个喜欢的语言吧——Java写起来太不爽了;Java社区的低效和保守也已经开始显出C++的影子。

scala确实是非常强大和灵活;我在见到一些颇富技巧性的hack之后都有些怀疑scala社区的风气会不会慢慢变得像C++社区一样过分热衷技巧的炫耀。不过scala的设计目标就是以较简单的语法规则获得最大的scalability,不需要通过挖掘语言规范里的犄角旮旯来实现一些必要功能,所以不会像C++一样成为一门本身已相当复杂,却还需要别人反过来教语言发明者如何使用的语言。

scala毕竟表现力比Java强太多,代码也简洁太多。比如这次我需要实现一个抛出异常后重试若干次的逻辑,只需定义一个函数:

def tryFor[T](times: Int)(op: => T): T = {
  if (times <= 0) throw new RuntimeException("Operation failed.")
  try { return op } catch {
    case e: Throwable => e.printStackTrace
    tryFor(times - 1)(op)
  }
}

然后这样使用:

val source = tryFor(5) {new Source(url)}

程序就会不断获得网页源代码,并在5次失败后抛出异常。Java实现同样的东西可不会如此优雅了。又如下面这段代码返回一篇博文xml:

private def generateEntryXml(entry: BlogEntry) = {
  <item>
    <title>
      {entry.title}
    </title>
    <wp:post_date>
      {dateFormat.format(entry.postDate)}
    </wp:post_date>
    <category>
      {entry.category}
    </category>
    {for (tag <- entry.tags) yield <category domain="tag">{tag}</category>}
    <content:encoded>
      {xml.Unparsed(handleNewLines(entry.content))}
    </content:encoded>
    <wp:status>publish</wp:status>
    {for (comment <- entry.comments) yield generateCommentXml(comment)}
  </item>
}

注意,xml标签直接作为scala的源代码的一部分在代码中出现!虽然我觉得这样会使scala语言多出一种“特殊情况”,增加语言的复杂性,但不得不承认这样的设计确实非常优美简洁。

我比较看好scala,以后自己做跑在jvm上的东西scala应该是首选语言。推荐有兴趣的童鞋也了解一下。

标签云

豆瓣