XLA、速いじゃん - Vengineerの妄想(準備期間)

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

Pushing the limits of GPU performance with XLAによると、
TensorFlow r1.11のXLA無しのGPU と TensorFlow r1.12のXLA有りのGPUでは、XLA有りの方が速いと。。。

何で、r1.12のXLA無しのGPU と r1.12のXLA有りのGPU で比べないのだろうか。。。。不思議だ。

まー、でも、XLA は r1.0 に対して、大幅に改善したということか。。もう、1年半も経ったからね。。。

Resnet-50の学習では、1.13x to 3.04x 速くなったと。。。結構、速くなったじゃん。。。

何故、速くなったかは、このブログにも書いてあります。

引用します
Run without XLA, the graph launches three kernels: one for the multiplication, o
ne for the addition and one for the reduction.

However, XLA can optimize the graph so that it computes the result in a single kernel launch. 
It does this by “fusing” the addition, multiplication and reduction into a single GPU kernel. 
Moreover, this fused operation does not write out the intermediate values produced by y*z and x+y*z to memory;
instead it “streams” the results of these intermediate computations directly to their users 
while keeping them entirely in GPU registers.

つまり、複数のカーネルを1つのカーネルにまとめて実行するのが、XLA の特徴なんだよね。

CUDAでは、各カーネルの起動には、それなりのコストがかかっているので、
まとめて、一つのカーネルにすることで、カーネルの起動のコストを削減するということですね。

おっと、ブログのサンプルコード(引用します)

from tensorflow.contrib.compiler import xla

def model_fn(x, y, z):
  return tf.reduce_sum(x + y * z)

def create_and_run_graph():
  with tf.Session() as sess:
    x = tf.placeholder(tf.float32, name='x')
    y = tf.placeholder(tf.float32, name='y')
    z = tf.placeholder(tf.float32, name='z')
    result = xla.compile(computation=model_fn, inputs=(x, y, z))[0]
    # `result` is a normal Tensor (albeit one that is computed by an XLA
    # compiled executable) and can be used like any other Tensor.
    result = tf.add(result, result)
    return sess.run(result, feed_dict={ ... })

のように、xla.compile を使って、model_fn を XLA でコンパイルできるようになっていた。。。

おおおおおお、また、調べないと。。。。

こんなこともできるようになったのね。

引用します
if should_use_xla():
  result = xla.compile(model_fn, (x, y, z))[0]
else:
  result = model_fn(x, y, z)

should_use_xla()を使うことで、切り替えができると。。。便利ーーーー。

tensorflow.contrib.compiler.xla って、このファイルですね。。。