XLA: Optimizing Compiler for Machine Learning

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

久しぶりに、TensorFlow XLAネタ。

いつものように TensorFlow のソースコードを眺めていたら、こんなの見つけました。

XLA: Optimizing Compiler for Machine Learning

The results are improvements in speed and memory usage: most internal benchmarks run ~1.15x faster after XLA is enabled. The dataset below is evaluated on a single NVidia V100 GPU:

XLAを使うだけで、15％も速くなるって。。。素晴らしい。

ドキュメントの中に、

Using XLA via tf.function and experimental_compile

というのがアップされていて、 MNISTを TensorFlow r2.0にて、tf.function と experimental_compile にて、XLAを使うにはどうすればいいのかを書いてあります。

experimental_compile については、ここに

experimental_compile: If false, execute the function in a regular way. The
function is optimized by some graph rewrite passes (some ops might be
clustered into a single op) and interpreted by the standard TensorFlow
executor, which dispatches op kernels one by one as they become
executable. Set it to false when directly running a multi-device
function on TPUs (e.g. two TPU cores, one TPU core and its
host CPU). If True, the function is compiled directly by XLA. XLA would
fuse all the ops and emit more efficient code to run for some devices
(e.g. TPU, XLA_GPU) and some use cases (e.g. dense tensor computation).
It requires that the whole function is compilable by XLA. If None
(default), compile the function with XLA when running on TPU and go
through the regular function execution path when running on other
devices.

とあります。

experimental_compile に true をセットすると、最終的には、ここで

bool xla_compile_attr;
if (TryGetNodeAttr(node->attrs(), kXlaCompileAttr, &xla_compile_attr)) {
is_xla_compile_attr_true |= xla_compile_attr;
}

is_xla_compile_attr_true が設定されます。そして、JIT Compileされます。

Vengineerの妄想

人生を妄想しています。

XLA: Optimizing Compiler for Machine Learning