MLIRベースのTensorFlow Runtime？

@Vengineerの戯言 : Twitter

SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

このツイートで知った「TensorFlow Runtimeを再構築」

Recording+slides from the #MLIR open meeting about TFRT - the new TensorFlow runtime are online! https://t.co/2l7oiP9sUs

Thanks @clattner_llvm for getting this started last year, I'm thrilled to collaborate with @mingshenghong and his team to deliver a new TensorFlow stack :)
— Mehdi Amini (@JokerEph) 2020年3月20日

ここに、MLIR Open Design Meeting March 19, 2020 でのスライドとビデオがアップされています。

TensorFlow DEV SUMMIT 2020 のビデオもありますね。

www.youtube.com

スライドでは、Host Runtime でのお話と BEF Executor (BEF Binary Executo Format) について書いてありまっす。

Page.16 に、

Host Program

Designed for low-level dispatch efficiency

とあり、low-level をと。

Graph Compilerによって、Graph を MLIR に変換し、その後、BEF に変換。一旦、BEF を Storage において、BEF Executor が Storage から取り出し、実行するというもの。

Page.22 に、

Low overhead: execute program by reading mmap’d byte array

Persistent and stable: Compile once offline, run many times online. Great for inference use-cases

とあります。

Page.23 に、

BEFExecutor
Key concepts:

BEF: dataflow graph

Kernel: dataflow node

AsyncValue: dataflow edge

ともあります。

これに対して、「Devicde Runtime for Graph Execution」ということで、GPU Runtime をどうするかも説明しています。

Page.29に、CUDA in TFRT: Kernel Execution のコードがあります。以下、引用です。

// Allocate a buffer and make a GPU tensor
%b1 = crt.mem.alloc %stream %size %align
%t2 = crt.make_tensor %b1 %shape
// Launch some kernels. %t1 has some data on GPU. %ch0 is a chain
%ch1 = crt.launch %stream <sigmoid> %t1 %t2 %ch0 // t2 = sigmoid(t1)
%ch2 = crt.launch %stream <sqrt> %t2 %t2 %ch1 // t2 = sqrt(t2)
// Allocate pinned host memory, copy, and print
%hb = crt.mem.host_alloc %size %align
%ch3 = crt.mem.copy_dtoh %stream %t2 %hb %ch2
%ch4 = crt.mem.free %t2 %ch3 // can free immediately after launching
%ev = crt.event.create %flags
%ch5 = crt.event.record %stream %ev %ch4
%ch6 = crt.event.poll %ev %ch5 // %ch completes when event is reached
hex.print %hb2 %ch6

なんだか、CUDAのHost側のプログラムそのものっぽいですね。

次が (Page.30～37)が、「TFRT End-to-End Inference Workflow」

Mainstream CPU (Xeon Gold 6154) and GPU (NVIDIA TITAN V)、

CUDA 10.1, CuDNN 7.6.4, GPU driver 430.34

環境にて、ResNet-50 v1.5 (FP16 mixed precsion) の batch size =1, NHWC(1, 224,224,3)での推論時間は、現在のRuntimeに比べて、28％も速い( 3/4 になったと)

Page.38～の「Next Steps and Selected Challenges」では、「TFRT mobile support」があります。

Selected opportunities and challenges

On-device compiler with small binary size

AOC and interpreter modes

Running op scheduling that balances performance and power concerns

TensorFlowのgithub を眺めてみたら、

tfrt というキーワードがちらほら出てきています。たとえば、このファイル (tflite_api_dispatcher/BUILD) 。まだ、ファイルは無いのですが、BUILDファイルにいろいろありまっす。

cc_library(
name = "tflite_api_dispatcher_with_kernels",
hdrs = ["tflite_api_dispatcher.h"],
deps = [
":tflite_api_dispatcher",
"//tensorflow/lite:framework_lib",
] + tflite_experimental_runtime_linkopts(
if_eager = [
# "//tensorflow/lite/experimental/tf_runtime/opdef:tflrt_opdefs",
# "//tensorflow/lite/experimental/tf_runtime/tfrt_ops:tfrt_tflite_ops_alwayslink",
],
if_non_eager = [
# "//tensorflow/lite/experimental/tf_runtime/tfrt_kernels:tfrt_tflite_interpreter_alwayslink",
# "//third_party/tf_runtime:basic_kernels_alwayslink",
],
),
)

EAGERモードと Graphモードがありますね。このファイル (tflite_api_dispathcer.h) によると、Graph モードの時のみ、BEFModel を使うみたいですね。

#elif TFLITE_EXPERIMENTAL_RUNTIME_NON_EAGER
using Interpreter = tflrt::TfLiteInterpreterAPI;
using InterpreterBuilder = tflrt::TfLiteInterpreterBuilderAPI;
using TfLiteModel = tflrt::BEFModel;
using TfLiteVerifier = tflrt::TfLiteVerifier;