CuPy v5で Tensor Comprehensions をサポート - Vengineerの妄想(準備期間)

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

Chainer Roadmap Meetup: v4 to v5に行ってきました。

CuPy v5で予定している新しい機能：ビデオとして、

　For advanced user

　・Provide simple CUDA kernel
　・Support DLPack and Tensor Comprehensions. <= 資料では、Comprehension になっています。
　　toDLPack() and fromDLPack()

なんでDLPackもサポートするのかな。。。。と思って調べてみたら、、、
Tensor Comprehensions のドキュメントとして、「Integrating TC with ML framework」というのがありました。

その第一ステップとして、DLpack support in framework と書いてあります。

引用
  Step 1: DLpack support in framework
    
    In order to integrate a new framework, minimal DLPack support is needed. Two functions are needed:

    　・toDlpack: create the DLPack tensor struct from the tensor.
　　　・fromDlpack: create the tensor from the DLPack struct.

とあるので、DPpack を最初にサポートするんですね。

第二ステップとしては、

引用
　Step 2: Integrating TC

　　Once the DLPack support is available in the framework, integration of TC is easy. 
　　This can be achieved by writing a lightweight C++ code 
　　which uses the DLPack tensor conversion calls to convert tensors so that they can be passed to TC backend.

　　This is all that is needed for integrating an ML framework with TC. 
　　Concretely, following functions need to be defined:

　　・define: 
　　　　　This is simply a wrapper which takes the TC lang input 
　　　　　and dispatches call to the TC backend. Nothing else is needed at this step.
　　・toDlpackTensors: 
　　　　　This should take the vector of input tensors (framework) 
　　　　　and use the dlpack tensor conversions API defined by framework 
　　　　　to convert input tensors to dlpack tensors.
　　・compile: 
　　　　　This takes the dlpack tensors converted in previous step 
　　　　　and dispatches compilation call to TC backend on those input dlpack tensors.
　　・prepareOutputs: 
　　　　　TC backend send back the output tensors infor (strides, shapes, type etc.) 
　　　　　and framework should allocate the outputs storage.
　　・run: 
　　　　　This simply dispatches the output tensor pointers to the TC backend 
　　　　　and returns the outputs received.

　・定義(define)して、
　・コンパイル(compile)して、
　・入力データを設定(toDlpackTensors)して、
　・出力データの準備(prepareOutput)をして
　・実行(run)する

atenのサンプルコードもあります。

CuPyの関数をどう TC に渡すんだろうか？
Tensor Comprehensions のdefineには、下記のように文字列として渡すんだよね。。

引用：https://github.com/facebookresearch/TensorComprehensions/tree/master/tensor_comprehensions

import tensor_comprehensions as tc
import torch
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
  output(i, j) +=! A(i, kk) * B(kk, j)
}
"""
# The name should match the name of the "def" in "lang"
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)