Tiramisu、その2 - Vengineerの妄想

@Vengineerの戯言 : Twitter
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

tutorial 01: A simple example of how to use Tiramisu (a simple assignment)
tutorial 02: blurxy
tutorial 03: matrix multiplication
tutorial 05: simple sequence of computations
tutorial 06: reduction example
tutorial 08: update example
tutorial 09: complicate reduction/update example

tutorial_01: A simple example of how to use Tiramisu (a simple assignment)を見てみましょう。

int main(int argc, char **argv)
{
    // Set default tiramisu options.
    global::set_default_tiramisu_options();
    global::set_loop_iterator_type(p_int32);

最初に、Tiramisuのオプションにデフォルト値を設定しています。

set_default_tiramisu_optionsは、下記のようになっています。

    static void set_default_tiramisu_options()
    {
        global::loop_iterator_type = p_int32;
        set_auto_data_mapping(true);

        auto location = std::getenv(NVCC_BIN_DIR_ENV_VAR);
        if (location)
            nvcc_bin_dir = location;
    }

set_loop_iterator_typeでは、下記のようにglobal::loop_iterator_typeに設定しているので、
main関数内の global::set_loop_iterator_type(p_int32);は、必要ないね。

    static void set_loop_iterator_type(primitive_t t) {
        global::loop_iterator_type = t;
    }

set_auto_data_mappingは、
下記のようにコード生成の前に、true を設定すると、自動的にいろいろやってくれるみたい。
ここで自動化の設定をやっているのね。

    /**
      * If this option is set to true, Tiramisu automatically
      * modifies the computation data mapping whenever a new
      * schedule is applied to a computation.
      * If it is set to false, it is up to the user to set
      * the right data mapping before code generation.
      */
    static void set_auto_data_mapping(bool v)
    {
        global::auto_data_mapping = v;
    }

次は、functionの宣言です。Tiramisuのfunctionは、C言語の関数と同じもの。
入力と出力を引数で持てます。入力と出力は、buffer として表現します。

    // Declare a function called "function0".
    // A function in tiramisu is the equivalent of a function in C.
    // It can have input and output arguments.  These arguments are
    // represented as buffers and are declared later in the tutorial.
    function function0("function0");

function の定義ができたら、次は、Layer I の設定です。
Layer I は、Abstract Algorith です。下記は、3と4の加算結果を e3 に代入しています。

    // Declare an expression that will be associated to the
    // computations.  This expression sums 3 and 4.
    expr e3 = expr(3) + expr(4);

次に、function (function0) 内での、computation を宣言します。
computation S0は、各要素の型が uint8 で、サイズが10 x 10。
その各要素には、e3、つまり、7 を代入します。

    // Declare a computation within function0.
    // To declare a computation, you need to provide:
    // (1) an ISL set representing the iteration space of the computation.
    // Tiramisu uses the ISL syntax to represent sets and maps.  The ISL syntax
    // is described in http://barvinok.gforge.inria.fr/barvinok.pdf (Section
    // 1.2.1 for sets and iteration domains, and 1.2.2 for maps and access
    // relations),
    // (2) a tiramisu expression: this is the expression that will be computed
    // by the computation.
    // (3) the function in which the computation will be declared.
    computation S0("[N]->{S0[i,j]: 0<=i<10 and 0<=j<10}", e3, true, p_uint8, &function0);

デバッグ用に、イタレーションの範囲を表示する。

    // Dump the iteration domain of the function.
    function0.dump_iteration_domain();

computation のスケジュール
computation を 2x2 のタイル状に分割して、横方向に並列処理する。
スケジュールの結果をダンプする。

    // Set the schedule of each computation.
    // The identity schedule means that the program order is not modified
    // (i.e. no optimization is applied).
    S0.tile(var("i"), var("j"), 2, 2, var("i0"), var("j0"), var("i1"), var("j1"));
    S0.tag_parallel_level(var("i0"));

    // Dump the schedule.
    function0.dump_schedule();

Layer IIIでは、function0 に割り当てるバッファ(buf0)、各要素の型が uint8で、サイズが10x10を宣言。

    // Create a buffer buf0.  This buffer is supposed to be allocated outside
    // the function "function0" and passed to it as an argument.
    // (actually any buffer of type a_output or a_input should be allocated
    // by the caller, in contrast to buffers of type a_temporary which are
    // allocated automatically by the Tiramisu runtime within the callee
    // and should not be passed as arguments to the function).
    buffer buf0("buf0", {tiramisu::expr(10), tiramisu::expr(10)}, p_uint8, a_output,
                &function0);

computation (S0)の出力に、バッファを設定する

    // Map the computations to a buffer (i.e. where each computation
    // should be stored in the buffer).
    // This mapping will be updated automatically when the schedule
    // is applied. To disable automatic data mapping updates use
    // global::set_auto_data_mapping(false).
    S0.set_access("{S0[i,j]->buf0[i,j]}");

ここからは、コード生成です。function への引数(出力)として、buf0 を設定します。

    // Add buf0 as an argument to the function.
    function0.set_arguments({&buf0});

Iterator domain から time space domain に変換。.dump_time_processor_domain はデバッグ用。

    // Generate the time-processor domain of the computation
    // and dump it on stdout.
    // This is purely for debugging, the tim-processor representation
    // is not used later on.
    function0.gen_time_space_domain();
    function0.dump_time_processor_domain();

AST(abstract Syntax Tree)を生成します。
その後、function の Halide statement を生成します。
function0.dump_halide_stmt() はデバッグのために、生成した Halide statement をダンプします。

    // Generate an AST (abstract Syntax Tree)
    function0.gen_isl_ast();

    // Generate Halide statement for the function.
    function0.gen_halide_stmt();

    // Dump the Halide stmt generated by gen_halide_stmt()
    // for the function.
    function0.dump_halide_stmt();

最後に、function0 のオブジェクトファイル (build/generated_fct_tutorial_01.o) を生成します。

    // Generate an object file from the function.
    function0.gen_halide_obj("build/generated_fct_tutorial_01.o");

    return 0;
}

生成したオブジェクトファイル (build/generated_fct_tutorial_01.o)を呼び出して使うコードを別途用意します。

#define NN 10
#define MM 10

int main(int, char **)
{
    Halide::Buffer<int32_t> output(NN, MM);
    init_buffer(output, (int32_t)9);

    std::cout << "Array (after initialization)" << std::endl;
    print_buffer(output);

    function0(output.raw_buffer());

    std::cout << "Array after the Halide pipeline" << std::endl;
    print_buffer(output);

    Halide::Buffer<int32_t> expected(NN, MM);
    init_buffer(expected, (int32_t)7);

    compare_buffers("tutorial_01", output, expected);

    return 0;
}

バッファ (output) を9で初期設定し、function0 の引数に output.raw_buffer() を設定しています。

バッファ (expected) を 7 で初期設定し、compare_buffers にて、output と expected を比較しています。

デバッグ部分は削除して、まとめて書くと

int main(int argc, char **argv)
{
    // Set default tiramisu options.
    global::set_default_tiramisu_options();
    global::set_loop_iterator_type(p_int32);

    function function0("function0");

    // Layer I
    expr e3 = expr(3) + expr(4);

    // Layer II
    computation S0("[N]->{S0[i,j]: 0<=i<10 and 0<=j<10}", e3, true, p_uint8, &function0);
    S0.tile(var("i"), var("j"), 2, 2, var("i0"), var("j0"), var("i1"), var("j1"));
    S0.tag_parallel_level(var("i0"));

    // Layer III
    buffer buf0("buf0", {tiramisu::expr(10), tiramisu::expr(10)}, p_uint8, a_output,
                &function0);
    S0.set_access("{S0[i,j]->buf0[i,j]}");

    // コード生成
    function0.set_arguments({&buf0});
    function0.gen_isl_ast();
    function0.gen_halide_stmt();
    function0.gen_halide_obj("build/generated_fct_tutorial_01.o");

    return 0;
}

基本的には、Halide と同じように、関数をオブジェクトファイルとして生成し、
そのオブジェクトファイルをリンクして使うようですね。