はじめに

Tenstorrent : TT-Budabackend を Windows 11 Pro + WSL2 + Ubuntu 22.04 LTS にてビルドして、テストコードを動かしてみました。

TT-budabackendをclone

を clone して、ビルドしてみました。

git clone https://github.com/tenstorrent/tt-budabackend.git
git submodule update --init --recursive

third_party の下にあるbinaryをダウンロードするので、git lfs が必要だったです。

build

Ubunru 20.04 LTS ということでしたが、22.04 LTS にてビルドできました。

export ARCH_NAME=grayskull

# Build backend
make -j16 build_hw

# Build a test runner
make -j16 verif/netlist_test

とりあえず、ここまでは build できました。

netlist の生成

 ./build/test/verif/graph_tests/test_graph --netlist verif/graph_tests/netlists/netlist_softmax_single_tile.yaml --silicon --mode 1

にて、下記のように、実機(Grayskull Dev Kit)が居ないと動かない部分以外は、動きました。

2024-02-04 12:51:06.705 | INFO     | Verif           - Deleting existing output_dir: "./tt_build/test_graph_7357905755580061414"
2024-02-04 12:51:06.710 | INFO     | Verif           - using output_dir: ./tt_build/test_graph_7357905755580061414
2024-02-04 12:51:06.724 | INFO     | Test            - Graph test started
2024-02-04 12:51:06.724 | INFO     | Test            - Generating random test seed since '--seed' is unspecified or '--seed -1' was specified.
2024-02-04 12:51:06.724 | INFO     | Verif           - Generated Test Seed = -1131874758
2024-02-04 12:51:06.724 | INFO     | Verif           - Setting Test Seed = -1131874758
2024-02-04 12:51:06.818 | INFO     | Runtime         - Running tt_runtime on host: 'LAPTOP-HGGR6RPK'
2024-02-04 12:51:06.819 | INFO     | PerfInfra       - Backend profiler is disabled
2024-02-04 12:51:06.836 | INFO     | Runtime         - No silicon devices detected.
2024-02-04 12:51:06.913 | INFO     | Runtime         - Compiling Firmware for TT device
2024-02-04 12:51:11.410 | INFO     | Test            - Backends teardown finished
2024-02-04 12:51:11.430 | INFO     | Test            - Elapsed Time: 4.699999999999999s
2024-02-04 12:51:11.459 | WARNING  | SiliconDriver   - Did not find any PCI devices matching Tenstorrent vendor_id 0x1e52
2024-02-04 12:51:11.459 | INFO     | Debuda          - Debug server ended on

入力のファイル(netlist)は、verif/graph_tests/netlists/netlist_softmax_single_tile.yaml です。

devices:
  arch: [grayskull, wormhole, wormhole_b0]

queues:
  input_activation: {type: queue, input: HOST, entries: 2, grid_size: [1, 1], t: 1, mblock: [1, 1], ublock: [1, 1], df: Float16_b, target_device: 0, loc: dram, dram: [[0, 0x20000000]]}
  input_constant  : {type: queue, input: HOST, entries: 2, grid_size: [1, 1], t: 1, mblock: [1, 1], ublock: [1, 1], df: Float16_b, target_device: 0, loc: dram, dram: [[0, 0x21000000]]}
  output_softmax  : {type: queue, input: mult, entries: 2, grid_size: [1, 1], t: 1, mblock: [1, 1], ublock: [1, 1], df: Float16_b, target_device: 0, loc: dram, dram: [[0, 0x22000000]]}

graphs:
  softmax_single_tile:
    target_device: 0
    input_count: 1
    exp:   {type: exp,        grid_loc: [0, 0], grid_size: [1, 1], inputs: [input_activation   ], t: 1, mblock: [1, 1], ublock: [1, 1], in_df: [Float16_b],            acc_df: Float16, out_df: Float16_b, intermed_df: Float16_b, ublock_order: r, buf_size_mb: 2, math_fidelity: HiFi3}
    sum:   {type: matmul,     grid_loc: [0, 1], grid_size: [1, 1], inputs: [exp, input_constant], t: 1, mblock: [1, 1], ublock: [1, 1], in_df: [Float16_b, Float16_b], acc_df: Float16, out_df: Float16_b, intermed_df: Float16_b, ublock_order: r, buf_size_mb: 2, math_fidelity: HiFi3, attributes: {m_k: 1, u_kt: 1}}
    recip: {type: reciprocal, grid_loc: [0, 2], grid_size: [1, 1], inputs: [sum                ], t: 1, mblock: [1, 1], ublock: [1, 1], in_df: [Float16_b],            acc_df: Float16, out_df: Float16_b, intermed_df: Float16_b, ublock_order: r, buf_size_mb: 2, math_fidelity: HiFi3}
    mult:  {type: multiply,   grid_loc: [0, 3], grid_size: [1, 1], inputs: [exp, recip         ], t: 1, mblock: [1, 1], ublock: [1, 1], in_df: [Float16_b, Float16_b], acc_df: Float16, out_df: Float16_b, intermed_df: Float16_b, ublock_order: r, buf_size_mb: 2, math_fidelity: HiFi3}

programs:
  - run_softmax_single_tile:
    - var: [$c_one, $c_zero, $gptr_q0, $gptr_q1, $c_num_microbatches, $lptr_q0, $lptr_q1]
    - varinst: [$c_zero, set, 0]
    - varinst: [$c_one, set, 1]
    - varinst: [$c_num_microbatches, set, 1]
    - loop: $c_num_microbatches
    -   execute: {
          graph_name: softmax_single_tile,
          queue_settings: {
               input_constant:   {prologue: false, epilogue: false, zero: false, rd_ptr_local: $lptr_q1, rd_ptr_global: $gptr_q1},
               input_activation: {prologue: false, epilogue: false, zero: false, rd_ptr_local: $lptr_q0, rd_ptr_global: $gptr_q0}
          }
        }
    -   varinst: [$lptr_q0, add, $lptr_q0, $c_one]
    -   varinst: [$gptr_q0, add, $gptr_q0, $c_one]
    -   varinst: [$lptr_q1, add, $lptr_q1, $c_one]
    -   varinst: [$gptr_q1, add, $gptr_q1, $c_one]
    - endloop

test-config:
  comparison-config:
    type: AllCloseHw
    atol: 0.01
    rtol: 0.15
    check_pcc: 0.99
    check_pct: 0.80
    verbosity: Concise
  stimulus-config:
    type: Uniform
    uniform_lower_bound: -1.0
    uniform_upper_bound: 1.0
    overrides:
      input_constant: 
        type: Constant
        constant_value: 1.0

queues:
  input_activation: {type: queue, input: HOST, entries: 2, grid_size: [1, 1], t: 1, mblock: [1, 1], ublock: [1, 1], df: Float16_b, target_device: 0, loc: dram, dram: [[0, 0x20000000]]}
  input_constant  : {type: queue, input: HOST, entries: 2, grid_size: [1, 1], t: 1, mblock: [1, 1], ublock: [1, 1], df: Float16_b, target_device: 0, loc: dram, dram: [[0, 0x21000000]]}
  output_softmax  : {type: queue, input: mult, entries: 2, grid_size: [1, 1], t: 1, mblock: [1, 1], ublock: [1, 1], df: Float16_b, target_device: 0, loc: dram, dram: [[0, 0x22000000]]}

graphsは、下記の部分で4つのCompute Tile ( [0,0], [0, 1], [0, 2], [0, 3]) を使っています。[0,0] は exp, [0, 1] は sum, [0, 2] は recip、[0, 3] は mult です。

  softmax_single_tile:
    target_device: 0
    input_count: 1
    exp:   {type: exp,        grid_loc: [0, 0], grid_size: [1, 1], inputs: [input_activation   ], t: 1, mblock: [1, 1], ublock: [1, 1], in_df: [Float16_b],            acc_df: Float16, out_df: Float16_b, intermed_df: Float16_b, ublock_order: r, buf_size_mb: 2, math_fidelity: HiFi3}
    sum:   {type: matmul,     grid_loc: [0, 1], grid_size: [1, 1], inputs: [exp, input_constant], t: 1, mblock: [1, 1], ublock: [1, 1], in_df: [Float16_b, Float16_b], acc_df: Float16, out_df: Float16_b, intermed_df: Float16_b, ublock_order: r, buf_size_mb: 2, math_fidelity: HiFi3, attributes: {m_k: 1, u_kt: 1}}
    recip: {type: reciprocal, grid_loc: [0, 2], grid_size: [1, 1], inputs: [sum                ], t: 1, mblock: [1, 1], ublock: [1, 1], in_df: [Float16_b],            acc_df: Float16, out_df: Float16_b, intermed_df: Float16_b, ublock_order: r, buf_size_mb: 2, math_fidelity: HiFi3}
    mult:  {type: multiply,   grid_loc: [0, 3], grid_size: [1, 1], inputs: [exp, recip         ], t: 1, mblock: [1, 1], ublock: [1, 1], in_df: [Float16_b, Float16_b], acc_df: Float16, out_df: Float16_b, intermed_df: Float16_b, ublock_order: r, buf_size_mb: 2, math_fidelity: HiFi3}

入力と出力は、次のようになっています。input_activation は入力データ、input_contant は固定値、output_softmax が出力データになります。input_activation と input_constant は HOST と接続します。input_activation は DRAM の 0x20000000に、input_constant は DRAM の 0x21000000 にデータを書きこみます。output_softmax は mult から入ってきて、DRAM の 0x22000000 に繋がっています。

コマンドを実行すると、./tt_build/test_graph_7357905755580061414 に以下のファイルが生成されました。

blob_init         graph_softmax_single_tile  net2pipe.log         placement_report.yaml  temporal_epoch_0
brisc             ncrisc                     netlist_queues.yaml  reports
device_desc.yaml  net2pipe.err               padding_table.yaml   runtime_data.yaml

graph_softmax_single_tile/op_info.txt の中は、

op_0: exp
op_1: mult
op_2: recip
op_3: sum

各 op のプログラムは、下記のディレクトリに生成されました。 - op_0 : graph_softmax_single_tile/op_0 - op_1 : graph_softmax_single_tile/op_1 - op_2 : graph_softmax_single_tile/op_2 - op_3 : graph_softmax_single_tile/op_3

reports/op_to_pipe_map_temporal_epoch_0.yaml の中に、接続情報がありました。

ops:
  sum:
    inputs:
      - name: exp
        type: op
        pipes:
          0-1-2:
            - 121000000000
      - name: input_constant
        type: queue
        pipes:
          0-1-2:
            - 122000000000
    outputs:
      - name: recip
        type: op
        pipes:
          0-1-2:
            - 120000000000
  recip:
    inputs:
      - name: sum
        type: op
        pipes:
          0-1-3:
            - 120000000000
    outputs:
      - name: mult
        type: op
        pipes:
          0-1-3:
            - 118000000000
  mult:
    inputs:
      - name: exp
        type: op
        pipes:
          0-1-4:
            - 117000000000
      - name: recip
        type: op
        pipes:
          0-1-4:
            - 118000000000
    outputs:
      - name: output_softmax
        type: queue
        pipes:
          0-1-4:
            - 119000000000
  exp:
    inputs:
      - name: input_activation
        type: queue
        pipes:
          0-1-1:
            - 116000000000
    outputs:
      - name: sum
        type: op
        pipes:
          0-1-1:
            - 121000000000
      - name: mult
        type: op
        pipes:
          0-1-1:
            - 117000000000

placement_report.yaml に配置情報がありました。

op_0 : [1, 1] : exp (exp)
op_3 : [2, 1] : sum (mutmul)
op_2 : [3, 1] : recip (reciprocal)
op_3 : [4, 1] : mult (multipl)

softmax_single_tile:
  placement:
    - "0   3   2   1   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
    - ".   .   .   .   .   .   .   .   .   .   .   .   "
  Op_types:
    matmul:
      Num_cores: 1
      Core_utilization(%): 0.833333313
    reciprocal:
      Num_cores: 1
      Core_utilization(%): 0.833333313
    multiply:
      Num_cores: 1
      Core_utilization(%): 0.833333313
    exp:
      Num_cores: 1
      Core_utilization(%): 0.833333313
  Total_core_utilization(%): 3.33333325

おわりに

TT-Budabackend のテストコードがビルドできたので、どんな感じにファイルが生成されるのかが分かるようになりました。

verif/graph_tests/netlists

の下にある netlist を変えて動かしてみると、生成されるファイルが色々と変わっています。

Vengineerの妄想(準備期間)

人生は短いけど、長いです。人生を楽しみましょう！

Tenstorrent : TT-Budabackend をビルドして、テストコードを動かしてみた

はじめに

TT-budabackendをclone

build

netlist の生成

おわりに