OpenACC ディレクティブによるプログラミング

12 - 3 章 スパース行列の CSR フォーマット変換と CUDA Linear Solver との連携

1. Bi-Conjugate Gradient Stabilized(BiCGStab)with OpenACC プログラム例

 前章 12-2 では、OpenACC + cuSPARSE を利用した Bi-Conjugate Gradient Stabilized(BiCGStab)実装プログラム例を示した。このルーチンを汎用的に使用できるように、疎行列のフォーマット変換も含んだ形でプログラムを組み直した。NVIDIA CUDA 線形ライブラリでは、計算に使用するスパース行列のフォーマットが CSR タイプを基本とするため、CUDA ライブラリを使用する場合は、事前に、CSR フォーマットに変換する必要がある。この項では、cuSPARSE ライブラリを使用してデータ・フォーマットの変換を行い、デバイス側で BiCGStab 反復法で線形方程式を解く Fortran プログラムの例を示すこととする。ここで例示するソース・ファイルと Makefile 一式をアーカイブした tar ファイルを提供する。以下のファイルをダウンロードしていただきたい。

 プログラム tar ファイル:openacc_bicgstab.tar.gz (2017.11.14更新しました。プログラム修正)

 以下の例は、PGI 17.10 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。

$  pgfortran -V
pgfortran 17.10-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

 また、使用した GPU は以下のとおりである。

$ pgaccelinfo

CUDA Driver Version:           9000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  384.59  Wed Jul 19 23:53:34 PDT 2017

Device Number:                 0
Device Name:                   GeForce GTX 1080 Ti
Device Revision Number:        6.1
Global Memory Size:            11714691072
Number of Multiprocessors:     28
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1657 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             5505 MHz
Memory Bus Width:              352 bits
L2 Cache Size:                 2883584 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc60

COO フォーマットから CSR フォーマットへの変換

 スパース行列を表現するフォーマットは、様々提案されているが、ここでは、サンプル係数行列の流通に使用されている Matrix MarketMTX フォーマットを入力として CSR フォーマットに変換するプログラムを紹介する。サンプル係数行列を提供しているサイトとして有名なのは、他に、The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) があり、ここでのデータを利用して線形システムのテスト実行が可能である。これらのサイトでは、MTXフォーマットのデータの提供を行っている。

 Matrix Market(MTX) フォーマットは、ヘッダー部分を除けば、非ゼロ要素の配列位置型(coordinate) で表す、いわゆる COO 形式となっている。COO フォーマットとは、(行, 列, 値) のタプルでリストを作る方法である。CUDA ライブラリ関数へ入力する係数行列のストレージ・フォーマットは CSR 形式としているため、COO 形式を CSR 形式に変換する必要がある。

提供プログラムの注意点を述べる。Matrix Market のフォーマットで「symmetric」属性を持つファイルは、対称行列の上三角行列のみ収納されている。一方、「general」属性を持つものは、「一般行列」を意味しノンゼロの配列要素が全て記述されている。mm_convert_csr.f90 プログラムは、デフォルトで「symmetric」属性を有する対称行列要素を「一般行列」に拡張するようにしている。プログラム内で、対称行列を一般行列に拡張するためのフラグ変数は extendSymMatrix であり、これにより制御している。「一般行列」に拡張する理由は、BiCGstab の前処理に不完全 LU 分解 (cusparseDcsrilu0) を施しているが、この cusparseDcsrilu0 ルーチンの入力仕様が当然ながら「一般行列」の入力を必要とするからである。

 Matrix Market サイトでは、MTXフォーマットのファイルを読み込むための fortran ルーチン(mmio.f) が提供されている。今回、tar ファイル内に含めた mmio.f は、サブルーチンの仮引数にオプショナル引数を使用できるように変更し、F90 MODULE形式にしたものとした。当該 mmio.f ルーチンは、mtx ファイルのヘッダ情報を読み、COO 形式のデータを読み込むためのルーチンとなる。

 以下に示すプログラム(mm_convert_csr.f90) は、MTX ファイルを CSR 形式のデータに変換するものである。データ形式の変換は、cuSPARSE ライブラリの関数を利用するため、デバイス側での処理を含む。このため、デバイス・データであることを指示するために OpenACC ディレクティブを利用する。以下の表は、mm_convert_csr.f90 内の主なルーチンの機能を示したものである。

module Matrix_data COO形式の配列バッファ宣言とCSR形式のデータ配列宣言、結果ベクトル配列、RHS ベクトル配列宣言
subroutine mm_matrix_read MTX形式のデータの読み込みルーチン
subroutine mm_rhs_read 線形方程式の右辺項ベクトルの読み込みルーチン
subroutine coo2csr_convert COO形式データをCSR形式データに変換するルーチン。配列の index base は 1 として変換する。cuSPARSE ライブラリを使用するため、デバイス側での処理も含むため、OpenACC 指示行を使用する。なお、このルーチンは、データ型として倍精度の対応のみ行なっており、複素数型を扱うには修正が必要である。なお、このルーチンの処理後、csrValA、csrRowPtrA、csrColIndA の CSR データの配列要素は、ホスト側にもデバイス側にも配置された状態にあなる。

 CSR フォーマットの詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。

nz 整数 行列内の非ゼロ要素の数
csrValA 配列 行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format.
csrRowPtrA 配列 配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。 この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA(0)を含みます。 一般に、csrRowPtrA(0)は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる。ここでは 1 を使用する。
csrColIndA 配列 配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列

MTXファイルから CSR 形式のデータに変換するルーチン(mm_convert_csr.f90)

! Copyright (c) 2017, Tsutomu Kato, SofTek Systems. Inc. All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!

module Matrix_data
! COO storage data and CSR data -- declare variables
  use precision
  implicit none

! Buffer for Matrix Market data format (as we say "COO" format)
    integer      :: rows, cols, nnz, nnzmax
    character*10 :: rep
    character*7  :: field
    character*19 :: symm
    integer, allocatable, dimension(:)       :: indx, jndx
    integer, allocatable, dimension(:)       :: ival
    real(fp_kind), allocatable, dimension(:) :: rval
    complex*16, allocatable, dimension(:)    :: cval

! Arrays for CUDA cuSPARSE CSR format (available for REAL data only)
    integer  :: m                            ! rows size of A matrix
    integer  :: n                            ! columns size of A matrix
    integer  :: nz                           ! nonzero element of A
    integer, allocatable, dimension(:)       :: csrRowPtrA  ! m + 1
    integer, allocatable, dimension(:)       :: csrColIndA  !
    real(fp_kind), allocatable, dimension(:) :: csrValA
!
    real(fp_kind), allocatable, dimension(:) :: x   ! initial guess & result vector
    real(fp_kind), allocatable, dimension(:) :: rhs ! right hand side vector

end module Matrix_data

module mm_coo_cuda_csr_convert
  use precision
  use Matrix_data
  use mmio
  implicit none

  contains

  subroutine matrix_printout(f, nonzero, i_ndx, j_ndx, rval1, ival1, cval1)
  implicit none
      integer                    :: i, nonzero
      character*7                                :: f
      integer                    :: i_ndx(nonzero)
      integer                    :: j_ndx(nonzero)
      integer,          optional :: ival1(nonzero)
      double precision, optional :: rval1(nonzero)
      complex*16,       optional :: cval1(nonzero)

      do i = 1, min(nonzero,10)
        if (f == "complex") then
          print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),cval1(i)
        elseif (f == "real") then
          print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),rval1(i)
        elseif (f == "integer") then
          print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),ival1(i)
        end if
      end do
      do i = max(nonzero-10, 1), nonzero
        if (f == "complex") then
          print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),cval1(i)
        elseif (f == "real") then
          print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),rval1(i)
        elseif (f == "integer") then
          print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),ival1(i)
        end if
      end do

  end subroutine matrix_printout

  subroutine csr_printout
  implicit none
    integer :: i
      print *
      print *,"== CSR data check : Start from 1  to 10 elements =="
      if (field == "complex") then
        print '(1x, a/, 7(1x,I10   ))',"csrRowPtrA=",(csrRowPtrA(i), i=1,min(m+1,10))
        print '(1x, a/, 7(1x,I10   ))',"csrColIndA=",(csrColIndA(i), i=1,min(nz,10))
        print '(1x  a/, 7(1x,2d10.4))',"csrValA  =",(csrValA(i), i=1,min(nz,10))
      elseif (field == "real") then
        print '(1x, a/, 7(1x,I10  ))',"csrRowPtrA=",(csrRowPtrA(i), i=1,min(m+1,10))
        print '(1x, a/, 7(1x,I10  ))',"csrColIndA=",(csrColIndA(i), i=1,min(nz,10))
        print '(1x, a/, 7(1x,d10.4))',"csrValA  =",(csrValA(i), i=1,min(nz,10))
      end if

      print *
      print *,"== Start from endpoint-10  to endpoint =="
      if (field == "complex") then
        print '(1x, a/, 7(1x,I10   ))',"csrRowPtrA=",(csrRowPtrA(i), i=max(m+1-10,1),m+1)
        print '(1x, a/, 7(1x,I10   ))',"csrColIndA=",(csrColIndA(i), i=max(nz-10,1),nz)
        print '(1x, a/, 7(1x,2d10.4))',"csrValA=",(csrValA(i), i=max(nz-10,1),nz)
      elseif (field == "real") then
        print '(1x, a/, 7(1x,I10  ))',"csrRowPtrA=",(csrRowPtrA(i), i=max(m+1-10,1), m+1)
        print '(1x, a/, 7(1x,I10  ))',"csrColIndA=",(csrColIndA(i), i=max(nz-10,1),nz)
        print '(1x, a/, 7(1x,d10.4))',"csrValA=",(csrValA(i), i=max(nz-10,1),nz)
      end if

  end subroutine csr_printout

  subroutine set_matrix(iunit1, extendSymMatrix, field, symm, t_nnz, t_indx, t_jndx, &
                        t_ival, t_rval, t_cval)
  use precision
  use Matrix_data

    character*7             :: field
    character*19            :: symm
    integer                 :: t_nnz
    integer ,optional       :: t_indx(t_nnz), t_jndx(t_nnz)
    integer, optional       :: t_ival(t_nnz)
    real(fp_kind), optional :: t_rval(t_nnz)
    complex*16, optional    :: t_cval(t_nnz)

    integer                 :: i, j, iunit1, count
    integer                 :: extendSymMatrix   ! 1 = extend , 0 = no


        if (symm == "symmetric" .and. extendSymMatrix == 1) then
          ! count number of non-diagonal elements
          count = 0
          do i = 1, t_nnz
            if (t_indx(i) /= t_jndx(i)) count=count+1
          end do
          allocate(indx(count + t_nnz))
          allocate(jndx(count + t_nnz))
            if (field == "real"   ) allocate(rval(count + t_nnz))
            if (field == "integer") allocate(ival(count + t_nnz))
            if (field == "complex") allocate(cval(count + t_nnz))
          ! copy the elements regular and transposed locations
          j = 1
          do i = 1, t_nnz
            indx(j) = t_indx(i)
            jndx(j) = t_jndx(i)
            if (field == "real"   ) rval(j) = t_rval(i)
            if (field == "integer") ival(j) = t_ival(i)
            if (field == "complex") cval(j) = t_cval(i)
            j = j + 1
            if (t_indx(i) /= t_jndx(i)) then
              indx(j) = t_jndx(i)
              jndx(j) = t_indx(i)
              if (field == "real"   ) rval(j) = t_rval(i)
              if (field == "integer") ival(j) = t_ival(i)
              if (field == "complex") cval(j) = t_cval(i)
              j = j + 1
            end if
          end do
          t_nnz = t_nnz + count
        else
          allocate(indx(t_nnz))
          allocate(jndx(t_nnz))
          allocate(cval(t_nnz))
          indx(:) = t_indx(:)
          jndx(:) = t_jndx(:)
          if (field == "real"   ) then
            allocate(rval(t_nnz))
            rval(:) = t_rval(:)
          end if
          if (field == "integer") then
            allocate(ival(t_nnz))
            ival(:) = t_ival(:)
          end if
          if (field == "complex") then
            allocate(cval(t_nnz))
            cval(:) = t_cval(:)
          end if
        end if

  end subroutine set_matrix

  subroutine mm_matrix_read(iunit1, extendSymMatrix)

!   Read Matrix Market format file

    implicit none
    integer, allocatable, dimension(:)       :: t_indx, t_jndx
    integer, allocatable, dimension(:)       :: t_ival
    real(fp_kind), allocatable, dimension(:) :: t_rval
    complex*16, allocatable, dimension(:)    :: t_cval
    integer      :: i, iunit1, count
    integer      :: extendSymMatrix   ! 1 = extend , 0 = none

!   Read header infos
      call mminfo(iunit1, rep, field, symm, rows, cols, nnz)

       ! small test data !
       ! !       | 1 2 0 |
       ! !   A = | 0 5 0 |   base = 1
       ! !       | 0 8 0 |
       ! rep   = "coordinate"
       ! field = "real"
       ! symm  ="general"
       ! rows = 3 ; cols = 3 ; nnz = 4
       ! indx(1:4) = (/3, 2, 1, 1/)
       ! jndx(1:4) = (/2, 2, 1, 2/)
       ! rval(1:4) =(/8.0, 5.0, 1.0, 2.0/)  ! for small test data

      nnzmax = nnz
      allocate (t_indx(nnz), t_jndx(nnz))

      if (field == "complex") then
        allocate (t_cval(nnz))
        call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, &
                    t_indx, t_jndx, cval=t_cval)
        call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, &
                    t_cval=t_cval)
      elseif (field == "real") then
        allocate (t_rval(nnz))
        call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, &
                    t_indx, t_jndx, rval=t_rval)
        call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, &
                    t_rval=t_rval)
      elseif (field == "integer") then
        allocate (t_ival(nnz))
        call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, &
                    t_indx, t_jndx, ival=t_ival)
        call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, &
                    t_ival=t_ival)
      elseif (field == "pattern") then
        print *, "Pattern field is not available." ; stop "10"
      else
        print *, "Error : data type not found." , field ; stop "10"
      end if

        deallocate(t_indx,t_jndx)
        if (field == "real"   ) deallocate(t_rval)
        if (field == "integer") deallocate(t_ival)
        if (field == "complex") deallocate(t_cval)

      print *,"Properties of Matrix_Market matrix"
      print *,"===================================="
      print '(3(1x,a,1x,a/),4(1x,a,i8/))', &
               "representation =",rep,  &
               "type           =",field,&
               "matrix type    =",symm, &
               "row (m)        =",rows, &
               "cols(n)        =",cols, &
               "Non-zero elm.  =",nnz,  &
               "extendSymMatrix=",extendSymMatrix
      if (symm == "symmetric" .and. extendSymMatrix == 1) &
        print '(/a/)', " ==== Triangular Symmetric matrix has been transformed to full matrix form ===="


      if (field == "complex") then
        call matrix_printout (field, nnz, indx, jndx, cval1=cval)
      elseif (field == "real") then
        call matrix_printout (field, nnz, indx, jndx, rval1=rval)
      elseif (field == "integer") then
        call matrix_printout (field, nnz, indx, jndx, ival1=ival)
        print *, "No available in case of field = integer in this routine"
        stop "20"
      elseif (field == "pattern") then
        print *, "No available in case of field = pattern in this routine"
        stop "30"
      end if
      close(iunit1)

  end subroutine mm_matrix_read

  subroutine mm_rhs_read(iunit2)

!   INPUT   : vector file (Matrix Market format)
!   OUTPUT  : rhs(:) -- global array

    use mmio
    implicit none
    integer      :: i, iunit2
    character*40 :: rhs_file
    integer      :: rows1, cols1, nnz1, nnzmax1
    character*10 :: rep1
    character*7  :: field1
    character*19 :: symm1
    integer, allocatable, dimension(:) :: indx1, jndx1

      call mminfo(iunit2, rep1, field1, symm1, rows1, cols1, nnz1)

      if (rows1 /= rows) then
         print *, "Error : RHS vector size does match with Matrix row size."
         stop
      end if

      print *,"Properties of Matrix_Market RHS vector"
      print *,"===================================="
      print '(3(1x,a,1x,a/),3(1x,a,i8/))', &
               "representation =",rep1,  &
               "type           =",field1,&
               "matrix type    =",symm1, &
               "row (m)        =",rows1, &
               "cols(n)        =",cols1, &
               "Non-zero elm.  =",nnz1

      nnzmax1 = nnz1
      allocate (indx1(rows1), jndx1(cols1))
      allocate (rhs(nnz1))
      call mmread(iunit2, rep1, field1, symm1, rows1, cols1, nnz1, nnzmax1, &
                  indx1, jndx1, rval=rhs)
      ! print *,"rhs=", rhs
      deallocate (indx1, jndx1)
      close(iunit2)

  end subroutine mm_rhs_read

  subroutine coo2csr_convert

!   INPUT   : Matrix COO data from Matrix Market format via mm_read routine
!   OUTPUT  : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data

    use cublas
    use cusparse
                                         !! fortran program as index_base = 1

    type(cusparseHandle)                 :: cusparse_H
    type(cublasHandle)                   :: cublas_H
    integer(8)                           :: buff_size  ! must be 8byte integer. so needs -Mlarge_arrays option
    character*1,allocatable,dimension(:) :: pBuffer
    integer,    allocatable,dimension(:) :: P
    integer                              :: status
    integer                              :: ierr_code
    integer                              :: i

      if ( rows /= cols ) then
        print *, "Error : Matrix must be square (rows = cols) in this routine."
        stop
      end if

    ! set m, n, nz
      m = rows
      n = cols    ! matrix size
      nz = nnz    ! nz == nnz
      if ( m /= n ) then
        print *, "Error : Matrix A is not square matrix ",m,"x",n
        stop
      end if

    ! set CSR storage arrays
      allocate(csrRowPtrA(m+1))
      allocate(csrColIndA(nz) )
      allocate(csrValA(nz)    )
    ! Initilize
      csrRowPtrA =0 ; csrColIndA =0 ; csrValA = 0_fp_kind

    ! index base is changed from 1 to 0 for using cusparseXcoosortByRow routine
      do i = 1, nz
          indx(i) = indx(i) - 1
          jndx(i) = jndx(i) - 1
      end do

        !! call matrix_printout (field, nnz, indx, jndx, rval1=rval)

    ! create handle for CUSPARSE
      status = cusparseCreate(cusparse_H)

    !$acc data copy(indx, jndx) copy(rval, csrValA, csrRowPtrA)

      ! calculate buffer size to conver
        !$acc host_data use_device(indx, jndx)
          status = cusparseXcoosort_bufferSizeExt(cusparse_H, m, n, nz, indx, jndx, buff_size)
        !$acc end host_data
        print *, "Xcoosort_bufferSizeExt status =", status

      ! Prepare working space
        print *, "working space (bytes) =", buff_size
        allocate(pBuffer(buff_size), stat=ierr_code)
        print *, "allocation status code for pBuffer =", ierr_code

      ! setup permutation vector P
        allocate(P(nz), stat=ierr_code)
        print *, "allocation status code for P =", ierr_code


       !$acc enter data copyin(P, pBuffer)      ! P, pBuffer enter in

      ! Creates an identity map
        !$acc host_data use_device(P)
          status = cusparseCreateIdentityPermutation(cusparse_H, nz, P)
        !$acc end host_data
        print *, "CreateIdentityPermutation status =", status

      ! sorts COO format by row
        !$acc host_data use_device(indx, jndx, P, pBuffer)
          status = cusparseXcoosortByRow(cusparse_H, m, n, nz, indx, jndx,  P, pBuffer)
        !$acc end host_data
        print *, "XcoosortByRow status =", status

      ! gather sorted COO Vals = set "csrValA"
        !$acc host_data use_device(rval, csrValA, P)
          status = cusparseDgthr(cusparse_H, nz, rval, csrValA, P, CUSPARSE_INDEX_BASE_ZERO)
        !$acc end host_data

        !$acc exit data delete(P, pBuffer)      ! P, Pbuffer deleted

      ! index base is changed from 0 to 1 for subsequent Fortran cusolver routines on device
        !$acc kernels
          do i = 1, nz
            indx(i) = indx(i) + 1
            jndx(i) = jndx(i) + 1
          end do
        !$acc end kernels

      ! COO to CSR format = set "csrRowPtrA"
        !$acc host_data use_device(indx, csrRowPtrA)
         status = cusparseXcoo2csr(cusparse_H, indx, nz, m, csrRowPtrA, CUSPARSE_INDEX_BASE_ONE)
        !$acc end host_data

    !$acc end data

        deallocate(P, pBuffer)

      ! copy jndx to csrColIndA = set "csrColIndA"
        csrColIndA = jndx

      if (field == "complex") then
        call matrix_printout (field, nz, indx, jndx, cval1=cval)
      elseif (field == "real") then
        call matrix_printout (field, nz, indx, jndx, rval1=csrValA)
      end if

      ! check CSR data
      call csr_printout

  end subroutine coo2csr_convert

end module mm_coo_cuda_csr_convert

メインプログラムと bicgstab ルーチン

 メインプログラムを以下に示す。このプログラムは、複素数の係数行列は対応していない。実数型のデータのみを扱える。最初に、Matrix Market 形式のファイル名を指定し、次に right hand side(rhs) データファイルがあれば、それを指定する。もし、存在しなければ、rhs ベクトルは全て 1.0 として計算する。入力データの読み込み後、coo2csr_convert ルーチンを call して、COO 形式データからCSR 形式データに変換する。変換後、csrRowPtrA、csrColIndA、csrValA 配列が作成され、この CSR 形式の行列を実引数として、BiCGStab ルーチンに渡し、計算が始まる。bicgstab.f90 ルーチンの内容に関しては、前章 12-2 を参照していただきたい。

main.f90

program main
  use precision
  use Matrix_data
  use mm_coo_cuda_csr_convert
  use bicg_routine
  implicit none

    integer                  :: i, j, istat
    integer,parameter        :: iunit1 = 10       ! for Matrix Market data (COO)
    integer,parameter        :: iunit2 = 11       ! for LHS, if available
    character*50             :: mm_file, rhs_file
    integer                  :: extendSymMatrix   ! 1 = symmetrize the pattern (transform from triangular to full)

    integer, parameter       :: max_iter = 10000  ! max_iteration
    real(fp_kind), parameter :: tol = 1.0d-13     ! tolerance
    integer, parameter       :: index_base = 1    ! CSR format -- one based index

! Read matrix Market data file

    extendSymMatrix = 1        !  if =1 .and. symmetric matrix=yes , transform from upper triangular to full

    print *, "File name for Matrix Market?"
    read (5, '(a)')  mm_file
    !! mm_file ="DATA/bcsstk14.mtx"
    print *, "Matrix Market file name = ",mm_file
    open (iunit1, file=mm_file, status='old')
    call mm_matrix_read(iunit1, extendSymMatrix)

! Read right hand side vectors, if available.

    print *
    print *, "File name for LHS vector? if no available, press ENTER and set as RHS=1.0."
    read (5, '(a)')  rhs_file
    if (rhs_file == "") then
       allocate (rhs(rows))
       rhs= 1.0_fp_kind
    else
      print *, "rhs_file name = ",rhs_file
      open (iunit2, file=rhs_file, status='old')
      call mm_rhs_read(iunit2)
    end if

! Convert COO to CSR sparse matrix
!   INPUT   : Matrix COO data from Matrix Market format via mm_rhs_read routine
!   OUTPUT  : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data

    call coo2csr_convert

! CSR sparse matrix data are set. The belows are arguments for BiCGStab routine.
! Arrays for CUDA cuSPARSE CSR format
!   integer       :: m                 IN    ! rows size for square matrix
!   integer       :: n                 IN    ! columns size for square matrix (n=m)
!   integer       :: nz                IN    ! nonzero element of A
!   integer       :: csrRowPtrA (m+1)  IN    ! the array of offsets corresponding to the start of each row
!   integer       :: csrColIndA (nz)   IN    ! sorted by row and by column within each row
!   real(fp_kind) :: csrValA    (nz)   IN    ! sorted by row and by column within each row
!   real(fp_kind) :: x(m)              INOUT ! initial guess & result
!   real(fp_kind) :: rhs(m)            IN    ! right hand side

! Print CSR Matrix data
!    print *, "csrValA", csrValA
!    print *, "csrRowPtrA",csrRowPtrA
!    print *, "csrColIndA", csrColIndA


! Initilize X
    allocate (x(n))
    x = 0.0_fp_kind      ! or Initial guess X is set

! Compute by BiCGStab method
! Preconditioned Conjugate Gradient using Incomplete-LU
! factorization with 0 fill-in and no pivoting

    print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter
    call BiCGStab (n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
                   tol, max_iter, index_base)

end program main

bicgstab.f90

! Copyright (c) 2017, Tsutomu Kato, SofTek Systems. Inc. All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!

module bicg_routine
    use precision
    implicit none

  contains
!=============================================================
    subroutine BiCGStab(n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
                        tol, max_iter, index_base)
!-------------------------------------------------------------
!   Bi-Conjugate Gradient Stabilized (BiCGStab)
!   A is an n × n sparse matrix defined in CSR storage format
!-------------------------------------------------------------
    use precision
    use measure_time

    use cublas_v2
    use cusparse
    implicit none

! Arrays for CUDA cuSPARSE CSR format
    integer       :: n                ! size of squre matrix
    integer       :: nz               ! nonzero element of A
    integer       :: csrRowPtrA(n+1)  ! the array of offsets corresponding to the start of each row
    integer       :: csrColIndA (nz)  ! sorted by row and by column within each row
    real(fp_kind) :: csrValA    (nz)  ! sorted by row and by column within each row
    real(fp_kind) :: x  (n)           ! initial guess & result
    real(fp_kind) :: rhs(n)           ! right hand side
    integer       :: index_base       ! = 1

    integer       :: max_iter
    real(fp_kind) :: tol

! Local variables
    type(cusparseHandle)     :: cusparse_H
    type(cublasHandle)       :: cublas_H
    type(cusparseMatDescr)   :: descrA
    type(cusparseMatDescr)   :: descrL
    type(cusparseMatDescr)   :: descrU
    type(cusparseSolveAnalysisInfo) :: infoA, infoL, infoU
    integer                  :: cusparse_status
    integer                  :: cublas_status
    integer                  :: status, ierr_code
    real(fp_kind)            :: alpha, alpham1, beta, omega, omegam1
    real(fp_kind)            :: rho, rho_p
    real(fp_kind)            :: nrmr, nrmr0, temp1, temp2, nrmCK1, nrmCK2
    real(fp_kind)            :: error, rsum, diff
    real(fp_kind), allocatable, dimension(:) :: valsILU0
    real(fp_kind), allocatable, dimension(:) :: Lov, z
    real(fp_kind), allocatable, dimension(:) :: r, r_tlda
    real(fp_kind), allocatable, dimension(:) :: p
    real(fp_kind), allocatable, dimension(:) :: q
    real(fp_kind), allocatable, dimension(:) :: t, s
    real(fp_kind),parameter::  one = 1.0, zero = 0.0, onem1 = -1.0

    integer                  :: i, j, k
    integer                  :: nprint
    integer, parameter       :: idbg = 2
    real(fp_kind),parameter  :: eps = 6.9d-310     ! machine epsilon

    allocate( r(n), p(n), r_tlda(n) )
    allocate( q(n) )
    allocate( valsILU0(nz) )
    allocate( Lov(n), z(n), t(n) , s(n) )

! Bi-Conjugate Gradient Stabilized (BiCGStab) using incomplete-LU

    print '(/1x,a)',"-----------------------------------------------"
    print '(1x,a)', " Bi-Conjugate Gradient Stabilized (BiCGStab)"
    print *,        "-----------------------------------------------"

!   print *,"rhs=",rhs
    do  i = 1, n
       r(i) = rhs(i)     ! set r = rhs, where Ax = rhs (Ax = b)
    end do

    alpha   = 1.0_fp_kind
    beta    = 0.0_fp_kind
    rho     = 0.0_fp_kind
    k = 0

! Clock
      status = init_clock
      status = start_clock

! creater handle for cuBLAS
    status = cublasCreate(cublas_H)
! creater handle for CUSPARSE
    status = cusparseCreate(cusparse_H)

! matrix A descriptor(descrA) and properties
    status = cusparseCreateMatDescr(descrA)
    status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)  ! must be GENERAL for using csrilu0
    status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE)  ! base = 1

! initiialize the analysis info for A matrix (infoA = sparsity pattern of A)
    status = cusparseCreateSolveAnalysisInfo(infoA)

! initiialize the L U preconditioner Info
    status = cusparseCreateSolveAnalysisInfo(infoL)
    status = cusparseCreateSolveAnalysisInfo(infoU)

! Setup for Lower Matrix
    status = cusparseCreateMatDescr (descrL)
    status = cusparseSetMatType     (descrL, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER)
    status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT)

! Setup for Upper Matrix
    status = cusparseCreateMatDescr (descrU)
    status = cusparseSetMatType     (descrU, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER)
    status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT)

! copy original vals to new ILU0-vals
    valsILU0 = csrValA

!print *,"X=",X
!print *,"r_init=",r
!Print CSR Matrix data
!   print *,"m,n,nz",m,n,nz
!   print *, "csrValA", csrValA
!   print *, "csrRowPtrA",csrRowPtrA
!   print *, "csrColIndA", csrColIndA
!   print *,"csrValA=",(csrValA(i),i=1,100)

!$acc data copy(r, X)  copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA ) &
!$acc create(p, q, Lov, z, r_tlda ) create(t, s)

   ! === Initial guess x0 and residual r = b - Ax0 where r = b
   !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, r, r_tlda, p )
     cusparse_status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                       n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0, q is temporary
     status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1)                                  ! r = r - A x0
     ! set as r_tlda = r and p = r
     status = cublasDcopy(cublas_H, n, r, 1, r_tlda, 1)                                    ! r_tlda = r
     status = cublasDcopy(cublas_H, n, r, 1, p, 1)                                         ! p      = r
     status = cublasDnrm2(cublas_H, n, r, 1, nrmr0)                                        ! nrmr0 = ||r||2
   !$acc end host_data

    print *,"nrmr0=",nrmr0, "tol*nrmr0=",tol*nrmr0

   !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0)
    ! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoL/descrL
     cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                      n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoL)

!print*,"cusparseDcsrsv_analysis1=", cusparse_status

    ! the analysis phase of the solution of a sparse linear system for infoU/descrU
     cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                      n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoU)

    !print*,"cusparseDcsrsv_analysis2=",cusparse_status

    ! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix
    ! output : valsILU0
     cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descrA, valsILU0, &
                      csrRowPtrA, csrColIndA, infoL);
   !$acc end host_data

!$acc wait

    ! check for the solution using valsILU0 vector from Incomplete-LU factorization
    !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov)
     cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                            valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov)
     status = cublasDnrm2(cublas_H, n, Lov, 1, nrmCK1)
     cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                            valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z)  ! solve z
     status = cublasDnrm2(cublas_H, n, z, 1, nrmCK2)
    !$acc end host_data

     print*,"nrmCK1, nrmCK2 =", nrmCK1, nrmCK2
     if (nrmCK1 == 0.0  .or. nrmCK2 == 0.0 .or. nrmCK1 /= nrmCK1 .or. nrmCK2 /= nrmCK2) then
       print *, "ERROR : This matrix is not be able to solve using valsILU0 vector from Incomplete-LU factorization."
       stop
     end if

        status = end_clock
        time_tag = "The analysis and Incomplete-LU factorization phase of ILU(0) :"
        status = print_time(time_tag)
        status = start_clock

    do while (  k <= max_iter )

      k = k + 1
      rho_p = rho

   !$acc host_data use_device(r_tlda, r)
      status = cublasDdot(cublas_H, n, r_tlda, 1, r, 1, rho)               ! rho = (r_tlda, r)
   !$acc end host_data

      if ( abs(rho) < eps .or. rho == 0.0 ) then
         print *,"ERROR! rho is zero", rho, "BiCG calculation condition does not hold. STOP!"
         exit
       end if

      if ( k > 1 ) then

      if ( abs(omega) < eps .or. rho == 0.0 ) then
        print *,"ERROR! omega is zero, BiCG calculation condition does not hold. STOP!"
        exit
      end if

        beta = (rho/rho_p) * (alpha/omega)
        omegam1 = -omega

      !$acc host_data use_device(q, p, r)
        cublas_status = cublasDaxpy(cublas_H, n, omegam1, q, 1, p, 1)   ! p = p -omega * q
        cublas_status = cublasDscal(cublas_H, n, beta, p, 1)            ! p = beta * p
        cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1)       ! p = p + r
      !$acc end host_data

      end if

      ! Solve using valsILU0 vector from Incomplete-LU factorization,  Solve M z = p
      !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, p, Lov, z)
      ! Forwadaard Solve, L part for infoA(the sparsity pattern of descL)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoL, p, Lov)

      ! Back Substitution (U part for infoU = descU)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z)  ! solve z
      !$acc end host_data

      ! Compute q = A z
      !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r_tlda, z, q)
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                          n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, z, zero, q)   ! q = A z
        status = cublasDdot(cublas_H, n, r_tlda, 1, q, 1, temp1)        ! temp1 = (r_tlda, q)
      !$acc end host_data

        alpha   = rho / temp1                                           ! alpha = rho / (r_tlda, q)
        alpham1 = -alpha

      !$acc host_data use_device(q, z, X, r)
        cublas_status = cublasDaxpy(cublas_H, n, alpham1, q, 1, r, 1)   ! r = r - alpha * q
        cublas_status = cublasDaxpy(cublas_H, n, alpha,   z, 1, X, 1)   ! X = X + alpha * z
        cublas_status = cublasDnrm2(cublas_H, n, r, 1, nrmr)            ! nrmr = || r ||2
      !$acc end host_data

      ! check for convergence
        if ( nrmr < tol*nrmr0 ) exit

      ! solve M s = r
      !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, s, t)
      ! Forward Solve, L part for infoA(the sparsity pattern of descL)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov)
      ! Back Substitution (U part for infoU = descU)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, s)  ! solve s
      ! Compute t = A s
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &         ! t = A s
                          n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, s, zero, t)

        status = cublasDdot(cublas_H, n, t, 1, r, 1, temp1)             ! temp1 = (t, r)
        status = cublasDdot(cublas_H, n, t, 1, t, 1, temp2)             ! temp2 = (t, t)
      !$acc end host_data

        omega = temp1/temp2
        omegam1 = -omega

      !$acc host_data use_device(s, t, X, r)
        status = cublasDaxpy(cublas_H, n, omega, s, 1, X, 1)            ! X = X + omega * s
        status = cublasDaxpy(cublas_H, n, omegam1, t, 1, r, 1)          ! r = r - omega * t
        status = cublasDnrm2(cublas_H, n, r, 1, nrmr)                   ! nrmr = || r ||2
      !$acc end host_data

      ! check for convergence
        if ( nrmr < tol*nrmr0 ) exit

        if (idbg == 1 ) print *, "iteration=", k,  " residual =", nrmr
    end do
!$acc end data

    print *, "iteration=", k,  " residual =", nrmr
    print *, "nrmr/nrmr0 =", nrmr/nrmr0
    print *
    if ( nrmr < tol*nrmr0 ) then
      print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED."
    else
      print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is FAILED."
    end if

        status = end_clock
        time_tag = "Elapased time for BiGGstab loop with ILU(0) :"
        status = print_time(time_tag)

! check result
    error = 0.0
    do i = 1, n
      rsum = 0.0
      do  j =  csrRowPtrA(i), csrRowPtrA(i+1)-1
         rsum = rsum + csrValA(j) * X(csrColIndA(j))
      end do
      diff = abs(rsum - rhs(i))
      if (diff > error) error = diff
    end do
    print '(1x,a,d20.13/)',"error =",error

! output result
    if (idbg >= 1) then
      nprint = n
      if ( n > 10 ) nprint = 10
      do i = 1, nprint
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
      print *,""
      do i = n-nprint, n
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
    end if

    status = cublasDestroy(cublas_H)
    status = cusparseDestroy(cusparse_H)
    status = cusparseDestroyMatDescr(descrA)
    status = cusparseDestroyMatDescr(descrL)
    status = cusparseDestroyMatDescr(descrU)
    status = cusparseDestroySolveAnalysisInfo(infoA)
    status = cusparseDestroySolveAnalysisInfo(infoL)
    status = cusparseDestroySolveAnalysisInfo(infoU)
    deallocate( q, r, r_tlda, p )
    deallocate( Lov, z, valsILU0, t, s )

    end subroutine BiCGStab

End module bicg_routine

MTXファイルを使用して実行

 tar file 内に提供した Makefile の記述に沿って、コンパイル&リンクを行う。なお、この Makefile では、cuda9.0 ツールキットのライブラリを使用するように設定しているが、cuda8.0 を使用しても良い。cusparse と cublas ライブラリは、PGIコンパイラにバンドルしているルーチンを使用する形になっている。

Makefile によるコンパイル

$ make
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  precision.f90
pgf90 -c -Minfo -Kieee -O  -Mfixed mmio.f
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  mm_convert_csr.f90
matrix_printout:
     68, Loop not vectorized/parallelized: contains call
     77, Loop not vectorized/parallelized: contains call
set_matrix:
    138, Unrolled inner loop 8 times
         Generated 2 prefetch instructions for the loop
    148, Loop not vectorized/parallelized: contains call
    169, Memory copy idiom, loop replaced by call to __c_mcopy4
    170, Memory copy idiom, loop replaced by call to __c_mcopy4
    173, Memory copy idiom, loop replaced by call to __c_mcopy8
    177, Memory copy idiom, loop replaced by call to __c_mcopy4
    181, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
    356, Memory zero idiom, loop replaced by call to __c_mzero8
         Memory zero idiom, loop replaced by call to __c_mzero4
    359, Generated an alternate version of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    369, Generating copy(csrvala(:),indx(:),rval(:),jndx(:),csrrowptra(:))
    387, Generating enter data copyin(p(:),pbuffer(:))
    406, Generating exit data delete(p(:),pbuffer(:))
    410, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        410, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    426, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  bicgstab.f90
bicgstab:
     93, Memory copy idiom, loop replaced by call to __c_mcopy8
    138, Memory copy idiom, loop replaced by call to __c_mcopy8
    149, Generating create(q(:))
         Generating copyin(valsilu0(:))
         Generating create(t(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating create(lov(:),p(:))
         Generating copyin(csrvala(:))
         Generating create(z(:))
         Generating copy(x(:),r(:))
         Generating create(s(:),r_tlda(:))
    207, Loop not vectorized/parallelized: potential early exits
    318, Generated an alternate version of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         FMA (fused multiply-add) instruction(s) generated
    330, Loop not vectorized/parallelized: contains call
    334, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  main.f90
main:
     36, Memory set idiom, loop replaced by call to __c_mset8
     68, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o  a.out precision.o mmio.o measure_time.o mm_convert_csr.o bicgstab.o main.o -Mcuda -acc -ta=tesla,cuda8.0,cc35,cc60

 サンプル係数行列(mtxデータ)は、The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) から入手する。NVIDIA の Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS というホワイトペーパーにテストで使用されている係数行列の名前が以下のようにリストアップされている。この名前を The SuiteSparse Matrix Collection 内の検索サイトで探すとテストデータが表示され、「Matrxi Market」形式ファイルをダウンロードすれば、mtx ファイルが入手できる。なお、係数行列の素性によっては解けない場合もある。

Table 1. Symmetric Positive Definite (s.p.d.) and Nonsymmetric Test Matrices
# Matrix Name m,n nnz s.p.d. Application
1. offshore 259,789 4,242,673 yes Geophysics:https://sparse.tamu.edu/Um/offshore
2. af_shell3 504,855 17,562,051 yes Mechanics:https://sparse.tamu.edu/Schenk_AFE/af_shell3
3. parabolic_fem 525,825 3,674,625 yes General:https://sparse.tamu.edu/Wissgott/parabolic_fem
4. apache2 715,176 4,817,870 yes Mechanics:https://sparse.tamu.edu/GHS_psdef/apache2
5. ecology2 999,999 4,995,991 yes Biology:https://sparse.tamu.edu/McRae/ecology2
6. thermal2 1,228,045 8,580,313 yes Thermal Simulation:https://sparse.tamu.edu/Schmid/thermal2
7. G3_circuit 1,585,478 7,660,826 yes Circuit Simulation:https://sparse.tamu.edu/AMD/G3_circuit
8. FEM_3D_thermal2 147,900 3,489,300 no Mechanics:https://sparse.tamu.edu/Botonakis/FEM_3D_thermal2
10. ASIC_320ks 321,671 1,316,08511 no Circuit Simulation:https://sparse.tamu.edu/Sandia/ASIC_320ks
11. cage13 445,315 7,479,343 no Biology:https://sparse.tamu.edu/vanHeukelum/cage13
12. atmosmodd 1,270,432 8,814,880 no Atmospheric Model:https://sparse.tamu.edu/Bourchtein/atmosmodd

 make run コマンドで実行が開始される。入力 mtx ファイルが求められるので、入手した mtx ファイル名を指定する。次に rhs ベクトルファイルが存在する場合は、そのファイル名を指定する。存在しない場合は、enter を押下し、実行に進む。もし、反復法の計算が不能となった場合は、cuSOLVER SP ルーチンを利用した直接解法で計算してみることもお勧めする。

実行

[kato@photon32]$ make run
===== Program run =====
a.out
File name for Matrix Market?
lap2D_5pt_n100.mtx
 Matrix Market file name = lap2D_5pt_n100.mtx
 Properties of Matrix_Market matrix
 ====================================
 representation = coordinate
 type           = real
 matrix type    = symmetric
 row (m)        =   10000
 cols(n)        =   10000
 Non-zero elm.  =   49600
 extendSymMatrix=       1


 ==== Triangular Symmetric matrix has been transformed to full matrix form ====

 i,indx,jndx            1            1            1    4.000000000000000
 i,indx,jndx            2            2            1   -1.000000000000000
 i,indx,jndx            3            1            2   -1.000000000000000
 i,indx,jndx            4          101            1   -1.000000000000000
 i,indx,jndx            5            1          101   -1.000000000000000
 i,indx,jndx            6            2            2    4.000000000000000
 i,indx,jndx            7            3            2   -1.000000000000000
 i,indx,jndx            8            2            3   -1.000000000000000
 i,indx,jndx            9          102            2   -1.000000000000000
 i,indx,jndx           10            2          102   -1.000000000000000
 i,indx,jndx        49590         9996         9997   -1.000000000000000
 i,indx,jndx        49591         9997         9997    4.000000000000000
 i,indx,jndx        49592         9998         9997   -1.000000000000000
 i,indx,jndx        49593         9997         9998   -1.000000000000000
 i,indx,jndx        49594         9998         9998    4.000000000000000
 i,indx,jndx        49595         9999         9998   -1.000000000000000
 i,indx,jndx        49596         9998         9999   -1.000000000000000
 i,indx,jndx        49597         9999         9999    4.000000000000000
 i,indx,jndx        49598        10000         9999   -1.000000000000000
 i,indx,jndx        49599         9999        10000   -1.000000000000000
 i,indx,jndx        49600        10000        10000    4.000000000000000

 File name for LHS vector? if no available, press ENTER and set as RHS=1.0.

 Xcoosort_bufferSizeExt status =            0
 working space (bytes) =                   794880
 allocation status code for pBuffer =            0
 allocation status code for P =            0
 CreateIdentityPermutation status =            0
 XcoosortByRow status =            0
 i,indx,jndx            1            1            1    4.000000000000000
 i,indx,jndx            2            1            2   -1.000000000000000
 i,indx,jndx            3            1          101   -1.000000000000000
 i,indx,jndx            4            2            1   -1.000000000000000
 i,indx,jndx            5            2            2    4.000000000000000
 i,indx,jndx            6            2            3   -1.000000000000000
 i,indx,jndx            7            2          102   -1.000000000000000
 i,indx,jndx            8            3            2   -1.000000000000000
 i,indx,jndx            9            3            3    4.000000000000000
 i,indx,jndx           10            3            4   -1.000000000000000
 i,indx,jndx        49590         9998         9898   -1.000000000000000
 i,indx,jndx        49591         9998         9997   -1.000000000000000
 i,indx,jndx        49592         9998         9998    4.000000000000000
 i,indx,jndx        49593         9998         9999   -1.000000000000000
 i,indx,jndx        49594         9999         9899   -1.000000000000000
 i,indx,jndx        49595         9999         9998   -1.000000000000000
 i,indx,jndx        49596         9999         9999    4.000000000000000
 i,indx,jndx        49597         9999        10000   -1.000000000000000
 i,indx,jndx        49598        10000         9900   -1.000000000000000
 i,indx,jndx        49599        10000         9999   -1.000000000000000
 i,indx,jndx        49600        10000        10000    4.000000000000000

 == CSR data check : Start from 1  to 10 elements ==
 csrRowPtrA=
          1          4          8         12         16         20         24
         28         32         36
 csrColIndA=
          1          2        101          1          2          3        102
          2          3          4
 csrValA  =
 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01
 -.1000D+01 0.4000D+01 -.1000D+01

 == Start from endpoint-10  to endpoint ==
 csrRowPtrA=
      49562      49566      49570      49574      49578      49582      49586
      49590      49594      49598      49601
 csrColIndA=
       9898       9997       9998       9999       9899       9998       9999
      10000       9900       9999      10000
 csrValA=
 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01
 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01

  tol = .1000000000D-12  max_iteration =   10000


 -----------------------------------------------
  Bi-Conjugate Gradient Stabilized (BiCGStab)
 -----------------------------------------------
 nrmr0=    100.0000000000000      tol*nrmr0=   1.0000000000000001E-011
 nrmCK1, nrmCK2 =    238.2158421093953         166.7314075873577

 The analysis and Incomplete-LU factorization phase of ILU(0) :                         0.212 (sec)

 iteration=           81  residual =   6.6059148572793421E-012
 nrmr/nrmr0 =   6.6059148572793405E-014

 Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED.

 Elapased time for BiGGstab loop with ILU(0) :                                          0.961 (sec)

 error = 0.6593836587854D-11

 x(      1)=0.275607474398E+01
 x(      2)=0.501214948795E+01
 x(      3)=0.696587678467E+01
 x(      4)=0.871021429238E+01
 x(      5)=0.102960547760E+02
 x(      6)=0.117547659017E+02
 x(      7)=0.131074411873E+02
 x(      8)=0.143691932892E+02
 x(      9)=0.155513693040E+02
 x(     10)=0.166627976257E+02

 x(   9990)=0.177105403722E+02
 x(   9991)=0.166627976257E+02
 x(   9992)=0.155513693040E+02
 x(   9993)=0.143691932892E+02
 x(   9994)=0.131074411873E+02
 x(   9995)=0.117547659017E+02
 x(   9996)=0.102960547760E+02
 x(   9997)=0.871021429238E+01
 x(   9998)=0.696587678467E+01
 x(   9999)=0.501214948795E+01
 x(  10000)=0.275607474398E+01

2. Windows 版 プログラムソース

 上記で示したプログラム一式は、Linux 上で動作させることを前提としたものである。同じプログラムを Windows 上で動作させるために Makefile を変更したものとソース・ファイル一式を以下に提供する。以下のファイルをダウンロードしていただきたい。

 プログラム tar ファイル:BiCGStab.zip

 以下の例は、PGI 17.10 バージョンの PGI コマンドプロンプト(端末)を使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。

$  pgfortran -V

pgfortran 17.10-0 64-bit target on x86-64 Windows -tp sandybridge
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

 また、使用した GPU は以下のとおりである。

$ pgaccelinfo

CUDA Driver Version:           9000

Device Number:                 0
Device Name:                   GeForce GTX 780
Device Revision Number:        3.5
Global Memory Size:            3221225472
Number of Multiprocessors:     12
Number of SP Cores:            2304
Number of DP Cores:            768
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    941 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             3004 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 1572864 bytes
Max Threads Per SMP:           2048
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc35
PGI$ make
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  precision.f90
pgf90 -c -Minfo -Kieee -O  -Mfixed mmio.f
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  mm_convert_csr.f90
matrix_printout:
     68, Loop not vectorized/parallelized: contains call
     77, Loop not vectorized/parallelized: contains call
set_matrix:
    138, Unrolled inner loop 8 times
         Generated 2 prefetch instructions for the loop
    148, Loop not vectorized/parallelized: contains call
    169, Memory copy idiom, loop replaced by call to __c_mcopy4
    170, Memory copy idiom, loop replaced by call to __c_mcopy4
    173, Memory copy idiom, loop replaced by call to __c_mcopy8
    177, Memory copy idiom, loop replaced by call to __c_mcopy4
    181, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
    356, Memory zero idiom, loop replaced by call to __c_mzero8
         Memory zero idiom, loop replaced by call to __c_mzero4
    359, Generated 3 alternate versions of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    369, Generating copy(csrvala(:),indx(:),jndx(:),rval(:),csrrowptra(:))
    387, Generating enter data copyin(p(:),pbuffer(:))
    406, Generating exit data delete(pbuffer(:),p(:))
    410, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        410, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    426, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  bicgstab.f90
bicgstab:
     93, Memory copy idiom, loop replaced by call to __c_mcopy8
    138, Memory copy idiom, loop replaced by call to __c_mcopy8
    149, Generating create(q(:))
         Generating copyin(valsilu0(:))
         Generating create(t(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating create(lov(:),p(:))
         Generating copyin(csrvala(:))
         Generating create(z(:))
         Generating copy(x(:),r(:))
         Generating create(s(:),r_tlda(:))
    207, Loop not vectorized/parallelized: potential early exits
    318, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
    330, Loop not vectorized/parallelized: contains call
    334, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  main.f90
main:
     36, Memory set idiom, loop replaced by call to __c_mset8
     68, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o  a.out precision.obj mmio.obj measure_time.obj mm_convert_csr.obj bicgstab.obj main.obj -Mcuda -acc -ta=tesla,cuda9.0,cc35,cc60  -Mcudalib=cusparse,cublas
PGI$ make run
===== Program run =====
a.out
 File name for Matrix Market?
lap2D_5pt_n100.mtx
 Matrix Market file name = lap2D_5pt_n100.mtx
 Properties of Matrix_Market matrix
 ====================================
 representation = coordinate
 type           = real
 matrix type    = symmetric
 row (m)        =   10000
 cols(n)        =   10000
 Non-zero elm.  =   49600
 extendSymMatrix=       1


 ==== Triangular Symmetric matrix has been transformed to full matrix form ====

 i,indx,jndx            1            1            1    4.000000000000000
 i,indx,jndx            2            2            1   -1.000000000000000
 i,indx,jndx            3            1            2   -1.000000000000000
 i,indx,jndx            4          101            1   -1.000000000000000
 i,indx,jndx            5            1          101   -1.000000000000000
 i,indx,jndx            6            2            2    4.000000000000000
 i,indx,jndx            7            3            2   -1.000000000000000
 i,indx,jndx            8            2            3   -1.000000000000000
 i,indx,jndx            9          102            2   -1.000000000000000
 i,indx,jndx           10            2          102   -1.000000000000000
 i,indx,jndx        49590         9996         9997   -1.000000000000000
 i,indx,jndx        49591         9997         9997    4.000000000000000
 i,indx,jndx        49592         9998         9997   -1.000000000000000
 i,indx,jndx        49593         9997         9998   -1.000000000000000
 i,indx,jndx        49594         9998         9998    4.000000000000000
 i,indx,jndx        49595         9999         9998   -1.000000000000000
 i,indx,jndx        49596         9998         9999   -1.000000000000000
 i,indx,jndx        49597         9999         9999    4.000000000000000
 i,indx,jndx        49598        10000         9999   -1.000000000000000
 i,indx,jndx        49599         9999        10000   -1.000000000000000
 i,indx,jndx        49600        10000        10000    4.000000000000000

 File name for LHS vector? if no available, press ENTER and set as RHS=1.0.

 Xcoosort_bufferSizeExt status =            0
 working space (bytes) =                   794880
 allocation status code for pBuffer =            0
 allocation status code for P =            0
 CreateIdentityPermutation status =            0
 XcoosortByRow status =            0
 i,indx,jndx            1            1            1    4.000000000000000
 i,indx,jndx            2            1            2   -1.000000000000000
 i,indx,jndx            3            1          101   -1.000000000000000
 i,indx,jndx            4            2            1   -1.000000000000000
 i,indx,jndx            5            2            2    4.000000000000000
 i,indx,jndx            6            2            3   -1.000000000000000
 i,indx,jndx            7            2          102   -1.000000000000000
 i,indx,jndx            8            3            2   -1.000000000000000
 i,indx,jndx            9            3            3    4.000000000000000
 i,indx,jndx           10            3            4   -1.000000000000000
 i,indx,jndx        49590         9998         9898   -1.000000000000000
 i,indx,jndx        49591         9998         9997   -1.000000000000000
 i,indx,jndx        49592         9998         9998    4.000000000000000
 i,indx,jndx        49593         9998         9999   -1.000000000000000
 i,indx,jndx        49594         9999         9899   -1.000000000000000
 i,indx,jndx        49595         9999         9998   -1.000000000000000
 i,indx,jndx        49596         9999         9999    4.000000000000000
 i,indx,jndx        49597         9999        10000   -1.000000000000000
 i,indx,jndx        49598        10000         9900   -1.000000000000000
 i,indx,jndx        49599        10000         9999   -1.000000000000000
 i,indx,jndx        49600        10000        10000    4.000000000000000

 == CSR data check : Start from 1  to 10 elements ==
 csrRowPtrA=
          1          4          8         12         16         20         24
         28         32         36
 csrColIndA=
          1          2        101          1          2          3        102
          2          3          4
 csrValA  =
 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01
 -.1000D+01 0.4000D+01 -.1000D+01

 == Start from endpoint-10  to endpoint ==
 csrRowPtrA=
      49562      49566      49570      49574      49578      49582      49586
      49590      49594      49598      49601
 csrColIndA=
       9898       9997       9998       9999       9899       9998       9999
      10000       9900       9999      10000
 csrValA=
 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01
 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01

  tol = .1000000000D-12  max_iteration =   10000


 -----------------------------------------------
  Bi-Conjugate Gradient Stabilized (BiCGStab)
 -----------------------------------------------
 nrmr0=    100.0000000000000      tol*nrmr0=   1.0000000000000001E-011
 nrmCK1, nrmCK2 =    238.2158421093953         166.7314075873577

 The analysis and Incomplete-LU factorization phase of ILU(0) :                         0.253 (sec)

 iteration=           81  residual =   6.6059148572793421E-012
 nrmr/nrmr0 =   6.6059148572793405E-014

 Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED.

 Elapased time for BiGGstab loop with ILU(0) :                                          0.593 (sec)

 error = 0.6593836587854D-11

 x(      1)=0.275607474398E+01
 x(      2)=0.501214948795E+01
 x(      3)=0.696587678467E+01
 x(      4)=0.871021429238E+01
 x(      5)=0.102960547760E+02
 x(      6)=0.117547659017E+02
 x(      7)=0.131074411873E+02
 x(      8)=0.143691932892E+02
 x(      9)=0.155513693040E+02
 x(     10)=0.166627976257E+02

 x(   9990)=0.177105403722E+02
 x(   9991)=0.166627976257E+02
 x(   9992)=0.155513693040E+02
 x(   9993)=0.143691932892E+02
 x(   9994)=0.131074411873E+02
 x(   9995)=0.117547659017E+02
 x(   9996)=0.102960547760E+02
 x(   9997)=0.871021429238E+01
 x(   9998)=0.696587678467E+01
 x(   9999)=0.501214948795E+01
 x(  10000)=0.275607474398E+01

前章へ

次章へ

OpenACCプログラミングのインデックスへ

PGI アクセラレータコンパイラサイトへ