下載kompute - kompute源代碼下載

kompute

C/C++

v0.9.0

下載

Kompute

跨供應商圖形卡的通用GPU計算框架（AMD，Qualcomm，Nvidia＆Friends）

快速，移動式，異步並為高級GPU加速用戶酶進行了優化。

加入Discord＆Community電話？文檔博客文章⌨示例？

KOMPUTE由Linux Foundation作為LF AI＆Data Foundation的託管項目的支持。

原理和功能

具有C ++ SDK的靈活Python模塊以進行優化
通過GPU家庭隊列的異步和並行處理支持
通過Android NDK跨越幾個體系結構啟用了示例的移動設備
BYOV：帶上自己的Vulkan設計，可以與現有的Vulkan應用程序一起玩
GPU和主機內存所有權和內存管理的明確關係
強大的代碼庫，帶90％的單位測試代碼覆蓋範圍
機器學習的高級用例？，移動開發和遊戲開發？
每月通話，不和諧聊天等的活躍社區

使用Kompute❤️的項目？

GPT4All-一個開源的開源大型語言模型的生態系統，該模型在您的CPU和幾乎所有GPU上都在本地運行。
Llama.cpp- C/C ++中Facebook Llama型號的港口。
tpoisonooo/how-to-to-timize-gemm-行矩陣優化。
VKJAX- VULKAN的JAX解釋器。

入門

在下面，您可以使用C ++和Python Kompute接口找到GPU乘法示例。

您可以加入Discord進行問題 /討論，打開GitHub問題或閱讀文檔。

您的第一個Kompute（C ++）

C ++接口可提供對Kompute本機組件的低級別訪問，從而實現了高級優化以及組件的擴展。

 void kompute ( const std::string& shader) {

    // 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
    kp::Manager mgr; 

    // 2. Create and initialise Kompute Tensors through manager

    // Default tensor constructor simplifies creation of float values
    auto tensorInA = mgr. tensor ({ 2 ., 2 ., 2 . });
    auto tensorInB = mgr. tensor ({ 1 ., 2 ., 3 . });
    // Explicit type constructor supports uint32, int32, double, float and bool
    auto tensorOutA = mgr. tensorT < uint32_t >({ 0 , 0 , 0 });
    auto tensorOutB = mgr. tensorT < uint32_t >({ 0 , 0 , 0 });

    std::vector<std::shared_ptr<kp::Memory>> params = {tensorInA, tensorInB, tensorOutA, tensorOutB};

    // 3. Create algorithm based on shader (supports buffers & push/spec constants)
    kp::Workgroup workgroup ({ 3 , 1 , 1 });
    std::vector< float > specConsts ({ 2 });
    std::vector< float > pushConstsA ({ 2.0 });
    std::vector< float > pushConstsB ({ 3.0 });

    auto algorithm = mgr. algorithm (params,
                                   // See documentation shader section for compileSource
                                   compileSource (shader),
                                   workgroup,
                                   specConsts,
                                   pushConstsA);

    // 4. Run operation synchronously using sequence
    mgr. sequence ()
        -> record <kp::OpSyncDevice>(params)
        -> record <kp::OpAlgoDispatch>(algorithm) // Binds default push consts
        -> eval () // Evaluates the two recorded operations
        -> record <kp::OpAlgoDispatch>(algorithm, pushConstsB) // Overrides push consts
        -> eval (); // Evaluates only last recorded operation

    // 5. Sync results from the GPU asynchronously
    auto sq = mgr. sequence ();
    sq-> evalAsync <kp::OpSyncLocal>(params);

    // ... Do other work asynchronously whilst GPU finishes

    sq-> evalAwait ();

    // Prints the first output which is: { 4, 8, 12 }
    for ( const float & elem : tensorOutA-> vector ()) std::cout << elem << "  " ;
    // Prints the second output which is: { 10, 10, 10 }
    for ( const float & elem : tensorOutB-> vector ()) std::cout << elem << "  " ;

} // Manages / releases all CPU and GPU memory resources

int main () {

    // Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
    // files). This shader shows some of the main components including constants, buffers, etc
    std::string shader = ( R"(
        #version 450

        layout (local_size_x = 1) in;

        // The input tensors bind index is relative to index in parameter passed
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };

        // Kompute supports push constants updated on dispatch
        layout(push_constant) uniform PushConstants {
            float val;
        } push_const;

        // Kompute also supports spec constants on initalization
        layout(constant_id = 0) const float const_one = 0;

        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint( in_a[index] * in_b[index] );
            out_b[index] += uint( const_one * push_const.val );
        }
    )" );

    // Run the function declared above with our raw string shader
    kompute (shader);
}

您的第一個Kompute（Python）

Python軟件包提供了一個高級交互式界面，可在確保高性能和快速開發工作流程的同時進行實驗。

 from . utils import compile_source # using util function from python/test/utils

def kompute ( shader ):
    # 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
    mgr = kp . Manager ()

    # 2. Create and initialise Kompute Tensors through manager

    # Default tensor constructor simplifies creation of float values
    tensor_in_a = mgr . tensor ([ 2 , 2 , 2 ])
    tensor_in_b = mgr . tensor ([ 1 , 2 , 3 ])
    # Explicit type constructor supports uint32, int32, double, float and bool
    tensor_out_a = mgr . tensor_t ( np . array ([ 0 , 0 , 0 ], dtype = np . uint32 ))
    tensor_out_b = mgr . tensor_t ( np . array ([ 0 , 0 , 0 ], dtype = np . uint32 ))
    assert ( t_data . data_type () == kp . DataTypes . uint )

    params = [ tensor_in_a , tensor_in_b , tensor_out_a , tensor_out_b ]

    # 3. Create algorithm based on shader (supports buffers & push/spec constants)
    workgroup = ( 3 , 1 , 1 )
    spec_consts = [ 2 ]
    push_consts_a = [ 2 ]
    push_consts_b = [ 3 ]

    # See documentation shader section for compile_source
    spirv = compile_source ( shader )

    algo = mgr . algorithm ( params , spirv , workgroup , spec_consts , push_consts_a )

    # 4. Run operation synchronously using sequence
    ( mgr . sequence ()
        . record ( kp . OpTensorSyncDevice ( params ))
        . record ( kp . OpAlgoDispatch ( algo )) # Binds default push consts provided
        . eval () # evaluates the two recorded ops
        . record ( kp . OpAlgoDispatch ( algo , push_consts_b )) # Overrides push consts
        . eval ()) # evaluates only the last recorded op

    # 5. Sync results from the GPU asynchronously
    sq = mgr . sequence ()
    sq . eval_async ( kp . OpTensorSyncLocal ( params ))

    # ... Do other work asynchronously whilst GPU finishes

    sq . eval_await ()

    # Prints the first output which is: { 4, 8, 12 }
    print ( tensor_out_a )
    # Prints the first output which is: { 10, 10, 10 }
    print ( tensor_out_b )

if __name__ == "__main__" :

    # Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
    # files). This shader shows some of the main components including constants, buffers, etc
    shader = """
        #version 450

        layout (local_size_x = 1) in;

        // The input tensors bind index is relative to index in parameter passed
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };

        // Kompute supports push constants updated on dispatch
        layout(push_constant) uniform PushConstants {
            float val;
        } push_const;

        // Kompute also supports spec constants on initalization
        layout(constant_id = 0) const float const_one = 0;

        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint( in_a[index] * in_b[index] );
            out_b[index] += uint( const_one * push_const.val );
        }
    """

    kompute ( shader )

互動筆記本和視頻

您可以嘗試使用交互式COLAB筆記本，這些筆記本可以讓您使用免費的GPU。可用的示例是以下python和C ++示例：

嘗試從博客文章中的交互式C ++ COLAB	嘗試從博客文章中的互動python colab

您還可以在Fosdem 2021會議上查看以下兩個演講。

這兩個視頻都有時間戳，可以讓您跳過最相關的部分 - 兩者的介紹和動機幾乎相同，因此您可以跳過更具體的內容。

觀看C ++愛好者的視頻	觀看Python＆Machine學習愛好者的視頻

建築概述

Kompute的核心體系結構包括以下內容：

Kompute Manager-基礎編排者，創建和管理設備和兒童組件
kompute序列 - 可以發送到GPU作為批處理的操作容器
Kompute操作（基礎） - 所有操作都繼承的基類
kompute張量 - 張張量結構化數據用於GPU操作
kompute算法 - 在GPU中執行的（著色器）邏輯的抽象

要查看完整的故障，您可以在C ++類參考中進一步閱讀。

完整的體系結構	簡化的Kompute組件
（非常小，檢查文檔中的完整參考圖以獲取詳細信息）

異步和並行操作

Kompute提供了通過VK ::圍欄以異步方式運行操作的靈活性。此外，Kompute可以明確分配隊列，這允許平行執行各個隊列家庭的操作。

下圖提供了有關如何將Kompute序列分配給不同隊列的直覺，以啟用基於硬件的並行執行。您可以在示例中看到手，以及詳細的文檔頁面，描述它如何以NVIDIA 1650為例。

啟用了移動設備

Kompute已被優化，可在移動環境中工作。該構建系統啟用了為Android環境的Vulkan共享庫的動態加載，以及用於CPP標頭的工作Android NDK包裝器。

要進行全面潛水，您可以閱讀博客文章“通過設備GPU加速機器學習增壓移動應用程序”。

您還可以訪問存儲庫中的端到端示例代碼，可以使用Android Studio運行。

Python包

除了C ++ Core SDK，您還可以使用Kompute的Python軟件包，該軟件包可以公開相同的核心功能，並支持與列表，Numpy Arrays等Python對象的互操作性。

唯一的依賴項是Python 3.5+和Cmake 3.4.1+。您可以使用以下命令從Python PYPI軟件包中安裝Kompute。

 pip install kp

您也可以使用以下方式從主分支機構安裝：

 pip install git+git://github.com/KomputeProject/kompute.git@master

有關更多詳細信息，您可以閱讀Python軟件包文檔或Python類參考文獻文檔。

C ++構建概述

提供的構建系統使用cmake ，允許跨平台構建。

頂級Makefile為開發以及Docker Image構建提供了一組優化配置，但是您可以從以下命令開始構建：

   cmake -Bbuild

您還可以使用add_subdirectory -android示例cmakelists.txt文件在存儲庫中添加kompute。

有關構建配置的更高級概述，請查看“構建系統”深度潛水文檔。

Kompute的發展

我們感謝公關和問題。如果您想進行貢獻，請嘗試檢查“良好的第一期”標籤，但是即使使用Kompute和報告問題也是一個很好的貢獻！

貢獻

開發依賴性

測試
- gtest
文件
- doxygen（帶有點）
- sphynx

發展

關注Mozilla C ++樣式指南https://www-archive.mozilla.org/hacking/mozilla-style-guide.html
- 使用後固定鉤運行襯裡，您可以將其設置
- 所有依賴項均在vcpkg.json中定義
使用Cmake作為構建系統，並提供帶有推薦命令的頂級Makefile
使用xxd（或xxd.exe Windows 64位端口）將Shader Spirv轉換為標題文件
使用doxygen和phinx進行文檔和自動
使用VCPKG查找依賴項，這是用於檢索庫的推薦設置

如果要使用調試層運行，則可以使用KOMPUTE_ENV_DEBUG_LAYERS參數添加它們。

 export KOMPUTE_ENV_DEBUG_LAYERS="VK_LAYER_LUNARG_api_dump"

更新文檔

要更新文檔，您需要：

在構建系統中運行gendoxygen目標
在構建系統中運行Gensphynx目標
用make push_docs_to_ghpages推到github頁面

運行測試

為貢獻者而言，運行單元測試已大大簡化。

在CPU上運行的測試，可以使用ACT命令行接口（https://github.com/nektos/act）觸發 - 一旦安裝命令行（並啟動Docker Daemon），您只需要鍵入：

 $ act

[Python Tests/python-tests]   Start image=axsauze/kompute-builder:0.2
[C++ Tests/cpp-tests      ]   Start image=axsauze/kompute-builder:0.2
[C++ Tests/cpp-tests      ]   ?  docker run image=axsauze/kompute-builder:0.2 entrypoint=["/usr/bin/tail" "-f" "/dev/null"] cmd=[]
[Python Tests/python-tests]   ?  docker run image=axsauze/kompute-builder:0.2 entrypoint=["/usr/bin/tail" "-f" "/dev/null"] cmd=[]
...

存儲庫包含C ++和Python代碼的單元測試，可以在test/和python/test文件夾下找到。

這些測試當前使用GITHUB操作通過CI進行。它使用docker-builders/ 。

為了最大程度地減少硬件要求，使用SwiftShader直接在CPU中進行測試可以在沒有GPU的情況下運行。

有關CI和測試如何設置的更多信息，您可以轉到文檔中的CI，Docker和測試部分。

動機

該項目在看到許多新的和著名的ML和DL項目之後開始了，例如Pytorch，Tensorflow，Alibaba DNN，Tencent NCNN以及其他 - 已經集成了或正在集成了Vulkan SDK以添加移動（和交叉供應商）GPU支持。

Vulkan SDK提供了一個出色的低級接口，可實現高度專業化的優化 - 但是，它以高度冗長的代碼為代價，需要500-2000行代碼才能開始編寫應用程序代碼。這導致每個項目都必須實現相同的基準，以抽象Vulkan SDK的非計算相關功能。大量的非標準化鍋爐板可能會導致知識轉移有限，引入獨特的框架實施錯誤的機會更高，等等。

我們目前正在開發Kompute，以免隱藏Vulkan SDK界面（設計得非常出色），而是直接關注Vulkan SDK的GPU計算功能，而是直接關注它。本文提供了有關Kompute動機的高級概述，並提供了一組示例，這些示例引入了GPU計算以及核心的Kompute架構。

展開

附加信息