下载kompute - kompute源代码下载

kompute

C/C++

v0.9.0

下载

Kompute

跨供应商图形卡的通用GPU计算框架（AMD，Qualcomm，Nvidia＆Friends）

快速，移动式，异步并为高级GPU加速用户酶进行了优化。

加入Discord＆Community电话？文档博客文章⌨示例？

KOMPUTE由Linux Foundation作为LF AI＆Data Foundation的托管项目的支持。

原理和功能

具有C ++ SDK的灵活Python模块以进行优化
通过GPU家庭队列的异步和并行处理支持
通过Android NDK跨越几个体系结构启用了示例的移动设备
BYOV：带上自己的Vulkan设计，可以与现有的Vulkan应用程序一起玩
GPU和主机内存所有权和内存管理的明确关系
强大的代码库，带90％的单位测试代码覆盖范围
机器学习的高级用例？，移动开发和游戏开发？
每月通话，不和谐聊天等的活跃社区

使用Kompute❤️的项目？

GPT4All-一个开源的开源大型语言模型的生态系统，该模型在您的CPU和几乎所有GPU上都在本地运行。
Llama.cpp- C/C ++中Facebook Llama型号的港口。
tpoisonooo/how-to-to-timize-gemm-行矩阵优化。
VKJAX- VULKAN的JAX解释器。

入门

在下面，您可以使用C ++和Python Kompute接口找到GPU乘法示例。

您可以加入Discord进行问题 /讨论，打开GitHub问题或阅读文档。

您的第一个Kompute（C ++）

C ++接口可提供对Kompute本机组件的低级别访问，从而实现了高级优化以及组件的扩展。

 void kompute ( const std::string& shader) {

    // 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
    kp::Manager mgr; 

    // 2. Create and initialise Kompute Tensors through manager

    // Default tensor constructor simplifies creation of float values
    auto tensorInA = mgr. tensor ({ 2 ., 2 ., 2 . });
    auto tensorInB = mgr. tensor ({ 1 ., 2 ., 3 . });
    // Explicit type constructor supports uint32, int32, double, float and bool
    auto tensorOutA = mgr. tensorT < uint32_t >({ 0 , 0 , 0 });
    auto tensorOutB = mgr. tensorT < uint32_t >({ 0 , 0 , 0 });

    std::vector<std::shared_ptr<kp::Memory>> params = {tensorInA, tensorInB, tensorOutA, tensorOutB};

    // 3. Create algorithm based on shader (supports buffers & push/spec constants)
    kp::Workgroup workgroup ({ 3 , 1 , 1 });
    std::vector< float > specConsts ({ 2 });
    std::vector< float > pushConstsA ({ 2.0 });
    std::vector< float > pushConstsB ({ 3.0 });

    auto algorithm = mgr. algorithm (params,
                                   // See documentation shader section for compileSource
                                   compileSource (shader),
                                   workgroup,
                                   specConsts,
                                   pushConstsA);

    // 4. Run operation synchronously using sequence
    mgr. sequence ()
        -> record <kp::OpSyncDevice>(params)
        -> record <kp::OpAlgoDispatch>(algorithm) // Binds default push consts
        -> eval () // Evaluates the two recorded operations
        -> record <kp::OpAlgoDispatch>(algorithm, pushConstsB) // Overrides push consts
        -> eval (); // Evaluates only last recorded operation

    // 5. Sync results from the GPU asynchronously
    auto sq = mgr. sequence ();
    sq-> evalAsync <kp::OpSyncLocal>(params);

    // ... Do other work asynchronously whilst GPU finishes

    sq-> evalAwait ();

    // Prints the first output which is: { 4, 8, 12 }
    for ( const float & elem : tensorOutA-> vector ()) std::cout << elem << "  " ;
    // Prints the second output which is: { 10, 10, 10 }
    for ( const float & elem : tensorOutB-> vector ()) std::cout << elem << "  " ;

} // Manages / releases all CPU and GPU memory resources

int main () {

    // Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
    // files). This shader shows some of the main components including constants, buffers, etc
    std::string shader = ( R"(
        #version 450

        layout (local_size_x = 1) in;

        // The input tensors bind index is relative to index in parameter passed
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };

        // Kompute supports push constants updated on dispatch
        layout(push_constant) uniform PushConstants {
            float val;
        } push_const;

        // Kompute also supports spec constants on initalization
        layout(constant_id = 0) const float const_one = 0;

        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint( in_a[index] * in_b[index] );
            out_b[index] += uint( const_one * push_const.val );
        }
    )" );

    // Run the function declared above with our raw string shader
    kompute (shader);
}

您的第一个Kompute（Python）

Python软件包提供了一个高级交互式界面，可在确保高性能和快速开发工作流程的同时进行实验。

 from . utils import compile_source # using util function from python/test/utils

def kompute ( shader ):
    # 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
    mgr = kp . Manager ()

    # 2. Create and initialise Kompute Tensors through manager

    # Default tensor constructor simplifies creation of float values
    tensor_in_a = mgr . tensor ([ 2 , 2 , 2 ])
    tensor_in_b = mgr . tensor ([ 1 , 2 , 3 ])
    # Explicit type constructor supports uint32, int32, double, float and bool
    tensor_out_a = mgr . tensor_t ( np . array ([ 0 , 0 , 0 ], dtype = np . uint32 ))
    tensor_out_b = mgr . tensor_t ( np . array ([ 0 , 0 , 0 ], dtype = np . uint32 ))
    assert ( t_data . data_type () == kp . DataTypes . uint )

    params = [ tensor_in_a , tensor_in_b , tensor_out_a , tensor_out_b ]

    # 3. Create algorithm based on shader (supports buffers & push/spec constants)
    workgroup = ( 3 , 1 , 1 )
    spec_consts = [ 2 ]
    push_consts_a = [ 2 ]
    push_consts_b = [ 3 ]

    # See documentation shader section for compile_source
    spirv = compile_source ( shader )

    algo = mgr . algorithm ( params , spirv , workgroup , spec_consts , push_consts_a )

    # 4. Run operation synchronously using sequence
    ( mgr . sequence ()
        . record ( kp . OpTensorSyncDevice ( params ))
        . record ( kp . OpAlgoDispatch ( algo )) # Binds default push consts provided
        . eval () # evaluates the two recorded ops
        . record ( kp . OpAlgoDispatch ( algo , push_consts_b )) # Overrides push consts
        . eval ()) # evaluates only the last recorded op

    # 5. Sync results from the GPU asynchronously
    sq = mgr . sequence ()
    sq . eval_async ( kp . OpTensorSyncLocal ( params ))

    # ... Do other work asynchronously whilst GPU finishes

    sq . eval_await ()

    # Prints the first output which is: { 4, 8, 12 }
    print ( tensor_out_a )
    # Prints the first output which is: { 10, 10, 10 }
    print ( tensor_out_b )

if __name__ == "__main__" :

    # Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
    # files). This shader shows some of the main components including constants, buffers, etc
    shader = """
        #version 450

        layout (local_size_x = 1) in;

        // The input tensors bind index is relative to index in parameter passed
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };

        // Kompute supports push constants updated on dispatch
        layout(push_constant) uniform PushConstants {
            float val;
        } push_const;

        // Kompute also supports spec constants on initalization
        layout(constant_id = 0) const float const_one = 0;

        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint( in_a[index] * in_b[index] );
            out_b[index] += uint( const_one * push_const.val );
        }
    """

    kompute ( shader )

互动笔记本和视频

您可以尝试使用交互式COLAB笔记本，这些笔记本可以让您使用免费的GPU。可用的示例是以下python和C ++示例：

尝试从博客文章中的交互式C ++ COLAB	尝试从博客文章中的互动python colab

您还可以在Fosdem 2021会议上查看以下两个演讲。

这两个视频都有时间戳，可以让您跳过最相关的部分 - 两者的介绍和动机几乎相同，因此您可以跳过更具体的内容。

观看C ++爱好者的视频	观看Python＆Machine学习爱好者的视频

建筑概述

Kompute的核心体系结构包括以下内容：

Kompute Manager-基础编排者，创建和管理设备和儿童组件
kompute序列 - 可以发送到GPU作为批处理的操作容器
Kompute操作（基础） - 所有操作都继承的基类
kompute张量 - 张张量结构化数据用于GPU操作
kompute算法 - 在GPU中执行的（着色器）逻辑的抽象

要查看完整的故障，您可以在C ++类参考中进一步阅读。

完整的体系结构	简化的Kompute组件
（非常小，检查文档中的完整参考图以获取详细信息）

异步和并行操作

Kompute提供了通过VK ::围栏以异步方式运行操作的灵活性。此外，Kompute可以明确分配队列，这允许平行执行各个队列家庭的操作。

下图提供了有关如何将Kompute序列分配给不同队列的直觉，以启用基于硬件的并行执行。您可以在示例中看到手，以及详细的文档页面，描述它如何以NVIDIA 1650为例。

启用了移动设备

Kompute已被优化，可在移动环境中工作。该构建系统启用了为Android环境的Vulkan共享库的动态加载，以及用于CPP标头的工作Android NDK包装器。

要进行全面潜水，您可以阅读博客文章“通过设备GPU加速机器学习增压移动应用程序”。

您还可以访问存储库中的端到端示例代码，可以使用Android Studio运行。

Python包

除了C ++ Core SDK，您还可以使用Kompute的Python软件包，该软件包可以公开相同的核心功能，并支持与列表，Numpy Arrays等Python对象的互操作性。

唯一的依赖项是Python 3.5+和Cmake 3.4.1+。您可以使用以下命令从Python PYPI软件包中安装Kompute。

 pip install kp

您也可以使用以下方式从主分支机构安装：

 pip install git+git://github.com/KomputeProject/kompute.git@master

有关更多详细信息，您可以阅读Python软件包文档或Python类参考文献文档。

C ++构建概述

提供的构建系统使用cmake ，允许跨平台构建。

顶级Makefile为开发以及Docker Image构建提供了一组优化配置，但是您可以从以下命令开始构建：

   cmake -Bbuild

您还可以使用add_subdirectory -android示例cmakelists.txt文件在存储库中添加kompute。

有关构建配置的更高级概述，请查看“构建系统”深度潜水文档。

Kompute的发展

我们感谢公关和问题。如果您想进行贡献，请尝试检查“良好的第一期”标签，但是即使使用Kompute和报告问题也是一个很好的贡献！

贡献

开发依赖性

测试
- gtest
文档
- doxygen（带有点）
- sphynx

发展

关注Mozilla C ++样式指南https://www-archive.mozilla.org/hacking/mozilla-style-guide.html
- 使用后固定钩运行衬里，您可以将其设置
- 所有依赖项均在vcpkg.json中定义
使用Cmake作为构建系统，并提供带有推荐命令的顶级Makefile
使用xxd（或xxd.exe Windows 64位端口）将Shader Spirv转换为标题文件
使用doxygen和phinx进行文档和自动
使用VCPKG查找依赖项，这是用于检索库的推荐设置

如果要使用调试层运行，则可以使用KOMPUTE_ENV_DEBUG_LAYERS参数添加它们。

 export KOMPUTE_ENV_DEBUG_LAYERS="VK_LAYER_LUNARG_api_dump"

更新文档

要更新文档，您需要：

在构建系统中运行gendoxygen目标
在构建系统中运行Gensphynx目标
用make push_docs_to_ghpages推到github页面

运行测试

为贡献者而言，运行单元测试已大大简化。

在CPU上运行的测试，可以使用ACT命令行接口（https://github.com/nektos/act）触发 - 一旦安装命令行（并启动Docker Daemon），您只需要键入：

 $ act

[Python Tests/python-tests]   Start image=axsauze/kompute-builder:0.2
[C++ Tests/cpp-tests      ]   Start image=axsauze/kompute-builder:0.2
[C++ Tests/cpp-tests      ]   ?  docker run image=axsauze/kompute-builder:0.2 entrypoint=["/usr/bin/tail" "-f" "/dev/null"] cmd=[]
[Python Tests/python-tests]   ?  docker run image=axsauze/kompute-builder:0.2 entrypoint=["/usr/bin/tail" "-f" "/dev/null"] cmd=[]
...

存储库包含C ++和Python代码的单元测试，可以在test/和python/test文件夹下找到。

这些测试当前使用GITHUB操作通过CI进行。它使用docker-builders/ 。

为了最大程度地减少硬件要求，使用SwiftShader直接在CPU中进行测试可以在没有GPU的情况下运行。

有关CI和测试如何设置的更多信息，您可以转到文档中的CI，Docker和测试部分。

动机

该项目在看到许多新的和著名的ML和DL项目之后开始了，例如Pytorch，Tensorflow，Alibaba DNN，Tencent NCNN以及其他 - 已经集成了或正在集成了Vulkan SDK以添加移动（和交叉供应商）GPU支持。

Vulkan SDK提供了一个出色的低级接口，可实现高度专业化的优化 - 但是，它以高度冗长的代码为代价，需要500-2000行代码才能开始编写应用程序代码。这导致每个项目都必须实现相同的基准，以抽象Vulkan SDK的非计算相关功能。大量的非标准化锅炉板可能会导致知识转移有限，引入独特的框架实施错误的机会更高，等等。

我们目前正在开发Kompute，以免隐藏Vulkan SDK界面（设计得非常出色），而是直接关注Vulkan SDK的GPU计算功能，而是直接关注它。本文提供了有关Kompute动机的高级概述，并提供了一组示例，这些示例引入了GPU计算以及核心的Kompute架构。

展开

附加信息