ดาวน์โหลด kompute - ดาวน์โหลดซอร์สโค้ด kompute

kompute

ซี/ซี++

v0.9.0

ดาวน์โหลด

kompute

กรอบการคำนวณ GPU วัตถุประสงค์ทั่วไปสำหรับการ์ดกราฟิกข้ามผู้ขายข้าม (AMD, Qualcomm, Nvidia & Friends)

Blazing Fast, เปิดใช้งานมือถือ, แบบอะซิงโครนัสและปรับให้เหมาะสมสำหรับการเร่งความเร็วการเร่งความเร็ว GPU ขั้นสูง

เข้าร่วมการโทร Discord & Community? โพสต์บล็อกเอกสาร⌨ตัวอย่าง?

Kompute ได้รับการสนับสนุนโดยมูลนิธิ Linux เป็นโครงการโฮสต์โดยมูลนิธิ LF AI & Data

หลักการและคุณสมบัติ

โมดูล Python ที่ยืดหยุ่นพร้อม C ++ SDK สำหรับการปรับให้เหมาะสม
การสนับสนุนการประมวลผลแบบอะซิงโครนัสและคู่ขนานผ่านคิวตระกูล GPU
มือถือเปิดใช้งานพร้อมตัวอย่างผ่าน Android NDK ในหลายสถาปัตยกรรม
Byov: นำการออกแบบของคุณเองมาเป็นของตัวเองเพื่อเล่นได้ดีกับแอพพลิเคชั่น Vulkan ที่มีอยู่เดิม
ความสัมพันธ์ที่ชัดเจนสำหรับ GPU และโฮสต์หน่วยความจำการเป็นเจ้าของและการจัดการหน่วยความจำ
Codebase ที่แข็งแกร่งพร้อมความครอบคลุมรหัสการทดสอบหน่วย 90%
กรณีการใช้งานขั้นสูงในการเรียนรู้ของเครื่อง? การพัฒนามือถือและการพัฒนาเกม?
ชุมชนที่ใช้งานอยู่ที่มีสายรายเดือนแชท Discord และอื่น ๆ

โครงการที่ใช้ kompute ❤?

GPT4ALL-ระบบนิเวศของโมเดลภาษาขนาดใหญ่โอเพนซอร์ซโอเพ่นซึ่งทำงานในท้องถิ่นบน CPU ของคุณและเกือบทุก GPU
llama.cpp - พอร์ตของรุ่น Llama ของ Facebook ใน C/C ++
TPOISONOOOO/วิธีการเพิ่มประสิทธิภาพการเพิ่มประสิทธิภาพ ROW-MAJOR MATMUL
VKJAX - JAX Interpreter สำหรับ Vulkan

เริ่มต้น

ด้านล่างคุณสามารถค้นหาตัวอย่างการคูณ GPU โดยใช้อินเตอร์เฟส C ++ และ Python Kompute

คุณสามารถเข้าร่วม Discord สำหรับคำถาม / การสนทนาเปิดปัญหา GitHub หรืออ่านเอกสาร

kompute แรกของคุณ (C ++)

อินเทอร์เฟซ C ++ ให้การเข้าถึงส่วนประกอบดั้งเดิมของ Kompute ในระดับต่ำทำให้สามารถปรับให้เหมาะสมขั้นสูงรวมถึงการขยายส่วนประกอบ

 void kompute ( const std::string& shader) {

    // 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
    kp::Manager mgr; 

    // 2. Create and initialise Kompute Tensors through manager

    // Default tensor constructor simplifies creation of float values
    auto tensorInA = mgr. tensor ({ 2 ., 2 ., 2 . });
    auto tensorInB = mgr. tensor ({ 1 ., 2 ., 3 . });
    // Explicit type constructor supports uint32, int32, double, float and bool
    auto tensorOutA = mgr. tensorT < uint32_t >({ 0 , 0 , 0 });
    auto tensorOutB = mgr. tensorT < uint32_t >({ 0 , 0 , 0 });

    std::vector<std::shared_ptr<kp::Memory>> params = {tensorInA, tensorInB, tensorOutA, tensorOutB};

    // 3. Create algorithm based on shader (supports buffers & push/spec constants)
    kp::Workgroup workgroup ({ 3 , 1 , 1 });
    std::vector< float > specConsts ({ 2 });
    std::vector< float > pushConstsA ({ 2.0 });
    std::vector< float > pushConstsB ({ 3.0 });

    auto algorithm = mgr. algorithm (params,
                                   // See documentation shader section for compileSource
                                   compileSource (shader),
                                   workgroup,
                                   specConsts,
                                   pushConstsA);

    // 4. Run operation synchronously using sequence
    mgr. sequence ()
        -> record <kp::OpSyncDevice>(params)
        -> record <kp::OpAlgoDispatch>(algorithm) // Binds default push consts
        -> eval () // Evaluates the two recorded operations
        -> record <kp::OpAlgoDispatch>(algorithm, pushConstsB) // Overrides push consts
        -> eval (); // Evaluates only last recorded operation

    // 5. Sync results from the GPU asynchronously
    auto sq = mgr. sequence ();
    sq-> evalAsync <kp::OpSyncLocal>(params);

    // ... Do other work asynchronously whilst GPU finishes

    sq-> evalAwait ();

    // Prints the first output which is: { 4, 8, 12 }
    for ( const float & elem : tensorOutA-> vector ()) std::cout << elem << "  " ;
    // Prints the second output which is: { 10, 10, 10 }
    for ( const float & elem : tensorOutB-> vector ()) std::cout << elem << "  " ;

} // Manages / releases all CPU and GPU memory resources

int main () {

    // Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
    // files). This shader shows some of the main components including constants, buffers, etc
    std::string shader = ( R"(
        #version 450

        layout (local_size_x = 1) in;

        // The input tensors bind index is relative to index in parameter passed
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };

        // Kompute supports push constants updated on dispatch
        layout(push_constant) uniform PushConstants {
            float val;
        } push_const;

        // Kompute also supports spec constants on initalization
        layout(constant_id = 0) const float const_one = 0;

        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint( in_a[index] * in_b[index] );
            out_b[index] += uint( const_one * push_const.val );
        }
    )" );

    // Run the function declared above with our raw string shader
    kompute (shader);
}

kompute แรกของคุณ (Python)

แพ็คเกจ Python ให้อินเทอร์เฟซแบบอินเทอร์แอคทีฟระดับสูงที่ช่วยให้การทดลองในขณะที่มั่นใจได้ว่าประสิทธิภาพสูงและเวิร์กโฟลว์การพัฒนาที่รวดเร็ว

 from . utils import compile_source # using util function from python/test/utils

def kompute ( shader ):
    # 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
    mgr = kp . Manager ()

    # 2. Create and initialise Kompute Tensors through manager

    # Default tensor constructor simplifies creation of float values
    tensor_in_a = mgr . tensor ([ 2 , 2 , 2 ])
    tensor_in_b = mgr . tensor ([ 1 , 2 , 3 ])
    # Explicit type constructor supports uint32, int32, double, float and bool
    tensor_out_a = mgr . tensor_t ( np . array ([ 0 , 0 , 0 ], dtype = np . uint32 ))
    tensor_out_b = mgr . tensor_t ( np . array ([ 0 , 0 , 0 ], dtype = np . uint32 ))
    assert ( t_data . data_type () == kp . DataTypes . uint )

    params = [ tensor_in_a , tensor_in_b , tensor_out_a , tensor_out_b ]

    # 3. Create algorithm based on shader (supports buffers & push/spec constants)
    workgroup = ( 3 , 1 , 1 )
    spec_consts = [ 2 ]
    push_consts_a = [ 2 ]
    push_consts_b = [ 3 ]

    # See documentation shader section for compile_source
    spirv = compile_source ( shader )

    algo = mgr . algorithm ( params , spirv , workgroup , spec_consts , push_consts_a )

    # 4. Run operation synchronously using sequence
    ( mgr . sequence ()
        . record ( kp . OpTensorSyncDevice ( params ))
        . record ( kp . OpAlgoDispatch ( algo )) # Binds default push consts provided
        . eval () # evaluates the two recorded ops
        . record ( kp . OpAlgoDispatch ( algo , push_consts_b )) # Overrides push consts
        . eval ()) # evaluates only the last recorded op

    # 5. Sync results from the GPU asynchronously
    sq = mgr . sequence ()
    sq . eval_async ( kp . OpTensorSyncLocal ( params ))

    # ... Do other work asynchronously whilst GPU finishes

    sq . eval_await ()

    # Prints the first output which is: { 4, 8, 12 }
    print ( tensor_out_a )
    # Prints the first output which is: { 10, 10, 10 }
    print ( tensor_out_b )

if __name__ == "__main__" :

    # Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
    # files). This shader shows some of the main components including constants, buffers, etc
    shader = """
        #version 450

        layout (local_size_x = 1) in;

        // The input tensors bind index is relative to index in parameter passed
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };

        // Kompute supports push constants updated on dispatch
        layout(push_constant) uniform PushConstants {
            float val;
        } push_const;

        // Kompute also supports spec constants on initalization
        layout(constant_id = 0) const float const_one = 0;

        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint( in_a[index] * in_b[index] );
            out_b[index] += uint( const_one * push_const.val );
        }
    """

    kompute ( shader )

โน้ตบุ๊กแบบอินเทอร์แอคทีฟในวิดีโอ

คุณสามารถลองโน้ตบุ๊ก colab แบบโต้ตอบซึ่งช่วยให้คุณใช้ GPU ฟรี ตัวอย่างที่มีอยู่คือตัวอย่าง Python และ C ++ ด้านล่าง:

ลองใช้ colab c ++ แบบโต้ตอบจากบล็อกโพสต์	ลองใช้ python colab แบบโต้ตอบจากบล็อกโพสต์

นอกจากนี้คุณยังสามารถตรวจสอบการเจรจาต่อไปนี้สองรายการที่นำเสนอในการประชุม Fosdem 2021

วิดีโอทั้งสองมีการประทับเวลาซึ่งจะช่วยให้คุณข้ามไปยังส่วนที่เกี่ยวข้องมากที่สุดสำหรับคุณ - คำนำและแรงจูงใจสำหรับทั้งคู่เกือบจะเหมือนกันดังนั้นคุณสามารถข้ามไปยังเนื้อหาที่เฉพาะเจาะจงมากขึ้น

ดูวิดีโอสำหรับผู้ที่ชื่นชอบ C ++	ดูวิดีโอสำหรับผู้ที่ชื่นชอบการเรียนรู้ของ Python & Machine

ภาพรวมสถาปัตยกรรม

สถาปัตยกรรมหลักของ Kompute รวมถึงสิ่งต่อไปนี้:

Kompute Manager - Base Orchestrator ซึ่งสร้างและจัดการอุปกรณ์และส่วนประกอบเด็ก
Kompute Sequence - คอนเทนเนอร์ของการดำเนินการที่สามารถส่งไปยัง GPU เป็นแบทช์
การดำเนินการ kompute (ฐาน) - คลาสพื้นฐานที่การดำเนินการทั้งหมดสืบทอด
Kompute Tensor - ข้อมูลโครงสร้างเทนเซอร์ที่ใช้ในการดำเนินการ GPU
อัลกอริทึม kompute - abstraction สำหรับ (shader) ลอจิกที่ดำเนินการใน GPU

หากต้องการดูรายละเอียดเต็มคุณสามารถอ่านเพิ่มเติมได้ในการอ้างอิงคลาส C ++

สถาปัตยกรรมเต็มรูปแบบ	ส่วนประกอบ kompute ง่ายๆ
(เล็กมากตรวจสอบไดอะแกรมอ้างอิงทั้งหมดในเอกสารสำหรับรายละเอียด)

การดำเนินการแบบอะซิงโครนัสและคู่ขนาน

Kompute ให้ความยืดหยุ่นในการดำเนินการในแบบอะซินรอนชูนผ่าน VK :: รั้ว นอกจากนี้ Kompute ยังเปิดใช้งานการจัดสรรคิวอย่างชัดเจนซึ่งอนุญาตให้ดำเนินการแบบขนานของการดำเนินงานในครอบครัวคิว

ภาพด้านล่างให้สัญชาตญาณเกี่ยวกับวิธีการจัดลำดับ Kompute ให้กับคิวที่แตกต่างกันเพื่อเปิดใช้งานการดำเนินการแบบขนานตามฮาร์ดแวร์ คุณสามารถดูตัวอย่างได้เช่นเดียวกับหน้าเอกสารรายละเอียดที่อธิบายว่ามันจะทำงานอย่างไรโดยใช้ Nvidia 1650 เป็นตัวอย่าง

เปิดใช้งานมือถือ

Kompute ได้รับการปรับให้เหมาะสมในการทำงานในสภาพแวดล้อมมือถือ ระบบบิลด์ช่วยให้สามารถโหลดไลบรารีที่ใช้ร่วมกันได้สำหรับสภาพแวดล้อม Android พร้อมกับ wrapper Android NDK ที่ใช้งานได้สำหรับส่วนหัวของ CPP

สำหรับการดำน้ำลึกอย่างเต็มรูปแบบคุณสามารถอ่านโพสต์บล็อก "เพิ่มแอพมือถือของคุณด้วยการเรียนรู้เครื่องเร่งความเร็ว GPU ในอุปกรณ์"

นอกจากนี้คุณยังสามารถเข้าถึงรหัสตัวอย่างแบบ end-to-end ในที่เก็บซึ่งสามารถรันได้โดยใช้ Android Studio

ตัวอย่างเพิ่มเติม

ตัวอย่างง่ายๆ

ตัวอย่างการคูณอย่างง่าย
บันทึกคำสั่งแบทช์ที่มีลำดับ kompute
ดำเนินการแบบอะซิงโครนัส
เรียกใช้การดำเนินการแบบขนานข้ามคิว GPU หลายรายการ
สร้างการดำเนินการ kompute ที่กำหนดเองของคุณ
การใช้การถดถอยโลจิสติกตั้งแต่เริ่มต้น

ตัวอย่างแบบครบวงจร

การใช้การถดถอยโลจิสติก
การขยายภาระงานที่มีความเข้มข้นของ GPU แบบขนานผ่านการดำเนินการหลายคิว
แอปพลิเคชั่น Android Ndk Mobile Kompute ML
การพัฒนาเกม Kompute ML ใน Godot Engine

แพ็คเกจ Python

นอกจาก C ++ Core SDK แล้วคุณยังสามารถใช้แพ็คเกจ Python ของ Kompute ซึ่งเปิดเผยฟังก์ชั่นหลักเดียวกันและรองรับการทำงานร่วมกันกับวัตถุ Python เช่นรายการอาร์เรย์ numpy ฯลฯ

การพึ่งพาเพียงอย่างเดียวคือ Python 3.5+ และ CMake 3.4.1+ คุณสามารถติดตั้ง kompute จากแพ็คเกจ Python PYPI โดยใช้คำสั่งต่อไปนี้

 pip install kp

นอกจากนี้คุณยังสามารถติดตั้งได้จาก Master Branch โดยใช้:

 pip install git+git://github.com/KomputeProject/kompute.git@master

สำหรับรายละเอียดเพิ่มเติมคุณสามารถอ่านเอกสารประกอบแพ็คเกจ Python หรือเอกสารอ้างอิงคลาส Python

C ++ สร้างภาพรวม

ระบบบิลด์ที่ให้ไว้ใช้ cmake ซึ่งช่วยให้สามารถสร้างข้ามแพลตฟอร์มได้

Makefile ระดับบนสุดให้ชุดการกำหนดค่าที่ดีที่สุดสำหรับการพัฒนารวมถึงการสร้างอิมเมจ Docker แต่คุณสามารถเริ่มสร้างด้วยคำสั่งต่อไปนี้:

   cmake -Bbuild

นอกจากนี้คุณยังสามารถเพิ่ม kompute ใน repo ของคุณด้วย add_subdirectory - ไฟล์ Android cmakelists.txt แสดงให้เห็นว่าจะทำอย่างไร

สำหรับภาพรวมขั้นสูงมากขึ้นของการกำหนดค่าการสร้างให้ตรวจสอบเอกสารการสร้างระบบดำน้ำลึก

การพัฒนา kompute

เราขอขอบคุณ PRS และปัญหา หากคุณต้องการมีส่วนร่วมลองตรวจสอบแท็ก "ปัญหาแรกที่ดี" แต่ถึงแม้การใช้ Kompute และปัญหาการรายงานก็เป็นผลงานที่ยอดเยี่ยม!

การบริจาค

Dev Dependencies

การทดสอบ
- gtest
เอกสาร
- doxygen (พร้อม dot)
- Sphynx

การพัฒนา

ตามคำแนะนำสไตล์ Mozilla C ++ https://www-archive.mozilla.org/hacking/mozilla-style-guide.html
- ใช้ตะขอหลังคอมมิวนิสต์เพื่อเรียกใช้ linter คุณสามารถตั้งค่าได้
- การพึ่งพาทั้งหมดถูกกำหนดใน vcpkg.json
ใช้ cmake เป็นระบบสร้างและให้ makefile ระดับบนด้วยคำสั่งที่แนะนำ
ใช้พอร์ต XXD (หรือ XXD.exe Windows 64bit) เพื่อแปลง shader spirv เป็นไฟล์ส่วนหัว
ใช้ Doxygen และ Sphinx สำหรับเอกสารและ autodocs
ใช้ VCPKG สำหรับการค้นหาการพึ่งพามันคือการตั้งค่าที่แนะนำเพื่อดึงไลบรารี

หากคุณต้องการเรียกใช้ด้วยเลเยอร์ดีบั๊กคุณสามารถเพิ่มได้ด้วยพารามิเตอร์ KOMPUTE_ENV_DEBUG_LAYERS เป็น:

 export KOMPUTE_ENV_DEBUG_LAYERS="VK_LAYER_LUNARG_api_dump"

อัปเดตเอกสาร

ในการอัปเดตเอกสารที่คุณจะต้อง:

เรียกใช้เป้าหมาย gendoxygen ในระบบการสร้าง
เรียกใช้เป้าหมาย gensphynx ในระบบบิลด์
กดไปที่หน้า gitHub ด้วย make push_docs_to_ghpages

การทดสอบกำลังดำเนินการ

การเรียกใช้การทดสอบหน่วยนั้นง่ายขึ้นอย่างมากสำหรับผู้มีส่วนร่วม

การทดสอบทำงานบน CPU และสามารถเรียกใช้โดยใช้อินเทอร์เฟซบรรทัดคำสั่ง ACT (https://github.com/nektos/act) - เมื่อคุณติดตั้งบรรทัดคำสั่ง (และเริ่ม Docker daemon) คุณเพียงแค่ต้องพิมพ์:

 $ act

[Python Tests/python-tests]   Start image=axsauze/kompute-builder:0.2
[C++ Tests/cpp-tests      ]   Start image=axsauze/kompute-builder:0.2
[C++ Tests/cpp-tests      ]   ?  docker run image=axsauze/kompute-builder:0.2 entrypoint=["/usr/bin/tail" "-f" "/dev/null"] cmd=[]
[Python Tests/python-tests]   ?  docker run image=axsauze/kompute-builder:0.2 entrypoint=["/usr/bin/tail" "-f" "/dev/null"] cmd=[]
...

ที่เก็บประกอบด้วยการทดสอบหน่วยสำหรับรหัส C ++ และ Python และสามารถพบได้ภายใต้โฟลเดอร์ test/ และ python/test

การทดสอบจะดำเนินการผ่าน CI โดยใช้การกระทำของ GitHub มันใช้ภาพที่พบใน docker-builders/

เพื่อลดข้อกำหนดของฮาร์ดแวร์การทดสอบสามารถทำงานได้โดยไม่ต้องใช้ GPU โดยตรงใน CPU โดยใช้ Swiftshader

สำหรับข้อมูลเพิ่มเติมเกี่ยวกับวิธีการตั้งค่า CI และการทดสอบคุณสามารถไปที่ส่วน CI, Docker และการทดสอบในเอกสาร

แรงจูงใจ

โครงการนี้เริ่มต้นขึ้นหลังจากเห็นว่าโครงการ ML & DL ใหม่ ๆ มากมายเช่น Pytorch, Tensorflow, Alibaba DNN, Tencent NCNN - ในหมู่คนอื่น ๆ - ได้รวมเข้าด้วยกันหรือกำลังมองหาการสนับสนุน GPU GPU

Vulkan SDK นำเสนออินเทอร์เฟซระดับต่ำที่ยอดเยี่ยมซึ่งช่วยให้สามารถปรับให้เหมาะสมได้อย่างมีความเชี่ยวชาญสูง - อย่างไรก็ตามมีค่าใช้จ่ายของรหัส verbose สูงซึ่งต้องใช้รหัส 500-2000 บรรทัดเพื่อเริ่มเขียนรหัสแอปพลิเคชัน สิ่งนี้ส่งผลให้แต่ละโครงการเหล่านี้ต้องใช้พื้นฐานเดียวกันเพื่อนามธรรมคุณลักษณะที่ไม่เกี่ยวข้องกับการคำนวณของ Vulkan SDK แผ่นหม้อไอน้ำที่ไม่ได้มาตรฐานจำนวนมากนี้อาจส่งผลให้เกิดการถ่ายโอนความรู้ที่ จำกัด โอกาสที่สูงขึ้นของข้อบกพร่องในการใช้เฟรมเวิร์กที่ไม่ซ้ำกันที่ได้รับการแนะนำ ฯลฯ

ขณะนี้เรากำลังพัฒนา Kompute เพื่อไม่ให้ซ่อนอินเทอร์เฟซ Vulkan SDK (เนื่องจากได้รับการออกแบบมาอย่างดีอย่างไม่น่าเชื่อ) แต่เพื่อเพิ่มความสามารถโดยตรงกับความสามารถในการคำนวณ GPU ของ Vulkan SDK บทความนี้ให้ภาพรวมระดับสูงของแรงจูงใจของ Kompute พร้อมกับชุดของตัวอย่างที่แนะนำทั้งการคำนวณ GPU และสถาปัตยกรรม Kompute หลัก

ขยาย

ข้อมูลเพิ่มเติม