skit
1.0.0
我在整個代碼庫中使用了一組有用的工具。
安裝:
git clone [email protected]:Shamdan17/skit.git
pip install -e .
datasetPreloader是圍繞torch.utils.data.Dataset的包裝器,將數據集緩存到磁盤上。當請求批處理時,這要么是在實例化的,要么懶惰。這是為了避免每次從磁盤上加載的開銷,尤其是從多個文件加載單個實例的情況下。更重要的是,這使我們可以通過僅執行一次來跳過昂貴的預處理步驟。
警告:目前僅支持返回張緊或元組張量的數據集。
用法:
from skit . data import DatasetPreloader
dataset = myTorchDataset ()
cache_path = 'path/to/cache'
# Wrap the dataset
dataset = DatasetPreloader (
dataset ,
cache_path = cache_path ,
wipe_cache = False , # If the cache exists, use it. Otherwise, create it. If true, delete the cache if it exists.
lazy_loading = True , # Load the entire dataset into memory on instantiation or lazily when a batch is requested
compress = True , # Compress the cache. This can save a lot of disk space. However, it can be slower to load.
block_size = 2000 , # The number of samples to store in a single folder. This is to avoid having too many files in a single directory, which can cause performance issues. Set to 0 to disable.
preloading_workers = 10 , # The number of workers to use when preloading the dataset. Does not affect lazy loading.
samples_to_confirm_cache = 100 # The number of samples to check when confirming the cache. If your dataset has many instances, increase the number of samples to confirm the cache. Please note this process is only a heuristic and is not 100% accurate. If in doubt, wipe the cache.
)
# Access the dataset as normalInmemorydataSetPreloDADER是DatasetPreloderoder頂部的包裝器,將整個數據集加載到存儲器中。如果您有一個小數據集,並且希望每次都避免從磁盤上加載的開銷,這很有用。具有與DataSetPreloDADER完全相同的API。