skit
1.0.0
我在整个代码库中使用了一组有用的工具。
安装:
git clone [email protected]:Shamdan17/skit.git
pip install -e .
datasetPreloader是围绕torch.utils.data.Dataset的包装器,将数据集缓存到磁盘上。当请求批处理时,这要么是在实例化的,要么懒惰。这是为了避免每次从磁盘上加载的开销,尤其是从多个文件加载单个实例的情况下。更重要的是,这使我们可以通过仅执行一次来跳过昂贵的预处理步骤。
警告:目前仅支持返回张紧或元组张量的数据集。
用法:
from skit . data import DatasetPreloader
dataset = myTorchDataset ()
cache_path = 'path/to/cache'
# Wrap the dataset
dataset = DatasetPreloader (
dataset ,
cache_path = cache_path ,
wipe_cache = False , # If the cache exists, use it. Otherwise, create it. If true, delete the cache if it exists.
lazy_loading = True , # Load the entire dataset into memory on instantiation or lazily when a batch is requested
compress = True , # Compress the cache. This can save a lot of disk space. However, it can be slower to load.
block_size = 2000 , # The number of samples to store in a single folder. This is to avoid having too many files in a single directory, which can cause performance issues. Set to 0 to disable.
preloading_workers = 10 , # The number of workers to use when preloading the dataset. Does not affect lazy loading.
samples_to_confirm_cache = 100 # The number of samples to check when confirming the cache. If your dataset has many instances, increase the number of samples to confirm the cache. Please note this process is only a heuristic and is not 100% accurate. If in doubt, wipe the cache.
)
# Access the dataset as normalInmemorydataSetPreloDADER是DatasetPreloderoder顶部的包装器,将整个数据集加载到存储器中。如果您有一个小数据集,并且希望每次都避免从磁盘上加载的开销,这很有用。具有与DataSetPreloDADER完全相同的API。