Skrip untuk memindahkan miliaran file ke Snowballedge secara efisien
- 2022.01.19
- added option to bypass setting the auto-extract metadata tag
- 2021.02.20
- save filelist_dir as filelist-currentdata.gz when executing genlist
- 2021.02.20
- performance improvement of genlist; dumping file list, not each line
- 2021.02.20
- replacing scandir.walk to os.walk. already os.walk module patched with scandir after python3.5
- 2021.02.10
- replacing os.path with scandir.path to improve performance of file listing
- 2021.02.09
- python2 compatibility for "open(filename, endoding)"
- 2021.02.01
- modifying to support Windows
- refactoring for more accurate defining of variables
- 2021.01.26
- multi processing support for parallel uploading of tar files
- relevant parameter: max_process
- 2021.01.25
- removing yaml feature, due for it to cause too much cpu consumtion and low performance
- fixing bug which use two profiles(sbe1, default), now only use "sbe1" profile
- showing progress
- 2020.02.25
- changing filelist file to contain the target filename
- 2020.02.24
- fixing FIFO error
- adding example of real snowball configuration
- 2020.02.22 - limiting multi-thread numbers
- adding multi-threading to improve performance
- adding fifo operation to reducing for big file which is over max_part_size
- 2020.02.19
- removing tarfiles_one_time logic
- spliting buffer by max_part_size
- 2020.02.18:
- supprt snowball limit:
- max_part_size: 512mb
- min_part_size: 5mb
- 2020.02.14:
- modifying for python3
- support korean in Windows
- 2020.02.12: adding features
- gen_filelist by size
- 2020.02.10: changing filename from tar_to_s3_v7_multipart.py to snowball_uploader_8.py
- adding features which can split tar file by size and count.
- adding feature which create file list
- showing help message
Snowball_uploader dikembangkan untuk memindahkan banyak file secara efisien ke Snowball atau Snowballedge yang merupakan alat AWS untuk memigrasikan file petabyte ke S3. Terutama, ketika ada jutaan file kecil, butuh waktu terlalu lama untuk mentransfernya, maka itu akan menunda proyek dan menyebabkan biaya tinggi untuk meminjamkan bola salju. Namun, menggunakan snowball_uploader , Anda dapat mempersingkat waktu transfer. Ini mengarsipkan file menjadi bagian dalam memori, dan mengirimkan bongkahan besar, dan agregat dalam beberapa file tar.
Pada awalnya, saya akan menunjukkan hasil kinerja kepada Anda. Hasil bola salju pertama diukur saat mengunggah setiap file saat mengubah nama, dan hasil ke -2 diukur saat menerapkan skrip yang membuat file arsip dengan tar dan mengirim ke bola salju di memori. Dengan tabel dan angka di bawah, Anda akan melihat setidaknya 7 kali kinerja yang lebih baik dengan opsi ke -2.
| Target | Jumlah file | Kapasitas total | NAS -> waktu bola salju | Snowball -> waktu S3 | Objek yang gagal |
|---|---|---|---|---|---|
| Pertunjukan Bola Salju Pertama | 19.567.430 | 2.408 GB | 1w | 113 jam | 954 |
| Pertunjukan Bola Salju ke -2 | kira -kira. 119.577.235 | 14.708 GB | 1w | 26 jam | 0 |
bucket_name = "your-own-bucket"
session = boto3 . Session ( profile_name = 'sbe1' )
s3 = session . client ( 's3' , endpoint_url = 'http://10.10.10.10:8080' )
# or below
#s3 = boto3.client('s3', endpoint_url='https://s3.ap-northeast-2.amazonaws.com')
#s3 = boto3.client('s3', region_name='ap-northeast-2', endpoint_url='https://s3.ap-northeast-2.amazonaws.com', aws_access_key_id=None, aws_secret_access_key=None)
target_path = '/move/to/s3/orgin/' ## very important!! change to your source directory
max_tarfile_size = 10 * 1024 ** 3 # 10GB
max_part_size = 300 * 1024 ** 2 # 300MB
min_part_size = 5 * 1024 ** 2 # 5MB
max_process = 5 # concurrent processes, set the value to less than filelist files in file list_dir
if os . name == 'nt' :
filelist_dir = "C:/Temp/fl_logdir_dkfjpoiwqjefkdjf/" #for windows
else :
filelist_dir = '/tmp/fl_logdir_dkfjpoiwqjefkdjf/' #for linux ec2-user > python3 snowball_uploader.py genlistParameter GenList menghasilkan file manifes yang berisi file asli dan file target. Parameter ini harus dijalankan sebelum mengatasi file.
ec2-user > ls /tmp/fl_logdir_dkfjpoiwqjefkdjf
fl_1.yml fl_2.yml fl_3.yml fl_4.yml fl_5.ymlec2-suer > cat f1_1.yaml
- ./snowball_uploader_11_failed.py: ./snowball_uploader_11_failed.py
- ./success_fl_2.yaml_20200226_002049.log: ./success_fl_2.yaml_20200226_002049.log
- ./file_list.txt: ./file_list.txt
- ./snowball-fl_1-20200218_151840.tar: ./snowball-fl_1-20200218_151840.tar
- ./bytesio_test.py: ./bytesio_test.py
- ./filelist_dir1_10000.txt: ./filelist_dir1_10000.txt
- ./snowball_uploader_14_success.py: ./snowball_uploader_14_success.py
- ./error_fl_1.txt_20200225_022018.log: ./error_fl_1.txt_20200225_022018.log
- ./snowball_uploader_debug_success.py: ./snowball_uploader_debug_success.py
- ./success_fl_1.txt_20200225_022018.log: ./success_fl_1.txt_20200225_022018.log
- ./snowball_uploader_20_thread.py: ./snowball_uploader_20_thread.py
- ./success_fl_1.yml_20200229_173222.log: ./success_fl_1.yml_20200229_173222.log
- ./snowball_uploader_14_ing.py: ./snowball_uploader_14_ing.py def rename_file ( org_file ):
target_file = org_file ##
return target_file Parameter CP_Snowball akan mentransfer file ke bola salju
Saat skrip dijalankan, itu membuat dua file log, success_'file_name ' ' timestamp'.log and error 'file_name' _ 'timestamp'.log
#print ('n')
print ( 'genlist: ' )
print ( 'this option will generate files which are containing target files list in %s' % ( filelist_dir ))
#print ('n')
print ( 'cp_snowball: ' )
print ( 'cp_snowball option will copy the files on server to snowball efficiently' )
print ( 'the mechanism is here:' )
print ( '1. reads the target file name from the one filelist file in filelist directory' )
print ( '2. accumulates files to max_part_size in memory' )
print ( '3. if it reachs max_part_size, send it to snowball using MultiPartUpload' )
print ( '4. during sending data chunk, threads are invoked to max_thread' )
print ( '5. after complete to send, tar file is generated in snowball' )
print ( '6. then, moves to the next filelist file recursively' )Saya bukan programmer profesional sehingga mungkin memiliki beberapa kekurangan, penanganan kesalahan sangat buruk. Dan skrip ini dapat mengkonsumsi sejumlah besar memori jika Anda menetapkan terlalu banyak parameter (max_threads, max_part_size, dan max_tarfile_size), maka dapat menyebabkan pembekuan sistem. Jadi uji beberapa kali dengan data sampel. Ketika saya menggunakannya di situs pelanggan, itu mengurangi waktu yang memakan waktu lebih dari 10 kali. Saya harap Anda bisa mendapatkan bantuan dari skrip ini juga.