利用python和cloudvolume包从谷歌云盘中多线程下载数据
簡介
本文主要介紹利用python從谷歌云盤中下載數據集
本文下載的數據是一個大規模的3D數據(volume或者2D圖像序列),估計有1T左右
保存在本地的格式是zarr格式,該格式下3D數據的存儲方式為:[z, y, x]
強調:該代碼在Windows10系統上測試失敗,在Ubuntu系統下測試成功,目前只支持Ubuntu系統下載!!!
步驟
前提
你得知道數據在谷歌云盤的位置,例如:
in_vol = "https://storage.googleapis.com/j0126-nature-methods-data/GgwKmcKgrcoNxJccKuGIzRnQqfit9hnfK1ctZzNbnuU/rawdata_realigned"環境
conda create -n cloudvol python=3.6
conda activate cloudvol
pip install numpy
pip install cloud-volume
pip install zarr daisy
函數
可以忽略
import cloudvolume import daisy import json import logging import numpy as np import os import syslogging.basicConfig(level=logging.INFO)def world_to_vox(offset,voxel_size):return [int(i/j) for i,j in zip(offset, voxel_size)]def fetch_in_block(block,voxel_size,raw_data,out_ds):logging.info('Fetching raw in block %s' %block.read_roi)voxel_size = list(voxel_size)block_start = list(block.write_roi.get_begin())block_end = list(block.write_roi.get_end())block_start = world_to_vox(block_start,voxel_size)block_end = world_to_vox(block_end,voxel_size)z_start, z_end = block_start[0], block_end[0]y_start, y_end = block_start[1], block_end[1]x_start, x_end = block_start[2], block_end[2]raw = raw_data[x_start:x_end, y_start:y_end, z_start:z_end]raw = np.array(np.transpose(raw[...,0], [2,1,0]))out_ds[block.write_roi] = rawdef fetch(in_vol,voxel_size,roi_offset,roi_shape,out_file,out_ds,num_workers):total_roi = daisy.Roi((roi_offset), (roi_shape))read_roi = daisy.Roi((0, 0, 0), (20, 2304, 2304))# [1*20=20, 256*9=2304, 256*9=2304]write_roi = read_roilogging.info('Creating out dataset...')raw_out = daisy.prepare_ds(out_file,out_ds,total_roi,voxel_size,dtype=np.uint8, # 指定保存數據的格式write_roi=write_roi)logging.info('Writing to dataset...')daisy.run_blockwise(total_roi,read_roi,write_roi,process_function=lambda b: fetch_in_block(b,voxel_size,raw_vol,raw_out),fit='shrink',num_workers=num_workers)例子
in_vol = "https://storage.googleapis.com/j0126-nature-methods-data/GgwKmcKgrcoNxJccKuGIzRnQqfit9hnfK1ctZzNbnuU/rawdata_realigned"raw_vol = cloudvolume.CloudVolume(in_vol,bounded=True,progress=True) print(raw_vol.info) print(raw_vol.shape) # rawdata_realigned: (10664, 10913, 5700, 1)voxel_size = daisy.Coordinate((20,9,9)) roi_offset = [0,0,0] roi_shape = [200, 45000, 45000] # 下載的大小為 [200/20=10, 45000/9=5000, 45000/9=5000]out_file = 'test.zarr' out_ds = 'raw' fetch(raw_vol,voxel_size,roi_offset,roi_shape,out_file,out_ds,num_workers=10)全部數據下載
if __name__ == '__main__':parser = argparse.ArgumentParser()parser.add_argument('-n', '--num_single', type=int, default=0)parser.add_argument('-z', '--kz', type=int, default=0)args = parser.parse_args()in_vol = "https://storage.googleapis.com/j0126-nature-methods-data/GgwKmcKgrcoNxJccKuGIzRnQqfit9hnfK1ctZzNbnuU/rawdata_realigned"raw_vol = cloudvolume.CloudVolume(in_vol,bounded=True,progress=True)print(raw_vol.info)print(raw_vol.shape) # rawdata_realigned: (10664, 10913, 5700, 1)voxel_size = daisy.Coordinate((20,9,9))size_x, size_y, size_z, _ = raw_vol.shapestride = 100num_z = size_z // strideprint('the number of block is ', num_z)roi_shape = [20, 98217, 95976] # 114000=5700*20, 98217=10913*9, 95976=10664*9# for kz in range(num_z):kz = args.kzout_file = 'rawdata_realigned/rawdata_realigned_%d_%d.zarr' % (kz*stride, (kz+1)*stride)for k_single in range(args.num_single, stride):roi_offset = [(kz*stride+k_single)*20,0,0]out_ds = 'raw_%d' % (kz*stride+k_single)fetch(raw_vol,voxel_size,roi_offset,roi_shape,out_file,out_ds,num_workers=20)補充
保存的zarr格式如下:
里面存在很多*.*.*的文件
里面有一個特殊的文件是.zarray,簡單來說這里面記錄了文件的格式,里面第一個參數就是chunks:
這里的[50,250,250] 就是每個*.*.*文件里面裝的數據大小,在總量一定的時候,改變chunks的大小,就可以控制生成的*.*.*文件的數量
在函數里面有兩行代碼如下:
read_roi = daisy.Roi((0, 0, 0), (20, 2304, 2304)) # [1*20=20, 256*9=2304, 256*9=2304] write_roi = read_roi不言而喻,write_roi就是控制chunks大小的,上面函數中默認設置的大小為[1, 256, 256],即一個*.*.*文件保存一張256x256的圖像
我當初嘗試成改write_roi變量,可我發現一個規律就是產生的chunks值永遠也不會大于256,就一直很納悶,猜到了肯定是daisy的鬼,所以就去看了它的源碼
找到源碼里面生成chunks的函數(daisy/datasets.py/182行),如下:
import numpy as np import daisydef get_chunk_size_dim(b, target_chunk_size):best_k = Nonebest_target_diff = 0for k in range(1, b+1):if ((b//k)*k) % b == 0:diff = abs(b//k - target_chunk_size)if best_k is None or diff < best_target_diff:best_target_diff = diffbest_k = kreturn b//best_kdef get_chunk_size(block_size):'''Get a reasonable chunk size that divides the given block size.'''chunk_size = daisy.Coordinate(get_chunk_size_dim(b, 256)for b in block_size)# logger.debug("Setting chunk size to %s", chunk_size)return chunk_size看到get_chunk_size_dim函數,我就明白了生成的chunk_size永遠不會超過256,而且它會根據你輸入的大小給你輸出一個最優的結果,但怎么選擇的我不知道,這里的256不是一個參數,所以不能更改,除非重新編譯源碼,差評
總結
以上是生活随笔為你收集整理的利用python和cloudvolume包从谷歌云盘中多线程下载数据的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 用Zend Encoder加密PHP文件
- 下一篇: js获取和设置属性