Storage Systems
- 參考: C o m p u t e r A r i c h i t e c t u r e ( 6 th ? E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer?Arichitecture?(6th?Edition)
目錄
- Bus
- Disk Storage
- Use Arrays of Small Disks?
- RAID
- RAID 0: Striping
- RAID 1: Disk Mirroring/Shadowing
- RAID 2: 位交叉式海明編碼陣列
- RAID 3: Bit-interleaved Parity Disk
- RAID 4: Block-interleaved Parity Disk
- RAID 5: Block-interleaved Distributed Parity
- RAID 6: 雙維奇偶校驗獨立存取盤陣列
- RAID 的實現
- Storage Environment
- Direct Attached Storage (DAS)
- Network Attached Storage (NAS)
- Storage Area Network (SAN)
- Memory (存儲系統): 內存
- Storage Systems (存貯系統): 外存 (持久性、非易失性)
Bus
- I/O buses tap into the processor-memory bus via bus adaptors: 適配器用于速度匹配(做緩存)、做接口
Main components of Intel Chipset: Pentium 4
- Northbridge (接高速設備的適配器): Handles memory, Graphics
- Southbridge (接低速設備的適配器): I/O, PCI bus, Disk controllers, USB controllers, Audio, Serial I/O, Interrupt controller, Timers
IMC(Integrated Memory Controller)
- 可以看到,CPU 集成度越來越高: Memory Controller 被集成到了 CPU 內部,北橋消失了。同時 L1 和 L2 Cache 被集成到了每個 Core 里,L3 Cache 被四個核共享,也被集成到了 CPU 里
- QPI (Quick Path Interconnect)——“快速通道互聯”,支持多條系統總線連接,取代前端總線 (FSB)
下一步把 Memory 也集成進 CPU…
The move from Parallel to Serial I/O
- Parallel I/O (ISA bus, PCI, SCSI, IDE)
- Parallel bus clock rate limited by clock skew across long bus (~100MHz)
- High power to drive large number of loaded bus lines
- Central bus arbiter (總線仲裁器) adds latency to each transaction, sharing limits throughput
- Expensive parallel connectors and backplanes/cables (all devices pay costs)
- Dedicated Point-to-point Serial Links (Ethernet, Infiniband, PCI Express, SATA, USB, Firewire)
- Point-to-point links run at multi-gigabit speed using advanced clock/signal encoding (requires lots of circuitry at each end)
- Lower power since only one well-behaved load
- Multiple simultaneous transfers
- Cheap cables and connectors (trade greater endpoint transistor cost for lower physical wiring cost), customize bandwidth per device using multiple links in parallel
- Examples: 硬盤接口: IDE (并行) → \rightarrow → SATA (串行)
Disk Storage
- Storage emphasizes reliability and scalability (可擴展性) as well as cost-performance (性價比)
- What is “Software king” that determines which HW features actually used?
- Compiler for processor
- Operating System for storage
Flash: The future of disks? (固態硬盤)
- Flash drive advantages: Lower power (no moving parts), Much faster seek time, 100X IOs per second (no moving parts), Greater reliability (no moving parts), Lower noise (no moving parts) (數據不移動時表現好)
- Flash disadvantages: Cost (20-100x disk cost/GB), Slow writes with current design (competitive with disks), write endurance (耐久度不行,某一個位置寫的次數多就壞了) - not an issue for most applications since use write-leveling to spread wear around blocks on chip (通過軟件來處理該問題)
Disk Figure of Metric: Areal Density
- Bits recorded along a track; Metric is Bits Per Inch (BPI)
- Number of tracks per surface; Metric is Tracks Per Inch (TPI)
- bit density per unit area; Metric is Bits Per Square Inch: Areal Density = BPI × TPI = \textrm{BPI} \times \textrm{TPI} =BPI×TPI
Disk Drive Performance
- Disk Service Time: Time taken by a disk to complete an I/O request is sum of
- Seek Time (尋道時間), Rotational Latency, Data Transfer Rate(MB/s)
Utilization vs. Response time
利用率和響應時間
- 利用率 (I/O 請求頻率) 越高,響應時間越長
反映存儲外設可靠性能的參數
- Reliability 系統可靠性: 系統從初始狀態開始一直提供服務的能力
- 用平均無故障時間 MTTF (Mean Time to Failure) 來衡量
- Availability 系統可用性: 系統正常工作時間在連續兩次正常服務間隔時間中所占的比率
- 用 MTTF MTTF + MTTR \frac{\textrm{MTTF}}{\textrm{MTTF} +\textrm{MTTR}} MTTF+MTTRMTTF? (Mean Time To Repair, 平均修復時間)來衡量 (修復 → \rightarrow → 數據恢復)
- MTTF + MTTR = MTBF(Mean Time Between Failure, 平均故障間隔時間)
- Dependability 系統可信性: 多大程度上可以合理地認為服務是可靠的
- 可信性不可度量
Use Arrays of Small Disks?
Replace Small Number of Large Disks with Large Number of Small Disks!
- Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?
Array Reliability
- Reliability of N N N disks = Reliability of 1 Disk ÷ N ÷ \ N ÷?N
- Arrays (without redundancy) too unreliable to be useful!
RAID
Redundant Arrays of (Inexpensive) Disks; 廉價磁盤冗余陣列
- Files are “striped” across multiple disks (將數據以條帶化的形式存儲在很多磁盤上)
- Redundancy yields high data availability 可用性 (Disks will still fail)
- Availability: service still provided to user, even if some components failed
- Contents reconstructed from data redundantly stored in the array
- Capacity penalty to store redundant info
- Bandwidth penalty to update redundant info
RAID 0: Striping
數據條帶化
- RAID 0: 非冗余磁盤陣列,無冗余信息;
- 將數據分成條帶 (stripe),以條帶為單位交叉地分布存放到各個磁盤中,形成一個容量更大,能并行工作的磁盤 (圖中 Stripe0, Stripe1… 為按順序排列的條帶,其大小稱為條帶寬度)
- 所有磁盤可以并行讀,因此性能很高;但不提供數據冗余,只要其中任一磁盤故障,整個系統都無法正常工作
- 適用于需要高帶寬磁盤訪問的場合
RAID 1: Disk Mirroring/Shadowing
- Each disk is fully duplicated onto its “mirror”: Very high availability can be achieved
- Bandwidth sacrifice on write: Logical write = two physical writes (并行寫入磁盤及其鏡像盤,且不需要計算校驗信息,因此寫入速度比級別更高的 RAID 都快)
- Reads may be optimized: 從 RAID 1 讀取數據時,磁盤及其鏡像盤可獨立地同時工作,由最先讀出數據的磁盤提供數據
- Most expensive solution: 100% capacity overhead
RAID 2: 位交叉式海明編碼陣列
- 每個數據盤存放數據字的一位,按位交叉存放,即 Disk0 存放所有數據字的第 0 位,Disk1 存放第 1 位… 各個數據盤上的相應位計算海明 Hamming 校驗碼,編碼位被存放在多個校驗(Ecc)磁盤的對應位上
- 從數據盤讀數據時,也要讀出 Hamming 碼,用于判斷數據是否有錯并加以糾正 (Hamming 碼可以糾正 1 位錯誤、檢測兩位錯誤)
- 需要多個磁盤來存放海明校驗碼信息,冗余磁盤數量與數據磁盤數量的對數成正比( log ? 2 m \log_2m log2?m, m m m 為數據盤的個數)
RAID 3: Bit-interleaved Parity Disk
位交叉奇偶校驗盤陣列
- 當某個磁盤發生故障時,磁盤控制器本身就能發現哪個磁盤出錯,因此不需要采用復雜的 Hamming 碼,使用奇偶校驗即可
- Logically, a single high capacity, high transfer rate disk: good for large transfers 單盤容錯并行傳輸 (細粒度磁盤陣列,即條帶寬度較小 (1 個字節或 1 位)。因此對于絕大多數 I/O 請求都需要磁盤陣列中所有磁盤為之服務,因此能獲得很高的數據傳輸率)
- 1 / N 1/N 1/N capacity cost for parity if N N N data disks and 1 1 1 parity disk
- Wider arrays reduce capacity costs, but decreases reliability/availability
RAID3 讀寫特點
- 假定:有 4 個數據盤和一個冗余盤
- 讀出數據,一共需要 5 次磁盤讀操作 (同時讀 4 個數據盤和一個冗余盤)
- 寫數據需要 3 次磁盤讀和 2 次磁盤寫操作
RAID 4: Block-interleaved Parity Disk
塊交叉奇偶校驗磁盤陣列
Inspiration for RAID 4
- 在 RAID 3 中,一次磁盤訪問將對磁盤陣列中的所有磁盤進行操作。RAID 4 希望使用較少的磁盤參與操作,以使磁盤陣列可以并行進行多個數據的磁盤操作
- RAID 4 數據以塊交叉的方式存于各盤, 奇偶校驗信息存在一臺專用盤上 (parity disk),冗余代價與 RAID 3 相同 (采用粗粒度的磁盤陣列,即采用比較大的條帶(塊)為單位進行交叉存放和計算奇偶校驗);訪問數據的方法與 RAID 3 不同
- Small read: every block has an error detection field——每個磁盤獨立的進行讀操作;Allows independent reads to different disks simultaneously (只有磁盤出現故障時,才會讀校驗盤,進行數據重建)
- To catch errors on read, rely on error detection field vs. the parity disk
- Large write: 寫入操作時,由于要重新計算校驗碼,因此幾乎要訪問所有磁盤
- Small read: every block has an error detection field——每個磁盤獨立的進行讀操作;Allows independent reads to different disks simultaneously (只有磁盤出現故障時,才會讀校驗盤,進行數據重建)
RAID 5: Block-interleaved Distributed Parity
Inspiration for RAID 5
- Small writes (write to one disk): since P has old sum, compare old data to new data, add the difference to P
Small Write Algorithm
- 1 Logical Write = 2 Physical Reads + 2 Physical Writes
Problems of Disk Arrays: Small Writes
- Small writes are limited by Parity Disk:
- Write to D 0 D_0 D0?, D 5 D_5 D5? both also write to P disk (因此還是不能同時寫 D 0 D_0 D0? 和 D 5 D_5 D5?)
- Write to D 0 D_0 D0?, D 5 D_5 D5? both also write to P disk (因此還是不能同時寫 D 0 D_0 D0? 和 D 5 D_5 D5?)
RAID 5: High I/O Rate Interleaved Parity
塊交叉分布式奇偶校驗盤陣列
- 為了解決上面的問題,把校驗信息分布到磁盤陣列中的各個磁盤上,無專用冗余盤,每一行數據塊的校驗塊被依次錯開、循環地存放到不同盤中,使奇偶校驗信息均勻分布在所有磁盤上
- Independent writes possible because of interleaved parity
- Independent writes possible because of interleaved parity
RAID 6: 雙維奇偶校驗獨立存取盤陣列
Inspiration:
- Recovering from 2 failures
RAID6 特點
- 雙維奇偶校驗獨立存取盤陣列: 在 RAID5 的基礎上增加了一個獨立的校驗信息,放在另一個校驗盤中,寫入數據要訪問 1 個數據盤和 2 個冗余盤,可容忍雙盤出錯
- 數據以塊交叉方式存于各盤,檢、糾錯信息均勻分布在所有磁盤上
RAID 的實現
- 軟件方式:陣列管理軟件由主機來實現
- 優點:成本低;
- 缺點:過多地占用主機時間,帶寬指標上不去
- 陣列卡方式:把 RAID 管理軟件固化在 I/O 控制卡上,從而可不占用主機時間,一般用于工作站和 PC 機
- 子系統方式:這是一種基于通用接口總線的開放式平臺,可用于各種主機平臺和網絡系統
Storage Environment
Direct Attached Storage (DAS)
直連
- Servers connect directly to the disk array typically via a SCSI interface.
Network Attached Storage (NAS)
網絡附加存儲——網絡上的文件系統
- Server 用來提供服務,有另外一套專門的體系負責存儲
- NAS Devices access the disks in an array via direct connection or through external connectivity
Storage Area Network (SAN)
存儲區域網絡——網絡上的磁盤
- Servers access the disk array through a dedicated network designated as SAN (consists of Fibre Channel switches) (專門構建一個網絡進行存儲介質和服務器之間的交互)
總結
以上是生活随笔為你收集整理的Storage Systems的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 工作那些事(十三)再次失业
- 下一篇: 红米note2 刷机 注意问题: