flume 集群datanode节点失败导致hdfs写失败(转)
來自:http://www.geedoo.info/dfs-client-block-write-replace-datanode-on-failure-enable.html
?
這幾天由于杭州集群處于升級過度時期,任務量大,集群節(jié)點少(4個DN),集群不斷出現問題,導致flume收集數據出現錯誤,以致數據丟失。
出現數據丟失,最先拿來開刀的就是數據收集,好嘛,先看看flume的錯誤日志:
[php]Caused by: java.io.IOException: Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-<br />
on-failure.policy in configuration, where the current policy is DEFAULT. (Nodes: current=[10.0.2.163:50010, 10.0.2.164:50010], original=[10.0.2.163:50010, 10.0.2.164:50010])<br />
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:817)<br />
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:877)<br />
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:983)<br />
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:780)<br />
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)[/php]
錯誤:
Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT
從日志上看是說添加DN失敗,需要關閉dfs.client.block.write.replace-datanode-on-failure.policy特性。但是我沒有添加節(jié)點啊?看來問題不是這么簡單。
通過查看官方配置文檔對上述的參數配置:
| 參數 | 默認值 | 說明 |
| dfs.client.block.write.replace-datanode-on-failure.enable | true | If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy |
| ?dfs.client.block.write.replace-datanode-on-failure.policy | ?DEFAULT | ?This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended. |
來自:https://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
然后尋找源碼位置在dfsclient中,發(fā)現是客戶端在pipeline寫數據塊時候的問題,也找到了兩個相關的參數:
dfs.client.block.write.replace-datanode-on-failure.enable
dfs.client.block.write.replace-datanode-on-failure.policy
前者是,客戶端在寫失敗的時候,是否使用更換策略,默認是true沒有問題。
后者是,更換策略的具體細節(jié),默認是default。
default在3個或以上備份的時候,是會嘗試更換 結點次數??次datanode。而在兩個備份的時候,不更換datanode,直接開始寫。
由于我的節(jié)點只有4個,當集群負載太高的時候,同時兩臺以上DN沒有響應,則出現HDFS寫的問題。當集群比較小的時候我們可以關閉這個特性。
參考:
記錄一次hadoop的datanode的報錯追查
Where can I set dfs.client.block.write.replace-datanode-on-failure.enable?
cdh4 vs cdh3 client處理DataNode異常的不同
flume JIRA 說明
https://issues.apache.org/jira/browse/FLUME-2261
總結
以上是生活随笔為你收集整理的flume 集群datanode节点失败导致hdfs写失败(转)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Spring源码深度解析第2天
- 下一篇: js get/set Cookie