hadoop元数据合并过程_Hadoop元数据合并异常及解决方法
這幾天觀察了一下Standby NN上面的日志,發(fā)現(xiàn)每次Fsimage合并完之后,Standby NN通知Active NN來下載合并好的Fsimage的過程中會(huì)出現(xiàn)以下的異常信息:
01
2014-04-2314:42:54,964ERROR org.apache.hadoop.hdfs.server.namenode.ha.
02
StandbyCheckpointer: Exception in doCheckpoint
03
java.net.SocketTimeoutException: Read timed out
04
at java.net.SocketInputStream.socketRead0(Native Method)
05
at java.net.SocketInputStream.read(SocketInputStream.java:152)
06
at java.net.SocketInputStream.read(SocketInputStream.java:122)
07
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
08
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
09
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
10
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
11
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
12
at sun.net.www.protocol.http.HttpURLConnection.getInputStream
13
(HttpURLConnection.java:1323)
14
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
15
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl
16
(TransferFsImage.java:268)
17
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient
18
(TransferFsImage.java:247)
19
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.
20
uploadImageFromStorage(TransferFsImage.java:162)
21
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.
22
doCheckpoint(StandbyCheckpointer.java:174)
23
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.
24
access$1100(StandbyCheckpointer.java:53)
25
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
26
$CheckpointerThread.doWork(StandbyCheckpointer.java:297)
27
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
28
$CheckpointerThread.access$300(StandbyCheckpointer.java:210)
29
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
30
$CheckpointerThread$1.run(StandbyCheckpointer.java:230)
31
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal
32
(SecurityUtil.java:456)
33
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
34
$CheckpointerThread.run(StandbyCheckpointer.java:226)
上面的代碼貼出來有點(diǎn)亂啊,可以看下下面的圖片截圖:
StandbyCheckpointer
于是習(xí)慣性的去Google了一下,找了好久也沒找到類似的信息。只能自己解決。我們通過分析日志發(fā)現(xiàn)更奇怪的問題,上次Checkpoint的時(shí)間一直都不變(一直都是Standby NN啟動(dòng)的時(shí)候第一次Checkpoint的時(shí)間),如下:
1
2014-04-2314:50:54,429INFO
2
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer:
3
Triggering checkpoint because it has been70164seconds since the last checkpoint,
4
which exceeds the configured interval600
難道這是Hadoop的bug?于是我就根據(jù)上面的錯(cuò)誤信息去查看源碼,經(jīng)過仔細(xì)的分析,發(fā)現(xiàn)上述的問題都是由StandbyCheckpointer類輸出的:
01
privatevoiddoWork() {
02
// Reset checkpoint time so that we don't always checkpoint
03
// on startup.
04
lastCheckpointTime = now();
05
while(shouldRun) {
06
try{
07
Thread.sleep(1000* checkpointConf.getCheckPeriod());
08
}catch(InterruptedException ie) {
09
}
10
if(!shouldRun) {
11
break;
12
}
13
try{
14
// We may have lost our ticket since last checkpoint, log in again,
15
// just in case
16
if(UserGroupInformation.isSecurityEnabled()) {
17
UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();
18
}
19
20
longnow = now();
21
longuncheckpointed = countUncheckpointedTxns();
22
longsecsSinceLast = (now - lastCheckpointTime)/1000;
23
24
booleanneedCheckpoint =false;
25
if(uncheckpointed >= checkpointConf.getTxnCount()) {
26
LOG.info("Triggering checkpoint because there have been "+
27
uncheckpointed +" txns since the last checkpoint, which "+
28
"exceeds the configured threshold "+
29
checkpointConf.getTxnCount());
30
needCheckpoint =true;
31
}elseif(secsSinceLast >= checkpointConf.getPeriod()) {
32
LOG.info("Triggering checkpoint because it has been "+
33
secsSinceLast +" seconds since the last checkpoint, which "+
34
"exceeds the configured interval "+ checkpointConf.getPeriod());
35
needCheckpoint =true;
36
}
37
38
synchronized(cancelLock) {
39
if(now < preventCheckpointsUntil) {
40
LOG.info("But skipping this checkpoint since we are about"+
41
" to failover!");
42
canceledCount++;
43
continue;
44
}
45
assertcanceler ==null;
46
canceler =newCanceler();
47
}
48
49
if(needCheckpoint) {
50
doCheckpoint();
51
lastCheckpointTime = now;
52
}
53
}catch(SaveNamespaceCancelledException ce) {
54
LOG.info("Checkpoint was cancelled: "+ ce.getMessage());
55
canceledCount++;
56
}catch(InterruptedException ie) {
57
// Probably requested shutdown.
58
continue;
59
}catch(Throwable t) {
60
LOG.error("Exception in doCheckpoint", t);
61
}finally{
62
synchronized(cancelLock) {
63
canceler =null;
64
}
65
}
66
}
67
}
68
}
上面的異常信息是由 doCheckpoint()函數(shù)執(zhí)行的過程中出現(xiàn)問題而拋出來的,這樣導(dǎo)致lastCheckpointTime = now;語(yǔ)句永遠(yuǎn)執(zhí)行不到。那么為什么doCheckpoint()執(zhí)行過程會(huì)出現(xiàn)異常??根據(jù)上述堆棧信息的跟蹤,發(fā)現(xiàn)是由TransferFsImage類的doGetUrl函數(shù)中的下面語(yǔ)句導(dǎo)致的:
1
if(connection.getResponseCode() != HttpURLConnection.HTTP_OK) {
由于connection無法得到對(duì)方的響應(yīng)碼而超時(shí)。于是我就想到是否是我的集群socket超時(shí)設(shè)置的有問題??后來經(jīng)過各種分析發(fā)現(xiàn)不是。于是我只能再看看代碼,我發(fā)現(xiàn)了上述代碼的前面有如下設(shè)置:
01
if(timeout <=0) {
02
Configuration conf =newHdfsConfiguration();
03
timeout = conf.getInt(DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_KEY,
04
DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_DEFAULT);
05
}
06
07
if(timeout >0) {
08
connection.setConnectTimeout(timeout);
09
connection.setReadTimeout(timeout);
10
}
11
12
if(connection.getResponseCode() != HttpURLConnection.HTTP_OK) {
13
thrownewHttpGetFailedException(
14
"Image transfer servlet at "+ url +
15
" failed with status code "+ connection.getResponseCode() +
16
"\nResponse message:\n"+ connection.getResponseMessage(),
17
connection);
18
}
DFS_IMAGE_TRANSFER_TIMEOUT_KEY這個(gè)時(shí)間是由dfs.image.transfer.timeout參數(shù)所設(shè)置的,默認(rèn)值為10 * 60 * 1000,單位為毫秒。然后我看了一下這個(gè)屬性的解釋:
Timeout for image transfer in milliseconds. This timeout and the relateddfs.image.transfer.bandwidthPerSecparameter should be configured such that normal image transfer can complete within the timeout. This timeout prevents client hangs when the sender fails during image transfer, which is particularly important during checkpointing. Note that this timeout applies to the entirety of image transfer, and is not a socket timeout.
這才發(fā)現(xiàn)問題,這個(gè)參數(shù)的設(shè)置和dfs.image.transfer.bandwidthPerSec息息相關(guān),要保證Active NN在dfs.image.transfer.timeout時(shí)間內(nèi)把合并好的Fsimage從Standby NN上下載完,要不然會(huì)出現(xiàn)異常。然后我看了一下我的配置
1
2
dfs.image.transfer.timeout
3
60000
4
5
6
7
dfs.image.transfer.bandwidthPerSec
8
1048576
9
60秒超時(shí),一秒鐘拷貝1MB,而我的集群上的元數(shù)據(jù)有800多MB,顯然是不能在60秒鐘拷貝完,后來我把dfs.image.transfer.timeout設(shè)置大了,觀察了一下,集群再也沒出現(xiàn)過上述異常信息,而且以前的一些異常信息也由于這個(gè)而解決了。。
總結(jié)
以上是生活随笔為你收集整理的hadoop元数据合并过程_Hadoop元数据合并异常及解决方法的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: git 裁切_Vue + ccroppe
- 下一篇: viewer vue 文档_vue基于v