當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

解决 OCFS2 不能自动挂载提示 o2net_connect_expired

發布時間：2025/5/22 编程问答 12 豆豆

生活随笔收集整理的這篇文章主要介紹了解决 OCFS2 不能自动挂载提示 o2net_connect_expired 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

RAC 在啟動的是要要先啟動OCFS2，在修改/etc/sysconfig/o2cb的配置后，發現兩機器只有一臺可以自動掛載ocfs2分區，而另外一臺不能自動掛載。但啟動完畢后，手動掛載正常。

一、詳細情況
兩機器分別是dbsrv-1和dbsrv-2，使用交叉線做網絡心跳，并在cluster.conf中使用私有心跳IP，非公用IP地址。
1、檢查o2cb狀態
啟動后，o2cb服務是啟動正常的，ocfs2模塊也加載正常的，但心跳是Not Active：

引用 Checking heartbeat: Not Active

2、檢查/etc/fstab文件

引用 #cat /etc/fstab|grep ocfs2
/dev/sdc1 ? ?/oradata ? ocfs2 ? _netdev,datavolume,nointr 0 0

配置正確；
3、檢查兩機器的/etc/ocfs2/cluster.conf內容

引用 # more /etc/ocfs2/cluster.conf
node:
? ? ? ?ip_port = 7777
? ? ? ?ip_address = 172.20.3.2
? ? ? ?number = 0
? ? ? ?name = dbsrv-2
? ? ? ?cluster = ocfs2

node:
? ? ? ?ip_port = 7777
? ? ? ?ip_address = 172.20.3.1
? ? ? ?number = 1
? ? ? ?name = dbsrv-1
? ? ? ?cluster = ocfs2

cluster:
? ? ? ?node_count = 2
? ? ? ?name = ocfs2

已經確認，兩機器該文件是完全相同的。
4、查看系統日志
報錯信息如下：

引用 Jul 20 19:33:18 dbsrv-2 kernel: OCFS2 1.2.3
Jul 20 19:33:24 dbsrv-2 kernel: (4452,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_request_join:786 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_try_to_join_domain:934 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_join_domain:1186 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_register_domain:1379 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_dlm_init:2009 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_mount_volume:1062 ERROR: status = -107
Jul 20 19:33:24 dbsrv-2 kernel: ocfs2: Unmounting device (8,33) on (node 0)
Jul 20 19:33:26 dbsrv-2 mount: mount.ocfs2: Transport endpoint is not connected
Jul 20 19:33:26 dbsrv-2 mount:
Jul 20 19:33:26 dbsrv-2 netfs: Mounting other filesystems: ?failed

二、分析問題
1、node節點的啟動順序
從Google搜索到如此的信息：

引用 Mount triggers the heartbeat thread which triggers the o2net
to make a connection to all heartbeating nodes. If this connection
fails,the mount fails. (The larger node number initiates the connection
to the lower node number.)

說明o2cb啟動的時候，是根據node節點的大小順序啟動的。
而在cluster.conf中，node0是dbsrv-2，node1是dbsrv-1，所以，dbsrv-1在啟動的時候馬上可聯通本機IP，然后掛載ocfs2分區；但dbsrv-2啟動的時候，則不能即時發現對方IP地址，所以啟動失敗。
2、嘗試修改HEARTBEAT_THRESHOLD參數
從Goolge搜索到另外一條信息：

引用 After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster.

Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes?

Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised?

Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies.

并參考網上的資料，修改/etc/sysconfig/o2cb的HEARTBEAT_THRESHOLD參數為301，啟動后報：

引用 Jul 23 13:59:50 dbsrv-2 kernel: (4477,0):o2hb_check_slot:883 ERROR: Node 1 on device sdc1 has a dead count of 14000 ms, but our count is 602000 ms.
Jul 23 13:59:50 dbsrv-2 kernel: Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
Jul 23 13:59:54 dbsrv-2 kernel: OCFS2 1.2.3
Jul 23 14:00:00 dbsrv-2 kernel: (4449,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
Jul 23 14:00:00 dbsrv-2 kernel: (4475,2):dlm_request_join:786 ERROR: status = -107

問題依舊。
※注釋

引用 [隔離時間（秒）] = (O2CB_HEARTBEAT_THRESHOLD - 1) * 2
(301 - 1) * 2 = 600 秒

綜上所述，已經能清楚所有配置都是正確的。
導致故障的原因是：
在啟動o2cb服務的前，由于某些原因，o2cb依賴的IP地址未能及時取得聯系，操作了其限定的時間，而啟動失敗。而在機器完整啟動后，網絡已經正常，所以，手動掛載ocfs2分區正常。

三、解決問題
1、Oracle metalink給出的信息

引用 The problem here is that network layer not becoming fully functional even ?after /etc/init.d/network script is done executing. The proposed patch is a ?work around and is not fixing a problem in o2cb script.

2、解決方法

引用 a）確保所有配置文件都正確，無差異；
b）確保兩服務器的機器時間不要相差太遠；
（可使用時間同步）
c）o2cb使用的cluster.conf文件中，應使用心跳IP，而非公網IP
d）修改/etc/init.d/o2cb腳本，在最前面加入一個sleep的延遲時間，以等待網絡正常；
e）實在還是不行，把啟動腳本放到/etc/rc.local中
mount -t ocfs2 -o datavolume,nointr /dev/sdc1 /oradata
/etc/init.d/init.crs start

四、已知可能的原因
1、磁盤原因
例如使用iSCSI、Firewire等做盤柜，可能因讀取時間長，引發timeout導致問題；
2、網絡原因
如果使用公網IP做o2cb的判斷，則由于在加載網卡驅動后，交換機未能及時通訊（特別是Cisco的交換機），導致IP通訊失敗；
如果使用心跳IP做o2cb的判斷，則有部分網卡在加載驅動后，未能馬上激活，并與對方網卡聯通而導致失敗。
總體來說，都是和硬件的關系比較多。

轉載于:https://www.cnblogs.com/tianlesoftware/archive/2009/11/13/3610355.html

《新程序員》：云原生和全面數字化實踐50位技術專家共同創作，文字、視頻、音頻交互閱讀

總結

以上是生活随笔為你收集整理的解决 OCFS2 不能自动挂载提示 o2net_connect_expired的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： HtmlParser 简介
下一篇： vhd安装windows 7 64位

编程问答

解决 OCFS2 不能自动挂载 提示 o2net_connect_expired

總結

解决 OCFS2 不能自动挂载提示 o2net_connect_expired