【Oracle RAC故障分析与处理】
原文地址:【Oracle?RAC故障分析與處理】作者:蟻巡運維平臺
一?RAC環境
RAC架構,2節點信息
節點1
SQL> show parameter instance
NAME?????????????????????????????????TYPE????????VALUE
------------------------------------ ----------- -----------------------------------------------
active_instance_count????????????????????integer
cluster_database_instances????????????????integer?????2
instance_groups?????????????????????????string
instance_name??????????????????????????string??????RACDB1
instance_number????????????????????????Integer?????1
instance_type???????????????????????????string??????RDBMS
open_links_per_instance??????????????????integer?????4
parallel_instance_group???????????????????string
parallel_server_instances??????????????????integer?????2
節點2
SQL> show parameter instance
NAME?????????????????????????????????TYPE????????VALUE
------------------------------------ ----------- ------------------------------------------
active_instance_count????????????????????integer
cluster_database_instances????????????????integer?????2
instance_groups?????????????????????????string
instance_name??????????????????????????string??????RACDB2
instance_number????????????????????????integer?????2
instance_type???????????????????????????string??????RDBMS
open_links_per_instance??????????????????integer?????4
parallel_instance_group???????????????????string
parallel_server_instances??????????????????integer?????2
數據庫版本
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE????10.2.0.1.0??????Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production
操作系統信息
節點1
[oracle@rac1 ~]$ uname -a
Linux rac1 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686 i686 i386 GNU/Linux
節點2
[oracle@rac2 ~]$ uname -a
Linux rac2 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686 i686 i386 GNU/Linux
RAC所有資源信息
[oracle@rac2 ~]$ crs_stat -t
Name???????????Type????????????Target?????State??????Host????????
----------------------------------------------------------------------------------------------
ora....B1.inst????application????????ONLINE????ONLINE????rac1????????
ora....B2.inst????application????????ONLINE????ONLINE????rac2????????
ora....DB1.srv???application????????ONLINE????ONLINE????rac2????????
ora.....TAF.cs????application????????ONLINE????ONLINE????rac2????????
ora.RACDB.db??application?????????ONLINE????ONLINE????rac2????????
ora....SM1.asm??application????????ONLINE????ONLINE????rac1????????
ora....C1.lsnr????application????????ONLINE????ONLINE????rac1????????
ora.rac1.gsd????application????????ONLINE????ONLINE????rac1????????
ora.rac1.ons????application????????ONLINE????ONLINE????rac1????????
ora.rac1.vip????application????????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????????ONLINE????ONLINE????rac2????????
ora....C2.lsnr????application???????ONLINE????ONLINE????rac2????????
ora.rac2.gsd????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.ons????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.vip????application?????????ONLINE????ONLINE?????rac2
二?模擬兩個節點內聯網不通,觀察RAC會出現什么現象?給出故障定位的整個過程
本小題會模擬RAC的私有網絡不通現象,然后定位故障原因,最后排除故障。
1.首先RAC是一個非常健康的狀態
[oracle@rac2 ~]$ crs_stat -t
Name???????????Type????????????Target?????State??????Host????????
----------------------------------------------------------------------------------------------
ora....B1.inst????application????????ONLINE????ONLINE????rac1????????
ora....B2.inst????application????????ONLINE????ONLINE????rac2????????
ora....DB1.srv???application????????ONLINE????ONLINE????rac2????????
ora.....TAF.cs????application????????ONLINE????ONLINE????rac2????????
ora.RACDB.db??application?????????ONLINE????ONLINE????rac2????????
ora....SM1.asm??application????????ONLINE????ONLINE????rac1????????
ora....C1.lsnr????application????????ONLINE????ONLINE????rac1????????
ora.rac1.gsd????application????????ONLINE????ONLINE????rac1????????
ora.rac1.ons????application????????ONLINE????ONLINE????rac1????????
ora.rac1.vip????application????????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????????ONLINE????ONLINE????rac2????????
ora....C2.lsnr????application???????ONLINE????ONLINE????rac2????????
ora.rac2.gsd????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.ons????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.vip????application?????????ONLINE????ONLINE?????rac2??
檢查CRS進程狀態(CRS??CSS??EVM)
[oracle@rac2 ~]$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
檢查OCR磁盤狀態,沒有問題
[oracle@rac2 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
?????????Version??????????????????:??????????2
?????????Total space (kbytes)?????:?????104344
?????????Used space (kbytes)??????:???????4344
?????????Available space (kbytes) :?????100000
?????????ID???????????????????????: 1752469369
?????????Device/File Name?????????: /dev/raw/raw1
????????????????????????????????????Device/File integrity check succeeded
????????????????????????????????????Device/File not configured
?????????Cluster registry integrity check succeeded
檢查vote disk狀態
[oracle@rac2 ~]$ crsctl query css votedisk
0.?????0????/dev/raw/raw2??????????????????????顯示2號裸設備為表決磁盤
located 1 votedisk(s).??????????????????????????????只定位1個表決磁盤
2.手工禁用一個私有網卡
[oracle@rac2 ~]$ cat /etc/hosts
127.0.0.1???????localhost.localdomain???localhost
::1?????localhost6.localdomain6 localhost6
##Public Network - (eth0)
##Private Interconnect - (eth1)
##Public Virtual IP (VIP) addresses - (eth0)
192.168.1.101???rac1????????????????????????這是RAC的共有網卡
192.168.1.102???rac2
192.168.2.101???rac1-priv????????????????????這是RAC的私有網卡
192.168.2.102???rac2-priv
192.168.1.201???rac1-vip?????????????????????這是RAC虛擬網卡
192.168.1.202???rac2-vip
看一下IP地址和網卡的對應關系
[oracle@rac2 ~]$ ifconfig
eth0??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:87??
??????????inet addr:192.168.1.102??Bcast:192.168.1.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f187/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:360 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:593 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:46046 (44.9 KiB)??TX bytes:62812 (61.3 KiB)
??????????Interrupt:185 Base address:0x14a4
eth0:1????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:87??
??????????inet addr:192.168.1.202??Bcast:192.168.1.255??Mask:255.255.255.0
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????Interrupt:185 Base address:0x14a4
eth1??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:91??
??????????inet addr:192.168.2.102??Bcast:192.168.2.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f191/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:76588 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:58002 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:65185420 (62.1 MiB)??TX bytes:37988820 (36.2 MiB)
??????????Interrupt:193 Base address:0x1824
eth2??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:9B??
??????????inet addr:192.168.203.129??Bcast:192.168.203.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f19b/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:339 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:83 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:42206 (41.2 KiB)??TX bytes:10199 (9.9 KiB)
??????????Interrupt:169 Base address:0x18a4
lo????????Link encap:Local Loopback??
??????????inet addr:127.0.0.1??Mask:255.0.0.0
??????????inet6 addr: ::1/128 Scope:Host
??????????UP LOOPBACK RUNNING??MTU:16436??Metric:1
??????????RX packets:99403 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:99403 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:0
??????????RX bytes:18134658 (17.2 MiB)??TX bytes:18134658 (17.2 MiB)
eth 0?對應RAC的共有網卡
eth 1?對應RAC的私有網卡
eth0:1對應RAC的虛擬網卡
我們現在禁止eth1私有網卡來完成內聯網網絡不通現象,方法很簡單
ifdown eth1?????????????????????????????禁用網卡
ifup???eth1?????????????????????????????激活網卡
[oracle@rac2 ~]$ su – root?????????????????需要使用root用戶哦,否則提示Users cannot control this device.
Password:
[root@rac2 ~]# ifdown eth1???????????????
我從17:18:51敲入這個命令,4分鐘之后節點2重啟,大家知道發生了什么現象嘛?
Good?這就是傳說中RAC腦裂brain split問題,當節點間的內聯網不通時,無法信息共享,就會出現腦裂現象,RAC必須驅逐其中一部分節點來保護數據的一致性,被驅逐的節點被強制重啟,這不節點2自動重啟了么。又說回來,那為什么節點2重啟,其他節點不重啟呢。
這里有個驅逐原則:(1)子集群中少節點的被驅逐
?????????????????(2)節點號大的被驅逐
?????????????????(3)負載高的節點被驅逐
我們中的就是第二條,OK,節點2重啟來了,我們登陸系統,輸出用戶名/密碼
3.定位故障原因
(1)查看操作系統日志
[oracle@rac2 ~]$ su - root
Password:
[root@rac2 ~]# tail -30f /var/log/messages
我又重新模擬了一遍,由于信息量很大,我從里面找出與網絡有關的告警信息
Jul 17 20:05:25 rac2 avahi-daemon[3659]: Withdrawing address record for 192.168.2.102 on eth1.
收回eth1網卡的ip地址,導致節點1驅逐節點2,節點2自動重啟
Jul 17 20:05:25 rac2 avahi-daemon[3659]: Leaving mDNS multicast group on interface eth1.IPv4 with address 192.168.2.102.
網卡eth1脫離多組播組
Jul 17 20:05:25 rac2 avahi-daemon[3659]: iface.c: interface_mdns_mcast_join() called but no local address available.
Jul 17 20:05:25 rac2 avahi-daemon[3659]: Interface eth1.IPv4 no longer relevant for mDNS.
網卡eth1不在與mDNS有關
Jul 17 20:09:54 rac2 logger: Oracle Cluster Ready Services starting up automatically.
Oracle集群自動啟動
Jul 17 20:09:59 rac2 avahi-daemon[3664]: Registering new address record for fe80::20c:29ff:fe8f:f191 on eth1.
Jul 17 20:09:59 rac2 avahi-daemon[3664]: Registering new address record for 192.168.2.102 on eth1.
注冊新ip地址
Jul 17 20:10:17 rac2 logger: Cluster Ready Services completed waiting on dependencies.
CRS完成等待依賴關系
從上面信息我們大體知道,是因為eth1網卡的問題導致節點2重啟的,為了進一步分析問題我們還需要看一下CRS排錯日志
[root@rac2 crsd]# tail -100f $ORA_CRS_HOME/log/rac2/crsd/crsd.log
Abnormal termination by CSS, ret = 8
異常終止CSS
2013-07-17 20:11:18.115: [ default][1244944]0CRS Daemon Starting
2013-07-17 20:11:18.116: [ CRSMAIN][1244944]0Checking the OCR device
2013-07-17 20:11:18.303: [ CRSMAIN][1244944]0Connecting to the CSS Daemon
重啟CRS??CSS進程
[root@rac2 cssd]# pwd
/u01/crs1020/log/rac2/cssd
[root@rac2 cssd]# more ocssd.log???????查看cssd進程日志
[CSSD]2013-07-17 17:26:18.319 [86104976] >TRACE:???clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac2_crs))
這里可以看到rac2節點的cssd進程監聽出了問題
[CSSD]2013-07-17 17:26:19.296 [75615120] >TRACE:???clssnmHandleSync: Acknowledging sync: src[1] srcName[rac1] seq[13] sync[12]
請確認兩個節點的同步問題
從以上一系列信息可以分析出這是內聯網通信問題,由于兩個節點的信息無法同步導致信息無法共享從而引起腦裂現象
4.節點2重啟自動恢復正常狀態
[root@rac2 cssd]# ifconfig
eth0??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:87??
??????????inet addr:192.168.1.102??Bcast:192.168.1.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f187/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:567 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:901 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:65402 (63.8 KiB)??TX bytes:96107 (93.8 KiB)
??????????Interrupt:185 Base address:0x14a4
eth0:1????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:87??
??????????inet addr:192.168.1.202??Bcast:192.168.1.255??Mask:255.255.255.0
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????Interrupt:185 Base address:0x14a4
eth1??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:91??
??????????inet addr:192.168.2.102??Bcast:192.168.2.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f191/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:76659 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:51882 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:61625763 (58.7 MiB)??TX bytes:26779167 (25.5 MiB)
??????????Interrupt:193 Base address:0x1824
eth2??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:9B??
??????????inet addr:192.168.203.129??Bcast:192.168.203.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f19b/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:409 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:58 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:45226 (44.1 KiB)??TX bytes:9567 (9.3 KiB)
??????????Interrupt:169 Base address:0x18a4
lo????????Link encap:Local Loopback??
??????????inet addr:127.0.0.1??Mask:255.0.0.0
??????????inet6 addr: ::1/128 Scope:Host
??????????UP LOOPBACK RUNNING??MTU:16436??Metric:1
??????????RX packets:49025 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:49025 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:0
??????????RX bytes:11292111 (10.7 MiB)??TX bytes:11292111 (10.7 MiB)
我們看一下網卡ip地址,被收回的私有eth1網卡ip現在已經恢復了,這是因為剛剛節點2進行了重啟操作。重啟后會初始化所有網卡,被我們禁用的eth1網卡被重新啟用,重新恢復ip。
檢查CRS進程狀態,全都是健康的
[root@rac2 cssd]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
檢查集群,實例,數據庫,監聽,ASM服務狀態,也都是完好無損,全部啟動了
[root@rac2 cssd]# crs_stat -t
Name???????????Type???????????Target????State?????Host????????
------------------------------------------------------------
ora....B1.inst???application????ONLINE????ONLINE????rac1????????
ora....B2.inst???application????ONLINE????ONLINE????rac2????????
ora....DB1.srv???application????ONLINE????ONLINE????rac1????????
ora.....TAF.cs???application????ONLINE????ONLINE????rac1????????
ora.RACDB.db??application????ONLINE????ONLINE????rac1????????
ora....SM1.asm??application????ONLINE????ONLINE????rac1????????
ora....C1.lsnr???application????ONLINE????ONLINE????rac1????????
ora.rac1.gsd???application????ONLINE????ONLINE????rac1????????
ora.rac1.ons???application????ONLINE????ONLINE????rac1????????
ora.rac1.vip????application????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????ONLINE????ONLINE????rac2????????
ora....C2.lsnr???application????ONLINE????ONLINE????rac2????????
ora.rac2.gsd???application????ONLINE????ONLINE????rac2????????
ora.rac2.ons???application????ONLINE????ONLINE????rac2????????
ora.rac2.vip????application????ONLINE????ONLINE????rac2????????
RAC故障分析并解決的整個過程到此結束
三?模擬OCR磁盤不可用時,RAC會出現什么現象?給出故障定位的整個過程
OCR磁盤:OCR磁盤中注冊了RAC所有的資源信息,包含集群、數據庫、實例、監聽、服務、ASM、存儲、網絡等等,只有被OCR磁盤注冊的資源才能被CRS集群管理,CRS進程就是按照OCR磁盤中記錄的資源來管理的,在我們的運維過程中可能會發生OCR磁盤信息丟失的情況,例如 在增減節點時,添加?or?刪除OCR磁盤時可能都會發生。接下來我們模擬一下當OCR磁盤信息丟失時,如果定位故障并解決。
實驗
1.檢查OCR磁盤和CRS進程
(1)檢查OCR磁盤,只有OCR磁盤沒有問題,CRS進程才可以順利管理
[root@rac2 cssd]# ocrcheck
Status of Oracle Cluster Registry is as follows :
?????????Version??????????????????:???????????2
?????????Total space (kbytes)????????:??????104344
?????????Used space (kbytes)????????:????????4344
?????????Available space (kbytes)?????:??????100000
?????????ID???????????????????????:??1752469369
?????????Device/File Name??????????: /dev/raw/raw1????????????這個就是OCR磁盤所屬的裸設備
????????????????????????????????????Device/File integrity check succeeded
????????????????????????????????????Device/File not configured
?????????Cluster registry integrity check succeeded?????????????????完整檢查完畢沒有問題
(2)檢查CRS狀態
[root@rac2 cssd]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
集群進程全部健康
(3)關閉CRS守護進程
[root@rac2 sysconfig]# crsctl stop crs
Stopping resources.????????????????????????停止資源
Successfully stopped CRS resources????????????停止CRS進程
Stopping CSSD.????????????????????????????停止CSSD進程
Shutting down CSS daemon.
Shutdown request successfully issued.?????????
關閉請求執行成功
[root@rac2 sysconfig]# crsctl check crs
Failure 1 contacting CSS daemon???????????????連接CSS守護進程失敗
Cannot communicate with CRS????????????????無法與CRS通信
Cannot communicate with EVM???????????????無法與EVM通信
2.用root用戶導出OCR磁盤內容進行OCR備份
[root@rac2 sysconfig]# ocrconfig -export /home/oracle/ocr.exp
[oracle@rac2 ~]$ pwd
/home/oracle
[oracle@rac2 ~]$ ll
total 108
-rw-r--r-- 1 root???root?????98074 Jul 18 11:20 ocr.exp?????????已經生成OCR導出文件
3.重啟CRS守護進程
[root@rac2 sysconfig]# crsctl start crs
Attempting to start CRS stack?????????????????????嘗試啟動CRS
The CRS stack will be started shortly?????????CRS即將啟動
檢查CRS狀態
[root@rac2 sysconfig]# crsctl check crs???????很好,我們重新啟動后就變正常了
CSS appears healthy
CRS appears healthy
EVM appears healthy
4.使用裸設備命令0字節覆蓋OCR磁盤內容模擬丟失狀態
[root@rac2 sysconfig]# dd if=/dev/zero of=/dev/raw/raw1 bs=1024 count=102400
102400+0 records in???????102400記錄輸入
102400+0 records out??????102400記錄輸出
104857600 bytes (105 MB) copied, 76.7348 seconds, 1.4 MB/s
命令解釋
dd:???????????????????????????????指定大小的塊拷貝一個文件,并在拷貝的同時進行指定的轉換
if=/dev/zero?????????????????指定源文件,0設備
of=/dev/raw/raw1?????指定目標文件,OCR磁盤
bs=1024????????????????????????指定塊大小1024個字節,即1k
count=102400?????????????指定拷貝的塊數,102400個塊
5.再次檢查OCR磁盤狀態
[root@rac2 sysconfig]# ocrcheck
PROT-601: Failed to initialize ocrcheck??????????????????初始化OCR磁盤失敗
檢查CRS狀態
[root@rac2 sysconfig]# crsctl check crs
Failure 1 contacting CSS daemon??????????????????????連接CSS守護進程失敗
Cannot communicate with CRS???????????????????????無法與CRS通信
EVM appears healthy
CRS進程失敗很正常,你想想連記錄的資源信息都丟失了,還怎么管理呢
6.使用import恢復OCR磁盤內容
[root@rac2 crs1020]# ocrconfig -import /home/oracle/ocr.exp
7.最后檢查OCR磁盤狀態
謝天謝地順順利利恢復回來了
[root@rac2 crs1020]# ocrcheck
Status of Oracle Cluster Registry is as follows :
?????????Version??????????????????:??????????2
?????????Total space (kbytes)?????:?????104344
?????????Used space (kbytes)??????:???????4348
?????????Available space (kbytes) :??????99996
?????????ID???????????????????????:??425383787
?????????Device/File Name?????????: /dev/raw/raw1
????????????????????????????????????Device/File integrity check succeeded
????????????????????????????????????Device/File not configured
?????????Cluster registry integrity check succeeded
8.關注CRS守護進程
[root@rac2 crs1020]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
非常好,當OCR磁盤恢復之后自動重啟CRS守護進程
[root@rac2 crs1020]# crs_stat -t
Name???????????Type???????????Target????State?????Host????????
------------------------------------------------------------
ora....B1.inst????application????ONLINE????ONLINE????rac1????????
ora....B2.inst????application????ONLINE????OFFLINE???????????????
ora....DB1.srv???application????ONLINE????ONLINE????rac1????????
ora.....TAF.cs????application????ONLINE????ONLINE????rac1????????
ora.RACDB.db???application????ONLINE????ONLINE????rac1????????
ora....SM1.asm??application????ONLINE????ONLINE????rac1????????
ora....C1.lsnr????application????ONLINE????ONLINE????rac1????????
ora.rac1.gsd????application????ONLINE????ONLINE????rac1????????
ora.rac1.ons????application????ONLINE????ONLINE????rac1????????
ora.rac1.vip?????application????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????ONLINE????OFFLINE???????????????
ora....C2.lsnr????application????ONLINE????OFFLINE???????????????
ora.rac2.gsd????application????ONLINE????OFFLINE???????????????
ora.rac2.ons????application????ONLINE????OFFLINE???????????????
ora.rac2.vip?????application????ONLINE????ONLINE????rac2
我重啟了一遍CRS集群服務
[root@rac2 init.d]# ./init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
Shutdown has begun. The daemons should exit soon.
[root@rac2 init.d]# crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
[root@rac2 init.d]# ./init.crs start
Startup will be queued to init within 90 seconds.
現在都恢復了
[oracle@rac2 ~]$ crs_stat -t
Name???????????Type????????????Target?????State??????Host????????
----------------------------------------------------------------------------------------------
ora....B1.inst????application????????ONLINE????ONLINE????rac1????????
ora....B2.inst????application????????ONLINE????ONLINE????rac2????????
ora....DB1.srv???application????????ONLINE????ONLINE????rac2????????
ora.....TAF.cs????application????????ONLINE????ONLINE????rac2????????
ora.RACDB.db??application?????????ONLINE????ONLINE????rac2????????
ora....SM1.asm??application????????ONLINE????ONLINE????rac1????????
ora....C1.lsnr????application????????ONLINE????ONLINE????rac1????????
ora.rac1.gsd????application????????ONLINE????ONLINE????rac1????????
ora.rac1.ons????application????????ONLINE????ONLINE????rac1????????
ora.rac1.vip????application????????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????????ONLINE????ONLINE????rac2????????
ora....C2.lsnr????application???????ONLINE????ONLINE????rac2????????
ora.rac2.gsd????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.ons????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.vip????application?????????ONLINE????ONLINE?????rac2??
四 模擬votedisk不可用時,RAC會出現什么現象?給出故障定位的整個過程
表決磁盤:在發生腦裂問題時,通過表決磁盤來決定驅逐哪個節點。這是發生在集群層上的腦裂。
控制文件:如果是發生在實例層上的腦裂問題,通過控制文件來決定驅逐哪個節點。
Votedisk冗余策略:
(1)votedisk可以選擇外部冗余,通過外部的機制進行保護
(2)votedisk還可以選擇Oracle自己的內部冗余,通過添加votedisk磁盤鏡像來實現內部冗余
實驗
1.檢查vote disk狀態
[oracle@rac1 ~]$ crsctl query css votedisk
0.?????0????/dev/raw/raw2?????????????????顯示2號裸設備為表決磁盤
located 1 votedisk(s).?????????????????????????只定位1個表決磁盤
2.停止CRS集群
[root@rac1 sysconfig]# crsctl stop crs
Stopping resources.????????????????????????停止資源
Successfully stopped CRS resources????????????停止CRS進程
Stopping CSSD.????????????????????????????停止CSSD進程
Shutting down CSS daemon.
Shutdown request successfully issued.?????????
3.添加votedisk表決磁盤,實現內部冗余,
crsctl??add??css??votedisk /dev/raw/raw3 –force???把raw3這塊裸設備添加入表決磁盤組
添加之后Oracle就會把原來表決磁盤內容復制一份到新表決磁盤中
4.再次檢查vote disk狀態
crsctl??query??css??votedisk
5.啟動CRS集群
[root@rac2 sysconfig]# crsctl start crs
Attempting to start CRS stack???????????????嘗試啟動CRS
The CRS stack will be started shortly?????????CRS即將啟動
小結:當表決磁盤/dev/raw/raw2損壞時,可以用其鏡像/dev/raw/raw3來代替,使其RAC可以繼續對外提供服務。
來源:互聯網
轉載于:https://blog.51cto.com/linuxzkq/1583890
總結
以上是生活随笔為你收集整理的【Oracle RAC故障分析与处理】的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: OSPF协议介绍及配置 (下)
- 下一篇: 转:3d max 2013 安装教程,凭