‘壹’ 遇到大规模oracle坏块该怎么处理
最近一两个月,一直有场景化运维、场景化大数据分析的声音围绕在耳畔,以Gdevops全球敏捷运维峰会杭州站上新炬网络执行副总裁程永新的“ 一切没有场景驱动的运维平台建设都是假大空! ”最为振聋发聩。我们一直在谈技术,谈原理,谈内核,总以为“懂了”这些的人,就势必能广阔天地大有所为。
技术固然重要,但偏离了业务/应用场景的技术,无法呈现业务价值的技术就非常不重要。
技术也应该是场景驱动的,对于运维技术人员来说,离开场景学习的所谓高深技术,只是浪费时间。所以新同事进入一个新团队后,能使技术更好发挥作用的环境、流程的考核会占据了另外的三分之二。
今天来谈谈Oracle坏块问题。 坏块问题,相信做过两三年Oracle维护、支持的DBA都会遇到过,即使从来没遇到过的,看过Oracle 官方文档的甚至是会度娘或者谷哥的,应该也知道基本的处理手段。
这是我内部分享的一个简单思维导图,如果有遗漏,欢迎在后面评论补充。
面试的时候通常是这么问的:你了解Oracle坏块么?坏块为什么会产生?描述一下你处理过的坏块案例细节?如果你负责的好几个数据库都突然发生了坏块,你会怎么做?
通常,前面几个问题的回答都不会太差。但是最后一个问题的回答,鲜有能出众的。原因在于面太窄,思维太窄。如果这个时候你还要一个个库的去看alert日志,那么显然就走错了方向。
现实情况就是,我们的数据库可能是五六十套,或者上百套,你一套套库的去看这些日志里的报错的块,再根据块找到对象,确定是表还是索引,再去用坏块的修复手段去修复…….那么,所有人都会被你害了。
我们经历过,花了两天的时间,都没修复完一个库中几万个坏块的情况,在其他大牛还在哼哧哼哧做恢复的时候,我向领导提议启用了新的方案,在大老板没有完全失去耐性的情况下恢复了业务。
真正正确的做法是,如果确定坏块数量为数众多,赶紧停业务,切灾备,后面再补数据。 灾备是干什么吃的,养库千日,用库一时,就在这个时候了!
非常可惜的是,大多数来面试的DBA会非常纠结于用block recover,还是用dbms_repair,还是用BBED,还是……
那么,什么时候往上申报,要切灾备?
且不谈许多公司的灾备形同虚设,关键时候不敢切的事情。就算这些灾备都是实实在在可用的,恐怕也不是说切立即就能切的,切灾备涉及到应用、网络、主机、存储等多方面的调整。
那么多久应该切呢? 一般的企业从故障处理开始,预估2小时之内不能让业务恢复正常运行的,应该申报切灾备。当然,如果是金融行业,特别是证券基金行业,1分钟之内故障还没恢复,就要知会证监会,半个小时没有恢复就会受到同行业通报,所以要切就应该在这个时间之内申报。
问题又回到了原点。你得先有规划,作为企业的重要系统,你得先建设灾备环境,而且是有效的灾备,并且应该事先有一个灾备切换预案。
作为DBA来说,动作敏捷的检查数据库的情况,并及时汇报非常重要。其实这里,又想说自动运维平台了。通过简单的按钮点点,就能快速知道告警日志里的坏块涉及什么对象,是不是就好很多呢?
继续往回看,面试的问题是什么?是多个数据库同时发现大量坏块。
作为一个经验丰富的运维管理人员,第一反应应该是,为什么会同时发生呢?显然是由于外因导致的。因此做好容灾切换,业务恢复使用的第一时间,应该去看看这些数据库共同的基础是不是同样的存储、同样的存储管理软件、卷管理软件。
依我的经验,大部分多个库同时出现坏块,都坏在存储管理软件身上。
有一次是Storage Foundation做卷复制的时候出现了软件bug,IBM、Oracle、symentec等众多厂家在“问题作战室”里整整呆了一个月,各家公司二线、三线出具各种证据自己没问题,最后最后才找到蛛丝马迹,揪出来。
还有一次稍微容易些,存储软件状态恢复后坏块没有恢复,一个个系统通过fsck命令来进行的修复。你说,这是什么类型的坏块呢?
作为有经验的DBA,要解决问题,但不要急着去敲命令,站在更高点的位置来看待问题,可能会事半功倍。
很不幸的前两天,某个朋友公司核心数据库“莫名奇妙”地遇到中止了。原因不方便说,但是据说等故障恢复完之后,朋友已经抽了好几包烟了。
我们先来看看,是由于LGWR终止了数据库( 注:做了一些脱敏处理 )。
但是,重启数据库却发现数据库启动不了,发现众多数据文件发生了坏块,数据库根本不能open:
同时伴随着类似的内部错误:
怎么破?通过当天的数据库备份结合归档进行恢复,远远比去修复坏块要快。
‘贰’ Oracle数据库打不开 该怎么办我们公司的oracle数据库坏了 打不开了,该如何处理
如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!
诗檀软件专业数据库修复团队
Oracle的损坏/坏块 主要分以下几种:
ORA-1578
ORA-8103
ORA-1410
ORA-1499
ORA-1578
ORA-81##
ORA-14##
ORA-26040
ORA-600 Errors
Block Corruption
Index Corruption
Row Corruption
UNDO Corruption
Control File
Consistent Read
Dictionary
File/RDBA/BL
Error Description Corruption related to:
ORA-1578 ORA-1578一般为Oracle检测到存在物理坏块问题,包括其检测数据块中的checksum不正确,或者tail_chk信息不正确等。 ORA-1578 is reported when a block is thought to be corrupt on read.
Block
数据块
OERR: ORA-1578 “ORACLE data block corrupted (file # %s, block # %s)” Master Note
OERR: ORA-1578 “ORACLE data block corrupted (file # %s, block # %s)”
Fractured Block explanation
Handling Oracle Block Corruptions in Oracle7/8/8i/9i/10g/11g
Diagnosing and Resolving 1578 reported on a Local Index of a Partitioned table
ORA-1410
ORA-1410错误常见于从INDEX或其他途径获得的ROWID,到数据表中查询发现没有对应的记录。
该错误可能因为数据表与其索引存在不一致,也可能是分区的数据表本身存在问题。
This error is raised when an operation refers to a ROWID in a table for which there is no such row.
The reference to a ROWID may be implicit from a WHERE CURRENT OF clause or directly from a WHERE ROWID=… clause.
ORA 1410 indicates the ROWID is for a BLOCK that is not part of this table.
Row
数据行
Understanding The ORA-1410
Summary Of Bugs Containing ORA 1410
OERR: ORA 1410 “invalid ROWID”
ORA-8103
该ORA-8103可能由多个BUG引起,例如LOB在10.2.0.4之前可能会由于BUG覆盖了另一张表的segment header,导致出现ORA-8103错误。
诊断该问题可以从数据表的segment header和data_object_id入手。
The object has been deleted by another user since the operation began.
If the error is reprocible, following may be the reasons:-
a.) The header block has an invalid block type.
b.) The data_object_id (seg/obj) stored in the block is different than the data_object_id stored in the segment header. See dba_objects.data_object_id and compare it to the decimal value stored in the block (field seg/obj).
Block
数据块
ORA-8103 Troubleshooting, Diagnostic and Solution
OERR: ORA-8103 “object no longer exists” / Troubleshooting, Diagnostic and Solution
ORA-8102 ORA-8102常见于索引键值与表上存的值不一致。 An ORA-08102 indicates that there is a mismatch between the key(s) stored in the index and the values stored in the table. What typically happens is the index is built and at some future time, some type of corruption occurs, either in the table or index, to cause the mismatch.
Index
索引
OERR ORA-8102 “index key not found, obj# %s, file %s, block %s (%s)
ORA-1499 对表和索引做交叉验证时发现问题 An error occurred when validating an index or a table using the ANALYZE command.
One or more entries does not point to the appropriate cross-reference.
Index
索引
ORA-1499. Table/Index row count mismatch
OERR: ORA-1499 table/Index Cross Reference Failure – see trace file
ORA-1498 Generally this is a result of an ANALYZE … VALIDATE … command.
This error generally manifests itself when there is inconsistency in the data/Index block. Some of the block check errors that may be found:-
a.) Row locked by a non-existent transaction
b.) The amount of space used is not equal to block size
c.) Transaction header lock count mismatch.
While support are processing the tracefile it may be worth the re-running the ANALYZE after restarting the database to help show if the corruption is consistent or if it ‘moves’.
Send the tracefile to support for analysis.
If the ANALYZE was against an index you should check the whole object. Eg: Find the tablename and execute:
ANALYZE TABLE xxx VALIDATE STRUCTURE CASCADE; Block
OERR: ORA 1498 “block check failure – see trace file”
ORA-26040 由于采用过nologging/unrecoverable选项的redo生成机制,且做过对应的recover,导致数据块中被填满了0XFF,导致报错ORA-26040。 Trying to access data in block that was loaded without redo generation using the NOLOGGING/UNRECOVERABLE option.
This Error raises always together with ORA-1578
Block
数据块
OERR ORA-26040 Data block was loaded using the NOLOGGING option
ORA-1578 / ORA-26040 Corrupt blocks by NOLOGGING – Error explanation and solution
ORA-1578 ORA-26040 in a LOB segment – Script to solve the errors
ORA-1578 ORA-26040 in 11g for DIRECT PATH with NOARCHIVELOG even if LOGGING is enabled
ORA-1578 ORA-26040 On Awr Table
Errors ORA-01578, ORA-26040 On Standby Database
Workflow Tables ORA-01578 ORACLE data block corrupted ORA-26040 Data block was loaded using the NOLOGGING option
ORA-1578, ORA-26040 Data block was loaded using the NOLOGGING option
ORA-600[12700]
从索引获得的ROWID,对应到数据表时发现不存在数据行错误。
一把是一致性度consistent read问题
Oracle is trying to access a row using its ROWID, which has been obtained from an index.
A mismatch was found between the index rowid and the data block it is pointing to. The rowid points to a non-existent row in the data block. The corruption can be in data and/or index blocks.
ORA-600 [12700] can also be reported e to a consistent read (CR) problem.
Consistent Read
一致性读
Resolving an ORA-600 [12700] error in Oracle 8 and above.
ORA-600 [12700] “Index entry Points to Missing ROWID”
ORA-600[3020] 主要问题是redo和数据块中的信息不一致 This is called a ‘STUCK RECOVERY’.
There is an inconsistency between the information stored in the redo and the information stored in a database block being recovered. Redo
ORA-600 [3020] “Stuck Recovery”
Information Required for Root Cause Analysis of ORA-600 [3020] (stuck recovery)
ORA-600[4194] 主要是redo记录与回滚rollback/undo的记录不一致 A mismatch has been detected between Redo records and rollback (Undo) records.
We are validating the Undo record number relating to the change being applied against the maximum undo record number recorded in the undo block.
This error is reported when the validation fails. Undo
ORA-600 [4194] “Undo Record Number Mismatch While Adding Undo Record”
Basic Steps to be Followed While Solving ORA-00600 [4194]/[4193] Errors Without Using Unsupported parameter
ORA-600[4193] 主要是redo记录与回滚rollback/undo的记录不一致 A mismatch has been detected between Redo records and Rollback (Undo) records.
We are validating the Undo block sequence number in the undo block against the Redo block sequence number relating to the change being applied.
This error is reported when this validation fails. Undo
ORA-600 [4193] “seq# mismatch while adding undo record”
Basic Steps to be Followed While Solving ORA-00600 [4194]/[4193] Errors Without Using Unsupported parameter
Ora-600 [4193] When Opening Or Shutting Down A Database
ORA-600 [4193] When Trying To Open The Database
ORA-600[4137] transaction id不匹配,问题可能存在与回滚段中或者对象本身存在讹误 While backing out an undo record (i.e. at the time of rollback) we found a transaction id mis-match indicating either a corruption in the rollback segment or corruption in an object which the rollback segment is trying to apply undo records on.
This would indicate a corrupted rollback segment. Undo/Redo
ORA-600 [4137] “XID in Undo and Redo Does Not Match”
ORA-600[6101] Not enough free space was found when inserting a row into an index leaf block ring the application of undo. Index
ORA-600 [6101] “insert into leaf block (undo)”
ORA-600[2103] Oracle is attempting to read or update a generic entry in the control file.
If the entry number is invalid, ORA-600 [2130] is logged. Control File
ORA-600 [2130] “Attempt to access non-existant controlfile entry”
ORA-600[4512] Oracle is checking the status of transaction locks within a block.
If the lock number is greater than the number of lock entries, ORA-600 [4512] is reported followed by a stack trace, process state and block mp.
This error possibly indicates a block corruption. Block
ORA-600 [4512] “Lock count mismatch”
ORA-600[2662] 主要是发现一个数据块的SCN甚至超过了当前SCN,常规解决途径有调整SCN等,但11.2以后Oracle公司使较多调整SCN的方法失效了 A data block SCN is ahead of the current SCN.
The ORA-600 [2662] occurs when an SCN is compared to the dependent SCN stored in a UGA variable.
If the SCN is less than the dependent SCN then we signal the ORA-600 [2662] internal error. Block
ORA-600 [2662] “Block SCN is ahead of Current SCN”
ORA 600 [2662] DURING STARTUP
ORA-600[4097] 访问一个回滚段头以便确认事务是否已提交时,发现XID有问题 We are accessing a rollback segment header to see if a transaction has been committed.
However, the xid given is in the future of the transaction table.
This could be e to a rollback segment corruption issue OR you might be hitting the following known problem. Undo