APEXtras

a development team dedicated to Oracle APEX

Archive for the ‘ORA-08103’ tag

Oracle/VMware/Hardware Week of Hell

without comments

Our development Oracle servers - and all our other servers - are Windows 2003 virtual machines running on VMware Fusion in a MacPro3.1 (and I’m well aware that that is not an Oracle-supported configuration). Last week Oracle became corrupt and we discovered that all copies and backups had some degree of corruption that could not be fixed. During reinstallation of Oracle, VMware corrupted the virtual machine several times. Finally, a hardware failure wiped out the first successful rebuild. It took the whole of last week to reinstate the development server.

The only positive aspect was that in the process of recovering from this mess I learnt quite a lot about Oracle, VMware, and the Mac; I was immensely helped by the huge amount of information available on the Internet - especially other people’s blogs.

Lessons Learnt

  • Running Oracle 10g Standard Edition on Windows Server 2003 on VMware Fusion will always risk block corruption - even when DB_BLOCK_CHECKING is set to full and even when the disk space for the Virtual Machine is pre-allocated.
  • Oracle 10g Enterprise Edition with DB_BLOCK_CHECKING set to full does not develop block corruption.
  • Make regular backups with RMAN even when you are using VMware snapshots as your backup.
  • In VMware Fusion, the greater the number of snapshots, the greater the risk of fatal corruption.

Software versions:
VMware Fusion 2.0.4 (159196)
Microsoft Windows Server 2003 Standard Edition SP 2
Oracle 10.2.0.4.0

Why run Oracle on virtualised Windows Server 2003/VMware Fusion/MacPro?

We have historical and practical reasons for running Oracle on top of VMware on a MacPro. We used to host the systems we developed for customers ourselves, and when we got out of the hosting business we still owned a computer room with several redundant rack-mount Windows servers. When we started developing in Oracle/APEX we just installed Oracle on one of our old servers. That worked well for 3 years; but then we decided to move to a new office that did not have a separate computer room. We didn’t want to share our office with a server that sounds like a vacuum cleaner - and besides, the Oracle server was now 6 years old. The quietest server-class machine turned out to be a MacPro, which led us to experiment with Windows 2003/Oracle running on VMware Fusion. We ended up with a whole rack-full of Windows servers virtualised inside a single MacPro3.1 with 2 2.8 GHz quad-core processors and 10 GB of memory. Any 4 of the servers would run simultaneously at an acceptable speed, and our virtual development Oracle server was faster than the metal original (and silent). We moved office and happily developed on this setup for the next 9 months.

Oracle error ORA-08103: object no longer exists

This error suddenly appeared every time a function or procedure from a particular long-established package was called. Trying to recompile the package produced the same error. ORA-08103 is the error you expect to see when another process has deleted the object you are in the process of accessing. In our case, I was the sole user and that could not be the problem.

Identifying Block Corruption
It didn’t take long with Google to find the more sinister cause - Oracle block corruption - eg see this Oracle forum posting which also describes how to use RMAN from the command line to populate the system view V$DATABASE_BLOCK_CORRUPTION:

RMAN TARGET=SYS
(the system prompts for the SYS password)
RMAN> SHUTDOWN IMMEDIATE
RMAN> STARTUP MOUNT
(mount the database but don't open it - only "fixed" tables/views are available)
RMAN> BACKUP VALIDATE CHECK LOGICAL DATABASE;

RMAN responds with something like:
input datafile fno=00002 name=ORACLE_HOME\UNDOTBS01.DBF
input datafile fno=00003 name=ORACLE_HOME\SYSAUX01.DBF
input datafile fno=00001 name=ORACLE_HOME\SYSTEM01.DBF
etc…
with ORACLE_HOME replaced by its actual physical path
(eg D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB_1).

Then when logged in as a user with sysdba privileges (no need to open the database because V$DATABASE_BLOCK_CORRUPTION is a “fixed” view):

SELECT * FROM V$DATABASE_BLOCK_CORRUPTION;

Sure enough, I got a display of more than 20 corrupt blocks:

FILE# BLOCK#  BLOCKS  CORRUPTION_CHANGE#  CORRUPTION_TYPE
13     245811  1       0                   FRACTURED
13     245883  1       0                   FRACTURED
13     650238  1       0                   FRACTURED
13     906713  7       0                   ALL ZERO

etc…

The number in the FILE column relates to one of the numbered data files that RMAN displayed (the number following fno=).

There are usually also alert log entries detailing the corrupt blocks, and similar information is given by the command line tool DBVERIFY:
DBV FILE=’FULL_PATH_TO_THE_DATAFILE’
where FULL_PATH_TO_THE_DATAFILE represents the file name as displayed by RMAN.

Types of Block Corruption
CORRUPTION_TYPE FRACTURED and ALL ZERO are explained in this way by Oracle in the documentation for V$DATABASE_BLOCK_CORRUPTION:
FRACTURED - Block header looks reasonable, but the front and back of the block are different versions.
ALL ZERO - Block header on disk contained only zeros. The block may be valid if it was never filled and if it is in an Oracle7 file. The buffer will be reformatted to the Oracle8 standard for an empty block.

I can’t find any detailed explanation of why these problems may occur - perhaps they mean that the block was being read or written to by some other process at the same time as Oracle was writing to it?

Recovering from Block Corruption
Block corruption should be possible to recover from. Oracle 10g RMAN includes the BLOCKRECOVER command; this should recover all blocks that were marked as corrupt in V$DATABASE_BLOCK_CORRUPTION:

RMAN> BLOCKRECOVER CORRUPTION LIST;

That produced the message:
“Block media Recovery requires Enterprise Edition”.

It’s not made at all clear in Oracle’s documentation, but that’s correct - BLOCKRECOVER is available only in Enterprise Edition. Furthermore, BLOCKRECOVER can only recover blocks when there are previous versions of the block available from retained backups. The way to recover from corruption without using BLOCKRECOVER is to restore the complete file(s) with the corrupt blocks from a good backup. However, rather than taking Oracle backups with RMAN, we had been using VMWare’s snapshot capability to store regular copies of the complete virtual machine. But every snapshot, right back to the first one from 8 months ago, showed an Oracle instance with some block corruption.

Reinstalling Oracle on a Virtual Machine
Since we use Oracle’s exp utility to take daily full dumps of the development tablespace, we keep all PL/SQL code in a Subversion source control system, and we make frequent application exports from APEX, I now decided to reinstall Oracle 10g Standard Edition from scratch on a clean server. A few hours later I had a new development system running. But after a few minutes of testing the APEX applications, the old ORA-08103 error reappeared. Sure enough, RMAN again reported block corruption. I again reinstalled Oracle, and this time tested for block corruption at every stage. Corruption first appeared after creating a new development tablespace - before any data had been imported. Setting system initialization parameter DB_BLOCK_CHECKING to FULL had no effect.

As a last resort I then tried installing Oracle Enterprise Edition, reasoning that I should at least be able to recover from block corruption if I could get a non-corrupt backup - but no need; this time there was no block corruption at any stage.

I tested the Enterprise Edition installation by creating and dropping large tablespaces and by forcing automatic tablespace extension up to the 60 GByte disk limit by inserting records into a huge table - at no stage did block corruption develop when DB_BLOCK_CHECKING was set to to FULL.

There’s no obvious reason why oracle Enterprise Edition should perform differently from Standard Edition in relation to block corruption - I would expect the relevant code to be exactly the same in both. Specifically, I would expect the DB_BLOCK_CHECKING parameter when set to FULL would ensure that Standard Edition could not write corrupt blocks - however, since our configuration is unsupported, its failure to protect against corruption cannot be described as an Oracle bug. Maybe Enterprise Edition has different default settings for some crucial parameter?

VMware File Missing Error

VMware allows one to take a “snapshot” of the Virtual Machine (VM) state at any time. I had originally thought of these snapshots as similar to normal backups, and by the time I had rebuilt Oracle, applied the 10.2.0.4 patch set, imported the development tablespace, reinstalled APEX, and imported the APEX applications on the main development VM, there were about 60 stored snapshots - the whole VM package occupied more than 270 GBYTES on disk.

I now deleted an unwanted snapshot but got an error message about being unable to delete it, which I thought was a fairly trivial problem at the time. I shut down the VM and made a disk copy of the whole package. When I restarted it, VMWare complained that a needed file was missing - it named the file as the <PACKAGE_NAME>.vmdk, which is the main file representing the VM’s hard disk. When I looked inside the VM package (right click in Finder and select “Show Package Contents”) the file was indeed missing - VMWare appeared to have “lost” it. That represented more than a day’s work lost.

More Snapshots, Less Resilience
I soon discovered that my view of snapshots as extra backups was the opposite of the truth - in fact when a snapshot is taken, it creates files which must be added to all the previous snapshot files to recreate the machine state. The VM disk image is fragmented into a base disk file <PACKAGE_NAME>.vmdk and many delta files <PACKAGE_NAME>-nnnnnn.vmdk where nnnnnn is 000001, 000002 etc. Each file has a unique id number (the CID) in its header together with the CID of its parent (parentCID). The last file in the chain is
<PACKAGE_NAME>.vmdk. Entries in file vmware.log list the disk files in the order they are opened as the failed VM is started. The last one in the list is the one that should have chained to the missing file. When a snapshot is deleted, its data is merged into the main disk image file (for more detail see eg blog post VMware Snapshots: Or, VMSD’s, VMSN’s, CID chains, and what happens when you press Delete All).

Recovering from VMWare File Missing Error
I made a copy of the VM package then I followed the recovery procedure described on the DRIVE: ACTIVATED blog and the VMware Knowledge Base: I copied into the package a version of <PACKAGE_NAME>.vmdk from a backup of the same virtual machine made at the time when it was first virtualised (so now 8 months out of date). I then edited the header of the last file in the chain so that it referred to the ID of the new <PACKAGE_NAME>.vmdk using 0xED, a hex editor for the Mac. The file was huge (32 GB) but saved successfully in 40 minutes. Miraculously, this process resulted in a working VM.

I spent the next few hours deleting most of the snapshots, 10 at a time, and finally brought the size of the VM down to 68 GBytes.

Hardware Error

The final insult - I shut down the VM and copied it from the backup external disk (WD 1 TByte “My Book”) back on to the production disk… and the copy failed with an unrecoverable disk error. I had to repeat the retrieval of the VM; another day lost. We’re going to use a mirrored external disk for backup in future.

Written by Roger

June 23rd, 2009 at 5:26 pm