I am asking for the following reason:
1. While I know T-SQL in general, I am no SQL Server admin.
2. Recently we had a SQL Server "crash" on us. Ultimately it is a VM with one of it's "disks" mapped to a LUN on our SAN that actually crashed. However, it appears that it was going on for about 2 months before the server finally crapped out. Problem is, all of our data from early December disappeared, even in the backups so we had to rebuild everything.
While the database that was affected is "production" it is a copy of data from elsewhere, so, no major crisis, other than getting it back on-line.
To the best of my guessing ability, I am thinking the disk was "crashing" for some time, blocking writes while allowing reads. My gut tells me SQL Server was caching data pages (I hope I have the terminology correct) in memory to compensate for the write failures and merging the results with reads from disk (writes on this server occur in batches...it is mostly reads). When the cache filled up and max out memory, the server crashed and, when it came back up, the disk failures finally surfaced. Again, guess, but a fairly educated one.
I am concerned because we have other SQL servers on this SAN, some extremely critical to our business. I would like to know if this situation is about to happen elsewhere and do whatever we can to stop it. I'm trying to work with our Infrastructure team to see if they can monitor the SAN/LUNs for these circumstances, but I've not had much luck with them, so I thought I'd try to tackle it from the SQL Server side.
If anyone could let me know where I might find SQL error logs that might record disk write failures and let me know what to look for in them (or queries to run on the server itself), I can put together some jobs/monitoring tools to keep this from happening again.
Thanks, in advance, for your assistance!