In a virtualized infrastructure, access to shared storage is a cornerstone of performance, reliability, and disaster recovery. Storage Area Networks (SANs), particularly those built on iSCSI (Internet Small Computer System Interface) protocols, are ubiquitous in modern data centers. When things go wrong — especially with something as sensitive as a Logical Unit Number (LUN) suddenly becoming read-only — administrators scramble for answers. This article dives deep into a real-world situation where iSCSI LUNs became read-only after a host reboot, explains why this happened, and explores how target reattachment and a consistency check turned a potential disaster into a textbook recovery.
TL;DR
The issue began after a virtualization host rebooted and re-established an iSCSI connection to a shared target. Suddenly, several LUNs appeared as read-only, causing panic among system administrators. As it turned out, a misalignment between initiator states and target sessions had caused a SCSI reservation conflict. By detaching and reattaching the target, followed by a full consistency check, the VMs were successfully recovered without data loss. This event underscores the importance of proper multipathing configuration and session handling for shared storage.
The Boot-Time Surprise
It all started with a routine reboot of a VMware ESXi host. Following standard maintenance tasks, the administrator restarted the node expecting business as usual. However, as the host came back online and attempted to remount iSCSI LUNs connected through its software iSCSI initiator, alarms began firing. VMs refused to boot properly, and storage logging revealed that certain iSCSI LUNs had become inexplicably read-only.
What’s peculiar is that these iSCSI LUNs were behaving completely normally before the reboot. Yet now, even though the targets were reachable and visible, the host treated the LUNs as effectively immutable. This wasn’t just a misread by the interface — these paths were truly mounted as read-only according to system logs and command-line queries.
Diagnosing the Problem
At first glance, the configuration seemed solid:
- The iSCSI initiator was properly configured with the correct target IP addresses.
- The storage target was online and presenting all expected LUNs.
- Multipathing policies were in place and functional, with no alerts from the SAN hardware.
Digging deeper, SCSI sense codes and syslog entries suggested something less obvious — a SCSI reservation conflict. This commonly occurs when multiple initiators (hosts) try to access the same LUN, and one of them holds an active reservation or lock. In shared infrastructures, SCSI-3 Persistent Reservations are often used to manage access coordination; however, these can occasionally stick or become orphaned during ungraceful shutdowns or reboots.
When the rebooted host tried to re-establish its iSCSI sessions, it was met with resistance from the target, which still believed that another active initiator (perhaps the previous session) held control. As a consequence, it permitted only read-only access for safety — thereby locking out the very write operations needed by running virtual machines.
The Role of Multipathing and Session Persistence
Another subtle contributor to the issue was the behavior of the iSCSI multipathing software. Many enterprise solutions use Dynamic Multi-Pathing (DMP) or Round-Robin policies that split traffic across multiple paths. However, incorrect timeouts or stale paths can cause the system to fall back to degraded performance modes, leading to partial access or inconsistent device states.
If, during the reboot, the host didn’t cleanly dropout of one path before re-establishing another, the storage target could interpret that session as still valid. This would cause the target to present degraded responsibility to the new incoming session, leading to — you guessed it — LUNs appearing as read-only.
The Target Reattach & Consistency Check Solution
Once the root cause was suspected to be lingering SCSI-3 reservations or a session-state mismatch, the logical solution was to completely reinitialize the initiator-target relationship. This was done in several steps:
- First, all iSCSI targets were detached from the affected host using the ESXi CLI and management console.
- Then, software initiator sessions were explicitly flushed to clear out cached paths and session metadata.
- After a brief pause, the iSCSI configuration was restored, and targets were reattached manually, allowing the host to renegotiate access terms and SCSI reservations from scratch.
- Finally, a full filesystem consistency check (commonly done via VMFS tools or vSphere health checks) was carried out on each LUN to ensure no damage had occurred during the read-only window.
The results? Remarkably smooth recovery. VMs that had previously refused to boot came back online, metadata stores were intact, and no live migrations or snapshots were lost. Despite the harrowing nature of read-only storage on production VMs, there was no long-term data corruption.
Lessons Learned: Avoiding Future Incidents
This incident provided several key takeaways for system administrators and storage engineers alike:
- Always ensure clean shutdowns of storage initiators before reboots. Incomplete sessions can wreak havoc on target logic.
- Check your multipathing configuration. Use tools to simulate failover scenarios and ensure stale paths are detected and removed promptly.
- Invest in visibility tools that let you inspect SCSI reservations, LUN metadata, and session histories. These can be invaluable during deep-dive troubleshooting.
- Automate cleanup tasks post-maintenance via scripts that handle iSCSI detachments, session flushes, and reattaches to reset the playing field proactively.
Why This Matters for VM Administrators
For those managing virtual environments, especially large vSphere deployments, understanding the behavior of underlying storage mechanisms is not only useful — it’s essential. Virtual disks (.vmdk files) are heavily reliant on block-level access and stability. Any deviation, such as read-only LUNs, introduces serious risk not just to performance but to data integrity.
This case highlights a chain of relatively benign actions — a reboot, a session reattach — that spiraled into degraded storage behavior. But it also shows how awareness of iSCSI internals, alongside the courage to detach and reattach storage, can reverse the damage before it’s too late.
Final Thoughts
In dynamic infrastructures, issues like these are bound to occur. What differentiates a catastrophe from a recovery success story is preparation, insight, and responsive action. iSCSI is powerful but requires rigorous session and path management to deliver consistent uptime. Next time a reboot is scheduled, remember: initiator states matter, SCSI reservations are sticky, and with the right tools, even a read-only LUN can become writable again.
Stay vigilant and remember — storage doesn’t forget.