There are many areas where the errors can be made and you might experience lots of issues with the mis-configuration settings. A thorough and deep understanding of the SAN configuration is needed to troubleshoot any storage related issues. Slight differences can make a huge data loss and could make the organisation collapse. To troubleshoot any kind of situation, follow these tips as a starting step before the advanced troubleshooting. There might be other tools to troubleshoot the issues but these are basic first steps which might help you save the time.
1) Always take backup of Switch Configurations
Regular backup of switch configurations needs to be done just in regular intervals just in case if you are unable to troubleshoot the issue and needs to revert back to the previous configuration. Such backup files tend to be human-readable flat files that are extremely useful if you need to compare a broken configuration image to a previously known working configuration. Another option might be to create a new zone configuration each time you make a change, and maintain previous versions that can be rolled back to if there are problems after committing the change.
2) Troubleshooting Connectivity Issues
Many of the day-to-day issues that you see are connectivity issues such as hosts not being able to see a new LUN or not being able to see storage or tape devices on the SAN. Connectivity issues will be due to misconfigured zoning. Each vendor provides different tools to configure and troubleshoot zoning, but the following common CLI commands can prove very helpful.
fcping is an FC version of the popular IP ping tool. fcping allows you to test the following:
- Whether a device (N_Port) is alive and responding to FC frames
- End-to-end connectivity between two N_Ports
- Zoning between two devices
fcping is available on most switch platforms as well as being a CLI tool for most operating systems and some HBAs. It works by sending Extended Link Service (ELS) echo request frames to a destination, and the destination responding with ELS echo response frames. For example
# fcping 50:01:43:80:05:6c:22:ae
Another tool that is modeled on a popular IP networking tool is the fctrace tool. This tool traces a route/path to an N_Port. The following command shows an fctrace command example
# fctrace fcid 0xef0010 vsan 1
3) Things to check while troubleshooting Zoning
- Are your aliases correct ?
- If using port zoning, have your switch domain IDs changed ?
- If using WWPN zoning, have any of the HBA/WWPNs been changed ?
- Is your zone in the active zone set?
4) Rescan the SCSI Bus if required
After making zoning changes, LUN masking changes or any other work that changes a LUN/volume presentation to a host, you may be required to rescan the SCSI bus on that host in order to detect the new device. The following command shows how to rescan the SCSI bus on a Windows server using the diskpart tool
DISKPART> list disk
If you know that your LUN masking and zoning are correct but the server still does not see the device, it may be necessary to reboot the host.
5) Understanding Switch Configuration Dumps
Each switch vendor also tends to have a built-in command/script that is used to gather configs and logs to be sent to the vendor for their tech support groups to analyze. The output of these commands/scripts can also be useful to you as a storage administrator. Each vendor has its own version of these commands/scripts
Cisco – show tech-support
Brocade – supportshow or supportsave
QLogic – create support
6) Use Port Error Counters
Switch-based port error counters are an excellent way to identify physical connectivity issues such as
- Bad cables (bent, kinked, or otherwise damaged cables)
- Bad connectors (dust on the connectors, loose connectors)
The following example shows the error counters for a physical switch port on a switch:
admin> portshow 4/15
These port counters can sometimes be misleading. It is perfectly normal to see high counts
against some of the values, and it is common to see values increase when a server is rebooted and when similar changes occur. If you are not sure what to look for, check your switch documentation, but also compare the counters to some of your known good ports.
If some counters are increasing on a given port that you are concerned with, but they are not increasing on some known good ports, then you know that you have a problem on that port.
Other commands show similar error counters as well as port throughput. The following porterrshow command shows some encoding out (enc out) and class 3 discard (disc c3) errors on port 0. This may indicate a bad cable, a bad port, or another hardware problem