Course Notes – 5 day class – prep this class.
Course Description
Instructor Community
Training Plan
Last Taught:
- 2024-10-14
Module 01: Introduction – 60 Minutes
Vitalsource
VitalSource Support – https://support.vitalsource.com/hc/en-us
Vitalsource Getting Started Video – https://www.youtube.com/watch?v=9uL5R_Tl3b0&feature=youtu.be
Slide 01-09 VMware Online Resources:
VMware vSphere Blog: https://blogs.vmware.com/vsphere/
VMware Communities: https://communities.vmware.com
VMware Support: https://www.vmware.com/support
VMware Education: https://www.vmware.com/education
VMware Certifications: https://mylearn.vmware.com/portals/certification
VMware Education and Certification Blog: https://blogs.vmware.com/education/
VMware Knowledge Base: http://kb.vmware.com
VMware Hands On Labs – https://labs.hol.vmware.com/HOL/catalogs/catalog/all
Slide 01-10: VMware Education Overview
VMware Learning Paths – https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/professional-services/vmware-learning-paths.pdf
VMware Learning Zone – https://vmwarelearningzone.vmware.com
Slide 01-11: VMware Certification Overview
VMware certifications: http://mylearn.vmware.com/portals/certification
Slide 01-12: VMware Digital Badge Overview
VMware Badges – http://www.pearsonvue.com/vmware/badging
Module 02: Introduction to Troubleshooting
Lesson 1: Introduction to Troubleshooting – ?? Minutes
Slide 02-06: About the Troubleshooting Process
The troubleshooting process begins when a user reports a problem. In this context, the user is anyone using the system, from an end user to an administrator.
An issue reported by a user might not be the actual problem. The user might be reporting symptoms of the problem.
An observed problem might be directly causing the symptoms, but typically the root cause has a more fundamental source.
Slide 02-07: Defining System Problems
A system consists of several software and hardware components. For example, an ESXi host consists of hardware components such as CPU, memory, storage, networking, and hypervisor software. A virtual machine includes components, such as one or more applications, a guest operating system, and virtual hardware.
A problem that occurs in a system can disrupt and negatively affect production services that were functioning normally.
This course focuses on configuration and operational issues.
Slide 02-08: Identifying the Effects of System Problems
Usability is about whether users can complete tasks and achieve goals with the given product. Usability is also about the amount of effort (often measured in time) that is required by a user to perform a task.
Accuracy is about the system’s precision and ability to show the same results under unchanged conditions.
Reliability can be defined in terms of whether a system consistently produces correct outputs for a given amount of time. Reliability is enhanced by system features that help avoid and detect problems. Reliability is often defined in business service-level agreements (SLAs) in the form of availability.
Performance is also defined in terms of an SLA. An SLA establishes performance and reliability requirements for applications. With an SLA, tracking and analysis of the achieved performance and reliability ensures that those requirements are met. A performance problem occurs when an application fails to meet its SLA. Depending on the SLA, the failure might be in the form of excessively long response times or an unacceptable length of time that the system was unavailable.
Slide 02-09: Collecting Symptoms of a Problem
Collecting symptoms is the first step in troubleshooting a problem:
- Users will often report several symptoms that are correlated to a single root cause.
- Differentiating between symptoms and the root cause of a problem is imperative.
An observed problem might be directly causing the symptoms, but typically the problem has a more fundamental cause.
Problems can arise in any computing environment. Complex application behaviors, changing demands, and shared infrastructure can lead to new issues in previously stable environments.
Troubleshooting problems requires an understanding of the interactions between the software and hardware components of a computing environment. Moving to a virtualized computing environment adds new software layers and new types of interactions that must be considered when troubleshooting.
Slide 02-10: Gathering Supplemental Information
Proper troubleshooting requires starting with a broad view of the computing environment and systematically narrowing the scope of the investigation, as possible sources of problems are eliminated.
Troubleshooting efforts that start with a narrowly conceived idea of the source of a problem often get stuck in the detailed analysis of one component, when the real source of the problem is elsewhere in the infrastructure.
To isolate the source of a problem, you must adhere to a logical troubleshooting methodology that avoids preconceptions about the source of the problem.
Slide 02-11: Viewing and Interpreting Diagnostic Information
A problem generates a diagnostic message. You inspect the diagnostic message to learn more about the problem. If diagnostic information does not appear in the GUI or in an event viewer, you can review the appropriate log files for useful entries. You then use the information in the diagnostic messages to focus on the area of the system that is most likely causing the issue.
For example, a user receives an error message when powering on a virtual machine. This fault occurs when an attempted operation conflicts with a resource configuration policy. This fault may occur, for example, if a power-on operation reserves more memory than is allocated to a resource pool. Retry the operation after adjusting the resources to allow more memory.
Slide 02-12: Identifying Possible Causes and Taking Appropriate Action
In a VMware virtual environment, the root cause of a problem can occur in any one of the virtual or hardware components. Knowing where to start looking for the root cause is often not obvious. Gathering as much information as you can about the issue helps determine which component to check first.
You might take one of the following troubleshooting approaches:
- Top-down approach: Start troubleshooting in the guest operating system and work your way down the stack to the VM, to the ESXi host, and finally to the hardware.
- Bottom-up approach: Start troubleshooting at the hardware-level first and work your way up the stack to the ESXi host, to the VM, and finally to the guest operating system.
- Approach the cause by halves: Start troubleshooting at the middle of the stack. For example, start with the VM and test possible causes. The test results determine whether you should continue troubleshooting up the stack or down the stack.
Slide 02-13: Determining the Root Cause
General virtual infrastructure knowledge and knowledge of your specific system configuration are helpful in identifying possible causes. You prioritize the list of possible causes, ordering them from most probable to least probable. Then, you test each possible cause to determine the most likely cause of the problem, called the root cause.
In the example, the problem is that a virtual machine has stopped responding. In a nonresponsive system, the operating system seems to be paralyzed and no error messages appear. However, the operating system is still running. Such problems might require guidance from documents such as VMware knowledge base articles. For example, to troubleshoot a VM that has stopped responding, see VMware knowledge base article 1007819 at https://kb.vmware.com/kb/1007819.
For this problem, you might take a top-down approach. You start with the operations performed on the VM, check the VM configuration, and then check for sufficient resources on the host where the VM is located.
Slide 02-14: Resolving the Problem
To resolve the problem, you identify possible solutions to the problem and implement a solution.
In determining the best solution, you assess the impact that the issue has on normal operations. For example, if the problem causes business-critical applications to be inaccessible, then the impact of the problem is high and immediate resolution is necessary.
When identifying possible solutions, you might decide to first implement a short-term fix so that systems can be brought back online quickly. Before implementing the short-term solution, you document all changes that you make to the system from the time the problem occurred. You also back up your log files from the time that the problem occurred. Some short-term solutions can be destructive and truncate important log information that is necessary for additional assistance.
Eventually, you want to implement a more permanent, long-term solution to prevent the problem from occurring again.
Slide 02-15: Scenario: Defining the Problem
In the example described on the slide, you use the troubleshooting methodology to diagnose a vSphere vMotion migration problem. You use the vSphere Client to perform a vSphere vMotion migration, but the migration fails with an error message.
At this point, you cannot tell whether the problem is specific to vSphere vMotion or is in the underlying infrastructure, such as storage or networking.
To pinpoint the problem area, you gather information about the problem, starting with any diagnostic messages displayed in the vSphere Client.
Slide 02-16: Scenario: Identifying Possible Causes
vSphere performs many compatibility checks before the migration is initiated. Checking configuration items beforehand helps eliminate possible causes of problems, such as vSphere vMotion not being configured or incompatible CPUs.
Slide 02-17: Scenario: Gathering Information
vSphere Client shows the following error messages for the failed vSphere vMotion migration task:
- A general system error occurred: The vSphere vMotion migration failed because the ESXi hosts were not able to connect over the vSphere vMotion network. You check the vSphere vMotion network settings and physical network configuration.
- vSphere vMotion migration failed to create a connection with remote host 172.20.12.52: The ESXi hosts failed to connect over the vSphere vMotion network.
- Migration failed to connect to remote host 172.20.12.52 from host 172.20.14.51: Timeout. The IP addresses refer to the vSphere vMotion VMkernel interfaces on the remote host (ESXi02) and the local host (ESXi01).
The vSphere vMotion migration failed because the destination host did not receive data from the source host on the vSphere vMotion network. You verify that your vSphere vMotion network settings and physical network configuration are correct.
The first error message in the stack is helpful and tells you to check the vSphere vMotion network settings and physical network configuration. Not all error messages are useful or relevant.
Slide 02-18: Scenario: Determining the Root Cause
First, you use the ping command to test the network connectivity between the hosts. For example, you ping ESXi02 from ESXi01.
If the ping command fails, you investigate why. For example, the ping might fail because of a network misconfiguration or faulty physical hardware. You make a change to your environment and try the ping again.
After the ping is successful, you test the vSphere vMotion migration. If the migration is successful, you identified the root cause of the problem. If the migration is not successful, you test the next possible cause on the list. If the ping command is successful, network connectivity exists between the two hosts.
You test the VMkernel interface connectivity. You use the ping command for this test too. From one host, you run the ping command, pointing to the VMkernel interface that you want to check on the target host. For example, you run the ping command to ping the vSphere vMotion VMkernel interface on ESXi02 (172.20.13.52) from ESXi01.
If the ping command fails, you investigate why. You verify that the VMkernel interface is configured correctly. You make the changes to your environment and try the ping command again.
When the ping command is successful, you test the vSphere vMotion migration again. If the migration is successful, you identified the root cause of the problem. If the migration is not successful, you must further investigate to find the root cause.
Slide 02-19: Scenario: Resolving the Problem
After you identify the root cause, you identify possible solutions to fix the problem. The impact (high, medium, or low) that the problem has on normal operations determines how quickly the solution should be implemented.
Finally, you determine the appropriate type of solution for this issue. If a long-term solution requires significant downtime or potentially has a major system impact, you might implement a short-term solution temporarily.
When you implement a short-term solution, you take the following steps:
- Back up your log files from the time that the problem occurred. Remember that log files rotate and might not be available at a future time.
- Correct enough problems so that the system works normally right now.
- Document all changes that you made to the system since the problem occurred.
- Schedule downtime later to implement a long-term solution.
Module 03: Tools for Troubleshooting vSphere
Lesson 1: Using the Command Line for Troubleshooting – ?? Minutes
Slide 03-06: vSphere Troubleshooting Toolkit
Slide 03-07: Troubleshooting References and Documentation
Slide 03-08: User Interfaces: GUI
Slide 03-09: User Interfaces: CLI
Slide 03-10: Running Commands
vSphere ESXi Shell includes a set of fully supported ESXCLI commands and a set of commands for diagnosing and managing ESXi hosts. Familiarize yourself with vSphere ESXi Shell in case VMware Technical Support directs you to use it.
With the Standalone ESXCLI command set, you can run common system administration and configuration tasks against vSphere systems, from an administration server of your choice. Standalone ESXCLI can be installed on supported operating systems, such as Windows and Linux.
Slide 03-11: Accessing vSphere ESXi Shell
An ESXi system includes a direct console that you can use to start and stop the system, and to perform a limited set of maintenance and troubleshooting tasks. The direct console user interface (DCUI) includes vSphere ESXi Shell, which is deactivated by default. You can activate vSphere ESXi Shell in the DCUI, or through the vSphere Client.
To access vSphere ESXi Shell locally, you require physical access to the DCUI and administrator privileges. Local users who are assigned to the administrator group have local shell access by default.
To access vSphere ESXi Shell remotely, you activate the SSH service. However, you should activate SSH access only for a limited time. Never leave SSH active on an ESXi host in a production environment. Activating SSH creates a security vulnerability and reduces ESXi resources.
You can use the vSphere Client to activate local and remote access to the ESXi Shell:
- Select the ESXi host.
- Click Configure.
- Click Services.
- Start the ESXi Shell and SSH services.
Slide 03-12: vSphere ESXi Shell and SSH Timeout (1)
The Availability timeout setting determines how long both the SSH and vSphere ESXi Shell remain active.
The default value is 0, and SSH and vSphere ESXi Shell remain active until manually deactivated.
A value of 1 or higher determines how many minutes in the DCUI, or seconds in the vSphere Client, that the services remain active before being automatically deactivated.
If the Idle timeout setting is configured, local and remote users are automatically logged out if their sessions are idle for the defined period:
The default value is 0, and sessions are not logged out automatically.
A value of 1 or higher determines how long an idle session remains active before being automatically logged out. This value is measured in minutes in the DCUI and in seconds in vSphere Web Client.
Slide 03-13: vSphere ESXi Shell and SSH Timeout (2)
If either the vSphere ESXi Shell service or the SSH service is activated, the menu option is visible in the DCUI, but you cannot navigate to it.
Slide 03-14: ESXCLI Commands
You can use the command esxcli esxcli command list for a full listing.
The ESXCLI commands are a comprehensive set of commands for managing most aspects of the vSphere environment.
Help is available at all levels of the ESXCLI command set. For example, you can enter esxcli for a list of namespaces available with esxcli.
To determine the available commands in the network namespace, you enter esxcli network.
To determine the configuration options available for firewalls, you enter esxcli network firewall.
Each level displays command syntax help and options available for the namespace.
For more information about ESXCLI commands and their descriptions, see vSphere Command-Line Interface Reference at https://code.vmware.com.
Slide 03-15: Viewing vSphere Storage Information
The esxcli storage command set includes the following namespaces:
core: Provides configuration options and details on adapters, devices, paths, plug-ins, claiming, and claim rules
nmp: Provides command-line options to the default Native Multipathing Plug-in (NMP)
san: Provides display and reset options for the available adapter types, including Fibre Channel, iSCSI, Fibre Channel over Ethernet (FCoE), and SAS
vmfs: Provides the option of upgrading a VMFS3 datastore to VMFS5 and using the command line to manage snapshots and extents
filesystem: Includes operations such as mounting, unmounting, rescanning, listing, and performing an automount on VMFS and NFS datastores
nfs: Provides a way to add, remove, and list NFS datastores using the command line
Slide 03-16: Viewing vSphere Network Information
The esxcli network command has the following options:
- ens: Lists and manipulates the Enhanced Networking Stack (ENS) feature on a virtual switch
- firewall: Provides a way to view, load, refresh, set, and unload firewall settings
- ip: Provides a way to view and configure properties of the VMkernel interfaces to include DNS, Internet Protocol Security (IPsec), and route information
- nic: Provides a command-line interface for physical NIC operations including activating and deactivating the adapter, setting some general options, and listing the current NIC setup
- port: Provides port information
- sriovnic: Lists single root I/O virtualization capable physical adapters
- vm: Lists networking information for VMs that have active ports and lists ports used by VMs
- vswitch: Provides command-line options for standard and distributed switches
- diag: Sends ICMP echo requests to network hosts
Slide 03-17: Viewing vSAN Information
Slide 03-18: Viewing Hardware Information
Slide 03-19: Using the vim-cmd Tool
Slide 03-20: Using the vim-cmd Tool to Manipulate VMs
Slide 03-21: Using esxtop Utility
Slide 03-22: Getting Help for esxtop Utility
Lab 1: Using the Command Line – ?? Minutes
Lab 2: Using vim-cmd Commands – ?? Minutes
Lesson 2: Using Command-Line Tools – ?? Minutes
Slide 03-28: About Standalone ESXCLI
vSphere CLI is not supported in vSphere 7.0 onwards, but all existing capabilities are supported, with more API-centric tools, such as the Standalone ESXCLI package.
For information about vSphere CLI and vSphere 7, see VMware knowledge base article 78473 at https://kb.vmware.com/s/article/78473.
For information about the vSphere CLI command set, see vSphere Command-Line Interface Reference at https://code.vmware.com.
For vSphere CLI, SDK, and API documentation, see https://www.vmware.com/support/pubs/sdk_pubs.html.
Slide 03-29: ESXCLI Authentication from Standalone ESXCLI
You can use variables to pass authentication information.
A best practice is to preload the credential information of each ESXi host that you manage into the credential store of your Standalone ESXCLI server.
Slide 03-30: Manual ESXCLI Authentication (1)
You can download the certificate authority (CA) digital certificate from any vCenter instance:
- Open the URL http://<vCenter_system_FQDN> in your browser.
- Click Download trusted root certificates.
A ZIP file is downloaded to your desktop. - Extract the .0 and .r0 files from the ZIP file.
The .0 file is a PEM-encoded digital security certificate. The .r0 file is the certificate revocation list. - Rename the .0 file to .crt.
- Copy the file to the correct location on your Standalone ESXCLI platform.
The location varies depending on the operating system. On Ubuntu Linux servers, the location is /usr/local/share/ca-certificates.
Install the certificate. - On Ubuntu Linux servers, you install the certificate using the update-ca-certificates command.
Slide 03-31: Manual ESXCLI Authentication (2)
Slide 03-32: Manual ESXCLI Authentication (3)
Slide 03-33: Digital Certificate Authentication
Slide 03-34: Credential Store Authentication
You must add the username and password to the credential store before you add the thumbprint to the credential store.
After you add the username, password, and thumbprint to the credential store, you can use any esxcli command with only the name of the ESXi host.
vSphere SDK for Perl can be downloaded here: https://developer.vmware.com/web/sdk/vsphere-perl
Slide 03-35: About DCLI
Slide 03-36: Running DCLI Commands from vCenter Server Appliance
Slide 03-37: Using DCLI and Certificates
Best practice is to download the trusted root CA certificates to your Standalone DCLI system from https://<vCenter FQDN>/certs/download.zip.
You can use Data Center CLI (DCLI) without digital certificates, but this practice is not recommended.
Unlike the ESXCLI commands, you must use the fully qualified domain name (FQDN) of the vCenter system when you add the CA certificate.
Slide 03-38: DCLI Command Example
Slide 03-39: Choosing a Method for Running CLI-Based Commands
Slide 03-40: About PowerCLI
Download and find VMware PowerCLI documentation at https://code.vmware.com/tool/vmware-powercli
VMware PowerCLI is better for scripting than as a troubleshooting tool, but if you are familiar with PowerShell it can be used as a troubleshooting tool.
Slide 03-41: vSphere PowerCLI Cmdlet Structure
Slide 03-42: vSphere PowerCLI Cmdlet Types
Slide 03-43: Connecting with vSphere PowerCLI
Slide 03-44: vSphere PowerCLI Cmdlet Example
Slide 03-45: Using the Show-Command
Lab 3: Using Standalone ESXCLI and DCLI – ?? Minutes
Lesson 3: Logging and Log Files – ?? Minutes
Slide 03-06:
Slide 03-07:
Slide 03-08:
Slide 03-09:
Slide 03-10:
Slide 03-11:
Slide 03-12:
Slide 03-13:
Slide 03-14:
Slide 03-15:
Slide 03-16:
Lesson 4: VMware Skyline Overview – ?? Minutes
Slide 03-06:
Slide 03-07:
Slide 03-08:
Slide 03-09:
Slide 03-10:
Slide 03-11:
Slide 03-12:
Slide 03-13:
Slide 03-14:
Slide 03-15:
Slide 03-16:
Slide 03-17:
Module 04:
Lesson 1:
Slide 04-06:
Slide 04-07:
Slide 04-08:
Slide 04-09:
Slide 04-10:
Slide 04-11:
Slide 04-12:
Slide 04-13:
Slide 04-14:
Slide 04-15:
Slide 04-16:
Slide 04-17:
Module 05:
Lesson 1:
Slide 05-06:
Slide 05-07:
Slide 05-08:
Slide 05-09:
Slide 05-10:
Slide 05-11:
Slide 05-12:
Slide 05-13:
Slide 05-14:
Slide 05-15:
Slide 05-16:
Slide 05-17:
Module 06:
Lesson 1:
Slide 06-06:
Slide 06-07:
Slide 06-08:
Slide 06-09:
Slide 06-10:
Slide 06-11:
Slide 06-12:
Slide 06-13:
Slide 06-14:
Slide 06-15:
Slide 06-16:
Slide 06-17:
Module 07:
Lesson 1:
Slide 07-06:
Slide 07-07:
Slide 07-08:
Slide 07-09:
Slide 07-10:
Slide 07-11:
Slide 07-12:
Slide 07-13:
Slide 07-14:
Slide 07-15:
Slide 07-16:
Slide 07-17:
Module 08:
Lesson 1:
Slide 08-06:
Slide 08-07:
Slide 08-08:
Slide 08-09:
Slide 08-10:
Slide 08-11:
Slide 08-12:
Slide 08-13:
Slide 08-14:
Slide 08-15:
Slide 08-16:
Slide 08-17:
Module 09:
Lesson 1:
Slide 09-06:
Slide 09-07:
Slide 09-08:
Slide 09-09:
Slide 09-10:
Slide 09-11:
Slide 09-12:
Slide 09-13:
Slide 09-14:
Slide 09-15:
Slide 09-16:
Slide 09-17:
Module 10:
Lesson 1:
Slide 10-06:
Slide 10-07:
Slide 10-08:
Slide 10-09:
Slide 10-10:
Slide 10-11:
Slide 10-12:
Slide 10-13:
Slide 10-14:
Slide 10-15:
Slide 10-16:
Slide 10-17:
Module 11:
Lesson 1:
Slide 11-06:
Slide 11-07:
Slide 11-08:
Slide 11-09:
Slide 11-10:
Slide 11-11:
Slide 11-12:
Slide 11-13:
Slide 11-14:
Slide 11-15:
Slide 11-16:
Slide 11-17:
Module 12:
Lesson 1:
Slide 12-06:
Slide 12-07:
Slide 12-08:
Slide 12-09:
Slide 12-10:
Slide 12-11:
Slide 12-12:
Slide 12-13:
Slide 12-14:
Slide 12-15:
Slide 12-16:
Slide 12-17:
Module 13:
Lesson 1:
Slide 13-06:
Slide 13-07:
Slide 13-08:
Slide 13-09:
Slide 13-10:
Slide 13-11:
Slide 13-12:
Slide 13-13:
Slide 13-14:
Slide 13-15:
Slide 13-16:
Slide 13-17:
Module 14:
Lesson 1:
Additional Resources
VMware Master glossary – https://www.vmware.com/pdf/master_ glossary.pdf