I had an interesting problem while working on a recent project and would like to share my experience. I had designed and implemented a new System Center 2012 R2 Configuration Manager infrastructure integrated with MDT 2013. Initial testing of Operating System Deployment (OSD) deploying Windows 7 SP1 was initially successful. However, when testing deploying Windows 7 to another model it failed downloading the MDT package after the machine joined the domain and got the SCCM client installed.
The failure occurred after the restart following the "Setup Windows and ConfigMgr" step which followed the installation of the drivers. The same MDT package had already been downloaded successfully previously in the Task Sequence (TS) while in Windows PE. The machine joined the domain successfully and after the error it was able to copy the client logs to a network share. The error in the client's smsts.log was 0x80072ee2 which maps to "ERROR_INTERNET_TIMEOUT".
Although I noticed that the same NIC driver was being used in Windows PE and the full Windows 7 installation, I obtained the latest network card (NIC) driver directly from the NIC manufacturer (Intel) and imported it into the driver package (also disabled the older versions of it). The problem persisted.
Next I considered that I might be running into the download issue fixed by hotfix KB2905002. I had installed the hotfix on the server side but wasn't being installed when the SCCM client is installed during OSD. I configured the client part of the hotfix to be installed during OSD using the PATCH property in the "Setup Windows and ConfigMgr" step, and validated that it got installed by looking at the client version after the deployment of Windows 7. Unfortunately the problem persisted.
I then observed that the error occurred only on a newer computer model using a modern network card. The deployment worked on older computer models. All computers were tested connected to the same port on the local switch. The error occurred randomly on any file part of the MDT package. I decided to capture network traces on the client and on the server to investigate further.
The network traces revealed the following behavior during failure:
- The client requests to download files from the MDT package from the appropriate distribution point by sending "GET" HTTP requests. It sends one "GET" request per file using an URI similar to (/SMS_DP_SMSPKG$/<MDTpkgID>/sccm?/Tools/x86/<MDTfile>)
- Many files are downloaded successfully.
- Eventually the failure occurs on a random file from the MDT package.
While downloading a file from the MDT package, I noticed that the client resets the connection with TCP port 80 on the server and then right away opens a second connection to download the same file. This happens for every file and the file is downloaded successfully, except on the file where the failure occurs (random file). When the failure happens, the client TCP connection requests (SYN packets) to TCP port 80 (after it resets the connection) are not answered by the server (the server does receive them as I captured simultaneous network traces on both client and server). The client then times out because its three SYN packets part of the three-way TCP connection hand-shake are never answered with a SYN-ACK from the server. This is when error 0X80072ee2 is logged in the smsts.log file.
To capture a network trace on the client side I used Wireshark on a monitoring computer connected to the same network device as the machine where we were deploying Windows 7. The network device was then connected to the switch port at the wall. Wireshark ran in promiscuous mode and all traffic going to the machine target of the deployment also reached the port where the monitoring machine was. Wireshark's "Experts Info" reported many warnings of type "Previous segment not captured" and "ACKed segment that wasn’t captured", which occurred all over the trace and not just at the start of the capture. On the server side, the Network Monitor capture marked many TCP acknowledgments (Ack) from the client as "Dup Ack" (duplicates). All this may indicate a packet loss problem (packets being dropped).
So if there was a packet loss issue, why would the problem appear only on newer computer models? The newer computer models have more resources, are faster and also have modern NICs. My client had a fast network with most links being fiber optics. Analyzing the network captures in more detail, I noticed that the client and the server were agreeing on using TCP Window Scaling.
Excerpts from RFC1323:
This memo presents a set of TCP extensions to improve performance over large bandwidth*delay product paths and to provide reliable operation over very high-speed paths. It defines new TCP options for scaled windows and timestamps
The introduction of fiber optics is resulting in ever-higher transmission speeds, and the fastest paths are moving out of the domain for which TCP was originally engineered. This memo defines a set of modest extensions to TCP to extend the domain of its application to match this increasing network capability
TCP performance depends not upon the transfer rate itself, but rather upon the product of the transfer rate and the round-trip delay. This "bandwidth*delay product" measures the amount of data that would "fill the pipe"; it is the buffer space required at sender and receiver to obtain maximum throughput on the TCP connection over the path, i.e., the amount of unacknowledged data that TCP must handle in order to keep the pipeline full. TCP performance problems arise when the bandwidth*delay product is large. We refer to an Internet path operating in this region as a "long, fat pipe", and a network containing this path as an "LFN" (pronounced "elephan(t)").
Expanding the window size to match the capacity of an LFN results in a corresponding increase of the probability of more than one packet per window being dropped. This could have a devastating effect upon the throughput of TCP over an LFN. In addition, if a congestion control mechanism based upon some form of random dropping were introduced into gateways, randomly spaced packet drops would become common, possible increasing the probability of dropping more than one packet per window
Excerpt from Wikipedia:
Because some routers and firewalls do not properly implement TCP Window Scaling, it can cause a user's Internet connection to malfunction intermittently for a few minutes, then appear to start working again for no reason. There is also an issue if a firewall doesn't support the TCP extensions
To test if TCP Window Scaling was causing the problem, I configured the distribution point not to use it. To do this the registry parameter Tcp1323Opts was set to 0. See the following Microsoft article for information on how to configure this:
After restarting the server, the next Windows 7 deployment to one of the new computer models that was failing completed successfully.
I would have liked to look in a network trace when the client downloaded all the files in the MDT package while in Windows PE to see whether TCP Window Scaling was used but I missed this. It would be nice to know if the TCP/IP stack in Windows PE is capable of requesting TCP 1323 options during a three-way TCP connection hand-shake because if it isn't, that would explain why downloading all the files from the MDT package in Windows PE always worked.
Although disabling TCP Window Scaling and Timestamps on the distribution point allowed our Windows 7 deployments to continue, I don't recommend to do this without first checking your network devices such as gateways and routers for potential compatibility issues with TCP 1323 extensions. After all, these extensions are there to utilize the increased network capabilities provided by "fat pipe" networks such as fiber optic networks.