Its a networking problem.....
This is what most developers say when there is any issue in connecting between a program running on one server and a program running on another server. When you are a programmer and you make a remote method call, or even a socket connection, you are relying on all the layers of software and hardware between your two services to make it happen for you and give you the connection you need to share data. When this fails it is typically referred to as a leaky abstraction
When there is what I now call a "Connectivity Issue" (I used to call a firewall problem or at best a network problem) most application developers will throw their hands up in the air and say it just doesn't work!. Some will call a network engineer, but they might not be able to help (unless they have a good application background) or some would call and Systems Administrator, that might not help either depending on where the problem is. Finally a good architect should be able to diagnose and troubleshoot these types of problems, usually very quickly if they have access.
Side bar:: one of the biggest issue in large companies today is the Siloization of IT folks into these functional groups where no one can solve a problem like this. Even if there are people with the talent they usually don't have the access to solve. This is one of the reasons I am starting to believe that software engineering doesn't scale, see future blog on this one.
Now I would like to write down the steps I take (or should take, because sometimes I skip a step and it causes big problems) to solve these types of issues.
First I determine if it is a (1) DNS problem
, first, if you are using a name and not an IP address, it could be a DNS issue. DNS issues can be complex to track down. They can be a caching issue, mis-configured iterative resolver, unhealthy iterative resolver or one of many other problems. For troubleshooting, I recommend you use IP's, or in the case you need a name, add an entry to your /etc/hosts file. For further DNS troubleshooting info you can look here
Once you have eliminated DNS as an issue you need to figure out if you can get (2) basic connectivity
to the server. For simplicity I recommend turning off all your software services and running netcat on your server (example: nc -l 1234 to listen on port 1234). Remember on Linux you need to be root to run on a privileged port like 80. Once you have your server running the process, go to the client and telnet to server on specified port (telnet IPADRESS 80) you will get a login prompt, here is an example client --- server conversation:
// Server, joelslinux:
[root@joelslinux .ssh]# nc -l 1234
Joels-Macbookpro:/Users/joelnylund root# telnet 192.168.0.35 1234
Connected to 192.168.0.35.
Escape character is '^]'.
^CConnection closed by foreign host.
As you can see netcat works as an echo server. If you can get connectivity with netcat, you have narrowed your problem down to a problem at the one of the application layers on either the client or server.
You will be very tempted to skip the above step, don't do it, I have and it always costs me time later, this is not hard to do, and netcat is on almost every Linux server and OSX, you can also get a windows version, so you have no excuse.
If you cant get connectivity you have narrowed your problem down to either an Operating system problem or a true networking issue.
If you narrow it down to an (3) application layer problem
, I recommend using the same technique as above but substitute netcat for the real server. Still using telnet as the client. The nice thing about telnet is you can type or paste most simple text based protocols right into the command line and allow you to diagnose the issue.
Another technique to diagnose application level problem is using tcpdump
Here is an example tcpdump using netcat listening on 1234, and telneting and typing on one line "joel is cool"
root@joelslinux .ssh]# /usr/sbin/tcpdump src port 1234
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
20:34:38.007751 IP 192.168.0.35.1234 > 192.168.0.102.54490: S 1214955266:1214955266(0) ack 1626693439 win 5792
joel is here now
20:34:43.166928 IP 192.168.0.35.1234 > 192.168.0.102.54490: . ack 19 win 1448
20:34:49.286864 IP 192.168.0.35.1234 > 192.168.0.102.54490: . ack 24 win 1448
????20:34:53.391475 IP 192.168.0.35.1234 > 192.168.0.102.54490: . ack 29 win 1448
I will not go into detail as to how to read tcpdumps here, but to learn about them go here.
(4)Operating System Issues:
Sometimes network problems are in the local operating system. It could be running an OS level firewall (check that first). Also it could be a permission issue on a user not being able to get a file handle, too many sockets. Use netstat to troubleshoot these types of issues. Finally it could be a routing problem , sometimes depending on the number of NIc's in the box and the complexity of the networking the routing could be setup so the default route sends you to la la land and not where you want to go. You should get an SA to help you check your routes, or try to get to your destination on another protocol or port and if that works, you can usually eliminate that its a routing issue. Finally the networking could be misconfigured, the netmask, the ip address etc could all be an issue.
(5)Real Networking Issues:
When it is a real networking issue it could be one of many things, don't always assume its a firewall issue, it might be but it could also be, a bad cable, a bad network jack, a router issue, load balancer, quality of service device, or one of many more. The best thing to do when communicating to network engineers is say its a connectivity issue and you have no idea where. If you say its a firewall issue, they may think you know what your talking about, just check the firewall and say its fine and get back to you saying its not the problem. If you say its a connectivity issue the will often help you troubleshoot it using some of the same tools as above while they watch the various network devices. Also its important that you tell them where the machine is located. In a typically production configuration there is front office boxes and backoffice boxes. If you are trying to get out to somewhere in cyberspace from the front office boxes you will typically NAT out through the front side firewall or load balancer if you are using that instead. If you are in the back office you will often NAT out the backside firewall which is between front office and back office.
Lots of time this can be a NIC issue, first thing is to check your two boxes using mii-tool.
[root@joelslinux .ssh]# /sbin/mii-tool
eth0: negotiated 100baseTx-FD, link ok
This shows im running 100mb full duplex ethernet on my box, so if this is the expectation then im fine. Most of the time the issue is your running half duplex. Sometimes the NIC thinks everything is fine but the negotiation on the router is an issue. In this case you will need a network engineer to look a it.
(6b)Reverse DNS on webservers
Most web servers will try to log connections from the source, and standard log file configuration wants the web server to do what is called a reverse dns lookup (equivalent of dig -x IP). Not all servers have reverse IP dns records, also called .addr.apra records. If DNS is misconfigured on the server, this can cause connectivity and performance issues of clients trying to connect because the web server will not hand over control of the socket to the protocol handling code until it has written its log record, in order to write the log record it needs the reverse dns record.
In this case we cant punt on the DNS issue because we don't usually know what IP's will be connecting to the server (if we do we can still use the /etc/hosts trick described above), but if not we will have to fix DNS. Here are some simple steps:
if there are more than 1 entries coming back, try commenting out each entry 1 by 1 and doing dig commands in between and seeing if DNS responses start coming back or start coming back faster. If there is only 1 server, do a dig and if the response doesn't come back or it comes back slow you will need to get the server fixed or find a new DNS server to use.
So that's it, my 6 simple steps to troubleshooting connectivity issues. I will update this and fill in the stuff I forgot as I remember.