Domain Name Validation Class for PHP
By the time I finish writing this, there will probably be 10 new domain validation routines uploaded somewhere on the Internet. The only problem is, the majority of them are incomplete. Incomplete would be the equivalent to wrong, depending on what you are looking for in a validation routine. The problem? Most developers don’t read the requirements / specifications. The majority of developers do not take time to RTFM (read the freaking manual) for anything. What makes you think we will spend time reading the ever so boring RFC documents?
Domain validation is a common issue as evidenced by the more than 42,000,000 results on Google when you search for domain name validation php. But who can you trust to do it right? If I were to ask you what makes a valid domain name, do you know? Do you know how long a domain name can be? Do you even really know what a domain name is? Continue reading, and I bet you will be enlightened.
A domain name is a word (or multiple words separated by periods known as nodes) followed by what is known as a top level domain like .com, .net, .org just to name a few. So blogchuck.com or www.blogchuck.com would be examples of valid domain names. But what makes it valid?
If you want to read all of the RFCs (Request For Comments) to better understand what a domain is, go ahead. But a summary of the specifications found in the RFCs are as follows.
- Nodes are the group of characters found between each period in the domain name (my.mail.blogchuck.com nodes would be: my, mail, blogchuck, and com)
- The last node of the domain name is known as the Top Level Domain or TLD
- Each node starts with a letter or number; ends with a letter or number; and contains letters, number, and hyphens in between; all other characters are invalid
- The TLD is a specific group of letters (no other characters) as identified by this list
- Each node can only be a maximum of 63 characters long
- Total length of the domain name is limited to 255 characters
There may be a few things that raise some questions. I will try to answer a few that may (or may not) be obvious.
Q: Why can’t a TLD be a number?
A: Because then the DNS (Domain Name Service) that is responsible for routing the domain names to the various addresses would have a difficult time trying to determine if it is an IP address or a TLD. For example, you can legitimately set up 192.168.0.com and it would NOT be an IP address. However, 192.168.0.1 is an IP address. Do you see how that could be confusing? Any number from 0 to 255 is a valid number for an IP address. And imagine how difficult it would be for users to try to determine if they were visiting a valid IP address or if the server was faking it with a valid domain name.
Q: I thought a valid domain could be 255 characters long?
A: It can. But it must be broken into nodes of 63 characters or less (counting the TLD). Try an experiment. Go to godaddy.com and type in a domain name and see how many characters they let you put in the box. You cannot register a domain name longer than 63 characters. What makes the 255 characters allowable are the other nodes.
THE PROBLEM
Now that we know what constitutes a valid domain, do any of the solutions found on the internet meet these requirements? Most of them do not. For example, the very first search result (found at the time this was written) shows the following solution for using php to check the URL:
preg_match ("/^[a-z0-9][a-z0-9\-]+[a-z0-9](\.[a-z]{2,4})+$/i", $url)
What is wrong here? The part I want to focus on immediately is the part that validates the TLD. The portion of the regex that does that is (\.[a-z]{2,4}). This will not validate the TLD. It will only check to make sure the TLD is between 2 to 4 letters long. What about the valid TLD of MUSEUM or TRAVEL? What about the invalid TLD of AA or WA? MUSEUM and TRAVEL won’t pass and AA or WA will. Not to mention that it does not check the size of the URL or any of it’s nodes. That is NOT domain validation. But this is a very common solution found on the internet.
No matter how common, it is not correct.
THE SOLUTION
To try to help resolve this, I have written an open source solution that I have posted to my google code repository. You can find it at: http://code.google.com/p/blogchuck/wiki/DomainsClass
I link to the wiki because it contains the link to the source code as well as the description of what it does. In a nutshell, my class will take the domain and do the following:
- verifies the number of characters in the domain does not exceed 255
- splits the domain into nodes
- validates each node ensuring it contains up to 63 characters or less
- validates each node to make sure it starts with a letter or number, ends in a letter or number, and only contains letters, numbers, and/or hyphens and has at least 1 character (so something like blogchuck..com wont sneak through)
- validates the last node to make sure it is a valid TLD
As usual, if you have any feedback, I would love to hear it. But for now, I am happy knowing that I can help clean up the global code repository known as the Internet, by providing quality solutions available to anyone who is interested.
If you have something you would like to see a solution for, please drop me a line. I would love to research and write about it. One of these days, I may even contribute to an RFC or submit my own. Please avoid asking me to develop an entire website. I CAN do that, but it will cost you. I do little research projects like this as a hobby. Writing full blown projects are handled by my consulting company. After all, I do have to eat.
Happy coding!









>
Follow Me (digitally you stalker)