The Linux Blog UNIX, LINUX, BSD, OSX

22Apr/090

Block Referrer Spam

Log files are a useful tool for webmasters. It helps to know how people are finding your site, and what software they are using to view it, among other things. A strange decision by a small group of bloggers, though, has given unscrupulous marketers another window of opportunity to manipulate search engines to increase their traffic.

The decision made by these short-sighted bloggers was to display, on their site, a list of recent referrers to each page. I can't imagine any reason why a visitor might be in the least bit interested in seeing this, but nevertheless many sites now display referrers on every page.

As search engine spiders visit sites, they grab the contents of each page they visit. They use this snapshot in their index - meaning that although a page may change every minute or two, a search engine may be using a single copy of a page for several days, or even weeks.

So a referral URL that is on a page when the spiders come to visit can have quite a bit of value, if the search engine visiting uses link popularity in any way (Google uses link popularity, as do many others).

So marketers have started to use programs to visit pages using a fake referral header, to get their URLs listed on as many sites as possible, in the hopes that this will increase their traffic.

However, this renders log files almost completely useless. These fake visitors usually visit from search engines, having searched for a keyphrase relevant to their own site. They skew statistics relating to number of visitors received, the countries used to visit, the technology used to view the page, how users found the page, how long they spent on the site ... and so on.

A webmaster may find their search engine rankings dropping because of this, and they may find search engines have removed them completely. Many sites that use spam techniques are quickly identified and penalised, and penalties will often be applied to sites that link to them as well.

There are plenty of techniques available for blocking referrer spam, and everyone has their favourite. Personally, I use a combination of two techniques.

The first is fairly simple - my referrer log is not indexable. I don't display referrers on the pages of my site. My referral log is publicly available, but search engines are instructed to ignore it. This removes the main incentive for people to referrer-spam my site (the other reason for this type of spam - the hope that the site owner will themselves visit the spamming URL - is less common, because it has such a low response rate).

Second, I use an .htaccess file to block requests from whatever I've managed to identify as either a crawler designed to find URLs to spam or a spamming URL. This is a relatively simple blacklist, and though it cannot work as a long term solution to this problem, it keeps me happy for now.

To implement this technique on your own site, first make sure you are running Apache with mod_rewrite. If you are, create a file called ".htaccess" (just that, not .htaccess.txt or anything else) and paste the following into it:

Update: 14th September 2005

The list below has been expanded substantially over the last year, and now covers much more spam than before. As stated before, this is not a practical solution to the problem in the long term, as this list can only ever get longer and longer, and may become unmaintainable, or even (eventually) slow a site to a crawl as all the rules are processed. However, as of now, it is still a useful tool.

  1. RewriteEngine on
  2. # Block Referrer Spam
  3. # Drugs / Herbal
  4. RewriteCond %{HTTP_REFERER} (sleep-?deprivation) [NC,OR]
  5. RewriteCond %{HTTP_REFERER} (sleep-?disorders) [NC,OR]
  6. RewriteCond %{HTTP_REFERER} (insomnia) [NC,OR]
  7. RewriteCond %{HTTP_REFERER} (phentermine) [NC,OR]
  8. RewriteCond %{HTTP_REFERER} (phentemine) [NC,OR]
  9. RewriteCond %{HTTP_REFERER} (vicodin) [NC,OR]
  10. RewriteCond %{HTTP_REFERER} (hydrocodone) [NC,OR]
  11. RewriteCond %{HTTP_REFERER} (levitra) [NC,OR]
  12. RewriteCond %{HTTP_REFERER} (hgh-) [NC,OR]
  13. RewriteCond %{HTTP_REFERER} (-hgh) [NC,OR]
  14. RewriteCond %{HTTP_REFERER} (ultram-) [NC,OR]
  15. RewriteCond %{HTTP_REFERER} (-ultram) [NC,OR]
  16. RewriteCond %{HTTP_REFERER} (cialis) [NC,OR]
  17. RewriteCond %{HTTP_REFERER} (soma-) [NC,OR]
  18. RewriteCond %{HTTP_REFERER} (-soma) [NC,OR]
  19. RewriteCond %{HTTP_REFERER} (diazepam) [NC,OR]
  20. RewriteCond %{HTTP_REFERER} (gabapentin) [NC,OR]
  21. RewriteCond %{HTTP_REFERER} (celebrex) [NC,OR]
  22. RewriteCond %{HTTP_REFERER} (viagra) [NC,OR]
  23. RewriteCond %{HTTP_REFERER} (fioricet) [NC,OR]
  24. RewriteCond %{HTTP_REFERER} (ambien) [NC,OR]
  25. RewriteCond %{HTTP_REFERER} (valium) [NC,OR]
  26. RewriteCond %{HTTP_REFERER} (zoloft) [NC,OR]
  27. RewriteCond %{HTTP_REFERER} (finasteride) [NC,OR]
  28. RewriteCond %{HTTP_REFERER} (lamisil) [NC,OR]
  29. RewriteCond %{HTTP_REFERER} (meridia) [NC,OR]
  30. RewriteCond %{HTTP_REFERER} (allegra) [NC,OR]
  31. RewriteCond %{HTTP_REFERER} (diflucan) [NC,OR]
  32. RewriteCond %{HTTP_REFERER} (zovirax) [NC,OR]
  33. RewriteCond %{HTTP_REFERER} (valtrex) [NC,OR]
  34. RewriteCond %{HTTP_REFERER} (lipitor) [NC,OR]
  35. RewriteCond %{HTTP_REFERER} (proscar) [NC,OR]
  36. RewriteCond %{HTTP_REFERER} (acyclovir) [NC,OR]
  37. RewriteCond %{HTTP_REFERER} (sildenafil) [NC,OR]
  38. RewriteCond %{HTTP_REFERER} (tadalafil) [NC,OR]
  39. RewriteCond %{HTTP_REFERER} (xenical) [NC,OR]
  40. RewriteCond %{HTTP_REFERER} (melatonin) [NC,OR]
  41. RewriteCond %{HTTP_REFERER} (xanax) [NC,OR]
  42. RewriteCond %{HTTP_REFERER} (herbal) [NC,OR]
  43. RewriteCond %{HTTP_REFERER} (drugs) [NC,OR]
  44. RewriteCond %{HTTP_REFERER} (lortab) [NC,OR]
  45. RewriteCond %{HTTP_REFERER} (adipex) [NC,OR]
  46. RewriteCond %{HTTP_REFERER} (propecia) [NC,OR]
  47. RewriteCond %{HTTP_REFERER} (carisoprodol) [NC,OR]
  48. RewriteCond %{HTTP_REFERER} (tramadol) [NC]
  49. RewriteRule .* - [F]
  50. # Porn
  51. RewriteCond %{HTTP_REFERER} (porno) [NC,OR]
  52. RewriteCond %{HTTP_REFERER} (shemale) [NC,OR]
  53. RewriteCond %{HTTP_REFERER} (gangbang) [NC,OR]
  54. RewriteCond %{HTTP_REFERER} (-cock) [NC,OR]
  55. RewriteCond %{HTTP_REFERER} (-anal) [NC,OR]
  56. RewriteCond %{HTTP_REFERER} (-orgy) [NC,OR]
  57. RewriteCond %{HTTP_REFERER} (cock-) [NC,OR]
  58. RewriteCond %{HTTP_REFERER} (anal-) [NC,OR]
  59. RewriteCond %{HTTP_REFERER} (orgy-) [NC,OR]
  60. RewriteCond %{HTTP_REFERER} (singles-?christian) [NC,OR]
  61. RewriteCond %{HTTP_REFERER} (dating-?christian) [NC,OR]
  62. RewriteCond %{HTTP_REFERER} (cumeating) [NC,OR]
  63. RewriteCond %{HTTP_REFERER} (cream-?pies) [NC,OR]
  64. RewriteCond %{HTTP_REFERER} (cumsucking) [NC,OR]
  65. RewriteCond %{HTTP_REFERER} (cumswapping) [NC,OR]
  66. RewriteCond %{HTTP_REFERER} (cumfilled) [NC,OR]
  67. RewriteCond %{HTTP_REFERER} (cumdripping) [NC,OR]
  68. RewriteCond %{HTTP_REFERER} (krankenversicherung) [NC,OR]
  69. RewriteCond %{HTTP_REFERER} (cumpussy) [NC,OR]
  70. RewriteCond %{HTTP_REFERER} (suckingcum) [NC,OR]
  71. RewriteCond %{HTTP_REFERER} (drippingcum) [NC,OR]
  72. RewriteCond %{HTTP_REFERER} (pussycum) [NC,OR]
  73. RewriteCond %{HTTP_REFERER} (swappingcum) [NC,OR]
  74. RewriteCond %{HTTP_REFERER} (eatingcum) [NC,OR]
  75. RewriteCond %{HTTP_REFERER} (cum-) [NC,OR]
  76. RewriteCond %{HTTP_REFERER} (-cum) [NC,OR]
  77. RewriteCond %{HTTP_REFERER} (sperm) [NC,OR]
  78. RewriteCond %{HTTP_REFERER} (christian-?dating) [NC,OR]
  79. RewriteCond %{HTTP_REFERER} (jewish-?singles) [NC,OR]
  80. RewriteCond %{HTTP_REFERER} (sex-?meetings) [NC,OR]
  81. RewriteCond %{HTTP_REFERER} (swinging) [NC,OR]
  82. RewriteCond %{HTTP_REFERER} (swingers) [NC,OR]
  83. RewriteCond %{HTTP_REFERER} (personals) [NC,OR]
  84. RewriteCond %{HTTP_REFERER} (sleeping) [NC,OR]
  85. RewriteCond %{HTTP_REFERER} (libido) [NC,OR]
  86. RewriteCond %{HTTP_REFERER} (grannies) [NC,OR]
  87. RewriteCond %{HTTP_REFERER} (mature) [NC,OR]
  88. RewriteCond %{HTTP_REFERER} (enhancement) [NC,OR]
  89. RewriteCond %{HTTP_REFERER} (sexual) [NC,OR]
  90. RewriteCond %{HTTP_REFERER} (gay-?teen) [NC,OR]
  91. RewriteCond %{HTTP_REFERER} (teen-?chat) [NC,OR]
  92. RewriteCond %{HTTP_REFERER} (gay-?chat) [NC,OR]
  93. RewriteCond %{HTTP_REFERER} (adult-?finder) [NC,OR]
  94. RewriteCond %{HTTP_REFERER} (adult-?friend) [NC,OR]
  95. RewriteCond %{HTTP_REFERER} (friend-?finder) [NC,OR]
  96. RewriteCond %{HTTP_REFERER} (friend-?adult) [NC,OR]
  97. RewriteCond %{HTTP_REFERER} (finder-?adult) [NC,OR]
  98. RewriteCond %{HTTP_REFERER} (finder-?friend) [NC,OR]
  99. RewriteCond %{HTTP_REFERER} (discrete-?encounters) [NC,OR]
  100. RewriteCond %{HTTP_REFERER} (cheating-?wives) [NC,OR]
  101. RewriteCond %{HTTP_REFERER} (housewives) [NC,OR]
  102. RewriteCond %{HTTP_REFERER} (\-sex\.) [NC,OR]
  103. RewriteCond %{HTTP_REFERER} (xxx) [NC,OR]
  104. RewriteCond %{HTTP_REFERER} (snowballing) [NC]
  105. RewriteRule .* - [F]
  106. # Weight
  107. RewriteCond %{HTTP_REFERER} (fat-) [NC,OR]
  108. RewriteCond %{HTTP_REFERER} (-fat) [NC,OR]
  109. RewriteCond %{HTTP_REFERER} (diet) [NC,OR]
  110. RewriteCond %{HTTP_REFERER} (pills) [NC,OR]
  111. RewriteCond %{HTTP_REFERER} (weight) [NC,OR]
  112. RewriteCond %{HTTP_REFERER} (supplement) [NC]
  113. RewriteRule .* - [F]
  114. # Gambling
  115. RewriteCond %{HTTP_REFERER} (texas-?hold-?em) [NC,OR]
  116. RewriteCond %{HTTP_REFERER} (poker) [NC,OR]
  117. RewriteCond %{HTTP_REFERER} (casino) [NC,OR]
  118. RewriteCond %{HTTP_REFERER} (blackjack) [NC]
  119. RewriteRule .* - [F]
  120. # Loans / Finance
  121. RewriteCond %{HTTP_REFERER} (mortgage) [NC,OR]
  122. RewriteCond %{HTTP_REFERER} (refinancing) [NC,OR]
  123. RewriteCond %{HTTP_REFERER} (cash-?advance) [NC,OR]
  124. RewriteCond %{HTTP_REFERER} (cash-?money) [NC,OR]
  125. RewriteCond %{HTTP_REFERER} (pay-?day) [NC]
  126. RewriteRule .* - [F]
  127. # User Agents
  128. RewriteCond %{HTTP_USER_AGENT} (Program\ Shareware|Fetch\ API\ Request) [NC,OR]
  129. RewriteCond %{HTTP_USER_AGENT} (Microsoft\ URL\ Control) [NC]
  130. RewriteRule .* - [F]
  131. # Misc / Specific Sites
  132. RewriteCond %{HTTP_REFERER} (netwasgroup\.com) [NC,OR]
  133. RewriteCond %{HTTP_REFERER} (nic4u\.com) [NC,OR]
  134. RewriteCond %{HTTP_REFERER} (wear4u\.com) [NC,OR]
  135. RewriteCond %{HTTP_REFERER} (foxmediasolutions\.com) [NC,OR]
  136. RewriteCond %{HTTP_REFERER} (liveplanets\.com) [NC,OR]
  137. RewriteCond %{HTTP_REFERER} (aeterna-tech\.com) [NC,OR]
  138. RewriteCond %{HTTP_REFERER} (continentaltirebowl\.com) [NC,OR]
  139. RewriteCond %{HTTP_REFERER} (chemsymphony\.com) [NC,OR]
  140. RewriteCond %{HTTP_REFERER} (infolibria\.com) [NC,OR]
  141. RewriteCond %{HTTP_REFERER} (globaleducationeurope\.net) [NC,OR]
  142. RewriteCond %{HTTP_REFERER} (soma\.125mb\.com) [NC,OR]
  143. RewriteCond %{HTTP_REFERER} (mitglied\.lycos\.de) [NC,OR]
  144. RewriteCond %{HTTP_REFERER} (foxmediasolutions\.com) [NC,OR]
  145. RewriteCond %{HTTP_REFERER} (jroundup\.com) [NC,OR]
  146. RewriteCond %{HTTP_REFERER} (feathersandfurvanlines\.com) [NC,OR]
  147. RewriteCond %{HTTP_REFERER} (conecrusher\.org) [NC,OR]
  148. RewriteCond %{HTTP_REFERER} (sbj-broadcasting\.com) [NC,OR]
  149. RewriteCond %{HTTP_REFERER} (edthompson\.com) [NC,OR]
  150. RewriteCond %{HTTP_REFERER} (codychesnutt\.com) [NC,OR]
  151. RewriteCond %{HTTP_REFERER} (artsmallforsenate\.com) [NC,OR]
  152. RewriteCond %{HTTP_REFERER} (axionfootwear\.com) [NC,OR]
  153. RewriteCond %{HTTP_REFERER} (protzonbeer\.com) [NC,OR]
  154. RewriteCond %{HTTP_REFERER} (candiria\.com) [NC,OR]
  155. RewriteCond %{HTTP_REFERER} (bigsitecity\.com) [NC,OR]
  156. RewriteCond %{HTTP_REFERER} (coresat\.com) [NC,OR]
  157. RewriteCond %{HTTP_REFERER} (istarthere\.com) [NC,OR]
  158. RewriteCond %{HTTP_REFERER} (amateurvoetbal\.net) [NC,OR]
  159. RewriteCond %{HTTP_REFERER} (alleghanyeda\.com) [NC,OR]
  160. RewriteCond %{HTTP_REFERER} (xadulthosting\.com) [NC,OR]
  161. RewriteCond %{HTTP_REFERER} (datashaping\.com) [NC,OR]
  162. RewriteCond %{HTTP_REFERER} (zick\.biz) [NC,OR]
  163. RewriteCond %{HTTP_REFERER} (newprinceton\.com) [NC,OR]
  164. RewriteCond %{HTTP_REFERER} (dvdsqueeze\.com) [NC,OR]
  165. RewriteCond %{HTTP_REFERER} (xopy\.com) [NC,OR]
  166. RewriteCond %{HTTP_REFERER} (webdevboard\.com) [NC,OR]
  167. RewriteCond %{HTTP_REFERER} (devaddict\.com) [NC,OR]
  168. RewriteCond %{HTTP_REFERER} (eaton-inc\.com) [NC,OR]
  169. RewriteCond %{HTTP_REFERER} (whiteguysgroup\.com) [NC,OR]
  170. RewriteCond %{HTTP_REFERER} (guestbookz\.com) [NC,OR]
  171. RewriteCond %{HTTP_REFERER} (webdevsquare\.com) [NC,OR]
  172. RewriteCond %{HTTP_REFERER} (indfx\.net) [NC,OR]
  173. RewriteCond %{HTTP_REFERER} (snap\.to) [NC,OR]
  174. RewriteCond %{HTTP_REFERER} (2y\.net) [NC,OR]
  175. RewriteCond %{HTTP_REFERER} (astromagia\.info) [NC,OR]
  176. RewriteCond %{HTTP_REFERER} (free-?sms) [NC]
  177. RewriteRule .* - [F]

The above will block just about all of the most common referral spam that I've seen so far. I'm adding to the list constantly (last addition: 14th September 2005) so do check back and see if there are updates if you're using it.

One potential problem with this technique, other than that it will, in time, become useless as too many URLs are added, is that there is always a possibility authentic visitors will be blocked. So, on this site, instead of the last line above, I've actually used something a little more user-friendly:

  1. RewriteRule .* bad_referrer.php [L]

And there we have it. With minimum effort (for now), referral log spamming in my site has been almost entirely removed. Before adding this set of rules and scripts, I was seeing around 200 fake referrals per day in my log files. Now, I see about 3 or 4 a week. Hopefully, this will continue until I can devise a better way of protecting against this kind of problem - before blacklists become an impossibility to manage.

Filed under: Tutorials No Comments
22Apr/091

URL Rewriting for Beginners

Introduction

URL rewriting can be one of the best and quickest ways to improve the usability and search friendliness of your site. It can also be the source of near-unending misery and suffering. Definitely worth playing carefully with it - lots of testing is recommended. With great power comes great responsibility, and all that.

There are several other guides on the web already, that may suit your needs better than this one.

Before reading on, you may find it helpful to have the mod_rewrite cheat sheet and/or the regular expressions cheat sheet handy. A basic grasp of the concept of regular expressions would also be very helpful.

What is "URL Rewriting"?

Most dynamic sites include variables in their URLs that tell the site what information to show the user. Typically, this gives URLs like the following, telling the relevant script on a site to load product number 7.

  1. http://www.pets.com/show_a_product.php?product_id=7

The problems with this kind of URL structure are that the URL is not at all memorable. It's difficult to read out over the phone (you'd be surprised how many people pass URLs this way). Search engines and users alike get no useful information about the content of a page from that URL. You can't tell from that URL that that page allows you to buy a Norwegian Blue Parrot (lovely plumage). It's a fairly standard URL - the sort you'd get by default from most CMSes. Compare that to this URL:

  1. http://www.pets.com/products/7/

Clearly a much cleaner and shorter URL. It's much easier to remember, and vastly easier to read out. That said, it doesn't exactly tell anyone what it refers to. But we can do more:

  1. http://www.pets.com/parrots/norwegian-blue/

Now we're getting somewhere. You can tell from the URL, even when it's taken out of context, what you're likely to find on that page. Search engines can split that URL into words (hyphens in URLs are treated as spaces by search engines, whereas underscores are not), and they can use that information to better determine the content of the page. It's an easy URL to remember and to pass to another person.

Unfortunately, the last URL cannot be easily understood by a server without some work on our part. When a request is made for that URL, the server needs to work out how to process that URL so that it knows what to send back to the user. URL rewriting is the technique used to "translate" a URL like the last one into something the server can understand.

Platforms and Tools

Depending on the software your server is running, you may already have access to URL rewriting modules. If not, most hosts will enable or install the relevant modules for you if you ask them very nicely.

Apache is the easiest system to get URL rewriting running on. It usually comes with its own built-in URL rewriting module, mod_rewrite, enabled, and working with mod_rewrite is as simple as uploading correctly formatted and named text files.

IIS, Microsoft's server software, doesn't include URL rewriting capability as standard, but there are add-ons out there that can provide this functionality. ISAPI_Rewrite is the one I recommend working with, as I've so far found it to be the closest to mod_rewrite's functionality. Instructions for installing and configuring ISAPI_Rewrite can be found at the end of this article.

The code that follows is based on URL rewriting using mod_rewrite.

Basic URL Rewriting

To begin with, let's consider a simple example. We have a website, and we have a single PHP script that serves a single page. Its URL is:

  1. http://www.pets.com/pet_care_info_07_07_2008.php

We want to clean up the URL, and our ideal URL would be:

  1. http://www.pets.com/pet-care/

In order for this to work, we need to tell the server to internally redirect all requests for the URL "pet-care" to "pet_care_info_07_07_2008.php". We want this to happen internally, because we don't want the URL in the browser's address bar to change.

To accomplish this, we need to first create a text document called ".htaccess" to contain our rules. It must be named exactly that (not ".htaccess.txt" or "rules.htaccess"). This would be placed in the root directory of the server (the same folder as "pet_care_info_07_07_2008.php" in our example). There may already be an .htaccess file there, in which case we should edit that rather than overwrite it.

The .htaccess file is a configuration file for the server. If there are errors in the file, the server will display an error message (usually with an error code of "500"). If you are transferring the file to the server using FTP, you must make sure it is transferred using the ASCII mode, rather than BINARY. We use this file to perform 2 simple tasks in this instance - first, to tell Apache to turn on the rewrite engine, and second, to tell apache what rewriting rule we want it to use. We need to add the following to the file:

  1. RewriteEngine On # Turn on the rewriting engine
  2. RewriteRule ^pet-care/?$ pet_care_info_01_02_2003.php [NC,L] # Handle requests for "pet-care"

A couple of quick items to note - everything following a hash symbol in an .htaccess file is ignored as a comment, and I'd recommend you use comments liberally; and the "RewriteEngine" line should only be used once per .htaccess file (please note that I've not included this line from here onwards in code example).

The "RewriteRule" line is where the magic happens. The line can be broken down into 5 parts:

  • RewriteRule - Tells Apache that this like refers to a single RewriteRule.
  • ^/pet-care/?$ - The "pattern". The server will check the URL of every request to the site to see if this pattern matches. If it does, then Apache will swap the URL of the request for the "substitution" section that follows.
  • pet_care_info_01_02_2003.php - The "substitution". If the pattern above matches the request, Apache uses this URL instead of the requested URL.
  • [NC,L] - "Flags", that tell Apache how to apply the rule. In this case, we're using two flags. "NC", tells Apache that this rule should be case-insensitive, and "L" tells Apache not to process any more rules if this one is used.
  • # Handle requests for "pet-care" - Comment explaining what the rule does (optional but recommended)

The rule above is a simple method for rewriting a single URL, and is the basis for almost all URL rewriting rules.

Patterns and Replacements

The rule above allows you to redirect requests for a single URL, but the real power of mod_rewrite comes when you start to identify and rewrite groups of URLs based on patterns they contain.

Let's say you want to change all of your site URLs as described in the first pair of examples above. Your existing URLs look like this:

  1. http://www.pets.com/show_a_product.php?product_id=7

And you want to change them to look like this:

  1. http://www.pets.com/products/7/

Rather than write a rule for every single product ID, you of course would rather write one rule to manage all product IDs. Effectively you want to change URLs of this format:

  1. http://www.pets.com/show_a_product.php?product_id={a number}

And you want to change them to look like this:

  1. http://www.pets.com/products/{a number}/

In order to do so, you will need to use "regular expressions". These are patterns, defined in a specific format that the server can understand and handle appropriately. A typical pattern to identify a number would look like this:

  1. [0-9]+

The square brackets contain a range of characters, and "0-9" indicates all the digits. The plus symbol indicates that the pattern will idenfiy one or more of whatever precedes the plus - so this pattern effectively means "one or more digits" - exactly what we're looking to find in our URL.

The entire "pattern" part of the rule is treated as a regular expression by default - you don't need to turn this on or activate it at all.

  1. RewriteRule ^products/([0-9]+)/?$ show_a_product.php?product_id=$1 [NC,L] # Handle product requests

The first thing I hope you'll notice is that we've wrapped our pattern in brackets. This allows us to "back-reference" (refer back to) that section of the URL in the following "substitution" section. The "$1" in the substitution tells Apache to put whatever matched the earlier bracketed pattern into the URL at this point. You can have lots of backreferences, and they are numbered in the order they appear.

And so, this RewriteRule will now mean that Apache redirects all requests for domain.com/products/{number}/ to show_a_product.php?product_id={same number}.

Regular Expressions

A complete guide to regular expressions is rather beyond the scope of this article. However, important points to remember are that the entire pattern is treated as a regular expression, so always be careful of characters that are "special" characters in regular expressions.

The most instance of this is when people use a period in their pattern. In a pattern, this actually means "any character" rather than a literal period, and so if you want to match a period (and only a period) you will need to "escape" the character - precede it with another special character, a backslash, that tells Apache to take the next character to be literal.

For example, this RewriteRule will not just match the URL "rss.xml" as intended - it will also match "rss1xml", "rss-xml" and so on.

  1. RewriteRule ^rss.xml$ rss.php [NC,L] # Change feed URL

This does not usually present a serious problem, but escaping characters properly is a very good habit to get into early. Here's how it should look:

  1. RewriteRule ^rss\.xml$ rss.php [NC,L] # Change feed URL

This only applies to the pattern, not to the substitution. Other characters that require escaping (referred to as "metacharacters") follow, with their meaning in brackets afterwards:

  • . (any character)
  • * (zero of more of the preceding)
  • + (one or more of the preceding)
  • {} (minimum to maximum quantifier)
  • ? (ungreedy modifier)
  • ! (at start of string means "negative pattern")
  • ^ (start of string, or "negative" if at the start of a range)
  • $ (end of string)
  • [] (match any of contents)
  • - (range if used between square brackets)
  • () (group, backreferenced group)
  • | (alternative, or)
  • \ (the escape character itself)

Using regular expressions, it is possible to search for all sorts of patterns in URLs and rewrite them when they match. Time for another example - we wanted earlier to be able to indentify this URL and rewrite it:

 

  1. http://www.pets.com/parrots/norwegian-blue/

And we want to be able to tell the server to interpret this as the following, but for all products:

  1. http://www.pets.com/get_product_by_name.php?product_name=norwegian-blue

And we can do that relatively simply, with the following rule:

  1. RewriteRule ^parrots/([A-Za-z0-9-]+)/?$ get_product_by_name.php?product_name=$1 [NC,L] # Process parrots

With this rule, any URL that starts with "parrots" followed by a slash (parrots/), then one or more (+) of any combination of letters, numbers and hyphens ([A-Za-z0-9-]) (note the hyphen at the end of the selection of characters within square brackets - it must be added there to be treated literally rather than as a range separator). We reference the product name in brackets with $1 in the substitution.

We can make it even more generic, if we want, so that it doesn't matter what directory a product appears to be in, it is still sent to the same script, like so:

  1. RewriteRule ^[A-Za-z-]+/([A-Za-z0-9-]+)/?$ get_product_by_name.php?product_name=$1 [NC,L] # Process all products

As you can see, we've replaced "parrots" with a pattern that matches letter and hyphens. That rule will now match anything in the parrots directory or any other directory whose name is comprised of at least one or more letters and hyphens.

Flags

Flags are added to the end of a rewrite rule to tell Apache how to interpret and handle the rule. They can be used to tell apache to treat the rule as case-insensitive, to stop processing rules if the current one matches, or a variety of other options. They are comma-separated, and contained in square brackets. Here's a list of the flags, with their meanings (this information is included on the cheat sheet, so no need to try to learn them all).

  • C (chained with next rule)
  • CO=cookie (set specified cookie)
  • E=var:value (set environment variable var to value)
  • F (forbidden - sends a 403 header to the user)
  • G (gone - no longer exists)
  • H=handler (set handler)
  • L (last - stop processing rules)
  • N (next - continue processing rules)
  • NC (case insensitive)
  • NE (do not escape special URL characters in output)
  • NS (ignore this rule if the request is a subrequest)
  • P (proxy - i.e., apache should grab the remote content specified in the substitution section and return it)
  • PT (pass through - use when processing URLs with additional handlers, e.g., mod_alias)
  • R (temporary redirect to new URL)
  • R=301 (permanent redirect to new URL)
  • QSA (append query string from request to substituted URL)
  • S=x (skip next x rules)
  • T=mime-type (force specified mime type)

Moving Content

  1. RewriteRule ^article/?$ http://www.new-domain.com/article/ [R,NC,L] # Temporary Move

Adding an "R" flag to the flags section changes how a RewriteRule works. Instead of rewriting the URL internally, Apache will send a message back to the browser (an HTTP header) to tell it that the document has moved temporarily to the URL given in the "substitution" section. Either an absolute or a relative URL can be given in the substitution section. The header sent back includea a code - 302 - that indicates the move is temporary.

  1. RewriteRule ^article/?$ http://www.new-domain.com/article/ [R=301,NC,L] # Permanent Move

If the move is permanent, append "=301" to the "R" flag to have Apache tell the browser the move is considered permanent. Unlike the default "R", "R=301" will also tell the browser to display the new address in the address bar.

This is one of the most common methods of rewriting URLs of items that have moved to a new URL (for example, it is in use extensively on this site to forward users to new post URLs whenever they are changed).

Conditions

Rewrite rules can be preceded by one or more rewrite conditions, and these can be strung together. This can allow you to only apply certain rules to a subset of requests. Personally, I use this most often when applying rules to a subdomain or alternative domain as rewrite conditions can be run against a variety of criteria, not just the URL. Here's an example:

  1. RewriteCond %{HTTP_HOST} ^addedbytes\.com [NC]
  2. RewriteRule ^(.*)$ http://www.addedbytes.com/$1 [L,R=301]

The rewrite rule above redirects all requests, no matter what for, to the same URL at "www.addedbytes.com". Without the condition, this rule would create a loop, with every request matching that rule and being sent back to itself. The rule is intended to only redirect requests missing the "www" URL portion, though, and the condition preceding the rule ensures that this happens.

The condition operates in a similar way to the rule. It starts with "RewriteCond" to tell mod_rewrite this line refers to a condition. Following that is what should actually be tested, and then the pattern to test. Finally, the flags in square brackets, the same as with a RewriteRule.

The string to test (the second part of the condition) can be a variety of different things. You can test the domain being requested, as with the above example, or you could test the browser being used, the referring URL (commonly used to prevent hotlinking), the user's IP address, or a variety of other things (see the "server variables" section for an outline of how these work).

The pattern is almost exactly the same as that used in a RewriteRule, with a couple of small exceptions. The pattern may not be interpreted as a pattern if it starts with specific characters as described in the following "exceptions" section. This means that if you wish to use a regular expression pattern starting with <, >, or a hyphen, you should escape them with the backslash.

Rewrite conditions can, like rewrite rules, be followed by flags, and there are only two. "NC", as with rules, tells Apache to treat the condition as case-insensitive. The other available flag is "OR". If you only want to apply a rule if one of two conditions match, rather than repeat the rule, add the "OR" flag to the first condition, and if either match then the following rule will be applied. The default behaviour, if a rule is preceded by multiple conditions, is that it is only applied if all rules match.

Exceptions and Special Cases

Rewrite conditions can be tested in a few different ways - they do not need to be treated as regular expression patterns, although this is the most common way they are used. Here are the various ways rewrite conditons can be processed:

  • <Pattern (is test string lower than pattern)
  • >Pattern (is test string greater than pattern)
  • =Pattern (is test string equal to pattern)
  • -d (is test string a valid directory)
  • -f (is test string a valid file)
  • -s (is test string a valid file with size greater than zero)
  • -l (is test string a symbolic link)
  • -F (is test string a valid file, and accessible (via subrequest))
  • -U (is test string a valid URL, and accessible (via subrequest))

Server Variables

Server variables are a selection of items you can test when writing rewrite conditions. This allows you to apply rules based on all sorts of request parameters, including browser identifiers, referring URL or a multitude of other strings. Variables are of the following format:

  1. %{VARIABLE_NAME}

And "VARIABLE_NAME" can be replaced with any one of the following items:

  • HTTP Headers
    • HTTP_USER_AGENT
    • HTTP_REFERER
    • HTTP_COOKIE
    • HTTP_FORWARDED
    • HTTP_HOST
    • HTTP_PROXY_CONNECTION
    • HTTP_ACCEPT
  • Connection Variables
    • REMOTE_ADDR
    • REMOTE_HOST
    • REMOTE_USER
    • REMOTE_IDENT
    • REQUEST_METHOD
    • SCRIPT_FILENAME
    • PATH_INFO
    • QUERY_STRING
    • AUTH_TYPE
  • Server Variables
    • DOCUMENT_ROOT
    • SERVER_ADMIN
    • SERVER_NAME
    • SERVER_ADDR
    • SERVER_PORT
    • SERVER_PROTOCOL
    • SERVER_SOFTWARE
  • Dates and Times
    • TIME_YEAR
    • TIME_MON
    • TIME_DAY
    • TIME_HOUR
    • TIME_MIN
    • TIME_SEC
    • TIME_WDAY
    • TIME
  • Special Items
    • API_VERSION
    • THE_REQUEST
    • REQUEST_URI
    • REQUEST_FILENAME
    • IS_SUBREQ

Working With Multiple Rules

The more complicated a site, the more complicated the set of rules governing it can be. This can be problematic when it comes to resolving conflicts between rules. You will find this issue rears its ugly head most often when you add a new rule to a file, and it doesn't work. What you may find, if the rule itself is not at fault, is that an earlier rule in the file is matching the URL and so the URL is not being tested against the new rule you've just added.

  1. RewriteRule ^([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_product_by_name.php?category_name=$1&product_name=$2 [NC,L] # Process product requests
  2. RewriteRule ^([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_blog_post_by_title.php?category_name=$1&post_title=$2 [NC,L] # Process blog posts

In the example above, the product pages of a site and the blog post pages have identical patterns. The second rule will never match a URL, because anything that would match that pattern will have already been matched by the first rule.

There are a few ways to work around this. Several CMSes (including wordpress) handle this by adding an extra portion to the URL to denote the type of request, like so:

  1. RewriteRule ^products/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_product_by_name.php?category_name=$1&product_name=$2 [NC,L] # Process product requests
  2. RewriteRule ^blog/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_blog_post_by_title.php?category_name=$1&post_title=$2 [NC,L] # Process blog posts

You could also write a single PHP script to process all requests, which checked to see if the second part of the URL matched a blog post or a product. I usually go for this option, as while it may increase the load on the server slightly, it gives much cleaner URLs.

  1. RewriteRule ^([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/?$ get_product_or_blog_post.php?category_name=$1&item_name=$2 [NC,L] # Process product and blog requests

There are certain situations where you can work around this issue by writing more precise rules and ordering your rules intelligently. Imagine a blog where there were two archives - one by topic and one by year.

  1. RewriteRule ^([A-Za-z0-9-]+)/?$ get_archives_by_topic.php?topic_name=$1 [NC,L] # Get archive by topic
  2. RewriteRule ^([A-Za-z0-9-]+)/?$ get_archives_by_year.php?year=$1 [NC,L] # Get archive by year

The above rules will conflict. Of course, years are numeric and only 4 digits, so you can make that rule more precise, and by running it first the only type of conflict you cound encounter would be if you had a topic with a 4-digit number for a name.

  1. RewriteRule ^([0-9]{4})/?$ get_archives_by_year.php?year=$1 [NC,L] # Get archive by year
  2. RewriteRule ^([A-Za-z0-9-]+)/?$ get_archives_by_topic.php?topic_name=$1 [NC,L] # Get archive by topic

mod_rewrite

Apache's mod_rewrite comes as standard with most Apache hosting accounts, so if you're on shared hosting, you are unlikely to have to do anything. If you're managing your own box, then you most likely just have to turn on mod_rewrite. If you are using Apache1, you will need to edit your httpd.conf file and remove the leading '#' from the following lines:

  1. #LoadModule rewrite_module modules/mod_rewrite.so
  2. #AddModule mod_rewrite.c

If you are using Apache2 on a Debian-based distribution, you need to run the following command and then restart Apache:

  1. sudo a2enmod rewrite

Other distubutions and platforms differ. If the above instructions are not suitable for your system, then Google is your friend. You may need to edit your apache2 configuration file and add "rewrite" to the "APACHE_MODULES" list, or edit httpd.conf, or even download and compile mod_rewrite yourself. For the majority, however, installation should be simple.

ISAPI_Rewrite

ISAPI_Rewrite is a URL rewriting plugin for IIS based on mod_rewrite and is not free. It performs most of the same functionality as mod_rewrite, and there is a good quality ISAPI_Rewrite forum where most common questions are answered. As ISAPI_Rewrite works with IIS, installation is relatively simple - there are installation instructions available.

ISAPI_Rewrite rules go into a file named httpd.ini. Errors will go into a file named httpd.parse.errors by default.

Leading Slashes

I have found myself tripped up numerous times by leading slashes in URL rewriting systems. Whether they should be used in the pattern or in the substitution section of a RewriteRule or used in a RewriteCond statement is a constant source of frustration to me. This may be in part because I work with different URL rewriting engines, but I would advise being careful of leading slashes - if a rule is not working, that's often a good place to start looking. I never include leading slashes in mod_rewrite rules and always include them in ISAPI_Rewrite.

Sample Rules

To redirect an old domain to a new domain:

  1. RewriteCond %{HTTP_HOST} old_domain\.com [NC]
  2. RewriteRule ^(.*)$ http://www.new_domain.com/$1 [L,R=301]

To redirect all requests missing "www" (yes www):

  1. RewriteCond %{HTTP_HOST} ^domain\.com [NC]
  2. RewriteRule ^(.*)$ http://www.domain.com/$1 [L,R=301]

To redirect all requests with "www" (no www):

  1. RewriteCond %{HTTP_HOST} ^www\.domain\.com [NC]
  2. RewriteRule ^(.*)$ http://domain.com/$1 [L,R=301]

Redirect old page to new page:

  1. RewriteRule ^old-url\.htm$ http://www.domain.com/new-url.htm [NC,R=301,L]

Useful Links

Summary

Hopefully if you've made it this far you now have a clear understanding of what URL rewriting is and how to add it to your site. It is worth taking the time to become familiar with - it can benefit your SEO efforts immediately, and increase the usability of your site.

Thanks To www.addedbytes.com

22Apr/090

HTTP Status Codes Explained

HTTP, Hypertext Transfer Protocol, is the method by which clients (i.e. you) and servers communicate. When someone clicks a link, types in a URL or submits out a form, their browser sends a request to a server for information. It might be asking for a page, or sending data, but either way, that is called an HTTP Request. When a server receives that request, it sends back an HTTP Response, with information for the client. Usually, this is invisible, though I'm sure you've seen one of the very common Response codes - 404, indicating a page was not found. There are a fair few more status codes sent by servers, and the following is a list of the current ones in HTTP 1.1, along with an explanation of their meanings.

A more technical breakdown of HTTP 1.1 status codes and their meanings is available at http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html. There are several versions of HTTP, but currently HTTP 1.1 is the most widely used.

Informational

  • 100 - Continue
    A status code of 100 indicates that (usually the first) part of a request has been received without any problems, and that the rest of the request should now be sent.
  • 101 - Switching Protocols
    HTTP 1.1 is just one type of protocol for transferring data on the web, and a status code of 101 indicates that the server is changing to the protocol it defines in the "Upgrade" header it returns to the client. For example, when requesting a page, a browser might receive a statis code of 101, followed by an "Upgrade" header showing that the server is changing to a different version of HTTP.

Successful

  • 200 - OK
    The 200 status code is by far the most common returned. It means, simply, that the request was received and understood and is being processed.
  • 201 - Created
    A 201 status code indicates that a request was successful and as a result, a resource has been created (for example a new page).
  • 202 - Accepted
    The status code 202 indicates that server has received and understood the request, and that it has been accepted for processing, although it may not be processed immediately.
  • 203 - Non-Authoritative Information
    A 203 status code means that the request was received and understood, and that information sent back about the response is from a third party, rather than the original server. This is virtually identical in meaning to a 200 status code.
  • 204 - No Content
    The 204 status code means that the request was received and understood, but that there is no need to send any data back.
  • 205 - Reset Content
    The 205 status code is a request from the server to the client to reset the document from which the original request was sent. For example, if a user fills out a form, and submits it, a status code of 205 means the server is asking the browser to clear the form.
  • 206 - Partial Content
    A status code of 206 is a response to a request for part of a document. This is used by advanced caching tools, when a user agent requests only a small part of a page, and just that section is returned.

Redirection

  • 300 - Multiple Choices
    The 300 status code indicates that a resource has moved. The response will also include a list of locations from which the user agent can select the most appropriate.
  • 301 - Moved Permanently
    A status code of 301 tells a client that the resource they asked for has permanently moved to a new location. The response should also include this location. It tells the client to use the new URL the next time it wants to fetch the same resource.
  • 302 - Found
    A status code of 302 tells a client that the resource they asked for has temporarily moved to a new location. The response should also include this location. It tells the client that it should carry on using the same URL to access this resource.
  • 303 - See Other
    A 303 status code indicates that the response to the request can be found at the specified URL, and should be retrieved from there. It does not mean that something has moved - it is simply specifying the address at which the response to the request can be found.
  • 304 - Not Modified
    The 304 status code is sent in response to a request (for a document) that asked for the document only if it was newer than the one the client already had. Normally, when a document is cached, the date it was cached is stored. The next time the document is viewed, the client asks the server if the document has changed. If not, the client just reloads the document from the cache.
  • 305 - Use Proxy
    A 305 status code tells the client that the requested resource has to be reached through a proxy, which will be specified in the response.
  • 307 - Temporary Redirect
    307 is the status code that is sent when a document is temporarily available at a different URL, which is also returned. There is very little difference between a 302 status code and a 307 status code. 307 was created as another, less ambiguous, version of the 302 status code.

Client Error

  • 400 - Bad Request
    A status code of 400 indicates that the server did not understand the request due to bad syntax.
  • 401 - Unauthorized
    A 401 status code indicates that before a resource can be accessed, the client must be authorised by the server.
  • 402 - Payment Required
    The 402 status code is not currently in use, being listed as "reserved for future use".
  • 403 - Forbidden
    A 403 status code indicates that the client cannot access the requested resource. That might mean that the wrong username and password were sent in the request, or that the permissions on the server do not allow what was being asked.
  • 404 - Not Found
    The best known of them all, the 404 status code indicates that the requested resource was not found at the URL given, and the server has no idea how long for.
  • 405 - Method Not Allowed
    A 405 status code is returned when the client has tried to use a request method that the server does not allow. Request methods that are allowed should be sent with the response (common request methods are POST and GET).
  • 406 - Not Acceptable
    The 406 status code means that, although the server understood and processed the request, the response is of a form the client cannot understand. A client sends, as part of a request, headers indicating what types of data it can use, and a 406 error is returned when the response is of a type not i that list.
  • 407 - Proxy Authentication Required
    The 407 status code is very similar to the 401 status code, and means that the client must be authorised by the proxy before the request can proceed.
  • 408 - Request Timeout
    A 408 status code means that the client did not produce a request quickly enough. A server is set to only wait a certain amount of time for responses from clients, and a 408 status code indicates that time has passed.
  • 409 - Conflict
    A 409 status code indicates that the server was unable to complete the request, often because a file would need to be editted, created or deleted, and that file cannot be editted, created or deleted.
  • 410 - Gone
    A 410 status code is the 404's lesser known cousin. It indicates that a resource has permanently gone (a 404 status code gives no indication if a resource has gine permanently or temporarily), and no new address is known for it.
  • 411 - Length Required
    The 411 status code occurs when a server refuses to process a request because a content length was not specified.
  • 412 - Precondition Failed
    A 412 status code indicates that one of the conditions the request was made under has failed.
  • 413 - Request Entity Too Large
    The 413 status code indicates that the request was larger than the server is able to handle, either due to physical constraints or to settings. Usually, this occurs when a file is sent using the POST method from a form, and the file is larger than the maximum size allowed in the server settings.
  • 414 - Request-URI Too Long
    The 414 status code indicates the the URL requested by the client was longer than it can process.
  • 415 - Unsupported Media Type
    A 415 status code is returned by a server to indicate that part of the request was in an unsupported format.
  • 416 - Requested Range Not Satisfiable
    A 416 status code indicates that the server was unable to fulfill the request. This may be, for example, because the client asked for the 800th-900th bytes of a document, but the document was only 200 bytes long.
  • 417 - Expectation Failed
    The 417 status code means that the server was unable to properly complete the request. One of the headers sent to the server, the "Expect" header, indicated an expectation the server could not meet.

Server Error

  • 500 - Internal Server Error
    A 500 status code (all too often seen by Perl programmers) indicates that the server encountered something it didn't expect and was unable to complete the request.
  • 501 - Not Implemented
    The 501 status code indicates that the server does not support all that is needed for the request to be completed.
  • 502 - Bad Gateway
    A 502 status code indicates that a server, while acting as a proxy, received a response from a server further upstream that it judged invalid.
  • 503 - Service Unavailable
    A 503 status code is most often seen on extremely busy servers, and it indicates that the server was unable to complete the request due to a server overload.
  • 504 - Gateway Timeout
    A 504 status code is returned when a server acting as a proxy has waited too long for a response from a server further upstream.
  • 505 - HTTP Version Not Supported
    A 505 status code is returned when the HTTP version indicated in the request is no supported. The response should indicate which HTTP versions are supported.

22Apr/090

Understanding Network Address Translation, NAT

Network Address Translation (NAT) is one of the basic functions of a circuit level gateway. The simple purpose of NAT is to hide the IP addresses of a private network from the outside world.

Normally, when a router forwards a packet from one segment to another, the packet is unchanged. With NAT, as a packet crosses from a trusted segment of a circuit level gateway to an untrusted segment, the packet is rewritten so that the packet’s source address as it appears on the private segment is replaced by a translated source address. The translated source address is what the outside world sees. Thus, the private address remains hidden from the outside world.
nat1

When a host on a public network transmits a packet to a host on the private network, the source host addresses the packet to the private host’s publicly translated address. The sender on the public side does not know the destination host’s true address. As the packet crosses the circuit level gateway, the gateway rewrites the packet so that the destination address is translated to the destination host’s private address.

nat2

This image illustrates the changes in source and destination addresses as packets cross a circuit level gateway performing network address translation

nat3

One to One Translation
One form of NAT establishes a one to one translation between an equal number of private and public host addresses. For example, each host address on a Class C network on the private side of a circuit level gateway is uniquely mapped to a corresponding host address on a Class C network on the public side of the gateway. If 10.1.1.0/24 is the private network address and 172.19.19.0/24 is the public network address, then outbound packets with a source address of 10.1.1.5 can always be rewritten with a translated source address of 172.19.19.5, and inbound packets with a destination address of 172.19.19.5 can be rewritten with a translated destination address of 10.1.1.5. The mapping is persistent and bi-directional. Therefore, connections may be initiated from either side of the circuit level gateway unless a default deny policy is applied.

Pool of Translated Addresses
One form of NAT maps a large block of addresses from the private network to a small pool of addresses on the public segment. Multiple Class A addresses may be mapped to part of a Class C network block. If 10.0.0.0/4 is the private segment’s network address and 172.19.19.0/28 is the public pool of addresses, then an outbound packet with a source address of 10.1.1.5 may be rewritten to have a translated source address of any host address in the pool of 172.19.19.0/28. The NAT gateway will then create a temporary entry in its internal translation table to track the mapping. An inbound packet’s destination address cannot be translated unless a corresponding entry exists in the NAT table. If a current translation exists in the NAT table, the inbound packet’s destination address will be rewritten in accordance with the NAT table entry. The mapping is not persistent and is only temporarily bi-directional. An inbound connection may be accepted only until the NAT table entry expires.

Single Translated Addresses
The form of NAT commonly (but not exclusively) used in commercial circuit level gateways maps any number of addresses from the private network to a single address on the public segment. Given a private segment with the network address 10.0.0.0/8 and a NAT policy that sets 172.19.19.130 as the public address, all outbound packets from the private network will be rewritten to have a translated source address of 172.19.19.130. To correctly map replies to the private host that initiated the connection, the source port number of the outbound packet must also be translated. The NAT gateway will then create a temporary entry in its internal translation table to track the translated source address and port number. An inbound packet’s destination address and port number cannot be translated unless a corresponding entry exists in the NAT table. If a current translation exists in the NAT table, the inbound packet’s destination address and port number will be rewritten in accordance with the NAT table entry. The mapping is not persistent and is only temporarily bi-directional. An inbound connection may be accepted only until the NAT table entry expires.

This image illustrates the changes in IP addresses and port numbers as packets cross a circuit level gateway performing network address and port translation.

nat Chains
netfilter implements network address translation in the nat table. This pre-defined table consists of three built-in chains, the PREROUTING, OUTPUT and POSTROUTING chains. Rules in the PREROUTING chain apply to inbound packets (packets arriving at the gateway from any direction). Rules in the OUTPUT chain apply to locally generated packets (packets that are generated on the gateway itself). Rules in the POSTROUTING chain apply to outbound packets (packets leaving the gateway in any direction).

nat Targets
The nat table includes the built-in targets MASQUERADE, SNAT, DNAT, NETMAP and REDIRECT.

The MASQUERADE target is available in the POSTROUTING chain. MASQUERADE is intended to be used where a firewall’s public side IP address is dynamically assigned, such as where an ISP assigns IP addresses by DHCP. MASQUERADE translates all private network addresses to the single address of the external interface as illustrated, performing port translation as needed and rewriting the destination address and port of replies as needed. When the firewall’s external IP address is released or changed, all translations are dropped.

The SNAT target is available in the POSTROUTING chain. SNAT may be used on a firewall with statically assigned IP addresses. SNAT provides outbound (more trusted to less trusted) network address translation to a pool of public side addresses such that the source address of each outbound packet is translated to an address from the pool, with port translation being performed as needed and the destination address and port of replies being rewritten as needed.

SNAT can use a single public side address as an alternative to a pool of addresses, making SNAT comparable to MASQUERADE. However, SNAT should not be used with dynamically assigned public addresses.

Conversely to SNAT, the DNAT target is available in the PREROUTING and OUTPUT chains and provides inbound (less trusted to more trusted) network address translation. When a connection is initiated from a less trusted network, the destination address is the address of the firewall interface that faces the originating network. DNAT translates the destination address to the address of a host on a more trusted segment. Optionally, the destination port may also be translated. The source address and port of replies from the more trusted segment will be rewritten as needed.

DNAT can use a pool of destination addresses and ports, providing a simple circuit level method of performing load balancing across a number of hosts such as a farm of web servers.

The NETMAP target provides static one to one translation between two network blocks of equal size.

The REDIRECT target is available in the PREROUTING and OUTPUT chains. REDIRECT translates the destination IP address of each packet arriving on any interface to the IP address of the interface on which the packet arrived. For example, REDIRECT will translate the destination address of any packet arriving at eth2. Optionally, the destination port may also be translated. Among other uses, REDIRECT facilitates use of transparent proxies whereby client software such as web browsers may be automatically redirected through the firewall to a proxy server without reconfiguration on the client side.

15Apr/090

Use PHP, GD and .htaccess to Watermark All Images in a Directory

The goal here is to watermark all images in a certain directory, except for thumbnails or other selection. You can either do this on each file prior to placing on your webserver - which is probably wise for CPU load issues - but let’s just say you want to do this for all files served in a single directory dynamically, a gallery for example.

The first step is to create a .png file with transparency which holds your watermark image. For this exercise, I’ve created this image:

tbwm.png

(I’ve added the border to stand the image out from the background of the page).

Here is the original image we are going to test with:

boratwow.jpg

After we have our watermark and sample image, we need to write a php file to use PHP’s GD function to apply this image to our original image. The particular function we use is imagecopy(). Here is the code I am using, I name it w.php:

$basedir=”/home/user/public_html/com/gallery/”;
$watermarkimage=”tbwm.png”;

$file=basename($_GET['i']);

$image = $basedir.”/”.$file;
$watermark = $basedir.”/”.$watermarkimage;

$im = imagecreatefrompng($watermark);

$ext = substr($image, -3);

if (strtolower($ext) == “gif”) {
if (!$im2 = imagecreatefromgif($image)) {
echo “Error opening $image!”; exit;
}
} else if(strtolower($ext) == “jpg”) {
if (!$im2 = imagecreatefromjpeg($image)) {
echo “Error opening $image!”; exit;
}
} else if(strtolower($ext) == “png”) {
if (!$im2 = imagecreatefrompng($image)) {
echo “Error opening $image!”; exit;
}
} else {
die;
}
imagefilledrectangle($im2, 0 , (imagesy($im2))-(imagesy($im)) , imagesx($im2) , imagesy($im2) , imagecolorallocatealpha($im2, 0, 0, 0, 100) );
imagecopy($im2, $im, (imagesx($im2)-(imagesx($im))), (imagesy($im2))-(imagesy($im)), 0, 0, imagesx($im), imagesy($im));

$last_modified = gmdate(’D, d M Y H:i:s T’, filemtime ($image));

header(”Last-Modified: $last_modified”);
header(”Content-Type: image/jpeg”);
imagejpeg($im2,NULL,95);
imagedestroy($im);
imagedestroy($im2);

?>

This file is placed in the images directory.

Also in the images, create an .htaccess file with the following code:

RewriteEngine on
RewriteRule ^([^thumb].*\.[jJ].*)$ /com/gallery/w.php?i=$1

This tells the web server that instead of serving jpg files out of this directory, that we should instead process the filename with w.php and then serve to the browser. It also adds in a clause that if it starts with thumb_, that it will not run on this file. This is so it does not run on thumbnails.

Here is the resulting image, with watermark! This is served right out of an image directory with no watermark on the original picture:

Borat with watermark from php

thanks to systembash

1Apr/090

Compiling a new kernel

1) We must have installed the following packages:

  • kernel-package
  • libncurses5-dev
  • fakeroot
  • wget
  • bzip2
  • build-essential

If not, try to useapt-get [package name]after aapt-get update

2) Move to /usr/src/ path. To do this, usecd /usr/src

3) Get the Kernel. To this, open your browser and go tohttp://www.kernel.org and download the latest, or you needed.

You can use “wget” to this. For example if we want to use the 2-6-25 Kernel, we type this in the consolewget http://www.eu.kernel.org/pub/linux/kernel/v2.6/linux-2.6.21.5.tar.gz

4) When the kernel have been downloaded. Unpack the ‘tar.gz’, usingtar xvf [tar.gz package name]

5) Make a simbolic link to the original folder which contains the (Just unpacked) Kernel. Typeln -s [Kernel folder name] linux.

Why we do this? The folder we created with “ln -s” it’s a simply link to the original folder. This folder it’s only to facilitate the work.

6) Move to the symnolic link folder “linux”. Just typecd linux

7) Make sure you’re in /usr/src/linux folder, and now typemake clean && make mrproper

8 ) Now, typemake menuconfigNOTE: There’s other kinds of compile, but i ever use this. It’s the most easy and secure, i think.

A screen like that will be loaded

In this, you must select the things you need to run your system and the modules you want.

Before this, you must save a configuration file with the settings you’ve selected.

9) Then, you must type the following:

make all
make modules_install
make install

10) We’ve installed out Kernel, but now we should say the system where’s the new Kernel.

To this, type:

depmod [number of kernel]Example -> depmod 2.6.21.5

apt-get install yaird

mkinitrd.yaird -o /boot/initrd.img-[Number of kernel] [Number of kernel]

update-grub

After all, we’ve compiled our own Kernel :) . To load it, just reboot the computer.

1Apr/090

SSH: Secure SHell

1. General facts

1.1 What is SSH ?

SSH, as its name suggests, is a Secure SHell which enables you to connect to a remote machine in a network. The network could a local one or located in London, Madrid or New York ! Furthermore, this protocol allows you to launch applications on the remote machine, make and receive transfers with the server, all in a secure way. SSH allows to establish a secured communication channel and be authenticated with the remote server by means of a pair of keys.

1.2 Why use SSH ?

The traditional commands such as rcp, rlogin, telnet are vulnerable. It is pretty easy to “snoop” a local network and find logins and passwords. Bear in mind that logins and passwords are not hidden when sent in the network. It only takes an evil-minded person to listen to your ports and analyze the circulating packets. Try for instance the EtherReal utilitary, and you will realize what we are talking about ! SSH is much more secure. Data are encrypted, so it’s a pain in the neck for evil-minded folks! This being said, be careful !!! Your password is encrypted during the connection, so are the data circulating between the server and the guest.

1.3 A few notations

- The local machine will be referred to as ipowerht, with and IP address set to 10.0.0.3.
- The remote server will be referred to as ipower with and IP address set to 10.0.0.4.
- my login (connection ID) will be nadir.

2. SSH for the guest

2.1 Establishing the connection

To be connected to an ssh server, we shall use the following command:

ssh login@remote_server

We should remind you that we are the guest now: the the IP is 10.0.0.3 and the machine isipowerht. We can check that:

nadir@ipowerht:~ $ ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:0C:6E:4D:3F:2A
inet adr:10.0.0.3  Bcast:10.0.0.255  Mask:255.255.255.0
adr inet6: fe80::20c:6eff:fe4d:3f2a/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:12462 errors:0 dropped:0 overruns:0 frame:0
TX packets:11916 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:6283113 (5.9 MiB)  TX bytes:1396161 (1.3 MiB)
Interruption:22

Type ’yes’ to be connected.

nadir@ipowerht:~ $ ssh nadir@10.0.0.4
The authenticity of host ‘10.0.0.4 (10.0.0.4)’ can’t be established.
RSA key fingerprint is c2:72:14:de:97:a3:68:e6:80:96:e5:f6:03:6d:ab:b0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘10.0.0.4′ (RSA) to the list of known hosts.
nadir@10.0.0.4’s password:
Linux ipower 2.6.12-9-386 #1 Mon Oct 10 13:14:36 BST 2005 i686 GNU/Linux

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
You have new mail.
Last login: Mon Oct 31 12:32:04 2005
nadir@ipower:~$

We now connected to the remote server ipower, the IP of which is 10.0.0.4.

A few details are necessary here: you must answer yes to be connected. Doing so authorizes a public key to be saved in a file, known_hosts, located in the .ssh subdirectory of your home directory.

nadir@ipowerht:~ $ cat ~/.ssh/known_hosts
|1|0Oe37usFY/ObV2ZushNUYkaCJHw=|hUK+rpChGZkayH3B5DtafHoggwQ= ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAtb+INGggM1ISDOrKTBc0bp3wa7HZOzaAjgi8TJEt
VQyg4cB1ege3C+2WiOdUDoKBDxBdgKaWesEIwx4g+LTY1YzwB6bb2JHU487WJx5YRDRF
yfKYICLJLg2n++vQWW/Bw8fwpcjCfhjV591WyDzHtb9lPkX45qFcWWtgJvJGxOc=

2.2 File transfer via scp

To transfer files in a secure way, we use the scp command as follows:

scp source destination

If it’s a directory transfer, we use -r

scp -r source destination

We use the classic notation of directories for a guest machine, whereas for a remote server we uselogin@remote_server:(Do NOT put a blank space after the ’:’). Let’s see a few examples. We are copying a file from the guest to the server:

nadir@ipowerht:~ $ scp Desktop/sources.list nadir@10.0.0.4:Desktop
nadir@10.0.0.4’s password:
sources.list                                  100% 1686     1.7KB/s   00:00

Below, we are copying a directory (-r) from the server to the local machine:

nadir@ipowerht:~ $ scp -r nadir@10.0.0.4:simula+ Desktop/
nadir@10.0.0.4’s password:
Root                                          100%   29     0.0KB/s   00:00
Repository                                    100%    8     0.0KB/s   00:00
Entries                                       100%   43     0.0KB/s   00:00
test.cpp                                      100%  181     0.2KB/s   00:00

2.3 File transfer via sftp

Instead of using the classic ftp, we shall use the ’sftp’ command. Do not blame anybody but yourself if some folks find your password when you use the classic ftp. The syntax is as follows:

sftp login@remore_server

In practice:

nadir@ipowerht:~ $ sftp nadir@10.0.0.4
Connecting to 10.0.0.4…
nadir@10.0.0.4’s password:
sftp>

2.4 Key authentication

We have previously seen that the authentication on the server was done via a login and a password. We shall now see that the authentication can be done by means of asymmetrical cryptography and a pair of private/public keys.

What is exactly a strong authentication ?
- for and foremost, it is an RSA/DSA type encryption
- each user has two keys: a private and a public one
- to be authenticated, the private part should be on the server and the private one on on the guest.
- instead of entering your login and password, you will only need to enter a “passphrase”, which is a password that authorizes sentences. NEVER EVER LEAVE THE PASSPHRASE EMPTY !!!. Why? Simply because, if someone steals your public key, you are in the dog house since you left the passphrase empty.

To generate a pair of keys, we use the following command:

ssh-keygen -t rsa

You have the choice between dsarsa and rsa1. Let’s see an example of how to generate public and private keys with the dsa encryption. By default, press ’Enter’ on your keyboard for the name of files, but not for the passphrase. I repeat: NEVER EVER LEAVE THE PASSPHRASE EMPTY !!!

nadir@ipowerht:~ $ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/nadir/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/nadir/.ssh/id_dsa.
Your public key has been saved in /home/nadir/.ssh/id_dsa.pub.
The key fingerprint is:
99:02:09:26:f7:18:1d:9d:85:93:d8:3f:12:2a:0e:52 nadir@ipowerht
nadir@ipowerht:~ $ ls -al .ssh
total 16
drwx——   2 nadir nadir 4096 2005-11-01 20:28 .
drwxr-xr-x  41 nadir nadir 4096 2005-11-01 19:16 ..
-rw——-   1 nadir nadir  736 2005-11-01 20:28 id_dsa
-rw-r–r–   1 nadir nadir  604 2005-11-01 20:28 id_dsa.pub

In the user’s directory, the .ssh subdirectory contains the public key id_dsa.pub and the private keyid_dsa. Note the permission on the private key id_dsa:

-rw——- 1 nadir nadir 736 2005-11-01 20:28 id_dsa

It is of type 600, ie. rw for the user, whereas the others have no permission. Why? Simply because others should not know your private key …

Transferring the public key on remote servers

To transfer the public key on remote servers, we use the following command:

ssh-copy-id -i path_to_the_public_key login@remote_server

Practically, we have:

nadir@ipowerht:~ $ ssh-copy-id -i ~/.ssh/id_dsa.pub nadir@10.0.0.4
27
The authenticity of host ‘10.0.0.4 (10.0.0.4)’ can’t be established.
RSA key fingerprint is c2:72:14:de:97:a3:68:e6:80:96:e5:f6:03:6d:ab:b0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘10.0.0.4′ (RSA) to the list of known hosts.
nadir@10.0.0.4’s password:
Now try logging into the machine, with “ssh ‘nadir@10.0.0.4′”, and check in:

.ssh/authorized_keys

to make sure we haven’t added extra keys that you weren’t expecting.

See that you have a message asking you to connect to the remote server and check the file.ssh/authorized_keys. Com’on, be patient! Enter the passphrase:

nadir@ipowerht:~ $ ssh nadir@10.0.0.4
Enter passphrase for key ‘/home/nadir/.ssh/id_dsa’:
Linux ipower 2.6.12-9-386 #1 Mon Oct 10 13:14:36 BST 2005 i686 GNU/Linux

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
You have new mail.
Last login: Tue Nov  1 22:42:00 2005 from ipowerht.lan
nadir@ipower:~$ cat .ssh/authorized_keys
ssh-dss AAAAB3NzaC1kc3MAAACBANIk54Df106UVD4Op/bHSOFWvyWl+P5GfkaOJV
j5i+MOO2fs1GD3SivhpZ6UbWfV2VhpidGg+DgqLg9817Isj7GexCrIQ269TgNSvLRb
rD3fA3JkD8zXOlBwn0UhFXURZkUJ9+ghT/JQPUXoKLhb+SN6kQxH8XAh7yPH5+hsug
YRAAAAFQCQtWe+/XfqCmkQO9iAtnJqlNED7wAAAIBUXoNpRZKBNi6CXeSNCzMS6jjC
0yuaLEUetwDGYT0w1aI8rVNiCrohcPiPba/vKgrgO/F+uSBsU2RsNLL8TLHfZvl+2c
e4U6m2o6APzwZFCyRXlgZWvPmsZZqnx1qKwRLkjSOq5ufIfMBXed2RWprwYINq8W7U
a0NYbjVkdD8G7AAAAIEAutDMfZeURchN88dHVuLA5uSR5dE+y2Gk2OmZx2ZjxM7adB
Zi1dKJQ85h6NrnqrgrgNhA0yWDhOBkIWNp24S9jwXS9dCHlnRO+yeaR2faXZbOGjeV
CtEfcdjc5GLSR2heFqBDQntiNnwmNoYzV9kqu1SzylbmdIAHNWJarjxo9xw= nadir@ipowerht

You can check that it’s indeed the correct key (provided you are patient enough to check character by character ;-) ):

nadir@ipower:~$ exit
logout
Connection to 10.0.0.4 closed.
nadir@ipowerht:~ $ cat ~/.ssh/id_dsa.pub
ssh-dss AAAAB3NzaC1kc3MAAACBANIk54Df106UVD4Op/bHSOFWvyWl+P5GfkaOJV
j5i+MOO2fs1GD3SivhpZ6UbWfV2VhpidGg+DgqLg9817Isj7GexCrIQ269TgNSvLRb
rD3fA3JkD8zXOlBwn0UhFXURZkUJ9+ghT/JQPUXoKLhb+SN6kQxH8XAh7yPH5+hsug
YRAAAAFQCQtWe+/XfqCmkQO9iAtnJqlNED7wAAAIBUXoNpRZKBNi6CXeSNCzMS6jjC
0yuaLEUetwDGYT0w1aI8rVNiCrohcPiPba/vKgrgO/F+uSBsU2RsNLL8TLHfZvl+2c
e4U6m2o6APzwZFCyRXlgZWvPmsZZqnx1qKwRLkjSOq5ufIfMBXed2RWprwYINq8W7U
a0NYbjVkdD8G7AAAAIEAutDMfZeURchN88dHVuLA5uSR5dE+y2Gk2OmZx2ZjxM7adB
Zi1dKJQ85h6NrnqrgrgNhA0yWDhOBkIWNp24S9jwXS9dCHlnRO+yeaR2faXZbOGjeV
CtEfcdjc5GLSR2heFqBDQntiNnwmNoYzV9kqu1SzylbmdIAHNWJarjxo9xw= nadir@ipowerht

For every connection to the remote machine, ssh will ask for the passphrase used to encrypt the private key:

nadir@ipowerht:~ $ ssh nadir@10.0.0.4
Enter passphrase for key ‘/home/nadir/.ssh/id_dsa’:

3. SSH for the server

The file /etc/ssh/sshd_config allows to configure the ssh server. It manages the parameters of connections. Any modification of this file requires a reboot of the sshd daemon as follows:

nadir@ipower:~$/etc/init.d/sshd restart

Note that, contrary to halting, rebooting the daemon prevents the users of the ssh server to be disconnected.

nadir@ipower:~$/etc/init.d/sshd stop

To change the parameters of this configuration file, we strongly encourage you to read the documentation

nadir@ipower:~$man sshd

Thanks to math-linux.com

1Apr/090

IPTables: Filtering by MAC Address

If we want filter a MAC in our firewall, we can use IPTables to this. For example, if we want to filter a MAC like 00:12:8D:EE:6E:AB (Must type the MAC with this format -> HH:HH:HH:HH:HH:HH) and deny their access to our Firewall we can put type this:

iptables -A INPUT -m -mac --mac-source 00:12:8D:EE:6E:AB -j DROP

Also, we can use the ! operator, wich inverts the operation, for example, if we type:

iptables -A INPUT -m -mac --mac-source ! 00:12:8D:EE:6E:AB -j DROP

All the packets will be dropped, except the packets from 00:12:8D:EE:6E:AB MAC.

Tagged as: , , , No Comments
1Apr/090

Redirecting a port to a local machine inside our network

If we want to redirect a port (Like the http port) to one of our network machine we should use this IPTables rule:

iptables -t nat -A PREROUTING -p tcp -i eth0 --dport 80 -j DNAT --to-destination 192.168.0.x

iptables -t nat -A POSTROUTING -p tcp -d 192.168.0.x --dport 80 -j SNAT --to-source 192.168.0.y