Webmaster knowledge : Protect site from proxy SEO technique
very good article from seofaststart
How To Tell If You've Been Proxy Hacked
The simplest test, if you are experiencing a problem, is to examine Google search results for a phrase (search term in quotes) that should be unique to your page. For example, if your home page says "Fred's Widget Factory sells the best down-home widgets on Earth" then you can search for that phrase.
You want to use a phrase (or combination of phrases) that should only appear on your page, and nowhere else on the web… or very few places at least. Then you do the search – if there's more than one result (your page), then you need to examine the other URLs that are listed. If some of them are delivering an exact copy of the page, you just may be dealing with a proxy that has hijacked your content.
A typical proxy link looks something like this:
www.example.com/nph-proxy.pl/011110A/http/www.mattcutts.com/blog/
It's easy to see what URL that would fetch, if example.com were a real proxy. Other proxy URLs encode the target URL so it's not always that easy to determine what they're going to fetch just by looking.
The mere presence of proxies in the index doesn't necessarily mean you'll be dropped or penalized. The situation inside Google's systems is no doubt very complex. I have seen sites with multiple proxies indexed, and no ill effects. It's possible that there are certain factors (trust, authority, domain age, etc.) that make one site more susceptible than another. I have no idea how they make the decision on which copy of a page to keep.
Why Is This Even Possible?
In simple terms, it appears that the original (authentic) page gets dropped or penalized as duplicate content.
A couple years ago, Google deployed some software & infrastructure changes collectively known as "Big Daddy." This involved crawling from many different data centers, and changes to the crawler itself. It appears that the changes include moving some of the duplicate content detection down to the crawlers. The bug probably arises from the way the data centers are synchronized. Pure speculation here, but the picture I have of what happens looks like this:
- The original page exists in at least some of the data centers.
- A copy (proxy) gets indexed in one data center, and that gets sync'd across to the others.
- A spider visits the original, checks to see if the content is duplicate, and erroneously decides that it is.
- The original is dropped or penalized.
As far as whether "any site" could get hacked," I don't know. I'm not a black hat. I don't have a link farm. I don't have a botnet to spam blogs with. So I can't manufacture thousands of links to thousands of proxies, in an attempt to knock sites off of SERPS. I wouldn't do that anyway – it's evil. So what I know is based mostly on sites reporting a problem, blocking the proxies, and seeing the problem disappear after the proxies are gone. Then repeating the exercise with the same results.
How To Fight Back
There are basically three main possibilities for your situation:
Situation 1: You are running an Apache server. We have 2 solutions in this case, that were developed by Jaimie Sirovich (co-author of Professional Search Engine Optimization with PHP). We've worked some late nights on this.
Solution #1 uses mod_write and .htaccess, to pass all spider requests through a PHP script that validates the request. This will only defends against being hacked via "normal" anonymous proxies that pass long the user agent – it only inspects visits from the "Big 4" search engines (Ask, Google, MSN, and Yahoo). I call this the "first tier" defense – it won't stop every proxy that exists, but it will come close, and you can implement it without modifying any of your applications. It wil even work if your web site is all static pages. This is what I'm implementing. Jaimie doesn't like it because it's kind of a hack – and he would rather you didn't use it at all.
Solution #2 is a PHP script that implements the "reverse cloaking" defense, putting a "nonindex, nofollow" robots meta tag into your pages unless it's a spider that you have configured the script to recognize. This will only be possible if your site is built on PHP. It wouldn't be terribly difficult for a competent PHP user to implement this in an all-static site, you'd just need to change .htaccess so that your .html files are parsed as PHP. A WordPress plug-in will follow soon. This is a more robust defense, against more proxies.
How to get the code: An implementation guide is provided on Jaimie's blog, along with a testing environment that you can use to check spider user agents & IP addresses, and of course the source code for both solutions. No warranty is given. This is hard core code for a hard core situation. Don't use it if you don't need it, and all code should really be deployed by professionals who can understand what it does, modify it to suit unique environments, etc.
Situation 2: You are running a Microsoft (IIS) server. Jaimie is working on an IIS/ASP solution similar to the Apache/PHP solution, which should be available soon. Think days, not weeks, in other words. Much sooner than his new book (Professional SEO with ASP), which is also in the pipeline.
Situation 3: You are on a hosted solution, aren't running PHP scripts that you can edit, don't control the web server, etc. This is a more complex situation. I will have another post tomorrow that will offer some possible solutions, including one that involves creating your own caching proxy on a separate server. In this case, I don't recommend doing anything unless you really believe that you have a problem with proxies.
There are other solutions available. Bill Atchison's Crawlwall is a professional (commercial) solution, that does a lot more to prevent content theft, etc. If you have the means, you may want to consider this instead, and move the burden of "keeping up with the spiders" onto Bill's shoulders. Jaimie is working on a more general proxy-blocking solution as well. Ekstreme has the beginnings of a spider validation solution in the PHP Search Engine Bot Authentication code they published.
If You Are Operating A Proxy – Don't Be Part of the Problem
If you are operating a proxy server, and you don't want to be part of the problem, you can prevent your server from being used as a tool by adding a robots.txt file that prevents all search engine spiders from indexing proxied content through your server. For example, if all proxy URLs begin with /proxy/ then you can use:
User-agent: *
Disallow: /proxy/
Of course, not all proxies are being run by innocent people for innocent reasons. Some of them are actually designed to hijack content – to deliver ads, etc. Some people want to steal your content, and they want the search engines to index it. In fact, I would not be surprised if a large part of the overall problem isn't caused by such people firing links at their own proxies.
Is It Just Google?
You got me… I haven't seen any cases on other engines that looked like a proxy hack, but I'd be surprised if it only affected Google. Google may simply be the only search engine that shows you enough search results to let you "catch" the proxies. Google may be more susceptible because they crawl more URLs more often, and use multiple data centers.
Assuming I am not completely wrong, it sure looks like less of a design flaw, and more of an "emergent property" of the very things that make Google the world's best search engine (just my opinion, apparently the average consumer no longer agrees). I don't know that there is an easy solution, especially if the problem arises because of their multiple-data-center strategy.
Unfortunately, any countermeasures that we implement could be thwarted by someone willing to copy our content in other ways, or by constructing a proxy that spoofs user agents, uses intermediate proxies to hide its IP address, and strips out meta tags. This has always been possible, BTW. Anyone actually doing these things, of course, would likely be committing a crime… and would be a lot easier to find than some script kiddie using comment spam to fire links at someone else's proxies.
UPDATE: As of May 1, 2008, I have every reason to believe that Google has solved this problem, at least in the general case. At this point, the only sites I can see getting "duped by proxy" are spammier than the proxies themselves.
Update again: September 2009 - damned if this thing hasn't cropped up again – now it looks like Google's replacing the duped URL with the copy's URL – and even RANKING the duplicates… (similar to the already-known-and-passed-off-as-a-feature 302 redirect bug).
My view from this point, Could use this SEO tip for improve our system, with concern : Google bot not stupid, Perhap it need some tips to make it's work.
For protect this following article is great : SEO image
For those of you who do not know, SEO Image is one of the most plagiarized websites. Our content is stolen and rewritten every day by new and novice SEO companies throughout the world.
One issue we have is that these novice SEO companies not only copy word for word, but some cause the same effect that the proxy search portals do. That is a duplicate content filter. For those of you new to my blog, I have been very anti-duplicate content filter since its unleashing in 2005 as an overly aggressive filter.
So, to take this further the proxy sites are ways that searchers can try and mask the IP they are searching from, as the proxy server will allow someone to access a site that has banned regions from accessing it. I do not want to get to technical with the proxy servers, you can read more at Wikipedia. The problem with proxy servers is that they cache websites that are searched and then allow search engines to spider them so that they can appear as larger websites (page spam) and rank better so that people can click on the paid ads.
The Google URL Removal tool is a sure way of removing proxy duplicates. Since we feel the duplicate content filter will remove most copies, the proxy search results concern us because they are used by Black Hat SEO’s to try and hurt other websites rankings.
There is one easy way to remove the proxy servers with Google’s Remove URL tool. That is, first you need to be able to deny IP ranges from accessing your website in either Windows IIS Administration, or htaccess for Linux Servers.
First Step:
- Find the proxy indexed in Google with your content
- Find the Reverse DNS using DNS Stuff to determine the IP we generally block the Name services and the IP by C Class (XXX.XXX.C-Class.XXX). If the IP does not work, try our Server Header Checker Tool.
- Using your .htaccess file or IIS Administration deny access to the IP ranges by the C Class of the IP.
- Click the link in the Google Search Results and see if it returns a 403 Forbidden Code.
This is where it gets tricky, if you get the 403 code, then the site will no longer be duplicating you, however, if the site uses a frameset or iframe, then you will NOT be able to use the Google URL Removal Tool as it will see a 200 “Found” header directive and assume the page still exists.
Use the URL Removal Tool and check off “anything associated with this domain”. If the site does not use frames then you will get it removed, if it does have frames then google gets a 200 code and will NOT remove the site despite the frame. You can try to access the frame and submit that page, but it generally will not help.
All in all, the ability of proxy servers to hurt rankings is unknown. We believe it will effect some of the sites rankings, but may not be the full story. Another issue of proxy servers, is that they can 302 hijack sites if they are set poorly.
We have not found any code that can ban proxy servers even ones that use nph-proxy.cgi.
How To Fight Back — Code implementations
Well that's where I come in. I have 2 implementations in beta (read: they work according to my tests, but I'm going to be testing more) that address the problem based on the methods the search engines cite. Then, essentially, we're using a benign form of cloaking (yes, cloaking!) to make it more difficult for bad bots, proxies, etc. to exploit us.
I'll expand the explanation in that documentation to make it easier to comprehend/install. But if you know PHP, dive right in.
The code and concepts were primarily based off on the book I coauthored, "Search Engine Optimization with PHP." It is my sentiment that most SEOs have to be aware of technology more so than they think — hence the book authored by me and co-author Cristian Darie. This is just one example.
Note: I didn't realize WP changes quotes to curly quotes to look "pretty" since version 2.1. I turned that feature off. Cut and pasting should work.
Below is the main class necessary for the cloaking functionality, "SimpleCloakV2:"
<?php$__metaRobotsExcludeProxiesCallbackHTML = ";
/*
// +----------------------------------------------------------------------+
// | SimpleCloakV2 Version 2 |
// | Class for cloaking content |
// | http://www.SEOEgghead.com |
// +----------------------------------------------------------------------+
// | Copyright (c) 2005-2006 Jaimie Sirovich and Cristian Darie |
// +----------------------------------------------------------------------+
*/// load configuration file
require_once('config.inc.php');class SimpleCloakV2
{
function _connect()
{
if (USE_CUSTOM_CONNECT_CODE) return true;
// Connect to MySQL server
$dbLink = mysql_connect(DB_HOST, DB_USER, DB_PASSWORD)
or die("Could not connect: " . mysql_error());// Connect to the seophp database
mysql_select_db(DB_DATABASE) or die("Could not select database");
return $dbLink;
}
function _close($dbLink)
{
if (USE_CUSTOM_CONNECT_CODE) return true;
// close database connection
mysql_close($dbLink);
}
// returns the confidence level
function isSpider($spider_name = ", $check_uas = true, $check_ips = true, $use_user_defined_data = true, $ignore_bad_uas = true)
{
// default confidence level to 0
$confidence = 0;
// matching user agent?
if ($check_uas)
if (SimpleCloakV2::_get(0, $spider_name, 'UA', $_SERVER['HTTP_USER_AGENT'], ", $use_user_defined_data ? " : 'N', $ignore_bad_uas ? 'bad' : "))
$confidence += 2;
// matching IP?
if ($check_ips)
if (SimpleCloakV2::_get(0, $spider_name, 'IP', ", $_SERVER['REMOTE_ADDR'], $use_user_defined_data ? " : 'N', $ignore_bad_uas ? 'bad' : "))
$confidence += 3;
// return confidence level
return $confidence;
}
// retrieve cloaking data filtered by the supplied parameters
function _get($id = 0, $spider_name = ", $record_type = ",
$value = ", $wildcard_value = ", $is_user_defined_data = ", $not_spider_name = ")
{
// by default, retrieve all records
$q = " SELECT cloak_data.* FROM cloak_data WHERE TRUE ";
// add filters
if ($id) {
$id = (int) $id;
$q .= " AND id = $id ";
}
if ($spider_name) {
$spider_name = mysql_escape_string($spider_name);
$q .= " AND spider_name = '$spider_name' ";
}
if ($record_type) {
$record_type = mysql_escape_string($record_type);
$q .= " AND record_type = '$record_type' ";
}
if ($value) {
$value = mysql_escape_string($value);
$q .= " AND value = '$value' ";
}
if ($wildcard_value) {
$wildcard_value = mysql_escape_string($wildcard_value);
$q .= " AND ( '$wildcard_value' = value OR '$wildcard_value' LIKE CONCAT(value, '.%') ) ";
}
if ($is_user_defined_data) {
$is_user_defined_data = mysql_escape_string($is_user_defined_data);
$q .= " AND is_user_defined_data = '$is_user_defined_data' ";
}
if ($not_spider_name) {
$not_spider_name = mysql_escape_string($not_spider_name);
$q .= " AND spider_name <> '$not_spider_name' ";
}
$dbLink = SimpleCloakV2::_connect();
// execute the query
$tmp = mysql_query($q);
SimpleCloakV2::_close($dbLink);
// return the results as an associative array
$rows = array();
while ($_x = mysql_fetch_assoc($tmp)) {
$rows[] = $_x;
}
return $rows;
}
// updates the entire database with fresh spider data, but only if our data is
// more than 7 days old, and if the online version from iplists.org has changed
function updateAll($delete_user_defined_data = false)
{$dbLink = SimpleCloakV2::_connect();
// retrieve last update information from database
$q = "SELECT cloak_update.* FROM cloak_update";
$tmp = mysql_query($q);
$updated = mysql_fetch_assoc($tmp);
$db_version = $updated['version'];
$updated_on = $updated ['updated_on'];
// get the latest update more recent than 7 days, don't attempt an update
if (isset($updated_on) &&
(strtotime($updated_on) > strtotime("-604800 seconds")))
{
// close database connection
SimpleCloakV2::_close($dbLink);
// return false to indicate an update wasn't performed
return false;
}
// read the latest iplists version
$version_url = 'http://www.iplists.com/nw/version.php';
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $version_url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
$latest_version = curl_exec($ch);
curl_close($ch);
$latest_version = mysql_escape_string($latest_version);// if no updated version information was retrieved, abort
if (!$latest_version)
{
// return false to indicate an update wasn't performed
return false;
}
// save the update data
$q = "DELETE FROM cloak_update";
mysql_query($q);
$q = "INSERT INTO cloak_update (version, updated_on) " .
"VALUES('$latest_version', NOW())";
mysql_query($q);
// if we already have the current data, don't attempt an update
if ($latest_version == $db_version)
{
// close database connection
mysql_close($dbLink);
// return false to indicate an update wasn't performed
return false;
}
// update the database
SimpleCloakV2::_updateCloakingDB('google',
'http://www.iplists.com/nw/google.txt', $delete_user_defined_data);
SimpleCloakV2::_updateCloakingDB('yahoo',
'http://www.iplists.com/nw/inktomi.txt', $delete_user_defined_data);
SimpleCloakV2::_updateCloakingDB('msn',
'http://www.iplists.com/nw/msn.txt', $delete_user_defined_data);
SimpleCloakV2::_updateCloakingDB('ask',
'http://www.iplists.com/nw/askjeeves.txt', $delete_user_defined_data);
SimpleCloakV2::_updateCloakingDB('altavista',
'http://www.iplists.com/nw/altavista.txt', $delete_user_defined_data);
SimpleCloakV2::_updateCloakingDB('lycos',
'http://www.iplists.com/nw/lycos.txt', $delete_user_defined_data);
SimpleCloakV2::_updateCloakingDB('wisenut',
'http://www.iplists.com/nw/wisenut.txt', $delete_user_defined_data);// close connection
SimpleCloakV2::_close($dbLink);// return true to indicate a successful update
return true;
}
// update the database for the mentioned spider, by reading the provided URL
function _updateCloakingDB($spider_name, $url, $delete_user_defined_data = false)
{
$ua_regex = '/^# UA "(.*)"$/m';
$ip_regex = '/^([0-9.]+)$/m';
// use cURL to read the data from $url
// NOTE: additional settings are required when accessing the web through a proxy
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
$result = curl_exec($ch);
curl_close($ch);
// use _parseListURL to parse the list of IPs and user agents
$lists = SimpleCloakV2::_parseListURL($result, $ua_regex, $ip_regex);
// if the user agents and IPs weren't retrieved, we cancel the update
if (!$lists['ua_list'] || !$lists['ip_list']) return;// lock the cloack_data table to avoid concurrency problems
mysql_query('LOCK TABLES cloak_data WRITE');
// delete all the existing data for $spider_name
SimpleCloakV2::_deleteSpiderData($spider_name, $delete_user_defined_data ? " : 'N');// insert the list of user agents for the spider
foreach ($lists['ua_list'] as $ua) {
SimpleCloakV2::_insertSpiderData($spider_name, 'UA', $ua);
}// insert the list of IPs for the spider
foreach ($lists['ip_list'] as $ip) {
SimpleCloakV2::_insertSpiderData($spider_name, 'IP', $ip);
}
// release the table lock
mysql_query('UNLOCK TABLES');
}
// helper function used to parse lists of user agents and IPs
function _parseListURL($data, $ua_regex, $ip_regex)
{
$ua_list_ret = preg_match_all($ua_regex, $data, $ua_list);
$ip_list_ret = preg_match_all($ip_regex, $data, $ip_list);
return array('ua_list' => $ua_list[1], 'ip_list' => $ip_list[1]);
}// inserts a new row of data to the cloaking table
function _insertSpiderData($spider_name, $record_type, $value, $is_user_defined = 'N')
{
// escape input data
$spider_name = mysql_escape_string($spider_name);
$record_type = mysql_escape_string($record_type);
$value = mysql_escape_string($value);
$is_user_defined = mysql_escape_string($is_user_defined);// build and execute the INSERT query
$q = "INSERT INTO cloak_data (spider_name, record_type, value, is_user_defined) " .
"VALUES ('$spider_name', '$record_type', '$value', '$is_user_defined')";
mysql_query($q);
}
// delete the cloaking data for the mentioned spider
function _deleteSpiderData($spider_name, $is_user_defined = ")
{
// escape input data
$spider_name = mysql_escape_string($spider_name);// build and execute the DELETE query
$q = "DELETE FROM cloak_data WHERE spider_name='$spider_name'";
if ($is_user_defined) {
$is_user_defined = mysql_escape_string($is_user_defined);
$q .= " AND is_user_defined = '$is_user_defined' ";
}
mysql_query($q);
}
// only use if it's not found via the IPLists cloaking database
function botVerifyByDNS($ua = array('google', '#.*\.googlebot\.com$#'))
{
// check cache of bad bots
if (SimpleCloakV2::isSpider('bad', false, true, true, false)) {
return false;
}
// check only UA since this function is only called if the cloaking DB doesn't handle it
if (SimpleCloakV2::isSpider($ua[0], true, false)) {
// reverse lookup
$host_name = gethostbyaddr($_SERVER['REMOTE_ADDR']);
// if it says it's a certain UA but gethostbyaddr the corresponding domain regex, store it and then abort
if (!preg_match($ua[1], $host_name)) {
$dbLink = SimpleCloakV2::_connect();
SimpleCloakV2::_insertSpiderData('bad', 'IP', $_SERVER['REMOTE_ADDR'], 'Y');
SimpleCloakV2::_close($dbLink);
return false;
}
$connected_ip_address = $_SERVER['REMOTE_ADDR'];
$host_name_ip_address = gethostbyname($host_name);
// if the connected IP matches the authoritative IP, we have a match
if ($connected_ip_address == $host_name_ip_address) {
$dbLink = SimpleCloakV2::_connect();
SimpleCloakV2::_insertSpiderData($ua[0], 'IP', $_SERVER['REMOTE_ADDR'], 'Y');
SimpleCloakV2::_close($dbLink);
return true;
} else {
// if it says it's a certain UA, gethostbyaddr says the right thing, but gethostbyname does not
$dbLink = SimpleCloakV2::_connect();
SimpleCloakV2::_insertSpiderData('bad', 'IP', $_SERVER['REMOTE_ADDR'], 'Y');
SimpleCloakV2::_close($dbLink);
return false;
}
}
// it does not even say it's a bot via UA
return false;
}
function _addMetaRobotsExcludeProxiesCallback($buffer)
{
global $__metaRobotsExcludeProxiesCallbackHTML;
return preg_replace('#</title>#', '</title>' . $__metaRobotsExcludeProxiesCallbackHTML, $buffer);
}
function metaRobotsExcludeProxies($auto_modify_content = true, $uas = array(array('google', '#.*\.googlebot\.com$#'), array('yahoo', '#.*\.yahoo\.net$#'), array('msn', '#.*\.live\.com$#'), array('ask', '#.*\.ask.com$#') ), $meta_tag = '<meta name="robots" content="noindex,nofollow" />', $passlist_regex = ")
{
global $__metaRobotsExcludeProxiesCallbackHTML;
if ($meta_tag)
$__metaRobotsExcludeProxiesCallbackHTML = $meta_tag;
// if it's on our passlist
// ex: #become|lycos|somestupidbot#
if ($passlist_regex) {
if (preg_match($passlist_regex, $_SERVER['HTTP_USER_AGENT'])) return false;
}
foreach ($uas as $u) {
// if it's a bot according to UA, then start to investigate
if (SimpleCloakV2::isSpider($u[0], true, false)) {
// if it's a bot according to IPLists or our user-defined list
if (SimpleCloakV2::isSpider($u[0], false, true)) {
return false;
// if it's a bot according to DNS
} else if (SimpleCloakV2::botVerifyByDNS($u)) {
return false;
// if it's not
} else {
if ($auto_modify_content) ob_start(array('SimpleCloakV2′, '_addMetaRobotsExcludeProxiesCallback'));
return true;
}
}
}
// it's not a bot according to UA
if ($auto_modify_content) ob_start(array('SimpleCloakV2′, '_addMetaRobotsExcludeProxiesCallback'));
return true + 1;
}
}
?>Save this file as "simple_cloak_v2.php."
You will also need the configuration file (it is referenced in "simple_cloak_v2.php"):
<?php
// defines database connection data
// set to "1″ if you are already connected in your application.
define("USE_CUSTOM_CONNECT_CODE", 0);
// usually localhost
define("DB_HOST", "your_db_host");
// db user
define("DB_USER", "some_user");
// password
define("DB_PASSWORD", "secret");
//db name
define("DB_DATABASE", "your_db");
?>Save this as "config.inc.php."
Then, to implement:
Use this SQL to create the database tables needed for the SimpleCloakV2 class
Run the following queries in your mySQL database (using the mysql binary or phpmysqladmin):
CREATE TABLE `cloak_data` (
`id` int(11) NOT NULL auto_increment,
`spider_name` varchar(255) NOT NULL default ",
`record_type` enum('UA','IP') NOT NULL default 'UA',
`value` varchar(255) NOT NULL default ",
`is_user_defined` enum('N','Y') NOT NULL default 'N',
PRIMARY KEY (`id`),
KEY `value` (`value`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;CREATE TABLE `cloak_update` (
`version` varchar(255) NOT NULL default ",
`updated_on` datetime NOT NULL default '0000-00-00 00:00:00′
) ENGINE=MyISAM DEFAULT CHARSET=latin1;Only if you already have a "cloak_data" table (from our book or a previous version of SimpleCloak on the blog), run this SQL:
ALTER TABLE cloak_data ADD `is_user_defined` ENUM('N','Y') NOT NULL;
Populate the Cloaking database with the data from IPLists.com
Note: This should be run periodically from a cron job to keep the data updated. It will update only once a week regardless. However, you may also put it in the footer of an application.
<?php// load the SimpleCloakV2 library
require_once 'simple_cloak_v2.inc.php';// update cloaking data and indicate the success status
if (SimpleCloakV2::updateAll())
{
echo "Cloaking database updated!";
}
else
{
echo "Cloaking database was already up to date, or the update failed.";
}?>
Then pick *1* of the following methods.
Note: Method #2 is a bit of a kludge, as the RewriteMap directive of Apache cannot be used in .htaccess. *It has not been tested extensively yet!*
METHOD NUMBER 1 — PHP Implementation
Place this code at the top of your application (or relevant parts thereof):
<?
include_once('simple_cloak_v2.inc.php');
$_x = SimpleCloakV2::metaRobotsExcludeProxies();
?>The code automatically inserts the meta tag using PHP output buffering. If you want a more custom/efficient solution, that is also possible. See the first parameter of function "metaRobotsExcludeProxies." Set to false, it will not use the output buffering, and you may use the result to effect changes in your application as desired.
METHOD NUMBER 2 — .htaccess Implementation
Place this in your .htaccess file
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} yahoo|slurp|msn|ask|google|gsa [NC]
RewriteRule (^.*$) proxy.php?orig_url=$1And this is the code for proxy.php:
<?include ('simple_cloak_v2.inc.php');
// should we deny access?
if (SimpleCloakV2::metaRobotsExcludeProxies(false)) {
header("HTTP/1.0 403 Forbidden");
echo 'forbidden … ';
exit();
}
// otherwise echo as it was …// construct the original URL
$url = $_SERVER['SERVER_NAME'] . '/' . $_SERVER['REQUEST_URI'];
// get the contents
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
// do some parsing
preg_match("#(.*)\r\n\r(.*)#s", $result, $captures);
$headers = $captures[1];
$data = $captures[2];
preg_match_all('#(.*)\r#m', $headers, $captures);
$split_headers = $captures[1];
// we have to reissue the headers as is
foreach ($split_headers as $s) {
header($s);
}
// echo the body.
echo $data;
?>
More implementation :
1. Add this to all of your headers:
|
and if you see an attempted hijack…
2. Block the site via .htaccess:
|
3. Block the IP address of the proxy
|
4. Do your research and file a spam report with Google.
