Scraper site
A scraper site is a website that copies all of its content from other websites using web scraping. A search engine is not a scraper site, sites such as Yahoo and Google gather content from other websites and index it so that the index can be searched with keywords. Search engines then display snippets of the original site content in response to a user's search. In the last few years scraper sites have proliferated at an amazing rate for spamming search engines.
Open content are a common source of material for scraper sites. Contents
1 Made for AdSense
2 Legality
3 Techniques
4 References
Made for AdSense
Some scraper sites are created for monetizing the site using advertising programs. In such case, they are called Made for AdSense sites or MFA . This is also a derogatory term used to refer to websites that have no redeeming value except to get web visitors to the website for the sole purpose of clicking on advertisements. Made for AdSense sites are considered sites that are spamming search engines and diluting the search results by providing surfers with less-than-satisfactory search results.
The scraped content is considered redundant to that which would be shown by the search engine under normal circumstances had no MFA website been found in the listings. These types of websites are being eliminated in various search engines and sometimes show up as supplemental results instead of being displayed in the initial search results. Some sites engage in "Adsense Arbitrage"–they will buy AdWords spots for lower cost search terms and bring the visitor to a page that is mostly Adsense. The arbitrager then makes the difference between the low value clicks he bought from AdWords and the higher value clicks generated by this traffic on his MFA sites. In 2007, Google cracked down on this business model by closing the accounts of many arbitragers. Another way Google and Yahoo are combating the proliferation of arbitrage are through quality scoring systems. For example, in Google's case, Adwords penalizes "low quality" advertiser pages by placing a higher per click value to its campaigns. This effectively evaporates the arbitrager's profit margin.
Legality
Scraper sites may violate copyright law. Even taking content from an open content site can be a copyright violation, if done in a way which does not respect the license. For instance, the GNU Free Documentation License (GFDL) and Creative Commons ShareAlike (CC-BY-SA) licenses require that a republisher inform readers of the license conditions, and give credit to the original author. Techniques
Technique
Many scrapers will pull snippets and text from websites that rank high for keywords they have targeted.This way they hope to rank highly in the SERPs (Search Engine Results Pages). RSS feeds are vulnerable to scrapers. Some scraper sites consist of advertisements and paragraphs of words randomly selected from a dictionary. Often a visitor will click on a pay-per-click advertisement because it is the only comprehensible text on the page. Operators of these scraper sites gain financially from these clicks. Ad networks claim to be constantly working to remove these sites from their programs, although there is an active polemic about this since these networks benefit directly from the clicks generated at these kind of sites. From the advertiser's point of view, the networks don't seem to be making enough effort to stop this problem. Scrapers tend to be associated with link farms and are sometimes perceived as the same thing, when multiple scrapers link to the same target site. A frequent target victim site might be accused of link-farm participation, due to the artificial pattern of incoming links to a victim website, linked from multiple scraper sites.
And so on. Patrick Altoft say about this :
Scrapers can cause a lot of problems for bloggers, mainly because a lot of them remove links back to your blog making it hard for search engines to decide which blog is the copycat.
Here is what Matt Cutts recently said about how best to protect yourself against duplicate content:
If you are syndicating articles on third party sites make sure they link back to the original article on your site, rather than your homepage.
So, having internal links within the post as well as maybe a link to your homepage in your feed footer isn’t going to be the best solution. What you really need is a link to your blog post from within the feed content. Obviously your feed will already have a link to your post anyway but most scrapers tend to remove those links and just keep the title and the content.
Find your feed-rss2.php file in the wp-includes folder and add the following code to line 39 (in WP 2.3.1). The code needs to be added just after where it says <?php the_content() ?>
<p><a href="<?php the_guid(); ?>">Permalink + Comments</a></p>
This will make sure search engines know the source of the post and will give your readers an extra place to click to visit your site.
About Scraper, Yoast.com have something interesting
I didn't do that because I thought you guys couldn't find my blog. It was actually a plugin requested by Shoemoney, who wanted to make the scrapers work for him too, by getting some more backlinks from them. I thought it was a great idea, and this is actually quite a simple plugin: you upload it, enable it, maybe change the default text (which is "This is a post from <link><blog name></link>"), and you're done.
Sample Case – Scrape site
devtrenth how to scrape site

Screen scraping has been around on the internet since people could code on it, and there are dozens of resources out there to figure out how to do it (google php screen scrape to see what I mean). I want to touch on some things that I’ve figured out while scraping some screens. I assume you have php running, and know your way around Windows.
- Do it on your local computer. If you are scraping a lot of data you are going to have to do it in an environment that doesn’t have script time limits. The server that I use has a max execution time of 30 seconds, which just doesn’t work if you are scraping a lot of data off of slow pages. The best thing to do is to run your script from the command line where there is no limit to how long a script can take to execute. This way, you’re not hogging server resources if you are on a shared host, or your own server’s resources if you are on a dedicated host. Obviously, if your screen scraping data to serve ‘on-the-fly’, then this senario won’t work, but it’s awesome for collecting data. Make sure you can run php from the command line by opening up a command prompt window, and type ‘php -v’. You should get the version of php you are running. If you get an error message then you’ll need to map your PATH environment variable to your php executable.
- Do it once. If you are writing a script that loops through all of the pages on a site, or a lot of pages – make sure your script works right before you execute it. If the host sees what you are doing and doesn’t like it, then they could just block you. So it’s best to make sure your script runs correctly by doing a small test run. Then when that works, unleash your script on the entire site. In that same vein, don’t screen scrape a site all the time. You’re just going to piss off the admin if they figure it out.
- Do it smart. Make sure the site doesn’t offer an api for doing what you want before you scrape their site. Often, the api can get you the information quicker and in a better format than the screen scrape can.
- Use the cURL library. I really don’t know any other way to scrape a page other than to use cURL — it works so well I just never have had to try anything else. Since you are going to be using php from the command line, you’re also going to want to use curl from the command line (it’s easier than using the PHP functions, and external libraries are not loaded any way). Get the curl library from http://curl.haxx.se/download.html and download the non ssl version. Map the path to curl.exe in your PATH environment variable, and make sure you can run curl from the command line.
Those are all of my tips. Here is some screen scrape code that I use.
To call curl just write a function like this. This is so much easier than using the php commands, but you probably don’t want to use a shell_exec command on a web server where someone can put in their own input. That might be bad. I only use this code when I run it locally.
function GetCurlPage ($pageSpec) {
return shell_exec("curl $pageSpec");
}
This is the code that calls the curl function. We start by using the output buffer, this greatly speeds up our code. This particular code would grab the title of a page and print it:
ob_start();
$url = 'http://www.example.com';
$page = GetCurlPage($url);
preg_match("~~",$page,$m);
print $m[1];
ob_end_flush();
To run your script from the command line and generate output to a file you simply call it like this:
php my_script_name.php > output.txt
Any output captured by the output buffer will be printed to the file you pass the output to.
This is a very simple example that doesn’t even check to see if the title exists on that page before it prints, but hopefully you can use your imagination to expand this into something that might grab all of the titles on an entire site. A common thing that I do is use a program like Xenu Link Sleuth to build my list of links I want to scrape, and then use a loop to go through and scrape every link on the list (in other words, use Xenu for your spider and your code to process the results).
