From last articles about proxy SEO, I have some research more about this topic and found more keywords interesting, scrape,glype,web crawler perhap it could use with some of our site. Modify to white hat to make webmaster work easier for manage site.
What is it ?
Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-levelHypertext Transfer Protocol (HTTP), or embedding certain full-fledged Web browsers, such as the Internet Explorer (IE) and the Mozilla Web browser. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Exemplary uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration.
Web scraping is the process of automatically collecting Web information. Web scraping is a field with active developments sharing a common goal with the semantic Web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies even though some solutions are entirely ad hoc. Therefore, there are different levels of automations that existing Web-scraping technologies can provide:
Techniques for Web scraping
- Human copy-and-paste: Sometimes even the best Web-scraping technology can not replace human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly setup barriers to prevent machine automation.
- Text grepping and regular expression matching: A simple yet powerful approach to extract information from Web pages can be based on the UNIX grep command or regular expression matching facilities of programming languages (for instance Perl orPython).
- HTTP programming: Static and dynamic Web pages can be retrieved by posting HTTP requests to the remote Web server using socket programming.
- DOM parsing: By embedding a full-fledged Web browser, such as the Internet Explorer or the Mozilla Web browser control, programs can retrieve the dynamic contents generated by client side scripts. These Web browser controls also parse Web pages into a DOM tree, based on which programs can retrieve parts of the Web pages.
- HTML parsers: Some semi-structured data query languages, such as the XML query language (XQL) and the hyper-text query language (HTQL), can be used to parse HTML pages and to retrieve and transform Web content.
- Web-scraping software: There are many Web-scraping software available that can be used to customize Web-scraping solutions. These software may provide a Web recording interface that removes the necessity to manually write Web-scraping codes, or some scripting functions that can be used to extract and transform Web content, and database interfaces that can store the scraped data in local databases.
- Semantic annotation recognizing: The Web pages may embrace metadata or semantic markups/annotations which can be made use of to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, are stored and managed separated to the Web pages, so the Web scrapers can retrieve data schema and instructions from this layer before scraping the pages.
Legal issues
Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear. While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. Also, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union.
U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels, which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.
In Australia, the Spam Act 2003 outlaws some forms of web harvesting.
Technical measures to stop bots
The administrator of a website can use various measures to stop or slow a bot. Some techniques include:
- If the application is well behaved, adding entries to robots.txt will be adhered to. Google and other well-behaved bots can be stopped this way.
- Blocking an IP address. This will also block all browsing from that address.
- Sometimes bots declare who they are. Well behaved ones do (for example 'googlebot'). They can be blocked on that basis. Unfortunately, malicious bots may declare they are a normal browser.
- Bots can be blocked by excess traffic monitoring.
- Bots can be blocked with tools to verify that it is a real person accessing the site, such as the CAPTCHA project.
- Sometimes bots can be blocked with carefully crafted Javascript.
- Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.
Some interesting article about how to scrape site
from oooff
Most Basic Web Data Parsing Script
Whole script -
The whole script minus the line numbers of course. Those are just their for our reference.
1. <?php
2. $data = file_get_contents('http://search.msn.com/results.aspx?q=site%3Afroogle.com');
3. $regex = '/Page 1 of (.+?) results/';
4. preg_match($regex,$data,$match);
5. var_dump($match);
6. echo $match[1];
7. ?>
Script Explanation -
Ok here goes with the basic explanation…
Line 2.
$data = file_get_contents('http://search.msn.com/results.aspx?q=site%3Afroogle.com');
Now if you studied up on the first tutorial you'll know that we're pulling data from MSN search using the file_get_contents command and assigning the data to the $data variable.
However we're also passing some data in the url to get the specific page from MSN that we want to scrape. If you already know about passing variables in the url you can go to Line 3.
You might be asking what is all that stuff after the MSN url? I'm sure you've seen it a lot of times but might not been sure what it was. Basically what all that stuff is, is just like passing a variable in a php script but you're doing it through a url. Lets take a peak at the url we're using here to get a better understanding. Our url if you don't remember is "http://search.msn.com/results.aspx?q=site%3Afroogle.com".
Let's break it into two parts split on the question mark. Why you ask? That's where the url ends and the data being passed begins. With is separated we have:
http://search.msn.com/results.aspx
and
q=site%3Afroogle.com
Now I hope I don't need to go into an explanation on the first part so I'm really only going to talk about the second. Also I'll do some basic tutorials on accepting data later so you have an understand what happens to this url on the other side. When you look at the second part of the url you'll always see a field and a value for the field, although sometimes that value is blank. How do you know which is the field and which is the value you ask? The field is always going to come before the equal sign = and the value will come after. Basically think of it like assigning a variable a value. In this data being passed by the url our field is "q" if you didn't already guess and our value is site%3Afroogle.com. The field 'q' that MSN takes stands for query. So passing data assigned to the 'q' field is telling MSN, "hey look this search/query up for me."
The value assigned to the field 'q' is site%3Afroogle.com. First thing you're probably thinking is what in the world is that %3A, I didn't type that. Well to keep things very simplistic, there's certain variables that can't be passed through url's things like colon's, quotes, semi-colon's etc, because these are protected and mean certain things to a web server when they see them. So we need to use some other form of formatting. In this case we're converting the ':' in site:froogle.com to a encoded value (more on that later). So what we're asking for by the site: command in MSN is how many pages from site X are in your search engine. So specifically how many pages from froogle are indexed in MSN.
Click here to see the page we're scraping
// Some hosting aware file_get_contents as danger commmand, Maybe we should try curl, or some class.
Line 3.
$regex = '/Page 1 of (.+?) results/';
div id="search_header"><h1>site:froogle.com</h1><h5>Page 1 of 9,138 results</h5> <b>
Anytime you see a $varname = 'something here'; or $varname = "something here"; you know it's just a value being assigned to a variable. Also note you can use single ' and double " quotes interchangeably.
(.+?) is our best friend when it comes to regex, it basically means match everything starting from the text ( I'll call that text anchors too, so be prepared for me to use the interchangeably) in the beginning and stopping at our end text/anchor. Something like this:
opening anchor text here ( .+?) closing anchor text here
Pretty easy huh? Yeah I thought so. The only other thing to note in this is that there is the forward slashes in the '/stuff/'; that's a regex thing. Just know that in php you always need to let regex know what to match inside of forward slashes.
// Know more about regex here : http://de.php.net/manual/en/reference.pcre.pattern.syntax.php
Line 4.
preg_match($regex,$data,$match);
Ah a new function's in town, preg_match(). Preg_match() is the PHP function to call regex for a single match. So anytime we want to match one thing in our data we're going to call the parsing function preg_match().
With preg match we're doing something called passing data to the function for it to work on. In this case we're passing $regex, $data, $match. We know what both $regex (parsing string we just made) and $data (scraped page from MSN) are but what is the $match variable? It's just the variable that our parsed data is going to be returned to. In plain english we're saying take $data and then apply the filter $regex to it. Then whatever comes through that filter dump out into $match. Make sense?
I sure hope you said yep, that's easy.
Line 5.
var_dump($match);
The function var_dump() is your best friend as a programmer. It says whatever is in this variable or array dump it out onto the screen so I can see what's happening. So this line will output this onto the screen.
array(2) {
[0]=>
string(23) "Page 1 of 9,138 results"
[1]=>
string(5) "9,138"
}
Array? What's that? Well this is as good a time as any to introduce what an array is. They're extremely useful tools for you to know. So lets backup a little we know that a variable is something that holds 1 thing, right? Well an array is just like a variable except it holds multiple things. I like to think of it like this. Stop and imagine a train for a second it has all these cars on it that hold things right? well a variable is a single car and can only hold a single thing. Where an array is like a train that has multiple cars to hold things. In the output above we have a two (2) cell array, which is just like a 2 car train. In car 0 we have the string 'Page 1 of 9,138 results' and in car 1 we have the string '9,138', which is the result we want right? You might be asking why does preg_match return an array rather then just a simple string. It does this two give you two options on how to match things. You'll notice car/cell 0 has the anchors included as well as the matched text. Where car 1 only has the text inside the anchors.
Line 6.
echo $match[1];
What's with the new notation? If you hadn't already guessed that's how we access the cars in our train. We know if we have a array and what we want is in car 1 we access that by 'referencing' that car which is what the [1] means. We want to output only what's in the second cell because we don't want the anchors included. This will output to our screen:
9,138
Which is exactly what we aimed to do.
Other things to try -
So fun stuff to try using our new skills.
1. Use the link: command in MSN and see if you can get the number of links for a domain. Don't forget that : = %3A
2. See if you can get the title of a web of any web page. Hint: anchors are going to be <title> and </title>.
Conclusion -
You can make some pretty cool tools with just the two very basic things I've shard with you so far. Pulling data from somewhere using the file_get_contents() function and the data parsing preg_match() function. Have fun with it and I'll see you on the next data scraping tutorial.
wiki scrape
<?php
function wikipedia($article) {
$pattern[0] = '/<a href="(.*?)">(.*?)<\\/a>/';
$replace[0] = '$2';
$pattern[1] = '/<h3 id=\"siteSub\">From Wikipedia, the free encyclopedia<\/h3>/';
$replace[1] = '';
$pattern[2] = '/<div id=\"contentSub\">(.*?)<\/div><div id=\"jump-to-nav\">Jump to: navigation, search<\/div>/';
$replace[2] = '';
$pattern[3] = '/<div class=\"messagebox cleanup metadata\">(.*?)<p><br \/><\/p>/';
$replace[3] = '';
$pattern[4] = '/<table class=\"messagebox\" (.*?)>(.*?)<\/table>/';
$replace[4] = '';
$pattern[5] = '/<dl>(.*?)<\/dl>/';
$replace[5] = '';
$pattern[6] = '/<h1 class=\"firstHeading"\>(.*?)<\/h1>/';
$replace[6] = '<h3>$1</h3>';
$pattern[7] = '/<table class=\"messagebox protected\" style=\"border: 1px solid #8888aa; padding: 0px; font-size:9pt;\">(.*?)<\/table>/';
$replace[7] = '';
$pattern[8] = '/<div class=\"infobox sisterproject\">(.*?)<\/div><\/div>/';
$replace[8] = '';
$pattern[9] = '/<sup (.*?)>(.*?)<\/sup>/';
$replace[9] = '';
$pattern[10] = '/<table style=\"background: transparent;\" width=\"0\">(.*?)<\/table>/';
$replace[10] = '';
$pattern[11] = '/<table class=\"messagebox current\" style=\"font-size: normal;\">(.*?)<\/table>/';
$replace[11] = '';
$pattern[12] = '/<table class=\"toccolours\" align=\"center\" width=\"55%\" cellpadding=\"0\" cellspacing=\"0\">(.*?)<\/table>/';
$replace[12] = '';
$pattern[13] = '/<div class=\"editsection\"(.*?)>(.*?)<\/div>/';
$replace[13] = '';
$pattern[14] = '/<div id=\"bodyContent\">/';
$replace[14] = '<div>';
$pattern[15] = '/<dd>(.*?)<\/dd>/';
$replace[15] = '';
$pattern[16] = '/<div class=\"messagebox cleanup metadata\">(.*?)<\/div>/';
$replace[16] = '';
$pattern[17] = '/<div class=\"thumbcaption\">(.*?)<\/div><\/div>/';
$replace[17] = '';
$pattern[18] = '/<div class=\"thumb tright\">/';
$replace[18] = '';
$pattern[19] = '/\[(.*?)\]/';
$replace[19] = '';
$pattern[20] = '/<table class="messagebox protected" (.*?)>(.*?)<\/table>/';
$replace[20] = '';
$pattern[21] = '/<div style="position:absolute; z-index:100; right:20px; top:10px; height:10px; width:300px;"><\/div>/';
$replace[21] = '';
$pattern[22] = '/<div style="position:absolute; z-index:100; right:10px; top:10px;" class="metadata" id="administrator">(.*?)<\/div><\/div>/';
$replace[22] = '';
$pattern[23] = '/<table class="messagebox current"(.*?)>(.*?)<\/table>/';
$replace[23] = '';
$pattern[24] = '/<table class="messagebox current" style="width: auto;">(.*?)<\/table>/';
$replace[24] = '';
$pattern[25] = '/<div class="dablink">(.*?)<\/div>/';
$replace[25] = '';
$pattern[26] = '/<b>/';
$replace[26] = '<strong>';
$pattern[27] = '/<\/b>/';
$replace[27] = '</strong>';
$pattern[28] = '/<div(.*?)>/';
$replace[28] = '';
$pattern[29] = '/<\/div>/';
$replace[29] = '';
$pattern[30] = '/<map(.*?)>(.*?)<\/map>/';
$replace[30] = '';
$pattern[31] = '/<img src="(.*?)" alt="This page is semi-protected." width="18" (.*?)\/>/';
$replace[31] = '';
$pattern[32] = '/<table style="width:100%;background:none">(.*?)<\/table>/';
$replace[32] = '';
$pattern[33] = '/<div class="messagebox merge metadata">(.*?)<\/div>/';
$replace[33] = '';
$wikipedia = fopen($article, "r");
$wikipedia = preg_replace($pattern, $replace, $wikipedia);
if (preg_match("/<\!-- start content --\>(.*)<table id=\"toc\" class=\"toc\" summary=\"(.*)\">/", $wikipedia, $w)) {
$wikipedia = $w[1];
} elseif (preg_match("/<\!-- start content --\>(.*)<a name=\"(.*)\">/is", $wikipedia, $w)) {
$wikipedia = $w[1];
} elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"boilerplate metadata\" id=\"stub\">/is", $wikipedia, $w)) {
$wikipedia = $w[1];
} elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"printfooter\">/is", $wikipedia, $w)) {
$wikipedia = $w[1];
}
}
print $wikipedia;
}
?>
