Featured Posts

Quickies Tip: Set your Glype tmp, logs and cache folders outside the web root. No, chmod'ing to 700 won't work, lol. WTF: Uneven Google! Useful: Fairly decent and promising project: proxycoder.com Useful:...

Read more

Triond 1,000 challenge? Yeah, right... image via Over the last few months I kept bumping into a "challenge" on Triond: people are either trying to get 1,000 articles by the end of 2010 or make $1,000 in any given 30 days (there are at...

Read more

EzineArticles scraper Piece of code that queries EZA for a given string and grabs an article at random from the first results page. Again, this is slow as fuck and shouldn't be used for production sites. [php]<?php ini_set('error_reporting',...

Read more

Basic scraper with PHP and DOM Who says you need UBot to run basic scraping tasks? Here's a trivial script that scrapes centurian.org. It's unoptimized (i.e. slow as fuck) but it still does a great job. [php] <?php // DOMDocument()s...

Read more

Resuming uploads with ProFTPD ProFTPD doesn't allow resuming of uploads out of the box. Here's a quick hack around it: edit the config file (usually /etc/proftpd/proftpd.conf, but can depend on your distro) and add AllowOverwrite...

Read more

  • Prev
  • Next

Basic scraper with PHP and DOM

2

Category : Uncategorized

Who says you need UBot to run basic scraping tasks? Here’s a trivial script that scrapes centurian.org. It’s unoptimized (i.e. slow as fuck) but it still does a great job.

<?php
// DOMDocument()s throw shitty warnings for broken HTML
ini_set('display_errors', 0); 

$scrapeURL              = 'http://www.centurian.org/popular-proxies/?start=';

$startIndex             = 0;
$stop                   = FALSE;

while (!$stop) {
		//create the DOMDocument
        $dom = new DOMDocument();
        $stop = TRUE;
        if ($dom->loadHTMLFile($scrapeURL . $startIndex)) {
        		//get all links from the page
                $list = $dom->getElementsByTagName('a');
                for ($i = 0; $i < $list->length; $i++) {
                        $proxy = $list->item($i)->textContent;
                        //we only want the links that have "http://" in anchor
                        if (strpos($proxy, 'http://')!==FALSE) {
                                print $proxy."\r\n";
                                $stop = FALSE;
                        }
                }
        }
        $startIndex += 32;
}

?>

Usage:

pigpromoter@pleech:~/scrapers/centurianDOM$ php ./centurianDOM.php > proxies.txt
pigpromoter@pleech:~/scrapers/centurianDOM$ cat ./proxies.txt | wc -l
6277

… or simply put it in your htdocs and fire it up through the browser.

As I said, the script is trivial, but it’s a good way of seeing how DOM works. Enjoy!

Comments (2)

Many forget the power of PHP and DOM! Nice shot!

True. On the other hand, you got Curl and messing around with regex for the exact same result.

Post a comment