Who says you need UBot to run basic scraping tasks? Here’s a trivial script that scrapes centurian.org. It’s unoptimized (i.e. slow as fuck) but it still does a great job.
<?php
// DOMDocument()s throw shitty warnings for broken HTML
ini_set('display_errors', 0);
$scrapeURL = 'http://www.centurian.org/popular-proxies/?start=';
$startIndex = 0;
$stop = FALSE;
while (!$stop) {
//create the DOMDocument
$dom = new DOMDocument();
$stop = TRUE;
if ($dom->loadHTMLFile($scrapeURL . $startIndex)) {
//get all links from the page
$list = $dom->getElementsByTagName('a');
for ($i = 0; $i < $list->length; $i++) {
$proxy = $list->item($i)->textContent;
//we only want the links that have "http://" in anchor
if (strpos($proxy, 'http://')!==FALSE) {
print $proxy."\r\n";
$stop = FALSE;
}
}
}
$startIndex += 32;
}
?>
Usage:
pigpromoter@pleech:~/scrapers/centurianDOM$ php ./centurianDOM.php > proxies.txt pigpromoter@pleech:~/scrapers/centurianDOM$ cat ./proxies.txt | wc -l 6277
… or simply put it in your htdocs and fire it up through the browser.
As I said, the script is trivial, but it’s a good way of seeing how DOM works. Enjoy!




Many forget the power of PHP and DOM! Nice shot!
True. On the other hand, you got Curl and messing around with regex for the exact same result.