Hi! So little over a half a year ago I made a site called [URL="http://hostpicker.net/home"] HostPicker.net[/URL]. The main purpose of the site is to display data such as price, locations, protection etc. from as many hosting companies and games as possible. When I made the original site I didn’t know it would expand as much as It have done. So at the time the system was fine but not anymore. The way I currently scrape data is that I have a JSON list of the regex rules saved in a database that is then performed with PHP (Laravel Framework) and CURL.
Example - JSON:
[CODE]
{
"links": {
"cart": "https://billing.serenityservers.net/cart.php?a=add¤cy=1&billingcycle=monthly&pid=15",
"price": "_cart_&skipconfig=1&configoption[_cid_]=_max_",
"location": "_cart_"
},
"values": {
"sel": false,
"cid": "/configid.=.'([0-9]+)';/",
"min": "/configmin.=.([0-9]+);/",
"max": "/configmax.=.([0-9]+);/"
},
"prices": {
"sel": false,
"pri": "/<tr class=\"total\">[\\s]*<td[^>]*>[^<]*<\\/td>[\\s]*<td[^>]*>([^<]+)<\\/td>[\\s]*<\\/tr>/",
"div": "_max_",
"cur": "/([a-zA-Z]{3})/"
},
"locations": {
"aut": true,
"sel": "/Location<\\/td>[\\d\\D]+?<select.+>([\\d\\D]+?)<\\/select>/",
"loc": "/<option.+>(.+)<\\/option>/"
}
}
[/CODE]
Since I have to rewrite the regex for each host (when they update too) it has become a complete pain.
I have been looking for a while for a better system but haven’t found anything useful. Does any of you know a better way to do this?
Are you looking for a better way to build scrapers for individual websites? You could do this quite easily with Selenium WebDriver
[B]Edit:
[/B]To build out fast and dirty scrapers, you can use [URL="https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en"]XPath Helper[/URL] to grab the direct element on the webpage and then use Selenium to scrape that element.
For example, on [URL]http://www.vaugaming.com/garrys-mod.html[/URL]
If you have XPath Helper, press CTRL+SHIFT+X to turn it on, then whilst holding SHIFT, you can hover over anything and retrieve the XPath for that element.
For the pricing per slot element, our XPath is:
[CODE]/html/body/div[@class='kode-wrapper']/div[@class='kode-content']/section[@class='kode-pagesection']/div[@class='container']/div[@class='row']/div[@class='col-md-12']/div[@class='kd-imageframe']/div[@class='row']/div[@class='col-md-7']/p[4]/strong[/CODE]
Pretty ugly, but it works.
Then in Selenium's .NET WebDriver API, you can do
[CODE]
IWebDriver driver = new InternetExplorerDriver();
driver.Navigate().GoToUrl("http://www.vaugaming.com/garrys-mod.html");
IWebElement element = driver.FindElement(By.XPath("/html/body/div[@class='kode-wrapper']/div[@class='kode-content']/section[@class='kode-pagesection']/div[@class='container']/div[@class='row']/div[@class='col-md-12']/div[@class='kd-imageframe']/div[@class='row']/div[@class='col-md-7']/p[4]/strong"));
string priceInfo = element.Text;
[/CODE]
That chrome extension will be super helpful! This system definitely seems better. Thank you!
[QUOTE=CodingBeast;49634386]That chrome extension will be super helpful! This system definitely seems better. Thank you![/QUOTE]
No problem, good luck! Don't get your ass sued for scraping!
Selenium is very heavy, if you want something more lightweight (generally desirable for web services), then I can recommend something like [url=https://github.com/jmcarp/robobrowser]Robobrowser[/url] (Python). Instead of being a full headless web browser, it makes requests (with [url=http://docs.python-requests.org/en/latest/]requests[/url]), and allows you to work with the results (with [url=http://www.crummy.com/software/BeautifulSoup/]BeautifulSoup[/url]).
If Python's not your thing, I can also recommend [url=https://github.com/sparklemotion/mechanize]Mechanize[/url] for Ruby. There are many other libraries like those for other languages as well, so search around a bit.
Web scraping has been ~50% of my day job for the past few months, so I thought I'd throw in my 2 cents. We have had to use selenium for a couple of things, since some sites have protection against this kind of thing. I would definitely recommend using something smaller if speed is a concern, though.
[QUOTE=BackwardSpy;49648726]Selenium is very heavy, if you want something more lightweight (generally desirable for web services), then I can recommend something like [url=https://github.com/jmcarp/robobrowser]Robobrowser[/url] (Python). Instead of being a full headless web browser, it makes requests (with [url=http://docs.python-requests.org/en/latest/]requests[/url]), and allows you to work with the results (with [url=http://www.crummy.com/software/BeautifulSoup/]BeautifulSoup[/url]).
If Python's not your thing, I can also recommend [url=https://github.com/sparklemotion/mechanize]Mechanize[/url] for Ruby. There are many other libraries like those for other languages as well, so search around a bit.
Web scraping has been ~50% of my day job for the past few months, so I thought I'd throw in my 2 cents. We have had to use selenium for a couple of things, since some sites have protection against this kind of thing. I would definitely recommend using something smaller if speed is a concern, though.[/QUOTE]
I figured he didn't want to deal with WebRequests and Cookie management. He would have to learn fiddler and what not.
You generally don't need to worry yourself with those anyway, at least not with Robobrowser and Mechanize. Occasionally it is useful to inspect a form POST or whatever if you want to, for example, login into a page with one request rather than two. However, for 99% of scraping you can just go to the page, find the url, and get a css/xpath selector. With that, it's as simple as this (ruby example):
[code]page = Mechanize.new.get URI 'url_goes_here'
elements = page.css '#some_element > p'[/code]
I usually do scraping with a credential database and multiple servers to avoid detection, so I have to manage my cookies and what not. He probably doesn't need to. I've never heard of those tools before.
Since I'm doing this in PHP I looked a bit around and found [URL="https://github.com/FriendsOfPHP/Goutte"]Goutte[/URL]. Not sure how it is compared to the other ones but from my tests it seems to run pretty well and include all the features I need.
I can't scrape all the data with XPath because some of it is saved in JavaScript variables. Is there any kind of extension or something similar that could make finding the regex expression easier? I want to make it as easy as possible so my site staff with less coding knowledge can help me add new hosts and such.
I made a fancy crawler for my work placement that scrapes a bunch of websites for around 170k products. I did it by writing a base class that can be extended by different stores so you can add as many new stores as you like. and just run it as a whole with a cronjob every day and it takes around 1-2 hours to complete all 170k products.
here's a lil snippit:
[code]
/**
* Get all products from a website
* @return array
*/
protected function getProducts( $query )
{
// Loop through all the crawled pages
foreach($this->getCrawledPages() as $page )
{
// Load the DOM and make sure it won't preduce errors
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML( $page );
libxml_clear_errors();
// Find the product container
$xpath = new DOMXpath( $dom );
$domProducts = $xpath->query( $query );
// Loop through the products
foreach( $domProducts as $domProduct)
{
// Search all the for product information
$productLink = $this->getLink($xpath,$domProduct);
$products[] = array(
'name' => $this->getName($xpath,$domProduct),
'link' => $productLink,
'price' => $this->getPrice($xpath,$domProduct),
);
}
}
return isset($products) ? $products : array();
}
[/code]
Like i said, i did everything with pure php and curl and it works great, it gives some extra headaches but it's worth it in the long run!
[QUOTE=robbert^^;49657222]I made a fancy crawler for my work placement that scrapes a bunch of websites for around 170k products. I did it by writing a base class that can be extended by different stores so you can add as many new stores as you like. and just run it as a whole with a cronjob every day and it takes around 1-2 hours to complete all 170k products.
here's a lil snippit:
[code]
/**
* Get all products from a website
* @return array
*/
protected function getProducts( $query )
{
// Loop through all the crawled pages
foreach($this->getCrawledPages() as $page )
{
// Load the DOM and make sure it won't preduce errors
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML( $page );
libxml_clear_errors();
// Find the product container
$xpath = new DOMXpath( $dom );
$domProducts = $xpath->query( $query );
// Loop through the products
foreach( $domProducts as $domProduct)
{
// Search all the for product information
$productLink = $this->getLink($xpath,$domProduct);
$products[] = array(
'name' => $this->getName($xpath,$domProduct),
'link' => $productLink,
'price' => $this->getPrice($xpath,$domProduct),
);
}
}
return isset($products) ? $products : array();
}
[/code]
Like i said, i did everything with pure php and curl and it works great, it gives some extra headaches but it's worth it in the long run![/QUOTE]
How fast is it to find an element on the DOM?
Selenium takes about 30ms per element using an #id, and 40ms per element using an XPath on my servers at work
I can benchmark it for you this this Friday (?) I have a couple of project running and some personal stuff that needs my atention but i last time i tested it took about a minute for 1000 products.
And i don't know how many things you need to scrape but that seems okay if you do it at say 1 am and build an item query so it only scrapes X amount of items every day and you just let it run every day. If you have a high amount of items that is.
thanks
really good thanks
[url=http://www.scatzx.com/1453216135.html][img]http://www.scatzx.com/img/1453216135.jpg[/img][/url]
[highlight](User was permabanned for this post ("gimmick" - postal))[/highlight]
Sorry, you need to Log In to post a reply to this thread.