Best way to Scrape data from a website?

Status
Not open for further replies.

ethanny2

Member
Basically I am making a web app and while I have good data from a free API, I feel that in order to make my data as good as possible I would want to take review scores from Metacritic, it seems Metacritic had an unoffical API at one point but it was shut down leaving the only way to get data from the site to scrape it. I don't know much about web scraping and most of the libraries I saw were in Phython, a language I don't know. What does GAF recommend for scraping?
 
Python -> BeautifulSoup/LXML. If you find this too difficult, use Scrapy. You should also learn regular expressions. You might not know Python, but it's a very easy language to learn if you know anything else. You will most definitely also need to know enough HTML and enough about how webpages are put together to be able to pseudocode your scraper before you implement it.
 
Python -> BeautifulSoup/LXML. If you find this too difficult, use Scrapy. You should also learn regular expressions. You might not know Python, but it's a very easy language to learn if you know anything else.

seconding this. Ive used this set up many times and python is ridiculously easy to learn once you know programming.
 
Python -> BeautifulSoup/LXML. If you find this too difficult, use Scrapy. You should also learn regular expressions. You might not know Python, but it's a very easy language to learn if you know anything else. You will most definitely also need to know enough HTML and enough about how webpages are put together to be able to pseudocode your scraper before you implement it.

Thanks I'll check it out , guess Im using Phython.... is there no JS based scrape libraries out there? In regards to how web pages and HTML work I know about the DOM and how sometimes its hard to scrape because of AJAX, JS generated content etc... But Im sure a good tutorial will help me out
 
Thanks I'll check it out , guess Im using Phython.... is there no JS based scrape libraries out there? In regards to how web pages and HTML work I know about the DOM and how sometimes its hard to scrape because of AJAX, JS generated content etc... But Im sure a good tutorial will help me out

JS is mostly used for client-side scripting. You do not want to be scraping assets on the client side. This is not only performance-poor, it's unethical (scraping is a violation of the terms of use of Metacritic and other sites. You should still do it, but you owe it to them to cache data and behave responsibly, not repeatedly rescrape on every page load). There are also client security issues associated with making cross-domain requests that will prevent you from scraping this.

If you absolutely must use Javascript, I'm sure there's some server-side or offline JS option, like through Node, for scraping. But I would recommend Python for this use case.

Getting content from AJAX requests isn't too tough--you basically just directly make the AJAX request yourself. You can use the python requests library for this. If you need to emulate a full browser for whatever reason, which is useful when you're dealing with forms that do a bunch of funky stuff, use mechanize. True JS generated content is a little more challenging because most of the libraries for HTTP requests or browser emulation don't emulate JS execution. There are ways around this, but this is a much more advanced use case then what you're starting out with.
 
Status
Not open for further replies.
Top Bottom