First you have to,
- Install beautifulsoup4 and requests libraries.
pip install beautifulsoup4
pip install requests
- import these libraries to script
from bs4 import BeautifulSoup import requests
Now the Fun Part begins,
- There are few step before using we should do
- by using requests we retrieve data from a specific URL using GET Request, and the response is stored in r variable. We use requests get method for it.
r=requests.get(url)
- Then the content in specific response (by the way it is html content) used to create beautifulsoup soup object.
soup=BeautifulSoup(r.content,"html.parser")
html parser is optional,you know that everything in r.content is html right!!
- Actually to do an any web scrape you would only need to know 3 keyword and that's all.Rest is up up you.
- Here are they,
- findAll() function
- .contents
- .text
- findAll() function
soupObject.findAll("element",{"property":'name'}[optional])
this returns all the html content having these properties in a List form.
- .contents
item.contents
converts immediate child elements inside html element to a list form.
- .text
get the text (content visible to you in website) in html elements without any html elements.
This is an sample example for web scraping and store data in an excel sheet.
This is an sample example for web scraping and store data in an excel sheet.
from bs4 import BeautifulSoup import requests def getSoupObject(url="http://www.list.com/search/home.html"): r=requests.get(url) soup=BeautifulSoup(r.content,"html.parser") return soup def getDataFromPage(soupObj): lst=[] divCont= soupObj.findAll("div",{"class":'list_l_box'}) for item in divCont: itemList=item.contents if itemList[1].findAll('img',{"alt":"No image"}) !=[]: continue else: companyDataList=itemList[3].contents if companyDataList[5].findAll("span",{"itemprop":"telephone"})!=[]: companyName=companyDataList[1].text companyTele=companyDataList[5].findAll("span",{"itemprop":"telephone"})[0].text temp=[] temp.append(companyName) temp.append(companyTele) lst.append(temp) return lst
No comments:
Post a Comment