personal coding experiences: BeautifulSoup4 extremely beginner guide

First you have to,

Install beautifulsoup4 and requests libraries.

pip install beautifulsoup4
pip install requests

import these libraries to script

from bs4 import BeautifulSoup
import requests

Now the Fun Part begins,

There are few step before using we should do
by using requests we retrieve data from a specific URL using GET Request, and the response is stored in r variable. We use requests get method for it.

r=requests.get(url)

Then the content in specific response (by the way it is html content) used to create beautifulsoup soup object.

soup=BeautifulSoup(r.content,"html.parser")

html parser is optional,you know that everything in r.content is html right!!

Actually to do an any web scrape you would only need to know 3 keyword and that's all.Rest is up up you.
Here are they,

findAll() function
.contents
.text

findAll() function

soupObject.findAll("element",{"property":'name'}[optional])

this returns all the html content having these properties in a List form.

.contents

item.contents

converts immediate child elements inside html element to a list form.

.text

get the text (content visible to you in website) in html elements without any html elements.

This is an sample example for web scraping and store data in an excel sheet.

from bs4 import BeautifulSoup
import requests




def getSoupObject(url="http://www.list.com/search/home.html"):

    
    r=requests.get(url)
    soup=BeautifulSoup(r.content,"html.parser")
    return  soup


def getDataFromPage(soupObj):

    lst=[]

    divCont= soupObj.findAll("div",{"class":'list_l_box'})

    for item in divCont:
        itemList=item.contents

        if itemList[1].findAll('img',{"alt":"No image"}) !=[]:
            continue
        else:
            companyDataList=itemList[3].contents

            if companyDataList[5].findAll("span",{"itemprop":"telephone"})!=[]:

                companyName=companyDataList[1].text
                companyTele=companyDataList[5].findAll("span",{"itemprop":"telephone"})[0].text
                temp=[]
                temp.append(companyName)
                temp.append(companyTele)
                

                lst.append(temp)
    return lst

personal coding experiences

Pages

Friday, January 20, 2017

BeautifulSoup4 extremely beginner guide

No comments:

Post a Comment