WebScrapingLearning

WebScrapingLearning

reference book

《Web Scraping with Python Collecting More Data from the Modern Web》 by Ryan Mitchell

1.

html = urlopen('http://www.pythonscraping.com/pages/page1.html')

Two main things can go wrong in this line:

  • The page is not found on the server (or there was an error in retrieving it). → HTTPError
  • The server is not found. → URLError (more serious)

how to handle:

from urllib.request import urlopen 
from urllib.error import HTTPError 
from urllib.error import URLError
try: 
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e: 
    print(e)
except URLError as e: 
    print('The server could not be found!')
else: 
    print('It Worked!')

2.

from bs4 import BeautifulSoup
print(bs.nonExistentTag)

returns a None object

from bs4 import BeautifulSoup
print(bs.nonExistentTag.someTag)

returns an exception (because we call another function on the None object)

how to handle it:

try: 
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e: 
    print('Tag was not found')
else:
	if badContent == None: 
        print ('Tag was not found')
	else: 
        print(badContent)

3.

synthesize with 1. and 2.

from urllib.request import urlopen 
from urllib.error import HTTPError 
from bs4 import BeautifulSoup

# returns either the title of the page, or a None object 
def getTitle(url): 
    try:
        html = urlopen(url)
	except HTTPError as e: 
        # the page is not found on the server (or there was an error in retrieving it).
        return None
	try:
		# if the server is not found (or the server did not exist),
        # then html would be a None object, and html.read() would throw an AttributeError
        bs = BeautifulSoup(html.read(), 'html.parser') 
        
        # may returns an exception (AttributeError)
        # because we call another function on the None object
        title = bs.body.h1
	except AttributeError as e: 
        return None
	return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html') 
if title == None: 
    print('Title could not be found')
else: 
    print(title)

4.

When to get_text() and When to Preserve Tags?

Calling .get_text() should always be the last thing you do, immediately before you print, store, or manipulate your final data.

In general, you should try to preserve the tag structure of a document as long as possible.

5.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(html,"html.parser")
# bs = BeautifulSoup(html.read(),"html.parser")
# it can add .read() or not
for k in bs.find('table',{'id':'giftList'}).descendants:
    print(k)

urlopen returns a file object, BeautifulSoup can use the file object directly without needing to call .read() first

we call html.read() only in order to get the HTML content of the page

image-20210916231222915

6.

The useage of find_all and find:

image-20210919162947641

tag

a string name of a tag or even a Python list of string tag names

image-20210919163224807

which can be done more easily as follows:

.find_all(re.compile('h[1-6]'))

attributes

a Python dictionary of attributes and matches tags that contain any one of those attributes

image-20210919163647602

recursive

boolean, default to be true

text

unusual, it matches based on the text content of the tags, rather than properties of the tags themselves

image-20210919163834812

html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all(text='the prince')
print(type(k))
for ele in k:
    print(ele)

image-20210919164553108

limit

used only in the find_all method; find is equivalent to the same find_all call, with a limit of 1

keyword

select tags that contain a particular attribute or set of attributes

image-20210919164802915

each value for an id should be used only once on the page, so the former should be equivalent to the following

image-20210919165105599

keyword is redundant, anything that can be done with keyword can also be accomplished using other techniques such as regular_express and lambda_express

For instance, the following two lines are identical

image-20210919165251025

when using keyword, and using class attribute, remember to add an underscore such as class_:

image-20210919165424874

class is a reserved word in Python that cannot be used as a variable or argument name

Recall that passing a list of tags to .find_all() via the attributes list acts as an “or” filter (it selects a list of all tags that have tag1, tag2, or tag3...). If you have a lengthy list of tags, you can end up with a lot of stuff you don’t want. The keyword argument allows you to add an additional “and” filter to this.

7.

lambda expression

# the same result with the second code line
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')
# bs.find_all('', text='Or maybe he\'s only resting?')

attention! the first ' ' should refers to the tag parameter according to the notebook.6

and the first ' ' has no effect on the result:

html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all(text='the prince')
print(type(k))
for ele in k:
    print(ele)

image-20210920222756858

html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all('',text='the prince')
print(type(k))
for ele in k:
    print(ele)

image-20210920222733834

posted on 2021-12-08 14:34  coderabcd  阅读(65)  评论(0)    收藏  举报

导航