WebScrapingLearning

reference book

《Web Scraping with Python Collecting More Data from the Modern Web》 by Ryan Mitchell

1.

html = urlopen('http://www.pythonscraping.com/pages/page1.html')

Two main things can go wrong in this line:

The page is not found on the server (or there was an error in retrieving it). → HTTPError
The server is not found. → URLError (more serious)

how to handle:

from urllib.request import urlopen 
from urllib.error import HTTPError 
from urllib.error import URLError
try: 
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e: 
    print(e)
except URLError as e: 
    print('The server could not be found!')
else: 
    print('It Worked!')

2.

from bs4 import BeautifulSoup
print(bs.nonExistentTag)

returns a None object

from bs4 import BeautifulSoup
print(bs.nonExistentTag.someTag)

returns an exception (because we call another function on the None object)

how to handle it:

try: 
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e: 
    print('Tag was not found')
else:
	if badContent == None: 
        print ('Tag was not found')
	else: 
        print(badContent)

3.

synthesize with 1. and 2.

from urllib.request import urlopen 
from urllib.error import HTTPError 
from bs4 import BeautifulSoup

# returns either the title of the page, or a None object 
def getTitle(url): 
    try:
        html = urlopen(url)
	except HTTPError as e: 
        # the page is not found on the server (or there was an error in retrieving it).
        return None
	try:
		# if the server is not found (or the server did not exist),
        # then html would be a None object, and html.read() would throw an AttributeError
        bs = BeautifulSoup(html.read(), 'html.parser') 
        
        # may returns an exception (AttributeError)
        # because we call another function on the None object
        title = bs.body.h1
	except AttributeError as e: 
        return None
	return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html') 
if title == None: 
    print('Title could not be found')
else: 
    print(title)

4.

When to get_text() and When to Preserve Tags?

Calling .get_text() should always be the last thing you do, immediately before you print, store, or manipulate your final data.

In general, you should try to preserve the tag structure of a document as long as possible.

5.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(html,"html.parser")
# bs = BeautifulSoup(html.read(),"html.parser")
# it can add .read() or not
for k in bs.find('table',{'id':'giftList'}).descendants:
    print(k)

urlopen returns a file object, BeautifulSoup can use the file object directly without needing to call .read() first

we call html.read() only in order to get the HTML content of the page

6.

The useage of find_all and find:

tag

a string name of a tag or even a Python list of string tag names

which can be done more easily as follows:

.find_all(re.compile('h[1-6]'))

attributes

a Python dictionary of attributes and matches tags that contain any one of those attributes

recursive

boolean, default to be true

text

unusual, it matches based on the text content of the tags, rather than properties of the tags themselves

html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all(text='the prince')
print(type(k))
for ele in k:
    print(ele)

limit

used only in the find_all method; find is equivalent to the same find_all call, with a limit of 1

keyword

select tags that contain a particular attribute or set of attributes

each value for an id should be used only once on the page, so the former should be equivalent to the following

keyword is redundant, anything that can be done with keyword can also be accomplished using other techniques such as regular_express and lambda_express

For instance, the following two lines are identical

when using keyword, and using class attribute, remember to add an underscore such as class_:

class is a reserved word in Python that cannot be used as a variable or argument name

Recall that passing a list of tags to .find_all() via the attributes list acts as an “or” filter (it selects a list of all tags that have tag1, tag2, or tag3...). If you have a lengthy list of tags, you can end up with a lot of stuff you don’t want. The keyword argument allows you to add an additional “and” filter to this.

7.

lambda expression

# the same result with the second code line
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')
# bs.find_all('', text='Or maybe he\'s only resting?')

attention! the first ' ' should refers to the tag parameter according to the notebook.6

and the first ' ' has no effect on the result:

html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all(text='the prince')
print(type(k))
for ele in k:
    print(ele)

html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all('',text='the prince')
print(type(k))
for ele in k:
    print(ele)

posted on 2021-12-08 14:34 coderabcd 阅读(65) 评论(0) 收藏举报

刷新页面返回顶部