WebScrapingLearning
WebScrapingLearning
reference book
《Web Scraping with Python Collecting More Data from the Modern Web》 by Ryan Mitchell
1.
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
Two main things can go wrong in this line:
- The page is not found on the server (or there was an error in retrieving it). → HTTPError
- The server is not found. → URLError (more serious)
how to handle:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
try:
html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
print(e)
except URLError as e:
print('The server could not be found!')
else:
print('It Worked!')
2.
from bs4 import BeautifulSoup
print(bs.nonExistentTag)
returns a None object
from bs4 import BeautifulSoup
print(bs.nonExistentTag.someTag)
returns an exception (because we call another function on the None object)
how to handle it:
try:
badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
print('Tag was not found')
else:
if badContent == None:
print ('Tag was not found')
else:
print(badContent)
3.
synthesize with 1. and 2.
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
# returns either the title of the page, or a None object
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
# the page is not found on the server (or there was an error in retrieving it).
return None
try:
# if the server is not found (or the server did not exist),
# then html would be a None object, and html.read() would throw an AttributeError
bs = BeautifulSoup(html.read(), 'html.parser')
# may returns an exception (AttributeError)
# because we call another function on the None object
title = bs.body.h1
except AttributeError as e:
return None
return title
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
print('Title could not be found')
else:
print(title)
4.
When to get_text() and When to Preserve Tags?
Calling .get_text() should always be the last thing you do, immediately before you print, store, or manipulate your final data.
In general, you should try to preserve the tag structure of a document as long as possible.
5.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(html,"html.parser")
# bs = BeautifulSoup(html.read(),"html.parser")
# it can add .read() or not
for k in bs.find('table',{'id':'giftList'}).descendants:
print(k)
urlopen returns a file object, BeautifulSoup can use the file object directly without needing to call .read() first
we call html.read() only in order to get the HTML content of the page

6.
The useage of find_all and find:

tag
a string name of a tag or even a Python list of string tag names

which can be done more easily as follows:
.find_all(re.compile('h[1-6]'))
attributes
a Python dictionary of attributes and matches tags that contain any one of those attributes

recursive
boolean, default to be true
text
unusual, it matches based on the text content of the tags, rather than properties of the tags themselves

html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all(text='the prince')
print(type(k))
for ele in k:
print(ele)

limit
used only in the find_all method; find is equivalent to the same find_all call, with a limit of 1
keyword
select tags that contain a particular attribute or set of attributes

each value for an id should be used only once on the page, so the former should be equivalent to the following

keyword is redundant, anything that can be done with keyword can also be accomplished using other techniques such as regular_express and lambda_express
For instance, the following two lines are identical

when using keyword, and using class attribute, remember to add an underscore such as class_:

class is a reserved word in Python that cannot be used as a variable or argument name
Recall that passing a list of tags to .find_all() via the attributes list acts as an “or” filter (it selects a list of all tags that have tag1, tag2, or tag3...). If you have a lengthy list of tags, you can end up with a lot of stuff you don’t want. The keyword argument allows you to add an additional “and” filter to this.
7.
lambda expression
# the same result with the second code line
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')
# bs.find_all('', text='Or maybe he\'s only resting?')
attention! the first ' ' should refers to the tag parameter according to the notebook.6
and the first ' ' has no effect on the result:
html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all(text='the prince')
print(type(k))
for ele in k:
print(ele)

html3 = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs3 = BeautifulSoup(html3,"html.parser")
k = bs3.find_all('',text='the prince')
print(type(k))
for ele in k:
print(ele)

浙公网安备 33010602011771号