Pandas - read_html() - ZhangZhihuiAAA

When I ran below code: df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population') I got error: HTTPError: HTTP Error 403: Forbidden I can open that URL in browser.

This is a common issue when using pandas.read_html() on Wikipedia.
Even though you can open the page in a browser, Wikipedia blocks requests that do not have a User-Agent header, and read_html() uses a simple request underneath (via urllib) that often gets rejected with HTTP 403 Forbidden.

✅ Why your code failed

Browser requests include headers like user-agent, cookies, etc. → Wikipedia allows.
pandas.read_html() default request looks like an automated bot. → Wikipedia blocks → 403 Forbidden.

✅ How to fix it

You need to supply a proper User-Agent header by downloading the page yourself (using requests) and then passing the HTML to read_html().

✔ Working code

import pandas as pd
import requests

url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)
response.raise_for_status()   # ensure no HTTP errors

dfs = pd.read_html(response.text)
print(len(dfs))
dfs[0].head()

Location	Population	% of world	Date	Source (official or from the United Nations)	Notes
0	World	8232000000	100%	13 Jun 2025	UN projection[1][3]	NaN
1	India	1417492000	17.3%	1 Jul 2025	Official projection[4]	[b]
2	China	1408280000	17.1%	31 Dec 2024	Official estimate[5]	[c]
3	United States	340110988	4.1%	1 Jul 2024	Official estimate[6]	[d]
4	Indonesia	284438782	3.5%	30 Jun 2025	National annual projection[7]	NaN

This works because:

requests.get() sends a normal browser-like User-Agent.
Wikipedia returns the page.
pd.read_html() parses tables from the HTML content.

You should see the first table as expected.

In the code you gave: headers = {"User-Agent": "Mozilla/5.0"} But my browser is Chrome, will it still work or do I need to modify the code?

You do not need to modify anything —
User-Agent: "Mozilla/5.0" is a generic, browser-like User-Agent and works perfectly for Wikipedia.

Why it works

Most modern browsers (including Chrome, Edge, Firefox) send a User-Agent string that begins with:

✔ Conclusion

headers = {"User-Agent": "Mozilla/5.0"} is enough
No need to pretend to be Chrome specifically
The code will work exactly the same

I ran the 'Working code' you gave successfully. The dfs dataframe got the content of a table in that page. What confused me is that page contains other content before the table, why the 'dfs = pd.read_html(response.text)' statement only read the content of the table?

pd.read_html() ignores all non-table content by design.
That’s why even though the Wikipedia page has paragraphs, images, headings, menus, etc., the line:

✅ Why `read_html()` only extracts tables

read_html() does the following:

Parses the HTML using lxml (or html5lib).
Searches for <table> tags only.
Converts each <table> element into a pandas DataFrame.
Returns a list of these DataFrames.

It completely ignores:

<p> (paragraphs)
<div>
<span>
<h1> / <h2> / etc.
Navigation menus
Images, scripts, styles, etc.

Because its purpose is table extraction, not scraping the whole page.

✔ Example: Wikipedia page HTML structure

Even though the Wikipedia page contains many different sections, it also contains one or more blocks like:

All other content stays out.

✔ Why `dfs` contains several tables

On many Wikipedia pages, there are multiple tables:

The main population table
Side info boxes
Other data tables
Footnote tables

So dfs becomes a list of DataFrames, one for each table.

Example:

✔ If you want only the main table

You can select the largest one:

Summary

read_html() is not an HTML scraper.
It is a table detector.
It extracts only <table> tags and turns them into DataFrames.
All other HTML content is ignored by default.


博客园 © 2004-2025 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

ChatGPT said:

✅ Why your code failed

✅ How to fix it

✔ Working code

You said:

ChatGPT said:

Why it works

✔ Conclusion

ChatGPT said:

✅ Why read_html() only extracts tables

✔ Example: Wikipedia page HTML structure

✔ Why dfs contains several tables

✔ If you want only the main table

Summary

✅ Why `read_html()` only extracts tables

✔ Why `dfs` contains several tables