ChatGPT said:
This is a common issue when using pandas.read_html() on Wikipedia.
Even though you can open the page in a browser, Wikipedia blocks requests that do not have a User-Agent header, and read_html() uses a simple request underneath (via urllib) that often gets rejected with HTTP 403 Forbidden.
✅ Why your code failed
-
Browser requests include headers like user-agent, cookies, etc. → Wikipedia allows.
-
pandas.read_html() default request looks like an automated bot. → Wikipedia blocks → 403 Forbidden.
✅ How to fix it
You need to supply a proper User-Agent header by downloading the page yourself (using requests) and then passing the HTML to read_html().
✔ Working code
import pandas as pd import requests url = "https://en.wikipedia.org/wiki/List_of_countries_by_population" headers = {"User-Agent": "Mozilla/5.0"} response = requests.get(url, headers=headers) response.raise_for_status() # ensure no HTTP errors dfs = pd.read_html(response.text) print(len(dfs)) dfs[0].head()
| Location | Population | % of world | Date | Source (official or from the United Nations) | Notes | |
|---|---|---|---|---|---|---|
| 0 | World | 8232000000 | 100% | 13 Jun 2025 | UN projection[1][3] | NaN |
| 1 | India | 1417492000 | 17.3% | 1 Jul 2025 | Official projection[4] | [b] |
| 2 | China | 1408280000 | 17.1% | 31 Dec 2024 | Official estimate[5] | [c] |
| 3 | United States | 340110988 | 4.1% | 1 Jul 2024 | Official estimate[6] | [d] |
| 4 | Indonesia | 284438782 | 3.5% | 30 Jun 2025 | National annual projection[7] | NaN |
This works because:
-
requests.get()sends a normal browser-like User-Agent. -
Wikipedia returns the page.
-
pd.read_html()parses tables from the HTML content.
You should see the first table as expected.
You said:
ChatGPT said:
You do not need to modify anything —
User-Agent: "Mozilla/5.0" is a generic, browser-like User-Agent and works perfectly for Wikipedia.
Why it works
Most modern browsers (including Chrome, Edge, Firefox) send a User-Agent string that begins with:
Mozilla/5.0
For example, your actual Chrome User-Agent probably looks like:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
But Wikipedia does not require the full long string — it only checks that the request looks like it came from a browser, not a script.
✔ Conclusion
-
headers = {"User-Agent": "Mozilla/5.0"}is enough -
No need to pretend to be Chrome specifically
-
The code will work exactly the same
ChatGPT said:
pd.read_html() ignores all non-table content by design.
That’s why even though the Wikipedia page has paragraphs, images, headings, menus, etc., the line:
dfs = pd.read_html(response.text)
returns only tables, not anything else.
✅ Why read_html() only extracts tables
read_html() does the following:
-
Parses the HTML using lxml (or html5lib).
-
Searches for
<table>tags only. -
Converts each
<table>element into a pandas DataFrame. -
Returns a list of these DataFrames.
It completely ignores:
-
<p>(paragraphs) -
<div> -
<span> -
<h1>/<h2>/ etc. -
Navigation menus
-
Images, scripts, styles, etc.
Because its purpose is table extraction, not scraping the whole page.
✔ Example: Wikipedia page HTML structure
Even though the Wikipedia page contains many different sections, it also contains one or more blocks like:
<table class="wikitable sortable"> <tr><th>Country</th><th>Population</th></tr> <tr><td>China</td><td>1,411,750,000</td></tr> ... </table>
pd.read_html() finds these <table> blocks and turns each one into a DataFrame.
All other content stays out.
✔ Why dfs contains several tables
On many Wikipedia pages, there are multiple tables:
-
The main population table
-
Side info boxes
-
Other data tables
-
Footnote tables
So dfs becomes a list of DataFrames, one for each table.
Example:
len(dfs) # maybe 2, 3, or more tables dfs[0] # first table dfs[1] # second table
✔ If you want only the main table
You can select the largest one:
main_table = max(dfs, key=lambda x: x.shape[0])
Or search by column names:
main_table = [df for df in dfs if 'Country' in df.columns][0]
Summary
-
read_html()is not an HTML scraper. -
It is a table detector.
-
It extracts only
<table>tags and turns them into DataFrames. -
All other HTML content is ignored by default.

浙公网安备 33010602011771号