ZhangZhihui's Blog  
When I ran below code: df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population') I got error: HTTPError: HTTP Error 403: Forbidden I can open that URL in browser.
 
ChatGPT said:

This is a common issue when using pandas.read_html() on Wikipedia.
Even though you can open the page in a browser, Wikipedia blocks requests that do not have a User-Agent header, and read_html() uses a simple request underneath (via urllib) that often gets rejected with HTTP 403 Forbidden.

✅ Why your code failed

  • Browser requests include headers like user-agent, cookies, etc. → Wikipedia allows.

  • pandas.read_html() default request looks like an automated bot. → Wikipedia blocks → 403 Forbidden.

✅ How to fix it

You need to supply a proper User-Agent header by downloading the page yourself (using requests) and then passing the HTML to read_html().

✔ Working code

import pandas as pd
import requests

url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)
response.raise_for_status()   # ensure no HTTP errors

dfs = pd.read_html(response.text)
print(len(dfs))
dfs[0].head()

 

 

LocationPopulation% of worldDateSource (official or from the United Nations)Notes
0 World 8232000000 100% 13 Jun 2025 UN projection[1][3] NaN
1 India 1417492000 17.3% 1 Jul 2025 Official projection[4] [b]
2 China 1408280000 17.1% 31 Dec 2024 Official estimate[5] [c]
3 United States 340110988 4.1% 1 Jul 2024 Official estimate[6] [d]
4 Indonesia 284438782 3.5% 30 Jun 2025 National annual projection[7] NaN

This works because:

  • requests.get() sends a normal browser-like User-Agent.

  • Wikipedia returns the page.

  • pd.read_html() parses tables from the HTML content.

You should see the first table as expected.

 

You said:
In the code you gave: headers = {"User-Agent": "Mozilla/5.0"} But my browser is Chrome, will it still work or do I need to modify the code?
 
ChatGPT said:

You do not need to modify anything
User-Agent: "Mozilla/5.0" is a generic, browser-like User-Agent and works perfectly for Wikipedia.

Why it works

Most modern browsers (including Chrome, Edge, Firefox) send a User-Agent string that begins with:

Mozilla/5.0

For example, your actual Chrome User-Agent probably looks like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36

But Wikipedia does not require the full long string — it only checks that the request looks like it came from a browser, not a script.

✔ Conclusion

  • headers = {"User-Agent": "Mozilla/5.0"} is enough

  • No need to pretend to be Chrome specifically

  • The code will work exactly the same

 

I ran the 'Working code' you gave successfully. The dfs dataframe got the content of a table in that page. What confused me is that page contains other content before the table, why the 'dfs = pd.read_html(response.text)' statement only read the content of the table?
 
ChatGPT said:

pd.read_html() ignores all non-table content by design.
That’s why even though the Wikipedia page has paragraphs, images, headings, menus, etc., the line:

dfs = pd.read_html(response.text)

returns only tables, not anything else.


✅ Why read_html() only extracts tables

read_html() does the following:

  1. Parses the HTML using lxml (or html5lib).

  2. Searches for <table> tags only.

  3. Converts each <table> element into a pandas DataFrame.

  4. Returns a list of these DataFrames.

It completely ignores:

  • <p> (paragraphs)

  • <div>

  • <span>

  • <h1> / <h2> / etc.

  • Navigation menus

  • Images, scripts, styles, etc.

Because its purpose is table extraction, not scraping the whole page.


✔ Example: Wikipedia page HTML structure

Even though the Wikipedia page contains many different sections, it also contains one or more blocks like:

<table class="wikitable sortable">
    <tr><th>Country</th><th>Population</th></tr>
    <tr><td>China</td><td>1,411,750,000</td></tr>
    ...
</table>

pd.read_html() finds these <table> blocks and turns each one into a DataFrame.

All other content stays out.


✔ Why dfs contains several tables

On many Wikipedia pages, there are multiple tables:

  • The main population table

  • Side info boxes

  • Other data tables

  • Footnote tables

So dfs becomes a list of DataFrames, one for each table.

Example:

len(dfs)        # maybe 2, 3, or more tables
dfs[0]          # first table
dfs[1]          # second table

✔ If you want only the main table

You can select the largest one:

main_table = max(dfs, key=lambda x: x.shape[0])

Or search by column names:

main_table = [df for df in dfs if 'Country' in df.columns][0]

Summary

  • read_html() is not an HTML scraper.

  • It is a table detector.

  • It extracts only <table> tags and turns them into DataFrames.

  • All other HTML content is ignored by default.

 

posted on 2025-11-15 21:02  ZhangZhihuiAAA  阅读(1)  评论(0)    收藏  举报