Last year I had written a python script, to store data of COVID-19 cases (active, cured and deaths) from the website. The script was running fine initially but later due to modifications on the page I was just getting the first 2 rows which are the headers now, and nothing else. Earlier I was using pandas.read_html() method, but it's not able to grab all the data. I tried with the following but these are also not helping:
- BeautifulSoup
- lxml.html
Also tried the code as in here, but still the same issue. Any reasons why the issue and some other steps I could take?
Here is What I have tried till now:
- Using
pandas
url = "https://www.mohfw.gov.in/"
df_list = pd.read_html(url)
- Using lmxl.html
>>> import requests
>>> page = requests.get(url)
>>> import lxml.html as lh
>>> doc = lh.fromstring(page.content)
>>> tbody_elements = doc.xpath('//tbody') # table is under `<tbody>` tag but it's able to get the data
>>> tbody_elements
[] # null here
>>> tr_elements = doc.xpath('//tr')
>>> tr_elements
[<Element tr at 0x7fb3f507d260>, <Element tr at 0x7fb3f507d2b8>, <Element tr at 0x7fb3f507d310>]
>>> len(tr_elements)
3
>>>for i in tr_elements:
... print("Row - ", r)
... for row in i:
... print(row.text_content())
... r=r+1
...
Output:
('Row - ', 1)
COVID-19 INDIA as on : 14 March 2021, 08:00 IST (GMT+5:30) [↑↓ Status change since yesterday]
('Row - ', 2)
S. No. Name of State / UT Active Cases* Cured/Discharged/Migrated* Deaths**
('Row - ', 3)
Total Change since yesterdayChange since yesterday Cumulative Change since yesterday Cumulative Change since yesterday
- Using
BeautifulSoup
>>> from bs4 import BeautifulSoup
>>> url = 'https://www.mohfw.gov.in/'
>>> web_content = requests.get(url).content
>>> soup = BeautifulSoup(web_content, "html.parser")
>>> all_rows = soup.find_all('tr')
>>> all_rows
[<tr><h5>COVID-19 INDIA <span>as on : 15 March 2021, 08:00 IST (GMT+5:30)\t[\u2191\u2193 Status change since yesterday]</span></h5></tr>, <tr class="row1">\n<th rowspan="2" style="width:5%;"><strong>S. No.</strong></th>\n<th rowspan="2" style="width:24%;"><strong>Name of State / UT</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Active Cases*</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Cured/Discharged/Migrated*</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Deaths**</strong></th>\n</tr>, <tr class="row2"><th style="width: 12%;">Total</th><th style="width: 12%;"><span class="mob-hide">Change since yesterday</span><span class="mob-show">Change since<br/> yesterday</span></th>\n<th style="width: 12%;">Cumulative</th><th style="width: 12%;">Change since yesterday</th>\n<th style="width: 12%;">Cumulative</th><th style="width: 12%;">Change since yesterday</th></tr>]
>>> len(all_rows)
3
In both BeautifulSoup and lmxl.html, I am just getting the first two rows which are actually headers in the table.

It looks like they've commented out the whole table. On my browser the table is not visible either:
You could use BeautifulSoup to find the comment entry and decode it as more soup, for example:
This would give you output starting: