I'm using puppeteer with jQuery and NodeJS to try and get list items from a web page:
<table>
<td class="hr">
<ul class="people">
<li class = "person">Richard</li>
<li class = "person">Linus</li>
<li class = "person">Brian</li>
<li class = "team_lead">Charles</li>
</ul>
</td>
<td class="manufacturing">
<ul class="people">
<li class = "person">Alan</ul>
<li class = "person">Margret</li>
<li class = "person">Ken</li>
<li class = "person">Edsger</li>
<li class = "team_lead">Dennis</li>
</ul>
</td>
<td class="design">
<ul class="people">
<li class = "person">Bill</li>
<li class = "person">Ada</li>
<li class = "person">Steve</li>
<li class = "person">Ken</li>
<li class = "team_lead">Dennis</li>
</ul>
</td>
</table>
and using the nodeJS code:
const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
async function main(){
const browser = await puppeteer.launch({headless : false, defaultViewport: {width: 1920, height: 1080}});
const page = await browser.newPage();
await page.goto("${url}");
const htmlContent = await page.content();
const $ = cheerio.load(htmlContent);
let peopleList = [];
$(`table td .people`).each(function(i, li){
peopleList.push(li.text());
});
console.log(`people: ${peopleList}`);
}
main();
I have got this code for parsing through the list from another stackoverflow answer: How to store list items within an array with jQuery and using a Udemy tutorial, and tried to edit it accordingly.
I am looking to store each name in a two dimensional array, so something like:
peopleList = [[Richard, Linus, Brian, Charles], [Alan, Margret, Edsger, Dennis], [Bill, Ada, Steve, Ken, Dennis]];
however I am getting a single string:
RichardLinusBrianCharlesAlanMargretEdsgerDenisBillAdSteveKenDennis,RichardLinusBrianCharlesAlanMargretEdsgerDenisBillAdSteveKenDennis,...
(repeat for each ul element) and when I try to go deeper and include li tags I just get an empty string.
- Is there any way I can save in the desired way?
- I am using an private site and therefore have removed the url and changed people to Computer scientists. Is there any way to point Puppeteer to a site run locally, eg: localhost/index.html?
There is no need to use Cheerio with Puppeteer. Puppeteer already works with the live page, so it generally doesn't make sense to snapshot the page into a string, then dump it into a separate library. This is inefficient and leads to confusing bugs when the snapshot goes stale.
Instead, use
page.$$eval(yourSelector, browserCallback)to do the job:Output:
The joined string issue was resolved above by using the selector
table td .people .person, which would technically work in the Cheerio approach as well.If you want to keep the categories distinct, you could use a nested query:
which gives:
All that said, if the page you're working with has the data you want statically, using
fetchand Cheerio may make sense. But I'm assuming you're working with a SPA or website that requires some interaction to get to the scrape point, or there's some other good motivator for using Puppeteer.As another aside, if you wind up sticking with Puppeteer but prefer to use jQuery, you can either add it, or use it if the page happens to have jQuery included already. You'll then access
$inside anevaluate-family callback that runs in the browser context. This makes more sense than using Cheerio in most cases, since you're taking advantage of the realtime page abilities of Puppeteer and won't suffer from stale data issues.To answer your other question, for demo and reproducibility purposes, I use
setContentas shown above, but you can run a server and navigate to your page on localhost. Just make sure to include the port.