I am trying to build a chrome extension that aggregates information from a bunch of sites when the user visits a site A
async function fetchHTML(url) {
const response = await fetch(proxyUrl + url);
const html = await response.text();
console.log(html);
return html;
}
// Function to extract the element - total violations from the HTML content
function extractTotalViolations(html) {
const parser = new DOMParser();
const doc = parser.parseFromString(html, "text/html");
const totalViolations = doc.querySelector(".total-violations").textContent;
return totalViolations;
}
// The URL of the page we want to scrape
const url = "https://whoownswhat.justfix.org/en/address/MANHATTAN/610/EAST%2020%20STREET";
// Fetch the HTML content of the page and extract the total violations
fetchHTML(url).then(html => {
const totalViolations = extractTotalViolations(html);
console.log(totalViolations);
});
When I print totalViolations, I get NULL. So I printed the HTML that was fetched & I realized that I am getting some javascript code that doesn't look anything like the HTML code I see on the website directly. I suspect the website is using some javascript masking or maybe I am not fetching the HTML correctly
<script>
!function(e){function t(t){for(var n,l,i=t[0],f=t[1],a=t[2],p=0,s=[];p<i.length;p++)l=i[p],Object.prototype.hasOwnProperty.call(o,l)&&o[l]&&s.push(o[l][0]),o[l]=0;for(n in f)Object.prototype.hasOwnProperty.call(f,n)&&(e[n]=f[n]);for(c&&c(t);s.length;)s.shift()();return u.push.apply(u,a||[]),r()}function r(){for(var e,t=0;t<u.length;t++){for(var r=u[t],n=!0,i=1;i<r.length;i++){var f=r[i];0!==o[f]&&(n=!1)}n&&(u.splice(t--,1),e=l(l.s=r[0]))}return e}var n={},o={1:0},u=[];function l(t){if(n[t])return n[t].exports;var r=n[t]={i:t,l:!1,exports:{}};return e[t].call(r.exports,r,r.exports,l),r.l=!0,r.exports}l.m=e,l.c=n,l.d=function(e,t,r){l.o(e,t)||Object.defineProperty(e,t,{enumerable:!0,get:r})},l.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",
</script>
My question is how can I extract the HTMl correctly so that I can parse the DOM & get all the information from this site that I want to put on the extension. Thanks
The fact that you've got Javascript as a response proves that:
which means that you need to load the page while your browser's Dev Tools are open and carefully study the requests that are being sent. Based on your description it's likely that the first request being sent when you visit the page will load a Javascript code, which then is processed and sends further requests to the server. Carefully study the requests, along with their URLs, request headers and payloads as well as the responses.
You will need to replicate the request sending and you will also need to parse the response. If the response will end up being some HTML, then you can parse it in the way you already tried to parse (with the change being effected on where and how the request or requests are being sent), otherwise, if the response is not HTML, but something else, such as JSON, then carefully study the HTML that ends up being displayed on the target site and implement a code that converts the raw server response into a similar HTML code.