Scraping a website with dynamic wdtNonce parameter

66 views Asked by At

I am pretty much self taught in webpage scraping and I don't really have a deep understanding of the inner workings of a webpage.

However, I've been able to scrape all websites I've put my hands on.

Until I tried this one.

My goal is to be able to choose the date and download the corresponding prices.

By examining the network traffic, I have been able to replicate the HTTP Request that yields the desired response in JSON format.

The aforementioned request's payload looks like this:

    {
    "draw": "5",
    "columns[0][data]": "0",
    "columns[0][name]": "wdt_ID",
    "columns[0][searchable]": "true",
    "columns[0][orderable]": "false",
    "columns[0][search][value]": "",
    "columns[0][search][regex]": "false",
    "columns[1][data]": "1",
    "columns[1][name]": "date",
    "columns[1][searchable]": "true",
    "columns[1][orderable]": "false",
    "columns[1][search][value]": "26+Feb+2024|26+Feb+2024",
    "columns[1][search][regex]": "false",
    "columns[2][data]": "2",
    "columns[2][name]": "mtu",
    "columns[2][searchable]": "true",
    "columns[2][orderable]": "false",
    "columns[2][search][value]": "|",
    "columns[2][search][regex]": "false",
    "columns[3][data]": "3",
    "columns[3][name]": "almcpmwh",
    "columns[3][searchable]": "true",
    "columns[3][orderable]": "false",
    "columns[3][search][value]": "",
    "columns[3][search][regex]": "false",
    "columns[4][data]": "4",
    "columns[4][name]": "alvolumemwh",
    "columns[4][searchable]": "true",
    "columns[4][orderable]": "false",
    "columns[4][search][value]": "",
    "columns[4][search][regex]": "false",
    "columns[5][data]": "5",
    "columns[5][name]": "alnetpositionmwh",
    "columns[5][searchable]": "true",
    "columns[5][orderable]": "false",
    "columns[5][search][value]": "",
    "columns[5][search][regex]": "false",
    "columns[6][data]": "6",
    "columns[6][name]": "ksmcpmwh",
    "columns[6][searchable]": "true",
    "columns[6][orderable]": "false",
    "columns[6][search][value]": "",
    "columns[6][search][regex]": "false",
    "columns[7][data]": "7",
    "columns[7][name]": "ksvolumemwh",
    "columns[7][searchable]": "true",
    "columns[7][orderable]": "false",
    "columns[7][search][value]": "",
    "columns[7][search][regex]": "false",
    "columns[8][data]": "8",
    "columns[8][name]": "ksnetpositionmwh",
    "columns[8][searchable]": "true",
    "columns[8][orderable]": "false",
    "columns[8][search][value]": "",
    "columns[8][search][regex]": "false",
    "columns[9][data]": "9",
    "columns[9][name]": "datetime",
    "columns[9][searchable]": "true",
    "columns[9][orderable]": "false",
    "columns[9][search][value]": "|",
    "columns[9][search][regex]": "false",
    "start": "0",
    "length": "25",
    "search[value]": "",
    "search[regex]": "false",
    "sumColumns[]": [
        "alvolumemwh",
        "ksvolumemwh",
        "alnetpositionmwh",
        "ksnetpositionmwh"
    ],
    "avgColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "minColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "maxColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "wdtNonce": "c201b4ccc3"
}

So far so good. Everything works fine and I am able to choose the date and download the data I want.

However, the value of this parameter

"wdtNonce": "c201b4ccc3"

seems to be dynamic and after a while the default value that I am using stops being valid and the request returns no data.

Is there a way to make this persistent?

Is there a way to automatically renew the value of the parameter to a valid one?

Is there a way to circumvent this?

How does my browser "know" beforehand which value it should use for this parameter?

Is this a built in feature intended to block scraping?

I am not posting my code because the code itself works without any problems. Thank you in advance!

2

There are 2 answers

0
Sumanth On BEST ANSWER

Usually, these dynamic strings or tokens come from two sources:

  1. It is generated in the website using javascript
  2. It is from response of a previous request. Either as a cookie, response header, body etc.,

or a combination of both are any other way

In this particular website, the token is from the main HTML Page itself.

GET https://alpex.al/market-results/

The token is under value key of input tag with id as 'wdtNonceFrontendEdit_53'

Token in HTML

You can first fetch the main page, parse and apply the following xpath to extract the wdtNonce id and use it in the API Payload.

//input[contains(@id, 'wdtNonceFrontendEdit')]/@value

You should periodically fetch the main page (Whenever the token stops working), extract the id and use it in API Payload to crawl the data. Refer https://stackoverflow.com/a/78007031/11809002 for more info on how to responsibly crawl data and why tokens are used by websites.

1
Krishan Kumawat On
Identify the logic behind wdtNonce generation: This might involve inspecting the network traffic or website code to understand how the server generates new wdtNonce values. If you can identify a pattern, you could potentially create a mechanism to generate new nonces when the current one becomes invalid. However, this approach can be fragile and break easily if the website changes its logic.

Check for API availability: While the website might not provide an official API for data access, there might be an undocumented or hidden one. Explore the website's documentation or search online communities to see if anyone has discovered an unofficial API. Using a documented API is always the recommended approach, as it adheres to rate limits and avoids potential security risks.

Respect robots.txt and terms of service: Before attempting any scraping, always check the website's robots.txt file and terms of service. Scraping against their guidelines is unethical and can be illegal. If scraping is not allowed, respect their decision and explore alternative methods of data acquisition.

Consider alternative data sources: If scraping this specific website is not feasible due to ethical or technical reasons, look for alternative sources that provide the data you need. There might be public datasets, government reports, or official APIs from other organizations that offer similar information.