Jsoup can't find existing element by class name

70 views Asked by At

I'm trying to parse pastebin page with all params(likes, views_count, etc). And all fine except raw text. I know that there is a \<paste_id\>/raw, but I don't want to use it since I will have to make 2 requests. And I can't immediately access \<paste_id\>/raw because I need all the additional parameters. Needed data stores in

<textarea class="textarea -raw js-paste-raw">Text I need to parse</textarea>

I checked that in the response from the pastebin server without processing/with javascript processing, this element looks exactly the same. Jsoup simply ignores this element as if it doesn't exist.

Elements raw_test_1 = doc.getElementsByClass("textarea");
log(raw_test_1.size());

Elements raw_test_2 = doc.select("textarea");
log(raw_test_2.size());

Elements raw_test_3 = doc.getElementsByClass("textarea -raw js-paste-raw");
log(raw_test_3.size());

Result:

0

0

0

It looks like the pastebin developers did something nasty with this element to ruin the lives of those who try to parse their site. Edge shows that this element has errors.

enter image description here

Which confirms that this element of the site is intentionally corrupted.

update:

I found out the cause of the problem. The response from the PasteBin servers is completely different for browsers and JSoup. There are more than 900 lines of code for browsers. And 300 for Jsoup. Pastebin somehow detects that we are parsing the page and issues a neutered response. The raw paste we need is contained in the neutered part of the answer. I tried .userAgent("Chrome/4.0.249.0 Safari/532.5") but it's not enough, how to make pastebin servers perceive our request as a real request?

1

There are 1 answers

0
rzwitserloot On

Class line is invalid

Space in a class attribute separates class names. Hence, expecting .getElementsByClass("foo bar") to work is incorrect - you're asking a lot from JSoup here. What you really want is .getElementsByClass("foo") and then get the intersection (the 'overlap') between that and .getElementsByClass("bar").

Secondly, HTML classes must start with a letter. In other words, -raw is an invalid class name. Given that it is invalid, in the browser/HTML space the general trend is not to hard-crash (but, had it been XHTML, most parser tools would completely refuse this HTML as being invalid), but what does happen is more or less undefined - you shouldn't write invalid HTML. It is possible that JSoup completely ignores '-raw'.

Thus, we have 2 plausible things that make JSoup not parse this, and neither is JSoup's fault - it is the fault of this invalid HTML:

  • It doesn't let you ask for multiple classes in a single "by class" call. Fix by getting the intersection between all those classes you are actually interested in.

  • It doesn't let you ask for -raw. Simply don't ask for that one.

Different HTML is returned

JSoup parses the HTML. It does not run any javascript. Turn javascript off in your browser and retry the request. I bet you get the same 'weird completely neutered HTML'.

There are various ways to integrate dynamism (e.g. the values of things in db tables, think 'list of movies on youtube', 'streetnames in google maps', 'most recent posts' on a forum site, and so on - anything that isn't 'hardcoded' by the programmers and designers). The usual way in the past was templating. In which case, the HTML you get from the browser is very similar to the DOM that the browser ends up showing to your eyeballs.

However, more and more of the web is now done using a static-HTML+API-calls model: The HTML/CSS/JS served by the server is completely static ('hardcoded' - every pageload it is the same every time, and therefore, it contains absolutely nothing dynamic. No comments, nothing that could change based on user interaction at all). That HTML/CSS/JS then runs a bunch of web requests within javascript to get the dynamic parts and creates the HTML elements dynamically, injecting them on the fly into the DOM (which is the 'HTML' the way the browser knows how to render it, i.e. parsed, turned into a syntax tree, and crucially the javascript gets to edit it, which is reflected live in what the page looks like.

All those pages (And it is most of the web at this point) cannot be parsed with JSoup. You need something to fully emulate an entire browser (because that javascript can tell the system to scroll the page, and then render a canvas, and then check the canvas and react to that, and then ask for your screen size, and a million other things that are difficult to emulate for a non-browser). Then you can ask the browser for the DOM and go from there.

This is possible - 'browser automation' exists, e.g. in the form of selenium, but these tools were generally originally designed to let you test your own websites, they weren't designed to let you scrape existing sites. Selenium specifically, and the general principle of browser based web scraping, is incredibly slow and complicated. You're running an entire browser, which means you need a graphics stack and GBs of memory to spend a second or 3 to process it all.

Generally, the obvious way to get at all that dynamic data is to use the same APIs that the static HTML/CSS/JS is using - because APIs don't change when a designer decides they don't like the colour of part of the page: They are designed to be easy to use from code. Sites also tend to charge for API usage, though. Generally, the significant hassle of attempting to write the extremely complicated code, and maintain the extremely complicated stack of IAAS you need, and the extremely complicated team of lawyers you need to sort out the business of using automated browsers to parse the web add up to costing far, far more than API access costs. Which is why the vast majority of programmers use those APIs, which means that few good programmers are working on making those browser automation tools run faster.

That all combines into: Use the API.