Html Unit with xpath not returning the expected result

33 views Asked by At

I am trying to screen scrap the top news on a particular date of a particular news paper using google search using Html Unit. I am able to get the search results but when I am trying to access the top news link using xpath it is failing. Below is the failing code snippet.

        HtmlPage page = client.getPage("https://www.google.co.in");

        List<HtmlForm> allForms = page.getForms();
        System.out.println("No of Forms Detected :  ---  " + allForms.size());
        HtmlForm searchForm = allForms.get(0);
        HtmlTextArea searchTextArea = searchForm.getTextAreaByName("q");
        searchTextArea.setText("24 June 2019 news, the hindu");

        HtmlInput gSearch = searchForm.getInputByName("btnK");
        HtmlPage searchResultPage = (HtmlPage) gSearch.click();
        client.waitForBackgroundJavaScript(5000 * 2);
        System.out.println(searchResultPage.asNormalizedText());
        String xpath = "//*[@id='web']/ol/li[2]/div/div[1]/h3/a";

        HtmlAnchor topNewsLink = (HtmlAnchor) searchResultPage.getByXPath(xpath).get(0);
        HtmlPage postPage = topNewsLink.click();

        System.out.println(postPage.asNormalizedText());

In this (HtmlAnchor) searchResultPage.getByXPath(xpath); is always resulting null. The xpath is correct, I am keeping the sample screen shot of this result for reference.

Google Search of news on a particular date

1

There are 1 answers

0
RBRi On

I like to give some hints - this is not a complete solution

First make sure you are reaching the correct page from the google search by checking the text (as you did already). Usually there is a lot of Javascript code executed as part of the search execution - maybe you reached a different page with HtmlUnit.

client.waitForBackgroundJavaScript(5000 * 2);
System.out.println(searchResultPage.asNormalizedText());

If you are on the right page, have a look at the DomTree like HtmlUnit sees it.

System.out.println(searchResultPage.asXml());

Then you can use the xml output as base for constructing your XPath expression.

If you think there is still a problem with the xpath evaluation itself, it will be great if you can create a simple example (based on a static html page) and open an issue for the github project.