Extract text from Text Node using XPath

654 views Asked by At

I am new to XPath and trying to capture the values "Time: " and "13:45" from the following HTML snippet. Any help or suggestion will be really useful. Thank you!

<div class="inner-box">
    <p class="inner-info-blk">
        <strong>Time: </strong>
        "13:45"
    </p>
</div>

I can access the label with in the <strong>...</strong> container with the pattern below but cannot figure out how to get the time value with in the <p ...> container.

Label xpath:

//div[@class="inner-box"]/p[@class="inner-info-blk"]/strong
3

There are 3 answers

2
Yubo On

You can use text() to get the text from an element.

from lxml import etree

html = '''
<div class="inner-box">
<p class="inner-info-blk">
    <strong>Time: </strong>
    "13:45"
</p>
'''

x = etree.HTML(html)
result = x.xpath('//div[@class="inner-box"]/p[@class="inner-info-blk"]/text()[2]') # get the text inside p
print(result[0].strip()) # since LXML return a list, you need to get the first one

And that would get the text from the <p> element.

UPDATE: As @shailesh has mentioned, the Selenium locator would not evaluate XPath expression that returns a text; nor, to the best of my knowledge, there exists such a method in Selenium that will evaluate arbitrary XPath expression. But just to offer an alternative, you may also use a bit of JS here:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get(
    "file:///C:/Users/yubo/data/social/stackoverflow/6.10%20Selenium/example.html"
)
time = driver.find_element(
    By.XPATH,
    './/div[@class="inner-box"]/p[@class="inner-info-blk"]',
)
print(driver.execute_script("return arguments[0].lastChild.textContent", time).strip()) # Same as @undetected selenium; a coincidence where we happened to write at the same time.
driver.quit()
0
shailesh On

You can find out the solution using split method, because Locators do not allow to use text() method with xpath. Time: in your example is a static and unique value which can split to get actual time value what you expect. I would recommend to first deal with xpath, if not found the solution try to resolve by logic. May be this can help you.

from selenium import webdriver
from selenium.webdriver.common.by import By


driver = webdriver.Firefox()
driver.get('https://www.yourpage.html')
time = driver.find_element(By.XPATH,"//p")

print(time.text.split("Time:")[1])

driver.quit()

O/P: "13:45"

This can be also relevant

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get('file:///Users/shakava/Downloads/stackoverflow.html')
time = driver.find_element(By.XPATH,"//p")
arr = time.text.split(":")

START = 1
timeVal = ""

for index, item in enumerate(arr[START:], START):
    if index>1:
        timeVal+=":"

    timeVal+=item
    index+1

print(timeVal)
driver.quit()

O/P: "13:45"

0
undetected Selenium On

Given the HTML:

<div class="inner-box">
    <p class="inner-info-blk">
        <strong>Time: </strong>
        "13:45"
    </p>
</div>

The time value i.e. 13:45 is a within a Text Node_ and the lastChild of it's parent <p>. So to extract the desired text you can use either of the following locator strategies:

  • Using xpath, execute_script() and textContent:

    print(driver.execute_script('return arguments[0].lastChild.textContent;', driver.find_element(By.XPATH, "//div[@class="inner-box"]/p[@class="inner-info-blk"]")).strip())
    
  • Using xpath, get_attribute() and splitlines():

    print(driver.find_element(By.CSS_SELECTOR, "div.inner-box > p.inner-info-blk").get_attribute("innerHTML").splitlines()[2])
    

Alternative

As an alternative you can also use Beautiful Soup as follows:

Code Block:

from bs4 import BeautifulSoup

html_text = '''
<div class="inner-box">
    <p class="inner-info-blk">
        <strong>Time: </strong>
        "13:45"
    </p>
</div>
'''

soup = BeautifulSoup(html_text, 'html.parser')

last_text = soup.find("p", {"class": "inner-info-blk"}).contents[2]
print(last_text.strip())

Console Output:

"13:45"

References

You can find a couple of relevant detailed discussions in: