How to webscrape this website using Selenium

165 views Asked by At

I want to webscrape the website https://www.rome2rio.com. Below is the code that I came up. Sadly I see a captcha 99% of the times I try. Can someone give a hint on what could I add to the code or how could I modify it to improve this and avoid being detected.

Thanks

from selenium import webdriver
import undetected_chromedriver as uc
import time
import random

# Initialize undetected ChromeOptions
chrome_options = uc.ChromeOptions()

# Essential options to avoid detection
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--incognito")

# Correctly setting excludeSwitches within undetected_chromedriver context
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument("--start-maximized")  # To start maximized
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

# Rotating User-Agent
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    # Add more as needed
]
random_user_agent = random.choice(user_agents)
chrome_options.add_argument(f"user-agent={random_user_agent}")

# Adjusting viewport size to non-standard dimensions if needed
# chrome_options.add_argument("--window-size=1366,768")  # Use only if you don't want to start maximized

# Use undetected_chromedriver to avoid detection
driver = uc.Chrome(options=chrome_options)

# Open the specified website
driver.get("https://www.rome2rio.com/map/Marseille/Paris")

# Mimicking human behavior with random sleep
time.sleep(random.uniform(2, 4))

# Proceed with your script...

# Close the driver after operations are complete
driver.quit()
2

There are 2 answers

0
Leonardo Hysesani On

I believe solving the captcha by using 2Captcha's or some other captcha solving service's API would be a more reliable solution than trying to evade detection. They might not be free, but their pricing is not an issue for most applications at 1-2$/1000 requests depending on the captcha type.

0
Michael Mintz On

You can avoid the CAPTCHAs with https://github.com/seleniumbase/SeleniumBase UC Mode.

After pip install seleniumbase, you can run the following with python:

from seleniumbase import Driver

driver = Driver(uc=True)
driver.uc_open_with_reconnect("https://www.rome2rio.com/map/Marseille/Paris", 3)
driver.type('input[aria-label="From"]', "Geneva, Switzerland")
driver.type('input[aria-label="To"]', "Vienna, Austria")
driver.click('button span:contains("Search")')

breakpoint()

driver.quit()

The script pauses at the breakpoint(). Type c and press Enter in the console to continue from the breakpoint.

More documentation on UC Mode here: SeleniumBase/help_docs/uc_mode.md

The SeleniumBase driver includes all the original driver methods, plus new ones.