Using multythreading with seleniumbase and with different proxies

46 views Asked by At

My task: Task is that I need to create a driver for each page with its own proxy and parse these pages in parallel with other drivers. Also, after collecting links from pages, I need to destroy all drivers and create new ones so that each driver also has its own unique proxy and that they parse products from different pages in parallel.

Problem: The problem is that every driver created has the same ip address from pool_proxies.

Code:

import concurrent.futures
import sys
from selenium.webdriver import ActionChains
import time
from selenium.webdriver.common.keys import Keys

sys.argv.append("-n")

pages = 3
pool_proxies_for_pages = ['proxy0', 'proxy1', 'proxy2']
pool_proxies_for_products = ['proxy5', 'proxy6', 'proxy7']

def create_undetected_webdriver(proxy):
    driver = Driver(uc=True, proxy=proxy, agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36')
    return driver

def parsing_products(page, list_for_pool, links):
    driver = create_undetected_webdriver(list_for_pool[page])
    #go through the products
    for link in links:
        driver.get(link)

def parsing_pages(page, proxy):
    driver = create_undetected_webdriver(proxy)
    url = f'https://www.ebay.com/e/_electronics/shop-all-ebay-refurbished-cell-phones?_pgn={page}'
    driver.get(url)

    #scrolling down
    ActionChains(driver).send_keys(Keys.END).perform()
    time.sleep(7)

    # get links
    elements = driver.find_elements("xpath", "//a[@tabindex='-1']")
    links = []
    for i in elements:
        link = i.get_attribute("href")
        links.append(link)

    #destroy driver
    driver.quit()
    #parsing products
    parsing_products(page, pool_proxies_for_products, links)

with concurrent.futures.ThreadPoolExecutor(max_workers=pages) as executor:
    for page in range(pages):
        executor.submit(parsing_pages, page, pool_proxies_for_pages[page])

I'd be happy to take any offer.

1

There are 1 answers

1
Michael Mintz On

Proxy with auth with SeleniumBase uses the solution from https://stackoverflow.com/a/35293284/7058266, where essentially a zip file is created that contains the proxy credentials that will be used for proxying, and then SeleniumBase loads that extension into Chrome. The default setting assumes that only a single proxy is used, therefore, if the zip file already exists it will get overwritten, saving space/memory. In the case that you need multiple simultaneous proxies, there's an arg that you need to set: multi_proxy=True, which then creates a uniquely-named zip file for each test that uses a proxy.

Here's a sample script that uses that (using the pytest format):

from parameterized import parameterized
from seleniumbase import BaseCase
BaseCase.main(__name__, __file__, "-n3")

class ProxyTests(BaseCase):
    @parameterized.expand(
        [
            ["user1:pass1@host1:port1"],
            ["user2:pass2@host2:port2"],
            ["user3:pass3@host3:port3"],
        ]
    )
    def test_multiple_proxies(self, proxy_string):
        self.get_new_driver(
            undetectable=True, proxy=proxy_string, multi_proxy=True
        )
        self.driver.get("https://browserleaks.com/webrtc")
        self.sleep(30)

If you're not using pytest multithreading via pytest-xdist, then you should see https://github.com/seleniumbase/SeleniumBase/issues/2478#issuecomment-1981699298 for preventing thread resource conflicts.