我有一个每30秒访问一次网站的Python脚本,每次都需要有一个不同的IP地址。
什么是最好/最省时的解决方案?
-
在线抓取免费代理?你知道一个从许多来源收集代理的 python 脚本吗?
每次使用 Tor 浏览器 都有不同的 IP(我在 aws ec2 实例上使用 selenium,你们知道如何在 Ubuntu 服务器上使用 Tor 浏览器的教程吗?
其他方法?
要收集和使用不同的代理,一个强大的解决方案是使用以下解决方案使用新活动的代理向网站发出代理请求,该代理在免费代理列表中列出:
-
代码块:
from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:WebDriverschromedriver.exe') driver.get("https://sslproxies.org/") driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//th[contains(., 'IP Address')]")))) ips = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 1]")))] ports = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 2]")))] driver.quit() proxies = [] for i in range(0, len(ips)): proxies.append(ips[i]+':'+ports[i]) print(proxies) for i in range(0, len(proxies)): try: print("Proxy selected: {}".format(proxies[i])) options = webdriver.ChromeOptions() options.add_argument('--proxy-server={}'.format(proxies[i])) driver = webdriver.Chrome(options=options, executable_path=r'C:WebDriverschromedriver.exe') driver.get("https://www.whatismyip.com/proxy-check/?iref=home") if "Proxy Type" in WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "p.card-text"))): break except Exception: driver.quit() print("Proxy Invoked")
-
控制台输出:
['190.7.158.58:39871', '175.139.179.65:54980', '186.225.45.146:45672', '185.41.99.100:41258', '43.230.157.153:52986', '182.23.32.66:30898', '36.37.160.253:31450', '93.170.15.214:56305', '36.67.223.67:43628', '78.26.172.44:52490', '36.83.135.183:3128', '34.74.180.144:3128', '206.189.122.177:3128', '103.194.192.42:55546', '70.102.86.204:8080', '117.254.216.97:23500', '171.100.221.137:8080', '125.166.176.153:8080', '185.146.112.24:8080', '35.237.104.97:3128'] Proxy selected: 190.7.158.58:39871 Proxy selected: 175.139.179.65:54980 Proxy selected: 186.225.45.146:45672 Proxy selected: 185.41.99.100:41258
网站"https://sslproxies.org/"似乎已更新。 这是一个更新的代码 -
from selenium import webdriver
from selenium.webdriver.common.by import By
import chromedriver_autoinstaller # pip install chromedriver-autoinstaller
chromedriver_autoinstaller.install() # To update your chromedriver automatically
driver = webdriver.Chrome()
# Get free proxies for rotating
def get_free_proxies(driver):
driver.get('https://sslproxies.org')
table = driver.find_element(By.TAG_NAME, 'table')
thead = table.find_element(By.TAG_NAME, 'thead').find_elements(By.TAG_NAME, 'th')
tbody = table.find_element(By.TAG_NAME, 'tbody').find_elements(By.TAG_NAME, 'tr')
headers = []
for th in thead:
headers.append(th.text.strip())
proxies = []
for tr in tbody:
proxy_data = {}
tds = tr.find_elements(By.TAG_NAME, 'td')
for i in range(len(headers)):
proxy_data[headers[i]] = tds[i].text.strip()
proxies.append(proxy_data)
return proxies
free_proxies = get_free_proxies(driver)
print(free_proxies)
你会在python字典中获得这样的输出 -
[{'IP Address': '200.85.169.18',
'Port': '47548',
'Code': 'NI',
'Country': 'Nicaragua',
'Anonymity': 'elite proxy',
'Google': 'no',
'Https': 'yes',
'Last Checked': '8 secs ago'},
{'IP Address': '191.241.226.230',
'Port': '53281',
'Code': 'BR',
'Country': 'Brazil',
'Anonymity': 'elite proxy',
'Google': 'no',
'Https': 'yes',
'Last Checked': '8 secs ago'},
.
.
.
}]