如何制作函数来修复href链接



是否有请求/selemen函数将href链接转换为正确的链接,如:

clickLink("https://www.google.com","about")

返回类似https://www.google.com/about的值?

就像它修复了一个href链接并转换为常规链接

例如

https://google.com about https://google.com/about
//www.pastebin.com/ / https://www.pastebin.com/

etc

我试着做一个,但没有运气

def fixLink(Link,LinkOriginalPage):
'''Fixes link. ex. /f/d -> https://www.wtds.com/f/d
LinkOriginalPage=page Link redirected from'''
if Link.startswith("https://") or Link.startswith("http://"):
return "debug1 " + Link # , and exit
#fix 329 links crawled! - Latest link: https://www.wikipedia.com/https://kl.wikipedia.org/
if Link.startswith("//"):
Link="debug2 " + "https:"+Link # example, //www.pastebin.com/ -> http://www.pastebin.com/
# print(Link)
return Link # due to glitch
# now link does not start with //
# check if link is like a/b/c->site.com/a/b/c
asciiLetters="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
linkStartsWithValidProtocol=not (Link.startswith("http://") or Link.startswith("https://"))
linkDoesNotStartWithSlash=Link[0] in asciiLetters
if linkStartsWithValidProtocol and linkDoesNotStartWithSlash:
if LinkOriginalPage.endswith("/"):
Link="debug3 " + LinkOriginalPage+Link
else:
Link="debug4 " + LinkOriginalPage+"/"+Link
return Link
# now link does not start with ascii letter
# check if link is like /a/b/c
if Link.startswith("/"):
domainOfLink=getDomainFromLink(LinkOriginalPage)
# print(domainOfLink)
Link="debug 5|"+LinkOriginalPage+" http://"+domainOfLink+Link
# print("startswith / "+Link)
return Link # due to glitch
# fix div links (widely used bad code practice)
if Link.startswith("#"):
#glitch, invalud url like *&YT -> invalud url schema
#fix div
domainOfLink=getDomainFromLink(LinkOriginalPage)
Link="debug 6 "+domainOfLink+Link
return Link
# return the output if not returned (nvm)
return "https://about.io"

您可以使用;urljoin";函数。下面是一个例子。

from urllib.parse import urljoin
a = "http://www.example.com"
b = "index.html"
print(urljoin(a,b))
# Returns 'http://www.example.com/index.html'

PS。http://www.example.com/实际上是存在的。

您是否尝试将这些链接视为字符串?例如

x = "google.com"
y = "about"
final_string= x+y

然后在函数中使用它作为参数?

最新更新