使用正则表达式或其他选项快速抓取网页



请首先查看下面的更新。

我正试图在reddit上为一个指定的子reddit刮取所有的版主。API只允许你获得sub-redit的所有主持人用户名,所以最初我已经获得了所有这些,然后对每个配置文件执行了一个额外的请求,以获得化身url。这最终超过了API的限制。

因此,我只想获得以下页面的来源,并在收集每个页面上的10个用户名和头像url的同时进行分页。这将导致对网站的轮询请求减少。我知道如何进行分页,但现在我正在努力了解如何收集用户名和相邻的化身URL。

所以采用以下网址:

https://www.reddit.com/r/videos/about/moderators/

所以我会拉整个页面的来源,

添加所有mods用户名&url转换为mod对象,然后转换为数组。

在我返回的字符串上使用regex是个好主意吗?

到目前为止,这是我的代码,任何帮助都会很棒:

func tester() {
let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("(error)")
return
}
let string = String(data: data, encoding: .utf8)
let regexUsernames = try? NSRegularExpression(pattern: "href="/user/[a-z0-9]"", options: .caseInsensitive)
var results = regexUsernames?.matches(in: string as String, options: [], range: NSRange(location: 0, length: string.length))
let regexProfileURLs = try? NSRegularExpression(pattern: "><img src="[a-z0-9]" style", options: .caseInsensitive)
print("(results)") // This shows as empty array
}
task.resume()
}

我也尝试过以下操作,但出现了此错误:

Can't form Range with upperBound < lowerBound

代码:

func tester() {
let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("cannot cast data into string")
return
}
let leftSideOfValue = "href="/user/"
let rightSideOfValue = """
guard let leftRange = htmlString.range(of: leftSideOfValue) else {
print("cannot find range left")
return
}
guard let rightRange = htmlString.range(of: rightSideOfValue) else {
print("cannot find range right")
return
}
let rangeOfTheValue = leftRange.upperBound..<rightRange.lowerBound
print(htmlString[rangeOfTheValue])
}

更新:

所以我已经到了一个地步,它会给我第一个用户名,但我正在循环,只是一遍又一遍地得到同一个用户名。每迈出一步,最好的方法是什么?有没有一种方法可以像让newHTMLString=htmlString.dropFirst(k:?(那样用我们刚刚得到的元素后面的子字符串替换htmlString?

func tester() {
let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("cannot cast data into string")
return
}

let counter =  htmlString.components(separatedBy:"href="/user/")
let count = counter.count
for  i in 0...count {
let leftSideOfUsernameValue = "href="/user/"
let rightSideOfUsernameValue = """
let leftSideOfAvatarURLValue = "><img src=""
let rightSideOfAvatarURLValue = "">"

guard let leftRange = htmlString.range(of: leftSideOfUsernameValue) else {
print("cannot find range left")
return
}
guard let rightRange = htmlString.range(of: rightSideOfUsernameValue) else {
print("cannot find range right")
return
}
let username = htmlString.slice(from: leftSideOfUsernameValue, to: rightSideOfUsernameValue)
print(username)
guard let avatarURL = htmlString.slice(from: leftSideOfAvatarURLValue, to: rightSideOfAvatarURLValue) else {
print("Error")
return
}
print(avatarURL)
}
}
task.resume()
}

我也试过:

let endString = String(avatarURL + rightSideOfAvatarURLValue)
let endIndex = htmlString.index(endString.endIndex, offsetBy: 0)
let substringer = htmlString[endIndex...]
htmlString = String(substringer)

您应该能够通过调用一个简单的regex将所有名称和url拉入两个单独的数组,方法如下:

func tester() {
let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else { return }
guard let htmlString = String(data: data, encoding: .utf8) else { return }
let names = htmlString.matching(regex: "href="/user/(.*?)"")
let imageUrls = htmlString.matching(regex: "><img src="(.*?)" style")
print(names)
print(imageUrls)
}
task.resume()
}
extension String {
func matching(regex: String) -> [String] {
guard let regex = try? NSRegularExpression(pattern: regex, options: []) else { return [] }
let result  = regex.matches(in: self, options: [], range: NSMakeRange(0, self.count))
return result.map {
return String(self[Range($0.range, in: self)!])
}
}
}

或者您可以为每个<div class="_1sIhmckJjyRyuR_z7M5kbI">创建一个对象,然后根据需要获取要使用的名称和URL。

最新更新