我有一个应用程序需要获取网站上html页面的节点。问题是该页面要求用户登录。我试图在网站上找到有关登录的主题,人们大多有两个字段:登录名和密码。
但就我而言,有一个带有城市列表的组合框:登录表单屏幕截图。我当前的代码:
class Program
{
static void Main(string[] args)
{
var client = new CookieAwareWebClient();
client.BaseAddress = @"https://mystat.itstep.org/ru/login";
var loginData = new NameValueCollection();
loginData.Add("login", "login");
loginData.Add("password", "password");
client.UploadValues("login.php", "POST", loginData);
string htmlSource = client.DownloadString("index.php");
Console.WriteLine("Logged in!");
}
}
public class CookieAwareWebClient : WebClient
{
private CookieContainer cookie = new CookieContainer();
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest request = base.GetWebRequest(address);
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = cookie;
}
return request;
}
}
如何通过 c# 选择此列表中的城市之一?
您必须先进行初始 GET,以获取您在第一篇文章中所需的 cookie 和 csrf 令牌。csrf 令牌需要从第一个 html 响应中解析出来,以便您可以将其与您的用户名和密码一起提供。
这是您的主流应该的样子:
var client = new CookieAwareWebClient();
client.BaseAddress = @"https://mystat.itstep.org/en/login";
// do an initial get to have cookies sends to you
// have a server session initiated
// and we need to find the csrf token
var login = client.DownloadString("/");
string csrf;
// parse the file and go looking for the csrf token
ParseLogin(login, out csrf);
var loginData = new NameValueCollection();
loginData.Add("login", "someusername");
loginData.Add("password", "somepassword");
loginData.Add("city_id", "29"); // I picked this value fromn the raw html
loginData.Add("_csrf", csrf);
var loginResult = client.UploadValues("login.php", "POST", loginData);
// get the string from the received bytes
Console.WriteLine(Encoding.UTF8.GetString(loginResult));
// your task is to make sense of this result
Console.WriteLine("Logged in!");
解析需要根据需要尽可能复杂。我只实现了让你获得 csrf 代币的东西。我把城市的解析(提示:它们从<select
开始,然后在每行都有<option
,直到你找到一个</select>
(让你作为高级练习实施。不要费心问我要。
以下是 csrf 解析逻辑:
void ParseLogin(string html, out string csrf)
{
csrf = null;
// read each line of the html
using(var sr = new StringReader(html))
{
string line;
while((line = sr.ReadLine()) != null)
{
// parse for csrf by looking for the input tag
if (line.StartsWith(@"<input type=""hidden"" name=""_csrf""") && csrf == null)
{
// string split by space
csrf = line
.Split(' ') // split to array of strings
.Where(s => s.StartsWith("value")) // value="what we need is here">
.Select(s => s.Substring(7,s.Length -9)) // remove value=" and the last ">
.First();
}
}
}
}
如果你喜欢冒险,你可以编写html解析器,疯狂地使用字符串方法,尝试一些正则表达式或使用库
请记住,抓取网站可能违反网站的服务条款。验证您正在执行的操作是否被允许/不会干扰其操作。