我发现我可以使用SgmlReader.SL从html生成XDocument对象。https://bitbucket.org/neuecc/sgmlreader.sl/
代码是这样的。
public XDocument Html(TextReader reader)
{
XDocument xml;
using (var sgmlReader = new SgmlReader { DocType = "HTML", CaseFolding = CaseFolding.ToLower, InputStream = reader })
{
xml = XDocument.Load(sgmlReader);
}
return xml;
}
此外,我们还可以从XDocument对象中获取img标记的src属性。
var ns = xml.Root.Name.Namespace;
var imgQuery = xml.Root.Descendants(ns + "img")
.Select(e => new
{
Link = e.Attribute("src").Value
});
并且,我们可以下载图像的流数据并将其转换为BASE64字符串。
public static string base64String;
WebClient wc = new WebClient();
wc.OpenReadAsync(new Uri(url)); //image url from src attribute
wc.OpenReadCompleted += new OpenReadCompletedEventHandler(wc_OpenReadCompleted);
void wc_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
using (MemoryStream ms = new MemoryStream())
{
while (true)
{
byte[] buf = new byte[32768];
int read = e.Result.Read(buf, 0, buf.Length);
if (read > 0)
{
ms.Write(buf, 0, read);
}
else { break; }
}
byte[] imageBytes = ms.ToArray();
base64String = Convert.ToBase64String(imageBytes);
}
}
所以,我想做的是下面的步骤。我想在一个方法链中做以下步骤,比如LINQ或Reactive Extensions。
- 从XDocument对象中获取img标记的src属性
- 从url获取图像数据
- 从图像数据生成BASE64字符串
- 用BASE64字符串替换src属性
这里是最简单的源和输出。
之前
<html> <head> </head> <body> <img src='http://image.com/image.jpg' /> <img src='http://image.com/image2.png' /> </body> </html>
之后
<html> <head> </head> <body> <img src='data:image/jpg;base64,iVBORw...' /> <img src='data:image/png;base64,iSDoske...' /> </body> </html>
有人知道解决这个问题的办法吗?
我想请教一下专家。
LINQ和Rx都是为了促进产生新对象的转换而设计的,而不是用来修改现有对象的转换,但这仍然是可行的。您已经完成了第一步,将任务分解为多个部分。下一步是制作实现这些步骤的可组合函数。
1) 您基本上已经有了这个,但我们可能应该保留这些元素,以便稍后更新。
public IEnumerable<XElement> GetImages(XDocument document)
{
var ns = document.Root.Name.Namespace;
return document.Root.Descendants(ns + "img");
}
2) 从可组合性的角度来看,这似乎是你碰壁的地方。首先,让我们制作一个FromEventAsyncPattern
可观测生成器。Begin/End异步模式和标准事件已经有了,所以这将介于两者之间。
public IObservable<TEventArgs> FromEventAsyncPattern<TDelegate, TEventArgs>
(Action method, Action<TDelegate> addHandler, Action<TDelegate> removeHandler
) where TEventArgs : EventArgs
{
return Observable.Create<TEventArgs>(
obs =>
{
//subscribe to the handler before starting the method
var ret = Observable.FromEventPattern<TDelegate, TEventArgs>(addHandler, removeHandler)
.Select(ep => ep.EventArgs)
.Take(1) //do this so the observable completes
.Subscribe(obs);
method(); //start the async operation
return ret;
}
);
}
现在我们可以使用这种方法将下载转化为可观测的。根据您的使用情况,我认为您也可以在WebClient上使用DownloadDataAsync
。
public IObservable<byte[]> DownloadAsync(Uri address)
{
return Observable.Using(
() => new System.Net.WebClient(),
wc =>
{
return FromEventAsyncPattern<System.Net.DownloadDataCompletedEventHandler,
System.Net.DownloadDataCompletedEventArgs>
(() => wc.DownloadDataAsync(address),
h => wc.DownloadDataCompleted += h,
h => wc.DownloadDataCompleted -= h
)
.Select(e => e.Result);
//for robustness, you should probably check the error and cancelled
//properties instead of assuming it finished like I am here.
});
}
编辑:根据您的评论,您似乎在使用Silverlight,其中WebClient
不是IDisposable
,也没有我使用的方法。要解决这个问题,可以尝试以下方法:
public IObservable<byte[]> DownloadAsync(Uri address)
{
var wc = new System.Net.WebClient();
var eap = FromEventAsyncPattern<OpenReadCompletedEventHandler,
OpenReadCompletedEventArgs>(
() => wc.OpenReadAsync(address),
h => wc.OpenReadCompleted += h,
h => wc.OpenReadCompleted -= h);
return from e in eap
from b in e.Result.ReadAsync()
select b;
}
您需要找到ReadAsync
的实现来读取流。你应该可以很容易地找到一个,而且帖子已经足够长了,所以我把它省略了。
3&4) 现在,我们已经准备好将所有内容放在一起并更新元素。由于第3步非常简单,我将把它与第4步合并。
public IObservable<Unit> ReplaceImageLinks(XDocument document)
{
return (from element in GetImages(document)
let address = new Uri(element.Attribute("src").Value)
select (From data in DownloadAsync(address)
Select Convert.ToBase64String(data)
).Do(base64 => element.Attribute("src").Value = base64)
).Merge()
.IgnoreElements()
.Select(s => Unit.Default);
//select doesn't really do anything as IgnoreElements eats all
//the values, but it is needed to change the type of the observable.
//Task may be more appropriate here.
}