下载字符串返回胡言乱语



我正试图将网页源代码的文本作为字符串进行解析。其结果是,它的格式模糊地像网站的html,但带有无意义的文本。我这样做是作为教程的一部分,讲师给出的源代码也给了我同样的问题。它也适用于我尝试的每个网站。可能是我的电脑/网络连接有问题吗?

日志结果:

07-26 17:29:49.143 10863-10863/org.andrewedgar.downloadwebcontent I/Result: !otp tl
<-[fl E7>   hm ls=n-sl-e ti8l-e"ln=" !edf-><-[fI ]     hm ls=n-sl-e ti8 ag"><[ni]-
!-i E8>    <tlcas"oj ti9 ag"><[ni]-
!-i tI ]<-><tlcas"oj"ln=e" !-!edf->  <ed
mt hre=uf8>    <eanm=vepr"cnet"it=eiewdh nta-cl="
mt ae"ecito"cnet"omi  ooulaplnigpg hr nld wsm adn aedms"
mt ae"uhr otn=Wwhmz>
tteZpyoe/il>
ln e=sotu cn ye"mg/-cn rf"sai/m/aio.n"
<- otAeoeCS->    <ikrl"tlset rf"sai/s/otaeoemncs>    <- hmf cn S -
ln e=syehe"he=/ttccsteiyioscs>    <- lgn otIosCS->    <ikrl"tlset rf"sai/s/lgn-otioscs>    <- lgn ieIosCS->    <ikrl"tlset rf"sai/s/lgn-ieioscs>    <- otta S -
ln e=syehe"he=/ttccsbosrpmncs>    <- lcnvCS->    <ikrl"tlset rf"sai/s/lcnvmncs>    <- nmt S -
ln e=syehe"he=/ttccsaiaemncs>   <- eoo S -
ln e=syehe"he=/ttccsvnbxvnbxcs> <- W-aoslCS->    <ikrl"tlset rf"sai/s/w.aoslcs> <- anCS->    <ikrl"tlset rf"sai/s/ancs> <- epnieCS->    <ikrl"tlset rf"sai/s/epniecs>
srp r=/ttcj/edrmdrir283rsod142mnj"<srp>  <ha>  <oydt-p=srl"dt-agt"nveu aaofe=7"
!-i tI ]
pcas"rweugae>o r sn n<togotae<srn>bosr lae< rf"tp/boshpycm"ugaeyu rwe<a oipoeyu xeine<p
!edf->
dvi=peodr 
dvcas'odr 
dvcas"atr"<dv
/i>    <dv<- rlae -
<edri=hae"cas"edrscin>      <i ls=cnanr>        <a ls=nva"
ahe=# ls=nva-rn"<m d"rnLg"sc"sai/m/apCdLgWtTx.n"at"apcd"<a
dvcas"-lxmn-rp>            dvi=nveu ls=mimn"
<lcas"a"
l < aasrl ls=nvln cie rf"hm"Hm sa ls=s-ny>cret<sa>/>/i
/l
<dv
dvcas"eubn>              < rf"tp:/er.apcd.o"cas"utn1>er<a
/i>          <dv
/a>      <dv
/edr !-Hae -
<eto d"oe ls=hr_eto rdat1pdig>    <i ls=dslytbe>        <i ls=tbecl"
dvcas"otie"
dvcas"eocnet>             <1Lancd h<rfnwy/1
pPormigdenthv ob oigtdosadfutan.b>oehv oefnadlanhwt oe<p
ahe=hts/lanzpyoecm ls=bto_"LanNw/>            <dv
/i>
/i>   <dv
/eto>!-Hr eto -
<- QeyLb->  <citsc"sai/svno/qey11..i.s>/cit
!-BosrpJ -
srp r=/ttcj/edrbosrpmnj"<srp>   <- ehrJ -
srp r=/ttcj/edrtte.i.s>/cit
!-wyonsj -
srp r=/ttcj/edrjur.apit.203mnj"<srp>    <- lcnvJ -
srp r=/ttcj/edrjur.lcnvmnj"<srp>    <- W-aoslJ -
srp r=/ttcj/edrolcrue.i.s>/cit
!-CutrpJ -
srp r=/ttcj/edrjur.oneu.i.s>/cit
!-Sot colJ -
srp r=/ttcj/edrsot-colmnj"<srp> <- edrJ -
srp r=/ttcj/edrvnbxmnj"<srp>    <- jxhm S-> <citsc"sai/svno/qeyaacipmnj"<srp>   <- o S->    <citsc"sai/svno/o.i.s>/cit
!-Mi S->    <citsc"sai/smi.s>/cit
<bd><hm>

代码:

public class DownloadTask extends AsyncTask<String, Void, String> {
@Override
protected String doInBackground(String... urls) {
String result = "";
URL url;
HttpURLConnection urlConnection = null;
try {
url = new URL(urls[0]);
urlConnection = (HttpURLConnection) url.openConnection();
InputStream in = urlConnection.getInputStream();
InputStreamReader reader = new InputStreamReader(in);
int data = reader.read();
while (data != -1) {
data = reader.read();
char current = (char) data;
result += current;
data = reader.read();
}
return result;

} catch (Exception e) {
e.printStackTrace();
return "Failed";
}
}
}
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
DownloadTask task = new DownloadTask();
String result = null;

try {
result = task.execute("http://www.zappycode.com").get();
} catch (Exception e) {
e.printStackTrace();
}
Log.i("Result", result);
}
}

每次迭代从流中读取两次:

while (data != -1) {
data = reader.read();  // <<- here
char current = (char) data;
result += current;
data = reader.read(); // <<- and here
}

但只附加到结果一次。所以,你最终只得到奇怪的字符。像这样的东西应该起作用:

while((int data = reader.read) != -1) result += (char) data

但总的来说,从输入中读取原始字节并将其转换为字符并不是一个好主意。像这样的东西会更强大:

BufferedReader br = new BufferedReader(reader)
StringBuilder accumulator = new StringBuilder()
while((String line = br.readLine()) != null) accumulator
.append(line)
.append(System.lineSeparator)

您的代码似乎正在读取原始的8位ASCII字符并显示它们。该网站可能采用不同的字符编码(请参阅维基百科关于编码的文章(。使用缓冲读取器并让Java将一系列编码的字节转换为String,而不是逐字节读取@extractic在StackOverflow上指出了另一个答案,该答案有一个可以在这里工作的代码示例:如何读取http输入流。

最新更新