使用Python下载*.mp4文件



我正在尝试从网站下载并保存讲座视频。虽然我已经成功下载了这些文件,但它们不会在我的媒体播放器中播放。这是我正在使用的代码:

from bs4 import BeautifulSoup
import re
import urllib2
snippet = open('Python/SNA Page Source Revised.txt', 'r')
soup = BeautifulSoup(snippet)
links = [link.get('href') for link in soup.find_all('a')]
videos = []
for link in links:
match = re.search('.*mp4.*', link)
if match:
videos.append(link)
vidNum = 1
for video in videos:
f = urllib2.urlopen(video)
with open('Data Analysis/Social Network Analysis/Video '+vidNum+'.mp4', 'wb') as code:
code.write(f.read())
vidNum += 1

一切似乎都很好,但当我尝试播放其中一个视频时,我会出现以下错误:"Python(v2.7)需要安装插件才能播放以下类型的媒体文件:text/html解码器"此外,如果我手动从网站下载视频,文件大约为22.8MB,但当我使用脚本时,文件只有7.8kB。

我下载文件的方式有问题吗?如有任何帮助,我们将不胜感激。

另外:我在Ubuntu 12.04 LTS操作系统上使用Python v2.7。

***编辑****

以下是我根据收到的回复使用的代码:

import requests
r = requests.get('https://class.coursera.org/sna-003/lecture/download.mp4?lecture_id=2', auth=('myUsername', 'myPassword'))
with open('Data Analysis/TestFile.mp4', 'wb') as fd:
fd.write(r.content)

以下是r.content的输出:

<!DOCTYPE html>
<html itemtype="http://schema.org" xmlns:fb="http://ogp.me/ns/fb#"><head><meta content="IE=Edge,chrome=IE7" http-equiv="X-UA-Compatible"/><meta content="!" name="fragment"/><meta content="NOODP" name="robots"/><meta charset="utf-8"/><meta content="Coursera" property="og:title"/><meta content="website" property="og:type"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" property="og:image"/><meta content="https://www.coursera.org/" property="og:url"/><meta content="Coursera" property="og:site_name"/><meta content="en_US" property="og:locale"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." property="og:description"/><meta content="727836538,4807654" property="fb:admins"/><meta content="274998519252278" property="fb:app_id"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." name="description"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" name="image"/><meta content="app-id=736535961" name="apple-itunes-app"/><script>window.onerror = function(message, url, lineNum) {
// First check the URL and line number of the error
url = url || window.location.href;
// 99% of the time, errors without line numbers arent due to our code,
// they are due to third party plugins and browser extensions
if (lineNum === undefined || lineNum == null) return;
// Now figure out the actual error message
// If it's an event, as triggered in several browsers
if (message.target &amp;&amp; message.type) {
message = message.type;
}
if (!message.indexOf) {
message = 'Non-string, non-event error: ' + (typeof message);
}
var errorDescrip = {
message: message,
script: url,
line: lineNum,
url: document.URL
}
var err = {
key: 'page.error.javascript', 
value: errorDescrip
}
window._204 = window._204 || [];
window._204.push(err);
window._gaq = window._gaq || [];
window._gaq.push(err);
}</script><title>Coursera.org</title><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/css/home.css" rel="stylesheet" type="text/css"/><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/pages/auth/css/auth.css" rel="stylesheet" type="text/css"/><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" id="_mobile">(function(el) {
// Override certian behaviour if the page is for our mobile app.
// TODO(priya) Remove this conditional behaviour once I want to push this behaviour
// for regular authentication pages on mobile/smaller screens as well.
// Currently I'm keeping existing behaviour same and only adding mobile specific
// layouts ot /mobilesignup page (which is what isMobileApp = true signifies).
if ("false" == "true") {
var head = document.getElementsByTagName('head')[0];
// Add viewport meta tag
var viewport = document.querySelector('meta[name=viewport]');
var viewportContent = 'width=device-width, initial-scale=1.0, user-scalable=no';
if (!viewport) {
viewport = document.createElement('meta');
viewport.setAttribute('name', 'viewport');
head.appendChild(viewport);
}
viewport.setAttribute('content', viewportContent);
// Add responsive css
var link  = document.createElement('link');
link.rel  = 'stylesheet';
link.type = 'text/css';
link.href = el.getAttribute("data-baseurl") + "pages/auth/css/auth_responsive.css";
head.appendChild(link);
}
})(document.getElementById("_mobile"));
</script></head><body><div id="fb-root"></div><div id="origami"><div style="position:absolute;top:0px;left:0px;width:100%;height:100%;background:#f5f5f5;padding-top:5%;"><div id="coursera-loading-nojs" style="text-align:center; margin-bottom:10px;display:none;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div><div><span id="coursera-loading-js" style="display: none; padding-left:45%">loading   <img src="https://d2wvvaown1ul17.cloudfront.net/site-static/images/icons/loading.gif"/></span></div><noscript><div style="text-align:center; margin-bottom:10px;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div></noscript></div></div><!--[if gte IE 8]&gt;&lt;script&gt;document.getElementById("coursera-loading-js").style.display = 'block';&lt;/script&gt;&lt;![endif]-->
<!--[if lte IE 7]&gt;&lt;script&gt;document.getElementById("coursera-loading-nojs").style.display = 'block';
window._204 = window._204 || [];
window._gaq = window._gaq || [];
window._gaq.push(
['_setAccount', 'UA-28377374-1'],
['_setDomainName', window.location.hostname],
['_setAllowLinker', true],
['_trackPageview', window.location.pathname]);
window._204.push(
['client', 'home'],
{key:"pageview", value:window.location.pathname});
&lt;/script&gt;&lt;script src="https://eventing.coursera.org/204.min.js"&gt;&lt;/script&gt;&lt;script src="https://ssl.google-analytics.com/ga.js"&gt;&lt;/script&gt;&lt;![endif]-->
<!--[if !IE]&gt; --><script>document.getElementById("coursera-loading-js").style.display = 'block';</script><!-- &lt;![endif]--><script src="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/js/core/require.js" type="text/javascript"></script><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" data-debug="0" data-locale="" data-timestamp="1386838999742" data-version="e47434615f57601f9b9ccaf255a589e8550d328d" id="_require" type="text/javascript">if(document.getElementById("coursera-loading-js").style.display == 'block') {
(function(el) {
// prevent throw
require.onError = function(err) {
window._204 = window._204 || [];
window._204.push({key: 'requireErr', value: err});
};
define("pages/auth/authConfig",
function() {
return {"coursera_url": "https://www.coursera.org/",
"environment": "production"};
}
);
require.config({
enforceDefine: false,
waitSeconds: 14,
baseUrl: el.getAttribute("data-baseurl"),
urlArgs: el.getAttribute("data-debug") == "1" ? "v=" + el.getAttribute("data-timestamp") : "",
shim: {
"underscore": {
exports: '_'
},
"backbone": {
deps: ['underscore', 'jquery'],
exports: 'Backbone'
}
},
paths: {
"jquery":       "js/core/jquery",
"underscore":   "js/core/underscore",
"backbone":     "js/core/backbone",
"i18n":         "js/core/i18n._t"
},
callback: function() {
require(["pages/auth/routes"]); // bootup coursera
},
config: {
i18n: {
locale: (window.localStorage ? localStorage.getItem("locale") : '') || el.getAttribute("data-locale")
}
}
});
})(document.getElementById("_require"));
}</script><script type="text/javascript">define("pages/home/models/user.json", [], function(){
return null;
});
</script></body></html>

不过,我觉得这很奇怪,因为它看起来就像网站的源代码,但当我查看r.url时,我会得到一个可以在浏览器中加载的实际网站,它会提示我保存或查看视频。即使我试图传递我从中获得的新url,我认为它包含我的cookie信息,我仍然会得到相同的内容。我不明白我哪里错了。

首先,下载并安装requests包。

然后使用此代码:

import requests
def downloadfile(name,url):
name=name+".mp4"
r=requests.get('url')
print "****Connected****"
f=open(name,'wb');
print "Donloading....."
for chunk in r.iter_content(chunk_size=255): 
if chunk: # filter out keep-alive new chunks
f.write(chunk)
print "Done"
f.close()

您需要有一个有效的cookie,这样您就不会下载登录页面。

以下是如何在urllib2 上设置cookie

import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie', 'cookiename=cookievalue'))
f = opener.open("http://example.com/")

此外,你可以使用cookielib进行更像网络浏览器的行为,以进行登录过程,并获得正确的cookie来下载你的电影。

另一种方法是使用类似urllib2的请求,这更容易实现自动登录过程。

我首先会将文件保存为.html而不是.mp4,这样你就可以100%确定它不是登录页/错误页或其他杂项垃圾。一些网站需要cookie、特定的用户代理(阻止机器人/抓取器/自动漏洞扫描程序)、Referrer等等。

我个人使用篡改数据或实时http头来确保我的程序在调试时正常工作。

如果你收到了cloudfront响应,那么你可能没有正确处理cookies/user-agents/refferer的。

我刚刚检查了链接,还有一个CSRF cookie{CSRF_token=toNQOP7stgOREzrDcbPc},您将100%需要它来查看通过登录页面的任何内容。

如果你有链接,你也可以使用Curl下载MP4视频,这更容易

导入os

os.system(f"curl{yourURL链接}-输出c:/Users/Desktop/yourFile.mp4")

最新更新