扩展脚本执行时间超过5分钟的reddit抓取



我试图收集所有的帖子提交到一个特定的子reddit使用这里找到的代码:https://www.labnol.org/internet/web-scraping-reddit/28369/然而,执行限制达到了这之前完成。

我正在寻找一种方法来延长脚本的运行时间,理想情况下,一旦我单击运行,就不需要我的干预。

const getThumbnailLink_ = url => {
if (!/^http/.test(url)) return '';
return `=IMAGE("${url}")`;
};
const getHyperlink_ = (url, text) => {
if (!/^http/.test(url)) return '';
return `=HYPERLINK("${url}", "${text}")`;
};
const writeDataToSheets_ = data => {
const values = data.map(r => [
new Date(r.created_utc * 1000),
r.title,
getThumbnailLink_(r.thumbnail),
getHyperlink_(r.url, 'Link'),
getHyperlink_(r.full_link, 'Comments')
]);
const sheet = SpreadsheetApp.getActiveSheet();
sheet.getRange(sheet.getLastRow() + 1, 1, values.length, values[0].length).setValues(values);
SpreadsheetApp.flush();
};
const isRateLimited_ = () => {
const response = UrlFetchApp.fetch('https://api.pushshift.io/meta');
const { server_ratelimit_per_minute: limit } = JSON.parse(response);
return limit < 1;
};
const getAPIEndpoint_ = (subreddit, before = '') => {
const fields = ['title', 'created_utc', 'url', 'thumbnail', 'full_link'];
const size = 10000;
const base = 'https://api.pushshift.io/reddit/search/submission';
const params = { subreddit, size, fields: fields.join(',') };
if (before) params.before = before;
const query = Object.keys(params)
.map(key => `${key}=${params[key]}`)
.join('&');
return `${base}?${query}`;
};
const scrapeReddit = (subreddit = 'AskMen') => {
let before = '';
do {
const apiUrl = getAPIEndpoint_(subreddit, before);
const response = UrlFetchApp.fetch(apiUrl);
const { data } = JSON.parse(response);
const { length } = data;
before = length > 0 ? String(data[length - 1].created_utc) : '';
if (length > 0) {
writeDataToSheets_(data);
}
} while (before !== '' && !isRateLimited_());
};

一般来说,优化脚本以避免达到配额定义的执行时间是更好的做法。因此,在您的情况下,一个解决方案是减少每次执行的批处理大小。在您链接的参考代码中,每批获取1000篇文章,您的代码获取10000篇。

尝试设置较小的值,看看脚本的执行时间是否不再超过配额。

const getAPIEndpoint_ = (subreddit, before = '') => {
const fields = ['title', 'created_utc', 'url', 'thumbnail', 'full_link'];
const size = 1000;
const base = 'https://api.pushshift.io/reddit/search/submission';
const params = { subreddit, size, fields: fields.join(',') };
if (before) params.before = before;
const query = Object.keys(params)
.map(key => `${key}=${params[key]}`)
.join('&');
return `${base}?${query}`;
};

但是如果你的业务需要超过你的配额,你可以升级到Google Workspace Basic, Business or Enterprise-取决于你需要增加多少配额和你愿意支付多少。

关于不同帐户和定价的更多信息,请参阅此处。

最新更新