标题: [文本处理] 【已解决】求助BAT如何批量获取网页的标题(title)和描述(description)? [打印本页]
作者: hotkean 时间: 2015-11-12 10:03 标题: 【已解决】求助BAT如何批量获取网页的标题(title)和描述(description)?
本帖最后由 hotkean 于 2015-11-17 09:22 编辑
我现在有个文本a.txt 里面有N行URL网址,如何批处理实现导出每个URL的标题呢并且保存为a-title.txt
如a.txt内网址为
http://www.bathome.net
https://www.qq.com/
导出a-title.txt内容为:
批处理之家 批处理_BAT_CMD_DOS_VBS_Perl_Python_PowerShell - Powered by Discuz!
腾讯首页
顺便再求助一下大家,导出网站description的文本文件a-des.txt的批处理,一并多谢了!
作者: pcl_test 时间: 2015-11-12 12:36
本帖最后由 pcl_test 于 2015-11-12 12:52 编辑
- //&cls&cscript -nologo -e:jscript "%~f0"<"url列表.txt"&pause&exit /b
-
- function BintoStr(strBin,strCharset){
- var stream = new ActiveXObject('ADODB.Stream')
- stream.Type = 1
- stream.Mode = 3
- stream.Open()
- stream.Write(strBin)
- stream.Position = 0
- stream.Type = 2
- stream.Charset = strCharset
- return stream.ReadText
- }
-
- function getHtmlTxt(strURL){
- var http = new ActiveXObject('Msxml2.XMLHTTP');
- try{
- http.open('GET', strURL, false);
- http.send();
- var m = http.GetResponseHeader('Content-Type').match(/charset\s?=\s?([^\s;]+)/i);
- if (m){
- var contenttype = m[0].replace(/charset\s?=\s?/,'');
- var HtmlText = BintoStr(http.ResponseBody,contenttype);
- return HtmlText;
- }
- else{
- var m = http.ResponseText.match(/<\s?meta.+?charset\s?=\s?[^\s\"]+/i);
- if (m){
- var contenttype = m[0].replace(/.+?charset\s?=\s?/,'');
- var HtmlText = BintoStr(http.ResponseBody,contenttype);
- return HtmlText;
- }
- else return http.ResponseText;
- }
- }catch (e){}
- }
-
- //var fso = new ActiveXObject('Scripting.Filesystemobject');
- var url = WScript.StdIn.ReadAll().split(/\s/);
- var s = '';
- for (var i=0; i<url.length; i++)
- {
- var txt = getHtmlTxt(url[i]);
- if (!txt)
- s += url[i]+'\r\n拒绝访问\r\n\r\n';
- else{
- var title = /<title>([^<]+)<\/title>/i.exec(txt);
- var description = txt.match(/<\s?meta[^>]+?description[^>]*?>/i);
- if (title)
- var titlestr = title[1];
- else var titlestr = 'not found';
- if (description)
- var descriptionstr = description[0].replace(/.+?content\s?=\s?"([^"]+)".+/i, '$1');
- else var descriptionstr = 'not found';
- s += url[i]+'\r\ntitle:'+titlestr+'\r\ndescription:'+descriptionstr+'\r\n\r\n';
- }
- }
- WSH.echo(s);
复制代码
作者: hotkean 时间: 2015-11-12 15:22
多谢,如果是想把标题和描述分别导出为 a-title.txt 和 a-description.txt 两个文件(一行一行模式) 代码又应该如何修改呢?
作者: pcl_test 时间: 2015-11-12 16:27
回复 3# hotkean
38行开始替换- var fso = new ActiveXObject('Scripting.Filesystemobject');
- var url = WScript.StdIn.ReadAll().split(/\s/);
- var titlestr = descriptionstr = '';
- for (var i=0; i<url.length; i++)
- {
- var txt = getHtmlTxt(url[i]);
- if (txt) {
- var title = /<title>([^<]+)<\/title>/i.exec(txt);
- var description = txt.match(/<\s?meta[^>]+?description[^>]*?>/i);
- if (title)
- titlestr += title[1]+'\r\n';
- else titlestr += 'not found\r\n';
- if (description)
- descriptionstr += description[0].replace(/.+?content\s?=\s?"([^"]+)".+/i, '$1')+'\r\n';
- else descriptionstr += 'not found\r\n';
- }
- else{
- titlestr += 'access forbidden\r\n';
- descriptionstr += 'access forbidden\r\n';
- }
- }
- fso.CreateTextFile('title.txt',2).Write(titlestr);
- fso.CreateTextFile('description.txt',2).Write(descriptionstr);
复制代码
欢迎光临 批处理之家 (http://www.bathome.net/) |
Powered by Discuz! 7.2 |