标题: [文本处理] (已解决)网页文件 一行内容 提取图片地址 不需要重复的 [打印本页]
作者: web 时间: 2018-1-18 16:00 标题: (已解决)网页文件 一行内容 提取图片地址 不需要重复的
本帖最后由 web 于 2018-1-19 13:57 编辑
<p style="padding: 0px; line-height: 1.5; clear: both; color: rgb(51, 51, 51); font-family: "Hiragino Sans GB", Tahoma, Arial, 宋体, sans-serif;"><img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_9480.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_9480.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_5111.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_5111.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_4181.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_4181.jpg" /><br /> <br /> <br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_8536.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_8536.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_2145.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_2145.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123754_4315.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123754_4315.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_5113.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_5113.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_7621.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_7621.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_2878.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_2878.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_9000.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_9000.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_605.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_605.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123755_8239.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123755_8239.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_5145.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_5145.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_3003.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_3003.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_6521.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_6521.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_9915.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_9915.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123756_2703.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123756_2703.jpg" /><br /> <img alt="undefined" src="/upload/externalpic/1214218/1214218_20170827123757_1357.jpg" style="border: none; visibility: visible; vertical-align: bottom; max-width: 790px; zoom: 1;" class="lazy" data-original="/upload/externalpic/1214218/1214218_20170827123757_1357.jpg" /></p>
<p><img src="/upload/files/2017/08/20/1503223004634.jpg" alt="" class="lazy" data-original="/upload/files/2017/08/20/1503223004634.jpg" height="1311" width="740" /><img src="/upload/files/2017/08/20/1503223032210.jpg" alt="" class="lazy" data-original="/upload/files/2017/08/20/1503223032210.jpg" height="9860" width="740" /><img src="/upload/files/2017/08/20/1503223054641.jpg" alt="" class="lazy" data-original="/upload/files/2017/08/20/1503223054641.jpg" height="5919" width="740" /></p>
2段文件都是只有一行 取共同的方法
网页文件 提取图片地址 不需要重复的 不需要引号
作者: 523066680 时间: 2018-1-18 19:38
- use Mojo::DOM;
- use File::Slurp;
-
- my $html = read_file( "a.htm" );
- my $dom = Mojo::DOM->new( $html );
-
- grep { print $_->attr("data-original"), "\n" } ( $dom->find("img")->each );
复制代码
/upload/externalpic/1214218/1214218_20170827123754_9480.jpg
/upload/externalpic/1214218/1214218_20170827123754_5111.jpg
/upload/externalpic/1214218/1214218_20170827123754_4181.jpg
/upload/externalpic/1214218/1214218_20170827123754_8536.jpg
/upload/externalpic/1214218/1214218_20170827123754_2145.jpg
/upload/externalpic/1214218/1214218_20170827123754_4315.jpg
/upload/externalpic/1214218/1214218_20170827123755_5113.jpg
/upload/externalpic/1214218/1214218_20170827123755_7621.jpg
/upload/externalpic/1214218/1214218_20170827123755_2878.jpg
/upload/externalpic/1214218/1214218_20170827123755_9000.jpg
/upload/externalpic/1214218/1214218_20170827123755_605.jpg
/upload/externalpic/1214218/1214218_20170827123755_8239.jpg
/upload/externalpic/1214218/1214218_20170827123756_5145.jpg
/upload/externalpic/1214218/1214218_20170827123756_3003.jpg
/upload/externalpic/1214218/1214218_20170827123756_6521.jpg
/upload/externalpic/1214218/1214218_20170827123756_9915.jpg
/upload/externalpic/1214218/1214218_20170827123756_2703.jpg
/upload/externalpic/1214218/1214218_20170827123757_1357.jpg
/upload/files/2017/08/20/1503223004634.jpg
/upload/files/2017/08/20/1503223032210.jpg
/upload/files/2017/08/20/1503223054641.jpg
作者: slore 时间: 2018-1-19 10:00
本帖最后由 slore 于 2018-1-19 10:01 编辑
extractimg.rb (ruby)- puts File.read('a.html').scan(/\/upload[^.]+\.jpg/).uniq
复制代码
代码解释:读取文件,扫码获取jpg文件的正则表达式匹配,再利用数组的uniq(唯一)方法,去掉重复匹配。
作者: web 时间: 2018-1-19 10:50
谢谢 各位的回复 有没有批处理 或者批处理使用第三方的方法 其他语言还不会使用 麻烦了
作者: WHY 时间: 2018-1-19 13:17
本帖最后由 WHY 于 2018-1-20 19:55 编辑
- @echo off
- PowerShell -c "[string]$s=type a.html;[regex]::Matches($s,'(?<=src=\")[^^\"]+')|%%{$_.Value}"
- pause
复制代码
作者: web 时间: 2018-1-19 13:52
回复 5# WHY
谢谢了 搞定
刚才找了一下
找到这个 接近
sed "y/;&/\n\n/" utf.txt | sed -n "/.*src=/ s/.*src=//p">b.txt
作者: WHY 时间: 2018-1-20 19:59
回复 6# web
允许使用第三方的话,推荐 grep- grep -P -o "(?<=src=\")[^^\"]+" a.html
复制代码
非要用 sed,或许可以这样:- sed -r "s/(src=|[^\"]\.jpg)\"/\1\n/g" a.html | findstr /b /e "\/.*\.jpg"
复制代码
欢迎光临 批处理之家 (http://www.bathome.net/) |
Powered by Discuz! 7.2 |