标题: [其他] 使用第三方命令爬取最高人民法院裁判文书 [打印本页]
作者: 过气码农现律师 时间: 2019-2-7 23:18 标题: 使用第三方命令爬取最高人民法院裁判文书
本帖最后由 pcl_test 于 2019-2-8 11:01 编辑
感谢贴
十分感谢本论坛 由于工作需要 春节假期用批处理和论坛提供的sed、iconv工具编写了一款爬取最高法裁判文书的爬虫脚本 总共1万多份裁判文书都可以在本地进行搜索查询了 虽然只是静态网页爬虫但十分有成就感 今后争取能学会动态网页爬虫技术 哈哈 下面上代码- set url_1=http://www.court.gov.cn/wenshu.html
- echo # target = %url_1%
- curl %url_1%>temp.tmp 2>nul
- iconv -c -f utf-8 -t gbk//ignore temp.tmp >gbk.tmp
- sed -r -i "s/[[:space:]]//g" gbk.tmp
- sed -n -r "/<liclass=\"last\">.*page/p" gbk.tmp >temp.tmp
- sed -r -i "s/.*page.([0-9]+)(\.html)?\".*/\1/" temp.tmp
- set /p pages=<temp.tmp
- sed -n -r "/共收录<font>([0-9]+)<\/font>份/p" gbk.tmp >temp.tmp
- sed -r -i "s/.*共收录<font>([0-9]+)<\/font>份.*/\1/" temp.tmp
- set /p articles=<temp.tmp
- echo # find !pages! pages , !articles! articles
- set /p down_number=# input the number of latest articles to down ^( 1 - !articles! ^) :
- set n=1
- :loop_begin
- if !n! GTR !pages! goto loop_end
- curl %url_1%?page=!n!>temp.tmp 2>nul
- iconv -c -f utf-8 -t gbk//ignore temp.tmp >gbk.tmp
- sed -r -i "s/[[:space:]]//g" gbk.tmp
- sed -n -r -i "/target=\"_blank\"href=\".*html/p" gbk.tmp
- sed -r -i "s/.*target=\"_blank\"href=\"(.*html).*/\1/" gbk.tmp
- type gbk.tmp>>link.bak
- set /a n+=1
- set lines=0
- for /f %%i in (link.bak) do set /a lines+=1
- if !lines! GEQ !down_number! (
- goto loop_end
- ) else (
- goto loop_begin
- )
- :loop_end
- set start_time=!time!
- set n=0
- :down_begin
- set /a down_number-=1
- if !down_number! LSS 0 goto end
- set url=""
- set /p url=<link.bak
- if !url!=="" goto end
- curl http://www.court.gov.cn!url! >temp.tmp 2>nul
- iconv -c -f utf-8 -t gbk//ignore temp.tmp >gbk.tmp
- sed -r -i "s/[[:space:]]//g" gbk.tmp
- set t=""
- sed -n "/<divclass=\"title\">/p" gbk.tmp>title.tmp
- sed -r -i "s/<[^>]*>//g" title.tmp
- sed -i "s/:/:/g" title.tmp
- sed -i "s/(/(/g" title.tmp
- sed -i "s/)/)/g" title.tmp
- set /p t=<title.tmp
- sed -n -i "/<divclass=\"txt_txt\"id=\"zoom\">/,/\[CDATA\[/p" gbk.tmp
- sed -i "s/ //g" gbk.tmp
- sed -r -i "s/<[^>]*>/\n/g" gbk.tmp
- sed -n -i "1,/^二〇.*年.*月.*日/p" gbk.tmp
- sed -r -i "/^$/d" gbk.tmp
- sed -i "s/(/(/g" gbk.tmp
- sed -i "s/)/)/g" gbk.tmp
- sed -n -r "/^([0-9][0-9][0-9][0-9]).*号$/p" gbk.tmp >num.tmp
- set file_number=""
- set /p file_number=<num.tmp
- sed -r -i "s/^(.*)/ \1/" gbk.tmp
- ren gbk.tmp "!file_number! !t!".txt 2>nul
- set /a n+=1
- sed -i "1 d" link.bak
- echo # !n! articles down
- goto down_begin
- :end
- del link.bak
- del *.tmp
- echo # mission start at !start_time! end at !time! , !n! succeed
- pause>nul
复制代码
作者: xczxczxcz 时间: 2019-2-8 16:50
这个有点感兴趣。写一个PS的- # 配置
- $title = 'class="title|fl print';
- $Web = New-Object System.Net.Webclient;
- $Web.Encoding = [Text.Encoding]::UTF8;
- $url = 'http://www.court.gov.cn/paper/default/index.html';
- $Master = 'http://www.court.gov.cn'
- $http = (split-path $url).Replace('\','/');
- $save = '.\案件记录.log';
- NI $save -type file -force | Out-Null;
- #
- Function ANJian_ZenLi {
- Param ( [string]$url )
- [Collections.Arraylist] $arr = @();
- [Collections.Arraylist] $array = @();
- $Content = $null;
- $Page = $Web.DownloadString( $url ) -Split "`n";
-
- $Content = ( $Page | SLS -Pattern $title) -Split "`r`n";
- $array = $Content | %{ $_.Split('"')[-1] -Replace '(</?(li|div))?>','' };
- $array.Add('') | Out-Null;
-
- $Content = $Page |SLS -Pattern '</div><div style';
- $Content = $Content -Replace ";'>","`r`n@" -Replace '</div>',"`r`n";
- $arr = (($Content -Split "`r`n" | sls -Pattern '^@') -NotMatch '^@<') -Replace '@','';
- $arr = $arr -notmatch '^$'; if ( $arr ) { $arr.Add('') | Out-Null };
- $array = $array + $arr;
-
- $Content = (( $Page | SLS -Pattern '^\s+.*</div>$') -NotMatch '<div') -Split "`r`n";
- $arr = $Content | %{ ($_ -Replace '^\s+|</div>| ','').Trim() };
- $arr = $arr -notmatch '^$'; if ( $arr ) { $arr.Add('') | Out-Null };
- $array = $array + $arr;
-
- $Content = ( $Page | SLS -Pattern '<(div|p) style.*</(div|p)>$') -Split "`r`n";
- $arr = $Content | %{ $_.Split('"')[-1] -Replace '(</(p|div))?>| ','' };
- $arr = $arr -notmatch '^$'; if ( $arr ) { $arr.Add('') | Out-Null };
- $array = $array + $arr;
- $array.Add('*************************完成*************************') | Out-Null;
- Return $array;
- };
- $Page = $Web.DownloadString( $url ) -Split "`n";
- $Last = (( $Page | SLS -Pattern '尾页' ) -Split "`r`n" ).Split('"')[-2];
- [int]$Last = $Last.Split('/.')[-2];
- [Collections.Arraylist] $PageArray = @($url);
- (2..$Last) | %{ $PageArray += $http + '/index/page/' + "$_.html" };
-
- For ( $i =0; $i -lt $PageArray.Count; $i++ ) {
- $DictXianQing = @{};
- if ( $i -ge 1 ) { $Page = $Web.DownloadString( $PageArray[$i] ) -Split "`n"; };
- $Content = ($Page | SLS -Pattern 'xiangqing') -Split "`r`n";
-
- $Content | %{
- $link = $Master + $_.Split('"')[-2];
- $str = $_.Split('"')[-1] -Replace '(</(a|li))?>','';
- $DictXianQing += @{ "$link" = "$str" };
- };
-
- Foreach ( $k in $DictXianQing.Keys.GetEnumerator() | Sort {[int]($_.Split('-.')[-2])} ) {
- $Receive = ANJian_ZenLi $k;
- $Receive |ac $save -force;
- ''|ac $save -force; ''|ac $save -force;
- };
- '按任意键处理下一页,删除此句会整理所有页面。';pause;
- };
- '已全部完成 按任意键退出。';pause
复制代码
作者: 过气码农现律师 时间: 2019-2-9 21:40
回复 2# xczxczxcz
这是什么语言?感觉比较厉害
作者: 523066680 时间: 2019-2-10 11:29
- use Encode;
- use Modern::Perl;
- use File::Slurp;
- use Mojo::UserAgent;
- use File::Basename qw/basename/;
- use File::Path qw/mkpath/;
- STDOUT->autoflush(1);
-
- our $ua = Mojo::UserAgent->new();
- our $main = "http://www.court.gov.cn";
- our $wdir = "F:/temp/gov_wenshu";
- mkpath $wdir unless -e $wdir;
-
- #获取尾页代码,前缀
- my ($prefix, $maxpg) = get_max_pgcode( $main ."/wenshu.html" );
- for my $id ( 1 .. $maxpg ) {
- printf "${main}${prefix}$id.html\n";
- get_article( "${main}${prefix}$id.html" );
- }
-
- sub get_article
- {
- our ($main, $wdir);
- my ( $link ) = @_;
- my $res;
- my $fpath;
- my $dom = $ua->get( $link )->result->dom;
- for my $e ( $dom->find(".list .l li a")->each )
- {
- printf "%s\n", basename($e->attr("href"));
- $fpath = $wdir ."/". basename($e->attr("href"));
- next if ( -e $fpath );
- $res = $ua->get( $main . $e->attr("href") )->result;
- write_file( $fpath, $res->body );
- }
- }
-
- sub get_max_pgcode
- {
- my ( $link ) = @_;
- my $res = $ua->get( $link )->result;
- my $href = $res->dom->at(".yiiPager .last a")->attr("href");
- if ($href =~/^(.*\/)(\d+)\.html/) { return ($1, $2); }
- else { printf "Failed to get max page code\n"; return undef }
- }
复制代码
作者: xczxczxcz 时间: 2019-2-10 14:31
无聊 来个娱乐版的。在线式的。不下载到硬盘。内存存取。可以返回。不特别优化,纯属添加功能。跟WIN10学的。这些东西不想下载到硬盘。
:按两次<Enter>显示选定项的内容;按<Enter>+<BackSpace>返回上一层。不带窗口按纽的。- # 配置
- $title = 'class="title|fl print';
- $Web = New-Object System.Net.Webclient;
- $Web.Encoding = [Text.Encoding]::UTF8;
- $url = 'http://www.court.gov.cn/paper/default/index.html';
- $Master = 'http://www.court.gov.cn'
- $http = (split-path $url).Replace('\','/');
- #
- Function ANJian_ZenLi {
- Param ( [string]$url )
- [Collections.Arraylist] $arr = @();
- [Collections.Arraylist] $array = @();
- $Content = $null;
- $Page = $Web.DownloadString( $url ) -Split "`n";
-
- $array = $Page -Match $title | %{ $_.Split('"')[-1] -Replace '(</?(li|div))?>','' };
- $array.Add('') | Out-Null;
-
- $Content = $Page -Match '</div><div style' -Replace ";'>","`r`n@" -Replace '</div>',"`r`n";
- $arr = ($Content -Split "`n" -Match '^@') -NotMatch '^@<' -Replace '^@','' `
- -Replace '×','X' -Replace '“','“' -Replace '”','”' -Replace '…','…';
- $arr = $arr -notmatch '^$'; if ( $arr ) { $arr.Add('') | Out-Null };
- $array = $array + $arr;
-
- $Content = ( $Page -Match '^\s+.*</div>$') -NotMatch '<div';
- $arr = $Content -Replace '^\s+|</div>| ','' -Replace '×','X' `
- -Replace '“','“' -Replace '”','”' -Replace '…','…';
- $arr = $arr -notmatch '^$'; if ( $arr ) { $arr.Add('') | Out-Null };
- $array = $array + $arr;
-
- $arr = $Page -Match '<(div|p) style.*</(div|p)>$' | %{ $_.Split('"')[-1] -Replace '(</(p|div))?>| ','' };
- $arr = $arr -notmatch '^$'; if ( $arr ) { $arr.Add('') | Out-Null };
- $array = $array + $arr;
- $array.Add('*************************完成*************************') | Out-Null;
- '';$array;
- };
- Function InputLine {
- $num = Read-Host -Prompt "输入内容序号,1 - $n ;按<Enter>+<退格>返回上一层 按2次<Enter>显示内容";
- if ( [Console]::ReadKey($true).Key -eq 'BackSpace' ) { cls;InputPage };
- $num = $num -as [int32];
- if ( !$num -or ($num -lt 1 -or $num -gt $n) ) { CLS;'超出范围,重新输入'; InputLine };
- $Url = $XianQingArray[$num-1][0];cls;
- ANJian_ZenLi $Url;pause;cls;
- Write-Host " 第 $Number 页:如下" -fore Green;
- For ( $k =0; $k -lt $XianQingArray.Count; $k++ ) {
- ''; Write-Host " <$($XianQingArray[$k][1])> " -Fore red -NoNewLine;
- Write-Host $($XianQingArray[$k][2]) -Fore DarkYellow; '';
- };
- InputLine;
- };
- Function InputPage {
- $Specify = Read-Host -Prompt "输入某一个页面,1 - $Last ";
- [Collections.Arraylist] $XianQingArray = @();
- $Number = $Specify -as [int32];
-
- if ( !$Number -or ($Number -lt 1 -or $Number -gt $Last) ) { CLS;'超出范围,重新输入'; InputPage };
- cls; $Page = $Web.DownloadString( $PageArray[$Specify-1] ) -Split "`n";
- $Content = $Page -Match 'xiangqing';
-
- $n = 0;
- $Content | %{ $n++
- $link = $Master + $_.Split('"')[-2];
- $str = $_.Split('"')[-1] -Replace '(</(a|li))?>','';
- $XianQingArray += ,($link,$n,$str);
- };
- cls; Write-Host " 第 $Number 页:如下" -fore Green;
- For ( $k =0; $k -lt $XianQingArray.Count; $k++ ) {
- ''; Write-Host " <$($XianQingArray[$k][1])> " -Fore red -NoNewLine;
- Write-Host $($XianQingArray[$k][2]) -Fore DarkYellow; '';
- };
- InputLine;
- };
-
- $Page = $Web.DownloadString( $url ) -Split "`n";
- [int]$Last = ($Page -Match '尾页' -Replace '\D','') -join '';
- [Collections.Arraylist] $PageArray = @($url);
- (2..$Last) | %{ $PageArray += $http + '/index/page/' + "$_.html" };
-
- InputPage;
复制代码
欢迎光临 批处理之家 (http://www.bathome.net/) |
Powered by Discuz! 7.2 |