[新手上路]批处理新手入门导读[视频教程]批处理基础视频教程[视频教程]VBS基础视频教程[批处理精品]批处理版照片整理器
[批处理精品]纯批处理备份&还原驱动[批处理精品]CMD命令50条不能说的秘密[在线下载]第三方命令行工具[在线帮助]VBScript / JScript 在线参考
返回列表 发帖

[文本处理] 批处理怎样提取txt文档中的特定字符串?

本帖最后由 batbat001 于 2023-2-8 22:43 编辑

{"[{"Title":"敏儿演剧史:剧本类","Summary":"上海 : 商务印书馆, **二十二年十月[1933.10]印行 : 王云五发行","PageSum":"101页","Id":"4b1e7731b001378fe85e3470b23e19ab","Pid":"f721283beed81168a7f56770c2e37f42"},{""Title":"上海新学会社图书目录","Summary":"上海 : 上海新学会社, [1936]","PageSum":"1册","Id":"18046902990da9f8c0e3b7081c62b7d9","Pid":"e24b0fcfdd69228163ee2a40179b3a10"},{""Title":"京师译学馆生理卫生学讲义","Summary":"[出版地不详] : [出版者不详], [19??]","PageSum":"1册","Id":"56373fe68f3834c486e0edfd3682ed96","Pid":"81d7172b2f060a842cc78a0486c16647"},{"Title":"中国中部奥陶纪头足类化石","Summary":"[北京] : 农矿部地质调查所 : 国立北平研究院地学研究, **十九年七月[1930.7]印行","PageSum":"18,101页","Id":"0b6ea856ab6a52c74415ced060d1028c","Pid":"7c902869e5461cc9bdb309aa3a0fd9a7"}

有若干txt文档,格式均为utf-8,每个txt里面的内容大致如上,为一些图书的目录信息,所有目录信息都混放在了同一行。
需求:提取每本书的信息到一个excel表格中,包括"Title"、"Summary"、"PageSum"、"Id"、"Pid“字段的内容信息,每组信息一行。

或者,能否把每个txt混在一行的所有书的拆开,把每本书的信息变为单独的一行。这样我可以手动导入excel。

希望能得到大神出手。谢谢了!

本想用JS提取的  看了一下 发现这个文档不知道是不是没贴全还是怎么回事, 开始三个字符  {"[  和引号问题 ""Title" 会在获取数据时出错

TOP

回复 2# terse


   上传附件总是不成功,所以剪切了一部分,应该是没剪全。
我尝试着用grep,可是总是搞不定。:'(

TOP

本帖最后由 hfxiang 于 2023-2-8 20:18 编辑

  1. BEGIN {
  2. RS = "},{"
  3. print "\"Title\",\"Summary\",\"PageSum\",\"Id\tPid\""
  4. }
  5. {
  6. Title = gensub(/^.*"Title":"([^"]+)",".*$/, "\"\\1\"", "g", $0)
  7. Summary = gensub(/^.*"Summary":"([^"]+)",".*$/, "\"\\1\"", "g", $0)
  8. PageSum = gensub(/^.*"PageSum":"([^"]+)",".*$/, "\"\\1\"", "g", $0)
  9. Id = gensub(/^.*"Id":"([^"]+)",".*$/, "\"\\1\"", "g", $0)
  10. Pid = gensub(/^.*"Pid":"([^"]+)".*$/, "\"\\1\"", "g", $0)
  11. print Title "," Summary "," PageSum "," Id "," Pid
  12. }
复制代码
保存为a.awk
下载gawk( http://bcn.bathome.net/tool/4.1.0/gawk.exe ),执行
  1. gawk -f.\a.awk 输入文本.txt>输出文本.txt
复制代码
结果
  1. "Title","Summary","PageSum","Id Pid"
  2. "敏儿演剧史:剧本类","上海:商务印书馆,** 二十二年十月 [1933.10] 印行:王云五发行","101页","4b1e7731b001378fe85e3470b23e19ab","f721283beed81168a7f56770c2e37f42"
  3. "上海新学会社图书目录","上海:上海新学会社,[1936]","1 册","18046902990da9f8c0e3b7081c62b7d9","e24b0fcfdd69228163ee2a40179b3a10"
  4. "京师译学馆生理卫生学讲义","[出版地不详] : [出版者不详], [19??]","1册 ","56373fe68f3834c486e0edfd3682ed96","81d7172b2f060a842cc78a0486c16647"
  5. "中国中部奥陶纪头足类化石","[北京] : 农矿部地质调查所:国立北平研究院地学研究,** 十九年七月 [1930.7] 印行","18,101页","0b6ea856ab6a52c74415ced060d1028c","7c902869e5461cc9bdb309aa3a0fd9a7"
复制代码
1

评分人数

TOP

{"categoryId":null,"totalCount":380,"pageCount":38,"pageSize":10,"documents":[{"Title":"理学的疗法","Title_F":"理学的疗法","TI":["理学的疗法"],"TitleInitial":"L","CLC":["R454"],"Language":"汉语","Summary":"重庆 : 商务印书馆, **三十四年一月[1945.1]出版兼印行 : 王云五发行","Summary_F":"重庆 : 商务印书馆, **三十四年一月[1945.1]出版兼印行 : 王云五发行","LiteratureCategory":10,"WorkSubject":["物理疗法"],"PageSum":"46页","Dimensions":["18×13cm"],"ProvisionActivityDate":["1945"],"ContributionRole":["著"],"Contribution":["刘雄"],"AU":["刘雄"],"PublicationOrganization":["王云五"],"ProvisionActivityPlace":["重庆"],"ProvisionActivityOrganization":["商务印书馆"],"_version_":1738744139389861888,"showMoreEdition":true,"showEntity":false,"entity":[{"OcrTag":true,"Id":"24175e103f83cd8526d3b1c805eeaf98","Pid":"a7769ea13a60e20e94ce1df28f5bb0c1","LiteratureCategory":10,"_version_":1738744221921181696}],"Abstract":null,"LiteratureCategoryPieceTypeId":101,"viewPermission":true,"hasEntites":true},{"Title":"**哲学 三 晚明诸儒之学术及其精神","Title_F":"**哲学 三 晚明诸儒之学术及其精神","TI":["**哲学 三 晚明诸儒之学术及其精神"],"TitleInitial":"G","CLC":["B248"],"Language":"汉语","Summary":"[出版地不详] : 中央训练团党政高级训练班, [19??]印","Summary_F":"[出版地不详] : 中央训练团党政高级训练班, [19??]印","LiteratureCategory":10,"WorkSubject":["哲学","中国","明代"],"PageSum":"12页","Dimensions":["18×13cm"],"ProvisionActivityDate":["19??"],"ContributionRole":["讲"],"Contribution":["钱穆"],"AU":["钱穆"],"ProvisionActivityPlace":["出版地不详"],"ProvisionActivityOrganization":["中央训练团党政高级训练班"],"_version_":1739945908735311872,"showMoreEdition":false,"showEntity":false,"entity":[{"OcrTag":true,"Id":"31ae82462ecddebba36cca9c355da8db","Pid":"ae257d2a67fd6bf7f1bab6ae092d73f3","LiteratureCategory":10,"_version_":1739945929203515397}],"Abstract":null,"LiteratureCategoryPieceTypeId":101,"viewPermission":true,"hasEntites":true}]}

TOP

回复 4# hfxiang


    收到。我试一下。感谢感谢! :)

TOP

回复 6# batbat001


如果用gawk4.1(utf-8的支持不稳定)不能正确出结果,可尝试用uawk(对utf-8的支持较好)
下载地址:https://www.aliyundrive.com/s/rHoCCwmAww5 (注意要下载全部文件)
提取密码:1234

TOP

回复 7# hfxiang


    gawk好像不能正确出结果。我再试一下这个。

TOP

回复 1# batbat001


如果需要上传文件,可以用阿里云盘或百度网盘。

如果需要上传截图,可以找个图床,例如:
http://bbs.bathome.net/thread-60985-1-1.html
我帮忙写的代码不需要付钱。如果一定要给,请在微信群或QQ群发给大家吧。
【微信公众号、微信群、QQ群】http://bbs.bathome.net/thread-3473-1-1.html
【支持批处理之家,加入VIP会员!】http://bbs.bathome.net/thread-67716-1-1.html

TOP

回复 7# hfxiang


    非常棒!运行顺畅!非常感谢!

TOP

回复 9# Batcher


    感谢版主提醒!论坛有你们真好!

TOP

本帖最后由 WHY 于 2023-2-9 10:58 编辑

修改正则表达式模式,取消正则多选分支结构。
  1. <# :
  2. @echo off
  3. PowerShell -C ". ([ScriptBlock]::Create((gc -Literal '%~f0') -Join \"`r`n\")) '%~dp0'"
  4. pause & exit
  5. #>
  6. param($path);
  7. $reg = '"([TSPI][a-zA-Z]+)":("[^"]*")';
  8. $dic = New-Object 'System.Collections.Generic.Dictionary[string, [Collections.ArrayList]]';
  9. $Hash = @{Title=$true; Summary=$true; PageSum=$true; Id=$true; Pid=$true}
  10. forEach( $file In (dir -Literal $path -Filter *.txt) ){
  11.     $str = [IO.File]::ReadAllText($file.FullName, [Text.Encoding]::UTF8);
  12.     forEach( $m In [regex]::Matches($str, $reg) ){
  13.         $key = $m.Groups[1].Value;
  14.         If( $Hash.ContainsKey($key) ){
  15.             If( !$dic.ContainsKey($key) ){
  16.                 $dic[$key] = @();
  17.             }
  18.             [void]$dic[$key].Add($m.Groups[2].Value);
  19.             $count = $dic[$key].Count;
  20.         }
  21.     }
  22. }
  23. $out = [Collections.ArrayList]@();
  24. [void]$out.Add('"' + ($dic.Keys -join '","') + '"');
  25. for( $i=0; $i -lt $count; $i++ ){
  26.     $str = '';
  27.     forEach( $key In $dic.Keys ){
  28.         $str += $dic[$key][$i] + ',';
  29.     }
  30.     [void]$out.Add($str);
  31. }
  32. [IO.File]::WriteAllLines('result.csv', $out, [Text.Encoding]::UTF8);
复制代码
1

评分人数

TOP

  1. @if (0) == (0) echo off
  2. dir /b/a-d *.txt|cscript -nologo -e:jscript "%~0"
  3. pause & exit
  4. @end
  5. function adoText(path) {
  6.        var stream,content;
  7.        stream = new ActiveXObject("ADODB.Stream");
  8.        stream.type = 2;
  9.        stream.charset = 'utf-8';
  10.        stream.open();
  11.        stream.loadFromFile(path);
  12.        result = stream.readText(-1);
  13.        stream.close();
  14.        return result;
  15. }
  16. var fso=new ActiveXObject("Scripting.FileSystemObject");
  17. new_file = 'new_result.csv'
  18. f = fso.CreateTextFile(new_file, true);
  19. f.WriteLine("Title"+','+"Summary"+','+"PageSum"+','+"Id"+','+"Pid");
  20. while (!WSH.StdIn.AtEndOfStream){
  21.        var  file = WScript.StdIn.Readline();
  22.        var content =  adoText(file);
  23.        var obj = new Function("return" + unescape(content))();
  24.        var arr = obj.documents;
  25.        for(var i = 0, len = arr.length; i < len; i++ ){
  26.             f.WriteLine("\""+arr[i].Title+"\""+','+"\""+arr[i].Summary+"\""+','+"\""+arr[i].PageSum+"\""+','+arr[i].entity[0].Id+','+arr[i].entity[0].Pid);
  27.        }   
  28. }
  29. f.Close()
复制代码
1

评分人数

TOP

PS版 正则提取
  1. $result = @('"Title","Summary","PageSum","Id","Pid"');
  2. $count = ($result -split(',')).Length
  3. $num = [pscustomobject] @{ Value = 0 }
  4. $reg = '(?<="(Title|Summary|PageSum|(?:I|Pi)d)":)"([^"]*)"';
  5. $group= (select-string *.txt -Pattern $reg -AllMatches -Encoding utf8 ).Matches.Value;
  6. $groups = $($group | Group { [math]::Floor($num.Value++ /$count)})
  7. $result += $groups.ForEach({$_.Group -join(',')})
  8. [IO.File]::WriteAllLines('new_result.csv', $result, [Text.Encoding]::UTF8);
复制代码

TOP

仅处理5楼所列的标准的JSON格式:
  1. @echo off
  2. echo;"Title","Summary","PageSum","Id","Pid">输出.csv
  3. PowerShell "$obj=(gc 输入.txt -Raw -Enc UTF8 | ConvertFrom-Json).documents; for($i=0;$i -lt $obj.Title.Count;$i++){'\"'+(@($obj.Title[$i],$obj.Summary[$i],$obj.PageSum[$i],$obj.entity[$i].Id,$obj.entity[$i].Pid) -join '\",\"')+'\"'}">>输出.csv
  4. pause
复制代码

TOP

返回列表