批处理如何快速删除大文本文件里面的多余重复行（重复行只保留一行）？ - BAT求助&讨论 - 批处理之家 BAT,CMD,批处理,PowerShell,VBS,DOS - Powered by Discuz!

批处理之家 » BAT求助&讨论 » 批处理如何快速删除大文本文件里面的多余重复行（重复行只保留一行）？

[新手上路]批处理新手入门导读	[视频教程]批处理基础视频教程	[视频教程]VBS基础视频教程	[批处理精品]批处理版照片整理器
[批处理精品]纯批处理备份&还原驱动	[批处理精品]CMD命令50条不能说的秘密	[在线下载]第三方命令行工具	[在线帮助]VBScript / JScript 在线参考

返回列表发帖

Rank: 5 Rank: 5

帖子: 206
积分: 963
技术: 16
捐助: 0
注册时间: 2008-3-9

1楼 跳转到 »

发表于 2011-7-26 12:07 | 显示全部帖子

本帖最后由 asnahu 于 2011-7-26 12:09 编辑

replace pioneer是基于PERL的啊，如果直接用PERL不是更好？另外，也可以用awk处理：

gawk "!a[$0]++" FILE
复制代码

awk '!a[$0]++'
复制代码
This one-liner is very idiomatic. It registers the lines seen in the associative-array “a” (arrays are always associative in Awk) and at the same time tests if it had seen the line before. If it had seen the line before, then a[line] > 0 and !a[line] == 0. Any expression that evaluates to false is a no-op, and any expression that evals to true is equal to “{ print }”.

For example, suppose the input is:
foo
bar
foo
baz
复制代码
When Awk sees the first “foo”, it evaluates the expression “!a["foo"]++”. “a["foo"]” is false, but “!a["foo"]” is true - Awk prints out “foo”. Then it increments “a["foo"]” by one with “++” post-increment operator. Array “a” now contains one value “a["foo"] == 1″.

Next Awk sees “bar”, it does exactly the same what it did to “foo” and prints out “bar”. Array “a” now contains two values “a["foo"] == 1″ and “a["bar"] == 1″.

Now Awk sees the second “foo”. This time “a["foo"]” is true, “!a["foo"]” is false and Awk does not print anything! Array “a” still contains two values
“a["foo"] == 2″ and “a["bar"] == 1″.

Finally Awk sees “baz” and prints it out because “!a["baz"]” is true. Array “a” now contains three values “a["foo"] == 2″ and “a["bar"] == 1″ and “a["baz"] == 1″.

The output:
foo
bar
baz
复制代码
Here is another one-liner to do the same. Eric in his one-liners says it’s the most efficient way to do it.
awk '!($0 in a) { a[$0]; print }'
复制代码
It’s basically the same as previous one, except that it uses the ‘in’ operator. Given an array “a”, an expression “foo in a” tests if variable “foo” is in “a”.

Note that an empty statement “a[$0]” creates an element in the array.

可以从这里下载：http://unxutils.sourceforge.net/

Rank: 5 Rank: 5

帖子: 206
积分: 963
技术: 16
捐助: 0
注册时间: 2008-3-9

2楼

发表于 2011-7-26 12:18 | 显示全部帖子

系统要有gawk啊，下面就是解释嘛，哪里看不懂？

Rank: 5 Rank: 5

帖子: 206
积分: 963
技术: 16
捐助: 0
注册时间: 2008-3-9

3楼

发表于 2011-7-26 13:19 | 显示全部帖子

gawk "!a[$0]++"<a.txt>b.txt
复制代码

看这样行不行

Rank: 5 Rank: 5

帖子: 206
积分: 963
技术: 16
捐助: 0
注册时间: 2008-3-9

4楼

发表于 2011-7-26 17:22 | 显示全部帖子

awk处理行是非常快的了

Rank: 5 Rank: 5

帖子: 206
积分: 963
技术: 16
捐助: 0
注册时间: 2008-3-9

5楼

发表于 2011-7-27 20:13 | 显示全部帖子

回复 12# cm535

我2楼的回答里不是有更加高效的方法了吗，要认真看啊

Rank: 5 Rank: 5

帖子: 206
积分: 963
技术: 16
捐助: 0
注册时间: 2008-3-9

6楼

发表于 2011-7-27 20:22 | 显示全部帖子

本帖最后由 asnahu 于 2011-7-27 20:26 编辑

我这没法测试处理这么大的文件到底需要多长时间，需要注意的是机器的硬件环境是很重要的。