本帖最后由 batsealine 于 2016-4-4 10:40 编辑
先举个例子展示脚本要干的事:
有两个文件 jd.txt 和 stem.txt jd.txt 中有“苏 eyo”, stem.txt中有"苏 ey",现要生成一个新文件里面为"苏 eyo ey",再具体一点:
jd.txt 是这样的:
木桨 fatr
本应 fatv
枰 faua
苏 eyo
stem.txt 是这样的(全部是单字):
苏 ey
枰 fa
要求生成的 new_jd.txt :
木桨 fatr
本应 fatv
枰 faua fa
苏 eyo ey
要求 new_jd.txt 中字的顺序不变,相当于只在 jd.txt 中部份字后加入一列构词码
文件下载地址(右键另存为):jd.txt stem.txt
我已经用for循环实现了,两个文件分别有18万行和3万行,这样下来需要15分钟左右,现在求一个更好的方法,如果有谁会 grep awk,也欢迎指点。
下面是我的代码: | | | | | | | | | | | import re | | import os | | import sys | | import codecs | | | | filepath = sys.argv[1] | | filename = re.search(r'(?:.*/)?(.*)\.', filepath).group(1) | | newFilename = "new_" + filename + ".txt" | | f = codecs.open(filepath, "r", 'utf-8') | | lines = f.readlines() | | f.close() | | | | f_stem = codecs.open(r'stem.txt', "r", 'utf-8') | | stemlines = f_stem.readlines() | | f_stem.close() | | | | newf = codecs.open(newFilename, "w", 'utf-8') | | | | for i,line in enumerate(lines): | | | | words = line.split() | | myre = words[0] | | if (len(myre) == 1): | | for stemline in stemlines: | | if (re.match(myre, stemline)): | | line = line.strip('\r\n') + "\t" + stemline.split()[1] + '\n' | | break | | newf.write(line) | | if (i % 100 == 0): | | print(i) | | | | newf.close()COPY |
############################## 分割线
用了版主的 OrderedDict 后,程序的运行时间只要1.1s,非常满意 | | | | | | | | | | | import re | | import os | | import sys | | import codecs | | from time import time | | from collections import OrderedDict | | | | | | filepath = sys.argv[1] | | filename = os.path.basename(filepath) | | newFilename = "new_" + filename | | f = codecs.open(filepath, "r", 'utf-8') | | lines = f.readlines() | | f.close() | | | | f_stem = codecs.open(r'stem.txt', "r", 'utf-8') | | stemlines = f_stem.readlines() | | f_stem.close() | | | | newf = codecs.open(newFilename, "w", 'utf-8') | | | | | | | | stemwords = [] | | for stemline in stemlines: | | stemline_split = stemline.split() | | stemwords.append((stemline_split[0], stemline_split[1])) | | | | | | stem = OrderedDict(stemwords) | | | | | | for i,line in enumerate(lines): | | | | word = line.split()[0] | | if (len(word) == 1 and stem.get(word)): | | | | line = line.strip('\r\n') + '\t' + stem[word] + '\n' | | newf.write(line) | | | | newf.close()COPY |
|