标题: windows代码页转换工具wincp.exe [打印本页]
作者: happy886rr 时间: 2017-6-1 12:44 标题: windows代码页转换工具wincp.exe
本帖最后由 happy886rr 于 2017-6-1 18:43 编辑
[tvcp已经更名为wincp,修复数个bug,版本号升级为1.1]
wincp代码页转化工具,支持文本编码转换,BOM头修改、伪造BOM、去BOM,自动修正参数,BOM自动偏移 等功能。
下载:存外链图为a.zip解压便是。
WINCP.EXE (TEXT CODEPAGE CONVERSION TOOL, BY LEO, VERSION 1.1)
摘要:
=========================================================================
代码页转化工具,支持文本编码转换,BOM头修改、伪造、去BOM,自动修正参数,
BOM自动偏移... 等功能。
用处特殊,效果奇佳。不仅仅是编码转换,更具代码页翻译、伪造、BOM自定义,加
密等等。
补充:code_page参数可以是代码页数字入936,也可以是代码页缩写如GBK,具体对
照详见 备注(常见代码页缩写)。
=========================================================================
用法:
-------------------------------------------------------------------------
wincp [input_file] -f [code_page] -t [code_page] -s [skip_number] -b[fill_BOM] -o [out_file]
-------------------------------------------------------------------------
-f From the code page
-t Translate to the code page
-s Skip the number of bytes
-b Filling BOM
-o Output file name
-h Show help information
-------------------------------------------------------------------------
举例:
-------------------------------------------------------------------------
REM 将test.txt从BIG5编码转为UTF8编码
wincp test.txt -o out.txt -f BIG5 -t UTF8
REM 将test.txt从ANSI编码转为UTF8编码
wincp test.txt -o out.txt -f 936 -t 65001
REM 将test.txt从UTF8编码转为UCS-2LE编码,即通常的UNICODE编码,并填充其BOM头为0xFFFE。
wincp test.txt -f 65001 -t 1200 -s 0 -b 0xFFFE -o out.txt
REM 将test.txt从UNICODE大端编码转为UTF8编码
wincp test.txt -o out.txt -f UCS2BE -t UTF8
wincp test.txt -oout.txt -fUNICODEBE -tUTF8
REM 将test.txt去除BOM
wincp test.txt -oout.txt
REM 伪造BOM
wincp test.txt -oout.txt -b0xADFF0000
...
-------------------------------------------------------------------------
备注:(常见代码页缩写)
-------------------------------------------------------------------------
ANSI 0
GBK 936
GB18030 54936
BIG5 950
UNICODE UTF16 UCS2 1200
UNICODEBE UTF16BE UCS2BE 1201
UTF8 65001
UTF7 65000
UTF32 12000
UTF32BE 12001
-------------------------------------------------------------------------
代码页:(通用代码页对照表)
-------------------------------------------------------------------------
437 — 最初的 IBM PC 代码页,实现了扩展ASCII字符集
737 — 希腊语
850 — Latin-1(西欧语言)
852 — Latin-2(中欧及东欧语言)
855 — 西里尔(Cyril)字母
857 — 土耳其语
858 — 带欧元符号的“多语言”
860 — 葡萄牙语
861 — 冰岛语
863 — 法语 加拿大英语
865 — 北欧
866 — 西里尔(Cyril)字母
869 — 希腊语
874 — 泰文字母
932 — 日本
949 — 韩国
936 — GBK中文编码
950 — BIG5繁体中文
1200 — UCS-2LE Unicode 小端序
1201 — UCS-2BE Unicode 大端序
1250 — 东欧拉丁字母
1251 — 古斯拉夫语
1252 — 西欧拉丁字母 ISO-8859-1.
1253 — 希腊语
1254 — 土耳其语
1255 — 希伯来语
1256 — 阿拉伯语
1257 — 巴尔
1258 — 越南
1254 — 土耳其语
10000 — Macintosh Roman encoding (followed by several other Mac character sets)
10007 — Macintosh Cyrillic encoding
10029 — Macintosh Central European encoding
12000 — utf-32 Unicode UTF-32, little endian byte order; available only to managed applications
12001 — utf-32BE Unicode UTF-32, big endian byte order; available only to managed applications
28591 — iso-8859-1 ISO 8859-1 Latin 1; Western European (ISO)
51936 — EUC-CN EUC Simplified Chinese; Chinese Simplified (EUC)
54936 — GB18030
65000 — UTF-7 Unicode
65001 — UTF-8 Unicode
-------------------------------------------------------------------------
BOM:(常见字节顺序标记)
-------------------------------------------------------------------------
UTF-8 EF BB BF
UTF-16 (LE) FF FE
UTF-16 (BE) FE FF
UTF-32 (LE) FF FE 00 00
UTF-32 (BE) 00 00 FE FF
UTF-7 2B 2F 76 +[38|39|2B|2F]
UTF-1 F7 64 4C
UTF-EBCDIC DD 73 66 73
SCSU 0E FE FF
BOCU-1 FB EE 28 (+FF)
GB-18030 84 31 95 33
-------------------------------------------------------------------------
版本:
VERSION 1.0
源码支持单宽字符,各类win编译器编译。- /*
- TEXT CODEPAGE CONVERSION TOOL, COPYRIGHT@2017~2019 BY LEO, VERSION 1.1
- WINCP.EXE
-
- UNICODE COMPILATION:
- ==> G++ wincp.cpp -D _UNICODE -D UNICODE -municode -O2 -static
- ==> CL wincp.cpp /O2 /Oy- /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /MD
-
- ANSI COMPILATION:
- ==> G++ wincp.cpp -O2 -static
- ==> CL wincp.cpp /O2 /Oy- /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /MD
- */
-
-
- #include <stdio.h>
- #include <stdlib.h>
- #include <string.h>
- #include <windows.h>
- #include <locale.h>
- #include <ctype.h>
- #include <tchar.h>
- #include <time.h>
-
- #if !defined(_MSC_VER) && !defined(bool)
- #include <stdbool.h>
- #endif
-
- #if !defined(WIN32) && !defined(__WIN32__)
- #error Only run on windows system
- #endif
-
- /***************定义宏变量***************/
- //文件限制
- #define MAX_FILE_SIZE 1024 * 1024
-
- //标准行长
- #define BUFF_SIZE 1024
-
- //BOM容器长度
- #define BOMS_SIZE 4
-
- //编码检测阈值(字节)
- #define CHECK_SIZE 16383
-
- //定义帮助说明
- #define HELP_INFORMATION _T("\
- wincp v1.1 - Console text codepage conv tool - Copyright (C) 2017-2019 by LEO\n\
- Usage: wincp [input_file] -f [code_page] -t [code_page] -s [skip_number] -b[fill_BOM] -o [out_file]\n\
- \n\
- General options:\n\
- -f From the code page\n\
- -t Translate to the code page\n\
- -s Skip the number of bytes\n\
- -b Filling BOM\n\
- -o Output file name\n\
- -h Show help information\n\
- \n\
- Official website:\n\
- http://www.bathome.net/thread-44343-1-1.html\n\
- ")
-
- /*
- Microsoft code pages:\n\
- 897 – IBM-PC SBCS Japanese (JIS X 0201-1976)
- 941 – IBM-PC Japanese DBCS for Open environment
- 947 – IBM-PC DBCS for (Big5 encoding)
- 950 – Traditional Chinese MIX (Big5 encoding) (1114 + 947) (same with euro: 1370)
- 1114 – IBM-PC SBCS (Simplified Chinese; GBK; Traditional Chinese; Big5 encoding)
- 1126 – IBM-PC Korean SBCS
- 1162 – Windows Thai (Extension of 874; but still called that in Windows)
- 1169 – Windows Cyrillic Asian
- 1250 – Windows Central Europe
- 1251 – Windows Cyrillic
- 1252 – Windows Western
- 1253 – Windows Greek
- 1254 – Windows Turkish
- 1255 – Windows Hebrew
- 1256 – Windows Arabic
- 1257 – Windows Baltic
- 1258 – Windows Vietnamese
- 1361 – Korean (JOHAB)
- 1362 – Korean Hangul DBCS
- 1363 – Windows Korean (1126 + 1362) (Windows CP 949)
- 1372 – IBM-PC MS T Chinese Big5 encoding (Special for DB2)
- 1373 – Windows Traditional Chinese (extension of 950)
- 1374 – IBM-PC DB Big5 encoding extension for HKSCS
- 1375 – Mixed Big5 encoding extension for HKSCS (intended to match 950)
- 1385 – IBM-PC Simplified Chinese DBCS (Growing CS for GB18030, also used for GBK PC-DATA.)
- 1386 – IBM-PC Simplified Chinese GBK (1114 + 1385) (Windows CP 936)
- 1391 – Simplified Chinese 4 Byte (Growing CS for GB18030, also used for GBK PC-DATA.)
- 1392 – IBM-PC Simplified Chinese MIX (1252 + 1385 + 1391)
- ...
- */
-
- //开关解析宏名
- #define _OPT_TEOF -1
- #define _OPT_TILL -2
- #define _OPT_TERR -3
-
- //开关解析变量
- int OPTIND=1, OPTOPT, UNOPTIND=-1;
- TCHAR* OPTARG;
-
- #if defined(_UNICODE) || defined(UNICODE)
- #define TCHARFORMAT WCHAR
- #else
- #define TCHARFORMAT CHAR
- #endif
-
- //BOM转UINT宏函数
- #define BOM2UINT(x) (unsigned int)(((unsigned char)(x)[0]<<24)|((unsigned char)(x)[1]<<16)|((unsigned char)(x)[2]<<8)|((unsigned char)(x)[3]))
-
- /***************功能函数群***************/
- //判断纯数字
- int _istPositiveNumber(TCHAR* instr)
- {
- //过滤前空
- while(_istspace(*instr))
- {
- instr++;
- }
-
- //过滤空值和负数
- if(*instr == _T('\0') || *instr == _T('-'))
- {
- return -1;
- }
-
- //判断每一位是数字
- while(_istdigit(*(instr)))
- {
- instr++;
- }
-
- //判断结尾
- return (*instr == _T('\0')) ?0 :1;
- }
-
- //获取代码页
- int _tgetCP(TCHAR* instr)
- {
- //空指针
- if(instr == NULL)
- {
- return -1;
- }
-
- //设置返回值
- int retCP;
- switch(_istPositiveNumber(instr))
- {
- case -1:
- return -1;
-
- case 0:
- return _ttoi((TCHARFORMAT*)instr);
-
- case 1:
- break;
- }
-
- if (_tcsicmp(instr, _T("ANSI") ) ==0)
- {
- retCP=CP_ACP;
- }
- else if(_tcsicmp(instr, _T("GBK") ) ==0)
- {
- retCP=936;
- }
- else if(_tcsicmp(instr, _T("GB18030")) ==0)
- {
- retCP=54936;
- }
- else if(_tcsicmp(instr, _T("BIG5") ) ==0)
- {
- retCP=950;
- }
- else if(
- _tcsicmp(instr, _T("UNICODE") ) ==0 ||
- _tcsicmp(instr, _T("UTF16") ) ==0 ||
- _tcsicmp(instr, _T("UCS2") ) ==0
- )
- {
- retCP=1200;
- }
- else if(
- _tcsicmp(instr, _T("UNICODEBE")) ==0 ||
- _tcsicmp(instr, _T("UTF16BE") ) ==0 ||
- _tcsicmp(instr, _T("UCS2BE") ) ==0
- )
- {
- retCP=1201;
- }
- else if(_tcsicmp(instr, _T("UTF7") ) ==0)
- {
- retCP=65000;
- }
- else if(_tcsicmp(instr, _T("UTF8") ) ==0)
- {
- retCP=65001;
- }
- else if(_tcsicmp(instr, _T("UTF32") ) ==0)
- {
- retCP=12000;
- }
- else if(_tcsicmp(instr, _T("UTF32BE")) ==0)
- {
- retCP=12001;
- }
- else
- {
- retCP=-1;
- }
- return retCP;
- }
-
- //字符转HEX
- int C2HEX(TCHAR intc)
- {
- int hret=-1;
- if (_T('0')<=intc && intc<=_T('9'))
- {
- hret=intc-48;
- }
- else if(_T('A')<=intc && intc<=_T('F'))
- {
- hret=intc-55;
- }
- else if(_T('a')<=intc && intc<=_T('f'))
- {
- hret=intc-87;
- }
- else
- {
- hret=-1;
- }
-
- return hret;
- }
-
- //BOM头转BINBYTE
- int TCHARRAY2BIN(TCHAR* instr, BYTE* &tainer)
- {
- memset(tainer, 0, BOMS_SIZE);
-
- if(*instr == _T('x') || *instr == _T('X'))
- {
- instr ++;
- }
- if(_tcsnicmp(instr, _T("0x"), 2) ==0)
- {
- instr += 2;
- }
-
- int i=-1, hexNUM;
- while(++i<BOMS_SIZE)
- {
- hexNUM=C2HEX(*instr++);
- if(hexNUM != -1)
- {
- tainer[i] |= (hexNUM<<4);
- }
- else
- {
- break;
- }
-
- hexNUM=C2HEX(*instr++);
- if(hexNUM != -1)
- {
- tainer[i] |= hexNUM;
- }
- else
- {
- break;
- }
- }
-
- return i;
- }
-
- //开关解析模块
- int _tgetopt(int nargc, TCHAR* nargv[], TCHAR* ostr)
- {
- static TCHAR* place = (TCHAR*)_T("");
- static TCHAR* lastostr = NULL;
- register TCHAR* oli;
-
- if(ostr!=lastostr)
- {
- lastostr=ostr;
- place=(TCHAR*)_T("");
- }
-
- if(!*place)
- {
- if(
- (OPTIND>=nargc) ||
- (*(place=nargv[OPTIND]) !=(TCHAR)_T('-')) ||
- (!*(++place))
- )
- {
- if(*place !=(TCHAR)_T('-') && OPTIND <nargc)
- {
- place =(TCHAR*)_T("");
- if(UNOPTIND == -1)
- {
- UNOPTIND = OPTIND++;
- return _OPT_TILL;
- }
- else
- {
- return _OPT_TERR;
- }
- }
-
- place=(TCHAR*)_T("");
- return _OPT_TEOF;
- }
- if (*place == (TCHAR)_T('-') && *(place+1) == (TCHAR)_T('\0'))
- {
- ++OPTIND;
- return _OPT_TEOF;
- }
- }
-
- if (
- (OPTOPT=*place++) == (TCHAR)_T(':') ||
- !(oli=(TCHAR*)_tcschr((TCHARFORMAT*)ostr, (TCHAR)OPTOPT))
- )
- {
- if(!*place)
- {
- ++OPTIND;
- }
- }
-
- if (oli != NULL && *(++oli) !=(TCHAR)_T(':'))
- {
- OPTARG=NULL;
- if(!*place)
- {
- ++OPTIND;
- }
- }
- else
- {
- if(*place)
- {
- OPTARG=place;
- }
- else if(nargc <= ++OPTIND)
- {
- place=(TCHAR*)_T("");
- }
- else
- {
- OPTARG=nargv[OPTIND];
- }
- place=(TCHAR*)_T("");
- ++OPTIND;
- }
- return OPTOPT;
- }
-
- //代码页转化
- void PageTurnAround(const BYTE* input, int inputSIZE, int inPAGE, int outPAGE, BYTE* &outDATA, int &oLEN)
- {
- int wLEN;
- char* outCACHE=NULL;
- wchar_t* wcsCACHE=NULL;
-
- if(inPAGE == outPAGE)
- {
- outDATA=(BYTE*)input, oLEN=inputSIZE;
- return;
- }
-
- //针对UCS-2输入代码页
- if(inPAGE == 1200)
- {
- wcsCACHE=(wchar_t*)input;
- wLEN=inputSIZE/2+1;
- goto TOMCS;
- }
- if(inPAGE == 1201)
- {
- wchar_t* wp=(wchar_t*)input;
- while(*wp)
- {
- *wp = (((*wp)&0x00FF)<<8)|(((*wp)&0xFF00)>>8);
- wp ++;
- }
- wcsCACHE=(wchar_t*)input;
- wLEN=inputSIZE/2+1;
- goto TOMCS;
- }
-
- //输入代码页 过渡到 UNICODE中转代码页
- wLEN=MultiByteToWideChar(inPAGE, 0, (char*)input,-1, NULL, 0);
- if(wLEN <1)
- {
- _ftprintf(stderr, _T("Unable to convert code page\n"));
- exit(1);
- }
- wcsCACHE=(wchar_t*)malloc(wLEN * sizeof(wchar_t));
- MultiByteToWideChar(inPAGE, 0, (char*)input, -1, wcsCACHE, wLEN);
-
- TOMCS:
- //针对UCS-2输出代码页
- if(outPAGE == 1200)
- {
- outDATA=(BYTE*)wcsCACHE, oLEN=(wLEN-1)*2;
- return;
- }
- if(outPAGE == 1201)
- {
- wchar_t* wp=(wchar_t*)wcsCACHE;
- while(*wp)
- {
- *wp = (((*wp)&0x00FF)<<8)|(((*wp)&0xFF00)>>8);
- wp ++;
- }
- outDATA=(BYTE*)wcsCACHE, oLEN=(wLEN-1)*2;
- return;
- }
-
- //UNICODE中转代码页 过渡到 输出代码页
- int uLEN=WideCharToMultiByte(outPAGE, 0, wcsCACHE, -1, NULL, 0, NULL, NULL);
- if(uLEN <1)
- {
- _ftprintf(stderr, _T("Unable to convert code page\n"));
- exit(1);
- }
- outCACHE=(char*)malloc(uLEN);
- WideCharToMultiByte(outPAGE, 0, wcsCACHE, -1, outCACHE, uLEN, NULL, NULL);
-
- outDATA=(BYTE*)outCACHE, oLEN=uLEN-1;
- return;
- }
-
- //文本转化核心
- bool ConveTextFile(TCHAR* inFILE, TCHAR* outFILE, int inPAGE, int outPAGE, int skipNUMBER, int binBOM_SIZE, BYTE* tainerBOM_BIN)
- {
- //读取输入文件
- FILE* inFP=_tfopen(inFILE, _T("rb"));
- if(inFP == NULL)
- {
- _ftprintf(stderr, _T("Open input file error\n"));
- exit(1);
- }
-
- //获取字典文件尺寸
- fseek(inFP, 0, SEEK_END);
- int fsize = ftell(inFP);
- if(fsize > MAX_FILE_SIZE)
- {
- _ftprintf(stderr, _T("The input file is too large, can not be greater than %dKB\n"), MAX_FILE_SIZE/1024);
- exit(1);
- }
-
- fseek(inFP, (long)skipNUMBER, SEEK_SET);
-
- //动态分配文本容器
- BYTE* inDATA=(BYTE*)malloc(fsize+1);
-
- //将文本流读入内存
- int readSIZE=fsize-skipNUMBER;
- fread(inDATA, sizeof(BYTE), readSIZE, inFP);
- fclose(inFP);
- inDATA[fsize-skipNUMBER]='\0';
-
- //转化代码页
- int oLEN=0;
- BYTE* outDATA=NULL;
-
- //调用代码页转换函数
- PageTurnAround(inDATA, readSIZE, inPAGE, outPAGE, outDATA, oLEN);
-
- if(oLEN <1)
- {
- return false;
- }
-
- //读取输出文件
- FILE* outFP=_tfopen(outFILE, _T("wb"));
- if(outFP == NULL)
- {
- _ftprintf(stderr, _T("Open output file error\n"));
- exit(1);
- }
-
- fwrite(tainerBOM_BIN, sizeof(BYTE), binBOM_SIZE, outFP);
- fwrite(outDATA, sizeof(BYTE), oLEN, outFP);
- fclose(outFP);
-
- free(inDATA);
- return true;
- }
-
- #if defined _MSC_VER
- #else
- extern "C"
- #endif
-
- //*************MAIN主函数入口*************/
- int _tmain(int argc, TCHAR** argv)
- {
- if(argc<2)
- {
- //无参数则退出
- _ftprintf(stdout, HELP_INFORMATION);
- return 0;
- }
-
- //设置传入参数
- TCHAR *opeOUTFILE=NULL, *opeINFILE=NULL;
- int opeIN_PAGE=CP_ACP, opeOUT_PAGE=CP_ACP, opeSKIP_NUMBER=0, opeBOM_SIZE=0;
- BYTE opeFLAG=0x00, *pTAINER=NULL, tainerBOM_BIN[BOMS_SIZE]= {0};
-
- //开关解析
- int K=_OPT_TEOF;
- while( (K=_tgetopt(argc, argv, (TCHAR*)_T("f:t:s:b:o:hF:T:S:B:O:H"))) != _OPT_TEOF)
- {
- switch(K)
- {
- case _T('f'):
- case _T('F'):
- opeIN_PAGE =_tgetCP(OPTARG);
- if(opeIN_PAGE == -1)
- {
- _ftprintf(stderr, _T("The switch '-f' needs a positive number\n"));
- exit(1);
- }
- opeFLAG |= 0x01;
- break;
-
- case _T('t'):
- case _T('T'):
- opeOUT_PAGE =_tgetCP(OPTARG);
- if(opeIN_PAGE == -1)
- {
- _ftprintf(stderr, _T("The switch '-t' needs a positive number\n"));
- exit(1);
-
- }
- opeFLAG |= 0x02;
- break;
-
- case _T('s'):
- case _T('S'):
- if(OPTARG == NULL)
- {
- _ftprintf(stderr, _T("The switch '-s' needs a positive number\n"));
- exit(1);
- }
- opeSKIP_NUMBER = _ttoi((TCHARFORMAT*)OPTARG);
- if(! (0<= opeSKIP_NUMBER && opeSKIP_NUMBER <=4 ) )
- {
- _ftprintf(stderr, _T("The switch '-s' needs a number between {0,4}\n"));
- exit(1);
- }
- opeFLAG |= 0x04;
- break;
-
- case _T('b'):
- case _T('B'):
- if(OPTARG != NULL && _tcslen(OPTARG) <= 8)
- {
- _ftprintf(stderr, _T("The switch '-b' needs binary number\n"));
- exit(1);
- }
- pTAINER=(BYTE*)tainerBOM_BIN;
- opeBOM_SIZE = TCHARRAY2BIN(OPTARG, pTAINER);
- opeFLAG |= 0x08;
- break;
-
- case _T('o'):
- case _T('O'):
- if(OPTARG != NULL)
- {
- opeFLAG |= 0x10;
- opeOUTFILE = OPTARG;
- }
- break;
-
- case _T('h'):
- case _T('H'):
- _ftprintf(stdout, HELP_INFORMATION);
- return 0;
-
- case _OPT_TILL:
- //第一个无选项的参数识别为输入名
- opeINFILE = argv[UNOPTIND];
- break;
-
- case _OPT_TERR:
- _ftprintf(stderr, _T("Extra parameters \"%s\"\n"), argv[OPTIND]);
- exit(1);
-
- default:
- _ftprintf(stderr, _T("Unknown switch '-%c'\n"), K);
- exit(1);
- }
- }
-
- //无输入,强制退出
- if(opeINFILE == NULL)
- {
- _ftprintf(stderr, _T("Needs input file name\n"));
- exit(1);
- }
-
- //无输出,强制覆盖
- if(opeOUTFILE == NULL)
- {
- opeOUTFILE=opeINFILE;
- }
-
- //无参数,SKIP智能偏移
- if((opeFLAG&0x04) == 0)
- {
-
- FILE* inFP=_tfopen(opeINFILE, _T("rb"));
- if(inFP == NULL)
- {
- _ftprintf(stderr, _T("Open input file error\n"));
- exit(1);
- }
- fread(tainerBOM_BIN, sizeof(BYTE), BOMS_SIZE, inFP);
- fclose(inFP);
-
- UINT uBOM_VALUE = BOM2UINT(tainerBOM_BIN);
-
- //倒序识别BOM
- switch(uBOM_VALUE)
- {
- case 0xFFFE0000:
- case 0x0000FEFF:
- case 0x2B2F7638:
- case 0x84319533:
- opeSKIP_NUMBER = 4;
- break;
-
- default:
- if(
- (uBOM_VALUE>>16) == 0xFFFE ||
- (uBOM_VALUE>>16) == 0xFEFF
- )
- {
- opeSKIP_NUMBER = 2;
- }
- else if((uBOM_VALUE>>8) == 0xEFBBBF)
- {
- opeSKIP_NUMBER = 3;
- }
- else
- {
- opeSKIP_NUMBER = 0;
- }
- break;
- }
- }
-
- //无参数,BOM自动修正
- if((opeFLAG&0x08) == 0)
- {
- TCHAR* tcsBIN =_T("");
- switch(opeOUT_PAGE)
- {
- case 1200:
- tcsBIN =_T("0xFFFE");
- break;
-
- case 1201:
- tcsBIN =_T("0xFEFF");
- break;
-
- case 12000:
- tcsBIN =_T("0xFFFE0000");
- break;
-
- case 12001:
- tcsBIN =_T("0x0000FEFF");
- break;
-
- case 65001:
- tcsBIN =_T("0xEFBBBF");
- break;
-
- case 65007:
- tcsBIN =_T("0x2B2F7638");
- break;
-
- case 54936:
- tcsBIN =_T("0x84319533");
- break;
-
- default:
- break;
- }
-
- //填充BOM缓存
- pTAINER = (BYTE*)tainerBOM_BIN;
- opeBOM_SIZE = TCHARRAY2BIN(tcsBIN, pTAINER);
- }
-
- //执行代码页转化
- if(! ConveTextFile(opeINFILE, opeOUTFILE, opeIN_PAGE, opeOUT_PAGE, opeSKIP_NUMBER, opeBOM_SIZE, tainerBOM_BIN))
- {
- _ftprintf(stderr, _T("Conver file error\n"));
- return 1;
- }
-
- return 0;
- }
复制代码
作者: 3518228042 时间: 2017-6-1 16:33
演示下将本地的目录“小说”目录下的所有网页转换成txt,
希望可以做到去除非段落换行,比如浏览器打开网页文件显示:
脸
色当即羞红了起来
网页代码应该去除两个<br/> <br/>及之间的内容:
“脸<br/> <br/>色当即羞红了起来”替换成“脸色当即羞红了起来”
如果<br/> <br/>左边有中文右边有正规的段落换行(全角半角空格多个),不希望替换;
如果<br/> <br/>左边有中文右边有中文,必须替换;
如果<br/> <br/>左边第一个字是,、“:;,必须替换;
如果<br/> <br/>右边是,。、“”:;!?…,必须替换;
如果<br/> <br/>左边第一个字是。?!”……右边是……不希望替换;
如果<br/> <br/>左边是:右边是“必须替换;
还有有时候所有的换行不是<br/> <br/>而是</P><P>、<br/><br/>或
<br/>
<br/>
最后提取标题和<br/>之间内容,自定义的替换广告内容,和对多个空字符换行被清理掉,变成干净的ANSI文本
我这里有几个测试网页
作者: happy886rr 时间: 2017-6-1 18:40
回复 2# 3518228042
建议直接用sed,处理这些最好用脚本和正则。当然,C语言也能做,但是代码将会很繁琐。
欢迎光临 批处理之家 (http://www.bathome.net/) |
Powered by Discuz! 7.2 |