标题: [问题求助] VBS怎样判断字符串中是否含有UNICODE字符? [打印本页]
作者: lqh123108 时间: 2014-10-31 15:17 标题: VBS怎样判断字符串中是否含有UNICODE字符?
asc("拉")
但不知道ANSI这个取值范围是多少?
盼赐教
作者: yu2n 时间: 2014-11-4 09:57
1. 标题让人费解。这个帖子页面的所有文字在Unicode字符集中都能找到,另外,你举个不在Unicode字符集中的文字看看。
2. 关于VBS的ASC函数,你一定想不它的返回值相当有趣,它跟操作系统语言有关,它随着系统代码页的变化而改变。
例如:
在简体XP系统下 asc("拉") 的返回值为 -16211 ,而"拉"的GBK编码为 49325(你会发现 16211 + 49325 = 65536)。
在繁体XP系统下 asc("拉") 的返回值为 -22060 ,我没查"拉"的BIG5编码是多少,但我猜测它的编码是 43476 ,为什么我猜对了?
Asc 返回输入字符的“码位”(字符码)。对于单字节字符集 (SBCS) 值,返回值范围为 0 到 255;
对于双字节字符集 (DBCS) 值,返回值范围为 -32768 到 32767。有关单字节 ASCII 字符的图表,请参见 ASCII 字符代码。
返回的值取决于当前线程的代码页,该代码页包含在 System.Globalization 命名空间中 TextInfo 类的 ANSICodePage 属性中。
3. ANSI 取值范围?不是ANSII?我猜ANSI可能是“俺寺”的意思,我悟性不高不要怪我。
附:ANSICodePage- Code Page Identifiers
- The following table defines the available code page identifiers.
- Note ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page.
- Identifier .NET Name Additional information
- 037 IBM037 IBM EBCDIC US-Canada
- 437 IBM437 OEM United States
- 500 IBM500 IBM EBCDIC International
- 708 ASMO-708 Arabic (ASMO 708)
- 709 Arabic (ASMO-449+, BCON V4)
- 710 Arabic - Transparent Arabic
- 720 DOS-720 Arabic (Transparent ASMO); Arabic (DOS)
- 737 ibm737 OEM Greek (formerly 437G); Greek (DOS)
- 775 ibm775 OEM Baltic; Baltic (DOS)
- 850 ibm850 OEM Multilingual Latin 1; Western European (DOS)
- 852 ibm852 OEM Latin 2; Central European (DOS)
- 855 IBM855 OEM Cyrillic (primarily Russian)
- 857 ibm857 OEM Turkish; Turkish (DOS)
- 858 IBM00858 OEM Multilingual Latin 1 + Euro symbol
- 860 IBM860 OEM Portuguese; Portuguese (DOS)
- 861 ibm861 OEM Icelandic; Icelandic (DOS)
- 862 DOS-862 OEM Hebrew; Hebrew (DOS)
- 863 IBM863 OEM French Canadian; French Canadian (DOS)
- 864 IBM864 OEM Arabic; Arabic (864)
- 865 IBM865 OEM Nordic; Nordic (DOS)
- 866 cp866 OEM Russian; Cyrillic (DOS)
- 869 ibm869 OEM Modern Greek; Greek, Modern (DOS)
- 870 IBM870 IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2
- 874 windows-874 ANSI/OEM Thai (ISO 8859-11); Thai (Windows)
- 875 cp875 IBM EBCDIC Greek Modern
- 932 shift_jis ANSI/OEM Japanese; Japanese (Shift-JIS)
- 936 gb2312 ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)
- 949 ks_c_5601-1987 ANSI/OEM Korean (Unified Hangul Code)
- 950 big5 ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)
- 1026 IBM1026 IBM EBCDIC Turkish (Latin 5)
- 1047 IBM01047 IBM EBCDIC Latin 1/Open System
- 1140 IBM01140 IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)
- 1141 IBM01141 IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)
- 1142 IBM01142 IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)
- 1143 IBM01143 IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)
- 1144 IBM01144 IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)
- 1145 IBM01145 IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)
- 1146 IBM01146 IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)
- 1147 IBM01147 IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)
- 1148 IBM01148 IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)
- 1149 IBM01149 IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)
- 1200 utf-16 Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
- 1201 unicodeFFFE Unicode UTF-16, big endian byte order; available only to managed applications
- 1250 windows-1250 ANSI Central European; Central European (Windows)
- 1251 windows-1251 ANSI Cyrillic; Cyrillic (Windows)
- 1252 windows-1252 ANSI Latin 1; Western European (Windows)
- 1253 windows-1253 ANSI Greek; Greek (Windows)
- 1254 windows-1254 ANSI Turkish; Turkish (Windows)
- 1255 windows-1255 ANSI Hebrew; Hebrew (Windows)
- 1256 windows-1256 ANSI Arabic; Arabic (Windows)
- 1257 windows-1257 ANSI Baltic; Baltic (Windows)
- 1258 windows-1258 ANSI/OEM Vietnamese; Vietnamese (Windows)
- 1361 Johab Korean (Johab)
- 10000 macintosh MAC Roman; Western European (Mac)
- 10001 x-mac-japanese Japanese (Mac)
- 10002 x-mac-chinesetrad MAC Traditional Chinese (Big5); Chinese Traditional (Mac)
- 10003 x-mac-korean Korean (Mac)
- 10004 x-mac-arabic Arabic (Mac)
- 10005 x-mac-hebrew Hebrew (Mac)
- 10006 x-mac-greek Greek (Mac)
- 10007 x-mac-cyrillic Cyrillic (Mac)
- 10008 x-mac-chinesesimp MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)
- 10010 x-mac-romanian Romanian (Mac)
- 10017 x-mac-ukrainian Ukrainian (Mac)
- 10021 x-mac-thai Thai (Mac)
- 10029 x-mac-ce MAC Latin 2; Central European (Mac)
- 10079 x-mac-icelandic Icelandic (Mac)
- 10081 x-mac-turkish Turkish (Mac)
- 10082 x-mac-croatian Croatian (Mac)
- 12000 utf-32 Unicode UTF-32, little endian byte order; available only to managed applications
- 12001 utf-32BE Unicode UTF-32, big endian byte order; available only to managed applications
- 20000 x-Chinese_CNS CNS Taiwan; Chinese Traditional (CNS)
- 20001 x-cp20001 TCA Taiwan
- 20002 x_Chinese-Eten Eten Taiwan; Chinese Traditional (Eten)
- 20003 x-cp20003 IBM5550 Taiwan
- 20004 x-cp20004 TeleText Taiwan
- 20005 x-cp20005 Wang Taiwan
- 20105 x-IA5 IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)
- 20106 x-IA5-German IA5 German (7-bit)
- 20107 x-IA5-Swedish IA5 Swedish (7-bit)
- 20108 x-IA5-Norwegian IA5 Norwegian (7-bit)
- 20127 us-ascii US-ASCII (7-bit)
- 20261 x-cp20261 T.61
- 20269 x-cp20269 ISO 6937 Non-Spacing Accent
- 20273 IBM273 IBM EBCDIC Germany
- 20277 IBM277 IBM EBCDIC Denmark-Norway
- 20278 IBM278 IBM EBCDIC Finland-Sweden
- 20280 IBM280 IBM EBCDIC Italy
- 20284 IBM284 IBM EBCDIC Latin America-Spain
- 20285 IBM285 IBM EBCDIC United Kingdom
- 20290 IBM290 IBM EBCDIC Japanese Katakana Extended
- 20297 IBM297 IBM EBCDIC France
- 20420 IBM420 IBM EBCDIC Arabic
- 20423 IBM423 IBM EBCDIC Greek
- 20424 IBM424 IBM EBCDIC Hebrew
- 20833 x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended
- 20838 IBM-Thai IBM EBCDIC Thai
- 20866 koi8-r Russian (KOI8-R); Cyrillic (KOI8-R)
- 20871 IBM871 IBM EBCDIC Icelandic
- 20880 IBM880 IBM EBCDIC Cyrillic Russian
- 20905 IBM905 IBM EBCDIC Turkish
- 20924 IBM00924 IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
- 20932 EUC-JP Japanese (JIS 0208-1990 and 0212-1990)
- 20936 x-cp20936 Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)
- 20949 x-cp20949 Korean Wansung
- 21025 cp1025 IBM EBCDIC Cyrillic Serbian-Bulgarian
- 21027 (deprecated)
- 21866 koi8-u Ukrainian (KOI8-U); Cyrillic (KOI8-U)
- 28591 iso-8859-1 ISO 8859-1 Latin 1; Western European (ISO)
- 28592 iso-8859-2 ISO 8859-2 Central European; Central European (ISO)
- 28593 iso-8859-3 ISO 8859-3 Latin 3
- 28594 iso-8859-4 ISO 8859-4 Baltic
- 28595 iso-8859-5 ISO 8859-5 Cyrillic
- 28596 iso-8859-6 ISO 8859-6 Arabic
- 28597 iso-8859-7 ISO 8859-7 Greek
- 28598 iso-8859-8 ISO 8859-8 Hebrew; Hebrew (ISO-Visual)
- 28599 iso-8859-9 ISO 8859-9 Turkish
- 28603 iso-8859-13 ISO 8859-13 Estonian
- 28605 iso-8859-15 ISO 8859-15 Latin 9
- 29001 x-Europa Europa 3
- 38598 iso-8859-8-i ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
- 50220 iso-2022-jp ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)
- 50221 csISO2022JP ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)
- 50222 iso-2022-jp ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)
- 50225 iso-2022-kr ISO 2022 Korean
- 50227 x-cp50227 ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)
- 50229 ISO 2022 Traditional Chinese
- 50930 EBCDIC Japanese (Katakana) Extended
- 50931 EBCDIC US-Canada and Japanese
- 50933 EBCDIC Korean Extended and Korean
- 50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
- 50936 EBCDIC Simplified Chinese
- 50937 EBCDIC US-Canada and Traditional Chinese
- 50939 EBCDIC Japanese (Latin) Extended and Japanese
- 51932 euc-jp EUC Japanese
- 51936 EUC-CN EUC Simplified Chinese; Chinese Simplified (EUC)
- 51949 euc-kr EUC Korean
- 51950 EUC Traditional Chinese
- 52936 hz-gb-2312 HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)
- 54936 GB18030 Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
- 57002 x-iscii-de ISCII Devanagari
- 57003 x-iscii-be ISCII Bengali
- 57004 x-iscii-ta ISCII Tamil
- 57005 x-iscii-te ISCII Telugu
- 57006 x-iscii-as ISCII Assamese
- 57007 x-iscii-or ISCII Oriya
- 57008 x-iscii-ka ISCII Kannada
- 57009 x-iscii-ma ISCII Malayalam
- 57010 x-iscii-gu ISCII Gujarati
- 57011 x-iscii-pa ISCII Punjabi
- 65000 utf-7 Unicode (UTF-7)
- 65001 utf-8 Unicode (UTF-8)
复制代码
作者: lqh123108 时间: 2014-11-4 11:00
非常感谢,我的程序的目的是,下载网页内容,只要支持ANSI的字符一律保存到ANSI文本,不支持的(如仅支持UNICODE)一律删除
作者: lqh123108 时间: 2014-11-4 11:05
另外,我查了许多资料,网页的字符是不是可变字节的UNICODE字符,即a是一个字节的字符,汉字“中”是二个字节的字符。。。 而默认的ANSI(XP,简体中文)我估计是能包括ANSI-1 ANSI-2 GB2312 等一些字编码的字符 唯一不包括的是那些三个字节或四个字节的UNICODE字符,这样理解对吗/
作者: yu2n 时间: 2014-11-4 11:20
回复 3# lqh123108
简体中文系统ANSICodePage
936 gb2312 ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)
以简体中文系统为例:
使用FOR遍历字符串,截取单个字符,然后遍历gb2312字符集查找该字符,不存在则……
作者: yu2n 时间: 2014-11-4 14:13
我找到了一些简体gb2321/gbk中不存在的字符:- ¢£¥ˉ′μàáèéêìíòóùúüāēěīōū∥ǎǐǒǔǖǘǚǜ ̄°~一二三四上中下甲乙丙丁天地人月火水木金土日有社名特财祝劳代呼学监企资协祭休自至一二三四五六七八九十月火水木金土日株有社名特财祝劳秘男女适优注项休写医宗学监企资协夜豈更車賈滑串句龜龜契金喇奈懶癩羅蘿螺裸邏樂洛烙珞落酪駱亂卵欄爛蘭鸞嵐濫藍襤拉臘蠟廊朗浪狼來冷勞擄櫓爐盧老蘆虜路露魯鷺碌祿綠菉錄鹿論壟弄籠聾牢磊賂雷壘屢樓淚漏累縷陋勒肋凜凌稜綾菱陵讀拏樂諾丹寧怒率異北磻便復不泌數索參塞省葉說殺辰沈拾若掠略亮兩梁糧良諒量勵呂女廬旅濾礪閭驪麗黎力曆歷轢年憐戀撚漣煉璉練聯輦蓮連鍊列劣咽烈裂說廉念捻殮簾獵令囹寧嶺怜玲瑩羚聆鈴零靈領例禮醴隸惡了僚寮尿料樂燎療蓼遼龍暈阮劉杻柳流溜琉留硫紐類六戮陸倫崙淪輪律慄栗率隆利吏履易李梨泥理痢罹裡里離匿溺吝燐璘藺鱗麟林淋臨立笠粒狀炙識什茶刺切度拓糖宅洞暴輻行降見廓塚晴凞豬益神祥福靖精羽諸逸都飯飼館鶴
复制代码
----好像论坛数据库显示不了,做了转码。很多文字显示是假的,试试以下代码:- ' asc.vbs --- Asc 与 AscW 对比
- ' 这个算法很慢,i3 2100处理器耗时 296 秒。
- ' 如果用记事本打开生成的文本(asc.vbs.txt)乱码,要修改记事本字体:“格式”菜单 -> 字体 -> 选择“宋体”
-
- dt_star = Now()
-
- For i = 1 To 65535
- If CLng("&H" & Hex(Asc(ChrW(i)))) < -32768 Or CLng("&H" & Hex(Asc(ChrW(i)))) > 32767 Then
- IS_FIND = False
- For j = -32768 To 32767
- If ChrW(i) = Chr(j) Then IS_FIND = True :Exit For
- Next
- If IS_FIND = False Then
- strTxt = strTxt & ChrW(i)
- strVbs = strVbs & " & ChrW(" & i & ")"
- End If
- End If
- Next
-
- Set fso = CreateObject("Scripting.filesystemobject")
- set wTxt = fso.OpenTextFile(WScript.ScriptName & ".log", 2, True, -1)
- wTxt.Write strTxt
- wTxt.Close
-
- strVbs = "WScript.Echo """" " & strVbs
- set wVbs = fso.OpenTextFile(WScript.ScriptName & ".log.vbs", 2, True, -1)
- wVbs.Write strVbs
- wVbs.Close
-
- WScript.Echo "Asc与AscW对比完成,耗时 " & DateDiff("s", dt_star, Now()) & "秒。"
- WScript.Echo strTxt
- WScript.Echo strVbs
复制代码
作者: lqh123108 时间: 2014-11-7 11:16
回复 5# yu2n
谢了
我发现不支持ANSI的UNICODE字符,如是四个字节的,使用for x=1 to len(str) vstr=mid(str,x,1)取某个字符时会出错,即把四个字节的人为分开了,当成二个字符来处理;
另外,GB2312字符集是不是实际上包含了ANSI等
还有这个字符集,在哪找到?/
作者: lqh123108 时间: 2014-11-7 11:18
本帖最后由 lqh123108 于 2014-11-7 12:17 编辑
回复 6# yu2n
能解释一下这些代码的意思?
作者: yu2n 时间: 2014-11-7 17:10
回复 8# lqh123108
大致是对比asc,ascw对应的chr,chrw函数的取值范围。
也是系统页码与Unicode字符集的差异。
我这方面很浅薄,请教其他人吧。
作者: lqh123108 时间: 2014-11-15 16:56
回复 9# yu2n
非常感谢
欢迎光临 批处理之家 (http://www.bathome.net/) |
Powered by Discuz! 7.2 |