本帖最后由 CrLf 于 2016-10-7 02:00 编辑
回复 15# plp626
8192 的 Base64 压缩后是 4987 字节(含映射表,比 Base64 多用了 $@#| 四种字符):- $20$1410@26144@U$@@2814@5B$@@2aUQ@i@|$10KFAAAChQAAAoUAAA$11kJCQ$13CQkJ$14A|TVqQAAM$144E$144//8AALg$149Q$1410$14bg$1454fug4AtAnNIbgBTM0hVGhpcyBwcm9ncmFtIGNhbm5vdCBiZSBydW4gaW4gRE9TIG1vZGUuDQ0KJ$149BQRQAATAEFAJOHz1c$14aOAADwMLAQI4AAw$144U$145gAAIBI$144Q$144I$@284Q$145gAAB$145E$144E$14aBg$144BAAAUV0AAAM$146CAAAB$146EAAAE$148B$14gBQAABAAw$204$147C50ZXh0$144FAo$144Q$144D$145Q$14iGAAAGAuZGF0YQAAAE$146I$145I$144Q$14iBAAADALnJkYXRhAADw$145D$145C$144Eg$14iQAAAQC5ic3M$145s$@28uIAAAMAuaWRhdGEAAEAD$144U$145Q$144U$14iBAAAD$20g$14mFWJ5YPsGIld+ItVCDHbiXX8iwIx9osAPZEAAMB3Qz2NAADAclu+AQAAAMcEJAg$144xwIlEJATopAgAAIP4AXRshcB0KscEJAgAAAD/0Lv/////idiLdfyLXfiJ7F3CBAA9kwAAwHS9PZQAAMB0u4nYi3X8i134iexdwgQAjXYAPQUAAMB16McEJAs$144x9ol0JAToRwgAAIP4AXQ0hcB0zccEJAsAAAD/0OuhxwQkC$144LsB$144iVwkBOgeCAAAhfZ0iOhFAwAAu//////rgccEJAsAAAC5AQAAALv/////iUwkBOj0BwAA6WL////rDZ$134BVieVTg+wkjV34xwQkABBAAOiKCAAAg+wE6PICAADHRfg$145uABAQACNVfSJXCQQiw0AIEAAiUQkBIlUJAiJTCQMxwQkBEBAAOjBBwAAoRBAQACFwHRYoxAgQACLFSRRQACF0g+FiwAAAIP64HQgoRBAQACJRCQEix0kUUAAi0swiQwk6HYHAACLFSRRQACD+sB0G4sdEEBAAIlcJASLDSRRQACLUVCJFCToUAcAAOg7BwAAix0QIEAAiRjoHgIAAIPk8OgGBwAAiwiJTCQIixUAQEAAiVQkBKEEQEAAiQQk6KkAAACJw+jSBgAAiRwk6LoHAACJRCQEixUkUUAAi0IQiQQk6PUGAACLFSRRQADpVf///412AI28Jw$144BVieWD7AjHBCQB$144/xUcUUAA6Mj+//+QjbQm$145FWJ5YPsCMcEJAIAAAD/FRxRQADoqP7//5CNtCY$145VYsNNFFAAInlXf/hjXQmAFWLDShRQACJ5V3/4ZCQkJBVieVd6VcDAACQkJCQkJCQVbgQ$144ieVXVlOD7AyLXQiLdQyD5PDodAUAAOgPAgAAxkXzAIP7Ab/oAwAAD4+C$144g/sCfiqLRggPtgiA+XgPlcIxwID5WA+VwIXCdF+A+WQPlMCA+UQPlMIJ0KgBdWUx9oX/D5TAOf4PnMIJ0KgBD4SL$144uwEAAADrCpBDgfv+$144f2iJHCTobwYAAIPsBGaFwHnngH3zeHRAgH3zZHQoidiNZfRbXl9dw8ZF83jrrYtGBIkEJOjgBQAAicfpbP///8ZF82TrlYlcJATHBCQAMEAA6LMFAADrxolcJATHBCQDMEAA6KEFAADrtMcEJAEAAABG6BIFAADpYf///4B983h0LIB982R0B7j/$14465HHBCQAMEAAuP8AAACJRCQE6GUFAAC4/wAAAOly////xwQkAzBAALr/$144iVQkBOhGBQAA69+QkJCQVbnwMEAAieXrFI22$145ItRBIsBg8EIAYIAAEAAgfnwMEAAcupdw5CQkJCQkJCQVYnl2+Ndw5CQkJCQkJCQkFWJ5YPsCKEgIEAAiwiFyXQm6w2Q$114/xCLDSAgQACLUQSNQQSjICBAAIXSdenJw420Jg$144BVieVTg+wEoQAaQACD+P90KYXAicN0E4n2jbwn$145P8UnQAaQABLdfbHBCQgFEAA6Mr9//9bW13Diw0EGkAAMcCFyesKQIsUhQQaQACF0nX0672Ntg$144CNvw$144BVieVTg+wEoSBAQACFwHU2oQAaQAC7AQAAAIkdIEBAAIP4/3QlhcCJw3QPkI10JgD/FJ0AGkAAS3X2xwQkIBRAAOha/f//W1tdw4sNBBpAADHAhcnrCkCLFIUEGkAAhdJ19OvB$114VaFwQEAAieVdi0gE/+GJ9lW6QgAAAInlUw+3wIPsZIlUJAiNVagx24lUJASJBCT/FfRQQAC6HwAAALkB$144g+wMhcB1B+tGAclKeA6AfCqoQXX0CcsByUp58oM7PHUHidiLXfzJw7k0MEAAuuoAAACJTCQMiVQkCMcEJGEwQAC4gDBAAIlEJATokgIAALisMEAAu+QAAACJRCQMiVwkCOvXjbQm$145I28Jw$144BVieVXVlOB7MwAAACLDXBAQACFyXQIjWX0W15fXcPHRZhBQUFBoRAwQACNdZjHRZxBQUFBx0WgQUFBQYlFuKEUMEAAx0WkQUFBQcdFqEFBQUGJRbyhGDBAAMdFrEFBQUHHRbBBQUFBiUXAoRwwQADHRbRBQUFBiUXEoSAwQACJRcihJDBAAIlFzKEoMEAAiUXQoSwwQACJRdQPtwUwMEAAZolF2Ik0JP8V8FBAAA+3wIPsBIXAiYVE////D4U7AQAAxwQkP$144OijAgAAhcCJww+EWQEAAPyJx4uFRP///7kP$14486vHQwRgGUAAuQEAAADHQwgwFUAAoUBAQADHAzwAAACLFURAQADHQyg$145iUMUoTAgQACJUxiLFTQgQACJQxyhUEBAAIlTIMdDMP////+JQyyLFTwgQAChOCBAAIlTOLof$144iUM0ifaJ2CHIg/gBGcAkIAHJBEGIhCpI////SnnnoRAwQACJhWj///+hFDBAAImFbP///6EYMEAAiYVw////oRwwQACJhXT///+hIDBAAImFeP///6EkMEAAiYV8////oSgwQACJRYChLDBAAIlFhA+3BTAwQABmiUWIjYVI////iQQk/xXoUEAAD7f4g+wEhf91QjHShdJ1HokcJOhzAQAAiTQk/xXwUEAAg+wED7fA6F/9//+Jw4kdcEBAAI1DBKNgQEAAjUMIo4BAQACNZfRbXl9dw4n46Dj9//852In6dbHrsehLAQAAkJCQkJCQkJCQkJBRieGDwQg9ABAAAHIQgekAEAAAgwkALQAQAADr6SnBgwkAieCJzIsIi0AE/+CQkJBVieWD7BiLRRSJRCQQi0@2akDItFDIlEJAiLRQiJRCQEoSRRQACDwECJBCTo/gAAAKEkUUAAg8BAiQQk6N4AAADoyQAAAJCQkJCQkJCQkP8lBFFAAJCQ$14aD/JRxRQACQk$14b/yUgUUAAkJ$14bP8lFFFAAJCQ$14aD/JVBRQACQk$14b/yUYUUAAkJ$14bP8lLFFAAJCQ$14aD/JRBRQACQk$14b/yVMUUAAkJ$14bP8lOFFAAJCQ$14aD/JURRQACQk$14b/yVIUUAAkJ$14bP8lMFFAAJCQ$14aD/JTxRQACQk$14b/yVAUUAAkJ$14bP8lXFFAAJCQ$14aD/JfhQQACQk$14b/yXsUEAAkJ$14bP8l9FBAAJCQ$14aD/JfBQQACQk$14b/yXoUEAAkJ$14bFWJ5V3ph/j//5CQkJCQkJD/////8Bl$147D/////$20i$14dP////8$14hQ$14jEBp$14nD/////$145P////8$20g$14lJWQAJXg$14eC1MSUJHQ0NXMzItRUgtMi1TSkxKLUdUSFItTUlOR1czMgAAAHczMl9zaGFyZWRwdHItPnNpemUgPT0gc2l6ZW9mKFczMl9FSF9TSEFSRUQpACVzOiV1OiBmYWlsZWQgYXNzZXJ0aW9uIGAlcycKAAAuLi8uLi9nY2MvZ2NjL2NvbmZpZy9pMzg2L3czMi1zaGFyZWQtcHRyLmMAAEdldEF0b21OYW1lQSAoYXRvbSwgcywgc2l6ZW9mKHMpKSAhPSAw$20b$146BoU$14dDAUgAA6FAAAIRQ$14dNRSAAAEUQAAkF$14eJFMAABBRAADcU$14e0UwAAXFE$14yGRRAABwUQAAgFEAAIxRAACcUQ$14cC8UQ$14cDIUQAA2FEAAOhRAAD4UQAADFIAABhSAAAgUgAALFIAADhSAABAUgAATFIAAFRSAABgUgAAbFIAAHRSAACAUgAAjFI$14dmFI$14dZFEAAHBRAACAUQAAjFEAAJxR$14dLxR$14dMhRAADYUQAA6FEAAPhRAAAMUgAAGFIAACBSAAAsUgAAOFIAAEBSAABMUgAAVFIAAGBSAABsUgAAdFIAAIBSAACMUg$14cCYUg$148EAQWRkQXRvbUEAAJsARXhpdFByb2Nlc3MAAACvAEZpbmRBdG9tQQDcAEdldEF0b21OYW1lQQAA3wJTZXRVbmhhbmRsZWRFeGNlcHRpb25GaWx0ZXIAAABIAF9zbGVlc$146nAF9fZ2V0bWFpbmFyZ3MAPABfX3BfX2Vudmlyb24AAD4AX19wX19mbW9kZQ$144BQAF9fc2V0X2FwcF90eXBl$145HkAX2NleGl0$145OkAX2lvYgAAXgFfb25leGl0$144hAFfc2V0bW9kZQAAFQJhYm9ydAAcAmF0ZXhpd$146eAmF0b2kAADACZmZsdXNo$145DkCZnByaW50ZgAAAD8CZnJlZQAAcgJtYWxsb2M$145fwJwcmludGY$145kAJzaWduYWw$1453gBHZXRBc3luY0tleVN0YXRl$@26F$144BQ$@26FAAAEtFUk5FTDMyLmRsb$146UUAAAbXN2Y3J0LmRsbAAA$105KFAAAChQAABtc3ZjcnQuZGxsAAA8UAAAVVNFUjMyLmRsb$207$147
复制代码 基本思路:
模仿卷积神经网络的原理,将压缩流程分成采样层和“卷积层”(什么鬼)交替的结构
第一层采样 -> 第一层压缩 -> 第二层采样 -> 第二层压缩 -> 以此类推,如加入参数变量,理论上可扩展到无限层采样
初次使用时,先将特殊字符转义(第零层)
第一层优先进行 AAAAAAA 模式的匹配
从第二层开始,同时匹配 AAAAAAA 或 AAABAAACDAAAEAAA 两种模式
采样层负责解析统计 AAAAAAA 或 AAABAAACDAAAEAAA 这两种可压缩的基本形式,并预估压缩比,对压缩比排序后,保留压缩效果最靠前的部分样本
由于采样内容也包括映射表,所以事实上只要层数够多、参数够大,任何重复的字符串的最终都可以归纳为这两种基本形式
处理层负责根据采样层的建议进行压缩,将每个匹配的样本压缩为索引格式,索引分两种:
-----------------------------------------------------------------------------------------------------------
将 AAAAAAA 映射为 $117,其意义为:
模式 层数 该层序号 重复次数
$ 1 1 7
$117 的映射表为 $11A,此处的 A 为其原始字符串
-----------------------------------------------------------------------------------------------------------
将 AAABAAACDAAAEAAA 映射为 @12,其意义为:
模式 层数 该层序号
@ 1 2
@12 的映射表为 @12AAA@B@CD@E@,此处的 AAA 为其重复的字符串,@B@CD@E@ 为其所在的序列
该模式下,根据 AAA 的长度,对附近范围的字符串做采样测试,以寻找压缩效果最佳的样本
-----------------------------------------------------------------------------------------------------------
例如将:
AAAAAAAAAAAAAAAAAAEFAAABAAACAAADAAAEFAAAGAAAHHAAAAI
压缩为(红字为映射表,蓝字为压缩后的编码):
@10AAA@@B@C@D@EF@G@HH@$11A|$11iEF@10AI
-----------------------------------------------------------------------------------------------------------
没写解压脚本,只写了压缩脚本,默认进行 3 层采样/压缩:- 1>1/* :
- @echo off
- set "Input=原始文件.txt"
- set "Output=压缩后.txt"
-
- cscript /nologo /e:jscript %0 <"%Input%" 2>"%Output%"
- pause
- exit /b
- */
-
- var DEBUG = true
- var maxdeep = 3
-
- var base64 = WSH.StdIn.ReadAll()
- var map = []
-
- var len = base64.length
- var text = compress(base64.replace(/\r\n/gm,''), maxdeep)
- WSH.StdOut.WriteLine('')
- WSH.StdOut.WriteLine(text)
- WSH.StdOut.WriteLine('')
- WSH.StdOut.WriteLine('#Before = #' + len)
- WSH.StdOut.WriteLine('#After = #' + text.length)
-
- WSH.StdErr.WriteLine(text)
-
-
- function compress(text, deep){
- deep = deep || 3
- if(deep>36)deep = 36
-
- var from_list = [
- [
- {from:'#', mode:'#'},
- {from:'$', mode:'#'},
- {from:'@', mode:'#'},
- {from:'|', mode:'#'}
- ]
- ]
-
- text = convolution(text,from_list[0],0)
-
- var len = text.length
- for(var i=1; i<=deep; i++){
- var $len = text.length
- var from_arr = sample(text, i)
- from_list.push(from_arr)
- text = convolution(text,from_arr,i)
- text = getHead(from_arr,i) + text
- if(DEBUG)WSH.Echo('\r\n'+len +' -> '+$len+' -> '+text.length+'\r\n')
- }
-
- return text.replace(/^\|+/,'').replace(/\|+/,'|')
-
-
- function getHead(from_arr){
- if(!from_arr.length)return ''
- var head=''
-
- for(var i=0; i<from_arr.length; i++){
- switch(from_arr[i].mode){
- case '$':
- head+='$'+from_arr[i].map+from_arr[i].from
- break
-
- case '@':
- head+='@'+from_arr[i].map+from_arr[i].from+from_arr[i].list
- break
- }
- }
- head+='|'
- return head
- }
-
- function sample(text, index_deep){
- var from_map_repeat = {}
- var from_map_like = {}
- var from_map_like_stack = []
-
- if(index_deep==1){
- var pattern = /(.+?)\1{2,36}/gm
- } else {
- var pattern = /(.+?)\1{2,36}|(..+)(?:[^@]{1,5}?\2)+/gm
- }
- pattern.lastIndex = 0
-
- while((matches=pattern.exec(text)) != null){
- if(DEBUG)WSH.StdOut.Write('\r\t\t\t\rpattern.lastIndex = ' + pattern.lastIndex)
-
- if(matches[1]){
- checkrepeat(pattern,matches[0],matches[1])
- } else {
- checklike(pattern,matches[0],matches[2])
- }
- }
-
- function checkrepeat(pattern,str,$1){
- var cutoff = str.length - $1.length - 7
- if('$'+$1 in from_map_repeat){
- cutoff = str.length - 4
- from_map_repeat['$'+$1].cutoff += cutoff
- } else {
- cutoff = str.length - $1.length - 7
- from_map_repeat['$'+$1] = {from:$1, mode:'$', cutoff:cutoff}
- }
- from_map_like_stack = []
- }
-
- function checklike(pattern,str,$2){
- var $$2 = $2.replace(/\W/gm,'\\$&')
-
- var nextIndex = $2.length<2 ? pattern.lastIndex : pattern.lastIndex - str.length + 1
-
- str = str.replace(
- new RegExp('('+$$2+')(?='+$$2+')$','gm'),
- function(repeat){pattern.lastIndex-=repeat.length;return ''}
- )
- var list = convolution(str,from_list[0],index_deep).replace(new RegExp($2.replace(/\W/gm,'\\$&'),'gm'),'@')
-
- var cutoff
- if(str in from_map_like){
- cutoff = str.length - 4
- from_map_like[str].cutoff += str.length - 4
- } else {
- str=list.replace(/@/g,$2.replace(/\W/g,'\\$&'))
- cutoff = str.length - list.length - 7
- from_map_like[str] = {from:$2, mode:'@', str:str, list:list, cutoff:cutoff}
- }
- from_map_like[str].$length = $2.length+list.substr(1).search('@')
- var thisstack = {lastIndex:pattern.lastIndex, $length:$2.length+list.substr(1).search('@'), flag:true, cutoff:cutoff, ref:from_map_like[str]}
- from_map_like_stack.push(thisstack)
-
- for(var i=from_map_like_stack.length;i--;){
- var stack = from_map_like_stack[i]
- var endof = stack.lastIndex+stack.$length
-
- if(endof<pattern.lastIndex){
- from_map_like_stack.splice(i+1)
- break
- } else if(stack.cutoff > cutoff){
- if(stack.ref.flag)stack.ref.cutoff -= cutoff
- stack.ref.flag = false
- } else {
- thisstack.ref.cutoff -= cutoff
- thisstack.flag = false
- }
- }
-
- pattern.lastIndex = nextIndex
-
- return str
- }
-
- var from_arr = []
- for(var from in from_map_repeat)from_arr.push(from_map_repeat[from])
- for(var from in from_map_like)from_arr.push(from_map_like[from])
-
- for(var i=from_arr.length;i--;)if(from_arr[i].cutoff<2)from_arr.splice(i,1)
-
- from_arr = from_arr.sort(
- function(a,b){
- return b.cutoff - a.cutoff
- }
- ).sort(
- function(a,b){return b.from.length - a.from.length}
- )
-
- if(DEBUG){
- WSH.Echo('')
- WSH.Echo('')
- WSH.Echo('Deep = '+ index_deep)
- for(var i in from_arr){
- WSH.Echo(from_arr[i].from+'->'+ from_arr[i].mode+index_deep+'(?)')
- for(var j in from_arr[i]){
- WSH.Echo(j + '\t= '+ from_arr[i][j])
- }
- WSH.Echo('')
- }
- }
-
- return from_arr
- }
-
-
- function convolution(text,from_arr,index_deep){
- for(var i=0,count=0; i<from_arr.length; i++){
- from_arr[i].done = false
-
- switch(from_arr[i].mode){
- case '$':
- from_arr[i].map = index_deep.toString(36) + count.toString(36)
- text = text.replace(
- new RegExp('('+(from_arr[i].from.replace(/\W/gm,'\\$&'))+'){4,36}','gm'),
- function(str,from){count++;from_arr[i].done=true;return '$'+from_arr[i].map+parseInt(str.length / from.length).toString(36)}
- )
- break
-
- case '@':
- from_arr[i].map = index_deep.toString(36) + count.toString(36)
- text = text.replace(
- new RegExp(from_arr[i].str.replace(/\W/gm,'\\$&'),'gm'),
- function(str,from){count++;from_arr[i].done=true;return '@'+from_arr[i].map}
- )
- break
-
- case '#':
- text = text.replace(
- new RegExp(from_arr[i].from.replace(/\W/gm,'\\$&'),'gm'),
- function(str,from){count++;from_arr[i].done=true;return '#'+from_arr[i].map}
- )
- break
- }
-
- if(count>=36){
- from_arr.splice(i)
- break
- }
- }
-
- for(var i=from_arr.length;i--;){
- if(!from_arr[i].done)from_arr.splice(i,1)
- }
-
- return text
- }
- }
复制代码
|