|

楼主 |
发表于 2021-7-10 13:04:29
|
显示全部楼层
前面说的很清楚,KSDRIP不开源,而且其生成的DA3自动从utf-16LE转成了GB2312,丢失了索引和特殊字符,同时也无法解析语音库。
5 f9 {1 ?* R8 v0 a) {2 U我这个是底层解析,只是说明技术可行性,只是为了好玩,不喜欢可以忽略。
/ f6 ^" V4 l: r! a有了这个源代码,完全可以在任何平台支持金山词霸DIC和ADIC。9 I) o) f; u- d- `5 m0 x
目前已经解决了国内大部分词典的词库格式解析,包括有道、海笛、欧路、灵格斯、金山词霸、MDICT等等,只剩海笛的语音图片离线库没有解析完成,资料太少,加密比较复杂,等有空好好再研究一下。: }4 i* w- d1 F4 {+ u- d ] ]% x
5 U8 A# d, |8 N
生成DIC跟解析是两个工程,目前看,120字节的文件头有几个不知道什么意思,我个人没有这个需求,所以抽不出时间。
( ?- Q; m$ [7 A" H7 d给个文件头自己看看吧:
D. C) p! g; `& U- Option Explicit' ]6 H' f5 J' `; B* A( g* x
6 n( V# T4 S: C/ s7 m/ S6 R% _- '金山词霸DIC词库解析8 @- P( W7 y! h0 `3 W
- 'Kingsoft PowerWord Dic file format:
# v& B6 g& g8 S: B; i - 'Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15# G& g8 `( D; z6 b
- '00000000 4B 53 44 49 70 57 05 00 95 8B 00 00 52 A3 00 00 KSDIpW 晪 R?
1 P, t) m* l+ v9 d* n5 L& S - '00000016 68 58 22 49 08 00 00 00 01 00 00 00 78 A8 25 00 hX"I x?7 j# w& j6 [3 v8 a
- '00000032 00 40 00 00 01 00 02 00 04 08 00 00 09 04 00 00 @3 r& h2 ?4 H! G) W2 @
- '00000048 04 08 00 00 1D 1E 00 00 20 00 00 00 11 00 00 00
U# z+ m/ l' N3 V6 o - '00000064 F4 01 00 00 00 00 00 00 78 00 00 00 78 00 00 00 ? x x& s% F- }" U* D' E
- '00000080 F8 07 00 00 70 08 00 00 F8 3F 00 00 68 48 00 00 ? p ? hH
L% | _# z& o) ^& z$ o - '00000096 E8 F0 00 00 50 39 01 00 D8 CB 01 00 28 05 03 00 桊 P9 厮 (
. m4 w2 ~' J5 e6 j' ? - '00000112 50 A3 22 00 00 00 00 00 3C 00 64 00 69 00 63 00 P? < d i c
9 Q" x- [7 S( ]! s8 l) t - '每个zlib块解压后都是163848 s9 r( x4 B- U) @/ }: u' }
- Type TCibaDIC) o+ Y A) y! a9 | A
- lSign As Long '0x4944534b ie KSDI
1 A! F. Z) T7 I2 l - lFileSize As Long 'file size
; W2 I+ i, A$ G% K! A - lFileSize1 As Long9 M! A; M& d; g" k7 I
- lFileSize2 As Long( Y7 j% G& a' m& D4 ~- G7 M
- lFileCRC32 As Long 'crc32?
" Y; j! L4 R0 v2 y - lNum1 As Long '8
* Y" h& N& @' ^' a+ V* X - lNum2 As Long '1
& W1 K0 N* Y0 q3 n5 K- d0 z - lFileSizeOrig As Long 'Original file size of decription
0 w9 _. ?' P! a9 Y z3 D - lBlockSize As Long '0x00004000
' V. e; s1 @4 H" q9 Z - lNum4 As Long '0x00020001& w0 g0 I- R* [
- lSource_lcid As Long '0x00000804- ?7 R9 e5 [) P
- lTarget_lcid As Long '0x00000409
+ @. b! L; m2 g/ \* d - lNum5 As Long '0x00000804& i$ A5 P6 M) I' y/ f" W2 ~
- lNumWords As Long '0x1e1d0 R- B0 G* p( b; X0 q
- lNum6 As Long '0x20
/ _+ U5 v s6 Q: M- L o/ h! ` - lNum7 As Long '0x11
0 t, Y6 H) Z L5 Z/ v2 K G - lNum8 As Long '0x01f40 X7 {0 |' B- ]
- lNum9 As Long '0x00
- _6 O! C( M1 D2 q. l - lOffStart As Long '0x78' h9 H6 R( T8 A. j/ g
- lOffXML As Long '0x781 e S5 X" ]$ Q9 B- U
- lLenXML As Long '0x07f8) V: U$ c2 G+ L7 ~. ?. t
- lOffIdxTable As Long '0x78( x8 t. d* i! c; w8 P0 w
- lLenIdxTable As Long '0x78
% k, r9 F" w/ P$ `5 g' z - lOffIdxTable1 As Long '0x787 `. X- Q& @9 g% z
- lLenIdxTable1 As Long '0x781 R3 U4 j$ B6 \# R3 i
- lOffIndexTable As Long '0x78- _* K$ }# a5 q4 X( o. m/ [
- lLenIndexTable As Long '0x78
2 h5 A& Y6 `2 I- q |, v - lOffWordsTable As Long '0x78. L7 K3 R% F7 M/ y, M0 w% i
- lLenWordsTable As Long '0x78
$ ^" w2 J u1 ` B - End Type
8 }# ^/ R9 q& q8 @+ z8 _1 M/ ]
复制代码 |
|