|
本帖最后由 Oeasy 于 2013-11-17 09:54 编辑 & f4 X4 g( Z# H$ u: E
: t+ C) u% k. F8 U- ~7 M
) n( n4 o) }0 x5 B) z
一个简单得不能再简单的网页抓取然后制作mdx教程(20131114)
+ q) J- d* {* D- S
( x* k* w# |; ]0 A& ^使用软件3 i0 ]; R' Q1 s& G6 u1 j# r7 R
0. 操作系统:Windows 7 旗舰版64位
7 h8 T7 j8 c! r; h+ }: A% Z1. 抓取工具:wget,http://users.ugent.be/~bpuype/wget/,http://baike.baidu.com/view/1312507.htm V Z# _5 |+ r5 a2 D. @
2. 文本处理:EditPlus, UltraEdit, TextForever(http://www.comicer.com/stronghorse/software/index.htm#TextForever)
: B' P X, a2 ]* g9 D( L. h( N/ J( t& V7 [- A
目标词典/ W: p6 m Q8 r8 J
Dictionary of Phrase and Fable,1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的,而且网站没有设置抓取限制(至少目前看来没有设置),获取index也非常容易,故以此为例。! Q$ V% `0 c0 p$ X2 q% n/ M3 E" u
另:有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536,版本不详,似乎是第17版的。& m; ]$ D1 Y# q: p2 k7 h
& p" ^7 `8 @6 v7 f) B, ^' k
操作步骤4 _- s$ |2 ^0 q9 Q6 j- ~2 L3 y
1. 获取index
0 B0 R5 H5 i4 v; v! [观察http://www.infoplease.com/dictionary/brewers/,该网站本身可以browse整本词典,获取index非常容易。
3 Z3 E, m N; l& w) E新建一个txt,内容为
' d( o2 `6 J. ~* [http://www.infoplease.com/dictionary/brewers/index-a.html
, n4 `9 ]) m$ I$ k3 {' U, Z. hhttp://www.infoplease.com/dictionary/brewers/index-b.html' Q/ W* ~) a/ G
http://www.infoplease.com/dictionary/brewers/index-c.html
4 i& s2 C% l, C" Nhttp://www.infoplease.com/dictionary/brewers/index-d.html
$ I. Y. i1 E' }! @http://www.infoplease.com/dictionary/brewers/index-e.html
: N7 j6 D0 m4 t, { lhttp://www.infoplease.com/dictionary/brewers/index-f.html
, v R4 m' L2 j( Ehttp://www.infoplease.com/dictionary/brewers/index-g.html
3 K h6 O+ @- P" a7 D3 y3 c( ahttp://www.infoplease.com/dictionary/brewers/index-h.html3 m; e% F1 ?3 Q
http://www.infoplease.com/dictionary/brewers/index-i.html7 K) E7 N8 h2 v2 Y1 ^! w: y
http://www.infoplease.com/dictionary/brewers/index-j.html+ T1 J; c3 H: h4 E) c0 k& U
http://www.infoplease.com/dictionary/brewers/index-k.html4 h$ \) d& }* O9 X' y
http://www.infoplease.com/dictionary/brewers/index-l.html p9 m9 }% I0 g1 h" S
http://www.infoplease.com/dictionary/brewers/index-m.html
8 _4 v, P" D! @* lhttp://www.infoplease.com/dictionary/brewers/index-n.html
4 e' g: {& A: E* a! ~. N; w. Ghttp://www.infoplease.com/dictionary/brewers/index-o.html
1 G, Y+ ]& R; g. @3 Zhttp://www.infoplease.com/dictionary/brewers/index-p.html
s' ?, W7 ]- C% K2 Phttp://www.infoplease.com/dictionary/brewers/index-q.html: ~, L: y' _3 N% x7 s
http://www.infoplease.com/dictionary/brewers/index-r.html& ]% m1 L' r5 g& _5 I
http://www.infoplease.com/dictionary/brewers/index-s.html4 N, Y0 C4 U, F; ~* O2 Y
http://www.infoplease.com/dictionary/brewers/index-t.html3 e* l& d1 Z* M1 b/ N' g/ Z! P/ E
http://www.infoplease.com/dictionary/brewers/index-u.html
6 b- j. G/ E, Z; m5 phttp://www.infoplease.com/dictionary/brewers/index-v.html
" A/ _. }2 N' Shttp://www.infoplease.com/dictionary/brewers/index-w.html& J" \1 L8 x/ P
http://www.infoplease.com/dictionary/brewers/index-x.html
8 {- y3 @' P: l D6 v$ y, Z8 whttp://www.infoplease.com/dictionary/brewers/index-y.html4 [! ~/ `# j( H }
http://www.infoplease.com/dictionary/brewers/index-z.html
4 r- i$ o5 {/ |4 W这些地址都是观察上面网站而得,txt命名为download.txt。
$ r& B1 c# } S我把这个download.txt和wget.exe(如果你下载的wget是wget+版本号.exe,不妨重命名为wget.exe),这俩文件都放在D:\DOPF下。/ H# p3 X: g v- n. r8 r& S9 s y
9 Y" n9 P* f( {. H7 X% \; x
cmd.exe->CD/D D:\DOPF->wget -i download.txt7 B8 O* w' }- e/ Y9 k) L- d/ F
; h0 D6 Z" m6 ]7 Q7 o5 K7 x很快,26个html文件就下下来了,对这26个html文件进行整理,得到4 m0 V( n! O0 B# [& y
+ d9 a0 I3 g/ O# z6 X
这样的一共16698个链接。6 B$ J" i9 y+ ]* Q' m. \5 [
; g- M5 o: J! J& L% C4 H" N" D2. 抓取内容
5 e4 t% {* y9 Q1 x/ P2 s同样的,wget -i download.txt$ f$ O5 B4 s) h- t% `
把上面那N个html都抓下来,然后就很简单了。4 @+ p/ k% S' b- y+ N1 K8 f6 z/ p
-2013年11月14日 16:35:47
4 c5 D! B2 n$ T; ^成功抓取了16695个html,漏了3个,懒得研究到底是哪3个了。" k1 e( p1 \: `
; x, M. A1 y/ o/ B3. 文本提取! L8 `3 l/ A: ^. k* E
观察可知,词典条目内容在第一个<h1>和<div class="source">之间# G( a* C, \9 t/ `0 l+ V$ W
<h1>Charybdis</h1>
6 ^$ p3 q7 L5 W6 A9 H: q
{* V, j$ k3 X& o* V<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and% J# V" D* g1 ]9 D, G) h
Charybdis are employed to signify two equal dangers. Thus Horace says
3 d: A6 P+ h: H7 `5 l8 N$ dan author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
' W* D5 \/ ?9 Xseeking to avoid one fault, falls into another. The tale is that
. U3 n1 ] _( fCharybdis stole the oxen of Hercules, was killed by lightning, and
) t/ M0 O% |+ r% Z& | L4 ~. ^+ p& `changed into the gulf.</p>. n( v! G- W* y; \9 O' U
<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your1 f: i3 `, @$ w. k/ F+ ^7 h
mother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.
: x" @0 H7 \/ D- O& _# C. g3 s</p># i; o, q. e! [, |6 n
. F) X5 e. j8 t% G7 g& b' m
<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>
: }; V7 ]& e9 U& R利用TextForever来提取文本
9 a# l4 @. p" r7 Z / U% b/ {! R; K. _! u
-3 a0 p0 l. ^; ~

7 @7 a) H* l E0 A$ ]% f提取完毕,合并得到的16695个html,, _0 X5 g& d7 }- K! P) k

$ q( H$ H! F! Q9 H9 N6 \这本词典的制作过程中,我思考了下,不用在“文件内容前加注文件名”,有的情况下,是需要这样做的,以方便提取keywords,经过测试,还是要在“文件内容后加空行”。: Y! w2 U* M0 ^" P- U; u ? H
& }7 a2 A- O# P
得到dopf-src.txt,对这个txt进行操作,得到可build为mdx的txt。( e9 J Z4 E; Z
5 d( c o C3 a7 g; s0 A Q" W4. 制作mdx
* Z1 b) n2 X( l) s; m+ V @合并后的文本长这样:
. q4 s- I+ X9 K, ?/ z
) M) g" I7 M1 }1 {. i $ z) ?; s6 E" h/ u% b! ~" g& ~
明显http://www.infoplease.com/dictionary/brewers/的词典是xml,由于MDict PC版不支持xml+css,我们要把xml标签替换为html标签。经过下面一系列的操作。2 m+ z$ H$ O0 N3 \* T+ Q) C$ J$ j
, d1 `2 n% {1 Y
, Y1 ^) l3 l% Q& V& i9 \2 A
处理后最终的文本是这样:5 c: p7 `4 r% x+ R; C

d- q5 S( e7 `9 p. ]! W& y/ y& s: X* _. ^5 [8 F, ]
再简单写点css' d: W i- f4 `( @. o

: n3 i3 G- e3 s/ L2 ~
) F: J5 s5 V7 H' q) O, d6 E中途遇到些小问题,一个个解决,最后,成品:
: K8 e9 e' ~1 O# g ! h( C2 h% o' Q8 x, y1 b
是不是比在线的稍微顺眼点呢?
# ~/ R; m m# B+ J' L! Bhttp://www.infoplease.com/dictionary/brewers/comb.html) r0 e$ z- @' {* U0 n
7 y" c. a+ J5 U0 `! p' a9 q
: [) d; F+ n4 A7 B3 \) K
PS:虽然做完了,但是我发现了一些问题,从上面的截图中就可以看出来,有些词之间少了空格。暂无意修改,等有空改完了再分享。谁有兴趣改一改练练手的话,可以PM我,我把下载的网页发给你。# |! ^7 s# }9 W7 a. c
|
|