掌上百科 - PDAWIKI

 找回密码
 免费注册

QQ登录

只需一步,快速开始

查看: 4479|回复: 4

[教程] 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

[复制链接]

该用户从未签到

发表于 2013-11-14 08:24:13 | 显示全部楼层 |阅读模式
本帖最后由 Oeasy 于 2013-11-17 09:54 编辑 & f4 X4 g( Z# H$ u: E
: t+ C) u% k. F8 U- ~7 M
) n( n4 o) }0 x5 B) z
一个简单得不能再简单的网页抓取然后制作mdx教程(20131114)
+ q) J- d* {* D- S
( x* k* w# |; ]0 A& ^使用软件3 i0 ]; R' Q1 s& G6 u1 j# r7 R
0. 操作系统:Windows 7 旗舰版64位
7 h8 T7 j8 c! r; h+ }: A% Z1. 抓取工具:wget,http://users.ugent.be/~bpuype/wget/http://baike.baidu.com/view/1312507.htm  V  Z# _5 |+ r5 a2 D. @
2. 文本处理:EditPlus, UltraEdit, TextForever(http://www.comicer.com/stronghorse/software/index.htm#TextForever
: B' P  X, a2 ]* g9 D( L. h( N/ J( t& V7 [- A
目标词典/ W: p6 m  Q8 r8 J
Dictionary of Phrase and Fable,1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的,而且网站没有设置抓取限制(至少目前看来没有设置),获取index也非常容易,故以此为例。! Q$ V% `0 c0 p$ X2 q% n/ M3 E" u
另:有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536,版本不详,似乎是第17版的。& m; ]$ D1 Y# q: p2 k7 h
& p" ^7 `8 @6 v7 f) B, ^' k
操作步骤4 _- s$ |2 ^0 q9 Q6 j- ~2 L3 y
1. 获取index
0 B0 R5 H5 i4 v; v! [观察http://www.infoplease.com/dictionary/brewers/,该网站本身可以browse整本词典,获取index非常容易。
3 Z3 E, m  N; l& w) E新建一个txt,内容为
' d( o2 `6 J. ~* [
http://www.infoplease.com/dictionary/brewers/index-a.html
, n4 `9 ]) m$ I$ k3 {' U, Z. hhttp://www.infoplease.com/dictionary/brewers/index-b.html' Q/ W* ~) a/ G
http://www.infoplease.com/dictionary/brewers/index-c.html
4 i& s2 C% l, C" Nhttp://www.infoplease.com/dictionary/brewers/index-d.html
$ I. Y. i1 E' }! @http://www.infoplease.com/dictionary/brewers/index-e.html
: N7 j6 D0 m4 t, {  lhttp://www.infoplease.com/dictionary/brewers/index-f.html
, v  R4 m' L2 j( Ehttp://www.infoplease.com/dictionary/brewers/index-g.html
3 K  h6 O+ @- P" a7 D3 y3 c( ahttp://www.infoplease.com/dictionary/brewers/index-h.html3 m; e% F1 ?3 Q
http://www.infoplease.com/dictionary/brewers/index-i.html7 K) E7 N8 h2 v2 Y1 ^! w: y
http://www.infoplease.com/dictionary/brewers/index-j.html+ T1 J; c3 H: h4 E) c0 k& U
http://www.infoplease.com/dictionary/brewers/index-k.html4 h$ \) d& }* O9 X' y
http://www.infoplease.com/dictionary/brewers/index-l.html  p9 m9 }% I0 g1 h" S
http://www.infoplease.com/dictionary/brewers/index-m.html
8 _4 v, P" D! @* lhttp://www.infoplease.com/dictionary/brewers/index-n.html
4 e' g: {& A: E* a! ~. N; w. Ghttp://www.infoplease.com/dictionary/brewers/index-o.html
1 G, Y+ ]& R; g. @3 Zhttp://www.infoplease.com/dictionary/brewers/index-p.html
  s' ?, W7 ]- C% K2 Phttp://www.infoplease.com/dictionary/brewers/index-q.html: ~, L: y' _3 N% x7 s
http://www.infoplease.com/dictionary/brewers/index-r.html& ]% m1 L' r5 g& _5 I
http://www.infoplease.com/dictionary/brewers/index-s.html4 N, Y0 C4 U, F; ~* O2 Y
http://www.infoplease.com/dictionary/brewers/index-t.html3 e* l& d1 Z* M1 b/ N' g/ Z! P/ E
http://www.infoplease.com/dictionary/brewers/index-u.html
6 b- j. G/ E, Z; m5 phttp://www.infoplease.com/dictionary/brewers/index-v.html
" A/ _. }2 N' Shttp://www.infoplease.com/dictionary/brewers/index-w.html& J" \1 L8 x/ P
http://www.infoplease.com/dictionary/brewers/index-x.html
8 {- y3 @' P: l  D6 v$ y, Z8 whttp://www.infoplease.com/dictionary/brewers/index-y.html4 [! ~/ `# j( H  }
http://www.infoplease.com/dictionary/brewers/index-z.html

4 r- i$ o5 {/ |4 W这些地址都是观察上面网站而得,txt命名为download.txt。
$ r& B1 c# }  S我把这个download.txt和wget.exe(如果你下载的wget是wget+版本号.exe,不妨重命名为wget.exe),这俩文件都放在D:\DOPF下。/ H# p3 X: g  v- n. r8 r& S9 s  y
9 Y" n9 P* f( {. H7 X% \; x
cmd.exe->CD/D D:\DOPF->wget -i download.txt7 B8 O* w' }- e/ Y9 k) L- d/ F

; h0 D6 Z" m6 ]7 Q7 o5 K7 x很快,26个html文件就下下来了,对这26个html文件进行整理,得到4 m0 V( n! O0 B# [& y
http://www.infoplease.com/dictionary/brewers/a.html
9 _  K. @( k: o5 C& Y( X" ^http://www.infoplease.com/dictionary/brewers/a1.html
' r* ~9 `. n* p' ohttp://www.infoplease.com/dictionary/brewers/a-b.html
9 W6 h. _" W$ R% Xhttp://www.infoplease.com/dictionary/brewers/a-b-c.html- W5 {8 S, D4 e6 o" n" r/ a
http://www.infoplease.com/dictionary/brewers/a-b-c-book.html
, Q6 l! V  q5 r9 K6 C  }* ^/ dhttp://www.infoplease.com/dictionary/brewers/a-b-c-process.html
1 T' R) _  R7 u% q' _1 c. yhttp://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html
% m! F: `9 e- D' u, Chttp://www.infoplease.com/dictionary/brewers/a-u-c.html; g, D* o! P% U, j. m
http://www.infoplease.com/dictionary/brewers/aaron.html1 e- E: a. l/ O. X0 D
http://www.infoplease.com/dictionary/brewers/ab.html; O: A) b5 v% f- d4 u+ K
http://www.infoplease.com/dictionary/brewers/aback.html& q9 i3 ^& k( B/ j( Y# c! s& E7 M  g
http://www.infoplease.com/dictionary/brewers/abacus.html
- P; u4 L& v& f; z+ whttp://www.infoplease.com/dictionary/brewers/abaddon.html6 W  G4 ~5 i4 P
http://www.infoplease.com/dictionary/brewers/abambou.html
  b. b7 Z+ r& @7 `- q% J  Whttp://www.infoplease.com/dictionary/brewers/abandon.html
4 n, f1 S2 L5 Whttp://www.infoplease.com/dictio ... on-fait-larron.html/ g/ g+ J, A2 M% ^" I' H9 L
http://www.infoplease.com/dictionary/brewers/abaris.html
" E- z4 i( u8 t5 P  ~# ehttp://www.infoplease.com/dictionary/brewers/abate.html) R3 o- ^% w; a2 T1 q; \. g( w
http://www.infoplease.com/dictionary/brewers/abaton.html
  ?; q8 Z$ I2 y7 t, f7 shttp://www.infoplease.com/dictionary/brewers/abbassides.html
7 C- X8 Z/ y% b$ ]+ Chttp://www.infoplease.com/dictionary/brewers/abbey-laird.html
  K3 A: B# {8 T8 ghttp://www.infoplease.com/dictionary/brewers/abbey-lubber.html
$ s; K" [# ~9 m# ?$ h8 ?……
+ d9 a0 I3 g/ O# z6 X
这样的一共16698个链接。6 B$ J" i9 y+ ]* Q' m. \5 [

; g- M5 o: J! J& L% C4 H" N" D2. 抓取内容
5 e4 t% {* y9 Q1 x/ P2 s同样的,wget -i download.txt$ f$ O5 B4 s) h- t% `
把上面那N个html都抓下来,然后就很简单了。4 @+ p/ k% S' b- y+ N1 K8 f6 z/ p
-2013年11月14日 16:35:47
4 c5 D! B2 n$ T; ^成功抓取了16695个html,漏了3个,懒得研究到底是哪3个了。" k1 e( p1 \: `

; x, M. A1 y/ o/ B3. 文本提取! L8 `3 l/ A: ^. k* E
观察可知,词典条目内容在第一个<h1>和<div class="source">之间# G( a* C, \9 t/ `0 l+ V$ W
<h1>Charybdis</h1>
6 ^$ p3 q7 L5 W6 A9 H: q
  {* V, j$ k3 X& o* V<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and% J# V" D* g1 ]9 D, G) h
Charybdis are employed to signify two equal dangers. Thus Horace says
3 d: A6 P+ h: H7 `5 l8 N$ dan author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
' W* D5 \/ ?9 Xseeking to avoid one fault, falls into another. The tale is that
. U3 n1 ]  _( fCharybdis stole the oxen of Hercules, was killed by lightning, and
) t/ M0 O% |+ r% Z& |  L4 ~. ^+ p& `changed into the gulf.</p>. n( v! G- W* y; \9 O' U
<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your1 f: i3 `, @$ w. k/ F+ ^7 h
mother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.
: x" @0 H7 \/ D- O& _# C. g3 s</p># i; o, q. e! [, |6 n
. F) X5 e. j8 t% G7 g& b' m
<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>

: }; V7 ]& e9 U& R利用TextForever来提取文本
9 a# l4 @. p" r7 Z/ U% b/ {! R; K. _! u
-3 a0 p0 l. ^; ~

7 @7 a) H* l  E0 A$ ]% f提取完毕,合并得到的16695个html,, _0 X5 g& d7 }- K! P) k

$ q( H$ H! F! Q9 H9 N6 \这本词典的制作过程中,我思考了下,不用在“文件内容前加注文件名”,有的情况下,是需要这样做的,以方便提取keywords,经过测试,还是要在“文件内容后加空行”。: Y! w2 U* M0 ^" P- U; u  ?  H
& }7 a2 A- O# P
得到dopf-src.txt,对这个txt进行操作,得到可build为mdx的txt。( e9 J  Z4 E; Z

5 d( c  o  C3 a7 g; s0 A  Q" W4. 制作mdx
* Z1 b) n2 X( l) s; m+ V  @合并后的文本长这样:
. q4 s- I+ X9 K, ?/ z
) M) g" I7 M1 }1 {. i$ z) ?; s6 E" h/ u% b! ~" g& ~
明显http://www.infoplease.com/dictionary/brewers/的词典是xml,由于MDict PC版不支持xml+css,我们要把xml标签替换为html标签。经过下面一系列的操作。2 m+ z$ H$ O0 N3 \* T+ Q) C$ J$ j
, d1 `2 n% {1 Y
, Y1 ^) l3 l% Q& V& i9 \2 A
处理后最终的文本是这样:5 c: p7 `4 r% x+ R; C

  d- q5 S( e7 `9 p. ]! W& y/ y& s: X* _. ^5 [8 F, ]
再简单写点css' d: W  i- f4 `( @. o

: n3 i3 G- e3 s/ L2 ~
) F: J5 s5 V7 H' q) O, d6 E中途遇到些小问题,一个个解决,最后,成品:
: K8 e9 e' ~1 O# g! h( C2 h% o' Q8 x, y1 b
是不是比在线的稍微顺眼点呢?
# ~/ R; m  m# B+ J' L! Bhttp://www.infoplease.com/dictionary/brewers/comb.html) r0 e$ z- @' {* U0 n
7 y" c. a+ J5 U0 `! p' a9 q
: [) d; F+ n4 A7 B3 \) K
PS:虽然做完了,但是我发现了一些问题,从上面的截图中就可以看出来,有些词之间少了空格。暂无意修改,等有空改完了再分享。谁有兴趣改一改练练手的话,可以PM我,我把下载的网页发给你。# |! ^7 s# }9 W7 a. c

本帖被以下淘专辑推荐:

该用户从未签到

发表于 2013-11-14 16:26:41 | 显示全部楼层
此贴要顶!
  • TA的每日心情
    开心
    2018-1-27 00:16
  • 签到天数: 1 天

    [LV.1]初来乍到

    发表于 2013-11-15 23:30:45 来自手机 | 显示全部楼层
    老大您好。感谢您提供的教程。8 z6 Y0 S6 y" C
    小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说的抓取一千多个网页的那个步骤,小弟一头雾水,手动一个一个地输入也是一个方法,不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢?烦请指点一二,多谢多谢。
    8 j6 {2 ?8 g' w' u$ _& A3 t0 t& n2 E6 o  W. j
    小弟想抓取的网页如下:9 `& ?" [' e& K% M! ~5 `, @
    http://zokugo-dict.com/
    % R( ]' B2 ]6 N" K/ a7 ]- t: C# S9 C2 s- v5 \1 V) L& G* E  f
    右边的五十音图就是索引部分。

    该用户从未签到

     楼主| 发表于 2013-11-16 14:14:53 | 显示全部楼层
    liuyunrushui 发表于 2013-11-15 23:30
    2 U8 F0 H2 G! E. f( ~老大您好。感谢您提供的教程。
    * q0 s& {; C- g; X, R( @小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说 ...

    8 ~" C$ t3 z4 a& B$ I& I/ f, a6 C
    cmd.exe
    + F5 g1 e) j& C, O
    * z, D( t6 m! v+ P. [5 Ewget -i download.txt6 z' \9 }) ]% n0 U7 a$ u
    所有网页链接在download.txt,参考http://baike.baidu.com/view/1312507.htm,也可以自己写程序抓。结合awk等等的话,其实可以更快,抓完也就制作完了。

    该用户从未签到

    发表于 2014-4-1 09:02:21 | 显示全部楼层
    thank you very much
    您需要登录后才可以回帖 登录 | 免费注册

    本版积分规则

    小黑屋|手机版|Archiver|PDAWIKI |网站地图

    GMT+8, 2025-4-27 00:14 , Processed in 0.023484 second(s), 23 queries .

    Powered by Discuz! X3.4

    © 2001-2023 Discuz! Team.

    快速回复 返回顶部 返回列表