制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

Oeasy · 发表于 2013-11-14 08:24:13

本帖最后由 Oeasy 于 2013-11-17 09:54 编辑

一个简单得不能再简单的网页抓取然后制作mdx教程（20131114）

使用软件
0. 操作系统：Windows 7 旗舰版64位
1. 抓取工具：wget，http://users.ugent.be/~bpuype/wget/，http://baike.baidu.com/view/1312507.htm
2. 文本处理：EditPlus, UltraEdit, TextForever（http://www.comicer.com/stronghorse/software/index.htm#TextForever）

目标词典
Dictionary of Phrase and Fable，1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的，而且网站没有设置抓取限制（至少目前看来没有设置），获取index也非常容易，故以此为例。
另：有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536，版本不详，似乎是第17版的。

操作步骤
1. 获取index
观察http://www.infoplease.com/dictionary/brewers/，该网站本身可以browse整本词典，获取index非常容易。
新建一个txt，内容为

http://www.infoplease.com/dictionary/brewers/index-a.html7 `& {" l* f; [0 c4 B0 Y# L. `
http://www.infoplease.com/dictionary/brewers/index-b.html
) C; M/ F; c( hhttp://www.infoplease.com/dictionary/brewers/index-c.html
" d* u# g1 W/ e+ E7 }http://www.infoplease.com/dictionary/brewers/index-d.html
" m3 ~; `% `0 A# J, M* d9 T1 rhttp://www.infoplease.com/dictionary/brewers/index-e.html& x4 G8 P" q5 Q( ]
http://www.infoplease.com/dictionary/brewers/index-f.html
: ^0 s) `0 X" f( Z7 ihttp://www.infoplease.com/dictionary/brewers/index-g.html
5 J6 ]2 w" E) s; O, xhttp://www.infoplease.com/dictionary/brewers/index-h.html
. F, |) I0 Z( V9 U( e4 t3 T; Dhttp://www.infoplease.com/dictionary/brewers/index-i.html
1 E( q, n4 `! B8 Y3 z# ~8 R/ d8 }http://www.infoplease.com/dictionary/brewers/index-j.html
0 M$ P$ s$ Z' x+ [http://www.infoplease.com/dictionary/brewers/index-k.html. X; ~- }7 {4 o2 U% n
http://www.infoplease.com/dictionary/brewers/index-l.html8 e4 u  X& J" u  h. O
http://www.infoplease.com/dictionary/brewers/index-m.html
/ i1 r) w0 f+ B9 O) M& w' Uhttp://www.infoplease.com/dictionary/brewers/index-n.html
$ j- g3 L1 }. ], y( }  ]5 z0 a6 T6 `http://www.infoplease.com/dictionary/brewers/index-o.html- X5 U: W2 O5 {
http://www.infoplease.com/dictionary/brewers/index-p.html
0 Z& v! Y9 a* P. w; i0 n' rhttp://www.infoplease.com/dictionary/brewers/index-q.html6 S! W- j9 v  c* H9 ?: C- y3 G
http://www.infoplease.com/dictionary/brewers/index-r.html" S/ [) e: P5 j% \; k+ A
http://www.infoplease.com/dictionary/brewers/index-s.html
3 t- J3 Z$ o* r1 V' q3 t: ?http://www.infoplease.com/dictionary/brewers/index-t.html
' p* K3 }! B5 i+ e8 i% W9 E/ Mhttp://www.infoplease.com/dictionary/brewers/index-u.html" K& e1 `2 a# S' a( n0 Z: n
http://www.infoplease.com/dictionary/brewers/index-v.html4 d+ \* k* H7 u$ Z) L( p1 t
http://www.infoplease.com/dictionary/brewers/index-w.html
0 ~3 {1 Q/ w+ z7 Fhttp://www.infoplease.com/dictionary/brewers/index-x.html$ S! O- P+ _( h) c2 S5 A
http://www.infoplease.com/dictionary/brewers/index-y.html
3 Z$ e. _5 F% {% ~' j! @& s  Khttp://www.infoplease.com/dictionary/brewers/index-z.html

这些地址都是观察上面网站而得，txt命名为download.txt。
我把这个download.txt和wget.exe（如果你下载的wget是wget+版本号.exe，不妨重命名为wget.exe），这俩文件都放在D:\DOPF下。

cmd.exe->CD/D D:\DOPF->wget -i download.txt

很快，26个html文件就下下来了，对这26个html文件进行整理，得到

http://www.infoplease.com/dictionary/brewers/a.html
6 X* _6 f0 J) Q+ p3 K+ e- ^http://www.infoplease.com/dictionary/brewers/a1.html
( _, n, o( M  U3 b6 Jhttp://www.infoplease.com/dictionary/brewers/a-b.html. ^) Y+ H* L; T
http://www.infoplease.com/dictionary/brewers/a-b-c.html; K$ O$ ]% o; m
http://www.infoplease.com/dictionary/brewers/a-b-c-book.html3 m7 @+ x% k  a  {* ~
http://www.infoplease.com/dictionary/brewers/a-b-c-process.html
1 ^( D* L5 ?& I3 E7 _: Uhttp://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html) w, `: o/ b% w5 @
http://www.infoplease.com/dictionary/brewers/a-u-c.html; o0 c# l5 Q1 X5 q
http://www.infoplease.com/dictionary/brewers/aaron.html! Z5 e9 e7 }- s3 M* Y+ k( B( l
http://www.infoplease.com/dictionary/brewers/ab.html
5 Q: P( W! X$ D7 X/ |: Dhttp://www.infoplease.com/dictionary/brewers/aback.html
0 `+ E/ E( g) ^  J; V  khttp://www.infoplease.com/dictionary/brewers/abacus.html
) k) `5 e! T7 J. `- l3 Mhttp://www.infoplease.com/dictionary/brewers/abaddon.html4 B/ W: \1 O; `4 W
http://www.infoplease.com/dictionary/brewers/abambou.html
- A9 c# x; [( e, Qhttp://www.infoplease.com/dictionary/brewers/abandon.html
: B; x2 A! S1 t- ihttp://www.infoplease.com/dictio ... on-fait-larron.html. P. I4 Z# a# K/ `( q
http://www.infoplease.com/dictionary/brewers/abaris.html' p% d+ O) A/ o# k6 q2 T
http://www.infoplease.com/dictionary/brewers/abate.html. {; b( l) p/ U# }$ Y
http://www.infoplease.com/dictionary/brewers/abaton.html
4 |( B& ]: m' c8 N. N; chttp://www.infoplease.com/dictionary/brewers/abbassides.html
& y& l( X+ V: p5 vhttp://www.infoplease.com/dictionary/brewers/abbey-laird.html
  y) ~, U5 `0 r8 b  Ihttp://www.infoplease.com/dictionary/brewers/abbey-lubber.html
" m$ G! m% |$ Y- j( I1 K4 ^" x……

这样的一共16698个链接。

2. 抓取内容
同样的，wget -i download.txt
把上面那N个html都抓下来，然后就很简单了。
-2013年11月14日 16:35:47
成功抓取了16695个html，漏了3个，懒得研究到底是哪3个了。

3. 文本提取
观察可知，词典条目内容在第一个<h1>和<div class="source">之间

<h1>Charybdis</h1>/ q  o/ m5 [- M0 I; \0 J# w
3 ?' Z; q2 G! S
<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and& s3 m  ]! I$ I4 O% U
Charybdis are employed to signify two equal dangers. Thus Horace says
9 R1 p5 a/ Z' ]7 N5 ~an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
% X7 W6 D1 g6 G" N7 gseeking to avoid one fault, falls into another. The tale is that& l3 J6 a1 \" ?; d' k
Charybdis stole the oxen of Hercules, was killed by lightning, and6 v, ]* i$ ~, B& ?4 x
changed into the gulf.</p>
<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your
. \- @- s: K6 S  P2 w! X6 @& F) wmother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.- e5 E$ o2 t2 o) F( k0 U
</p>
  ]; ?# \/ |5 B2 p5 s9 A* ?/ w: Y- g- O6 ?/ K
<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>

利用TextForever来提取文本

-

提取完毕，合并得到的16695个html，

这本词典的制作过程中，我思考了下，不用在“文件内容前加注文件名”，有的情况下，是需要这样做的，以方便提取keywords，经过测试，还是要在“文件内容后加空行”。

得到dopf-src.txt，对这个txt进行操作，得到可build为mdx的txt。

4. 制作mdx
合并后的文本长这样：

明显http://www.infoplease.com/dictionary/brewers/的词典是xml，由于MDict PC版不支持xml+css，我们要把xml标签替换为html标签。经过下面一系列的操作。

处理后最终的文本是这样：

再简单写点css

中途遇到些小问题，一个个解决，最后，成品：

是不是比在线的稍微顺眼点呢？
http://www.infoplease.com/dictionary/brewers/comb.html

PS：虽然做完了，但是我发现了一些问题，从上面的截图中就可以看出来，有些词之间少了空格。暂无意修改，等有空改完了再分享。谁有兴趣改一改练练手的话，可以PM我，我把下载的网页发给你。

Hugh · 发表于 2013-11-14 16:26:41

此贴要顶！

liuyunrushui · 发表于 2013-11-15 23:30:45

老大您好。感谢您提供的教程。
小弟按照您的教程，把第一步完成了，但是如何有效地完成第二步，就是您所说的抓取一千多个网页的那个步骤，小弟一头雾水，手动一个一个地输入也是一个方法，不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢？烦请指点一二，多谢多谢。

小弟想抓取的网页如下：
http://zokugo-dict.com/

右边的五十音图就是索引部分。

Oeasy · 发表于 2013-11-16 14:14:53

liuyunrushui 发表于 2013-11-15 23:30 & @ _: w5 D$ p& p+ X' S
老大您好。感谢您提供的教程。; {. L8 _5 E# W$ D; T# N
小弟按照您的教程，把第一步完成了，但是如何有效地完成第二步，就是您所说 ...

cmd.exe

wget -i download.txt
所有网页链接在download.txt，参考http://baike.baidu.com/view/1312507.htm，也可以自己写程序抓。结合awk等等的话，其实可以更快，抓完也就制作完了。

tovaremeterio · 发表于 2014-4-1 09:02:21

thank you very much

		自动登录	找回密码
密码			免费注册

[教程] 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

本帖被以下淘专辑推荐: