制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

Oeasy · 发表于 2013-11-14 08:24:13

本帖最后由 Oeasy 于 2013-11-17 09:54 编辑

一个简单得不能再简单的网页抓取然后制作mdx教程（20131114）

使用软件
0. 操作系统：Windows 7 旗舰版64位
1. 抓取工具：wget，http://users.ugent.be/~bpuype/wget/，http://baike.baidu.com/view/1312507.htm
2. 文本处理：EditPlus, UltraEdit, TextForever（http://www.comicer.com/stronghorse/software/index.htm#TextForever）

目标词典
Dictionary of Phrase and Fable，1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的，而且网站没有设置抓取限制（至少目前看来没有设置），获取index也非常容易，故以此为例。
另：有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536，版本不详，似乎是第17版的。

操作步骤
1. 获取index
观察http://www.infoplease.com/dictionary/brewers/，该网站本身可以browse整本词典，获取index非常容易。
新建一个txt，内容为

http://www.infoplease.com/dictionary/brewers/index-a.html
% G/ G0 M, F- u8 ^- F+ Nhttp://www.infoplease.com/dictionary/brewers/index-b.html
! A( N) X" A4 \% shttp://www.infoplease.com/dictionary/brewers/index-c.html6 o- m, |! J: ]# P& X) K
http://www.infoplease.com/dictionary/brewers/index-d.html3 ]+ H9 z: s" `
http://www.infoplease.com/dictionary/brewers/index-e.html( u6 j" {' G$ g  [. Y7 h9 M
http://www.infoplease.com/dictionary/brewers/index-f.html- z; V) F& W9 K" X* V+ A3 a  t
http://www.infoplease.com/dictionary/brewers/index-g.html
! w6 D/ Z! U0 `6 k8 C: bhttp://www.infoplease.com/dictionary/brewers/index-h.html7 b& z5 M# F& N( c
http://www.infoplease.com/dictionary/brewers/index-i.html; L: o( v$ v- ]9 T" f; Q/ a5 E
http://www.infoplease.com/dictionary/brewers/index-j.html
6 R, J" f! y4 V; C* J6 chttp://www.infoplease.com/dictionary/brewers/index-k.html
6 r( c( d" Z( R- ehttp://www.infoplease.com/dictionary/brewers/index-l.html. M% ~1 z. x9 T3 v% {
http://www.infoplease.com/dictionary/brewers/index-m.html
0 o! U% f( G* M( a* W# e. yhttp://www.infoplease.com/dictionary/brewers/index-n.html6 H7 I8 P" _, M
http://www.infoplease.com/dictionary/brewers/index-o.html6 Q+ `3 z3 u: N4 F( V
http://www.infoplease.com/dictionary/brewers/index-p.html
/ T0 ?) L8 L2 e4 p5 `6 q6 z& Jhttp://www.infoplease.com/dictionary/brewers/index-q.html# ?6 Y- x7 V# y( b1 h% D
http://www.infoplease.com/dictionary/brewers/index-r.html
0 x) p% @3 y8 Z& chttp://www.infoplease.com/dictionary/brewers/index-s.html% X1 O* ?9 c9 m3 W
http://www.infoplease.com/dictionary/brewers/index-t.html
6 A, ?0 I. H* phttp://www.infoplease.com/dictionary/brewers/index-u.html
2 v. T2 _/ ^* C2 h' Y  hhttp://www.infoplease.com/dictionary/brewers/index-v.html
  |! R  j; [8 |; ?; [8 a8 thttp://www.infoplease.com/dictionary/brewers/index-w.html
* z% A) O$ _% D: J4 Rhttp://www.infoplease.com/dictionary/brewers/index-x.html6 X9 {8 F8 s0 }
http://www.infoplease.com/dictionary/brewers/index-y.html  Q8 B. b6 ]3 s* l" x+ K
http://www.infoplease.com/dictionary/brewers/index-z.html

这些地址都是观察上面网站而得，txt命名为download.txt。
我把这个download.txt和wget.exe（如果你下载的wget是wget+版本号.exe，不妨重命名为wget.exe），这俩文件都放在D:\DOPF下。

cmd.exe->CD/D D:\DOPF->wget -i download.txt

很快，26个html文件就下下来了，对这26个html文件进行整理，得到

http://www.infoplease.com/dictionary/brewers/a.html% Z1 R2 j3 o( \
http://www.infoplease.com/dictionary/brewers/a1.html
. E* d4 M$ F1 xhttp://www.infoplease.com/dictionary/brewers/a-b.html
  B! y6 g8 h) N1 d) J4 ?' Rhttp://www.infoplease.com/dictionary/brewers/a-b-c.html
/ ]" w+ f$ ^  v# F* ?. Z# B0 Shttp://www.infoplease.com/dictionary/brewers/a-b-c-book.html
/ i8 \/ Z9 y. ~http://www.infoplease.com/dictionary/brewers/a-b-c-process.html
! j6 y; F; Y# }$ D/ ?. Uhttp://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html
0 Q- S  U$ m* `5 Bhttp://www.infoplease.com/dictionary/brewers/a-u-c.html$ i: t/ I* y4 V/ Q$ h2 U6 ]5 _& Z
http://www.infoplease.com/dictionary/brewers/aaron.html# ~6 c0 v* B' F1 b+ e
http://www.infoplease.com/dictionary/brewers/ab.html
* U  s; K8 ?. g& @0 Z* ~( Bhttp://www.infoplease.com/dictionary/brewers/aback.html
2 I6 i0 O6 ^. b* u. t  xhttp://www.infoplease.com/dictionary/brewers/abacus.html
: p3 Y9 |# {. V/ J. V; _" Nhttp://www.infoplease.com/dictionary/brewers/abaddon.html
6 [. K, k( [+ {1 Lhttp://www.infoplease.com/dictionary/brewers/abambou.html
4 d4 l. J4 w) C, {http://www.infoplease.com/dictionary/brewers/abandon.html' V/ p0 E5 \% C0 I6 s
http://www.infoplease.com/dictio ... on-fait-larron.html
1 M8 f" d# y! ghttp://www.infoplease.com/dictionary/brewers/abaris.html
# `% p& F- o+ p. i( G/ G! yhttp://www.infoplease.com/dictionary/brewers/abate.html) j, M8 R; F4 L1 [+ Q) ]! o6 Y
http://www.infoplease.com/dictionary/brewers/abaton.html. n1 H, G4 @1 y( |- v
http://www.infoplease.com/dictionary/brewers/abbassides.html) X  w1 b/ ?8 c* r0 w. d
http://www.infoplease.com/dictionary/brewers/abbey-laird.html5 ~7 k( V; `9 w$ U
http://www.infoplease.com/dictionary/brewers/abbey-lubber.html
0 n* `& B1 b3 h7 L  V……

这样的一共16698个链接。

2. 抓取内容
同样的，wget -i download.txt
把上面那N个html都抓下来，然后就很简单了。
-2013年11月14日 16:35:47
成功抓取了16695个html，漏了3个，懒得研究到底是哪3个了。

3. 文本提取
观察可知，词典条目内容在第一个<h1>和<div class="source">之间

<h1>Charybdis</h1>" ~: a1 {% b1 I5 H/ ?' c9 ]1 r

. K- P: p, k& o<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and5 X1 ~, H5 q1 D5 g7 Y
Charybdis are employed to signify two equal dangers. Thus Horace says! I% E2 M7 [0 e/ j
an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
; {1 B8 ?# n7 n& ?& G# c# H3 yseeking to avoid one fault, falls into another. The tale is that7 p' D+ I0 t( k9 H" o
Charybdis stole the oxen of Hercules, was killed by lightning, and
9 `8 s( k3 h" Echanged into the gulf.</p>1 l# q6 h- O% b6 v  G) M3 p
<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your
  T- R9 B4 x) L6 ~; \6 ?; O" p+ Imother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.8 P( n7 O- `' O1 w( r, J; |. |
</p>5 L& C% ~5 _" C$ J2 V# b  h3 E
( t5 F- a" E( J
<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>

利用TextForever来提取文本

-

提取完毕，合并得到的16695个html，

这本词典的制作过程中，我思考了下，不用在“文件内容前加注文件名”，有的情况下，是需要这样做的，以方便提取keywords，经过测试，还是要在“文件内容后加空行”。

得到dopf-src.txt，对这个txt进行操作，得到可build为mdx的txt。

4. 制作mdx
合并后的文本长这样：

明显http://www.infoplease.com/dictionary/brewers/的词典是xml，由于MDict PC版不支持xml+css，我们要把xml标签替换为html标签。经过下面一系列的操作。

处理后最终的文本是这样：

再简单写点css

中途遇到些小问题，一个个解决，最后，成品：

是不是比在线的稍微顺眼点呢？
http://www.infoplease.com/dictionary/brewers/comb.html

PS：虽然做完了，但是我发现了一些问题，从上面的截图中就可以看出来，有些词之间少了空格。暂无意修改，等有空改完了再分享。谁有兴趣改一改练练手的话，可以PM我，我把下载的网页发给你。

Hugh · 发表于 2013-11-14 16:26:41

此贴要顶！

liuyunrushui · 发表于 2013-11-15 23:30:45

老大您好。感谢您提供的教程。
小弟按照您的教程，把第一步完成了，但是如何有效地完成第二步，就是您所说的抓取一千多个网页的那个步骤，小弟一头雾水，手动一个一个地输入也是一个方法，不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢？烦请指点一二，多谢多谢。

小弟想抓取的网页如下：
http://zokugo-dict.com/

右边的五十音图就是索引部分。

Oeasy · 发表于 2013-11-16 14:14:53

liuyunrushui 发表于 2013-11-15 23:30
" e; ]% q/ D4 I/ _老大您好。感谢您提供的教程。
8 B! `. e N* ~, e/ F# v小弟按照您的教程，把第一步完成了，但是如何有效地完成第二步，就是您所说 ...

cmd.exe

wget -i download.txt
所有网页链接在download.txt，参考http://baike.baidu.com/view/1312507.htm，也可以自己写程序抓。结合awk等等的话，其实可以更快，抓完也就制作完了。

tovaremeterio · 发表于 2014-4-1 09:02:21

thank you very much

		自动登录	找回密码
密码			免费注册

[教程] 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

本帖被以下淘专辑推荐: