制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

Oeasy · 发表于 2013-11-14 08:24:13

本帖最后由 Oeasy 于 2013-11-17 09:54 编辑

一个简单得不能再简单的网页抓取然后制作mdx教程（20131114）

使用软件
0. 操作系统：Windows 7 旗舰版64位
1. 抓取工具：wget，http://users.ugent.be/~bpuype/wget/，http://baike.baidu.com/view/1312507.htm
2. 文本处理：EditPlus, UltraEdit, TextForever（http://www.comicer.com/stronghorse/software/index.htm#TextForever）

目标词典
Dictionary of Phrase and Fable，1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的，而且网站没有设置抓取限制（至少目前看来没有设置），获取index也非常容易，故以此为例。
另：有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536，版本不详，似乎是第17版的。

操作步骤
1. 获取index
观察http://www.infoplease.com/dictionary/brewers/，该网站本身可以browse整本词典，获取index非常容易。
新建一个txt，内容为

http://www.infoplease.com/dictionary/brewers/index-a.html
$ G* F* m) f$ V- h. @( U8 s  Bhttp://www.infoplease.com/dictionary/brewers/index-b.html
# t0 R/ u" `! r7 Fhttp://www.infoplease.com/dictionary/brewers/index-c.html
  D) K* h# |9 k! E$ _2 E8 ~; Yhttp://www.infoplease.com/dictionary/brewers/index-d.html* g2 \5 g$ x6 T. `8 b/ d) f& k
http://www.infoplease.com/dictionary/brewers/index-e.html
' k2 V% n& E; |http://www.infoplease.com/dictionary/brewers/index-f.html4 F$ G4 w) U$ b( ^! ?. ~7 u
http://www.infoplease.com/dictionary/brewers/index-g.html
2 ^7 }3 g% S# ?http://www.infoplease.com/dictionary/brewers/index-h.html
5 p8 S  J: ]0 Vhttp://www.infoplease.com/dictionary/brewers/index-i.html
( j* V' z: C1 _- Ihttp://www.infoplease.com/dictionary/brewers/index-j.html2 j/ m. G( r/ P9 O9 B
http://www.infoplease.com/dictionary/brewers/index-k.html
1 Z$ y5 T* f2 h% hhttp://www.infoplease.com/dictionary/brewers/index-l.html6 N' C* y! d3 P$ S. A# w$ B$ l% M
http://www.infoplease.com/dictionary/brewers/index-m.html" z/ h# g- s2 Y% v
http://www.infoplease.com/dictionary/brewers/index-n.html' x3 v! H2 k2 a( L* o, S" c! b2 V
http://www.infoplease.com/dictionary/brewers/index-o.html* B! o" _2 h1 `" a1 r8 n
http://www.infoplease.com/dictionary/brewers/index-p.html+ t2 t1 V# b0 ]) e7 l6 x+ U. y, A
http://www.infoplease.com/dictionary/brewers/index-q.html; K1 I6 A# U  j6 @: J
http://www.infoplease.com/dictionary/brewers/index-r.html$ k% Q: S; {! r
http://www.infoplease.com/dictionary/brewers/index-s.html* D6 K2 g; A: e/ @" t
http://www.infoplease.com/dictionary/brewers/index-t.html
# }8 T6 n2 w2 T) Zhttp://www.infoplease.com/dictionary/brewers/index-u.html
- b9 X% M6 l- z1 }( V2 V& ~http://www.infoplease.com/dictionary/brewers/index-v.html
- ]+ N( d$ ^& t& A+ i9 p( Khttp://www.infoplease.com/dictionary/brewers/index-w.html9 k( S+ Z1 A& Z, J  R5 a
http://www.infoplease.com/dictionary/brewers/index-x.html
2 G( f; r, l  E3 B9 t7 r) Chttp://www.infoplease.com/dictionary/brewers/index-y.html3 O2 T6 K' @' k2 A8 D* M8 Q0 i" G
http://www.infoplease.com/dictionary/brewers/index-z.html

这些地址都是观察上面网站而得，txt命名为download.txt。
我把这个download.txt和wget.exe（如果你下载的wget是wget+版本号.exe，不妨重命名为wget.exe），这俩文件都放在D:\DOPF下。

cmd.exe->CD/D D:\DOPF->wget -i download.txt

很快，26个html文件就下下来了，对这26个html文件进行整理，得到

http://www.infoplease.com/dictionary/brewers/a.html6 t0 W( H7 ]* S1 |  T+ D# w
http://www.infoplease.com/dictionary/brewers/a1.html1 U, q& l: D; z9 r$ u7 @, B, |
http://www.infoplease.com/dictionary/brewers/a-b.html
. R- [0 n! N$ i% c* R4 Jhttp://www.infoplease.com/dictionary/brewers/a-b-c.html
- a  r+ }3 J+ X9 Whttp://www.infoplease.com/dictionary/brewers/a-b-c-book.html
/ p  ?% [  _8 S. k' K" Phttp://www.infoplease.com/dictionary/brewers/a-b-c-process.html
# a3 R3 l( x' v2 \& xhttp://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html
5 h1 b  _0 {4 y% }& a. {8 m9 i2 q4 Shttp://www.infoplease.com/dictionary/brewers/a-u-c.html* C. Q$ F4 R  K) Q$ e  Z
http://www.infoplease.com/dictionary/brewers/aaron.html1 c: ?4 J+ ]# ]. c2 N+ \
http://www.infoplease.com/dictionary/brewers/ab.html
( Y" F+ R( ^7 Y6 ~- k2 chttp://www.infoplease.com/dictionary/brewers/aback.html; `: ?/ ?: Q) `# S
http://www.infoplease.com/dictionary/brewers/abacus.html
7 \  u8 D3 p/ j; u  h9 e. \( I, ehttp://www.infoplease.com/dictionary/brewers/abaddon.html
! y( D4 \( F  Mhttp://www.infoplease.com/dictionary/brewers/abambou.html3 M0 R3 |9 ~* H7 R; u( U
http://www.infoplease.com/dictionary/brewers/abandon.html% c, L! ^( U$ ^1 O6 @
http://www.infoplease.com/dictio ... on-fait-larron.html8 N. [' @. d. ^/ U
http://www.infoplease.com/dictionary/brewers/abaris.html
! x2 D8 b1 W1 M! N1 t0 Yhttp://www.infoplease.com/dictionary/brewers/abate.html
8 e  {  [" n) j$ Z& [1 x# Thttp://www.infoplease.com/dictionary/brewers/abaton.html
5 Y3 o7 [. p& U* i# fhttp://www.infoplease.com/dictionary/brewers/abbassides.html* }# g) Q2 P5 C2 ?) i" W
http://www.infoplease.com/dictionary/brewers/abbey-laird.html
4 n$ T& z+ o" dhttp://www.infoplease.com/dictionary/brewers/abbey-lubber.html
( r1 ?. v' u2 G( Y3 I! \5 \6 z……

这样的一共16698个链接。

2. 抓取内容
同样的，wget -i download.txt
把上面那N个html都抓下来，然后就很简单了。
-2013年11月14日 16:35:47
成功抓取了16695个html，漏了3个，懒得研究到底是哪3个了。

3. 文本提取
观察可知，词典条目内容在第一个<h1>和<div class="source">之间

<h1>Charybdis</h1>
! s% ? g8 ~ V2 C( u9 g
" @% I6 I9 f7 B# g" A5 H<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and
$ t" S% N5 w' }( j7 C0 CCharybdis are employed to signify two equal dangers. Thus Horace says2 r" m0 o0 N. D4 B8 }
an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>8 x$ c7 d, l! n7 @; o- |
seeking to avoid one fault, falls into another. The tale is that8 k; e, }% Z6 d# `4 |2 o
Charybdis stole the oxen of Hercules, was killed by lightning, and/ {5 l& C4 E$ K
changed into the gulf.</p>
" E. B+ z2 C, s) u3 P m) k+ }6 v<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your
, u' v/ _" r, V* v2 nmother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.
: f2 c* L9 e. x+ r0 H5 G2 H</p>
) u# @3 I6 I3 T5 B& x, T/ l* c7 ~3 n6 A3 }, A
<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>

利用TextForever来提取文本

-

提取完毕，合并得到的16695个html，

这本词典的制作过程中，我思考了下，不用在“文件内容前加注文件名”，有的情况下，是需要这样做的，以方便提取keywords，经过测试，还是要在“文件内容后加空行”。

得到dopf-src.txt，对这个txt进行操作，得到可build为mdx的txt。

4. 制作mdx
合并后的文本长这样：

明显http://www.infoplease.com/dictionary/brewers/的词典是xml，由于MDict PC版不支持xml+css，我们要把xml标签替换为html标签。经过下面一系列的操作。

处理后最终的文本是这样：

再简单写点css

中途遇到些小问题，一个个解决，最后，成品：

是不是比在线的稍微顺眼点呢？
http://www.infoplease.com/dictionary/brewers/comb.html

PS：虽然做完了，但是我发现了一些问题，从上面的截图中就可以看出来，有些词之间少了空格。暂无意修改，等有空改完了再分享。谁有兴趣改一改练练手的话，可以PM我，我把下载的网页发给你。

Hugh · 发表于 2013-11-14 16:26:41

此贴要顶！

liuyunrushui · 发表于 2013-11-15 23:30:45

老大您好。感谢您提供的教程。
小弟按照您的教程，把第一步完成了，但是如何有效地完成第二步，就是您所说的抓取一千多个网页的那个步骤，小弟一头雾水，手动一个一个地输入也是一个方法，不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢？烦请指点一二，多谢多谢。

小弟想抓取的网页如下：
http://zokugo-dict.com/

右边的五十音图就是索引部分。

Oeasy · 发表于 2013-11-16 14:14:53

liuyunrushui 发表于 2013-11-15 23:30 $ \! y" x: t: C! z# `
老大您好。感谢您提供的教程。7 N3 f& }4 r7 h& u* U4 }3 f7 y
小弟按照您的教程，把第一步完成了，但是如何有效地完成第二步，就是您所说 ...

cmd.exe

wget -i download.txt
所有网页链接在download.txt，参考http://baike.baidu.com/view/1312507.htm，也可以自己写程序抓。结合awk等等的话，其实可以更快，抓完也就制作完了。

tovaremeterio · 发表于 2014-4-1 09:02:21

thank you very much

		自动登录	找回密码
密码			免费注册

[教程] 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

本帖被以下淘专辑推荐: