掌上百科 - PDAWIKI

 找回密码
 免费注册

QQ登录

只需一步,快速开始

查看: 3946|回复: 4

[教程] 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

[复制链接]

该用户从未签到

发表于 2013-11-14 08:24:13 | 显示全部楼层 |阅读模式
本帖最后由 Oeasy 于 2013-11-17 09:54 编辑 0 _2 `7 d* @. D1 I2 m; J# @

1 h, S) i9 L1 K3 _: m1 D( m7 q$ N0 F  p" }6 R! b# b5 x4 p
一个简单得不能再简单的网页抓取然后制作mdx教程(20131114)
" i; X" D; Z! d5 @4 g6 `8 j8 w- K1 U9 C& }% e& e
使用软件
7 d' z! C/ ]( w( Z1 t5 L+ E# O0. 操作系统:Windows 7 旗舰版64位) }1 w( {4 c; J. C& y  n
1. 抓取工具:wget,http://users.ugent.be/~bpuype/wget/http://baike.baidu.com/view/1312507.htm" u. F+ [% s7 B6 z' L8 o: V
2. 文本处理:EditPlus, UltraEdit, TextForever(http://www.comicer.com/stronghorse/software/index.htm#TextForever* W. W/ c5 _2 z7 T% {

6 y- F) T, Z) @& }0 r' X; K# X5 A目标词典
# G! x8 s. X4 b( Z3 MDictionary of Phrase and Fable,1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的,而且网站没有设置抓取限制(至少目前看来没有设置),获取index也非常容易,故以此为例。
0 o3 ~( g) A8 \& j0 v另:有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536,版本不详,似乎是第17版的。: e) R2 R! _8 z6 o' ^+ j
; f6 i) q4 `1 q% m- {4 n5 U5 E
操作步骤
8 L$ @0 z: G5 |) c8 `. {+ D3 i1. 获取index& |8 q: Z- A/ A& c+ A
观察http://www.infoplease.com/dictionary/brewers/,该网站本身可以browse整本词典,获取index非常容易。
  d' Z+ Q: [$ O& }4 n新建一个txt,内容为
3 u' y0 h% O: z
http://www.infoplease.com/dictionary/brewers/index-a.html7 `& {" l* f; [0 c4 B0 Y# L. `
http://www.infoplease.com/dictionary/brewers/index-b.html
) C; M/ F; c( hhttp://www.infoplease.com/dictionary/brewers/index-c.html
" d* u# g1 W/ e+ E7 }http://www.infoplease.com/dictionary/brewers/index-d.html
" m3 ~; `% `0 A# J, M* d9 T1 rhttp://www.infoplease.com/dictionary/brewers/index-e.html& x4 G8 P" q5 Q( ]
http://www.infoplease.com/dictionary/brewers/index-f.html
: ^0 s) `0 X" f( Z7 ihttp://www.infoplease.com/dictionary/brewers/index-g.html
5 J6 ]2 w" E) s; O, xhttp://www.infoplease.com/dictionary/brewers/index-h.html
. F, |) I0 Z( V9 U( e4 t3 T; Dhttp://www.infoplease.com/dictionary/brewers/index-i.html
1 E( q, n4 `! B8 Y3 z# ~8 R/ d8 }http://www.infoplease.com/dictionary/brewers/index-j.html
0 M$ P$ s$ Z' x+ [http://www.infoplease.com/dictionary/brewers/index-k.html. X; ~- }7 {4 o2 U% n
http://www.infoplease.com/dictionary/brewers/index-l.html8 e4 u  X& J" u  h. O
http://www.infoplease.com/dictionary/brewers/index-m.html
/ i1 r) w0 f+ B9 O) M& w' Uhttp://www.infoplease.com/dictionary/brewers/index-n.html
$ j- g3 L1 }. ], y( }  ]5 z0 a6 T6 `http://www.infoplease.com/dictionary/brewers/index-o.html- X5 U: W2 O5 {
http://www.infoplease.com/dictionary/brewers/index-p.html
0 Z& v! Y9 a* P. w; i0 n' rhttp://www.infoplease.com/dictionary/brewers/index-q.html6 S! W- j9 v  c* H9 ?: C- y3 G
http://www.infoplease.com/dictionary/brewers/index-r.html" S/ [) e: P5 j% \; k+ A
http://www.infoplease.com/dictionary/brewers/index-s.html
3 t- J3 Z$ o* r1 V' q3 t: ?http://www.infoplease.com/dictionary/brewers/index-t.html
' p* K3 }! B5 i+ e8 i% W9 E/ Mhttp://www.infoplease.com/dictionary/brewers/index-u.html" K& e1 `2 a# S' a( n0 Z: n
http://www.infoplease.com/dictionary/brewers/index-v.html4 d+ \* k* H7 u$ Z) L( p1 t
http://www.infoplease.com/dictionary/brewers/index-w.html
0 ~3 {1 Q/ w+ z7 Fhttp://www.infoplease.com/dictionary/brewers/index-x.html$ S! O- P+ _( h) c2 S5 A
http://www.infoplease.com/dictionary/brewers/index-y.html
3 Z$ e. _5 F% {% ~' j! @& s  Khttp://www.infoplease.com/dictionary/brewers/index-z.html

8 `; {0 Z% b' H* F2 I* J( q这些地址都是观察上面网站而得,txt命名为download.txt。1 A, I0 I0 R9 s/ _
我把这个download.txt和wget.exe(如果你下载的wget是wget+版本号.exe,不妨重命名为wget.exe),这俩文件都放在D:\DOPF下。
: S  O) q2 u9 Z7 m8 u
* P/ J; ?6 p- P' x1 \2 `' scmd.exe->CD/D D:\DOPF->wget -i download.txt" e3 c  i+ X8 y$ l6 }; p
$ n5 \, f$ E' N& z+ l8 [
很快,26个html文件就下下来了,对这26个html文件进行整理,得到; F8 a  x% n; n: v3 w  H
http://www.infoplease.com/dictionary/brewers/a.html
6 X* _6 f0 J) Q+ p3 K+ e- ^http://www.infoplease.com/dictionary/brewers/a1.html
( _, n, o( M  U3 b6 Jhttp://www.infoplease.com/dictionary/brewers/a-b.html. ^) Y+ H* L; T
http://www.infoplease.com/dictionary/brewers/a-b-c.html; K$ O$ ]% o; m
http://www.infoplease.com/dictionary/brewers/a-b-c-book.html3 m7 @+ x% k  a  {* ~
http://www.infoplease.com/dictionary/brewers/a-b-c-process.html
1 ^( D* L5 ?& I3 E7 _: Uhttp://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html) w, `: o/ b% w5 @
http://www.infoplease.com/dictionary/brewers/a-u-c.html; o0 c# l5 Q1 X5 q
http://www.infoplease.com/dictionary/brewers/aaron.html! Z5 e9 e7 }- s3 M* Y+ k( B( l
http://www.infoplease.com/dictionary/brewers/ab.html
5 Q: P( W! X$ D7 X/ |: Dhttp://www.infoplease.com/dictionary/brewers/aback.html
0 `+ E/ E( g) ^  J; V  khttp://www.infoplease.com/dictionary/brewers/abacus.html
) k) `5 e! T7 J. `- l3 Mhttp://www.infoplease.com/dictionary/brewers/abaddon.html4 B/ W: \1 O; `4 W
http://www.infoplease.com/dictionary/brewers/abambou.html
- A9 c# x; [( e, Qhttp://www.infoplease.com/dictionary/brewers/abandon.html
: B; x2 A! S1 t- ihttp://www.infoplease.com/dictio ... on-fait-larron.html. P. I4 Z# a# K/ `( q
http://www.infoplease.com/dictionary/brewers/abaris.html' p% d+ O) A/ o# k6 q2 T
http://www.infoplease.com/dictionary/brewers/abate.html. {; b( l) p/ U# }$ Y
http://www.infoplease.com/dictionary/brewers/abaton.html
4 |( B& ]: m' c8 N. N; chttp://www.infoplease.com/dictionary/brewers/abbassides.html
& y& l( X+ V: p5 vhttp://www.infoplease.com/dictionary/brewers/abbey-laird.html
  y) ~, U5 `0 r8 b  Ihttp://www.infoplease.com/dictionary/brewers/abbey-lubber.html
" m$ G! m% |$ Y- j( I1 K4 ^" x……

% @6 M* y8 P; U6 K这样的一共16698个链接。: m& h1 I$ s; N- ]% T1 Z& u; a

# b: N) T- p3 m' k$ v2. 抓取内容- C) v; |/ W8 l+ K. u
同样的,wget -i download.txt6 F$ z; s9 j. Y5 I2 b, M4 `
把上面那N个html都抓下来,然后就很简单了。
$ O  w9 v. @/ |2 k( ^4 g-2013年11月14日 16:35:472 F, v' J5 r* W5 O9 O
成功抓取了16695个html,漏了3个,懒得研究到底是哪3个了。# A6 f- H- `; n$ Q+ U$ K4 l
6 Q8 b1 W# O( b8 ^8 ?
3. 文本提取
& }4 N- r( m  E- z$ V1 B观察可知,词典条目内容在第一个<h1>和<div class="source">之间
/ Y2 L" ^$ |* C0 D6 Q$ e, y) `$ s: g; x
<h1>Charybdis</h1>/ q  o/ m5 [- M0 I; \0 J# w
3 ?' Z; q2 G! S
<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and& s3 m  ]! I$ I4 O% U
Charybdis are employed to signify two equal dangers. Thus Horace says
9 R1 p5 a/ Z' ]7 N5 ~an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
% X7 W6 D1 g6 G" N7 gseeking to avoid one fault, falls into another. The tale is that& l3 J6 a1 \" ?; d' k
Charybdis stole the oxen of Hercules, was killed by lightning, and6 v, ]* i$ ~, B& ?4 x
changed into the gulf.</p>
" d5 l( P" x( ~0 [. E/ w( E- i2 n; {. d0 ]/ d<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your
. \- @- s: K6 S  P2 w! X6 @& F) wmother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.- e5 E$ o2 t2 o) F( k0 U
</p>
  ]; ?# \/ |5 B2 p5 s9 A* ?/ w: Y- g- O6 ?/ K
<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>

, Y7 n7 ?6 u; Q; T- E$ g利用TextForever来提取文本2 E6 {  W" j1 i4 u+ E

1 u7 h; x5 D; w( I% C3 T" i-9 f! N9 R$ G( Y. ~% b/ a& B

* D8 }0 e$ i. b: k% R: X提取完毕,合并得到的16695个html,0 p8 X5 G3 J/ b/ H5 r6 g' P- M. H0 I
3 m: J* q9 k/ Z7 [* j5 m
这本词典的制作过程中,我思考了下,不用在“文件内容前加注文件名”,有的情况下,是需要这样做的,以方便提取keywords,经过测试,还是要在“文件内容后加空行”。
0 x( f' n8 Y& K3 B/ I9 |% y" ~9 [5 [% K2 b4 B8 Z) P: D
得到dopf-src.txt,对这个txt进行操作,得到可build为mdx的txt。+ ]$ `; J, v1 C: |) C# ~) H; T

0 s# J3 |; K* U' p, U) _4. 制作mdx
7 d7 j& [6 W- a4 e' D7 S; ?, b合并后的文本长这样:
. T: _+ t) D6 e) s0 @0 _% M% ^! o! l6 T. _, N! y3 I( y* F! P+ ^: [8 I+ |
1 z0 b1 s) W$ a" s, Q, G  ^
明显http://www.infoplease.com/dictionary/brewers/的词典是xml,由于MDict PC版不支持xml+css,我们要把xml标签替换为html标签。经过下面一系列的操作。
- ], H& D6 f* W' b. P* ], g& Z0 q3 x
+ S+ l/ j- H: d6 a) h9 S+ s( m- c8 f2 p
处理后最终的文本是这样:
: C+ j. D' a: Z" q5 E' v% p5 O- g. q* X+ q* t

. L3 _- D8 n# X; r再简单写点css' Y% Q7 h% t) J) Q% p$ \" K
) X; }- O% J3 [

& [* y2 p; {0 f2 R中途遇到些小问题,一个个解决,最后,成品:1 n+ P% Q: U. {3 r/ N
6 T$ j3 I5 w: v) k
是不是比在线的稍微顺眼点呢?
( A0 v  @  S9 G* k8 T2 Ghttp://www.infoplease.com/dictionary/brewers/comb.html+ B- c, f/ _1 d/ o( k
; w" o! t1 S- j. `
3 O2 n8 ?, r" u# D
PS:虽然做完了,但是我发现了一些问题,从上面的截图中就可以看出来,有些词之间少了空格。暂无意修改,等有空改完了再分享。谁有兴趣改一改练练手的话,可以PM我,我把下载的网页发给你。) Q& T3 i- f3 X) i0 p# [

本帖被以下淘专辑推荐:

该用户从未签到

发表于 2013-11-14 16:26:41 | 显示全部楼层
此贴要顶!
  • TA的每日心情
    开心
    2018-1-27 00:16
  • 签到天数: 1 天

    [LV.1]初来乍到

    发表于 2013-11-15 23:30:45 来自手机 | 显示全部楼层
    老大您好。感谢您提供的教程。
    % @) Z3 o. o* `; i小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说的抓取一千多个网页的那个步骤,小弟一头雾水,手动一个一个地输入也是一个方法,不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢?烦请指点一二,多谢多谢。
    ! |% ]. Q+ W/ l; ?  x& a8 \3 W
    - U+ ]# K- o" {6 E* n4 o小弟想抓取的网页如下:
      X2 N$ s+ Y3 _http://zokugo-dict.com/
    9 S- J$ X% T0 n) n/ a6 p+ H
    4 T6 J& W, P9 F8 G2 J" k6 y右边的五十音图就是索引部分。

    该用户从未签到

     楼主| 发表于 2013-11-16 14:14:53 | 显示全部楼层
    liuyunrushui 发表于 2013-11-15 23:30 & @  _: w5 D$ p& p+ X' S
    老大您好。感谢您提供的教程。; {. L8 _5 E# W$ D; T# N
    小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说 ...
    ' [7 ^. K7 Q. i3 l! @6 q( ^
    * @3 o: B( A9 V3 }# K
    cmd.exe
    " ?6 `3 ~0 X2 t5 A3 X4 e5 o7 [3 V6 h9 t  k' e
    wget -i download.txt
    ! K& J, F! D) [4 D( J* g所有网页链接在download.txt,参考http://baike.baidu.com/view/1312507.htm,也可以自己写程序抓。结合awk等等的话,其实可以更快,抓完也就制作完了。

    该用户从未签到

    发表于 2014-4-1 09:02:21 | 显示全部楼层
    thank you very much
    您需要登录后才可以回帖 登录 | 免费注册

    本版积分规则

    小黑屋|手机版|Archiver|PDAWIKI |网站地图

    GMT+8, 2024-5-16 10:38 , Processed in 0.058536 second(s), 10 queries , MemCache On.

    Powered by Discuz! X3.4

    Copyright © 2001-2023, Tencent Cloud.

    快速回复 返回顶部 返回列表