掌上百科 - PDAWIKI

 找回密码
 免费注册

QQ登录

只需一步,快速开始

查看: 3933|回复: 4

[教程] 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

[复制链接]

该用户从未签到

发表于 2013-11-14 08:24:13 | 显示全部楼层 |阅读模式
本帖最后由 Oeasy 于 2013-11-17 09:54 编辑 9 A3 Z& y6 C- t# A- q" Q3 e* w
. T( ]. \- G) P( n
% b2 k% ^% D" v$ A, ^8 x& D5 }
一个简单得不能再简单的网页抓取然后制作mdx教程(20131114)- M) e2 k: S: f
" S. P( Y0 I6 h2 W
使用软件# _2 o+ W' W, G1 _. Z
0. 操作系统:Windows 7 旗舰版64位7 d1 ~) D2 b7 Z1 R: \3 N
1. 抓取工具:wget,http://users.ugent.be/~bpuype/wget/http://baike.baidu.com/view/1312507.htm
5 j4 G4 h4 F! f6 s2. 文本处理:EditPlus, UltraEdit, TextForever(http://www.comicer.com/stronghorse/software/index.htm#TextForever
4 A$ F5 F# t: K- k% m1 c: k( v0 t! C1 M5 D, D
目标词典
7 Q% g$ s2 m2 @; _' FDictionary of Phrase and Fable,1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的,而且网站没有设置抓取限制(至少目前看来没有设置),获取index也非常容易,故以此为例。
, g! z( t( w# F* z, {1 ]另:有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536,版本不详,似乎是第17版的。
% ~8 X# {# k2 d5 K) ^/ q8 [0 N  R! I& I* l+ D
操作步骤
! {4 B  K) N7 g0 a# ^- Y- j1. 获取index
5 i# U- o' c0 ~" ^5 N观察http://www.infoplease.com/dictionary/brewers/,该网站本身可以browse整本词典,获取index非常容易。1 H6 D- g/ H" c2 M) C+ K
新建一个txt,内容为' M7 b. [3 Y: @. m- y
http://www.infoplease.com/dictionary/brewers/index-a.html
$ G* F* m) f$ V- h. @( U8 s  Bhttp://www.infoplease.com/dictionary/brewers/index-b.html
# t0 R/ u" `! r7 Fhttp://www.infoplease.com/dictionary/brewers/index-c.html
  D) K* h# |9 k! E$ _2 E8 ~; Yhttp://www.infoplease.com/dictionary/brewers/index-d.html* g2 \5 g$ x6 T. `8 b/ d) f& k
http://www.infoplease.com/dictionary/brewers/index-e.html
' k2 V% n& E; |http://www.infoplease.com/dictionary/brewers/index-f.html4 F$ G4 w) U$ b( ^! ?. ~7 u
http://www.infoplease.com/dictionary/brewers/index-g.html
2 ^7 }3 g% S# ?http://www.infoplease.com/dictionary/brewers/index-h.html
5 p8 S  J: ]0 Vhttp://www.infoplease.com/dictionary/brewers/index-i.html
( j* V' z: C1 _- Ihttp://www.infoplease.com/dictionary/brewers/index-j.html2 j/ m. G( r/ P9 O9 B
http://www.infoplease.com/dictionary/brewers/index-k.html
1 Z$ y5 T* f2 h% hhttp://www.infoplease.com/dictionary/brewers/index-l.html6 N' C* y! d3 P$ S. A# w$ B$ l% M
http://www.infoplease.com/dictionary/brewers/index-m.html" z/ h# g- s2 Y% v
http://www.infoplease.com/dictionary/brewers/index-n.html' x3 v! H2 k2 a( L* o, S" c! b2 V
http://www.infoplease.com/dictionary/brewers/index-o.html* B! o" _2 h1 `" a1 r8 n
http://www.infoplease.com/dictionary/brewers/index-p.html+ t2 t1 V# b0 ]) e7 l6 x+ U. y, A
http://www.infoplease.com/dictionary/brewers/index-q.html; K1 I6 A# U  j6 @: J
http://www.infoplease.com/dictionary/brewers/index-r.html$ k% Q: S; {! r
http://www.infoplease.com/dictionary/brewers/index-s.html* D6 K2 g; A: e/ @" t
http://www.infoplease.com/dictionary/brewers/index-t.html
# }8 T6 n2 w2 T) Zhttp://www.infoplease.com/dictionary/brewers/index-u.html
- b9 X% M6 l- z1 }( V2 V& ~http://www.infoplease.com/dictionary/brewers/index-v.html
- ]+ N( d$ ^& t& A+ i9 p( Khttp://www.infoplease.com/dictionary/brewers/index-w.html9 k( S+ Z1 A& Z, J  R5 a
http://www.infoplease.com/dictionary/brewers/index-x.html
2 G( f; r, l  E3 B9 t7 r) Chttp://www.infoplease.com/dictionary/brewers/index-y.html3 O2 T6 K' @' k2 A8 D* M8 Q0 i" G
http://www.infoplease.com/dictionary/brewers/index-z.html

6 b% T/ M$ h' c, X这些地址都是观察上面网站而得,txt命名为download.txt。
# G9 z1 P, w. i5 V+ M' j% V# }+ l我把这个download.txt和wget.exe(如果你下载的wget是wget+版本号.exe,不妨重命名为wget.exe),这俩文件都放在D:\DOPF下。
# T. `) s5 b+ a, y2 U4 _0 o$ W% Z* F4 _5 ?' y* w  {
cmd.exe->CD/D D:\DOPF->wget -i download.txt5 r) Y% k0 v! G/ x: ]8 W+ ^3 E, o

# ?9 H" B0 g4 h5 Q' y# V# N6 B- ^很快,26个html文件就下下来了,对这26个html文件进行整理,得到1 P% V. z# i  a
http://www.infoplease.com/dictionary/brewers/a.html6 t0 W( H7 ]* S1 |  T+ D# w
http://www.infoplease.com/dictionary/brewers/a1.html1 U, q& l: D; z9 r$ u7 @, B, |
http://www.infoplease.com/dictionary/brewers/a-b.html
. R- [0 n! N$ i% c* R4 Jhttp://www.infoplease.com/dictionary/brewers/a-b-c.html
- a  r+ }3 J+ X9 Whttp://www.infoplease.com/dictionary/brewers/a-b-c-book.html
/ p  ?% [  _8 S. k' K" Phttp://www.infoplease.com/dictionary/brewers/a-b-c-process.html
# a3 R3 l( x' v2 \& xhttp://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html
5 h1 b  _0 {4 y% }& a. {8 m9 i2 q4 Shttp://www.infoplease.com/dictionary/brewers/a-u-c.html* C. Q$ F4 R  K) Q$ e  Z
http://www.infoplease.com/dictionary/brewers/aaron.html1 c: ?4 J+ ]# ]. c2 N+ \
http://www.infoplease.com/dictionary/brewers/ab.html
( Y" F+ R( ^7 Y6 ~- k2 chttp://www.infoplease.com/dictionary/brewers/aback.html; `: ?/ ?: Q) `# S
http://www.infoplease.com/dictionary/brewers/abacus.html
7 \  u8 D3 p/ j; u  h9 e. \( I, ehttp://www.infoplease.com/dictionary/brewers/abaddon.html
! y( D4 \( F  Mhttp://www.infoplease.com/dictionary/brewers/abambou.html3 M0 R3 |9 ~* H7 R; u( U
http://www.infoplease.com/dictionary/brewers/abandon.html% c, L! ^( U$ ^1 O6 @
http://www.infoplease.com/dictio ... on-fait-larron.html8 N. [' @. d. ^/ U
http://www.infoplease.com/dictionary/brewers/abaris.html
! x2 D8 b1 W1 M! N1 t0 Yhttp://www.infoplease.com/dictionary/brewers/abate.html
8 e  {  [" n) j$ Z& [1 x# Thttp://www.infoplease.com/dictionary/brewers/abaton.html
5 Y3 o7 [. p& U* i# fhttp://www.infoplease.com/dictionary/brewers/abbassides.html* }# g) Q2 P5 C2 ?) i" W
http://www.infoplease.com/dictionary/brewers/abbey-laird.html
4 n$ T& z+ o" dhttp://www.infoplease.com/dictionary/brewers/abbey-lubber.html
( r1 ?. v' u2 G( Y3 I! \5 \6 z……
6 s3 ~" t. k0 }* a8 r6 F2 _% X% A+ A
这样的一共16698个链接。
+ o6 J. z8 O' M" S2 \
  d& d" b) H- `7 b/ x* i2. 抓取内容0 R0 D" y: M  u# M8 u) F  ~, ?
同样的,wget -i download.txt
$ Q# D# u; e7 a6 v6 b把上面那N个html都抓下来,然后就很简单了。$ U# u0 x2 ~8 [6 \# W
-2013年11月14日 16:35:47
) i$ Y7 b( w. p7 v- y' z# L成功抓取了16695个html,漏了3个,懒得研究到底是哪3个了。3 }4 {; M: M! Y- G9 @, E9 Z( t: u

3 L* D+ {1 c( V; v3. 文本提取
8 B( F" w. K9 z* e" R5 f4 J3 b" O观察可知,词典条目内容在第一个<h1>和<div class="source">之间
6 [; d/ ]8 s& b, W( @( M0 N
<h1>Charybdis</h1>
! s% ?  g8 ~  V2 C( u9 g
" @% I6 I9 f7 B# g" A5 H<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and
$ t" S% N5 w' }( j7 C0 CCharybdis are employed to signify two equal dangers. Thus Horace says2 r" m0 o0 N. D4 B8 }
an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>8 x$ c7 d, l! n7 @; o- |
seeking to avoid one fault, falls into another. The tale is that8 k; e, }% Z6 d# `4 |2 o
Charybdis stole the oxen of Hercules, was killed by lightning, and/ {5 l& C4 E$ K
changed into the gulf.</p>
" E. B+ z2 C, s) u3 P  m) k+ }6 v<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your
, u' v/ _" r, V* v2 nmother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.
: f2 c* L9 e. x+ r0 H5 G2 H</p>
) u# @3 I6 I3 T5 B& x, T/ l* c7 ~3 n6 A3 }, A
<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>

8 j1 z) i9 E3 N  J5 F利用TextForever来提取文本
  m" Z& b) v8 g6 F  e
# a3 g& H% y0 w: J. l-0 B& G/ H6 I' u% A5 C) T! P
" P1 I. o8 s9 n( K/ e# ^  C
提取完毕,合并得到的16695个html,' X# X9 b, U. N: M! N6 U
. ~  C! o, l) w6 U
这本词典的制作过程中,我思考了下,不用在“文件内容前加注文件名”,有的情况下,是需要这样做的,以方便提取keywords,经过测试,还是要在“文件内容后加空行”。
8 n/ V0 A) C% h% n4 p% s' @* q( x5 M/ l& R/ L$ c$ g) w- [% e
得到dopf-src.txt,对这个txt进行操作,得到可build为mdx的txt。1 O; z1 X. m5 k4 _& M5 {

9 Z5 ?8 m1 D" b3 [4 X3 o5 Q4. 制作mdx  A, R, T9 z: ^; Q, i
合并后的文本长这样:' c( [# M* x2 L) E; J

3 K# s" c4 z& d
; u# v) q9 S4 S6 e$ i明显http://www.infoplease.com/dictionary/brewers/的词典是xml,由于MDict PC版不支持xml+css,我们要把xml标签替换为html标签。经过下面一系列的操作。3 h7 f, [! n% `8 V$ v" v' v

; z/ ?  W; D# a) y8 q5 \
% d4 X- @' h! ?. N) N6 J. o处理后最终的文本是这样:
1 f# ?+ }! {* t! h6 g
$ z* I( w9 x4 {# u- F9 m0 j$ Q  X
1 F  U& x) m/ i5 K9 R$ y; y4 \再简单写点css
$ C- E- {) R5 f4 A4 M- ?7 {, l, C) F( F% V4 f" r3 v4 {5 O
) x8 Z5 v4 ]# A% ?- g3 b6 t
中途遇到些小问题,一个个解决,最后,成品:: G; h( Z8 G* }/ G
" s  d1 w8 y$ K$ O4 e: H  B
是不是比在线的稍微顺眼点呢?0 a3 d) P7 E/ y0 ]* {! _# ~
http://www.infoplease.com/dictionary/brewers/comb.html
/ Z. D+ ?, v4 ]+ d7 k: D; y
( p! W9 X0 n+ a: L6 p& h' U
$ ^- v( Y% U' i4 GPS:虽然做完了,但是我发现了一些问题,从上面的截图中就可以看出来,有些词之间少了空格。暂无意修改,等有空改完了再分享。谁有兴趣改一改练练手的话,可以PM我,我把下载的网页发给你。
8 ~) r; j0 z  O( V9 l9 r- K2 b! o

本帖被以下淘专辑推荐:

该用户从未签到

发表于 2013-11-14 16:26:41 | 显示全部楼层
此贴要顶!
  • TA的每日心情
    开心
    2018-1-27 00:16
  • 签到天数: 1 天

    [LV.1]初来乍到

    发表于 2013-11-15 23:30:45 来自手机 | 显示全部楼层
    老大您好。感谢您提供的教程。2 s( a1 ?/ W) l$ |$ S5 u9 J
    小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说的抓取一千多个网页的那个步骤,小弟一头雾水,手动一个一个地输入也是一个方法,不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢?烦请指点一二,多谢多谢。
    1 M4 O( I& ]2 T8 s; w* K- ~( |/ B0 ~4 w% `7 u  U# \8 J
    小弟想抓取的网页如下:
    , g: o2 K6 G- E* n8 {3 _5 q( z/ Z1 w7 _http://zokugo-dict.com/. L0 [* \: o- Q% Q$ k+ F: O
    & S; ?; G& k8 q/ v0 P" o
    右边的五十音图就是索引部分。

    该用户从未签到

     楼主| 发表于 2013-11-16 14:14:53 | 显示全部楼层
    liuyunrushui 发表于 2013-11-15 23:30 $ \! y" x: t: C! z# `
    老大您好。感谢您提供的教程。7 N3 f& }4 r7 h& u* U4 }3 f7 y
    小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说 ...

    ( Y$ }( t+ n( h1 g
    8 b5 Q, x, t% j) v" \cmd.exe
    , t- r4 n  J* E: [+ T% R
    ! v; a) l5 ]7 b  ^1 B' `wget -i download.txt
    1 O, D. `, R* H/ I所有网页链接在download.txt,参考http://baike.baidu.com/view/1312507.htm,也可以自己写程序抓。结合awk等等的话,其实可以更快,抓完也就制作完了。

    该用户从未签到

    发表于 2014-4-1 09:02:21 | 显示全部楼层
    thank you very much
    您需要登录后才可以回帖 登录 | 免费注册

    本版积分规则

    小黑屋|手机版|Archiver|PDAWIKI |网站地图

    GMT+8, 2024-5-4 15:58 , Processed in 0.035824 second(s), 9 queries , MemCache On.

    Powered by Discuz! X3.4

    Copyright © 2001-2023, Tencent Cloud.

    快速回复 返回顶部 返回列表