掌上百科 - PDAWIKI

 找回密码
 免费注册

QQ登录

只需一步,快速开始

查看: 3934|回复: 4

[教程] 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

[复制链接]

该用户从未签到

发表于 2013-11-14 08:24:13 | 显示全部楼层 |阅读模式
本帖最后由 Oeasy 于 2013-11-17 09:54 编辑
  `- c" r3 H4 j% z9 l1 T3 @6 b  Y5 x+ m& h" M

0 b- Q. _0 e* y! X: E5 ~一个简单得不能再简单的网页抓取然后制作mdx教程(20131114)
, B& v& ]( {! C4 R. S0 U. P; O. ^4 u8 ?8 R- ^/ G
使用软件
; q+ r- g0 U" K0. 操作系统:Windows 7 旗舰版64位7 P3 q; y8 k9 J  ^) e, N
1. 抓取工具:wget,http://users.ugent.be/~bpuype/wget/http://baike.baidu.com/view/1312507.htm
( l4 A; ~" u& Q; G- G2. 文本处理:EditPlus, UltraEdit, TextForever(http://www.comicer.com/stronghorse/software/index.htm#TextForever
& v- l  c0 J) l5 h- F' t' B: K2 \3 ]7 l8 O4 o" j: A8 f
目标词典; [2 d$ _6 Z3 `# ?; w. M
Dictionary of Phrase and Fable,1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的,而且网站没有设置抓取限制(至少目前看来没有设置),获取index也非常容易,故以此为例。
1 h5 s( a& O" A  K$ c4 z另:有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536,版本不详,似乎是第17版的。! ?/ U! r9 O3 e9 G

6 W8 O. W; B1 F4 k9 j9 D* v操作步骤
; N: t' [/ F9 h9 ^. x% A; x1. 获取index
+ h5 p9 }5 C3 O( `观察http://www.infoplease.com/dictionary/brewers/,该网站本身可以browse整本词典,获取index非常容易。
9 }: o. ^8 I$ E8 \( ?5 n' n新建一个txt,内容为
8 j0 y% v, ^+ d2 Q9 o
http://www.infoplease.com/dictionary/brewers/index-a.html8 }, x8 D$ s% Q! t* q. g7 J* T
http://www.infoplease.com/dictionary/brewers/index-b.html
* L) x2 A8 @1 U+ A) I3 R! ]9 Jhttp://www.infoplease.com/dictionary/brewers/index-c.html
+ x- q) H2 J% ~: B5 _: O2 Phttp://www.infoplease.com/dictionary/brewers/index-d.html
3 f* ?+ ~, B& Y# S. Ghttp://www.infoplease.com/dictionary/brewers/index-e.html
3 Q- ^. R9 q: c! thttp://www.infoplease.com/dictionary/brewers/index-f.html9 s9 ]; f9 c) B! G, l& O
http://www.infoplease.com/dictionary/brewers/index-g.html, ], z2 h3 l  q& Q0 V# `+ n" b
http://www.infoplease.com/dictionary/brewers/index-h.html
" d/ c; k# D3 Z9 {9 ]http://www.infoplease.com/dictionary/brewers/index-i.html
: C* o1 b6 N! A* C  s# dhttp://www.infoplease.com/dictionary/brewers/index-j.html! O0 ^% l" D9 b0 e1 Q
http://www.infoplease.com/dictionary/brewers/index-k.html: }8 p9 O9 G% ?
http://www.infoplease.com/dictionary/brewers/index-l.html
! Y( Q2 J  o# I" fhttp://www.infoplease.com/dictionary/brewers/index-m.html+ |- [6 a( A$ y2 Y; y3 k
http://www.infoplease.com/dictionary/brewers/index-n.html: S! \) q4 _' P- L, E- Y
http://www.infoplease.com/dictionary/brewers/index-o.html
6 ^1 `6 w5 ~. V0 R$ Y8 y& B; jhttp://www.infoplease.com/dictionary/brewers/index-p.html
) n* s- Y: ?' o7 I% Zhttp://www.infoplease.com/dictionary/brewers/index-q.html
) W/ I& a% g+ ~+ x7 M: E7 xhttp://www.infoplease.com/dictionary/brewers/index-r.html
* w9 g6 E/ v# E8 N* B/ r/ jhttp://www.infoplease.com/dictionary/brewers/index-s.html; I) D/ k5 z" \) C4 n
http://www.infoplease.com/dictionary/brewers/index-t.html
9 t; W  w! D3 [$ G) D7 P' d/ Nhttp://www.infoplease.com/dictionary/brewers/index-u.html) i; P3 K, ^7 C: w3 y* J
http://www.infoplease.com/dictionary/brewers/index-v.html
7 \: b5 \; ]( ~& p0 fhttp://www.infoplease.com/dictionary/brewers/index-w.html
1 p2 }$ u) [# W- Y. ?8 Bhttp://www.infoplease.com/dictionary/brewers/index-x.html
0 |$ N& p5 V  q4 rhttp://www.infoplease.com/dictionary/brewers/index-y.html
5 m6 U% A* K7 a- u4 `! p6 F0 @4 r" qhttp://www.infoplease.com/dictionary/brewers/index-z.html
! i( u/ I+ Q% f  z
这些地址都是观察上面网站而得,txt命名为download.txt。2 z+ F* K% w& y' }8 X
我把这个download.txt和wget.exe(如果你下载的wget是wget+版本号.exe,不妨重命名为wget.exe),这俩文件都放在D:\DOPF下。! V  e7 l3 ]4 f; X0 Y

% _# r+ E# `, P6 i& Z* _4 v! zcmd.exe->CD/D D:\DOPF->wget -i download.txt
. L8 Q* v6 F$ f6 \( c7 R& i6 \1 w! o
很快,26个html文件就下下来了,对这26个html文件进行整理,得到
/ D: h' _1 a% U0 B; y* W
http://www.infoplease.com/dictionary/brewers/a.html* v) D. i5 E, {& u# S# z% P
http://www.infoplease.com/dictionary/brewers/a1.html( [8 x3 K- h4 J: p6 n- ~
http://www.infoplease.com/dictionary/brewers/a-b.html% G3 R3 J( k3 B6 @* F7 C9 ~
http://www.infoplease.com/dictionary/brewers/a-b-c.html' B$ I# u+ N3 T- y" A. o
http://www.infoplease.com/dictionary/brewers/a-b-c-book.html
: t8 g' S: Y) {9 {4 {http://www.infoplease.com/dictionary/brewers/a-b-c-process.html9 M' ^/ `- t7 P8 u) O1 ^
http://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html
9 M7 J: J) m. @http://www.infoplease.com/dictionary/brewers/a-u-c.html: |+ @8 G# A. o" `* E) t# l1 t
http://www.infoplease.com/dictionary/brewers/aaron.html7 E. Q8 J. j5 ]" {% }7 e
http://www.infoplease.com/dictionary/brewers/ab.html& Q; N. E3 E$ K1 O0 \0 B9 m
http://www.infoplease.com/dictionary/brewers/aback.html
% w2 Q& a  {3 `, L9 N" [/ U6 e$ n1 Vhttp://www.infoplease.com/dictionary/brewers/abacus.html$ W/ u0 u  l/ n/ p' ]- h4 h6 Z
http://www.infoplease.com/dictionary/brewers/abaddon.html+ d0 a! k6 X: M+ H
http://www.infoplease.com/dictionary/brewers/abambou.html
8 w$ E; g8 R0 p# p" U7 i8 Dhttp://www.infoplease.com/dictionary/brewers/abandon.html
0 i" t: @" K7 y2 \( w4 u: K9 Khttp://www.infoplease.com/dictio ... on-fait-larron.html
, \$ {) k5 @1 ]# nhttp://www.infoplease.com/dictionary/brewers/abaris.html
1 Z& W7 k; h6 ~7 A, R( Ghttp://www.infoplease.com/dictionary/brewers/abate.html
. F1 O" C" ^% q$ |2 j8 Q/ ?http://www.infoplease.com/dictionary/brewers/abaton.html( s" r0 ^( a4 t# [+ \, L
http://www.infoplease.com/dictionary/brewers/abbassides.html
8 f" b( \5 T* ?9 N8 Dhttp://www.infoplease.com/dictionary/brewers/abbey-laird.html% v' [/ z9 ~* e+ C1 }+ h
http://www.infoplease.com/dictionary/brewers/abbey-lubber.html( ], M* z  v* y+ P
……
8 s3 ^0 a5 F9 U! C
这样的一共16698个链接。, _' Q( z. s2 o* {4 N

% R. K- m* L5 `: ?5 o0 v2. 抓取内容
& s  h* u; b3 m4 H% X同样的,wget -i download.txt4 [2 [, _$ p" @' A4 i6 H
把上面那N个html都抓下来,然后就很简单了。6 X3 u# l& z$ F. B$ G% D
-2013年11月14日 16:35:474 h; a8 i! Z1 p/ m9 o' G5 U
成功抓取了16695个html,漏了3个,懒得研究到底是哪3个了。
* k' K2 v. ~# H# S. m2 c
" U+ w* Q+ ^  Y+ c, r# O3. 文本提取
1 v* N) A7 d# A0 L- W. R; S) F) Q. R观察可知,词典条目内容在第一个<h1>和<div class="source">之间& n4 h3 ^& K9 b1 ~% A! D2 m
<h1>Charybdis</h1>* b' e: }7 {5 d4 x( J
7 |3 _# w0 j5 n! \1 a6 X
<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and; |" ]% o% Q/ r, _* N0 s
Charybdis are employed to signify two equal dangers. Thus Horace says
9 x6 l. Y1 w' F2 y7 a: _an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
4 L% j# X8 c3 j4 p! nseeking to avoid one fault, falls into another. The tale is that  h/ x: D* d6 y, c- s% c; Q
Charybdis stole the oxen of Hercules, was killed by lightning, and# M$ X3 [# Q2 J6 @
changed into the gulf.</p>
) z3 h( e/ H. o  J* o- f, |- R) [<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your# N: R' q4 [7 m% c. U
mother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.) V' \$ g! _+ |- B. C5 r
</p>& ?; _, x5 A( K/ `- }. Y/ `
( _9 m  Z  u, O+ Z8 y: e& J
<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>

' R( K. l5 x/ g, o' \" T) [利用TextForever来提取文本5 [, w3 h. S" J/ G" c3 G
' H4 F* q% @$ h3 D. k& w
-
2 s2 r: I9 `) [# Y- |: k3 w. I. k- R1 T% Z# @- Y5 `
提取完毕,合并得到的16695个html,
% u/ m$ U7 E0 \  R
- y& E' {# J; @8 e- ]: r0 C: O这本词典的制作过程中,我思考了下,不用在“文件内容前加注文件名”,有的情况下,是需要这样做的,以方便提取keywords,经过测试,还是要在“文件内容后加空行”。' V. p6 G" k$ i8 I4 U

! v- Y7 }* ]; s4 i得到dopf-src.txt,对这个txt进行操作,得到可build为mdx的txt。* `1 c) G& X3 {* j
* p* ]. n  n1 m
4. 制作mdx$ }2 n, q& @+ Q- g
合并后的文本长这样:
( t4 l0 p' C( O. U3 A9 B4 H0 M- Z7 g9 A+ |# j4 b

1 E  u4 X! F1 K) ^0 I* N5 x明显http://www.infoplease.com/dictionary/brewers/的词典是xml,由于MDict PC版不支持xml+css,我们要把xml标签替换为html标签。经过下面一系列的操作。6 w" [. O* P+ P# [# E  _* @6 m

/ ^; F5 q9 X) }# g0 y
: G% u  {/ y) @  Y8 s处理后最终的文本是这样:
; P  b, @. K" a! D
6 B; O9 x4 g0 ^( e+ X' ^  o% p! J$ T( h( w7 N! E6 e% [% d7 k: y, k) w
再简单写点css" Y* Q! z  C" m9 _6 O& |

- q/ P2 B' y6 k5 ^: ~9 F( P2 s! [& ?, N! v& I: z4 N$ D
中途遇到些小问题,一个个解决,最后,成品:
) \" D. j2 B  U$ j+ o
# d! T7 P! i  O+ w  |# v& J是不是比在线的稍微顺眼点呢?7 x+ w' a8 E% A: M# ~/ {+ h# m
http://www.infoplease.com/dictionary/brewers/comb.html/ f6 x1 ^4 f9 L' b" b$ m1 O# H
5 E/ a# Y$ w9 J7 R1 m
" V3 ]8 c) C. P( o% k$ ~
PS:虽然做完了,但是我发现了一些问题,从上面的截图中就可以看出来,有些词之间少了空格。暂无意修改,等有空改完了再分享。谁有兴趣改一改练练手的话,可以PM我,我把下载的网页发给你。; A4 l+ B1 H* Q8 B

本帖被以下淘专辑推荐:

该用户从未签到

发表于 2013-11-14 16:26:41 | 显示全部楼层
此贴要顶!
  • TA的每日心情
    开心
    2018-1-27 00:16
  • 签到天数: 1 天

    [LV.1]初来乍到

    发表于 2013-11-15 23:30:45 来自手机 | 显示全部楼层
    老大您好。感谢您提供的教程。; @% }/ x" |2 h2 \) v
    小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说的抓取一千多个网页的那个步骤,小弟一头雾水,手动一个一个地输入也是一个方法,不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢?烦请指点一二,多谢多谢。" C9 P+ i' o( ~8 H6 P2 U: t
    $ d4 L# w( i- M  F) q
    小弟想抓取的网页如下:
    - f8 d* i$ @* B3 Ahttp://zokugo-dict.com/
    4 S2 E' ~* g' K! q' A" c% ?8 O% D% G0 \+ e' F# B
    右边的五十音图就是索引部分。

    该用户从未签到

     楼主| 发表于 2013-11-16 14:14:53 | 显示全部楼层
    liuyunrushui 发表于 2013-11-15 23:30 % Z5 M5 r1 H' Z6 b# F
    老大您好。感谢您提供的教程。* v9 f: k0 @* T4 K' U$ _6 j
    小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说 ...

    , |( c- f6 p1 x% [- `; M# \6 H! {% ^6 ^/ J/ s; r2 Y( e4 T
    cmd.exe$ a  Z. n) z# D+ ]
      c2 \/ m/ ^" l2 I* x5 n4 V
    wget -i download.txt
    : ]8 @; w0 m4 t所有网页链接在download.txt,参考http://baike.baidu.com/view/1312507.htm,也可以自己写程序抓。结合awk等等的话,其实可以更快,抓完也就制作完了。

    该用户从未签到

    发表于 2014-4-1 09:02:21 | 显示全部楼层
    thank you very much
    您需要登录后才可以回帖 登录 | 免费注册

    本版积分规则

    小黑屋|手机版|Archiver|PDAWIKI |网站地图

    GMT+8, 2024-5-4 17:57 , Processed in 0.037975 second(s), 9 queries , MemCache On.

    Powered by Discuz! X3.4

    Copyright © 2001-2023, Tencent Cloud.

    快速回复 返回顶部 返回列表