TA的每日心情 | 开心 2018-7-19 02:16 |
---|
签到天数: 20 天 [LV.4]偶尔看看III
|
如果不想要图片和链接的可以用我瞎写的ruby脚本。:lol . }! v: b" t0 c, T
max代表最大号,min是最小, 从1开始,每天想处理多少就处理多少,也可以放到服务器上全部处理成一个文件。- R$ O& s2 V& |) c2 x ~
低网速情况的代码- require 'rubygems'
( o7 h* X! }" u" [- n) ~9 v - require 'hpricot'
; o) S/ |8 }- l- F6 `2 S - require 'open-uri'
4 u* U/ _6 l3 @* A% V7 Z- y, g3 U - max=200
8 m% X% E9 C& Q9 B7 I2 T - min=11 y, W8 e5 ~$ @$ Q, f( {
- dic=File.open("baidudic#{min}-#{max}.txt","a")
# P3 q' j: M3 R# y+ u3 j - while min<max+1 do
$ n% B; X* ]: J9 e; [ Z, o# y( x - url = "http://baike.baidu.com/view/#{min}.htm"
3 [. e$ M: ?& S* G' \ - puts "#{url}"
$ x6 s9 G* v' p4 t* ~. H0 a) L - doc= Hpricot(open(url))1 J2 Y" U# X. [
- title0= (doc/:title).inner_html
& O4 w& [1 q+ j0 P; Z8 f - title=title0.split('_')% K3 y; F! b6 g, w
- content= (doc/"#lemmaContent").inner_html
( }+ Q$ F1 @0 O* S9 q - temp=content.gsub(/<\/?[^>]*>/, "")* k" W: Q+ }5 @/ O
- temp=temp.gsub(/编辑本段/, "")# q# [" m2 a9 r' I z
- dic.puts title[0]
' N/ k: \8 D; T3 v3 Q; J - dic.puts "原文链接:#{url}"& o* t% Y W$ F- \
- dic.puts temp
1 Z' N9 l) n( A* b. y- g - dic.puts "</>"
1 x1 c6 w1 c3 B/ {/ T7 Z+ I5 P - puts "OK"
) h* Q+ R+ o+ q - min=min+1
, a$ `1 w; ?3 D- N( ]5 ^" ] - end7 x6 r" c8 N7 A5 }
- dic.close0 G, Y/ E+ @" U
复制代码 高网速情况的代码
0 B N9 c# z( H$ P, }- # baidubaike 2 mdict by daming
% A* y5 _ g2 U8 G5 T - # [email protected]
1 {/ ?$ o2 a( k1 x' }% R - require 'rubygems') T, r7 s4 J4 W9 g
- require 'hpricot'
' C( D3 k, _* R; A& V6 z - require 'pathname'
$ W: S8 W4 i9 }3 P - require 'fileutils'
4 a" C/ y2 y, t O - require 'open-uri'4 c1 P. c! a/ Y, l% z- g/ K
- Maxn=20
1 P& C3 V/ y* s! `+ F+ o6 X# B - max=100
6 s* f4 }4 B* @7 v - min=1( b+ q" }; l. u: w1 p. T! o$ X @
- dic=File.open("baidudic#{min}-#{Maxn*max}.txt","a")1 `' V v A) n4 I
- for j in 0..(Maxn-1) do
K9 R: Y0 R1 q! }7 C - FileUtils.makedirs("temp")
' J: \9 B: H/ I i$ i& V2 Y - i=min
' s) Q2 ~% x5 r. E* h+ P - while i<max+1 do
7 K2 M n+ j0 b4 J - url = "http://baike.baidu.com/view/#{i+j*max}.htm"
+ F2 i9 s1 q; k7 I/ [9 Q1 T - puts "#{url}"
0 s! a( I6 i& U& N7 y6 ` - data=open(url){|f|f.read}- \/ h' c t% D" x* h" H5 M8 j
- open("temp\\#{i}.htm","wb"){|f|f.write(data)}
z# m0 v$ f3 \% _6 g" V - puts "download"
8 d4 X6 l0 u; q, y) b - i=i+1
2 l/ g/ f( j6 E7 Q - end' q1 x$ I t* y3 Z# C9 N
- i=min$ j1 m" |$ a4 d7 T
- while i<max+1 do
& `: |% K, Z- U7 s - puts "#{url}"* d3 D& A+ m8 v( @& w3 v' h
- url = "http://baike.baidu.com/view/#{i+j*max}.htm"/ T3 f7 g8 W2 o. n% L! E' \
- doc = open("temp\\#{i}.htm") { |f| Hpricot(f) }
# i8 Q. ?) B$ b8 f, z8 i2 z - title0= (doc/:title).inner_html
6 @" j4 E1 m) |, A' b# K - title=title0.split('_')% Z" y3 L4 G- G; J5 q& {+ {
- content= (doc/"#lemmaContent").inner_html
, P! f$ q+ E6 D! O# i- b - temp=content.gsub(/<\/?[^>]*>/, "")
V9 @2 O/ |$ b) S5 L# x - temp=temp.gsub(/编辑本段/, "")
+ z0 Q, ~0 `, F S, D- F - dic.puts title[0] l s1 l. V. ` F& ] S) _* \
- dic.puts "原文链接:#{url}"( f8 y) }# u4 Q" }2 b2 f3 j9 Q. C' p
- dic.puts temp+ l( c8 a: e$ f) G
- dic.puts "</>"
" } f2 c2 y8 \) v$ O - puts "converted"+ S5 S$ a/ r" V) J N+ X' B
- i=i+10 ]- j' [' t- x4 f3 x
- end
; [% O) B" ^8 {% j3 K3 L# z1 y# r - dir = Pathname.new("temp")
" s' @, }# b- B8 o5 | - dir.rmtree
2 ~' f: `4 n+ t/ \$ C - puts "cache cleaned", |- a5 O4 O9 s- p( C# `) Z: w3 N
- end
t6 m4 q& u1 I* G6 n2 J; r. u - dic.close3 J$ X+ S, F: }) Y* s
复制代码 windows上ruby地址3 @5 X" l7 Q! v) z( ~
http://rubyforge.org/frs/download.php/29263/ruby186-26.exe
" k$ x) T! k8 Z: D# ulinux这个不是问题6 \! o6 Q6 E! W6 w
3 N. G0 k7 m6 t一次不要开多个窗口,百度会封3 k( z6 r6 K) ^
( K9 g, l: u# E# F* A5 ^[ 本帖最后由 发哥 于 2008-10-15 21:12 编辑 ] |
评分
-
1
查看全部评分
-
|