|
发表于 2016-10-13 18:42:44
|
显示全部楼层
lxchen2001 发表于 2016-10-13 17:44" o3 p. f2 E/ o- ]8 e/ i3 e4 M% M& C
我明白你的问题了。你想把文章一句句拆开。+ ? i/ C: M: g0 @7 p
, X1 f2 X2 `, y) `) z2 E' u
网页HTML上文字是放在一起的,经过处理后才成为两个栏位 ... # H+ {) ^+ ?- B0 I! `! r2 m7 o
这样应该可以了. P; f5 R; }: S1 `
- import requests7 ^* |: E) T4 ?+ C, [: E0 K
- from bs4 import BeautifulSoup& f9 S5 {1 {- ~2 N
- r=requests.get('http://www.cuyoo.com/article-30928-1.html')% k' G0 ~" n% p( T4 s3 \; X5 }( Z
- soup=BeautifulSoup(r.text,'lxml')
6 a) O+ b, b; U; I' x4 L* v: Y4 Q - en=soup.find(id='en')% N2 u d5 B2 V. d8 g: C! N1 V
- enstring=en.strings2 x7 s$ A7 d! l6 l' p
- cn=soup.find(id='cn')
+ ^0 B5 q$ c3 Q* M- }* | - cnstring=cn.strings
9 Z5 G' m; D& r - file=open('/30928.txt','w',encoding='utf-8')( T4 {9 M/ J! h
- while True:
4 @ W) K" y# A* j( G - try:4 r9 G8 d# p3 {& z8 J5 n
- ensentence=next(enstring)) E0 B$ l j+ p
- #print(ensentence)% F4 N$ o% t1 u7 T8 R
- file.write(ensentence)
6 U7 v1 J1 e) x0 |1 W2 w- i8 h5 {4 g - file.write('\n'). l% d2 x: I/ G5 D4 `' K4 r3 R
- cnsentence=next(cnstring)7 h& K) c( O( I3 K# |
- #print(cnsentence)6 j- p8 O) n+ A8 r: L) u. u
- file.write(cnsentence)( P. [$ a8 M. m2 u( ?( I
- file.write('\n')
* t" |" A, M1 d0 `' r - except StopIteration as e:% v m: {0 A+ y+ F# M! M
- print('Finished')8 ~5 D; z0 P2 T# ]
- break. `0 m+ a$ `% U; A$ z p
- file.close()
复制代码 |
|