TA的每日心情 | 开心 2019-8-21 08:44 |
---|
签到天数: 163 天 [LV.7]常住居民III
|
发表于 2018-11-16 21:27:49
|
显示全部楼层
- '''1 V+ F9 w) z: i' f9 n
- Based on xmllarge.py- V% a8 U3 S! r q0 `
- '''- t, q7 ~) L! _% X+ B' d
- # from pyquery import PyQuery as pq1 p! D9 U" I; H: ~
- from pathlib import Path
3 \+ L9 r& k" b7 b8 w - ; w# Y2 a0 e! N$ F9 R- k0 o& W- o2 d- i
& g9 ~$ e9 P+ R: H" j1 c5 K- def xml_iter(file, tag):9 i( }. k/ j' D {- A6 d
- '''
L4 e& b8 q+ k' m - Process huge xml files
1 @6 G" V: I5 ~: K9 ~ - <tag> </tag> need to be in separate lines% V9 H* P5 e+ U& ^& N
- # TODO: in the middle of lines y' w) u% `& x, W9 e w( b. ]
- , h# T. ^+ z. d! v: A* \6 l
- :file: file path
; v/ v5 r; {9 o# l5 R' l - :tag: element to retrieve
. R/ d- W- _, v- U- F8 O& c0 ` - '''
3 e; M# R/ n3 E7 P/ u - tagb1 = '<' + tag + '>'
4 C- d' E4 Y/ Q* j/ e - tagb1 = tagb1.encode()! u% N0 j' s6 I- T' |
2 z; A- D" E) k% M- ( w) t; q4 F/ ?0 r( k. p# A5 j
- tagb2 = '<' + tag + ' '
6 S; j4 n# j. B- i - tagb2 = tagb2.encode()- v( y3 \& _5 t
8 Z2 o/ j& H' ^* q" q2 w4 Q9 P2 \- tagb3 = '</' + tag + '>'
. ^4 B2 ?5 c$ y- w: h; R1 z7 y - tagb3 = tagb3.encode()# r4 C) ^- z2 Y+ V
- 0 o3 U4 V3 P1 z. z2 O0 H8 S
- with open(file, 'rb') as inputfile:6 j( [: g, G n# W/ {& {( v. @
- append = False
& V* A3 P: C( i2 l* j - for line in inputfile:# W# z. P: B- @: Y6 U+ R
- #~ if b'<tu>' in line or b'<tu ' in line:
1 M+ O" j9 V W" p - if tagb1 in line:. n. R/ @, [1 G( a
- inputbuffer = line[line.index(tagb1):]' h3 W8 h H$ E0 I; F. D& i) A; ^
- append = True0 S: i5 |. w# A" A, }# y: @! x- m. s2 s
- elif tagb2 in line:% o! x7 W$ {- e2 v4 `1 E
- inputbuffer = line[line.index(tagb2):]
( C" g5 S3 X0 ]0 Q9 c1 t4 N5 r - append = True
8 K2 d& C# k8 ^& ?. J( X - #~ elif b'</tu>' in line:: y& s) b. z! r z& T; ]
- elif tagb3 in line:
! F" K" p: Z3 g" F, i7 d - inputbuffer += line[:line.index(tagb3) + len(tagb3)]
+ A, ^& ~3 S/ n: ^: u: P - append = False' `3 t. Z2 e% l: j
- yield inputbuffer
@ P3 C9 R4 A) B- _6 x - #~ docitem = process_buffer(inputbuffer, id_num)
/ ^( e8 j' U1 k$ p6 s7 N - #~ print(id_num)5 Y: e( Q1 N" ]" D# u
- #~ id_num += 1
! e& Z4 q r) u - inputbuffer = b''
2 v( ^: _" j- P* D5 v. X! o# ] - elif append:% y; A8 N; G/ F. P: l
- inputbuffer += line
复制代码 4 k3 F- S) U- o( U$ b, n$ |
% ?0 E5 b, r$ V( m+ c; t4 @( W
这么多人找这东西?我过一阵打包发个小工具。
$ N [/ y& \* v+ g; Q( l. B
% u, [' t) w0 _上面的python3函数用法
& O* r8 M; a* U! V0 o( Y6 X3 W6 Tresu = ''! j. [$ y' i& C2 u6 x0 ?- w, N
for elm in xml_iter(filename, 'tu'):
( t0 @- z6 [% W# g resu += elm
9 T# s. z! v. H& [& W! {0 T" l; U' E! X9 o+ V# ]
内存足迹极小……不管文件多大。 |
|