TA的每日心情 | 开心 2019-8-21 08:44 |
---|
签到天数: 163 天 [LV.7]常住居民III
|
发表于 2018-11-25 16:50:10
|
显示全部楼层
本帖最后由 mikeee 于 2018-11-25 18:08 编辑
4 Y* T2 X2 k- |: P- e# I9 R4 X5 S8 S3 \' P
有一个办法应该可行:先用 Abbyy Finereader 转成 docx,docx再转成 htm。
) A+ W, e& D$ S
_% c, H6 D" v2 K) q! [我机器里没装Finereader,用在线 https://finereaderonline.com 做了十页(每天在线只能OCR十页),效果不错:htm里的页头自动消失。两列变成了单列,粗体保留,好像原pdf换行时的 hyphen 都去掉了,但原pdf里跨页的段落好像没有合并。
+ `' v) [; [4 U4 H. H& _3 Q! J+ Z4 P. h t5 _& }
Chrome Devtools 大致看了看:css selector: p.Bodytext21 可定位所有的释义6 q- m4 `- E" z% k6 t; ~
css selector:p.Bodytext21>span.Bodytext2Bold 可定位释义里的粗体4 @4 O% d7 c3 t5 P& R
- G/ T1 d# _ F. u4 f贴不了图,发个 docx 和 htm 文件(仅10页) 百度盘链接: https://pan.baidu.com/s/15Qc4tQeWcePy7AhTJLiJXQ 提取码: encg / q3 i- l! D7 c* |( a9 x* S L; D% l- i
8 |# H2 K, m# o; T9 D# d9 H折腾了一阵,这个 python3 码处理上面说的 htm 得到的东西大致可以做成 mdx/ r. O; G Y: F/ d* ~/ J# F! {
- '''word and phrase orgins test9 t$ J7 a# d' K& F7 h8 l
- '''
' z. ~8 E3 g! c - from pyquery import PyQuery as pq
' N$ w2 {' q3 L4 O. z - 6 @- `* K5 L1 T1 W
- file = r'WordandPhraseOrigins.htm'! {% y" o$ B4 f( t
- try:
3 p7 b$ \3 {2 C7 @ - html = open(file, 'rt', encoding='utf8').read()
* F- W' ^% E) |# Q! v7 @( q - except Exception as exc:2 m! l- U& W5 T ?
- print('error: {}. Trying gb2312...'.format(exc))$ i4 b; P8 S) w5 _0 D* i* C* Y
- try:/ [7 X9 [5 M: y
- html = open(file, 'rt', encoding='gb2312').read(); D* k7 K& k2 q0 V2 d& f
- print('Looks good')3 P! R, ~# I) R2 R( v/ k. j
- except Exception as exc:
+ Z. C+ |* W# |2 ]. }8 q$ F: K - SystemExit('error: {}. Giving up...'.format(exc))
% V( e; g& W4 P5 o/ T; K+ g( a - doc = pq(html)' c! M+ a. ~6 {7 Y
2 ~' Y. ^2 Y" X0 J; j# w3 j- css_text = 'p.Bodytext21'6 |, I* e& ]: u4 o) f
- css_bold = 'p.Bodytext21>span.Bodytext2Bold': ]/ X" v& D! W5 n( i
2 u) r* e# O5 ]6 i n P7 _- items = doc(css_text)
8 f9 h1 o$ `8 q$ Q& ~ V0 V7 d - 4 P" _& D7 o( q" h3 C& V- L
- text = doc(css_text).map(lambda idx, elm: pq(elm)(' p% |5 P( l- C7 l) v, F
- 'span.Bodytext2Bold').text() + ('(hw)\n' if pq(elm)('span.Bodytext2Bold').text() else '\n') + pq(elm)('span.Bodytext20').text())
8 S% @) v8 p6 W8 R - print('\n\n'.join(text[:60]))" p o8 P( y1 O$ N% L* z
复制代码 上面码的输出大致这个样子:。。。
) u6 Y/ K$ G, W: R: P/ y, }A-Rod.(hw)
3 {2 |4 N4 @ `. e0 ~People who have little or no knowledge of baseball might have trouble with these initials. They are short for Alex Rodriguez, the famous Yankee baseball star.3 h% e- K& _/ `) V# R
+ n3 A4 e; t+ W$ W; D6 L" Saround Cape Horn.(hw)
% J$ q2 Q: p% k W8 }. O# AAn expression once used in whaling communities to mean “being away on a whaling voyage.” One old poem went:/ K1 }! C* m" D8 m* T$ m- O
/ Y# w& ^ ~( E9 J+ O0 ]+ E. s
. f! ?# \0 S v* _9 T
“I’ll tell your father, boys,” I cried To lads at play upon my lawn.. }5 W; P4 f# k) H' j7 L2 n
; B, V, ~7 K$ ` D! G$ A" D' \" I2 z
They chorused back, “You’ll have to go Around Cape Horn.”: u4 z0 v. ^! h1 T& \- o# P. r
8 y9 W' l4 c/ r4 Z& E
around the horn.(hw)6 H4 X* Z( S! S+ t
In the days of the tall ships any sailor who had sailed around Cape Horn was entitled to spit to windward; otherwise, it was a serious infraction of nautical rules of conduct. Thus, the permissible practice of spitting to windward was called Cape Horn isn’t so named because it is shaped like a horn. Captain Schouten, the Dutch navigator who first rounded it in 1616, named it after Hoorn, his birthplace in northern Holland.
9 }8 |+ S. o- o8 u) r. N
/ V# Y g @% V. z- uarrant thief; knight errant.(hw)& F4 i G- L4 s0 a
was originally just a variation of nomadic or vagabond, the word best known in a knight who roamed the country performing good deeds. But from its persistent use in expressions such as an a thief who roamed the countryside holding up victims, came to mean thorough, downright, or out-) N- u5 f5 M' H2 }- |0 Y( ?
。。。 $ d: e' I2 p) m( K9 T; B
顺便安利一下 pyquery,是不是完爆正则、bs4、lxml?
g0 a7 `7 Q/ ^6 W' d9 A
( u1 {) g, |& Z7 M' ?/ G, Y8 F& w& N9 g' p/ C, @9 u7 w
% [; u5 U) n" v; h$ _1 e& p
|
评分
-
1
查看全部评分
-
|