TA的每日心情 | 开心 2019-8-21 08:44 |
---|
签到天数: 163 天 [LV.7]常住居民III
|
发表于 2018-11-25 16:50:10
|
显示全部楼层
本帖最后由 mikeee 于 2018-11-25 18:08 编辑 : c& j: x7 l" T+ R
. h) e9 p( Q9 s% Z# A$ A有一个办法应该可行:先用 Abbyy Finereader 转成 docx,docx再转成 htm。* N5 {. K8 A/ h! B f
- p( s* t1 i# V* y
我机器里没装Finereader,用在线 https://finereaderonline.com 做了十页(每天在线只能OCR十页),效果不错:htm里的页头自动消失。两列变成了单列,粗体保留,好像原pdf换行时的 hyphen 都去掉了,但原pdf里跨页的段落好像没有合并。
' M' @' b$ O" c1 u
+ W/ ^- i d, o6 { \Chrome Devtools 大致看了看:css selector: p.Bodytext21 可定位所有的释义
1 Q* O% p6 v. h/ g' q, _5 y$ ?8 ~css selector:p.Bodytext21>span.Bodytext2Bold 可定位释义里的粗体1 l$ B( T9 e2 U8 J
& m3 R* J! u% O4 T' P贴不了图,发个 docx 和 htm 文件(仅10页) 百度盘链接: https://pan.baidu.com/s/15Qc4tQeWcePy7AhTJLiJXQ 提取码: encg
( x1 [" ^/ [. C6 c: b
0 n% {8 F* Q1 C% I+ ~折腾了一阵,这个 python3 码处理上面说的 htm 得到的东西大致可以做成 mdx9 ?) B( g' S5 d4 ^
- '''word and phrase orgins test( Q: C8 n$ ?1 l* C& C0 B
- '''
8 ~" \+ M( F# c6 a5 t% U- { - from pyquery import PyQuery as pq) o8 |8 U5 h: x
- 6 ?* p X L- A% Z o
- file = r'WordandPhraseOrigins.htm'
9 ]: X7 _# d I. c8 {4 j9 k8 K, {5 q - try:
9 M1 e/ u* y7 W3 u" C( O - html = open(file, 'rt', encoding='utf8').read()
9 D. N8 {2 R" M: M4 r5 ` - except Exception as exc:; v% W- q- u8 B4 Y; }6 q! X7 S
- print('error: {}. Trying gb2312...'.format(exc)); [6 R# L% |5 W. \: Y& V. k
- try:
8 z8 S/ B, C. Z0 |+ T2 z/ m2 O - html = open(file, 'rt', encoding='gb2312').read()6 Z% f3 o% [) E4 _" N: o
- print('Looks good')3 P( G4 d) e! k! [, ]2 [
- except Exception as exc:) m9 |% Q3 F$ G: V
- SystemExit('error: {}. Giving up...'.format(exc))
) _. @8 T4 \+ l( a. K0 | - doc = pq(html); f. t/ p7 [* I7 Q
' [2 W. H# R9 R. z4 ?3 F' M- css_text = 'p.Bodytext21'
- b" F# x l* W - css_bold = 'p.Bodytext21>span.Bodytext2Bold'
2 E/ q( ]- V: {/ k- a - 6 q! j2 ?' a( B& y$ Z5 K% f
- items = doc(css_text): i4 f! p5 D ^( ]
- : Y3 _) S/ D! M& q/ O+ T
- text = doc(css_text).map(lambda idx, elm: pq(elm)(; o7 v4 v3 B+ M" M
- 'span.Bodytext2Bold').text() + ('(hw)\n' if pq(elm)('span.Bodytext2Bold').text() else '\n') + pq(elm)('span.Bodytext20').text())
6 C. c4 y. s& _: E9 O. ^# v/ i& d) ~4 N2 | - print('\n\n'.join(text[:60]))! b1 S& Z5 O& N% y# ]& T! K
复制代码 上面码的输出大致这个样子:。。。
" V* D; w. @2 x! gA-Rod.(hw)
: S6 B; V6 L/ s, F# hPeople who have little or no knowledge of baseball might have trouble with these initials. They are short for Alex Rodriguez, the famous Yankee baseball star.
1 n' v/ U( v! l, o2 E: x8 v' T0 ]8 c) |& a/ U2 J, B" L0 m( |; S: k4 _7 `8 @
around Cape Horn.(hw)
, I" V* p5 q' u% \0 ~An expression once used in whaling communities to mean “being away on a whaling voyage.” One old poem went:
- P9 S$ z& \% ]& P, W
3 V3 O- D$ e7 b8 d% s: d1 G- p
/ B F# y$ _) N, {“I’ll tell your father, boys,” I cried To lads at play upon my lawn.
3 ? S. A! m5 }, S1 q) m- P8 X ?8 A- L$ `- G, i1 p
. F7 f" d* k/ U' ~/ W/ PThey chorused back, “You’ll have to go Around Cape Horn.”) ]- e. |" s) G1 d
- ?2 q) u2 ]9 p/ laround the horn.(hw). t$ D# k8 {" G% ^, \
In the days of the tall ships any sailor who had sailed around Cape Horn was entitled to spit to windward; otherwise, it was a serious infraction of nautical rules of conduct. Thus, the permissible practice of spitting to windward was called Cape Horn isn’t so named because it is shaped like a horn. Captain Schouten, the Dutch navigator who first rounded it in 1616, named it after Hoorn, his birthplace in northern Holland.# j( F; x0 J# k5 I
9 ~% U W: L5 B. v7 Oarrant thief; knight errant.(hw)7 `: Z6 h6 ^2 u: k$ y5 z7 m2 N
was originally just a variation of nomadic or vagabond, the word best known in a knight who roamed the country performing good deeds. But from its persistent use in expressions such as an a thief who roamed the countryside holding up victims, came to mean thorough, downright, or out-/ k' e5 K; S. Q5 @ ?$ i) }, B
。。。 : Y+ Y2 @! Q0 H/ g
顺便安利一下 pyquery,是不是完爆正则、bs4、lxml?
; g& v3 m q; Z$ o& ]6 V8 Z; H2 Y4 e4 B+ l, e
' M( K7 z3 @( I$ P! J* Z/ o) w" b6 V A* n
|
评分
-
1
查看全部评分
-
|