파이썬으로 pdf 및 사진 텍스트 출력하기 & 엑셀로 출력& 특정문자 제거(정규식)

language/Python

파이썬으로 pdf 및 사진 텍스트 출력하기 & 엑셀로 출력& 특정문자 제거(정규식)

늉_늉 2022. 12. 21. 19:00

쉬는동안 영어 학원 채점 알바를 하게 되었다.

업무중 하나는 영어단어장의 단어를 적어 시험지를 만드는 일이였는데 영단어를 타이핑하다보니

파이썬을 이용하면 좀더 쉽게 할수 있을것 같아 서치해보았다.

두가지 방법을 사용해 보았는데

1. 사진을 바로 txt로 추출

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

 
import cv2
import os
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
 
# 설치한 tesseract 프로그램 경로 (64비트)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
# 32비트인 경우 => r'C:\Program Files (x86)\Tesseract-OCR\tesseract'
 
# 이미지 불러오기, Gray 프로세싱
image = cv2.imread("KakaoTalk_20221221_175129075.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 
# write the grayscale image to disk as a temporary file so we can
# 글자 프로세싱을 위해 Gray 이미지 임시파일 형태로 저장.
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
 
# Simple image to string
text = pytesseract.image_to_string(Image.open(filename), lang='eng+kor')
os.remove(filename)
 
print(text)
 
cv2.imshow("Image", image)
cv2.waitKey(0)
Colored by Color Scripter

cs

 

2. pdf에서 txt추출

1
2
3
4
5
6

from tika import parser
pdf_path = "pdftest.pdf"
parsed = parser.from_file(pdf_path)
txt = open('output.txt', 'w', encoding = 'utf-8')
print(parsed['content'], file = txt)
txt.close()
Colored by Color Scripter

cs

 

두가지 조건다 흑백효과를 준뒤 작업한 결과가 훨씬 결과물이 좋았다.

본코드들은 구글링 하여 얻었다.

2.번을 시행할때 v-flat이라는 어플을 이용해서 스캔후 pdf로 바꿔주었다.

결과물은

이와같은데

영어단어 먼저 나오고 그후에 뜻이좀 나오고

이런 것도 좀 들어있다.

완벽하게 convert되지 않았지만, 타이핑 치는것에서 복붙으로 한결 편한 알바가 될것같다.

+ 엑셀로 변환까지 변환후 함수작업하면 좀더 손이 덜가서 후작업을 해주었다.

import pandas as pd

# pip install pandas

# df = pd.read_csv('.ouput.txt',sep="\t",encoding='utf-8')
# print(df)
# df.to_excel('newResult.xlsx',index=True)

import pandas as pd

df = pd.DataFrame(pd.read_csv('output.txt',sep='    '))

print(df)

df.to_excel('테스트1.xlsx',index=False)

영단어 앞에 숫자가 붙어

함수로 영단어만 분리하는 엑셀함수

=OFFSET($C$2,ROW(A1)*2-2,0)

이걸 두개 행 정도 해주고 아래로 드래그 하면 나머지 값들도 채워진다.

++++++++++++++++ 학원에서 엑셀 작업을 하다보니.. ㅁㅁ,ㅇㅇ같은 문자들이 자주등장하는 이슈가 있었다.

이번엔 필요없는 문자들을 제거하는 작업을 하였다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

from tika import parser
from pprint import pprint as pp
import re
 
text_file_path = "400_3.txt" 
new_text_content = ''
target_word1 = 'ㅁㅁ'
new_word = ''
 
with open(text_file_path,'r',encoding='utf-8') as f:
     ## 기존 텍스트파일에 대한 내용을 모두 읽는다.
    lines = f.readlines()
    for i, l in enumerate(lines):
        # new_string = l.strip().replace(target_word1,new_word)
        new_string =  l.strip().replace(target_word1,'')
        new_string = re.sub("[ㄱ-ㅎ]|口*|□□",'',new_string)
        if new_string:
            new_text_content += new_string + '\n'
        else:
            new_text_content += '\n'
                
with open(text_file_path,'w',encoding='utf-8') as f:
    f.write(new_text_content)
 
Colored by Color Scripter

cs

파이썬은 잘모르지만 역시 구글링이다.

여러 티스토리의 글을 보고 시도 해보았고, 최종적 성공은 밑의 링크 분의 글을 참조하였다.

https://zephyrus1111.tistory.com/106

분명 공부하는 중 짧게 하려고 시작한 영어학원 아르바이트였는데
어느순간보니 이러고 있었다.

앞으로의 영단어 작업시간이 줄어들겠지 ^ ^

저작자표시 (새창열림)

'language > Python' 카테고리의 다른 글

동적페이지 링크를 받아 큐알코드이미지로 만들기 (0)	2023.05.19
day1 - 파이썬 설치 (0)	2021.07.20
파이썬 따라하기 - 웹스크래핑 (0)	2021.04.08
파이썬으로 크롤링 해보기 (0)	2021.04.07

현재글파이썬으로 pdf 및 사진 텍스트 출력하기 & 엑셀로 출력& 특정문자 제거(정규식)

nyung_nyong

https://github.com/hjjju?tab=repositories

코드로 배우는스프링웹프로젝트, 스프링 MVC, 흥미위주, POI, poi엑셀 다운로드, 생활법률상식사전, 코드로 배우는 스프링웹프로젝트, python qrCode, 코드로배우는스프링웹프로젝트, 오타가 있다면 알려주세요., 엑셀업로드, https://github.com/hjjju/codeProject, JSTL, 코드로 배우는 스프링 웹프로젝트, 코드로 배우는 스프링웹 프로젝트, 큐알코드생성, mybatis, 동적페이지스크래핑, 이것이리눅스다, Java,

Today :
Yesterday :

nyung_nyong