Konvertieren Sie gescannte pdf in text-python

Habe ich eine gescannte pdf-Datei und ich versuche, zu extrahieren text aus.
Ich habe versucht, pypdfocr zu machen (ocr), aber ich habe Fehler:

"konnte nicht gefunden ghostscript in der üblichen Stelle"

Nach der Suche fand ich diese Lösung Verknüpfung von Ghostscript pypdfocr im Windows-Plattform und ich versuchte, download GhostScript und steckte es in die environment-variable, aber es hat immer noch den gleichen Fehler.

Wie kann ich searh text in meine gescannte pdf-Datei mit python?

Dank.

Bearbeiten: hier ist mein code-Beispiel:

import os
import sys
import re
import json
import shutil
import glob
from pypdfocr import pypdfocr_gs
from pypdfocr import pypdfocr_tesseract 
from PIL import Image

path = PATH_TO_MY_SCANNED_PDF
mainL = []
kk = {}


def new_init(self, kk):
    self.lang = 'heb'   
    self.binary = "tesseract"
    self.msgs = {
            'TS_MISSING': """ 
                Could not execute %s
                Please make sure you have Tesseract installed correctly
                """ % self.binary,
            'TS_VERSION':'Tesseract version is too old',
            'TS_img_MISSING':'Cannot find specified tiff file',
            'TS_FAILED': 'Tesseract-OCR execution failed!',
        }

pypdfocr_tesseract.PyTesseract.__init__ = new_init  

wow = pypdfocr_gs.PyGs(kk)
tt = pypdfocr_tesseract.PyTesseract(kk)


def secFile(filename,oldfilename):
    wow.make_img_from_pdf(filename)


    files = glob.glob("X:/e206333106/ocr-114/balagan/" + '*.jpg')  
    for file in files:
        im = Image.open(file)
        im.save(file + ".tiff") 

    files = glob.glob("PATH" + '*.tiff')  
    for file in files:
        tt.make_hocr_from_pnm(file)
    pdftxt = ""    
    files = glob.glob("PATH" + '*.html') 
    for file in files:
        with open(file) as myfile:
            pdftxt = pdftxt + "#" + "".join(line.rstrip() for line in myfile)
    findNum(pdftxt,oldfilename)

    folder ="PATH"

    for the_file in os.listdir(folder):
        file_path = os.path.join(folder, the_file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
        except Exception, e:
            print e

def pdf2ocr(filename):
    pdffile = filename
    os.system('pypdfocr -l heb ' + pdffile)

def ocr2txt(filename):  
    pdffile = filename


    output1 = pdffile.replace(".pdf","_ocr.txt")
    output1 = "PATH" + os.path.basename(output1)

    input1 = pdffile.replace(".pdf","_ocr.pdf")

    os.system("pdf2txt" -o  + output1 + " " + input1) 

    with open(output1) as myfile:
        pdftxt="".join(line.rstrip() for line in myfile)
    findNum(pdftxt,filename)


def findNum(pdftxt,pdffile):
    l = re.findall(r'\b\d+\b', pdftxt)


    output = open('PATH' + os.path.basename(pdffile) + '.txt', 'w')
    for i in l:
        output.write(",")
        output.write(i)
    output.close()    

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

i = 0     
files = glob.glob(path + '\\*.pdf') 
print path  
print files 
for file in files:
    if file.endswith(".pdf"):
        if is_ascii(file):
            print file
            pdf2ocr(file)    
            ocr2txt(file)
        else:
            newname = "PATH" + str(i) + ".pdf"
            shutil.copyfile(file, newname)
            print newname
            secFile(newname,file)
        i = i + 1

files = glob.glob(path + '\\' + '*_ocr.pdf')         

for file in files:
    print file
    shutil.copyfile(file, "PATH" + os.path.basename(file))
    os.remove(file)

Könnten Sie Ihre code-Beispiel?
Editiere ich das in meiner Frage

InformationsquelleAutor Michal | 2017-08-03

2

Werfen Sie einen Blick auf diese Bibliothek: https://pypi.python.org/pypi/pypdfocr
aber eine PDF-Datei auch Bilder sind. Sie können in der Lage sein zu analysieren, die Seite streams. Einige Scanner brechen die einzelnen gescannten Seite in Bilder, so dass Sie nicht den text mit ghostscript.

immer noch der gleiche Fehler, ich schrieb pypdfocr mit dem Namen.pdf in die Kommandozeile ein und die Fehlermeldung: FEHLER: Konnte nicht finden, Ghostscript, die in der üblichen Stelle; bitte geben Sie Sie mit Hilfe Ihrer config-Datei
welches os verwenden Sie?
Ich verwende windows 64 bit
haben Sie ghostscript mit pip? pip install ghostscript
ja, ich habe es getan..

InformationsquelleAutor ghovat
0

Können Sie verwenden OpenCV für python. Es gibt eine Menge von nennt Beispiele zum erkennen von text.
Hier ist der link geben Sie den link-Beschreibung hier

Ich habe nicht gefunden, wie kann ich es nutzen für pdf-Dateien.
Drucken Sie die pdf-Datei als Bild (png oder jpeg) und dann können Sie verwenden OpenCV OCR.
Ich versuchte, einen Blick auf openCV, aber wenn ich import numpy es schreibt AttributeError: 'module' object has no attribute 'einsum'

InformationsquelleAutor E. Alex

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.