Try to read *pdf file

DavinE

Hello Guys,

I have a Question/Problem...
I‘m trying to read a *PDF file to get an Output like this:

['2374575', '1', 'ZB33S Zählerschr.,univ. Z,1100x800x205mm,SKII', 'ZB33S', 'Ausgeliefert', '272.89', '0']
['2081086', '1', 'FZ443N Kabelrangierkanal,3-feldig,aufsteckbar', 'FZ443N', 'Ausgeliefert', '50.49', '0']

I only get the Full text in one..... and i don‘t know how to separat it...
In the Exempel file i only need the Pos. 1-6.

This is the Output i get:

Auftrags-/RechnungsauskunftDatum Von 14.09.2021Datum Bis 14.11.2021Artikelnummer Artikel-Volltextsuche Auftragsnummer Bestellnummer AufträgeNummerDatumStatusObjektKommissionBestellangabenBestell-Nr.4919825 / 113.10.2021erfasst95463035 Bauer, ThorstenSeite 1 von 1PosAusschr.-PosArtikel-Nr.Verband-Nr.BezeichnungBestelltLieferbarRückstandEinzelpreisGesamtpreis128931232893123DEHN Kombi-Ableiter Typ1+2 DEHNsh 909340 ZP
Basic 2 für TN-S-Systeme110181,95181,95228400782840078HAGE vector AP-Kleinverteiler    VE312DN
IP65,3-reihig, 36TE, Rangierkanal110118,53118,53335072223507222HAGE Leitungsschutzschalter C16A  MCN316 6kA
3-polig 230/400V C-Charakter.11025,3025,30435070233507023HAGE FI-Schutzschalter 40A       CDS440D
QuickConnect Typ A 4-pol. 30mA 400V10130,9030,90535070923507092HAGE Gabel-Phasenschiene 10qmm   KDN363F
QuickConnect 3-polig+N 12M1105,255,25629661202966120HAGE Sammelschiene 1feldrig        ZM11C
universN L=245mm 12x5mm 1 Stück1103,653,65729661212966121HAGE Sammelschiene 2feldrig        ZM12C
universN L=495mm 12x5mm 1 Stück1107,307,30Gesamtwert: 372,88Seite 1 von 1

Here is a Exempel file: https://imgur.com/a/RsMnG4A

This is my code:


from PyPDF2 import PdfFileReader, PdfFileWriter
import PyPDF2

with open('AB_NEU.pdf', "rb") as f:
    reader = PyPDF2.PdfFileReader(f)
    print(reader)
    page = reader.getPage(0)
    print(page)
    text = page.extractText()
    print(text)
    #for texts in text:
    #    print(texts)
    #    print('____________')

I will be glad to receive any help.

cvp

@DavinE Unfortunately, doc says "PageObject.extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. "

DavinE

@cvp

Yes, I have also seen that unfortunately does not work as I would like. I will now make me for the one manufacturer which does not provide me with csv files and let me read the PDF.

I thank you anyway

And I wish a Happy New Year xD

JonB

You could also try using PDFKit in objc_util. If your documents are always the same format, you can select text from specific regions, which might fix problems with text being out of order. You can also return all text in the page.

JonB

Actually, you should check out pdfminer, which has ways of detecting tabular data based on bounding boxes. Camelot and tabula may also work.

DavinE

I think this is too much for me and my knowledge....
But thanks for your help