Welcome!
This is the community forum for my apps Pythonista and Editorial.
For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.
Try to read *pdf file
-
Hello Guys,
I have a Question/Problem...
I‘m trying to read a *PDF file to get an Output like this:['2374575', '1', 'ZB33S Zählerschr.,univ. Z,1100x800x205mm,SKII', 'ZB33S', 'Ausgeliefert', '272.89', '0'] ['2081086', '1', 'FZ443N Kabelrangierkanal,3-feldig,aufsteckbar', 'FZ443N', 'Ausgeliefert', '50.49', '0']
I only get the Full text in one..... and i don‘t know how to separat it...
In the Exempel file i only need the Pos. 1-6.This is the Output i get:
Auftrags-/RechnungsauskunftDatum Von 14.09.2021Datum Bis 14.11.2021Artikelnummer Artikel-Volltextsuche Auftragsnummer Bestellnummer AufträgeNummerDatumStatusObjektKommissionBestellangabenBestell-Nr.4919825 / 113.10.2021erfasst95463035 Bauer, ThorstenSeite 1 von 1PosAusschr.-PosArtikel-Nr.Verband-Nr.BezeichnungBestelltLieferbarRückstandEinzelpreisGesamtpreis128931232893123DEHN Kombi-Ableiter Typ1+2 DEHNsh 909340 ZP Basic 2 für TN-S-Systeme110181,95181,95228400782840078HAGE vector AP-Kleinverteiler VE312DN IP65,3-reihig, 36TE, Rangierkanal110118,53118,53335072223507222HAGE Leitungsschutzschalter C16A MCN316 6kA 3-polig 230/400V C-Charakter.11025,3025,30435070233507023HAGE FI-Schutzschalter 40A CDS440D QuickConnect Typ A 4-pol. 30mA 400V10130,9030,90535070923507092HAGE Gabel-Phasenschiene 10qmm KDN363F QuickConnect 3-polig+N 12M1105,255,25629661202966120HAGE Sammelschiene 1feldrig ZM11C universN L=245mm 12x5mm 1 Stück1103,653,65729661212966121HAGE Sammelschiene 2feldrig ZM12C universN L=495mm 12x5mm 1 Stück1107,307,30Gesamtwert: 372,88Seite 1 von 1
Here is a Exempel file: https://imgur.com/a/RsMnG4A
This is my code:
from PyPDF2 import PdfFileReader, PdfFileWriter import PyPDF2 with open('AB_NEU.pdf', "rb") as f: reader = PyPDF2.PdfFileReader(f) print(reader) page = reader.getPage(0) print(page) text = page.extractText() print(text) #for texts in text: # print(texts) # print('____________')
I will be glad to receive any help.
-
@DavinE Unfortunately, doc says "PageObject.extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. " -
Yes, I have also seen that unfortunately does not work as I would like. I will now make me for the one manufacturer which does not provide me with csv files and let me read the PDF.
I thank you anyway
And I wish a Happy New Year xD
-
You could also try using PDFKit in objc_util. If your documents are always the same format, you can select text from specific regions, which might fix problems with text being out of order. You can also return all text in the page.
-
Actually, you should check out pdfminer, which has ways of detecting tabular data based on bounding boxes. Camelot and tabula may also work.
-
I think this is too much for me and my knowledge....
But thanks for your help