Webpage Slices are Different from what is There

TomD

I downloaded a webpage successfuly which looks correct in content. When printing slices the characters are different.

Specifically, when I print the first character it shows the first plus the next 3 characters and an apostrophe on the end. So one character becomes 5 characters.

On printing longer slices of the webpage the number of characters is also greater and the apostrophe is always added on the end.

What is happening?

Tom

cvp

@TomD could you post your code here?

cvp

@TomD If you download with something like

data = requests.get(url).content

data is bytes and when you print it, you convert it to string as b'xxx'

TomD

#The "with" statement overflows into the next line due to this narrow comment box

import urllib.request
tda=str
with urllib.request.urlopen("https://www.asx.com.au/asx/statistics/todayAnns.do") as response
tda=response.read()

#Print the entire html string so I know what is in it
#The output of this print statement starts:
#b'\r\n\r\n\r\n<!DOCTYPE

print (tda)

#Separately print the first 5 characters in the html string
#The output of this is, including spaces between items:
#b'\r' b'\n' b'\r' b'\n' b'\r'

print (tda[0:1],tda[1:2],tda[2:3],tda[3:4],tda[4:5])

#Print the first 5 characters in the string
#The output of this is:
#b'\r\n\r\n\r'

print(tda[0:5])

TomD

CVP, so no printed string slice takes the html one character at a time. It combines them into groups and adds apostrophes.

cvp

@TomD You can see the string is between b' '
And characters with \ are not printable: ex: \n = next line
Thus b'\n' is only one character "next line "

TomD

Thanks CVP. That has me onto something.
I am data scraping. Maybe better off using a package like beautifulsoup?

cvp

@TomD try this

st = tda.decode('utf8')
print(st)

And you will see that there are empty lines at begin, which are \n

TomD

It doesn't like
print (st)

cvp

@TomD try this script

import urllib.request
with urllib.request.urlopen("https://www.asx.com.au/asx/statistics/todayAnns.do") as response:
	tda=response.read()
st = tda.decode('utf8')
print(st)

TomD

I see so I could work on that utf8 more easily

cvp

@TomD st contains a string, thus yes, good luck

TomD

I much appreciate. You have helped me around an obstacle

mikael

@TomD, definitely recommend using BeautifulSoup or webview with Javascript. Latter especially if you are trying to scrape pages with dynamic content.