dropbox.put_file() chopping off bytes

roosterboy197

I don't understand what's going on here so hopefully one of you can enlighten me.

I subscribe to the RSS feed for http://kupu.maori.nz in MrReader and have a service set up to send a post to a Pythonista script to extract the word of the day and append it to a text file in my Dropbox. But every time it runs, the file shrinks by a number of bytes, usually between 2 and 25. I can't see any correlation between the size of the appended string and the size of the shrinkage.

What am I doing wrong? I'm betting it's something simple I've missed but I dunno.

#coding: utf-8
import sys
import dropboxlogin
from dropbox import rest
import console
import locale
import webbrowser

dropboxlogin.app_key = '...'
dropboxlogin.app_secret = '...'

DB_FOLDER = '/flashcards/'
DB_FILE = 'Māori.cards.txt'

def main():
    locale.setlocale(locale.LC_ALL, '')
    
    #extract the word of the day from the RSS text
    rss = sys.argv[1]
    new_word = rss.split('.')[0]
    new_word = new_word.replace(': ', ' :: ')
    #this gives us text like "ngeru :: cat"
    
    try:
        db = dropboxlogin.get_client()
        ff, md = db.get_file_and_metadata(DB_FOLDER + DB_FILE)
        wordlist = ff.read().decode('utf-8').splitlines()
        ff.close()
        print(md) #to check our starting size
        wordlist.append(new_word)
        wordlist = list(set(wordlist))
        wordlist.sort(cmp=locale.strcoll)
        md = db.put_file(DB_FOLDER + DB_FILE, u'\n'.join(wordlist), overwrite=True)
        print(md) #to see how much we've shrunk
        console.hud_alert('added {}'.format(new_word))
    except rest.ErrorResponse as e:
        console.alert('Error - add_maori.py', message='{}\n'.format(e))
    
    webbrowser.open('mrreader://')

if __name__ == '__main__':
    main()

ccc

I would try...

DB_FOLDER = '/flashcards/'
DB_FILE = 'Māori.cards.txt'
DB_FILEPATH = DB_FOLDER + DB_FILE  # use DB_FILEPATH in your main program

# ...
        print('before', len(wordlist), len(''.join(wordlist))
        wordlist = list(set(wordlist))  # is the set() operation removing data?
        print(' after', len(wordlist), len(''.join(wordlist))

roosterboy197

Here are the results of running that several times, with the string I added and the bytes returned from the Dropbox metadata.

# "whatitoka :: door"
('bytes - before', 1157)
('before', 60, 1091)
('after', 60, 1091)
('bytes - after', 1150)
# 17 characters, -7 bytes

# "awa :: river"
('bytes - before', 1150)
('before', 60, 1083)
('after', 60, 1083)
('bytes - after', 1142)
# 12 characters, -8 bytes

# "hapa :: dinner"
('bytes - before', 1142)
('before', 60, 1078)
('after', 60, 1078)
('bytes - after', 1137)
# 14 characters, -5 bytes

# "āporo :: apple"
('bytes - before', 1137)
('before', 60, 1073)
('after', 60, 1073)
('bytes - after', 1132)
# 14 characters, -5 bytes

# "hgtj :: gfrd" - random characters
('bytes - before', 1132)
('before', 60, 1066)
('after', 60, 1066)
('bytes - after', 1125)
# 12 characters, -7 bytes

# "hgtjfj :: hytgfrd" - random characters
('bytes - before', 1125)
('before', 59, 1064)
('after', 59, 1064)
('bytes - after', 1122)
# 17 characters, -3 bytes

I can't really see any pattern here. After seeing the byte difference was -5 for both 14-character strings, I tested using same length strings as earlier examples (17 and 12) but got different results.

roosterboy197

Yeah, now I'm really confused. I woke up this morning thinking "hmm, maybe it's the sort", added in just one print statement and now my output looks like this:

# "hgtjfj :: hytgfrd"
('bytes - before', 1122)
('before', 59, 1062)
('after', 58, 1045)
('sorted', 58, 1045)
('bytes - after', 1102)
# 17 characters, -20 bytes
# "hgtjfj :: hytgfrd"
('bytes - before', 1102)
('before', 58, 1044)
('after', 57, 1027)
('sorted', 57, 1027)
('bytes - after', 1083)
# 17 characters, -19 bytes
# "hgtjfj :: hytgfrd"
('bytes - before', 1083)
('before', 57, 1027)
('after', 56, 1010)
('sorted', 56, 1010)
('bytes - after', 1065)
# 17 characters, -18 bytes
# "hgtjfj :: hytgfrd"
('bytes - before', 1065)
('before', 57, 1010)
('after', 56, 993)
('sorted', 56, 993)
('bytes - after', 1048)
# 17 characters, -17 bytes

Now the wordlist = list(set(wordlist)) line is removing data! And it looks like the byte count of the lost data is dropping by one each time. I just don't understand.

roosterboy197

And that's why I shouldn't code first thing out of bed in the morning. Of course it's doing that; I'm reusing the same input each time so changing it to a set strips out the dupe. Fixing that gives me this output:

# "12hgtjfj :: 34hytgfrd"
('bytes - before', 1048)
('before', 56, 999)
('after', 56, 999)
('sorted', 56, 999)
('bytes - after', 1054)
# 21 characters, 6 bytes

So it's not the sort.

But I just noticed that my byte count increased this time!