Fast File Format to provide fast writing operations
I am dealing with a huge amount of data which I need to store in a file. I know how I can write these in files and everything is working fine, but since I receive new sensor data every 10ms from 3 peripherals, I am not able to store all data because writing in a text file is too slow and I am losing data.
It seems like that the file format plays an important role, and I found out that you can use HDF5 files which are quite compact and especially build for dealing with a lot of data. The problem is that in Pythonista it is not possible to use the import hdf and import pandas. Do you know an alternative file format for fast writing?
@ProgrammingGo Thanks for your clear explanation...
Try adding a buffer, instead of outputting each value at one time, group them up then output
You still have not told us how much data is in each 10ms sensor sample. 1 byte? 30 bytes? 1 MB?
Also, how long do you record?
Raw transfer speeds on an iPhone are probably in the 50-100 MB/sec, or more on newer devices. Writing 100 bytes at 100 Hz is still only 10kB/sec.
I seriously doubt that is your problem.
Eliminating the constant file opening is going to help a lot. I'd suggest you log file opening/writing/close time, then compare to simply time to write. As the file gets bigger, the open time will get longer and longer.
You can buffer up large writes, by just appending bytes, but internally python's write is doing that anyway, so I doubt yours see much improvement there. You might try increasing the buffer (third argument to open).
Eliminating all of those str() calls will help, just write the raw bytes. But again, hard to imagine that is your problem. I wonder if you are just having issues with the cb module.
@cvp You're welcome :)
@ellie_ff1493 Hi okay I will give a try to collect it in an array an then write in the file. Do you think that the redirection to a file via logging could help?
@JonB the amount of data per sample rate is about 35 to 40 bytes. The duration is continuously that means the whole day measuring.
I don't think it is a problem of the cb module, with 500ms sample rates it works pretty fine. I don't have any alternative to the cb module to allow multiple connections.
By the way, you suggested me to use logging to write in files. How can I manage to write in 3 different files by using logging? - Each peripheral has is own file.
By eliminating the str() you mean to use the binary file for writing the values: wb for opening the file and then file.write(value) ?
So you are recording ~10GB data each day!? 10ms * 40bytes * 3 sensors * 100 = 117kB/s * 86400s = 9,66GB/d => Sorry I'm curious, what will you do with this huge amount of data? For processing you probably like to switch to another platform...
@brumm Am I erroneous or is it 40b x 3 devices x 100 (1000msec/10msec) = 12000 = 12kb/sec?
Even if I'm correct, 1GB per day is a lot
40b 1000 msec 86400 sec ------- x 3 x --------- = 12000 b = 12 kb x --------- = 1 gb/day 10 msec sec day
So, writing 30 bytes to an open file in a tight loop takes maybe 70 microseconds on my crappy iPad3. (I timed by running 100000 writes). Even flushing between writes, very little difference. I doubt you'd improve much by trying to manually buffer-- since python files handle buffering for you, it shouldn't be an issue.
Opening the file each time, takes about 1 msec per write. I suspect as you get to GB files, the seek time might take some extra time. So don't do that. Open the file and hang onto the handle, and only close it in an error handler or script end.
I still think there cb module could be problematic. Are the sensortags recording data on their own internal timer, then you ask for data based on a python timer? Or, do you poll for data which triggers a sample?
If the sensor tags are sampling based on their own internal timer, things could drift because the clocks are different. To check for that, you would as a counter in the sensortags payload that is incremented only when it sends a response (not when it samples)