Fast File Format to provide fast writing operations
I am dealing with a huge amount of data which I need to store in a file. I know how I can write these in files and everything is working fine, but since I receive new sensor data every 10ms from 3 peripherals, I am not able to store all data because writing in a text file is too slow and I am losing data.
It seems like that the file format plays an important role, and I found out that you can use HDF5 files which are quite compact and especially build for dealing with a lot of data. The problem is that in Pythonista it is not possible to use the import hdf and import pandas. Do you know an alternative file format for fast writing?
how much is "huge"?
hdf files are efficient, but if your bottleneck is disk access that is not going to matter. i would like to see how you are writing, id think your bluetooth is probably the limiter, not disk writing.
hdf are efficient to read and random access, but not necessaily more efficient to write. if actual disk writing is the limiter, you could bz2 your data so you write less of it. also, if you are opening /writing /closing the file every 10msec, that will be slow -- instead you could queue up writes to a BytesIO, then do one big write.
Also, ascii writes are probably slower than raw bytes.
Hi thank you for the response, I am recording nonstop data, so it is getting really huge. Do you mean to collect the values in a kind of buffer and then do a big write? For example, collect 1000 datasets, write them in the file and overwrite it with the new one? But I would maybe get the problem that I am to slow clearing the buffer and write the new datasets there?
@ProgrammingGo As usual, due to my poor English, I probably don't understand, but do you speak about replacing your file and not appending to it? So you would not need to close/reopen it at each write.
show me your weite operation.
if you are doing
while True: data = getData() with open(file,'wa') as f: f.write(data)
that will,be slow, because it has to open, seek, write, close. if instead you did
f =open(file,'w') while True: data = getData() f.write(data)
another option is to buffer up writes to be in large chunks, rather than few bytes at a time.
again, id like to see how you are writing the data, and How much data are you writing each time? are you converting from bytes to string? also, id suggest using the logging module, since that is designed for exactly what you are doing, and can for instance handle keeping files to fixed size (opens new files when files getntoo big), etc.
why do you think the file operation is where you are losing data? are you timing the writes and finding them to take too long?
Hi, I am doing exactly that one:
while True: data = getData() with open(file,'wa') as f: f.write(data
By the way, Iam using the logging module to avoid the print() because it saves some time as well.
Because I read that writing in different file formats is different. Especially when you want to have fast writing operation using a txt.file is not the best solution
ok, you should open the file once. also, use 'wb', if you are wriing binary data, to avoid text conversions.
how much data are you writing each time?
i am actually a little confused by your statement that you are using logging, yet show the above code hammering the file system... which is it? are you just calling logging.debug instead of write?
can you post your actual code to a gist?
Hi okay also wb for writing binary, thank you. I will try to collect more data in a dictionary or list and then write it in the file instead of open the file for every incoming dataset.
Regarding the how much data Iam writing it is the following: sensordata1,sensordata2, sensordata3, sensordata4, time ---> this is send via
file.write(str(sensordata1) + "," + str(sensordata2) + "," + str(sensordata3) + "," + str(sensordata4) + "," + str(time)). What Iam doing is that I concatenate all the values and send it as a string. And everytime a new dataset is available I execute the above write method to write it to the file
Did you mean to use the logging module to write data into files, instead of write? (you suggested me to use logging module. If yes, how is it gonna work with the logging module? - Do you mean maybe something like this? - redirecting directly to file with logging module --> where a logger is set up to write the output directly to the file, without open and close it every time (redirect it directly to the file with logging)
@cvp Hi cvp, thank you for your message, it is not a problem, I appreciate your support. What I want to do is that I append new datasets to the file. So I have 3 peripherals, let's say P1,P2,P3. An each for each of that I will create a file called: P1.txt, P2.txt and P3.txt. So the incoming datasets should be added /appended to the appropriate file. And because every 10 ms I receive datasets from all 3 peripherals via BLE, I need to store it in the file but it is to fast and sometimes the file is empty or some values are missing. If I increase the sending interval to 500ms everything works fine. And yes what I am doing is that I reopen the file every time I have a new dataset.
@ProgrammingGo Thanks for your clear explanation...
Try adding a buffer, instead of outputting each value at one time, group them up then output
You still have not told us how much data is in each 10ms sensor sample. 1 byte? 30 bytes? 1 MB?
Also, how long do you record?
Raw transfer speeds on an iPhone are probably in the 50-100 MB/sec, or more on newer devices. Writing 100 bytes at 100 Hz is still only 10kB/sec.
I seriously doubt that is your problem.
Eliminating the constant file opening is going to help a lot. I'd suggest you log file opening/writing/close time, then compare to simply time to write. As the file gets bigger, the open time will get longer and longer.
You can buffer up large writes, by just appending bytes, but internally python's write is doing that anyway, so I doubt yours see much improvement there. You might try increasing the buffer (third argument to open).
Eliminating all of those str() calls will help, just write the raw bytes. But again, hard to imagine that is your problem. I wonder if you are just having issues with the cb module.
@cvp You're welcome :)
@ellie_ff1493 Hi okay I will give a try to collect it in an array an then write in the file. Do you think that the redirection to a file via logging could help?
@JonB the amount of data per sample rate is about 35 to 40 bytes. The duration is continuously that means the whole day measuring.
I don't think it is a problem of the cb module, with 500ms sample rates it works pretty fine. I don't have any alternative to the cb module to allow multiple connections.
By the way, you suggested me to use logging to write in files. How can I manage to write in 3 different files by using logging? - Each peripheral has is own file.
By eliminating the str() you mean to use the binary file for writing the values: wb for opening the file and then file.write(value) ?
So you are recording ~10GB data each day!? 10ms * 40bytes * 3 sensors * 100 = 117kB/s * 86400s = 9,66GB/d => Sorry I'm curious, what will you do with this huge amount of data? For processing you probably like to switch to another platform...
@brumm Am I erroneous or is it 40b x 3 devices x 100 (1000msec/10msec) = 12000 = 12kb/sec?
Even if I'm correct, 1GB per day is a lot
40b 1000 msec 86400 sec ------- x 3 x --------- = 12000 b = 12 kb x --------- = 1 gb/day 10 msec sec day
So, writing 30 bytes to an open file in a tight loop takes maybe 70 microseconds on my crappy iPad3. (I timed by running 100000 writes). Even flushing between writes, very little difference. I doubt you'd improve much by trying to manually buffer-- since python files handle buffering for you, it shouldn't be an issue.
Opening the file each time, takes about 1 msec per write. I suspect as you get to GB files, the seek time might take some extra time. So don't do that. Open the file and hang onto the handle, and only close it in an error handler or script end.
I still think there cb module could be problematic. Are the sensortags recording data on their own internal timer, then you ask for data based on a python timer? Or, do you poll for data which triggers a sample?
If the sensor tags are sampling based on their own internal timer, things could drift because the clocks are different. To check for that, you would as a counter in the sensortags payload that is incremented only when it sends a response (not when it samples)