include standalone ObjC function

JonB

Accelerate libraries are tricky.
These are all c functions, so you have to use c.vDSP_blah.argtypes=[...] Etc
Meaning you have to dig up all of the function prototypes, etc.

However you can just use the equivalent numpy methods, which are probably very similar in speed, since they are also vectorized and probably use the same underlying BLAS code. There are some efficient ways to cast the buffer you get as a numpy array, without copying. Then to get average power you could use np.sqrt(np.mean(np.square(np_data)))
To get rms, and np.max(np.abs(np.data)) to get peak.

Sorry I meant to post some code on this..

JonB

By the way, the answer at the bottom of that stack overflow is what I've been playing around with... But the mixer is screwing up the inputNode, since the format are incompatible.

daltonb

No worries man, appreciate the help. Any chance you could just post that casting snippet for now? That sound tricky for me.

As to the second point.. any reason not to just add the processing code to the tap block that updates the RecognitionRequest? (instead of adding a mixer)

JonB

Sorry, on my phone, away from my iPad... But yes, you get access to the buffer in the handler, and can compute metet directly there before passing on to the recognizer.

The one issue is that iOS doesn't seem to respect the buffer size -- instead giving us 16535 samples - about .375 sec -- so you only get new data a few times per second.
There is in theory a way to request fewer samples (thus faster call rate And lower latency), using the lower level audiounit, but I can't seem to get that working...

daltonb

Ok, I'll try to figure out the right casting call in the meantime. Yeah that is annoying; however from that stackoverflow post I've verified that calling buffer.setFrameLength(1024) succeeds in speeding up the sampling rate significantly after the first long (0.375s) sample.. haven't checked yet to see if I can update that before the first sample, but shouldn't matter too much for my purposes.

JonB

def handler(_cmd,obj1_ptr,obj2_ptr):
	# param1 = AVAudioPCMBuffer
	#   The buffer parameter is a buffer of audio captured 
	#   from the output of an AVAudioNode.
	# param2 = AVAudioTime
	#   The when parameter is the time the buffer was captured  
	if obj1_ptr:
		obj1 = ObjCInstance(obj1_ptr)
		#print('length:',obj1.frameLength(),'sample',ObjCInstance(obj2_ptr).sampleTime())
		#print('format:',obj1.format())
		data=obj1.floatChannelData().contents
		data_np=np.ctypeslib.as_array(obj=data,shape=(obj1.frameLength(),)) #if you want to use it outside of the handler, use .copy()
        power=n.sqrt(np.mean(np.square(data_np)))

JonB

You would then, in the handler, set an attribute on your view with the power, which will get used next frame. (Or better yet, don't use update in the view, instead trigger the draw using the handler, this ensuring you only draw when updated info is available.

If you want 60Hz frame rate, you'd want the frameLength to be 735 samples.

daltonb

MONEY. This works great for accessing the sound data!!

Sadly, upon further testing, setting the frame length to 1024 makes the speech recognition results very poor. Not sure why.. any ideas? Do you think the speech recognizer is expecting the original frame length somehow? For instance I say "Hello" and it outputs "LOL" sometimes, so maybe the input is getting clipped.

My frame length on my phone is actually 4410 by default which is ok, but I guess this is a platform specific number.

JonB

I have not tried the frameLength trick, but I wonder if the copy is having trouble keeping up, resulting in dropouts. You could write those samples to a .wav file, then listen to it using the quicklook, to see if the quality is suffering. If you comment out the numpy stuff, does the lower frame still cause poor results? If not, there are some techniques we can use to speed that processing.

Other possibilities would be to reduce sample rate (8000, 11050, or 22100), which should ease the processor burden.

JonB

this may be obvious, but be sure to set the frameLength prior to passing it to the recognizer, otherwise it will be getting duplicate data.

what happens, i think, is that the buffer contains all of the samples, including the initial 0.375 or whatever sec. if you change frame length to 1024, you are telling the engine how many samples you consumed -- it wants to keep that buffer the same size, and not ever skip, so it calls you sooner next time, where everything shifted left, and new samples appended at the end. The least latency would be those end samples. This takes the latency down from .375 for me to maybe 20-30 msec.


def handler(_cmd,buffer_ptr, samptime_ptr):
    if buffer_ptr:
        buffer = ObjCInstance(buffer_ptr)
        # a way to get the sample time in sec of start of buffer, comparable to time.perf_counter.  you can differnce these to see latency to start of buffer.	
        hostTimeSec=AVAudioTime.secondsForHostTime_(ObjCInstance(samptime_ptr).hostTime())

        #you can also check for skips, by looking at sampleTime(), which should be always incrementing by whatever you set the framelength to... if more than that, then your other processing is taking too long

        #this just sets up pointers that numpy can read... no actual read yet
        data=buffer.floatChannelData().contents
        data_np=np.ctypeslib.as_array(obj=data,shape=(buffer.frameLength(),))

        #Take the LAST N samples for use in visualization... i.e the most recent, and least latency
        update_path(data_np[-1024:])

        #this tells the engine how many samples we consumed ... next time, we will get samples [1024:] along with 1024 new samples
        buffer.setFrameLength_(1024)

        # be sure to append the buffer AFTER setting the frameLength, otherwise you will keep feeding it repeated portions of the data
        requestBuffer.append(buffer)

daltonb

Hey @JonB sorry for the slow response, this did help me get over a hump though. I think my frameCapacity is less than yours which is apparently the upper limit for frameLength.. setting sample size to 2048 worked well. I'm planning to post a first crack at a live speech recognition module soon.