(Available as of version 1.74.00; Advanced features available as of 1.77.00)
AudioCapture() allows easy audio recording and saving of arbitrary sounds to a file (wav format). AudioCapture will likely be replaced entirely by AdvAudioCapture in the near future.
AdvAudioCapture() can do everything AudioCapture does, and also allows onset-marker sound insertion and detection, loudness computation (RMS audio “power”), and lossless file compression (flac). The Builder microphone component now uses AdvAudioCapture by default.
Speech2Text() provides speech recognition (courtesy of google), with about 1-2 seconds latency for a 2 sec voice recording. Note that the sound files are sent to google over the internet. Intended for within-experiment processing (near real-time, 1-2s delayed), in which priority is given to keeping an experiment session moving along, even if that means skipping a slow response once in a while. See coder demo > input > speech_recognition.py.
Eventually, other features are planned, including: speech onset detection (to automatically estimate vocal RT for a given speech sample), and interactive visual inspection of sound waveform, with playback and manual onset determination (= the “gold standard” for RT).
You need to switch on the microphone before use, which can take several seconds. The only time you can specify the sample rate (in Hz) is during switchOn().
Considerations on the default sample rate 48kHz:
DVD or video = 48,000
CD-quality = 44,100 / 24 bit
human hearing: ~15,000 (adult); children & young adult higher
human speech: 100-8,000 (useful for telephone: 100-3,300)
google speech API: 16,000 or 8,000 only
Nyquist frequency: twice the highest rate, good to oversample a bit
pyo’s downsamp() function can reduce 48,000 to 16,000 in about 0.02s (uses integer steps sizes) So recording at 48kHz will generate high-quality archival data, and permit easy downsampling.
Class extends AudioCapture, plays a marker sound as a “start” indicator.
Has method for retrieving the marker onset time from the file, to allow calculation of vocal RT (or other sound-based RT).
See Coder demo > input > latencyFromTone.py
Compress using FLAC (lossless compression).
Return the RMS loudness of the saved recording.
Returns (hz, duration, volume) of the marker sound. Custom markers always return 0 hz (regardless of the sound).
Return (onset, offset) time of the first marker within the first secs of the saved recording.
Has approx ~1.33ms resolution at 48000Hz, chunk=64. Larger chunks can speed up processing times, at a sacrifice of some resolution, e.g., to pre-process long recordings with multiple markers.
If given a filename, it will first set that file as the one to work with, and then try to detect the onset marker.
Plays the current marker sound. This is automatically called at the start of recording, but can be called anytime to insert a marker.
Plays the saved .wav file, as just recorded or resampled. Execution blocks by default, but can return immediately with block=False.
loops : number of extra repetitions; 0 = play once
stop : True = immediately stop ongoing playback (if there is one), and return
Starts recording and plays an onset marker tone just prior to returning. The idea is that the start of the tone in the recording indicates when this method returned, to enable you to sync a known recording onset with other events.
Re-sample the saved file to a new rate, return the full path.
Can take several visual frames to resample a 2s recording.
The default values for resample() are for google-speech, keeping the original (presumably recorded at 48kHz) to archive. A warning is generated if the new rate not an integer factor / multiple of the old rate.
To control anti-aliasing, use pyo.downsamp() or upsamp() directly.
Restores to fresh state, ready to record again
Sets the name of the file to work with, e.g., for getting onset time.
Sets the onset marker, where tone is either in hz or a custom sound.
The default tone (19000 Hz) is recommended for auto-detection, as being easier to isolate from speech sounds (and so reliable to detect). The default duration and volume are appropriate for a quiet setting such as a lab testing room. A louder volume, longer duration, or both may give better results when recording loud sounds or in noisy environments, and will be auto-detected just fine (even more easily). If the hardware microphone in use is not physically near the speaker hardware, a louder volume is likely to be required.
Custom sounds cannot be auto-detected, but are supported anyway for presentation purposes. E.g., a recording of someone saying “go” or “stop” could be passed as the onset marker.
Interrupt a recording that is in progress; close & keep the file.
Ends the recording before the duration that was initially specified. The same file name is retained, with the same onset time but a shorter duration.
The same recording cannot be resumed after a stop (it is not a pause), but you can start a new one.
Uncompress from FLAC to .wav format.
Class for speech-recognition (voice to text), using Google’s public API.
Google’s speech API is currently free to use, and seems to work well. Intended for within-experiment processing (near real-time, 1-2s delayed), in which its often important to skip a slow or failed response, and not wait a long time; BatchSpeech2Text() reverses these priorities.
It is possible (and perhaps even likely) that Google will start charging for usage. In addition, they can change the interface at any time, including in the middle of an experiment. (If so, please post to the user list and we’ll try to develop a fix, but there could still be some downtime.) Presumably, confidential or otherwise sensitive voice data should not be sent to google.
Note: | Requires that flac is installed (free download from https://xiph.org/flac/download.html). If you download and install flac, but get an error that flac is missing, try setting the full path to flac in preferences -> general -> flac. |
---|---|
Usage: |
Always import and make an object; no data are available yet:
from microphone import Speech2Text
gs = Speech2Text('speech_clip.wav') # set-up only
Then, either: Initiate a query and wait for response from google (or until the time-out limit is reached). This is “blocking” mode, and is the easiest to do:
resp = gs.getResponse() # execution blocks here
print resp.word, resp.confidence
Or instead (advanced usage): Initiate a query, but do not wait for a response (“thread” mode: no blocking, no timeout, more control). running will change to False when a response is received (or hang indefinitely if something goes wrong–so you might want to implement a time-out as well):
resp = gs.getThread() # returns immediately
while resp.running:
print '.', # displays dots while waiting
sys.stdout.flush()
core.wait(0.1)
print resp.words
Options: Set-up with a different language for the same speech clip; you’ll get a different response (possibly having UTF-8 characters):
gs = Speech2Text('speech_clip.wav', lang='ja-JP')
resp = gs.getResponse()
Example: | See Coder demos / input / speech_recognition.py |
---|---|
Known limitations: | |
Availability is subject to the whims of google. Any changes google makes along the way could either cause complete failure (disruptive), or could cause slightly different results to be obtained (without it being readily obvious that something had changed). For this reason, its probably a good idea to re-run speech samples through Speech2Text at the end of a study; see BatchSpeech2Text(). | |
Author: | Jeremy R. Gray, with thanks to Lefteris Zafiris for his help and excellent command-line perl script at https://github.com/zaf/asterisk-speech-recog (GPLv2) |
Parameters: |
Calls getThread(), and then polls the thread until there’s a response.
Will time-out if no response comes within timeout seconds. Returns an object having the speech data in its namespace. If there’s no match, generally the values will be equivalent to None (e.g., an empty string).
If you do resp = getResponse(), you’ll be able to access the data in several ways:
- resp.word :
- the best match, i.e., the most probably word, or None
- resp.confidence :
- google’s confidence about .word, ranging 0 to 1
- resp.words :
- tuple of up to 5 guesses; so .word == .words[0]
- resp.raw :
- the raw response from google (just a string)
- resp.json :
- a parsed version of raw, from json.load(raw)
Send a query to google using a new thread, no blocking or timeout.
Returns a thread which will eventually (not immediately) have the speech data in its namespace; see getResponse. In theory, you could have several threads going simultaneously (almost all the time is spent waiting for a response), rather than doing them sequentially (not tested).
Like Speech2Text(), but takes a list of sound files or a directory name to search for matching sound files, and returns a list of (filename, response) tuples. response‘s are described in Speech2Text.getResponse().
Can use up to 5 concurrent threads. Intended for post-experiment processing of multiple files, in which waiting for a slow response is not a problem (better to get the data).
If files is a string, it will be used as a directory name for glob (matching all *.wav, *.flac, and *.spx files). There’s currently no re-try on http error.
PsychoPy provides lossless compression using FLAC codec. (This requires that flac is installed on your computer. It is not included with PsychoPy by default, but you can download for free from http://xiph.org/flac/ ). Functions for file-oriented Discrete Fourier Transform and RMS computation are also provided.
Lossless compression: convert .wav file (on disk) to .flac format.
If path is a directory name, convert all .wav files in the directory.
keep to retain the original .wav file(s), default True.
level is compression level: 0 is fastest but larger, 8 is slightly smaller but much slower.
Uncompress: convert .flac file (on disk) to .wav format (new file).
If path is a directory name, convert all .flac files in the directory.
keep to retain the original .flac file(s), default True.
Compute and return magnitudes of numpy.fft.fft() of the data.
If given a sample rate (samples/sec), will return (magn, freq). If wantPhase is True, phase in radians is also returned (magn, freq, phase). data should have power-of-2 samples, or will be truncated.
Compute and return the audio power (“loudness”).
Uses numpy.std() as RMS. std() is same as RMS if the mean is 0, and .wav data should have a mean of 0. Returns an array if given stereo data (RMS computed within-channel).
data can be an array (1D, 2D) or filename; .wav format only. data from .wav files will be normalized to -1..+1 before RMS is computed.