Quite a while back, I used Jasper to create a speech recognition periodic table fun fact program. That was before my website was revamped and I lost the original blog post. (Well, I do have it in my backups, but it’s not really worth reviving at this point.) I do have a video showing the results, however:
I’ve started to revisit this project, and here are my first thoughts…
Introduction
I’m working on voice-activated scientific instrumentation, and I would like to have a speech recognition system that is
- easy to install
- easy to maintain
- easy to configure
Jasper was a lot of fun, but there is a tremendous amount of overhead, a (lack of ) tremendous amount of documentation, and a (sadly, not lack of) tremendous amount of hiccups along the way. This last point is especially true for those trying to use a version 2 Raspberry Pi with Jessie. Since my project does not require an exhaustive dictionary, and I am only interested in a few commands (at least, in the beginning), I wanted to take a minimalist approach to speech recognition.
Setup
I’m using Kamino Base – a fresh Jessie install with just a few additional software packages and minimal configurations such as timezone, keyboard layout and ssh access. I then installed pocketsphinx sudo apt-get install pocketsphinx which installs version 0.8-5. No other software is installed at this point. I have a USB microphone attached and have made sure I can record sound via arecord
and play sound via aplay
. I have made no attempt to mess with default audio devices or modules to switch the order of devices, so for arecord, I need the -D plughw:1,0
flag to indicate that my mic is device #1.
Progress
Simply running pocketsphinx_continuous -adcdev plughw:1,0
gets me some speech recognition. Nice and simple. Sadly, the text I want is buried deep within a whole bunch of information that is more-or-less useless to me. We can get rid most of it with the -logfn /dev/null
flag. Now, I get output that looks like this:
Now, I find the result rather ironic, since I said, “I understand what you say”, but I’ll worry about accuracy later. Right now, I want to be able to take the output of this command and use it in another program. My though is this: what if I filter out all extraneous text and send it to a named pipe? That way, another program can be aware of the most recently uttered text and use that information as desired.
Along the way, I have learned a bit about sed and named pipes. Perhaps I’ll provide some more detail at some point, but here are the main facts:
- sed can be used to remove any lines that don’t contain the useful information. In this case, the useful information has a nice identifier – 9 digits plus a colon and a space.
- there’s plenty of information on the web about linux and named pipes. This project, however, required what is called a non-blocking pipe. Surprisingly, this can be done with a [relatively straightforward c program](http://stackoverflow.com/q/7360473/2711057).
- In order to redirect sed to the pipe, we need to use the –unbuffered flag.
The results
So, first thing we need is to make the ftee
command described in the above link. The code:
/* ftee - clone stdin to stdout and to a named pipe (c) racic@stackoverflow WTFPL Licence */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <errno.h> #include <signal.h> #include <unistd.h> int main(int argc, char *argv[]) { int readfd, writefd; struct stat status; char *fifonam; char buffer[BUFSIZ]; ssize_t bytes; signal(SIGPIPE, SIG_IGN); if(2!=argc) { printf("Usage:\n someprog 2>&1 | %s FIFO\n FIFO - path to a" " named pipe, required argument\n", argv[0]); exit(EXIT_FAILURE); } fifonam = argv[1]; readfd = open(fifonam, O_RDONLY | O_NONBLOCK); if(-1==readfd) { perror("ftee: readfd: open()"); exit(EXIT_FAILURE); } if(-1==fstat(readfd, &status)) { perror("ftee: fstat"); close(readfd); exit(EXIT_FAILURE); } if(!S_ISFIFO(status.st_mode)) { printf("ftee: %s in not a fifo!\n", fifonam); close(readfd); exit(EXIT_FAILURE); } writefd = open(fifonam, O_WRONLY | O_NONBLOCK); if(-1==writefd) { perror("ftee: writefd: open()"); close(readfd); exit(EXIT_FAILURE); } close(readfd); while(1) { bytes = read(STDIN_FILENO, buffer, sizeof(buffer)); if (bytes < 0 && errno == EINTR) continue; if (bytes <= 0) break; bytes = write(STDOUT_FILENO, buffer, bytes); if(-1==bytes) perror("ftee: writing to stdout"); bytes = write(writefd, buffer, bytes); if(-1==bytes);//Ignoring the errors } close(writefd); return(0); }
which is compiled with gcc -o ftee ftee.c. I put that in ~/bin, which I have in my PATH variable.
Next, we make a pipe using mkfifo /tmp/speech and then run the following command:
pocketsphinx_continuous -adcdev plughw:1,0 -logfn /dev/null | sed --unbuffered -n 's/^[0-9: ]\{11\}\(.*\)/\1/p' | ftee /temp/speech
In another shell, I can use cat < /tmp/speech
to view the text recognized by pocketsphinx. Here’s what’s going on: “-adcdev plughw:1,0” tells pocketsphinx to use my mic, which is device #1; “-logfn /dev/null” hides the INFO lines. We then pipe the output to sed. The “–unbuffered” flag allows for the output of sed to be further piped and “-n” prevents printing of output unless we say otherwise. The sed regular expression says to search for a line beginning with 9 numbers a colon and a space ^[0-9: ]{11} and store whatever is left (.*) because we want that returned \1 and printed p. Lastly, we pipe the output to ftee.
Conclusion
So, at this point I have a very compact set of commands with minimal overhead that provides access to the most recent text recognized by the speech-to-text engine. The next step is to create a program that will take advantage of this feature.