Speech recognition part 1

Quite a while back, I used Jasper to create a speech recognition periodic table fun fact program.  That was before my website was revamped and I lost the original blog post.  (Well, I do have it in my backups, but it’s not really worth reviving at this point.)  I do have a video showing the results, however:

I’ve started to revisit this project, and here are my first thoughts…

Introduction

I’m working on voice-activated scientific instrumentation, and I would like to have a speech recognition system that is

  • easy to install
  • easy to maintain
  • easy to configure

Jasper was a lot of fun, but there is a tremendous amount of overhead, a (lack of ) tremendous amount of documentation, and a (sadly, not lack of) tremendous amount of hiccups along the way.  This last point is especially true for those trying to use a version 2 Raspberry Pi with Jessie.  Since my project does not require an exhaustive dictionary, and I am only interested in a few commands (at least, in the beginning), I wanted to take a minimalist approach to speech recognition.

Setup

I’m using Kamino Base – a fresh Jessie install with just a few additional software packages and minimal configurations such as timezone, keyboard layout and ssh access.  I then installed pocketsphinx  sudo apt-get install pocketsphinx which installs version 0.8-5.  No other software is installed at this point.  I have a USB microphone attached and have made sure I can record sound via arecord and play sound via aplay.  I have made no attempt to mess with default audio devices or modules to switch the order of devices, so for arecord, I need the -D plughw:1,0 flag to indicate that my mic is device #1.

Progress

Simply running pocketsphinx_continuous -adcdev plughw:1,0 gets me some speech recognition.  Nice and simple.  Sadly, the text I want is buried deep within a whole bunch of information that is more-or-less useless to me.  We can get rid most of it with the -logfn /dev/null flag.  Now, I get output that looks like this:

speech_output

Now, I find the result rather ironic, since I said, “I understand what you say”, but I’ll worry about accuracy later.  Right now, I want to be able to take the output of this command and use it in another program.  My though is this: what if I filter out all extraneous text and send it to a named pipe?  That way, another program can be aware of the most recently uttered text and use that information as desired.

Along the way, I have learned a bit about sed and named pipes.  Perhaps I’ll provide some more detail at some point, but here are the main facts:

  • sed can be used to remove any lines that don’t contain the useful information.  In this case, the useful information has a nice identifier – 9 digits plus a colon and a space.
  • there’s plenty of information on the web about linux and named pipes.  This project, however, required what is called a non-blocking pipe.  Surprisingly, this can be done with a [relatively straightforward c program](http://stackoverflow.com/q/7360473/2711057).
  • In order to redirect sed to the pipe, we need to use the –unbuffered flag.

The results

So, first thing we need is to make the ftee command described in the above link.  The code:

which is compiled with gcc -o ftee ftee.c.  I put that in ~/bin, which I have in my PATH variable.

Next, we make a pipe using mkfifo /tmp/speech and then run the following command:

In another shell, I can use cat < /tmp/speech to view the text recognized by pocketsphinx. Here’s what’s going on: “-adcdev plughw:1,0” tells pocketsphinx to use my mic, which is device #1; “-logfn /dev/null” hides the INFO lines.  We then pipe the output to sed.  The “–unbuffered” flag allows for the output of sed to be further piped and “-n” prevents printing of output unless we say otherwise.  The sed regular expression says to search for a line beginning with 9 numbers a colon and a space ^[0-9: ]{11} and store whatever is left (.*) because we want that returned \1 and printed p.  Lastly, we pipe the output to ftee.

Conclusion

So, at this point I have a very compact set of commands with minimal overhead that provides access to the most recent text recognized by the speech-to-text engine.  The next step is to create a program that will take advantage of this feature.

 

 

 

Published by

BoB

The guy who runs this show.

Leave a Reply

Your email address will not be published. Required fields are marked *