Description of the CMU Sphinx-4 Speech Recognizer's use inside infinear.com
infinear.com uses an open source tech stack with Asterisk 1.6 as the primary PSTN interface. The PSTN gateway is hosted by an external provider. Asterisk uses the async AGI interface to communicate with a java Sphinx-4 speech decoder. The multi threaded java decoder uses a mySQL database for user content. The primary data format internal to infinear.com is XML. So, all HTML content is converted into XML using nekoHTML. Cepstral's TTS product is used via JNI to generate speech from text.
Internally, infinear.com uses the CMU Sphinx-4 decoder. A very fast trainer focused only on phonemes generates acoustic models. A typical training session generates models in a couple of hours on a dual core 2.6GHz box. The CMU-Cambridge Langauge Toolkit generates Langauge Models. The exact dictionary varies for every user and use case. We switch grammars at decode time depending on the user and use case. A typical grammar generation scenario is driven by user content stored in a database. These user specific words are put in a dictionary. A rules-engine generates phoneme sequences for each word. The CMU LM Toolkit generates an LM and a vocab.
The Sphinx decoder is initialized at runtime from this dynamically generated grammar. The standard decode phases then generate text from speech input. Some standard rules are applied to eliminate stop words and classify the string. This string is then used for driving the user specified flows (either login to yahoo mail or perform some web action).
A key deployment choice was Amazon EC2. There aRE ec2 data centers on the east and west coasts of the US. There are global data centers in Europe and Singapore. The Singapore data center provides a low latency springboard to India and China for future expansions.
Amazon EC2 instances can be dynamically started and stopped depending on load. A SIP/HTTP load balancer can choose which instance to bind a user call to. All instances host the entire infinear.com java/asterisk/php codebase. A common large database instance supports all instances. Asterisk has been configured to only handle SIP signalling. All RTP traffic avoids EC2 entirely-so, we dont have to pay for resources during long user calls.
infinear.com is primarily a server side solution driven by voice/speech. But smartphones provide good platforms for enhanced user experience. Though our focus is on handsfree solutions, there are many situations where a user may want to interct via a web interface or native app interface. In 2010, infinear.com will focus primarily on the iPhone, Blackberry and Google's Android platform.
Last Updated: 27 February 2008