Hands-On with Polly: Amazon’s AI-Based Synthesizer for Speech

Artificial intelligence (AI) is one of the most important trends in IT right now.
Amazon Web Services (AWS), has many offerings in the AI space. But Polly, one of the most interesting Amazon AI services for non developers, was released last November at 2016 AWS re-Invent conference.
Polly is simply a text-to speech engine. The GUI doesn’t make Polly look very impressive or sophisticated. You can use the GUI to enter text strings and click a button to have your computer speak them. Figure 1 shows how this looks.
[Click on the image to see a larger version.] Figure 1: This is Amazon Polly’s GUI interface. I must admit that I was a little bit nostalgic when I first saw the Polly interface. A friend of mine had an Apple II program called S.A.M. It would do almost the same thing. You could type a text string into the computer and it would then speak what you had written.
My friends and I were initially amazed at the robotic-sounding speech from the computer. But eventually, our amazement turned into creative ways to trick the computer into swearing (which wasn’t difficult).
The point is that text to speech engines have been around for decades. If you base your opinion on Polly solely based on Figure 1 above, it is easy dismiss Polly as a cloud-based rehash or 30-year-old app.
Like many things in IT, things don’t always turn out as they appear. The text-to-speech engine supports many languages and dialects. There may be multiple voices in some regions.
I spent some time playing with the different voices associated with the English U.S. option in preparation for this article. Some voices, like “Salli,” seem surprisingly real. Others, like Joey, sound more robotic. Some voices, such as “Justin” and “Ivy”, sound more like children.
However, the American accents were what impressed me the most. Although I am an American, I wasn’t aware of the American accent at first. However, I started to experiment with voices from other English-speaking countries, and discovered that Polly can speak with either an Australian or British accent.
Although I had a lot of fun with Polly’s voices and it was probably a bit too much, Polly is more than a simple speech engine that parrots text input. Developers can integrate Polly-based speech in their applications using a rich API. This is a feature that was missing in the Apple text-to speech engine. The Apple text-to speech engine was funny, but there was no way to leverage it outside of the program’s interface.
The Polly API is great for basic speech integration. But there are two things I really like about it. You can customize Polly’s vocabulary. If Polly doesn’t pronounce a word correctly, you can tell Polly how to pronounce it.
Another thing that stands out is Polly’s ability to create MP3 files. The “Download.MP3” button is probably familiar from Figure 1. However, Polly allows you to create short.MP3 files that contain Polly speaking just a few lines of text. Although you can’t do this through the GUI, there is an option to upload a text file to allow Polly to convert it into spoken words within a.MP3 file. I am considering using Polly to create audio versions for some of my books as an author.
This brings us to an important point. Polly seems to be able to do basic text-to speech conversions very well.