Pronouncing Things with Amazon's Polly - Cuttlesoft, Custom Software Developers

I made a cool thing last week that I wanted to share.

I've been studying for my AWS Certified Solutions Architect exam and going through the corresponding A Cloud Guru course. The course has a series of labs using Polly, Amazon's text-to-speech service; these labs inspired me to build something with Polly for my own use.

What I Built

I spend a lot of time on Wikipedia, where I often encounter words I don’t know how to pronounce, like the names of various animal genera, for example.

Wikipedia usually spells these phonetically in International Phonetic Alphabet (IPA) notation, which might look like /ˈpɪdʒ.ən/ for the word "pigeon," for example. Wikipedia also links this to their IPA help page and, if you hover over each character, provides a helpfully-simplified per-character pronunciation guide.

What I wanted was something to read me the pronunciation aloud without me having to comb through interesting but complex charts — sometimes I just want to know how to pronounce the scientific name for whiskers (spoiler: it's /vaɪˈbrɪsi/).

So that's what I built.

How I Built It

I set up a Lambda function (triggered by an incoming POST request to API Gateway) to take the given IPA notation and voice selection, send them to the Polly service to be translated into speech, then handle the returned audio stream. Initially, this meant saving the audio as a file on S3; later, I decided to just return the Base64-encoded audio directly.

Lambda + Polly

My first step was to create the following IAM policy to allow a Lambda function to use Polly’s speech synthesis feature, then create an IAM role to attach the policy to.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "polly:SynthesizeSpeech"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

I then created the Lambda function and assigned it the new role I created. The full code for the Lambda function (using Python 3) is below, but basically it:

Initializes the Boto 3 Polly client
Gets the IPA notation and selected Polly voice from the request
Wraps the text with some structure so that Polly will pronounce it, not read it
Sends the text to Polly to convert to speech
Encodes and returns the audio it gets back from Polly

The only real configuration I had to do here was to get Polly to read the text as IPA notation rather than as regular text.

Polly reading /ˈkʌ.təl.fɪʃ/ as plain text instead of the IPA notation for pronouncing "cuttlefish."

One of the arguments taken by the synthesize_speech method is TextType, which accepts either "text" or "ssml" as its value. The default value is "text" and will result in Polly reading the text as you would expect a person to read it. The "ssml" option, however, allows the use of supported Speech Synthesis Markup Language (SSML) tags to control how Polly generates speech from text. In the case of translating IPA notation, the <phoneme> tag did exactly what I was looking for, with "ipa" specified as the alphabet and our IPA notation to read as the ph (phonetic symbols for pronunciation) value.

<phoneme alphabet="ipa" ph="ˈkʌ.təl.fɪʃ"></phoneme>

Polly reading /ˈkʌ.təl.fɪʃ/ as the IPA notation for pronouncing "cuttlefish," using the code above.

Originally, I set the Lambda function up to save the audio returned from Polly as an MP3 file in a bucket on S3, then to check whether the audio already existed before sending the text to Polly. I eventually decided to just Base64 encode the audio and return it directly, skipping the S3 step.

If you're interested in the implementation with the S3 upload intact, you can check it out here. (Don't forget to update your IAM policy to let Lambda access S3, too.)

API Gateway

Once I created the Lambda function, I needed to create the trigger for it. For this, I created a new API with the API Gateway service. The API itself only took a few steps to configure:

Add POST method ("Create Method" in the "Actions" menu, select "POST", and confirm)
Set endpoint "Integration type" as "Lambda function"
Select newly-created Lambda function in "Lambda Function" field and save
Enable CORS ("Enable CORS" in the "Actions" menu)
Deploy API ("Deploy API" in the "Actions" menu)

I then grabbed the resulting invoke URL for my static site to POST to, and that was it for API Gateway setup.

S3 Static Site

Perhaps the least interesting part of the process, the web page I created for interacting with the Lambda function/Polly service is also using AWS services — it's hosted as a static website in an S3 bucket. The page itself is just some HTML for structure, some JavaScript to POST the submitted form to the Lambda API and to present the audio player when the Polly audio comes back, and some CSS for fun.

/kənˈkluːʒən/

And that's it! The whole process was surprisingly simple and a lot of fun.

I've already been using the result myself, but give it a try and let me know what you think in the comments!

Pronouncing Things with Amazon’s Polly

What I Built

How I Built It

Lambda + Polly

API Gateway

S3 Static Site

/kənˈkluːʒən/

Related Posts

Why Custom Software Development Matters

Takeaways From DinosaurJS 2017