Speech Recognition System Using Raspberry pi 4

Today, we will design a Speech Recognition System using Raspberry pi 4, where we will create python code to compare voices and recognize.

Posted at: 31 - May - 2022

Category: Raspberry Pi

Author: syedzainnasir

0 Comments

Departments:

Microcontrollers:

Raspberry Pi 4

Codings:

Python

Speech Recognition System Using Raspberry pi 4, speech recognition project using Raspberry Pi 4, Raspberry Pi 4 speech recognition, speech recognition with RPi4

Thank you for joining us for yet another session of this series on Raspberry Pi programming. In the preceding tutorial, we created a pi-hole ad blocker for our home network using raspberry pi 4. We also learned how to install pi-hole on raspberry pi four and how to access it in any way with other devices. This tutorial will implement a speech recognition system using raspberry pi and use it in our project. First, we will learn the fundamentals of speech recognition, and then we will build a game that uses the user's voice to play it and discover how it all works with a speech recognition package.

Here, you'll learn:

The basics of voice recognition
On PyPI, what packages may be found?
Utilize the SpeechRecognition package with a wide range of useful features.

Where To Buy?
No.	Components	Distributor	Link To Buy
1	Raspberry Pi 4	Amazon	Buy Now

Components

Raspberry pi 4
Microphone

A Brief Overview of Speech Recognition

Are you curious about how to incorporate speech recognition into a Python program? Well, when it comes to conducting voice recognition in Python, there are a few things you need to know first. I'm not going to overwhelm you with the technical specifics because it would take up an entire book. Things have gone a long way when it comes to modern voice recognition technologies. Several speakers can be recognized and have extensive vocabulary in several languages.

Voice is the first element of speech recognition. A mic and an analog-to-digital converter are required to turn speech into an electronic signal and digital data. The audio can be converted to text using various models once it has been digitized.

Markov models are used in most modern voice recognition programs. It is assumed that audio signals can be reasonably represented as a stationary series when seen over a short timescale.

The audio signals are broken into 10-millisecond chunks in a conventional HMM. Each fragment's spectrogram is converted into a real number called cepstral coefficients. The dimensions of this cepstral might range from 10 to 32, depending on the device's accuracy. These vectors are the end product of the HMM.

Training is required for this calculation because the voice of a phoneme changes based on the source and even within a single utterance by the same person. The most probable word to produce the specified phoneme sequence is determined using a particular algorithm.

This entire process could be computationally costly, as one might expect. Before HMM recognition, feature transformations and dimension reduction methods are employed in many current speech recognition programs. It is also possible to limit an audio input to only those parts which are probable to include speech using voice detectors. As a result, the recognizer does not have to waste time studying sections of the signal that aren't relevant.

Choosing a Speech Recognition Tool

There are a few speech recognition packages in PyPI. There are a few examples:

NLP can discern a user's purpose in some of these programs, which goes beyond simple speech recognition. Several other services are focused on speech-to-text conversion alone, such as Google Cloud-Speech.

SpeechRecognition is the most user-friendly of all the packages.

Voice recognition necessitates audio input, which SpeechRecognition makes a cinch. SpeechRecognition will get you up to speed in minutes rather than requiring you to write your code for connecting mics and interpreting audio files.

Since it wraps a variety of common speech application programming interfaces, this SpeechRecognition package offers a high degree of extensibility. The SpeechRecognition library is a fantastic choice for every Python project because of its flexibility and ease of usage. The APIs it encapsulates may or may not be able to support every feature. For SpeechRecognition to operate in your situation, you'll need to research the various choices.

You've decided to give SpeechRecognition ago, and now you need to get it deployed in your environment.

Speech Recognition Software Installation

Using pip, you may set up Speech Recognition software in the terminal:

$ pip install SpeechRecognition

When you've completed the setup, you should start a command line window and type:

Import speech_recognition as sr

Sr.__version__

Let's leave this window open for now. Soon enough, you'll be able to use it.

If you only need to deal with pre-existing audio recordings, Speech Recognition will work straight out of the box. A few prerequisites are required for some use cases, though. In particular, the PyAudio library must record audio from a mic.

As you continue reading, you'll discover which components you require. For the time being, let's look at the package's fundamentals.

Recognizer Class

The recognizer is at the heart of Speech Recognition's magic.

Naturally, the fundamental function of a Recognizer class is to recognize spoken words and phrases. Each instance has a wide range of options for identifying voice from the input audio.

The process of setting up a Recognizer is straightforward. It's as simple as typing "in your active interpreter window."

sr.Recognizer()

There are seven ways to recognize the voice from input audio by utilizing a distinct application programming interface in each Recognizer class. The following are examples:

Aside from recognizing sphinx(), all the other functions fail to work offline using CMU Sphinx. Internet access is required for the remaining six activities.

This tutorial does not cover all of the capabilities and features of every Application programming interface in detail. Speech Recognition comes with a preset application programming interface key for the Google Speech Application programming interface, allowing you to immediately get up and running with the service. As a result, this tutorial will extensively use the Web Speech Application programming interface. Only the Application programming interface key and the user are required for the remaining six application programming interfaces.

Speech Recognition provides a default application programming interface key for testing reasons only, and Google reserves the right to cancel it at any time. Using the Google Web application programming interface in a production setting is not recommended. There is no method to increase the daily request quota, even if you have a valid application programming interface key. If you learn how to use the Speech Recognition application programming interface today, it will be straightforward to apply to any of your projects.

Whenever a recognize function fails to recognize the voice, it will output an error message. Request Error if the application programming interface is unavailable. A faulty Sphinx install could cause this in the case of recognizing sphinx(). If quotas are exceeded, servers are unreachable, or there isn't internet service, a Request Error will be raised for all the six methods.

Let us use recognize google() in our interpreter window and see if it works!

Exactly what has transpired?

Something like this is most likely what you've gotten.

I'm sure you could have foreseen this. How is it possible to tell something from nothing?

The Recognizer function recognize() expects an audio data parameter. If you're using Speech Recognition, then audio data should become an instance of the audio data class.

To construct an AudioData instance, you have two options: you can either use an audio file or record your audio. We'll begin with audio files because they're simpler to work with.

Using Audio Files

To proceed, you must first obtain and save an audio file. Use the same location where your Python interpreter is running to store the file.

Speech Recognition's AudioFile interface allows us to work with audio files easily. As a context manager, this class gives the ability to access the information of an audio file by providing a path to its location.

File Formats that are supported

This software supports various file formats, which include:

AIFF

FLAC

You'll need to get a hold of the FLAC command line and a FLAC encoding tool.

Recording data using the record() Function

To play the "har.wav" file, enter the following commands into your interpreter window:

har = sr.AudioFile('har.wav')

with harvard as source:

audio = r.record(source)

Using the AudioFile class source, the context manager stores the data read from the file. Then, using the record() function, the full file's data is saved to an AudioData class. Verify this by looking at the format of the audio:

type(audio)

You can now use recognize_google() to see if any voice can be found in the audio file. You might have to wait a few seconds for the output to appear, based on the speed of your broadband connection.

r.recognize_google(audio)

Congratulations! You've just finished your very first audio transcription!

Within the "har.wav" file, you'll find instances of Har Phrases if you're curious. In 1965, the IEEE issued these phrases to evaluate telephone lines for voice intelligibility. VoIP and telecom testing continue to make use of them nowadays.

Seventy-two lists of 10 phrases are included in the Har Phrases. On the Open Voice Repository webpage, you'll discover a free recording of these words and phrases. Each language has its own set of translations for the recordings. Put your code through its paces; they offer many free resources.

Segments with a start and end time

You may want to record a small section of the speaker's speech. The record() method accepts the duration term parameter, which terminates the program after a defined amount of time.

Using the example above, the first 4 secs of the file will be saved as a transcript.

with har as source:

audio = r.record(source, duration=4)

r.recognize_google(audio)

In the files stream, utilize the record() function within a block. As a result, the 4 secs of audio you recorded for 4 seconds will be returned when you record for 4 seconds again.

with har as source:

audio1 = r.record(source, duration=4)

audio2 = r.record(source, duration=4)

r.recognize_google(audio1)

r.recognize_google(audio2)

As you can see, the 3rd phrase is contained within audio2. When a timeframe is specified, the recorder can cease in the middle of a word. This can harm the transcript. In the meantime, here's what I have to say about this.

The offset keywords arguments can be passed to the record() function combined with a recording period. Before recording, this setting specifies how many frames of a file to disregard.

with har as source:

audio = r.record(source, offset=4, duration=3)

r.recognize_google(audio)

Using the duration and the offset word parameters can help you segment an audio track if you understand the language structure beforehand. They can, however, be misused if used hurriedly. Using the following command in your interpreter should get the desired result.

with har as source:

audio = r.record(source, offset=4.7, duration=2.8)

r.recognize_google(audio)

The application programming interface only received "akes heat," which matches "Mesquite," because "it t" half of the sentence was missed.

You also recorded "a co," the first word of the 3rd phrase after the recording. The application programming interface matched this to "Aiko."

Another possible explanation for the inaccuracy of your transcriptions is human error. Noise! Since the audio is relatively clean, the instances mentioned above all worked. Noise-free audio cannot be expected in the actual world except if the soundtracks can be processed in advance.

Noise Can Affect Speech Recognition.

Noise is an unavoidable part of everyday existence. All audiotapes have some noise level, and speech recognition programs can suffer if the noise isn't properly handled.

I listened to the "jackhammer" audio sample to understand how noise can impair speech recognition. Ensure to save it to the root folder of your interpreter session.

The sound of a jackhammer is heard in the background while the words "the stale scent of old beer remains" are spoken.

Try to translate this file and see what unfolds.

jackmer = sr.AudioFile('jackmer.wav')

with jackhammer as source:

audio = r.record(source)

r.recognize_google(audio)

How wrong!

So, how do you go about dealing with this situation? The Recognizer class has an adjust for ambient noise() function you might want to give a shot.

with jackmer as source:

r.adjust_for_ambient_noise(source)

audio = r.record(source)

r.recognize_google(audio)

You're getting closer, but it's still not quite there yet. In addition, the statement's first word is missing: "the." How come?

Recognizer calibration is done by reading the first seconds of the audio stream and adjusting for noise level. As a result, the stream has already been consumed when you run record() to record the data.

Adjusting ambient noise() takes the duration word parameter to change the time frame for analysis. The default value for this parameter is 1, but you can change it to whatever you choose. Reduce this value by half.

with jackmer as a source:

r.adjust_for_ambient_noise(source, duration=0.5)

audio = r.record(source)

r.recognize_google(audio)

Now you've got a whole new set of problems to deal with after getting "the" at the start of the sentence. There are times when the noise can't be removed from the signal because it simply has a lot of noise to cope with. That's the case in this particular file.

These problems may necessitate some sound pre-processing if you encounter them regularly. Audio editing programs, which can add filters to the audio, can be used to accomplish this. For the time being, know that background noise can cause issues and needs to be handled to improve voice recognition accuracy.

Application programming interface responses might be useful whenever working with noisy files. There are various ways to parse the JSON text returned by most application programming interfaces. For the recognize google() function to produce the most accurate transcription, you must explicitly request it.

Using the recognize google() function and the show all boolean argument will do this.

r.recognize_google(audio, show_all=True)

A transcript list can be found in the dictionary returned by recognizing google(), with the entry 'alternative .'This response format varies in different application programming interfaces, but it's primarily useful for debugging purposes when you get it.

As you've seen, the Speech Recognition software has a lot to offer. Aside from gaining expertise with the offsets and duration arguments, you also learned about the harmful effects noise has on transcription accuracy.

The fun is about to begin. Make your project dynamic by using a mic instead of transcribing audio clips that don't require any input from the user.

Using Microphone

For Speech Recognizer to work, you must obtain the PyAudio library.

Install PyAudio

Use the command below to install pyaudio in raspberry pi:

sudo apt-get install python-pyaudio python3-pyaudio

Confirmation of Successful Setup

Using the console, you can verify that PyAudio is working properly.

python -m speech_recognition

Ensure your mic is turned on and unmuted. This is what you'll see if everything went according to plan:

Let SpeechRecognition translate your voice by talking into your mic and discovering its accuracy.

Microphone instance

The recognizer class should be created in a separate interpreter window.

import speech_recognition as sr

r = sr.Recognizer()

After utilizing an audio recording, you'll use the system mic as your input. Instantiation your Microphone interface to get at this information!

mic = sr.Microphone()

For raspberry pi, you must provide a device's index to use a certain mic. For a list of microphones, simply call our Mic class function.

Sr.Microphone.list_microphone_names()

Keep in mind that the results may vary from those shown in the examples.

You may find the mic's device index using the list microphone names function. A mic instance might look like this if you wanted to use the "front" mic, which has a value of Three in the output.

mic = sr.Microphone(device_index=3)

Use listen() to record the audio from the mic

A Mic instance is ready, so let's get started recording.

Similar to AudioFile, Mic serves as a context manager for the application. The listen() function of the Recognizer interface can be used in the with section to record audio from the mic. This technique uses an input source as its initial parameter to capture audio until quiet is invoked.

with mic as source:

audio = r.listen(source)

Try saying "hi" into your mic once you've completed the block. Please be patient as the interpreter prompts reappear. Once you hear the ">>>" prompt again, you should be able to hear the voice.

r.recognize_google(audio)

If the message never appears again, your mic is probably taking up the excessive background noise. Ctrl then C key can halt the execution and restore your prompts.

Recognizer class's adjustment of ambient noise() method must be used to deal with the noise level, much like you did while attempting to decipher the noisy audio track. It's wise to do this whenever you're listening for mic input because it's less unpredictable than audio file sources.

with mic as source:

r.adjust_for_ambient_noise(source)

audio = r.listen(source)

Allow for adjustment of ambient noise() to finish before speaking "hello" into the mic after executing the code mentioned above. Be patient as the interpreter's prompts reappear before ascertaining the speech.

Keep in mind that the audio input is analyzed for a second by adjusting ambient noise(). Using the duration parameter, you can shorten it if necessary.

According to the website, not under 0.5 secs is recommended by the Speech Recognition specification. There are times when greater durations are more effective. The lower the ambient noise, the lower the value you need. Sadly, this knowledge is often left out of the development process. In my opinion, the default one-second duration is sufficient for most purposes.

How to handle speech that isn't recognizable?

Using your interpreter, type in the above code snippet and mutter anything nonsensical into the mic. You may expect a response such as this:

An UnknownValueError exception is thrown if the application programming interface cannot translate speech into text. You must always encapsulate application programming interface requests in try and except statements to address this problem.

Getting the exception thrown may take more effort than you imagine. When it comes to transcribing vocal sounds, the API puts in a lot of time and effort. For me, even the tiniest of noises were translated into words like "how." A cough, claps of the hands, or clicking the tongue would all raise an exception.

A "Guess the Word" game to Put everything together

To put what you've learned from the SpeechRecognition library into practice, develop a simple game that randomly selects a phrase from a set of words and allows the player three tries to guess it.

Listed below are all of the scripts:

import random

import time

import speech_recognition as sr

def recognize_speech_from_mic(recognizer, microphone):

if not isinstance(recognizer, sr.Recognizer):

raise TypeError("`recognizer` must be `Recognizer` instance")

if not isinstance(microphone, sr.Microphone):

raise TypeError("`microphone` must be `Microphone` instance")

with microphone as source:

recognizer.adjust_for_ambient_noise(source)

audio = recognizer.listen(source)

response = {

"success": True,

"error": None,

"transcription": None

}

try: response["transcription"] = recognizer.recognize_google(audio)

except sr.RequestError:

response["success"] = False

response["error"] = "API unavailable"

except sr.UnknownValueError:

response["error"] = "Unable to recognize speech"

return response

if __name__ == "__main__":

WORDS = ["apple", "banana", "grape", "orange", "mango", "lemon"]

NUM_GUESSES = 3

PROMPT_LIMIT = 5

recognizer = sr.Recognizer()

microphone = sr.Microphone()

word = random.choice(WORDS)

instructions = (

"I'm thinking of one of these words:\n"

"{words}\n"

"You have {n} tries to guess which one.\n"

).format(words=', '.join(WORDS), n=NUM_GUESSES)

print(instructions)

time.sleep(3)

for i in range(NUM_GUESSES):

for j in range(PROMPT_LIMIT):

print('Guess {}. Speak!'.format(i+1))

guess = recognize_speech_from_mic(recognizer, microphone)

if guess["transcription"]:

break

if not guess["success"]:

break

print("I didn't catch that. What did you say?\n")

if guess["error"]:

print("ERROR: {}".format(guess["error"]))

break

print("You said: {}".format(guess["transcription"]))

guess_is_correct = guess["transcription"].lower() == word.lower()

user_has_more_attempts = i < NUM_GUESSES - 1

if guess_is_correct:

print("Correct! You win!".format(word))

break

elif user_has_more_attempts:

print("Incorrect. Try again.\n")

else:

print("Sorry, you lose!\nI was thinking of '{}'.".format(word))

break

Let's analyze this a little bit further.

There are three keys to this function: Recognizer and Mic. It takes these two as inputs and outputs a dictionary. The "success" value indicates the success or failure of the application programming interface request. It is possible that the 2nd key, "error," is a notification showing that the application programming interface is inaccessible or that a user's speech was incomprehensible. As a final touch, the audio input "transcription" key includes a translation of all of the captured audio.

A TypeError is raised if the recognition system or mic parameters are invalid:

Using the listen() function, the mic's sound is recorded.

For every call to recognize speech from the mic(), the recognizer is re-calibrated using the adjust for ambient noise() technique.

After that, whether there is any voice in the audio, recognize function is invoked to translate it. RequestError and UnknownValueError are caught by the try and except block and dealt with accordingly. Recognition of voice from a microphone returns a dictionary containing the success, error, and translated voice of the application programming interface request and the dictionary keys.

In an interpreter window, execute the following code to see if the function works as expected:

import speech_recognition as sr

from guessing_game import recognize_speech_from_mic

r = sr.Recognizer()

m = sr.Microphone()

recognize_speech_from_mic(r, m)

The actual gameplay is quite basic. An initial set of phrases, a maximum of guesses permitted, and a time restriction are established:

Once this is done, a random phrase is selected from the list of WORDS and input into the Recognizer and Mic instances.

After displaying some directions, the condition statement is utilized to handle each user's attempts at guessing the selected word. This is the first operation that happens inside of the first loop. Another loop tries to identify the person's guesses at least PROMPT LIMIT instances and stores the dictionary provided to a variable guess.

Otherwise, a translation was performed, and the closed-loop will end with a break in case the guess "transcription" value is unknown. False is set as an application programming interface error when no audio is transcribed; this causes the loop to be broken again with a break. Aside from that, the application programming interface request was successful; nonetheless, the speech was unintelligible. As a precaution, the for loop repeatedly warns the user, giving them a second chance to succeed.

If there are any errors inside the guess dictionary, the inner loop will be terminated again. An error notice will be printed, and a break is used to exit the outer for loop, which will stop the program execution.

Transcriptions are checked for accuracy by comparing the entered text to a word drawn at random. As a result, the lower() function for text objects is employed to ensure a more accurate prediction. In this case, it doesn't matter if the application programming interface returns "Apple" or "apple" as the speech matching the phrase "apple."

If the user's estimate was correct, the game is over, and they have won. The outermost loop restarts when a person guesses incorrectly and a fresh guess is found. Otherwise, the user will be eliminated from the contest.

This is what you'll get when you run the program:

Recognition of Other Languages

Speech recognition in other languages, on the other hand, is entirely doable and incredibly simple.

The language parameter must be set to the required string to use the recognize() function in a language other than English.

r = sr.Recognizer()

with sr.AudioFile('path/to/audiofile.wav') as source:

audio = r.record(source)

r.recognize_google(audio, language='fr-FR')

There are only a few methods that accept-language keywords:

What are the applications of speech recognition software?

Mobile Payment with Voice command

Do you ever have second thoughts about how you're going to pay for future purchases? Has it occurred to you that, in the future, you may be able to pay for goods and services simply by speaking? There's a good chance that will happen soon! Several companies are already developing voice commands for money transfers.

This system allows you to speak a one-time passcode rather than entering a passcode before buying the product. When it comes to online security, think of captchas and other one-time passwords that are read aloud. This is a considerably better option than reusing a password every time. Soon, voice-activated mobile banking will be widely used.

AI Assistants

When driving, you may use such Intelligent systems to get navigation, perform a Google search, start a playlist of songs, or even turn on the lights in your home without touching your gadget. These digital assistants are programmed to respond to every voice activation, regardless of the user.

There are new technologies that enable Ai applications to recognize individual users. This tech, for instance, allows it to respond to the voice of a certain person exclusively. Using an iPhone as an example, it's been around for a few years now. If you want Siri to only respond to your commands and queries when you speak to it, you can do so on your iPhone. Unauthorized access to your gadgets, information, and property is far less possible when your voice can only activate your Artificial intelligent assistant. Anyone who is not permitted to use the assistant will not be able to activate it. Other uses for this technology are almost probably on the horizon.

Translation Application

In a distant place, imagine attempting to check into an unfamiliar hotel. Since neither you nor the front desk employee is fluent in the other country's language, no one is available to act as a translator. You can use the translator device to talk into the microphone and have your speech processed and translated verbally or graphically to communicate with another person.

Additionally, this tech can benefit multinational enterprises, educational institutions, or other institutions. You can have a more productive conversation with anyone who doesn't speak your language, which helps break down the linguistic barrier.

Conclusion

There are many ways to use the SpeechRecognition program, including installing it and utilizing its Recognizer interface, which may be used to recognize audio from both files and the mic. You learned how to use the record offset and the duration keywords to extract segments from an audio recording.

The recognizer's tolerance to noise level can be adjusted using the adjust for the ambient noise function, which you've seen in action. Recognizer instances can throw RequestErrors and UnknownValueErrors, and you've learned how to manage them with try and except block.

More can be learned about speech recognition than what you've just read. We will implement the RTC module integration in our upcoming tutorial to enable real-time control.

Speech Recognition System Using Raspberry pi 4

Components

A Brief Overview of Speech Recognition

Choosing a Speech Recognition Tool

Speech Recognition Software Installation

Recognizer Class

Using Audio Files

File Formats that are supported

Recording data using the record() Function

Segments with a start and end time

Noise Can Affect Speech Recognition.

Using Microphone

Install PyAudio

Confirmation of Successful Setup

Microphone instance

Use listen() to record the audio from the mic

How to handle speech that isn't recognizable?

A "Guess the Word" game to Put everything together

Recognition of Other Languages

What are the applications of speech recognition software?

Mobile Payment with Voice command

AI Assistants

Translation Application

Conclusion

Syed Zain Nasir

THE ENGINEERING PROJECTS

Raspberry Pi 4 Basics

RPi4 with Simple Modules

RPi4 Interfacing with Sensors

RPi4 with Embedded Modules

RPi4 Image Processing

RPi4 Speech Recognition

RPi4 Social Media

RPi4 Advanced Protocols

ARDUINO

Raspberry Pi

ESP32

Components

A Brief Overview of Speech Recognition

Choosing a Speech Recognition Tool

Speech Recognition Software Installation

Recognizer Class

Using Audio Files

File Formats that are supported

Recording data using the record() Function

Segments with a start and end time

Noise Can Affect Speech Recognition.

Using Microphone

Install PyAudio

Confirmation of Successful Setup

Microphone instance

Use listen() to record the audio from the mic

How to handle speech that isn't recognizable?

A "Guess the Word" game to Put everything together

Recognition of Other Languages

What are the applications of speech recognition software?

Mobile Payment with Voice command

AI Assistants

Translation Application

Conclusion

Syed Zain Nasir