machine-learning

flask

opencv

mediapipe

Create a simple hand sign recognition Flask app

All about how to create a simple hand sign recognition Flask app using Mediapipe and OpenCV and use it to control and give commands in games

Kushagra Agarwal (kushagra)

No one stays forever but you Your goals will happen, so focus

In this blog, I will be discussing how to create a Flask app that can recognize the hand signs you make and can perform the corresponding actions. I will be using Google Mediapipe and OpenCV library for implementing a machine learning hand sign recognition model. So, by the end of this blog, you will be able to make your own flask app that can implement hand sign recognition in real-time.

In this blog, we'll be going over

  1. How to set up your Flask app
  2. Creating your own hand sign recognition model
  3. Setting up webcam streaming in Flask

Let’s get right into the first step!

How to set up your Flask app

Flask is a python based framework, so firstly you have to install python in order to get Flask up and running. You can do that by going to python.org  and downloading the latest version of python, for this blog I have used the python version 3.8

Now, that you have python up and running, the next step is to install Flask. You can do that simply by going to your terminal or cmd  and typing in the command

pip install Flask

This command will install Flask in your system, now that you have flask let's learn to create a simple Flask app to print "Hello World". I will be using vs code as my code editor for this blog but feel free to use any editor you want.

create a folder for your project, I created a folder named HelloWorld

For setting flask app you need to create the following files

  1. app.py
  2. static folder
  3. templates folder

So, your app.py file basically contains your backend code, for say the routes, the functions, or the computation part. The statics folder will contain your CSS, Javascript, images or any static file you want for your app. The templates folder will basically contain all your HTML pages.    

so, your directory structure looks somewhat like this

You can also refer to Flask documentation for making a minimal flask application

Now, go to your app.py file and write the following code out

from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello_world():
    return "<p>Hello, World!</p>"
    
if __name__=='__main__':
    app.run(debug=True)

In the above code we imported flask and created a flask app instance with app=Flask(__name__) and finally added an HTML page route that is / or the home page, this route will return the response Hello, World!

Now go ahead and run your app.py file

python3 app.py

Congratulations!!! you have successfully created your first flask app

Creating your own hand sign recognition model

Now, that you know how to create a basic flask app. Let's start with our hand recognition model so that you can implement it in the flask.

Prerequisite

  • You should have basic knowledge of OpenCV
  • Python Programming
  • Basics of NumPy, Pandas

What are we going to discuss

  1. How to capture live feed from your webcam using OpenCV
  2. How to manipulate captured video frames
  3. How to apply mediapipe hand-tracking model
  4. How to analyze the data you get from the handing tracking model and calculating hand signs using it

How to capture live feed from your webcam using opencv

Okay, firstly let create a python file, and import the OpenCV library, in case you don't have an OpenCV library you can install it by using the following command

pip install opencv-python

Now, that we have OpenCV setup let's understand how can we use it to get the live feed from our webcam. Go ahead and type out the following code

import cv2 as cv

vid = cv.VideoCapture(0)

So, this will basically create a VideoCapture instance vid. In VideoCapture() you pass a video file path, index, video stream URL, or a stream of images. VideoCapture will store the video stream frame by frame in the vid instance. These frames are then decoded into a NumPy array and then that array can be manipulated to do all kinds of cool stuff like color transformation, pixel rendering, object detection, and much more. You can refer to OpenCV documentation if you can want to take a deep dive into the world of image processing, for this blog we will just stick to our aim i.e, getting the feed from our webcam using OpenCV. So, as you can see I have pass 0 as the VideoCapture parameter here, which means it will take the feed from my internal webcam. In case you want to access an external webcam you have to pass 1,2or3...

Now, that you have the camera feed let's decode the feed and display it using the OpenCV function show. Go ahead and type out the following code

import cv2 as cv

vid=cv.VideoCapture(0)  

print(vid.isOpened())

vid.set(cv.CAP_PROP_FPS,60)  #set the video FPS to 60

while vid.grab():  
    
    state,frame=vid.read()  
    cv.imshow("video",frame)   
    
    if cv.waitKey(16) & 0xFF == ord('q'):
        break

vid.release()
print(vid.isOpened())  #value becomes false as vid is released
cv.destroyAllWindows()

Description of each function in brief

  1. vid.set set method is used to set the features of the video feed we captured, check the list of set properties in the official documentation. Here I have used set to the FPS of video to 60
  2. vid.grab grab method iterates over video frames captured and return true if the next frame is present and false if not empty
  3. vid.isOpened isOpened method returns true if video frames are captured successfully
  4. vid.read read method returns decoded frame and true if next frame is present if no frame is present it returns false
  5. cv.imshow method displays the output on a display window, parameters passed are the name of the frame and a mat object i.e, the decoded frame
  6. cv.waitkey method is used to hold the frame, time is passed in milliseconds and if 0 is passed it is considered as hold the frame for infinite time

How to manipulate captured video frames

We have successfully captured our video using OpenCV, now the next thing is how can we manipulate the video frames that we just captured. So, as I have discussed above that an image/frame is decoded as a NumPy array(mat object) using the read method is OpenCV. So, in order to manipulate the video, we actually have to manipulate the NumPy array which in turn will transform the video

OpenCV mat object structure

So, as you can see from the above diagram, in a mat object each pixel is a vector containing 3 values B, G, R and it's very clear that if we can change the values of this vector we will be able to perform color transformations in your image. Similarly, if we delete some of these values we should be able to remove pixels from our image. This is a very basic overview of image processing. Now that you know about how images are manipulated in OpenCV let's write some code and test it

import cv2 as cv

img = cv.imread('../../images/ninetail.jpg')

imgGray=cv.cvtColor(img,cv.COLOR_BGR2GRAY)

img[:,:,1],img[:,:,0]=0,0

cv.imshow("gray",imgGray)
cv.imshow("red",img)

cv.waitKey(0)

So, here I have used the cvtColor method to convert the image to a grayscale image, you can also use simple array manipulation, like img[:,:,1],img[:,:,0]=0,0 to get a redscale, here we are simply setting the B and G vector of the image to 0 using python slicing

result

You can try a bunch of other image process functions, check this out.

How to apply mediapipe hand-tracking model

Okay, so we have covered all the major concepts, now it's time to have some fun with what we have learned. So, we will be learning how to use the google mediapipe library in our project to implement hand detection on the frames captured using OpenCV. Firstly you have to install mediapipe library, for that use the following command-

pip install mediapipe

Understanding Hand landmarks

So, what mediapipe hand-tracking model basically does is that it takes an image/frame as a parameter and checks if there's any hand present in the frame or not if a hand is detected it returns us the coordinates of the hand landmarks shown in figure

Example of data set returned from the above image is :

[[0, 572, 66], [1, 547, 105], [2, 514, 117], [3, 488, 116], [4, 464, 113], [5, 497, 130], [6, 451, 144], [7, 427, 141], [8, 413, 134], [9, 489, 107], [10, 433, 107], [11, 423, 96], [12, 423, 87], [13, 482, 82], [14, 432, 82], [15, 437, 76], [16, 445, 73], [17, 478, 57], [18, 443, 61], [19, 449, 63], [20, 460, 63]]

Here, we can see total 21 1D-arrays are returned each containing 3 elements, the first element represents landmark number, second element represents x coordinate of the landmark and the third element represents the y coordinate.  
Now, that you have a basic understanding of hand landmarks, go ahead and try out the following code

import mediapipe as mp
import cv2 as cv
import numpy as np


mp_drawing=mp.solutions.drawing_utils
mp_hands= mp.solutions.hands

cap=cv.VideoCapture(0)


with mp_hands.Hands(
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5) as hands:

    while cap.grab():
        ret,frame=cap.read()
        cv.imshow('hand-tracking',frame)
        
        image = cv.cvtColor(cv.flip(frame, 1), cv.COLOR_BGR2RGB)

        image.flags.writeable= False
        results=hands.process(image)
        image.flags.writeable= True
        print(result)
        cv.imshow('hand-tracking-show',image)
        if cv.waitKey(5) & 0xFF == 27:
            break
        

cap.release()
cv.destroyAllWindows()

Here we create two instances of mediapipe, i.e, mp_drawing=mp.solutions.drawing_utils that will be used to draw outline over the detected object and another instance is mp_hands= mp.solutions.hands which we will be using to track our hand. So, here I have created a VideoCapture instance cap that will store my video frames. As you can see I have converted my BGR image to RGB that is essentially cause mediapipe image processing is done on RGB images.   mp_hands.Hands(min_detection_confidence=0.5,min_tracking_confidence=0.5) using this method you can pass the parameter like tracking confidence, max hands you want to detect and you can finally get the coordinates of landmark by using process method over Hands.

Now that we are able to detect our hand let's draw over the detected region. For that, we use the mp_drawing instance

Try out the following code

import mediapipe as mp
import cv2 as cv
import numpy as np


mp_drawing=mp.solutions.drawing_utils
mp_hands= mp.solutions.hands

cap=cv.VideoCapture(0)


with mp_hands.Hands(
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5) as hands:

    while cap.grab():
        ret,frame=cap.read()
        cv.imshow('hand-tracking',frame)
        
        image = cv.cvtColor(cv.flip(frame, 1), cv.COLOR_BGR2RGB)

        image.flags.writeable= False
        results=hands.process(image)
        image.flags.writeable= True
        
        image=cv.cvtColor(image, cv.COLOR_RGB2BGR)
        if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                mp_drawing.draw_landmarks(
                    image, hand_landmarks, mp_hands.HAND_CONNECTIONS)
        cv.imshow('hand-tracking-show',image)
        if cv.waitKey(5) & 0xFF == 27:
            break
        

cap.release()
cv.destroyAllWindows()

In this code we simply iterate over our results variable, results.multi_hand_landmarks will return single hand landmarks in each iteration, if multiple hands are present in the frame. mp_drawing.draw_landmarks is used to draw circles on each landmark and can also draw connections between each landmark using the same.

So, that's how easy it is to get your hand detection model up and running using mediapipe library.

How analyze the data you get from the handing tracking model and calculating hand signs using it

Now that we are able to make our hand detection model successfully, let's see how can we use the data we get from our model to predict hand signs.

Try out the following code

import cv2
import mediapipe as mp
import numpy as np
import time


class handDetector():
    def __init__(self, mode=False, maxHands=1, detectionCon=0.5, trackCon=0.5):
        self.mode = mode
        self.maxHands = maxHands
        self.detectionCon = detectionCon
        self.trackCon = trackCon

        self.mpHands = mp.solutions.hands
        self.hands = self.mpHands.Hands(self.mode, self.maxHands,
                                        self.detectionCon, self.trackCon)
        self.mpDraw = mp.solutions.drawing_utils

    def findHands(self, img, draw=True):
        imgRGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        self.results = self.hands.process(imgRGB)
        # print(results.multi_hand_landmarks)

        if self.results.multi_hand_landmarks:
            for handLms in self.results.multi_hand_landmarks:
                if draw:
                    self.mpDraw.draw_landmarks(img, handLms,
                                               self.mpHands.HAND_CONNECTIONS,
                                               self.mpDraw.DrawingSpec(color=(255,206,85)),
                                               self.mpDraw.DrawingSpec(color=(240,171,0))
                                               )
        return img

    def findPosition(self, img, handNo=0, draw=True):

        lmList = []
        if self.results.multi_hand_landmarks:
            myHand = self.results.multi_hand_landmarks[handNo]
            for id, lm in enumerate(myHand.landmark):
                # print(id, lm)
                h, w, c = img.shape
                cx, cy = int(lm.x * w), int(lm.y * h)
                # print(id, cx, cy)
                lmList.append([id, cx, cy])
                font = cv2.FONT_HERSHEY_SIMPLEX
                if id==12:
                    cv2.putText(img,'12',(cx,cy), font, .5,(0,0,255),2,cv2.LINE_AA)
                elif id==11:
                    cv2.putText(img,'11',(cx,cy), font, .5,(0,0,255),2,cv2.LINE_AA)
        return lmList



cap = cv2.VideoCapture(0)
detector = handDetector()
handsign='none'
while True:
    success, img = cap.read()
    img = cv2.flip(img, 1)
    img = detector.findHands(img)
    x,y,w= img.shape
    font = cv2.FONT_HERSHEY_SIMPLEX
    lmList = detector.findPosition(img)
    if len(lmList) != 0:
        # print(lmList)
        
        if lmList[12][2]>lmList[11][2]:
            cv2.putText(img,'down',(10,200), font, 5,(0,0,255),2,cv2.LINE_AA)
        else:
            cv2.putText(img,'up',(10,200), font, 5,(0,0,255),2,cv2.LINE_AA)
        
    cv2.imshow("hand",img)
    if cv2.waitKey(5) & 0xFF == 27:
            break

So, in this code, I have basically created a class that will detect my hand and will return the landmarks. What I have done is by using these landmarks I am calculating the hand sign. For say I want to check if the middle finger is open or closed so, that I will compare the y coordinate value of landmarks  12 with the y coordinate of landmark 11 that is

Here you can see that the y coordinate value of landmark 12 is 80 and y coordinate value of landmark 11 is 114, when my middle finger is up, that is y12<y11 when the middle finger is up

But as I put my middle finger down the value of y12 becomes greater than the value of y11, so by comparing the value of landmarks y12 and 11y we can determine whether the middle finger is up or down. Similarly, we can check multiple landmarks to recognize a different kinds of hand signs.

Fun to do for you:

import cv2
import mediapipe as mp
import numpy as np
import time


class handDetector():
    def __init__(self, mode=False, maxHands=1, detectionCon=0.5, trackCon=0.5):
        self.mode = mode
        self.maxHands = maxHands
        self.detectionCon = detectionCon
        self.trackCon = trackCon

        self.mpHands = mp.solutions.hands
        self.hands = self.mpHands.Hands(self.mode, self.maxHands,
                                        self.detectionCon, self.trackCon)
        self.mpDraw = mp.solutions.drawing_utils

    def findHands(self, img, draw=True):
        imgRGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        self.results = self.hands.process(imgRGB)
        # print(results.multi_hand_landmarks)

        if self.results.multi_hand_landmarks:
            for handLms in self.results.multi_hand_landmarks:
                if draw:
                    self.mpDraw.draw_landmarks(img, handLms,
                                               self.mpHands.HAND_CONNECTIONS,
                                               self.mpDraw.DrawingSpec(color=(255,206,85)),
                                               self.mpDraw.DrawingSpec(color=(240,171,0))
                                               )
        return img

    def findPosition(self, img, handNo=0, draw=True):

        lmList = []
        if self.results.multi_hand_landmarks:
            myHand = self.results.multi_hand_landmarks[handNo]
            for id, lm in enumerate(myHand.landmark):
                # print(id, lm)
                h, w, c = img.shape
                cx, cy = int(lm.x * w), int(lm.y * h)
                # print(id, cx, cy)
                lmList.append([id, cx, cy])

        return lmList



cap = cv2.VideoCapture(0)
detector = handDetector()
handsign='none'
while True:
    success, img = cap.read()
    img = cv2.flip(img, 1)
    img = detector.findHands(img)
    lmList = detector.findPosition(img)
    if len(lmList) != 0:
        # print(lmList)
        if lmList[12][2]>lmList[11][2] and lmList[16][2]>lmList[15][2] and lmList[4][1]<lmList[5][1] and lmList[7][2]>lmList[8][2] and lmList[19][2]>lmList[20][2]:
            handsign="yo"
        elif lmList[12][2]>lmList[11][2] and lmList[16][2]>lmList[15][2] and lmList[4][1]<lmList[5][1] and lmList[7][2]<lmList[8][2] and lmList[19][2]>lmList[20][2]:
            handsign="thulu"
        elif lmList[12][2]>lmList[11][2] and lmList[16][2]>lmList[15][2] and lmList[4][1]<lmList[5][1] and lmList[7][2]>lmList[8][2] and lmList[19][2]<lmList[20][2]:
            handsign="L"
        elif lmList[12][2]<lmList[11][2] and lmList[16][2]<lmList[15][2] and lmList[4][1]<lmList[5][1] and lmList[7][2]>lmList[8][2] and lmList[19][2]>lmList[20][2]:
            handsign="open"
        elif lmList[12][2]>lmList[11][2] and lmList[16][2]<lmList[15][2] and lmList[4][1]<lmList[5][1] and lmList[7][2]>lmList[8][2] and lmList[19][2]>lmList[20][2]:
            handsign="MidDown"
        elif lmList[12][2]<lmList[11][2] and lmList[16][2]>lmList[15][2] and lmList[4][1]<lmList[5][1] and lmList[7][2]>lmList[8][2] and lmList[19][2]>lmList[20][2]:
            handsign="MidcloseDown"
        elif lmList[12][2]>lmList[11][2] and lmList[4][1]>lmList[5][1] and lmList[16][2]>lmList[15][2] and lmList[7][2]<lmList[8][2] and lmList[19][2]<lmList[20][2]:
            handsign="fist"
        elif lmList[12][2]<lmList[11][2] and lmList[16][2]>lmList[15][2] and lmList[4][1]<lmList[5][1] and lmList[7][2]>lmList[8][2] and lmList[19][2]<lmList[20][2]:
            handsign="LL"
        else:
            handsign="no move"
    print(handsign)
    cv2.imshow("hand",img)
    if cv2.waitKey(5) & 0xFF == 27:
            break
                    
        
        


Try out this code on your own and try to think of the logical implementation!!!

Setting up webcam streaming in Flask

The final step is to set up webcam streaming using so that we can implement hand tracking in our flask app. I have already discussed have to set up a basic flask app, now let's go further

Try out the following code

app.py

from flask import Flask,render_template, Response
import cv2 as cv

app = Flask(__name__)

def generate_frames():
    cap=cv.VideoCapture(0)
    
    while cap.grab():
                
        success, frame = cap.read()  # read the camera frame
    
    
        image =cv.flip(frame, 1)

        if not success:
            break
        else:
            ret, buffer = cv.imencode('.jpg', image)
            image = buffer.tobytes()
            yield (b'--frame\r\n'
                b'Content-Type: image/jpeg\r\n\r\n' + image + b'\r\n')

@app.route("/")
def index():
    return render_template('index.html')


@app.route("/video")
def video():
    return Response(generate_frames(),mimetype='multipart/x-mixed-replace; boundary=frame')


if __name__ == "__main__":
    app.run(debug=True)

index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    
    <div class="container">
        <div class="row">
            <div class="col-lg-8  offset-lg-2" style="display:flex;flex-direction:column;">
                <div style="border: 10px;width: 100%;"><img src="{{ url_for('video') }}" width="50%"></div>
            </div>
        </div>
    </div>


</body>
</html>

Here, in our Flask app, we basically created a /video route that is generating frames using the method generate_frame. The method generate_frame generates frames using OpenCV as he has discussed above and it continuously yield them to the video route using the yield function. In HTML code I have used jinja2 templating to get the video stream from our backend code to the frontend.

Congratulation!!! You have completed your first video streaming flask app

Fun to do for you

Try to create a flask that can form hand tracking in real-time.

Here's something extra for

I created this Naruto Jutsu Battle game Hackathon

Demo Video

Here's GitHub repository for this project

Thank you for reading my blog, if you found it useful do share it with your friend. Also for more awesome blogs make sure to follow up on TechHub Community. TechHub is a great community to learn and explore new technologies. We also have a Discord Server, join today to get the latest updates.