Lane Detection Implementation with Mask R-CNN

7 min readMay 20, 2021

A simple step-by-step tutorial to implement lane detection using Mask R-CNN

Intro

This is a straightforward example on how to implement a simple object detection and segmentation system with Mask R-CNN. Actually this post is inspired by our team project during my ‘Intelligent Sensing Systems’ course. In the project, my teammates and I compared U-Net with Mask R-CNN and finally chose U-Net as our project direction. In this case, to utilize all the Mask R-CNN pipeline prepared as well, I decided to post the details here as a step-by-step tutorial.

This tutorial will cover the whole pipeline which includes model introduction, data collection, annotation, coding, training and testing. To make it straightforward, this implementation illustrates only a single class object detection which is the lane (marks). However, with a slight modification, the model will be able to detect multi objects. This will be mentioned in later part.

Background on Mask R-CNN

For many tasks, it is as same important to know that there is an object in the image as to determine which pixels in the given image correspond to that object. The first part is so called object detection while the second is called image segmentation.

Mask R-CNN itself is a modification to R-CNN and Faster R-CNN to detect objects. The ability to perform image segmentation is done by adding a fully convolutional network branch onto the existing network architectures. So while the main branch generates bounding boxes and identifies the object’s class the fully convolutional branch, which is fed features from the main branch, generates image masks.

Mask R-CNN framework as presented in original paper (https://arxiv.org/pdf/1703.06870.pdf)

Another interesting thing on Mask R-CNN is RoIAlign. RoIAlign is not using quantization for data pooling. It provide higher precision than RoIWarp and RoIPool. You may find a very good post on the understanding and comparison Here.

https://towardsdatascience.com/understanding-region-of-interest-part-2-roi-align-and-roi-warp-f795196fc193

So basically Mask R-CNN is a two stage convolutional network which serves object detection and object masking.

Definition

Scope: illustration for single class only, covering pipeline from data collection to testing
Datasets: original dataset is a video captured by a dashcam. randomly picked 203 frames for training and 103 frames for validation.
Labels: a predicted class with its probability is assigned to each of the detected object.
Necessary domain knowledge: not much required for domain knowledge. Only common sense required is to recognize the lane marks.
Definition of done: randomly pick an image and feed into the model to detect and mask the lane marks.

Overview

For the model itself, will be explained in later part. The below pipeline is to show how to build the model up from scratch. Firstly, data is important. Without proper data, no model can be trained for good results. The data source used in this implementation is from a dashcam recording video. This is easy to obtain and sufficient for a demonstration. On the other hand, a single video is lack of variety which by means of different weather or different countries with various road situations.

Annotation is a time costing part. In layman’s terms, you need to ‘teach’ the model how the object you want looks like, before it can recognize. Thus, the annotation is done frame by frame manually. In this demonstration, the lane marks are four edged polygons which is relatively easy to annotate. When it comes to a more edged or irregularly shaped object, the annotation will require more time and patience.

Model modification and training is based on pretrained coco weights. Also, you may train from scratch which may require more data and training time. The model will download coco weights itself during the first launch. It gives the model a general impression on the various objects. You may find more information on coco dataset here.

After training is done, a random selected frame from internet is fed into the model to see the result. Of cause, quantized comparison may also be carried out to evaluate the results. However, this will not be covered in this here coz this post is emphasized on implementation. A separate post may be published coving the evaluation part.

Data Collection and Annotation

Data collection actually can be done in different ways. You may record your own dashcam video, use existing video or even collect frames/photos. The key things are variety and quantity.

For annotation I used VGG Image Annotator which is an easy-to-use interface for object detection and image segmentation annotation out of Oxford.

VGG Image Annotator — https://www.robots.ox.ac.uk/~vgg/software/via/

The major change here for me was just having to make sure each polygon region is labeled with the class name by using the Region Attributes in the annotator. A JSON file will be generated to record down all the annotated point coordinates belonging to each image.

As you can see from below screenshot, the annotation is quite straightforward. Highlight polygon then annotate out the objects in the image. There is a sequence if you annotate more than one objects within same class.

There is class name and object sequence need to be indicated. For example, in the above screenshot, the highlighted in red-dot annotation is under ‘road’ class with a №3 sequence in the class within this image.

The left screenshot shows the input of class name as well as the polygon highlighted in red. It is important to set the annotation class correct. Otherwise, the model will go haywire when it receives a confusing annotation.

After annotation is done, export annotations as JSON. The JSON file will be read by the model to recognize the objects. Also, you may re-organize the JSON file by removing the header and put the content in structure to ease the reading, which is done in my demonstration.

Last but not least, put the JSON file under the same directory with training/validating data.

You may refer to my training data/JSON and validating data/JSON.

Code and Training

This part was actually quite nice for me because it forced me to dig through the Matterport code to try and figure out how to modify it for my situation.

see my maskrcnn_kr.py file for all of the code, I will just show some segments here.

There are some requirements needed.

numpy
scipy
Pillow
cython
matplotlib
scikit-image
tensorflow==2.0.0
keras==2.2.4
opencv-python
h5py
imgaug

You may also follow below lines to prepare the environment. If not having your conda yet, you may refer to this post to get it ready.

installation:

conda create -n mrcnn python=3.5conda install pippip install -r requirements.txt

download weights:
To speed up the training, it is recommended to train from coco weights. The program will automatically download the coco weights if you start a fresh training.
run:

activate mrcnnpython maskrcnn_kr.py

After running the above lines, you should see the program asking you to whether to train or predict or exit as below screenshot.

Now you may start to train the model. It may take time, depending on the spec of your machine.

Result

A dashcam image taken in Singapore KPE (Kallang–Paya Lebar Expressway) tunnel was chosen here. Result as below. Generally, it achieved the expectations. Again, there are more ways to evaluate the outcome in quantization. I may start another post to talk about it in future.

What’s more

As mentioned earlier, it is easy to modify the model to be multi class detection. In this demonstration model, there are only two classes which are ‘background’ and ‘roadmark’.

class_names = ['BG', 'roadmark']

While adding in more classes in code, don’t forget to modify accordingly in the annotations.

Want to make the model suitable for video? Here is some example.

# Video capture
        vcapture = cv2.VideoCapture(video_path)
        width = int(vcapture.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(vcapture.get(cv2.CAP_PROP_FRAME_HEIGHT))
        fps = vcapture.get(cv2.CAP_PROP_FPS)# Define codec and create video writer
        file_name = "splash_{:%Y%m%dT%H%M%S}.avi".format(datetime.datetime.now())
        vwriter = cv2.VideoWriter(file_name,
                                  cv2.VideoWriter_fourcc(*'MJPG'),
                                  fps, (width, height))count = 0
        success = True
        while success:
            print("frame: ", count)
            # Read next image
            success, image = vcapture.read()
            if success:
                # OpenCV returns images as BGR, convert to RGB
                image = image[..., ::-1]
                # Detect objects
                r = model.detect([image], verbose=0)[0]
                # Color splash
                splash = color_show(image, r['masks'])
                # RGB -> BGR to save image to video
                splash = splash[..., ::-1]
                # Add image to video writer
                vwriter.write(splash)
                count += 1
        vwriter.release()

Basically, video is frames stacking together. Thus the concept to handle video is also to do it in frames and then stack back.

Final Thoughts

This quick project is an example of how to build a class image model with detection and segmentation functions. As stated above, depending on use cases, this model could be twisted to suit wide area of cases which often found useful to localize specific objects/areas in images/videos.

The ideal is to show how a model is built up from scratch. Step-by-step guidance may give a chance not only to folks who are equipped with relevant knowledge, but also, more importantly, to the new comers who wish to give a try. Just like the ‘hello world’ in programming, let’s treat this as a ‘welcome’ to the visual sensing world.