Detecting Pavement Cracks in Road Images and Videos using YOLO and Faster R-CNN

Link to Github repo | Link to website | Link to paper

Introduction

In this project I apply both the Faster R-CNN computer and YOLO (”You Only Look Once”) computer vision frameworks to submit a computer vision-based solution to the 2020 Global Road Damage Detection (”GRDC”) Challenge organized by the Institute of Electrical and Electronics Engineers (”IEEE”), which in May 2020 published a novel 21,041 annotated image dataset of pavement road distresses calling upon academic and other researchers to submit innovative deep learning-based solutions to common road hazard detection problems. Please feel free to skip to the Example Visualization section to visualize some of the object detection outputs from this project.

Making use of this dataset, I propose a supervised object detection approach leveraging You Only Look Once (”YOLO”) and the Faster R-CNN frameworks to detect and classify road distresses in real-time via a vehicle dashboard-mounted smartphone camera, producing 0.68 F1-score experimental results ranking in the top 5 of 121 teams that entered this challenge as of December 2021.

Dataset

The GRDC dataset combines 21,041 road images of pixel sizes 600 x 600 and 720 x 720 captured through a smart phone camera mounted on a vehicle dashboard traveling at an average speed of 25 mph and subsequently handannotated by a team of researchers from the Indian Institute of Technology Roorkee and the University of Tokyo.¹ As the GRDC’s stated objective was to develop deep learning models capable of generalizing to predicting road distresses across multiple countries, as opposed to those of a single country as was the focus of the competition’s 2018 precursor using 9,053 images collected throughout Japan only, the raw dataset is further divided into 10,506, 7,706 and 2,829 images from Japan, India and the Czech Republic respectively.

**Table I.** Data Dictionary of Road Distress Classes

Additionally while the raw dataset provides annotations for a total of eight different road distress classes based on the Japanese Maintenance Guidebooks for Road Pavement,² only the top four classes by frequency counts, namely Longitudinal cracks (class label D00), Lateral Cracks (D10), Alligator Cracks (D20) and Potholes (D40), were considered in the GRDC. Descriptive statistics of the distribution of road distresses and a data dictionary of each are further provided in Table I and Figures 1 – 2 respectively.

**Figure 1.** Example images of annotated road distress images from the 2020 Global Road Damage Detection Challenge dataset

**Figure 2.** Frequency distributions of road distresses by type and country

Methodology

In order to explore the relative strengths and weaknesses of one- and two-stage detectors applied to the task at hand, I apply both the Faster R-CNN computer and YOLO (”You Only Look Once”) computer vision frameworks to the GRDC dataset. The Faster RCNN framework developed by Girshick et al.³ may be designated as “two-stage” given its process of first outputting region proposals in an image as candidate regions potentially containing an object of interest before applying a second Region of Interest (”RoI”) layer on each region proposal to classify the image and predict bounding box vertices and dimensions within each.

Although Faster R-CNN boasts higher mAP scores than its predecessor Fast RCNN and R-CNN models as measured on popular image
dataset benchmarks such as MS COCO and PASCAL VOC 2007 and enables real-time detection through approximately 200ms per image test times using GPUs versus 47.0s in the case of R-CNN,⁴ one-stage detectors such as YOLO were previously developed to overcome some of these foregoing inference time bottlenecks.

YOLO may be styled as “one-stage” due to its bypass of Faster R-CNN’s region proposal stage in order to instead divide a convolved input image
into a SxS grid of anchor boxes whereby each anchor box s is tasked with outputting K bounding box predictions with center coordinates lying within the dimensions of that anchor box s. This set of K * S^2 prediction bounding boxes is then passed through a time-efficient non-maximum suppression (“NMS”) algorithm to eliminate bounding boxes with areas of overlap over a certain Intersection over Union (”IoU”) threshold and yield a final list of predicted bounding boxes.⁵

Altogether this one-stage process allows for much improved inference speeds with the latest YOLO implementation of ultralytics-YOLO (“YOLOv5”) allowing for per image prediction times in the 7-10ms context using GPUs.⁶ Given these model architecture and inference time differences we investigated both YOLOv5 in its x (142M trainable parameters) and l (77M parameters) size varieties as well as Faster R-CNN, finding that both YOLOv5-x and l model versions outperformed Faster RCNN in F1-score and inference time. YOLOv5 was therefore subsequently used as the base model architecture in this approach.

In order to further improve the F1-score performance of this YOLO-based method, the Ensemble Model (“EM”) and Test Time Augmentation (“TTA”) approaches were further used in the prediction stage. The EM approach ensembles or averages the bounding box predictions of several
YOLOv5 models trained with different batch size, learning rate, optimizer and other hyperparameters, with each model’s differing kernel patterns learned under its unique set of hyperparameters supplementing those of other included models; as with standard tree-based horizontal or vertical
ensembling methods such as Random Forests or Gradient Boosting this has the effect of reducing model prediction variance such that improved accuracy may be achieved.⁷ The tradeoff for this improved accuracy would therefore be increased inference time and reduced model interpretability as no single model would be responsible for the resulting predictions.

Similarly this second Test Time Augmentation approach used in this case ensembles individual model predictions on different augmented image versions, derived through horizontal flipping and scaling image resolution 1.30x, 0.83x and 0.67x, of the same base test image. This procedure subsequently filters these five distinct bounding box prediction sets, corresponding to one base and four augmented images, through the NMS procedure based on a selected IoU threshold and a comparison of bounding box confidence scores. This TTA procedure therefore similarly to model ensembling allows for reduced generalization error in its multiple prediction ensembling.

Lastly these TTA and EM approaches can be combined such that each k set of base and augmented test images produced through TTA can be fed to each of i EM models in order to yield k * i bounding box prediction sets which are then averaged and filtered through the NMS procedure as detailed in Figure 4, allowing for increased prediction accuracy through averaging the predictions of several different models across multiple augmented versions of the same base test image.

The GRDC train dataset was further split into 98% training images and 2% validation images in order to validate model loss parameter reduction after each training epoch, with these final train and validation sets containing 20,621 and 420 images respectively. In the case of YOLOv5 this base training dataset was further augmented using YOLOv5’s standard training augmentation pipeline including horizontal and vertical image flipping and saturation and hue augmentations as detailed in Table II, while this unaltered base dataset was used for Faster R-CNN training.

**Table II**. YOLOv5 Training Data Augmentations Used

Results

Per GRDC competition guidelines test scores were derived by submitting through the GRDC’s competition website prediction sets for two unreleased test sets containing 2,631 and 2,664 images respectively and sampled following similar country and target class distributions as the training set per the GRDC (“test1” and “test2”).⁸ YOLOv5x and l models as well as Faster R-CNN were trained as a first step using standard out-of-the-box training values for learning rate, optimizer, momentum and other hyperparameters, producing F1 scores of 0.52, 0.52 and 0.50 respectively. Additional tuning of batch size and optimizer hyperparameter values showed 8-32 image batch size and stochastic gradient descent with Nesterov accelerated momentum as being optimal in the case of YOLOv5, while SGD with simple momentum and 8-16 batch size were similarly demonstrated as optimal in the Faster R-CNN context. As further tuning YOLOv5x, YOLOv5l and Faster R-CNN models showed superior performance on the part of YOLOv5 as measured by F1 score, the YOLO framework was subsequently adopted as the core of this proposed approach.

In order to increase model heterogeneity to make this ensemble approach more generalizable, and operating within a maximum inference time constraint of 0.50s in order to theoretically enable real-time detection in the field, several versions of these YOLOv5x and YOLOv5l configured with different batch size and other hyperparameter values were trained and subsequently ensembled. Following this approach an ensemble of six YOLOv5x and YOLOv5l models each trained with 32, 16 and 8 batch sizes for 150 epochs was shown empirically to yield significant improvement over these previous single-model experiments with an F1 score of 0.57, such that this ensemble structure was subsequently selected as the core of this approach. Given it was further observed that per image inference times increased linearly with number of models included in this ensemble this six-model approach producing maximum 0.42ms per image inference times with the vast majority of predictions times falling in xthe 0.21-0.40ms range was therefore selected to satisfy this self-imposed 0.5s inference time constraint.

Following this EM stage, applying TTA augmentations as shown in Figure 3 further allowed for increasing F1 score to 0.59. Finally in order to further improve model prediction performance, an exhaustive grid-search of YOLOv5 NMS and minimum confidence threshold (C) hyperparameter values was conducted in order to ascertain the optimal combination of these hyperparameters, yielding a highest top 5-placing F1 score of 0.68 with C = 0.25 and NMS = 0.999. A summary of all F1 scores produced through this approach for both test1 and test2 datasets is shown below in Tables III and IV.

**Table III**. Test1 F1 Scores of YOLOv5 Model Ensemble varying Confidence Threshold and NMS

**Table IV**. Test2 F1 Scores of YOLOv5 Model Ensemble varying Confidence Threshold and NMS

System Implementation

Semi-automated road monitoring systems leveraging computer vision algorithms such as those presented here could be deployed using dashboard mounted smartphones in order to supplement or potentially replace human visual inspection in either a real-time or offline data processing setting.
To further improve recall performance in higher-resource environments, this system could use images taken from several smartphones mounted at different angles in the same vehicle in order to strengthen same-location predictions with different fields of view of the same sections of road.

Furthermore, by using image GPS coordinates automatically embedded in that image file’s EXIF data, complete road quality maps of neighborhoods, cities or states could be compiled post-data collection in order to quantify levels of road distress across different road sections, offering a visualization medium to better inform government agencies’ road maintenance funding allocation decisions for instance. To demonstrate this, I created a simple Python folium map of road surface quality in a Paulus Hook neighborhood block in Jersey City, NJ as shown in Figure 5 using road images queried through the Google Street View API and passed to this six-model model ensemble. Leveraging the model’s prediction confidence score as a relatively crude proxy for road damage severity, road damage scores can be computed for different sections of road using these road distress frequencies and severities. To facilitate further analysis, this road section-level data could be exported to a tabular format for storing in government agency databases, allowing for comprehensive road analyses across entire cities and states to be performed.

Other low-cost data collection methods such as smartphone accelerometer data could further reinforce this computer-vision approach such as by providing estimates for road roughness such as proposed in Douangphachanh et al.⁹. As the International Roughness Index (IRI) is another road distress metric commonly monitored by OECD government agencies such as many US Department of Transportation (”DOT”) state agencies as part of MAP-21 federal reporting requirements, a computer vision-based model such as that presented in this project could therefore be supplemented with models regressing IRI on accelerometer data to provide a fuller picture of road quality across both surface quality and roughness.¹⁰

Conclusion

The current relatively elevated costs associated with completing regular and extensive road damage surveys at the local and regional levels through human visual inspection calls for computer vision-assisted monitoring of road infrastructure. This project put forward a YOLO-based approach to road distress detection using model ensembling and test time augmentation, yielding a 0.68 F1 score on test data placing in the top 5 of 121 teams that entered the 2020 Global Road Detection Challenge as of December 2021.

Leveraging this YOLO model ensemble, we furthermore proposed a novel approach to road distress monitoring using several dashboard-mounted smartphones enabling the real-time capture and processing of images and videos of road hazards at different angles. Using a batch of Google Street
View API road images with embedded EXIF GPS coordinate data queried for neighborhood block in Jersey City, NJ, we further demonstrate a simple indexing methodology for quantifying and mapping road surface quality based on distress frequency and severity. As part of future work, we plan to investigate additional methods for improving the cost-effectiveness of road roughness data collection and processing in order to integrate road roughness as an additional dimension to road quality monitoring.

Example Video Predictions

i) Longitudinal Crack Detection

ii) Lateral Crack Detection

iii) Alligator Crack Detection

iv) Pothole Detection

Thanks for reading and feel free to check out DeepRoad AI’s website to learn more!

Sources

Arya, H. Maeda, S. K. Ghosh, D. Toshniwal, A. Mraz, T. Kashiyama, and Y. Sekimoto, Deep learning-based road damage detection and classification for multiple countries, Automation in Construction, vol. 132, 2021
J. R. Association, Maintenance Guidebook for Road Pavements, 2013 edition, Technical Report, http://www.road.or.jp/english/publication/index.html, Accessed: 2021-12-15.
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-time object detection with region proposal networks, Advances in Neural Information Processing Systems, vol. 201, 2015
R. Girshick, Fast R-CNN, Proceedings of the IEEE international Conference on Computer Vision, 2015, pp. 1440–1448.
A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, Yolov4:Optimal speed and accuracy of object detection, arXivpreprint arXiv:2004.10934, 2020.
J. Solawetz, Yolov5 is here: State-of-the-art objectdetection at 140 fps, https://blog.roboflow.com/yolov5-is-here/, 2020, Accessed: 2021-12-16.
G. Dietterich and E. B. Kong, Machine learning bias, statistical bias, and statistical variance of decision tree algorithms, Citeseer, Tech. Rep., 1995.
G.R.D.C. Organizing Team, Data, https://rdd2020.sekilab.global/data/, 2020, Accessed: 2021-15-17.
.V. Douangphachanh and H. Oneyama, Estimation of Road Roughness condition from smartphones under realistic settings, 13th international conference on ITS Telecommunications (ITST), 2013, pp. 433–439.
P. Mucka, Current approaches to quantify the longitudinal road roughness, International Journal of Pavement Engineering, vol. 17, no. 8, pp. 659–679, 2016.