Labeling for object detection: It is not that obvious the first time you do it.
To train object detection models, one would need to build a new dataset for training, which implies one should label the data also. Labeling for object detection could be done with the opensource annotation tool LabelImg. Link to its github repo is here:
(Hey pythonists, it is written in python!). With labelImg, annotations could be written in these 3 most used formats: pascalvoc i.e., xml files, yolo and CreateML formats.
In this blog, I will be sharing my annotation guidelines I deduced during an object detection project. To keep you inline with the workflow background, I have been building a training dataset of people in public spaces from all around the world to train yolov4 on detecting people and predicting their gender and age interval. In this task, I have 4 classes: man, woman, girl and boy. After training, my baby model yolov4 should confidently make the difference between these four mentioned classes. Let me tell you the truth fellas, it is all about the data! Many practitioners talk about hacks and tricks, but first, it is all about the data. By this I mean, if you design the training dataset right and make strong annotation (no room for human error and wrong labels), and equilibrated classes, then the new born model wont be stupid (it will be able to generalize and behave independently).
For the sake of building a smart object detection model, let us focus on how should we build a training dataset and most importantly how should we label it.
I have a weird habit of observing my data whether it is .mp3, .wav, .jpeg or .csv or else. I always give a look at the data at hand. Training set design requires caring mainly about these points: quantity and quality. If you have N classes, then the number of instances in each class should be similar. Otherwise, you will have a good performance only on the large classes. Generally for object detection tasks and specifically yolo family, the minimum number to get a good performance is 1000 instances per class. For example, I have man, woman, girl and boy as classes. I will then need 4000 people in total in my dataset. This does not mean I have to collect 4000 pictures, but all the pictures I have at hand for training should contain 4000 people at least and 1000 in every class. I have started with 545 images of people in the wild. This set includes 1750 women, 2000 men, 80 girls and 75 boys. I know the classes aren’t balanced. I chose to start off with this dataset, then increase the training set with more balance. [I made the count per class based on my annotation files with a python script that I can share if that is desired]. During the labeling of this dataset, I have made few annotation guidelines to respect, explained below:
First, most if not every person in the image should be labelled. The pictures sometimes include large crowds where the overlap and occlusion is very present, and other times instances (people) are distanced in the space. When the instances overlap much, you might get confused which ones to label and which not. If I trace a bounding box around a person and not another, then my model will consider the unlabeled passenger as a part of the background. Not labeling every identifiable object/target results in data loss and confuses the model. If I do not label every instance I can classify as a human, then my model will not detect all people it should detect in test phase. That is why, I i have made sure to turn off my human laziness and label all instances in my data. One should be better as objective and mechanical as possible. Certainly, computers do not feel lazy and once trained on a task will detect every target accurately. So most if not all targets in every picture should be labelled correctly. Every time I remind myself: If you can identify whether this person is a man or woman/boy or a girl, then you should label it, regardless of how much blurry, dark, shady, close or far away in the space it is.
Second, another major point that affects precision of an object detector is the resolution and lighting in training data. If you train your model on low resolution pictures, then it will only perform on low resolution test data. You should match your training data quality to the specifics of deployment data stream. I have trained a yolov5 on a low resolution dataset which then performed poorly on random test images I took from the internet. Similarly, I trained a Vgg16 on high resolution dataset of celebrities and tested it on my webcam video stream. Clearly, my training and test data did not match in quality in both cases. That does not make those models stupid. Building models is done for specific tasks and environments. So make your training data match that production environment.
As I previously mentioned, many pictures in my dataset are crowded with people, in spaces with poor lighting. These characteristics make it confusing for the model to differentiate between classes, since many training examples look like shadows with blurred and unclear edges and faces. To avoid this, I mix my data with better resolution data and better lighting. Hopefully then, the model will be able to detect most people whether from a far or close distance, under poor or normal lighting conditions equally.
During annotation with labelImg or else, you will remark that many bounding boxes are overlapping, and you will ask yourself am I messing up? Should I label each of these? I can easily see that this is a woman, will the model understand that too if I label it? Well, you better train your model to detect every class at any dimensions, then you can choose to filter predictions with small medium or large dimensions. I came up to the following rule regarding overlapping bounding boxes: if inside a bounding box there exists more than one target subject (person or whatever your target object is), then you should label all of these inside the bigger bounding box, otherwise the model will consider pixels of man/boy or girl as important pixels in detecting a woman. Since differentiating between these classes (woman, man, boy, girl) is hard, then the model should be explicitly told to which class belongs every object. Remember, if you don’t label an instance, it will be considered a part of the background.
The above picture is an illustrative example. There are people in the back but they appear very small. Should I label them? Since I cannot identify their gender or age interval with my naked eye, so I exclude them from annotation. All other people I can classify are in bounding boxes. You can also notice that I make the bounding box include all parts of the instance, the hands, the feet .. but not necessarily held objects like handbag, backpack, bags or else, because not every man or woman will be holding a certain bag or object. For those people occluded by others, I do label them too, especially if I find them present inside the bounding box surface of another instance. That is because I want my model to know that in one box there exists one woman in frond and a man occluded behind her with only his head and upper body appearing, as an example.
Third, it is a good practice to include in the training set pictures having no objects to detect. The more the better. It will enable the model to better learn about the background. These background pictures should be rich. Do not make all your background pictures dominated by one color shade or specific edges in it. Make it a rich mixture to prevent the intervention of background data in decision making (classification). If I have most of my pictures including females with clear or pink background, then the model will classify a man or boy as woman or girl, only because the background is clear or pink.
MAKE (machine) LEARNING A BETTER EXPERIENCE FELLAS!