AI_DL_Assignment / 20. Principles of Object Detection /2. Object Detection Introduction - Sliding Windows with HOGs.srt
Prince-1's picture
Add files using upload-large-folder tool
e62bc71 verified
1
00:00:00,470 --> 00:00:01,000
OK.
2
00:00:01,050 --> 00:00:02,430
So let's start at the beginning.
3
00:00:02,460 --> 00:00:05,520
Let's talk about object really object vectors.
4
00:00:05,670 --> 00:00:11,910
So I'm going to introduce you to the history of it so fiercely detection is one of the holy grails of
5
00:00:11,910 --> 00:00:17,610
computer vision because previously what we have been doing is just classifying like an entire image
6
00:00:17,610 --> 00:00:20,510
and seeing what objects are what Hassid belong to.
7
00:00:20,730 --> 00:00:26,490
But can we take an image like this and label each major component into being a dog car person horse
8
00:00:26,760 --> 00:00:28,340
person in the back.
9
00:00:28,350 --> 00:00:32,230
Not yet until we have come across up to detection.
10
00:00:32,640 --> 00:00:40,620
So object detection is a mix of object classification and localization object action is it is the identification
11
00:00:40,650 --> 00:00:43,120
of a bounding box outlining the object.
12
00:00:43,140 --> 00:00:49,590
So like in my face here basically is extraction a bony box or on my face and this direction is perhaps
13
00:00:49,590 --> 00:00:53,760
one of the most popular object detection algorithms that we all know.
14
00:00:53,830 --> 00:00:57,220
We're all quite familiar with from using cameras in our cell phones.
15
00:00:57,270 --> 00:00:57,780
OK.
16
00:00:58,290 --> 00:01:04,150
So basically it DL tells you instead of telling you this object here is a cat.
17
00:01:04,170 --> 00:01:09,070
It actually tells you where is the cat and that is the whole point of object detection.
18
00:01:10,620 --> 00:01:15,340
So let's get into the history of it and start with horror Cassiar classifiers.
19
00:01:15,360 --> 00:01:19,140
Now there were many public detectors before this.
20
00:01:19,140 --> 00:01:24,840
However here is what made it hard to justify this is what made it mainstream and quite popular because
21
00:01:24,840 --> 00:01:26,340
it was so fast.
22
00:01:26,370 --> 00:01:33,420
So basically this was a this was developed by Viola Jones in the face detection algorithm in 2001 not
23
00:01:33,420 --> 00:01:35,480
that long long ago 17 years ago.
24
00:01:35,520 --> 00:01:40,960
To be fair and it was superfast and it's actually still use to the number of applications.
25
00:01:41,280 --> 00:01:43,710
Basically it's been optimized and tweaked to be even faster.
26
00:01:43,710 --> 00:01:49,890
So it basically reduces the CPQ load and it's very very accurate.
27
00:01:49,890 --> 00:01:52,930
Basically what it does it's a cascade of classifiers.
28
00:01:53,190 --> 00:01:56,640
That's basically how it got it got its name and it uses a horror.
29
00:01:56,640 --> 00:01:58,590
Basically let's go into the next slide.
30
00:01:58,660 --> 00:02:02,760
Actually I don't have it in this section but it basically uses horror features and harsh features are
31
00:02:02,760 --> 00:02:06,210
basically basically like you have rectangles.
32
00:02:06,250 --> 00:02:07,100
Overling here.
33
00:02:07,240 --> 00:02:12,690
You imagine a white rectangle here and one here and then there are different types of Arcacha pacifies.
34
00:02:12,810 --> 00:02:15,590
So basically is just a feature extraction.
35
00:02:15,690 --> 00:02:22,350
Basically what we learned before and it's led this box is that over the window over and over continuously
36
00:02:22,410 --> 00:02:31,950
looking for a face they're very good but they are pretty hard to train and develop and optimize.
37
00:02:32,010 --> 00:02:38,010
So let's move on to histogram with gradients and SVM sliding windows so sliding windows is a method
38
00:02:38,010 --> 00:02:43,580
where we extract segments a full image piece by piece in the form of a rectangular extractor box.
39
00:02:43,590 --> 00:02:48,000
So I mentioned it in previous slide when I was talking about this box being slid across this image.
40
00:02:48,330 --> 00:02:53,430
What it does here in this image is a picture of my wife from the last bodybuilding bikini competition
41
00:02:53,430 --> 00:02:54,560
two months ago.
42
00:02:54,870 --> 00:03:02,550
And what it does is just imagine this window is being moved here then down here and then down here just
43
00:03:02,550 --> 00:03:05,670
like remember how we moved across the image.
44
00:03:05,680 --> 00:03:07,960
And CNN's it's exactly the same thing.
45
00:03:07,970 --> 00:03:14,430
And we can actually set the same parameters like stride and the size of this box and what this box does
46
00:03:14,430 --> 00:03:17,640
here in sliding windows with histogram of gradients.
47
00:03:17,700 --> 00:03:25,980
SVM is that it basically extracts the entire hawgs all his brilliance in this box at different scales.
48
00:03:25,980 --> 00:03:31,620
So basically it does it with image at one scale and then not a scale smaller scale and then this one
49
00:03:31,620 --> 00:03:35,480
here and this one basically has no room to go right to just go straight down.
50
00:03:35,760 --> 00:03:39,480
And it tries to match up to how gradients went what it knows.
51
00:03:39,480 --> 00:03:41,700
It's supposed to look like to find the object.
52
00:03:42,000 --> 00:03:47,400
Now as you can see this could be an effective way but it's not really that resilient.
53
00:03:47,400 --> 00:03:48,410
Why.
54
00:03:48,420 --> 00:03:53,400
Because imagine we have to do this for every segment of image continuously.
55
00:03:53,400 --> 00:03:55,680
It gets exhaustive and computationally expensive
56
00:03:58,720 --> 00:04:05,370
so previous action which is basically TISM feature extraction I just mentioned that and why would we
57
00:04:05,370 --> 00:04:10,740
want to actually manually find co-features if CNN's actually eliminate that.
58
00:04:10,740 --> 00:04:16,350
All right CNN's actually automatically find features by just running all these tests destroying data
59
00:04:16,680 --> 00:04:20,350
Trulia algorithm and finding the last matching it with the correct last.
60
00:04:20,370 --> 00:04:22,770
So that's what's brilliant about CNN's.
61
00:04:22,770 --> 00:04:24,760
It takes that step away from us.
62
00:04:26,340 --> 00:04:31,970
So as I said once of problems we're doing this is a sea of scale.
63
00:04:32,100 --> 00:04:34,920
Imagine this is a simple image just 20 by 20.
64
00:04:34,920 --> 00:04:36,870
So this box can be passed over here.
65
00:04:36,960 --> 00:04:39,630
But imagine this was a much bigger continue TV image.
66
00:04:39,720 --> 00:04:44,130
How many different times how many different boxes would we extract.
67
00:04:44,130 --> 00:04:46,460
How do we know what size box should be.
68
00:04:46,470 --> 00:04:50,410
I mean that's where we rescale image but how many different rescaling are we going to do.
69
00:04:50,440 --> 00:04:54,830
So as you can see this is not a very effective way of doing object detection.
70
00:04:56,430 --> 00:05:02,600
So talk a bit the bullet histogram gradients are not going to go in go into this in detail of taught
71
00:05:02,600 --> 00:05:05,480
this in my other op and see the course you can.
72
00:05:05,480 --> 00:05:07,280
The video is included free in that section.
73
00:05:07,290 --> 00:05:09,230
So that's why I'm going to talk about it much here.
74
00:05:09,550 --> 00:05:15,290
But basically the slides are here for you to go through on your own and you can pretty much infer from
75
00:05:15,290 --> 00:05:17,720
these steps here what hawgs really are.
76
00:05:20,110 --> 00:05:22,090
So now we move on to our CNN's.