File size: 7,917 Bytes
e62bc71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
1
00:00:00,470 --> 00:00:01,000
OK.

2
00:00:01,050 --> 00:00:02,430
So let's start at the beginning.

3
00:00:02,460 --> 00:00:05,520
Let's talk about object really object vectors.

4
00:00:05,670 --> 00:00:11,910
So I'm going to introduce you to the history of it so fiercely detection is one of the holy grails of

5
00:00:11,910 --> 00:00:17,610
computer vision because previously what we have been doing is just classifying like an entire image

6
00:00:17,610 --> 00:00:20,510
and seeing what objects are what Hassid belong to.

7
00:00:20,730 --> 00:00:26,490
But can we take an image like this and label each major component into being a dog car person horse

8
00:00:26,760 --> 00:00:28,340
person in the back.

9
00:00:28,350 --> 00:00:32,230
Not yet until we have come across up to detection.

10
00:00:32,640 --> 00:00:40,620
So object detection is a mix of object classification and localization object action is it is the identification

11
00:00:40,650 --> 00:00:43,120
of a bounding box outlining the object.

12
00:00:43,140 --> 00:00:49,590
So like in my face here basically is extraction a bony box or on my face and this direction is perhaps

13
00:00:49,590 --> 00:00:53,760
one of the most popular object detection algorithms that we all know.

14
00:00:53,830 --> 00:00:57,220
We're all quite familiar with from using cameras in our cell phones.

15
00:00:57,270 --> 00:00:57,780
OK.

16
00:00:58,290 --> 00:01:04,150
So basically it DL tells you instead of telling you this object here is a cat.

17
00:01:04,170 --> 00:01:09,070
It actually tells you where is the cat and that is the whole point of object detection.

18
00:01:10,620 --> 00:01:15,340
So let's get into the history of it and start with horror Cassiar classifiers.

19
00:01:15,360 --> 00:01:19,140
Now there were many public detectors before this.

20
00:01:19,140 --> 00:01:24,840
However here is what made it hard to justify this is what made it mainstream and quite popular because

21
00:01:24,840 --> 00:01:26,340
it was so fast.

22
00:01:26,370 --> 00:01:33,420
So basically this was a this was developed by Viola Jones in the face detection algorithm in 2001 not

23
00:01:33,420 --> 00:01:35,480
that long long ago 17 years ago.

24
00:01:35,520 --> 00:01:40,960
To be fair and it was superfast and it's actually still use to the number of applications.

25
00:01:41,280 --> 00:01:43,710
Basically it's been optimized and tweaked to be even faster.

26
00:01:43,710 --> 00:01:49,890
So it basically reduces the CPQ load and it's very very accurate.

27
00:01:49,890 --> 00:01:52,930
Basically what it does it's a cascade of classifiers.

28
00:01:53,190 --> 00:01:56,640
That's basically how it got it got its name and it uses a horror.

29
00:01:56,640 --> 00:01:58,590
Basically let's go into the next slide.

30
00:01:58,660 --> 00:02:02,760
Actually I don't have it in this section but it basically uses horror features and harsh features are

31
00:02:02,760 --> 00:02:06,210
basically basically like you have rectangles.

32
00:02:06,250 --> 00:02:07,100
Overling here.

33
00:02:07,240 --> 00:02:12,690
You imagine a white rectangle here and one here and then there are different types of Arcacha pacifies.

34
00:02:12,810 --> 00:02:15,590
So basically is just a feature extraction.

35
00:02:15,690 --> 00:02:22,350
Basically what we learned before and it's led this box is that over the window over and over continuously

36
00:02:22,410 --> 00:02:31,950
looking for a face they're very good but they are pretty hard to train and develop and optimize.

37
00:02:32,010 --> 00:02:38,010
So let's move on to histogram with gradients and SVM sliding windows so sliding windows is a method

38
00:02:38,010 --> 00:02:43,580
where we extract segments a full image piece by piece in the form of a rectangular extractor box.

39
00:02:43,590 --> 00:02:48,000
So I mentioned it in previous slide when I was talking about this box being slid across this image.

40
00:02:48,330 --> 00:02:53,430
What it does here in this image is a picture of my wife from the last bodybuilding bikini competition

41
00:02:53,430 --> 00:02:54,560
two months ago.

42
00:02:54,870 --> 00:03:02,550
And what it does is just imagine this window is being moved here then down here and then down here just

43
00:03:02,550 --> 00:03:05,670
like remember how we moved across the image.

44
00:03:05,680 --> 00:03:07,960
And CNN's it's exactly the same thing.

45
00:03:07,970 --> 00:03:14,430
And we can actually set the same parameters like stride and the size of this box and what this box does

46
00:03:14,430 --> 00:03:17,640
here in sliding windows with histogram of gradients.

47
00:03:17,700 --> 00:03:25,980
SVM is that it basically extracts the entire hawgs all his brilliance in this box at different scales.

48
00:03:25,980 --> 00:03:31,620
So basically it does it with image at one scale and then not a scale smaller scale and then this one

49
00:03:31,620 --> 00:03:35,480
here and this one basically has no room to go right to just go straight down.

50
00:03:35,760 --> 00:03:39,480
And it tries to match up to how gradients went what it knows.

51
00:03:39,480 --> 00:03:41,700
It's supposed to look like to find the object.

52
00:03:42,000 --> 00:03:47,400
Now as you can see this could be an effective way but it's not really that resilient.

53
00:03:47,400 --> 00:03:48,410
Why.

54
00:03:48,420 --> 00:03:53,400
Because imagine we have to do this for every segment of image continuously.

55
00:03:53,400 --> 00:03:55,680
It gets exhaustive and computationally expensive

56
00:03:58,720 --> 00:04:05,370
so previous action which is basically TISM feature extraction I just mentioned that and why would we

57
00:04:05,370 --> 00:04:10,740
want to actually manually find co-features if CNN's actually eliminate that.

58
00:04:10,740 --> 00:04:16,350
All right CNN's actually automatically find features by just running all these tests destroying data

59
00:04:16,680 --> 00:04:20,350
Trulia algorithm and finding the last matching it with the correct last.

60
00:04:20,370 --> 00:04:22,770
So that's what's brilliant about CNN's.

61
00:04:22,770 --> 00:04:24,760
It takes that step away from us.

62
00:04:26,340 --> 00:04:31,970
So as I said once of problems we're doing this is a sea of scale.

63
00:04:32,100 --> 00:04:34,920
Imagine this is a simple image just 20 by 20.

64
00:04:34,920 --> 00:04:36,870
So this box can be passed over here.

65
00:04:36,960 --> 00:04:39,630
But imagine this was a much bigger continue TV image.

66
00:04:39,720 --> 00:04:44,130
How many different times how many different boxes would we extract.

67
00:04:44,130 --> 00:04:46,460
How do we know what size box should be.

68
00:04:46,470 --> 00:04:50,410
I mean that's where we rescale image but how many different rescaling are we going to do.

69
00:04:50,440 --> 00:04:54,830
So as you can see this is not a very effective way of doing object detection.

70
00:04:56,430 --> 00:05:02,600
So talk a bit the bullet histogram gradients are not going to go in go into this in detail of taught

71
00:05:02,600 --> 00:05:05,480
this in my other op and see the course you can.

72
00:05:05,480 --> 00:05:07,280
The video is included free in that section.

73
00:05:07,290 --> 00:05:09,230
So that's why I'm going to talk about it much here.

74
00:05:09,550 --> 00:05:15,290
But basically the slides are here for you to go through on your own and you can pretty much infer from

75
00:05:15,290 --> 00:05:17,720
these steps here what hawgs really are.

76
00:05:20,110 --> 00:05:22,090
So now we move on to our CNN's.