1
00:00:00,470 --> 00:00:01,000
OK.

2
00:00:01,050 --> 00:00:02,430
So let's start at the beginning.

3
00:00:02,460 --> 00:00:05,520
Let's talk about object really object vectors.

4
00:00:05,670 --> 00:00:11,910
So I'm going to introduce you to the history of it so fiercely detection is one of the holy grails of

5
00:00:11,910 --> 00:00:17,610
computer vision because previously what we have been doing is just classifying like an entire image

6
00:00:17,610 --> 00:00:20,510
and seeing what objects are what Hassid belong to.

7
00:00:20,730 --> 00:00:26,490
But can we take an image like this and label each major component into being a dog car person horse

8
00:00:26,760 --> 00:00:28,340
person in the back.

9
00:00:28,350 --> 00:00:32,230
Not yet until we have come across up to detection.

10
00:00:32,640 --> 00:00:40,620
So object detection is a mix of object classification and localization object action is it is the identification

11
00:00:40,650 --> 00:00:43,120
of a bounding box outlining the object.

12
00:00:43,140 --> 00:00:49,590
So like in my face here basically is extraction a bony box or on my face and this direction is perhaps

13
00:00:49,590 --> 00:00:53,760
one of the most popular object detection algorithms that we all know.

14
00:00:53,830 --> 00:00:57,220
We're all quite familiar with from using cameras in our cell phones.

15
00:00:57,270 --> 00:00:57,780
OK.

16
00:00:58,290 --> 00:01:04,150
So basically it DL tells you instead of telling you this object here is a cat.

17
00:01:04,170 --> 00:01:09,070
It actually tells you where is the cat and that is the whole point of object detection.

18
00:01:10,620 --> 00:01:15,340
So let's get into the history of it and start with horror Cassiar classifiers.

19
00:01:15,360 --> 00:01:19,140
Now there were many public detectors before this.

20
00:01:19,140 --> 00:01:24,840
However here is what made it hard to justify this is what made it mainstream and quite popular because

21
00:01:24,840 --> 00:01:26,340
it was so fast.

22
00:01:26,370 --> 00:01:33,420
So basically this was a this was developed by Viola Jones in the face detection algorithm in 2001 not

23
00:01:33,420 --> 00:01:35,480
that long long ago 17 years ago.

24
00:01:35,520 --> 00:01:40,960
To be fair and it was superfast and it's actually still use to the number of applications.

25
00:01:41,280 --> 00:01:43,710
Basically it's been optimized and tweaked to be even faster.

26
00:01:43,710 --> 00:01:49,890
So it basically reduces the CPQ load and it's very very accurate.

27
00:01:49,890 --> 00:01:52,930
Basically what it does it's a cascade of classifiers.

28
00:01:53,190 --> 00:01:56,640
That's basically how it got it got its name and it uses a horror.

29
00:01:56,640 --> 00:01:58,590
Basically let's go into the next slide.

30
00:01:58,660 --> 00:02:02,760
Actually I don't have it in this section but it basically uses horror features and harsh features are

31
00:02:02,760 --> 00:02:06,210
basically basically like you have rectangles.

32
00:02:06,250 --> 00:02:07,100
Overling here.

33
00:02:07,240 --> 00:02:12,690
You imagine a white rectangle here and one here and then there are different types of Arcacha pacifies.

34
00:02:12,810 --> 00:02:15,590
So basically is just a feature extraction.

35
00:02:15,690 --> 00:02:22,350
Basically what we learned before and it's led this box is that over the window over and over continuously

36
00:02:22,410 --> 00:02:31,950
looking for a face they're very good but they are pretty hard to train and develop and optimize.

37
00:02:32,010 --> 00:02:38,010
So let's move on to histogram with gradients and SVM sliding windows so sliding windows is a method

38
00:02:38,010 --> 00:02:43,580
where we extract segments a full image piece by piece in the form of a rectangular extractor box.

39
00:02:43,590 --> 00:02:48,000
So I mentioned it in previous slide when I was talking about this box being slid across this image.

40
00:02:48,330 --> 00:02:53,430
What it does here in this image is a picture of my wife from the last bodybuilding bikini competition

41
00:02:53,430 --> 00:02:54,560
two months ago.

42
00:02:54,870 --> 00:03:02,550
And what it does is just imagine this window is being moved here then down here and then down here just

43
00:03:02,550 --> 00:03:05,670
like remember how we moved across the image.

44
00:03:05,680 --> 00:03:07,960
And CNN's it's exactly the same thing.

45
00:03:07,970 --> 00:03:14,430
And we can actually set the same parameters like stride and the size of this box and what this box does

46
00:03:14,430 --> 00:03:17,640
here in sliding windows with histogram of gradients.

47
00:03:17,700 --> 00:03:25,980
SVM is that it basically extracts the entire hawgs all his brilliance in this box at different scales.

48
00:03:25,980 --> 00:03:31,620
So basically it does it with image at one scale and then not a scale smaller scale and then this one

49
00:03:31,620 --> 00:03:35,480
here and this one basically has no room to go right to just go straight down.

50
00:03:35,760 --> 00:03:39,480
And it tries to match up to how gradients went what it knows.

51
00:03:39,480 --> 00:03:41,700
It's supposed to look like to find the object.

52
00:03:42,000 --> 00:03:47,400
Now as you can see this could be an effective way but it's not really that resilient.

53
00:03:47,400 --> 00:03:48,410
Why.

54
00:03:48,420 --> 00:03:53,400
Because imagine we have to do this for every segment of image continuously.

55
00:03:53,400 --> 00:03:55,680
It gets exhaustive and computationally expensive

56
00:03:58,720 --> 00:04:05,370
so previous action which is basically TISM feature extraction I just mentioned that and why would we

57
00:04:05,370 --> 00:04:10,740
want to actually manually find co-features if CNN's actually eliminate that.

58
00:04:10,740 --> 00:04:16,350
All right CNN's actually automatically find features by just running all these tests destroying data

59
00:04:16,680 --> 00:04:20,350
Trulia algorithm and finding the last matching it with the correct last.

60
00:04:20,370 --> 00:04:22,770
So that's what's brilliant about CNN's.

61
00:04:22,770 --> 00:04:24,760
It takes that step away from us.

62
00:04:26,340 --> 00:04:31,970
So as I said once of problems we're doing this is a sea of scale.

63
00:04:32,100 --> 00:04:34,920
Imagine this is a simple image just 20 by 20.

64
00:04:34,920 --> 00:04:36,870
So this box can be passed over here.

65
00:04:36,960 --> 00:04:39,630
But imagine this was a much bigger continue TV image.

66
00:04:39,720 --> 00:04:44,130
How many different times how many different boxes would we extract.

67
00:04:44,130 --> 00:04:46,460
How do we know what size box should be.

68
00:04:46,470 --> 00:04:50,410
I mean that's where we rescale image but how many different rescaling are we going to do.

69
00:04:50,440 --> 00:04:54,830
So as you can see this is not a very effective way of doing object detection.

70
00:04:56,430 --> 00:05:02,600
So talk a bit the bullet histogram gradients are not going to go in go into this in detail of taught

71
00:05:02,600 --> 00:05:05,480
this in my other op and see the course you can.

72
00:05:05,480 --> 00:05:07,280
The video is included free in that section.

73
00:05:07,290 --> 00:05:09,230
So that's why I'm going to talk about it much here.

74
00:05:09,550 --> 00:05:15,290
But basically the slides are here for you to go through on your own and you can pretty much infer from

75
00:05:15,290 --> 00:05:17,720
these steps here what hawgs really are.

76
00:05:20,110 --> 00:05:22,090
So now we move on to our CNN's.