1 00:00:00,470 --> 00:00:01,000 OK. 2 00:00:01,050 --> 00:00:02,430 So let's start at the beginning. 3 00:00:02,460 --> 00:00:05,520 Let's talk about object really object vectors. 4 00:00:05,670 --> 00:00:11,910 So I'm going to introduce you to the history of it so fiercely detection is one of the holy grails of 5 00:00:11,910 --> 00:00:17,610 computer vision because previously what we have been doing is just classifying like an entire image 6 00:00:17,610 --> 00:00:20,510 and seeing what objects are what Hassid belong to. 7 00:00:20,730 --> 00:00:26,490 But can we take an image like this and label each major component into being a dog car person horse 8 00:00:26,760 --> 00:00:28,340 person in the back. 9 00:00:28,350 --> 00:00:32,230 Not yet until we have come across up to detection. 10 00:00:32,640 --> 00:00:40,620 So object detection is a mix of object classification and localization object action is it is the identification 11 00:00:40,650 --> 00:00:43,120 of a bounding box outlining the object. 12 00:00:43,140 --> 00:00:49,590 So like in my face here basically is extraction a bony box or on my face and this direction is perhaps 13 00:00:49,590 --> 00:00:53,760 one of the most popular object detection algorithms that we all know. 14 00:00:53,830 --> 00:00:57,220 We're all quite familiar with from using cameras in our cell phones. 15 00:00:57,270 --> 00:00:57,780 OK. 16 00:00:58,290 --> 00:01:04,150 So basically it DL tells you instead of telling you this object here is a cat. 17 00:01:04,170 --> 00:01:09,070 It actually tells you where is the cat and that is the whole point of object detection. 18 00:01:10,620 --> 00:01:15,340 So let's get into the history of it and start with horror Cassiar classifiers. 19 00:01:15,360 --> 00:01:19,140 Now there were many public detectors before this. 20 00:01:19,140 --> 00:01:24,840 However here is what made it hard to justify this is what made it mainstream and quite popular because 21 00:01:24,840 --> 00:01:26,340 it was so fast. 22 00:01:26,370 --> 00:01:33,420 So basically this was a this was developed by Viola Jones in the face detection algorithm in 2001 not 23 00:01:33,420 --> 00:01:35,480 that long long ago 17 years ago. 24 00:01:35,520 --> 00:01:40,960 To be fair and it was superfast and it's actually still use to the number of applications. 25 00:01:41,280 --> 00:01:43,710 Basically it's been optimized and tweaked to be even faster. 26 00:01:43,710 --> 00:01:49,890 So it basically reduces the CPQ load and it's very very accurate. 27 00:01:49,890 --> 00:01:52,930 Basically what it does it's a cascade of classifiers. 28 00:01:53,190 --> 00:01:56,640 That's basically how it got it got its name and it uses a horror. 29 00:01:56,640 --> 00:01:58,590 Basically let's go into the next slide. 30 00:01:58,660 --> 00:02:02,760 Actually I don't have it in this section but it basically uses horror features and harsh features are 31 00:02:02,760 --> 00:02:06,210 basically basically like you have rectangles. 32 00:02:06,250 --> 00:02:07,100 Overling here. 33 00:02:07,240 --> 00:02:12,690 You imagine a white rectangle here and one here and then there are different types of Arcacha pacifies. 34 00:02:12,810 --> 00:02:15,590 So basically is just a feature extraction. 35 00:02:15,690 --> 00:02:22,350 Basically what we learned before and it's led this box is that over the window over and over continuously 36 00:02:22,410 --> 00:02:31,950 looking for a face they're very good but they are pretty hard to train and develop and optimize. 37 00:02:32,010 --> 00:02:38,010 So let's move on to histogram with gradients and SVM sliding windows so sliding windows is a method 38 00:02:38,010 --> 00:02:43,580 where we extract segments a full image piece by piece in the form of a rectangular extractor box. 39 00:02:43,590 --> 00:02:48,000 So I mentioned it in previous slide when I was talking about this box being slid across this image. 40 00:02:48,330 --> 00:02:53,430 What it does here in this image is a picture of my wife from the last bodybuilding bikini competition 41 00:02:53,430 --> 00:02:54,560 two months ago. 42 00:02:54,870 --> 00:03:02,550 And what it does is just imagine this window is being moved here then down here and then down here just 43 00:03:02,550 --> 00:03:05,670 like remember how we moved across the image. 44 00:03:05,680 --> 00:03:07,960 And CNN's it's exactly the same thing. 45 00:03:07,970 --> 00:03:14,430 And we can actually set the same parameters like stride and the size of this box and what this box does 46 00:03:14,430 --> 00:03:17,640 here in sliding windows with histogram of gradients. 47 00:03:17,700 --> 00:03:25,980 SVM is that it basically extracts the entire hawgs all his brilliance in this box at different scales. 48 00:03:25,980 --> 00:03:31,620 So basically it does it with image at one scale and then not a scale smaller scale and then this one 49 00:03:31,620 --> 00:03:35,480 here and this one basically has no room to go right to just go straight down. 50 00:03:35,760 --> 00:03:39,480 And it tries to match up to how gradients went what it knows. 51 00:03:39,480 --> 00:03:41,700 It's supposed to look like to find the object. 52 00:03:42,000 --> 00:03:47,400 Now as you can see this could be an effective way but it's not really that resilient. 53 00:03:47,400 --> 00:03:48,410 Why. 54 00:03:48,420 --> 00:03:53,400 Because imagine we have to do this for every segment of image continuously. 55 00:03:53,400 --> 00:03:55,680 It gets exhaustive and computationally expensive 56 00:03:58,720 --> 00:04:05,370 so previous action which is basically TISM feature extraction I just mentioned that and why would we 57 00:04:05,370 --> 00:04:10,740 want to actually manually find co-features if CNN's actually eliminate that. 58 00:04:10,740 --> 00:04:16,350 All right CNN's actually automatically find features by just running all these tests destroying data 59 00:04:16,680 --> 00:04:20,350 Trulia algorithm and finding the last matching it with the correct last. 60 00:04:20,370 --> 00:04:22,770 So that's what's brilliant about CNN's. 61 00:04:22,770 --> 00:04:24,760 It takes that step away from us. 62 00:04:26,340 --> 00:04:31,970 So as I said once of problems we're doing this is a sea of scale. 63 00:04:32,100 --> 00:04:34,920 Imagine this is a simple image just 20 by 20. 64 00:04:34,920 --> 00:04:36,870 So this box can be passed over here. 65 00:04:36,960 --> 00:04:39,630 But imagine this was a much bigger continue TV image. 66 00:04:39,720 --> 00:04:44,130 How many different times how many different boxes would we extract. 67 00:04:44,130 --> 00:04:46,460 How do we know what size box should be. 68 00:04:46,470 --> 00:04:50,410 I mean that's where we rescale image but how many different rescaling are we going to do. 69 00:04:50,440 --> 00:04:54,830 So as you can see this is not a very effective way of doing object detection. 70 00:04:56,430 --> 00:05:02,600 So talk a bit the bullet histogram gradients are not going to go in go into this in detail of taught 71 00:05:02,600 --> 00:05:05,480 this in my other op and see the course you can. 72 00:05:05,480 --> 00:05:07,280 The video is included free in that section. 73 00:05:07,290 --> 00:05:09,230 So that's why I'm going to talk about it much here. 74 00:05:09,550 --> 00:05:15,290 But basically the slides are here for you to go through on your own and you can pretty much infer from 75 00:05:15,290 --> 00:05:17,720 these steps here what hawgs really are. 76 00:05:20,110 --> 00:05:22,090 So now we move on to our CNN's.