AI_DL_Assignment / 20. Principles of Object Detection /2. Object Detection Introduction - Sliding Windows with HOGs.srt

Add files using upload-large-folder tool

e62bc71 verified 3 months ago

7.92 kB

	1
	00:00:00,470 --> 00:00:01,000
	OK.

	2
	00:00:01,050 --> 00:00:02,430
	So let's start at the beginning.

	3
	00:00:02,460 --> 00:00:05,520
	Let's talk about object really object vectors.

	4
	00:00:05,670 --> 00:00:11,910
	So I'm going to introduce you to the history of it so fiercely detection is one of the holy grails of

	5
	00:00:11,910 --> 00:00:17,610
	computer vision because previously what we have been doing is just classifying like an entire image

	6
	00:00:17,610 --> 00:00:20,510
	and seeing what objects are what Hassid belong to.

	7
	00:00:20,730 --> 00:00:26,490
	But can we take an image like this and label each major component into being a dog car person horse

	8
	00:00:26,760 --> 00:00:28,340
	person in the back.

	9
	00:00:28,350 --> 00:00:32,230
	Not yet until we have come across up to detection.

	10
	00:00:32,640 --> 00:00:40,620
	So object detection is a mix of object classification and localization object action is it is the identification

	11
	00:00:40,650 --> 00:00:43,120
	of a bounding box outlining the object.

	12
	00:00:43,140 --> 00:00:49,590
	So like in my face here basically is extraction a bony box or on my face and this direction is perhaps

	13
	00:00:49,590 --> 00:00:53,760
	one of the most popular object detection algorithms that we all know.

	14
	00:00:53,830 --> 00:00:57,220
	We're all quite familiar with from using cameras in our cell phones.

	15
	00:00:57,270 --> 00:00:57,780
	OK.

	16
	00:00:58,290 --> 00:01:04,150
	So basically it DL tells you instead of telling you this object here is a cat.

	17
	00:01:04,170 --> 00:01:09,070
	It actually tells you where is the cat and that is the whole point of object detection.

	18
	00:01:10,620 --> 00:01:15,340
	So let's get into the history of it and start with horror Cassiar classifiers.

	19
	00:01:15,360 --> 00:01:19,140
	Now there were many public detectors before this.

	20
	00:01:19,140 --> 00:01:24,840
	However here is what made it hard to justify this is what made it mainstream and quite popular because

	21
	00:01:24,840 --> 00:01:26,340
	it was so fast.

	22
	00:01:26,370 --> 00:01:33,420
	So basically this was a this was developed by Viola Jones in the face detection algorithm in 2001 not

	23
	00:01:33,420 --> 00:01:35,480
	that long long ago 17 years ago.

	24
	00:01:35,520 --> 00:01:40,960
	To be fair and it was superfast and it's actually still use to the number of applications.

	25
	00:01:41,280 --> 00:01:43,710
	Basically it's been optimized and tweaked to be even faster.

	26
	00:01:43,710 --> 00:01:49,890
	So it basically reduces the CPQ load and it's very very accurate.

	27
	00:01:49,890 --> 00:01:52,930
	Basically what it does it's a cascade of classifiers.

	28
	00:01:53,190 --> 00:01:56,640
	That's basically how it got it got its name and it uses a horror.

	29
	00:01:56,640 --> 00:01:58,590
	Basically let's go into the next slide.

	30
	00:01:58,660 --> 00:02:02,760
	Actually I don't have it in this section but it basically uses horror features and harsh features are

	31
	00:02:02,760 --> 00:02:06,210
	basically basically like you have rectangles.

	32
	00:02:06,250 --> 00:02:07,100
	Overling here.

	33
	00:02:07,240 --> 00:02:12,690
	You imagine a white rectangle here and one here and then there are different types of Arcacha pacifies.

	34
	00:02:12,810 --> 00:02:15,590
	So basically is just a feature extraction.

	35
	00:02:15,690 --> 00:02:22,350
	Basically what we learned before and it's led this box is that over the window over and over continuously

	36
	00:02:22,410 --> 00:02:31,950
	looking for a face they're very good but they are pretty hard to train and develop and optimize.

	37
	00:02:32,010 --> 00:02:38,010
	So let's move on to histogram with gradients and SVM sliding windows so sliding windows is a method

	38
	00:02:38,010 --> 00:02:43,580
	where we extract segments a full image piece by piece in the form of a rectangular extractor box.

	39
	00:02:43,590 --> 00:02:48,000
	So I mentioned it in previous slide when I was talking about this box being slid across this image.

	40
	00:02:48,330 --> 00:02:53,430
	What it does here in this image is a picture of my wife from the last bodybuilding bikini competition

	41
	00:02:53,430 --> 00:02:54,560
	two months ago.

	42
	00:02:54,870 --> 00:03:02,550
	And what it does is just imagine this window is being moved here then down here and then down here just

	43
	00:03:02,550 --> 00:03:05,670
	like remember how we moved across the image.

	44
	00:03:05,680 --> 00:03:07,960
	And CNN's it's exactly the same thing.

	45
	00:03:07,970 --> 00:03:14,430
	And we can actually set the same parameters like stride and the size of this box and what this box does

	46
	00:03:14,430 --> 00:03:17,640
	here in sliding windows with histogram of gradients.

	47
	00:03:17,700 --> 00:03:25,980
	SVM is that it basically extracts the entire hawgs all his brilliance in this box at different scales.

	48
	00:03:25,980 --> 00:03:31,620
	So basically it does it with image at one scale and then not a scale smaller scale and then this one

	49
	00:03:31,620 --> 00:03:35,480
	here and this one basically has no room to go right to just go straight down.

	50
	00:03:35,760 --> 00:03:39,480
	And it tries to match up to how gradients went what it knows.

	51
	00:03:39,480 --> 00:03:41,700
	It's supposed to look like to find the object.

	52
	00:03:42,000 --> 00:03:47,400
	Now as you can see this could be an effective way but it's not really that resilient.

	53
	00:03:47,400 --> 00:03:48,410
	Why.

	54
	00:03:48,420 --> 00:03:53,400
	Because imagine we have to do this for every segment of image continuously.

	55
	00:03:53,400 --> 00:03:55,680
	It gets exhaustive and computationally expensive

	56
	00:03:58,720 --> 00:04:05,370
	so previous action which is basically TISM feature extraction I just mentioned that and why would we

	57
	00:04:05,370 --> 00:04:10,740
	want to actually manually find co-features if CNN's actually eliminate that.

	58
	00:04:10,740 --> 00:04:16,350
	All right CNN's actually automatically find features by just running all these tests destroying data

	59
	00:04:16,680 --> 00:04:20,350
	Trulia algorithm and finding the last matching it with the correct last.

	60
	00:04:20,370 --> 00:04:22,770
	So that's what's brilliant about CNN's.

	61
	00:04:22,770 --> 00:04:24,760
	It takes that step away from us.

	62
	00:04:26,340 --> 00:04:31,970
	So as I said once of problems we're doing this is a sea of scale.

	63
	00:04:32,100 --> 00:04:34,920
	Imagine this is a simple image just 20 by 20.

	64
	00:04:34,920 --> 00:04:36,870
	So this box can be passed over here.

	65
	00:04:36,960 --> 00:04:39,630
	But imagine this was a much bigger continue TV image.

	66
	00:04:39,720 --> 00:04:44,130
	How many different times how many different boxes would we extract.

	67
	00:04:44,130 --> 00:04:46,460
	How do we know what size box should be.

	68
	00:04:46,470 --> 00:04:50,410
	I mean that's where we rescale image but how many different rescaling are we going to do.

	69
	00:04:50,440 --> 00:04:54,830
	So as you can see this is not a very effective way of doing object detection.

	70
	00:04:56,430 --> 00:05:02,600
	So talk a bit the bullet histogram gradients are not going to go in go into this in detail of taught

	71
	00:05:02,600 --> 00:05:05,480
	this in my other op and see the course you can.

	72
	00:05:05,480 --> 00:05:07,280
	The video is included free in that section.

	73
	00:05:07,290 --> 00:05:09,230
	So that's why I'm going to talk about it much here.

	74
	00:05:09,550 --> 00:05:15,290
	But basically the slides are here for you to go through on your own and you can pretty much infer from

	75
	00:05:15,290 --> 00:05:17,720
	these steps here what hawgs really are.

	76
	00:05:20,110 --> 00:05:22,090
	So now we move on to our CNN's.