IBM-机器学习-无监督学习-深度学习-强化学习笔记-全-

IBM 机器学习、无监督学习、深度学习、强化学习笔记（全）

001：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p01 0_课程介绍.zh_en -BV1eu4m1F7oz_p1-

Hi， my name is Miguel and I am one of your instructors in our course for unsupervised learninging。

In this course， you will learn the tools and techniques that help you leverage data that doesn't have an event variable oral label variable。

Companies from around the globe use unsupervised learning to segment their customers。

assess the quality of their data， and group similar observations together for their analysis。

In this course， you will learn。Clustering techniques like K means。

hierarchical clustering and D scan， as well as dimension reduction techniques like principal components analysis and matrix factorization。

A very important part of our course is the final project。Please。

we recommend you to post your solution online， be it on an online portfolio， a personal repository。

the IBM online Com or a Github page， it will help you highlight your。

Machine learning and analyticslytic health skills。If you need any help。

please reach out to your peers and instructors using the discussion boards。

we in this together and we will help one another and with that I will see you in the course。

Thank you。😊。

002：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p02 1_无监督学习概述.zh_en -BV1eu4m1F7oz_p2-

In this set of videos， we will reintroduce the concept of unsupervised learning and what it entails。

And this will serve as the foundation for the remainder of this specific course。

So in the last set of courses， we dove into the algorithms available。

assuming that we have the known outcome available in our data set。In this course。

we're going to talk about a whole other class of machine learning algorithms called unsupervised learning。

This class of algorithms are relevant when we don't have outcomes we are trying to predict。

But rather， we're interested in finding structures within our data set and perhaps want to partition our data into smaller pieces。

Now there can be a couple of use cases for this unsupervised learning。

One popular use case is called clustering， where we use our unlabeled data to identify an unknown structure。

And an example this may be segmenting our customers into different groups。

The other major use case for unsupervised algorithms is for dimensionality reduction。

namely using structural characteristics to reduce the size of our data set without losing much information contained in that original data set。

Now， in regards to clustering， we'll be covering the K means algorithm。

Herarchical agglomerative clustering algorithm， the D B scan algorithm and the mean shift algorithm。

And then in regards to dimensionality reduction， we'll be covering principal component analysis or PC。

as well as non negative matrix factorization。 Now， we don't go into this at all over here。

but we'll go into each of these in more depth as we get through these videos。

Now， just to give you some intuition as to why dimensionality reduction will be important。

Let's talk about that infamous curse of dimensionality or infamous for those of us in these circles。

Now， dimensionality refers to the number of features in our data and theoretically and in ideal situations。

the more features we have， the better the model should perform。

since models have more things to learn from so they should therefore be more successful。However。

real life is more complicated than that。 And there are several reasons why too many features may end up leading to worse performance in practice。

If you have too many features， several things can go wrong。

Maybe some of those features are spurious correlations， meaning they correlate within your data set。

but maybe not outside your data set as new data comes in。

Too many features may create more noise and signal algorithmgorithms find it harder to sort through non meaningful features if you have too many features。

And then the number of training examples required will increase exponentially with the dimensionality。

So this becomes especially clear when we think about distance based algorithms such as the canest neighbors that we talked about in our last course。

So if we look here and we imagine that we have a survey with 10 possible responses。

And for those 10 possible responses to get 60% coverage， we only need six answers。

We only need six different people to answer that us。

If we add on another survey with 10 possible response values。

That in order to get that same 60% coverage so that your can nearest neighbors of the same distance from whatever the new value coming in is。

We would need 60 people to respond， so we need 60 different rows of data in order to get our same coverage that we had when we just had six with one dimension。

And then you can imagine once we increase that to three dimensions。

And we have three different surveys， each one with 10 possible positions。

Then in order to get that same coverage for each neighbor to be equally distance as it was for that original one dimension with only 10 positions。

we would need 600 different rows。So you see how the more dimensions you add on。

The more rows you need， the more data you need to get that same amount of coverage。Now。

on top of that。Higher dimensions will often lead to slower performance。

as dealing with more columns is going to be more computationally expensive。And also。

it'll lead to the incidence of outliers increasing as that number of dimensions increases。

So to mitigate some， not all the problems I just mentioned。

one usually needs a lot of rows to train on， as I just mentioned。

Which may not be possible in real life。 You may not be able to gather these 600 different examples。

or if you imagine， obviously you would increase to multiple dimensions much more than three。

and we need that many more rows to get a certain amount of coverage。Therefore。

it often becomes a need to reduce the dimension of one data set。

So far we have seen feature selection as a way of achieving this， and in this course。

we'll discuss how we can accomplish the same goal using unsupervised machine learning models such as principal component analysis。

which we just discussed our PCA。

Now， to think about this in a real life example， now this curse of dimensionality comes up often in applications。

So if we consider that customer churn example that we discussed in earlier courses。

The original data set had 54 different columns， so 54 different features。

And some like age or under 30 or senior citizen will obviously be very closely related。

Others such as latitude， for example， are essentially duplicated。

We have those duplicated throughout。 And even if we remove duplicates and nonnumeric columns。

this cursd dimensionality can still apply。 We can still have too many columns。

even if they are not necessarily perfectly correlated。Now。

things that we can do with this churn data set clustering can help identify groups of similar customers。

Without us thinking about whether or not they churn or not。

maybe that allows to segment our customers into different groupings。

And then dimensionality reduction can improve both the performance because it can speed it up as we reduce the number of features and the interpretpretability of each of these groupings that we just came up with。

Now， just a high level overview。So when we're working with unsupervised learning。

we start off with an unlabeled data set。We then fit that unlabeled data set dependent on the model that we choose。

And we get our model。 And then once we have that model， we can look at new， again， unlabeled data。

We're still working with unlabeled data， but we can look at this new data。

Use that model that we just fit。Right from just before。

And then use that to predict our new groupings that we now have or the new dimensionality reduction。

depending on which we are doing， whether it's dimensionality reduction or a clustering。

So an example for clustering， if we want to group news articles by topics and we don't have those topics as labels。

So we have our starting point of text articles of unknown topics。We then create our model。

whether that's K means or whatever other models that we will discuss in order to see what kind of groupings we will naturally find in our data set。

We fit that to the data set so that we have our model fitted to figure out according to certain features that are within these articles。

according to certain words showing up。 We come up with certain groupings。

We then take another group of text articles of unknown topics。We use that model that we just fit。

And then we can use that again， that model took certain words。

certain features in order to determine the groupings。

we can then predict similar articles according to the articles that we had in our original data set。

003：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p03 2_无监督学习的聚类用例.zh_en -BV1eu4m1F7oz_p3-

Now let's talk about some common use cases out in the real world for using clustering。

So clustering will be used for classification， for anomaly detection， for customer segmentation。

as well as even improving supervised learning models。

So a common use case to start is classification， as we mentioned， for data that is not labeled。

so even if your data does not have a column that specifies the classes。

clustering algorithms will try to find heterogeneous groupings within your dataset set。

And examples of this used for unlabeled data include finding groupings that are different than your normal emails to help you identify spam。

so again， assume you don't have labels available。Or finding subgroups and text like product reviews in order to come up with these different groupings。

Now， another common use case for cluster will be anomaly detection。

Imagine that we are working with credit card transactions， and we have a certain user。

and we see that there's a small cluster compared to the rest of those users transactions thats high volume of attempts。

or perhaps now there' smaller volume of attempts or at new merchants。

This would create its own new cluster。 And that would present an anomaly within the data set。

and perhaps that would indicate to the credit card company that perhaps there's fraudulent transactions happening。

😊。

Another common use case will be customer segmentation。 So think of finding， for example。

groupings that help you find out how many type of customers your business has based on the recency。

the frequency and average amount of visits in the last three months。

And it takes a combination of each one of those different features and comes up with different segments。

Or another common segmentation is by demographics and that level of engagement， for example。

you can come up with groups for single customers， new parents， empty neters， etc。

and determine for a combination of each or clustering those together in some way their preferred marketing channel and use these insights to drive your future marketing campaigns。

And then another common use case， or final common use case will be to help improve supervised learning。

So， for example， you can check a good model， a good， say。

logistic regression model that you trained on your entire dataset set and see how well that performs compared to models trained for sub segmentseg of your data that you found through clustering。

And perhaps you'll be able to improve your performance if you look at each one of these different classes and come up with different predictions for each one of these different groupings。

Now， there's no guarantee that this will always work。

but it is common practice to segment the data to find these heterogeneous groups and then train a model for each group to help improve that classification。

😊。

Now， again， the other type of unsupervised learning that we discussed is going to be dimension reduction。

And we will use this often for high resolution images。We take our high resolution images。

We add on our model。 we fit our model to those different high resolution images。

In order to come up with a reduced， more compact version of those images that still hopefully contains most of the data that tells us what that image actually contains。

And then with that model that we fit。We can then take high resolution images that we haven't seen before and again come up with these smaller。

reduced versions of those images as well。And then we can predict what that compressed image should be like。

use those algorithms in order to determine what kind of reduced compressed image will still work in best practice。

So common reduction use cases here in image processing。

this will be probably one of the most common use cases for PCA。

Both compressing images and in computer vision for image tracking as it will reduce the noise to the primary factors that are relevant in your video capture if we're talking about image tracking。

And with the reduced size of the data set can greatly speed up the computational efficiency of your detection algorithms。

Now， with that， we close out our introduction to unsupervised learning， and in the next video。

we'll begin to hone in on the concept of clustering to help prepare us conceptually for our first unsupervised model。

decay Ka means algorithm。 All right， I'll see you there。😊。

004：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p04 3_聚类简介.zh_en -BV1eu4m1F7oz_p4-

Now， we're going to start with a simple example in order to introduce this concept of clustering。

Now， in our example， we have customers of a site with one feature in order to segment these customers。

The number of visits of that customer。

Now， if we were to use clustering to segment the users of the app into two groups。

where would you think that we would draw the line。

So probably what you see here would ultimately be the best choice。 Visly， it makes a lot of sense。

These are our two clusters， and we're going to explore how this actually works mathematically and algorithmically in just a bit。

And perhaps you find for your business objective， you need three clusters。

and this is what your three clusters would look like。

Or maybe you need five clusters and this is what your five clusters would look like。

And in this course， you'll learn to use a wide variety of clustering algorithms and how to actually select the correct number of clusters that best suit your data。

With this in mind， in the next video， we will introduce our first unsupervised learning model。

Ka Mes。

005：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p05 4_K-均值算法.zh_en -BV1eu4m1F7oz_p5-

Here we will introduce our first unsupervised machine learning algorithm used for clustering K means。

Now， we're going to use a similar example to what we just saw in the last video。 But this time。

we have two features。 We have the number of visits that we had before to our site。

And in the recency， how recently did that customer come to our store。And visually。

hopefully you can see that there are already two clusters that you can come up with according to the data points that we have。

Now that answer is obvious to us， but our goal here with K means is to see how we can come up with this algorithmically。

So the way that K means works。Is that since we prescribing two clusters we're going to initialize our algorithm by picking two random points。

And these are going to act as the centroids of our clusters。

So we have our clusters in blue and our clusters in pink that are going to be coming from these two centroids。

Then with our centroids initiative， we take each example in our space and determine which cluster it belongs to by computing the distance to the nearest centroid and seeing which ones closer。

So here in the first iteration， the examples are color coded as we see here。

And now every point belongs to a cluster。Now， obviously， hopefully。

thinking back to the clusters that you thought of when you first looked at this data set。

We are not done yet， as this assignment is somewhat arbitrary and it hasn't converged and will explain what it looks like when it converged and what it means when it converges in just a bit。

So the second step is then to adjust the points， to adjust those cents that we just discussed to the new mean of our clusters。

So the new location of the pink square is right in the middle of all of the pink circles and same for the blue。

So we move our centroid so that theyre in the center of our defined points。

We're now through the first iteration， and we're going to keep repeating this process until no example is assigned to a different cluster。

So let's see the first step of the second iteration with our new cluster centroids in place。

We are then going to identify which cluster each point belongs to again。

so we see that they have moved， according to which one is closer to our new centroid。

given the means of that last cluster。

And then we'd move our centroids again to the new mean of our centroids of our data points that are within our two groupings now。

And then we do this for a third iteration。And we see again， the colors have changed。

And now the cluster sentries don't move anymore。And once we have that。

that's the sign of convergence。 It found the visual structure in the dataset set automatically by continuously iterating。

moving to the mean of those identified points that were closest until it was not able to move any more。

though sentry stayed in place。 And we have our two clusters。

Now， for three clusters， the clusters can look like this。

However， there can be multiple solutions。Such as what we see here。

And when we say that there's multiple solutions， what we mean here is that it's not going to move any more。

We have converged。

But we can converge in different places where we will no longer move those centroids。

So the problem with Kaine's algorithm is that it's sensitive to a choice of those initial points。

So different initial configurations may yield different results。

So I'll pause here and in the next video， we will discuss how to choose the right model in regards to which one of these different converges make the most sense。

006：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p06 5_K-均值算法的初始化.zh_en -BV1eu4m1F7oz_p6-

Now， as discussed， we may end up with different clusters every time we run this Ka means algorithm。

as again， the process is to take three centuries。Find the nearest points。

Take the average of each one of those points that are closer to that centroid than any other centroid and set that average that we have as the new centroid and view the closest points to that new centroid。

And this movement towards that average， as we keep reinitiating that centroid after avi iteration。

Is going to stop once that centroid no longer moves。

and this is going to happen at different points depending on where we initiate our centroids。

So we need a way of judging the converge results and rank them according to goodness。

Now， on top of that， another idea to ensure that we get to a better optimization of this Ka means algorithm is to initialize it in a smarter way。

So local Opima or just nonop solutions you can think of often happen when two cluster ss are initialized close to each other。

so men being initialized close to each other lead to local optima， not optimal solutions。

So we can make an effort， therefore， to initialize with points that are far away enough from one another。

So how do we do this。We can start by a random initial point as we see here。

And then for the second pick， instead of getting it randomly。

We're going to prioritize faraway points by assigning a probability of the distance of each point squared。

Over the sum of all the distances squared for each point from that initial centroid。

So we look at every single point， square the distance from the original centroid。

and we put a lot more weight if you look at this formula to those that are far away because that'll take up a larger proportion of the total distance squared of all of our points。

So we'll be more likely to end up with a not so close point。

such as the blue one as our second cluster centroid。

And then we'll repeat this process if we want three different clusters。

This time the distance calculation is calculated as the minimum distance of that point to any of the two clusters。

So rather than the distance just being from one cluster。

it's a minimum distance between those two clusters to ensure that we are far away from both of our current clusters that we have。

And then we can do this one more time or as many more times as we need。

depending on the K that we define。 And again， the distance measures now the minimum distance from all three of our initiated clusters and therefore ensuring that it's a far away from all three of the different centroids that we have initiated。

This algorithm with this smarter initialization is called K means plus plus。

And it helps avoid getting stuck at these local optima。

And this is actually going to be the default implementation of K means in S K learn that we will be using later。

So here we've discussed getting a better initialization point。In the next video。

we'll talk through picking the correct number of clusters as well in terms of how many clusters are actually built into our data set。

All right， I'll see you there。

007：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p07 6_选择合适的K-均值算法聚类数量.zh_en -BV1eu4m1F7oz_p7-

Now that we're familiarized with how K means works， let's ask an important question。

How do we choose K， how do we choose that number of clusters？

Now sometimes there's going to be a specific amount of clusters。

you know you would like dependent on that specific objective of your clustering task。

so examples of this may be that your computer has four cores so it naturally becomes that you're looking for four clusters。

Or the business side of your organization may dictate that there are 10 clusters when trying to determine the different measurements to incorporate into our different sizes。

Or a navigation interface for browsing scientific papers may need to be split into 20 disciplines specifically。

so you set K equal to 20。

On the other hand， there's going to be times， though， that the number of clusters is unclear。

and we thus need an approach for selecting the right number of clusters for our problem。

Now， in order to do so we're going to introduce some metrics。

one of those metrics is going to be inertia， and that's a popular metric to help us accomplish this goal and understand the entropy built into our different clusters。

The metric is going to just give us the total sum of squared distance of each point to its cluster centroid。

This way we're penalizing spread out clusters and rewarding tighter clusters to those centroids。

One drawback of using inertia is that this value will be sensitive to the number of points in the clusters。

If you think about it， no matter what， as we add more points。

we will continuously penalize our inertia， even if those points are relatively closer to the cents than the existing points。

Now， distortion， on the other hand， takes the average of the square distances from each point to its cluster centroid。

Again， it'll still hold that smaller values will correspond to tighter clusters。

But this time， adding more points will not necessarily increase distortion as closer points will aid in actually decreasing that average distance。

So thinking about inertia versus distortion， both are going to be measures of entropy per cluster。

Inertia will always increase as more members are added to each cluster。

but this will not be the case with distortion since it will work by taking that average。Thus。

when the similarity of points in the cluster is more important， you should use the distortion。

And if you are more concerned that clusters have similar numbers of points。

than you should use inertia。And generally speaking， these will decrease fairly similarly。

So what can we do in order to find the clustering with best inertia？

What we would do is we initiate our K means algorithm several times。

And with different initial configurations。 And with that， assuming we predeine what our k is。

we can compute the resulting inertia or distortion。

keep that results and see which one of our different initializations or configurations lead to the best inertia or distortion。

So as an example of this， we're thinking which model is going to be the right one。

And we see for this K equals 3， and we have our three different centroids that it had converged to。

We see that the inertia is equal to 12。645。We look at this other converged K means algorithm with k equal 3 again。

and we see that inertia is equal to 12。943。And then again， we see the inertia is equal to 13。112。

all these different converged K means algorithms， but with different initializations。

So we would want to pick the inertia with the lowest value between the three。Here。

we introduced inertia and distortions and showed how it could be used。

as we just saw to choose the best model given a specific K。In the next video。

we will extend this to show how this can be used to help determine the correct number of clusters as well。

As well as showing in the next video the syntax used to compute these methods using Python。

All right， I'll see you there。

008：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p08 7_肘部法和应用K-均值算法.zh_en -BV1eu4m1F7oz_p8-

So how do we use inertia or distortion to help choose the right number of clusters。

as I promised we would do in the last video。

Now we know that inertia and distortion will measure the distance of each one of our points to their respective centroids。

And if we think about this metric。Either inertia or distortion。 Techically speaking。

we will almost always be decreasing this value as we increase the number of clusters。

And we can think of this in regards to the extreme if we had a cluster for every single one of our data points。

our distance to each centroid would then be equal to 0。

and we would end up with inertia or distortion of0。So in order to accommodate this。

this is where the elbow method will come into play。We see here that we have an inflection point。

That could be chosen， perhaps as a good K。 And again。

this is a graph of the number of clusters on the X axis and either inertia or distortion on that Y axis。

And we can see until this inflection point， the inertia or distortion goes down very rapidly。

But after this point， the rate of decrease slows down quite dramatically。

And this slowing down can indicate to us a natural point in our data set where the number of groupings make sense and should serve as a logical choice for K。

And again， this works for both distortion and inertia。

where inertia penalizes different number of points within clusters and leads to more balance。

whereas distortion will penalize average distance and lead to more similar clusters。

So how do we implement K means in Python， So this will be our first unsupervised learning algorithm that we do in Python。

We will still use that same first step where we will import the class。For our unsupervised model。

So from Sk learn dot cluster， we import K means。We're then again going to initiate an instance of this class。

as well as pass in each of the different hyperparameter for that class。

So we pass in the final number of clusters， we are going to have to decide that。

and then we'll show in just a bit how we can use this to actually use the elbow method。

But we get our N clusters equal to 3。 We're also initiating using the K means plus plus initialization that we discussed earlier。

where we had the distance squared over the total distance squared。

This will also be the default for K means。 and you can look at different initialization techniques in the documentation。

We're then going to take this initiated class and fit an instance of it to the data and then use that to predict the clusters for either new data or even our existing data。

So first step is called dot fit on x1。And then we call predicts。

And this is all similar to what we saw with the supervised learning as well。 Again， when we do this。

it is safer to fit and predict on that same data set。

because we're just trying to find those groupings and we're not overfitting to some type of solution。

as we did with supervised learning。 So we could predict on X1 as well to see the groupings that come out of X1。

And then just a side note， we can also use batch mode。

which will just randomly select different batches and use something similar to not similar exactly like K means。

but just with smaller batches， and this will help speed up the algorithm if you find that K means is too slow。

Now， to implement the elbow method。What we're going to want to do is fit K means for various levels of k and then save those inertia values。

So we're going to start off with inertia equal to a blank list。

We're then going to run through a number of different clusters ranging from one to10。

and we're going to fit the Ka means algorithm for the different number of clusters。

So we do 4K en list clusters， we initiate a new K means。With the number of clusters equal to one。

two， three， four， etc ce。We call K means dot fit on our data。

And then we append once we fit it to our data， we have this attribute of the inertia for that number of clusters so we can get the number of clusters for each the inertia for each one of these different number of clusters。

and we append that on and we can then use PLT dot plot and the list of clusters as our x axis。

the inertia as our y axis in order to find that actual elbow。

Now that closes out our discussion here on K means。

And we will move here into our lab where we'll see how we do all this in practice。 All right。

I'll see you there。

009：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p09 8_K-均值算法笔记本（选修部分）第1部分.zh_en -BV1eu4m1F7oz_p9-

Welcome to our lab here on Ka meanss clustering， our first lab for Cose4。

In this course， we're going to learn how to use K means using SK learn。 So throughout this lab。

we will run a K means algorithm。

Understand what parameters are customizable within the algorithm and then know how to use the inertia curve that we discuss in lecture to determine the optimal number of clusters。

Now， a quick overview， K means is one of the most basic clustering algorithms that we'll be working with。

It relies on finding cluster centers to group data points based on minimizing that sum of squared errorss between each data point and its cluster center。

So first things first， we're going to import all the necessary libraries。 We bring in nuumpy， pandas。

seaborn mattepl lid。 And then now we're bringing in scale k meanss。

We're going to make blobs and we'll see how this comes into play。

and we' be very useful for playing around with k means。 And then we'll use shuffle as well。

and we'll see that later on。😊。

We then going to just set a bunch of our parameters for our visualizations。

And then， we are going to。

Get started with creating our first simple data set here。

So in order to do this。We're going to first， create our function。

Where we have， and I'll break this down step by step。 We have our color。

and this is similar to thinking of it as a list。

Where our color equals B， R， G， C， M， K， we can think of it looping through of B for blue。

R for red and so on。We say alpha equal to 0。5 and that's just going to be how opaque each one of our data points are hopefully you realize at this point that we're going to be creating some type of scatter plot and then the size of each one of our points S equals 20。

We're going to call this PLT dot GcaA， which stands for get current axes and set the aspect equal to equal。

You see here just in quotes， has this string equal。

And that's just because we're going to be using a circle。

and we want it's going to be each unit going either in the x direction or the y direction。

We want them to be equal to one another。 So we see that clean circle。

You can try erasing this to see what it looks like otherwise。

We then say， if we have no clusters， so we're not clustering at all。

then we're just going to create a scatter plot passing in our X。

We're going to say all rows per column1。

All rows for column 2。 The color is going to be equal to just that first color。 So that be。

Our alpha is going to be equal to 0。5， and our size equal to 20。Now， if we have。

A number of clusters。 What we want to do is for each one of our different clusters， plot these out。

And the way that we do this is we call P L T dot scatter。

And then we say we want the X values for which our K means model came up with the label equal to I。

Whatever that we're looping through the number of clusters here。 So for the first one。

then the second one and so on。

Then we're going to say we want that first column， and then we're going to say for all those equal to I again。

and we want that second column。 So we get each of the two columns。

but specifying the rows that are equal to the labels that we came up with。

And then we set it to different colors looping through each one of these colors that we have defined above。

And then we are also going to plot the actual cluster centers so we can see where those lie as well。

So we just say cluster I related to the cluster that we are currently on。

And we say the x coordinate， as well as the y coordinate。

So saying that first column and second column。 And then again， using that same color。

and we're going to mark that with an x so that we can differentiate that from our actual data points。

We're also going to make the size of that larger， we're going to say the size is equal to 100。

So to see what this looks like。We're going to crate our a here。In order to do that。

we create this angle， which is just going to be a nupy array or its values between 0 and two times pi。

and it's going to be 20 equally spaced points。

And we're saying we don't want the end point。 So it's going to be up to。

but not including two times pi。

We're then going to append。Two different values together to create our x。

to create our X within our x， our first feature and our second feature。So each of our two axes。

where the first one is going to be the cosine of our angle。

And the second one's going to be the sign of our angle。

And this0 is just to say that we want to append these across the zero axis so that we have them。

1 alongside the other。 And then we transpose this so that we have two different columns。

So I'm going to show you quickly。 First， I will run this。

So and we display the cluster， and we see our perfect circle here。

And just to take a quick look what X looks like。

X is just going to be these two different columns with one of them being the cosine of the angle and the other one being the signine of that angle。

And now we have0 clusters because our default above was setting the number of clusters equal to 0。

There is no K and yet， but we'll introduce K means models。 So all we're doing is plotting the x。

Now we're going to group this data into two clusters to see what it looks like。

And we use two different random states to initialize the algorithm。

And to see how we come up with different results， depending on how we initialize the algorithm。

So we set number of clusters equal to 2。We say K means。

and we set the number of clusters equal to that number of clusters。 We set random state equal to 10。

And we're saying we only want to initialize once。

Generally， speaking， Kas couldn't initialize a number of times and then just choose the one with the best inertia。

Here， we're just saying choose only one time just to see the differences between two different random states。

even though again， the defaults。 if we look here， will be to use the initialization of Kas plus plus。

So that will ensure that it's more likely to choose far away points。

but it will still choose different points。 So it is going to be important to either initialize a number of times。

Or to if well， I guess either way you're going to initialize a number of times， if you do more times。

check those inertias and choose which one is best on your own。

So we call a K means on those hyperparameters that we've passed。

We call K M dot fit on x。 So now we have our Kmings model fit。

And then we can using that K means that we fit， be able to display the cluster using that function that we defined earlier。

And you can see using that K M that we came up with。

it has an attribute to give us the different labels。As well as the different cluster centers。

So that's now available and we can create this scatter plot。So， we run this。And we see。

Are two different groupings。 And again， these groupings。

because of the way that we created this data， it could really fall anywhere。

There are no natural groupings。 as is why it's likely to fall in many different places。

Given the way that we are running this。 That's why it really will not converge necessarily in the same spot。

So we see that here and we have the x marking where those centroids actually lie that should be the average of all the red dots。

the blue X is going to be the average of all the blue dots。

and we see how it classifies each one of those two classes。

Now setting the random state equal to 20 here。

We can see that it comes up with a very different clustering。

And if we think about it， coming back to lecture。Why are these clusters different when we run the K means twice？

And this should be obvious as we talk through it quite thoroughly as I went through each one of these different graphs。

But it's because the starting points of the cluster centers have an impact on where these final clusters actually lie。

And again， these also are going to be clusters that don't actually probably exist。

given how equally space each one of these points are。

So it's very highly likely that each one of these different clusters will come up in a different place。

So I'm going to pause here。And we will continue to figure out the optimum number of clusters and how we'd actually do that using Python code。

All right， I'll see in a bit。

010：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p10 9_K-均值算法笔记本第2部分.zh_en -BV1eu4m1F7oz_p10-

Now， in this video， we're going to talk through how you can actually choose the optimum number of clusters。

depending on your data set。So we're going to synthetically create our data set here so we know the number of clusters。

and that will allow us to really understand how we choose those clusters and how that looks when we create that elbow plot that we discussed earlier。

So in order to synthetically create these different blobs， as you see here。

we're using the make blobs function。We are first going to say how many samples we want。

so we're going to have 1 thousand different data points。

How many bins we're going to set that equal to 4。 And then we're going to save the centers of each one of our different points。

So we actually have these centers predefined。 and we set them at negative 3，3，0，0，3，3， and 6，6。

So you can imagine these are in a straight diagonal line。And then in order to make blobs。

we're going to say， and that'll output to values we care here about the X。

We're going to say the number of samples that we want。 We set that equal to 100th。

The other important point that we have here is going to be the cluster。

underscore STD or the standard deviation of that cluster。

and that will define how tightly around each one of these centroids we going to plot each one of our data points。

And then again， we set our centers equal to the centers that we have defined above。

and we are going to set the random say equal to 42 to ensure that we have the same values as well。

So I called display Cussler， which is going to be the same functionality that we discussed in the first video。

We're going to call it on our new X。And we see here somewhat our four different blobs already visually。

So if you wanted these to be very clear again。We can use that standard deviation if it's a smaller standard deviation。

it'll be tighter around those clusters。 And you see if I were to run this。

that they are really tight around each cluster and very， very obvious。

So that takes a little too far。 we don't want it quite that obvious。

so we're going to set it equal to one。

We're then going to run our K means and set our initial number of clusters equal to7。

So you say K means we call the number of clusters equals to 7。 We do K M dot fit。

and we can use display clusters using the function again that we defined earlier。

which will give us each one of our different plots， color coded。

as well as their different centroids。We pass in KM， we pass in that number of clusters。

and we see here the seven clusters that will be。

Cated，1 we call K means with clusters equal to 7， and seems to arbitrarily be splitting in a way。

Each of these different four clusters into subsets。Now， if we call no clusters equal to 4。

we run that same code。It seems visually that we have a much cleaner set of four different clusters。

Now， we asked the question here。 And here， it's obvious because we have it plotted in two dimensions。

and we had these clear， distinct， different blobs。

But should we use four seven clusters in the real world？

Data usually will have more than these two dimensions。

And a data set with higher dimensional space is going to be very hard to visualize。

So way to solve this problem and decide， should we use 4 or 7 is going to be what we discussed earlier。

finding that elbow by plotting inertia versus that number of clusters。

So we can see by calling K M dot inertia， we can get the inertia for the last fitted model。

So this will be for the number of pluss equal to 4。

And I want you to think before I run this， which one will have a lower overall inertia。

4 clusters or 7 clusters。 And we'll discuss that in just a second。

So in order to plot this out。We create a blank list， so we have inertia equal to this blank list。

We have that we're going to run through a range of different numbers of clusters ranging from one up to 11。

including 11， so up to 10。And then four numb clusters in this list。 So for that one through 10。

We're going to fit a K means on that number of clusters。We're then going to。Take that inertia list。

And depend on for our fitted model， the inertia for that given model，4 clusters equal to 1，2，3。

et cetera， up until 10。We're then going to plot。As our X axis。

we're going to use the list nu clusters， which is going to be those values 1 to 10。

and as our Y axis is going to be these inertia values that we're coming up with that we're pending onto the list。

We call PLT dot scatter on these two。 So this will actually create a line plot。

This will create our actual markers。 There's other ways we could have done this， as well。

We're then going to set our X label and our y label to number of clusters and inertia respectively。

And we run this。And we see this steep decline。From 1 to 2， then from 2 to 3，3 to 4。

And then you see that kind of slows down after that4。Now。

this is obviously not always going to be perfect。 At times， it will be difficult to really say。

where is that inflection point here， it may even look like two because of such a steep drop off from 1 to 2。

for that should generally be the place to stop off from 1 to 2。 But you see that at 4。

it kind of starts to flatten out。And I asked you that question earlier。

which one will have lower inertia of 4 or 7。 And hopefully， if you've been paying attention。

you notice。 And as you see on the plot， then Nertia continues to go down as you increase the number of clusters。

essentially， no matter what。So we see that the inertia continues to go down no matter what。

but there don't go down as quickly once we hit that 4。 So that's our inflection point。

And we say that we should probably use four clusters。

Now that closes out this video in regards to looking at this elbow plot of the number of clusters versus the inertia in the next video we'll see a practical application list and how we can use this K means。

On actual images。All right， I'll see you there。

011：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p11 10_K-均值算法笔记本（选修部分）第3部分.zh_en -BV1eu4m1F7oz_p11-

Now， let's move to a more practical application where we can actually see this in practice as we will take this image of bell peppers。

And then group together the different colors so that rather than working with the multitude of colors within this image。

we're only going to be working with the number of colors that we create within our clusters。

And we'll see what we mean as we walk through this notebook book。

So the first thing that we're going to do is read in this image。Now。

when we call P L T dot I M read and we call this image。

we are actually bringing it in as a numpy array。 So I'll show this in just a second。

We can then use PL T dot I M show to actually show that image， which is currently as a nuumpy array。

We can actually see that within our Jupyter notebook book。And then we're just calling Plt。

ax off because we don't want any axes when we're just plotting an image。

So he run this。

And we see our image with our different colors and our different shades of green， red and yellow。

We then I we call here imaget shape， but quickly， I want to show you what the actual image object looks like。

And as I mentioned， it's going to be taking this image that we just plotted out。 and rather than。

Giving the actual picture， we're actually just representing it as an array where each value is going to be how much in the red。

green and blue scale， each one of these different pixels are。

And that's going to be for every single pixel。 So we want to see how many pixels we have。

And we have 480 times 640 different pixels。 And each pixel has three values that represent how much。

again， red， green and blue it has。

And below to just hone in on how we have this picture representation withinumpre arrays。

We're going to look using R equals all these R equals 35 G equals 95 b equals 131。

And we will and these are all values between 200，0 and 255。Were going to call P L T do I M show。

And just for the specific array。 So as if it's just one pixel with a certain amount of coloration。

So we run this and we see that this， since it's mostly blue， will output something close to blue。

If I were to decrease the blue to 13 and increase green to 1，95。 Then you see this very green image。

And then just so you understand a bit of how coloring works。

If we were to set this all to 100。 so if all the values are the same should be somewhere gray。

Because it's equal amountuns of each。 And if we set it all to 0。What do you think will happen？

I'll run that here。

You see that we have black。And then if they are each 255， which is the maximum value。

You see that we get white， so just an understanding。

a quick understanding how each one of these pixels are being created using this nuumpy array。

So what we're going to do next is reshape our data frame so that it's only every single pixel is going to be a row rather than having three dimensions。

we're going to make this two dimensions， So we're going to take our 480 by 640 pixels multiply 480 times 640。

So again， each row will represent a single pixel。 and then the other shape will be the RGB and how much of each will be incorporated into that particular pixel。

So we call reshape。We say that thee。First dimension is going to be the first dimension of our original pixel times a second dimension。

Again， that's it was originally in three dimensions。 We're taking those first two dimensions。

multiplying those together。 So that's how many rows we have。

And then the number of columns will be R G B O B3 relating to each one of those three。

And then just to see the first five values， we have。Each one of these rows represents a pixel。

and each one of these numbers within that row represent either the red。

the green or the blue respectively。And since 480 times 640 equals 307，200。

that's going to be our new shape of our new Numpy array。Now we're going to run K means on the。

Image that we had using eight clusters。 So we're going to come up with eight groupings。

So rather than every single to take every single one of these 307000 values and find eight groups to group these together into different segments。

We're then going to create a copy of that image。And replace that copy's values。

With their respective labels that were come up with these eight different clusters。

So rather than the actual value that was there， we're saying 4 k means where the label is equal to for all those 307。

000 rows， where a label was equal to label 1 out of each one of our unique labels， so one through8。

Replace that with the actual value for that cluster center。So I'm going to run this。

And it will replace all those values and just to show you quickly what that looked like。

Our new values， you see they are all the same here，43，156，43，56， and later on 236，172，8。

these represent one of the eight different clusters that we had。

and we replaced those original values that we see up here。

With these one of eight values that we have created using our different centroids。Now。

to see what that looks like， now that we've replaced this multitude of different hues of different colors with only eight possible colors。

We're going to reshape that to that original image shape。

In order to actually show this as an image using PLT IM show， we have to get it back to 480 times 6。

40 times 3。We can then call Plt。im show again， turn off the axis。

And we see that we can still get a lot of our initial picture with just these eight different colors。

And we can see the different hues and how it differentiates between the different peppers and how we loss a bit of the granularity。

But we see these clusters of the red， the white， the green， the black and so on。😊。

So the next thing that we're going to want to do in order to take this a step further is create a function that will take in any image。

as well as a number of clusters and return the image using just the specific centroids replacing each one of those different pixels。

As we just did with8。 we want to do that for any image and for any number of K for any number of clusters。

So to do that， we're going to repeat the steps that we just did。

We're going to set image flat to the reshaped image， given the first two dimensions and then three。

given the RGB。We're then going to set the number of clusters equal to the K that we have defined here。

We setting random state equal to zero just to ensure that we have the same values as we look at it and you look at it back at home。

We're then going to fit that to our image flat， again， that two dimensions， in our case。

307200 by three。We're then going to create a copy as we did before。

and we're going to ultimately change this copy。By running a for loop through each one of our different labels。

And if our labels are equal to whatever value it is within our output。

Then we will replace that with that specific cluster。Again， doing the same steps as we did before。

We're then going to reshape that again back to the original image shape so that we can end up ultimately printing it out。

And then we're going to output from this function， both that new image with the replace colors。

as well as the inertia for that specific K means， depending on what our K was there。

So we've created our function that will output again that new image with the replace pixels。

as well as the inertia for that fitted model， depending on the K that we use。

We're then going to call that function for k between 2 and 20。

Counting here by two and draw that inertia curve， as well as later on。

will also print out many of these pictures。So we're saying k values。

the K values that we will loop through are going to be 2 through 21， not including 21。

counting by 2。And then， we're going to。Initiate empty lists for the image list so we can save that image list。

as well as the different inertias。We're then again。

getting an output when we call this image cluster function that we defined of both the new image with replaced pixels as well as the inertias。

So we will call this function。Output image 2， as well as the inertia。

and then append each one of these output values to that list that we initiated here。

So I'm going to run this， and this will take just a second。

and it will output for us each of these different images， as well as the inertia values。

And then we'll plot out these inertia values in just a second。 All right。

I'll see you as soon as it stops running。

So that should have taken about five minutes to run。

Now we have from the outputs our different inertia values。

as well as our images which we'll get to in just a second。

and we can plot our inertia values versus each one of our different numbers of clusters。

So we're going to call PLt dot plot to get the line graph on top of that。

we call PLt dot scatter to get each of the points。 And we。

Get our X label and Y label of inertia and K。And we see here that it kind of curves down and has this smooth curve。

and it's hard to see an exact elbow。So this is a case where maybe we can't exactly see where that elbow exists and determine using the elbow method。

So we note here and you can dive deeper into this。Metric of the cellhouette coefficient。

But what it will do is it will tell you the difference between the or the similarity between points within a cluster and other points in the cluster as compared to clusters nearby。

And again， you can dive deeper， but that will be a different method of differentiating where you should choose where that number of k should be。

Now， the next step that we have here。Is going to be that we are going to plot each one of the images to see given the images that we have。

How each one plots with the different number of colors。 Again。

we're only going to use the number of colors that we have within the cluster。

So we're going to run through our values of counting by 2 between 2 and 20。

So for the range of the length of those values。 So for 10 different subplots。We're going to plot。

A five rows by two columns。 So we're going to have a subplot that will all be。

It'll be a grid of 10 different axes， where each one will have a different image。And one at a time。

we will show that image。Given the K values that we are using。

And then we will title that and then also turn off the axis and we can see。

As we increase the number of colors， how much of the image we are able to actually discern。

given the number of colors we're using。

So here at the bottom， when we see that we're using 20 different colors。

so we have 20 centroids replacing their original values。

We see that we can actually pretty clearly see each one of our different peppers and really discern the original photo well。

Just to give you an idea of how many colors there were originally， we can run NP。t unique。

And that was on the image。Flat。And we will say axis equals0。And let's see the link here。

And you see that originally， there was 98452 unique colors to make up that picture。

And we can see how well we can represent that with just 20 colors here。

So we can see how well we were able to group those 98000 different colors into 20 colors on their own。

😊，That closes out our notebook here in regards to Kamin's clustering。

and I look forward to seeing you back at lecture。 All right， I'll see you there。😊。

012：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p12 11_欧式距离和曼哈顿距离.zh_en -BV1eu4m1F7oz_p12-

Our clustering methods will rely very heavily on our definition of distance。

So let's take a step back and discuss different distant metrics that are available to us。Now。

let's go over the learning goals for this set of videos。

In these videos， the main topic of discussion will be different measures of distance between different points。

And with that， we will discuss the different applications of these different distance measures and how they relate to clustering。

And the different measures that we're going to discuss are going to be the Euclidean distance。

which is going to be that classic distance that you're probably already familiar with。

As well as the Manhattan distance， the cosine similarity， and the Jaar distance。

Now， our choice of distance metric will be incredibly important when discussing any of our clustering algorithms。

As these clustering algorithms will all be dependent on some type of measure of how distant or in that same vein。

how similar one point is with the next。

Now there are several choices of distance metrics， and they all have their strengths and more appropriate use cases。

But at times， we may also need to use empirical evaluation to determine which one of our distance metrics works best in achieving our goals。

Now， the most intuitive distance metric that we are hopefully already somewhat familiar with。

and what we use in K means is going to be the Euclidean distance。

Now another name for this is the L2 distance。So in order to highlight how Euclidean distance is calculated。

we're going to take these two points and calculate the Euclidean distance between them。

So we remove all the other points so we can just look at these two points and hopefully you remember parts of this from math class back Men day。

But in order to find this distance， D。We need to first find our change in visits。

as well as our change in recency or a change in the X axis， as well as a change in the y axis。

And then if you think back to that math class example。

they said to think back to from back in the day。How do we think these values。

visits and recency in their change will relate to our calculation of D。

We would get D by taking the square root of the square of each of these changes。

So that math equation that I was hinting towards was a squared plus B squared equals C squared。

And again， you take the square root of C squared， and you end up with the formula that we see here。

And we can move this on to higher dimensions。 Imagine if we had three dimensions。

four dimensions and so on。 We just take the square of each of those and then take the square root of the sum of all those values。

😊。

Another distance metric that you may already be familiar with is the L1 distance or the Manhattan distance。

And instead of squaring each term， we're adding up the absolute value of each term。Now， it's larger。

It will always be larger than the L2 distance unless they lie on the same axis。

So the same number of visits or the same number of recency。

And we'd use this in business cases where there's very high dimensionality。

As high dimensionality often leads to difficulty in distinguishing distances between one point and the other。

and the L1 score does better than the L2 score in distinguishing these different distances。

once we move up to higher dimensional space。Now， these are the two most commonly known distance metrics that hopefully you may know a bit already。

In the next video， we will introduce some less well known distance metrics that can prove to be very powerful for certain applications。

All right， I'll see you there。

013：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p13 12_余弦距离和Jaccard距离.zh_en -BV1eu4m1F7oz_p13-

So we start here with a bit of a less intuitive distance metric， namely the cosine distance。

So we're going to start off again with two points in two dimensional space just to highlight our example。

And hopefully from the lines that we just drew， it should be clear that this is already shaping out to be much different than the L1 and L2 metrics that we just discussed。

What we really care about with the cosine distance is the angle between these two points。

This metric gives us the cosine of the angle between these two vectors defined by each of these two points。

Which in order to move up to higher dimensions， this formula will still hold of taking that dot product。

as you see in the numerator over the norm of each point in the denominator。

And the key to the cosine distance is that it will remain insensitive to the scaling with respect to the origin。

That is we can move one of those points as we have here。

Along that same line and that distance would remain the same。

So any two points on that same ray， passing through the origin will have a distance of zero from one another。

And the idea is that we want to see the relationships here between rec scene visits。

between one point and the other， much more so than we care about the actual physical distance between the two。

So recency being one and visits being one is equal to in regards to the cosine distance and how far away it is。

recency being 10 and visits being 10。

Maybe it would be along that same ray。So for two vectors that are pointing in the same direction。

our cosine distance will spit out zero。

It'll think of them as very close or essentially exactly the same。

But for Euclidean distance。It may think of them as very far apart。

depending on where those values actually lie， even if they are on the same line。

So how is this useful being able to classify them is exactly the same if they are pointing in the same direction？

Let's say we have text data and our features are going to be different counts of different words within the documents。

Now just because one document is longer than the other， so it has more counts of each of these words。

Does not mean that they need to be far away from one another and thus cluster differently。

Maybe they're about the exact same thing。Maybe one of those articles is a summary of the other。

In that case， you want to mark them as close to one another。

and cosine distance will come in handy in that situation。

So if you have three counts of the word data science and 10 counts of the word。

Application， and then you had 30 of data science and 100 of application。

then you probably want to assume that those are along the same category and cluster those together。

even though their Euclidean distance may be far apart。

their cosine distance there would have been in the exact same direction and thus zero。

Another advantage of the cosine distance is that it's more robust against this cursive dimensionality。

Euclidean distance can get affected and lose meaning if we have a lot of features。

as we saw in our initial discussion of that curse of dimensionality。

So our takeaway here is that the best choice of distance is going to heavily depend on what our application is。

Another distance metric to keep in mind is going to be the jackard distance。

which will be useful for text as well。

And it applies to sets， and an example of this is used pretty often will be something that we walk through here。

which is the word occurrence， the unique word occurrence。

So say we have a sentence。 A， I like chocolate ice cream。

That set of A is just going to be the unique words in that sentence，I like chocolate ice and cream。

Say sentence B is going to be， do I want chocolate cream or vanilla cream？

So set B is going to be do I want chocolate cream or and vanilla。

again not counting that second cream， only those unique values。

And then the jackard distance is going to be one minus the amount of value shared。

So the intersection over that union。 So the shared values of the two sentences over the length of the total unique values between those two sentences。

And we'll see this example in just a second and the calculation as well。

And it can be used as a different option when we have these text documents to group similar topics together。

So using this example。We can cacate the score between our two sentences and running through it。

we see that our intersection is going to end up having three words。

And there are nine unique words total。So the distance is going to be 1 to minus 1 third equals2 third or 0。

67， and that will be our distance。

So that closes out our different distance metrics and overall in this discussion。

Just to recap， we discussed the importance of having different measures of distance between our two points。

As well as the applications of distance measures to clustering and how the measures of distance or similarity will ultimately have a large effect on the groupings that we end up creating。

And with that， we discussed the Euclidean as our most common metric where we used our old me that we learned from back in the day of a squared plus B squared equals C squared。

we discussed the Manhattan distance， which was the absolute value of each distance's individual features all added together。

We discussed the cosine similarity， which highlighted the angle between our points。

and then finally we discussed the Jaarard distance。

which was useful to showing the difference in similarities for different sets of values。

All right， I'll see you in the next video。

014：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p14 13_维数灾难笔记本第1部分.zh_en -BV1eu4m1F7oz_p14-

Now， in this demo， we're going to take a brief aside and touch back on the curse of dimensionality。

Distance measures will come into place slightly as we will talk about the Elidean distance for each of these。

but the focus here is going to be the curse of dimensionality。So with that in mind。

We can talk about the demo objectives， which will be to gain a deeper understanding of why observations are going to be further apart once we move to higher dimensional space。

We're then going to see an example of how adding dimensions will ultimately degrade certain model performance when we're working with classification。

and then we're going to start to learn how to fight that cursive dimensionality within your different modeling projects。

So the main point is that in higher dimensional space， points will tend to be further apart。

And this is going to impact our data analysis intuitively。

if we think back to the clustering examples that we've already gone through and we're talking about how distant each one of the different points are from one another and saying what the nearest neighbor really is。

Can we really say it's a neighbor， if it's a certain distance apart。

if we're moving an incredibly far distance apart Once we move to these higher dimensions。

So this notebook will show why higher dimensional space does lead to this sparse data。

leads to data points being naturally further apart from one another。

So we're going to start off with a circle inside a square。

And the idea is that we're just going to have a square that's going to be the diameter of the circle。

that's going to be the。Length， the width and height of our square right squares。

width and height are automatically going to be the same。

We're going to have a unit circle so the diameter is going to be2。

and then our width is going to be 2 and our height's going to be 2。

And the point is that that circle should touch the borders of our square。

and we want to know using that circle within the square， how much of that is empty space。

And then we're going to move to the next step。And create a sphere within a cube using those same dimensions。

1 by one or two by two radius of one。And we're going to see how just moving to higher dimensions。

using those same points， we're going to have a larger proportion of the space not covered by that circular object。

And then generalize that to higher dimensions and discuss how as we move to higher dimensions。

just the fact that we're moving to higher dimensions leads to there being more empty space within our square or that square moved into higher dimensions。

So the point is to bring in that concept， but I'd be remiss if I didn't also walk you through a lot of the new code that we're going to be talking about as we go through some more complicated plots。

This way， when you're back home， you will be able to go ahead。

create these plots on your own and understand what went into the code。

as well as once we get to the next couple of cells。

being able to start to even plot in three dimensions。So with that in mind。

I'm going to create an empty cell above so that we can walk through all the different code that's within this function。

So to start off。We're going to。Initiate our figure。And then so Plt。

gcf is going to be a way to get current figure if that figure doesn't exist。

it will initiate a new figure。And then taking that figure。

we're going to add on our subplot and that's going to be our axes。And we're just saying one by one。

if you think about subplots， that could be two by one or two by two， if you say two by one。

you'd have two rows with each on a bounding box and then one column。

and we're saying which one do we want to select， we're just selecting that first one。

And then we're saying aspect equals equal。 And this can be similar to what we saw in the last notebook that we had when we wanted to draw that circle and the importance of actually ensuring that our X axis and our y axis are on the same scale。

If one of those are on the wrong scale， then it looks like we have a rectangle rather than a square or an oval rather than a circle。

😊。

So we run this and we see that we now have our bounding box going from zero to1。

Now I'm going to skip over quickly and we're going to come back to it this building end of the circle。

Because this is going to be the circle， like we mentioned， centered at00。

And then going from zero to 1 and then from zero to negative one as well。

and our box currently only goes from zero to 1， not from negative one to one。

So I'll bring this back into play once we walk through this code where we increase our X limit and our Y limit。

So then we're going to add on this scatter plot。Which is just going to be that single dot because it's 00 is the point that we're bringing in。

And we are saying that。We have the size equal to 10 and the colors equal to black。 That's at 0，0。

And now we've changed the scale a little bit to ensure that that's in the center。 But again。

it's still not at negative 1，1。

We're then going to add on a straight line， and this is going to represent the radius of our circle。

so it's going to go from0 to1。And we're going to have a10， sorry， different points。

So it goes from zero to 1， counting by 100， and then that's going to be each one of our x values。

And then for the y values， we're going to stay at 0。

And this will allow us to create that straight line that we see here。 again。

this mess with the axes a bit。 So the plot looks a little bit funky。

But we'll see in just a second what this looks like once we increase those。 In fact。

I'll do that right now。 Let's change the X limit and y limit。

We're going to add this on to our graph。😊。

We have to make sure that we have。No extra tabs there。 And now we see it goes from 0 to one。

and it goes from。The0 to1 is the line， and we're able to see negative one to 1 on our x axis and negative one to 1 on our y axis。

Now we can go through some of the pieces that we skipped over， so coming back first to the circle。

And this is something that's probably the most new。For those that are watching through this video。

what we're doing is we're getting the current axes， which is just our bounding box。

And then we're calling this add artist， and then artist object is essentially anything that you have within your plot。

that's going to be your ticks， that's going to be your numbers， that's going to be your lines。

those are all artist objects。When we call PLT。 circlecle。

that's going to be a subclass of that artist object， and it won't show up unless we call add artist。

And you can Google and look at the discussion on ad artists and how it works in regards to creating your。

Mate plot lid plots and different ways that you can use us。

But the idea is that it will take things at our subclass of that artist object and be able to add it on。

So we're adding on this circle。 This circle is going to be centered at 0。With a radius of one。

And then we're saying alpha equals 0。5， that's just how opaque our circle is the same way that we saw alpha earlier。

So we run this and now we have a circle on our plot。So we have our circle within our bounded square。

We're then going to add on an R。So we're just adding text， so a dot text。

And we're saying that we want an R， we can say the size of that R and where we wanted to lie at 0。

4 comma 0。1， so we have that R there at 0。4 and 0。1。

We've already set our X limit in Y limit and hopefully already familiar with sending your Y label。

your X label， and your title， but we'll throw that in。

So we have all that within our plot。 And then it's saying when we say point equals 0 here。

I want to ensure that no one's misled。 the way that it's being used is that point is equal to false。

False in Python or 0 and Python will always be equal to false。

whereas any other number will be equal to true。So if the point is true。

so if it's not zero as it is by defaults， then we're just going to create a dot。Sorry。

we're going to create a dot here。That's at 0。85。85， and we're just going to write on top of that。

That it's a far away point。Just to highlight what we're signifying as a far away point。

So the idea is that each axis in this example is supposed to be a different covariate and are supposed to imagine we've standard scaled our data。

so they're centered on0， and this means that the average for each covariate is now0 or the entire center of our circle and points that are outside the unit circle would be harder to classify because these values are far away from our mean。

So this is just saying that values that are outside the circle。

so taking this idea of a circle within a square and moving it to the idea of how it would apply when we're talking about creating our different machine learning models。

Is that we are now identifying that anything outside that circle is pretty far away from the mean as we have standard scaled our data and that means it's over a single standard deviation away。

So we're going to run this。And we see our unit circle when we call make circle on its own。

very similar to what we have above。

And then when we call make circle and we call one rather than point equals 0。

it's going to add on that far away point， and that far away point will be the same， no matter what。

it has nothing to do with the number you pass in。 Again， it's just true versus false there。

Now。The point that we want to make here。Is how much of this square is going to be outside that circle。

again， thinking back to how this relates to our modeling。

if we have standard scaled our two different covariates。

which means that being one unit away from the mean。

means that we're a standard deviation from the mean value for each one of our different covariates covariate A and covariate B。

which we ultimately be using for predictions。How much of our points are going to be far away？Now。

since the square has a length of 2 R， the radius being 1 and the area of the square is going to be2 R squared。

just taking the formula for creating a square，2 R times 2 R。

The percentage of square outside the circle。Is going to be one minus pi r squared。

which is your area of your circle over 2 R squared。

And that's just going to be the area of the circle over the area of the square。

so it's 1 minus pi over4 once you cancel out the R squares。

And you have that 1 minus5 over4 means that approximately 21% of that square is outside the circle。

So I'm going to pause here。And in the next video， we're going to extend this out to a cube and also walk through how you can create 3D graphs using Python。

All right， I'll see you there。

015：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p15 14_维数灾难笔记本第2部分.zh_en -BV1eu4m1F7oz_p15-

Building off of what we just discussed in two dimensions。 So we had our square and our circle。

And we saw that 21 per cent of our square was outside of the circle。

We are now going to push that out into three dimensions and work with a sphere rather than a circle and a cube rather than a square。

Now， I want to remind you how this ties back to a data science problem。

The idea here thinking about the two dimensions is that we have for each dimension。

that is a different feature。 So a different covariate。 So we have covariate A， Covariate B。

Both have been normalized。 And we can think that the values lie between negative one and one for each one of those。

and they've been standardized。 So they have a mean of 0 and a standard deviation of one。

If we think about this value and look at the square here。

The idea is that to be a single unit away from that center value。

That would indicate that you're one standard deviation away， whether you're pointing horizontally。

vertically or diagonally。 And we can see all the different values that lie one standard deviation from the mean。

That's your unit circle。 And then all values that are outside of that circle are going to relate to those that are far away from the mean above one standard deviation from the mean。

But still within that negative one to one range。 So we see。

given that we're working with values between and negative one and one for our covariates。

These are the values outside the circle that will be， in a sense， outliers。

Now， in that same sense， if we were to add on a third dimension Covariate C。

And that's what we're planning to do here。 The idea is。

we're still working from negative one to one。 Still。

our sphere now will indicate one standard deviation away in any direction， now。

not just diagonally but diagonally within space。

And then anything outside that sphere， we can then again think of in the same sense that we just did with the circle and the square that this is more than one standard deviation away still between negative one and one and see how many outliers we have。

So again， with the square we had 21% now we're moving to plotting in three dimensions。

I'm going to show you step by step how you can plot some of these values in three dimensions so that you can go home or leave this notebook and then be able to plot in three dimensions yourself as well。

So the first thing that we do is we are going to have to import this axes 3D library Now。

if we don't do this。And we try to createate our figure。

And then from that figure， we get our current axes and make them 3D projections。

We'll see that we get in error。

We will have to first import that library。That axes 3D to give us that option。

So。We pull this in。

And I'm going to run through just as we did before。

have to cell above so we can see this step by step。 But now that I've imported axes 3D。

we see that now， rather than working in two dimensions。

you can see how we can start to work within three dimensions。

So hopefully this is exciting to see that we have values for X， Y axes and then now a z axis as well。

We're then going to draw our cube。 Now we have here this idea of combinations and taking the products。

I don't want to walk too much into it。 I'll show you quickly how the product works。

and I would suggest you can look at the combinations and see how it works as well built off of this product that I'm about to create。

But right now， we're taking the product of three R's and the R is just defined as negative 1 and one。

In order to make this a little clear。

We're going to use three different lists of two， rather than negative one and one， though。

we're going to use one and2。3，4， and。5，6。 And when I take the product。

I'll take the list so that we can see this output。 otherwise it's just a generator object。

We also have to make sure that we import that library。

You see that it comes up with every possible combination， not accounting for ordering。 So1，3 and 5。

taking the first value from each of the lists， then1，3 and 6。 So first， first and second。

and then 1，4，5，1，4，6。

2，3，5 so you can see how it's going through each one of these different values ensuring that it covers all the different possible combinations。

So it does that with negative 1，1。 And then the combinations of value of 2 will give you values of two for each one of different combinations。

and I wouldn't worry too much about it。 That point here is given again that。

We're pulling out an S and an E， it's going to output two different values when we get that combination。

We're going to take the sum of S minus E， and that has to be equal to。This is using our R1。

R1 minus r 0 in order for it to be an edge on our Q。So that's all it's trying to do。

is's trying to find where each of our edges lie。

Now I'm going to pull out this portion of code just to show you how one line is drawn in three dimensional space。

Let me。All this。

We copy this， we're going to move it above。And we're saying for S and E。

we don't care too much about that。 But what we do care about is this zip of S and E and then plotting that。

So in order to see what that， well， this the star is going to ensure that it unpacks it。

So rather than just creating generator object， we'll see that actual output。

And I'll actually print here so we can see what。

This output looks like。So。Zipping S and E。And then I'm going to break。

so we're just going to plot one line。

So I'm going to run this。R is not defined yet， if you forgot to copy that in。

say r equals negative 11。

And we see that we plotted this one line。Now the zip S。

This is going to be our x values of our two points， the y values of our two points。

and the z values of our two points。

So we're plot from negative one， negative one， negative one， up to negative one， negative one，1。

so that's the idea that we're seeing here。And it's hard to see in three dimensional space。

but we are going from negative one， negative1 negative one up to 111。

Now， when we run through all the different lines， all we're doing is using this plot 3D。

which will work exactly the same as just plot in two dimensional space。

And that is just creating those lines connecting those two dots the same way you would do in two dimensional space。

calling ax dot plot。

So if I don't run the brake here and do this for let the for loop run all the way through。

you see here that we now have our cube connecting each one of these points that we have here。

The next step that we want to do in order to draw our sphere is we're first going to create this mesh grid。

So I'm going to copy this above into a different cell。And in order to make this a little clear。

this is going to be the number of points。

If you were to do without the J， the J just in general， so you know。

within Python means a complex number。

We are working here with the J， not because we're working with complex numbers。

but the complex numbers just let us know that rather than counting by 20。

we want 20 points in between 0 and two times pi。

That's all we're doing here by using the complex number。But we're going to reduce this。

just for our example， to3 and 2， so F。

Three values and two values。 And the idea is that when we want to plot along many different points and we want to cover So here it's supposed to go from0 to two times pi。

and we want to have three different values， so it'll go0， then pi then two times pi。

And then we're also going from 0 to pi with just two values。 So 0 to pi。

And the idea is that we want to plot all the possible combinations of these points。

And in order to do that， we have to create this mesh grid so that we have 0，0， as well as 0 and pi。

and then。

Pi coming from our count from 0。Through22 pi， we then have pi and 0。

and then pi and pi for our second axis and so on and so forth。

So that's the idea of the mesh grid to allow you to plot on each one of these multiple points。 Now。

it has two outputs。

For each one of the different grids， those are both equal in shape。So you have 0 and 1。

That's supposed to be your X and Y here， we're plotting in three dimensions。

And all we're doing is taking that two dimensional graph。

And we're expanding that to create our sphere by using each one of those different points and taking the cosine of each of these values ranging from。

From U， you go from0 to2 pi， and then from V， from 0 to pi。

Multiplying them together。 And then first Z， we just get cosine of V。

and that will createate our sphere。So we're going to have our three points。

all these multiple points。 And right now， they're just points out in space。

And in order to connect all those spaces into one final sphere。

we're going to use this plot wire frame， which will connect all those dots together。😊。

So we call x dot plot wireframe on the X， Y， and z。

And we run this。And then we see all these different points that were created in three dimensional space。

all being connected by this wire frame。

And ultimately， we saw here how to plot in 3D and may be difficult to visualize how much extra empty space there is。

But if we think about it in terms of the equations。

the volume of this sphere is given by 4 over 3 pi R cubed。

And since we're working with a cube with a radius of2 R。

it's going to have two R cubed in terms of volume。And when we calculate the percent of that cube。

again， thinking， thinking of this as three different covariates。

We can see that the volume outside the sphere is going to be one minus that volume of the sphere。

Four over three times pi r cubed， over two are cubed。You do some cross multiplication。

You end up with 1 minus pi over 6， and approximately 48 per cent of your values being outside the cube。

So working with that same range of negative one to one。

And that same radius being described as your standard deviation and being beyond that being a bit of an outlier。

we see that 48% cent of our values are now outliers。

Now that we've moved up to three dimensional space。

So that closes out this video in the next video we will continue and show you how you can actually generalize this to even higher dimensional space and see those percentages as we continuously increase the number of dimensions。

016：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p16 15_维数灾难笔记本第3部分.zh_en -BV1eu4m1F7oz_p16-

Now， we discussed how we moved from two dimensions up to three dimensions。

And we saw that when we moved from2 to three dimensions。

We saw how many more values are more than one unit away from that mean value of each one of our different covariates。

Again， working between negative one and one。 And we see that before it was at 21 per cent in two dimensions and then just adding on one more covariate with the same range from negative one to 1 and the same idea that it's going to be standardized with the standard deviation of one and a mean of 0。

We saw that 48 per cent light outside。Now， what we want to see from there if we can generalize up to higher dimensions。

Now， obviously， we won't be able to plot in higher dimensions。

But we can start to get an intuitive sense if the idea is。

if it's within one unit away from that mean。That would mean that we are working within the ball。

within the sphere， whatever you want to call it。 And then outside of that， still using that range。

the similar range for each one of our different covariates。 If it's outside of that ball。

then we would say that it was outside of that standard deviation。

And we'd say that's a bit of an outlier using the same sized covariates。So in order to do that。

what we're going to start with is。Here we have a random sample calling N do random do sample is just going to pull from a uniform distribution。

random points from 0 to 1。We're saying that we once the size to be 5 rows and two columns。

So we're going to have two dimensional points。We're then going to get the norm。Again。

this is just the distance from 0，0。The norm is going to be that Euclidean distance from 0，0。

So the Euclidean distance is just going to be that value squared because we're moving from 0，0。

So we square that value and then take the square root of that value squared。

and we're calling dot sum and we're sum one just because we're going to be passing in an array。

And we want to get that sum for each one of our individual points。

So we're getting that Euclidean distance for each one of our points。

And then we want to determine using that norm whether or not we are one unit away from that mean or if we're greater than one unit away。

And no matter the dimensional space， that's going to be the way that we determine whether or not we are within the ball within that sphere or not。

So we're going to say。This in the ball will just say。

Is that value using our norm that we just defined within the ball or not within the ball。

And it will return either a true or false value。

So just to see an example of this， we're going to use that sample data， we're going to say 4 x。

Y and zip， and we're going to zip together both the norm value so we can see what the norm value output is for each one of these sample data points。

and then we can say whether or not that's in the ball。

and we should see anything above one being outside the ball。

So we run this。 And first， we printed out our sample data that we randomly generated with all the points being between 0 and 1。

And we see that all of these were actually within the circle。

And here were working in two dimensional space that was a bit lucky， you see if I run this again。

that two of them happened to be outside of the circle。

Now， how would we generalize this beyond two dimensions。

so we saw we could do three dimensions bywinu do it to any number of dimensions。

So the way that we're going to do that。Is we're going to create this function called what percent of the n cube is in the N ball。

So in the n dimensional cube is in the n dimensional ball。We pass in the number of dimensions。

We can also pass in our different sample sizes here。 we're going to use 10000。

So we're going to generate 10000 random points。We're then going to create a random sample again。

those will be values between 0 and 1。Using the shape of 10000 different rows。

all with the dimensions defined by the number of dimensions we pass into this function。

So originally， we just did two dimensions， as we saw in our samples here。 Now。

we're going to move that up to 3，4，5 dimensions。 And you can also imagine this again。

Think that each one of these different rows。Contains our first covariate then our second covariate。

And when we add more dimensions， all we're doing it is is adding on more features。

Adding on more dimensions。

So what we're going to do is we're going to call in the ball。For these 10000 different values。

And then we're going to call dot mean。 So if you think about it。

this will be outputting either true or false for each one of these 10000 values。

True or false can be used as one and 0 with true being one false being 0。 If we take the averageage。

we can see what percentage actually falls within the ball。

That's how this dot mean will work for us。And then we're saying for iteration in range 100 so that we get 100 different examples of these 10000 points to ensure that we converge on what something close to what the actual solution would be in regards to generalizing to these higher dimensions。

So we end up with 100 different values for the average amount that lies within the ball versus outside the ball。

And then we take the mean of those values。

And that will give us the percentage of the N cube that's in the end ball。

We're then going to call for dimensions ranging from 2 up till 15。 So not including 15。

So up till 14。 those are going to be the different dimensions that we're going to test。And then。

our data is going to be。For each of these， we want to pull out what percentage is in the ball。

So we're just going to map in。These different dimensions into are what percent of the N cube is in the N ball function。

And that will output for each one of these different values in the range。

What percentage actually lies within the cube， the circle， whatever it is。

嗯。You see here that we also include 2 and 3。 So we'll also be able to check compared to what we saw before。

whether or not we have close approximations of what the actual values are。

given the calculations that we had in regards to the actual。

Formulas of a sphere versus a cube and a circle versus a square。

So we say4 dim and percent。 So're just getting。Say start with2 and then the input for two for that data。

We're going to map those two together and get the dimension， as well as the percent within the ball。

So we see that 78 per cent fall within the ball at first， which 78。5， which makes sense。

given that we saw before that 21 per cent was outside of the ball。

Sam with 52 per cent being in the ball for three dimensions。 We saw 48 per cent above。

And we see how that drops off quite dramatically as we keep increasing the number of dimensions。

So more and more of our values， as we add on these more fee， these features。

all with similar ranges and similar standard distributions。

We see how many of them tend to be outliers。

And we can plot this finally getting a simple plot， calling PLT do plot。

We're going to get our x label， our Y label， and just our title。

and all we're doing is our dimensions。

Versus the data， which is the percentage of。

The amount that falls within the ball versus not。 And we can see how it steeply drops off as we add on more and more dimensions。

So just do double check。Our understanding， we see that this is dropping off quite dramatically。

We're also going to measure the distance from the center of our cube to its nearest point。

So you can see out of all those points that we have。

But here we're going to generate rather than 10000， just 1000 points。

We can see how many of those or out of those thousand0 points， which one's closest to the center。

and hopefully we will see I will。Give you a little bit of a spoiler。

We will see that that closest point will be farther and farther away as we increase the number of dimensions。

So this is just a bit more evidence to that same point。So we're going to pass in the dimension。

We're going to pass in our sample size here being 100。 We're setting the default equal to 1000。

Were going to， again。Call a random sample this time， rather than 0 to 1， will subtract 0。5。

So it's centered at 00。And then it'll be from negative 0。5 up till 0。5。

And then we will return the min of the norm of each one of these points。 Again。

the norm is the distance from 0 in either direction。

And then in order to estimate the closest， given that dimension。

We can use that getmin distance that we just defined that will give us that minimum distance using the norm of each one of those points。

We're going to do that 100 times over。 So in the same fashion that we just did to ensure that we have a large enough sample。

And then we're going to return not just the average of that data。

But the minimum of those minimums。

As well as the maximum of those minimums so that we can get a bit of a range of the values in regards to how far away they are from the origin。

So we're going to calculate this from values ranging from 2 to 100。

We're then going to map those dims into that estimate closest function that we just defined above。

And we can print this out。

And this will take just a second to run。 And then afterwards。

we'll also be able to plot using that same functionality that we just discussed。 So we see here。

four dimension 6， V。

Average value was 0。22。 The minimum of those minimum values was 0。1。

and the maximum of those minimum values， given that 100 different iterations of this was 0。3。

So we're going to plot those dimensions， as well as the mind data， all of the rows first column。

And then this PL T dot fill between。We're going to use that in order to plot both our min and max。

So we'll have the range of the average values， and then we'll also be able to fill between the min and max values so we can see a bit more clearly what the range was as we increase the number of dimensions。

So the menes data， if you recall， is going to output three different values。

0 is the mean。The first column is going to be， or the second first in Python is going to be the minimum。

and the second is going to be the max。 And I're saying alpha equal to 05 because it's going to fill between the two values。

And we want to also see that line in between。

So you run this。And we can see as we increase the number of dimensions。

how far that minimum point is from the origin， as well as that good of range that we are able to get using that fill between as well。

So that closes out this video。 And it gave us an opportunity to look at how we can expand up into higher dimensions。

😊。

With all this in mind。In the next video， we will begin to show you the effects of working with high dimensional data when you are actually trying to use your different classification algorithms that we introduced in the last course。

All right， I'll see you there。

017：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p17 16_维数灾难笔记本第4部分.zh_en -BV1eu4m1F7oz_p17-

Welcome back to the final video for this notebook。In this final video。

we're going to show how dimensionality， how high dimensionality can end up affecting model performance。

And with that， I want to quickly touch on again how we can fight the curse of dimensionality。

And two different methods that should immediately come to mind as we discuss them in the intro to this course are going to be feature selection。

where you would use domain knowledge to reduce the number of features。

given the ones that you think are already informative。

As well as feature extraction and with feature extraction。

you're going to use dimensionary reduction techniques such as PCA。

Which we'll learn later on within this course to transform our raw data into lower dimensionality data that will preserve hopefully the majority of our variability in that original data。

And again， well touch on this later on in the course。

So here we're going to show creating play data sets how high dimensionality will end up affecting our model performance。

In order to do so， we're going to import。Many libraries here we're doing a classification problem。

so we need our train test split， we're going to need our standard scalar。

and then something new that we haven't seen yet is we're going to use this make classification function。

which is available in Scalalar do data sets， and that's just going to create a toy dataset set with a certain amount of classes。

and I'll show you this in practice in just a second。

and then we're going to use our decision tree classifier to ultimately predict the class。

So first thing that we're going to do is create our classification data set in order to do so。

we're using this make classification。

Function that I just introduced a second ago， and I'm going to show you a bit about how these arguments work。

So I'm going to create a cell above。

And of course， first we're going to have to import that library。

but're then going to use this to create our X and y。 So now we have our x。

And our x is going to be this two dimensional data set。

the default is that there are going to be 100 samples。 you see here that it has 100 samples。

so if we were to run x dot shape。

We would see that we have 100 rows and two features。

Those two features are going to be decided because we said that we want the number of features equal to two。

We're saying that the number of features that are redundants that don't give any extra information are going to be equal to 0。

If you imagine， often we will have redundant features， such as when we discussed。

if you're talking about age versus whether or not they're a senior。

there will be a bit of redundancy built in。

The number of informative features will be the rest of that。

so we're saying all of our features will be informative here。

And then the number of class clusters per class will allow us to spread out that data in a way。 Now。

I'm going to plot。Each one of our different classes。

Along with that X， we have which class each one of those values belong to。

In order to look at both of those， we're going to use a scatter plot， and we're going to scatter。

Are x such that y equals 1。And that'll be。On X， we want first， our first feature。

Then we're going to once our second feature。And then I'm going to use this again。

To create another scatter plot。That's going to be。Our y equal to 0。

And everything else the same。So here you see our two different classes。

they're differentiated fairly clearly。And just to show you how different things work。

if we were to say the number of clusters is equal to one。

so we don't have separate clusters within our different classes。

Then you see that they're very clearly separated， so adding on this extra cluster allowed them to be a little bit closer together as there were going to be separate clusters for that class。

Also， to go along with that， if instead of having both of our features being informative。

If we only had。

One of our features as informative。Then we would see that the other one is redundant in everything along one axis。

So we don't create this separation， And there's no use in one of those features。

One of those features don't essentially add any extra value or combined。

they don't add any extra value。

So that's how the make classification works。 We saw that original plot of what our data actually looks like when we're working with two features。

We're then going to add on that a bit of noise。So we're just going to use our random state here。

We're setting a random state equal to 2 with that object， we can call range dot uniform。

and we're going to be adding on two times a bunch of random values of the same shape as x。

So we're adding 2 x， something the same size as x。

So it'll add to each one of the individual data points within our 100 by two array。

And we're going to add on values that are between0 and 2。

So the default for range dot uniform will be values between 0 and1。

And we're going to multiply that by two， soll be values between0 and 2。

We're then going to scale our data so that it is all between0 ends。 Well。

so that the standard deviation， the mean will be0 and the standard deviation will be 1。

and now that we have our data resetting x to the standard scalar version of itself。

we can split it into our x train， x test， Y train and Y test。

So we have our toy data set。

We can then use our decision tree classifier。And run that on X train and Y train and see what our score is for our X test and Y test。

And we see that our score from this two feature classifier is 0。875。

Now we're going to run all the same steps and what's important to note here is that the number of features is obviously going to be going up 100 fold。

but with that， we're also ensuring that each one of those different features are informative。

So we're not allowing for redundant features here， so we still have all of our features being informative。

We are going to run through the same steps。 Otherwise everything else is the same。 We are going to。

Said our range。Again， setting that random state， adding on that extra noise two times the uniform value。

run through the steps of setting up the training set and the test set。

Then we're going to use again our decision tree classifier on our standard scale data。

Check our score on our test set after fitting on our train set。

And we see that our score goes all the way down to 0。425。

So we see that adding on additional features， even if they're informative。

end up leading to worse model performance。

Due to the fact that it will very heavily increase the amount that it will overfi to each one of these features。

And something to note along with this， is that， as we mentioned during the lectures， you should。

if you are going to have more features， try to also have more rows of data。

So if we had enough rows of data， maybe we can counteract this problem。 But generally。

if you' are going to have a certain amount of rows。 The less features you have。

the more informative each of those features can be， less likely you will be to overfit。

We're then going to， rather than just looking at 2 and 200 loop through values between 50 and 4000 and run through each one of these same steps。

So all the steps are going to be the same。 We're calling for nu and NP dot L space starting at 50。

That's our increments up till 4000， counting by 50。

And we're just going to continuously pass in that numb for a number of features。

As well as。By setting number of redundant equal to zero by default， all of them will be informative。

And then everything else is the same。 We can get each one of our different scores as those are going to be appended on to this empty list。

We run this and this will take just a second to run， and then we can plot that as well。

Just looking across each one of the numbers of different features and seeing the classification and accuracy as we increase the number of features。

Now， by chance， some of these can be a bit more accurate。

but adding features in general can very much lead to reductions in accuracy， not all the time。

but it very easily can。

So in this example， the accuracy is highly volatile in the number of features and increasing features again can reduce that accuracy。

Additionally， in our example， we testified that none of the features are redundant and in practice when you have this many more features。

Generally speaking， you will almost definitely have redundant features。

And for example， if we are predicting customer churn， as we've discussed throughout these courses。

using a variety of customer characteristics， we may have collected extensive data say for each customer that we have across many dimensions。

and this would be an example in practice of high dimensional space。

which can make it difficult to apply unsupervised learning methods directly。

and potentially lead to issues within this cursive dimensionality as we try to create these groupings。

So that closes out our video here on the curse of dimensionality with that we're going to go back to discussing different types of groupings。

different types of clustering algorithms， starting off with a glloative hierarchical clustering。

and I look forward to seeing you there。

018：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p18 17_层次聚合聚类.zh_en -BV1eu4m1F7oz_p18-

Now let's talk about our next clustering algorithm， hierarchical aglomative clustering。

With hierarchical aglomative clustering， we'll try to continuously split out and merge new clusters successively until we reach a level of emergence。

😊，Now， let's see how hierarchical aglom clustering actually works。

So here we're using the same example as before， and we're going to try and come up with our different clusters。

For hierarchical gllomerated clustering， we start off by looking at the points and identifying the pair。

which has the minimal distance。

So notice here again that the distance becomes a very important factor in the success of our clustering algorithm。

so we need to keep into account which distance metrics we're actually using。

We see that these two points that we have in green are the closest。

so we color code them here to highlight that these are going to be our first pair。

And then we continue to do this again， looking for the next closest pair of points。

And the next closest pair。And we can keep doing this。

But the next closest pair can actually be a pair of clusters。

So we might have two different clusters or a cluster and a point that are going to be closest to one another。

It doesn't necessarily just have to be two different points。

Now， how we define the distance from a cluster to a point or from a cluster to another cluster will depend on our linkage criterion。

And we'll expand on this a bit later。 But for the distance to a particular cluster。

maybe it's going to be the average of points in a given cluster and the distance to that average of given points。

Or maybe it's the minimal distance between all points in a given cluster to that point。

So just taking the minimum distance。

And if it's a pair of clusters， if we do find it's a cluster that is the closest point。

then we can go ahead and merge them into their own cluster。

So we see that again， we don't have that。 but here we move one more step， and we see that the。

Blue and the green that we had merged together into the two greens。

And we can continue to see that we can create more and more clusters， and we keep going。

creating each one of our pairs， moving forward。😊。

And now we see some of them merging together， merging further together。

We have those red dots all creating their own cluster。

And now the number of clusters will start to reduce。

As we keep moving forward。Each one of them， combining together。

And at this point， looking here were at six different clusters。

We can run again， and we get down to five clusters as we continue to find the closest linkage。

Now we're down to four。

And again， as we continue to move up that ladder， we can continue to merge these different clusters together。

And then we're at three different clusters at two different clusters。

And if we were to continue this， we can end up with one large cluster。😊。

So this means that if we allow this to continue， eventually， we don't have clusters。

so we have to come up with some type of stopping criteria when we're using a glloorative clustering。

019：层次链接类型.zh_en -BV1eu4m1F7oz_p19-

So with the idea of using the average distances of all the points within their respective clusters。

How do we go about actually finding our stopping point？

So let's say we're at this stage and we have five clusters as we see here。

and each one of those clusters are color coded as we move forward。

So at this stage， we can say that the average cluster distances for each one of our clusters。

which we have marked here with the same colors that we just saw in that two dimensional plot。

And with that， so we have each of the average distances and with that we have our gray dotted line。

which marks a point where we are going to stop once all of these average distances are above that line。

So in the next iteration。We find that light， purple and magenta clusters are going to be merged。

Therefore， that average cluster distance for that particular cluster should go ahead and increase。

So we can visualize this change in that average cluster distance as followed。

For that new combined cluster， we now have this average cluster distance that we see that is higher than the previous two。

so before we had the light purple and the magenta。

we merge those to that higher version of magenta and we see that we have a higher average cluster distance。

And now we only have four remaining clusters， and as a whole。

they're a bit closer to that limit set to that gray line。

In the next step， we can have that purple cluster is going to merge with the teal cluster in that top right corner。

Ands the new cluster forms。Combining that teal and purple is now above that threshold。

Now we don't stop at this point， though， we are only going to stop once the minimum is above that threshold。

so the minimum average cluster distance is still not above that threshold。

we still have the pink and magenta below that threshold。

Now， in this next step， once we move to two clusters。

magenta cluster and the pink cluster merged together to create this new pink cluster。

And finally， once we merged these two。All the cluster distances are above this threshold。

There are big enough。To therefore claim that the algorithm has finally converged。

Now we mentioned earlier that we would want to merge clusters at some point that are closest to one another。

But that idea of which cluster is closer is a bit of an ambiguous concept。

Especially when there are going to be multiple points belonging to each one of these different clusters。

Now there are several methods to measure that distance between these clusters。

and these different methods are called the different linkage types。

The first example that we have here is single linkage。

and that's going to be the minimum pair wise distance between our different clusters。So。

Given that we have our different clusters that we have here on our data。

it's going to be the distance between the two closest points。

say one from the teal cluster and one from magenta。

and we can see the blue lines that connect each one of these。

according to which is going to be the minimum distance between a certain points in the magenta and a certain point in the teal。

And we take that distance between those specific points and declare that that will be the distance between those two clusters。

and then we tried to find for all these pairwise linkages， which one is the minimum。

and then we would combine those together as we move up the hierarchy。Now a pro。

and we will talk through many different type of linkages。

a pro to the single linkage or the minimum pairwise distance between clusters。

Is that it can help in ensuring a clear separation of our clusters that have any points within certain distances of one another so it has clear boundaries。

But a con of this single linkage will be that it won't be able to separate out cleanly if there's some noise between the two different clusters。

So it'll be very easy to be skewed by certain outliers falling close to certain clusters。

Now another linkage type is going to be called the complete linkage。And with complete leakage。

instead of taking the minimum distance， given the points within each cluster。

we would take the maximum value。 So taking the furthest distance from each cluster。

and from those maximum distances， decide which one is the smallest。

And then we can move up that hierarchy to reducing here from four clusters down to 3。Now。

a pro of this method is that it will do a much better job of separating out the clusters if there's a bit of noise or overlapping points of the two clusters。

unlike with the single leakage。But acon this is that it content to break apart larger existing clusters dependent on where that maximum distance of those different points may end up lying。

Alternatively。We can also take the average of all the points for a given cluster and use those averages or those cluster centroids as we've been introduced to to determine the distance between our different clusters。

Now， the pros and cons of using the average can kind of be seen as an average between the pros and cons of using the single and complete linkage and that it may also break up those larger clusters and also may be a bit drawn towards a noise but also do a better job than either the single linkage or the maximum linkage in regards to the cons of each。

And then finally， we have the ward linkage， and the ward linkage is going to compute the inertia。

So if you recall， the inertia is going to be the distance squared between each one of our different points and their centroids。

And picks the pair that's going to ultimately minimize that inertia value。

So trying to minimize that sum of squares of the distances to their cluster centroids。

so in that sense you can think of it as something similar to K means in trying to come up with the new。

Combining of the different clusters。And again， the pros and cons of war will be similar to the average and that they will。

Balance out both the pros and cons of the min in max linkage。

020：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p20 19_应用层次聚合聚类.zh_en -BV1eu4m1F7oz_p20-

Now， in order to do this in practice， to do this in Python。

it'll be very similar steps to what we've seen so far。

We will start by importing our class here called a glloorative clustering。

We're then going to create an instance of the class。

so we say ag equal to a glamor of clustering with our different hyperparameter。

We can choose。The number of clusters here， we set the number of clusters equal to three so that it'll keep building up until we get to three clusters。

We then have the option to choose our distant metric， and here you see we chose Euclideium。

affinity equals the Euclidean， we use the Euclidean distance。

And we can also define what our linkage will be going through the different linkages that we just discussed are available。

we can choose which one we'd like to use for our current clustering algorithm。

And then as before， we would fit the instance on the data and use that to predict clusters for new data。

Now let's recap what we went over here in this section。

In this section we introduce the hierarchical aglomerative clustering method and how we can use it to slowly build up to larger and larger clusters。

And this method becomes very useful in business practices when you may want to also see these subgroups that build up to these larger groupings。

We then discuss stopping conditions and how you may either have a predetermined amount of groups in mind or a predetermined amount of clusters in mind。

Or you can say to continue up until you reach a threshold of minimum average of our cluster distances。

And finally， you went over different linkage types。

including single linkage using the closest points to determine distance between clusters。

complete linkage using the furthest points determine the distance between clusters。

Average linkage and ward linkage， which finds the combined clusters that most reduce the amount of inertia。

That closes this video on hierarchical aglomative clustering。

and in the next video we're going to dive into our next clustering algorithm DB scan All right。

I'll see you there。

021：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p21 20_DBSCAN算法.zh_en -BV1eu4m1F7oz_p21-

Our next unsupervised learning algorithm we are going to cover is densely based spatial clustering of applications with noise。

And the noise part is going to be important， as this is one of the few approaches that truly clusters our data rather than partitioning it。

And it will help us find outliers rather than putting them all into different clusters。

And we'll see how in just a bit。Now， let's cover the learning goals for this section。

In this section， we're going to discuss how this D B scan clustering algorithm actually works in finding our different clusters。

We will also discuss the input arguments and their importance for determining our clusters。

as well as discussing the outputs of our DB scan algorithm。

And finally， we'll close out by discussing the strengths and weaknesses of working with the DV scan algorithm。

So let's start here with a quick introduction to what D B scan is。 As we mentioned。

a key part of this clustering algorithm is that it truly finds clusters of data rather than just partitioning our data and thus works better when we have noise in our data set。

We know that outliers will show up in most of our data sets。

and in reality we should be able to create our clusters and say that these outlier points do not belong to any of these clusters。

Now， the basics of how D B scan works is that we are working under the assumption that points in a cluster should be a certain distance from one another within a certain neighborhood。

So we would randomly select points from these higher density regions and slowly expand our clusters。

And as we expand， we only include points that are at a certain distance from the points that have iterly already been included within that cluster。

Given that distance that we're using from point to point。

And the algorithm ends when no more points are of a certain distance from the clusters already identified。

And thus all points will have been classified as either belonging to a particular cluster or otherwise。

they would be noise。 Now， this is all high level。 And in just a few slides will make sure to visualize how this actually works in practice。

Before we get to those visualizations， though， let's talk about the inputs for D B scan。

as these inputs will be of utmost importance to getting our clusters identified correctly。So first。

as we've seen repeatedly with all our clustering algorithms。

we have to define the distance metric used to define our similarity between our different points。

Then we have to define the epsilon。As we mentioned。

we are starting at random points and then using those points we determining if other points are within a certain distance。

And if they are， they become part of the cluster。Now this minimum distance between the points。

Is going to be considered part of the same cluster if it's within a certain epsilon range。

So that's going to be our epsilon。 how far away a point needs to be to be considered part of that cluster。

N clue， or often seen as min samples， which is actually the argument used for S K learn。

And this argument， this input， will be the minimum amount of points for a particular point to be considered a core point of a cluster。

And core points are going to be defined by this N clue argument。

And they're going to be defined as those points that have at least N clue neighbors。

including itself。

So if we set n clue equal to 3， that means that that point has at least two other neighbors that are within that epsilon distance。

A non core point can still be a part of the cluster if it's in the neighborhood of that core point。

But to understand this， let's dive a bit deeper into the different classification of points given our DB scan model。

So there are three possible labels for any given point。

First， we have our core point， which we just defined as any point that has more than N clue neighbors。

And all clusters will require at least one core point。

We then have density reachable or border points。And these will be points that can be reached by a core point。

But may have fewer than end clue neighbours itself。

These will still be a part of the cluster as long as they are in the Epsilon neighborhood of a core point。

And then finally， we have noise and noise is going to be a point that is not part of any cluster。

And that would be one that has no core points in that Epsilon neighborhood of the point。

So if we have n clue equal to 4 and three points are within epsilon and no others are near by。

None of these three are going to be core points， unless thus they are all going to be identified as noise and again。

will visualize this a bit more clearly in the videos to come。

And with those possible labels， for any point， we identify clusters as the connected core and density reachable points within our data set。

So that closes out this video and in the next video we're going to turn to that visualization。

I keep promising that we're going to see to clearly understand how the DB scan algorithm works。

Allright， I'll see you there。

022：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p22 21_可视化DBSCAN.zh_en -BV1eu4m1F7oz_p22-

So as promised， let's start to visualize how DB scan actually works。

And as we have in our past clustering algorithms， we're going to start with this two dimensional data set and we're going to come up with clusters depending on the visits and the recency and how far away each point is from one another。

So we start at a random point here we have this point in pink。

And then we look at the radius epsilon around that point and we'll have to define that epsilon and here we define it as 1。

75。And we look， we create that 1。75 epsilon， and we look around。And we see。

is there enough points given our N clue within that circle to start a cluster。

And we see that there are four points。 again， we include that point itself。

even though even with that point， we get up to 5。 So we now have our first cluster。

So every point within that epsilon is going to be part of our first cluster。

And then we process each new point in the same way。

So we move on to our next point here and anything within that radius。

within that epsilon radius gets included as part of that cluster。

And we keep moving along。And then here we see that this point， while it is part of the cluster。

because it's near one of the core points。 as we looped through。

we saw that the point to the right of this that it is part of within that Epsilon radius was a core point with four points。

This one is not a core point。But it is density reachable， so it will be part of our cluster。

but we will highlight that this particular point is going to be a border point。

a density reachable point and not one of our core points。

So we elite that as part of a lighter pink。And we keep going down this chain， adding on points。

according to those that fall within epsilon。

Keep running through and we see these are all core points because they all have at least four points including themselves in there。

And we move along and then this point only has three。 So this one again。

is going to be a border point， but it is near one of the core points。

so it will count as part of the cluster still。

But we see we highlight that in light pink， and we see we can keep moving along。

And eventually， we have。

All of our points within the cluster， we s to search along all the points。

And then if there are no neighbors left， we will randomly try a new。

unvisited point to potentially start a brand new cluster。😊。

And when we do that。Here we start with the blue， we we to check。

is this going to be a core point once again？

So we check again with an epsilon of this new random point that we sought out。

We see that it is a core point， now we have started our new cluster。

Now this point again is going to be that density reachable point。

but it will still be part of the cluster because it's near another point that is a core point。

And we can continue to move along to build out our cluster here。

And you see again， we have a density reachable point， we've had a couple so far。

but all those are near core points， so they still are going to be。

Part of our cluster。And then， we see here。That we have with n clue equal to equal to 4。

We only have three within this cluster。 So this is going to be a density reachable point。

but not a core point。

And then when we move over to this point over here， we see that the only one within that radius。

Is going to be that density reachable point。 So there's no core points within this radius。

So if theres no points within this radius that are not core points， then this becomes a noise point。

It becomes an outlier。 So this isn't part of either of our two clusters and is labeled as an outlier point。

which is why we haven't marked it here in gray。

Now， I want you to take a moment。 and given that DB scan method that we just walked through。

notice which points tended to be the core points as we have them labeled in a darker hue。

Which ones were those density reachable points， which are still part of our cluster。

but don't have the number of points that make it a core point， given our end clue。

And then which point we have labeled as outlier？

Now that we understand how the DB scan algorithm works。

let's discuss some strengths and weaknesses of working with the DB scan algorithm。

So as we saw with the DV scan algorithm， we not need to specify the number of clusters as DV scan will automatically determine the clusters dependent on how close points are from one another。

It also allows for noise and will not automatically determine that outliers are part of a particular cluster。

They'll also do a strong job of handling arbitrary shapes。

as it's going to be searching out points that are within epsilon distance of one another and will stop whenever a gap occurs。

no matter what that boundary shape between the clusters are。

Now some weaknesses。

It's going to require two parameters， which means we need to search over more possible values to find that optimal solution。

Also， those hyperparameter can be very difficult to fine tune in higher dimensional space。

And then finally will not do well with clusters of different density。

so even if we have two clear groups， if for one group the points are about five units away from one another and the other is one unit away depending on our distance metric。

Depending on that distance between our two clusters that are on average five units away or one unit away。

it may be difficult to determine the differentiation between those two clusters。

Now let's walk through how the DB scan algorithm can actually be used using Python。

so first things first we import the class containing our clustering method。

so from SKLn dot cluster we import DB scan。

We then create an instance of that class and pass in the necessary hyper parameters。 Here。

we're setting epsilon equal to 3 and the min samples equal to 2。

So that's that n clue that we've been talking of。 And epsilon is the epsilon we've been talking of。

that distance from every single point in order to include it as a core point or within the cluster。

We're then going to Fibit instance on the data。So just calling Db。fit。

And then we can't call DB。predict because of the way that the algorithm actually works。

if you recall it's defining the points iteratively by scanning through each one of the different points within that data。

so it's just creating clusters within that fitted data， you can't call predict with the DB scan。

If you wanted to fit on a larger data set， then you just include it in that fit。

and then you can come up with the different clusters。So we get our Db dot labels。

And just to note， for those labels， we're going to have class zero， class1。

and if there's going to be an outlier， any outlier。

as we saw can happen with the AB scan will be labeled negative one。

Now let's recap what we learned here in this section。In this section。

we discuss the DB scan algorithm and how it will come up with its own clusters dependent on which points are within a certain distance of the other points。

We then discuss the inputs and their importance， especially that of the epsilon and N clue chosen。

as well as the outputs and understanding the difference between a core point。

a density reachable point， and just outliers or noise。

And finally， we discussed some of the algorithms strengths and weaknesses such as it being able to better determine clusters of arbitrary shapes。

but perhaps having difficulty determining clusters that may have different densities。

Now this closes out our discussion on DB scan and in the next video we'll introduce our final clustering algorithm。

the mean shift clustering All right， I'll see you there。

023：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p23 22_平均漂移算法.zh_en -BV1eu4m1F7oz_p23-

Here we will be discussing our final clustering algorithm， the mean shift algorithm。Now。

let's go over the learning goals for this section。

In this section， we're going to cover the mean shift clustering algorithm and how we use the concept of moving towards the highest density to help determine our different clusters。

And then we're also going to discuss the strengths and weaknesses of working with the mean shift algorithm。

Now the meanshift algorithm will work similarly to K means in that we will be partitioning our points according to their nearest cluster centroid。

For Ca means， though， the centroid represented the mean of all points within that cluster。

While with mean shift， that centroid is going to be the most dense point within the cluster。

which in principle， can be anywhere in that cluster。

And the algorithm will assign points to a cluster by moving to the densest points within a certain window。

So how do we calculate this local density to say where the highest density point is？

In order to do so， we're going to calculate the weighted mean around each point。

So what do we mean here when we are asking for the weighted mean。

We can think of the weighted mean as assigning more weight to those points closer to the original point within our window。

So say we select this black point to start。

We calculate the weighted neme in the local neighborhood or within this window， this pink square。

And it would find that the densest point。Given the weighted mean would be here in pink。

And note on the side that the new mean does not have to be at a data point。

And can be somewhere else within this window。So how do we go about using this to create our different clusters？

So the steps are going to be that you choose a point and a window。

So we saw that window size start at a random point。

We calculate that weighted mean within that window。

And then we shift the centroid of the window to the new mean。 So we shift that square。

So it's now perfectly around that new weighted mean that we just found that new denser point。

We then continuously repeat steps 2 and 3 until convergence until there's no shift。

meaning that we have reached the local density maximum and we'll call this the mode。

So when the mode is reached。

And then we steps 1 through 4 for all data points。 until finally。

data points that lead to the same mode will all be grouped together in that same cluster。

So let's visualize how this is done in practice。

So let's visualize how this actually works in practice。So we start with a centroid at a given point。

And then given that window， we sample that local density。

and then we follow the gradient towards the denser direction。

So we keep moving towards the highest density。 So we keep reclaiming where that denses point is。

and we create our new window around it。 and we see we move along each one of our data points。

Until ultimately， we find that local density be maximal and we stop there。

We can do this again， starting at another point。We can sample the local density and again。

follow that gradient towards the denser direction。

And we see that we move along towards that densest direction。 And again。

we end up finding that same local maximum。 So we would assign those both to the same cluster。

We can do this again， starting at another point， this time， starting further away。

At a point that will probably lie outside this cluster。

We sample that local density and follow the gradient towards the denser direction。

And you see that it moves along as we move towards that denser direction。

and then it finds that local density maximum and stops there。

And to keep going， we can start at each one of the different points， sample that local density。

Follow the gradient towards that denser direction。 And here again。

we see that that point finds the same local maximum。

So we would end up labeling it as the same cluster。

And we keep going like this。And eventually， it's going to find for us4 unique local maxima。

So we see them laid out here。 each one of our four local maxima。

And is going to assign the points to the centroids that they fall into。

So we see here that all of the pink fall under that pink centroid。

We see all the teal values falling next to that teal centroid and all the blue values falling under that blue centroid。

And now we have， as well， the purple with its purple centroid。

And we have our four different clusters。

And no cluster numbers needed or any distance parameters need to be defined。

It's just going to move towards that densest direction and figure out those clusters for us。

Now， let's hone in a bit into what we mean here by this weighted mean。

That mean that we keep moving towards as we get higher and higher density。

So that new mean is going to be calculated using the sum over points within the window。

And we see this in both the numerator and the denominator。

We're also going to have this weighting or this kernel function that's going to allow us to give a certain weight。

according to how far each one of these different points are from the previous mean。

And we see that in the numerator， we weight this according to each point。

So we're going to weight that and then take the distance of that point and those that have a higher distance or a lower distance will have a higher weight。

And the common kernel that's used is going to be the RBF kernel。

which is going to be similar to your Gaussian kernel。

giving more weight again to those values that are closer and less weight。

according to the normal distribution for those values that are further away。

Now let's talk about some strengths and weaknesses of working with a mean shift。

The mean shift is model free。 It does not assume the number or the shape of each one of our clusters。

So that's going to be a pro that we didn't see when we worked with something like K means。

We can use just one parameter。 we don't have to tune over more than one parameter like we did with D scan。

that parameter being the window size or the bandwidth。

And it will be robust to outliers。 We have that window size， and it won't be affected。

and it can have those outliers outside of each one of our different clusters。

Some weaknesses。

The results will heavily depend on our window size。

So it's going to depend on the bandwidths that we choose and selection of that window of that bandwidth is not going to be an easy thing to decipher in general。

And also， finally， can be slow to implement。The the complexity is going to be proportional to m N squared。

where n is going to be the number of iterations that it has to do and N the number of data points。

So the more data points that it goes is going to be more and more complex。

You see that it's n squared complexity。 So if we have a large data set。

this may take a while to converge。

Now let's walk through the syntax that you need in order to perform mean shift using Python。

So first thing that we want to do is import the class containing that clustering method。

So from SK learned dot cluster， we import mean shift。

We then create an instance of this class。Setting M S equal to mean shift。

And we pass in our parameter bandwidth equals 2。

So again， our window here will be equal to two。

And then we fit the instance on the data， and we can use that to predict clusters for new data。

So we call M that instance of our class dot fit on x1， so it finds our clusters using x1。

And then we can call MS dot predict on x2 to see which clusters they fall under given the new data。

So to recap。In this video， we talked about the meanshift clustering algorithm and how we use the concept of using a window。

as well as the densesest point within our window to find our different centroids of our clusters。

And we discussed the algorithm's strengths and weaknesses。

such as not needing to define the number of clusters。

as well as understanding that this model will have a higher overall complexity。

So with that， we close out our different clustering methods， And in the next video。

we will compare and contrast all the different methods that we discussed and which ones are best to use for which use cases。

024：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p24 23_算法比较.zh_en -BV1eu4m1F7oz_p24-

In this video， let's briefly bring together the different clustering algorithms that we've introduced。

And discuss some of the pros， the cons， and the use cases for each one。

So what will we cover here in this section？In this section。

we'll go over a review of the clustering approaches that we went through throughout this course。

We'll then summarize and compare each one of these different approaches。

As well as providing some guidelines for choosing which approach is best。

given the business case that you are working with。

So let's review the clustering algorithms discussed in this course so far。First， we have k meanss。

And recall that with K means， we were going to have to predetermine that number of clusters that we're looking for。

And once we do so， our clusters will depend on coming up with some mean value that is trying to reduce the distance from our centroids or that mean of that cluster to each one of the different points within that cluster。

With that in mind， we will get the results that we see here for the shapes given and we see that it doesn't do a perfect job of getting shapes that aren't necessarily spherical。

and we're going to dive a bit deeper into the pros and cons of each in just a bit。

but this is just a recap and an intro to what we're working with with each of the models that we had introduced。

So next， we have the mean shift， which does not require us to set that number of clusters as we had to do with Ks。

but rather we'll iterably move towards those densesest points given a window and we'll get the results that we see here under mean shift。

And notice that for both K means and mean shift， they are going to heavily favor more of a spherical shape and may not have quite the flexibility to find different shapes。

Next， we have ward。 And what we mean here by ward is the aggglomative hierarchical clustering with ward as the linkage type between our clusters。

Recall that ward linkage specifies distance between clusters as the new combined inertia of those clusters。

And since we are linking closest clusters when we work with hierarchical clustering。

We have a bit more flexibility in combining clusters of different shapes。

But some noise can throw this off， as it did in our two circles example above。

And while we can set means of how we want clusters determineds。

They do not quite get determined on their own accord， as we saw with mean shift。

or as we will see in this next one with D B scan。So finally， we have Db scan。

which we'll find those points which are closest to one another in order to create those clusters。

And this will both create its own clusters。So you don't have to predetermine the number of clusters。

And be able to identify clusters of different shapes。 Now。

this may seem like D B scan should always be the one to go with。

But we'll dive a bit deeper into what can make DB scan a bit more difficult and at times not the ideal candidate。

So let's dive deeper starting with K means。With K means if we use mini batch to find our centroids and clusters。

this will find our clusters fairly quickly， so it will run with fairly low complexity compared to the other models。

If we don't already know how many clusters we are looking for。With K means。

we're going to have to search through our K values and use something like our elbow method that we introduced to determine that number of clusters。

It'll generally be a bit more skewed to finding even sized clusters when we work with K means。

And it's not going to work well with non spherical cluster shapes。

as we'll be looking at distance from the centroid in every single direction as we move towards that mean。

and therefore， we'll only be able to find more spherical shapes。

which is why it doesn't do a great job with these different shapes that we have here。

Next， we have mean shift， and with mean shift， we do not have to guess Ka that number of clusters will be determined for us。

Also means shiftiff will do a fairly good job of finding uneven cluster sizes。

It'll simply be moving towards that highest density。

given a specified bandwidth so we can find uneven clusters。 They don't have to be even in any means。

such as what we saw with K means。Now it can be slow with a lot of data。

we said that k beings with the mini batch can run very fast。

The mean shift can tend to be a bit slow if we have a lot of data。

as it's going to be searching for points for highest local density for every single point。

It will do a good job of finding a lot of clusters if they exist in the data set。

So if you think that there are a lot of clusters， this may be a good choice。

It will not do a great job of finding weird shapes， as again。

we are looking for closeness in every direction within a certain window so tend to go towards more spherical shapes。

And it will be limited to using the Euclidean distance within its formulation。

so we don't get to use these other metrics， these other distance metrics that we introduced earlier in the course。

Now we move to hierarchical clustering here with ward。

And the strength of hierarchical clustering really comes into play when we want to get a full hierarchy tree and see how some groups may be subgroups of others。

Now， you do have to come up with some means of deciding the number of clusters on your own。

whether that's choosing the numbers directly or with a minimum average distance threshold。

as we saw in our course on hiarchco clusterluing。It will often find uneven cluster sizes。

as we can easily have a tiny cluster of one or two points that are far away from the rest。

There are going to be many different distance metrics and linkage options that can be chosen。

which may make it difficult to fine tune this type of model。

And it can end up being very slow to calculate as a number of observations increases。

So this also will have fairly high complexity。

Now with DB scan， it seems you can often get the best of both worlds if you choose the right parameters。

But finding those correct parameters can prove to be a difficult task。Now， with D B scan。

it will be able to find clusters of uneven sizes as long as it reaches the n clue amount that was predefined。

it will create a new cluster， assuming again， that you have， if N clue is equal to 4。

as long as you have four points within that epsilon radius， you will create a new cluster。

It will work with distance metrics of your choosing。

so you're not limited to just Euclidean distance。DB scan will be able to easily move along a cluster in small steps。

thus being able to find clusters of uneven shapes。

Now there is a danger if you choose too small of an epsilon， that you will have too many clusters。

which is probably not ideal or tworthily for most business cases。And finally。

the main disadvantage is that it can have a great difficulty determining clusters of different densities。

Now to bring it all together， I would say take a look at this page if you're ever trying to decide which one of the different clustering approaches to use。

If you look at the parameters for K means， you just need to choose the number of clusters means shift bandwidth。

which may be a little bit difficult to fine tune。For hierarchicalical clustering。

you choose the number of clusters， but you can also visualize the clusters that are created as they grow one on top of the other。

so it becomes a bit easier to choose that number of clusters。

And then the neighborhood size could be fairly difficult to choose when you're working with DB scan。

Now the scalability of each。With K means， you can scale to very large number of samples。

so very large data probably want a medium amount of clusters， not too many clusters。

and this is both using mini batch which will help speed things along。

Mean shift will not be quite as scalable with the number of samples。

so as we increase the number of samples， it tends to take quite some time the complexity increases。

With hierarchical clustering， you can use large， so not very large like kines。

but large number of samples， as well as a large number of clusters。

and then DB scan again will scale quite large number of samples and a medium amount of clusters。

Now we have here the different general use cases。But I want to skip more to the applications。

All I want to highlight for the general use cases is that again with DB scan。

you can also use it for outlier detection， unlike the others。

it'll do a good job of determining those outliers。Now in regards to the applications。

we have that' 4K means， you can find few clusters of roughly the same size。

I would say this is a quick and dirty way if you know the number of clusters that you're looking for。

Then this may be a good way to get started in your clustering of your data set。With mean shifts。

you can identify the number of clusters on its own。 So if you don't know the number of clusters。

this is a good choice， often used in video， and also again， if you don't know those clusters。

this is a good for a business case， especially if DB scan may be difficult to fine tune or if you have clusters of different densities。

And then hierarchical clustering will be good for business cases where you may want to find the subgroups as well。

so if you don't just want the groups but the subgroups that build into those groups。

And then finally， DB scan that's often used for computer vision applications。

but also for business cases where you don't know the number of clusters。

and they are of similar density， then you can use DB scan to identify those clusters for you。

So just to quickly summarize， what we went through here were different clustering techniques where clustering is just unsupervised learning。

meaning we don't have labels。But we can come up with groupings of our data to see if theres different segments of our data that can be clumped together。

And we discussed several approaches that were possible， such as K means。

hierarchicalicalglomative clustering， Db scan， mean shifts。

and all this can be implemented using S Kler。 And if you're interested in learning more about the different hyperparameters that can be pass through or even more clustering methods。

I would suggest looking at the link and feel free to dive deeper and experiment with everything that you have there available。

😊，Now， just to recap。In this section， we had a review of the different clustering approaches that we've discussed throughout this course。

We summarize and compared each one of the different clustering approaches and then finally provided some guidelines for choosing which approach is appropriate for the given business situation。

That closes out our section here on clustering。 And in the next videos。

we will move on to dimensionality reduction。 All right， I'll see you there。

025：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p25 24_聚类笔记本第1部分.zh_en -BV1eu4m1F7oz_p25-

Welcome to our lab here。 on different clustering methods。😊。

We have here the same data set that we used earlier on。

in that we will be looking at the wine quality and the data contains various chemical properties of the wine。

such as acidity， the sugar levels， the ph levels， the alcohol levels。

and also contains a quality metric，3 through 9 with 9 being the highest， and a color。

either red or white。And we're going to be using these chemical properties。

So everything besides quality and color in order to cluster our wine。And with that in mind。

we can see actually， to see whether or not our clusters will relate to the cluster of either red wine or white wine。

And we'll see that in practice later on in this notebook book。

First thing that we want to do is import all the necessary libraries。

we're also going to change our directory here to data。

and we're going to pull in our data winequalitydata。 csv。

We call data dot head， and we take the first four rows。 And you see here that we transpose it。

And this is just for a bit of readability。 So really fixed acidity， volatile acidity。

Those are going to be our column names。 And we get a quick peek at each one of our different columns。

And what may stand out is the different scales of each one of these different features。

as well as the fact that color is going to be a string。 So the rest are numerical。

and this one is a string。 And this will come into play in just a second。😊。

We then look at the shape of our data， and we see that we have 6497 rows， with just 13 columns。

And then we're going to look at the data type， like I just mentioned for each one of our different entries。

And the reason why this is important is because 4K means to work。

as well as most S K learn algorithms。 We're going to need for our data to all be numerical。

Otherwise we won't be able to pass it through。So when we check this out。

We are fortunate to see that all of our values， again。

are not going to be using quality or color when we create our clusters。

All the other values are going to be floats。 They are not going to be objects like we see with color。

or even imagestegers。😊。

We're then going to check the value counts for our different wine colors。

as well as for our different qualities。 Again， those qualities ranging from 3 to 9。

So we check the color， and we see the majority of our data set is going to be white wine。

We can even call， as we've seen earlier。Normalize equals true。 and we can see that。

A bit over 75% of our data set is going to be white wine， whereas of 24。6 is going to be red wine。

And then we can check this out in terms of the value counts for the quality。

And we see that the majority of our quality is going to center around that 5。

6 and 7 value with very few， very low quality and very few， very high quality。

Now we want to look at a histogram breaking down the quality by red and white one。

Given our data set。So we're going to。First。Initiate these colors red and white。 Now。

these are just going to be objects pointing to a certain color。 So from S M S thats seaborn。

we're going to pull our color palette。 and we want red to be associated with the second value or the third value because it's Python indexing and then the whites objects pointing to the color palette and the fifth value。

We're then going to。Explicitly tell our histogram what our bin range is going to be。

We don't want to combine any one of our values。 We saw that 9 up here could end up being very low。

especially when we split it between red and white wine。

and we want to ensure that we have a separate bin for every single one of our different quality values。

With those different objects created。

We're then going to initiate our axis， so our bounding box using PLt。axxes。

And then we're going to zip together this list of red and white。

which is just the string of red and white。As well as the red and white colors that we initiated up here。

So red will associate with red in that first iteration through the for loop。

and white will associate with this white in the second iteration。

And that will be color for the string in plot color for these different objects that we have here。

We are then going to take a subset of our data。So we say data dot Lo。

and we want to locate where the color is equal to the color that we have specified。

either red or white。 And then we only want the column of quality。

We're just taking a histogram of the of the quality。We then call Q data， that's our subset。 hist。

And we say， again， we want to use just the bins that we specified above。 So 3，4，5，6，7，8， and 9。

We set alpha equal to 0。5， because we want it to be somewhat see through。

as we will be plotting one histogram on top of the other。We then set。

where do we want to plot this x equals x， that axes that we initiated earlier。

And then the color that we're going to use is going to be that plot color。

which is either going to be this red object or this white object that we had defined up here。

And then we're just going to create a label for our legend later on。

labeling the white as white and the red as red， using this string。

We then won our legend。 We want our X label， and Y label。We set our x limits。

our x ticks are going to be in between each one of these values。 So at 3。5，4。5 and so on。

and our labels are just going to be the different bin range values，3，4，5， and so on。So you run this。

And we can see here our different breakdown of red and white wine。 and see that。

Red is slightly more centered around this 5，6。嗯。Ands the white wine has a higher peak at that 6。

So more values of 6， but otherwise， somewhat of a normal distribution。

whereas red is going to be a little bit flatter with kind of a bimodal 5 and 6 in regards to the red wine quality。

We're then going to， in question 2， examine the correlation and skew of our relevant variables。

So everything except for color and quality。 We're not going to drop these。

but we do want to exclude these when we look at our cross correlations between each one of our different values。

And that's because we're going to be using our cluster algorithms without either of these two values。

And then on top of that， we're going to perform any appropriate feature transformations or scaling。

Now， what's important is we have to recall that we are using distance metrics when to use our K means or any one of our clustering algorithms。

And something that wasn't mentioned throughout lecture is the importance of width distance metrics。

And this should have already clicked as we were thinking through what we've done with each one of our supervised learning models。

If we are using distance， it will be of utmost importance that each one of our features are going to be on the same scale。

We don't want any one of our different features being more heavily favored or causing further distance than the other one。

So we just want their variation to be changing what our clusters will look like。

rather than their actual magnitudes or their built in values。

And then we're going to finally examine the pair wise distribution of the variables with pair plots to verify the scaling and the normalization efforts that we went through。

So we're also going to make sure that there's a normal distribution that just makes things a bit cleaner so that we don't have a heavy skew in one direction when we take each one of our distance metrics。

So we're specifying here that our float columns are going to be all of our columns。

except for color and quality。

We're then going to create our correlation matrix by just specifying that we only want those columns。

And calling dot Cor。And then finally， just to make sure that we are not getting。

we're not seeing each one of our different correlations with themselves。

which would obviously have a correlation of one。 Every value with itself will have a correlation of one。

we're replacing them across that diagonal。 So4 x and range of the length of our float columns。

So however many columns are doing。 We're going to replace the diagonal。 So the I lo。😊。

And if you think each these values being the same。That we are going to be zeroing out each one of the different values in our correlation matrix along that diagonal。

So let's see what this looks like。Again， you see that 0，0 for fixed acidity。

and this will come into play because we're going to look at the highest correlation between each one of our different features。

And we can see here it's a little bit difficult to quickly visualize what those highest values are。

So to get that pairwise maximal correlations， we're going to call coremat dot abs sorting the absolute value。

because whether it's negative or positive correlation， we want to see if they're highly correlated。

And then we call dot IDX max。

And we see that fixed acidity is most highly correlated with density。

and we could also even just call dot max if we wanted to see what the maximum values are。

And we see some high correlations between certain values。And the reason why this is important。

Is if you recall when we discussed earlier in the lecture， as well as in the last lab。

If we have high correlation between different values。

then we start to reach that problem of high dimensionality。

and we know that that causes a problem whenever we're working with distance metrics as we are with most of our clustering algorithms。

So we're not going to do anything there。 The There are some fairly high correlations。

but not high enough for us to exclude certain values。

But that is the reason why we'd want to start to investigate how high the correlations are between each one of our different values that we're going to use for our unsupervised model。

We're now going to look at the skew of each one of our columns。

So we can just call for those float columns。We're looking at the skew， we just call dot skew。

And recall that 0 means no skew。 positive value means a right skew。

A negative value means a left skew， meaningan it's not normally distributed。

right skew means heavy right tail。 left skew means heavy left tail。

And then we're going to sort those values from highest to lowest。

and then we're just going to take those that are above 0。75。

So we look at that and we see that each one of these values。 we are saying that above 0。

75 has a heavy skew in order to help to correct that skew。

we're just saying four call in each one of these skew columns。

only taking the index values because we don't care about the actual values。

We just care about each one of these column names。

We're going to change that data to the log version of itself。

And that will help normalize our features。

So we run this and we've replaced all of our different columns。And then on top of that。

as I mentioned before， it's of utmost importance that all of our features are on the same scale。

So in order to ensure this， we are importing our standard Scalar， as we've done before from Scalar。

pre processing。We set SE to that standard Scalar object。

We call fit transform on all of our float column data。

So we're going to replace all of our data float columns。

And then we're going to investigate this briefly and see that our values are now all are on a similar scale。

Finally， just to make sure that we get a visual of what these actually look like。

we're going to run the pair plot。

In order to run our pair plot， we want all of our columns。All of our float columns， as well as color。

And the reason why we want color is because we are going to break apart our scatter plots。

as well as our histograms and our pair plots will show by color to start to investigate。

investigate that natural differentiation between our difference。Features and the color values。

We're just ordering it white than red。 And then we're just saying our palette here。

We want red to be equal to the red that we defined earlier。 And then for white。

we're going to use gray as the coloring for it。Now I'm going to run this。

and pair plots generally take a bit of time to run。

So I'm going to pause the video real quick and we'll come back once it's already ran。

Now， we ran this parapo， and we did that to see the relationship between each one of our different features。

But on top of that， because we're also looking at the breakdown between red and white。

we can look at these two features。 And again， we have more than two features or two dimensions to work with when we create arcanes。

But even with these two features， we can begin to see that there is somewhat of a clustering between the red and the white wines。

So we can see that there probably will be a pretty clean classification given our data that will show us which wines are red and which are white without actually having those labels available。

Now that closes out this question number two in our video here in question number three。

we will start to fit a K means cluster and see what kind of clusters we actually come up with without the labels to our data。

026：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p26 25_聚类笔记本第2部分.zh_en -BV1eu4m1F7oz_p26-

Now， for question number 3 here。We're going to continue by fitting our first K means clustering model。

and we're going to use two clusters， and we're going to use two clusters not identifying the red and white。

We're not going to have that included in our data set。 And we're going to examine the clusters。

according to the red and white wine， to see if it automatically clusters。

according to this red and white differentiation。So what we do is we import from Skar dot cluster。

our K means model。We're then going to initiate our model and say that we want two clusters。

Re what K means We need to say how many clusters we want。And then we call KM dot fit。

On just our float columns。 So not including both the quality column or the color column。

We then call Km dot predict on those same columns， and we set that equal to its own column within the data set。

And we'll see why we do that in just a second。And once we do that， we can call data dot head。

And we can see all the way here at the end that we create this new column that's either 0 or one。

And we're going to see how that relates to this color column to see if all the reds were identified as 0 and all the whites as one。

So in order to do that， we're going to only take the subset of columns of color and K means K means being the one we just defined。

We're then going to group by each one of these objects， so we're aggregating by both of them。

And then dot size is just going to give us the count of that breakdown。

Now it's going to be a pandas series。 So we're just changing it to a data frame。

and we're renaming that column at first by default， it will name the column as 0。

So we're just calling it number。

And we run this。And we can see that for 0， the majority of them are going to be that red wine。

With only 87 white being identified as0。 and for white wine， only 23 were identified for one。

Only 23 were identified as red， and 4811 were identified as white。

So we can see that it did a pretty good job without any labels separating out our data set into two different clusters that are very highly related to our red and white clusters。

Now we're going to fit a Ka means model with clusters ranging from 1 to 20。And now with this。

we are assuming that we don't know the number of k that we want。

We don't know how many clusters we want。And for each model。

we're going to store the number of clusters， as well as the inertia value。

And then we're going to plot that cluster number versus the inertia and see if we can find that elbow that would identify that this would be the best number of clusters given our data set。

So we start with an empty list， and then we range from values from 1 to 20。

We call K means， and we initiate with that number。We then fit it on our float columns。

and then we take our K M list， and we keep app on this panda series that will have the clusters。

Which is just that for loop at that point， the inertia for this fitted model。

And then we can also save the model as well。 Just the full on model。

if we want to access that later。So I'm going to run this and this is going to take just a second to run。

so again I will pause the video and we'll come back when it's done running。Okay， that is now Ram。

and we now have our。K means list of our different clusters in their inertia as well as their models。

That list， if we think about our panda series， recall that that's going to be each one of our difference。

Indices for that series。So we're going to concatenate each of those series together using access equals 1。

and then we're going to transpose it so that our different column names are going to be clusters。

inertia and model， and we'll have that for each one of our different clusters ends。

They are different inertias， their respective inertias for each of those different cluster values。

We're then only going to take clusters and inertia， so once we have those as our columns。

we're only selecting those two columns。We're setting our index to clusters。

those are going to be that number of clusters， and that will allow us to easily call plot data。

Which is now our Padas data frame that we have created here， dot plot。

And say that we want a line connected by each one of markers connected by lines。

Markers being Os here。And then we just want our x stick to go from0 to 21 or to 20。

and then our x limits to go from0 to 20。We run this and with our X labels being cluster and our Y labels being inertia。

And we try to see if there's any strong elbow。 It doesn't seem like there's quite that。

maybe a bit of that at。

4 perhaps where it starts to decline how much it's going to be really declining as quickly。

So maybe you choose 4。 But probably the best fact。

this is if you know that there's some type of clustering here。

as we did with either the quality of the wine。 and we knew that they were six different values there from3 to 9。

or if you knew that there's red or white wine， you choose that as one of your case。Now。

that closes out our discussion here with K means in the next question and in the next video。

we are going to start to discuss using a glloorative clustering to create our different clusters。

All right， I'll see there。😊。

027：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p27 26_聚类笔记本（选修部分）第3部分.zh_en -BV1eu4m1F7oz_p27-

Hi and welcome back for question number 5 of our notebook here。In this question。

we're going to fit in a gllomative clustering model with just two clusters。

We're then going to go ahead and compare the results of a gllomative clustering to that of K means。

Then also compare that against the red and white wines and see if the numbers and the groupings seem to be the same。

We're then also going to visualize the dendrogram， and the dendrogram is going to be that subgroup building up to those larger groups that we saw during lecture when we talked about a gllomative clustering。

So we're going to see how we can do that in Python as well。

So the first thing that we want to do is import a gl of clustering。

We're going to create our object here and pass in the arguments。

We're going to say that we want two clusters or we're specifying the number of clusters equal to2。

If we want， we can also pass in the distance threshold as an argument here。 If we do do that。

if you want to do that back at home， you just have to make sure to set the number of clusters equal to none。

You have to do number of clusters or the distance threshold。 you cannot do both。

Here we'll do a number of clusters。We want it to compute the full tree。

if you did a certain amount get cut off to save computation time。

So if you want to save computation time， you could set this to false。

but this will run fairly quickly and where it'll allow us to see everything that built up within archery。

And then we're setting our linkage toward。 And again。

that linkage means that we're finding the clusters that reduce the inertia。

The most between any other groupings。So once that has been initiated。

We're then going to fit it to our data just using the float columns as we did before。

And we're going to add that in as another column within our data， so we had K means before。

and now we're also going to have the Gm data set。So I run this。

and this will take just a second to run， but not too long。 now we have our data。

and we can then use the same method that we saw before。

So we're going to take our data and take the subset of the colors。

The glam column that we just created。 The K means that we created earlier。First。

we're going to group by color and alom and see the counts。 So we run this。

And we see again that the red and white wines were able to group them appropriately。

So we' able to see that for red， only 31 were of a glm class 0 and 1568 were of a glam class 1。😊。

Whereas the majority of white was。Classified as a glam class 0 here。

So we have the zeros and ones very highly related。

very highly correlated with our red and white wine。

And that was a similar story when we worked with K means， as well。

Now the numbers are a bit flipped， so it doesn't really matter whether's 01 that's arbitrary。

but the fact that they are separating them out into specific classes。So we see 1576 verse 23。

1568 31， maybe not quite as well there， and the Gaiglom maybe didn't do quite as well for the white wine either。

but still did a good job of classifying each of these two separate classes。

And then if we want to look at both of these in total， this will be a little bit difficult to read。

given that trade off between 0 and1 and also just having this multi index。

I would suggest just looking at these top2 that we just discussed。

But if you want to dive deeper and C 4 red wine when we had a glum。

how much of the K means were an agreement。 And these would be agreement， both 1 and 0，1563。

And you can break it down accordingly and take a deeper dive into where the mismatches may have happened。

So again， though the clusters are not identical， the clusters are very consistent within a single wine variety。

either red or white。Now we're going to plot out our dendrogram。

And I don't want to walk through all a different pieces of code。

this is just for plotting out denjoograms， that's all you will really need in order to use this moving forward。

But your。Fitted model should have these children which will help us identify the breakdown of our model。

We use this hierarchy dot linkage that we imported from sippi dot cluster。

which will allow us again to create what we need to pass into our dendrogram that we're going to use。

We're going to initiate our figure and our axes， we're going to create the colors that we want to use。

And。Set the link color palette。 So how we're going to link each of these。

and you'll see this red and gray come into play in just a second once we plotted out。

And then we call hierarchy， which is what we imported here。

dot dendurogram to plot out our dendurogram。Now， Z is equal to that hierarchy linkage object that we。

Created just here above。Some important arguments。 First。

let me run this so we can see what this looks like before going through the arguments。

So we see the den brm， and we see how it broke down from side to side。

And we also see this went down a certain amount of levels。

This in't go all the way down to the bottom。 If we wanted to see all the way down to the bottom。

we can change that。 and it would take some more time to plot。

But we see also the number we said show leaf counts equals true。

We can see the number that shows up in each one of these different subgroups。

so how many rows showed up in each of these subgroups？Now， if we wanted to see less data。

we could set this P equal to something like 10。And I run this again。

And now you only see if you counts the bottom。Lines that we have here。 there's only 10 lines。

so it's breaking it down so you can see up until there are only 10 subgroups left。

And that's dependent on using the last P。 You can also write here level。

And you can say how many levels down you want to go。🎼So just to highlight。

this is about two levels down。If we were to run this just one level down。

we can see that just breaks out into these two subgroups。 Again。

I changed the P and the trunncate mode at the same time in order to see how much of that dendrogram we actually want to visualize。

Now， we're going to stop here。 And in the next video。

we're going to discuss how you can actually incorporate these different clusters into creating your different models。

Seeing the performance of each and then closing out this video with another walking through of the performance with different levels of。

say， different types of clusters。 All right， I'll see you there。😊。

028：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p28 27_聚类笔记本第4部分.zh_en -BV1eu4m1F7oz_p28-

Now， in this question， we are going to explore the idea of using clustering as a form of feature engineering。

So the first thing that we need to do is create a variable that we're going to try and predict as when we're doing our feature engineering。

this will now be for supervised learning。So we are going to create a binary target variable Y。

which is just going to denote whether or not the quality is greater than 7。

So greater than 7 will be equal to 1，7 or less will be equal to 0。

We're then going to create a variable called x with K means。

And that's going to be from original data。 So it's going to be a panda's data frame。

And we're going to take that data and everything that we've worked with so far。 If you recall。

we added on a glm as a column as well as K means as a column。 So we'll drop quality color and a glm。

which will leave that K means。 So we have all of our float columns plus that K means column。

And then we're going to create another pandas data frame， which is X without K means。

And that's just taking what we just created from X with K means and dropping K means column。

And then for both data sets， we will use stratified shovel split with 10 different splits。

We will fit 10 different random forest classifiers。And with that。

compute that R O C AU score of these 10 classifiers。

Find the average of each and see which performed better。

The one with k meanss or the one without k meanss。

So in order to do so。We're going to have to first import our random forest classifier。

We will also import our ROC AUC score， as well as our stratified shfel split。

So hopefully you recall all of that from the course when we did supervised learning。

We're then going to。Create our target variable， which is just when the quality is greater than 7。

We set that equal to 1。 So if we say just this part， the quality greater than 7。

that will return either true or false。 Se it as type int converts that true to 1 and the false to 0。

We then initiate our objects X with K means。Which is just going to be our data set that we currently have worked towards。

But dropping a glam， color and quality。 So we saw the canamines in our float columns。

And then X without k means will take that x with k means that we just defined。

And drop the K means column。 So now we have these two different pandas data frame。

One is just the float columns， which is x without K means。

and one is the float columns with that K means column as well， which is x with K means。

We're then going to initiate our stratified shuffle split object。

And then we're going to define this function， which will allow us to pass in an estimator。

And that estimator， a spoil alert here will be random force classifier。

but we'll see how we'll use this again for logistic regression as well。

And then an X and a Y， so our different features and then our outcome variable。So first。

we initiate an empty list of RC AU， and that's because if you recall。

we're going to create 10 different values and then take the mean of each of those values。

So we'll append each of those values to this empty list。We take。Train index and test index。

F values in our SSS dot split， for our X and Y， depending on the X and Y that we passed in here within our function。

And because this SSS。Is defined to have 10 different splits When we run this for loop。

we are running through four different 10 different iterations。

Of different stratified shuffel splits。 So different splits of our data that have ensured that there's a stratification。

That's a certain amount of data quality greater than seven shows up in each one of our different train and test sets。

So then we set X train and X test。Using those train indices and test indices。

and we set why train and Y test with those train indices and test indices。

We can then call that estimator that we defined up here that we're passing into our function。

And called dot fit on our training set that we defined。

And then we can come up with our actual prediction。

which is going to be estimator dot predict on our test set， on our holdout set。

And we can do the same for our predicted probabilities。 if you recall， if we want that ROC AU score。

Then we need the predicted probabilities to actually create that。

So we get the probabilities that's going to output the probabilities for both of the classes。

we only want the positive class， so we're taking all rows， but only the first column。

not the zero with column。And that's going to be our different scores。

And then we can call for each one of our different iterations。

the R O C AU score for our actual values。 That's the Y test。

As well as the scored values that we just computed。

And we will continuously append that to our empty list so that we get all 10 different。

All 10 different ROC values。We then take the mean of that list。

and we will have the average for the different ROC scores across those 10 different splits。

So now that we had that function defined， that will output that average across the 10 list。

We could set our estimator here to random force classifier。

so we have estimator equal to this object。

We pass that in to our function that we just defined。Along with x with ks。

so this is with the column of kians。And that's going to be our X， as well as our target column y。

And then we're going to do the same thing， running that function to getss。The same estimator。

except on x without K means。 So with our data set without that extra column。We run this。

And we can see。That without came means cluster。Actually did worse than with Ca means cluster。

So we performed better when we had our camem means cluster as input into our random forest。

Now， what I'd like to do is explore the idea of changing the number of labels that we will incorporate when we create this new feature or this now new set of features if we think about this in regards to one hot encoding。

So we're going to say 4 n equals 1 through 20。 We fit a K means algorithm with n clusters。

So first one two clusters， three clusters， so on。And we then have to one hot and code it because otherwise 19 label number 19 will be thought of as greater than label number 5 or label number 10。

So instead we want those each1 hot encoded so that there's no ordinal value to each one of those different values。

And once we have our one hot encoded version of that column。

we're then going to fit a logistic regression model and compute that average ROC AU score。

And then we're going to plot that average ROC AU score for each one of our different numbers of clusters。

So I'm going to run this while I explain it because it may take a little bit of time。

But the way that we start off is that we're going to set x basis equal to just those float columns。

We're then going to initiate our stratified shuffel split with only 10 splits， as we did before。

We're then going to define this new function， create K means columns。 So as I mentioned。

we can't just create that one column with multiple labels。We have to one hot。

encode code those labels。So we say KM equals K means with the number of clusters equal to whatever n we pass in。

We're then going to fit on just our float columns。And then when we call K M dot predict on our x basis here。

we're actually outputting each one of these different labels。 So if the number of n was equal to 20。

we'd have values 1，2，3，4， all the way through。19， actually starting from 0 up until 19 to have our 20 different clusters。

We then take that column that we just created。And we call PDD dot get dummies on that column。

And now we create if there is 19，19 different columns。And having a one or a0。

if that column happened to be。A one， a two， a three， so on and so forth。

We then concatenate just those float columns。To those new K means columns that we defined。

so that may be。Up to 20 columns that we're adding on。And then once we have this data frame。

the idea is that we will be able to pass that in as our data frame and then fit our models。

So we initiate our estimator as logistic regression。We say the ends。

the number of clusters that we want to run through are1 through 20。

We're then going to get our list of ROC and AUC values。

By calling that get average ROC 10 splits that we define just above in the cell above。

We pass into that the estimator。Our X value is going to be this create K means outputs。 So remember。

this output will actually output a panda's data frame that's going to concatenate onto that original data float columns are new labels。

one hot encoded。And then using that same target variable for each n in our different ends that we have defined up here。

We're then going to plot that out， we initialize our plot。

and then we just plot the ends versus the different ROC AUCs that are output given the model。

given the function that we're running here。So we've already ran this。

let's look down at the results。

And we see it jumps around quite a bit as we add on and reduce some of those clusters。

So that closes out。 And this is just after over 10 iterations。

that closes out our section here on the different clustering methods。

gives you an introduction to how you can also use these different clustering methods to actually do some feature engineering。

And with that， we close out our section on clustering。 and in lecture。

we will move on to dimensionality reduction。

All right， I'll see you there。

029：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p29 28_降维度简介.zh_en -BV1eu4m1F7oz_p29-

In this set of videos， we are moving away from cluster strength and moving on to a different class of unsupervised learning。

namely dimensionality reduction， or finding ways of representing our data set in lower dimensions。

Now let's discuss the learning goals for this section。In this section。

we're going to have an overview of dimensionality reduction and how we can go about solving the problem of the curse of dimensionality by coming up with a lower dimensional representation of our data that maintains the majority of information that's important to us within that original data set。

We'll then discuss principal component analysis or PCA and how we can use that to come up with new features in lower dimensional space。

solving our problem of thecursive dimensionality。And then we're going to discuss non negative matrix factorization and how we can use it to come up with a means of decomposing our original data into only positive values and reduce the number of dimensions again。

Now， we should recall from earlier in the course， as well as working through our notebook on the curseive dimensionality that due to the curseive dimensionality。

In practice， too many of these features may lead to worse performance for our different models。

And our distance measures that we're using perform poorly as well as the incidence of outliers increasing as we increase the number of dimensions。

And the reason why this is， if we think about just working with one dimension that has， say。

10 positions， then in order to fill out this entire space， we only need six observations。

We would only need 6 rows to cover 60% of this space。If we increase this to two dimensions。

Each one with 10 different positions。Then we would need 60 different observations within our data set in order to cover 60% of the possible positions。

And then if we increase it to three dimensions and beyond。

we can see how this number in order to cover the same amount of space that is available。

increases exponentially as more and more dimensions get added on。

So this is a very common situation within business。

within enterprise data sets that often contain many， many features。

Data can be often represented by using fewer dimensions or fewer features than your original data may have。

And ways to accomplish this would be either reduce the dimensionality by selecting a certain subset that you deem are the most important features within that larger data set that you're working with。

Or you can combine with linear and nonlinear transformations。

which is what we're going to do here starting with PCA。

So how does PCA or this idea of creating new features out of the many features？Actually work。

Here in this example， we'll start with two features。

And we see that we have phone usage and data usage as our two features。They look very correlated。

one with the other， and visually， it looks like the points lie very close to a line。

So the question is， can we reduce the number of features from the two that we have down to one？Now。

what if we considered this line？And project the points on that line and got those projections instead。

So here are the different projections。And this will entail a linear transformation of our data to create this new single line。

and if we think about this going out to higher dimensions， if we go into higher dimensional space。

we can imagine projecting from 3D down to 2D or 100 dimensions even down to 10 dimensions in general or just projecting down to lower dimensions。

Now with our linear transformation， the points are going to now lie on this line that we see here。

We have now created out of those two original dimensions。

a one dimensional feature space that is the combination of phone and data usage。

We can think of this transformation as a scaled addition of each of the two columns。Thus。

what ended up happening is we now have one column created as a combination of those two original columns。

This is going to be the idea behind principal component analysis or PCA。

We replace the columns by some linear combinations of those original columns。

And these linear combinations are not going to be arbitrary。

They're going to be intelligently selected in order to preserve the underlying meaning of our data。

And what we mean by that in a second we'll see is trying to maintain as much of the original variance as possible。

030：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p30 29_主成分分析降维.zh_en -BV1eu4m1F7oz_p30-

And now， looking at what we had before compared to what we now have。

we have successfully created a single feature out of the two features we originally working with。

thus reducing the dimensionality of our feature space。😊。

So now let's focus on how principle component analysis or PCA finds these lines on which to project our data。

so let's say this is the data set we're now working with。

And we can see pretty clearly that the data is distributed in a certain way on a certain axis that we can see visually。

Now linearar algebra has tools that can determine exactly where our axis is。

where we have the most variance。So using linear algebra， we can find this primary vector。

so this is called the primary vector that the data set is distributed on。

and mathematically it's going to be called the primary right singular vector。

And this is going to account for the maximum amount of variance in any direction for our data set。

Now， excluding that primary right singular vector， this is going to be the second axis for the data set。

It's going to be another right singular vector， secondary behind that primary one that we just highlighted。

Once we have this decomposition of our data into orthogonal vectors or perpendicular vectors。

each one of these vectors as we move forward will be perpendicular or orthogonal to one another。

we can then determine a meaningful projection of our data。Here。

since the vector's lengths are disproportional， it'll make sense to project onto that V1 that we saw。

and we wouldn't lose a lot of information if we projected our data down to V1。

This is because there's not much variance in V2's direction and if you were to project onto V2。

you'd see that the scale would be very small， if we projected down all our the same way that we did in that last example down to V2。

we'd be scruunging up our data much more so than if we project onto V1 if we project onto V1。

we're able to maintain a lot of that original variance。

So in order to find these singular vectors。The mathematical theory that enables us to find this is called the singular value decomposition。

Now， the data set that we work with does not need to be square。

as we see here our original data set a is going to be an M by n matrix with M and n not being equal。

We can decompose a。Into the matrices US and V。And U and V here can be thought of as just rotations in space。

one in the N space， M by M space， one in the N by N space。

And they code the information of V1 and V2's directions only， but not the length。

They are going to be more of auxiliary or technical matrices where the real geometric idea is going to lie with S Now the matrix S is going to store the actual lengths of those vectors。

so recall those longer vectors will tell you which ones should be your primary vectors in regards to where to project your data down onto。

So S， as we see here， given where the stars are。Is what's going to be called a diagonal matrix？

Meaning only the non zero entries， only non zero entries in that matrix are across that diagonal。

And these values。Are going to be sorted from largest to smallest。

and they will tell us which vectors are actually important。So here in this example。

we're working with a5 by3 matrix originally， and then we decompose that into U being5 by5。

S being phi by3 and v transposed or V originally being 3 by 3。

And this singular value decomposition is going to be what PsyitLn actually uses for PCA for a principal component analysis。

So let's say our data set when decomposed， looks like what we have here。

We have three singular values。Those three values across the diagonal say they are9，5 and2。

9 being the top left down to five and2， and that'll tell us that the first two left singular vectors are more important than the third again。

the larger the value， the more important it will be。

So most of the variance in the data is in the direction of the first two principal components。

And those principal components are going to be calculated from the V that we have here。

Those will actually provide for us if we were to even plot this out the values of V。

The points from the origin to wherever it is here in three dimensions of the。

Where that principal component will point to。And again。

that first principle component being the one that accounts for the most amount of variance。

And if we want to bring it down。From n dimensions down to K dimensions， which is our goal。

so we're working with an A N by N matrix。And we want to change that to an A。

Or a new matrix that's not necessarily a， that's going to be M。

we're going to keep the same amount of rows by k， where k is going to be less columns than n。

which is currently3。All we'd have to do is take that decomposition。

And see where we can remove one of those columns here we use the singular values from V。

We can multiply that A by our V transposed。And we will get a new matrix if we see that v is going to be k by n。

if we take the transpose it's n by K。So a M by n matrix multiplied or taking the dot product of an n by K matrix。

we can then end up with a new matrix that has dimensions of M by K。

and that will give us a new data set using this singular value decomposition。

That is now an M by K reduced amount of columns that's going to be a combination of those original columns。

Something to keep into account when we're doing principal component analysis。

Is that since we are talking about lengths here a lot？

The algorithm will be very sensitive to scaling。So it will be important to scale prior to applying RP PCA。

If we think about every single difference， one of our different algorithms that we use so far in this course。

And the effects of the distance。We'll notice that having unscaleed data would allow one of those axes have more weight to provide where the maximum variance may actually be。

so if our data is not scaled， we can end up with this projection that we see here when in reality we'd want this projection down the center of our data。

Now in order to do PCA。Using SKLarn， we import from SKLarn。key decomposition PCA。

We're then going to create our instance of the class here。

so PCA inst equals PCA and we have to say how many components do we want to reduce our original data frame down to？

So if we're starting off with 10 columns here we want to reduce it down to three columns。

that's what the end components is going to signify。

So we can pass in that final number of components that we actually want。

We can then take that initiated instance of PCA。With the number of components equal to3。

and we can call fit transform。The same way that we have for many of our different standard scalers。

we were able to call fit and transform an old output a new data set now with a less amount of columns。

So for example， we can transform our customer churn dataset。

which has around 20 numeric features to one with only three features。

with those three features being a combination of those original 20 features that we had。

Using that singular value decomposition that gave us that V matrix to show us how to reduce the number of dimensions。

Now that closes out our discussion here on linearar PCA in the next video we will discuss how can move beyond linearity All right。

I'll see you there。

031：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p31 30_降维笔记本（选修部分）第1部分.zh_en -BV1eu4m1F7oz_p31-

Welcome to our notebook here on Diality Reduction。In this notebook。

we're going to be using the Portuguese wholesale distributoror data set。

that data set is going to contain the annual spending on fresh products on milk products。

grocery products and so on。And then the last two which we're actually going to end up dropping are going to be channel and region and the reason we drop those is because we want to focus on the numeric values here and these are technically going to both be categorical values and it's just as easy if we wanted to to 10 and encode them。

but for this we're just going to drop those two columns。

We're then going to import our necessary libraries as we do at the start of each one of our notebooks。

Then here for part1， we're going to want to import our data and check each of the data types。

we're then as mentioned going to drop the channel and the region columns as we won't be focusing on these throughout our examples here using PCA。

We're then going to convert the remaining columns to floats if that's necessary。

And then we're going to copy a version of the data that we just created using the dot copy method。

To preserve it， and we'll be using that later on and we'll see how in a bit。So first things first。

we import our data using pandas。readcsv。

We look at the shape and we see that we have 440 rows and eight columns and recall the number of columns is going to be important as our goal here with PCA is to reduce that number of columns that we're working with when we create our models or whatever it is that we want to do with our data。

maybe want to visualize and we want to reduce the two columns。So we see our first。5ive rows。

and we see here that we still have that channel and region which we said we don't want to include。

so we're just going to call data do drop and we drop the channel and region from axis equals 1。

And we look at the data types， we see that they're each integers。

and we're just going to convert those each to float。

callingt dot as type float for each one of the different columns。

Now we have them all as floats and then as mentioned。

we're going to want to save this original data for later， so recall here we have our data。

which is our data frame that we've just created and then data Rige is going to be a copy of that which we're not going to touch for a bit。

Here in part two。We need to again， ensure that our data is scaled and relatively normally distributed。

it'll be easier to work with with normally distributed data。

and then as mentioned in the lecture we saw how important it is to scale our data to ensure that no feature has extra weight when trying to come up with the different principal components。

So we're going to examine the correlation between each one of our different features。

And recall this will be important as when we are doing PCA。

what we will be looking for is if two features are very highly correlated。

they're not adding any extra information and we want to remove or reduce those or combine a few to end up with less features overall。

so if they're highly correlated， we can probably remove some without losing much variance from the overall data set。

We're then going to perform any transformations and scale our data using whatever scaling method you prefer。

whether it's Minmac Scalar or the standard Scalar。

We're then going to view the pairwise correlation plots using our pair plot just to visualize all the relationships as well as now seeing if we have normally distributed data。

looking across that diagonal of the pair plot。

So the first thing that we want to do is call data。

cor so we can see the correlation between each one of the different features。

So this will give us for each feature， the correlation with all the other features in a square matrix in a square data frame。

And just to ensure that we can get the highest correlation， which feature the highest correlation。

And because one feature with itself will always have a correlation of1。

we're going to replace that diagonal value which are going to start off as all ones with all zeros。

So we're saying4 x in the range of formatmat dot shape0。

It's a square matrix we could have called shape 0 or shape 1。

So that's going to be for every single value in our matrix。

For every single numeric value for the range of our matrix， we're going to take the diagonal value。

so 00，1，1，2，2， and replace that one with a0。

And we can see now our correlation matrix has a correlation between fresh and milk and grocery。

and then for fresh and fresh， it's just a zero across each one of the different diagonals。

Now， we're going to call the absolute value on that full correlation。

as we don't care if it's a positive or negative， just the strength of that correlation。

And we're going to call I D X max to see which feature is most highly correlated with each of the other features。

So we' saying， what's the max index value。 So for fresh， it's frozen for milk， its grocery。

so on and so forth。

We're then going to examine the eew。For each one of our different values and then take the long transformation if necessary。

for those that have higher s。Recall that the s is going to be a value with 0 being no skew。

positive value being a right skew and a negative value being a left skew。 The higher that value is。

the stronger the sw。So we call data。 skeew to see the skew of each one of our different columns。

We sort them from largest to smallest， and those are going to be our log columns。

and that will now be a panda series。 And then we're just going to take those log columns that are greater than 0。

75。

Those that have a higher sw， and we see here we have these values that tend to have a higher skuw。

And for those， we're going to take the log transformation of each。

hopefully creating more normally distributed data。

So for call in each one of these log columns index。

So these are this is our log columns that we just defined is that panda series。

If we call the index。 we get each one of these delication， frozen， milk and so on。

which is going to also match up with each one of our different data columns。

So we're going to place those columns in place with the log transformation of those columns。

We can then also call the Minmac Scalar， so we import from Scalalar do preprocessing the Minmac Scalar。

we want to ensure that all our values are on the same scale。We call min max Scalar。

we initiate the objects， and then we say four column in each one of our columns。

we're going to fit and transform on that column， so we're going to replace it again in place to standardize that data So all values are between 0 and 1 by using the min Mac Scalar。

which we recall is just subtracting the minimum value and then dividing by the max minus the min。

So that'll ensure all our values are between zero and1。

The next thing that we want to do is we're going to visualize everything that we've just done。

so we're going to see each of the relationships and hopefully see those high correlations with each one of the different scatter plots that we'll see with the pair plot as well as saying hopefully more normally distributed data。

which we see for the most part throughout each one of our different columns and we see， for example。

milk would and grocery have a pretty high correlation if you look just three columns in and two columns down。

you see that high correlation。

Now， in part 3， we want to introduce how we can do this all in one step。

and this will be especially useful if we want to incorporate this into some supervised learning model later on and be able to pass in different parameters throughout。

So we're going to pass in our pipeline function， and we saw that during our course on supervised learning。

But what's important when using the pipeline function is that each one of the functions that are passed in。

each one of the different pieces of that pipeline have to have a fit and transform method to it。

So we want to take the log and then take the in Max Scalar。

But the log doesn't have that fit transform that's built in with each one of our different SK learner objects that we've been working with。

So Minmac Scalr has a fit transform， but log transformer does not。

So in order to ensure that we have a version of taking that log transformation that has the fit and transform methods that we can pass into our pipeline。

we're going to call this function transformer。And this function will take whatever function it is that you want to pass in。

And convert it so that it has a fit and transform method available to it。

So now we have a log transformer object。Which is going to be a log transformer with a fit and transform method。

And once we do that， we can pass it into our pipeline。 So first， within our pipeline。

we need to pass in that list of tus where the first value of that tuple is going to just be that name if we want to pull it out later。

And then the next value is going to be the actual function that we want to call。

So here we call that log transformer that we just created and then Minmac Scalar。

We pass in this list of tus into our pipeline， and then we can just call pipeline do fit transform on our original data。

If you recall， we made a copy and we didn't change our data at all for that copy of the data。

And we can call fit transform and get the output down the line of both taking that log transformation and that MinNAC Scalar。

And we run this。And then that data pipe should equal that data that we just transformed。

So we're going to check that using nuy dot all close。

which is just going to check that each value within each of our arrays are exactly the same。

With a bit of possible rounding error， many decimal points down the line。 So we run this。

and we see that it's true that all of our values are the same。

and we see that our pipeline work just as well as taking each one of these different steps separately。

Now， that closes out part 3 in part 4， we're going to start working with PC A on this transformed data that we've been working with and see how much of the variance can we explain with different numbers of these principal components。

All right， I'll see you there。

032：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p32 31_降维笔记本第2部分.zh_en -BV1eu4m1F7oz_p32-

Now， for part 4， as we will be working through here in this video。

we're going to perform PCA on that data that we work through in the last video。

And we're going to perform PCA for the number of components ranging from one to five。

so we start off with six columns and no matter what we're going to try and reduce the number of columns that we'll ultimately be working with。

We're then going to store the amount of explained variance for each one of the different numbers of dimensions。

So for one dimension， how much variance was explained ver2， so on and so forth。

And if we were to do number of components equal to 6。

then we would have explained 100% of the variance。

So we're saying how much of the variance going to explain at each one of the different steps。

We're also going to store the feature importances for each one of the number of dimensions。

And something to note is that PCA won't explicitly provide this feature importance。

but the components properties。Which we'll show you how to use in just a bit。

Will show you how each one of those principal components was composed as a combination of each one of the original features and the larger those values are。

given that we've standardized our data， The more impact each one of those features has had on that principal component。

and therefore， we can assume that that is a more important feature。

And then we're going to plot both that explained variance as well as these feature importances。

Now I'm going to break this down step by step， so I'm going to actually create a cell above。

but before I do that， just to show you where we're starting off。

We're going to import from ecalar。 decomposition。We're going to import PCA。

We're going to initiate an empty list of the PCA list and the feature wait lists。

which we're going to use to store are explained variances and the feature importances。

And then for n and range 1 through 6。 So one through 5， if including 5。

So that's what we want to range through。We're going to initiate a model。

A PCA model with a number of components equal to wherever we are within that range。

And then we're going to fit it to our data that we have now done the transformations to to ensure that it is on the same scale and mostly normal data。

We're then going to take the explained variance of each and append that to PCA list。

And then after a few steps， which I'll walk you through in just a bit。

we're going to take each one of the feature importances and append it to the feature waitlist。

And then after we do this for the and in range one through six。

we have this for each one of our different numbers of principal components。

So let's start off by looking at just this step here。So we're going to create a panda series。

We actually are also going to need， of course， to initiate our model。

What I'm going to do since I'm pulling this out here is going to set n equal to two as we discuss all the steps。

And you can imagine that this is going to do it for n equals， of course，1 through 5。

So we set n equals2， and then let's see what this series is that we're going to be outputting。

it should be n， which is the number of components， which we set to two。

the actual model as well as the explained variance up to that point， so for using two components。

how much variance was explained by using two components。

So run this。And you can see that it explained 72% of the overall variance。Now。

just to see how the explained variance ratio actually looks， let's pull this out。

And we can see that it says if you set n equal to2。

it shows you how much of the explained variance ratio was covered with the first principal component。

which was about 45%。And how much was done by the second component， which was about 27%。

And the first one should always have more than the second。

which should always have more than the third， right。

Our first principal component should be the component that explains the most variance。We are then。

so we will have there for each of our number components the amount of variance explained。

so that is covered。 Our next step is going to be to find the feature importances。

So the first thing that we're going to do here。Is we're going to。And let's add this on over here。

Set some weights and the idea of the weights is that we have the breakdown of each of our principal components。

But we want to add more weight to the more important principal components。

so the first one should be more important than the second one and so on and so forth。

So what I'm doing here is I'm taking this explain variance ratio that we output here。

And then we're just setting it if we're working with two components。

we're setting it as a proportion of one， sort of saying。

44% and 28% were adding those up and we're saying out of one。

what proportion is 44 and what proportion is 28？So just to look at what that means。You see。

we take that original amount。With 45 and 27， and we just divided by the total of 45 plus 27。

So that we see that the weights are 62 for that first component and 38 for that second component。

and we're going to await our components according to how important these different principal components are。

So this will become clear in just a second。The next thing that we're going to see is this PCA dot components。

So what was important here？For the PCA components is this is going to be the breakdown of。

How each one of the components is actually comprised。So let's first strip away。

Everything besides PCA components。

And we can see here that we have。For the first components。

How each one the different features that we had。 So we have six different features。

how they each created a linear combination to come up with our first components。

And then the linear combination that came up with our second component。So again。

the idea is the larger these absolute values are， the more they contributed to each component。

and the more important that feature is。So what we had here before。

Is we took the absolute value because we don't care about whether it's positive or negative。

we just care about how much it affected that principal component。

And then we're waiting it according to these weights。And if you recall。

the weights are going to be how important each one of the principal components are。

So this first one is going to be multiplied by 0。62。

And this second one is going to be multiplied by 0。38。So that we don't put on too much weight。

So we see here that we use 70% of whatever feature this is， this is the fifth feature。

And then we use 70 per here in the second feature。 in the second PC A for a different feature。

We want to ensure that these do not get equal weights。

This should get a higher weight than this one， since this is part of the first principal component。

So that's why we multiply it by the weights。 And then we can see what the overall contribution is。

Let's just copy and paste that。

And we can see the overall contribution。For each one of the different components。

And then we're going to take the sum。Access equals zero。

So that we can see now that we've weighted each one of them。

How much each one， these different features。With their weights。

we're able to comprise these principal components that we have。So we see here that's。

Whatever feature it is， the fit feature was the most important in the first two components if you add up the weights of the first two components。

We're then going to divide that value down here。 So we have the absolute features values。

We're going to divide that。By the total sum of these values to ensure that each one of these values is a proportion up to one。

So that we can see， again， these each represent how much weight each one of our original features。

Played in coming up with our two principal components。

we're going to normalize that over one to see the proportion of one of each one of these features。

how much they comprise， how much did they contribute to coming up with these principal components。

And that's going to be the values that we have here。

And then we are going to have a data frame that has the number of components。

and then it's going to have each one of the different columns so that we can line that up with each one of these values。

And then we're going to have for each one of those different values。

what is the aligned column that I went with， and that's going to be our values here。

So I'm going to run this and the first thing that outputs is the number of explained variants for each one of our different principal components。

so we see the first one covered 45%， the first two covered 72， then 83， 92 and 98。

so we see once we get to five we've covered 98% of our overall variance。

We're then going to concatenate if you recall， let's look actually at this feature wait list that we created。

This is going to be a bunch of data frames， so let's just look at the first one。

And we see this is going to be for a number of components equals to one。

How much each one of these different features contributed to that principal component。

We set this equal to one。We can see for the first two。

how much it contributed to each one of the different principal components。

We're going to concatenate all these different data frames together so that we have one long data frame。

And then we're going to pivot that。And set the index equal to this n so that we don't have multiple ones。

twos， but well sum up all the ends。We are also going to set our columns equal to the different features。

And then we can just have our values as the values。

And now we have this data frame that we have here。

where we see when the number of features is equal to one。

the contribution of each one of these difference features， when the number of features。

when the number of components is equal to two， the contribution of each of the features and so on and so forth。

Now we're going to plot。The overall variance， just using a bar plot。

So this is plotting what we had up here。 that PC Df， which is just that overall variance。

And we just set our X label， our Y label， and our title。

And we see that how much of the overall variance was explained once we add on each one of these different principal components。

And then finally， we have plotting the features D F。 And we're going to see。

as we have each one of the different number of。

Dimenssions that we're working with。How much does each one of the different features contribute to all of our principal components so we see here for detergent's paper at first it explained most of the variance it was the most important feature。

it tends to balance out as we add on that number of components。

Now that closes out our section here on Que 4， showing you how to see use PCA。

see the explained overall variance， as well as getting a hint at the actual feature importances as we create each one of our different principal components。

In the next section， we will discuss how we can actually use grid search to fine tune our PCA model。

especially when working with kernels。 All right， I'll see you there。

033：图像处理.zh_en -BV1eu4m1F7oz_p33-

In this video， we will use an example to see how PCA can be used to reduce the feature space of an actual image in practice。

Now， the learning goals for this section will be just to show how dimensionality reduction can be used in a real world application。

And with that， bring together an example， using dimensionality reduction to take one image and compress it down to smaller amount of features and see what that compressed image would actually comprise of when doing PC A。

Now， we're going to walk through how we can use dimensionality reduction in real life practice。

So frequently， we want to use dimensionality reduction when we end up with a lot of different features。

when we have high dimensional data。And this can happen often with text data Feature are usually going to be the word existence flags or the word counts per document。

and as we saw with the nonne matrix factorization notebook we just went through。

this can end up creating quite a lot of features very very fast。 and thus a lot of dimensions。

So you want to use it often when we're working with NLP。Or as we see here。

if we're working with images， especially if we're working， say， with colored images。

the features can be the brightness value for R G and B values。

So the brightness of each one of those different colors per pixel。

So it means that we can end up with quite a lot of features。

on the order of the number of pixels that are present within our image here。

We're working with black and white。 So it'll just be the brightness per each one of the pixels without the RG B values but still can end up with quite a lot of pixels。

😊。

So in this example。We're going to see how PCA is going to be used for image compression。

We're going to reduce this image's dimensions， but hopefully retain most of the image。

So to see this image as a data set， we put on a grid on top of this image。

Where each square is going to be 12 by 12 pixel sections。

So each one of these different squares will have 1 44 pixels per square。

and each one of those squares will represent a single observation within the full data set of this image。

Something to note is that this grid is just for visual representation， but in our example。

we would imagine that there are more squares than what we see here。

So each square， again， is a single observation that is 12 by 12， So a total of 144 pixels。

This is going to be a black and white image， which means that every pixel contains only one numeric value indicating the brightness of that pixel。

And putting those 144 pixels side by size， we can end up with just one row vector。

So we see here we take that 12 by 12 and we unravel it to have 144 different features。

For each one of our different squares。And each row in our data set will be each one of the individual squares in our original image。

We can then perform PC A on all of our data points。

So we see here again that we have each one of our different rows representing a single square。

So we end up with a matrix。 That's the size of the number of squares。Times 144。

which is the number of features we now have。Where you can apply PCA to this matrix to try to reduce the current dimensionality so that we end up with a new matrix that still has that same number of rows。

which is going to match up with the number of squares。

Times M where M is going to be some value less than 144。

And those new columns will be projections of some special combination of those original features that will create our principal components that will describe the most amount of variance。

So to see this in action， we see here reducing from 1 and 44 down to 60 dimensions。 So each square。

rather than being represented by those 1 hundred and 44 different values are now represented by these 60 different values。

We can still see quite a clear picture of our original image。We reduced down to 16 dimensions。

and we still don't lose much from the original image in regards to visually looking at one next to the other。

So after PCA。You will get these top 16 components， and these will be the 16 most important principal components。

And every original 12 by 12 grid in this image before is now some linear combination of these 16 components that we have here。

Once we reduced down to 16 in regards to our dimensions using PC A。

We can reduce this further down to just four dimensions。

So here we're reducing the dimensionality severely。

But since we're keeping the four most important principle components。

the image is still somewhat recognizable。And here we have the L2 error between that original image and the compressed image with various levels of dimensionality。

Where we're just seeing the distance of what that original image looked like compared to the values that we're working with now with the compressed version。

And we can see that for quite some time we don't have that high of a relative error as we continue to reduce that number of dimensions。

Now we see here just the top four principal components。 And again。

we were going to be able to from that original 144 create the some combination of those to come up with these four principal components。

And something to note， as we recall when we are working with PCA in the PCA notebook。

is going to be the top four of our original top 16 or even of our top original 144 components。

So reducing to 16 and then selecting from the top is the same as just reducing down to our top4。

So no matter what we always have， the first most important principle component first， the second one。

second， so on and so forth。

And then here we can see what that image actually looks like， reduced down to one dimension。

So you see that now we're only working with one dimension and each one of our different squares is just going to be a different weight for each one of those different original squares that we are working with。

And we can still see somewhat of a fuzzy image here。

So we can see here how PC A is actually compressing our original image and the amount of data that we have to store in order to represent that image。

Now， just a quickly recap。In this section， we discuss the applications of dimensionality reduction in the real world。

Using the example of working with that butterfly image。

Using PCA to reduce the number of dimensions and show how we didn't lose much from that original image when we reduce the number of features。

Now that closes out our section here on unsupervised learning， and it was a pleasure teaching you。

Thank you。

034：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p34 33_核主成分分析和多维缩放.zh_en -BV1eu4m1F7oz_p34-

Now let's move beyond linearity to working with nonlinear transformations。

So what we've talked about so far with principal component analysis and singular value decomposition。

Everything that we were working with there were all linear transformations。

so we're using linear transformations to map our original data set to a lower dimension。Now。

data in general can very often have nonlinear features。

And when we work with nonlinear features and we try to perform PCA。

this can cause our dimensionality reduction to ultimately fail。

So here we have this example data set。And we can see here we're doing a mapping from two dimensions to two principal components。

So it will end up not changing the space， but in general。

as we try to map from higher dimensions to lower dimensions。And we have nonlinear features。

We won't be able to maintain that variance while reducing the number of dimensions as we've done so far with linear PCA。

So if you recall our discussion during support vector machines。

there are going to be kernel functions which we can use to apply nonlinear transformations to our data。

Now， if you did think back to support vector machines。

what probably came to mind is with the kernel functions。

We're mapping up to higher dimensional space， and the goal here is to map the lower dimensional space。

But the key is that when you use these kernel functions and map a higher dimensional space。

you're able to uncover nonlinear structures within your dataset set and use that to map down using a linear fashion similar to how youre able to then come up with a linear boundary。

😊，Once you map up those higher dimensions， you can use that linear PCA in order to actually come up with less dimensions。

So here we see from that original space that we saw earlier using kernel PCA projection。

we're able to come up with a linearly separable space， so we're able to adjust the space。Now。

in the figure here on the left。

We're going to be applying PCA directly and we see this curvature in our data。

And we wouldn't be able to maintain the total amount of variance if we just directly applied linear PCA。

So instead， we apply this kernel。Which will map our data to a linear space。

and then we can reduce it down to a lower number of dimensions without losing the information that we would lose by squashing down our data on that original linear projection。

So how do we actually perform kernel PCA using Python。

as usual we're going to import the class containing the dimensionality reduction method。

Once we import from SKAle。 decomposition the kernel PCA， we then initiate our class。

and we're going to say the number of components we want what type of kernel we want to use。

there's actually different kernels available， as there were with support vector machines。

As well as choosing the gamma and if you recall the gamma will identify how curvy or how complex you want it to be in regards to the nonlinearity of that original data set。

And then same as working with just PCA， we can call that object the dot pit transformform on our data set and we have our transform data set using the kernel PCA。

Now let's talk briefly about manifold learning。There is going to be another class of nonlinear dimensionality reduction。

And what we are working with here is going to be multidimensional scaling or MDS。 Now， MDS。

unlike PCA， will not strive to preserve that variance within the data， so recall with PCA。

the goal is to maintain as much of the variance within the original data。With MDS， instead。

the goal is to maintain the geometric distances between each one of the different points。

So the figure on the left is supposed to be a sphere in three dimensions。And under MDS。

it's map to a disk and the distances between each of the points in three dimensions is trying to maintained as we move down to these two dimensions。

Now， in order to run MDS within Python， we are going to import the class containing dimensionality reduction method。

so from SKle。 decomposition again， we import MDS。We create an instance of the class as well as the number of components that we ultimately want。

And again， we just call the MDS and we call fit transform on our data set。

And then we will have X underscore MDS as our transform data set that is now it only has two columns or two features。

Now， other popular manifold learning methods exist such as ISOMap。

which will use nearest neighbors and try to maintain the nearest neighbor ordering。In a way。Or TSNE。

which tries to keep similar points closer together and dissimilar points further apart and can be very good for visualization。

And there are going to be several ways to do decomposition and generally would say try a few out。

a good approach would be to try those out， and then perhaps if you're able to move down to two or three dimensions using EDA and visualization to see how well you were able to come up with clusters or maintain the amount of variance that was originally there。

Now that closes out our discussion here in regards to principle component analysis。

as well as the different types of manifold learning。In the next lesson。

we're going to go through a demo of using PCA in practice All right， I'll see you in the notebook。

035：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p35 34_降维笔记本第3部分.zh_en -BV1eu4m1F7oz_p35-

Welcome back for part 5 of our notebook here。 Here。

we're going to introduce Colonel PCA or PC A working with a kernel。

where we're going to use what we discuss in lecture and that we can come up with a nonlinear combination rather than the linear PCA。

To come up with a way， to say where the high variance is by mapping up to higher dimensions。

to get that curvature in those lower dimensions。

Now。We want to know。Choosing here that our kernel is equal to RBF。

we can also search through different kernels， and I suggest you looking at the documentation as well。

But we can also search through when we're working with RBF using different gammas。

and that'll tell you essentially how complex that boundary is going to be or how curvy your line that you can project onto will actually be。

So we're going to search through different gammas and we're going to use grid search and when we use grid search。

what we're trying to do is find the best model and when we do this with supervised learning。

this is clear we can do this with using a scoring methods such as mean squared error or working with the accuracy or whatever other classification score you want to use and optimize on that score。

Now when we're using unsupervised learning。It's not quite as clear how we can end up scoring which one of these different models performeds better。

But we do need to come up with some type of scoring option in order to decide which gamma or if we wanted to search through different kernels。

which kernel work the best。

So what we're going to do here is we're going to introduce a custom scoring method。

So you'll see here that we define a score。And we'll walk through what that score is。

but essentially what we're going to do is take a model。Fit a PCA， fit a PCA model to our data。

and then take the inverse of that。And then see how far away the inverse of that PCA model is from our original data。

and the lower that value is， the better we did。So let's walk through that here。

So first we're going to import the kernel PC rather than just PCA。

We're going to import grid search C as we'll be using that in order to find the optimal hyperparameters for our kernel PCA。

And then you'll see in just a second how we're going to incorporate mean squared error in regards to coming up with the best version of our kernel PCA。

So first thing that we're going to do is define a score。So we're going to pass into that score。

The PCA model。As well as our X。 And there's going to be no Y here。 It's just going to be that X。

right， We're using unsupervised data。 There's no label that we're attributing to this。

All we're doing here with this try and accept is just we want to ensure that we are working with a nuy array rather than working with a pandas data frame。

So if x is equal to a this x is equal to a pandas data frame， we call dot values。

and we're working with the array。

If it's already an array， then it'll just set X file to that array。

We're then going to call our PCA model that we passed into the score。

And we're going to call it on the X Val， and we fit transform our data to get our new version with whatever。

however many components we're passing through， one component， two component， so on。

as well as whatever kernel we're using and whatever gamma we're using。

Specific to what this PCA model is。

We're then going to take the output of that。And pass it into this PC dot inverse transform function to get the inverse。

which should undo what we did， but it can't perfectly undo because we lost some information as we did that original transformation as we did that original dimensionality reduction。

So it'll take the inverse and that will be our new data in。

And then what we're going to do is take the original data that we had。

And see how far off that is from our inverse transform that we just did。And in order to do that。

we'll just take the mean squared error。Now when we do a score。

we want to get the highest value possible， when we do mean squared error。

obviously we want to minimize our mean squared error。

so we're just going to multiply it by negative one so that we can optimize by getting the highest value。

And that's going to be our scoring function。From there。

it should be as simple as any other grid search that we've worked with in the past。

You're going to set your parameter grid， which is going to be gamma and we'll loop through different gamma values。

It's going to be the dictionary and the number of components and we'll loop through different numbers of components。

Now， I'll let you know， generally speaking， the higher the number of components。

The better this transform inverse transform will work。

But this will allow us to hone in on the right level of gamma。

But then going to do grid search CV。We're going to say that we want to pass in the kernel PCA。

and the things that we don't want to search over， but want to keep the same through every single loop is going to be that the kernel is equal to RBf。

and we want it to fit the inverse transform。 If we don't call this when we call the PC A。

Then we won't have the option to call this inverse transform that we have called up here during our scoring function。

So we say fit inverse transform equals true。 We can then pass in our parameter grid that we defined up here。

And then we can pass in the score that we just created。

We say n jobs equal thank1， just to say we want to paralyze as much as possible。

and then using this kernel PCA that we're defining here。

we can call kernel PC do fit on the data and get our best estimator to see which one of these gammas perform the best。

So I'll run that， and that will take just a second to run。 So I'm going to pause the video， oh。

There it is。 Never mind。 And we see here that we have for our gamma value，0。5 was the best。

Option in regards to that transform to inverse transform。 And we see that the number of。

Components is equal to 4， which is the max value， which is what I said。

usually when you are working with looping through the number of components。

the max value will be the one chosen。

But now we can see that we should probably use that gamma equals 0。

5 when choosing our gamma for our kernel PCA。

Now， for part 6， we're going to show you how you can use PCA built into your modeling pipeline in order to perhaps use it to make your logistic regression work better on the data that you have。

So we're going to be loading in this very large data set。

which is the human activity recognition using smartphones。 We've seen this before。

It has tons of different columns。 We can look at the shape here， and see that。😊。

It is 10，299 rows and 562 different columns， so we're going to try and reduce that number of columns。

So what we're going to do is we're going to first import the different libraries needed。

our pipeline， standard scalar， stratified shuffle split to keep that same ratio of each one of our different outcome values。

We're now using logistic regression and we can pull in our accuracy score since we're doing a classification problem here。

X is going to be all values except for activity。 Y is going to be the activity。

And then we're going to initiate our stratified chael split and we'll call this in just a bit when we want to get our average score。

Now， this get average score。Is going to just be a function that does all the steps in the pipeline to standard scaling。

PCA， and then logistic regression， and all we're going to change at each one of the steps is the number of components。

So we set this pipe equal to this list and we pass it into our pipeline as we've done before。

We have our scores， which are just blank。 So we initiated our pipeline， but haven't fit anything yet。

We have our scores equal to that blank。 But then using the S S S that we initiated here。

that stratified shuffle split。

And we're going to get five different splits since we set the number of splits equal to five。

And for each of those， we'll get a new X train and a new X test。

as well as a new Y train and a new Y test。

And we can call pipe， that being the pipeline we created here。

Dot fit on our X train and Y train。And then once we do that five different times throughout each time we're also going to get the accuracy score on the test set。

So once it's fit on the training set， we can see the actual score on the test set。

we'll have five different scores， and then we'll output the average of those five different scores。

We're going to set the number of ends from 10 up to 500， so we see our original data was 562。

We're going to see if we reduce the number of dimensions。

is there a point where perhaps we don't need all of the data set or even perhaps some improvement with lower dimensions。

So we're going to get our score list by running this get average score that we defined find up here on each n in this option of ends that we have here。

So I run this。And this one will actually take some time。 So I'm going to pause the video here。

and I'll see you in just a bit as we touch on the results from running this function。 All right。

I'll see there。😊。

All right。 now that has。Finish running and it may have taken a couple minutes。

Let's see what these scorereless came out as。We run this and this should be in the same order as our ends that we have here。

And we see that after a certain point once we get to the 450500 range。

there doesn't seem to be any more improvement in adding more variables and adding more features。

And we can see this with the plot as well， just plotting out ends versus our different score lists。

and we can see that it really plateaus and it's not even starting at 0 here on the Y axis。

starting at 0。84。 So we see that adding on all these extra dimensions doesn't really add that much extra value in regards to the logistic regression。

So we could probably shrink this down to even 100 features here。

Or 200 features and still have a pretty high accuracy。

depending on what you're trying to get at and be able to speed up the process of how long it will take to learn this model。

That closes out our demo here on Diality reduction， and I'll see you back at lecture。 Thank you。

036：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p36 35_非负矩阵分解.zh_en -BV1eu4m1F7oz_p36-

Now we introduce another way of reducing the number of dimensions。

namely non negative matrix factorization。Now with non negative matrix factorization。

we are still going to be decomposing our original matrix。

But this time we're starting with as input only positive values。

so you can think word counts or pixels and image as examples of matrices with only positive values。

And then we decompose that original matrix of positive values into two matrices W and H。

with both also having only positive values， so that's that non negative matrix factorization。

Now we can think of taking a term and a document matrix。

so to create a matrix of the sort out of many documents。

you can think of each one of your different observations or each one of your different rows as being a specific document。

And each column being a word。And the values for that documents for rows and。Wrote words for columns。

each one of the values will be the word count or some other measure of the word for that document depending on how you pre process your text data。

We can then decompose this into how the terms each make up certain topics， and that's your W here。

And that number of topics will be of your choosing。

similar to the choosing of components when we're doing PCA。

And then the H will be how to combine these new topics together to recreate our original documents。

Now， thinking of images， if we think back to PCA。PCA is highly recommended when you have to transform higher dimensions into lower dimensions。

and you are okay to lose the original features in the process as new ones are being introduced。

So when we look at the breakdown of the components。

it's going to be difficult to gain any insight into how they all combine to recreate that original image as each one of these new components are。

Composed of a weird combination of those original features。

Now with non negative matrix factorization， since we are only working with positive values。

And we can only add those values together， we can't subtract since everything's positive in both our W and H matrices。

The different components tend to have more of an intuitive feel。

as we'll be adding together the shading of the eyes， the eyebrows， the nose， etctera。

all together to recreate an image of our face as we see here。Now。

nonne matrix factorization has proven to be powerful for word and vocabulary recognition。

image processing problems， text mining， transcription processes， cryptic encoding and decoding。

and it can also handle decomposition of non interpretterpreable data objects， such as video， music。

or images。So why focus on a decomposition of only positive values。For one。

since non negative matrix factorization only works with positive values。

it can never undo the application of a latent feature。

There's no cancelling out with negative values。 It's only going to be additive。

And thus each included feature must be important， as again， we can't cancel it out down the line。

Also， since its only positive values， this leads to features that may be interpretable。

as they must all add together to recreate our original data。 So， as mentioned。

for something like a data set of different faces， you may have the nose， the ears， etc ceter。

and those will add together to recreate the face。Something to note is that because non negative matrix factorization has the extra constraint of positive values only if we end up in that original decomposition with some negative values。

the algorithm will automatically truncate those to 0 and thus may not be able to maintain as much of our original information。

Something else to note is that unlike PC， there's going to be no constraint of only orthogonal vectors when we're working only with positive values。

so that decomposition can thus have portions pointing in similar directions in n dimensional space。

So now let's briefly touch on how non negative matrix factorization will work with something like natural language processing。

So as input to our non negative matrix factorization for documents。

you would pass in some type of pre process version of each of your documents。

turning words into numeric values， can either use a count vectorizer for the count of words or the T F I DF。

which is term frequency inverse document frequency。

which will give you a value that gives less weight to more common words such as a or the or is within。

The entire range of all of your documents。We can then have the possibilities of tuning the number of topics that we ultimately want。

as well as the means of pre processingces our text may want to remove certain stop wordss or frequent terms altogether。

And then our output will be how the different terms relate to the different topics。

And then another matrix telling us how to use those topics to reconstruct our original documents。Now。

in order to actually use NMF within Python， the syntax will be very similar to what we've seen so far with the different decomposition methods。

so from SK learn。 decomposition， we import NMF。We then create an instance of our class passing in the appropriate arguments so we say how many topics。

how many different components do we actually want？And then we say， how do we want to initialize。

Most of you will initialize as random， But what is important to note is that the method can be sensitive to the type of initialization。

as we' have seen with other models， and the results will not necessarily be unique。

So we initiate our class NMF with a number of components。

and then we can fit the instance and create a transform version of the data by calling nmF。

fi as well as nmF。transform in order to come up with our new data。Now。

just to recap the different approaches that we went through。

Dimenssionalality reduction is going to be common across a wide range of application。

and we have here some rules of thumb for selecting what approach you'd like to use。

For a principal component analysis， this will be great if you have a linear combination of features。

you believe that you can create or maintain the amount of original variance。

and that's your goal is to preserve variance by creating a linear combination of those original features。

😊，Colonnel PC A will be similar， except for assuming there is more of a nonlinear relationship。

and we still want to preserve the overall variance within each one of our features。

Multidisional scaling， like PCA。With new transformed features are determined based on preserving distance。

rather than maintaining variances as we did with PC A。

So if maintaining the amount of distance is more important。

which may be something useful if you want to visualize different clusters。

this may be a better approach， than you'd want to use M S。And then finally， as we just discussed。

is non negative matrix factorization， which is useful when you're working with only positive values such as working with word matrices or working with images。

Now let's recap what we learned here in this section。In this section。

we discussed dimensionality reduction and how we can solve our problem of this cursesive dimensionality by coming up with a lower dimensional representation of our original data that maintains the majority of the information important to us in that original data set。

We then discuss principal component analysis or PCA and how we can use it to come up with new features created as a linear combination of those original features。

or if we use kernel PCA， a nonlinear combination of those original features to maintain as much of the variance from that original data set as possible。

And then finally， we discussed non negative matrix factorization and how working with only positive values can lead us being able to come up with more intuitive and powerful representations of our original data in lower dimensions。

Now that closes out our lecture on dimensionality reduction。

and from here we're going to move to a demo actually working with non negative matrix factorization using Python。

All right， I'll see you there。

037：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p37 36_非负矩阵分解笔记本第1部分.zh_en -BV1eu4m1F7oz_p37-

Welcome to our notebook here on non negative matrix factorization In this notebook。

we're going to be covering the BBC data set on different articles across five different topics。

The data has been pre processed so that we have a sparse matrix。

We'll see what that means in just a second。 With that， we have BBC dot terms。

which is just a list of the words that are used。As well as BBC。 docs。

which is just going to be a list of the articles。

Listed by topic。So at a high level， what we're going to do here is turn our BBC do matrix file into an actual sports matrix。

So it's already in sparse matrix form， as we'll see。 But in general。

working with a sparse matrix just means rather than having a ton of zeros for many of your columns。

We're just going to have for each row or column。 we will specify whether or not there is a value there and what that value is。

Rather than when you have a larger， not sparse matrix with a lot of zeros。

you can end up eating a lot of memory。

We're then going to decompose that sparse matrix， using non negative matrix factorization。

and then use the resulting components of that non negative matrix factorizations to analyze the topics that we end up coming up with。

So the first thing that we want to do is take that bbc。mtx， which is our sparse matrix。

and we're going to open that file， so now we have our file available as F and then once we have that F object we just call read liness and the output of read liness will be output into content。

So now we have our contents and just to see what that looks like。

I'm going to run this and you see that it's going to be a list。

And。What we have beyond the first two values are just going to be a sparse matrix representation。

And we're going to go through in just that part 1 below what each of these different lines mean。

But first， we want to remove each of these first two values。

So we're just going to call a content dot pop。 We're going to call0 twice， so we remove。

The zeroth value， and then we remove the zeroth value again。

So we see that the last one that I removed where was this。Value here from the list。

And now we should only have from here and below， if we call content。

Now， in part one。We're going to turn this list。 And currently。

this is a list of strings into a list of tuples。And that list of tuples will represent a sparse matrix。

So， that sparse matrix。Is going to have as that first column， the word ID。

As the second column within that tuple， that second value， we're going to have the article I D。

And the third is going to be the number of times that that particular word shows up in that particular article。

So as an example here， if word 1 appears in Art 3， two times， then our element for a list。

that tuo will be word 1。Article 3 showed up two times。Now， in order to create this tuple。

What we do is this somewhat complicated looking list comprehension。

I will break it down very quickly if we just do C for C in contents。

Let's actually， that'll just give us the exact list that we saw before。

So let's just first call that split。

And we see that we've now split that string that originally had into the three separate values。

So we're on our way there。 And then all we do from there。

Is map over a floatat since。It'll be difficult to get a float integer just out of that 1。0。

We map over a float， and then we take the integer of that float so that we're only working with integers each time mapping。

So first， we map to each one of the values in this tuple。 the float。 Then we map over the integer。

And then we set that output as just a tuple。

And if we look at the output for just those first aid values， we see that we now have a list of tus。

1，1，1，1，7，2， so on and so forth， telling us the word， the article。

and then the count of that word within that article。

Now we want to prepare the actual sparse matrix that we're going to be passing into our NMF into our non negative matrix factorization。

So we're going to import nuumpy and pandas， and we're also going to import from sippi do Sprse the COO matrix。

which will give us a means of passing in the way we have our data currently constructed into a sparse matrix。

So we're going to specify what our rows are going to be。 So since these start off。

if you look back up here， it's going to actually start with word 1 article 1。

Just for it to match up with Python syntax， we're going to make it word 0， Art 0， so on and so forth。

So we call that。Every single value， X1， these are going to be our rows。

We want our rows to be each one of our different documents。So we say x1 minus-1。

So it's going to be whatever this idea is minus-1。

For x within sparse matrix that we have defined here。 And then x 0， recall thus the word I D。

We're going to subtract one from that x0 value， and that will be our different columns。

So our different columns will be each one of our different words。

And then the actual values are going to be the amount of times that that shows up。

And when we call CO O matrix， we pass in the values。And then with that。

we have the related rows and columns for where those value should actually fall。

So it'll plug that in if we have row 1 column 1， it'll plug in whatever that value is。

So for this second one， it'll say。Rowwen。Or row 7， column 1。Plug in the value too。So we run this。

And just to make this perfectly clear， we're actually going to recreate from that sparse matrix an actual pandas data frame。

So we know what our actual matrix that we're working with that we're doing non negative matrix factorization on actually is made up of。

😊。

So we're going to pull in the actual terms， and these will relate。 The0 will be the0 term。

The first will be the first term， so on and so forth。So we say from this。Flat file， BBC dot terms。

we'll call F dot read lines again。 And then just to access that first value。

which is going to be the actual word。 We call C dot split on that string。

That'll output strings as before。 And we only want the first value。

And that will be our output for words。

And I'll run this， and we see。The different words that come out。

And then we'll do the same thing for each one of our document names。

We can do that。All the codes the same， except we're working with a different flat file。

and we can see all the different document names。

And then I'm going to take that COO。Which we initialize here。

which is just going to be a sparse matrix。

We're going to turn that into a numpy array。

Pass that into our data frame， and we're going to set our column equal to those words we pulled out and our index equal to those columns。

So this is going to be the actual original data frame that we're working with。

This is going to be Article 1， business 0，01。And we see that the word ad showed up once。

The word sales showed up five times profit 10 times so on and so forth。

And you see the reason why we'd want a sparse matrix is because we'd have all of these zeros for almost every single one of these different articles。

because we need a separate column for every single word that showed up in any single one of the articles。

which is why we generally work with sparse matrices。 when we're doing natural language processing。😊。

So now we have。A。Data frame that we want to work with。

and the next step will be to decompose our matrix using non negative matrix factorization。

and we'll save that for the next video， and I look forward to seeing you there。

038：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p38 37_非负矩阵分解笔记本第2部分.zh_en -BV1eu4m1F7oz_p38-

Welcome back to our notebook In this video， we're going to actually conduct non negative matrix factorization。

If you recall， just before we created our data frame that had each one of our different articles for each row。

and then for each column， we had each one of the different words。

and the values were how often those words showed up in each one of the different articles。

We're going to decompose that into different topics。And we will end up with two matrices。

One will be each one of the words and how much they relate to each topic。

and then the other one will be how to take those topics and recreate those documents that we have。

So in order to do non negative matrix factorization。

we're going to have to define how many components we want。

we're going to set the number of components equal to five。

which is the number of topics that we actually had in the original documents。

And this will allow us to later on compare to see how related the new topics are to the actual topics that we had within each one of the different articles。

So we import NmF。We then call the NMF， we set the number of components。

our initialization is going to be just random with a random state of 818 recall that with non negative matrix factorization you're not guaranteed to get the same exact solution every single time。

We're going to pass in our sparse matrix。Calling model fit transform on that sparse matrix。

and that will output this doc topic。

Which is just going to be a data frame that's going to be of shape 2225。

which is going to be the number of articles that we had。

and then we have reshaped that into just the five topics that we now want。

so rather than having 2025 by if we recall co was about somewhere in the900s in regards to the number of words。

we have reduced it down to five topics。

Now we want to look at the components of this model。

And when we look at the components， all that is is going to be the different words。

And how they make up each one of the different topics that we now have。

So we're going to create a new data frame， which is we're going to call here topic word。

And we're going to pass in that model dot components， that's going to be this output here again。

going to be the waiting for each one of the words for each particular topic。

We want the index equal to just we'll call it topic one through topic 5。

And then the columns are going to be those words that we pulled in earlier。 And when we look at this。

We can see that。

We have。For each one of the different topics， how much each one of the different words contribute to that particular topic。

Now， just to make further sense of how this relates the topics and the words。

as well as the articles and the words， we recall that the original data had five topics， business。

entertainment， politics， sports and tech。

Now I'm going to do topics per dock， and we're going to again pass in the。

Actual values of that doc topic that we pulled in earlier。

We're then going to set as our index rather than if we recall what the docs actually look like。

this is going to be the different articles。

It's going to be each one of these different values， business dot 001， business do 002。

That first word before that dot is going to tell us which topic we're working with。

So we're just going to call i dot split on that dot。

And then we're just going to take the first value， so we'll have all business。

Or later on， all entertainment， so on and so forth。

And then our columns will be topic one， topic two， so on and so forth。

And when we look at this， we can see that we have that。

Taking that each one of those original documents and saying which topic they most relate to。

So you see， business seems to relate most to topic 2。

And we see that repeatedly for every one of the different business topics and then for tech。

we see that topic four， I believe。

What we'll see in just a second， what we're going to do here is order to reset the index so that this indexes its own column。

We're then going to group by that index and get the average value for each one of the topics。

And when we do that， we can see that topic one， the max value is politics。Topic two was business。

topic three was sports， so on and so forth。 Let's just quickly。Just to make clear。

show you what that matrix looks like。 This is the matrix。

And we're just seeing which one of these have the highest values。

And that's how we end up with these different groups。And then to make this perfectly clear。

we see that topic 1 should， for example， relate to politics。 So if we take our topic word。

That we saw up here。We're going to transpose that so that each one of the different topics are going to be the columns。

And then we will sort by first topic one。With the highest values on top， so ascending goes false。

and we see party， labor， government， elects， Blair， these all tend to highly relate with politics。

which makes sense。

And if we did topic three， which was sport。We can see that game， play， so on and so forth。

and you can play around with this with each one in the different topics。

So we see that with this unsupervised model， if perhaps you don't actually have your topics available。

you can come up with this in a way types of clusters with the non negative matrix factorization。

That closes out our video here and our notebook on non negative matrix factorization。

And I'll see you back in lecturecher。 All right。 Thank you。😊。

039：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p39 0_课程介绍.zh_en -BV1eu4m1F7oz_p39-

Hi， my name is Miguel and I am one of your instructors in our course for deep learninging and reinforcement learning。

Deep learning is a very exciting topic because it powers most of our favorite AI applications。

anything from self drivingriv cars to computer vision and speech to text recognition is using some shape or form of deep learning。

And it's really going to help you in all your classification tasks and even on supervised learning applications。

You will first start learning about neural networks， what they are， how they work and best practices。

and then you will learn some deep neural network applications like the courseive neural networks and convolutional neural networks。

And you will wrap up learning some more modern architectures like。

Gerrative adversarial networks， or GNS and reinforcement learning。

which is one of the bigger promises。

Of machine learning and artificial intelligence， even if it's very computational and data intensive。

it holds big promises and it might be what the future holds for AI。

From all the IBM professional certificates and specializations。

this course is one of the most advanced and complex。

so make sure that you take enough breaks and if you need any help please don't hesitate to reach out to your instructors and peers。

we' are here to help one another and we will go through this together。

Another very important part of this course is the final project。

it will really help you highlight your analytical and machine learning skills so make sure you post your solution online。

it can be on a Github page， an online portfolio for the IBM communities。

we really encourage you to post your solution out there。

And with that， I will see you in the course， thank you。

040：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p40 1_神经网络简介.zh_en -BV1eu4m1F7oz_p40-

In this set of videos， we will introduce the basic concepts behind working with neural networks。Now。

neural networks and deep learning are behind most of the artificial intelligence that shapes our everyday life today。

Think of all the cool features in our phones， ranging from face recognition to auto correctect to text autocomps。

voicemail to text previews， also the way that we find what we need on the internet using predictive internet searches。

content or product recommendations， and even self driving cars。

Also， many of the classification and regression problems that you need to solve at your business are going to end up being good candidates for neural networks and deep learning as well。

Now there are several Watson applications and artificial intelligence APIs that help you infuse artificial intelligence into your business。

Here we have some of the most used with links to live demos that you can explore here on your own。

and as you go through some of these， think about ways you can use these applications within your business。

whether it's identifying the pieces of an image， coming up with an efficient translation into a foreign language。

summarizing and classifying comments or reviews of your product。

as well as finding whether those comments that have positive， negative or perhaps a neutral tone。

Now it's often noted that the biology of the brain serves as an inspiration for the mathematical models that make up our neural networks。

The idea being that the brain functions by firing neurons along a chain where one neuron gets signals from prior neurons。

And according to the firings of prior neurons， the next neurons decide where to generate signals or not generate signals。

according to those inputs。Those signals that were activated。

then pass on signals down that chain to the next neurons。And by layering many neurons together。

we end up creating a very complex model。Now， moving to the actual neural network。

we can think of it as a complicated computation engine。

We're going to train it using our training data， so train our neuralNe model。

And then we'll use that trained neural net model to generate predictions using new data。

so note here that similar trained test approach as we did with supervised learning。

which will become of utmost importance as we create our neural networks。

Now that closes out this video in the next video we're going to dive into a single one of these cells to see how data flows in and how data flows out from each one from layer to layer Allright。

I'll see you there。

041：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p41 2_神经元基础.zh_en -BV1eu4m1F7oz_p41-

Now let's zoom in on a single node in the middle of our basic neural network。

First， that node will get input values from the previous layer wherever that node lies。

Those input values are then going to be combined via weights from each of those different values similar to basic multiple linear regression。

Then that combination of weights is going to be transformed。

similar to how logistic regression transforms a linear combination to squash those values between 0 and 1。

And that transformed value is used as input for the next layer。

Now let's add in some variables to paint this process a bit clear。

So we can have as our three input values， x1， x2， and x3。

And we'll assume also an intercept term with that value equal to one as we do with multiple regression。

We also have the respective weights for each one of our different values， W1， W2 and W3。

as well as B， And our model is going to learn each one of these weights as well as the B。

And as mentioned， we will multiply each value by its weight。

as we do with linear regression and end up with some output value Z。

Finally， we're going to use an activation function， like I said。

similar to logistic regression or even logistic regression。

to transform that output and use that value as input for the next layer。

Now， without this activation function， we are restricted to only linear output or linear combinations of our inputs。

And no matter how many layers deep we go， we are still just working with a linear combination of our features。

It's going to be this activation function that allows for the great flexibility with respect to how we consider the model outputs。

given our model inputs using a neural network。😊。

Now， some notation that'll be worth getting familiar with as we walk through working with neural networks。

We have Z， which is going to be the net input or the linear combination of the inputs prior to activation。

so essentially the output of just that linear regression。

We'll have our bias term or that B that we just saw。

which is also similar to our bias term within linear regression。

We'll have F our activation function， that nonlinear function we use to transform the output of z。

And then we have a， our output layer， or the value once we take F of Z。

once we transform Z that we ultimately pass through to our next layer。

Now， with this syntax in mind， as well as that basic unit that we just walked through that basic neuron。

We'd seen that there is a lot of relation between that neuron。And logistic regression。

So when we choose F of Z equals 1 over1 plus E to the negative Z。

where Z is our output of just the linear part of that neuron。

We are actually looking at something very similar to logistic regression。

And what we have here was Z。Z is just going to be equal to that intercept term plus the sum of each one of the different inputs multiplied by their respective weights。

which we've expanded out here。

And our neuron is then simply just a unit of logistic regression。

where we have the different weights that we learn are just the coefficients for logistic regression。

The inputs are the different variables that we have here。 and the bias term is that constant term。

So it all relates back to our basic logistic regression。

And because logistic regression and our neural network in a way can accomplish the same task if we're trying to accomplish classification。

We want to ensure that when we move to neural network that we actually need a more complex model that we don't just need this single unit。

but we need multiple units and perhaps multiple layers。

and that's when we do switch over to neural networks。

The trade off being that you may be able to come up with a more complex boundary with neural networks。

But you'll lose a lot of the explanatory value that you have with logistic regression。

So what we have here is going to be the sigmoid function， which we use for a logistic regression。

as well as our activation here when we talked about the neuron and the output for the neural network。

And what our sigmoid function will do will take that linear combination and create a linear function。

as we see here， we have linearity， not a straight line here。And squash those values between0 and1。

which will be useful as we walk through the different steps of our neural network。

042：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p42 3_使用sklearn的神经网络.zh_en -BV1eu4m1F7oz_p42-

Now in order to create the multi layerer perceptionion in practice， using Python。

we're going to go over the SKL version of creating this neural network。

Now something to note is that we can make this simple multilayer perceptionron using SKLn。

but as we move on to more complex models， you will see that we're going to move away from PsyitLarn and start working with a library called CAIS。

but for now let's continue to focus on SKLearn， so as usual we're going to import from SKLearn here at neuralural network we're going to import the MLP classifier。

We then need to specify our activation function。So we pass in the different arguments while we initiate a class of this MLP classifier。

And some of the arguments that you see here are the hidden layer sizes。

so this will actually be the sizes of each layer between your input and your output。

so as we saw before we input x1 x to x3 and then we have certain amounts of hidden layers。

And we're saying here the size of each one of those hidden layers。

So the fact that this tus only a size 2， that means that there's two hidden layers， one of size 5。

one of size 2。 If we wanted3 and we wanted the third one to be of size 5。

we can do5 comma 2 comma 5， so that's how the hidden layer sizes argument will work。

And then the activation function that we want to use。

we've seen so far that we've only used the syigmoid function。By defaults。

SKLn will actually use the relou function， which we'll learn a bit later。

but because we want to stay in line with what we've discussed so far。

we're going to set the activation equal to logistic here and logistic is just the same as setting equal to sigmoid。

We can then as usual， fit and predict given our data so we pass into our fit。

our X train and our Y train， and then we can pass into our MLP。 predictd our holdout set。

our X test and see how well we performed on this holdout set。

Now as closes out this video and in the next video we're going to go into some of the common terminology used for the multi layerer perception。

as well as some intuition behind the basic math that brings us all together Allright。

I'll see you there。

043：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p43 4_神经元实战.zh_en -BV1eu4m1F7oz_p43-

So let's zoom back into this single node。When we're working with just a single neuron。

what we have here is a perceptron。And this is the basis upon which all neural networks are built。

Now， note here that we have as before our input values x1， x2 and x3， as well as our intercepts。

and then our weights and our beta that we're going to learn。And in this example。

we're going to be using logistic regression using that sigmoid activation function that we just discussed。

Now， if we were to change and look at actual values just to make clear how this actually looks in practice。

We can have the values as inputs， imagine that we have a row with feature 1 equal to 0。9。

feature 2 equal to 0。2， feature 3 equals to 0。3， and then our W1， W2 and W3 are 2。

3 and negative 1 with a B of 0。5。We can then calculate the actual z value。

That would be input once we have each one of these values。Into our activation function。

That activation function is 1 over 1 plus E to whatever Rz that we calculated was。

And we'd end up with a value of 0。93。And that would be the output of this particular node。

So our node output is 0。93。So why not just use a single neuron。

why do we need to have a larger network where we have one stacked on top of the other？

If we have just a single neuron as we would， if we were just doing logistic regression。

that would only permit a linear decision boundary。

When we move on to stacking one layer on top of the other。

We are able to come up with a much more complex decision boundary。

and most of our real world problems will probably be much more complicated than just that linear decision boundary that we can learn with something like logistic regression or something with just one unit。

So in order to take our inputs and pass them through and get our different outputs as we see here。

we'd be working with a multi layer perception， so we saw that one unit perception we add on each one。

And we see here that we have this feed forward structure。Where we have our inputs of x1， x2， and x3。

Those will each be inputs into the next layer if you look at each one of the arrows。

x1 goes to each one of the different perceptions on that next layer， as does x2 and as does x3。

And then that next layer， the second layer， is connected to every value in the third layer。

and so on and so forth until we get our output of y1， y2， and y3。

044：IBM《机器学习（无监督学习、深度学习和强化学习、毕业项目）｜machine learning》中英字幕 p44 5_使用sklearn的神经网络.zh_en -BV1eu4m1F7oz_p44-