IBM-机器学习-无监督学习-深度学习-强化学习笔记-全-
IBM 机器学习、无监督学习、深度学习、强化学习笔记(全)
001:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p01 0_课程介绍.zh_en -BV1eu4m1F7oz_p1-
Hi, my name is Miguel and I am one of your instructors in our course for unsupervised learninging。
In this course, you will learn the tools and techniques that help you leverage data that doesn't have an event variable oral label variable。
Companies from around the globe use unsupervised learning to segment their customers。
assess the quality of their data, and group similar observations together for their analysis。
In this course, you will learn。Clustering techniques like K means。
hierarchical clustering and D scan, as well as dimension reduction techniques like principal components analysis and matrix factorization。
A very important part of our course is the final project。Please。
we recommend you to post your solution online, be it on an online portfolio, a personal repository。
the IBM online Com or a Github page, it will help you highlight your。
Machine learning and analyticslytic health skills。If you need any help。
please reach out to your peers and instructors using the discussion boards。
we in this together and we will help one another and with that I will see you in the course。
Thank you。😊。
002:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p02 1_无监督学习概述.zh_en -BV1eu4m1F7oz_p2-
In this set of videos, we will reintroduce the concept of unsupervised learning and what it entails。
And this will serve as the foundation for the remainder of this specific course。
So in the last set of courses, we dove into the algorithms available。
assuming that we have the known outcome available in our data set。In this course。
we're going to talk about a whole other class of machine learning algorithms called unsupervised learning。
This class of algorithms are relevant when we don't have outcomes we are trying to predict。
But rather, we're interested in finding structures within our data set and perhaps want to partition our data into smaller pieces。

Now there can be a couple of use cases for this unsupervised learning。
One popular use case is called clustering, where we use our unlabeled data to identify an unknown structure。
And an example this may be segmenting our customers into different groups。
The other major use case for unsupervised algorithms is for dimensionality reduction。
namely using structural characteristics to reduce the size of our data set without losing much information contained in that original data set。
Now, in regards to clustering, we'll be covering the K means algorithm。
Herarchical agglomerative clustering algorithm, the D B scan algorithm and the mean shift algorithm。
And then in regards to dimensionality reduction, we'll be covering principal component analysis or PC。
as well as non negative matrix factorization。 Now, we don't go into this at all over here。
but we'll go into each of these in more depth as we get through these videos。

Now, just to give you some intuition as to why dimensionality reduction will be important。
Let's talk about that infamous curse of dimensionality or infamous for those of us in these circles。
Now, dimensionality refers to the number of features in our data and theoretically and in ideal situations。
the more features we have, the better the model should perform。
since models have more things to learn from so they should therefore be more successful。However。
real life is more complicated than that。 And there are several reasons why too many features may end up leading to worse performance in practice。
If you have too many features, several things can go wrong。
Maybe some of those features are spurious correlations, meaning they correlate within your data set。
but maybe not outside your data set as new data comes in。
Too many features may create more noise and signal algorithmgorithms find it harder to sort through non meaningful features if you have too many features。
And then the number of training examples required will increase exponentially with the dimensionality。
So this becomes especially clear when we think about distance based algorithms such as the canest neighbors that we talked about in our last course。
So if we look here and we imagine that we have a survey with 10 possible responses。
And for those 10 possible responses to get 60% coverage, we only need six answers。
We only need six different people to answer that us。
If we add on another survey with 10 possible response values。
That in order to get that same 60% coverage so that your can nearest neighbors of the same distance from whatever the new value coming in is。
We would need 60 people to respond, so we need 60 different rows of data in order to get our same coverage that we had when we just had six with one dimension。
And then you can imagine once we increase that to three dimensions。
And we have three different surveys, each one with 10 possible positions。
Then in order to get that same coverage for each neighbor to be equally distance as it was for that original one dimension with only 10 positions。
we would need 600 different rows。So you see how the more dimensions you add on。
The more rows you need, the more data you need to get that same amount of coverage。Now。
on top of that。Higher dimensions will often lead to slower performance。
as dealing with more columns is going to be more computationally expensive。And also。
it'll lead to the incidence of outliers increasing as that number of dimensions increases。
So to mitigate some, not all the problems I just mentioned。
one usually needs a lot of rows to train on, as I just mentioned。
Which may not be possible in real life。 You may not be able to gather these 600 different examples。
or if you imagine, obviously you would increase to multiple dimensions much more than three。
and we need that many more rows to get a certain amount of coverage。Therefore。
it often becomes a need to reduce the dimension of one data set。
So far we have seen feature selection as a way of achieving this, and in this course。
we'll discuss how we can accomplish the same goal using unsupervised machine learning models such as principal component analysis。
which we just discussed our PCA。

Now, to think about this in a real life example, now this curse of dimensionality comes up often in applications。
So if we consider that customer churn example that we discussed in earlier courses。

The original data set had 54 different columns, so 54 different features。
And some like age or under 30 or senior citizen will obviously be very closely related。
Others such as latitude, for example, are essentially duplicated。
We have those duplicated throughout。 And even if we remove duplicates and nonnumeric columns。
this cursd dimensionality can still apply。 We can still have too many columns。
even if they are not necessarily perfectly correlated。Now。
things that we can do with this churn data set clustering can help identify groups of similar customers。
Without us thinking about whether or not they churn or not。
maybe that allows to segment our customers into different groupings。
And then dimensionality reduction can improve both the performance because it can speed it up as we reduce the number of features and the interpretpretability of each of these groupings that we just came up with。

Now, just a high level overview。So when we're working with unsupervised learning。
we start off with an unlabeled data set。We then fit that unlabeled data set dependent on the model that we choose。
And we get our model。 And then once we have that model, we can look at new, again, unlabeled data。
We're still working with unlabeled data, but we can look at this new data。
Use that model that we just fit。Right from just before。
And then use that to predict our new groupings that we now have or the new dimensionality reduction。
depending on which we are doing, whether it's dimensionality reduction or a clustering。

So an example for clustering, if we want to group news articles by topics and we don't have those topics as labels。
So we have our starting point of text articles of unknown topics。We then create our model。
whether that's K means or whatever other models that we will discuss in order to see what kind of groupings we will naturally find in our data set。
We fit that to the data set so that we have our model fitted to figure out according to certain features that are within these articles。
according to certain words showing up。 We come up with certain groupings。
We then take another group of text articles of unknown topics。We use that model that we just fit。
And then we can use that again, that model took certain words。
certain features in order to determine the groupings。
we can then predict similar articles according to the articles that we had in our original data set。


003:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p03 2_无监督学习的聚类用例.zh_en -BV1eu4m1F7oz_p3-
Now let's talk about some common use cases out in the real world for using clustering。
So clustering will be used for classification, for anomaly detection, for customer segmentation。
as well as even improving supervised learning models。

So a common use case to start is classification, as we mentioned, for data that is not labeled。
so even if your data does not have a column that specifies the classes。
clustering algorithms will try to find heterogeneous groupings within your dataset set。
And examples of this used for unlabeled data include finding groupings that are different than your normal emails to help you identify spam。
so again, assume you don't have labels available。Or finding subgroups and text like product reviews in order to come up with these different groupings。


Now, another common use case for cluster will be anomaly detection。
Imagine that we are working with credit card transactions, and we have a certain user。
and we see that there's a small cluster compared to the rest of those users transactions thats high volume of attempts。
or perhaps now there' smaller volume of attempts or at new merchants。
This would create its own new cluster。 And that would present an anomaly within the data set。
and perhaps that would indicate to the credit card company that perhaps there's fraudulent transactions happening。
😊。


Another common use case will be customer segmentation。 So think of finding, for example。
groupings that help you find out how many type of customers your business has based on the recency。
the frequency and average amount of visits in the last three months。
And it takes a combination of each one of those different features and comes up with different segments。
Or another common segmentation is by demographics and that level of engagement, for example。
you can come up with groups for single customers, new parents, empty neters, etc。
and determine for a combination of each or clustering those together in some way their preferred marketing channel and use these insights to drive your future marketing campaigns。

And then another common use case, or final common use case will be to help improve supervised learning。
So, for example, you can check a good model, a good, say。
logistic regression model that you trained on your entire dataset set and see how well that performs compared to models trained for sub segmentseg of your data that you found through clustering。
And perhaps you'll be able to improve your performance if you look at each one of these different classes and come up with different predictions for each one of these different groupings。
Now, there's no guarantee that this will always work。
but it is common practice to segment the data to find these heterogeneous groups and then train a model for each group to help improve that classification。
😊。

Now, again, the other type of unsupervised learning that we discussed is going to be dimension reduction。
And we will use this often for high resolution images。We take our high resolution images。
We add on our model。 we fit our model to those different high resolution images。
In order to come up with a reduced, more compact version of those images that still hopefully contains most of the data that tells us what that image actually contains。
And then with that model that we fit。We can then take high resolution images that we haven't seen before and again come up with these smaller。
reduced versions of those images as well。And then we can predict what that compressed image should be like。
use those algorithms in order to determine what kind of reduced compressed image will still work in best practice。

So common reduction use cases here in image processing。
this will be probably one of the most common use cases for PCA。
Both compressing images and in computer vision for image tracking as it will reduce the noise to the primary factors that are relevant in your video capture if we're talking about image tracking。
And with the reduced size of the data set can greatly speed up the computational efficiency of your detection algorithms。
Now, with that, we close out our introduction to unsupervised learning, and in the next video。
we'll begin to hone in on the concept of clustering to help prepare us conceptually for our first unsupervised model。
decay Ka means algorithm。 All right, I'll see you there。😊。

004:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p04 3_聚类简介.zh_en -BV1eu4m1F7oz_p4-
Now, we're going to start with a simple example in order to introduce this concept of clustering。
Now, in our example, we have customers of a site with one feature in order to segment these customers。
The number of visits of that customer。


Now, if we were to use clustering to segment the users of the app into two groups。
where would you think that we would draw the line。

So probably what you see here would ultimately be the best choice。 Visly, it makes a lot of sense。
These are our two clusters, and we're going to explore how this actually works mathematically and algorithmically in just a bit。



And perhaps you find for your business objective, you need three clusters。
and this is what your three clusters would look like。

Or maybe you need five clusters and this is what your five clusters would look like。

And in this course, you'll learn to use a wide variety of clustering algorithms and how to actually select the correct number of clusters that best suit your data。

With this in mind, in the next video, we will introduce our first unsupervised learning model。
Ka Mes。



005:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p05 4_K-均值算法.zh_en -BV1eu4m1F7oz_p5-
Here we will introduce our first unsupervised machine learning algorithm used for clustering K means。
Now, we're going to use a similar example to what we just saw in the last video。 But this time。
we have two features。 We have the number of visits that we had before to our site。
And in the recency, how recently did that customer come to our store。And visually。
hopefully you can see that there are already two clusters that you can come up with according to the data points that we have。
Now that answer is obvious to us, but our goal here with K means is to see how we can come up with this algorithmically。

So the way that K means works。Is that since we prescribing two clusters we're going to initialize our algorithm by picking two random points。

And these are going to act as the centroids of our clusters。
So we have our clusters in blue and our clusters in pink that are going to be coming from these two centroids。

Then with our centroids initiative, we take each example in our space and determine which cluster it belongs to by computing the distance to the nearest centroid and seeing which ones closer。
So here in the first iteration, the examples are color coded as we see here。
And now every point belongs to a cluster。Now, obviously, hopefully。
thinking back to the clusters that you thought of when you first looked at this data set。
We are not done yet, as this assignment is somewhat arbitrary and it hasn't converged and will explain what it looks like when it converged and what it means when it converges in just a bit。

So the second step is then to adjust the points, to adjust those cents that we just discussed to the new mean of our clusters。
So the new location of the pink square is right in the middle of all of the pink circles and same for the blue。
So we move our centroid so that theyre in the center of our defined points。
We're now through the first iteration, and we're going to keep repeating this process until no example is assigned to a different cluster。


So let's see the first step of the second iteration with our new cluster centroids in place。

We are then going to identify which cluster each point belongs to again。
so we see that they have moved, according to which one is closer to our new centroid。
given the means of that last cluster。


And then we'd move our centroids again to the new mean of our centroids of our data points that are within our two groupings now。

And then we do this for a third iteration。And we see again, the colors have changed。
And now the cluster sentries don't move anymore。And once we have that。
that's the sign of convergence。 It found the visual structure in the dataset set automatically by continuously iterating。
moving to the mean of those identified points that were closest until it was not able to move any more。
though sentry stayed in place。 And we have our two clusters。

Now, for three clusters, the clusters can look like this。

However, there can be multiple solutions。Such as what we see here。
And when we say that there's multiple solutions, what we mean here is that it's not going to move any more。
We have converged。

But we can converge in different places where we will no longer move those centroids。
So the problem with Kaine's algorithm is that it's sensitive to a choice of those initial points。
So different initial configurations may yield different results。


So I'll pause here and in the next video, we will discuss how to choose the right model in regards to which one of these different converges make the most sense。



006:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p06 5_K-均值算法的初始化.zh_en -BV1eu4m1F7oz_p6-
Now, as discussed, we may end up with different clusters every time we run this Ka means algorithm。
as again, the process is to take three centuries。Find the nearest points。
Take the average of each one of those points that are closer to that centroid than any other centroid and set that average that we have as the new centroid and view the closest points to that new centroid。
And this movement towards that average, as we keep reinitiating that centroid after avi iteration。
Is going to stop once that centroid no longer moves。
and this is going to happen at different points depending on where we initiate our centroids。
So we need a way of judging the converge results and rank them according to goodness。

Now, on top of that, another idea to ensure that we get to a better optimization of this Ka means algorithm is to initialize it in a smarter way。
So local Opima or just nonop solutions you can think of often happen when two cluster ss are initialized close to each other。
so men being initialized close to each other lead to local optima, not optimal solutions。
So we can make an effort, therefore, to initialize with points that are far away enough from one another。
So how do we do this。We can start by a random initial point as we see here。

And then for the second pick, instead of getting it randomly。
We're going to prioritize faraway points by assigning a probability of the distance of each point squared。
Over the sum of all the distances squared for each point from that initial centroid。
So we look at every single point, square the distance from the original centroid。
and we put a lot more weight if you look at this formula to those that are far away because that'll take up a larger proportion of the total distance squared of all of our points。
So we'll be more likely to end up with a not so close point。
such as the blue one as our second cluster centroid。
And then we'll repeat this process if we want three different clusters。

This time the distance calculation is calculated as the minimum distance of that point to any of the two clusters。
So rather than the distance just being from one cluster。
it's a minimum distance between those two clusters to ensure that we are far away from both of our current clusters that we have。
And then we can do this one more time or as many more times as we need。
depending on the K that we define。 And again, the distance measures now the minimum distance from all three of our initiated clusters and therefore ensuring that it's a far away from all three of the different centroids that we have initiated。

This algorithm with this smarter initialization is called K means plus plus。
And it helps avoid getting stuck at these local optima。
And this is actually going to be the default implementation of K means in S K learn that we will be using later。
So here we've discussed getting a better initialization point。In the next video。
we'll talk through picking the correct number of clusters as well in terms of how many clusters are actually built into our data set。

All right, I'll see you there。

007:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p07 6_选择合适的K-均值算法聚类数量.zh_en -BV1eu4m1F7oz_p7-
Now that we're familiarized with how K means works, let's ask an important question。
How do we choose K, how do we choose that number of clusters?
Now sometimes there's going to be a specific amount of clusters。
you know you would like dependent on that specific objective of your clustering task。
so examples of this may be that your computer has four cores so it naturally becomes that you're looking for four clusters。

Or the business side of your organization may dictate that there are 10 clusters when trying to determine the different measurements to incorporate into our different sizes。


Or a navigation interface for browsing scientific papers may need to be split into 20 disciplines specifically。
so you set K equal to 20。


On the other hand, there's going to be times, though, that the number of clusters is unclear。
and we thus need an approach for selecting the right number of clusters for our problem。


Now, in order to do so we're going to introduce some metrics。
one of those metrics is going to be inertia, and that's a popular metric to help us accomplish this goal and understand the entropy built into our different clusters。

The metric is going to just give us the total sum of squared distance of each point to its cluster centroid。
This way we're penalizing spread out clusters and rewarding tighter clusters to those centroids。
One drawback of using inertia is that this value will be sensitive to the number of points in the clusters。
If you think about it, no matter what, as we add more points。
we will continuously penalize our inertia, even if those points are relatively closer to the cents than the existing points。


Now, distortion, on the other hand, takes the average of the square distances from each point to its cluster centroid。


Again, it'll still hold that smaller values will correspond to tighter clusters。

But this time, adding more points will not necessarily increase distortion as closer points will aid in actually decreasing that average distance。


So thinking about inertia versus distortion, both are going to be measures of entropy per cluster。
Inertia will always increase as more members are added to each cluster。
but this will not be the case with distortion since it will work by taking that average。Thus。
when the similarity of points in the cluster is more important, you should use the distortion。
And if you are more concerned that clusters have similar numbers of points。
than you should use inertia。And generally speaking, these will decrease fairly similarly。

So what can we do in order to find the clustering with best inertia?
What we would do is we initiate our K means algorithm several times。
And with different initial configurations。 And with that, assuming we predeine what our k is。
we can compute the resulting inertia or distortion。
keep that results and see which one of our different initializations or configurations lead to the best inertia or distortion。

So as an example of this, we're thinking which model is going to be the right one。
And we see for this K equals 3, and we have our three different centroids that it had converged to。
We see that the inertia is equal to 12。645。We look at this other converged K means algorithm with k equal 3 again。
and we see that inertia is equal to 12。943。And then again, we see the inertia is equal to 13。112。
all these different converged K means algorithms, but with different initializations。
So we would want to pick the inertia with the lowest value between the three。Here。
we introduced inertia and distortions and showed how it could be used。
as we just saw to choose the best model given a specific K。In the next video。
we will extend this to show how this can be used to help determine the correct number of clusters as well。
As well as showing in the next video the syntax used to compute these methods using Python。
All right, I'll see you there。


008:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p08 7_肘部法和应用K-均值算法.zh_en -BV1eu4m1F7oz_p8-

So how do we use inertia or distortion to help choose the right number of clusters。
as I promised we would do in the last video。

Now we know that inertia and distortion will measure the distance of each one of our points to their respective centroids。
And if we think about this metric。Either inertia or distortion。 Techically speaking。
we will almost always be decreasing this value as we increase the number of clusters。

And we can think of this in regards to the extreme if we had a cluster for every single one of our data points。
our distance to each centroid would then be equal to 0。
and we would end up with inertia or distortion of0。So in order to accommodate this。
this is where the elbow method will come into play。We see here that we have an inflection point。
That could be chosen, perhaps as a good K。 And again。
this is a graph of the number of clusters on the X axis and either inertia or distortion on that Y axis。
And we can see until this inflection point, the inertia or distortion goes down very rapidly。
But after this point, the rate of decrease slows down quite dramatically。

And this slowing down can indicate to us a natural point in our data set where the number of groupings make sense and should serve as a logical choice for K。
And again, this works for both distortion and inertia。
where inertia penalizes different number of points within clusters and leads to more balance。
whereas distortion will penalize average distance and lead to more similar clusters。


So how do we implement K means in Python, So this will be our first unsupervised learning algorithm that we do in Python。
We will still use that same first step where we will import the class。For our unsupervised model。
So from Sk learn dot cluster, we import K means。We're then again going to initiate an instance of this class。
as well as pass in each of the different hyperparameter for that class。

So we pass in the final number of clusters, we are going to have to decide that。
and then we'll show in just a bit how we can use this to actually use the elbow method。
But we get our N clusters equal to 3。 We're also initiating using the K means plus plus initialization that we discussed earlier。
where we had the distance squared over the total distance squared。
This will also be the default for K means。 and you can look at different initialization techniques in the documentation。


We're then going to take this initiated class and fit an instance of it to the data and then use that to predict the clusters for either new data or even our existing data。
So first step is called dot fit on x1。And then we call predicts。
And this is all similar to what we saw with the supervised learning as well。 Again, when we do this。
it is safer to fit and predict on that same data set。
because we're just trying to find those groupings and we're not overfitting to some type of solution。
as we did with supervised learning。 So we could predict on X1 as well to see the groupings that come out of X1。
And then just a side note, we can also use batch mode。
which will just randomly select different batches and use something similar to not similar exactly like K means。
but just with smaller batches, and this will help speed up the algorithm if you find that K means is too slow。

Now, to implement the elbow method。What we're going to want to do is fit K means for various levels of k and then save those inertia values。
So we're going to start off with inertia equal to a blank list。
We're then going to run through a number of different clusters ranging from one to10。
and we're going to fit the Ka means algorithm for the different number of clusters。
So we do 4K en list clusters, we initiate a new K means。With the number of clusters equal to one。
two, three, four, etc ce。We call K means dot fit on our data。
And then we append once we fit it to our data, we have this attribute of the inertia for that number of clusters so we can get the number of clusters for each the inertia for each one of these different number of clusters。
and we append that on and we can then use PLT dot plot and the list of clusters as our x axis。
the inertia as our y axis in order to find that actual elbow。
Now that closes out our discussion here on K means。
And we will move here into our lab where we'll see how we do all this in practice。 All right。
I'll see you there。


009:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p09 8_K-均值算法笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p9-
Welcome to our lab here on Ka meanss clustering, our first lab for Cose4。

In this course, we're going to learn how to use K means using SK learn。 So throughout this lab。
we will run a K means algorithm。

Understand what parameters are customizable within the algorithm and then know how to use the inertia curve that we discuss in lecture to determine the optimal number of clusters。
Now, a quick overview, K means is one of the most basic clustering algorithms that we'll be working with。
It relies on finding cluster centers to group data points based on minimizing that sum of squared errorss between each data point and its cluster center。



So first things first, we're going to import all the necessary libraries。 We bring in nuumpy, pandas。
seaborn mattepl lid。 And then now we're bringing in scale k meanss。
We're going to make blobs and we'll see how this comes into play。
and we' be very useful for playing around with k means。 And then we'll use shuffle as well。
and we'll see that later on。😊。




We then going to just set a bunch of our parameters for our visualizations。

And then, we are going to。

Get started with creating our first simple data set here。

So in order to do this。We're going to first, create our function。

Where we have, and I'll break this down step by step。 We have our color。
and this is similar to thinking of it as a list。


Where our color equals B, R, G, C, M, K, we can think of it looping through of B for blue。
R for red and so on。We say alpha equal to 0。5 and that's just going to be how opaque each one of our data points are hopefully you realize at this point that we're going to be creating some type of scatter plot and then the size of each one of our points S equals 20。


We're going to call this PLT dot GcaA, which stands for get current axes and set the aspect equal to equal。
You see here just in quotes, has this string equal。
And that's just because we're going to be using a circle。
and we want it's going to be each unit going either in the x direction or the y direction。
We want them to be equal to one another。 So we see that clean circle。
You can try erasing this to see what it looks like otherwise。



We then say, if we have no clusters, so we're not clustering at all。
then we're just going to create a scatter plot passing in our X。
We're going to say all rows per column1。


All rows for column 2。 The color is going to be equal to just that first color。 So that be。

Our alpha is going to be equal to 0。5, and our size equal to 20。Now, if we have。

A number of clusters。 What we want to do is for each one of our different clusters, plot these out。
And the way that we do this is we call P L T dot scatter。
And then we say we want the X values for which our K means model came up with the label equal to I。


Whatever that we're looping through the number of clusters here。 So for the first one。
then the second one and so on。

Then we're going to say we want that first column, and then we're going to say for all those equal to I again。
and we want that second column。 So we get each of the two columns。
but specifying the rows that are equal to the labels that we came up with。
And then we set it to different colors looping through each one of these colors that we have defined above。
And then we are also going to plot the actual cluster centers so we can see where those lie as well。

So we just say cluster I related to the cluster that we are currently on。
And we say the x coordinate, as well as the y coordinate。
So saying that first column and second column。 And then again, using that same color。
and we're going to mark that with an x so that we can differentiate that from our actual data points。
We're also going to make the size of that larger, we're going to say the size is equal to 100。



So to see what this looks like。We're going to crate our a here。In order to do that。
we create this angle, which is just going to be a nupy array or its values between 0 and two times pi。
and it's going to be 20 equally spaced points。

And we're saying we don't want the end point。 So it's going to be up to。
but not including two times pi。

We're then going to append。Two different values together to create our x。
to create our X within our x, our first feature and our second feature。So each of our two axes。
where the first one is going to be the cosine of our angle。

And the second one's going to be the sign of our angle。

And this0 is just to say that we want to append these across the zero axis so that we have them。

1 alongside the other。 And then we transpose this so that we have two different columns。


So I'm going to show you quickly。 First, I will run this。

So and we display the cluster, and we see our perfect circle here。
And just to take a quick look what X looks like。

X is just going to be these two different columns with one of them being the cosine of the angle and the other one being the signine of that angle。


And now we have0 clusters because our default above was setting the number of clusters equal to 0。
There is no K and yet, but we'll introduce K means models。 So all we're doing is plotting the x。


Now we're going to group this data into two clusters to see what it looks like。
And we use two different random states to initialize the algorithm。
And to see how we come up with different results, depending on how we initialize the algorithm。

So we set number of clusters equal to 2。We say K means。
and we set the number of clusters equal to that number of clusters。 We set random state equal to 10。
And we're saying we only want to initialize once。

Generally, speaking, Kas couldn't initialize a number of times and then just choose the one with the best inertia。
Here, we're just saying choose only one time just to see the differences between two different random states。
even though again, the defaults。 if we look here, will be to use the initialization of Kas plus plus。
So that will ensure that it's more likely to choose far away points。
but it will still choose different points。 So it is going to be important to either initialize a number of times。




Or to if well, I guess either way you're going to initialize a number of times, if you do more times。
check those inertias and choose which one is best on your own。

So we call a K means on those hyperparameters that we've passed。

We call K M dot fit on x。 So now we have our Kmings model fit。
And then we can using that K means that we fit, be able to display the cluster using that function that we defined earlier。


And you can see using that K M that we came up with。
it has an attribute to give us the different labels。As well as the different cluster centers。

So that's now available and we can create this scatter plot。So, we run this。And we see。
Are two different groupings。 And again, these groupings。
because of the way that we created this data, it could really fall anywhere。
There are no natural groupings。 as is why it's likely to fall in many different places。
Given the way that we are running this。 That's why it really will not converge necessarily in the same spot。
So we see that here and we have the x marking where those centroids actually lie that should be the average of all the red dots。
the blue X is going to be the average of all the blue dots。
and we see how it classifies each one of those two classes。
Now setting the random state equal to 20 here。

We can see that it comes up with a very different clustering。

And if we think about it, coming back to lecture。Why are these clusters different when we run the K means twice?
And this should be obvious as we talk through it quite thoroughly as I went through each one of these different graphs。
But it's because the starting points of the cluster centers have an impact on where these final clusters actually lie。
And again, these also are going to be clusters that don't actually probably exist。
given how equally space each one of these points are。
So it's very highly likely that each one of these different clusters will come up in a different place。

So I'm going to pause here。And we will continue to figure out the optimum number of clusters and how we'd actually do that using Python code。
All right, I'll see in a bit。



010:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p10 9_K-均值算法笔记本第2部分.zh_en -BV1eu4m1F7oz_p10-
Now, in this video, we're going to talk through how you can actually choose the optimum number of clusters。
depending on your data set。So we're going to synthetically create our data set here so we know the number of clusters。
and that will allow us to really understand how we choose those clusters and how that looks when we create that elbow plot that we discussed earlier。
So in order to synthetically create these different blobs, as you see here。
we're using the make blobs function。We are first going to say how many samples we want。
so we're going to have 1 thousand different data points。
How many bins we're going to set that equal to 4。 And then we're going to save the centers of each one of our different points。
So we actually have these centers predefined。 and we set them at negative 3,3,0,0,3,3, and 6,6。
So you can imagine these are in a straight diagonal line。And then in order to make blobs。
we're going to say, and that'll output to values we care here about the X。
We're going to say the number of samples that we want。 We set that equal to 100th。
The other important point that we have here is going to be the cluster。
underscore STD or the standard deviation of that cluster。
and that will define how tightly around each one of these centroids we going to plot each one of our data points。
And then again, we set our centers equal to the centers that we have defined above。
and we are going to set the random say equal to 42 to ensure that we have the same values as well。
So I called display Cussler, which is going to be the same functionality that we discussed in the first video。
We're going to call it on our new X。And we see here somewhat our four different blobs already visually。
So if you wanted these to be very clear again。We can use that standard deviation if it's a smaller standard deviation。
it'll be tighter around those clusters。 And you see if I were to run this。
that they are really tight around each cluster and very, very obvious。
So that takes a little too far。 we don't want it quite that obvious。
so we're going to set it equal to one。

We're then going to run our K means and set our initial number of clusters equal to7。
So you say K means we call the number of clusters equals to 7。 We do K M dot fit。
and we can use display clusters using the function again that we defined earlier。
which will give us each one of our different plots, color coded。
as well as their different centroids。We pass in KM, we pass in that number of clusters。
and we see here the seven clusters that will be。

Cated,1 we call K means with clusters equal to 7, and seems to arbitrarily be splitting in a way。
Each of these different four clusters into subsets。Now, if we call no clusters equal to 4。
we run that same code。It seems visually that we have a much cleaner set of four different clusters。
Now, we asked the question here。 And here, it's obvious because we have it plotted in two dimensions。
and we had these clear, distinct, different blobs。
But should we use four seven clusters in the real world?
Data usually will have more than these two dimensions。
And a data set with higher dimensional space is going to be very hard to visualize。
So way to solve this problem and decide, should we use 4 or 7 is going to be what we discussed earlier。
finding that elbow by plotting inertia versus that number of clusters。
So we can see by calling K M dot inertia, we can get the inertia for the last fitted model。
So this will be for the number of pluss equal to 4。


And I want you to think before I run this, which one will have a lower overall inertia。
4 clusters or 7 clusters。 And we'll discuss that in just a second。

So in order to plot this out。We create a blank list, so we have inertia equal to this blank list。

We have that we're going to run through a range of different numbers of clusters ranging from one up to 11。
including 11, so up to 10。And then four numb clusters in this list。 So for that one through 10。
We're going to fit a K means on that number of clusters。We're then going to。Take that inertia list。
And depend on for our fitted model, the inertia for that given model,4 clusters equal to 1,2,3。
et cetera, up until 10。We're then going to plot。As our X axis。
we're going to use the list nu clusters, which is going to be those values 1 to 10。
and as our Y axis is going to be these inertia values that we're coming up with that we're pending onto the list。
We call PLT dot scatter on these two。 So this will actually create a line plot。
This will create our actual markers。 There's other ways we could have done this, as well。
We're then going to set our X label and our y label to number of clusters and inertia respectively。

And we run this。And we see this steep decline。From 1 to 2, then from 2 to 3,3 to 4。
And then you see that kind of slows down after that4。Now。
this is obviously not always going to be perfect。 At times, it will be difficult to really say。
where is that inflection point here, it may even look like two because of such a steep drop off from 1 to 2。
for that should generally be the place to stop off from 1 to 2。 But you see that at 4。
it kind of starts to flatten out。And I asked you that question earlier。
which one will have lower inertia of 4 or 7。 And hopefully, if you've been paying attention。
you notice。 And as you see on the plot, then Nertia continues to go down as you increase the number of clusters。
essentially, no matter what。So we see that the inertia continues to go down no matter what。
but there don't go down as quickly once we hit that 4。 So that's our inflection point。
And we say that we should probably use four clusters。
Now that closes out this video in regards to looking at this elbow plot of the number of clusters versus the inertia in the next video we'll see a practical application list and how we can use this K means。
On actual images。All right, I'll see you there。


011:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p11 10_K-均值算法笔记本(选修部分)第3部分.zh_en -BV1eu4m1F7oz_p11-
Now, let's move to a more practical application where we can actually see this in practice as we will take this image of bell peppers。
And then group together the different colors so that rather than working with the multitude of colors within this image。
we're only going to be working with the number of colors that we create within our clusters。
And we'll see what we mean as we walk through this notebook book。
So the first thing that we're going to do is read in this image。Now。
when we call P L T dot I M read and we call this image。
we are actually bringing it in as a numpy array。 So I'll show this in just a second。
We can then use PL T dot I M show to actually show that image, which is currently as a nuumpy array。
We can actually see that within our Jupyter notebook book。And then we're just calling Plt。
ax off because we don't want any axes when we're just plotting an image。

So he run this。

And we see our image with our different colors and our different shades of green, red and yellow。
We then I we call here imaget shape, but quickly, I want to show you what the actual image object looks like。

And as I mentioned, it's going to be taking this image that we just plotted out。 and rather than。
Giving the actual picture, we're actually just representing it as an array where each value is going to be how much in the red。
green and blue scale, each one of these different pixels are。
And that's going to be for every single pixel。 So we want to see how many pixels we have。
And we have 480 times 640 different pixels。 And each pixel has three values that represent how much。
again, red, green and blue it has。

And below to just hone in on how we have this picture representation withinumpre arrays。

We're going to look using R equals all these R equals 35 G equals 95 b equals 131。
And we will and these are all values between 200,0 and 255。Were going to call P L T do I M show。
And just for the specific array。 So as if it's just one pixel with a certain amount of coloration。

So we run this and we see that this, since it's mostly blue, will output something close to blue。
If I were to decrease the blue to 13 and increase green to 1,95。 Then you see this very green image。
And then just so you understand a bit of how coloring works。

If we were to set this all to 100。 so if all the values are the same should be somewhere gray。

Because it's equal amountuns of each。 And if we set it all to 0。What do you think will happen?
I'll run that here。

You see that we have black。And then if they are each 255, which is the maximum value。

You see that we get white, so just an understanding。
a quick understanding how each one of these pixels are being created using this nuumpy array。
So what we're going to do next is reshape our data frame so that it's only every single pixel is going to be a row rather than having three dimensions。
we're going to make this two dimensions, So we're going to take our 480 by 640 pixels multiply 480 times 640。
So again, each row will represent a single pixel。 and then the other shape will be the RGB and how much of each will be incorporated into that particular pixel。
So we call reshape。We say that thee。First dimension is going to be the first dimension of our original pixel times a second dimension。
Again, that's it was originally in three dimensions。 We're taking those first two dimensions。
multiplying those together。 So that's how many rows we have。
And then the number of columns will be R G B O B3 relating to each one of those three。
And then just to see the first five values, we have。Each one of these rows represents a pixel。
and each one of these numbers within that row represent either the red。
the green or the blue respectively。And since 480 times 640 equals 307,200。
that's going to be our new shape of our new Numpy array。Now we're going to run K means on the。
Image that we had using eight clusters。 So we're going to come up with eight groupings。
So rather than every single to take every single one of these 307000 values and find eight groups to group these together into different segments。
We're then going to create a copy of that image。And replace that copy's values。
With their respective labels that were come up with these eight different clusters。
So rather than the actual value that was there, we're saying 4 k means where the label is equal to for all those 307。
000 rows, where a label was equal to label 1 out of each one of our unique labels, so one through8。
Replace that with the actual value for that cluster center。So I'm going to run this。
And it will replace all those values and just to show you quickly what that looked like。
Our new values, you see they are all the same here,43,156,43,56, and later on 236,172,8。
these represent one of the eight different clusters that we had。
and we replaced those original values that we see up here。
With these one of eight values that we have created using our different centroids。Now。
to see what that looks like, now that we've replaced this multitude of different hues of different colors with only eight possible colors。
We're going to reshape that to that original image shape。
In order to actually show this as an image using PLT IM show, we have to get it back to 480 times 6。
40 times 3。We can then call Plt。im show again, turn off the axis。

And we see that we can still get a lot of our initial picture with just these eight different colors。
And we can see the different hues and how it differentiates between the different peppers and how we loss a bit of the granularity。
But we see these clusters of the red, the white, the green, the black and so on。😊。

So the next thing that we're going to want to do in order to take this a step further is create a function that will take in any image。
as well as a number of clusters and return the image using just the specific centroids replacing each one of those different pixels。
As we just did with8。 we want to do that for any image and for any number of K for any number of clusters。
So to do that, we're going to repeat the steps that we just did。
We're going to set image flat to the reshaped image, given the first two dimensions and then three。
given the RGB。We're then going to set the number of clusters equal to the K that we have defined here。
We setting random state equal to zero just to ensure that we have the same values as we look at it and you look at it back at home。
We're then going to fit that to our image flat, again, that two dimensions, in our case。
307200 by three。We're then going to create a copy as we did before。
and we're going to ultimately change this copy。By running a for loop through each one of our different labels。
And if our labels are equal to whatever value it is within our output。
Then we will replace that with that specific cluster。Again, doing the same steps as we did before。
We're then going to reshape that again back to the original image shape so that we can end up ultimately printing it out。
And then we're going to output from this function, both that new image with the replace colors。
as well as the inertia for that specific K means, depending on what our K was there。

So we've created our function that will output again that new image with the replace pixels。
as well as the inertia for that fitted model, depending on the K that we use。

We're then going to call that function for k between 2 and 20。
Counting here by two and draw that inertia curve, as well as later on。
will also print out many of these pictures。So we're saying k values。
the K values that we will loop through are going to be 2 through 21, not including 21。
counting by 2。And then, we're going to。Initiate empty lists for the image list so we can save that image list。
as well as the different inertias。We're then again。
getting an output when we call this image cluster function that we defined of both the new image with replaced pixels as well as the inertias。
So we will call this function。Output image 2, as well as the inertia。
and then append each one of these output values to that list that we initiated here。
So I'm going to run this, and this will take just a second。
and it will output for us each of these different images, as well as the inertia values。
And then we'll plot out these inertia values in just a second。 All right。
I'll see you as soon as it stops running。

So that should have taken about five minutes to run。
Now we have from the outputs our different inertia values。
as well as our images which we'll get to in just a second。
and we can plot our inertia values versus each one of our different numbers of clusters。
So we're going to call PLt dot plot to get the line graph on top of that。
we call PLt dot scatter to get each of the points。 And we。
Get our X label and Y label of inertia and K。And we see here that it kind of curves down and has this smooth curve。
and it's hard to see an exact elbow。So this is a case where maybe we can't exactly see where that elbow exists and determine using the elbow method。
So we note here and you can dive deeper into this。Metric of the cellhouette coefficient。
But what it will do is it will tell you the difference between the or the similarity between points within a cluster and other points in the cluster as compared to clusters nearby。
And again, you can dive deeper, but that will be a different method of differentiating where you should choose where that number of k should be。
Now, the next step that we have here。Is going to be that we are going to plot each one of the images to see given the images that we have。
How each one plots with the different number of colors。 Again。
we're only going to use the number of colors that we have within the cluster。
So we're going to run through our values of counting by 2 between 2 and 20。
So for the range of the length of those values。 So for 10 different subplots。We're going to plot。
A five rows by two columns。 So we're going to have a subplot that will all be。
It'll be a grid of 10 different axes, where each one will have a different image。And one at a time。
we will show that image。Given the K values that we are using。
And then we will title that and then also turn off the axis and we can see。

As we increase the number of colors, how much of the image we are able to actually discern。
given the number of colors we're using。

So here at the bottom, when we see that we're using 20 different colors。
so we have 20 centroids replacing their original values。
We see that we can actually pretty clearly see each one of our different peppers and really discern the original photo well。
Just to give you an idea of how many colors there were originally, we can run NP。t unique。
And that was on the image。Flat。And we will say axis equals0。And let's see the link here。

And you see that originally, there was 98452 unique colors to make up that picture。
And we can see how well we can represent that with just 20 colors here。
So we can see how well we were able to group those 98000 different colors into 20 colors on their own。
😊,That closes out our notebook here in regards to Kamin's clustering。
and I look forward to seeing you back at lecture。 All right, I'll see you there。😊。


012:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p12 11_欧式距离和曼哈顿距离.zh_en -BV1eu4m1F7oz_p12-
Our clustering methods will rely very heavily on our definition of distance。
So let's take a step back and discuss different distant metrics that are available to us。Now。
let's go over the learning goals for this set of videos。

In these videos, the main topic of discussion will be different measures of distance between different points。

And with that, we will discuss the different applications of these different distance measures and how they relate to clustering。

And the different measures that we're going to discuss are going to be the Euclidean distance。
which is going to be that classic distance that you're probably already familiar with。
As well as the Manhattan distance, the cosine similarity, and the Jaar distance。

Now, our choice of distance metric will be incredibly important when discussing any of our clustering algorithms。
As these clustering algorithms will all be dependent on some type of measure of how distant or in that same vein。
how similar one point is with the next。

Now there are several choices of distance metrics, and they all have their strengths and more appropriate use cases。

But at times, we may also need to use empirical evaluation to determine which one of our distance metrics works best in achieving our goals。

Now, the most intuitive distance metric that we are hopefully already somewhat familiar with。
and what we use in K means is going to be the Euclidean distance。
Now another name for this is the L2 distance。So in order to highlight how Euclidean distance is calculated。
we're going to take these two points and calculate the Euclidean distance between them。

So we remove all the other points so we can just look at these two points and hopefully you remember parts of this from math class back Men day。

But in order to find this distance, D。We need to first find our change in visits。
as well as our change in recency or a change in the X axis, as well as a change in the y axis。

And then if you think back to that math class example。
they said to think back to from back in the day。How do we think these values。
visits and recency in their change will relate to our calculation of D。
We would get D by taking the square root of the square of each of these changes。
So that math equation that I was hinting towards was a squared plus B squared equals C squared。
And again, you take the square root of C squared, and you end up with the formula that we see here。
And we can move this on to higher dimensions。 Imagine if we had three dimensions。
four dimensions and so on。 We just take the square of each of those and then take the square root of the sum of all those values。
😊。

Another distance metric that you may already be familiar with is the L1 distance or the Manhattan distance。
And instead of squaring each term, we're adding up the absolute value of each term。Now, it's larger。
It will always be larger than the L2 distance unless they lie on the same axis。
So the same number of visits or the same number of recency。
And we'd use this in business cases where there's very high dimensionality。
As high dimensionality often leads to difficulty in distinguishing distances between one point and the other。
and the L1 score does better than the L2 score in distinguishing these different distances。
once we move up to higher dimensional space。Now, these are the two most commonly known distance metrics that hopefully you may know a bit already。
In the next video, we will introduce some less well known distance metrics that can prove to be very powerful for certain applications。
All right, I'll see you there。


013:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p13 12_余弦距离和Jaccard距离.zh_en -BV1eu4m1F7oz_p13-

So we start here with a bit of a less intuitive distance metric, namely the cosine distance。

So we're going to start off again with two points in two dimensional space just to highlight our example。

And hopefully from the lines that we just drew, it should be clear that this is already shaping out to be much different than the L1 and L2 metrics that we just discussed。

What we really care about with the cosine distance is the angle between these two points。
This metric gives us the cosine of the angle between these two vectors defined by each of these two points。


Which in order to move up to higher dimensions, this formula will still hold of taking that dot product。
as you see in the numerator over the norm of each point in the denominator。

And the key to the cosine distance is that it will remain insensitive to the scaling with respect to the origin。

That is we can move one of those points as we have here。

Along that same line and that distance would remain the same。

So any two points on that same ray, passing through the origin will have a distance of zero from one another。

And the idea is that we want to see the relationships here between rec scene visits。
between one point and the other, much more so than we care about the actual physical distance between the two。

So recency being one and visits being one is equal to in regards to the cosine distance and how far away it is。
recency being 10 and visits being 10。



Maybe it would be along that same ray。So for two vectors that are pointing in the same direction。
our cosine distance will spit out zero。

It'll think of them as very close or essentially exactly the same。

But for Euclidean distance。It may think of them as very far apart。
depending on where those values actually lie, even if they are on the same line。
So how is this useful being able to classify them is exactly the same if they are pointing in the same direction?


Let's say we have text data and our features are going to be different counts of different words within the documents。

Now just because one document is longer than the other, so it has more counts of each of these words。


Does not mean that they need to be far away from one another and thus cluster differently。

Maybe they're about the exact same thing。Maybe one of those articles is a summary of the other。

In that case, you want to mark them as close to one another。
and cosine distance will come in handy in that situation。
So if you have three counts of the word data science and 10 counts of the word。


Application, and then you had 30 of data science and 100 of application。
then you probably want to assume that those are along the same category and cluster those together。
even though their Euclidean distance may be far apart。
their cosine distance there would have been in the exact same direction and thus zero。





Another advantage of the cosine distance is that it's more robust against this cursive dimensionality。

Euclidean distance can get affected and lose meaning if we have a lot of features。
as we saw in our initial discussion of that curse of dimensionality。


So our takeaway here is that the best choice of distance is going to heavily depend on what our application is。

Another distance metric to keep in mind is going to be the jackard distance。
which will be useful for text as well。


And it applies to sets, and an example of this is used pretty often will be something that we walk through here。
which is the word occurrence, the unique word occurrence。


So say we have a sentence。 A, I like chocolate ice cream。

That set of A is just going to be the unique words in that sentence,I like chocolate ice and cream。


Say sentence B is going to be, do I want chocolate cream or vanilla cream?

So set B is going to be do I want chocolate cream or and vanilla。
again not counting that second cream, only those unique values。


And then the jackard distance is going to be one minus the amount of value shared。
So the intersection over that union。 So the shared values of the two sentences over the length of the total unique values between those two sentences。
And we'll see this example in just a second and the calculation as well。




And it can be used as a different option when we have these text documents to group similar topics together。


So using this example。We can cacate the score between our two sentences and running through it。
we see that our intersection is going to end up having three words。

And there are nine unique words total。So the distance is going to be 1 to minus 1 third equals2 third or 0。
67, and that will be our distance。


So that closes out our different distance metrics and overall in this discussion。

Just to recap, we discussed the importance of having different measures of distance between our two points。

As well as the applications of distance measures to clustering and how the measures of distance or similarity will ultimately have a large effect on the groupings that we end up creating。

And with that, we discussed the Euclidean as our most common metric where we used our old me that we learned from back in the day of a squared plus B squared equals C squared。
we discussed the Manhattan distance, which was the absolute value of each distance's individual features all added together。



We discussed the cosine similarity, which highlighted the angle between our points。
and then finally we discussed the Jaarard distance。
which was useful to showing the difference in similarities for different sets of values。



All right, I'll see you in the next video。

014:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p14 13_维数灾难笔记本第1部分.zh_en -BV1eu4m1F7oz_p14-
Now, in this demo, we're going to take a brief aside and touch back on the curse of dimensionality。
Distance measures will come into place slightly as we will talk about the Elidean distance for each of these。
but the focus here is going to be the curse of dimensionality。So with that in mind。
We can talk about the demo objectives, which will be to gain a deeper understanding of why observations are going to be further apart once we move to higher dimensional space。
We're then going to see an example of how adding dimensions will ultimately degrade certain model performance when we're working with classification。
and then we're going to start to learn how to fight that cursive dimensionality within your different modeling projects。
So the main point is that in higher dimensional space, points will tend to be further apart。
And this is going to impact our data analysis intuitively。
if we think back to the clustering examples that we've already gone through and we're talking about how distant each one of the different points are from one another and saying what the nearest neighbor really is。
Can we really say it's a neighbor, if it's a certain distance apart。
if we're moving an incredibly far distance apart Once we move to these higher dimensions。
So this notebook will show why higher dimensional space does lead to this sparse data。
leads to data points being naturally further apart from one another。
So we're going to start off with a circle inside a square。

And the idea is that we're just going to have a square that's going to be the diameter of the circle。
that's going to be the。Length, the width and height of our square right squares。
width and height are automatically going to be the same。
We're going to have a unit circle so the diameter is going to be2。
and then our width is going to be 2 and our height's going to be 2。

And the point is that that circle should touch the borders of our square。
and we want to know using that circle within the square, how much of that is empty space。
And then we're going to move to the next step。And create a sphere within a cube using those same dimensions。
1 by one or two by two radius of one。And we're going to see how just moving to higher dimensions。
using those same points, we're going to have a larger proportion of the space not covered by that circular object。
And then generalize that to higher dimensions and discuss how as we move to higher dimensions。
just the fact that we're moving to higher dimensions leads to there being more empty space within our square or that square moved into higher dimensions。

So the point is to bring in that concept, but I'd be remiss if I didn't also walk you through a lot of the new code that we're going to be talking about as we go through some more complicated plots。
This way, when you're back home, you will be able to go ahead。
create these plots on your own and understand what went into the code。
as well as once we get to the next couple of cells。
being able to start to even plot in three dimensions。So with that in mind。
I'm going to create an empty cell above so that we can walk through all the different code that's within this function。

So to start off。We're going to。Initiate our figure。And then so Plt。
gcf is going to be a way to get current figure if that figure doesn't exist。
it will initiate a new figure。And then taking that figure。
we're going to add on our subplot and that's going to be our axes。And we're just saying one by one。
if you think about subplots, that could be two by one or two by two, if you say two by one。
you'd have two rows with each on a bounding box and then one column。
and we're saying which one do we want to select, we're just selecting that first one。
And then we're saying aspect equals equal。 And this can be similar to what we saw in the last notebook that we had when we wanted to draw that circle and the importance of actually ensuring that our X axis and our y axis are on the same scale。
If one of those are on the wrong scale, then it looks like we have a rectangle rather than a square or an oval rather than a circle。
😊。

So we run this and we see that we now have our bounding box going from zero to1。
Now I'm going to skip over quickly and we're going to come back to it this building end of the circle。
Because this is going to be the circle, like we mentioned, centered at00。
And then going from zero to 1 and then from zero to negative one as well。
and our box currently only goes from zero to 1, not from negative one to one。

So I'll bring this back into play once we walk through this code where we increase our X limit and our Y limit。
So then we're going to add on this scatter plot。Which is just going to be that single dot because it's 00 is the point that we're bringing in。

And we are saying that。We have the size equal to 10 and the colors equal to black。 That's at 0,0。
And now we've changed the scale a little bit to ensure that that's in the center。 But again。
it's still not at negative 1,1。

We're then going to add on a straight line, and this is going to represent the radius of our circle。
so it's going to go from0 to1。And we're going to have a10, sorry, different points。
So it goes from zero to 1, counting by 100, and then that's going to be each one of our x values。
And then for the y values, we're going to stay at 0。
And this will allow us to create that straight line that we see here。 again。
this mess with the axes a bit。 So the plot looks a little bit funky。
But we'll see in just a second what this looks like once we increase those。 In fact。
I'll do that right now。 Let's change the X limit and y limit。
We're going to add this on to our graph。😊。

We have to make sure that we have。No extra tabs there。 And now we see it goes from 0 to one。
and it goes from。The0 to1 is the line, and we're able to see negative one to 1 on our x axis and negative one to 1 on our y axis。
Now we can go through some of the pieces that we skipped over, so coming back first to the circle。

And this is something that's probably the most new。For those that are watching through this video。
what we're doing is we're getting the current axes, which is just our bounding box。
And then we're calling this add artist, and then artist object is essentially anything that you have within your plot。
that's going to be your ticks, that's going to be your numbers, that's going to be your lines。
those are all artist objects。When we call PLT。 circlecle。
that's going to be a subclass of that artist object, and it won't show up unless we call add artist。
And you can Google and look at the discussion on ad artists and how it works in regards to creating your。
Mate plot lid plots and different ways that you can use us。
But the idea is that it will take things at our subclass of that artist object and be able to add it on。
So we're adding on this circle。 This circle is going to be centered at 0。With a radius of one。
And then we're saying alpha equals 0。5, that's just how opaque our circle is the same way that we saw alpha earlier。
So we run this and now we have a circle on our plot。So we have our circle within our bounded square。
We're then going to add on an R。So we're just adding text, so a dot text。
And we're saying that we want an R, we can say the size of that R and where we wanted to lie at 0。
4 comma 0。1, so we have that R there at 0。4 and 0。1。


We've already set our X limit in Y limit and hopefully already familiar with sending your Y label。
your X label, and your title, but we'll throw that in。

So we have all that within our plot。 And then it's saying when we say point equals 0 here。
I want to ensure that no one's misled。 the way that it's being used is that point is equal to false。
False in Python or 0 and Python will always be equal to false。
whereas any other number will be equal to true。So if the point is true。
so if it's not zero as it is by defaults, then we're just going to create a dot。Sorry。
we're going to create a dot here。That's at 0。85。85, and we're just going to write on top of that。
That it's a far away point。Just to highlight what we're signifying as a far away point。

So the idea is that each axis in this example is supposed to be a different covariate and are supposed to imagine we've standard scaled our data。
so they're centered on0, and this means that the average for each covariate is now0 or the entire center of our circle and points that are outside the unit circle would be harder to classify because these values are far away from our mean。

So this is just saying that values that are outside the circle。
so taking this idea of a circle within a square and moving it to the idea of how it would apply when we're talking about creating our different machine learning models。
Is that we are now identifying that anything outside that circle is pretty far away from the mean as we have standard scaled our data and that means it's over a single standard deviation away。

So we're going to run this。And we see our unit circle when we call make circle on its own。
very similar to what we have above。


And then when we call make circle and we call one rather than point equals 0。
it's going to add on that far away point, and that far away point will be the same, no matter what。
it has nothing to do with the number you pass in。 Again, it's just true versus false there。


Now。The point that we want to make here。Is how much of this square is going to be outside that circle。
again, thinking back to how this relates to our modeling。
if we have standard scaled our two different covariates。
which means that being one unit away from the mean。
means that we're a standard deviation from the mean value for each one of our different covariates covariate A and covariate B。
which we ultimately be using for predictions。How much of our points are going to be far away?Now。
since the square has a length of 2 R, the radius being 1 and the area of the square is going to be2 R squared。
just taking the formula for creating a square,2 R times 2 R。
The percentage of square outside the circle。Is going to be one minus pi r squared。
which is your area of your circle over 2 R squared。
And that's just going to be the area of the circle over the area of the square。
so it's 1 minus pi over4 once you cancel out the R squares。
And you have that 1 minus5 over4 means that approximately 21% of that square is outside the circle。
So I'm going to pause here。And in the next video, we're going to extend this out to a cube and also walk through how you can create 3D graphs using Python。
All right, I'll see you there。


015:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p15 14_维数灾难笔记本第2部分.zh_en -BV1eu4m1F7oz_p15-
Building off of what we just discussed in two dimensions。 So we had our square and our circle。
And we saw that 21 per cent of our square was outside of the circle。

We are now going to push that out into three dimensions and work with a sphere rather than a circle and a cube rather than a square。

Now, I want to remind you how this ties back to a data science problem。
The idea here thinking about the two dimensions is that we have for each dimension。
that is a different feature。 So a different covariate。 So we have covariate A, Covariate B。
Both have been normalized。 And we can think that the values lie between negative one and one for each one of those。
and they've been standardized。 So they have a mean of 0 and a standard deviation of one。






If we think about this value and look at the square here。

The idea is that to be a single unit away from that center value。

That would indicate that you're one standard deviation away, whether you're pointing horizontally。
vertically or diagonally。 And we can see all the different values that lie one standard deviation from the mean。
That's your unit circle。 And then all values that are outside of that circle are going to relate to those that are far away from the mean above one standard deviation from the mean。
But still within that negative one to one range。 So we see。
given that we're working with values between and negative one and one for our covariates。





These are the values outside the circle that will be, in a sense, outliers。

Now, in that same sense, if we were to add on a third dimension Covariate C。
And that's what we're planning to do here。 The idea is。
we're still working from negative one to one。 Still。
our sphere now will indicate one standard deviation away in any direction, now。
not just diagonally but diagonally within space。




And then anything outside that sphere, we can then again think of in the same sense that we just did with the circle and the square that this is more than one standard deviation away still between negative one and one and see how many outliers we have。


So again, with the square we had 21% now we're moving to plotting in three dimensions。

I'm going to show you step by step how you can plot some of these values in three dimensions so that you can go home or leave this notebook and then be able to plot in three dimensions yourself as well。


So the first thing that we do is we are going to have to import this axes 3D library Now。
if we don't do this。And we try to createate our figure。

And then from that figure, we get our current axes and make them 3D projections。

We'll see that we get in error。

We will have to first import that library。That axes 3D to give us that option。

So。We pull this in。

And I'm going to run through just as we did before。
have to cell above so we can see this step by step。 But now that I've imported axes 3D。
we see that now, rather than working in two dimensions。
you can see how we can start to work within three dimensions。



So hopefully this is exciting to see that we have values for X, Y axes and then now a z axis as well。


We're then going to draw our cube。 Now we have here this idea of combinations and taking the products。
I don't want to walk too much into it。 I'll show you quickly how the product works。
and I would suggest you can look at the combinations and see how it works as well built off of this product that I'm about to create。



But right now, we're taking the product of three R's and the R is just defined as negative 1 and one。

In order to make this a little clear。

We're going to use three different lists of two, rather than negative one and one, though。
we're going to use one and2。3,4, and。5,6。 And when I take the product。
I'll take the list so that we can see this output。 otherwise it's just a generator object。

We also have to make sure that we import that library。

You see that it comes up with every possible combination, not accounting for ordering。 So1,3 and 5。
taking the first value from each of the lists, then1,3 and 6。 So first, first and second。
and then 1,4,5,1,4,6。


2,3,5 so you can see how it's going through each one of these different values ensuring that it covers all the different possible combinations。


So it does that with negative 1,1。 And then the combinations of value of 2 will give you values of two for each one of different combinations。
and I wouldn't worry too much about it。 That point here is given again that。



We're pulling out an S and an E, it's going to output two different values when we get that combination。

We're going to take the sum of S minus E, and that has to be equal to。This is using our R1。
R1 minus r 0 in order for it to be an edge on our Q。So that's all it's trying to do。
is's trying to find where each of our edges lie。

Now I'm going to pull out this portion of code just to show you how one line is drawn in three dimensional space。


Let me。All this。

We copy this, we're going to move it above。And we're saying for S and E。
we don't care too much about that。 But what we do care about is this zip of S and E and then plotting that。
So in order to see what that, well, this the star is going to ensure that it unpacks it。
So rather than just creating generator object, we'll see that actual output。
And I'll actually print here so we can see what。



This output looks like。So。Zipping S and E。And then I'm going to break。
so we're just going to plot one line。

So I'm going to run this。R is not defined yet, if you forgot to copy that in。
say r equals negative 11。


And we see that we plotted this one line。Now the zip S。
This is going to be our x values of our two points, the y values of our two points。
and the z values of our two points。

So we're plot from negative one, negative one, negative one, up to negative one, negative one,1。
so that's the idea that we're seeing here。And it's hard to see in three dimensional space。
but we are going from negative one, negative1 negative one up to 111。

Now, when we run through all the different lines, all we're doing is using this plot 3D。
which will work exactly the same as just plot in two dimensional space。
And that is just creating those lines connecting those two dots the same way you would do in two dimensional space。
calling ax dot plot。




So if I don't run the brake here and do this for let the for loop run all the way through。
you see here that we now have our cube connecting each one of these points that we have here。


The next step that we want to do in order to draw our sphere is we're first going to create this mesh grid。
So I'm going to copy this above into a different cell。And in order to make this a little clear。
this is going to be the number of points。

If you were to do without the J, the J just in general, so you know。
within Python means a complex number。

We are working here with the J, not because we're working with complex numbers。
but the complex numbers just let us know that rather than counting by 20。
we want 20 points in between 0 and two times pi。


That's all we're doing here by using the complex number。But we're going to reduce this。
just for our example, to3 and 2, so F。

Three values and two values。 And the idea is that when we want to plot along many different points and we want to cover So here it's supposed to go from0 to two times pi。
and we want to have three different values, so it'll go0, then pi then two times pi。

And then we're also going from 0 to pi with just two values。 So 0 to pi。
And the idea is that we want to plot all the possible combinations of these points。
And in order to do that, we have to create this mesh grid so that we have 0,0, as well as 0 and pi。
and then。


Pi coming from our count from 0。Through22 pi, we then have pi and 0。
and then pi and pi for our second axis and so on and so forth。
So that's the idea of the mesh grid to allow you to plot on each one of these multiple points。 Now。
it has two outputs。




For each one of the different grids, those are both equal in shape。So you have 0 and 1。
That's supposed to be your X and Y here, we're plotting in three dimensions。
And all we're doing is taking that two dimensional graph。


And we're expanding that to create our sphere by using each one of those different points and taking the cosine of each of these values ranging from。


From U, you go from0 to2 pi, and then from V, from 0 to pi。

Multiplying them together。 And then first Z, we just get cosine of V。
and that will createate our sphere。So we're going to have our three points。
all these multiple points。 And right now, they're just points out in space。
And in order to connect all those spaces into one final sphere。
we're going to use this plot wire frame, which will connect all those dots together。😊。


So we call x dot plot wireframe on the X, Y, and z。

And we run this。And then we see all these different points that were created in three dimensional space。
all being connected by this wire frame。

And ultimately, we saw here how to plot in 3D and may be difficult to visualize how much extra empty space there is。


But if we think about it in terms of the equations。
the volume of this sphere is given by 4 over 3 pi R cubed。
And since we're working with a cube with a radius of2 R。
it's going to have two R cubed in terms of volume。And when we calculate the percent of that cube。
again, thinking, thinking of this as three different covariates。

We can see that the volume outside the sphere is going to be one minus that volume of the sphere。


Four over three times pi r cubed, over two are cubed。You do some cross multiplication。
You end up with 1 minus pi over 6, and approximately 48 per cent of your values being outside the cube。
So working with that same range of negative one to one。



And that same radius being described as your standard deviation and being beyond that being a bit of an outlier。
we see that 48% cent of our values are now outliers。
Now that we've moved up to three dimensional space。


So that closes out this video in the next video we will continue and show you how you can actually generalize this to even higher dimensional space and see those percentages as we continuously increase the number of dimensions。



016:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p16 15_维数灾难笔记本第3部分.zh_en -BV1eu4m1F7oz_p16-
Now, we discussed how we moved from two dimensions up to three dimensions。
And we saw that when we moved from2 to three dimensions。
We saw how many more values are more than one unit away from that mean value of each one of our different covariates。
Again, working between negative one and one。 And we see that before it was at 21 per cent in two dimensions and then just adding on one more covariate with the same range from negative one to 1 and the same idea that it's going to be standardized with the standard deviation of one and a mean of 0。
We saw that 48 per cent light outside。Now, what we want to see from there if we can generalize up to higher dimensions。
Now, obviously, we won't be able to plot in higher dimensions。


But we can start to get an intuitive sense if the idea is。
if it's within one unit away from that mean。That would mean that we are working within the ball。
within the sphere, whatever you want to call it。 And then outside of that, still using that range。
the similar range for each one of our different covariates。 If it's outside of that ball。
then we would say that it was outside of that standard deviation。
And we'd say that's a bit of an outlier using the same sized covariates。So in order to do that。
what we're going to start with is。Here we have a random sample calling N do random do sample is just going to pull from a uniform distribution。
random points from 0 to 1。We're saying that we once the size to be 5 rows and two columns。
So we're going to have two dimensional points。We're then going to get the norm。Again。
this is just the distance from 0,0。The norm is going to be that Euclidean distance from 0,0。
So the Euclidean distance is just going to be that value squared because we're moving from 0,0。
So we square that value and then take the square root of that value squared。
and we're calling dot sum and we're sum one just because we're going to be passing in an array。
And we want to get that sum for each one of our individual points。
So we're getting that Euclidean distance for each one of our points。
And then we want to determine using that norm whether or not we are one unit away from that mean or if we're greater than one unit away。
And no matter the dimensional space, that's going to be the way that we determine whether or not we are within the ball within that sphere or not。
So we're going to say。This in the ball will just say。
Is that value using our norm that we just defined within the ball or not within the ball。
And it will return either a true or false value。

So just to see an example of this, we're going to use that sample data, we're going to say 4 x。
Y and zip, and we're going to zip together both the norm value so we can see what the norm value output is for each one of these sample data points。
and then we can say whether or not that's in the ball。
and we should see anything above one being outside the ball。




So we run this。 And first, we printed out our sample data that we randomly generated with all the points being between 0 and 1。

And we see that all of these were actually within the circle。
And here were working in two dimensional space that was a bit lucky, you see if I run this again。
that two of them happened to be outside of the circle。


Now, how would we generalize this beyond two dimensions。
so we saw we could do three dimensions bywinu do it to any number of dimensions。
So the way that we're going to do that。Is we're going to create this function called what percent of the n cube is in the N ball。
So in the n dimensional cube is in the n dimensional ball。We pass in the number of dimensions。
We can also pass in our different sample sizes here。 we're going to use 10000。
So we're going to generate 10000 random points。We're then going to create a random sample again。
those will be values between 0 and 1。Using the shape of 10000 different rows。
all with the dimensions defined by the number of dimensions we pass into this function。
So originally, we just did two dimensions, as we saw in our samples here。 Now。
we're going to move that up to 3,4,5 dimensions。 And you can also imagine this again。
Think that each one of these different rows。Contains our first covariate then our second covariate。
And when we add more dimensions, all we're doing it is is adding on more features。
Adding on more dimensions。

So what we're going to do is we're going to call in the ball。For these 10000 different values。
And then we're going to call dot mean。 So if you think about it。
this will be outputting either true or false for each one of these 10000 values。
True or false can be used as one and 0 with true being one false being 0。 If we take the averageage。
we can see what percentage actually falls within the ball。
That's how this dot mean will work for us。And then we're saying for iteration in range 100 so that we get 100 different examples of these 10000 points to ensure that we converge on what something close to what the actual solution would be in regards to generalizing to these higher dimensions。
So we end up with 100 different values for the average amount that lies within the ball versus outside the ball。
And then we take the mean of those values。




And that will give us the percentage of the N cube that's in the end ball。


We're then going to call for dimensions ranging from 2 up till 15。 So not including 15。
So up till 14。 those are going to be the different dimensions that we're going to test。And then。
our data is going to be。For each of these, we want to pull out what percentage is in the ball。
So we're just going to map in。These different dimensions into are what percent of the N cube is in the N ball function。
And that will output for each one of these different values in the range。
What percentage actually lies within the cube, the circle, whatever it is。

嗯。You see here that we also include 2 and 3。 So we'll also be able to check compared to what we saw before。
whether or not we have close approximations of what the actual values are。
given the calculations that we had in regards to the actual。



Formulas of a sphere versus a cube and a circle versus a square。

So we say4 dim and percent。 So're just getting。Say start with2 and then the input for two for that data。
We're going to map those two together and get the dimension, as well as the percent within the ball。

So we see that 78 per cent fall within the ball at first, which 78。5, which makes sense。
given that we saw before that 21 per cent was outside of the ball。
Sam with 52 per cent being in the ball for three dimensions。 We saw 48 per cent above。


And we see how that drops off quite dramatically as we keep increasing the number of dimensions。
So more and more of our values, as we add on these more fee, these features。
all with similar ranges and similar standard distributions。
We see how many of them tend to be outliers。


And we can plot this finally getting a simple plot, calling PLT do plot。
We're going to get our x label, our Y label, and just our title。
and all we're doing is our dimensions。

Versus the data, which is the percentage of。

The amount that falls within the ball versus not。 And we can see how it steeply drops off as we add on more and more dimensions。


So just do double check。Our understanding, we see that this is dropping off quite dramatically。
We're also going to measure the distance from the center of our cube to its nearest point。
So you can see out of all those points that we have。

But here we're going to generate rather than 10000, just 1000 points。
We can see how many of those or out of those thousand0 points, which one's closest to the center。
and hopefully we will see I will。Give you a little bit of a spoiler。
We will see that that closest point will be farther and farther away as we increase the number of dimensions。

So this is just a bit more evidence to that same point。So we're going to pass in the dimension。
We're going to pass in our sample size here being 100。 We're setting the default equal to 1000。
Were going to, again。Call a random sample this time, rather than 0 to 1, will subtract 0。5。
So it's centered at 00。And then it'll be from negative 0。5 up till 0。5。

And then we will return the min of the norm of each one of these points。 Again。
the norm is the distance from 0 in either direction。


And then in order to estimate the closest, given that dimension。
We can use that getmin distance that we just defined that will give us that minimum distance using the norm of each one of those points。
We're going to do that 100 times over。 So in the same fashion that we just did to ensure that we have a large enough sample。

And then we're going to return not just the average of that data。

But the minimum of those minimums。

As well as the maximum of those minimums so that we can get a bit of a range of the values in regards to how far away they are from the origin。

So we're going to calculate this from values ranging from 2 to 100。

We're then going to map those dims into that estimate closest function that we just defined above。
And we can print this out。

And this will take just a second to run。 And then afterwards。
we'll also be able to plot using that same functionality that we just discussed。 So we see here。
four dimension 6, V。

Average value was 0。22。 The minimum of those minimum values was 0。1。
and the maximum of those minimum values, given that 100 different iterations of this was 0。3。
So we're going to plot those dimensions, as well as the mind data, all of the rows first column。
And then this PL T dot fill between。We're going to use that in order to plot both our min and max。
So we'll have the range of the average values, and then we'll also be able to fill between the min and max values so we can see a bit more clearly what the range was as we increase the number of dimensions。

So the menes data, if you recall, is going to output three different values。

0 is the mean。The first column is going to be, or the second first in Python is going to be the minimum。
and the second is going to be the max。 And I're saying alpha equal to 05 because it's going to fill between the two values。
And we want to also see that line in between。

So you run this。And we can see as we increase the number of dimensions。
how far that minimum point is from the origin, as well as that good of range that we are able to get using that fill between as well。

So that closes out this video。 And it gave us an opportunity to look at how we can expand up into higher dimensions。
😊。

With all this in mind。In the next video, we will begin to show you the effects of working with high dimensional data when you are actually trying to use your different classification algorithms that we introduced in the last course。


All right, I'll see you there。

017:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p17 16_维数灾难笔记本第4部分.zh_en -BV1eu4m1F7oz_p17-
Welcome back to the final video for this notebook。In this final video。
we're going to show how dimensionality, how high dimensionality can end up affecting model performance。

And with that, I want to quickly touch on again how we can fight the curse of dimensionality。

And two different methods that should immediately come to mind as we discuss them in the intro to this course are going to be feature selection。
where you would use domain knowledge to reduce the number of features。
given the ones that you think are already informative。



As well as feature extraction and with feature extraction。
you're going to use dimensionary reduction techniques such as PCA。
Which we'll learn later on within this course to transform our raw data into lower dimensionality data that will preserve hopefully the majority of our variability in that original data。
And again, well touch on this later on in the course。



So here we're going to show creating play data sets how high dimensionality will end up affecting our model performance。


In order to do so, we're going to import。Many libraries here we're doing a classification problem。
so we need our train test split, we're going to need our standard scalar。
and then something new that we haven't seen yet is we're going to use this make classification function。
which is available in Scalalar do data sets, and that's just going to create a toy dataset set with a certain amount of classes。
and I'll show you this in practice in just a second。
and then we're going to use our decision tree classifier to ultimately predict the class。






So first thing that we're going to do is create our classification data set in order to do so。
we're using this make classification。

Function that I just introduced a second ago, and I'm going to show you a bit about how these arguments work。
So I'm going to create a cell above。


And of course, first we're going to have to import that library。
but're then going to use this to create our X and y。 So now we have our x。


And our x is going to be this two dimensional data set。
the default is that there are going to be 100 samples。 you see here that it has 100 samples。
so if we were to run x dot shape。



We would see that we have 100 rows and two features。

Those two features are going to be decided because we said that we want the number of features equal to two。
We're saying that the number of features that are redundants that don't give any extra information are going to be equal to 0。
If you imagine, often we will have redundant features, such as when we discussed。
if you're talking about age versus whether or not they're a senior。
there will be a bit of redundancy built in。



The number of informative features will be the rest of that。
so we're saying all of our features will be informative here。

And then the number of class clusters per class will allow us to spread out that data in a way。 Now。
I'm going to plot。Each one of our different classes。

Along with that X, we have which class each one of those values belong to。

In order to look at both of those, we're going to use a scatter plot, and we're going to scatter。

Are x such that y equals 1。And that'll be。On X, we want first, our first feature。
Then we're going to once our second feature。And then I'm going to use this again。

To create another scatter plot。That's going to be。Our y equal to 0。

And everything else the same。So here you see our two different classes。
they're differentiated fairly clearly。And just to show you how different things work。
if we were to say the number of clusters is equal to one。
so we don't have separate clusters within our different classes。


Then you see that they're very clearly separated, so adding on this extra cluster allowed them to be a little bit closer together as there were going to be separate clusters for that class。
Also, to go along with that, if instead of having both of our features being informative。
If we only had。

One of our features as informative。Then we would see that the other one is redundant in everything along one axis。
So we don't create this separation, And there's no use in one of those features。
One of those features don't essentially add any extra value or combined。
they don't add any extra value。



So that's how the make classification works。 We saw that original plot of what our data actually looks like when we're working with two features。


We're then going to add on that a bit of noise。So we're just going to use our random state here。
We're setting a random state equal to 2 with that object, we can call range dot uniform。
and we're going to be adding on two times a bunch of random values of the same shape as x。
So we're adding 2 x, something the same size as x。
So it'll add to each one of the individual data points within our 100 by two array。



And we're going to add on values that are between0 and 2。
So the default for range dot uniform will be values between 0 and1。

And we're going to multiply that by two, soll be values between0 and 2。
We're then going to scale our data so that it is all between0 ends。 Well。
so that the standard deviation, the mean will be0 and the standard deviation will be 1。
and now that we have our data resetting x to the standard scalar version of itself。
we can split it into our x train, x test, Y train and Y test。



So we have our toy data set。

We can then use our decision tree classifier。And run that on X train and Y train and see what our score is for our X test and Y test。

And we see that our score from this two feature classifier is 0。875。

Now we're going to run all the same steps and what's important to note here is that the number of features is obviously going to be going up 100 fold。
but with that, we're also ensuring that each one of those different features are informative。

So we're not allowing for redundant features here, so we still have all of our features being informative。

We are going to run through the same steps。 Otherwise everything else is the same。 We are going to。


Said our range。Again, setting that random state, adding on that extra noise two times the uniform value。
run through the steps of setting up the training set and the test set。


Then we're going to use again our decision tree classifier on our standard scale data。
Check our score on our test set after fitting on our train set。
And we see that our score goes all the way down to 0。425。

So we see that adding on additional features, even if they're informative。
end up leading to worse model performance。

Due to the fact that it will very heavily increase the amount that it will overfi to each one of these features。


And something to note along with this, is that, as we mentioned during the lectures, you should。
if you are going to have more features, try to also have more rows of data。
So if we had enough rows of data, maybe we can counteract this problem。 But generally。
if you' are going to have a certain amount of rows。 The less features you have。
the more informative each of those features can be, less likely you will be to overfit。





We're then going to, rather than just looking at 2 and 200 loop through values between 50 and 4000 and run through each one of these same steps。


So all the steps are going to be the same。 We're calling for nu and NP dot L space starting at 50。
That's our increments up till 4000, counting by 50。
And we're just going to continuously pass in that numb for a number of features。


As well as。By setting number of redundant equal to zero by default, all of them will be informative。

And then everything else is the same。 We can get each one of our different scores as those are going to be appended on to this empty list。

We run this and this will take just a second to run, and then we can plot that as well。


Just looking across each one of the numbers of different features and seeing the classification and accuracy as we increase the number of features。
Now, by chance, some of these can be a bit more accurate。
but adding features in general can very much lead to reductions in accuracy, not all the time。
but it very easily can。

So in this example, the accuracy is highly volatile in the number of features and increasing features again can reduce that accuracy。


Additionally, in our example, we testified that none of the features are redundant and in practice when you have this many more features。

Generally speaking, you will almost definitely have redundant features。

And for example, if we are predicting customer churn, as we've discussed throughout these courses。
using a variety of customer characteristics, we may have collected extensive data say for each customer that we have across many dimensions。
and this would be an example in practice of high dimensional space。
which can make it difficult to apply unsupervised learning methods directly。
and potentially lead to issues within this cursive dimensionality as we try to create these groupings。






So that closes out our video here on the curse of dimensionality with that we're going to go back to discussing different types of groupings。
different types of clustering algorithms, starting off with a glloative hierarchical clustering。
and I look forward to seeing you there。




018:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p18 17_层次聚合聚类.zh_en -BV1eu4m1F7oz_p18-
Now let's talk about our next clustering algorithm, hierarchical aglomative clustering。
With hierarchical aglomative clustering, we'll try to continuously split out and merge new clusters successively until we reach a level of emergence。
😊,Now, let's see how hierarchical aglom clustering actually works。

So here we're using the same example as before, and we're going to try and come up with our different clusters。

For hierarchical gllomerated clustering, we start off by looking at the points and identifying the pair。
which has the minimal distance。


So notice here again that the distance becomes a very important factor in the success of our clustering algorithm。
so we need to keep into account which distance metrics we're actually using。


We see that these two points that we have in green are the closest。
so we color code them here to highlight that these are going to be our first pair。


And then we continue to do this again, looking for the next closest pair of points。

And the next closest pair。And we can keep doing this。

But the next closest pair can actually be a pair of clusters。
So we might have two different clusters or a cluster and a point that are going to be closest to one another。
It doesn't necessarily just have to be two different points。



Now, how we define the distance from a cluster to a point or from a cluster to another cluster will depend on our linkage criterion。


And we'll expand on this a bit later。 But for the distance to a particular cluster。
maybe it's going to be the average of points in a given cluster and the distance to that average of given points。
Or maybe it's the minimal distance between all points in a given cluster to that point。
So just taking the minimum distance。



And if it's a pair of clusters, if we do find it's a cluster that is the closest point。
then we can go ahead and merge them into their own cluster。



So we see that again, we don't have that。 but here we move one more step, and we see that the。

Blue and the green that we had merged together into the two greens。

And we can continue to see that we can create more and more clusters, and we keep going。
creating each one of our pairs, moving forward。😊。


And now we see some of them merging together, merging further together。
We have those red dots all creating their own cluster。
And now the number of clusters will start to reduce。



As we keep moving forward。Each one of them, combining together。

And at this point, looking here were at six different clusters。

We can run again, and we get down to five clusters as we continue to find the closest linkage。

Now we're down to four。

And again, as we continue to move up that ladder, we can continue to merge these different clusters together。
And then we're at three different clusters at two different clusters。
And if we were to continue this, we can end up with one large cluster。😊。



So this means that if we allow this to continue, eventually, we don't have clusters。
so we have to come up with some type of stopping criteria when we're using a glloorative clustering。



019:层次链接类型.zh_en -BV1eu4m1F7oz_p19-
So with the idea of using the average distances of all the points within their respective clusters。
How do we go about actually finding our stopping point?
So let's say we're at this stage and we have five clusters as we see here。
and each one of those clusters are color coded as we move forward。

So at this stage, we can say that the average cluster distances for each one of our clusters。
which we have marked here with the same colors that we just saw in that two dimensional plot。
And with that, so we have each of the average distances and with that we have our gray dotted line。
which marks a point where we are going to stop once all of these average distances are above that line。

So in the next iteration。We find that light, purple and magenta clusters are going to be merged。
Therefore, that average cluster distance for that particular cluster should go ahead and increase。

So we can visualize this change in that average cluster distance as followed。
For that new combined cluster, we now have this average cluster distance that we see that is higher than the previous two。
so before we had the light purple and the magenta。
we merge those to that higher version of magenta and we see that we have a higher average cluster distance。
And now we only have four remaining clusters, and as a whole。
they're a bit closer to that limit set to that gray line。

In the next step, we can have that purple cluster is going to merge with the teal cluster in that top right corner。

Ands the new cluster forms。Combining that teal and purple is now above that threshold。
Now we don't stop at this point, though, we are only going to stop once the minimum is above that threshold。
so the minimum average cluster distance is still not above that threshold。
we still have the pink and magenta below that threshold。

Now, in this next step, once we move to two clusters。
magenta cluster and the pink cluster merged together to create this new pink cluster。

And finally, once we merged these two。All the cluster distances are above this threshold。
There are big enough。To therefore claim that the algorithm has finally converged。

Now we mentioned earlier that we would want to merge clusters at some point that are closest to one another。
But that idea of which cluster is closer is a bit of an ambiguous concept。
Especially when there are going to be multiple points belonging to each one of these different clusters。
Now there are several methods to measure that distance between these clusters。
and these different methods are called the different linkage types。
The first example that we have here is single linkage。
and that's going to be the minimum pair wise distance between our different clusters。So。
Given that we have our different clusters that we have here on our data。
it's going to be the distance between the two closest points。
say one from the teal cluster and one from magenta。
and we can see the blue lines that connect each one of these。
according to which is going to be the minimum distance between a certain points in the magenta and a certain point in the teal。
And we take that distance between those specific points and declare that that will be the distance between those two clusters。
and then we tried to find for all these pairwise linkages, which one is the minimum。
and then we would combine those together as we move up the hierarchy。Now a pro。
and we will talk through many different type of linkages。
a pro to the single linkage or the minimum pairwise distance between clusters。
Is that it can help in ensuring a clear separation of our clusters that have any points within certain distances of one another so it has clear boundaries。
But a con of this single linkage will be that it won't be able to separate out cleanly if there's some noise between the two different clusters。
So it'll be very easy to be skewed by certain outliers falling close to certain clusters。

Now another linkage type is going to be called the complete linkage。And with complete leakage。
instead of taking the minimum distance, given the points within each cluster。
we would take the maximum value。 So taking the furthest distance from each cluster。
and from those maximum distances, decide which one is the smallest。
And then we can move up that hierarchy to reducing here from four clusters down to 3。Now。
a pro of this method is that it will do a much better job of separating out the clusters if there's a bit of noise or overlapping points of the two clusters。
unlike with the single leakage。But acon this is that it content to break apart larger existing clusters dependent on where that maximum distance of those different points may end up lying。

Alternatively。We can also take the average of all the points for a given cluster and use those averages or those cluster centroids as we've been introduced to to determine the distance between our different clusters。
Now, the pros and cons of using the average can kind of be seen as an average between the pros and cons of using the single and complete linkage and that it may also break up those larger clusters and also may be a bit drawn towards a noise but also do a better job than either the single linkage or the maximum linkage in regards to the cons of each。

And then finally, we have the ward linkage, and the ward linkage is going to compute the inertia。
So if you recall, the inertia is going to be the distance squared between each one of our different points and their centroids。
And picks the pair that's going to ultimately minimize that inertia value。
So trying to minimize that sum of squares of the distances to their cluster centroids。
so in that sense you can think of it as something similar to K means in trying to come up with the new。
Combining of the different clusters。And again, the pros and cons of war will be similar to the average and that they will。
Balance out both the pros and cons of the min in max linkage。

020:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p20 19_应用层次聚合聚类.zh_en -BV1eu4m1F7oz_p20-
Now, in order to do this in practice, to do this in Python。
it'll be very similar steps to what we've seen so far。
We will start by importing our class here called a glloorative clustering。

We're then going to create an instance of the class。
so we say ag equal to a glamor of clustering with our different hyperparameter。

We can choose。The number of clusters here, we set the number of clusters equal to three so that it'll keep building up until we get to three clusters。
We then have the option to choose our distant metric, and here you see we chose Euclideium。
affinity equals the Euclidean, we use the Euclidean distance。


And we can also define what our linkage will be going through the different linkages that we just discussed are available。
we can choose which one we'd like to use for our current clustering algorithm。


And then as before, we would fit the instance on the data and use that to predict clusters for new data。


Now let's recap what we went over here in this section。

In this section we introduce the hierarchical aglomerative clustering method and how we can use it to slowly build up to larger and larger clusters。
And this method becomes very useful in business practices when you may want to also see these subgroups that build up to these larger groupings。


We then discuss stopping conditions and how you may either have a predetermined amount of groups in mind or a predetermined amount of clusters in mind。

Or you can say to continue up until you reach a threshold of minimum average of our cluster distances。
And finally, you went over different linkage types。
including single linkage using the closest points to determine distance between clusters。
complete linkage using the furthest points determine the distance between clusters。
Average linkage and ward linkage, which finds the combined clusters that most reduce the amount of inertia。
That closes this video on hierarchical aglomative clustering。
and in the next video we're going to dive into our next clustering algorithm DB scan All right。
I'll see you there。


021:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p21 20_DBSCAN算法.zh_en -BV1eu4m1F7oz_p21-
Our next unsupervised learning algorithm we are going to cover is densely based spatial clustering of applications with noise。
And the noise part is going to be important, as this is one of the few approaches that truly clusters our data rather than partitioning it。
And it will help us find outliers rather than putting them all into different clusters。
And we'll see how in just a bit。Now, let's cover the learning goals for this section。
In this section, we're going to discuss how this D B scan clustering algorithm actually works in finding our different clusters。

We will also discuss the input arguments and their importance for determining our clusters。
as well as discussing the outputs of our DB scan algorithm。

And finally, we'll close out by discussing the strengths and weaknesses of working with the DV scan algorithm。

So let's start here with a quick introduction to what D B scan is。 As we mentioned。
a key part of this clustering algorithm is that it truly finds clusters of data rather than just partitioning our data and thus works better when we have noise in our data set。
We know that outliers will show up in most of our data sets。
and in reality we should be able to create our clusters and say that these outlier points do not belong to any of these clusters。

Now, the basics of how D B scan works is that we are working under the assumption that points in a cluster should be a certain distance from one another within a certain neighborhood。
So we would randomly select points from these higher density regions and slowly expand our clusters。
And as we expand, we only include points that are at a certain distance from the points that have iterly already been included within that cluster。
Given that distance that we're using from point to point。
And the algorithm ends when no more points are of a certain distance from the clusters already identified。
And thus all points will have been classified as either belonging to a particular cluster or otherwise。
they would be noise。 Now, this is all high level。 And in just a few slides will make sure to visualize how this actually works in practice。




Before we get to those visualizations, though, let's talk about the inputs for D B scan。
as these inputs will be of utmost importance to getting our clusters identified correctly。So first。
as we've seen repeatedly with all our clustering algorithms。
we have to define the distance metric used to define our similarity between our different points。

Then we have to define the epsilon。As we mentioned。
we are starting at random points and then using those points we determining if other points are within a certain distance。
And if they are, they become part of the cluster。Now this minimum distance between the points。
Is going to be considered part of the same cluster if it's within a certain epsilon range。
So that's going to be our epsilon。 how far away a point needs to be to be considered part of that cluster。
N clue, or often seen as min samples, which is actually the argument used for S K learn。

And this argument, this input, will be the minimum amount of points for a particular point to be considered a core point of a cluster。


And core points are going to be defined by this N clue argument。

And they're going to be defined as those points that have at least N clue neighbors。
including itself。

So if we set n clue equal to 3, that means that that point has at least two other neighbors that are within that epsilon distance。
A non core point can still be a part of the cluster if it's in the neighborhood of that core point。
But to understand this, let's dive a bit deeper into the different classification of points given our DB scan model。


So there are three possible labels for any given point。

First, we have our core point, which we just defined as any point that has more than N clue neighbors。
And all clusters will require at least one core point。

We then have density reachable or border points。And these will be points that can be reached by a core point。

But may have fewer than end clue neighbours itself。

These will still be a part of the cluster as long as they are in the Epsilon neighborhood of a core point。

And then finally, we have noise and noise is going to be a point that is not part of any cluster。

And that would be one that has no core points in that Epsilon neighborhood of the point。

So if we have n clue equal to 4 and three points are within epsilon and no others are near by。
None of these three are going to be core points, unless thus they are all going to be identified as noise and again。
will visualize this a bit more clearly in the videos to come。

And with those possible labels, for any point, we identify clusters as the connected core and density reachable points within our data set。

So that closes out this video and in the next video we're going to turn to that visualization。
I keep promising that we're going to see to clearly understand how the DB scan algorithm works。
Allright, I'll see you there。



022:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p22 21_可视化DBSCAN.zh_en -BV1eu4m1F7oz_p22-

So as promised, let's start to visualize how DB scan actually works。

And as we have in our past clustering algorithms, we're going to start with this two dimensional data set and we're going to come up with clusters depending on the visits and the recency and how far away each point is from one another。
So we start at a random point here we have this point in pink。
And then we look at the radius epsilon around that point and we'll have to define that epsilon and here we define it as 1。
75。And we look, we create that 1。75 epsilon, and we look around。And we see。
is there enough points given our N clue within that circle to start a cluster。
And we see that there are four points。 again, we include that point itself。
even though even with that point, we get up to 5。 So we now have our first cluster。
So every point within that epsilon is going to be part of our first cluster。




And then we process each new point in the same way。
So we move on to our next point here and anything within that radius。
within that epsilon radius gets included as part of that cluster。


And we keep moving along。And then here we see that this point, while it is part of the cluster。
because it's near one of the core points。 as we looped through。
we saw that the point to the right of this that it is part of within that Epsilon radius was a core point with four points。
This one is not a core point。But it is density reachable, so it will be part of our cluster。
but we will highlight that this particular point is going to be a border point。
a density reachable point and not one of our core points。

So we elite that as part of a lighter pink。And we keep going down this chain, adding on points。
according to those that fall within epsilon。

Keep running through and we see these are all core points because they all have at least four points including themselves in there。
And we move along and then this point only has three。 So this one again。
is going to be a border point, but it is near one of the core points。
so it will count as part of the cluster still。


But we see we highlight that in light pink, and we see we can keep moving along。

And eventually, we have。

All of our points within the cluster, we s to search along all the points。
And then if there are no neighbors left, we will randomly try a new。
unvisited point to potentially start a brand new cluster。😊。

And when we do that。Here we start with the blue, we we to check。
is this going to be a core point once again?

So we check again with an epsilon of this new random point that we sought out。

We see that it is a core point, now we have started our new cluster。
Now this point again is going to be that density reachable point。
but it will still be part of the cluster because it's near another point that is a core point。

And we can continue to move along to build out our cluster here。

And you see again, we have a density reachable point, we've had a couple so far。
but all those are near core points, so they still are going to be。

Part of our cluster。And then, we see here。That we have with n clue equal to equal to 4。
We only have three within this cluster。 So this is going to be a density reachable point。
but not a core point。

And then when we move over to this point over here, we see that the only one within that radius。
Is going to be that density reachable point。 So there's no core points within this radius。

So if theres no points within this radius that are not core points, then this becomes a noise point。
It becomes an outlier。 So this isn't part of either of our two clusters and is labeled as an outlier point。
which is why we haven't marked it here in gray。



Now, I want you to take a moment。 and given that DB scan method that we just walked through。
notice which points tended to be the core points as we have them labeled in a darker hue。

Which ones were those density reachable points, which are still part of our cluster。
but don't have the number of points that make it a core point, given our end clue。

And then which point we have labeled as outlier?

Now that we understand how the DB scan algorithm works。
let's discuss some strengths and weaknesses of working with the DB scan algorithm。


So as we saw with the DV scan algorithm, we not need to specify the number of clusters as DV scan will automatically determine the clusters dependent on how close points are from one another。


It also allows for noise and will not automatically determine that outliers are part of a particular cluster。

They'll also do a strong job of handling arbitrary shapes。
as it's going to be searching out points that are within epsilon distance of one another and will stop whenever a gap occurs。
no matter what that boundary shape between the clusters are。



Now some weaknesses。

It's going to require two parameters, which means we need to search over more possible values to find that optimal solution。

Also, those hyperparameter can be very difficult to fine tune in higher dimensional space。

And then finally will not do well with clusters of different density。
so even if we have two clear groups, if for one group the points are about five units away from one another and the other is one unit away depending on our distance metric。



Depending on that distance between our two clusters that are on average five units away or one unit away。
it may be difficult to determine the differentiation between those two clusters。



Now let's walk through how the DB scan algorithm can actually be used using Python。
so first things first we import the class containing our clustering method。
so from SKLn dot cluster we import DB scan。

We then create an instance of that class and pass in the necessary hyper parameters。 Here。
we're setting epsilon equal to 3 and the min samples equal to 2。
So that's that n clue that we've been talking of。 And epsilon is the epsilon we've been talking of。
that distance from every single point in order to include it as a core point or within the cluster。


We're then going to Fibit instance on the data。So just calling Db。fit。

And then we can't call DB。predict because of the way that the algorithm actually works。
if you recall it's defining the points iteratively by scanning through each one of the different points within that data。
so it's just creating clusters within that fitted data, you can't call predict with the DB scan。

If you wanted to fit on a larger data set, then you just include it in that fit。
and then you can come up with the different clusters。So we get our Db dot labels。

And just to note, for those labels, we're going to have class zero, class1。
and if there's going to be an outlier, any outlier。
as we saw can happen with the AB scan will be labeled negative one。

Now let's recap what we learned here in this section。In this section。
we discuss the DB scan algorithm and how it will come up with its own clusters dependent on which points are within a certain distance of the other points。

We then discuss the inputs and their importance, especially that of the epsilon and N clue chosen。
as well as the outputs and understanding the difference between a core point。
a density reachable point, and just outliers or noise。



And finally, we discussed some of the algorithms strengths and weaknesses such as it being able to better determine clusters of arbitrary shapes。
but perhaps having difficulty determining clusters that may have different densities。
Now this closes out our discussion on DB scan and in the next video we'll introduce our final clustering algorithm。
the mean shift clustering All right, I'll see you there。


023:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p23 22_平均漂移算法.zh_en -BV1eu4m1F7oz_p23-
Here we will be discussing our final clustering algorithm, the mean shift algorithm。Now。
let's go over the learning goals for this section。

In this section, we're going to cover the mean shift clustering algorithm and how we use the concept of moving towards the highest density to help determine our different clusters。
And then we're also going to discuss the strengths and weaknesses of working with the mean shift algorithm。

Now the meanshift algorithm will work similarly to K means in that we will be partitioning our points according to their nearest cluster centroid。


For Ca means, though, the centroid represented the mean of all points within that cluster。

While with mean shift, that centroid is going to be the most dense point within the cluster。
which in principle, can be anywhere in that cluster。


And the algorithm will assign points to a cluster by moving to the densest points within a certain window。


So how do we calculate this local density to say where the highest density point is?

In order to do so, we're going to calculate the weighted mean around each point。

So what do we mean here when we are asking for the weighted mean。

We can think of the weighted mean as assigning more weight to those points closer to the original point within our window。

So say we select this black point to start。

We calculate the weighted neme in the local neighborhood or within this window, this pink square。

And it would find that the densest point。Given the weighted mean would be here in pink。
And note on the side that the new mean does not have to be at a data point。

And can be somewhere else within this window。So how do we go about using this to create our different clusters?

So the steps are going to be that you choose a point and a window。
So we saw that window size start at a random point。
We calculate that weighted mean within that window。

And then we shift the centroid of the window to the new mean。 So we shift that square。
So it's now perfectly around that new weighted mean that we just found that new denser point。


We then continuously repeat steps 2 and 3 until convergence until there's no shift。
meaning that we have reached the local density maximum and we'll call this the mode。
So when the mode is reached。



And then we steps 1 through 4 for all data points。 until finally。
data points that lead to the same mode will all be grouped together in that same cluster。


So let's visualize how this is done in practice。

So let's visualize how this actually works in practice。So we start with a centroid at a given point。

And then given that window, we sample that local density。
and then we follow the gradient towards the denser direction。
So we keep moving towards the highest density。 So we keep reclaiming where that denses point is。
and we create our new window around it。 and we see we move along each one of our data points。


Until ultimately, we find that local density be maximal and we stop there。

We can do this again, starting at another point。We can sample the local density and again。
follow that gradient towards the denser direction。
And we see that we move along towards that densest direction。 And again。
we end up finding that same local maximum。 So we would assign those both to the same cluster。

We can do this again, starting at another point, this time, starting further away。

At a point that will probably lie outside this cluster。
We sample that local density and follow the gradient towards the denser direction。

And you see that it moves along as we move towards that denser direction。
and then it finds that local density maximum and stops there。


And to keep going, we can start at each one of the different points, sample that local density。

Follow the gradient towards that denser direction。 And here again。
we see that that point finds the same local maximum。
So we would end up labeling it as the same cluster。


And we keep going like this。And eventually, it's going to find for us4 unique local maxima。
So we see them laid out here。 each one of our four local maxima。

And is going to assign the points to the centroids that they fall into。
So we see here that all of the pink fall under that pink centroid。
We see all the teal values falling next to that teal centroid and all the blue values falling under that blue centroid。
And now we have, as well, the purple with its purple centroid。
And we have our four different clusters。

And no cluster numbers needed or any distance parameters need to be defined。
It's just going to move towards that densest direction and figure out those clusters for us。


Now, let's hone in a bit into what we mean here by this weighted mean。
That mean that we keep moving towards as we get higher and higher density。


So that new mean is going to be calculated using the sum over points within the window。
And we see this in both the numerator and the denominator。

We're also going to have this weighting or this kernel function that's going to allow us to give a certain weight。
according to how far each one of these different points are from the previous mean。



And we see that in the numerator, we weight this according to each point。
So we're going to weight that and then take the distance of that point and those that have a higher distance or a lower distance will have a higher weight。


And the common kernel that's used is going to be the RBF kernel。
which is going to be similar to your Gaussian kernel。
giving more weight again to those values that are closer and less weight。
according to the normal distribution for those values that are further away。


Now let's talk about some strengths and weaknesses of working with a mean shift。

The mean shift is model free。 It does not assume the number or the shape of each one of our clusters。
So that's going to be a pro that we didn't see when we worked with something like K means。


We can use just one parameter。 we don't have to tune over more than one parameter like we did with D scan。
that parameter being the window size or the bandwidth。

And it will be robust to outliers。 We have that window size, and it won't be affected。
and it can have those outliers outside of each one of our different clusters。


Some weaknesses。

The results will heavily depend on our window size。
So it's going to depend on the bandwidths that we choose and selection of that window of that bandwidth is not going to be an easy thing to decipher in general。
And also, finally, can be slow to implement。The the complexity is going to be proportional to m N squared。
where n is going to be the number of iterations that it has to do and N the number of data points。
So the more data points that it goes is going to be more and more complex。
You see that it's n squared complexity。 So if we have a large data set。
this may take a while to converge。




Now let's walk through the syntax that you need in order to perform mean shift using Python。
So first thing that we want to do is import the class containing that clustering method。
So from SK learned dot cluster, we import mean shift。

We then create an instance of this class。Setting M S equal to mean shift。
And we pass in our parameter bandwidth equals 2。

So again, our window here will be equal to two。

And then we fit the instance on the data, and we can use that to predict clusters for new data。
So we call M that instance of our class dot fit on x1, so it finds our clusters using x1。

And then we can call MS dot predict on x2 to see which clusters they fall under given the new data。

So to recap。In this video, we talked about the meanshift clustering algorithm and how we use the concept of using a window。
as well as the densesest point within our window to find our different centroids of our clusters。


And we discussed the algorithm's strengths and weaknesses。
such as not needing to define the number of clusters。
as well as understanding that this model will have a higher overall complexity。


So with that, we close out our different clustering methods, And in the next video。
we will compare and contrast all the different methods that we discussed and which ones are best to use for which use cases。



024:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p24 23_算法比较.zh_en -BV1eu4m1F7oz_p24-
In this video, let's briefly bring together the different clustering algorithms that we've introduced。
And discuss some of the pros, the cons, and the use cases for each one。
So what will we cover here in this section?In this section。
we'll go over a review of the clustering approaches that we went through throughout this course。
We'll then summarize and compare each one of these different approaches。
As well as providing some guidelines for choosing which approach is best。
given the business case that you are working with。

So let's review the clustering algorithms discussed in this course so far。First, we have k meanss。
And recall that with K means, we were going to have to predetermine that number of clusters that we're looking for。
And once we do so, our clusters will depend on coming up with some mean value that is trying to reduce the distance from our centroids or that mean of that cluster to each one of the different points within that cluster。
With that in mind, we will get the results that we see here for the shapes given and we see that it doesn't do a perfect job of getting shapes that aren't necessarily spherical。
and we're going to dive a bit deeper into the pros and cons of each in just a bit。
but this is just a recap and an intro to what we're working with with each of the models that we had introduced。
So next, we have the mean shift, which does not require us to set that number of clusters as we had to do with Ks。
but rather we'll iterably move towards those densesest points given a window and we'll get the results that we see here under mean shift。
And notice that for both K means and mean shift, they are going to heavily favor more of a spherical shape and may not have quite the flexibility to find different shapes。
Next, we have ward。 And what we mean here by ward is the aggglomative hierarchical clustering with ward as the linkage type between our clusters。
Recall that ward linkage specifies distance between clusters as the new combined inertia of those clusters。
And since we are linking closest clusters when we work with hierarchical clustering。
We have a bit more flexibility in combining clusters of different shapes。
But some noise can throw this off, as it did in our two circles example above。
And while we can set means of how we want clusters determineds。
They do not quite get determined on their own accord, as we saw with mean shift。
or as we will see in this next one with D B scan。So finally, we have Db scan。
which we'll find those points which are closest to one another in order to create those clusters。
And this will both create its own clusters。So you don't have to predetermine the number of clusters。
And be able to identify clusters of different shapes。 Now。
this may seem like D B scan should always be the one to go with。
But we'll dive a bit deeper into what can make DB scan a bit more difficult and at times not the ideal candidate。

So let's dive deeper starting with K means。With K means if we use mini batch to find our centroids and clusters。
this will find our clusters fairly quickly, so it will run with fairly low complexity compared to the other models。
If we don't already know how many clusters we are looking for。With K means。
we're going to have to search through our K values and use something like our elbow method that we introduced to determine that number of clusters。
It'll generally be a bit more skewed to finding even sized clusters when we work with K means。
And it's not going to work well with non spherical cluster shapes。
as we'll be looking at distance from the centroid in every single direction as we move towards that mean。
and therefore, we'll only be able to find more spherical shapes。
which is why it doesn't do a great job with these different shapes that we have here。

Next, we have mean shift, and with mean shift, we do not have to guess Ka that number of clusters will be determined for us。
Also means shiftiff will do a fairly good job of finding uneven cluster sizes。
It'll simply be moving towards that highest density。
given a specified bandwidth so we can find uneven clusters。 They don't have to be even in any means。
such as what we saw with K means。Now it can be slow with a lot of data。
we said that k beings with the mini batch can run very fast。
The mean shift can tend to be a bit slow if we have a lot of data。
as it's going to be searching for points for highest local density for every single point。
It will do a good job of finding a lot of clusters if they exist in the data set。
So if you think that there are a lot of clusters, this may be a good choice。
It will not do a great job of finding weird shapes, as again。
we are looking for closeness in every direction within a certain window so tend to go towards more spherical shapes。
And it will be limited to using the Euclidean distance within its formulation。
so we don't get to use these other metrics, these other distance metrics that we introduced earlier in the course。

Now we move to hierarchical clustering here with ward。
And the strength of hierarchical clustering really comes into play when we want to get a full hierarchy tree and see how some groups may be subgroups of others。
Now, you do have to come up with some means of deciding the number of clusters on your own。
whether that's choosing the numbers directly or with a minimum average distance threshold。
as we saw in our course on hiarchco clusterluing。It will often find uneven cluster sizes。
as we can easily have a tiny cluster of one or two points that are far away from the rest。
There are going to be many different distance metrics and linkage options that can be chosen。
which may make it difficult to fine tune this type of model。
And it can end up being very slow to calculate as a number of observations increases。
So this also will have fairly high complexity。

Now with DB scan, it seems you can often get the best of both worlds if you choose the right parameters。
But finding those correct parameters can prove to be a difficult task。Now, with D B scan。
it will be able to find clusters of uneven sizes as long as it reaches the n clue amount that was predefined。
it will create a new cluster, assuming again, that you have, if N clue is equal to 4。
as long as you have four points within that epsilon radius, you will create a new cluster。
It will work with distance metrics of your choosing。
so you're not limited to just Euclidean distance。DB scan will be able to easily move along a cluster in small steps。
thus being able to find clusters of uneven shapes。
Now there is a danger if you choose too small of an epsilon, that you will have too many clusters。
which is probably not ideal or tworthily for most business cases。And finally。
the main disadvantage is that it can have a great difficulty determining clusters of different densities。

Now to bring it all together, I would say take a look at this page if you're ever trying to decide which one of the different clustering approaches to use。
If you look at the parameters for K means, you just need to choose the number of clusters means shift bandwidth。
which may be a little bit difficult to fine tune。For hierarchicalical clustering。
you choose the number of clusters, but you can also visualize the clusters that are created as they grow one on top of the other。
so it becomes a bit easier to choose that number of clusters。
And then the neighborhood size could be fairly difficult to choose when you're working with DB scan。
Now the scalability of each。With K means, you can scale to very large number of samples。
so very large data probably want a medium amount of clusters, not too many clusters。
and this is both using mini batch which will help speed things along。
Mean shift will not be quite as scalable with the number of samples。
so as we increase the number of samples, it tends to take quite some time the complexity increases。
With hierarchical clustering, you can use large, so not very large like kines。
but large number of samples, as well as a large number of clusters。
and then DB scan again will scale quite large number of samples and a medium amount of clusters。
Now we have here the different general use cases。But I want to skip more to the applications。
All I want to highlight for the general use cases is that again with DB scan。
you can also use it for outlier detection, unlike the others。
it'll do a good job of determining those outliers。Now in regards to the applications。
we have that' 4K means, you can find few clusters of roughly the same size。
I would say this is a quick and dirty way if you know the number of clusters that you're looking for。
Then this may be a good way to get started in your clustering of your data set。With mean shifts。
you can identify the number of clusters on its own。 So if you don't know the number of clusters。
this is a good choice, often used in video, and also again, if you don't know those clusters。
this is a good for a business case, especially if DB scan may be difficult to fine tune or if you have clusters of different densities。
And then hierarchical clustering will be good for business cases where you may want to find the subgroups as well。
so if you don't just want the groups but the subgroups that build into those groups。
And then finally, DB scan that's often used for computer vision applications。
but also for business cases where you don't know the number of clusters。
and they are of similar density, then you can use DB scan to identify those clusters for you。

So just to quickly summarize, what we went through here were different clustering techniques where clustering is just unsupervised learning。
meaning we don't have labels。But we can come up with groupings of our data to see if theres different segments of our data that can be clumped together。
And we discussed several approaches that were possible, such as K means。
hierarchicalicalglomative clustering, Db scan, mean shifts。
and all this can be implemented using S Kler。 And if you're interested in learning more about the different hyperparameters that can be pass through or even more clustering methods。
I would suggest looking at the link and feel free to dive deeper and experiment with everything that you have there available。
😊,Now, just to recap。In this section, we had a review of the different clustering approaches that we've discussed throughout this course。
We summarize and compared each one of the different clustering approaches and then finally provided some guidelines for choosing which approach is appropriate for the given business situation。
That closes out our section here on clustering。 And in the next videos。
we will move on to dimensionality reduction。 All right, I'll see you there。

025:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p25 24_聚类笔记本第1部分.zh_en -BV1eu4m1F7oz_p25-
Welcome to our lab here。 on different clustering methods。😊。
We have here the same data set that we used earlier on。
in that we will be looking at the wine quality and the data contains various chemical properties of the wine。
such as acidity, the sugar levels, the ph levels, the alcohol levels。
and also contains a quality metric,3 through 9 with 9 being the highest, and a color。
either red or white。And we're going to be using these chemical properties。
So everything besides quality and color in order to cluster our wine。And with that in mind。
we can see actually, to see whether or not our clusters will relate to the cluster of either red wine or white wine。
And we'll see that in practice later on in this notebook book。
First thing that we want to do is import all the necessary libraries。
we're also going to change our directory here to data。
and we're going to pull in our data winequalitydata。 csv。

We call data dot head, and we take the first four rows。 And you see here that we transpose it。
And this is just for a bit of readability。 So really fixed acidity, volatile acidity。
Those are going to be our column names。 And we get a quick peek at each one of our different columns。
And what may stand out is the different scales of each one of these different features。
as well as the fact that color is going to be a string。 So the rest are numerical。
and this one is a string。 And this will come into play in just a second。😊。

We then look at the shape of our data, and we see that we have 6497 rows, with just 13 columns。
And then we're going to look at the data type, like I just mentioned for each one of our different entries。
And the reason why this is important is because 4K means to work。
as well as most S K learn algorithms。 We're going to need for our data to all be numerical。
Otherwise we won't be able to pass it through。So when we check this out。
We are fortunate to see that all of our values, again。
are not going to be using quality or color when we create our clusters。
All the other values are going to be floats。 They are not going to be objects like we see with color。
or even imagestegers。😊。

We're then going to check the value counts for our different wine colors。
as well as for our different qualities。 Again, those qualities ranging from 3 to 9。


So we check the color, and we see the majority of our data set is going to be white wine。
We can even call, as we've seen earlier。Normalize equals true。 and we can see that。
A bit over 75% of our data set is going to be white wine, whereas of 24。6 is going to be red wine。


And then we can check this out in terms of the value counts for the quality。
And we see that the majority of our quality is going to center around that 5。
6 and 7 value with very few, very low quality and very few, very high quality。

Now we want to look at a histogram breaking down the quality by red and white one。

Given our data set。So we're going to。First。Initiate these colors red and white。 Now。
these are just going to be objects pointing to a certain color。 So from S M S thats seaborn。
we're going to pull our color palette。 and we want red to be associated with the second value or the third value because it's Python indexing and then the whites objects pointing to the color palette and the fifth value。


We're then going to。Explicitly tell our histogram what our bin range is going to be。
We don't want to combine any one of our values。 We saw that 9 up here could end up being very low。
especially when we split it between red and white wine。
and we want to ensure that we have a separate bin for every single one of our different quality values。

With those different objects created。

We're then going to initiate our axis, so our bounding box using PLt。axxes。
And then we're going to zip together this list of red and white。
which is just the string of red and white。As well as the red and white colors that we initiated up here。
So red will associate with red in that first iteration through the for loop。
and white will associate with this white in the second iteration。
And that will be color for the string in plot color for these different objects that we have here。

We are then going to take a subset of our data。So we say data dot Lo。
and we want to locate where the color is equal to the color that we have specified。
either red or white。 And then we only want the column of quality。
We're just taking a histogram of the of the quality。We then call Q data, that's our subset。 hist。
And we say, again, we want to use just the bins that we specified above。 So 3,4,5,6,7,8, and 9。
We set alpha equal to 0。5, because we want it to be somewhat see through。
as we will be plotting one histogram on top of the other。We then set。
where do we want to plot this x equals x, that axes that we initiated earlier。
And then the color that we're going to use is going to be that plot color。
which is either going to be this red object or this white object that we had defined up here。

And then we're just going to create a label for our legend later on。
labeling the white as white and the red as red, using this string。

We then won our legend。 We want our X label, and Y label。We set our x limits。
our x ticks are going to be in between each one of these values。 So at 3。5,4。5 and so on。
and our labels are just going to be the different bin range values,3,4,5, and so on。So you run this。

And we can see here our different breakdown of red and white wine。 and see that。
Red is slightly more centered around this 5,6。嗯。Ands the white wine has a higher peak at that 6。
So more values of 6, but otherwise, somewhat of a normal distribution。
whereas red is going to be a little bit flatter with kind of a bimodal 5 and 6 in regards to the red wine quality。

We're then going to, in question 2, examine the correlation and skew of our relevant variables。
So everything except for color and quality。 We're not going to drop these。
but we do want to exclude these when we look at our cross correlations between each one of our different values。
And that's because we're going to be using our cluster algorithms without either of these two values。
And then on top of that, we're going to perform any appropriate feature transformations or scaling。
Now, what's important is we have to recall that we are using distance metrics when to use our K means or any one of our clustering algorithms。
And something that wasn't mentioned throughout lecture is the importance of width distance metrics。
And this should have already clicked as we were thinking through what we've done with each one of our supervised learning models。
If we are using distance, it will be of utmost importance that each one of our features are going to be on the same scale。
We don't want any one of our different features being more heavily favored or causing further distance than the other one。
So we just want their variation to be changing what our clusters will look like。
rather than their actual magnitudes or their built in values。
And then we're going to finally examine the pair wise distribution of the variables with pair plots to verify the scaling and the normalization efforts that we went through。
So we're also going to make sure that there's a normal distribution that just makes things a bit cleaner so that we don't have a heavy skew in one direction when we take each one of our distance metrics。
So we're specifying here that our float columns are going to be all of our columns。
except for color and quality。

We're then going to create our correlation matrix by just specifying that we only want those columns。

And calling dot Cor。And then finally, just to make sure that we are not getting。
we're not seeing each one of our different correlations with themselves。
which would obviously have a correlation of one。 Every value with itself will have a correlation of one。
we're replacing them across that diagonal。 So4 x and range of the length of our float columns。
So however many columns are doing。 We're going to replace the diagonal。 So the I lo。😊。
And if you think each these values being the same。That we are going to be zeroing out each one of the different values in our correlation matrix along that diagonal。

So let's see what this looks like。Again, you see that 0,0 for fixed acidity。
and this will come into play because we're going to look at the highest correlation between each one of our different features。
And we can see here it's a little bit difficult to quickly visualize what those highest values are。

So to get that pairwise maximal correlations, we're going to call coremat dot abs sorting the absolute value。
because whether it's negative or positive correlation, we want to see if they're highly correlated。
And then we call dot IDX max。

And we see that fixed acidity is most highly correlated with density。
and we could also even just call dot max if we wanted to see what the maximum values are。
And we see some high correlations between certain values。And the reason why this is important。
Is if you recall when we discussed earlier in the lecture, as well as in the last lab。
If we have high correlation between different values。
then we start to reach that problem of high dimensionality。
and we know that that causes a problem whenever we're working with distance metrics as we are with most of our clustering algorithms。
So we're not going to do anything there。 The There are some fairly high correlations。
but not high enough for us to exclude certain values。
But that is the reason why we'd want to start to investigate how high the correlations are between each one of our different values that we're going to use for our unsupervised model。

We're now going to look at the skew of each one of our columns。
So we can just call for those float columns。We're looking at the skew, we just call dot skew。
And recall that 0 means no skew。 positive value means a right skew。
A negative value means a left skew, meaningan it's not normally distributed。
right skew means heavy right tail。 left skew means heavy left tail。

And then we're going to sort those values from highest to lowest。
and then we're just going to take those that are above 0。75。
So we look at that and we see that each one of these values。 we are saying that above 0。
75 has a heavy skew in order to help to correct that skew。
we're just saying four call in each one of these skew columns。
only taking the index values because we don't care about the actual values。
We just care about each one of these column names。


We're going to change that data to the log version of itself。
And that will help normalize our features。

So we run this and we've replaced all of our different columns。And then on top of that。
as I mentioned before, it's of utmost importance that all of our features are on the same scale。
So in order to ensure this, we are importing our standard Scalar, as we've done before from Scalar。
pre processing。We set SE to that standard Scalar object。
We call fit transform on all of our float column data。
So we're going to replace all of our data float columns。

And then we're going to investigate this briefly and see that our values are now all are on a similar scale。

Finally, just to make sure that we get a visual of what these actually look like。
we're going to run the pair plot。

In order to run our pair plot, we want all of our columns。All of our float columns, as well as color。
And the reason why we want color is because we are going to break apart our scatter plots。
as well as our histograms and our pair plots will show by color to start to investigate。
investigate that natural differentiation between our difference。Features and the color values。

We're just ordering it white than red。 And then we're just saying our palette here。
We want red to be equal to the red that we defined earlier。 And then for white。
we're going to use gray as the coloring for it。Now I'm going to run this。
and pair plots generally take a bit of time to run。
So I'm going to pause the video real quick and we'll come back once it's already ran。

Now, we ran this parapo, and we did that to see the relationship between each one of our different features。
But on top of that, because we're also looking at the breakdown between red and white。
we can look at these two features。 And again, we have more than two features or two dimensions to work with when we create arcanes。
But even with these two features, we can begin to see that there is somewhat of a clustering between the red and the white wines。

So we can see that there probably will be a pretty clean classification given our data that will show us which wines are red and which are white without actually having those labels available。


Now that closes out this question number two in our video here in question number three。
we will start to fit a K means cluster and see what kind of clusters we actually come up with without the labels to our data。



026:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p26 25_聚类笔记本第2部分.zh_en -BV1eu4m1F7oz_p26-
Now, for question number 3 here。We're going to continue by fitting our first K means clustering model。
and we're going to use two clusters, and we're going to use two clusters not identifying the red and white。
We're not going to have that included in our data set。 And we're going to examine the clusters。
according to the red and white wine, to see if it automatically clusters。
according to this red and white differentiation。So what we do is we import from Skar dot cluster。
our K means model。We're then going to initiate our model and say that we want two clusters。
Re what K means We need to say how many clusters we want。And then we call KM dot fit。
On just our float columns。 So not including both the quality column or the color column。
We then call Km dot predict on those same columns, and we set that equal to its own column within the data set。

And we'll see why we do that in just a second。And once we do that, we can call data dot head。

And we can see all the way here at the end that we create this new column that's either 0 or one。
And we're going to see how that relates to this color column to see if all the reds were identified as 0 and all the whites as one。
So in order to do that, we're going to only take the subset of columns of color and K means K means being the one we just defined。
We're then going to group by each one of these objects, so we're aggregating by both of them。
And then dot size is just going to give us the count of that breakdown。
Now it's going to be a pandas series。 So we're just changing it to a data frame。
and we're renaming that column at first by default, it will name the column as 0。
So we're just calling it number。

And we run this。And we can see that for 0, the majority of them are going to be that red wine。
With only 87 white being identified as0。 and for white wine, only 23 were identified for one。
Only 23 were identified as red, and 4811 were identified as white。
So we can see that it did a pretty good job without any labels separating out our data set into two different clusters that are very highly related to our red and white clusters。

Now we're going to fit a Ka means model with clusters ranging from 1 to 20。And now with this。
we are assuming that we don't know the number of k that we want。
We don't know how many clusters we want。And for each model。
we're going to store the number of clusters, as well as the inertia value。
And then we're going to plot that cluster number versus the inertia and see if we can find that elbow that would identify that this would be the best number of clusters given our data set。
So we start with an empty list, and then we range from values from 1 to 20。

We call K means, and we initiate with that number。We then fit it on our float columns。
and then we take our K M list, and we keep app on this panda series that will have the clusters。
Which is just that for loop at that point, the inertia for this fitted model。
And then we can also save the model as well。 Just the full on model。
if we want to access that later。So I'm going to run this and this is going to take just a second to run。
so again I will pause the video and we'll come back when it's done running。Okay, that is now Ram。
and we now have our。K means list of our different clusters in their inertia as well as their models。
That list, if we think about our panda series, recall that that's going to be each one of our difference。
Indices for that series。So we're going to concatenate each of those series together using access equals 1。
and then we're going to transpose it so that our different column names are going to be clusters。
inertia and model, and we'll have that for each one of our different clusters ends。
They are different inertias, their respective inertias for each of those different cluster values。
We're then only going to take clusters and inertia, so once we have those as our columns。
we're only selecting those two columns。We're setting our index to clusters。
those are going to be that number of clusters, and that will allow us to easily call plot data。
Which is now our Padas data frame that we have created here, dot plot。
And say that we want a line connected by each one of markers connected by lines。
Markers being Os here。And then we just want our x stick to go from0 to 21 or to 20。
and then our x limits to go from0 to 20。We run this and with our X labels being cluster and our Y labels being inertia。
And we try to see if there's any strong elbow。 It doesn't seem like there's quite that。
maybe a bit of that at。

4 perhaps where it starts to decline how much it's going to be really declining as quickly。
So maybe you choose 4。 But probably the best fact。
this is if you know that there's some type of clustering here。
as we did with either the quality of the wine。 and we knew that they were six different values there from3 to 9。
or if you knew that there's red or white wine, you choose that as one of your case。Now。
that closes out our discussion here with K means in the next question and in the next video。
we are going to start to discuss using a glloorative clustering to create our different clusters。
All right, I'll see there。😊。


027:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p27 26_聚类笔记本(选修部分)第3部分.zh_en -BV1eu4m1F7oz_p27-
Hi and welcome back for question number 5 of our notebook here。In this question。
we're going to fit in a gllomative clustering model with just two clusters。
We're then going to go ahead and compare the results of a gllomative clustering to that of K means。
Then also compare that against the red and white wines and see if the numbers and the groupings seem to be the same。
We're then also going to visualize the dendrogram, and the dendrogram is going to be that subgroup building up to those larger groups that we saw during lecture when we talked about a gllomative clustering。
So we're going to see how we can do that in Python as well。
So the first thing that we want to do is import a gl of clustering。
We're going to create our object here and pass in the arguments。
We're going to say that we want two clusters or we're specifying the number of clusters equal to2。
If we want, we can also pass in the distance threshold as an argument here。 If we do do that。
if you want to do that back at home, you just have to make sure to set the number of clusters equal to none。
You have to do number of clusters or the distance threshold。 you cannot do both。
Here we'll do a number of clusters。We want it to compute the full tree。
if you did a certain amount get cut off to save computation time。
So if you want to save computation time, you could set this to false。
but this will run fairly quickly and where it'll allow us to see everything that built up within archery。
And then we're setting our linkage toward。 And again。
that linkage means that we're finding the clusters that reduce the inertia。
The most between any other groupings。So once that has been initiated。
We're then going to fit it to our data just using the float columns as we did before。
And we're going to add that in as another column within our data, so we had K means before。
and now we're also going to have the Gm data set。So I run this。
and this will take just a second to run, but not too long。 now we have our data。
and we can then use the same method that we saw before。
So we're going to take our data and take the subset of the colors。
The glam column that we just created。 The K means that we created earlier。First。
we're going to group by color and alom and see the counts。 So we run this。

And we see again that the red and white wines were able to group them appropriately。
So we' able to see that for red, only 31 were of a glm class 0 and 1568 were of a glam class 1。😊。
Whereas the majority of white was。Classified as a glam class 0 here。
So we have the zeros and ones very highly related。
very highly correlated with our red and white wine。
And that was a similar story when we worked with K means, as well。

Now the numbers are a bit flipped, so it doesn't really matter whether's 01 that's arbitrary。
but the fact that they are separating them out into specific classes。So we see 1576 verse 23。
1568 31, maybe not quite as well there, and the Gaiglom maybe didn't do quite as well for the white wine either。
but still did a good job of classifying each of these two separate classes。


And then if we want to look at both of these in total, this will be a little bit difficult to read。
given that trade off between 0 and1 and also just having this multi index。
I would suggest just looking at these top2 that we just discussed。
But if you want to dive deeper and C 4 red wine when we had a glum。
how much of the K means were an agreement。 And these would be agreement, both 1 and 0,1563。
And you can break it down accordingly and take a deeper dive into where the mismatches may have happened。




So again, though the clusters are not identical, the clusters are very consistent within a single wine variety。
either red or white。Now we're going to plot out our dendrogram。

And I don't want to walk through all a different pieces of code。
this is just for plotting out denjoograms, that's all you will really need in order to use this moving forward。
But your。Fitted model should have these children which will help us identify the breakdown of our model。
We use this hierarchy dot linkage that we imported from sippi dot cluster。
which will allow us again to create what we need to pass into our dendrogram that we're going to use。
We're going to initiate our figure and our axes, we're going to create the colors that we want to use。

And。Set the link color palette。 So how we're going to link each of these。
and you'll see this red and gray come into play in just a second once we plotted out。
And then we call hierarchy, which is what we imported here。
dot dendurogram to plot out our dendurogram。Now, Z is equal to that hierarchy linkage object that we。

Created just here above。Some important arguments。 First。
let me run this so we can see what this looks like before going through the arguments。
So we see the den brm, and we see how it broke down from side to side。
And we also see this went down a certain amount of levels。
This in't go all the way down to the bottom。 If we wanted to see all the way down to the bottom。
we can change that。 and it would take some more time to plot。
But we see also the number we said show leaf counts equals true。
We can see the number that shows up in each one of these different subgroups。
so how many rows showed up in each of these subgroups?Now, if we wanted to see less data。
we could set this P equal to something like 10。And I run this again。
And now you only see if you counts the bottom。Lines that we have here。 there's only 10 lines。
so it's breaking it down so you can see up until there are only 10 subgroups left。
And that's dependent on using the last P。 You can also write here level。
And you can say how many levels down you want to go。🎼So just to highlight。
this is about two levels down。If we were to run this just one level down。
we can see that just breaks out into these two subgroups。 Again。
I changed the P and the trunncate mode at the same time in order to see how much of that dendrogram we actually want to visualize。
Now, we're going to stop here。 And in the next video。
we're going to discuss how you can actually incorporate these different clusters into creating your different models。

Seeing the performance of each and then closing out this video with another walking through of the performance with different levels of。
say, different types of clusters。 All right, I'll see you there。😊。


028:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p28 27_聚类笔记本第4部分.zh_en -BV1eu4m1F7oz_p28-
Now, in this question, we are going to explore the idea of using clustering as a form of feature engineering。
So the first thing that we need to do is create a variable that we're going to try and predict as when we're doing our feature engineering。
this will now be for supervised learning。So we are going to create a binary target variable Y。
which is just going to denote whether or not the quality is greater than 7。
So greater than 7 will be equal to 1,7 or less will be equal to 0。
We're then going to create a variable called x with K means。
And that's going to be from original data。 So it's going to be a panda's data frame。
And we're going to take that data and everything that we've worked with so far。 If you recall。
we added on a glm as a column as well as K means as a column。 So we'll drop quality color and a glm。
which will leave that K means。 So we have all of our float columns plus that K means column。
And then we're going to create another pandas data frame, which is X without K means。
And that's just taking what we just created from X with K means and dropping K means column。
And then for both data sets, we will use stratified shovel split with 10 different splits。
We will fit 10 different random forest classifiers。And with that。
compute that R O C AU score of these 10 classifiers。
Find the average of each and see which performed better。
The one with k meanss or the one without k meanss。

So in order to do so。We're going to have to first import our random forest classifier。
We will also import our ROC AUC score, as well as our stratified shfel split。
So hopefully you recall all of that from the course when we did supervised learning。
We're then going to。Create our target variable, which is just when the quality is greater than 7。
We set that equal to 1。 So if we say just this part, the quality greater than 7。
that will return either true or false。 Se it as type int converts that true to 1 and the false to 0。
We then initiate our objects X with K means。Which is just going to be our data set that we currently have worked towards。
But dropping a glam, color and quality。 So we saw the canamines in our float columns。
And then X without k means will take that x with k means that we just defined。
And drop the K means column。 So now we have these two different pandas data frame。
One is just the float columns, which is x without K means。
and one is the float columns with that K means column as well, which is x with K means。

We're then going to initiate our stratified shuffle split object。

And then we're going to define this function, which will allow us to pass in an estimator。
And that estimator, a spoil alert here will be random force classifier。
but we'll see how we'll use this again for logistic regression as well。



And then an X and a Y, so our different features and then our outcome variable。So first。
we initiate an empty list of RC AU, and that's because if you recall。
we're going to create 10 different values and then take the mean of each of those values。
So we'll append each of those values to this empty list。We take。Train index and test index。
F values in our SSS dot split, for our X and Y, depending on the X and Y that we passed in here within our function。
And because this SSS。Is defined to have 10 different splits When we run this for loop。
we are running through four different 10 different iterations。
Of different stratified shuffel splits。 So different splits of our data that have ensured that there's a stratification。
That's a certain amount of data quality greater than seven shows up in each one of our different train and test sets。
So then we set X train and X test。Using those train indices and test indices。
and we set why train and Y test with those train indices and test indices。
We can then call that estimator that we defined up here that we're passing into our function。
And called dot fit on our training set that we defined。
And then we can come up with our actual prediction。
which is going to be estimator dot predict on our test set, on our holdout set。
And we can do the same for our predicted probabilities。 if you recall, if we want that ROC AU score。
Then we need the predicted probabilities to actually create that。
So we get the probabilities that's going to output the probabilities for both of the classes。
we only want the positive class, so we're taking all rows, but only the first column。
not the zero with column。And that's going to be our different scores。
And then we can call for each one of our different iterations。
the R O C AU score for our actual values。 That's the Y test。
As well as the scored values that we just computed。
And we will continuously append that to our empty list so that we get all 10 different。
All 10 different ROC values。We then take the mean of that list。
and we will have the average for the different ROC scores across those 10 different splits。
So now that we had that function defined, that will output that average across the 10 list。
We could set our estimator here to random force classifier。
so we have estimator equal to this object。

We pass that in to our function that we just defined。Along with x with ks。
so this is with the column of kians。And that's going to be our X, as well as our target column y。
And then we're going to do the same thing, running that function to getss。The same estimator。
except on x without K means。 So with our data set without that extra column。We run this。

And we can see。That without came means cluster。Actually did worse than with Ca means cluster。
So we performed better when we had our camem means cluster as input into our random forest。

Now, what I'd like to do is explore the idea of changing the number of labels that we will incorporate when we create this new feature or this now new set of features if we think about this in regards to one hot encoding。
So we're going to say 4 n equals 1 through 20。 We fit a K means algorithm with n clusters。
So first one two clusters, three clusters, so on。And we then have to one hot and code it because otherwise 19 label number 19 will be thought of as greater than label number 5 or label number 10。
So instead we want those each1 hot encoded so that there's no ordinal value to each one of those different values。

And once we have our one hot encoded version of that column。
we're then going to fit a logistic regression model and compute that average ROC AU score。
And then we're going to plot that average ROC AU score for each one of our different numbers of clusters。
So I'm going to run this while I explain it because it may take a little bit of time。
But the way that we start off is that we're going to set x basis equal to just those float columns。
We're then going to initiate our stratified shuffel split with only 10 splits, as we did before。
We're then going to define this new function, create K means columns。 So as I mentioned。
we can't just create that one column with multiple labels。We have to one hot。
encode code those labels。So we say KM equals K means with the number of clusters equal to whatever n we pass in。
We're then going to fit on just our float columns。And then when we call K M dot predict on our x basis here。
we're actually outputting each one of these different labels。 So if the number of n was equal to 20。
we'd have values 1,2,3,4, all the way through。19, actually starting from 0 up until 19 to have our 20 different clusters。
We then take that column that we just created。And we call PDD dot get dummies on that column。
And now we create if there is 19,19 different columns。And having a one or a0。
if that column happened to be。A one, a two, a three, so on and so forth。
We then concatenate just those float columns。To those new K means columns that we defined。
so that may be。Up to 20 columns that we're adding on。And then once we have this data frame。
the idea is that we will be able to pass that in as our data frame and then fit our models。
So we initiate our estimator as logistic regression。We say the ends。
the number of clusters that we want to run through are1 through 20。
We're then going to get our list of ROC and AUC values。
By calling that get average ROC 10 splits that we define just above in the cell above。
We pass into that the estimator。Our X value is going to be this create K means outputs。 So remember。
this output will actually output a panda's data frame that's going to concatenate onto that original data float columns are new labels。
one hot encoded。And then using that same target variable for each n in our different ends that we have defined up here。
We're then going to plot that out, we initialize our plot。
and then we just plot the ends versus the different ROC AUCs that are output given the model。
given the function that we're running here。So we've already ran this。
let's look down at the results。

And we see it jumps around quite a bit as we add on and reduce some of those clusters。
So that closes out。 And this is just after over 10 iterations。
that closes out our section here on the different clustering methods。
gives you an introduction to how you can also use these different clustering methods to actually do some feature engineering。
And with that, we close out our section on clustering。 and in lecture。
we will move on to dimensionality reduction。

All right, I'll see you there。

029:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p29 28_降维度简介.zh_en -BV1eu4m1F7oz_p29-
In this set of videos, we are moving away from cluster strength and moving on to a different class of unsupervised learning。
namely dimensionality reduction, or finding ways of representing our data set in lower dimensions。
Now let's discuss the learning goals for this section。In this section。
we're going to have an overview of dimensionality reduction and how we can go about solving the problem of the curse of dimensionality by coming up with a lower dimensional representation of our data that maintains the majority of information that's important to us within that original data set。
We'll then discuss principal component analysis or PCA and how we can use that to come up with new features in lower dimensional space。
solving our problem of thecursive dimensionality。And then we're going to discuss non negative matrix factorization and how we can use it to come up with a means of decomposing our original data into only positive values and reduce the number of dimensions again。

Now, we should recall from earlier in the course, as well as working through our notebook on the curseive dimensionality that due to the curseive dimensionality。
In practice, too many of these features may lead to worse performance for our different models。
And our distance measures that we're using perform poorly as well as the incidence of outliers increasing as we increase the number of dimensions。

And the reason why this is, if we think about just working with one dimension that has, say。
10 positions, then in order to fill out this entire space, we only need six observations。
We would only need 6 rows to cover 60% of this space。If we increase this to two dimensions。
Each one with 10 different positions。Then we would need 60 different observations within our data set in order to cover 60% of the possible positions。

And then if we increase it to three dimensions and beyond。
we can see how this number in order to cover the same amount of space that is available。
increases exponentially as more and more dimensions get added on。

So this is a very common situation within business。
within enterprise data sets that often contain many, many features。
Data can be often represented by using fewer dimensions or fewer features than your original data may have。
And ways to accomplish this would be either reduce the dimensionality by selecting a certain subset that you deem are the most important features within that larger data set that you're working with。
Or you can combine with linear and nonlinear transformations。
which is what we're going to do here starting with PCA。

So how does PCA or this idea of creating new features out of the many features?Actually work。
Here in this example, we'll start with two features。
And we see that we have phone usage and data usage as our two features。They look very correlated。
one with the other, and visually, it looks like the points lie very close to a line。
So the question is, can we reduce the number of features from the two that we have down to one?Now。
what if we considered this line?And project the points on that line and got those projections instead。
So here are the different projections。And this will entail a linear transformation of our data to create this new single line。
and if we think about this going out to higher dimensions, if we go into higher dimensional space。
we can imagine projecting from 3D down to 2D or 100 dimensions even down to 10 dimensions in general or just projecting down to lower dimensions。

Now with our linear transformation, the points are going to now lie on this line that we see here。
We have now created out of those two original dimensions。
a one dimensional feature space that is the combination of phone and data usage。
We can think of this transformation as a scaled addition of each of the two columns。Thus。
what ended up happening is we now have one column created as a combination of those two original columns。
This is going to be the idea behind principal component analysis or PCA。
We replace the columns by some linear combinations of those original columns。
And these linear combinations are not going to be arbitrary。
They're going to be intelligently selected in order to preserve the underlying meaning of our data。
And what we mean by that in a second we'll see is trying to maintain as much of the original variance as possible。

030:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p30 29_主成分分析降维.zh_en -BV1eu4m1F7oz_p30-
And now, looking at what we had before compared to what we now have。
we have successfully created a single feature out of the two features we originally working with。
thus reducing the dimensionality of our feature space。😊。

So now let's focus on how principle component analysis or PCA finds these lines on which to project our data。
so let's say this is the data set we're now working with。
And we can see pretty clearly that the data is distributed in a certain way on a certain axis that we can see visually。
Now linearar algebra has tools that can determine exactly where our axis is。
where we have the most variance。So using linear algebra, we can find this primary vector。
so this is called the primary vector that the data set is distributed on。
and mathematically it's going to be called the primary right singular vector。
And this is going to account for the maximum amount of variance in any direction for our data set。
Now, excluding that primary right singular vector, this is going to be the second axis for the data set。
It's going to be another right singular vector, secondary behind that primary one that we just highlighted。
Once we have this decomposition of our data into orthogonal vectors or perpendicular vectors。
each one of these vectors as we move forward will be perpendicular or orthogonal to one another。
we can then determine a meaningful projection of our data。Here。
since the vector's lengths are disproportional, it'll make sense to project onto that V1 that we saw。
and we wouldn't lose a lot of information if we projected our data down to V1。
This is because there's not much variance in V2's direction and if you were to project onto V2。
you'd see that the scale would be very small, if we projected down all our the same way that we did in that last example down to V2。
we'd be scruunging up our data much more so than if we project onto V1 if we project onto V1。
we're able to maintain a lot of that original variance。

So in order to find these singular vectors。The mathematical theory that enables us to find this is called the singular value decomposition。
Now, the data set that we work with does not need to be square。
as we see here our original data set a is going to be an M by n matrix with M and n not being equal。
We can decompose a。Into the matrices US and V。And U and V here can be thought of as just rotations in space。
one in the N space, M by M space, one in the N by N space。
And they code the information of V1 and V2's directions only, but not the length。
They are going to be more of auxiliary or technical matrices where the real geometric idea is going to lie with S Now the matrix S is going to store the actual lengths of those vectors。
so recall those longer vectors will tell you which ones should be your primary vectors in regards to where to project your data down onto。
So S, as we see here, given where the stars are。Is what's going to be called a diagonal matrix?
Meaning only the non zero entries, only non zero entries in that matrix are across that diagonal。
And these values。Are going to be sorted from largest to smallest。
and they will tell us which vectors are actually important。So here in this example。
we're working with a5 by3 matrix originally, and then we decompose that into U being5 by5。
S being phi by3 and v transposed or V originally being 3 by 3。
And this singular value decomposition is going to be what PsyitLn actually uses for PCA for a principal component analysis。

So let's say our data set when decomposed, looks like what we have here。
We have three singular values。Those three values across the diagonal say they are9,5 and2。
9 being the top left down to five and2, and that'll tell us that the first two left singular vectors are more important than the third again。
the larger the value, the more important it will be。
So most of the variance in the data is in the direction of the first two principal components。
And those principal components are going to be calculated from the V that we have here。
Those will actually provide for us if we were to even plot this out the values of V。
The points from the origin to wherever it is here in three dimensions of the。
Where that principal component will point to。And again。
that first principle component being the one that accounts for the most amount of variance。
And if we want to bring it down。From n dimensions down to K dimensions, which is our goal。
so we're working with an A N by N matrix。And we want to change that to an A。
Or a new matrix that's not necessarily a, that's going to be M。
we're going to keep the same amount of rows by k, where k is going to be less columns than n。
which is currently3。All we'd have to do is take that decomposition。
And see where we can remove one of those columns here we use the singular values from V。
We can multiply that A by our V transposed。And we will get a new matrix if we see that v is going to be k by n。
if we take the transpose it's n by K。So a M by n matrix multiplied or taking the dot product of an n by K matrix。
we can then end up with a new matrix that has dimensions of M by K。
and that will give us a new data set using this singular value decomposition。
That is now an M by K reduced amount of columns that's going to be a combination of those original columns。

Something to keep into account when we're doing principal component analysis。
Is that since we are talking about lengths here a lot?
The algorithm will be very sensitive to scaling。So it will be important to scale prior to applying RP PCA。
If we think about every single difference, one of our different algorithms that we use so far in this course。
And the effects of the distance。We'll notice that having unscaleed data would allow one of those axes have more weight to provide where the maximum variance may actually be。
so if our data is not scaled, we can end up with this projection that we see here when in reality we'd want this projection down the center of our data。

Now in order to do PCA。Using SKLarn, we import from SKLarn。key decomposition PCA。
We're then going to create our instance of the class here。
so PCA inst equals PCA and we have to say how many components do we want to reduce our original data frame down to?
So if we're starting off with 10 columns here we want to reduce it down to three columns。
that's what the end components is going to signify。
So we can pass in that final number of components that we actually want。
We can then take that initiated instance of PCA。With the number of components equal to3。
and we can call fit transform。The same way that we have for many of our different standard scalers。
we were able to call fit and transform an old output a new data set now with a less amount of columns。
So for example, we can transform our customer churn dataset。
which has around 20 numeric features to one with only three features。
with those three features being a combination of those original 20 features that we had。
Using that singular value decomposition that gave us that V matrix to show us how to reduce the number of dimensions。
Now that closes out our discussion here on linearar PCA in the next video we will discuss how can move beyond linearity All right。
I'll see you there。


031:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p31 30_降维笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p31-
Welcome to our notebook here on Diality Reduction。In this notebook。
we're going to be using the Portuguese wholesale distributoror data set。
that data set is going to contain the annual spending on fresh products on milk products。
grocery products and so on。And then the last two which we're actually going to end up dropping are going to be channel and region and the reason we drop those is because we want to focus on the numeric values here and these are technically going to both be categorical values and it's just as easy if we wanted to to 10 and encode them。
but for this we're just going to drop those two columns。
We're then going to import our necessary libraries as we do at the start of each one of our notebooks。
Then here for part1, we're going to want to import our data and check each of the data types。
we're then as mentioned going to drop the channel and the region columns as we won't be focusing on these throughout our examples here using PCA。
We're then going to convert the remaining columns to floats if that's necessary。
And then we're going to copy a version of the data that we just created using the dot copy method。
To preserve it, and we'll be using that later on and we'll see how in a bit。So first things first。
we import our data using pandas。readcsv。

We look at the shape and we see that we have 440 rows and eight columns and recall the number of columns is going to be important as our goal here with PCA is to reduce that number of columns that we're working with when we create our models or whatever it is that we want to do with our data。
maybe want to visualize and we want to reduce the two columns。So we see our first。5ive rows。
and we see here that we still have that channel and region which we said we don't want to include。
so we're just going to call data do drop and we drop the channel and region from axis equals 1。

And we look at the data types, we see that they're each integers。
and we're just going to convert those each to float。
callingt dot as type float for each one of the different columns。

Now we have them all as floats and then as mentioned。
we're going to want to save this original data for later, so recall here we have our data。
which is our data frame that we've just created and then data Rige is going to be a copy of that which we're not going to touch for a bit。

Here in part two。We need to again, ensure that our data is scaled and relatively normally distributed。
it'll be easier to work with with normally distributed data。
and then as mentioned in the lecture we saw how important it is to scale our data to ensure that no feature has extra weight when trying to come up with the different principal components。
So we're going to examine the correlation between each one of our different features。
And recall this will be important as when we are doing PCA。
what we will be looking for is if two features are very highly correlated。
they're not adding any extra information and we want to remove or reduce those or combine a few to end up with less features overall。
so if they're highly correlated, we can probably remove some without losing much variance from the overall data set。
We're then going to perform any transformations and scale our data using whatever scaling method you prefer。
whether it's Minmac Scalar or the standard Scalar。
We're then going to view the pairwise correlation plots using our pair plot just to visualize all the relationships as well as now seeing if we have normally distributed data。
looking across that diagonal of the pair plot。

So the first thing that we want to do is call data。
cor so we can see the correlation between each one of the different features。
So this will give us for each feature, the correlation with all the other features in a square matrix in a square data frame。
And just to ensure that we can get the highest correlation, which feature the highest correlation。
And because one feature with itself will always have a correlation of1。
we're going to replace that diagonal value which are going to start off as all ones with all zeros。
So we're saying4 x in the range of formatmat dot shape0。
It's a square matrix we could have called shape 0 or shape 1。
So that's going to be for every single value in our matrix。
For every single numeric value for the range of our matrix, we're going to take the diagonal value。
so 00,1,1,2,2, and replace that one with a0。

And we can see now our correlation matrix has a correlation between fresh and milk and grocery。
and then for fresh and fresh, it's just a zero across each one of the different diagonals。

Now, we're going to call the absolute value on that full correlation。
as we don't care if it's a positive or negative, just the strength of that correlation。
And we're going to call I D X max to see which feature is most highly correlated with each of the other features。
So we' saying, what's the max index value。 So for fresh, it's frozen for milk, its grocery。
so on and so forth。

We're then going to examine the eew。For each one of our different values and then take the long transformation if necessary。
for those that have higher s。Recall that the s is going to be a value with 0 being no skew。
positive value being a right skew and a negative value being a left skew。 The higher that value is。
the stronger the sw。So we call data。 skeew to see the skew of each one of our different columns。
We sort them from largest to smallest, and those are going to be our log columns。
and that will now be a panda series。 And then we're just going to take those log columns that are greater than 0。
75。


Those that have a higher sw, and we see here we have these values that tend to have a higher skuw。
And for those, we're going to take the log transformation of each。
hopefully creating more normally distributed data。
So for call in each one of these log columns index。
So these are this is our log columns that we just defined is that panda series。
If we call the index。 we get each one of these delication, frozen, milk and so on。
which is going to also match up with each one of our different data columns。

So we're going to place those columns in place with the log transformation of those columns。

We can then also call the Minmac Scalar, so we import from Scalalar do preprocessing the Minmac Scalar。
we want to ensure that all our values are on the same scale。We call min max Scalar。
we initiate the objects, and then we say four column in each one of our columns。
we're going to fit and transform on that column, so we're going to replace it again in place to standardize that data So all values are between 0 and 1 by using the min Mac Scalar。
which we recall is just subtracting the minimum value and then dividing by the max minus the min。
So that'll ensure all our values are between zero and1。
The next thing that we want to do is we're going to visualize everything that we've just done。
so we're going to see each of the relationships and hopefully see those high correlations with each one of the different scatter plots that we'll see with the pair plot as well as saying hopefully more normally distributed data。
which we see for the most part throughout each one of our different columns and we see, for example。
milk would and grocery have a pretty high correlation if you look just three columns in and two columns down。
you see that high correlation。





Now, in part 3, we want to introduce how we can do this all in one step。
and this will be especially useful if we want to incorporate this into some supervised learning model later on and be able to pass in different parameters throughout。
So we're going to pass in our pipeline function, and we saw that during our course on supervised learning。
But what's important when using the pipeline function is that each one of the functions that are passed in。
each one of the different pieces of that pipeline have to have a fit and transform method to it。
So we want to take the log and then take the in Max Scalar。
But the log doesn't have that fit transform that's built in with each one of our different SK learner objects that we've been working with。
So Minmac Scalr has a fit transform, but log transformer does not。
So in order to ensure that we have a version of taking that log transformation that has the fit and transform methods that we can pass into our pipeline。
we're going to call this function transformer。And this function will take whatever function it is that you want to pass in。
And convert it so that it has a fit and transform method available to it。
So now we have a log transformer object。Which is going to be a log transformer with a fit and transform method。

And once we do that, we can pass it into our pipeline。 So first, within our pipeline。
we need to pass in that list of tus where the first value of that tuple is going to just be that name if we want to pull it out later。
And then the next value is going to be the actual function that we want to call。
So here we call that log transformer that we just created and then Minmac Scalar。
We pass in this list of tus into our pipeline, and then we can just call pipeline do fit transform on our original data。
If you recall, we made a copy and we didn't change our data at all for that copy of the data。
And we can call fit transform and get the output down the line of both taking that log transformation and that MinNAC Scalar。
And we run this。And then that data pipe should equal that data that we just transformed。
So we're going to check that using nuy dot all close。
which is just going to check that each value within each of our arrays are exactly the same。
With a bit of possible rounding error, many decimal points down the line。 So we run this。
and we see that it's true that all of our values are the same。
and we see that our pipeline work just as well as taking each one of these different steps separately。

Now, that closes out part 3 in part 4, we're going to start working with PC A on this transformed data that we've been working with and see how much of the variance can we explain with different numbers of these principal components。
All right, I'll see you there。


032:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p32 31_降维笔记本第2部分.zh_en -BV1eu4m1F7oz_p32-
Now, for part 4, as we will be working through here in this video。
we're going to perform PCA on that data that we work through in the last video。
And we're going to perform PCA for the number of components ranging from one to five。
so we start off with six columns and no matter what we're going to try and reduce the number of columns that we'll ultimately be working with。
We're then going to store the amount of explained variance for each one of the different numbers of dimensions。
So for one dimension, how much variance was explained ver2, so on and so forth。
And if we were to do number of components equal to 6。
then we would have explained 100% of the variance。
So we're saying how much of the variance going to explain at each one of the different steps。
We're also going to store the feature importances for each one of the number of dimensions。
And something to note is that PCA won't explicitly provide this feature importance。
but the components properties。Which we'll show you how to use in just a bit。
Will show you how each one of those principal components was composed as a combination of each one of the original features and the larger those values are。
given that we've standardized our data, The more impact each one of those features has had on that principal component。
and therefore, we can assume that that is a more important feature。
And then we're going to plot both that explained variance as well as these feature importances。
Now I'm going to break this down step by step, so I'm going to actually create a cell above。
but before I do that, just to show you where we're starting off。

We're going to import from ecalar。 decomposition。We're going to import PCA。
We're going to initiate an empty list of the PCA list and the feature wait lists。
which we're going to use to store are explained variances and the feature importances。

And then for n and range 1 through 6。 So one through 5, if including 5。
So that's what we want to range through。We're going to initiate a model。
A PCA model with a number of components equal to wherever we are within that range。

And then we're going to fit it to our data that we have now done the transformations to to ensure that it is on the same scale and mostly normal data。

We're then going to take the explained variance of each and append that to PCA list。
And then after a few steps, which I'll walk you through in just a bit。
we're going to take each one of the feature importances and append it to the feature waitlist。
And then after we do this for the and in range one through six。
we have this for each one of our different numbers of principal components。
So let's start off by looking at just this step here。So we're going to create a panda series。

We actually are also going to need, of course, to initiate our model。
What I'm going to do since I'm pulling this out here is going to set n equal to two as we discuss all the steps。
And you can imagine that this is going to do it for n equals, of course,1 through 5。


So we set n equals2, and then let's see what this series is that we're going to be outputting。
it should be n, which is the number of components, which we set to two。
the actual model as well as the explained variance up to that point, so for using two components。
how much variance was explained by using two components。

So run this。And you can see that it explained 72% of the overall variance。Now。
just to see how the explained variance ratio actually looks, let's pull this out。

And we can see that it says if you set n equal to2。
it shows you how much of the explained variance ratio was covered with the first principal component。
which was about 45%。And how much was done by the second component, which was about 27%。
And the first one should always have more than the second。
which should always have more than the third, right。
Our first principal component should be the component that explains the most variance。We are then。
so we will have there for each of our number components the amount of variance explained。
so that is covered。 Our next step is going to be to find the feature importances。
So the first thing that we're going to do here。Is we're going to。And let's add this on over here。
Set some weights and the idea of the weights is that we have the breakdown of each of our principal components。
But we want to add more weight to the more important principal components。
so the first one should be more important than the second one and so on and so forth。
So what I'm doing here is I'm taking this explain variance ratio that we output here。
And then we're just setting it if we're working with two components。
we're setting it as a proportion of one, sort of saying。
44% and 28% were adding those up and we're saying out of one。
what proportion is 44 and what proportion is 28?So just to look at what that means。You see。
we take that original amount。With 45 and 27, and we just divided by the total of 45 plus 27。
So that we see that the weights are 62 for that first component and 38 for that second component。
and we're going to await our components according to how important these different principal components are。

So this will become clear in just a second。The next thing that we're going to see is this PCA dot components。
So what was important here?For the PCA components is this is going to be the breakdown of。

How each one of the components is actually comprised。So let's first strip away。
Everything besides PCA components。

And we can see here that we have。For the first components。
How each one the different features that we had。 So we have six different features。
how they each created a linear combination to come up with our first components。
And then the linear combination that came up with our second component。So again。
the idea is the larger these absolute values are, the more they contributed to each component。
and the more important that feature is。So what we had here before。
Is we took the absolute value because we don't care about whether it's positive or negative。
we just care about how much it affected that principal component。
And then we're waiting it according to these weights。And if you recall。
the weights are going to be how important each one of the principal components are。
So this first one is going to be multiplied by 0。62。
And this second one is going to be multiplied by 0。38。So that we don't put on too much weight。
So we see here that we use 70% of whatever feature this is, this is the fifth feature。
And then we use 70 per here in the second feature。 in the second PC A for a different feature。
We want to ensure that these do not get equal weights。
This should get a higher weight than this one, since this is part of the first principal component。
So that's why we multiply it by the weights。 And then we can see what the overall contribution is。
Let's just copy and paste that。

And we can see the overall contribution。For each one of the different components。

And then we're going to take the sum。Access equals zero。
So that we can see now that we've weighted each one of them。

How much each one, these different features。With their weights。
we're able to comprise these principal components that we have。So we see here that's。
Whatever feature it is, the fit feature was the most important in the first two components if you add up the weights of the first two components。

We're then going to divide that value down here。 So we have the absolute features values。
We're going to divide that。By the total sum of these values to ensure that each one of these values is a proportion up to one。
So that we can see, again, these each represent how much weight each one of our original features。
Played in coming up with our two principal components。
we're going to normalize that over one to see the proportion of one of each one of these features。
how much they comprise, how much did they contribute to coming up with these principal components。
And that's going to be the values that we have here。
And then we are going to have a data frame that has the number of components。
and then it's going to have each one of the different columns so that we can line that up with each one of these values。
And then we're going to have for each one of those different values。
what is the aligned column that I went with, and that's going to be our values here。

So I'm going to run this and the first thing that outputs is the number of explained variants for each one of our different principal components。
so we see the first one covered 45%, the first two covered 72, then 83, 92 and 98。
so we see once we get to five we've covered 98% of our overall variance。

We're then going to concatenate if you recall, let's look actually at this feature wait list that we created。
This is going to be a bunch of data frames, so let's just look at the first one。

And we see this is going to be for a number of components equals to one。
How much each one of these different features contributed to that principal component。
We set this equal to one。We can see for the first two。
how much it contributed to each one of the different principal components。

We're going to concatenate all these different data frames together so that we have one long data frame。
And then we're going to pivot that。And set the index equal to this n so that we don't have multiple ones。
twos, but well sum up all the ends。We are also going to set our columns equal to the different features。
And then we can just have our values as the values。
And now we have this data frame that we have here。
where we see when the number of features is equal to one。
the contribution of each one of these difference features, when the number of features。
when the number of components is equal to two, the contribution of each of the features and so on and so forth。


Now we're going to plot。The overall variance, just using a bar plot。
So this is plotting what we had up here。 that PC Df, which is just that overall variance。



And we just set our X label, our Y label, and our title。
And we see that how much of the overall variance was explained once we add on each one of these different principal components。

And then finally, we have plotting the features D F。 And we're going to see。
as we have each one of the different number of。


Dimenssions that we're working with。How much does each one of the different features contribute to all of our principal components so we see here for detergent's paper at first it explained most of the variance it was the most important feature。
it tends to balance out as we add on that number of components。
Now that closes out our section here on Que 4, showing you how to see use PCA。
see the explained overall variance, as well as getting a hint at the actual feature importances as we create each one of our different principal components。


In the next section, we will discuss how we can actually use grid search to fine tune our PCA model。
especially when working with kernels。 All right, I'll see you there。


033:图像处理.zh_en -BV1eu4m1F7oz_p33-
In this video, we will use an example to see how PCA can be used to reduce the feature space of an actual image in practice。
Now, the learning goals for this section will be just to show how dimensionality reduction can be used in a real world application。
And with that, bring together an example, using dimensionality reduction to take one image and compress it down to smaller amount of features and see what that compressed image would actually comprise of when doing PC A。

Now, we're going to walk through how we can use dimensionality reduction in real life practice。
So frequently, we want to use dimensionality reduction when we end up with a lot of different features。
when we have high dimensional data。And this can happen often with text data Feature are usually going to be the word existence flags or the word counts per document。
and as we saw with the nonne matrix factorization notebook we just went through。
this can end up creating quite a lot of features very very fast。 and thus a lot of dimensions。
So you want to use it often when we're working with NLP。Or as we see here。
if we're working with images, especially if we're working, say, with colored images。
the features can be the brightness value for R G and B values。
So the brightness of each one of those different colors per pixel。
So it means that we can end up with quite a lot of features。
on the order of the number of pixels that are present within our image here。
We're working with black and white。 So it'll just be the brightness per each one of the pixels without the RG B values but still can end up with quite a lot of pixels。
😊。

So in this example。We're going to see how PCA is going to be used for image compression。
We're going to reduce this image's dimensions, but hopefully retain most of the image。

So to see this image as a data set, we put on a grid on top of this image。
Where each square is going to be 12 by 12 pixel sections。
So each one of these different squares will have 1 44 pixels per square。
and each one of those squares will represent a single observation within the full data set of this image。
Something to note is that this grid is just for visual representation, but in our example。
we would imagine that there are more squares than what we see here。

So each square, again, is a single observation that is 12 by 12, So a total of 144 pixels。
This is going to be a black and white image, which means that every pixel contains only one numeric value indicating the brightness of that pixel。
And putting those 144 pixels side by size, we can end up with just one row vector。

So we see here we take that 12 by 12 and we unravel it to have 144 different features。
For each one of our different squares。And each row in our data set will be each one of the individual squares in our original image。
We can then perform PC A on all of our data points。
So we see here again that we have each one of our different rows representing a single square。
So we end up with a matrix。 That's the size of the number of squares。Times 144。
which is the number of features we now have。Where you can apply PCA to this matrix to try to reduce the current dimensionality so that we end up with a new matrix that still has that same number of rows。
which is going to match up with the number of squares。
Times M where M is going to be some value less than 144。
And those new columns will be projections of some special combination of those original features that will create our principal components that will describe the most amount of variance。

So to see this in action, we see here reducing from 1 and 44 down to 60 dimensions。 So each square。
rather than being represented by those 1 hundred and 44 different values are now represented by these 60 different values。
We can still see quite a clear picture of our original image。We reduced down to 16 dimensions。
and we still don't lose much from the original image in regards to visually looking at one next to the other。
So after PCA。You will get these top 16 components, and these will be the 16 most important principal components。
And every original 12 by 12 grid in this image before is now some linear combination of these 16 components that we have here。
Once we reduced down to 16 in regards to our dimensions using PC A。

We can reduce this further down to just four dimensions。
So here we're reducing the dimensionality severely。
But since we're keeping the four most important principle components。
the image is still somewhat recognizable。And here we have the L2 error between that original image and the compressed image with various levels of dimensionality。
Where we're just seeing the distance of what that original image looked like compared to the values that we're working with now with the compressed version。
And we can see that for quite some time we don't have that high of a relative error as we continue to reduce that number of dimensions。

Now we see here just the top four principal components。 And again。
we were going to be able to from that original 144 create the some combination of those to come up with these four principal components。

And something to note, as we recall when we are working with PCA in the PCA notebook。
is going to be the top four of our original top 16 or even of our top original 144 components。

So reducing to 16 and then selecting from the top is the same as just reducing down to our top4。
So no matter what we always have, the first most important principle component first, the second one。
second, so on and so forth。

And then here we can see what that image actually looks like, reduced down to one dimension。
So you see that now we're only working with one dimension and each one of our different squares is just going to be a different weight for each one of those different original squares that we are working with。
And we can still see somewhat of a fuzzy image here。
So we can see here how PC A is actually compressing our original image and the amount of data that we have to store in order to represent that image。
Now, just a quickly recap。In this section, we discuss the applications of dimensionality reduction in the real world。
Using the example of working with that butterfly image。
Using PCA to reduce the number of dimensions and show how we didn't lose much from that original image when we reduce the number of features。
Now that closes out our section here on unsupervised learning, and it was a pleasure teaching you。
Thank you。

034:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p34 33_核主成分分析和多维缩放.zh_en -BV1eu4m1F7oz_p34-
Now let's move beyond linearity to working with nonlinear transformations。
So what we've talked about so far with principal component analysis and singular value decomposition。
Everything that we were working with there were all linear transformations。
so we're using linear transformations to map our original data set to a lower dimension。Now。
data in general can very often have nonlinear features。
And when we work with nonlinear features and we try to perform PCA。
this can cause our dimensionality reduction to ultimately fail。
So here we have this example data set。And we can see here we're doing a mapping from two dimensions to two principal components。
So it will end up not changing the space, but in general。
as we try to map from higher dimensions to lower dimensions。And we have nonlinear features。
We won't be able to maintain that variance while reducing the number of dimensions as we've done so far with linear PCA。

So if you recall our discussion during support vector machines。
there are going to be kernel functions which we can use to apply nonlinear transformations to our data。
Now, if you did think back to support vector machines。
what probably came to mind is with the kernel functions。
We're mapping up to higher dimensional space, and the goal here is to map the lower dimensional space。
But the key is that when you use these kernel functions and map a higher dimensional space。
you're able to uncover nonlinear structures within your dataset set and use that to map down using a linear fashion similar to how youre able to then come up with a linear boundary。
😊,Once you map up those higher dimensions, you can use that linear PCA in order to actually come up with less dimensions。
So here we see from that original space that we saw earlier using kernel PCA projection。
we're able to come up with a linearly separable space, so we're able to adjust the space。Now。
in the figure here on the left。

We're going to be applying PCA directly and we see this curvature in our data。
And we wouldn't be able to maintain the total amount of variance if we just directly applied linear PCA。
So instead, we apply this kernel。Which will map our data to a linear space。
and then we can reduce it down to a lower number of dimensions without losing the information that we would lose by squashing down our data on that original linear projection。

So how do we actually perform kernel PCA using Python。
as usual we're going to import the class containing the dimensionality reduction method。

Once we import from SKAle。 decomposition the kernel PCA, we then initiate our class。
and we're going to say the number of components we want what type of kernel we want to use。
there's actually different kernels available, as there were with support vector machines。
As well as choosing the gamma and if you recall the gamma will identify how curvy or how complex you want it to be in regards to the nonlinearity of that original data set。

And then same as working with just PCA, we can call that object the dot pit transformform on our data set and we have our transform data set using the kernel PCA。

Now let's talk briefly about manifold learning。There is going to be another class of nonlinear dimensionality reduction。
And what we are working with here is going to be multidimensional scaling or MDS。 Now, MDS。
unlike PCA, will not strive to preserve that variance within the data, so recall with PCA。
the goal is to maintain as much of the variance within the original data。With MDS, instead。
the goal is to maintain the geometric distances between each one of the different points。
So the figure on the left is supposed to be a sphere in three dimensions。And under MDS。
it's map to a disk and the distances between each of the points in three dimensions is trying to maintained as we move down to these two dimensions。

Now, in order to run MDS within Python, we are going to import the class containing dimensionality reduction method。
so from SKle。 decomposition again, we import MDS。We create an instance of the class as well as the number of components that we ultimately want。
And again, we just call the MDS and we call fit transform on our data set。
And then we will have X underscore MDS as our transform data set that is now it only has two columns or two features。
Now, other popular manifold learning methods exist such as ISOMap。
which will use nearest neighbors and try to maintain the nearest neighbor ordering。In a way。Or TSNE。
which tries to keep similar points closer together and dissimilar points further apart and can be very good for visualization。
And there are going to be several ways to do decomposition and generally would say try a few out。
a good approach would be to try those out, and then perhaps if you're able to move down to two or three dimensions using EDA and visualization to see how well you were able to come up with clusters or maintain the amount of variance that was originally there。
Now that closes out our discussion here in regards to principle component analysis。
as well as the different types of manifold learning。In the next lesson。
we're going to go through a demo of using PCA in practice All right, I'll see you in the notebook。


035:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p35 34_降维笔记本第3部分.zh_en -BV1eu4m1F7oz_p35-
Welcome back for part 5 of our notebook here。 Here。
we're going to introduce Colonel PCA or PC A working with a kernel。
where we're going to use what we discuss in lecture and that we can come up with a nonlinear combination rather than the linear PCA。



To come up with a way, to say where the high variance is by mapping up to higher dimensions。
to get that curvature in those lower dimensions。


Now。We want to know。Choosing here that our kernel is equal to RBF。
we can also search through different kernels, and I suggest you looking at the documentation as well。


But we can also search through when we're working with RBF using different gammas。
and that'll tell you essentially how complex that boundary is going to be or how curvy your line that you can project onto will actually be。


So we're going to search through different gammas and we're going to use grid search and when we use grid search。
what we're trying to do is find the best model and when we do this with supervised learning。
this is clear we can do this with using a scoring methods such as mean squared error or working with the accuracy or whatever other classification score you want to use and optimize on that score。





Now when we're using unsupervised learning。It's not quite as clear how we can end up scoring which one of these different models performeds better。


But we do need to come up with some type of scoring option in order to decide which gamma or if we wanted to search through different kernels。
which kernel work the best。

So what we're going to do here is we're going to introduce a custom scoring method。
So you'll see here that we define a score。And we'll walk through what that score is。
but essentially what we're going to do is take a model。Fit a PCA, fit a PCA model to our data。
and then take the inverse of that。And then see how far away the inverse of that PCA model is from our original data。
and the lower that value is, the better we did。So let's walk through that here。
So first we're going to import the kernel PC rather than just PCA。
We're going to import grid search C as we'll be using that in order to find the optimal hyperparameters for our kernel PCA。
And then you'll see in just a second how we're going to incorporate mean squared error in regards to coming up with the best version of our kernel PCA。
So first thing that we're going to do is define a score。So we're going to pass into that score。
The PCA model。As well as our X。 And there's going to be no Y here。 It's just going to be that X。
right, We're using unsupervised data。 There's no label that we're attributing to this。
All we're doing here with this try and accept is just we want to ensure that we are working with a nuy array rather than working with a pandas data frame。
So if x is equal to a this x is equal to a pandas data frame, we call dot values。
and we're working with the array。


If it's already an array, then it'll just set X file to that array。

We're then going to call our PCA model that we passed into the score。

And we're going to call it on the X Val, and we fit transform our data to get our new version with whatever。
however many components we're passing through, one component, two component, so on。
as well as whatever kernel we're using and whatever gamma we're using。


Specific to what this PCA model is。

We're then going to take the output of that。And pass it into this PC dot inverse transform function to get the inverse。
which should undo what we did, but it can't perfectly undo because we lost some information as we did that original transformation as we did that original dimensionality reduction。
So it'll take the inverse and that will be our new data in。



And then what we're going to do is take the original data that we had。
And see how far off that is from our inverse transform that we just did。And in order to do that。
we'll just take the mean squared error。Now when we do a score。
we want to get the highest value possible, when we do mean squared error。
obviously we want to minimize our mean squared error。
so we're just going to multiply it by negative one so that we can optimize by getting the highest value。
And that's going to be our scoring function。From there。
it should be as simple as any other grid search that we've worked with in the past。

You're going to set your parameter grid, which is going to be gamma and we'll loop through different gamma values。
It's going to be the dictionary and the number of components and we'll loop through different numbers of components。
Now, I'll let you know, generally speaking, the higher the number of components。
The better this transform inverse transform will work。
But this will allow us to hone in on the right level of gamma。



But then going to do grid search CV。We're going to say that we want to pass in the kernel PCA。
and the things that we don't want to search over, but want to keep the same through every single loop is going to be that the kernel is equal to RBf。
and we want it to fit the inverse transform。 If we don't call this when we call the PC A。
Then we won't have the option to call this inverse transform that we have called up here during our scoring function。
So we say fit inverse transform equals true。 We can then pass in our parameter grid that we defined up here。
And then we can pass in the score that we just created。






We say n jobs equal thank1, just to say we want to paralyze as much as possible。
and then using this kernel PCA that we're defining here。
we can call kernel PC do fit on the data and get our best estimator to see which one of these gammas perform the best。



So I'll run that, and that will take just a second to run。 So I'm going to pause the video, oh。

There it is。 Never mind。 And we see here that we have for our gamma value,0。5 was the best。

Option in regards to that transform to inverse transform。 And we see that the number of。
Components is equal to 4, which is the max value, which is what I said。
usually when you are working with looping through the number of components。
the max value will be the one chosen。


But now we can see that we should probably use that gamma equals 0。
5 when choosing our gamma for our kernel PCA。


Now, for part 6, we're going to show you how you can use PCA built into your modeling pipeline in order to perhaps use it to make your logistic regression work better on the data that you have。



So we're going to be loading in this very large data set。
which is the human activity recognition using smartphones。 We've seen this before。
It has tons of different columns。 We can look at the shape here, and see that。😊。



It is 10,299 rows and 562 different columns, so we're going to try and reduce that number of columns。


So what we're going to do is we're going to first import the different libraries needed。
our pipeline, standard scalar, stratified shuffle split to keep that same ratio of each one of our different outcome values。



We're now using logistic regression and we can pull in our accuracy score since we're doing a classification problem here。
X is going to be all values except for activity。 Y is going to be the activity。
And then we're going to initiate our stratified chael split and we'll call this in just a bit when we want to get our average score。

Now, this get average score。Is going to just be a function that does all the steps in the pipeline to standard scaling。
PCA, and then logistic regression, and all we're going to change at each one of the steps is the number of components。



So we set this pipe equal to this list and we pass it into our pipeline as we've done before。
We have our scores, which are just blank。 So we initiated our pipeline, but haven't fit anything yet。
We have our scores equal to that blank。 But then using the S S S that we initiated here。
that stratified shuffle split。


And we're going to get five different splits since we set the number of splits equal to five。

And for each of those, we'll get a new X train and a new X test。
as well as a new Y train and a new Y test。

And we can call pipe, that being the pipeline we created here。

Dot fit on our X train and Y train。And then once we do that five different times throughout each time we're also going to get the accuracy score on the test set。
So once it's fit on the training set, we can see the actual score on the test set。
we'll have five different scores, and then we'll output the average of those five different scores。



We're going to set the number of ends from 10 up to 500, so we see our original data was 562。


We're going to see if we reduce the number of dimensions。
is there a point where perhaps we don't need all of the data set or even perhaps some improvement with lower dimensions。

So we're going to get our score list by running this get average score that we defined find up here on each n in this option of ends that we have here。


So I run this。And this one will actually take some time。 So I'm going to pause the video here。
and I'll see you in just a bit as we touch on the results from running this function。 All right。
I'll see there。😊。



All right。 now that has。Finish running and it may have taken a couple minutes。

Let's see what these scorereless came out as。We run this and this should be in the same order as our ends that we have here。

And we see that after a certain point once we get to the 450500 range。
there doesn't seem to be any more improvement in adding more variables and adding more features。



And we can see this with the plot as well, just plotting out ends versus our different score lists。
and we can see that it really plateaus and it's not even starting at 0 here on the Y axis。
starting at 0。84。 So we see that adding on all these extra dimensions doesn't really add that much extra value in regards to the logistic regression。
So we could probably shrink this down to even 100 features here。



Or 200 features and still have a pretty high accuracy。
depending on what you're trying to get at and be able to speed up the process of how long it will take to learn this model。


That closes out our demo here on Diality reduction, and I'll see you back at lecture。 Thank you。


036:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p36 35_非负矩阵分解.zh_en -BV1eu4m1F7oz_p36-

Now we introduce another way of reducing the number of dimensions。
namely non negative matrix factorization。Now with non negative matrix factorization。
we are still going to be decomposing our original matrix。
But this time we're starting with as input only positive values。
so you can think word counts or pixels and image as examples of matrices with only positive values。
And then we decompose that original matrix of positive values into two matrices W and H。
with both also having only positive values, so that's that non negative matrix factorization。
Now we can think of taking a term and a document matrix。
so to create a matrix of the sort out of many documents。
you can think of each one of your different observations or each one of your different rows as being a specific document。
And each column being a word。And the values for that documents for rows and。Wrote words for columns。
each one of the values will be the word count or some other measure of the word for that document depending on how you pre process your text data。
We can then decompose this into how the terms each make up certain topics, and that's your W here。
And that number of topics will be of your choosing。
similar to the choosing of components when we're doing PCA。
And then the H will be how to combine these new topics together to recreate our original documents。
Now, thinking of images, if we think back to PCA。PCA is highly recommended when you have to transform higher dimensions into lower dimensions。
and you are okay to lose the original features in the process as new ones are being introduced。
So when we look at the breakdown of the components。
it's going to be difficult to gain any insight into how they all combine to recreate that original image as each one of these new components are。
Composed of a weird combination of those original features。
Now with non negative matrix factorization, since we are only working with positive values。
And we can only add those values together, we can't subtract since everything's positive in both our W and H matrices。
The different components tend to have more of an intuitive feel。
as we'll be adding together the shading of the eyes, the eyebrows, the nose, etctera。
all together to recreate an image of our face as we see here。Now。
nonne matrix factorization has proven to be powerful for word and vocabulary recognition。
image processing problems, text mining, transcription processes, cryptic encoding and decoding。
and it can also handle decomposition of non interpretterpreable data objects, such as video, music。
or images。So why focus on a decomposition of only positive values。For one。
since non negative matrix factorization only works with positive values。
it can never undo the application of a latent feature。
There's no cancelling out with negative values。 It's only going to be additive。
And thus each included feature must be important, as again, we can't cancel it out down the line。
Also, since its only positive values, this leads to features that may be interpretable。
as they must all add together to recreate our original data。 So, as mentioned。
for something like a data set of different faces, you may have the nose, the ears, etc ceter。
and those will add together to recreate the face。Something to note is that because non negative matrix factorization has the extra constraint of positive values only if we end up in that original decomposition with some negative values。
the algorithm will automatically truncate those to 0 and thus may not be able to maintain as much of our original information。
Something else to note is that unlike PC, there's going to be no constraint of only orthogonal vectors when we're working only with positive values。
so that decomposition can thus have portions pointing in similar directions in n dimensional space。
So now let's briefly touch on how non negative matrix factorization will work with something like natural language processing。
So as input to our non negative matrix factorization for documents。
you would pass in some type of pre process version of each of your documents。
turning words into numeric values, can either use a count vectorizer for the count of words or the T F I DF。
which is term frequency inverse document frequency。
which will give you a value that gives less weight to more common words such as a or the or is within。
The entire range of all of your documents。We can then have the possibilities of tuning the number of topics that we ultimately want。
as well as the means of pre processingces our text may want to remove certain stop wordss or frequent terms altogether。
And then our output will be how the different terms relate to the different topics。
And then another matrix telling us how to use those topics to reconstruct our original documents。Now。
in order to actually use NMF within Python, the syntax will be very similar to what we've seen so far with the different decomposition methods。
so from SK learn。 decomposition, we import NMF。We then create an instance of our class passing in the appropriate arguments so we say how many topics。
how many different components do we actually want?And then we say, how do we want to initialize。
Most of you will initialize as random, But what is important to note is that the method can be sensitive to the type of initialization。
as we' have seen with other models, and the results will not necessarily be unique。
So we initiate our class NMF with a number of components。
and then we can fit the instance and create a transform version of the data by calling nmF。
fi as well as nmF。transform in order to come up with our new data。Now。
just to recap the different approaches that we went through。
Dimenssionalality reduction is going to be common across a wide range of application。
and we have here some rules of thumb for selecting what approach you'd like to use。
For a principal component analysis, this will be great if you have a linear combination of features。
you believe that you can create or maintain the amount of original variance。
and that's your goal is to preserve variance by creating a linear combination of those original features。
😊,Colonnel PC A will be similar, except for assuming there is more of a nonlinear relationship。
and we still want to preserve the overall variance within each one of our features。
Multidisional scaling, like PCA。With new transformed features are determined based on preserving distance。
rather than maintaining variances as we did with PC A。
So if maintaining the amount of distance is more important。
which may be something useful if you want to visualize different clusters。
this may be a better approach, than you'd want to use M S。And then finally, as we just discussed。
is non negative matrix factorization, which is useful when you're working with only positive values such as working with word matrices or working with images。
Now let's recap what we learned here in this section。In this section。
we discussed dimensionality reduction and how we can solve our problem of this cursesive dimensionality by coming up with a lower dimensional representation of our original data that maintains the majority of the information important to us in that original data set。
We then discuss principal component analysis or PCA and how we can use it to come up with new features created as a linear combination of those original features。
or if we use kernel PCA, a nonlinear combination of those original features to maintain as much of the variance from that original data set as possible。
And then finally, we discussed non negative matrix factorization and how working with only positive values can lead us being able to come up with more intuitive and powerful representations of our original data in lower dimensions。
Now that closes out our lecture on dimensionality reduction。
and from here we're going to move to a demo actually working with non negative matrix factorization using Python。
All right, I'll see you there。

037:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p37 36_非负矩阵分解笔记本第1部分.zh_en -BV1eu4m1F7oz_p37-
Welcome to our notebook here on non negative matrix factorization In this notebook。
we're going to be covering the BBC data set on different articles across five different topics。
The data has been pre processed so that we have a sparse matrix。
We'll see what that means in just a second。 With that, we have BBC dot terms。
which is just a list of the words that are used。As well as BBC。 docs。
which is just going to be a list of the articles。

Listed by topic。So at a high level, what we're going to do here is turn our BBC do matrix file into an actual sports matrix。
So it's already in sparse matrix form, as we'll see。 But in general。
working with a sparse matrix just means rather than having a ton of zeros for many of your columns。
We're just going to have for each row or column。 we will specify whether or not there is a value there and what that value is。



Rather than when you have a larger, not sparse matrix with a lot of zeros。
you can end up eating a lot of memory。

We're then going to decompose that sparse matrix, using non negative matrix factorization。
and then use the resulting components of that non negative matrix factorizations to analyze the topics that we end up coming up with。

So the first thing that we want to do is take that bbc。mtx, which is our sparse matrix。
and we're going to open that file, so now we have our file available as F and then once we have that F object we just call read liness and the output of read liness will be output into content。
So now we have our contents and just to see what that looks like。
I'm going to run this and you see that it's going to be a list。

And。What we have beyond the first two values are just going to be a sparse matrix representation。
And we're going to go through in just that part 1 below what each of these different lines mean。
But first, we want to remove each of these first two values。
So we're just going to call a content dot pop。 We're going to call0 twice, so we remove。

The zeroth value, and then we remove the zeroth value again。
So we see that the last one that I removed where was this。Value here from the list。
And now we should only have from here and below, if we call content。

Now, in part one。We're going to turn this list。 And currently。
this is a list of strings into a list of tuples。And that list of tuples will represent a sparse matrix。

So, that sparse matrix。Is going to have as that first column, the word ID。
As the second column within that tuple, that second value, we're going to have the article I D。
And the third is going to be the number of times that that particular word shows up in that particular article。
So as an example here, if word 1 appears in Art 3, two times, then our element for a list。
that tuo will be word 1。Article 3 showed up two times。Now, in order to create this tuple。
What we do is this somewhat complicated looking list comprehension。
I will break it down very quickly if we just do C for C in contents。

Let's actually, that'll just give us the exact list that we saw before。
So let's just first call that split。

And we see that we've now split that string that originally had into the three separate values。
So we're on our way there。 And then all we do from there。

Is map over a floatat since。It'll be difficult to get a float integer just out of that 1。0。
We map over a float, and then we take the integer of that float so that we're only working with integers each time mapping。
So first, we map to each one of the values in this tuple。 the float。 Then we map over the integer。
And then we set that output as just a tuple。

And if we look at the output for just those first aid values, we see that we now have a list of tus。
1,1,1,1,7,2, so on and so forth, telling us the word, the article。
and then the count of that word within that article。


Now we want to prepare the actual sparse matrix that we're going to be passing into our NMF into our non negative matrix factorization。

So we're going to import nuumpy and pandas, and we're also going to import from sippi do Sprse the COO matrix。
which will give us a means of passing in the way we have our data currently constructed into a sparse matrix。

So we're going to specify what our rows are going to be。 So since these start off。
if you look back up here, it's going to actually start with word 1 article 1。
Just for it to match up with Python syntax, we're going to make it word 0, Art 0, so on and so forth。
So we call that。Every single value, X1, these are going to be our rows。
We want our rows to be each one of our different documents。So we say x1 minus-1。
So it's going to be whatever this idea is minus-1。
For x within sparse matrix that we have defined here。 And then x 0, recall thus the word I D。
We're going to subtract one from that x0 value, and that will be our different columns。
So our different columns will be each one of our different words。


And then the actual values are going to be the amount of times that that shows up。

And when we call CO O matrix, we pass in the values。And then with that。
we have the related rows and columns for where those value should actually fall。
So it'll plug that in if we have row 1 column 1, it'll plug in whatever that value is。
So for this second one, it'll say。Rowwen。Or row 7, column 1。Plug in the value too。So we run this。
And just to make this perfectly clear, we're actually going to recreate from that sparse matrix an actual pandas data frame。
So we know what our actual matrix that we're working with that we're doing non negative matrix factorization on actually is made up of。
😊。

So we're going to pull in the actual terms, and these will relate。 The0 will be the0 term。
The first will be the first term, so on and so forth。So we say from this。Flat file, BBC dot terms。
we'll call F dot read lines again。 And then just to access that first value。
which is going to be the actual word。 We call C dot split on that string。
That'll output strings as before。 And we only want the first value。
And that will be our output for words。

And I'll run this, and we see。The different words that come out。

And then we'll do the same thing for each one of our document names。

We can do that。All the codes the same, except we're working with a different flat file。
and we can see all the different document names。


And then I'm going to take that COO。Which we initialize here。
which is just going to be a sparse matrix。

We're going to turn that into a numpy array。

Pass that into our data frame, and we're going to set our column equal to those words we pulled out and our index equal to those columns。


So this is going to be the actual original data frame that we're working with。
This is going to be Article 1, business 0,01。And we see that the word ad showed up once。
The word sales showed up five times profit 10 times so on and so forth。
And you see the reason why we'd want a sparse matrix is because we'd have all of these zeros for almost every single one of these different articles。
because we need a separate column for every single word that showed up in any single one of the articles。
which is why we generally work with sparse matrices。 when we're doing natural language processing。😊。




So now we have。A。Data frame that we want to work with。
and the next step will be to decompose our matrix using non negative matrix factorization。
and we'll save that for the next video, and I look forward to seeing you there。


038:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p38 37_非负矩阵分解笔记本第2部分.zh_en -BV1eu4m1F7oz_p38-
Welcome back to our notebook In this video, we're going to actually conduct non negative matrix factorization。
If you recall, just before we created our data frame that had each one of our different articles for each row。
and then for each column, we had each one of the different words。
and the values were how often those words showed up in each one of the different articles。
We're going to decompose that into different topics。And we will end up with two matrices。
One will be each one of the words and how much they relate to each topic。
and then the other one will be how to take those topics and recreate those documents that we have。
So in order to do non negative matrix factorization。
we're going to have to define how many components we want。
we're going to set the number of components equal to five。
which is the number of topics that we actually had in the original documents。

And this will allow us to later on compare to see how related the new topics are to the actual topics that we had within each one of the different articles。
So we import NmF。We then call the NMF, we set the number of components。
our initialization is going to be just random with a random state of 818 recall that with non negative matrix factorization you're not guaranteed to get the same exact solution every single time。
We're going to pass in our sparse matrix。Calling model fit transform on that sparse matrix。
and that will output this doc topic。

Which is just going to be a data frame that's going to be of shape 2225。
which is going to be the number of articles that we had。
and then we have reshaped that into just the five topics that we now want。
so rather than having 2025 by if we recall co was about somewhere in the900s in regards to the number of words。
we have reduced it down to five topics。


Now we want to look at the components of this model。

And when we look at the components, all that is is going to be the different words。
And how they make up each one of the different topics that we now have。

So we're going to create a new data frame, which is we're going to call here topic word。
And we're going to pass in that model dot components, that's going to be this output here again。
going to be the waiting for each one of the words for each particular topic。
We want the index equal to just we'll call it topic one through topic 5。
And then the columns are going to be those words that we pulled in earlier。 And when we look at this。

We can see that。

We have。For each one of the different topics, how much each one of the different words contribute to that particular topic。

Now, just to make further sense of how this relates the topics and the words。
as well as the articles and the words, we recall that the original data had five topics, business。
entertainment, politics, sports and tech。


Now I'm going to do topics per dock, and we're going to again pass in the。
Actual values of that doc topic that we pulled in earlier。
We're then going to set as our index rather than if we recall what the docs actually look like。
this is going to be the different articles。

It's going to be each one of these different values, business dot 001, business do 002。
That first word before that dot is going to tell us which topic we're working with。

So we're just going to call i dot split on that dot。
And then we're just going to take the first value, so we'll have all business。

Or later on, all entertainment, so on and so forth。

And then our columns will be topic one, topic two, so on and so forth。

And when we look at this, we can see that we have that。
Taking that each one of those original documents and saying which topic they most relate to。
So you see, business seems to relate most to topic 2。
And we see that repeatedly for every one of the different business topics and then for tech。
we see that topic four, I believe。

What we'll see in just a second, what we're going to do here is order to reset the index so that this indexes its own column。
We're then going to group by that index and get the average value for each one of the topics。
And when we do that, we can see that topic one, the max value is politics。Topic two was business。
topic three was sports, so on and so forth。 Let's just quickly。Just to make clear。
show you what that matrix looks like。 This is the matrix。
And we're just seeing which one of these have the highest values。


And that's how we end up with these different groups。And then to make this perfectly clear。
we see that topic 1 should, for example, relate to politics。 So if we take our topic word。

That we saw up here。We're going to transpose that so that each one of the different topics are going to be the columns。

And then we will sort by first topic one。With the highest values on top, so ascending goes false。
and we see party, labor, government, elects, Blair, these all tend to highly relate with politics。
which makes sense。


And if we did topic three, which was sport。We can see that game, play, so on and so forth。
and you can play around with this with each one in the different topics。
So we see that with this unsupervised model, if perhaps you don't actually have your topics available。
you can come up with this in a way types of clusters with the non negative matrix factorization。

That closes out our video here and our notebook on non negative matrix factorization。
And I'll see you back in lecturecher。 All right。 Thank you。😊。


039:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p39 0_课程介绍.zh_en -BV1eu4m1F7oz_p39-

Hi, my name is Miguel and I am one of your instructors in our course for deep learninging and reinforcement learning。

Deep learning is a very exciting topic because it powers most of our favorite AI applications。
anything from self drivingriv cars to computer vision and speech to text recognition is using some shape or form of deep learning。


And it's really going to help you in all your classification tasks and even on supervised learning applications。

You will first start learning about neural networks, what they are, how they work and best practices。
and then you will learn some deep neural network applications like the courseive neural networks and convolutional neural networks。



And you will wrap up learning some more modern architectures like。


Gerrative adversarial networks, or GNS and reinforcement learning。
which is one of the bigger promises。

Of machine learning and artificial intelligence, even if it's very computational and data intensive。
it holds big promises and it might be what the future holds for AI。



From all the IBM professional certificates and specializations。
this course is one of the most advanced and complex。
so make sure that you take enough breaks and if you need any help please don't hesitate to reach out to your instructors and peers。
we' are here to help one another and we will go through this together。




Another very important part of this course is the final project。
it will really help you highlight your analytical and machine learning skills so make sure you post your solution online。
it can be on a Github page, an online portfolio for the IBM communities。
we really encourage you to post your solution out there。



And with that, I will see you in the course, thank you。

040:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p40 1_神经网络简介.zh_en -BV1eu4m1F7oz_p40-
In this set of videos, we will introduce the basic concepts behind working with neural networks。Now。
neural networks and deep learning are behind most of the artificial intelligence that shapes our everyday life today。
Think of all the cool features in our phones, ranging from face recognition to auto correctect to text autocomps。
voicemail to text previews, also the way that we find what we need on the internet using predictive internet searches。
content or product recommendations, and even self driving cars。

Also, many of the classification and regression problems that you need to solve at your business are going to end up being good candidates for neural networks and deep learning as well。

Now there are several Watson applications and artificial intelligence APIs that help you infuse artificial intelligence into your business。
Here we have some of the most used with links to live demos that you can explore here on your own。
and as you go through some of these, think about ways you can use these applications within your business。
whether it's identifying the pieces of an image, coming up with an efficient translation into a foreign language。
summarizing and classifying comments or reviews of your product。
as well as finding whether those comments that have positive, negative or perhaps a neutral tone。

Now it's often noted that the biology of the brain serves as an inspiration for the mathematical models that make up our neural networks。
The idea being that the brain functions by firing neurons along a chain where one neuron gets signals from prior neurons。
And according to the firings of prior neurons, the next neurons decide where to generate signals or not generate signals。
according to those inputs。Those signals that were activated。
then pass on signals down that chain to the next neurons。And by layering many neurons together。
we end up creating a very complex model。Now, moving to the actual neural network。
we can think of it as a complicated computation engine。
We're going to train it using our training data, so train our neuralNe model。
And then we'll use that trained neural net model to generate predictions using new data。
so note here that similar trained test approach as we did with supervised learning。
which will become of utmost importance as we create our neural networks。
Now that closes out this video in the next video we're going to dive into a single one of these cells to see how data flows in and how data flows out from each one from layer to layer Allright。
I'll see you there。

041:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p41 2_神经元基础.zh_en -BV1eu4m1F7oz_p41-
Now let's zoom in on a single node in the middle of our basic neural network。

First, that node will get input values from the previous layer wherever that node lies。

Those input values are then going to be combined via weights from each of those different values similar to basic multiple linear regression。

Then that combination of weights is going to be transformed。
similar to how logistic regression transforms a linear combination to squash those values between 0 and 1。


And that transformed value is used as input for the next layer。

Now let's add in some variables to paint this process a bit clear。

So we can have as our three input values, x1, x2, and x3。
And we'll assume also an intercept term with that value equal to one as we do with multiple regression。
We also have the respective weights for each one of our different values, W1, W2 and W3。
as well as B, And our model is going to learn each one of these weights as well as the B。

And as mentioned, we will multiply each value by its weight。
as we do with linear regression and end up with some output value Z。


Finally, we're going to use an activation function, like I said。
similar to logistic regression or even logistic regression。
to transform that output and use that value as input for the next layer。

Now, without this activation function, we are restricted to only linear output or linear combinations of our inputs。
And no matter how many layers deep we go, we are still just working with a linear combination of our features。
It's going to be this activation function that allows for the great flexibility with respect to how we consider the model outputs。
given our model inputs using a neural network。😊。




Now, some notation that'll be worth getting familiar with as we walk through working with neural networks。

We have Z, which is going to be the net input or the linear combination of the inputs prior to activation。
so essentially the output of just that linear regression。

We'll have our bias term or that B that we just saw。
which is also similar to our bias term within linear regression。

We'll have F our activation function, that nonlinear function we use to transform the output of z。
And then we have a, our output layer, or the value once we take F of Z。
once we transform Z that we ultimately pass through to our next layer。

Now, with this syntax in mind, as well as that basic unit that we just walked through that basic neuron。
We'd seen that there is a lot of relation between that neuron。And logistic regression。
So when we choose F of Z equals 1 over1 plus E to the negative Z。
where Z is our output of just the linear part of that neuron。

We are actually looking at something very similar to logistic regression。

And what we have here was Z。Z is just going to be equal to that intercept term plus the sum of each one of the different inputs multiplied by their respective weights。
which we've expanded out here。



And our neuron is then simply just a unit of logistic regression。
where we have the different weights that we learn are just the coefficients for logistic regression。
The inputs are the different variables that we have here。 and the bias term is that constant term。
So it all relates back to our basic logistic regression。



And because logistic regression and our neural network in a way can accomplish the same task if we're trying to accomplish classification。

We want to ensure that when we move to neural network that we actually need a more complex model that we don't just need this single unit。
but we need multiple units and perhaps multiple layers。
and that's when we do switch over to neural networks。


The trade off being that you may be able to come up with a more complex boundary with neural networks。

But you'll lose a lot of the explanatory value that you have with logistic regression。

So what we have here is going to be the sigmoid function, which we use for a logistic regression。
as well as our activation here when we talked about the neuron and the output for the neural network。

And what our sigmoid function will do will take that linear combination and create a linear function。
as we see here, we have linearity, not a straight line here。And squash those values between0 and1。
which will be useful as we walk through the different steps of our neural network。


042:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p42 3_使用sklearn的神经网络.zh_en -BV1eu4m1F7oz_p42-
Now in order to create the multi layerer perceptionion in practice, using Python。
we're going to go over the SKL version of creating this neural network。
Now something to note is that we can make this simple multilayer perceptionron using SKLn。
but as we move on to more complex models, you will see that we're going to move away from PsyitLarn and start working with a library called CAIS。
but for now let's continue to focus on SKLearn, so as usual we're going to import from SKLearn here at neuralural network we're going to import the MLP classifier。
We then need to specify our activation function。So we pass in the different arguments while we initiate a class of this MLP classifier。
And some of the arguments that you see here are the hidden layer sizes。
so this will actually be the sizes of each layer between your input and your output。
so as we saw before we input x1 x to x3 and then we have certain amounts of hidden layers。
And we're saying here the size of each one of those hidden layers。
So the fact that this tus only a size 2, that means that there's two hidden layers, one of size 5。
one of size 2。 If we wanted3 and we wanted the third one to be of size 5。
we can do5 comma 2 comma 5, so that's how the hidden layer sizes argument will work。
And then the activation function that we want to use。
we've seen so far that we've only used the syigmoid function。By defaults。
SKLn will actually use the relou function, which we'll learn a bit later。
but because we want to stay in line with what we've discussed so far。
we're going to set the activation equal to logistic here and logistic is just the same as setting equal to sigmoid。
We can then as usual, fit and predict given our data so we pass into our fit。
our X train and our Y train, and then we can pass into our MLP。 predictd our holdout set。
our X test and see how well we performed on this holdout set。
Now as closes out this video and in the next video we're going to go into some of the common terminology used for the multi layerer perception。
as well as some intuition behind the basic math that brings us all together Allright。
I'll see you there。


043:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p43 4_神经元实战.zh_en -BV1eu4m1F7oz_p43-
So let's zoom back into this single node。When we're working with just a single neuron。
what we have here is a perceptron。And this is the basis upon which all neural networks are built。
Now, note here that we have as before our input values x1, x2 and x3, as well as our intercepts。
and then our weights and our beta that we're going to learn。And in this example。
we're going to be using logistic regression using that sigmoid activation function that we just discussed。
Now, if we were to change and look at actual values just to make clear how this actually looks in practice。
We can have the values as inputs, imagine that we have a row with feature 1 equal to 0。9。
feature 2 equal to 0。2, feature 3 equals to 0。3, and then our W1, W2 and W3 are 2。
3 and negative 1 with a B of 0。5。We can then calculate the actual z value。
That would be input once we have each one of these values。Into our activation function。
That activation function is 1 over 1 plus E to whatever Rz that we calculated was。
And we'd end up with a value of 0。93。And that would be the output of this particular node。

So our node output is 0。93。So why not just use a single neuron。
why do we need to have a larger network where we have one stacked on top of the other?
If we have just a single neuron as we would, if we were just doing logistic regression。
that would only permit a linear decision boundary。
When we move on to stacking one layer on top of the other。
We are able to come up with a much more complex decision boundary。
and most of our real world problems will probably be much more complicated than just that linear decision boundary that we can learn with something like logistic regression or something with just one unit。

So in order to take our inputs and pass them through and get our different outputs as we see here。
we'd be working with a multi layer perception, so we saw that one unit perception we add on each one。
And we see here that we have this feed forward structure。Where we have our inputs of x1, x2, and x3。
Those will each be inputs into the next layer if you look at each one of the arrows。
x1 goes to each one of the different perceptions on that next layer, as does x2 and as does x3。
And then that next layer, the second layer, is connected to every value in the third layer。
and so on and so forth until we get our output of y1, y2, and y3。

044:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p44 5_使用sklearn的神经网络.zh_en -BV1eu4m1F7oz_p44-
Now in order to create the multi layerer perceptionion in practice, using Python。
we're going to go over the SKL version of creating this neural network。
Now something to note is that we can make this simple multilayer perceptionron using SKLn。
but as we move on to more complex models, you will see that we're going to move away from PsyitLarn and start working with a library called CAIS。
but for now let's continue to focus on SKLearn, so as usual we're going to import from SKLearn here at neuralural network we're going to import the MLP classifier。
We then need to specify our activation function。So we pass in the different arguments while we initiate a class of this MLP classifier。
And some of the arguments that you see here are the hidden layer sizes。
so this will actually be the sizes of each layer between your input and your output。
so as we saw before we input x1 x to x3 and then we have certain amounts of hidden layers。
And we're saying here the size of each one of those hidden layers。
So the fact that this tus only a size 2, that means that there's two hidden layers, one of size 5。
one of size 2。 If we wanted3 and we wanted the third one to be of size 5。
we can do5 comma 2 comma 5, so that's how the hidden layer sizes argument will work。
And then the activation function that we want to use。
we've seen so far that we've only used the syigmoid function。By defaults。
SKLn will actually use the relou function, which we'll learn a bit later。
but because we want to stay in line with what we've discussed so far。
we're going to set the activation equal to logistic here and logistic is just the same as setting equal to sigmoid。
We can then as usual, fit and predict given our data so we pass into our fit。
our X train and our Y train, and then we can pass into our MLP。 predictd our holdout set。
our X test and see how well we performed on this holdout set。
Now as closes out this video and in the next video we're going to go into some of the common terminology used for the multi layerer perception。
as well as some intuition behind the basic math that brings us all together Allright。
I'll see you there。


045:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p45 6_前向传播.zh_en -BV1eu4m1F7oz_p45-
Now let's walk through some of the important terminology that we should keep in mind when working with a neural network and here a multilayer perceptrum。
as well as some of the basics as to how we get from this first layer up until the final layer that we have from the x's up until the Y's。




So first off, we have our different weights and those weights will determine how do we combine each one of the different layers along our neural network。


So each one of these arrows that will connect x1 to each point each node within that next layer。
as well as all the lines between the second layer and the third layer。
these will all signify the specific weights in how to combine each one of these different layers。



We have our input layer, which is just going to be our input data set here just to make it especially clear。
we can imagine that this first x1, x2 x3 is just going to be the first row where x1 is feature 1。
x2 is feature 2 and x3 is feature 3。





We then have our hidden layers and those are going to be all of these purple nodes that fall between our input layer and what we will define right now as our output layer。
so everything between our input layer and output layer are going to be called our hidden layers。



And those hidden layers as we specified as we walk through the Python syntax, can be defined。
however we'd like, however many layers we'd like, we can say we want five hidden layers and there would be five different columns of nodes in between our input layer and our output layer。
that's something that we would predeefine and all the weights would connect each of those in order to learn this complex model feeding from our input layer through the hidden layers out through the output layer。
which will be our actual predictions。







The weights that we set are the different arrows are going to be represented by matrices。
and each of those different matrices will again just be the way that we combine each layer step by step。
and those matrices will have to be of the appropriate shape to ensure that if we have an input that's going to be three vector that it transforms it into a four vector in the next layer and then maintains that four vector in the next layer and then brings that down to a three vector in that final layer。
and I'll walk through this in just a second。







Our net input will be the sum of the weighted inputs, So that's going to be your z values。
and that's going to be, again, similar to your linear regression。
So x1 times sum weight plus x 2 times some weight plus x3 times sum weight will equal one of your values of z。




And then we will have four different z values for that first layer。
so our z is actually going to be a four vector as well our Z2 and then our Z3 will be a3 vector。



We then finally have our activation values and those activation values are just going to be taking those Z values that we just discussed and passing them through our activation function。
So I'm going to briefly skip over a0 here。 but a1 should be a4 vector as well。
or we just take that z1 and pass it through, for example。
each one of those different values in that four vector, Pass it through the sigmoid function。
We can do the same for a2, passing through Z2。 and then for a3, we can pass Z3。
probably here through a。😊。







Soft max layer in order to give the predicted probabilities that we'd want output for this classification problem that we have here。


Now if we go back to a0。

A0 is signifying that we want any A to be passed into the next layer。

Even though we're not doing anything to the x1, x2 and x3。
that is going to be fed as input into the next layer。
so we'll often call that a0 just for simplicityimp' purposes。


046:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p46 7_前向传播的矩阵表示.zh_en -BV1eu4m1F7oz_p46-
So imagining working with just a single row of data again, so if we have a single data point。
With a certain amount, so that's a single row with a certain amount of features。

Our W1, or our first weight would be a 3 by4 matrix, taking the input values of x 1, x 2 and x 3。
which would be a 1 by3 matrix and multiplying that one by three matrix by the3 by4 matrix, the W1。



And that will result in our1 by4 matrix, which would be our z1。

We can then pass all the values of Z1。Through our activation function。And that would result in a1。
another four vector。


In order to make this as clear as possible。And as we saw on the prior side。
we can think of our input values as a0。

And every a, a0, A1a2 will be the value that's passed as input into the next layer。


And again, our z1 is equal to the dot product of x and W1。

And our A1 is just going to be the activation function of Z1。
and that will be passed on to the next layer。


Now to take this a step further to see how this computes through the entire process of our neural network for a single row。

We are planning to start with a vector representing that row。
in our case that was a row vector of length three。
And plan to end with an output of a row vector of length three as well, which means in this example。
we're probably performing classification with output of three classes。



Now we showed how we got Z1 as a dot product of x and W1。

That allows us to calculate A1 by taking the activation function of Z1。
Z2 or the second layer of Z is calculated as a linear combination of each of those a ones that we just calculated。


And in order to get a linear combination of the correct dimensions, we have W2。
which matches up the shape of A1 and our eventual shape of the values we want for a2。
A2 is then again just the activation function of Z2。

And then Z3 is then again the linear combination of that prior output of A2。

So we need a new weight matrix, W3。

And once we get the linear combination of A2, and since this is the final layer。
we just take the soft max of Z3 in order to give us the predicted probabilities for each one of our different classes and that will be our predicted why。



Now, in practice, we're not just working with a single row at a time。

But rather we'd be working with an entire data set worth a rosese。
But in order to calculate this generalized version of our multi layerer perceptionceptron。
our equations should look very similar or exactly the same。


So this time we're inputting an n by3 matrix where n is the number of rows still working with three columns though。


And our output should also be an n by3 matrix with a predicted probability for each one of our different rows。


Now, the math should be the same, though, as we saw before, but this time。
the dot product of X and W is now just going to be an n by4 matrix or whatever the size of our next layer is rather than it just being one by 4。
we're now n by 4。 So if you imagine again, that that X is an n by 3。


We can have our WBA3 by4 so that we'd end up with an n by4 matrix for z1。

We can then take the activation function for all of the outputs that we get for Z1 and end up with our output from that first layer。

And again, we have the appropriate matrix to get the linear combination of each one of those outputs for each one of the different rows。


And end up with Z2。We pass each of those Z2s through the activation function to get A2。
That A2 will be the output。

For the second layer and input into the third layer。

And that will give us Z3 when we take the linear combination of A2 and W3。

And taking that Z3 now for multiple rows, we can take the soft max。


And end up with predicted probabilities for each one of the three classes for all of our N rows。

So that expands it out to the amount of rows within our entire data set。

047:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p47 8_主要的深度神经网络类型.zh_en -BV1eu4m1F7oz_p47-
Now, there are many deep learning approaches which we're going to discuss throughout this course。
and along with these basic groupings, there's also much more being developed。So quick overview。
we have the neural network models, which are just going to be your multi layerer perceptron and feed forward networks。
and this is going to be applied to many traditional predictive problems such as just classification and regression that we've discussed so far。
We have recurrent neural networks, and we have here the class of RNN and LSTM long short term memory RNN is recurrent neural network。
and this is going to be useful for modeling sequences。
so this will be useful for time series where maybe each one of the different steps along the way are dependent on prior steps or sentence prediction where each one of the different words may be dependent on prior words。
We'll have convolutional neural networks or CNN, and that's going to be very useful for feature and object recognition in visual data。
as it will take all the surrounding features and take them in as context moving forward as well。
as well as being used at times for forecasting as well。
where it can take points on either end or see some type of patterns within the data in order to predict future values。
And then it can also be used with unsupervised precha networks, with auto encoders。
deep belief networks and generative adversarial networks, And there's going to be many uses。
including generating actual images, labeling some outcomes, as well as dimensionality reduction。
using deep learning, and we'll discuss many of these throughout this course。
Now that closes our introduction to neural networks。In the next video。
we will begin to discuss the optimization that's needed in order to come up with our weights using gradient descent。
which will be a key factor in learning each one of our neural network models。 All right。
I'll see you there。😊。


048:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p48 9_神经网络介绍笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p48-
Welcome to our lab here on the Introuction to our neural networkss。

Here we're going to start off with an exercise based on different logic gates。
such as working with the and functionality or or functionality。 And then as we'll see later on。
working with X or in order to find if we can actually use a single perceptron。
as we discussed in lecture in order to come up with these and or or functions。
as well as these more complex ones such as X or。





So the first thing that we want to do is import our libraries so we call。

Important on pi S NPp, matpot lid do piPo as PLt, and then we're going to introduce the sigmoid function。
which we discussed in lecture, where we're just going to set sigmoid equal to1 over1 plus E to the negative x。



Where we can pass in values of x。

And if that value of x is very high, we'll end up with a value very close to one。
And if that value is very low, we'll have a value low being negative。
then we'll have a value close to zero。

And as we saw with the graph, it'll range between the values of 01。
but can't go any lower than zero and can't go any higher than one。

So we're just going to define that sigmoid function。
which is just going to be 1 over 1 plus e to the negative x。


We're then going to plot it out and all we're going to do is say that we want 100 values equally spaced between negative 10 and 10。


And then we're going to take the sigmoid of each one of those values。
which we just defined above so that we get the activation。 and then we're just going to plot。


The different values on the X axis versus the activation on the y axis。
and the rest is just drawing lines and creating a grid and ensuring that we have the right y limits so that we don't go too far negative are too far positive。


So we run that。And we see that as a value gets close to negative 10。
we're essentially at zero for our blue line and as our value our x axis gets close to 10。
then our y axis gets very close to positive one。



So with this in mind and how the sigmoid function works。With the idea that as we move higher again。
we get closer to one and as we go negative, we get closer to zero。

Just to highlight as well at the point of 0 itself, we're at around 0。5。
as you see the grid line at 0。5 for that y axis, so that's going to be exactly 50。
50 chance of either being1 or0 if we were to create a threshold。




Now a logic gate is going to take in two billions。

So two different inputs。Either usually true or false。
but we're going to signnify true as one and false as zero。

And then with those two inputs, it'll return either a 0 or a one。
depending on the rulery defined for that input。So we have here the truth table for a logic gate that shows the output。
given that we're working with an or gate。 So if we think about an or gate。
if either one of our two values。



Are equal to one。Then our output should be true, one is equivalent to true。
And the only time we should get false or zero is if both of our inputs are equal to zero。

So what we want to know is can we come up?

With a neuron that uses the sigmoid activation function that we just define that comes up with the values between 0 and 1。

That will allow us to always output the appropriate values of either 0 or one。


And the idea being that if the threshold is over 0。5, then the sigmoid would predict one。
if it's less than 0。5, then it would predict 0。

So we pass in x1 and x2, and each of those will take on a value of either zero or 1。

As well as the intercepts, and then we multiply each of those by a certain weight。


Now, again, by limiting the inputs of x1 and x2 to be either0 or 1。
we can simulate the effects of this logic gate that we just saw on the table above。


hich we saw over here。

The goal is to find the weights represented by the question marks we have here in this image。
such that it returns an output close to zero or one depending on what the inputs are。


So the idea would be if we think through the or problem, if both x1 and x2 are equal to 0。
then we want to output a0, otherwise if either of them are equal to one。
then we would want to pass in a1。



So, we have。

To think about what those weight should be, and if we think about the plot that we have above。

If it's going to be very negative, again, negative 10 or less。
then we would have a value very close to 0。 and if it's very positive, positive 10。
then it's very close to 1。 So that's our goal。



So thinking this through。You can see at the picture here below we already have the weights。
but let's talk through how these weights will actually work in outputting the actual value that you want for this orgate。


If x1 and x2 are both equal to0。The only value that's going to affect Z。

This equation that we have over here。Is going to be that intercept term of B。

And because we want the results for 00, if both x1 and x2 are equal to 0 to be close to 0。
B should be negative。 It has to be less than 0 to ensure that our sigmoid function outputs a value less than our 0。
5。Now, if either x1 or x2 is 1, we want the output to be close to one。

And that means that the weights associated with x1 and x2 should be enough to offset that negative 10 that we have for B。
So if we give B that value of negative 10。W1 and W2 each have to be at least greater than 10。
so we set them each to 20, so if x1 is equal to1, then we have negative 10 plus 20, positive 10。
we pass that through the sigmoid function and that would output a value very close to 1。
Same would hold if we had x2 equal to 1 and x1 equal to zero。

And then if both of them are equal to1, then we end up with positive 30 and again we get once we pass positive 30 through the sigmoid function。
we have a value very close to 1, so as long as either x1 or x2 are equal to 1。
given the weights 2020 and intercept negative 10, we have the value of 1, and if both of them are 0。
then we have the value of0 passed through our sigmoid function。




So here we see how we can come up with the appropriate weights。

To ensure that we actually complete this orR functionality。So I run this。

And the idea that we have here is that we create a function for the logic a that will take our W1 and our W2。

As well as our B。 And then return the sigmoid of W1 times X1 plus W2 times X 2 plus B。
which is what we hope to ultimately output, given that。


We are passing in the z of W1x1 plus w2 x2+ B into our sigmoid。
and then running the sigmoid and then hoping for a value of one。


Or zero depending on what we want our outputs to be if we're using an or gate or an end gate。
so on and so forth。


And then we're going to test it。By saying for each one of these values,0,0,0,1,1,0, and1,1。

We want to output。4 a and B。Given our test。What is going to be the actual value of?A and B。
and again, if we pass in that sigmoid, then we will end up with a value that is。One。
once we pass you know 10 or 01 when we call MP dot rounds, it'll start off maybe with 0。9。
and then we round it to ensure that we get a one。

So we have our or gate, which is just going to be equal to our logic gate。
which we defined as just the sigmoid of that linear combination。


And we pass in our weights W1, W2 and our intercept of negative 10。 We test that orgate。
and we get our different outputs, and we see that it matched up according with what we saw our or gate should actually be。



Now let's quickly look at the end gate and how we can come up with the end gate。


So with the end gate, when we look at the table that we have here。

If both of them are false, which is our00, then it should output false。

If only one of them is true, then they're not both true。 That's the ands gate, right。
with the and gate you want both one input and the second input to both be true。


So we'd still have a zero, and it would stay zero unless both the inputs are both true。
So can we come up again with the appropriate weights to ensure that if they are both true?
Then we end up with a truth value, otherwise we get a false value。

So as we see here。We set the B equal to a negative value。 That's negative enough。

That even once we add on just one of these values, whether it's W1 or W2。

That would be the equivalent of just one of those being true。 We still have a negative output。
So B plus W2 times 1, we'd still have negative 20 plus 10。
and we'd end up with negative 10 as long as it's negative。


Our sigmoid function will output a value less than 0。5, and we round that down to 0。
The only way that we end up with a positive value is if both of these are true。


Then we have negative 20 plus 11 plus 10。And then we'd end up with positive one。
we pass positive one into our sigmoid function, and we have a value greater than 0。5。
Now these W1s and W2s can essentially be any number, well first let's show that this works。

We see that it outputs zero for every single value, except for11 as it should with our and gate。

We can see also, if we wanted to, we can make these values any value less than。

Negative 20 or less than the absolute value of 20。 so that once it's added on, it remains negative。
but once both of those are added on, then it becomes positive。
So both values have to add up to something greater than 20 and be less than 20。


So we could run this and see that again, we get all the correct values。

Now we're going to do the same thing for the n or gate and the n and gate and and or and an n just means not or and not n。
so the opposite of or and the opposite of ns, and we'll see why this is important once we get to the next exercise。



So not or is just going to be the opposite of the or So if it's。
Any of these three values in this table, which would have all been true for the or。
then we set it to false。And we only keep it at true if both values are equal to zero。

And thinking through which weights will work, we just need to ensure that we have a。

Positive value, if both are equal to 0。 Otherwise, we have a negative value。

So we just have to ensure that these are negative and their absolute value are both greater than B。
And then we have our N or gate, we'll double check the outputs。

And we see that 0,0 is equal to one, otherwise they're all0。

And then finally, we're going to close out this video with N end。

Where we'll see, again, this is just the opposite of the actual and。
So and would only be true if both values were equal to true。 now that we do the opposite。
it's true every other time, except for when both values are equal to true。



So what we need to do is we need to ensure that。As long as we have the inputs。Both being。
1 that will cancel out the B that we have here, otherwise。We always have a positive value。
So we do that by saying that these two added together W1 and W2 will outweigh the B。
otherwise on their own they can never outweigh that B value。
and that will ensure that we always have what we have here in terms of the not and gate。



And we can see that that holds us well。

Now in the next video, we're going to pick Ben up。

And discuss why there is a limit to only working with a single neuron and how we can build off of a single neuron。
create another layer of neurons。


As we do with our multi layer perception and come up with this X or functionality。
which we'll discuss in the next video。

All right, I'll see you there。

049:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p49 10_神经网络介绍笔记本(选修部分)第2部分.zh_en -BV1eu4m1F7oz_p49-
Now in this video, we're going to discuss the limits of working with just a single neuron。
So so far in the past video, we saw all the different ways that a single neuron would be able to handle coming up with the an gate。
the or gate, the not ore and the not an gates。Now we'll see the limits when we work with the X or or the exclusive or gate。
And for those of you that have taken computer science courses。
perhaps you're familiar with the X orgate, for those of you that are not。
The idea of the X orgate is to only pass true, if either one or the other。Of our inputs are true。
but not both of them being true。 So we see if both are false, we return false, but also。
if both are true, then we return false。 Only if exactly one of them are true。 Do we return true。
So can we create a set of weights such that a single neuron can output this property that we see here?

And it turns out that we can't。And。

If we want to, what we're going to need to do is actually create another layer。
So we'll pass in our input values of x to1 x2, as well as our intercept。
And then we'll create another layer as we do with our feed for neural networks。
And see how using two layers, we can come up with this X orgate。
So the concept is if it's going to be x or。We want one of the outputs。In the second layer。
To actually be equivalent to the orgate。And the other one to be equal to the not and。So, the idea is。
If we have。Either one or so x1 equal to 1 or x2 equal to1 or both equal to1。
Then the or gate will return one。But it won't return one for the the only one it won't return one for is if they're both zero。
And then for the not ends。It will return。 The true value return one for all of them。
except for both one and one being true。And then we can take the outputs of the orgate and the not and gate。
And in the second level, add on another end gate。And that will give us our XR function。
So if we think about that, if we start with 0,0。Then that will pass 0 for the or gate。
So then we'd end up when we take the end gate at the second level。
no matter what if one of them is 0, then we automatically end up with a0。
So that is correct in that both the inputs are 0。 and with our x or gate。
as we see here in the table, it should be 0。Now, if one of them is a1 and the other is a zero。
The X or rate will return a1, and the knot and gate, which will return one for every single value。
except for ans, will both be one。And in that second level, if we take the end of both one and one。
we will output one, so we'll get the correct value。And then finally, we want a zero。
If both the x1 and the x2 are equal to 1 are both true。And are not endgate。If we have one and one。
we'll pass through a0。 So even though our or gate would pass through a1。
the and gate at the second level of the one and the0。
We'll end up ensuring that it ends up passing a zero。
So that will ensure equivalent to the X or final row that we pass out a zero。So again。
the X or gate will just be a combination of an or gate。And not and gate。Which we learned just above。
and then we can pass that through in the second layer with an end gate。
And that will output the correct output for our X orgate。 And we see that in practice。
where we define our X or gate as the combination of an end gate of the output of。

Both the or gate and the not and gate, we pass in that C And D。
which will pass out the ones or zeros accordingly add on that end gate。 And when we test that。
we get the output that we would expect where we have the0 for the both of them being false and is0 for both them being true and then true。
if just one of them is equal to true。


So that closes out our discussion in regards to working with that Xor gate and adding on the extra layer and saying how we can come up with more complex boundaries once we move to multiple layers。

Now we discussed during lecture。The actual matrix weights, taking our input。
how that's transformed into the first layer and then into the second hidden layer。
and then eventually into our output。What we're going to do here is make that more concrete by actually coming up with some random weights as well as some random inputs and see how these matrix sizes transform as we go from our input through to our output。

So here we're going to start with three weight matrices, W1, W2, and W3。
representing the weights in each layer。

And the convention for these matrices is that W。 I J gives the weight from neuron I from the prior layer。
Two neuron J up until the next layer, so the weight of moving from I up until J。

Now a vector X in is going to represent a single input。As we discussed during lecture。
we discussed just working with a single input, as well as a full data set of inputs。
and our X mat in is going to represent。What a toy version of a full data set with just seven rows。

And the goal far exercise here is seeing four input X in。
we're going to calculate the inputs and outputs to each layer。
each layer as we move from our linear combination。
which should output some z value and then taking the sigmoid of that value and then seeing how that's passed through to each one of the different layers。
We're going to write a function then that does the entire neural network calculation for a single input。
do that again for a matrix of inputs, and then test our functions that we just created using our XN and our X Mattin。
which is our toy dataset set。

So let's look at this W1, W2 and W3, which will highlight for us how these actual weight matrices should look in the back end。
Now this isn't learning the optimal parameters for us。
but it's showing us just one step through the feed forward from the input all the way through to the output and when we get back to lecture。
we'll talk about how we can actually optimize these weights。
So here we start with a three by four matrix for W1。We have three rows and four different columns。
W2 is going to be4 by4。And then column 3 is going to be4 by 3。 And while I say those numbers。
you should be thinking of how we're transforming our three dimensional vector from x1 x to x3。
we take our3 by4 matrix to expand that into a four vector and then keep that as a four vector by multiplying that by a 4 by4 matrix and then a 4 by 3 matrix to ensure that we have three outputs。


So that's the idea of W1, W2 and W3。So our input's just going to be these three values, 0。5。
08 and 0。2。And then our toy data set, which is going to have seven rows。
is going to also have the three columns where we have as the first entry。
the same entry that we have our x in, but also seeing how this can expand to six more rows where each row will have the same amount of columns。

We're defining here the softm for a vector, which is just going to allow us to output probabilities for a single vector。
and then we do the same for a full matrix。So we run this, and we see our output, as mentioned。
that we have for our W1, a 3 by four matrix。

That's going to be multiplied by a row vector, so one by3 multiplied by a3 by4 should output a one by four matrix。
which should be our first hidden layer。And then we see the matrix for。
Our toy data set were as 7 rows。 And if we imagine this is a 7 by 3。 And when we multiply this by W1。
We end up in our first hidden layer with outputs for every single one of the different rows。
So it'll be a 7 by 4 matrix。 and we'll see this in just a second。So if we pass in。
let's first pass in here, just the X in。And we take the dot product。
Then as mentioned that we will get the linear combination here we're looking at Z2。
which is going to just be that linear combination, taking the dot product of x in and our matrix W1。
and we end up with this four vector, as mentioned。
We're then just going to take the sigmoid of that output, so we got the linear combination。
Now we take the sigmoid of that and we still have the same shape。
but now it's the sigmoid of each one of those outputs。
And now that output will feed into the next layer。And again, at W 2。Was a 4 by four matrix。
And here we have a one by four vector。 So we'll end up with a again R Z 3 being a one by 4 vector。
And once we have that linear combination, we can again take the sigmoid。
and that will be the output into the final layer, and Z4 will be the dot product of A 3 and W3。
where again, W 3 is going to be that 4 by three matrix to ensure that it matches up with the one by4 matrix and outputs a1 by three vector。
😊。

And then we take that Z4 and call a soft max to see probabilities for each one of the different values。

So we see that for the different classes that we predict that the first class is the most likely。
And that's the idea of feeding through this neural network up until that soft max to come up with a complex solution to our classification problem。
Now, quickly, I want to show you what this looks like if we are to pass in full matrix。

So we run X mat in。 and instead of it just being one row。
Now we have all seven rows being passed through。And then we have that 7 by 4 matrix。 We can take the。
Dot product of that output by the4 by 4。 And we still had that7 by 4 matrix。
Then we multiply that by our。

Then we take the sigmoid of that, excuse me of the Z3, so still the same shape。
but now taking the sigmoid to each one of those values。
we can then take the dot product so that we can get the output of the linear combination of each of those but only outputting three different values。
and we can take the soft max, and then we can see the probabilities for each one of the different values from the output of that original matrix by taking the softm。



Now, just to see how that computes all the way through from beginning through to the ends。

We create a function called the soft max Vc, which shall just be the。 You see what we pass through。
the sigmoid of the sigmoid of the dot product of x and W1。😊,And W2, N W3。
And then we can do the same。 And just instead of passing in, it'll all be the same function。
as we just mentioned, the X will work just as well, whether it's a matrix or just an input。
We do create two different functions, and we can pass that out。
and we can see that we have the solutions desired。



All right。 well, that closes out our video here。Working with。From beginning to end a neural network。
And once we get back into lecture, we'll discuss how to actually optimize this model so that we're not just looking at random weights。
but the eventually the optimal weights using what we learned with gradient descent and then something called back propagation。
All right, I'll see you there。


050:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p50 11_梯度下降基础.zh_en -BV1eu4m1F7oz_p50-
In this section, we will discuss gradient descent, one of the most crucial components that make learning the parameters of our neural nets actually possible。
So the learning goals for this section are going to include going over basic gradient descent。
As well as how to do stoacastic gradient descent or update our parameters by going through one observation at a time。
And then finally, mini batch grading descent, which will update our parameters using just a certain subset amount of our observations within our data。

So four gradientd descent。We're going to start with a cost function, J of beta。
which looks like the following。And our goal is to find the beta。
The beta values at which this cost function is minimized。Now, in order to find that beta。
which we see as our x axis, we initialize at some random point on our cost function。
And then we use the gradient to gradually descend towards that minimum value。
and that minimum value means we minimize the cost function and that will be our optimal value for the beta。

Now let's discuss an example with linear Russian to help make this a bit clear。
So imagine that you're working with a simple regression or trying to learn two beta values。
beta n and beta1。

Now in three dimensions, we have on one axis all the possible values for beta not。

On the other, we have all the possible values for beta 1 and on that final axis, that vertical axis。
we have the output of the cost function we're trying to minimize for all values of beta n and beta 1。

Now that we have increased the number of dimensions。
we now have a more complicated surface on which to find that minimum value。


So how do we find that minimum value if we can't tell exactly how this cost function will look for our given model?

The key again, is to start at a random point。


We then compute the gradients of this point in respect to beta n and beta 1。

And that gradients will always point in direction of the largest increase。

Now we take the negative of that gradient, and now we are pointing in the direction of the largest decrease。

Now, that gradients that we are discussing will be a vector in that same dimensional space as our parameters。
consisting of the partial derivatives of each one of those parameters。

So we have the gradients, and that's going to tell us the direction of descent for each one of our individual parameters。


As we see here, we have for every single parameter each one of their partial derivatives。

We can then use the gradients and the given cost function to calculate a new point from that original initialized point。


So we started off with W, now w1 will equal W not minus a learning rate。
which we'll discuss in just a second, multiplied by the gradient of our cost function。



And we can see that we have now moved closer to minimizing our cost function as we move down and we subtract that gradient。


Now that learning rate is going to be a tunable parameter that will tell us how large we want to make each one of our steps within our cost function。


And we want to be careful with this because too large of a step and we'll end up overshooting our minimum。

And too small of a step, and itll take too long to optimize our model。

Now, using the same concept of subtracting the gradient。
we can iterate to move closer and closer to the minimum value from that last step。


So now W2 is going to be equal to that W1 we just calculated minus the gradient of the cost function。


And you see we move closer with W2, and then W3 will be the same thing subtracting from the weights we got from W2。
the gradients。

And we say we move a bit closer。And eventually we end up with a global minimum。

051:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p51 12_比较不同的梯度下降方法.zh_en -BV1eu4m1F7oz_p51-
Now the concept of a stochastic gradient descent compared to that gradient descent we just discussed。
Is to speed things up by using only a single data point to determine the gradients and the cost function。
So rather than summing together all of the error values and then taking the gradients。
as we did with that vanilla gradient descent。

We instead calculate our weights by subtracting from W n the gradient。
given the error for just one value。 So if you saw above, we had the sum over all the values。
and now, if we look above X observation and y observation。
we're doing this just for one specific value。Then using this single point。
we can again iterate through to continue to update the weights。
So we can use for W1 and each one can be a different random point。
but we keep using only a single random point。And we keep updating our weights。
Moving down our cost function。And eventually, we end up hopefully near some global minimum。
but that path is going to be much less directed due to the noise of working with just a single data point。
And that's the idea of a bit of randomness and it being a stochastic gradient descent。


Now, finally, with mini batch gradients set。We can choose some value n between one and the size of the entire data set。
And now perform an update for every end training examples。So now。
rather than summing over the entire data set or just one single observation。
we are summing over a random subset of our original data。Saying our error。
And taking the gradient and moving down that line, given the gradients。For that subset of values。
And now we get the best of both worlds, we can reduce the memory relative to our original or vanilla gradient descent where we use the entire data set。


But it'll be less noisy and get to the optimal value much smoother than working with sarcchastic gradientant descent。
So that's going to be the idea behind mini batch gradientding descents。
that n will be another parameter that will be tuning as we work with our neural network models。

Now, just a Rika。In this section we discuss gradient descent or full batch gradient descent where we went through every single row in our data set in order to update each one of our gradients。
We then discuss stochastic gradient descents and how we can take steps according to the gradient on each one of the single rows within our data set。
So checking that error against every single row and then updating accordingly。
and we discuss between the two how gradient descents may take a long time。
but will move more smoothly, whereas stochastic gradient descent will move more efficiently。
but maybe a little bit bouncy in regards to getting to that desired goal of our optimal value。
So the compromise was this mini batch gradient descent。
Where we reach that optimal value by not taking every single observation within our data set in order to calculate the gradient。
not just taking a single value, but taking a mini batch taking, say。
32 observations or 64 observations before creating an update using that gradient。
Now it closes out our video here on gradient descent。
and we'll get a clear picture in our next video where we'll jump into a notebook on how gradient descent actually works。
Allright, I'll see you there。😊。


052:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p52 13_梯度下降笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p52-

Welcome to our notebook here on Grant Descent。In this notebook。
we're going to have an overview of working with gradient descent in order to solve the simple linear regression。
as well as working with stochastic gradient descent。
which we defined in lecture as just taking a single row and seeing the error and moving according to just the error on that single row compared to with vanilla gradient descent where we use the entire dataset set。



So to start off, we're going to import all the necessary libraries here just importing nuumpy。
pandas and Matteplotlib。

We're then going to generate data from a known distribution。
so we'll know the actual values that we want to find when we do our gradient descent。

So if we think about just working with linear regression in general。
what we're trying to solve for is some y where that y is equal to some betas or some different coefficients multiplied by our different values in our X data set。
So here we have y equals B, which is just our intercept term plus theta 1 times x1 plus theta 2 plus times x2 plus some error term。

And in order to generate the data where we specify each one of the different thetas。
we're going to have x1 and x2, each be random values between 0 and 10。
where any value between 0 and 10 is equally likely to be picked since we're picking from the uniform distribution。
We're then going to actually set the values for B, theta 1 and theta 2, so B is going to be 1。5。
Theta 1 is going to be equal to 2, and theta 2 is going to be equal to 5。

And then from there, we can generate our y values, as well as our feature matrix。
which will have our X1 and x2 values。So how do we do that first thing we want to do is we're going to set the random C so that you back home are seeing the same solutions that we have here。

We're then going to say that we want 100 observations。

And we're going to pass that through as our x1s and our x2s are going to be random values between 0 and 10。
So N dot random dot uniform。Values between 0 and 10。
and we want 100 different observations between those values of0 and 10。
We set that equal to our x1 and our x2。And then for our constant term。
we're just going to call NP do1s, which will just create an array of ones。
For a certain shape that you will define, and we just define it as a one dimensional array with 100 different values。
And then finally, we're going to add on that error term, if you recall up here。
we also want to include the error term。 This is to ensure that doesn't fit exactly and we'll set that error term equal to just values from a normal distribution with a mean of 0 and a standard deviation of 0。
5。 and again, we want 100 different values。We're then going to choose our B。
our theta 1 and our theta 2 to match with the values we define above。
And then y is just going to be equal to b times that constant term, which is just our ones。
Pta 1 times our x1 that we defined as random values between 0 and 10。And theta 2。Times x2。
which is again, different values between0 and 10 plus that small error value。
We're then going to create an array out of our x1, x2 in our constant term。
So that we have our X matrix or our feature matrix。And we run this。

And then we can see what our actual y value is, and that should be some combination of if we look at this X met。

We should have。Something along the lines of two times this value and five times this value plus 1。
5 since 1。5 will just be multiplied by one, and we'll have that for each x1 and x2。

So in order to get the right answer directly, we can look at the closed form version of this model rather than using something like gradientding descent。
With linear regression, we can actually use matrix algebra to get the exact solution。
That will find the maximum likelihood or the least scores estimate for our data set。

And that's just going to be this matrix algebra here。 It's not too important。
All that's important here is to know that there's a close form solution and that for linear algebra。
you do not necessarily have to use gradient descent to find each one of your parameters。
Now the reason why we introduce gradient descent。Is because when we're doing deep learning or even for many of our other models。
we can't find this closed form solution and we'll need to use gradient descent to move towards that optimal value as we discussed in lecture。

So here we're going to use SK Learn's linear Russian model。

As well as also using the actual matrix algebra that we have defined here。
which we can just pull out from Numpy。


So from S K learn, we're going to import our linear regression model。
We're going to call linear regression we don't want to fit the intercept since the intercept's already included in our feature values in our Xmat array that we defined earlier。
so we set fit intercept equal to false。

And then we can fit our x met and our Y and then see what our different coefficients that it comes up with are。


And we can see that it's very close to the values that we wanted for B of 1。5。
theta 1 of 2 and theta 2 of5。

No。S K learn, the linear Russian model, will be using this closed form matrix algebra in order to solve for it。
So we should get the same solution When we call out this equation, just using nuy。
So that's just going to be the inverse of the dot product of the transpose and the value itself。
And then the dot product of that with x transpose, and then the dot product of that with y。

And then when we look at the solution that that comes up with again。
it's exactly the same as what we just saw with linear regression from S K learn。

Now, that closes out this section of just getting an intro of the data that we're working with in the next section in the next video。
we're going to discuss actually solving this problem using grade E descent。
as well as how to visualize that process。 All right, I'll see you there。😊。


053:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p53 14_梯度下降笔记本(选修部分)第2部分.zh_en -BV1eu4m1F7oz_p53-
Welcome back for the second part of this demo on gradient Descent。As we mentioned in the last video。
when we are working with other problems such as working with neural networks。
we are not going to have this analytical solution that we just found。So instead。
we're going to have to move towards that optimal value using gradient descent。
So in order to see this in practice。

We're going to actually pick a learning rate, as well as a number of iterations。
Run the code and plot the trajectory as we move towards that grading descent。
If you recall in lecture, that means that we're going to pick how big each one of our step sizes are going to be and then see step by step as we move closer to the optimal value。
how we're actually moving towards each one of the different thetas that we're trying to predict。
And then using that we'll find some examples where the learning rate is too high, too low。
or just right。

So we're going to start off。With a learning rate of 1 times 10 to the negative 3, so 0。001。
And we're going to say we want 10000 iterations, so 10000 different steps。
and we're going to initialize with the value of 3,3,3。

So in order to actually perform gradient descent。We're going to pass in a learning rate。
which we defined above the number of iterations and the theta initial, all defined up here above。
So the initialization steps will be we'll set our theta originally equal to that theta initial。
which is at this point 333。We're then going to set the theta path at first equal to just a bunch of zeros in the shape of the number of iterations plus one。
So if we doing 10000 iterations, there'll be 10000 in1 rows, and each one will be three columns。
So it'll say for each one of our different values, which are B B, Theta1 and theta 2。
how we're moving closer and closer to each one of those steps through each one of the different iterations。
And then we're setting that first value, so theta 0 for all the columns equal to that theta initial。

And then just to start off, we're going to set the loss vector equal to NP do zeros。
and we'll see the loss as we move through each one of the steps to see if we continue to minimize that loss and we'll do that for every single iteration。

We're then going to do this main gradient descent loop。
which is going to be what we discuss in lecture in regards to starting at the initial point。
finding the gradients, and then using that gradient to move closer and closer towards our optimal value。
So we're going to set our prediction equal to the dot product of our different thetas times our x matrix。
so if we think about that as taking our entire x matrix and if we take the dot product of the transpose of theta and the transpose of the X matrix。
then all we're doing is multiplying in this case, initializing with 33,3。
three times the first value for all them three times a second and three times a third。
adding those all together and getting our first prediction for each one of our different y values。
Our loss vector, which we defined up here, we will then say that first loss will be equal to the sum of the square of what we predicted minus the actual values minus what we predicted。
So we'll get the mean squared to error。Or just the squared error。And then our gradient vector。
which we didn't go through in lecture。 But just to know what the gradient actually looks like as we're taking those partial derivatives。
what will look like is going to be。That。Prediction, that error on the prediction。 So y minus y pre。
take that in the dot product of X mat。 So that's actually going to be equal to that gradient vector。
That's how we can come up with that gradient vector。
And it will be of the size that we need in regards to。Subtracting, or in this case。
this will actually be the negative of it。 So we're actually later on going to add it on。
So we see here that we add on that gradient vector。
But that's actually going to be equal to the negative of our gradient vector。
We divide that by the number of observations that we have and we'll use that in order to move a step closer towards our actual theta。
So at first our theta is 333。 We got that gradient vector and we set 333 plus that learning rate multiplied by that gradient vector。
which should also be of the same shape as that 333, So our row vector with three values。
And then we say that theta I plus 1 is equal to that theta that we just found, so we reset the theta。
and then we go back through the for loop using that new theta and coming up with our new prediction。
our new error, our new gradients, and then our new theta values。
And then we're going to return after we go through that entire for loop, the entire theta path。
as we go through each one of the different iterations, as well as the full loss vector。
How far off are we from our actual solution, as we move down the line。
And if we recall that loss vector is just going to be the sum squared error。

So we have our gradient descent function。And then we're going to actually plot this out。
I'm going to quickly walk through this, I'm not going to go through every single line of code。
but I do want you to get some intuition as to what we're plotting here。

So we have our true coefficients。Which are just equal to the B。
the theta 1 and theta 2 that we defined earlier。And then we say plot I J。
And what we're doing here is we're plotting。Two of these different values。
so either B versus theta 1 or b versus theta 2 or theta 1 versus theta 2。
those are going to be each of our three plots。So plot Ij。
we're going to plot the actual true coefficients, so the true coefficient of。
if we say I and J are equal to B and theta 1。Then we'll say we want the0 with value。
which will be B and the first value, so J would be theta 1。
So we're just going to plot the actual values and then mark those as the true coefficient。
We're then going to plot the theta path。And we're going to plot that path。 Again。
we're only using two of the dimensions at a time。 And let's say, again。
we're working with B and theta 1。Then we plot the path of if we sets I equal to0。
then we want for the theta path only that first column, which will be the different values for B。
And then for that second column, the different values for theta 1。 if we set J here equal to one。
And then, we can。

Say what the initial value is by calling theta path0 I and J and label that as the start。
And also to note here, we are taking each one of the steps that we take from the start to the end and using the dashed lines。
as well as a marker for each step of the triangle。 And then finally。
we're going to say negative one and I and negative one and J for thetapaths。
In order to get the final value and we'll label that to finish。


And then that's just going to be a subset when we call Cla O。

What we're doing is we're just taking each one of the different combinations of axes that we can。
so we'll have b versus theta 1, B versus theta 2 and theta 1 versus theta 2。

And as we see here, we're calling plot I J 0,1。And then plot I J 0,2, and then plot I J 1,2。
on each one of our different axes。 And then on top of that。
we're also going to plot our loss vector to see how we reduce the actual loss function as we iterate and move closer and closer to the true values。
With that, we get our gradient descent to output both our theta path and our loss vector。
We can then pass that into our plot all function that we defined with our learning rate。
our number of iterations and our theta。

We run this。

First, we have to obviously。Run what we have here, so we run that。
and this will plot out the actual steps that we take。


Here we see the start and we see each one of these triangles and are moving closer and closer this axis here on the we look at the top left plot。
we see the x axis is theta 0, the y axis is theta 1, and we want theta 0 to be moving towards 1。5。
that's our b, and we want the theta 1 to be moving towards 2。
And then if we look at the actual values, it looks like theta zero stopped at around 1。9。
Theta1 also started stop at around 1。9。 and then if we look at the top right graph。
we see theta 0 is still the x axis so also still at 1。9, but now the y axis is theta 2。
and that's actually pretty close to 5 already。

And we can see each step it takes along the way。 And then here we can see theta 1 versus theta 2。
which started as we suppose all these are starting at 3,3。
And then it should have moved towards that two and five that we'd like。And in the bottom right graph。
We see the number of iterations and we see pretty quickly once it gets to, it looks like maybe 100。
200 iterations drops off to getting a very low error, and that it's slowly。
very gradually continues to minimize that error, but not quite as much。
which is probably why that learning rate, Why we haven't gotten all the way to our optimal values。
and we have the gap between some of our end points are actual true values and the endpoints that we got using gradient descent。

Now I quickly want to show you what this looks like。If we were to decrease the number of iterations。

So if we decrease this to, say, 100。

And we run all this。We see that it stops a lot earlier along the way。
The gradient descent didn't get to quite finish。


Now if we keep this at 10,000。And we decrease the learning rate, which allows for larger steps。

We can see here that it actually reached the optimal solution。

So。Decreasing the learning rate allowed for those larger steps。
and we were actually able to get to those optimal solutions。

Finally, I want to show you what will happen if you set that learning rate too large。

Now if we run this。Maybe a little bit difficult to see if you see the top right corner of this top left or top left corner of the top left plot。
we see that we're talking about massive massive numbers。 So it's 0。2 times 10 to the 305。



We have missed our value by a long shot。 We totally overbl the optimal value。
And if we look at the actual error rate, we see as we increase the number of iterations。

The error shoots way, way, way up as we completely miss the optimal value。
So that closes out our video here working with vanilla gradient descent in the next video we will go through the same steps and briefly walk through how you can do it using stochastic gradient descent。


054:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p54 15_梯度下降笔记本(选修部分)第3部分.zh_en -BV1eu4m1F7oz_p54-
Welcome back to our notebook here on gradient Descent in this video we're going to close out by discussing stochastic gradient descent。
so rather than averaging the gradients across the entire data set before taking any steps。
we're now going to take a step for every single data point as we discussed in lecture。Now。
our exercise here is going to be to run the stochastic grading descent that we have this function。
and then also modify the code so that we can randomly reorder which one of the different data points it picks at each iteration and I'll walk through that in just a second。


So first for dochastic gradient descents, much of the function will look similar to our gradient descent function。
will set theta equal to this initial theta。

We will have our zeros which are going to be the same size as not just here the number of iterations。
but also the number of observations, So here the idea is that the number of iterations will be how many times are we going to go through the full data set?
But we're updating at every single data point。So by the time we run through the full data set and our data has 100 values。
we all have made 100 updates, so we have 100 iterations。
If our number of beerations is equal to 100 and our number of observations equal to 100。
then we'll be making 10,000 updates, but only running through the entire data set 100 times。


And this will become more and more familiar as when we work with our deep learning models later on in this course。
this will be something called EpochX, and the epochs will be the number of times you run through the full data set。
and then you will also define the batchge size as we talked about mini batch gradient descent。
where we will actually find that right balance of the batch size as well as how many times we want to run through the entire dataset set。




So we're going to have zeros for the theta path, that's going to be the size of the number of iterations。
Times the number of observations within our pass through matrix。
We're then going to set theta initial equal to the first value and our loss vector again is equal to the zeros for the number of iterations times the number of observations。

Then for the main stochastic gradient descent loop。

We're saying4 I in range number of iterations。 But now we're also saying4 J in range the number of our observations。
So 4 I, and then 4 J, and that will allow for。Every single value within every single for loop。
So we're do 100 iterations and for each each iteration。
we're going to go through each one of the different observations。
and that's going to be your value for J。And then we're seeing to get that gradient vector。
So up until the gradient vector, things are the same for that gradient vector。
We're going to say the value for J。Versus the value of j that was predicted for that specific row。
And then we're going to take the x matrix, but only the J row in order to get that dot product to get our gradient vector just for that single value。
And then with that new gradient vector, we'll use that to update our theta values at each one of the different steps to get each one of our new thetas。

And then we update the theta path, according to the counts。
and our count is being added by one Each time we run through either the A and the J loop。
That's why we have this count variable to keep adding on, rather than just。


If we were looking at the number of observations zero to 100, or just the number of iterations。
which is also0 to 100, the count will go from 0 through that 10。
000 as it' will have to go through both of these four loops。

And then we'll just return, as we did before, the theta path, as well as the loss factor。
So here we set the learning rate to1 e to the negative4。
the number of iterations here again is just 100, but that means we're making 10,000 steps。
but we're only running through the full data set 100 times。
And are they initials going to be this 333 again。So we can call stochastic gradient descent。
get our path as well as our loss vector, and use that same plot all function that we defined above。
and plot out that theta path, the loss vector, and then the appropriate labels for the learning rates。
the number of iterations, and the initial theta。

So he'd run this。

And we can see the path that was taken and if we look at this bottom left graph。
that'll be your first clear observation of the amount that it swerves back and forth rather than creating a straight path towards where we're trying to aim as we did with the normal gradienting descent。
so it's a bit more random。

Now, something to note is if you are doing something like stochastic gradient descent or mini batch gradient descents。

Each one of our updates as we go through each one of the values。
if we do that for loop with that set ordering, the update for iteration 20 will be dependent on the update from iteration 19 and the update from。
Iteration 18, so on and so forth, so be biased according to the ordering of our actual data frame。
So rather than do that, we're going to make this a bit more random and at each time set R J equal to a random integer。

Need a NP。The dot random dot random ins。Some value the size of the number of observations。
so we can actually say zero through nu observations。

And we run this and now rather than that ordering, we have a bit more randomness。
and we can see it's a bit more squiggly, but that ensures that it doesn't have like we saw before a clear pattern going back and forth and it's not dependent on the ordering of the data frame。



Now we can play around with this。As we see here, we can increase the number of iterations and I'll run this because this will take a second and now again we're making 10。
000 times 100 updates along the way, so using srcchastic gradient descent hopefully we will be able to get to that solution。
but in general, using srcchastic gradient descent as mentioned during lecture will allow you to speed things up but at the same time it may not exactly get to the right solution as it will be bouncing around in order to get to that ultimate solution。

That may have taken just a second to run, but as we see here。
once we increase the number of iterations, that may have been even too many iterations we didn't have to go as long as we did We see that we end up。

At that final point, at the finish that we'd hoped for with thee。
Theta 2 equal to5 if we look at the bottom left and theta 1 equal to2。


And we see that we have this sharp decrease right at the beginning at the number of iterations and very very slight amount of improvement along the way。
But we do know where we are aiming towards。 Now, something to note here, as I said。
we may not have needed to go as many iterations, as we get towards the bottom of that slope。
If we imagine that concave curve that we discussed during lecture。😊。


The gradient is going to get smaller and smaller as it approaches that optimum value。
so the updates will be smaller and smaller as we get closer and closer to that optimum value。
So that's going to ensure that if we have a small enough learning rate that we don't overshoot it。
It just stays within that right direction and keeps within that optimal minimal point within that concave curve。
Now, that closes out our section here on gradient descent。 And with that。
we will get back to lecture and discuss further working with our neural networks。 Allright。
I'll see you there。


055:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p55 16_如何训练神经网络.zh_en -BV1eu4m1F7oz_p55-
In this section we're going to go over the basics of back propagation。
and that's going to be the magic that ultimately makes training these complex neural networks actually possible。
So what's the process of actually training our neural net?
We saw before that we start with some initial guests。
We then saw as well how to get a prediction by pushing that initial input through the feed forwardward network。
The next step is to compare our prediction to the actual value and calculate that loss function J。
which measures our error。Once we see that air, we can adjust our weights accordingly and repeat the process。
So how do we know exactly how to adjust each one of our weights?
Back propagation is ultimately going to be that key framework that tells us how to make a single adjustment to our weights in the right direction。
using the chain rule from calculus。

So how do we train our model before using gradient tos set?
We saw this event during the Gring Descent notebook。We made a prediction。
given our initialized parameters。We then calculated the loss function for that particular prediction。
given the actual values we were trying to predict。
We then calculated the gradient of that loss function with respect to our parameters。
recall that this will be the direction of the steepest increase。
By subtracting that gradient value multiplied by some learning rate。
we were able to move our parameters in the direction that will minimize our loss function。
And we iterate over and over with the number of iterations and steps predetermined by our model。
And ultimately, our goal is to reach the optimal values that minimize our loss function。
So let's start off our discussion of this process for neural networks with these first two steps。
making a prediction and calculating the loss。

So first we pass in our input values。We have initialized values for each of the weights。
and with that, we calculate the different values at each layer。
AndThen we ultimately get our predicted output values for that given input and the given weights。
And with that, we get to evaluate our loss function, for example。
our squared error or our logistic loss。

So how can we ultimately change our weights to continue to lower that loss function at each iteration?
Let's think of the neural network as a single function F that takes as input X and output some value y。
The key to the complex computation that makes up this function F all comes down to the different weights represented as the arrows。
the different arrows in our picture below。And given the structure of our neural networks。
where the inputs are defined by our data sets and our activations will be predetermined when we create our model。
the parameters that we are trying to learn are going to be those weights。
and ultimately those weights will define that function F。And ultimately。
our loss function will be a function of the true value Y and this function F with the input of X。
And if we focus in on just one layer, we can see how many different weights actually need to be calculated。
So to get the Z value for just a single node in that first layer。
We'll need to learn the Ws for each input, as well as value for B for each input。And then again。
another four parameters for the second node of that layer。
And through for every node within that layer。And then we need to also do this for every node in all of the other layers in our network as well。

Now our goal then, when we find the gradients is to find the partial derivative of each weight in respect to J。
in other words, how much does a small change in the parameter affect our loss function J?Now。
given the way that the gradient is calculated, this will tell us what direction to a each weight WK to lower our loss function。
And once we're able to do this, we can then just adjust and repeat。
So that closes out this first video, and in the next video we'll officially introduce the concept of back propagation and how it ties with its idea of using the partial derivatives to find these optimal weights。


056:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p56 17_反向传播.zh_en -BV1eu4m1F7oz_p56-
Now, getting into back propagation, what we're ultimately going to once。
if you recall with the gradients。Is going to be that partial derivative in regards to each one of our weights and using that partial derivative。
we'll be able to update those weights in the correct direction moving forward。

Now this idea of being able to use calculus to update our parameters is going to play an important role in how our neural network models are actually constructed。
Our functions used to calculate our y, as well as our loss function。
are going to be chosen so that they have nice derivatives。
As we saw earlier when we touched on the derivative of the sigmoid function。
And we'll ultimately need to also be aware of some numerical issues that will come into play when working with these derivatives。
such as exploding and vanishing ingredientss, which we'll touch on later on。

Now with that in mind, we're going to think of the weights layer by layer。
and nowre going to dive a bit into the calculations used when actually conducting back propagation。
So the values for the weights for that final layer in our neural network。
Will be updated using that partial derivative in regard to the weights of that final layer。
And that's going to be calculated by taking the dot product of our error turn。
y hat minus y and the output from the prior layer that fed into our final layer。And then from there。
in order to calculate the weights for the second layer, the layer before the final layer。
We take what we learned from that final layer and take the dot product of W of that final layer。
Multipliied by the derivative of the activation of Z。From that final layer again。 So this is, again。
working with that final layer。 And with that, the dot product of the prior layer。
And that's our a1 over here。And finally, we add on the further steps needed and take the dot product with X。
our initial input。In order to get the derivative in respects to our initial layer。
And notice how these will be affected by our actual error term。
so the larger or smaller errors will affect the size of each one of our gradients。
Also note that if we use the sigmoid activation function。
that the derivative is the simple sigmoid of z times 1 minus the sigmoid of z。
And we're going to touch on this later on when we talk about the vanish ingredients。
And we will be using this in our notebook, and although it looks a bit complex and may have sounded a bit complex as we talk through it。
we'll see in the notebook that they're actually quite easy to compute。

So the idea of back propagation is that we'll first run our neural network with our initialized weights。

Then moving back through our layers, we're going to take the derivative of each of our weights in our final layer with respect to our lost function。

Then use that to again, get our partial derivative in respect to our layer two of our weights。
And then our layer 1 weights。 Finally, and we'll use these to update our initialized values。
and then again, feed these updated weights through our neural net and repeat the process。

Now I want to quickly touch on this concept of bench gradients that I discussed earlier。
Recall that this was the derivative of what we see here that we get for updating our three layer feed for our neural network。
What I want to highlight here is the fact that we are multiplying the derivative of the activation function of Rz so derivative of that activation function of Rz from all the other layers later in the network。
Now, with that, we want to re emphasizeha that our sigmoid function for our sigmoid function。
the maximum value the derivative can take。Is going to be 0。25。 And you can run the math。
But this is due to the fact that the sigmoid can only take on values between 0 and 1。
And thus the max value it can take on would be 0。5 times 1-0。5。
when we talk about the derivative here。 And that would equal our 0。25。
which we're stating is the maximum。Now, if we think about this。
the fact that point to5 is the absolute maximum。If we continue to make our network deeper and deeper and we continue to multiply by these small values。
the gradient at these earlier layers, such as the W1 that we see here, will eventually get very。
very small。😊,And this problem of the gradient in eventually getting incredibly small as we create these deeper neural networks。
Is what we call the vanish ingredient gradient problem。And for this reason。
Other activations such as Relo and others, which we'll touch on right after our next notebook in the next video。
have become more and more common。


057:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p57 18_反向传播笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p57-
Welcome to our demo here on back propagation In this exercise。
we're going to use back propagation to train a multi layer perceptron or our feed forward neural network that we've gone over in lecture。
Using just a single hidden layer。You'll have an opportunity on your own to play around with different patterns。
we're just going to focus on one pattern here, doing a classification problem and you'll see that in just a bit and see how quickly or slowly the different weights converge。
And with that, we'll also see the impact and interplay of different parameters such as the learning rate。
the number of iterations, and the number of data points that we're working with。So first thing。
we import our library, and then just to give you a overview of what we're going to go over in this lesson in this exercise。
we're going to prepare code。To create a multi layer perception with just a single hidden layer。
And in that hidden layer, there'll be four notes。And train it using that back propagation。
And if you notice in the libraries that we just imported。
We're not going to be using the traditional deep learning libraries just yet。
We're going to show you just using Numpy how to go through these steps of the feed forward and then the back propagation in order to learn the optimal weights。
So in order to do so, we're going to initialize the weights to some random values between negative and one and one。
as we discussed again in lecture。We're then going to perform that feed forward computation through our neural net。
We'll compute that loss function at the end, and using that loss function will calculate the gradients for all of our weights。
using back propagation, telling us in which direction to update the different weight matrices。
using also our learning rate parameter, which we'll define。
and then we'll execute those steps 2 through 5 for a fixed number of iterations that we'll define as well。
And with that we'll plot the different accuracies and the log loss and observe how they change over time。
And we'll be using the log loss here since we'll be doing a classification problem。😊。
And then once the code is running, you can address the following questions。
which patterns was the neural network able to learn quickly and which ones took longer again。
you can go through this on your own what learning rates and number of iterations worked well。
and then if you have time, you can try vary the size of the he layer and experiment with different activation functions。
including relo that we have here, which we'll discuss when we get out of this notebook。
and that might take a bit more to play around, but that's definitely an option that should be available as you get more and more familiar with the code。

So with that, I'd like to walk through the code that we have here。
we're going to start off with a certain number of observations。
we're going to create our initial pattern here, So we're saying there's going to be 500 different observations in our data set。
so equivalent of 500 rows。And then our X matrix。Is just going to be random values between negative one and one。
And we're going to have two different variables for our x matrix。
so these are the two variables for our input values。And with that。
we're also going to predeefine that bias term, which is just going to be a bunch of ones。
We can catnate those two together。 And that's all the values that we need to learn weights for。
So those are the actual values。 And for each one of those actual values。
we will learn the optimal weights, given the two X values, as well as that bias term。
So if you'd like to do a different pattern, you can just uncomment any one of the different patterns we've done the diamond here。

Each one should work fine and you can look through each one of them。
and we're going to print out the actual shape to see what this looks like。

But if you think about the diamond pattern that we are outputting, we have。
Values between negative one and one。And in our X map full。 So we're going to say we want。
The absolute value of each one of these。And if it's less than one。
Then we will market a certain value here where you would mark that as。True。
and then anything greater than one would be false or 0 and1 once we make that as type integer。
and that will allow us to classify whether or not it's in the diamond or not。
So if you think about the absolute value being greater than one when they're added together for random values between negative1 and one。
that means that。Both of our first two x values have to add up to one the absolute values of them。
and if it's greater than that, so if one values is 0。75 another one's negative 0。5。
then that would be outside of the range because it's greater than one the absolute value of that sum。
and we would be outside the circle and we'll see this in the graph in just a second。
So we're saying what the actual shape is。Of our x matrix, as well as our output y。
which should just be one dimension。

XMap full should be 500 by3 since we're including the bias term。
And then we're just going to plot taking that XM full。We're going to plot these values。
Where are y value that we defined up here?Is equal to one?And we will mark those with O's in R。
in red。And then for those that the y is equal to zero, so we're taking a different subset。
we'll mark those with x so that we can see that diamond that we're talking about。

So you run this。And you can see our diamond here。

And you see as those values stay less than one。 So this is essentially。
if you look at the bottom or the top corner, you see that or if any one of the corners。
those are going to be the values of 1 and 0。 And then there should be approximately a straight line dividing between those that are less than one and those that are greater one。
when you take the absolute value of each one of the x and Y components here。
or the x1 and x2 components here。 and sum them together。

So we want to come up with this classification method using what we learned in regards to our neural networks。
Clearly, it's not going to be a linear divider。 It's going to have to be a bit more complex than that。
And in the next video, we'll walk through how we initiate each one of the functions that we need。
as well as that feed forward and back propagation structure and start to learn the actual weights that we need in order to classify this object。
All right, I'll see you there。


058:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p58 19_反向传播笔记本(选修部分)第2部分.zh_en -BV1eu4m1F7oz_p58-

Welcome back。In this cell that we have here。We're going to walk through a bunch of the different functions that we're going to use throughout in order to make that feed forward neural net。
as well as take those back propagation steps。So to start off, we define our sigmoid function。
which you should recall from lecture, is just going to be one over1 plus e to the negative x。
Ex just being our input here, generally speaking, we'll be passing in that Z value in order to get R A。
if you recall the symbols that we use throughout the lecture。
Then we're going to define our loss function。Passing in why true and why pre。
We also have this epsilon, which is going to be this very small value。
And the reason for that is just that we will tend to error out if we get exactly one or exactly zero for our prediction。
And as we see here, if our prediction, y pre is equal to 0, will take the maximum of that value。
And our epsilon, which is this incredibly small number。
And we'll end up with that incredibly small number, rather than ending up with 0。
And on the other ends, when we take the minimum between y pre and 1 minus that epsilon。
If y pre was exactly 1, then 1 minus that epsilon will be 0。99, et cetera, something very。
very close to 1, but not exactly 1。😊,Then once we have that。
We are going to compute the log loss function, which is just going to be the log of。Our prediction。
the negative log of our prediction times y true。And if you think about why true。
it's either going to be zero or one。So if this y true is equal to zero。
And you end up predicting zero。Then NP。 log of0 would be one, but this cancels out。
so it'll be very low, the arrow will automatically cancel out and it won't count for that。
But if this value is 1, y true is 1, and you predict something very close to zero。
then you'll end up with a higher value。And the same thing goes on the other end。
if y choose is equal to one, then this will cancel out。But if it's equal to 0。
then it will maximize this error because the log of 1 minus0 will be the log of1 and have that maximum value。
Now, when we。So that's our loss function, and that's going to be the log loss again because we're doing classification between0 and1。
We then want to define here our forward pass, and we're going to pass in our initialized weights W1 and W2。
as well as our updated weights throughout, and we'll see how we do this in the next cell。With that。
we're actually also going to be doing our back propagation steps here。
so we're also going to be computing the gradients。
or we're going to use the output of this to do our back propagation steps。
So we're pulling in these global values that we've defined Xmat, y and nu。

We're then going to compute the new predictions step by step here。
So if we think about our feed forward neural net in order to get to Z2。
that next step in our process, we just take the dot product of our matrix of our original inputs and W1。
That'll give us Z 2。In order to get A2。We have to do our nonlinear transformation。
which is going to be our sigmoid function on that Z2。 So first, we did the linear step。
then that nonlinear step taking the sigmoid, and then Z3 is just going to be taking that output。
and Z3 will ultimately be the last。Step, we set have to take the sigmoid before our prediction because we're only doing one hidden layer。
And we take the output from the prior layer, which was A2。And we take that with a dot product of W2。
And then our prediction is just going to be the sigmoid of that Z3。 and when we reshape it here。
we're just making sure that it's only one dimension。Now to compute the gradient。
Given the loss function that we have defined。The gradient。Of Z, in respect to R。
R loss function in respect to Z is just going to be negative y plus y pre。
Now this is in respect to Z before when we were looking at the loss function in the lecture。
we were saying it in respect to that final output which would be A3。
but here we're starting with Z3, we take that output of the gradient of J in respect to Z and we use that in order to calculate J in respect to W。
And that's just going to be。This J Z3, that prior gradient that we just calculated。
That and the dot products of A2。And then in order to calculate。
Here we're not calculating it in respect to Jabe or we're calculating the gradient。
Of A in respect to Z。And this just is going to be, if you recall what we do with that a value with the Z value in order to get a。
We are just going to see it's going to be the sigmoid of Z2 times 1 minus the sigmoid of Z2。
And that's just going to be again, that derivative of the sigmoid。
And then we're going to use everything that we've use。
That we've calculated so far in order to get the gradient in respect to W1。
Which is just going to be the dot product of that JZ3, which is what we defined up here。
We reshape it just to ensure it's in the right shape and that it's not a one dimensional array。
Take the dot product of that with W2 transpost specifically。And then we multiply it。
By this value that we have output over here。And then we take the dot product of the transpose of that。
With our original input in the transpose of that original input。
So we just go back through each one of the steps。Taking each one of the different gradients that we learned before。
using back propagation to ultimately get the gradients。In respect to each one of our weights。
W1 and W2。Which is going to be what we need in order to do our back propagation steps and continuously update each one of our parameters which are just going to be our W1 and W2。
And this function will return both our prediction, which we have defined up here。
As well as this gradient, which is going to be this tuple that we have here。
And that gradient is going to have the gradient in respect to each one of our different weights。


And then finally, we define this。Loss accuracy, this plot loss accuracy。
which will just show us each one of the different loss values and the acuracies。
and we create our initial figure。 We say what the title is going to be。
and we're going to create a subplot。 So we're going to create two different plots, one for the loss。
one for the accuracycuracies。 So ideally, the losses should be going down and the accuracy should be going up。
And we're just going to call a dot plot。 So it's a very simple plot。
And then we're going to do that again for our acuracies。
So we have our loss values and our accuracies, and those are passed into the function。
So now we see all of the different functions that we have available to us。
And we're going to use those in the next video in order to do both our feed forward steps。
As well as our back propagation。 and then ultimately。
once we do that and go through a number of iterations。
see the output of the different accuracies and the different losses。All right, I'll see you there。


059:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p59 20_sigmoid激活函数.zh_en -BV1eu4m1F7oz_p59-
Now let's introduce some other activation functions besides just working with the sigmoid function that we discussed earlier。
So let's go over the learning goals for this section。

In this section, we're going to cover the different activation functions for our nodes within our neural network。

And then at the end, we'll summarize each one of the commonly used activation functions that we discuss。


So recall that this basic perceptron that we have here is the simplest building block for our neural network。
The classic perceptron model typically used a step activation function。
so0 for values less than 0 and one for values above0 as we get that input that input z into our activation function。
Now the activation function allows more flexibility than just this01 step function as it's called。
And without this activation function in general。The model would be a linear model。
so the activation improves our ability to determine nonlinear outcomes。In the next slides。
we're going to discuss various activation functions and a great analogy building off of our basic statistical knowledge and what we have discussed thus far is that of thinking of logistic regression as linear regression with a sigmoid activation function。
So recall that we have that linear regression and that linear combination of those past weights。
and then we pass that through through a sigmoid activation function。
and that's the same as working with our logistic regression。😊。

And in this analogy, we use the sigmoid because we want outputs between 0 and one。
and we want a nonlinear model, and with deep learning models。
we sometimes want more flexibility over which types of outputs we can consider。
and we may not only want to emphasize positive values between 0 and 1, for example。😊。



So we're going to start our discussion of activation functions with that sigmoid function。
And this is the only activation function that we've discussed thus far。
Some advantages of this activation function are that it produces that simple derivative that we saw earlier。
And that it keeps the values between 0 and1 and technically never gets to exactly zero or1 as it goes to infinity or negative infinity gets closer and closer。
but not exactly zero on1。The disadvantage and the disadvantage that we discussed earlier and something that we can even see graphically here is that the derivative content to be a very low value。
Now, what do I mean by the fact that we can see this graphically。
The derivative is meant to show us how much y changes with a tiny change in X。
And if we look at very small or very large numbers。
we can see that the y value barely changes at all for x values between 5 to 10 or between negative 5 to negative 10。
we can see that the y values are essentially flat。
And this will get even worse for x values beyond 10 or beyond negative 10。
So the derivatives are going to be very, very small。
So although the sigmoid function is easily interpretable and keeps values between0 and 1。
it's very prone to this Spanish and gradient problem and thus will often lead to difficulty when trying to optimize using gradient descent and back propagation。

060:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p60 21_其他常用的激活函数.zh_en -BV1eu4m1F7oz_p60-
Next we have the hyperbolic tangent function, so again we're going through our different activation functions。
Which is the equation that we have here, which is 10 H of z。
which is equal to sine H of z over cosine H of z, which is just equal to e to the 2 x minus1 over e to the 2 x plus 1。
Now, what's important here is not the equation itself。But rather。
some of the properties of this function。In a lot of ways it's going to be very similar to the sigmoid function that we just discussed。
just a bit stretched out。So rather than with the sigmoid, where the sigmoid of 0 is equal to 0。5。
the tangch of 0 is going to be equal to 0。And as we approach infinity and negative infinity。
we approach one and negative one rather than。1 and0 as we did with the s white function。

So what does this look like?When we look at the graph of this function and we think about it in relation to the sigmoid function。
we see that for values between negative 2 and 2。 we have a sharper slope。
And thus the derivative is going to be a bit larger。I。e。
a small change in X equals a larger chains than y。
and gradient descent may be optimized or work a bit faster when working with this activation function。
And it's powerful if for any reason you want your values to be between negative one and one rather than between zero and 1。
But as we discussed with the sigmoid function。 and as we can see here on the graph。
we still have that same problem of very small derivatives for higher absolute values。
And thus a gun faced that vanishing gradient problem。

So in order to answer this vanish ingredient problem。
We have introduced the rectified linear unit function, or relo。
And this activation function is actually quite simple。For any z value less than0。
we just return the value0, and for any z value greater than 0, we just return that actual value z。
So we're essentially taking the maximum value between the output Z and zero。
So really of zero is again going to be zero。Welo for any value of z greater than zero。
Is going to be equal to Z?And the relieu of any negative value will just be 0。
So what does this look like graphically。

We can see it here in thinking through this graph。 We can see that it is still nonlinear。
As we see this transition between less than0 and greater than zero, introducing that nonlinearity。
And as we can see on the right side of zero, we no longer have that tiny derivative causing us problems。
Then on the left side, rather than those tiny changes, there is zero change。
So these values will actually zero out particular nodes。Now。
this zeroing out will allow for us to ignore nodes that may not be providing much extra information。
And thus may be more efficient than the sigmoid or hyperbolic functions that always maintain at least some information at each node。
Now, on the other hand。There will be no learning happening at each of those nodes that are being zeroed out。
and perhaps you want to ensure some type of learning at all nodes。

With that in mind, we have the leaky rectified linear unit or Eery Lo。
And the way that this works is for positive values, the function remains the same。
So it's just going to be Z as your output。But for negative values, rather than simply zeroing out Z。
For z values less than 0, we multiply that negative value, that value z by some small number。
And that is the new output rather than just having zero。
So now our function is going to be the maximum value between z as we had before。
and then rather than0, some alpha less than1 multiplied by z。
And recall that this is going to be smaller than Z itself if we're working with negative values right so any negative number multiplied by a value less than one。
Will be larger than that original negative number。So our outputs for L Relu are going to be。
Z for z equals zero。Z for any value, z greater than 0。And then for values less than0。
for z values less than0, it will be alpha multiplied by that z。
So what does this look like graphically?

Now we can see width the graph for this function。We no longer have that zero value at any of our nodes。
This will solve that problem of nodes zeroing out throughout our network while keeping the advantage of a steady learning rate without that vanishing gradient problem。
Now, I would like to note here, just because it solves a potential problem that regular relo may have。
This doesn't mean that lakey relos are necessarily better every single time。
They are better a lot of the time, but they aren't necessarily better all the time。
Sometimes you may want a more efficient network that allows for zeroing out of some nodes。
And the best practice would be to try both starting with Relu or leaky Relu and then trying the other as well and seeing which one performeds better。

So to summarize。The sigmoid function is going to be powerful when we want outputs between 0 and 1。
but will again suffer from that vanishished ingredient gradient problem。
The hyperbolic tangent is useful if you want outputs between negative one and one。
And perhaps a bit of a steeper slope, but also suffers from that banished ingredient gradient problem。
Rlo will solve a vanished ingredient problem, but potentially suffers from that dying relo problem。
dyingying relo problem is what it's called。 and that's just that zeroing out of certain nodes。
And then finally, the leaky relo will also solve that vanish ingredient gradient problem that was introduced with sigmoid and hyperbolic tangent。
But also solves for that potential of the dying relo problem as well。

So to recap。In this section, we discussed the different activation functions that we just went over for nodes in our neural network and summarized each one of those commonly used activation functions。
Now, in the next video, we'll begin to introduce a topic that will be important for any machine learning problem。
but especially so for deeper neural networks, and that's going to be fighting overfitting by introducing different regularization techniques。
Allright, I'll see you there。


061:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p61 22_反向传播笔记本(选修部分)第3部分.zh_en -BV1eu4m1F7oz_p61-
So hopefully given the functions that we walk through in that last cell。
You've been able to think through how we would move forward through our neural net。

As well as how we'd use that back propagation and the output from our function。
specifically from that forward pass, given our gradient and our prediction。



How we would actually go about using those outputs, iterating over a number of iterations。

And updating our model。So with that in mind, hopefully you were thinking through it。
we're going to walk through that code over here。So the first thing that we want to do is create our W1 and W2。
again, those are just going to be random values here to start。
So there are going to be random values between negative 1 and 1。
and if we think about the weights for w1, it should be a 3 by4 matrix since our input should have three dimensions。
our two dimensions, x1 and x2, as well as our bias term, or including that in here。And then。
It's going to be by four since we'll have four nodes。
and then that means that our W2 will have to just be of size four。
And we are just going to have four because that's going to be our output after that。
So when we take the dot product of our output with W2。
Then we'll be able to output just the actual values of a classification value of either one or zero。
We're going to set the number of iterations here equal to 5000 and our learning rate equal to 0。001。
And then we're just calling Xmat that Xm full so that we can pass it through。
Our loss valves and our accuracies are going to start off as empty lists。As we have here。
And then we're going to do our iterations。So for I en range number of iterations, 5。
000 different iterations。We're going to take our Y pre。
And our output gradients from our forward pass。So I talked about this just in the top of this video that this forward pass will output。
Both a prediction, as well as our gradient values and that gradient value output。
if you recall from last video。


Is going to be。A tuple of two different values, so we're going to pull them out separately as JW1 grad and JW2 grad。
And then to update the weights。All we have to do is take that initialized value。
After the forward pass, which we've done with our forward pass and gotten our gradients。
we can subtract out the learning rate multiplied by that gradient。
To take that small step in the right direction and updating both our W1 and W2。
We can then get what our current losses by calling loss function on Y and Y pre。
We then append that value to the loss vows, and then our accuracy is just going to be how many we predicted correct over the total number of observations。
And just because we're using the log loss again, our prediction will be some value between 0 and 1。
It won't be exactly 0 or one。 So we just are going to。
Make them discrete values of0 or 1 by setting y pre greater than or equal to 0。5 equal to 1。
and all the values lower are equal to zero。And then we can append to our list of accuracies。
that value of back, and that's just for our first iteration。
And then what we have here is that we're just going to print out at every 200th iteration。
what the log loss was and what the accuracy is。And then at end。
we can plot that loss accuracy once we've gone through every single iteration of those 5。
000 iterations。Now, coming back to this W1 and W2, we think about the fact that we have updated this after the first iteration。
Once we update it, we can then pass it back into this W1 and W2 for the For pass since we updated it as itself。
And we continue to update that and get closer and closer to that correct, to that optimal value。

So we're going to run this。And we can see the outputs as we do。

Different sets of 200 values, 20040 600,800 and so on。

And we can see that that log loss。Goes down further and further。
and our accuracy goes up proportionally。As we go through each one of the different iterations and we see here at the end。
we end up with 4800 iterations and an accuracy of 94。4。

And you can play around with changing the learning rate。
so if you imagine I made that learning rate real small, I want you to think about what would happen。
I'm going to run this。

And we see that it is updating a bit too slowly。

As we set that, learning rate incredibly small。And maybe even, yeah。
So we don't want to set it too small or too large。 we'll set this back to what it was before so that we can do the next step。

Which was。Zero,01。


And then we can actually plot out where we got it correct and where we got it incorrect to see where on our diamonds。
We were more likely to have errors。So we run this and these plots should be simple enough given everything that we've learned。

And we see the false positives and false negatives all tend to be right around these edges。
So did a pretty good job of actually finding that classification boundary is just right along the edges。
perhaps some or correct or incorrect。😊。

Now feel free to play around with different shapes to look at what type of errors come out。
you see everything that we can do here within this notebook。
playing with the different number of iterations, playing with the learning rate。
And then we discuss also that you can play around with the different activation functions。
And you'd have to change here we have it to find a sigmoid。
You'd have to pass in something else besides sigmoid。 And with that in mind。
that's a smooth transition into what our lecture is going to be in our next lecture。
which is just going to be discussing different activation functions。 All right, I'll see you there。


062:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p62 23_流行的深度学习库.zh_en -BV1eu4m1F7oz_p62-
In this video, we're finally going to introduce the actual Python library vary Kes。
which we're going to use in order to build out our neural networks。
So in this section we're going to cover a bit of an overview of the different Python libraries that are going to be available in general。
again we're just going to be focusing on what our title here was, which was Kara's。


We're then going to show you how to set up a network structure using CAs。

And then finally, from A to Z how to actually build out that model, and once we are able to do that。
we'll jump into the actual code to create our first deep learning network within Python using KaS。



So what are some of the many Python libraries that are available to us if we want to do deep learning in general。
So some of the most common libraries used include TensorFlow and TensorF is going to be built by Google and has now actually incorporated Kara's simplified syntax into that Tensorflowlow package。




TenssorFlow was originally known as we talk through the different packages available as the more complicated steep learning curve version of a deep learning framework。
but with his incorporation of Karas has made it a lot more accessible TensorFlow also has a larger community than most of the other packages that are available and is originally built off of what we'll talk about next。
this package Theion Now Theiono is essentially。





Dead as development has ceased back in 2017, but many academic researchers relied on piano and pianos considered somewhat of the grandfather of deep learning frameworks。



And then we have Pytororch and Pytororch also has a large gathering, a large community。
and is currently a bit more research oriented compared to TensorFlow。
which is a bit more into building out those AI related products。



Pytorrchches developed by Facebook, again, also has a large community。
was originally known for being more accessible than TensorFlow, but again。
with that incorporation of Keras into TensorFlow。

It's now also very accessible, Tensorflow at。

And then like we said, Karas is going to be that high level library, very accessible。
like Python when we say high level, that means it's very close to English。
and it can run either on TensorFlow or Theianu and with TensorFlow's incorporation of Karas into the TensorFlow package。
it's most likely going to be running on TensorFlow moving forward。





And on this course, we'll be focusing on running specifically on Cars through Tensor flow under the hood。
so that's going to be what we will be focusing on and the code that we'll see throughout this course。


063:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p63 24_典型的keras工作流程.zh_en -BV1eu4m1F7oz_p63-
So let's talk through the typical command structure when building out our deep learning frameworks。
So the first thing that we're going to want to do is to actually build out that structure of our network。
how many layers do we want, how many nodes do we want in each layer。

And then we're going to compile that model that we create, however many layers it is。
whatever the specifics of each layer are, and we're going to learn more complicated types of frameworks later on。
we're then going to want to compile that model and when we compile that model。
we're going to specify the loss function that we're going to use at the end of our model。

Different metrics that we want to track maybe accuracy, maybe loss functions, things of that sort。
as well as the optimizer that we're going to use, and that's also going to include the learning rate that we use in order to specify that optimizer and that optimizer will be one of those that we discussed earlier。
whether that's atom stochastic gradient descent, something with momentum and so on。


We're then going to fit the model onto our training data。
And when we fit that model we'll specify the batch size as well as the number of epochs。
the number of times we'll run through the data set。


Once we do that, we'll be able to predict on new data once it's fit。
once that model's already fit onto the training data。

And then we can evaluate our results。

Now, when we work with Cars。

CAS is going to provide two approaches to building the structure of our model。

We'll have the sequential model, which allows for an easy to use linear stack of layers。
It's going to be much simpler than the other version that we're going to talk about and more convenient if the model has a simpler。
more relatable form that you're already used to, whether that's going to be the dense networks that we've talked about so far or later on just your typical convolutional neural nets or currentrent neural nets and so on。



And then there's also the functional API, and that's going to be a bit more detailed and complex。
but will allow for more complicated architectures。


Now, probably given that you're watching this course, you're just learning this for the first time。

Everything that the sequential model is going to provide for you will probably cover everything that you need to know so far。

064:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p64 25_在keras中实现示例神经网络.zh_en -BV1eu4m1F7oz_p64-
Now let's talk through how to actually write out the code and Cars to create this neural network that we see here。
So this neural network will take in an input of a data set with three different features, x1。
x2 and x3。It will then have two hidden layers, each one with four nodes。
or as we'll see in a second, they're going to call it units, but they're one in the same。
so we have four units in each one of the different layers they are fully connected as we've seen so far in each one of our feed4 neural networks。
And then finally, we're going to have that output layer Y1, Y2 and Y3。 And here。
as you see in the purple hidden layers, we're going to be using the sigmoid activation function throughout。
and we can use different activation functions if we'd like。 but in this example。
we'll just show you using the sigmoid activation functions。

So the first thing that you're going to want to do is import that sequential function and initialize your model objects。
so from Cars dot models we're importing this sequential function。
Once we have that sequential function available, we just initialize our model by setting model equal to sequential and from there we can add on each one of our different layers with their specific activation functions to build out the remainder of our model。
So right now we just have initialized model with no details involved。
We can then add layers to the model one by one。And in order to do so we're going to use these different types of layers here we're just going to stick with dense layers which are those fully connected layers and later on we're going to learn a bit more complex layers such as recurrent neural nets and convolutional neural nets and we'll discuss those later on but again we're just going to focus here on dense layers and then we can also add on our activations and with those activations we can specify whether we want sigmoid。
relu, leaky rulu and so on。

So the actual code is going to look like what we have here and we have in that first step importing the libraries and the model elements we just discussed。
so from the CAs dot models our sequential function, from CAs dot layers。
our densets and activation function, then we initialize our model as sequential。
Then to add on our first layer。We're going to want to specify the input dimension。
and that will just ensure that we have that first step correct and that we're putting on LeGgo blocks that actually fit。
So we call model dot add and we're adding on to that empty model, a dense layer。
that fully connected layer, that layer is going to have four units or forwarder nodes。
And the input dimension。 So coming in to that hidden layer with four nodes。
there's going to be three dimensions, and those represent the three features coming in。
We're then going to specify our activation function and we'll see as we do the code in a second that this could actually be specified while we add on our dense layer as well。
and we'll see that when we write out the code, but just to make this easier to read and give you other syntax in order to add on the activation function。
we add it on separately and we call model that add and we call activation。
what type of activation do we want and we pass in sigmoid。
And then to add on further layers to make a deeper neural network。
We're going to have that input dimension already presumed from the previous layer so you don't have to keep writing out what the input dimension is。
And you can just call model dot add。 We want a fully connected network, and the units is equal to 4。
and then model dot add activation sigmoid。 And we have created our second layer here。 Now。
there's more steps to complete that model that we just saw。
but we will walk through it in greater detail as soon as we get to the code in just a second。

So just to recap。In this section we discussed an overview of the different Python libraries available to build out these deep learning frameworks and that included TensorFlow。
Fiono and Pytorrch currently Pytorrch and TensorFlow or the main competitors and we're going to be using TensorFlow specifically and to be even more specific。
we're going to use the CAIS package as now available in TensorFlow and made TensorFlow a bit more accessible than it ever was。
We then discussed with that mind setting up an actual network structure using Cars as well as how to build out our models using Cars。
and we discussed that briefly and with that as promised just a second ago。
I said we're going to go a bit deeper into actually building out those networks using Cars and we're going to see that in just the next video All right。
I'll see you there。


065:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p65 26_keras笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p65-
All right, welcome to our notebook here that will introduce Karas, this one's a bit of a lab。
so hopefully you're able to walk through some of this on your own。
The goal here is going to be to use CARS to build and train neural networks。
We're going to be using the UCI PiMma diabetes dataset。
which will just allow us to predict whether or not a certain person has diabetes based on the attributes that we have here so we have nine different features within our data。
one of them is the outcome variable we're trying to predict so're working with eight features。
if you recall when we need to pass in that input that input here would be equal to 8。




We're also going to start off。By having a random forest in order to just get a baseline value for what the actual accuracy should be around。
And then hopefully we'll try and improve on that。 And we may see as we go through this。
that deep learning may not always be the answer。 And we'll see how much longer it also takes as well。
But often, obviously will。



Be the answer sometimes, so don't say just because here it's not answer that we should never use neural nets。

So the first step is going to be import here in this first cell are going to be libraries that we're already familiar with。


And then in this second cell here。You see that we're importing。 And again。
I mentioned in the lecture that Tensorflowlow has now incorporated the Kara syntax in order to get that specifically from Tensorflow。
we import from Tensorflow dot cars rather than saying from Kas。
So from Tensorflow do Cars dot models。 we import the sequential function and from Tensorflow docars do layers。
we import denses。 And we're just going to be working here, I believe, with SgD, we'll see later on。
But from Tensorflow docars dot optimizers。 we're going to import a couple of options。
And I would say feel free to try working with some of these other options that we may not use。




We're then going to import our actual file that we're going to be working with and that's going to be this diabetes data frame。
we're going to name each one of the columns as specified here。
so we just have a list equal to these names and we're going to use those names specifically when we read in our CSV file。



We can then get the shape and。As anticipated, there's going to be nine columns。
one of them being that outcome, whether or not they have diabetes, and there's 768 rows。
so this isn't a huge data set。Generally speaking, deep learning will work better when you have a larger data set。
We're then going to set x equal to diabetes df do IO。
and you see we're taking all of the rows in just that last column。
everything besides that last column and then for the y variable we're just taking that last column has diabetes。


We're then going to use that train test split with a 75 to 25% split。
something that we should already be familiar with so that we get our X train, our X test。
our Y train and our Y test, and then we can train on our train and test on our test sets。


We pull out the mean value for y and1 minus y here。 again, those are just going to be zeros or ones。
so it's going to provide some value between0 and1, and itll show us what proportion of our data set。


Is already going to be a positive value。

So because we see that 35% of the patients have diabetes whereas 65% do not。
we can get an accuracy of 65% by just predicting that nobody has diabetes。
so we've talked about classification and how we need to be careful when working with something like accuracy so we'll look at accuracy but with that we'll also look at the ROC AUC metric as well。


So we're going to get our baseline using random forest。
so we're going to train a random forest using 200 trees。
so our number of estimators that we see here is going to be equal to 200。


So RF model and we initiate our model random force classifier, this is a classification problem。
and then we fixed that to our training set, our X train and our Y train。


Now our models fit。And we're going to want to predict the actual values。
we're also going to want to predict the probability outputs for each one of our values。
for each one of our different rows, and we're going to do that so that we can plot our area under the curve。
that ROC curve in the next cell。



So we call RF model dot predictdict on our excess as well as RF model dot predictic Praba in order to get the different probabilities。

And then we can get what our actual accuracy is by passing in the Y tests and the predicted value。
and then we can get the ROC AUC score by passing in the Y test and the predicted probabilities。


And here we say that we that will output the predicted probabilities for each one of the classes we want just for the positive class。
which is why we say one here。


So we see that our accuracy is about 77。6 better than just predicting not for everything。
right that was 65%。

And our ROC AUC is going to be 83。6。

We're then going to create this function to plot our ROC curve and we'll use this later on so we create our function so that we can use it again later on。


We get our false positive rate and our true positive rate。
as well as the threshold that we're using by calling R OC curve on the Y tests and our predicted values。
😊。


And that's just going to be whatever we pass in as our YP and we should pass in those probabilities。
not actual predictions。


As well as the actual model that we'll be using。

We're then going to or the model name of what we're using so that we can specify this as random forest versus our neural nets。
which we'll use later。We're then going to just initiate our figure and our axis。
And then we're going to app plot our false positive rate versus our true positive rate with a black line。
😊,And then we're going to plot just a straight line。
which would show us if we were to just predict randomly about how well we would do So we can see that area over that line as well。
And that's going to be a dashed line。 That's going to be 。5 rather than that full line width of one。
So we'll see a smaller line there。


Keeping our grid, and then we can set our title as well as our x limit。
so from essentially zero to 1 on our x and Y axes。

Then once that function has been created, we just call plot R O C on our Y test。

And our wide pre probabilities, again, specifying that we just want the positive values。

And then we're saying that this is a random force for Ar。

And we see here the ROC curve for RF on that piMma diabetes problem。
and we see it does better than random prediction, and perhaps we can do a little bit better。
it's not a perfect prediction as we have that ROC AU of 0。836, where1 is perfect。
So that's going to close out our baseline values so you can remember these values as well as this graph that we have here of 77。
6 and 83。6。 And with that, we are now ready to build out our first neural net model。 All right。
I'll see you in the next video。



066:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p66 27_keras笔记本(选修部分)第2部分.zh_en -BV1eu4m1F7oz_p66-
Welcome back and hopefully at this point you're excited to finally build out your first neural network so here we're going to build just a single hidden layer neural network we're going to have again that input of eight variables and we're going to have one single hidden layer with 12 nodes。
Now, something that we didn't touch on for neural networks。
which we'll touch on in the next lecture that we do after this notebook。
is that it's going to be important to actually scale your data before building out your neural networks。
The reason behind this will have to do with how gradient descent works and how it will update certain weights differently depending on their scale。
and we'll get into that in the next lecture, but for now note that you're going to want to scale your data before performing your neural networks。
So we create our normalizer with that standard scalar。
We then create our Xtrain norm by calling fit transform on our Xtrain。
and then we have our X test normalized by just using transform。
not fit transform again because we want to ensure that our holdout set is indeed a holdout set and that we're actually using something that we learned from the training set that we should have had available to us。
And here we build out our first model。As we discuss in the lecture, Model one。
we initialize our model, we call it sequential。We're then going to add on。Our first layer。
which will be 12 units。That's going to be the default first value, our input shape。
we just have to say how many actually you don't have to specify the number of rows。
but the number of columns is going to be what's important。And then。Like I said during lecture。
we can skip a step that we saw when we walked through the actual syntax。
and we can actually include this activation within this model dot ad as we add on that dense, right。
This is still part of that dense layer。 If you look at where these parentheses actually close out。😊。
So we set our activation here equal to sigmoid。And we showed other options that are available to us。
and we could use Adam or Relu or leaky Relo and so on。And then to close it out。
we're just going to be condensing those 12 nodes into one node to predict some value between0 and 1。
So we add on one dense layer that's fully connected to those 12 nodes。
And we set the activation equal to sigmoid again because we wanted to output a value between zero and one。
So you run that, we've initialized our model and we can call model 1。t summary。

And get some of these nice details about our actual layers and how many weights there are going to be。
So if we look here, we see the total amount of parameters that we need to train and how many we need to train at each layer。
And we see that we have 121 total parameters and 108 at the first layer and 13 at the second layer now。
I would advise for you to pause and try to think through why there are 108 parameters and 13 parameters。
so I'm going to give you a second here to pause。Assuming you paused and thought this through。
The reason why we have 108 parameters at that first layer。Is going to be, we have8 input features。
And then we have that fully connected to each one of the 12 nodes。So if you think about that。
you'd originally think maybe something like 8 times 12。 But we also have that bias term。
So it's going to be nine units actually connected, so 9 times 12 is going to give you your 108。
And then to get to the next layer, again, you're going from 12 down to one。
so it's going to be fully connected to that one plus the bias terms, so you have 12 plus that one。
and that's going to be equal to 13, which is why you're going to have to learn a total of 121 different parameters。

We're then going to compile our actual model。And this is going to be our first time seeing how to actually compile that model using this specified optimizer。
our loss function, and the different metrics that we want to track throughout。
So we call model 1 do compile。And we say SGDs, we're using stochastic gradient descent。
which we imported earlier。Our learning rate is 0。003。
and we can change that learning rate to make it faster or slower。We then have our loss function。
which here is going to be binary cross entropy, so that's binary either 01 and cross entropy if you wanted something that's going to be categorical。
so across many different categories then you'd use categorical cross entropy if you want to do something that's going to be continuous then you can do mean squared error。
which is MSE and then we save the metrics that we want to track and here we want to track accuracy and then it'll automatically also track the loss function throughout。
And then we're going to actually save the run history and we'll see how this becomes useful throughout。
and that's going to be the output, one of the outputs from our fit。
and we're going to fit to our Xtrain norm to our Y train。
and then what we can actually do is pass in our validation data。
pass in our test set to see how we're performing on that holdout set as well as we fit to our training set。
And then we set the number of epochs, the number of times we want to run through our data set。
and we set that equal to 200, so it's going to run through the full data set 200 times。
And at each one of the different steps。

It's going to say each one of the different epochs。
How much are we increasing our or decreasing our loss?
How much are we increasing our accuracy and how we doing on that validation loss that hold out set。
How are we doing overall。So I'm going to pause the video as I'll take just a second and I'll see you as soon as it's done running。
Oh, it's done running。It's only 200 epochs that we're able to run through。

And then like we did for random force, we're going to generate two kinds of predictions。
one's going to be that hard prediction and the other one's going to be the probabilistic score。So。
We have predict our different classes, Model 1 do predict classes。
which will be available to us once we fit the video once we fit the model。
and then we have our just predict once we fit the model again, that being model one。
So we have our different predictions and we can see for our different classes。
we either have zeros or ones。


And for our probabilities, we have some value between zero and1。And that differentiator。
as you look at this, is just going to be whether or not it's greater than 0。5。

With that。We can then create our R O AU curve。 We created a function earlier that will allow us to do this。
as well as looking at our different accuracy scores。And our R O C AU scores。
So let's see how we did compare to our baseline model, we see if you recall earlier。
we did a little bit worse, and it's hard to tell exactly from the curve。
but we can see the ROC AUs 0。782 whereas before somewhere around 0。8 and our accuracy is 0。729。


So that's going to be our first neural network model and there may be some variations due to, again。
we randomize that initialization, so there is some randomness involved in creating these neural net models。
so you may not get the exact same result, but hopefully you have between 75 and 85 here we did a little bit worse percent accuracy and our Auc agains a little bit worse than between 0。
8 and 。9, but you may end up with a higher value depending on your initialization。
And then when we save that history from that fitting of the model。

What we actually did is we were able to get this dictionary。So run hisist1。
if we just look at this and look at the type that we have, this is the initial output。That we saved。
And this is going to be that history's object that Kas makes available to you。
And that's going to have with it。This history attribute, which is just going to be a dictionary。
And as we just saw, that dictionary has certain keys and those keys are going to be。
The actual loss and this is going to be the loss at each one of the different epochs。
the accuracy levels at each epoch, the validation loss。
so for that holdout set what your loss was and then your validation accuracy。
And that's only because we specified that we want to track accuracy when we first created our model。
when we compiled it here。


Otherwise, accuracy would not be available within this dictionary。
So once we have each one of these things, the loss, the accuracy and validation loss。
we can actually plot these out。So we initiate our figure and our axis。
and then we call run history do history loss。To get the different loss values at each one of the different epochs。
And that's going to be in order as it train。 so it should get lower and lower。
And then we can also get our validation loss。 and we'll plot that in either red or blue。
So red's going to be the loss on the training function。
So that should always be going down as it gets closer and closer to fitting exactly to our training set。
And then our validation loss, hopefully was getting smaller, but could have possibly increased。
meaning that we overfit our data set。

And here we see that they're both kind of still going down on both the training set and the validation set。
and this suggests that the model might benefit from further training。So running through more epochs。
So with that in mind, let's train the model a little bit more and see what happens。

And something to note is that it will pick up from where it left off so it'll continue to train given where it left off at these 200 epochs。
So we're using that same model one that's already been compiled and fit once。
and we're going to run it for another thousand epochs。
I will run this and then again I am going to actually pause a video here and well come back once it's done running as this will take approximately five times as long since it's five times as many runs through the data。

All right, that may have taken a couple minutes to run。
but now it's run through all a thousand epochs。And we want to see what kind of improvement did we get?
And recall, as we fit on that training set that validation error should keep going down。
that accuracy should continue to go up, whereas for that validation set for that holdout set。
it's possible that we start to overfit and that we tend to actually have that loss function go up or the accuracy go down。
So we're going to plot first we're going to call n and that's the length of our original runht if you see our output here was called run hist 1b。
So if we look at our original runht, we'll say the length of that was n, which is our first 200。
and then 1B should be the next 1000。And then we're going to plot for range n and taking run hist1 and get that loss function。
Again, just for that first 200 epochs, and we plot that in red。
And this is for the training set again, if we just say loss, if we want to see the holdout set。
we say vow loss, which we'll see in just a second。And then for n through n plus M, so from 200 to 1。
200, we're going to look at the additional loss, how much are we able to improve that loss function。
how much it was able to decrease as we did 1000 more epos。
And then we're going to do the same thing for the validation loss, so before we do red and hot pink。
and that's going to be our train loss and then blue and light sky blue so that we can differentiate between the first run and that second run。


So we plot this out。And we see it continued to decrease after 200。
and then we see the validation loss to actually also continue to decrease。
but really start to flatten out and the training loss decreased even further not at that same rate as it began to fit and perhaps overfit a bit to that actual data。

So that closes out our first neuralNe model, playing around with running it through different epochs。
seeing that plot and the output once we fit that model。
and in the next video we're going to try and play around with different models。
And see what kind of effect that'll have on overall accuracy as well as how fast it'll be able to fit All right。
I'll see you there。


067:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p67 28_keras笔记本(选修部分)第3部分.zh_en -BV1eu4m1F7oz_p67-
Let's close out this video with exercise2 over here。
so we're going to build another model and this time we're going to have two hidden layers each with six nodes。


And we're going to use the relu activation function for each one of those different hidden layers。
which is generally going to be best practice and then sigmoid for that final layer as we're trying to output values between 0 and1。
and if you recall for relu, you're not going to necessarily get values between0 and1。


We're then going to use a learning rate of 0。003 and we're going to train for 1500 epos。
so that's going to take some time as we saw as we did a000 epochs。
so I will do that pause and continue as we go through this。

And then as we did before, we're going to graph the trajectory of the loss functions as well as the accuracy on both the train and test set。
and then we'll plot that ROC curve for these different predictions。


So we're going to initialize our model。

As a sequential model, so we just call sequential。And then we're just going to add on each one of our different layers。
So we have our first hidden layer, and that's going to be a dense layer with six nodes。
so fully connected。We pass in for that first hidden layer, the actual input shape。
and that's going to be eight。And for the activation this time, rather than saying sigmoid。
all we have to do is change that string to relo。So you that for the first layer。
then we can do that again for the second hidden layer。
so we just add on again model do add dense and this time we don't need the input shape and we set the activation equal to relo once again。



And then to get to that final output, we recall that we only want one node as we're just trying to predict one or0 for each one of the different values。

And we therefore have a dense layer fully connected to just one node。
And we want that activation at that final layer to be equal to sigmoid。
As we want some value again between0 and 1。We're then going to compile our model and that's the next step。
and there we're going to pass in our optimizer as well as our loss function。
as well as the different metrics we want to track。

So of saying we want stochastic gradient descent, feel free on your own to try atom as well as RMS prop which we imported earlier。
as well as playing around with different learning rates perhaps。

Then we're going to still use binary cross entropy as we're still trying to figure out a binary zero or one value。
And then we're going to also track accuracy as one of the metrics。
We're going to save our history as run his 2, and we get that output from calling Model 2 dot fit on our X train and our Y train。
and then again we can specify here what we want our holdout set to be so that we can track that throughout。


And with that, we also, when we fit our model need to specify how many times we want to run through our data set。
so how many epochs we want, and we set in the exercise prompt that we want 1。
500 run throughs of the entire data set。


So I'm going to run this。

And you see, we start to get each one of the different epochs。
and we see that the accuracy increases, the loss decreases。
and at least in the beginning the validation loss should continue to decrease。


So I'm going to pause here and we'll come back once it's run through those 1500 different epos。

So now we have run through our 1,500 epochs, as we see here, and just a quick reminder。
if we call run his2, which we defined when we set it equal to the fit output。


We have our dictionary of keys with our different keys, which is the loss, the accuracy。
the validation loss, and the validation accuracy。Now last time we didn't do this but here we're actually going to plot the accuracy as well。
so we're going to create subplots and in that first subplot so we call PLT dot figure and then on that figure we add on the subplot and we're going to say that it's a one by two subplot and we want to look at the first one。

And on that axe。On that bounding box, we're going to plot the loss。

As well as the validation loss in red and blue, respectively。And then in the next subplot。
so we call add subplot and we say we want that second subplot。

We're going to call the accuracy, as well as that validation accuracy in red and blue as well。
And we're going to include a legend on each。 So you won't have to remember which one was which。

So you run this and here we see。

That's on that training set obviously continued to go down。
but on that validation set going through 1500 epochs and those two layers。
we definitely overfit our data set as we see that the loss。
the validation loss actually starts to increase around that 800 epochs point。
we kind of see a bit of an inflection。



AndThen we can see that accuracy jumping up and down as it tests throughout。
and we see that that fairly increase throughout as we fit our model closer and closer。
and then again, we see a bit of a decrease around。



800 to 1000 epochs。 And that's not going to correlate perfectly。
This is just according to our loss function。 and then our overall accuracy when we check it。
And we see that kind of jumps up and down。 But really plateaus definitely pass out 1000 epoch mark。

Then we can use what we did before in order to predict both the classes as well as the probabilities。
check for our accuracy as well as our ROC AU score once we get those outputs using the predicted class as well as the predicted probabilities。
then we can also plot our ROC using that function that we defined earlier。
So you run this and we see a bit higher of accuracy, a bit higher of ROC AU。
and then we see the curve as well。 Now again, there's a bit of randomization so we won't necessarily get the exact same answer and maybe there's not that much improvement and especially with the amount of time it took to train this model there probably wasn't enough improvement from our random forest so keep that in mind that sometimes it's not always going to be the best solution there is this talk within the data science community where you just throw neural nets at everything and how that's not best practice。
I want to ensure as you watch this video that you keep that in mind as well but they will be very powerful throughout and that's why we're learning it here。







That closes out our video here with the introduction to Carras。
feel free to play around with this model that we have here。 You can add on more layers。
increase the amount of nodes, decrease the amount of nodes。 change the activation functions。
change your optimizers。 You can play around with each of these and see how the model runs。
I would say don't do each one for 1500 epochs。 You're probably overfitting, and it'll take too long。
but it is worth playing around getting familiar with what you can play around with within this Cars framework。
All right, I'll see you back at lecture。😊。






068:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p68 29_优化器和动量.zh_en -BV1eu4m1F7oz_p68-
Now let's talk about optimizers, so in this video we're going to discuss different optimizers available to us when learning the appropriate weights for our given data and our neuralNe model。
So far we've discussed different approaches to gradient descent that vary the number of actual data points involved in each one of our steps in our gradientcent steps。
such as a single data point in srcasastic gradient descent。
a subset of data points with mini batch gradient descent。
and the entire set with full batch gradient descent。Now, no matter what we use。
they all have that same update formula to find the optimal weights。
Our weight at the next iteration is going to be equal to the prior weight minus the gradient times sum learning rate alpha。
But there are actually several variants to the step of updating the weights that will give us better performance。

And these tweaks to the updating step will all be built around improving further and further from this original formulation that we see here。

And these different methods of updating the weights or optimizing these weights are going to be called optimizers。

So let's start with the concept of momentum。With the regular gradient descent。
you'll generally move slowly towards your optome and you can be changing direction fairly frequently。

Now with momentum, you're going to smooth out this process。
And you do this by taking somewhat of a running average of each of the steps and thus smoothing out that variation of each of the individual steps for regular gradient descent。

So if we look at our formula, we see that rather than just updating our weights with that gradient。
We also look back to prior values to smooth out these steps。
So this V at step T that we have will incorporate some amount of V at time or at step T 1。
As well as the current gradient at the step that we are at。With this in mind。
our end value here is going to denote our momentum hyperparameter。
And the larger the value is for that momentum hyperparameter。
the more we are going to be smoothing out our values。In other words。
the more we are incorporating past values into our running average。
And we'll be giving values less than one in general。And a common value chosen here is going to be 0。
9, but again, if you want smoother steps。Use a higher value, otherwise use a lower value。
Also worth noting。If you want to look at perhaps further reading on your own in regards to momentum。
often that term n is going to be replaced by beta。
so beta is going to be the common nomenclature for that value。
And the alpha is replaced by one minus beta。So n is going to be replaced by beta and the alpha that we see here that we're used to using as that learning rate is going to be 1 minus beta。
And when we choose our n and our alpha in practice。
we may want to keep in mind using this relationship, so if we choose an n equal to 0。9。
you'll probably want to use an alpha around 0。1。


So just to show this in terms of a picture。For gradientd descent。
we can see that we take small steps that can fluctuate quite often。

Now with momentum。We tend to smooth out those steps。The fluctuations aren't going to be as dramatic。
And the steps can get much larger as momentum is gained。
Also worth noting is that momentum can cause you to actually overshoot your optimum value。
But the momentum will shrink at this point and you should be able to come back to that optimal value as we see here in the picture。

So the idea with Nera momentum, which will build off of the momentum we just learned。
is going to be that it'll look and control for this problem of overshooting。
and they'll do so by looking one step ahead。


So now rather than just taking the momentum and taking into account the gradient at the current step。
We take the momentum and the gradient at the step with that momentum accounted for。
So you see rather than just taking the gradient of the cost function as we did before。
we take the gradient of the cost function with n times to the Vt minus1 time step before accounted for。
And this will work because generally speaking, the momentum vector will be pointing in the right direction。
so it'll be a bit more accurate to use the gradient with the momentum accounted for than the gradient at that original position。

So if we think of standard momentum steps, we see that by using the past steps。
We can take larger steps that are closer to the correct direction。And if we separate out now。
just that momentum term in our last equation。This is going to be the direction that it actually takes。
And then taking the gradient with a momentum accounted for, as we do with Nerov momentum。
We have this extra correction in the right direction。
And the nest offset move even more smoothly towards our optimal value。

069:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p69 30_深度学习的正则化技术.zh_en -BV1eu4m1F7oz_p69-
Now, we've discussed the importance of regularization and the trade off between bias and variance when working with different algorithms in prior courses。
In this current set of videos, we'll discuss different regularization techniques for deep learning specifically and why it's of utmost importance when creating these more complex models。
So in this section, we're going to cover regularization techniques for deep learning and why it's so important for deep learning specifically。
😊。

We'll touch on dropout as well as early stopping, which are different regularization techniques specific to neural nets。

And with that, we'll also discuss different approaches to optimization when working through our neural nets and why those are important。


So to start off。Technically, a deep neural net is a neural net that has two or more hidden layers and often many more。

So a neural network becomes a deep neural network once we move beyond two layers。
the importance of regularization when it comes to deep neural nets。


Is that with more and more layers, we can learn more and more complex models。
and those complex models can often nearly perfectly fit to our training data。
but due to its complexity will often overfit to that training data and not generalize well to that new data。




Now, with that in mind, let's quickly remind ourselves what regularization is。

Regularization is any modification we make to a learning algorithm that's intended to reduce its generalization error。
but not necessarily its training error。

So again, we're not optimizing to training error。But rather tweaking our algorithm to optimize her generalization and usually at the price of additional training air。
So for neural nets, there are several means of regularization available to us。

Adding some regularization penalty in our actual cost function is an option。
and this would be similar to what we saw with Lassso or Rige, for example。
where more and higher weights will be penalized within our actual cost function。



We have something called drop out, where we'll randomly lose certain neurons in our network to ensure no model is over reliant on any particular neuron or any particular path。


Talk about early stopping, or will just be the idea of stopping gradient descent short so that's not perfectly fit to the training set。


And then with that, to some degree, those ideas of storcchastic and mini batch gradient descent that we discuss in prior videos may ensure that we don't perfectly fit to our training set。


And therefore may generalize a bit better than full batch grading descent。

Now, starting with that first option that we had listed there。
we can choose a loss function that will penalize our model for having higher weights。 And again。
this will be similar to Rir Russian if we're trying to predict a numerical value in that we can take something like our mean squared air。
😊。



And add on a regularization term that penalizes the weight squared。

And if we are trying to predict a categorical variable。

Rather than a numericalmer value variable, this will be very similar in that we will be simply adjusting our categorical loss function that we have to penalize again those larger weights。


Now let's move to a method that will be more specific to working with neural networks。
namely dropout。

Now, with dropout, we'll be randomly removing a subset of the neurons for each batch。

So a common problem that neural nets faced in earlier implementations was that although there are all these different pathways available when optimizing these complex neural nets。

Often they would become over reliant on particular pathways。
While not much learning was done on other pathways through other nodes。

So creating a deeper or denser network, adding on more layers or adding on more nodes would tend to not add that much value to our neural net。


By adding in dropout。Our neural networks will not be able to over relyly on any individual pathway。

And thus, it'll become more robust to overfitting to that particular pathway。

And finally, since we're randomly dropping out neurons and learning on just a subset。
we then will have to rescale those weights of the neurons at the end to reflect the percentage of the time that that neuron was active during training versus not active。



So we can think of this image that we have here as our standard fully connected feed forward neural network that we are familiar with at this point。


With drop out at each iteration, we end up with something like we see here to the right。
where we drop a predetermined percentage of the nodes at each one of our layers so that our model is forced to learn iterations with different possible pathways through to our solution。



So what exactly do we mean also when we say that we are rescaling our neurons after training our model?
If you think about it when we are training our model。
a node is going to have a probability P as we see here to the left of not being present。

At any particular iteration, and thus, the remaining weights are going to be scaled up to make up for that。

Thus when we actually get to testing time and all the weights are present。
We need to ensure that we appropriately rescale our weights。
and that's why we have P times W at the end。

When all these weights are always present。

Now, another heuristic that was available to us that we discussed earlier is idea of early stopping。

And this just refers to choosing some rules at which to stop training our model to ensure that we don't overfit。

Now an example of this would be to start by checking the loss on a validation set。
so not the training set, but rather some holdout set that validation set at every 10 epochs。


And if the validation loss at that next step,10 epochs later is higher than it was at the previous step。
then we would stop training。

So that's the idea of early stopping。

Now that closes out this section on common regularization techniques that are available to us。

In the next videos, we'll discuss another important part of tuning your neural net。

Namely what type of optimizer you're going to use All right, I'll see you there。


070:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p70 31_流行的优化器.zh_en -BV1eu4m1F7oz_p70-
Now let's move a bit away from this concept of momentum and talk about the Addgra optimizer。
which is short for adaptive gradient algorithm。The idea here is to scale the update for each weight separately as we do our grading descents and we update our weights。
So what will this do?What this will do is it'll update frequently updated weights a bit less。
And while updating。It will keep a running sum of each of the prior updates。
And then any new updates will be scaled down by a factor of the previous sum so that the steps continuously decrease。
so let's look at what this actually means。

The key difference when we do add agrad compared to our normal gradient descent。Is this term G?
And this term G will continue to increase, as we'll be starting at zero。
and we'll keep on adding squares of that derivative that we see here。
and obviously squares will always be positive, so G will continuously increase。
Then in order to update W, rather than just using the learning rate。
We use learning rate divided by the square root of this G value。
And since G is continuously increasing。We know that the learning rate will continuously decrease。
and this will lead to smaller and smaller updates at each iteration。
So as we get closer and closer to the optimal value。
that learning rate will shrink as we get closer and will help us avoid that overshooting。

Now I'd like to move on to another optimization method, namely RMS prop。
or root means square propagation is what that's short for。
Now we're working with a very similar functionality as the outergrad that we just discussed。
Except that rather than just using the sum of our prior gradients。
we're going to be decaying older gradients and giving more weight to more recent gradients。
And this can be similar to the functionality that we use for momentum Now we're just using that weighting that we discussed for momentum except for the learning rates。
And this will allow for updates to be more adaptive to recent gradients and is usually much more efficient than working with just Aiggrad。

And then finally we have this concept of atom, this optimizer atom。
which is for adaptive moment estimation, don't worry too much about what it's short for。
but this will combine both the concept of momentum and this RMS prop that we just discussed putting them both together。
So on the left side here, we have values similar to momentum。
If you recall our discussion during momentum, we're just going to be' replacing our n with beta 1 and our alpha with 1 minus beta 1。
which can be used for the momentum in our past formula as well as we discuss。
Now we didn't get into the math of RMS Pro, but I did mention that it'll work similar to the formula for momentum。
which is what we see here to the left。So to the right for RMSs props。Our BT value。
which synthesis for Rs Pro portion, is specific to our learning rate。
We'll have a very similar update to give most weight to the most recent values。
Now I'd like to note here if you're trying to figure out how to default each one of these values beta 1 and beta beta 2 by defaults。
beta 1 will be 0。9 and beta 2 will be 0。999 and they generally do not need to be played around with too much。
but you can play around with them bit if you find that you're not getting to the optimal model。
Now there's going to be a bit of bias built into each of these terms。So for M T。
you're going to want to correct that bias by dividing by 1 minus B to the T。
And this is meant more for correction towards the beginning。 As you can imagine, as T is growing。
the larger T is, the smaller B to the T will be。 Itll continue to shrink as T grows。😊。
And then we do the same for VT。Which, again, is the RMS prop portion。And finally。
we update our weights using our special learning rate scaled for VT that we just calculated。
multiplied by our momentum term entity。And there we have it。
our atom Opr combining both RMS Pro and this concept of momentum。

Now, which one should we choose between each one of the optimizers that are available to us?Now。
RMS ProP and Adam have become quite popular and from 2012 to 2017。
approximately 23% of deep learning papers submitted to this popular platform for research in deep learning。
mentioned using the at approach。

Now, it can be difficult though, to predict in advance which one of these approaches will work best for a particular problem。

And this is actually still an active area of inquiry in deep learning research。 Now。
I would say it's important to note that while atom speeds up the optimization process tremendously。
And usually does a fairly good job at finding optimal solutions。
there are going to be times when it does have trouble conversion。
And there are actually even different versions of at that have been implemented in that have been discovered recently。
And with that, I would say。Whether using different iterations of atom or other optimizes that we just discussed that may speed up the training。
if you're still having trouble with convergence, I would note to at least try using just regular mini batch gradient descent or full batch or srcchastic gradient descent as well。


So just to recap。In this section, we went over why it's so important to have regularization with deep learning models as these complex models are powerful enough to fit almost exactly to our trading data。
and with that in mind, we went over different regularization techniques。
such as what we've seen in Ridge with adding on a penalization term for higher weights within that cost function。
😊,As well as as we see here in the next bullet using dropout so that our models aren't over reliant on particular pathways through the network。
as well as early stopping, where we may be checking against a validation set as we train to prevent our overfitting。
And finally, we discuss different optimizers available to us beyond that regular gradient des set。
including using momentum, RMS prop, or combining the two using atom。
Now that closes out this set of videos in the next set of videos。
we'll review some of the extra pieces to keep in mind when building out our actual neural networks that will close out all we need to know to get started in tuning our own neural networks in Python。

All right, I look forward to seeing you there。

071:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p71 32_训练神经网络的细节.zh_en -BV1eu4m1F7oz_p71-
In this section, we're going to cover some final missing pieces to keep in mind before starting to look into actually coding up our own neural networks。
Now let's go over the learning goals for this section。In this section。
we're going to cover some of the details of training neural network models。
and a lot of this will be reviewed, as we'll go back over stochastic gradient descent。
as well as other batching approaches and important terminology and the reason why we do this is because once we start to actually implement our neural nets in Python。
we'll actually have to de tune each one of these different parameters as that we're going to discuss here。

So given our different data points within our data set。
We now know how to compute the derivative for each one of our weights。
And we went over different options on how to use that derivative to update our weights using different optimizers。

And now I want to review how often we should actually go about updating our weights。
As this is going to be something again that we're going to have to tune when creating our neural net models in Python。

So what do I mean by how often we need to update our weights?
Or going back and reviewing this idea of using all of our data set, part of our data。
or maybe even just a single row。So in our classical approach。
we'll be getting the derivative for the entire data set。
And we'll use that derivative to update our weights。So we're using the entire data set。
The pro of this is that each step will be informed by all data。
But the con will be that this contentt to be very slow。
especially as that dataset set grows very large。

Now on the other end of the spectrum, we again have stochastic gradient percent。
And with stochastic gradient descent, we get the derivative at just a single row at just a single point and take a step in that direction。
This means that the steps may be less informed, each one of those individual steps。
but you'll ultimately take many more of those steps as you run through your entire data set。
And the hope is, and the idea being that with us being able to quickly take more steps。
it will ultimately balance out any missteps you make along the way。
With the idea that you can take missteps at every iteration。
you probably want a smaller step taken each time so you don't veer too far away in the wrong direction。
And also since it won't be perfectly fitting to the entire data set。
This will also help in slightly regularizing your model as well。

And then we have our compromise using mini batch gradient descent。
And here we'll get the derivative using just a subset of our data set and then take a step in that direction。
according to the derivative of that subset。The typical mini batch size will tend to be 16 or 32 rows and you can tune this approximately the more rows that you choose。
The slower it may take to learn again, think about the sarcastic gradientding descent being a single row learning very quickly。
so the larger you have to learn that derivative on, the slower it may take。
And the idea of this compromise is meant to strike obviously a balance between the extremes of that full batch gradient descent and storcchastic gradient descent。

Now, just to hammer this all home, let's visualize each of these approaches in comparison to one another。
So we see all the way to the left ear。Faster and less accurate steps and all the way to the right will have slower and more accurate steps。
and I want you to think given everything we just discussed。
where stochastic gradient descent will fall, where mini batch gradient descent will fall and where full batch gradient descent will fall。
So all the way here to the left。As I hope you predicted on your own。
we're going to have a stochastic gradient descent where we'll have faster, less accurate steps。
And then we see the zigzag going as it tries to optimize the model。
Then on the other end of the spectrum。We have full batch gradientding descent。
which is going to be that slower but more accurate steps taken。And then finally。
we have our compromise in the mini batch gradient descent, where it falls somewhere in the middle。
it's not quite as fast as stochastic, but faster than full batch。
And it's not quite as accurate as full batch, but it is more accurate than stochastic gradientdiant descent。

Now, just to review some batching terminology, we have full batch using the entire data set to compute the gradient before updating。
we have mini batch, which uses a smaller portion of the data。
but more than just that single example that you would use with stochastic gradient descent。


And then we have stochastic gradient descent, which just uses a single example to compute the gradient before updating。
though sometimes something to note as you do some learning on your own。
people actually will use SGG to refer to mini batch。
so be aware of that as you start to read your own literature in regards to choosing your batch size。


Now, another piece of important terminology is going to be this idea of an epoch。
And that epoch is going to be one of those hyperparameter that you're going to have to tomb when you are actually implementing your neural nets in Python。
And it refers to a single pass through all of the training data。 Now, what do I mean by that。
If we think about a full batch gradient descent。There would be one step taken at every epoch because we're setting how many times we're passing through the data。
sort into every single step we pass through all the data, we do a full epoch。In SGD。
In sarcastic gradient descent。There's going to be n steps taken per epoch。
So we're going to take as many steps as there are rows in the data set every time you run through an epoch。
because, again, an epoch just means that we have ran through the whole data set。
And then with a mini batch, theres going to be n the number of rows divided by the batch side。
number of steps taken per an epoch。 So if you just think about the data set being 360 rows and we say batch size of 36。
we will take 10 steps at every single epoch。

And when training, we often refer to the number of epochs that are needed for that model to be trained。
and that's going to be an important hyperpar that we're going to tune as we try to create our own neuralNe models in Python。
So that closes out this video and in the next video。
we're going to discuss another piece of terminology worth understanding。

Namely data shuffling。

072:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p72 33_数据洗牌.zh_en -BV1eu4m1F7oz_p72-
Now in this video, let's discuss the concept of data shuffling。
So if we think about stochastic gradient descents or mini batch gradient descents。
we'll be going over a subset of our entire data set。So to avoid any cyclical movements。
so to avoid us going down the same path as we do our gradient descent every time and to aid convergence。
It's recommended to shuffle the data after each epoch。By doing so。
the data is not seen in the same order if you think about again。
mini batch gradientdiant descent or stochastic gradientdi descent so that not you're looking at the batches in the same order every single time。
And the batches are not going to be the exact same ones every single time。

So let's go over what this actually looks like。Now, if we were to do full batch rating descent。
Then we would run through the entire data set, that would be a single epoch。
and then there would be no reordering necessary。

Now if we are to split this into multiple batches。As we normally would with mini batch gradient descent。
for example。There's going to be a specific ordering that we'd split it up into with batch one being a certain subset。
batch two being a certain subset and so on。

And then recall that at each one of these batches, we find the derivative and use that to move our weights towards the optimal value。
So at each batch, we're taking another step, moving closer and closer towards its optimal value。
And once we run through the whole data set, then we've run through a single epoch。Now, in reality。
we're going to run through more than one epoch。

We're going to want to actually have multiple run throughs of the data set。
And just to see how many runths we have here, we split it up into a bunch of slices。
this is meant to represent, even though it's the same length。
multiple epochs through our full data set。

And you see that there's not that same ordering of the different colors。
The colors are a bit random here, as after that first epoch。

Rather than going back to batch 1, it's going to actually start with some other random batch。
And that batch doesn't even have to be the same batch that we had before。
And that will be the next step and will keep running through until we reach that optimal value。Again。
the idea being that we shuffle around our data。So that at each step。
we are going to be looking at a different subset of data so that we don't keep repeating that same path。

Now that closes out this video。Now let's recap what we learned here in this section。In this section。
we discussed the details of training neural network models。
specifically working with different types of batching。
such as we see here stochastic gradient descent or mini batch gradient descent or full batch gradient descent。
And with those different batching approaches, we discuss important terminology such as working with Epochs and understanding that an epoch is just one run through the data set and depending on whether you're doing stochastic。
mini batch or full batch gradient descents, you will make a certain amount of steps。
Towards your optimal value at each epoch。And then we discuss this idea of shuffling。
where if you're going to use mini batch or stochastic gradient descent。
make sure that you're not just repeating the same steps over and over again at each epoch。
Now that closes out this video in regards to the fundamentals that we will need。
And in the next video, we'll actually introduce the library that we'll be using in order to implement our neural nets in Python。
Allright, I'll see you there。


073:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p73 34_转换.zh_en -BV1eu4m1F7oz_p73-
So now I'd like to discuss actually scaling our inputs。So in our discussion of back propagation。
We briefly touch on the formula for the gradient used to update the values of our weight W。
And this will I promise tie back into scaling our input so just hold tight。
but in order to update our weights, we take the partial derivative in respect to W and we get。Again。
y hat minus y, which is that first partial derivative and the dot product of a。
whatever that input was from the last layer。

And at each iteration of gradient descent。W new, our new W is going to be that W old minus a learning rate times this partial derivative。

Now, when I equals zero, we are using the input values of x of those actual inputs as part of the derivative to update W New。
So those input values at that first layer are going to play a large role。


And this is going to mean that if we don't normalize the input values。

Those with higher values。Are going to update much more quickly than those with lower value。
Because again, we're using the AI from that prior step in order to update our values。

So if we have them on different scales, higher values will update quickly and the lower values will not update as quickly。
throwing off the way that we update our actual models。
right this imbalance can greatly slow down the speed at which our model actually converges。


So for that reason, we need to scale our inputs and different ways that we can scale our inputs that we've discussed in prior courses is the linear scaling to the interval between0 and1。
which is going to be our midmac scaling, which is X minus x min over X max minus x min to ensure they're all between 0 and1。


Or we can do here linear scaling to the interval between negative one and 1。
which is just going to be two times X I minus x min over x max minus x min minus1。
and that just ensures that you have values between negative 1 and1。And again。
we could also use that standard scalar, sometimes we want these values between zero and1 or between negative one and one。
because if you think about using the sigmoid function or the hyperbolic tangent function。
that will allow for each one of our inputs and outputs to stay on that same scale。


So let's recap what we learned here in this section。

In this section, we discuss preprocessing and preparing our data for our neural net models。
and with that we introduced how we can do multiclass classification with neural networks using that one hot encoding as well as the softmax function。
and then we discuss the importance of scaling your neural network inputs to ensure that you have balance updates of each one of your weights。
and we talked about how you can use different scalar similar to the Minmac Scalar or the standard scalealar to ensure that each one of your values on the same scale。






So that closes out our discussion here on different transformations that are important for your different neural net models。
and in the next section, we're going to introduce our first different type of model framework for our neural networks。
namely convolutional neural networks Allright, I'll see you there。





074:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p74 35_分类交叉熵.zh_en -BV1eu4m1F7oz_p74-
So we just went through that notebook introducing Carros。
and in that notebook we saw that we needed to actually implement some transformations in order to actually ensure that our neural nets performed optimally。
Here we're going to talk about some other important transformations to keep in mind when training our neural net models。
So let's go over the learning goals for this section。In this section。
we're going to cover prepro and preparing your data for analysis。
so all the steps that are going to have to come into play when you're thinking about creating a neural network。


Part of that will be if you're doing multi class classification。

How to set it up so that you can predict across multiple classes rather than what we've seen so far where there's just one class or the other。


And then finally, we're going to discuss the importance of scaling your neural net models。
and we saw this in our notebook as we went ahead and we used the standard Scalar in order to scale our data。
and you can also use something like the Minmac Scalar, which we've seen in earlier courses。



So for binary classification problems。

One we just trying to decide between two different classes。
We have a final layer with just a single node and a sigmoid activation and we saw that in just our last notebook where we had a full dense network they all connected to that final node。
there was only one node in that final layer and we had a sigmoid activation function in order to allow for that output。




Now the sigmoid activation function has many desirable properties。

One is that it gives an output strictly between0 and1。
and that value can be interpreted as a probability so we can say which one is more likely and by how much。



It's going to have a nice derivative, meaning that it's going to be easy to find the gradient as well as to use that to do back propagation。

And it's going to be analogous to logistic regression, or you'll have a bunch of input。
go into that linearly go into that node, and then you'll have that one nonlinear transformation as you do with logistic regression to output that value again between zero and1。



Now the question is, is there a way to extend this to a multiclass setting if we're trying to predict across multiple classes?

If we want to do this multiclass classification, we can use what we learned in regards to one hot encoding and we use this most frequently when working with different feature variables and we're just going to use that concept for our outcome variable。


So100 coding again is four categories。

And you can take, for example, a vector with length equal to the number of categories。
so say that your vector just has one value for each category and those different categories are going to be in this case。
checking, saving and mortgage the type of account that you have there。




You can then represent each category with one at a particular position and zero everywhere else。

So, for example, with our bank account example, rather than just having 1,2,3, we can have。


Three new columns where one of those columns for checking, perhaps that top value was checking。
we put a one there on top。

And then that second value was savings, so we put a one in the middle and zeros everywhere else。
and again, that top zero reference to whether that value is going to be checking that bottom one will be whether or not it's mortgage and we put a one at that bottom value because that bottom value was mortgage and zeros everywhere else。





So for multiclass classification problems, we're going to let that final layer be a vector with length equal to the number of possible classes as we just saw on the last slide。


And then we can extend the idea of the sigmoid to multi class classification using this soft mass function。
and that soft mass function is just going to be the E2。
whatever that Z output was for a particular class。😊。



Over the sum of E to the Z for all of the classes combined。

And what that does is it's going to yield a vector with entries that are going to be between0 and1。
normalizing them all to between0 and1, and that will ultimately sum to one。
so that we can get the probabilities for each one of the individual classes。



Now for the loss function。

When we even input it in, it's going to be categorical cross entropy that we're trying to calculate。

And this is just going to be the log loss function in disguise。
so we take that cross entropy and that's equal to negative Yi。
y being the actual values times log of Yi, whatever that prediction is。



And the derivative of this will have a nice property when used within the soft max so that the derivative of that last Z I in regards to that soft max is going to be Y I。
the prediction minus Y I。


075:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p75 36_卷积神经网络(CNN)简介.zh_en -BV1eu4m1F7oz_p75-
In this set of videos, we're going to introduce a new neural net architecture called Convolutional neural Nes。
And this has been incredibly powerful for image recognition and is now being used to solve many other tasks as well。
So in order to cover this topic, we're going to discuss convolutional neural networks in general and a bit about that architecture with that original motivation of working with image data in mind。



And then we're going to go over many common terms that you're going to need to know and we'll understand much deeper as we go through these videos such as grid size。
padding, pooling, and depth。


Now let's start off with the motivation behind these convolutional neural networks。
So if we imagine an image and the way that an image works is that each one of the different pixels will have a different numerical value to give you the density within the red green blue spectrum。
we can think about it on the gray scale to start, but the idea is that there's going to be some type of relationship between each one of our different pixels。
which are going to be each one of our different features。



And the structure of our neural network so far treats all of our inputs interchangeably。
and that the relationship or that spatial arrangement of those features have no impact on our model。


So there's no relationship between these individual features。
you just have this ordered set of variables, feature one, feature two, and so on。
and what we want is to be able to incorporate our domain knowledge of how images are actually built。


In building out our neural net architecture。

Now again, these convolutional networks that we discuss here were developed to deal with image data and the motivation behind them will become clear as we work through examples involving image data。
but increasingly as I mentioned before, these approaches are also being applied in other common analytical problems of regression and classification。
such as working with time series。




Now, some thoughts to keep in mind when diving into the motivation behind working with a new architecture。
The variables。And in this case, the variables are our pixels。Are going to have a natural topology。
they're going to have this spatial component that's actually meaningful。
And this makes images different from, say, loan default prediction where the variables do not have this natural topology or relationship in space from one to another。


We'll also want translation and variance, so when we're trying to identify whether there's a certain object in the image。
We want to ensure that it doesn't matter the size of that object or the orientation of that object。
it'll be translation andvari。We also want our model to be able to appropriately handle issues of pixel densities changing due to lighting and contrast。

Convolutional neural nets are going to be based on what we know about the structure of images and also what we know about the human visual system。

The human visual system has receptive fields, which respond to horizontal bars, vertical bars, etc。
and pieces them together。

Within data, many of the pixels, which again are going to be our features。
will actually tend to have fairly similar values that perhaps won't add much information on their own。
and we want to keep that in mind as well。


Within these images, we're also going to want to be able to identify edges and shapes that exist within that data。
And then finally, you will also want to ensure that it's scale invariant, again。
meaning that it will classify an object within that picture as a cat。
no matter the size of that object, so again, this idea of invariance。
whether it's the size or the orientation。

076:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p76 37_图像数据集.zh_en -BV1eu4m1F7oz_p76-
Now, fully connected image networks, thinking about the number of pixels in image as that starting number of features。
all being fully connected to the next hm layer。


Would tend to require a vast number of parameters。

And taking advantage of these structures that we're discussing here will end up meaning fewer parameters。

So if we think about this Ms image that we're going to see。
that's going to be 28 by 28 pixels on the gray scale, and that's what we see here。
We have this example of the Ms image。



And the idea here is that we have these handwritten digits ranging from zero to nine。
and we want to use deep learning to predict whether that handwritten image is a four。
here or a five, six, seven, etc。



Now this is an endless image on the gray scale。

An average color image, on the other hand。Will typically contain 200 by 200 pixels。

With three different color channels, red, green and blue。

For a total of 120,000 values or 120,000 features to start out our network。


So if we imagine with that fully connected network, we will have to start off with at least 120。
000 weights just on that first initial or 1 120,001 if you include the bioer。



And you can even imagine a single fully connected layer would require this incredible amount of weights if we're talking about that layer being something more than one or close to what we are talking about the size of that input features。


So with this many weights, that variance would be incredibly high with a very high likelihood of overfitting to your data。


With that in mind, we're going to introduce a bias and in this case。
a bias in relation to that fully connected network such that the architecture will be adjusted to look for certain kinds of patterns。



Now the motivation behind this new architecture is that different layers can learn certain intermediate features。


So we can start off with edges which then build up into shapes。
which that can then be built into relations between different shapes even。


As well as identifying different textures within images, and if you think about this buildup。
the relationship between the different pixels will be needed to make the identification of the slightest edges or these types of textures that we're talking about。



So an example of this buildup of features can be understood by thinking about the identification of a cat。
which has features such as two eyes that are certain distance and angle from one another。
as well as having the texture of cat fur。



So to identify just an eye, which would have to be a building block to get to two eyes of a certain relation。
we would first need the building blocks of a dark circle。
that pupil inside of another circle or an oval shape since it's a eye。



And that circle will be built from a combination of lower level features such as edges。
and the cat fur should also be made up of these lower level edges in a particular pattern。



So that closes out our video providing this idea of that motivation behind that convolutional neural network。

In the next video, we'll talk about kernels in the actual convolution function that's going to make this all possible。


好。
077:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p77 38_核.zh_en -BV1eu4m1F7oz_p77-

So in order to capture this relationship between our different features。
those features being the different pixels within our image, in order to capture this relationship。
we're going to make use of kernels。Now, a kernel is just going to be a grid of weights。
That's going to be overlaid on a certain portion of our image centered around a single pixel。
Now once that kernel is overlaid on that portion of the image。
Each weight from the kernel is going to be multiplied by the pixel and remember that pixel is just going to be some number。
so multiplied by that number beneath it。And the output over that centered pixel。
That we're going to get by overlaying this kernel on that portion of the image is just going to be the sum of all of those multiplications of the kernel and its respective pixels。
and that's going to be the convolutional operation and that's where we get this name for our convolutional neural nets。
And this method of using kernels is going to be what allows us to capture the relationships of nearby pixels to detect blurred portions of images。
sha portions, edges, etc。So let's look at an example of this in action using a three by three kernel。
So if these are going to be the different values for the pixels and say our image in this example is just three by three。
And then what we have here is our kernel。We want to think about how would we calculate the output?
And note that overlaying a three by three kernel on a three B3 image will only output one single value。
and that one single value will be at the center of what will ultimately be our output matrix。
which we see here to the right。So the key will be to overlay that kernel。On top of the image。
And now, in a way, it's just going to be like a dot product where we will take。
Starting at that first row。Sell by cell。We'll take the three multiply it by negative one。
right That's the top left corner times the top left corner of the kernel。
and then two plus2 times 0 plus1 times 1。 And you see we multiplied each value with its respective point within the kernel across that first row。
And we keep adding that up row by row。 So we look at the second row。
we do one times negative22 times 0 plus three times 2。 do the same thing for the third row。
So now we have nine different multiplications all being added up。
Each one with their respective values in the kernel, similar to how we would work with a dot product。
we add those all together and we end up with this output value at too。
And the way this will work when you're working with an actual image is the original input will probably be something larger than three by three as we see here。
So what we do is we just slide over that kernel that we have using that same kernel。
slide it one over to the right。And by sliding it over one to the right。
that would provide the output to the right of that two value within our output matrix。
And similarly, if we had a larger input, again, it's not a3 by three input。
we can slide that kernel one cell down。And do all the multiplication, take that dot products。
and we would have the output within our output matrix right below the two because we slid it down one and we'd slide that kernel across every single space that it can throughout our input image。
Now you can think of the kernels as feature detectors。So here we have a vertical line detector。
and there are some good videos on how the soul help you detect an actual line using matrix similar to what we see here using that convolutional function。
But the basic concept is just that as you move this filter along some type of vertical edge。
assuming you have that vertical edge and run that convolution and get your output。
You end up being able to highlight that there is an existence of this vertical line。😊,And similarly。
we can overlay the filter that we see here and detect a horizontal line。
Or use this filter that we have here, run it across and detect any corners that we may have in the image。
And the point being that we want to take away from here is that each one of these different kernels will be able to detect edges。
whether they're vertical, horizontal, diagonal corners。
or other combinations of features that may be important。Now。
these different filters that we just introduced are powerful to have some type of intuition of what a filter can be。
but in reality, the network will find those most useful kernels for you。Also, I'd like to note。
we'll probably set up our framework so that we learn many different kernels, not just one。
but all every single one of these different kernels will operate across that entire image。
and this is what allows for that translation and variance, so it doesn't matter where in the object。
where the object is within an image, whether that object is flipped or what the size of that object is。
And then also compare to our fully connected architecture。
If you think about just having as many different kernels as we have and each one only having nine weights。
this going to require much less parameters to learn。
And this will reduce that overall variance in regards to that bias variance trade off。

078:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p78 39_彩色图像的卷积.zh_en -BV1eu4m1F7oz_p78-
Now, to bring this all home when you're working with images, generally speaking。
most of our images will not just be on the gray scale, but rather have color。


And for a color image should be represented numerically。It will be。 It will have to be three。
generally speaking, most common is three,2 dimensional arrays, all stacked one on top of the other。
As we see here, where each one of these two dimensional arrays represents either the red scale。
the green scale or the blue scale respectively。




Now to move our kernels to three dimensions。

Rather than using the convolution operation, using just this kernel that's three by three。

We're going to use convolutions on a filter, filters the term once we move up to three dimensions。
which may be three by three by3, so it's going to be three,3 by three kernels all stacked together。



So that instead of having nine multiplications added together to get our one output。
we had the sum of 27 multiplications。

9ine and you can think about the filters that we learned where we'll have nine for each one of these different dimensions。
so9 for red,9 for green and9 for blue, and we get we multiply those respectively to each one of their different components within that input image to get our one centered output。
so we're adding together 27 different multiplications。





So once we use that filter, we end up with just a。

We'll go back to two dimensional output rather than these three dimensions。

Now, something that you may have noticed as we went through this idea of working with convolutions。

Is that when we work with these centered values and we're trying to output centered values?

The edges of our image and the corners of our image tend to get somewhat overlooked。

So in the next video, we're going to address this problem and introduce the concept of padding。

All right, I'll see you there。

079:填充和步长.zh_en -BV1eu4m1F7oz_p79-
So before we get into this idea of padding。I do want to discuss a bit about the grid size of our kernels。
And the grid side is just going to be a way of specifying the number of pixels that a kernel sees at once。
And typically we're going to want to use odd numbers so that there's going to be some center pixel。
but that's not necessary when you move your kernel across your image。
just will be a bit easier to compute and best practices。Also the kernel does not need to be square。
again, that will be typically what is used as square kernel。
but you do have the options of using non square kernels as well。
So this is our square kernel that we have here with a height and width of three。
And then if we think about kernels that aren't square, we have here height 1 and width 3。
and we can move this along our image as well。As well as taking a height of three and a width of one and moving this along our image。

Now, I discussed before that as we move a kernel across our image。
it's possible that we do not put as much weight along each one of the edges。
and that's going to be this edge effect if we use kernels directly onto our images。
the corners and the edges will not have as much play in allowing us to identify what that object is。
So the reason for that is that pixels near the edge will not be used as center pixels。
since there are not enough surrounding pixels。So if you think about having something like a 3 by three kernel and you try to center it on that top left corner。
You won't be able to because you have that three by3 and the center of that three by3 will ensure that the top and the left of that kernel will not be overlaid on any values within that image。

So the idea is to pad and padding adds extra pixels around that frame。
around the frame of your image so that pixels from the original image。
Along the edge becomes center pixels as the kernel moves across that image。
And those added pixels are typically going to be zero valued, so we just call it zero padding。

So to think about this example。First, I want you to look at our original input。
which is going to be our original image。Then I want you to look at the shape of our kernel。
so we have the shape of the original input, the shape of our kernel。
and I want you to see the shape of our output。So as we move this kernel along the image。
and we move it to the right and we move it down, we won't be able to capture every single value。
So the output will actually be smaller than our original input and also。
We won't be able to center around that one in that top left corner or the two right next to it to the right。
or even the one below at anything along those edges, we can't center our kernel on。

Now, with padding, we add on these zeros around each one of the edges。
And we're able to actually center now on that top left corner。
on that one and get the output value that we see to the right。
Another thing that I want you to notice is that we're still using that 3 by three kernel。
But now with padding, because we now have a larger input, if we take into account the padding。
our output will be larger as well and closer to the size of that original image。

Another thing that we can tune when creating our convolutional neural nets is going to be the stride or the step size as the kernel moves across the image。
So we said that it'll keep moving across the image。 Normally, if you said it at its default。
it will just move over one at a time。 So along that image。
that square will just move over one to the right。 Then another one to the right。
until it gets to the end。 Then it will start back all the way to the left。
just one cell down and then move along the right again。 That's going to be your step size。
And you can even set that to be different for vertical and horizontal steps。But again。
usually you're going to use the same value and that will be the defaults and what you'll see throughout。
And when that shriide is greater than one。 If we think about the output that we would get as we do these convolutional operations。
😊,Our output, if we skip over2, rather than just doing a stride of one。
our output will have to be smaller because we're multiplying We're doing less convolutional operations throughout the rest of our image。
So bring down that output value。

So here we have an example of strideide equal to two, so rather than just moving over one。
that kernel moves over two spots。And the next output would be 3。
And we see our output is smaller than we even originally had。

And then our vertical is also going to be moving down to。 So once it got to the end of the image。
it moved down to, we now have our new。Part of our image that we're going to run the kernel over and we get our next output。
which is just going to be zero。

Now, we can combine this with padding as well and still have this stride equal to 2。
This will be our first。Convolutional operation ending up with negative2。
We then move over2 to the right, and we have our next。Operation, which will output two。
And then we can do the same thing, moving down to。And you see, again。
that our output will be much larger, well, not much larger。
just a bit larger as we add on that extra padding。Now that closes out our discussion of padding。
In the next video, I want to introduce to you the idea of adding on depths so that you can actually pass through multiple kernels at each one of your different layers。

All right, I'll see you there。

080:深度和池化.zh_en -BV1eu4m1F7oz_p80-
So as mentioned in prior videos, in images, we often have multiple numbers associated with each pixel location。
Thinking about that pixel location in two dimensions。

And these different numbers in that same location。Are generally going to be referred to as the numbers for the different channels。
And examples of this include RGB, which is just the red。
green and blue channels that make up an image, and we saw this a bit earlier on your computer screen。
And then you have a little bit less commonly CMYK or cion magenta。
yellow and black for printing images rather than just displaying them on a screen。

Now, the number of channels that you have within your image is referred to as the depth of that input image。

And the filter itself will have a depth the same size as the number of input channels。
So your filter will be as deep if youre working with RGB as there are channels。
So there would be a depth of three。

So an example of this is if you're working with a5 by5 kernel on an RGB image。Then that kernel。
Well how many weights?It'll have five by five by three because it'll be for each channel equaling 75 original weights。

Now, the output from the layer。

We'll also have a death。So the way that that works is the network typically train many different kernels。
Again, each kernel will go over the entire image。And even though we're working with three dimensions here with our kernel。
when we talked about, for example, about five by5 by three。
That's still going to output a single number。 So each kernel outputs a single number at each pixel location。


But。You can have many kernels, so if you add 10 kernels in a layer。
the output of that layer will have a depth equal to 10。
And that's because we don't want to be confined to only working with a single kernel that can only detect a certain pattern。
10 kernels allow us to detect 10 different patterns。

So how is that going to work?So if you look all the way to the left, top left。
we have our original image。And in that original image。
we're starting off with a 32 by 32 by 3 image。And that's going to be the data from that original image。
We have the three there, and that relates to the red, green and blue dimensions。
And then each one will be 32 by 32 for the red, for the green and for the blue for each individual channel。
And then in the next layer, we see that we have a 32 by 32 by 10。Layer。
and that means that our depth is equal to 10。So how do we get that debt equal to 10?The 32 by 32。
each one of those 32 by 32s, will represent a single kernel。
So we see that we have that kernel that's five by five by three。
If we were to take one section and run that convolution operation。
then we get that single data point that's one by one by one。
And we can do that by moving that 5 by 5 by 3 kernelel along the entire image。To get the next output。
to get that pink slice that you see within that three dimensional cube in that second layer。
So that's how we get one single layer out of the 10。 And since there's 10 different filters。
if we look down to the image in the bottom row。We have another filter and green。
And that green filter moves along our image and produces another one of those 10 layers。
and each one of our different filters will produce a different layer。
Now I do want to note that if you are using a5 by5 by3 filter and that's moving along your 32 by 32 image。
Then you probably need some extra padding so that your next layer will still be 32 by 32。
and you'd also have to take minimal strides as you actually go about moving from 32 by 32 to another layer that's also 32 by 32。
But the idea is that the number of filters you have will be the depth of the next dimension of the next layer。

So now I want to introduce another concept that's important in convolutional neural nets。
and that's the idea of pooling。And pooling will reduce the image size by mapping a patch of pixels to a single value。
So that will shrink the dimensions of the image。And it's not going to need any parameters。
Though there are different types of pooling operations。
but every single one of those different pooling operations will be something like a max or an average where you're just going to take whatever values our output and take the maximum value or the average value。
whatever it is。

So speaking of the different types of pooling, we have max pooling。
And with max pooling for each one of our distinct patches。
That pooling will represent the maximum for that patch。

So an example here is we're using two by two maxpo and we take our original image that's four by four。
And we split it up into each one of these two by two squares。
And we get the max value within each one of those scarces to reduce the size of that data set to that 8154 that we see on the right。

And then the average pool is self explanatoryates in the name。
whereas we take the each distinct patch and we get the average。
and we can see again how we perform similar to what we just did before。
but rather than taking that max value, we take the average value。


I would say taking the max value is generally much more common practice in regards to what you'll be using when you actually pull together your data。

So just to recap。In this section, we gave you an idea of what convolutional neural networks are。
what that convolutional operation was and how we can use things such as the filters and the kernels in order to come up with our next layers within our network。
we discuss the original motivation, and why we would want to have a certain type of framework when we are working with image data specifically and how we can even use RGB using an image with three dimensions。
where one of those dimensions is the number of channels in order to actually come up with our next layer。
And with that, we discuss things such as the grid size and how we'd use that grid size to move along our image。
adding on padding so that we wouldn't have to lose information along those edges。
the idea of pooling to reduce the number of dimensions。
whether that's max pooling or average pooling, as well as this idea of depth where each one of your different filters will add to the depth of the next layer。
Now that closes out our discussion here on convolutional neural nets。And in the next video。
we are going to have a notebook where we'll actually see convolutional neural nets in practice。
All right, I'll see you there。😊。


081:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p81 42_CNN示例笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p81-
Welcome to our notebook here on Convolutional neural Nes。
Here we're going to be using Python to build out a convolutional neural nets in order to classify images using this famous C4R 10 data set。
and this C4R 10 data set is going to be 60,000 different images each 132 by 32 pixels and their color images so they will also have a certain amount of depth if you recall from lecture。

And each one of these different images will be one of 10 classes, either airplane, automobile, bird。
and so on。Now, in order to build out our convolutional nets, we're going to have to introduce。

New parts of Kas, new functions, and new layers that we hadn't used in our prior notebook。
So we introduced the sequential model, we can still use that at some point we will have to use that dense connected layer similar to what we did with that fully connected layer。
We can also use dropout in order to regularize and ensure that it doesn't overfit。
we'll have our different activation layers that we can use as well。
whether that's relu or sigmoid or hyperbolic, whatever it is。
and then we also have this flatten layer。And this flatten layer will be important as we move from our convolutional layers to our dense layers。
and eventually, in order to make some type of prediction。
we're going to need to flatten it out and then have that dense layer connected to that final prediction。
And then we're also going to import this comp 2D in this max pooling 2D。
which will allow us to build out our convolutional layers as well as our pooling layers that we introduce in lecture。
Now to get our data, we imported earlier this CR 10 from the CARAS data set。
so this is actually within the CAAS library, we have this data set available。
we call load data and when we load data that will give us two tuples our training set as well as our test set and the X and Y values for each。
Now we're going to print out the shape of our training set。
as well as the number of samples of both our training and test set separately。

no。There's going to be 50,000 train samples and 10,000 test samples。
So we'll be training on those 50000 train samples, and then we can ultimately test on that。
Hol that set on that 10000。But what I want to look to is the shape of X train。
And if you think back to the what we've been working with so far。
and those were not image data that we were working with。Here you see that we have four dimensions。
The first dimension is going to be the number of rows are the numbers of samples that we're working with。
and that's 50,000。

And then the next one is going to be the height and width in terms of the number of pixels of our images as well as our depth。
And that's why we have the 32 by 32, and then by three for the red, green and blue different layers。
And we can look for each individual image that we have this 32 by 32 by 3 shape。

And that's just going to be a bunch of numbers from zero to 255 for each one of these different colors。
red, green and blue。And we can see that these actually represent actual images。
So we see here that this is of class 9, if we call Y train and just 444。
and then we can look at the x train, which is going to be the actual image itself without the label。

And we see that it's actually going to be a real image。 Now, it's not a high definition image。
If we think 32 by 32 pixels, I will not be a high definition image。
but we can tell here that we have a truck。 and hopefully our convolutional neural net will be able to pick up the fact that it's going to be。
😊,Tirers and the large back and whatever other features there are that build out the truck。Now。
if we look at our Y train originally, we see that it's just a bunch of numbers。

Each one representing a different category。And as we discussed in lecture。
we often need to take something that's categorical in maybe many different categories and turn that into a categorical variable doing one hot encod。
So Kas has functionality built in to change this output into that one hot encoded version of the output。
In order to do so, we just call Kas。utils。2 categorical and we say, what the?
Set is that we want to change the categorical and the number of classes in that set。
which is going to be equal to 10。We run that。And now if we look at that Y train that we had above。
which we see was nine。

The new value is going to be that one hot encoded version was zeros everywhere except for in the nine spot。
Then we're going to want to make sure all of our values are flow and scaled down to between zero and one。
So recall that all of our different pixels will be values between0 and 255。 So if we divide by 255。
We ensure that all of our values are going to be between zero and 1。

No。

When we use this convolutional neural nets, when we create these layers, these convolutional layers。
We call just same as we did with dense, we call Com 2D。
And we want to ensure that we understand the different parameters that we can pass through so that we can specify exactly what kind of convolutional net。
what kind of convolutional layer we want to use。

So thinking back to lecture。Some of the important parameters that you should know。
it's going to be the filters。And that's going to be the number of filters used per location。
so in other words, that's going to be the depth of your output。
Or the number of kernels used if you think about, again。
that depth we recall in lecture we had at one point a depth of 10。
and that was because we had 10 different kernels that will output a depth of if we set 10 of 10。
if we set filters equal to 10。

Then we have our kernel size, which will be a tuple giving the height and width of the kernel used。
and you can specify that height and width to be different。 If you just pass through one number。
it will assume a square, and I would say stick with squares to start。
you can try playing around with other values, but those are going to be best practice for the majority of your starter material。
Then we have the strides, so that's going to be how you move along those kernels along your image。
And whether you want to move it one at a time, going from left to right or two at a time。
going left to right, as well as up and down。 So the first value is going to be the stride going left to right。
And then the next ones going to be up and down。And then you're going to want your input shape。
which recall we passed through and we had our dense neural network。And that was just one value here。
If you just recall what we've pulled out in terms of the shape of a single image。
that's going to actually be three dimensions。And we want to ensure that it fits within that first layer that we specify to ensure that that's correct。
And one more thing that I want to point out that's not here is the padding。
When we set padding equal to valid, that means that we are not having any padding。
and it'll stop as soon as the right, if we're moving from left to right。
as soon as the rightmost part of our kernel hits the edge as we move along those shs。
So if we imagine that we have a six by six image and our kernel is。5 by 5。
Then it'll only move to the right once, and then stop。
And then if we set padding equal to and we'll see this later on same。
then that will pad on some extra zeros, generally speaking to make sure it'll be just one set of zeros around。
but maybe it's even and out there might be two on one side and it'll be padding with zeros。

And then again, we have this flatten layer, and that turns our whatever input it has into a one dimensional vector。
And that will allow us once we do that to transition between the convolutional layers and those fully connected layers。

So we have here the initialization of our model using the sequential function。
So we're going to be using that sequential API again in order to build out our model。
We then add on our first comp 2 D layer。And this is in the ordering that we saw above。
and we can see if we just call shift tab, it's called it a couple more times。
we can see that we're setting the number of filters, so the depths also going to be equal to 32。

Our kernel size is going to be5 by5。We're not going to use the default for strides。
we're going to set that to two by two。We're also going to add on padding。
so there's actually going to be padding on this layer。Rather than the defaults of leaving it as is。

And then we specify the input shape, and if you recall the shape of our x train is going to be the number of different samples we have than the actual shape。
and if we say one through, then we're just specifying the shape of a single object。



So we added on that convolutional layer, we then have our activation。
which is going to be relo to ensure that we have that nonlinearity。
We then add on another convolutional layer, again setting the number of filters equal to 32。We are。
Going 5 by 5 in regards to our kernelel and recall that if we do 5 by 5。
then we're moving along our image, and that will keep continuously reduce the size of each one of the layers as we move across。
especially since our shs are 2 by 2。We then add on another activation layer。
We can then do our max pooling, which will just take the max of a certain grid and we're setting that pool side equal to 2 by 2。
So we're going to reduce very quickly the size of our layer。
We then also going to introduce some dropout to add a bit of regularization。

We then flattened that out。So that now we're working with just a one dimensional object rather than that three dimensional object that we were working with in the earlier layer。
We can then do。Add on a dense layer so that we have one fully connected layer again, call relu。
so we have a nonlinearity again called dropout for some extra regularization。
And then we're going to add on that final dense layer。
So that our output is equal to the number of classes we have。
because ultimately if you think about our neural network it needs to specify。
it needs to predict one of these 10 classes。And then we set our activation equal to softm as we do when we're trying to predict amongst multiple categories。


And we can call the Model1。t summary。

And we can see here the number of parameters at each step。
And we can see that our output shape is going to reduce at each one of these steps。
So reduce at first to 16 to 16 by 32。 We kept the depth at 32 at both steps。
Then we have 6 by 6 by 32。 We called max pooling, and that reduced it to 3 by 3 by 32。😊。
And then we had our dense layer and you see there's a ton of parameters there and recall that these dense layers are going to have a lot more parameters and be a lot more variance。
Than we would have with our convolutional layers。 We have one more dense layer。
and then our final activation。😊。

We can then specify our batch size。As well as the optimizers we're going to use。
So we're specifying we want the RMS prop optimizer with this learning rate。
and we can specify the decay if you call from RMS Pro。And then with that。
we compiled using categorical cross entropy rather than our binary cross entropy。
We can specify the optimizer。Metrics are want to track。
And then actually fit our model with the batcht size, specifying the number of epochs。
We have our validation data to see on the holdout set how it does。And we can shuffle equals true。
And that'll just be in regards to, as we optimize, we want to shuffle our data throughout。
So I'm going to let this run and we'll come back once it's done running。 I want to warn you。
this may take some time。

So that should have taken maybe five minutes, maybe a bit longer in order to run。
and we can actually see the timing for each epoch as we ranm through。

And we have here the 15 different epochs。And we can see that the loss on the training set continuously went down for each one in 15 epochs。
And we didn't save it as we did before so that we can access that history key。
that history dictionary, but we can see what happened at each step。
and we can see that we're tracking that validation loss and that goes down for the first number of epochs and then around here we see that it starts to fluctuate where it goes down back up down again on that validation set looking from 1。
049 to 1。08, and then we can see that the accuracy rather than continue going up starts to fluctuate as well。
On that validation set, but it continues to increase。For that training set。

Now, if we wanted to do any type of prediction, we can do the same thing that we did in our last notebook by taking that model。

And calling if we want the probabilities, we can just do dot predict。

And we can call that on our X test。And we run that。
and that'll give the probabilities for each one of the different classes。
If we want to predict the specific class, we can just call not predict classes。

As we did before, and we see here that as a prediction for each class。Now, if we recall。
if we wanted to test our accuracy, let's say, or any other metric, whether we want to look at the。
well, I would say if we want to look at the accuracy or something else that requires that actual prediction。
If you recall our Y test had been converted to this one hot encoded version of Y test。
So we'll have to take the inverse, take it back to what it was originally。 and in order to do that。
we can just。Pull in numpy。And call NP。org max。And that'll just say。
where's the maximum arguments you have to specify across axis 1。And when we call that。
we can get each one of the actual values, and we can see from what we have here。
it probably predicted correctly for 388 and 507, the accuracy score should be what we have here。
but we can actually test this, we can import from SK learned metrics, our accuracy score。
From escalar metrics。Accuracy score, and then we can take that accuracy score of。What we have here。
these are the actual values。And then our prediction that we have here。

And we have that same value 0。6176, since that was used as a validation set。
So that closes out that first exercise and in the next exercise we will walk through building out a different convolutional neural net and see if we can make any improvements on our current model All right。
I'll see you there。


082:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p82 43_CNN示例笔记本(选修部分)第2部分.zh_en -BV1eu4m1F7oz_p82-
Welcome back for our next exercise Now in our previous model, we had the structure that we have here。
a convolutional layer, another convolutional layer that max pool to bring down the size。
flattening it out, then that dense connection and then that final classification with the activation functions and the dropouts that we had specified。
Now we want to try building a more complicated model and it's going to have the following structure。
Convolutional layer, convolution layer, max pool, and then two more convolution layers。
So we're adding on an extra two convolution layers, another max pool。
and then that flatten that dense connection in our final classification。
We're also going to use strides of one for each one of our convolutional layers。
so rather than moving that kernel along。2 to the right then two down, as we did before。
we're only going to move it across and down by one each time。
We're then going to see how many parameters does our new model have and compared to our old model。
and then we're going to train it only for five epochs。
It will be more complicated so it'll take some more time。
And then we can look at the loss and accuracy numbers for both the training and validation sets and。
We can on your own, go ahead and try different structures and run times and see how accurate you can get your model to be。
So we're going to run this with this specified new framework。 So we're going to。Again。
Have 32 different filters。Here are grids going to be 3 by 3 above, if you recall, we had 5 by 5。
So that will also move across a bit quicker。


And then we before we had the the strides equal to one by one by two by two now they're going to be at their default of one by one。
and then we're having padding on each will'll also add on some extra。Wait, some extra。
Learning that we'll have to do。 it' will have to go through more convolutional operations as we have that padding。
Then we have our relo activation, and again, we set the default。
we have another convolutional layer this time without padding we have。Another activation of Relo。
some max pooling。And then we have another convolutional layer this time with 64 different filters。
And we're going to do that with padding。 and then again without padding again。
using a three by three grid。And then we'll flatten, have our dense layer。
and then our final dense layer to predict the classes as well as the activation of softmax。

So we run this to set up our new framework, and then when we look down at the number of total parameters that we have to learn or up to 1。
25 million total parameters that we have to train。


And if you were called before, we only had around 181000 to train。


So if we think about the timing that this will take and we'll start to run it here。

We're going to have probably something that's going to take a lot longer at each one of the epochs。
So we see that E T A, it' going down pretty quickly, but still at each one of the epochs。
It's around this three minute mark。That's getting 3 minutes,20 seconds。
And it's going to take some time at each epoch, compared to。

What we had before, what was going through each epoch。Around 27 seconds。

Now, I'm going to pause a video here and come back when this is done running。
and this will take some time to run, even longer than it did before。
But it's something that we want to make sure that you take into account as you start to build out your deep neural networks and understanding that as you have more complex structure。
you'll probably need a stronger machine or some way of paralyzing across multiple machines as you build these out。

All right, I'll see you in just a bit。So hopefully you're able to run that on your own。
And as we see here, it took quite a bit of time to run that。
We see thiss a bit under three minutes for each one of the different epochs for five epochs。
we're getting close to 15 minutes to run through and fit the model。
But what we also see is if we look at the accuracy and specifically the validation accuracy for that holdout set。
we after the fourth epoch, got to a higher accuracy than we ever got before with the other architecture。
So we see this more complex framework was able to better fit to our actual data set。
Now we can play around with different frameworks, adding on extraconvolutional layers。
moving convolutional layers, changing the stride and so on。 But as we saw here。
it could take some time。So and because of the flexibility, there's actually some architecture。
some frameworks that are best practices or most common practices that are used throughout that we'll discuss in just a bit。
but before that in our next video, we will discuss how we can use something that we trained on for a specific data。
such as what we did here。And use that training to actually supplement。
A classification of images for a completely different data set。
And we'll see what will mean in just a second when we discuss in the next lecture the idea of transfer learning。
All right, I'll see you there。


083:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p83 44_迁移学习简介.zh_en -BV1eu4m1F7oz_p83-
In this set of videos, we'll discuss transfer learning。
which allows to leverage already trained networks to make predictions for new data sets。
So in this section, we're going to cover an overview of transfer learning。
starting off with the motivation behind transfer learning。
As well as understanding some guiding principles in regards to fine tuning our transfer learning models。

Now, generally speaking, the earlier we are within our neural network。
Those earlier layers are going to be the slowest to train。
And this is generally going to be in large part due to the way that our weights are being optimized。
So if we recall that vanishish ingredient problem, we recall that because of back propagation。
By the time we get to the partial derivative in regards to our earlier layers。
It's very possible that we're not making any major updates to our weights。
But if we think about how our convolutional networks actually work。
And we think back to past lectures。Those earlier layers are meant to represent only the most primitive features。
such as say an edge。

Now, our later layers, on the other hand, are going to be capturing those features that are particular to these specific images in our data set。
And those features。In the later layers were builted off of those earlier primitive layers that we just discussed。
And these later layers will also be much easier and quicker to train。
as it doesn't suffer from the problems just mentioned from those earlier layers in regards to how fast I can train and will have a more immediate impact on that final result。
So just to motivate this a bit further。Any one of our famous competition winning models are going to be incredibly difficult to train from scratch。
This is due to the fact that they're going to be trained on huge data sets。
Huge dataset sets will obviously take much longer time to train on。😊。
They're going to have to go through a very long number of iterations to get to that optimal answer。
And we saw it just in that last notebook。 How long it take for a very simple model。
To learn the optimal weights, and that was, of course, though, on our own personal machines。
And with that in mind, when we build these award winning models。
we also will need some very heavy computing power to learn these patterns in a reasonable amount of time。
And that's all assuming that you got your framework right on that first time。
We'll also have to spend time experimenting to get those hyperparameter, number of layers。
What kind of strides want to flatten, et cetera。😊,Now, what we often see, though。
is that the basic features such as those edges and simple shapes learned in earlier layers of the network will generalize fairly well to any similar problems。
And if you just want to store the results。It's just a matter of storing those learned weights。
not the actual lift that was needed to learn those weights in the first place。
So our new idea will be to save those early layers of a pre chain network。
And then just retrain the later layers for a specific application for whatever our data set is。
And this concept。Is going to be what we call transfer learning of taking those earlier layers of a preing network and then just retraining those later layers。

Now let's walk through a visualization of this concept of transfer learning。So what we have here。
In the image in front of you。Represents our first trained convolutional neural network with a number of convolutional layers then that fully connected layer。
ultimately leading to that final soft max classifier as very similar to what we did in our last notebook。
The idea will be to remove that final output layer。
And then we can use what we learned so far or even go back further。 for example。
back one of the fully connected layers or even further removing one of those convolutional layers and so on。
And then we can use that pretrain network and train only on that last layer or last few layers。
using those learned earlier layers from the prior problem。In order to make a prediction on new data。
Now, this idea is going to be more of an art than a science in figuring out how long to train that last layer。
whether not to go back further and retrain more layers and so on。And in the next video。
we're going to discuss some of the options available to you。
as well as some basic guiding principles to keep in mind。

All right, I'll see you there。

084:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p84 45_迁移学习和微调.zh_en -BV1eu4m1F7oz_p84-
So let's start off with some different transfer learning options that are available to us。
So the additional training that we do of a pretrain network on a specific new data set。
so those extra steps on top of that pretrain network is going to be referred to as the step of fine tuning。

As mentioned earlier, understanding exactly how to fine tune in regards to how much or how far back is going to require you to think through a lot of different options。
Should you just train the very last layer, Should you go back a few layers。

Or even retrain the entire network。Using that pretrain network to just initialize the weights for your new data。

And for your new framework。Now, while there are no hard and fast rules for fine tuning。
your transfer learning model。

There are going to be some guiding principles that you're going to want to keep in mind。First off。
the more similar your data and problem are to the source data of your pretrain network。

The less fine tuning you'll have to do。So for example。
if you're using a pretrain network that was pre traineded on IageNe to distinguish between dogs and cats。
You should need relatively little fine tuning, so you don't need to go as far back。
say in your model, and use a lot of those pre trained weights。

And that's due to the fact that IageNe was already used to distinguish between different breeds of dogs and different breeds of cats。
so likely already has learned all those features that you're going to need。

Also, the more data you have available about your specific problem。
The more the network will benefit from the longer and deeper fine tuning of your model。

So for example, if you only had 100 dogs and 100 cats in your new training set。
You probably want to do very little fine tuning。Maybe just remove that final layer or two again。
for example, and use a lot of those pre traineded attributes that you learned from, say Inet。

On the other hand, if you have 100,000 dogs and 100,000 cats。
you may get more value from longer and deeper fine tuning。
going back further or even retraining the full network。
using that past network to initialize your weights。

Also, if your data is substantially different in nature。
than the data the source model was trained on。

Transfer learning may actually be of little value。 So an obvious example is if a network that was trained on recognizing type Latin alphabet characters。
it probably won't do a good job in regards to helping you distinguish between cats and dogs。
but likely would be useful as a starting point for recognizing, say, Cyyrillic alphabet characters。
as they are both some type of alphabet。




So to recap this idea of transfer learning, we have an overview of transfer learning。
as well as providing that motivation and understanding that it takes a while to learn those smaller pieces。
those lower level features such as an edge and you may want to actually take pre-train networks。
if you want to do something like image classification and you don't have that large of a data。
And with that in mind, we discuss a few guiding principles in regards to fine tuning that model。
you want similar data sets, you want to ensure that if you have only a small dataset。
not to do too much fine tuning。 if you have a larger data。
perhaps you'd benefit from doing even more fine tuning。
Now that closes out our lecture here on transfer learning。
And in the next video we'll show you how to actually conduct transfer learning using Python。
Allright, I'll see you there。


085:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p85 46_迁移学习笔记本(选修部分).zh_en -BV1eu4m1F7oz_p85-
Welcome to our demo notebook here on transfer learning。 in this exercise。
we're going to be using the well known Mnes digit data set。😊。
Which is just going to be a bunch of handwritten digits between 0 and9。Along with their labels。
so if there's a five written down, then that's labeled as a five。
0 written down that's labeled as a0。And we're going to use this data set to illustrate the power and the concepts behind transfer learning。
So we're going to train a convolutional neural net on just the digits between 5 and 9。
And after that, we're going to train just the last layer of the network on the digits0 through4 and see how well the features learned from5 through 9。
those earlier features before that final layer are going to be able to help classifying0 through4。
So we're going to import the necessary libraries, you should be familiar with all these from before。
the only ones that are new is we're importing the MminIS data, which is available in Tensorflow。
cars。 datasets。

And then we'll see how this is used later on, but we're also importing the back end from Cars。
and we're importing that as K。We're then going to pull out this now function。
And the reason for that is just we want to get the actual timing。
and we can use a magic function within these Jupyter notebooks to get the timing as well。
And we've done that before。 But generally, that tries to compute a confidence interval。
And it has to do more than one loop through all the data。 And it may take some time。
So we're just going to lose the aspect of having a confidence interval。
but be able to do it a bit more quickly。😊。

We're then going to set some of the parameters。So we're going to have the same batch size。
same number of classes, same number of epochs each time。 And those are 1,28,5, and 5 respectively。
We're also going to set the pixel numbers for the number of rows and the number of columns in regards to the pixels of our image。
and that's going to be 28 by 28。We're going to set our filters。
which is going to be the depth of each one of our next layers using our convolutional neural net。
We're then also going to set the pool size and that'll just be the square so it'll be two by two in regards to our max pooling as well as the kernel size。
and that'll be three by three as we create those kernels as well。

Now we're bringing out that K that we mentioned earlier, which is just the back end。

And we're saying four images, depending on your backend, when you pulled in this data set。
it'll either have the number of channels of your image or that depth of your image first or last。
So if you think about this as RGB, then there would be a depth of three。
So if the depth was first or the channels were first。Then using RGB, it would be 3 by 28 by 28。
This is just the gray scale, so there's only a depth of one, so it's 1 by 28 by 28。
and that's the dimensions of your image。Now, if it's not channels first, but channels last。
then it'll be 28 by 28 by one。 And this is just to ensure, no matter your back ends。
that you're going to be producing the same results as we are here。


Now we're going to create a function。In order to actually run our model in the same aspect。
So we're actually going to be pulling in a model that will set up the framework before actually passing it through this function that we have here。
We'll have our train set, which will be our both our X train and our Y train。
And then so it'll be a tuL, and then we'll have our test set。
which will be X test and Y test so also a TL, as well as the number of classes。
so those are going to be the parameters that we pass through this train model function。Now。
recalling that。This train that we pass in is going to be both X train and Y train。
To define x strain, we say we want the first value from that tuple。
So that's going to be x train and not Y train, and we're going to reshape that。
So that it has the same number of rows, so if you think about pulling out that x train and calling dot shape zero。
that's just going to give you how many examples you have。

Plus, the input shape, and this is the input shape that we defined up here。Which will either be。
The channels and image rows and image columns or the image rows, image columns。
and then the channels。 And this just ensures that we have those in the right ordering。
So I'm going to actually run the next cell to make this a bit clearer。
So this is going to initiate all of our values。 And we think about our X train。


This is going to be。

Our first value in our twople, and then if I call a shape。That first value。
Is going to be the number of examples。 So we say we just want 60,000 plus that input shape。
so rather than being 60,000 by 28 by 28, it's going to be 60,000 by 28 by 28 by one。Or 60。
000 by one by 28 by 28, depending which one of these is true。
We're then going to do the same for X test。We then going to ensure that we're working only with float values。
and then we will make sure all those values are between 0 and 1。And that's, again。
just by dividing by 255。 These our pixels will all be between 0 and 255。
We'll then print out the shape so that we'll be able to confirm that the shapes are as we expected。

And then we can see how many train samples we have, how many test samples we have。Or then。
because we are using。

Classification of values between either0 and 4 or5 through9。
we actually going to have to create those categorical variables as we've done before。
create those different classes, doing something along the lines of one hot encoding。
so I do that for our train set as well as our test set。
And then we call model that compile whatever model we pass in, we will compile it using this loss。
We're using a different optimizer。 I wouldn't worry too much about this being different than what you've seen before。
It's fairly similar to the math or R M S prop。 that portion of your atom, if you recall, as well。
But I wouldn't worry too much about it。 We use this specifically so that it wouldn't train quite as fast as something like R M S Pro or Adam actually would。
After you run this, you can try switching this for atom or R M S prop and see that it actually gets to those optimal values much quicker。
We're then going to also track the metrics of accuracy。 We call T equals now to get the timing。
We're then going to fit our model and fitting our model is going to be what takes the most time on our x train。
our Y train, that batch size that we specified earlier。 the number of epochs。 we specified earlier。
verbose equals one, that's the defaults。 If you set verbose equal to 0 rather than showing the steps throughout each epoch that we've seen every time we've run those deep neural nets。
😊,Those just would not show up。 so that would keep all those extra lines from showing up。
I generally find those useful, but if you see that they're taking up a lot of room on your screen。
feel free to fetch set verbose equal to 0。 And then we have that validation set to see how we're doing on our holdout set throughout。
And then to figure out how long it took, we call now again and we subtract that T that we initialize here before fitting our model。
We can then call model do evaluate on X test and Y test in order to get the scoring。
and that will give us both our error and our accuracy for that model that we ran。

So, here, we have initialized。

Our train model。We're then going to get the data that we need。 So we're loading。
All of our data calling MN。 low data, which will give us the X train and Y train tuple as well as the X test and Y test tuple。
We're then going to separate out less than5。And greater than 5。So that we have。
X train such as such that our outcome variables are all less than five。
Or are x trains such that they're all greater than or equal to5。


We're then going to define the feature layers。 And these are going to be those earlier layers that we hope to transfer on our new problem。
And we're going to freeze these layers during the fine tuning process。
So these are going to be those layers that we freeze if you think back to what transfer learning is and how it works。

And we're going to set these all to a list。

So features layers are equal to this list and it's going to have this convolutional layer with the filters and kernels that we specified earlier。
that activation, then another convolutional layer。
with another activation that max pulling some dropout。
and then it's going to flatten our convolutional layer into a one dimensional array。
And we pass this into lists and we'll see later on that sequential model can actually take in a list of those features。
so it has that add functionality。But also, if needed。
you can pass in all those layers as a list into your sequential function。
And then we're going to have the layers that were actually going to be fine tuning。
and that's going to be this dense layer, another activation layer。

Some drop out and then another dense layer to get it down to the number of classes。
plus that soft max function。So I'll run each of these。
And then we're setting our model equal to sequential again, as mentioned。
we can just pass in that list of the different layers。

Now we have our model, we can look at the summary。

And we see here, in the summary。That we have each one of those layers that we specified before in regards to our different feature layers。
as well as those classification layers leading all the way to the ends of our soft max function。


And then we see that we have a total number of parameterss of 600,165 that we're going to the train。

Now that we have our model, we can call that function that we created earlier。Using that model。
Then our training set。Then our test set and then our number of classes, which is still equal to five。
So we're going to start running this and this will take。Maybe呃。Two minutes。
something along those lines, So I'm going to pause here, and once it's done running。
we'll come back and look at these results and discuss these results。

So we see here that that took about three minutes to train, three minutes and 12 seconds here。
we can also look at the improvement in the accuracy step by step, going from 0。22 to 0。3,0。
38 and so on on regards to our training set and you look at the validation set as well。
but you see that its slowly getting that accuracy up at each step。
and probably can continue to improve。Now, our goal here with transfer learning is going to be to freeze certain layers and only train on those later layers。
So CARS allows layers to be frozen during the training process。

In order to do what we just said, that is, some layers would have their weights updated during the train process while others will remain pros。
and they won't be updated。 And this is going to be that core part of transfer learning。
You also want to note that a lot of the training time。
and we mention this in lecture is going to be spent back propagating the gradients back to that first layer。
Therefore, if we only need to compute the gradients for a small number of layers。
the training time should speed up。 should speed up at least a bit, hopefully quite a bit。

So in order to freeze the layers, we just set each one of those layers。
which are going to be for each one of the layers in our feature layers that we defined earlier as those that we will ultimately freeze for each one of those。
we just set L doc trainable equal to false。And that will freeze the training where it is and won' allow for further training。

Now when we look at the model dot summary。

We see that our total number of parameters is 609000。
But the trainable parameters is going to be less at 600,165。
And that's going to be due to the fact that we are freezing these upper layers in place。

Now we do still have to train a lot, but again, those being later layers。
they'll be able to update a bit quicker than those earlier layers。

So now we can call train model。

This time on the values less than 5。 So if we recall going back up。
we originally trained on the values greater than 5。 We froze those layers。
and now we want to see our model that we again, just froze those first few layers on and are only allowing for training on those final layers。




4 values less than 5。 Sorry, I run this。 And again, this will take some time。
but hopefully faster than the last one did。And we will come back as soon as is done training and touch on our results。

Now, looking at the results that we have here, we see that the total training time came down by a full minute when we're only training five epochs。
that's quite a bit。And also we're seeing that towards the ends。
We were getting higher overall accuracies。Both in our training set as well as in our validation set。
So we see how this power of transfer learning allowed us to save time while gaining higher accuracy in that shorter amount of time。

Now, just to close out, we want to flip these two steps, so rather than。

Doing the ordering of。First on training on our greater than five and then freezing layers。
we're going to train on our less than five, then freeze our layers。
And then run our final model on greater than five。

So in order to do that。We're going to reset our feature layers。
They' are going to be the same values。 But now we want to set retrain them。
set the right values trainable after training them on a different data set and then doing the same thing for our classification layers。
leaving those as trainable。And then we set that up equal to Model 2。
so we have sequential with our new layers, which are going to be the same layers as before。
just not trained yet。

We look at the summary and it should be the same steps。
except now our total parameters and our training trainable parameters should remain the same。
We can then call our train model function。This time, starting off with the less than five values。
And we run this。And this will take some time to run。
so we'll see you as soon as this is done running。

Now we can see now that it's done running。 we can see the different accuracy numbers。
as well as the validation accuracy。 We can go through the same steps of freezing those traable layers。


We look at the summary again。We see that same number of total parameters versus the number of trainable parameters。
And then we can again。Set our train our model, call that train model function。
and then pass in our new model Model 2 with those layers frozen and try to get that greater than5 accuracy。
We run this and again, this will take a bit to run shorter than the last one。
I'm going to pause it here and then we will look at the results once it's complete。

So as we see here now that the results are completed, we were able to reduce the training time。
But once we flipped which one we were performing first。
we didn't quite get the same accuracy results that we had before with a little bit less accuracy。
a little bit less accuracy on that validation set。
and that can happen Transfer learning is a bit of an art takes a bit more of understanding where we can fine tune how deep we can fine tune Another thing that we can keep in mind is the fact that each epoch is moving a lot faster。
😊。

And we are getting continuous improvement on accuracy。
so we couldn even add on an extra epoch or two, get that improved accuracy while doing it in less time than just running it from scratch as well。
Feel free to play around with。Training different layers going deeper back。
seeing how we were able to work with holding certain parts constant and so on。
That closes out our notebook here on Transfer Learn。
and I look forward to seeing you back in lecture。 Allright, I'll see you there。😊。


086:LeNet.zh_en -BV1eu4m1F7oz_p86-
In general, it may be difficult to determine the appropriate architecture for your convolutional neural network。
With that in mind, in this section, we're going to discuss some different architectures。
which is help provide a framework as you move towards building out your own convolutional neural nets。
Now let's go over our learning goals for this section。In this section。
we're going to discuss different architectures using in convolutional neural networks。
And we're going to specifically talk about some commonly used network types。
We're going to start off with Lette, which is an earlier architecture。
so it's going to be more of a motivating architecture and it was one of the first successes and it was used on black and white images。
We'll then discuss AlexNe, which has what really made convolutional neural networks popular as it won the 2012 ImageNet competition by Landslide。
Then we'll discuss VGG, which is a means of coming up with a simpler overall architecture that's still able to identify more complex features。
We'll then un discuss inception, which will be a means of combining different types of layers together within a single layer。
and we'll see what that means in just a bit。And then finally, we have ResNe。
which is going to be a means of working with much。
much deeper networks and still getting high accuracy。

So starting off with Lyette, Wellette was created by Ian McCum in the 1990s, so again。
it's one of those earlier architectures。And the Lnette was built for MNIS。
and the MNIS data set is specifically numerical values that are handwritten between0 and9。
and we want to identify which number is written。For a given image。
so they're all going to be black and white on gray scale。
so we're only going to have one channel if we think back to our discussion about image data。
And he was able to use this concept of convolutions for the first time to efficiently learn these features that are built into the data。
whether those are those edges or those loops, or those sharp turns that you may see in any numerical value。

So let's walk through the actual architecture of working with Lyette。
So we have the actual structure diagram here in front of us and we start off with this input。
which is a 32 by 32 grayscale image Again in the original data set we are working with numerical values in that Ms data set and that's why our output at the end is going to be 10 different values is going to predict whether it's a 0。
1,2,3 etc through 9, So 10 different possible outputs here we have an A you can just imagine that is a handwritten digit。
And we also have zero depth here since we we have this on the gray scale, it's in black and white。
so we don't have to worry about having different channels, just going to have a depth of one。
So then we have our first convolutional layer。And that's going to be a five by five convolutional layer with a stride of one。
so it's going to be moving across the image one step at a time and then down the image one step at a time。
And this will have the resulting output with a dimension of 28 by 28。And the reason for this。
as we discuss, if we're moving that five by5 filter across our image and down our image。
we're actually going to be reducing the number of dimensions each time we take those steps。
especially if there's no padding。We also are going to use a depth of six。
so this means we will result in six different kernels that are being learned。
So our filter will have six different kernels and we'll have that output that we have here of 6 by 28 by 28。
so the next layer does have a depth and that depth is6。
So we want to think how many weights do we need to learn for this particular layer?
And if we think about what the size of our kernel is, and that is5 by 5, so we have 25 weights there。
Then we add on the bias term, so we end up with 26 weights。And then we think about the depth。
Of that layer, of that filter。So we multiply that 26。
which is just one kernel times six for the depth to come to 156 weights being learned at that first layer。
Next, we have a pooling layer with stride equal to two。
So it's going to be no weight's needed to be learned as pooling is just a fixed operation。
But we want to note that here, given that we're working with this older architecture。
the original paper actually does a more complicated pooling than max or average pooling。
but this is essentially considered obsolete by now, so if you are going to be using this。
you would probably use something like max pooling。
We then have another5 by5hi filter again with stride equal to one and with no padding。 so again。
we're going to be reducing our size even further。So this time we go to。depth of 16。
so we went from6 out to 16, and we reduced that output size to 10 by 10 as we discussed。
as we move that 5 by5 filter, it will reduce the actual size of that next layer and then we have that depth again of 16。
The kernels will be taking in the full depth of that previous layer。
And that fold depth is equal to 6。So each five by five kernel now looks at six times five by five pixels。
Not just the5 by5 or the 32 by 32 as we saw in the original layer。
but now in order to calculate each one of the individual pixels in each one of our 16 dimensions。
We now have to look at six times five times five pixels。
So because each kernel has six times five times five different weights that are being learned。
we have 150 weights, plus that biascer so equal to 151 for each particular kernel。
And then to get the total weights for this layer, we multiply this by our new depth, which is 16。
So we're learning here 2416 weights, which is just 16 times 1,51。And we have our output here。
which is 16 by5 by5。 We can then flatten this into a vector, which is a 400 vector。
And now we are just working with fully connected layers so we can go from 400 down to 120。
and then from 120 down to 84。Then ultimately from 84 down to 10 and allow us to ultimately predict which class we are actually working with for our number 0 through 9。
So that softm output size 10 for each one of those 10 digits。

So how many weights did we actually have to train as we walked through this Lynette structure?
So if you think about that first layer, we only had 156 weights in that convolutional layer。
Then 2416, and then moving to those fully connected layers, that's when the numbers really jump up。
and we see 48000, then 10,000, then 850, and then ultimately we had a total of 61。
different weights or 61,706 weight。Now this is always going to be less than the equivalent for a fully connected layer。
And with that we want to note and the major takeaway that we want to take from here is that convolutional layers are generally going to have relatively few weights compared to these fully connected networks。
This structure that we just walked through, the Lette structure, is still used today。
In regards to that convolutional layer, then the pooling layer, and again。
the convolutional layer ultimately leading to those fully connected layers at the end。
That close out discussion here in regards to Lyette in the next video。
we will discuss the Alex Ne structure, which was used to win Iagenet back in 2012。 All right。
I'll see you there。


087:AlexNet.zh_en -BV1eu4m1F7oz_p87-
Now let's discuss Alexnett now Alex Net is named after Alex Sevsky, one of its main creators。
and if you watched the intro course that we had here for all of the courses within this learning path。
we would recall that this was when convolutional neural nets really hit the main stage。


And that was due to the fact that it won the competition here on ImageNe。

The goal of this competition was to predict the correct label from among a thousand different classes。

And this was amongst 1。2 million different images, so we're working with a very large data set or a large classification problem in general。


Now, again, Alexnet is considered the flashpoint for modern deep learning in general。

This is due to the fact that it demolished the competition at a top five air rate of 15。4%。

Whereas the next best was 26。2%。

So let's dive into Alex under the hood。Now here we have an actual diagram that Alex Matt。
Now don't be too nervous about this breakdown of layers。
The reason why we have two separate paths that this network is walking through is that in order to run a model on such a large data set。



What they actually did was split it up into two parallel paths。
so rather than thinking about say this first layer that we have here。
which we see is 55 by 55 by 48 twice, you can imagine using your normal convolutional layer that this would be something along the lines of 55 by 55 by 96。


Where that depth of 96 is split into two parallel paths。
And then you see the same with the next layer and every layer moving forward。 So in the next layer。
you can imagine rather again, than 27 by 27 by 1,28。 It would be 27 by 27 by2,56。
And when we look at this large network, along with those dense fully connected layers at the back end。

There was actually 60 million different parameters that had to be learned。

So that parallelization was very important and also this would take weeks to actually learn。
but again, had this very high performance of knocking out the competition with that difference that we just saw of 15% to around 26%。



So now I want to go over just a few details, a few more details in regards to Alex Smith。

So first off。The AlexNet developers performed data augmentation before feeding through these images through the network。
And they did things such as cropping, horizontal flipping and other manipulations。
And that augmentation helped with overfitting。So if you think about working with an image of a cat。
for example, and you were to crop down that image but still had an image of a cat or did a horizontal flipping。
but again, doing a horizontal flip of an image, you'd still have that image of a cat。
you'd be able to avoid overfitting to those exact images and learn the extra features that actually go into what makes up a cat。

Now, the basic template, which we just saw in the last slide。
Is that we'll have convolutions with relo?And。Re lose as those activation functions at is。
and relos were fairly new to use at the time。 And that was a major part of why they were able to create this huge breakthrough to train such a large network。
😊。



And with that, they would sometimes add on that max pooling layer after convolutional layers。
and as we've seen before, at the end, there was a fully connected layer that led to that soft max classifier that allowed you to identify the class of that image。



Now that closes out Alex Ne。

088:Inception.zh_en -BV1eu4m1F7oz_p88-
Now I'd like to talk about the Inception architecture。Now with inception。
the idea is perhaps you don't know exactly what type of filter or what type of layer you want at each step。
So you may want to combine or try a bunch of them together。
But this can be computationally expensive and we probably want to accomplish this with some level of computational efficiency。
And we're also going to want to ensure that we can reduce the total number of activations that are needed to run through our entire network。

So our solution with the inception architecture will be to turn each individual layer into branches of convolutions rather than just working with a single filter type。
And each of these branches are going to handle a small portion of the workload。
And then each layer will concatenate the different branches to complete a single layer。
So let's see a visual of this。

So what we see here。Is we are moving through our previous layer。To the next layer。
using one of these inception blocks。 So this is our first idea of this inception block that we see here where we have the previous layer。
and then we're going to concatenate many different types of convolutions, as well as max pullinging。
which will make up that next layer, So we'll have one by one convolutions。
make up a certain depth 3 by three convolutions,5 by5, and so on。😊。
And then we concatenate all those together to get our full depth。
And then we can run our activation function through that concatenated version of that layer。Now。
the way that is laid out, if we use the reducing filters from the previous layer using 3 by 3 and 5 by 5 convolutions。
So we run three by three and five by five convolutions through the full depth of the previous layer。
we shall recall that we're going to have to have a value for every single channel or every level of depth of that prior layer in order to get each individual value。
Thus, we end up requiring a ton of operations to complete the calculation for that filter。

Now, instead of what we just saw, what we can do is first。
as we look at these one by one convolutions, is we can run this one by one convolution。
and that one by one convolution may at first seem meaningless。
But recall that we're also working with the entire depth。😊。
So we would have a different single number for each level of the new depth that we're trying to calculate。
similar to when we work with different three by three kernels to come up with different depths with a three by three kernel。
we do the same with a one by one。 But this time we're just multiplying by a single value。
And by first doing these one by one convolutions。We can reduce the depth。
Without nearly as many calculations as would be needed if we did a 3 by3 or a 5 by5 filter。
And then once we have reduced that depth。Then we can do our five by five convolutions or a three by three convolutions with much fewer operations required。
thus reducing the computational complexity。Now, with that。
We also have this pooling that we see all the way out to the right。And for max pooling。
we still are going to end up with that same number of channels or that same depth if we did pooling on that previous layer。
So doing the one by one convolutions after the max pooling allows us to again reduce that depth to whatever depth we want before concatenating together all the different types of convolutions that we have here in our inception layer。

And this whole block serves the function of that previous convolutional layer。
so we combine these all together, and then once we combine these all together。
then we pass it through our activation function。

So to see what this looks like for a full network。We have our input coming in from the left。
and we see that we have multiple different convolutions within a single layer where we have single。
Triple and five, so three by three or five by five convolutions, as well as perhaps an average pool。
and then once we concatenate those all together, then we can run the soft necks and we can continue to do that with different types of convolutions at each one of the different layers feeding all the way through to our output。

089:ResNet.zh_en -BV1eu4m1F7oz_p89-
Now I'd like to talk briefly about our final architecture, the Resnet architecture。
And I want to start off with the motivation behind it。Now。
researchers were building deeper and deeper networks as they started to realize the power of convolutional neural nets。
But they start to find that as they built out these deeper networks。
they were actually tending to have worse performance on deeper networks。

And we see this here on the training error with the 20 layer versus the 56 layer network that we actually have a higher training error with that higher layer network。

And hopefully you don't think that this is intuitive because this is intuitive as we see here for perhaps the test error on our holdout set。
but when focusing just on our training set, ideally。
we should just be getting better and better in regards to the training error just on that training set that we learned the model on。



And this is surprising again, because deeper networks should overfit more and should do better on the training set。

So what was happening?Earlier layers of deep networks were very slow to adjust so it was hard to adjust those earlier layers within the network。

Analogous to that vanishing gradient issue as we move towards the front。
as we do back propagation and move towards the front of our neural network。

And this is happening that we are having this lower performance on the training set when, in theory。
we should be able to just have an identity transformation that makes the deeper network behave just like the shallower one。
So if our 20 layer network is doing well and we add on another layer or another 30 layers。
there's no reason why we would do worse。 because we can just add on identity layers that keep it exactly the same。
😊,But our convolutional neural nets due to the vanish ingredient gradient issue weren't able to learn these identity matrices or identity transformations in any type of way。

So the assumption that will make resnet possible, so resnet is the solution to this problem will be that the best transformation over multiple layers will be close to F of x plus x。
Or x is going to be our input to the series of layers。
and we see that here in eye diagram where we have that x。
And F of x is our function represented by several layers。
say convolutions here with their relo activations, as we see in the diagram in between。
And we can then take that layer's linear transformations。
As well as the added on initial weights that x and pass them through that current relo。

And this shortcut allows for the information from those earlier layers to easily pass through our network。

And we can continue to do this throughout the network to ensure that prior layers, say two layers。
ps, and not just that initial input, as you may have thought about with X。
can continue to be added to the output of the most recent layer。


And the idea basically is keep passing that initial information unchanged to the next layer。
as well as that transformed information as we move along our network。


Now, this will actually allow you to continue to pass through pass information。
if we just set our weights to 0 for our new layers。
so it's possible to just have relo of that shortcut connection。Represented by the loop。
And what goes wrong is that as we go deeper and deeper。
it becomes difficult to even learn something like that identity function due to the banishing gradient issue。
But if we allow for that, again, initial value to be passed through。
Then we can hold on to that value from that earlier layer。Add again。
zero weights to that new information coming in and effectively have that identity transformation as we move through our network。

Now to recap。In this section, we discuss many different common architectures that we should be aware of when we're working with convolutional neural nets。
We started off with Lette, which was an earlier version where we saw the framework of that convolutional layer。
then the pooling layer, and then those fully connected layers first being introduced using the MNIS data。

Then we discussed Alexnet, which was。Introducing we lose into the equation。
as well as other breakthroughs and efficiencies to create that complex network that ultimately blew away the IageNet competition back in 2012。

We then discussed VGG and how that allowed for a simpler。
powerful framework that accommodated for its simplicity with deeper networks。

We discussed the inception model which allowed us to include multiple layer types within a single layer while maintaining computational efficiency。

Then finally, we discussed ResNe, which allowed for maintaining information from earlier in the network to ensure that we can build out incredibly deep networks while still continuing to reduce training error。
Now that closes out discussion on convolutional neural networks。
which are very powerful for image data。But in the next video we're going to introduce our next major framework that's going to be powerful for working with text and time series data。
namely recurrent neural networks All right, I'll see you there。


090:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p90 51_循环神经网络(RNN).zh_en -BV1eu4m1F7oz_p90-
In this set of videos, we're going to motivate and gain an understanding of how recurrent neural networks work。
Now let's go over the learning goals for this set of videos。In this set of videos。
we're going to cover what recurrent neural networks are as well as the motivation behind them。

We'll discuss both the practical and mathematical details that allow you to understand how recurrentin neural networks work。

And then finally we'll touch on some limitations of these recurrent neural networks that we discuss and that will lead into our next set of videos on how to adjust for those limitations。


So we discuss how processing of images will force them into a specific input dimension。
where with our gray scale, we can imagine the two dimensions of pixels。
say 28 by 28 and why something like a convolutional operation。
which we saw in the past videos takes on surrounding cells。
and it makes sense for these types of input data。

But this may not be immediately obvious in regards to text。
in regards to what kind of data we want to input and what kind of operations we want to use。

For example, if our problem statement was to classify tweets as positive, negative or neutral。

Different tweets can have different number of words。

And we want to know how can we account for this variable link for each one of our input sequences?

Now we want to do better than just the bag of words implementationment。
which would essentially take every word and just state how many times that word appeared in the document。

Ideally。When working with text data, each word can be processed and understood in the appropriate context。
And by context, we can think of it as the prior word surrounding that word, prior sentences。
et cetera。

And those words should be handled differently, depending on that context, you can think about say。
a bat being either the animal of a bat or a baseball bat。


Also, we get more words。As we get those more words。
we should be able to update the context that we are currently working with。

So the solution will be to use this idea of recurrence。
where we input the words into our network one by one。

And this would mean that we can deal with variable length by just continuing to feed until the end of the sentence or till the end of the document。

And because we have the information from prior words。
the response to any particular word can depend on those that actually preceded it since we're feeding it one by one。


Our network would then output two things as each new word came in。One being a prediction。
if a sequence were to end at any particular word, what would the prediction be?

And second, a state, which contained a summary of everything that happened in the past leading up to that point。
or again, that context that we're looking for。


091:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p91 52_状态和循环神经网络.zh_en -BV1eu4m1F7oz_p91-
Now this picture of how the recurrent network is。Used or how it's built is often a bit confusing to digest。

But what we're looking at here。

Is that the input will come into a network one word or one time step at a time。
And as those values come in, we can update the state with all past context so we can keep track of the inputs that have come in。


As well as outputting a value so that we can have a prediction at each word or at each time set。

Now, as I said, this may be a bit unclear, so let's unroll this recurrent neural net and take a deeper dive inside。

Now looking at the quote unquote, unrolled version of what we just saw。
We're going to have our words coming in as input and those come in one at a time。
And then starting back at W1, we have a linear transformation denoted here by this matrix U。
And note that W will be some vector representing that single word。 And in general。
R N does not only take words, but can be taken in any information at one point in time。
So you can imagine this being an input vector with sales data with inventory levels。
promotional spends, all for one time period, and then W2 being that same information for time period 2 and so on。
But again, in this instance, we're just going to continue to think of this in terms of words。
where each word has its vector representation for word one, word two, etc within the sentence。Now。
how are we able to store?And pass that information from one cell to the next。
The way that we do that。Is that each step?Along with our input dot product of W and U。
which we just did, right, we took the dot product of that W vector with our U transformation。
We're also going to be getting as input the state from the prior cells。
So starting at S1 or even before S1, we can initialize that state with a zero vector。
But then we pass that information from S1 or state 1 to S2 and so on。
And the way that we do this is we add together the values from that prior state。
And take the dot product of that state with our matrixtri W。
We then combine the values from the input of W1 and U。As well as S1 and that W matrix。
And pass that through all together, all combined through an activation function to get our new state。
Now, the output from this activation can be used as actual output at each step。
And that is often the case, what can also do, as we see here with these V matrices。
is have another transformation take place with this vector。Pass that through another activation。
And either pass that new value as output。Or even create another layer on top。
you can think of this as just the first layer in our neural network with some amount of nodes。
we can have it another depth to our layer and create another layer taking in as input these output values。
And once all that is done, often we may only use that final output that we have here to create our ultimate prediction that we're trying to make。
So as an example。The first two words have an unknown sentiment。
while the last two words that we have here are going to have that positive sentiment。
So you see the question mark question mark, and then it was able to predict the positive sentiment if it was predicting sentiment at each one of these different outputlets。

Now each of these cells can have an output greater than one。
so we can imagine if we set our matrix to have an output with say phi values。
so that01 is an array with five different values。We would be getting five different outputs at each one of these steps。
And this is the idea of having more than one node in our first layer within a feed4 neural network。
those are one and the same。

So if we're assuming something like five or more nodes or five or more outputs。
and ultimately we want to predict something like a class that's only between two values or three values。

We'll need to have a dense matrix that give the linear combination of each one of those nodes。
as well as an activation function that results in either three values or only a single value depending what we're looking at。

Now, if this is a bit confusing, an important note is that usually we're only looking at that final output。
So here say0,4, output 4。And since that is the only output with information from all the other inputs。
that's going to be the most important。And that single output of 04。
Can have those five values that we just talked about or 32 values。
whatever amount of values you want in regards to the number of nodes you want in that first layer。
And if we have something like five or 32 nodes, then we need to pass that through a dense layer。
just that 04, in order to come up with the prediction。
whether that's an output of just one value or three values。
whatever it is you're trying to classify。

Now what we have here, what we have circled here is really the crux of our recurrent neural net。
which passes through that save state from all the prior inputs within our sequence。

In Cars, we call this part that we have as the input, the kernel。
and the kernel refers to the matrices used for that input transformations, those use。
and we can initialize these weights using our kernel initializer and we'll see that in the notebook。

And then we also have weights within our recurrent portion of our network。
and that will also need to be initialized, and those are going to be the Ws that we see here。
Now that closes out this video and in the next video we'll start to walk through at a high level the actual math of how this all works。

All right, I'll see you there。


092:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p92 53_循环神经网络详解.zh_en -BV1eu4m1F7oz_p92-
InThis video, let's get into the mathematical details behind your currentren neural nets。
So starting out, we have our inputs, W I, where I represents the eith position in our sequence。
And as we talked about so far, that would be the ethe word within our sentence。And with that。
we also have SI, which is the state at position I that holds all past information that should be passed through our network。
We have Oi, which is the output, that position I。And to calculate S。
As mentioned in the previous video, we take a function of the linear combination of our input。
Added on to a linear combination of our prior state。
And that function should be some nonlinear activation function。
And then to get our output or our final output, we take a linear combination of our current state and pass that through the activation function and assuming we're trying to predict classes that may be a softm。
for example。😊。

So what are we doing here?We get our current state as a function of the old state and that current input。
We then get our current input as a function of that current state。

And we learn the appropriate weights for these function by training through our network。
so what are going to be the different weights that we need in order to get that current state。

Now, if we think through our matrix multiplication。
thinking through all this as passing in just a single input。In reality。
we would pass through a batch at a time。Then we're starting off with an input of dimension R。
and in our example, that represented a single word from our input。
so that's going to be a vector of dimension R。We then have S is going to be the dimension of our hidden state。
And T, we are going to use as the dimension of our output vector After passing that through our dense layer at that final layer。
So in order to get the transformations that we need。And thinking back to that。
The visualization that we saw earlier。You will be S by R matrix so that'll take our R dimensional vector。
our word vector input and return something that is an S vector so it's the same shape as our state。
W is then an S by S vector。 So we'll take that prior state of dimension S and keep it in dimension S。
So it'll still be an S vector。And then finally, V is going to be a t by S vector。Or a T by S matrix。
And it'll transform our S vector from that hidden state into something that is of size T or T vector that fits the dimensions of our output。
And with that, we should note that the learned weights U。
V and W are going to be the same across all positions。 So we saw that unrolled version of our R N N。
And we had that U show up repeatedly。 We should note that that U or that V or that W will be the same throughout。

And as mentioned as well。Throughout, we will often ignore the intermediate outputs and only care about that final output that has seen all inputs from our sequence。
So thinking about that unrolls R and N, we discussed that output for being that final output。
We only really care about that final output。

In order to train recurrent neural nets, there's going to be a slight variation to our normal back propagation method called back propagation through time that allows us to update the weights within our recurrent neural network。

Now we're not going to get into too much detail about this。
But one can imagine that recurrentin neuralural nets must learn weights by updating across the entire sequence。
And thus, if the sequence is very long。We are even more prone to that banishing exploding gradient problem than we are with our regular feed forward neural nets。

And in practice。We are going to set a maximum length to our sequences to ensure that they don't get too long。

And with that in mind, if the input is shorter than that maximum, then we just pad that sequence。
and if it's too long, then we would truncate it。


And this ensures uniform input lengths for all of our sequences。

Now we touched on this briefly earlier。But although RNNs are often used for text applications and those are the examples we've seen。
there are multiple uses for working with such a framework。

They can be used for all types of sequential data, including customer sales, loss rates。
or network traffic over time。

Speech recognition, so working with audio input for call center automation and voice applications。

For manufacturing sensor data to tell where along a chain, failure may happen to occur。

And has even been extremely powerful in regards to our ability to now do genome sequencing。

So we talked to all about RNNs and some of the powers of RNNs。
but one of the major weaknesses of RNs。Is that the nature of that state transition as it's currently constructed?
Makes it hard to leverage information from the dispants, or in other words, early on in our sequence。


With that in the mind, in our next lecture, we're going to introduce LSDMs or long。
short term memory。

We's use similar concepts to what we just learned。

But have a more complex mechanism for updating the state that allows for longer term memory。

So that closes out our section here on recurrent neural nets。In this section。
we discussed recurrent neural networks and their motivation in regards to learning neural networks for sequences。
We discuss the practical and mathematical details for how allows for providing context for our sequential data。
And then finally, we touched on the limitations of our current neural nets in accounting for information throughout the entire sequence。
especially those longer sequences and how in the next video。
we're going to introduce LSDMs to help account for such issues。😊。

All right, I'll see you there。

093:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p93 54_RNN笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p93-
Welcome to our demo here on recurrent neuralural Nes。
In this demo we're going to be using recurrent neural nets to classify the sentiment on an IMDB data set IMDB is just going to be movie reviews in general。
so our data consists of 25,000 training sequences。
so those are just going to be sequences representing different reviews。

And we're also going to have with it 25000 test sequences that we can test how well we're able to train our data。
and then our outcome will be binary, either it's a positive review or a negative review。
and we have those。

Those will be labeled for us。So Cars provides a convenient interface to load in this data and this is actually built into Cars as its dataset set and will immediately encode those words as integers。
and those are going to be based on the most common words and we'll see in just a bit how those are encoded as integers。
And then from there, what we're going to do is we're going to actually show you how to come up with vector representations of those words and then train our actual recurrent neural nets。
So first things first, we import all the necessary libraries, so cars。
some that I want to point out here is that we're going to bring in embedding and we didn't talk about this much but when we use embedding what we're doing is we're taking our in this case。
the integers but taking those sequences and taking those words and coming up with word vectors that will represent the syntax or the context of that word in a way。
So if you have two words that are basically synonym such as doing something fast or doing something quickly or fast and quickly could have very similar meanings。


The embedding will have vectors that are very similar to one another。
So that gives you another layer of learning that we're going to come up with。
And that's going to be our embedding layer。 We're also going to import simple Rn N。
such just going to be that recurrent neural net, We talked about how that is going to be simpler than the versions we're going to learn in later videos such as LSTMs。
and that it may have that problem of that longer term learning within a longer sequence。
But just to learn how these cells kind of piece together when to start off here with a simple R and N。



We're then going to initialize the length of our features and we see that max features is 20,000。
and this is used when we're loading in our data, using the IMDB data set。
it's going to pick the most common whatever number of Mac features we have here, 20,000。
the most common 20000 words。



We're then going to, and we discuss this in lecture a bit。 set the maximum length of the sequence。
and we'll truncate after this, as well as pad if it's not up to that length。
And then we'll decide our batch size here as well when we do our deep learning, of course。
we always say what the batch size is, so when we run through one epoch that'll be decided by whatever our batch size is and the size of the entire data set。
so you can imagine with a batch size of 32 and quite a large data set we'll go through many iterations of gradient descent before going through a single epoch。




We're then going to load in our data and when we load in our data, we just call an imdb。
load data and the only parameter we needed to pass here or that we did pass here is going to be that max features there's 20。
000, which again is the 20,000 most common words within our data set。


And that will output our X train and our Y train as well as our X test and our Y test。
And we can look at the length of each and they should be equal to that 25,000 that we just discussed。
it'll take just a second to loadier, and here we see 25,000 train sequences and 25。
000 test sequences。



Now, as mentioned, we're also going to pad or tranquaate each of our sequences using that max length that we discussed earlier。
which was equal to 30 as we see right over here。

So we set sequence。Pulling in that sequence that we've pulled here。In terms of。
From that pre processingcess library。

We have the functionality of padding our sequences。
So KaRS has something built in in order to quickly pad or truncate our sequences。
and we set x train to that max length as well as our X test to that max length。


And now, when we look at the shape of those sequences。
we see that there are now 25000 different examples。


Where each one of those examples is of length 30。And if we want to see what one of those examples look like。
we see here that again, this is meant to represent a bunch of words and each one of those words are represented by a single integer。


Now, our goal is to build out a recurrent neural net。And in order to do so。
we should dive in a bit into what this embedding layer is as well as how this simple R&M layer works。

So rather than using pre trained word vectors, we're going to learn what those vectors actually are using this embedding layer。


Now, when I say that again, that embedding will allow you to have that context so that similar words will have vectors close to each other so if we're talking about X dimensional space。
let's see what was our dimensional space here。



啊。We put in word bedding of 50, we see it down here。
Then we are going to have a vector that has 50 numbers。 and in 50 dimensional space。
one vector should be close to the other if they're similar in meaning。


And we're going to learn whether they're similar in meaning using this embedding layer。

So the layer maps each integer into a distinct dense word vector of length output dim as we just mentioned。
And we can think of this as learning that word vector embedding on the fly。
So using the context of IMDB reviews。 So it will be specific to IMDB, which could be powerful。
something to note if you're trying to do embeddings on your own。
there are pre trained embedding embeddings available, such as word to V。
and because that's pre traineded and makes it easy to actually。
Take whatever data set and automatically use that embedding and come up with vectors that are similar to one another if they're synonyms。

We then are going to have, again, that input dimension should be the size of our vocabulary。
and then the input length specifies the length of the sequences that the network is going to expect。
And we just discussed how we're going to keep that at 30 by padding or truncating accordingly。




We then have our。Simple R&N layer。

Which we're going to pass in the number of units so we can think again to our diagram that we saw earlier and we can say how many units we want that to output。
we can say what type of activation to use, 10 H is usually best as we pass through our simple R&N。
but we have our options of working with others and feel free to play around with that。

And then we have our kernel initializer and our recurrent initializer。
Which are going to be the initial values for our weight matrices。 again。
that kernel initializer is going to be the weights for the input and the currentrent initializer is going to be the initialized weights for those state layers。



Here we're actually going to change the activation to reu if you see。
so you can try going back to 10h and see how that works。
and then we're also going to just pass in that input shape。
which is just going to be if we call xtrain dot shape1。
then we should have and we can just look at what that is。



X train, Dutcht shape。1。

And we see that that's going to be of shape 30, and that doesn't matter how many examples we have in general。
when you're trying to pass a shape, you're going to be passing in that shape of what a single vector would look like。

So let's build out our first R&M。

So our R and N hidden dim is going to be equal to 5。

Our wood embedding dimension is going to be 50。 So again, we're going to take those。

Integers that we currently have。And given their context。
come up with an embedding where it's going to transfer each one of those single values into a vector that's of dimension 50。

We're then going to initialize our model, add on our embedding layer。Passing the max features。

As well as the word embedding dim that max features is going to be what we have here。
20000 to give us what the actual。


Input dimension is。And then our word embedding is going to tell us our new dimensions。
And then that's going to be the first layer once we have。


Our new embedding and our data ready to be fed forward。 We can pass that through our simple。
recurrent neural network。We pass in the number of hidden dimensions, which is just five。
We then call out our kernel initializer, as well as our recurrent initializer, again。
initializing those weights for that first layer, for our input, as well as that state layer。

What this is is just random normal with very tight standard deviation around that zero for random values。
and then this is just going to be a diagonal matrix where along a diagonal。
we're going to have a bunch of ones。

This shouldn't make that large of a difference starting off。
You can try just removing these and using the default values, which we have up here。 I've tried it。
and I believe they are around similar performance。 I think this outperforms it by just a bit。



We then set our activation to Relo。Input shape equal to that Xtrain dot shape that we just saw。
And then finally, to get just one output, because we just want positive or negative。
we add on that dense layer with the activation of sigmoid。



So now we have our model and we can look at the summary。

And we see we have to train a bunch of parameters for that embedding layer。

Then for the simple Rn N。If we think about it, we're going to have in that initial matrix going from our input to our state layer。

We should have a 50。 We have。 we're trying to。 we have 50 as input。
and then we have five hidden cells。We add on that bias term。So we end up with。
50 times 5 plus the five bias term。 So 255 waits there。
And then to go from one state layer to the next, we recall that we're going to use a 5 by 5 matrix to keep that same dimension。

So that's going to be another 25 weights that we learn and that's how we get 280 parameters that we are currently learning and finally that dense layer。
which will just be those five input plus the bias term。


We can then call our optimizer, we're going to use RMS Pro with a learning rate of 0。0001。

We're going to use binary cross entropy since we're deciding between 0 and 1。
we're going to use that optimizer that we just discussed。
and we're going to track that accuracy as well。

And then finally, in order to fit that model, we can pass that in, pass in our X train, our Y train。
We pass in our batch size, which should we defined early as 32, the number of epochs。
as well as a validation set, which is going to be our X test and Y test to allow us to evaluate how well we're actually performing and whether we're overfitting on that holdout set。

So you run this and this will take just a bit, so I'm going to pause the video here。
And we'll come back when this is done running and discuss the results。


So now our model has run, went through the10 epochs。
as we're starting to learn or probably have learned at this point。
oftentimes it will take a bit for our deep learning models to actually learn each one of the weights and optimize on the models that we're trying to run。


Now we're going to call model RnN。 evaluatevalu and we're going to evaluate on our test set。
so on our X test and Y test we call evaluate this will take just a second, not too long。
and we're going to get our score and our accuracy That score is just going to be our binary cross entropy loss。
so our loss score and our accuracy will just be our actual accuracy and we have all these lines here we're going to scroll down to the bottom since we've printed it out and we see that we have a score that log loss of 0。
45 and then a test accuracy of 0。78。




So that closes out this video and in the next video we're going to briefly touch on different ways that we can manipulate the models that we just went to trying different parameters and hyperparameter we're not going to go through all possible different parameters and hyperparameter。
but we will discuss them and after we go through it in the next video I suggest that you as well at home go through and try playing around with each one of the different parameters All right。
I'll see you there。





094:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p94 55_RNN笔记本(选修部分)第2部分.zh_en -BV1eu4m1F7oz_p94-
Welcome back now in this exercise we're just going to play around with some of the parameters。
show you some of the parameters that you can play around with。And then。On your own。
as mentioned earlier, I would suggest you try playing around as well with these max features。
the max length, as well as something that we won't do the hidden dimension for our R and N。
I believe we're also going to work with the word embedding dimension and we'll see the performance for each as we move along。
So。

First thing that we have is we set the max features equal to 20,000。
which is the same as what we had before。And then the other thing that we have is that we're setting the max length。
so recall that we cap off our sequences or our sentences at a certain length and then pad them accordingly as well。
Here, rather than the 30 that we had before, we're going to pad or truncate at 80。

Everything else that we have here stays the same。The same holds for our H dimension。
as well as our wordenbeddings, as well as the setup of our model, our R N N model。

So we're just going to run this, we're not going to walk through that again。

Again, we're going to use RMS prop and again the loss will be binary cross entropy。
the metric that we will track is going to be accuracy along with that loss will automatically be tracked。
We run this and then again, finally we will fit on our training set using our X train and Y train and then having that hold out validation data of X test and Y test。

So we run this and again, this may take some time to run and the only difference again that we have with this sample versus before is that we're setting the max length where we will truncate our sequences up here at 80。
So I'm going to pause here and we'll come back once this training is done。

So that will actually take quite a bit of time to run we're not also going to run the accuracy results as we did before。
but you can actually see those accuracy results here on the validation set and we see that it went up to 0。
842 compared to what we had before which was 0。7846 and that matches that evaluate output over here So we see we're able to increase our accuracy。




Now we want to see again, we're going to play around with just one more parameter here。

So we have, again, this time, instead of。

20,000 features, which is the amount of actual words we're going to use using the most common words。
we bring that down to 5,000, keeping that max length up at 80。

And then for our word embedding dimension, recall that that's going to take those integer values that we're starting off with。
And convert them into x dimensional vectors here, 20 dimensional prior they were at 50 dimensional vectors。
So of shrinking that down as well。 So we're changing two features actually here。

So you run this to get our new Xtrain and X test。

We get our new R&N model, everything else yet the same。

And then after that, again, we will use the compile with the same loss function, the same optimizer。
tracking the same metrics。And we're going to call fit again here。

And then after that, this is going to run。 and this will take some time as well。
So all these will take a bit of time and。It's part of the process when we're doing this deep learning。

But here we see it's running through that first epoch。
and then we'll do that again for another 10 epochs here。
The goal being that if the accuracy on our holdoutet is continuing to improve。
we should probably run it for more epochs。 And we did see that that was the case up here。
If We look at the validation accuracy。 We see that it continued to go up。

After each epoch, so we probably could have continued to run that and get even higher accuracy。

So we're actually going to do that here。 we'll see after 10 epochs how well we were able to perform。
and then after that, I'm just going to run this now。
It'll run for another 10 epoch and we'll see how much that accuracy can actually improve。
so I'm going to pause it here and we will get back once we are done having both these items。
which will be quite a bit of time。


Now, looking at our results here are going through now 20 epochs, 10 on the first run。
another 10 on the next run, and that second 10, of course。
as it was in our last notebook will pick up where we left off so here we see that we had a actually a 0。
8479。


On the training set, and we see that continues to go up as that loss continues to go down。

And we see towards the ends that we get that validation accuracy of about 0。84, that is。


Around equal to what we were able to accomplish in just 10 epochs。

Using the word embeddings with 50 dimensions, max features are 20000 and the max length of 80。

I'd say again, feel free to play around with these different parameters。
see if you can improve the model, but we are also going to in the next lecture start to discuss a more powerful recurrent neural net structure with more long term memory。
specifically LSTMs。

So I'll see you back in lecture where we will pick up with long short term memory models。 All right。
I'll see you there。


095:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p95 56_长短期记忆网络(LSTM).zh_en -BV1eu4m1F7oz_p95-
In this section, we're going to introduce the more complex version of recurrent neural networks。
such as long short term memory, which we have listed here or LSTM。
as well as gated recurrent units or GRUs, which are going to be a bit simpler version。
but the same concept of LSDMs, as well as other subjects in regards to recurrent neural networks。
which you'll see as we introduce the learning objectives right now。
So our learning goals for this section。We're going to cover LSDMs and how they can help to solve that long term memory problem that we discuss with their current neural nets。


We'll discuss gated recurrent units or GRUs, which are another solution to that long term memory problem。
That's not quite as sophisticated as LSDMs, but it's going to be more efficient to train and often have similar performance。

We'll discuss sequence to sequence models or trying to predict another sequence given a certain sequence。
which will be powerful for language translation and helping us to understand how perhaps words are pieced together or sequences are pieced together that may be different lengths but related to one another。

And then finally, we're going to cover some common enterprise applications of LSDM models that you may look to use yourself within the workplace。


Now, we discussed how the matrices that we're using for current neural nets tend to weaken the signal of those earlier inputs as we start to get further down that sequence。

And what we're ultimately going to need is some structure that allows us to keep some portions unchanged over many steps to maintain information from earlier in the sequence。

And this is going to be the problem that our long, short term memory recurrent neural nets will go about addressing。

Now, the way that LSTM accomplishes this is that it makes remembering easy with a bit more of a complicated update mechanism for defining our current neural net's current state。

Now by default, we already know that LSTM should remember information from that last step。
the same as with recurrent neural nets。

But on top of that, rather than keeping or adjusting past information。
we have more flexibility in retaining or forgetting a large portion from those prior steps。

Besides just that last step。

Now, LSDMs are just going to be a special kind of a current neural network。
and they were invented back in 1997。

It's still called state of the art, though, because although the concept is a bit older。
the computing power that allows it to actually make it applicable is going to be new。


Now the idea behind it is that it's going to add in an explicit memory unit。

And the key to that unit is that it adds on a few additional gate units。
which you can think of as gates that allow for information to be passed along and how long we will continue to allow that to stay in memory。



With that in mind。The cell will have an input gate。
which given a certain value will tell the system to store that value in memory。


We'll have quote unquote, a forget gate, which, again。
given a certain value will determine whether that information will be removed from memory。


And then finally, we're going to have our output gate。
which is going to fire off the response to move the current hit unit forward within our network。

096:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p96 57_LSTM解释.zh_en -BV1eu4m1F7oz_p96-
Now here we have our LSTM diagram, and if we think back to our unrolls recurrent neural net。
you recall we unrolls our original diagram。This is going to be a single cell from that unrolled LSDM。
So we can imagine that we are working with some sequence。
And we're going to have here just an input of a single word from that full sequence of words。
Now I know this looks complex, which is what we promise, but let's start off explaining how it works。
and I promise by the end of this it'll make a bit more sense。So we have our input, which is again。
just going to be a single word as we're working with that unrolled version。
whether that's a single word or a single time step, we are passing in that input。
We also have this C of T, which is our cell state at time T。
which will be new and wasn't a part of ourcurrent neural net。

We're also going to have that hidden state, which is something similar to what we saw in the recurrent neural net。
and this will also feed into the next unit along with that cell state。


And then we have our output, which will be the same as the H subt that we are passing along to the next cell。
And similar to that recurtin neural net where each one of the cells will have an output。
but will ultimately matter will be that final output from that final cell of our unrolled version where we have the information from all of our inputs within our sequence。


So let's focus now on this new cell state that we're working with。

The cell states going to get updated in two stages。We have our。

Forget gate。Whose purpose is to give us an easy mechanism to decide what information from that prior cell state。
as well as the current input coming in to forget。

And then we have the add new information portion, which tells us what new information is worth maintaining。

So let's start off with that mechanism built to help us decide what we should forget。

So, here。Our cell is looking at values Ht minus1, so that input of Ht 1 from the prior cell。
as well as our new input x subt。And that's, again。
based off of the previous output that H subt minus1, as well as the current input, which is x subt。
We cancatenate these two vectors together。The x of T and the Ht minus1。

As we see here, so we have a single vector。And once we have that single vector looking back at our equation of F subt。
we see that we take that single vector, multiply it through, do some transformation using WF。
some transformation, and pass that through a sigmoid function。
so it'll output some value between0 and1。

Now let's walk through the portion built for adding in new information。

So that's this add in new information, and here we are again calculating that same function of sigmoid of a weight matrix。
Of the concatenation of our input。Except T as well as H T minus-1 here, of course。
just learning new weights。 So now we're working with W I rather than WF。 but again。
outputting some value between 0 and 1 as we pass that through the sigmoid。

With that, we're also going to be computing at this point a 10h of another weight matrix。
And the concatetnations are the weight matrix multiplied by a concatenation of again。
that input x of T and Ht minus1。And then we'll pass these values through a 10 H activation。
so we'll be have resulting values between negative 1 and 1。
And then if you look just above our sigmoid and 10h functions that we just walk through。
We have a multiplication which is meant to multiply those two values together。
The idea being that the 10h is the actual information you are deciding whether or not to add on。
and then that sigmoid between0 and 1 will tell you ideally what portion of that new information we would want to add on。
so if it's close to one, we add on all information, it's close to zero。
we don't add on very much of that new information。

Now, if we've been looking at this diagram throughout。
we would notice that the arrows from both our forget portion。
as well as our add on portion that we just discussed will both lead up to our cell state that we're trying to compute。
And our new cell state then will be a function of each of these outputs。

The F I that we calculated, which is, again, that value between 0 and 1。
will help tell us how much of that old cell, if we're multiping it by some value between 01。
how much to keep and how much to forget。

And then from there, we can add on the output of our last calculation of that 10 H and sigmoid multiply together。
side how much new information to add on。 So that's going to be that addition function to ultimately get to our new cell state C sub T。
which again, is just going to be F sub I some value between 0 and1。
multiplied by our prior cell state and then adding on I sub T。😊。
Multiplied by that C subT that we just calculated to figure out how much extra information to add on。


Now。Let's close out by looking at that output。 So we have that similar function again。
of taking the sigmoid of some weight matrix。 And this time, that weight matrix is W O。
So not the same weight matrix。Multipliied by the concatenation of H subt -1 and x subt。

And then net value。Will be multiplied by the 10 H of our new updated cell state。
that cell state that we just calculated。

And again, we just computed this C subT so there's no new weights that are needed here。
we just need to pass that through the 10H activation。And then multiplying those two together。
we get our H subt, so our hidden state at the current cell。
and that's going to be both the output value, which we see here at the top。
as well as the input or that Ht minus1 as we move along to the next cell。

And we see that process here, how that cell state and the H subTs persist as we input each value into the sequence。
And as we've seen, through LSTM。It's going to require a good amount of parameters and thus a lot of memory in order to compute this LSDM that allows us to have this longer term memory deciding what to keep and what to forget。
In the next video, we're going to start off by discussing GRUs。
which may not always be quite as accurate, but usually does result in similar results while using much less memory than needed for the amount of parameters that are needed here in LSDM。
Allright, I'll see you there。


097:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p97 58_门控循环单元(GRU).zh_en -BV1eu4m1F7oz_p97-

Now let's discuss that gated recurrent unit or the GRU。Now, just to note。
we're not going to be getting as much into the nitty gritty of the GRU compared to what we did at the LSDM。
but we will highlight that it has similar functionality overall to the LSDM。Now。
what makes a difference?Some major differences include that lack of a cell state。
so we saw that cell state throughout our LSDM here, as we see in the diagram。
it's just going to have that past hidden state that hidden state will allow for persistence as well as understanding what will be updated and what will be for God。
And as mentioned in the prior video, ultimately we can think of the GRU as a simpler version of the LSDM。
as if it's still going to accomplish that same functionality of having a longer term memory than our vanilla orcurrent neural net。
it just will have less weights that' will have to keep in memory throughout。So in our GRU。
we're going to have our reset gate again。Which if we look at the diagram where we see that R subt within that box we just highlighted that it takes in past hidden state Ht minus1 as well as x subt and passes that through a sigmoid in order to figure out what's going to be reset。
We then have our update gate, which if we look at the Z subt, we can see that again。
this is a combination of the 8 subt minus1 and our input x sub t passed through a sigmoid。
and there's going to be a lot more to that inter state, but again。
the idea remains the same of having some type of functionality decide what we remember and maintaining information from the past。
And another portion of the cell for updating the cell with that new information。
So the question arises shall we use an LSTM or a GRU。
LSTMs are going to be a bit more complex and may therefore be able to find more complicated patterns。
And of course, on the other side, GRUs are going to be a bit simpler and therefore quicker to train。
In general, GRUs will perform just about as well as LSTNs with that shorter training time。
especially when we're working with smaller data sets。And luckily in Cars。
if you're trying to decide whether to use an LSTM or a GRU。
all we need to do is call that layer type and it wouldn't be too complicated to write up changes between the two and plug and play between the two。
Now I want to discuss another concept so we're moving away from the LST or GRU。
but it will be built off of the idea of recurrent neural nets。
I want to discuss this concept of seek to seek or sequence to sequence, which is meant to convert。
A sequence from one domain, say here English to some other domain, such as French。And thus。
given the examples I just gave, obviously, is going to be very powerful for machine translation。
And if we think about how our recurrent neural nets works。Recall that given a sequence。
As the words are entered into the network, one at a time。And we see these words coming in。
We will have a new updated hidden state that will have accounted for all the past information。
and that's what we see above with H1 H2 through H6 as the sentence the black cat dranknk milk was fed through our network。
And at the end of our sentence, that final hidden state should have all the information relating to all the words contained within that sequence within our sentence。
And we can leverage this vector that hidden state。As no matter the size of our sentence。
if we're just looking at that H6, that final hidden state。
which is just going to be a vector the size of that state vector。
We can take that final hidden state that contains all the information for that given sequence here the English sentence。
And that information from the English words。Will be what we call the encoder portion of a encoder decoder model。
which is going to be the crux of how we work with the seek to seek modeling。So。We have。
again these words coming in, and this is towards the end of the sentence, drank milk。
and then we actually have a term for end of sentence。And from here。
we have our hidden state from the encoder portion。And now for the decoder portion。
It can now work as a language model just moving forward that's just trying to predict the next word。
And it's going to use as its initial state, what was output from the encoder portion。
And this makes sure that it's not just a language model spitting out French words。
but it's going to be spitting out new French words in a sequence conditional on that English sentence。

098:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p98 59_GRU详解.zh_en -BV1eu4m1F7oz_p98-
Now, we went over those basics of the encoder decoder model for sequence to sequence learning。
But maybe you notice that the way that this is currently constructed。
The model is going to be producing a single word at a time。
and that single word that's being produced will be conditional on whatever that prior word that was produced was。

And with that in mind。If at any point it produces one wrong word。
We may end up with a completely different trajectory and that would throw off the entire translation and that entire sequence that we're trying to predict。

Now, again, the way it's currently constructed is that it will continue to predict new words until it hits this end of sentence term。
Now, a new solution to solve for the problem that we just discussed that we our trajectory can be thrown off。
Would be to produce multiple different sentences through to the end of sentence term that we have。
And then see which one of those full sentences or full sequences is the most likely。
So we can imagine that each of these that we're seeing being built out here would lead to some end of sentence term。
and given each one of these different branches and once we have some predetermined amount of possible sentences。
possible sequences,We can then probabilistically determine which one of these full sentences is the most likely。

So the way that our encoder decoder model is currently built out。
our decoder works with that hidden state from the encoder that has information about the entire sentence。
So the final hidden state is used as that initializer for that decoder。
whether we're using that beam search that we just discussed where we can look at multiple different sentences and decide the most likely。
or if we're just looking at one。With that in mind。Each decoder time step。
so as we're going through producing each one of the different words。
Each one will be depending on that same encoder embedding。And will have no relationship。
As to where in the sentence we currently are within the decoder。In regards to。
are we at the translation for the cats or we in translation for drink milk。
And we'd want some way of rather than looking at the entire sentence within the encoder。
only looking at those terms that are similar to the terms where we're at within the decoder。
And attention is going to solve for that problem and allow us to look specifically at those terms that matter。

Now, with attention, our goal is to consider the words that are most similar to our current position in our sentence generation。
So rather than just using the hidden state from that entire sentence。 So again, just using H6。
that final hidden state。

We can actually use the hidden state from each one of the different terms。So how does this work?
We know that each word in either language will be represented by some type of vector。

So what we can do is that each word within our decoder。
We can look to see how close that vector in our decoder in that different language。
Is to each word within our encoder, so we'll have some function S。We have that function S of I J。
Which we can just think of as some type of function S。
which gives the similarity measure between the decoder state I and the encodeder state J。
So that we know how similar each term, so you see this is a function mapping to all the encoder terms to each one of our single decoder terms to decide which one is the most similar。

And then this similarity function will then weight the different embedding layers。
Each one of these hidden states from the encoder to give us a better embedding for the prediction of that next word。
So if。The second term in the encoder is the closest in regards to the vector distance that S measure。
To our decoder term where we currently are, then that will have a much higher weight。
All the weights adding to one then terms 1,3,4, and 5。

And this will then better allow you to translate between different languages when that ordering of the words are often different so that you know。
okay, even though we started with the cat and French shoe。

Terms would have the cat at the end of the sentence, would have that noun at the end of the sentence。
we can still see how close each one of those terms or each one of those final terms are compared to our encoder model。


099:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p99 60_自编码器简介.zh_en -BV1eu4m1F7oz_p99-
In this section, we're going to introduce our first deep learning model that's going to be used for unsupervised learning auto Enrs。
Now, let's discuss the learning goals for this section。In this section。
we'll start off with a review of non-deep learning based techniques for data representation。
such as PCA and how we can use that to condense our original data set into a smaller representation of that same data set。
After that, we will discuss how auto encoders leverage neural networks to also come up with lower dimensional representations of our data。
Then finally, we'll discuss a bit on how to describe those use of trained autoencors in order to actually generate images。
Now, autoenrs will be our first time looking at deep learning from an unsupervised learning vantage point。
Now, the goal of autoenrs is going to be to use those hidden layers in our neural networks to find a means of decomposing and then recreating our data。
and we'll see how this is done in just a bit。And this proves to be powerful for things such as dimensionality reduction and fighting that curse of dimensionality。
as we've seen in prior courses when we were working with PCA。With that。
this dimensionality reduction can be powerful for preprocesing for classification and identifying only the essential elements of our input data while filtering out the noise within our data set。

Now, as motivation, let's say we want to find whether two images are similar to one another。
so we have two pictures here each of the kangaroo and we not want to know whether each of these images are similar to one another。

Now one option, and I'll say right off the bat that we probably don't want to do。
It's to look at the pixel wise distance between these two images。
So if we look at the image the left and we look at the pixels there in the top right corner。
Compared to the pixels in the top right corner of the right image。
We see that these two are clearly not the same, and we can even see this with our own eyes that these are very different if we are just looking at these particular pixels。
And the goal would be and our problem here is that if we just look at the pixels as a whole。
we'll only be able to see the placement of the color scheme of the brightness, etc。
and not the actual content of our image。So the goal would be to find some type of representation that captures that actual content within the image。
So if we think about the image here to the left, we're looking at brown kangaroo fur。Kangaroo ears。
a kangaroo nose, a beige background, and those are going to be content within that left image。
In that right image, we have brown kangaroo fur, kangaroo ears。
kangaroo nose and a green background。 So many similarities may just a difference in background。
And the idea would be that with auto encoders and we think about what we've learned with deep learning。
how it's able to find each of those features that make up the image。
With something like autoencodederrs that leverages these deep neural networks。
we should be able to capture these essences。

Now, before getting into the actual method of working with autoencoderrs。
I'd like to introduce here a business application for what we just discussed and how autoencoderrs can be used in business practice。
If you think here of electronic components within a production line。
Some defects might be imperceptible for the human eye or difficult to scale。
given if we're looking at images of each of these chips and the amount of pixels in each of these images and we have to look across millions of chips may be too difficult to scale。

We may want to then reduce the dimensionality of those pixels so we can look at these defects at a lower dimensionality when we're comparing whether or not they're similar to one another。
whether or not there's a defect。So one approach to this?
To being able to detect these differences at scale would be to use PCA to reduce the dimensionality of our features。
which here are going to be pixels。So that each component is some linear combination of our principal components。
so we start off here with a pixel vector, it's here going to be RGB so we have the three channels and for each one of those channels we have the height and width。
We use PCA。And again, we're able to create a linear combination of our principal components if you recall from courses past and reduce the number of dimensions that we are working with。

Now, just a quick reminder how PCA works before we get into auto encoders and to motivate auto encoders。
PCA works in reducing dimensions of our original data。
and the goal within PCA is to find the dimensions that capture the most variance from our original data。
So as an example, if we are working with just two dimensions。
you can think of this again as just two features。The directions of the arrows will represent the principal components or the directions representing the most variance within our data。
And the lengths of these arrows coursed on to the amount of variance in the original data that is explained。
so we see the diagonal pointing to that upper right, accounting for more of our overall variance。
the one pointing to our upper left。And we see that each one of these arrows will actually be composed of a combination of both x1 and x2。
and now just saying which single axis accounts for the most variation won't be what we're doing。
but rather some combination of those creating new axes coming from that X1 and x2。Now。
there are going to be limits to working with PCA and why we'd want to move to something like autoone coversrs。
The main thing。Is that learn features will have to be some type of linear combination of our original features。
When in reality, there may be some complex nonlinear relationship between those original features and the best lower dimensional representation of those features。
And finally, how we define the best representation can be different depending on what our problem is。
So that closes out our video just motivating the use of auto encoders in the next video we will pick up and dive into how auto encoders actually work All right。
I'll see you there。

100:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p100 61_自编码器.zh_en -BV1eu4m1F7oz_p100-
So autoenderrs are going to be a neural network architecture that will force the learning of some lower dimensional representation of our data and that's commonly used for images。

So the way that auto encoders work。Step by step。Is that they will have that same value as both input and output as we see in the image。
We'll then feed this input through our encoder network represented here as that densely connected nodes in blue。

From this encoding step, we're going to be able to produce the lower dimensional embedding of our original data。
And that's going to be what we're actually looking for。
right that lower dimensional representation of our data。
and that will be here in the middle of our network。And then finally。
that embedding will be fed through the decoder network。
Which we have there through to our final output, which again is going to be the same as our input values。

And that decoder portion is meant to go about creating a reconstructive version of that original be。

And once we have that reconstructed version of the data。
we can go about computing the loss between that reconstructed version。And that original input。
And use that to train our actual network。So we take that loss function as we do with any neural net and use it to update the weights within our network。
and that's going to be that feed forward and back propagation steps that you already used to。

And the result will be that in that middle portion。
Given that if you've noticed the nodes be shrinking in。
so we start with three nodes then two nodes and it could be 100 to 10。
whatever it is as those amounts of nodes shrink。In that middle, and then again。
it shrinks and then expands back to that reconstructed version in the decoder step。
In that middle where we have just two nodes, we're going to have that lower dimensional representation of our original data。

And we can use an autoencoder to find image similarity。
Because we feed two images through the encounter network。
And we can calculate the similarity score of their latent vectors of this lower dimensional representation。
So that allows us to actually see。At a。Version that we'd be able to scale because it wouldn't require as much memory how similar two images are。

And we can always use that decoder portion of the network to map those vectors from our lower dimensional space。
To the full dimensional space of our images。The point being that this allows us a means of compressing and then decompressing our data。

Now, another use of the decoder model is to actually work as a generative model。
And in order to properly do this, we would probably want to actually work with variational autoencors。
which we'll discuss in the next lesson。But even with variational auto encoders。
this isn't going to be commonly done。 This generative model due to the fact that in order to get reasonable results。
some deep convolutional architecture is generally going to be required。And even with that。
generally speaking, the results of that image generation will generally be inferior to that of Gs。
which will learn not in the next lesson, but in the lesson afterwards。
So auto encodecors can have a wide variety of enterprise applications。
They can be powerful for pre processing and reducing the dimensionality of our data prior to learning some classification model。
It can be powerful for sending information in a compressed form as well as retrieving such information。
May use for anomaly detection as we discussed with the chip images。
Can help with machine translation as we're generally working in very high dimensional space if we're doing machine translation。
Can be powerful for image related applications such as generating images, denoising。
or taking fuzzier images and sharpening them, as well as processing and compressing as we discussed。
And for drug discovery, popularity prediction of social media posts and sound and music synthesis that can help find the key components that are key to each one of these。
Different domains and help identify those key components that may be key to the model in drug discovery。
popularity of a social media post or sound and music synthesis。One last note。
While most auto encodecors will use deep layers。Uutenrs are often going to be trained on just a single layer。
each for the encoding and decoding step。And an example of working with a deeper network。
Is going to be using sparse auto encoders, which essentially allow for those deeper networks。
but only certain nodes will be firing within those networks。
and this has been used successfully in things such as recommender systems。

So just to recap。In this section, we discussed nonde learning based techniques for data representation and remind ourselves how PCA will use a linear combination of our original features to come up with a representation that maintains as much of the variance as possible from our original data set。
We then discuss how auto encodecoders work with the encoder portion coming up with a condensed version of our data。
which can then be reconstructed using our decoder network。And then finally。
we discussed a bit how trained autoencoders can be used to generate images。
and that's especially true using something called variational autoencoderrs。
which we'll discuss in this upcoming video Allright, I'll see you there。

101:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p101 62_自编码器笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p101-

Welcome to our lab here on autoencors in this lab, we're going to be going over Diality reduction techniques on the MNIS data set。
which is those handwritten digits that we had seen in earlier notebooks。
And we're going to perform dimensionality reduction using first PCA as a baseline。
which we learned in earlier courses, and then as just discussed in lecture。
auto encoders and variational autoenrs。

And with each one of these models, we're going to use the appropriate scoring metrics so that we can compare the performance across each。
So a quick reminder of what the MNIS data set is, that's going to be handwritten digits between zero and nine。
will have 70,000 different handwritten digits。All in black and white。
and when we run this traditionally, we're going to split this into 60。
000 training images and then 10,000 validation images。
And this is such a common data set that is actually built into CAs, so we pull it out from Cars。
dataset, we pull out the NNIS functionality, and with MNIS we can call low data once we call that method。
we get the output of our training set, Xtrain and Y train and our test set X test and Y test。
And I run this。And we can actually look at。Quickly, the Shapeier。
And were working with 28 by 28 images。 And if you're recall, because it's black and white。
this won't be three dimensional if it had。RGB or some color scheme to it。
Then it would be three dimensions such as 28 by 28 by 3。 But here it's all black and white。
So our pixels will all be on the gray scale and nowll be 28 by 28。
And if we look at just a single image。

We can see that each one of these different pixels are going to be some number between zero and 255。
representing the light or darkness of each one of these different pixels。

So this takes a lot of space, I'm just going to clear that out。
Now we want to make sure that they're all between0 and 1, so we're just going to divide here by 255。

And then。If you think about it, the max number is 255, we divide by 255, the max number is now one。

We're going to use now PCA as a baseline with which we can compare against our deep learning models。
which we'll do later on。

So for PCA, we're going to have to teach each image, treat each image like a row of data。
so we're going to have to flatten out those 28 by 28 images。

In order to do that, all we have to do is call that xtrain。 reshape。
we saw that the original shape was 60,000 by 28 by 28, we keep that 60,000。Because we still have 60。
000 different examples。And then we just take a product of that 28 by 28 here。
That's one through dot shape1 through is going to be that 28。
And then the second 28 NP dot product is the product of the two。
so that willll end up with 60000 by 784。 And then same for the test set will'll have 1000 by 784。
And we see those results here。

Now, just a quick one sentence reminder on how PCA works。
so PCA will do a matrix decomposition of this data that we're working with。

To find the eigenvalues and these eigenvalues will end up being the principal components of our data or those latent features that describe the maximum amount of variance in the data。
we've been talking about these latent features that lower dimensional space that hopefully represents the important portions of each one of these different pieces of data that we're working with。

So just to ensure that it's scaled, we already got it between 0 and1。
but we will just run the Minmac Scalar here, we call Minmac Scalar which we imported from Scalalar not preprocessing。
we fit it on our Xtrain flat, that flatten data that we just produced。


And then we call transform on that Xtrain flat, and now we have our xtrain scales。

Now, the function that we're going to use for PCA。It's going to take in some data。
and it will end up being this X train scale to start, at least。And we say the number of components。
how much do we want to reduce our data set by, That's going to be something that we determine in the preset so we can say we only want two dimensions。
five dimensions, so on。We then going into the actual function。Are going to call PCA。
And that model needs the hyperparameter of how many components do we actually want。
we say that we want the number of components that we defined here above in the function。
and then we have our PCA model。That's been initiated。
We then fit that model to our data set by calling PC。
fit on that X data that we pass here into our function。And now we have this fit PCA。
we have that model fit to our data。And then we can print out here quickly after we run this。
how much of the overall variances was explained with the number of components。
That we pass in through our function。So we say variance explained with say two components。
and then we just print out the actual amount explained and going used that's going to be done by calling that fit model。
Getting the explained variance ratio, which we'll show in just a second what that actually looks like。
But that quickly is going to be the amount of variance explained by each one of our different components。
We take that sum and we see the overall amount explaineds where the maximum value suming goes altogether should be equal to one。
We're then going to return our fit model so that we can use that once our function has run。
as well as our transform data。So that's going to be our data set reduced down to our new number of components。
that new number of dimensions。So we're going to do this with 784 dimensions。
So that's going to be all the data。 And when I run this, recall。
it's going to print out the amount of variance explained。
I would want you to think to yourself how much the variance should be explained。 We run this。

And with all 784 components being taken into account, all of the features being taken into account。
we see that 100% of the variance was explained, which makes sense。Now。
just to be clear on what this explained ratio attribute is of our model。

If I run this。This is going to be an array that says, for each principal component。
what is the marginal amount of variance explained?
So the first component explained this much of the variance, the second one。
this much I believe it's round。009 and 0。07。 and we can see that the length of this given that our PCA model took all the components。

Will be 784。And if we actually plot this out, we can take that cumulative sum across all 784 components。
And that's what this comes sum does。For this entire array that we just printed out。
And we can see how much of the variance is explained as we add on more and more components。

And we can see that we need about 250 components to explain about 90% of our variances there。

Now for visualization purposes。Let's try reducing it down the number of features down to two。
and if we reduce the number of features down to two。
then perhaps with those two features we can see whether or not we're able to group together where the ones lie in those two features。
where the twos lie with with just those two features, so on and so forth。

So you run this。And we get our output of both the model, as well as our transform data。

Now I'm going to get to this in just a second, that's just an example of how to explain what we have here。
What we want to do is plot out each one of these numerical values that we have zero through nine。
and we have those labels available to us。And for each one。
we want to plot out in these new two dimensional space that we have。
point where each one of those points lie。So just to ensure that we don't have too large of a scatter plot。
we're only taking the first 250 examples of each。 remember our xtrain has 60。
000 examples that would be quite a dense scatter plot。Then for numbers ranging between 0,9。
including 0,9, if we say range 10 doesn't include 10。We're going to create this mask。
Which is just saying does y train equal that number where we currently are, So we can say here。
let's just start off Y train equals 0, and we can see whether or not that exact example in our full array and y train is going to be of length。

We'll just put that out。6000, right, So it's going to be for each one of the different examples。
iss it labeled to zero。We're then going to mask our data。
so this Ms data 2 again is the output from our function, so it's that two dimensional data。
We're only taking the rows from that two dimensional data。
Where are y trains equal to the current number in our for loop。
And then we're going to say take just that first column, we only have two columns, only two features。
and then we're only taking the first 250 examples。
And then we're going to do the same thing to get our y data or a second axis by saying, again。
I only want the rows where we have。equal to the y label specified。
So starting off at 0 is it equal to0。 and then we want that second column。 and again。
only the first 250 examples。We then plot that scatter plot of the x data and the y data。We label it。
We'll call that legend later on。 And once we run this。

We can see for each one of the different values, whether or not we create clumps in this two dimensional space representing that 0。
1,2, and so on。

And we can see, for example, the ones here in orange are already disentangled and grouped together on their own。
We can look at the nines, which are light blue here and the fours, which are purple。
And we can see that those are fairly close to one another, which makes sense。
given the way that fours and nines are drawn。 And you can continue to explore where groups are able to separate themselves out or where there seems to be some type of overlap。
😊。


But we can already see that these latent features within PCA。
Are learning somewhat how to disentangle our features and perhaps a neural network could help even further in doing this。

So now we want to score our PCA again, we're going to want to come up with some actual function to decide whether or not we are improving how well we are performing in regards to working with the PCA versus working with our network models。

So the number that we're going to be using, the latent features。
the amount of latent features we're going to be working with, here's going to be 64。
So we call that NNS PCA function that we defined earlier。We call 64 dimensions, we run this。
we now have our model, as well as besides that model, also the actual transform data。


We're then going to take our X test flat that we defined earlier and scale it。
So it's on the same scaling that we used with the this S was for transforming above。

Here are X trains, so we are're using that same S。

To transform our X test。We're then going to use that PCA 64 that we just fit on our training set。

To transform our X test scaled。So that's going to give us x test flat in 64 dimensions。

And then we're going to reconstruct that same image。
Back to the original dimensionality that 784 dimensions that we are working with。
And we do that by calling PCs Exp 64, which is our model inverse transform。
On that X test flat 64 that we just produced。On that 64 dimension data of that we just produced by calling PC64。
transform。So this is that reconstruction that we're trying to do we're reconstructing our original image by reducing the number of dimensions and then going back to that original amount of dimensions that we are working with。

We can look at the shape here and as expected, given that we have the test set, there's 10。
000 different samples and reconstructed their back to that original dimensionality of 784。
We're then just going to call this true and reconstructed。
so it's clear when we call this into our model, and that's going to be our X test scaled versus our reconstructed data。

We're then going to come up with the mean squared error of that reconstruction。
So we just say true minus reconstructed。For all the different pixels。
And we just average out the total error that we have there。
And we can call that on our now true and reconstructed that we just defined above。

And we see that we have an average mean squared error about 90。5 when you use 64 components for PCA。
So that's going to be the baseline that we're working with。 We see a mean squared error of 90。
6 that closes out our motivation, building out that baseline in the next video。
we're going to start to work with autoenrs, a simple autoencoder to see if we can do better than this baseline performance that we currently have a 90。
6。 All right, I'll see you there。


102:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p102 63_自编码器笔记本(选修部分)第2部分.zh_en -BV1eu4m1F7oz_p102-
Welcome back to our lab here on auto Enrs。 If you recall in the last video。
we used PCA in order to reduce the number of dimensions down to 64 and then reconstructed our image and then saw how far off our reconstructed image was from our original image。



Now we're going to do the same thing except using auto encoders and using neural nets。

Now so far, whenever we built out our neural nets, we've been using CAs and from CARs。
we've been using this sequential API where we just add on layers。
and that's a bit of a simpler way of building out your neural nets。



What we're going to introduce here is the functional API。
And that's going to be fairly simple as well and we'll walk through the steps here in regards to actually building out these complex architectures。


And what itll allow for is for more flexibility in building out your models。
so if you if you think back to our convolutional neural net discussion and we didn't have a notebook there。
but if you want to build some of those more complex architectures such as inception or Resnet you would have to actually use this functional API in order to build out layers such as with inception where youre concatenating a bunch of different types of layers together or resnet where you want to bring along portions of the layer to further layers。
you'll have to use something like the functional API。
so it is worth getting a hang of as we talk through it here。







So we're going to import from Tensorflowlow。carara's this input。

And this dense layer, and we'll see how those are used in just a second, as well as importing model。
which will be the key function in regards to the functional API within CAS。

So the goal that we have when we build out our auto encodecor will be to build out three different models。
we'll have that full auto encoder。


And that's taking that inputs, remember the inputs and the outputs would be the same。
but taking that image and then ultimately reconstructing that image at the back end。
so I'll be the encoder and decoder。

Where we deconstruct and reconstruct these images。

We'll have the encoder portion, which is just the portion that will take inputs。

And try to bring them into that latent space and then the decoder。
which will take that latent space and try to reconstruct it。


So we're setting our encoding dimension to 64 as we did with PCA。
so we're going to have that latent space in 64 dimensions。

We have to define when we use the functional API。What our input is going to be。
so this is just creating a blank tensor。when we say a tensor and that's where the word Tensor flow comes from。
all we're talking about is an array in a certain amount of dimensions。
so we're going to be using probably here two dimensions, one dimension is the number of samples。
the other dimension which we wanted to define the shape of is going to be how many actual features we're going to have if we kept it at 28 by 28 then for shape you'd probably put in 28 by 28 and then you'd leave out the 60。
000 still。





But we're initiating this blank tensor, and we're going to need this in order to define what the flow of our inputs to outputs actually are within the functional API。


And then the reason why it's called a functional API。

Why we have the term function involved is they become clear right here。

We create this dense layer, and this is similar to the dense layer we saw before, or we call dense。
we say what's going to be the dimension of that dense。
how many hidden nodes are they going to be and what type of activation are we going to use?


And this portion right here is actually a function。
And that function will be able to take a certain input。And all we have to do is say, as an input。
we want these inputs that we defined here, that tensor。And then we only have a simple model here。
So we're going and this is just the encoder model, the encoder portion in order to bring together all the steps。

We call model。And we call the inputs and the outputs, and as long as the inputs and outputs match up。
even if we added on a bunch more layers in between here, which we'll see later on。
as long as the first value is able to reach that final value given our functional API and how we defined each one of the different steps。
the model will be able to bring those all together。




So we have our encoder model, which is this model that starts with the inputs and ends with the outputs from this encoded portion。


We then are going to have our decoder model, which again。
we're going to have our input defined and this time the input for the decoder model。
if we think about the idea that is taking that latent space and reconstructing our image。


Should be the dimension of that latent space, so it's going to be taking in vectors with the of 64。
that encoding dim that we have here。

We're then going to pass that through a dense layer。
so we call that dense layer and we want that dense layer to be reconstructing our image back up to 784 dimensions。
we use the activation of sigmoid and what we pass into that dense layer is in this parentheses as you would with a function。
we pass in the encoded inputs。




And then our decoder model will just be, again, that model and then saying what the input is and what the output is。

And then to define the full model。

First, we're going to say what our outputs are going to be for that full model and if you recall our decoder model as currently constructed。
just takes in this input, which is just a blank tensor。


Instead, we're going to define the input。

As the encoder model inputs。 So it's going to move from inputs, pass out the encoder model output。


So it's actually going to take that input that we define since we have that within the function。

The output of the encoder model will then become the input for our decoder model。

And then that decoder model will, by default, because of the way that we constructed the decoder model。
output that final dense layer that reconstructs our image。


And then our full model will just be the inputs, which are defined up here that in those initial inputs。

And then those final outputs, which we just defined, which run from。
we just created all the steps needed, runs from these inputs through the encodeder model。
then through the decoder model, and then outputs this reconstruction。



So we run this, we now have our full model available to us。

We set our model。

Inputs equals inputs, output equals outputs, it's already what we did here。

And then the steps from there in regards to compiling and fitting the model are the same as with the sequential API。
we call compile, we define our optimizer, our loss function。
and then what metrics were want to track, we're going to track accuracy。



And then we're going to run this just for one epoch here on our X train flat and then recall that our output's also going to be x train flat。
our input and output are going to be the same。


We fit the model。

Running through just one epoch, this will take just a second。 And afterwards, I'm also。
we have the option here to actually just look at the summary。 So we'll do full。

Model that summary, which we'll look at。

Right after this is done。And three, two, one。

Look at the summary。And we can see that it passes through that input of dimension 784 down to dimensions of 64。
that being our latent space and then reconstructs that image。



Now, the way that cars works。Is because we built out these smaller intermediate models。
but actually trained them along the way。


That encoder model that we。Fit into this full model that we defined up here。

Has actually been fit to the data, so we can actually。

Encodeode our image and output that 64 dimensional space。 So we run this。
and we see we took our X test flat, which is originally 100 by 784 and reduced down the dimensions down to 64。



And we will look at that and see that we now have those values which represent the different pixel values for or that 64 dimensional version of the encoded image。



And now now that we have this available。Our goal is to see what the reconstruction error actually is。

So we want to use the trained autoencoder to generate reconstructed images and then compute the pixel wise distance between that reconstructed image and the original image and see how he did compared to our baseline PCA。



So we have our full model, which will both encode and decode our model so we can just pass in X test flat into our full model。
which has already been trained to predict what the。
if you think about if we call full model dot predict on X test flat。
it's going to try to bring that down to the latent space。
so encode it and then deco it again into what should be exactly the same if it was able to do it perfectly。
So all the decoded images will。




When don't we run full model。predict, both encode it and decode it。
so it's doing that reconstructing step here。

And we get our decoded images, and then we can run that MSE reconstruction that we defined above to see what the actual error is on our decoded images compared to our original X test flat。



So you run this and we see that it is significantly worse。For recall。
we had around 95 for the mean squared error。 And now we're at 346 about。
We could ran for more epochs as well, rather than just one epoch to better fit the model。
But even when with five epochs, you'll see that still does a bit worse。


And in the next video, we'll see how instead of just using one hidden later。
perhaps we make our network a bit deeper。

And still run for a bit more epochs, and hopefully from there we start to actually do better and perform better than what we saw with the PCA model。


All right, I'll see you in the next video。

103:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p103 64_自编码器笔记本(选修部分)第3部分.zh_en -BV1eu4m1F7oz_p103-
Welcome back to our auto encoder notebook。 As we saw in the last video。
we built out our first auto encoder with the encoder and decoder portions of our model。
And then with the reconstruction, we saw a fairly high air。😊。
And one of the reasons that we had this。Poor model was that we weren't really doing any deep learning。
So we're going to do here is first, we're going to start off by adding on some extra layers。
We're also going to run it for a few more epochs and see if we can get a lower reconstruction area than we did with PCA and see how much we're able to actually improve on that air。

So again, our final encoding dimension will be 64, but this time we're going to be including a hidden dimension of 256。


And this even't work very similarly to what we just did。
so this is another time to look through your practice through the functional API for CARS。
We're starting off again with the input and we just say the shape and that's going to just initiate what type of tensor we want to pass through。
And then we're going to have first, that hidden dimension。Which is just going to be a dense。
So a fully connected layer。 And we're going to pass into that fully connected layer with that hidden dimension of 256。
Our inputs。 So we're just passing our inputs that will produce some output。
We can pass that output through to another dense layer。 So this is the hidden layer。
Then this is the encoded layer。 And this will take in that encoder hidden within the function。
And this is the function itself, it's going to be another dense layer now reducing it down to 64 nodes。

And then to just create the model, we just say model and we're going from inputs。
All the way out to the encoded outputs。And it doesn't you don't have to write out this middle step as long as there's a connection between what's being input all the way through to the output within your model all you're passing through is what the input is and what the final output is。

And then we do our decoder model and again, our decoder model should start off with that encoding dimension of 64。
as it will be decoding now that latent space that has 64 dimensions。
We're going to have another hm dimension, so it'll take that next step up to 256 nodes and then finally have the 784 nodes。
And we can create our model that takes in the encoded inputs。
And passes out this reconstruction that we have here。

And that's going to be our decoder model that just as before we create our outputs by passing in the encoder model with the inputs as its input。
Into the decoder model。That will be that final output to allow us to go all the way from the inputs to that final output that we have here。

And then the full model, we can just say the inputs。
which is these inputs all the way out to these outputs that we just defined。

We can look at that full model summary, we looked at the summary before now it doesn't show it's just showing the different models。
but now you see there's a lot more parameters that it's learning rather than just what we had here at 50。
000 werere up to 217,000 because we have those that 256 node hidden layer in between the inputs and our encoding dimension。




So again, we have our full model just going from inputs that we defined to the outputs we defined。
we can compile it using the optimizer, the loss function that we care about。
which metrics we want to track。


And this time, we're going to run it for5 epochs。 So we have the batch size equal to 32。
So itll run the gradient every 32 samples that we go through。
And we're going again from X train flat as the input to X train flat as that output。
Those two will be the same。 And in the middle, itll be coming up with that encoding layer。

So you run this and this can be for five epochs and it may take a bit of time。
so I'm going to pause the video and we'll come back as soon as this is done running again。


So now our model has been fit to, again, our X train flat and our X train flat。
both as input and output。 We can then call full model dot predicts on our test set。 And again。
that's just going to try to。Deconstruct and reconstruct or bring it down to that latent say space and then reconstruct that original image。
So'll run that and then again, get that mean squared error of that reconstruction。
passing in that new decoded image, as well as the original excess flat。 And when we run that。
we see that we get a score of 84。3。 So it did better than PCA Now。
Now that we've done5 epochs and done this deeper network。


Now, let's see if we can improve even further。 We noticed that so far we're only using so many epochs。
I want to see if maybe we introduce further training as we see the。
Loss continued to go down and it hadn't plateaued yet, and the accuracy continued to go up。
see if we can actually get a better performance as we increase the number of epochs。


So this function that we have here。

We'll actually just be putting together all the steps that we have here above。

So this you can imagine was just copied and pasted。
so we're not going to go through that again into this。

Function, and then we just say the number of epochs that we want to run through。
And this is the part that's different。 wherefore I enraged the number of epochs。
we keep fitting the model to the training set and recall that if we're not reinitiating the model。
then what we are doing when we say only one epoch here。

For the entire range of epochs, it's going to pick up the training where the last one left off。
so we'll see the results for just one epoC, for two epochs for 3B epochs and so on。And with that。
every single time, we'll get the decoded images, get our reconstruction loss。
We will append that to this list that we have initiated up here。

And we will say the reconstruction loss after each number of epochs is so on and so forth to see whether or not we continue to improve。
So we initiate that function, and then we're going to run this now for 10 epochs。
right numberumber of epochs is the only argument we have available。

We run this for 10 epochs。And this will take just a bit of time to run。
We had the five epochs take X amount of time。 This will take about double the amount。
So we'll come back as soon as that's done running。

So now we see the results。After 10 epochs, we are able to, for the most part。
continue to reduce that reconstruction error。 That means squared error。
We do see that it is not monotonically decreasing at some points, it seems to waver。
but you see towards the end of 10 epochs, we get it down to the mid 60s compared to that higher number that we had earlier。
So those extra epochs, even though if we look at the accuracy score。
it may not be improving that much。We do see that it continued to decrease that mean squared error of that reconstruction score。
So that closes out our video here in regards to just working with the auto encoders in the next video we'll introduce variational auto encoders and how we can leverage those Allright。
I'll see you there。


104:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p104 65_自编码器笔记本(选修部分)第4部分.zh_en -BV1eu4m1F7oz_p104-
Welcome back to our lab on Autoencors。In this video。
we're going to be working with variational auto encoders specifically。
And if we recall variational autoencodederrs are specifically creating that latent space。
that's now going to be a mean and a standard deviation that signifies a normal distribution。
and from that normal distribution we will sample values and then pass those values through to our decoder to try and reconstruct our images。

So again, just at a high level, the first neural network is going to be that encoder where we're going to predict two vectors for each one of our images。
Where these two vectors will be interpreted as the mean and the standard deviation。
And that can be sampled and used to pull out something from the normal distribution。

The second neural network will be the decoder that takes the results of our encoder to reconstruct our original image。
And then the entire system is going to be trained using back propagation at each iteration。
and if we recall with regular autoencors, our input and output are the same。
that's still going to be the case, and we're still going to try and minimize the amount of error when we're trying to reproduce that image。
but now we're also adding on another penalization term。 If that values。
those two values are not going to be standard normal values。
They're not going to be 0 and1 for our mean and for a standard deviation。
and that's going to be that Kl dirgence that we discussed in lecture。

So we're going to import many of the。Libraries that we need, many of these we've seen。
some of them are new, and I'll touch on the new ones as we get into each one of the different portions within our code here。

So recall that when we're trying to create our input from the encoded output。
so that encoded output again is going to be that two dimensional vector。
and that vector is going to have a mean and a standard deviation。In order to create that。
to transform that into input for the decoder, we take those means。Add on the standard deviation。
then multiply it by some random epsilon, where that epsilon is going to be some value sampled from the normal distribution randomly with a mean of 0 and standard deviation of one。
And that's how there'll be some variation in what's going to be output each time。

So to create that sample, we're going to create this sampling function here。And we pass in as。
those args are actually going to be the mu and log sigma。
It's going to be that output from the encoder, and we're actually going to add this on later on。
We'll see how we do this in Caras, add this on to our actual encoder model at the back end of our encoder model so that we can produce a single vector。
So we have our mu and log Sigma, those are our as, we unpack them here。
We then set the epsilon equal to juice a random value with a mean of 0 standard deviation of1。
That's going to be the default when you call random normal。
And we want it the same shape as the mu so that we can add that on。
and we'll see that in just a second。 Add that on to the mu plus the sigma。
According to this formula that we have here。We then set sigma equal to the log of sigma E to the log of sigma so that we just have sigma itself if we recall the output in order to ensure that it always is positive。
we actually output a log of sigma, so now we need to transform that back into sigma and then from there once we have our mu and our epsilon and our sigma。
we can produce random values by taking mu plus sigma and multiplying that by that epsilon。
that randomly sampled value。

So that's going to be our sampling function。

Now we're going to create our actual encoder network。We have our inputs same as before。
We have our dense hidden layer。Which is going to have hidden dim so that 256 dimension。
so there's going to be some hidden layers this will be a bit of a deeper network。It has its inputs。
which we define here that are going to be passed in again, we're using the functional API。
Then that x is going to be used as input into both getting the mean as well as the log variance of our z value。
So we can use that X and we wouldn't be able to do this as easily with the sequential API。
and that's why we use a functional API。 we can pass in the X。
To a dense layer here and a different dense layer here。
so that we're outputting two different values with that input。

And then to get that final output that we're looking for。We're going to call Lambda。
and that Lambda function allows you to pass in your own created function。
which we created above that sampling。So Lada sampling and the input for that sampling is going to be this list。
which is Xine and Z log ver, and that corresponds to our ags。
which we were able to unpack up here into mu and log sigma。

And then at the end of our encoder model, we pass in our model, the inputs going to be input。
and then the output, we're actually going to output three different values。
We're going to output the Z mean so that we can track that。 We're going to output the Z log there。
Then we're going to output that actual Z value, that actual sampled value as well。

So now we've set up the encoder model。Then we're going to build out our decoder model。
And that's going to take an input of the shape of our latent dimension, those two dimensions。
We're then going to expand it So similar to how we did with the auto encoder。
where we first shrunk it down with the encoder。 Then we expand it with the。
Decoder here we're doing that again。 So we're expanding that out up to the hidden dimension first。
up to 256, passing in those latent inputs。 Then finally。
that final output will be the dense 784 node layers so that'll be the same dimensionality as our original image。

And then we can just say that the model is going to take in those latent inputs。
then pass out the outputs we defined by passing that into the model。 and then to get the full model。
we just do decoder model, and we pass in the encoder inputs。 But this time we only want as input。

That third value。Recall that we actually are outputting in the encoder model three different values。
What we really want to pass through into our decoder network is just that third value。
which is why we specify two here。And then once we have our outputs, we can create our full model。
which goes from those inputs defined all the way up here in our encoder model to the outputs that we just defined up here。

Now, just to take a quick dive in。

We can look at our model and we're going to look here at each one of the different layers that are involved and within that layer。
layers are defined a little differently than probably what we are used to in regards to working with our neural networks that have multiple layers。
but rather the layers are the models that are being used。 So we have the encoder input。


And that's just going to have no weights that are being learned。It's going be of dimension 784。
and none is just, however many samples we want to pass in。

Then in our actual encoder layer outside just the input。We have the input being 784 and the output。
as we discussed, being three two dimensional vectors,1 being the means。
1 being the sigmas and one being those sampled values。


And then within this encoder, still in layer 2。Our first weights are going to be our dense layer。
which gets us 784 by 256。 Our next weights are going to be the bias term。
which is just going to be 256 different weights。

We're then going to get our actual mean values, which is 256 by 2。
so those are the weights needed there。Our weights 4。 Now, when I says weights 1, weights 2。
weights 3, weights 4。 that's not the number of weights。 That's just representing weights。
number one, weights number 2 and so on。 The same way we did layer number one and layer number 2。
And then we have our mean bias and that's just going to be two values。
and then we're going to have similar for the log variance with 256 and 2。

And then in layer 3, we have our decoder starting off with the input of two output of 784。
and we have all the intermediate weights similar that build up back to that 784 that we discussed。



Now, again, if we want to start or go ahead and actually compile this model that we built out。


We need to actually reproduce that loss function that we need。
So I'm going to pause here now that we've built out the variational autoender and the next video will talk through how to actually build out this loss function appropriately so that we can compile our model that we just built。
Allright, I'll see you there。


105:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p105 66_自编码器笔记本(选修部分)第5部分.zh_en -BV1eu4m1F7oz_p105-
So as discussed at the end of the last video, we're going to need a different loss function for our variational autoencoder and the second part of that loss function。
so we have the first one for the reconstruction error, which is the same as the autoencors。
the second portion which is specific to variation autoencoderrs is what we have here and we discuss this in lecture how we predict log sigma because predicting sigma directly is going to could result in a negative value and doesn't make sense to have negative variance。
And then the fact that the cost function has two components。
Both of which penalize us for having results that deviate from that standard normal deviation。
So if log of sigma is far from0, we can use this portion E to the x minus x plus 1 being minimized at x equals0。
and that's equivalent to what we have here in the first portion of the Kl divergence。
And then the second part simply penalizes the mu value。
that mean value from being far away from zero。And then again。
the other part of the loss function is just going to be that reconstruction error。
So our exercise here。Was to actually create this loss function so that we can pass it through and optimize using this loss function。
minimizing this loss function that we have。So we do this first by starting with the reconstruction loss。
We're going to just multiply by 784 so that we're not working with the average value。
rather the full value rather than the average overall of the pixels。
so we're just using this binary cross entropy so far we've been passing in that string as our loss function。
but if you look at our imports that we had earlier。
We actually pulled in binary cross entropy and that's available from Kara's do losses。
So we pull that in。And we're able to say we want that binary cross entropy for the inputs and outputs。
and we've defined that as our reconstruction loss, so that's one portion of our loss function。
The other portion is just going to be this formula that we have here above。
And we're going to pass in the Z log there, which is going to be one of our outputs that we specified in our encoder。
as well as the。Mean value that we have here。 Sorry, we have the log Z log ver there。
And then finally, that square of the mean value。And then we're just going to sum that KL loss。
On that final axis and then the total loss。Is going to be the mean of the reconstruction loss。Plus。
that KL loss。 So that reconstruction loss that we defined here, plus the KL loss。
That we just defined down below, which is this function that we discussed。
both in lecture and here in the notebook。So we're going to run those cells。
And then once we have that total VAe loss, we can add on to our model by using this add loss functionality within Cars。
we can call。Add loss and pass in this total VAA loss。
And then we can just compile and specify our optimizers as well as the metrics we want to track。
Once we do that, we can look a bit at our summary, it gets a bit intense with everything that we're doing。
But we don't have to worry too much about it just here。And then we'll fit that to our Xtrain flat。
So this is training our full model, the number of epochs we defined above as just one。
I'm going to pause the video here as it'll take about 10 seconds to run and we'll come back as soon as I's done learning。
So that should have been a quick second to run or a quick 10 seconds。
and now we want to look back at the reconstruction error for our new model that we're working with。
the variational autoencoder model, I want to think whether that reconstruction should be higher or lower than the original autoender without maybe reading what we have here below。
I'm going to run this here。And we're going to see that we have a much higher reconstruction error。
And the reason being that this latent space is built more for interpretability that is sampling from a distribution rather than being perfectly reconstructed to those original images。
as now we're kind of sampling rather than taking the direct latent space that we learned。
Now I want to plot out the latent space that we had, so we have our models。
we're going to create a tuple that's just the encoder model and the decoder model so that we can pass that into our function later。
and then our data is also going to be this tuple, which is X test flat and the related Y test。
So this is pretty large, I'm going to summarize I'm not going to go through every line of code here。
but I am going to summarize what the data is actually or what the function is actually trying to do。
So we have our encoder and decoder as well as our X test and Y test。
And the first portion of our function is going to be to actually plot out in two dimensional space as we did with PCA。
each one of our different numbers, so that's going to be the first portion in order to do that。
we need to get our z value。So we're going to get each of the predicted Z values and with those z values we're going to also get their related numbers。
so we're setting the color in our scatter plot according to that Y test that they're related to。
And then we have our two dimensions of00 and z1 in that two dimensional space。
and those z's are going to be those random samples that we pull out。
according to what each one of those digits are。And then on the lower portion。
so we're going to actually plot out two plots here。
We're going to create a grid space ranging from negative four to 4。If you see that are limits。
Here our limit here is four passing into the function or whatever our limits tend to be that we can pass into our function here it's negative four to 4。
we're going to plot a bunch of values。Here n is equal to 30。
so 30 different values between negative 4 and 4 on both the x and y axis。
and we'll see for these different values of x and y。What is going to be and recall we are。
With our Z's, we originally plotted out in two dimensions, which ones。
which numbers they're closest to, and we'll see this once the picture comes out。
But now we're actually going to generate numbers using the decoder。
so let's say our first value is negative4, negative4。Then for negative 4, negative4。
we're going to pass that in by using that X I, Y I, we're going to pass that into our decoder。
And see what kind of sample it actually generates。You can think of the value being zero and 2 or0 and three and remember these are supposed to be samples from something close to normal distribution。
So, you know。Now that function has been initiated。We also, I forgot to run this。
we need to make sure that we have all that defined。

And then we're going to actually plot the results, and this should make what I just discussed a bit clearer。

So here we have the different groupings, so we see for our purple here, which is related to our ones。
those tend to be in the top left corner, which is negative for on the z0 for our first dimension and Z1 being positive for。


And then when we look at the numbers generated, those tend to be ones。

Whereas。You see here on the bottom that darker purple is the zeroes。
And here it's generating those zeroes。 Top right。 We have the light green。
which is associated with the sevens。 And we see that that's。


Predicting or generating sevens as well, so we can see this generative process。


Then finally, in exercise six, we want to train the variational autoencoder as was the autoencoder for 10 epochs each。
and then plot the reconstruction mean squared to error as a function of the total number of epochs for each one of these models。
And see which one seems to have more potential to continuously learn as it's given more computing time。

So。We have here。 It's going to start running for 10 epochs。 I'm going to continue running。
I'm going to talk through what we have here because all we are doing is then running our variational autoenr model for 10 epochs。
for the autoencoder, we had that function defined find in maybe video 1。
I believe it was I'm going to scroll all the way up。 maybe as video 2, because video 1 was on PCA。



Where we defined。This function here that trains AEE our autoend for a certain number of epos。
We don't have that for the variational autoender, so all we have to do is for I in range 10。
fit that model each time for one epoch, again, recalling that every time you run it。
it's going to pick up where the last one left off。
So each of these are going to take some time to actually plot out。
we'll come back when that's done and then we'll actually plot out the loss functions across each All right。
I'll see in a bit。Alright, now we have had our autoencoder as well as our variational autoencoder run for 10 epochs each。
and we can look at the plot over time or over the number of epochs。
and we see that there tends to be a plateau for those autoencors whereas the variational autoencors are continuing to go down you could probably run this a little bit further if you wanted maybe 1015 more epochs for the autoencos to really plateau。
But。We see that the autoencors are fitting exactly and eventually are going to plateau due to the fact that they're fitting exactly。
whereas the variational autoencoderrs continue to decrease and will take a little bit more time to get to that same reconstruction rate。
probably never getting to that same exact reconstruction rate as we see with the autoencors。
Now that closes out our notebook here。And after this。
we're going to get back into the lecture and discuss a different generative process。
specifically GNs and generative adversarial networks。 Allright, I'll see you there。😊。


106:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p106 67_什么是变分自编码器.zh_en -BV1eu4m1F7oz_p106-
In the next videos, we will introduce the concept of variational autoencors。
which will work similar to the autoencors that we just discussed, except now that latent space。
that hidden space that we're trying to represent is going to be described by distribution rather than exact figures。
Now let's go over the learning goals for this section。

In this section, we're going to cover how variational auto encoders work。
And how we can come up with this new latent space represented by some distribution。
Then we'll discuss variational autoencoder loss functions。
and that will provide us with some intuition as to how variational autoencoderrs are used and optimized。

Net with variational auto encoders will still be generating that latent representation。
or again that compress representation of the data that encodes those similarities。

And then similarly, we can reconstruct these to generate new samples。Now。
some important features of variational auto encoders will include rather than the data being represented by just a single set of vectors。
The values of the data in that latent representation will now be represented by a set of normally disant latent factors。
And now rather than the encoder coming up with a particular value。
It instead generates the parameters of our normal distribution。
namely the mu and sigma or the mean and the standard deviation。
And then using variational autoencors and the fact that we are going to be sampling from a given distribution rather than some fixed values。
we can actually generate new images。

Now, the goal of variational auto encoders will be to generate images using the decoder portion of our network。
Now, again, starting off our encoded latent vector will now be represented by some normal distribution。
And the parameters for that normal distribution will be learned by the encoder portion of our network within this variational autoender and then fed through to our learned decoder portion to produce the images。
And a secondary goal that will come along with that is that similar images will be close together within the latent space。
So as we'll see in our note book later on, for looking at hand drawn values between 0 and 9。
the latent space for all the zeroes will be close to one another for all the fives will be close to one another and so on and so forth。

So let's walk through the steps of how variational autoencoderrs will come up with this latent space represented by a normal distribution。

The first step will still be to pass through a network with some bottleneck。
So reducing the number of nodes。

As we did with the regular auto encoders。But now at step 2。
we're going to be learning a mu and a sigma for each value that are meant to represent a normal distribution from which values can be sampled。
So for example。Here, we may end up coming up with the vectors that we see, which are for the mu 0。
7 and negative 0。6, and then one in 0。6 for the sigma values。
In the next step for our variational auto encoders。
we combine these two values into one vector and add on some white noise with a mean of 0 in a standard deviation of one。
So using our example from before, we can come up with this vector at the end of 2。01 and 0。
54 by adding the mean that mu plus the sigma multiplied by our noise term。
And those are the vectors that we see here at the bottom that will tell us what is going to be some sample from our distribution。
This randomly sampled vector is then fed through our decoder network, in step 4。
And we then can produce our reconstructed image。So that's how you walk through this variational autoencoder with the mu and sigma。
In the next video, we're going to touch on a high level some of the math that makes the variation autoenrs work and are specific to this variational autoenr in general。
Allright, I'll see you there。


107:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p107 68_变分自编码器的工作原理.zh_en -BV1eu4m1F7oz_p107-
In this video, I want to start off by discussing the loss function of the variational autoenr compared to that of what we would do with just the autoenr。
So our goal will be to reconstruct the original image, as we did in our auto encoders。
Now we know that a major difference is that the variational autoender will be reconstructing a vector drawn from a standard normal distribution。
And with that in mind, we're going to have two components to our variational autoenr loss function。
First, we have what we had for the normal autoenrs。
which is just the error measuring how far off our reconstructed image was from that original image。
The second part of the loss function will be a penalty associated with generating vectors of the parameters mu and sigma that are both different from 0 and 1。
So mu being different than 0 and sigma being different than one。
As our goal will be to balance between that low reconstruction penalty, as we did with autoens。
While keeping our parameters as close to the standard normal distribution as possible。

So this comes down to the pixel wise difference between the reconstructed and original image for that first component。
And in order to calculate this, we can use a loss function such as mean squared error to see the distance between the reconstructed image and the original image。
And then again, the second component will be the difference between the vectors produced by the encoder and the parameters of a standard normal distribution。
So how do we go about calculating that second component of our loss function?For this。
autocutters will use something called KL divergence between the data that we generated and the standard normal distribution。
So here we have, for example, our mu values versus a mu of0。
And then we're actually going to take the log of sigma。 And if we think about our sigma。
the ideal sigma would be one。 the log of one would be。0。
so we're going to be comparing the log of our values versus 0 again。 Since again。
the log of sigma is going to be。 the log of one is going to be 0。
And we use the log to ensure that we end up with strictly positive values as a negative value for variance doesn't really make any sense。
Now, the actual KL diversionence formula will be what we have here。
where E to the log of sigma minus the log of sigma plus 1。
Is going to penalize the sigma for straining from one。 And we can know if sigma is 1。
then log of sigma is 0。 And if we replace log of sigma with 0 here。
we would see that this comes out to 1-1。And is thus minimized at sigma equal to one。
And then from mu, obviously, mu squared will be minimized when that mu value is equal to 0。
because if it goes a little bit to the right to 0。5 or even up to 1, then mu squared would be1。
If it goes a little to the left to negative one。 again, you'll have a value of one。
So it'll be minimized when mu is equal to 0。And in regards to that sigma portion。
we can somewhat see graphically if we imagine subtracting that orange line from the blue line。
That again, e to the x minus x plus1, which is similar to our sigma portion that we discussed in our KL divergence formula that again is going to be minimized at x equal to0。
Now a note on kale divergence。It's not technically necessary to include this component in our loss function。
But the reason that we do like to include it, though。
is that it helps generate a desired latent space where visually similar images are going to be close together within that latent space。

So as an example, what we have here is a variational autoencoder trained with two dimensions in the mu and sigma vectors。
And because this is a generative model, we can scan the latent plane and sample points at regular intervals and actually generate corresponding digits。
Given what we sampled for each one of these points。
And as long as our sample values are close within that latent space。
within that lower dimensional space, which is represented by our mu and sigma。
they will generate similar images。 And that's why you see sevens up to the left and some nines in the middle。
none of these were actual images within our data set。 But using our variational autoencodederrs。
we are able to generate these randomly。😊。

Now, just a recap。That will close out our section here on variational autoenrs。
We went through the basics of how variational autoenrs work and how they differentiate from regular autoenrs and that they produce a probabilistic means of describing our latent space。
And with that, we also discussed the loss function used and why adding on K L divergence to measure the distance from the normal distribution can be very powerful。
Now let's take a look at some actual code as to how we can actually build out auto encoders as well as variational auto encoders of our own Allright。
I'll see you there。

108:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p108 69_生成对抗网络(GAN)简介.zh_en -BV1eu4m1F7oz_p108-
In this next set of videos, we're going to introduce another unsupervised learning and an often more powerful generative model。
namely generative adversarial networks。😊,Now let's go over the learning goals for this section。
In this section, we'll have an overview of generative adversarial networks or Gms。
and we'll discuss their origin story as well as the motivation behind why they work。
We'll discuss the actual training procedure for GANS and how we'll use two adversarial networks。
one for generating classes, and one for discriminating which classes are real。
And then we'll discuss some key hyperparameters that govern howganNs train and whyganNs may be even more sensitive to these hyperparameters compared to normal neural networks。

Now, part of the motivation of GNS is the realization that the means by which neural networks interpret a new example make them vulnerable to adversarial examples。
So a broader example, if you were to think of trying to learn a spam filter。
and once a neural net has learned what makes an email spam versus not spam。
It then becomes possible using that same network to begin designing emails that look as much as possible like the non spam emails that can trick our actual network。
and these are adversarial examples。And if we look here at our image and are trying to determine whether or not this is a handwritten image。
If our neural net learned a bunch of handwritten images。
Then it actually knows all the features that make up a handwritten image。
According to that neural network。And therefore, we can replicate that to produce something that according to the neural network。
will be classified as a handwritten image。Now, we won't get into the math。
But a way that we have learned to generate these adversarial examples。
such as that spam that looks like it's not spam or nonwritten digits that look like handwritten digits is to take a training set。
And then focus on adjusting our original images in that backward pass in relation to each one of our gradients。

So the invention of GANS was connected to the neural networks' of vulnerability to these adversarial examples。
Researchers were going to run some speech synthesis contests to see which neural network would generate the most realistic sounding speech。
And they would have a neural network, the discriminator。
to judge whether that speech was real or not。But they decide not to run the contest because they realize with neural networks。
it would be possible for people to generate speech that were just going to fool this particular network。
given the way that it would be trained rather than actually generating realistic speech。

The researchers realized they could solve this by having the discriminator continually improve at distinguishing between real and fake speech。
They could have, or they could do this by feeding it real speech alongside fake speech。
or in other words, introducing their own adversarial examples。
And by incorporating the gradient of the resulting discriminator with respect to the input that loss in respect to X back to another neural network that was actually generating that speech。
They realized that they could train a network to generate very realistic speech。
and this is the root of our adversarial networks。 we're trying to improve both our generator model so that it creates better fake output。
As well as our discriminary model, which is meant to differentiate real and fake output at the same time。
They are going to be adversarial networks with opposite goals。Again。
we're going to have that generative model trying to trick the network so that it cannot discriminate between false and true examples。
whereas we're going to have discriminator working hard to get better at discriminating between the two。
And as both networks try to beat the other, they continue to improve。

So with this in mind。What exactly are GANs or generative adversarial networks?
GANs provide a way of training two neural networks simultaneously。
One of the neural networks works as the generator and learns to map random noise to images with the goal of making them indistinguishable from those within our training set。
So looking at this image, we start off with our generator network。And that starts with an input。
which is just going to be some random noise。And then tries to create an image indistinguishable from the training set images。
And not the same as any particular image, but rather trying to find similar properties of the image value distributions in that training set。
And then that produces an image which is fed through the discriminator。
and that discriminator is meant to decipher which images are generated or are fake and which ones are the actual images from our training set。

Now。Going back to Gs in their original paper。Gens was established as a training network in 2014 in a paper by Ian Goodfeelow。
And they showed how they were able to generate new high quality 28 by 28 pixel MIS digits。
so those MIS digits again are those handwritten digits。
The model relied on simple architectures for both the generator and discriminator。
where there were no convolutional layers。But rather just fully connected layers and relo activations between each layer。
There was also no dropout and no regularization using either the generator or the discriminator networks。

And the results were what we see here below where the images produced were nearly indistinguishable from the handwritten values in our training set。

And again, none of the images just shown existed in that original data set。
And just to show these alongside the actual training set images。
we see the generated nine and the generated zero versus that from the training set。
and then the same for the respective twos and8s that we see here。
Now that closes out this video and the next video we dive a bit deeper into the actual G training procedure。

All right, I'll see you there。

109:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p109 70_GAN的工作原理.zh_en -BV1eu4m1F7oz_p109-
In this video, we're actually going to take a look at the generative adversarial network training procedures。
So diving right in。The first step will be as usual with our neural networks to start with randomly initialized weights for our generator and discriminator networks。

Now to start our generator process, which is meant to find some distribution that does well in representing the actual values in our training set。
be those images or whatever it may be。


Will be to pass through just a random noise vector。

So our input's going to be a random noise vector, and the goal is that our generator network will be able to take this random noise and come up with the appropriate weights to generate the complex distributions of our images of whatever our training set is。



And then that generated image should then be the same size as the actual training set images as it will later be fed through our discriminator network。


The next step, step three。Will be for the discriminator to predict the probability that the generated image is a real image or one that our network has just created。


So once again, that is going to pass through our discriminator network, our produce image。
and the goal is to discriminate between actual training images and those our generator network is producing。



And the output will be the probability that we are working with an actual image and not a generated one。


Step 4。

We are then going to want to compute the losses, both assuming that the image we just generated is fake。

And a loss function assuming that the image generated was real。
and we'll get into this in just a second。

We're going to call the loss function assuming it was fake, which it is, of course, just a reminder。
we are working here in the realm of that generated image。


So we're going to call the loss function assuming it was fake L0。
as in how far away was our discriminator from predicting there's a zero probability that this generated image is not real。


And then we're going to call the loss function assuming it was real, which again, it's not。
we're going to call that loss function L1。

Representing how far discriminator was from predicting this was 100% a real image。

And then these two combined will produce our total loss function。
which we can later leverage to train both our generator and discriminator networks。


Now, here in step five, we'll make a little bit clear why we have these two different loss functions。

So our discriminator network is going to only care about correctly predicting that this image is not a real image。

So I'll want to use that loss function L0。One, we're talking about the discriminator network。
So how far off was the output from predicting zero is what we're trying to get here。
so we back propagate in relation to the loss。Of。

How far off we were to saying that this is not a real image。
And then we update the weights of only our discriminator network accordingly。

Now, where does our L1 penalty come into play?Our L1 penalty。Again。
the L1 being assuming that our image produced was actually real。
That's going to be ignored by our discriminator network because our discriminator network doesn't want to optimize assuming that a fake image is real。


The goal of computing L1 is to ultimately tell us if we're doing a good job of producing an image that seems realistic。


And thus, our output from our discriminator is too far from one。

So we compute our gradients in regards to L1 and backpo through without training the discriminator and pass that out ultimately to our generator。


So rather than using that gradient to train our discriminator network。
We use it to actually update our weights for our generator network。

So we continue to back propagate using that L1 gradient to update the weights of our generator。

Now what's missing in this training procedure that we just walked through?

The goal of our discriminator network is to learn to classify fake images as fake。
real images as real。


And for it to actually improve its ability to distinguish between fake and real images。

We're also going to have to give it images from the actual training set。

So our next step。

Is for the real images within our training set。To be passed through。
And we want to calculate the probability that。The image passed through from our actual training set is real。

So we passed that real image XR before we were working with XG that X generated。
and want to know the probability that this is actually a real image。
and that's going to be the goal of the discriminator now。


We then compute a new L1 loss for our real image。So again, when trying to predict real。
We're actually going to want our discriminator to output one。
so we're actually going to use this also to update our discriminator function。

And the L1 in regards to Xr is going to be how far off our probability was。

What should be as close to one as possible from that value of one?

And we can use that L1 loss。To train and update our weights appropriately within that discriminator network。


And we repeat this procedure with new random noise from the generator each time。

And continue until images from the generator begin to look real。

Now, we must note that values of losses from discriminator and generator may still be fluctuating when the generator is producing realistic images。
so we can't use these alone to determine when to stop training。



And there are ways of quantifying image quality within our generative process such as the inception score to help us find where we should actually stop training。
that's a bit beyond the scope of the lesson, but I encourage you looking into it if you want to delve a bit deeper into working with GANS。




110:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p110 71_GAN训练的问题.zh_en -BV1eu4m1F7oz_p110-
Now I wanted to pull out here a quote from the original paper on GNs。
The generative model can be thought of as analogous to a team of counterfeiters。
trying to produce fake currency and use it without detection。
While the discriminative model is analogous to the police trying to detect counterfeit currency。
Competition in this game drives both of these teams to improve their methods until the counterfeits are indistinguishable from the genuine articles。
So if we think if the police are too lax, then the counterfeiters won't have any incentive to create better bills。
If the police are too good, then perhaps they won't have any motivation to build those bills in the first place。

So there's going to be this balance that we also want to take into account。Now。
training Gs effectively is highly dependent on both the generator and discriminator learning at the same rate。
So the ability of two networks to learn is affected by network architectures。
different learning rates, different loss functions, different optimization techniques。
And in general, GNs are going to be more sensitive than traditional neural networks to choices on any of these dimensions。
So if you think about what I just discussed in relation to the dangers of overfitting。
In regards to that balance when he talked about the counterfeiters and the police。
We know that in regards to overfitting, we now have that compounded by the fact that we are working with adversarial networks with competing goals。
so if the discriminator is too good for example, wouldn't be able to train at all as fake examples will have too high of an error with the discriminator。
And if the discriminator is too lenient, on the other hand。
there would not be much learned in respect to creating realistic images。

So what you should be doing。To train gangs yourself。Since GNs are fairly new。
it'll be that much more important to actually take a deeper dive into the research and original papers on the subject。
such as Ian Goodfeelow's original paper。And some examples of wheregans are used currently are deep fakes。
which you may have seen in some spoof videos where you can take an existing photo or video and replace it with someone else's likeness。
Age interpolation or making people in images look older or younger than they actually are。
and even taking text and producing images related to that text。

And here we have some links and some examples of Gs being used for generating fake images of people and learning lip sync from audio。

Now, just to Ricap。In this section, we discuss an overview of generative adversarial networks or GNs and the idea of how we can leverage adversarial examples to generate more realistic samples。
We discussed the training procedure for GANS and how we use specific loss functions to update the weights of our discriminator and generator networks to optimize our model。
Then finally, we discuss some key hyper parameterss that govern how GNs train and how we need to be careful when fine tuning them due to the problem of stability with working with both generator and discriminators within the same network。
Now that closes out our video here on GANS, in the next video we're going to discuss some other topics that you should be aware of when working with deep learning models in general。


111:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p111 72_深度学习的其他主题.zh_en -BV1eu4m1F7oz_p111-
In this video, we're going to touch on some final additional topics that we should be aware of when working with deep learning models。
Now let's go over the learning goals for this section。
In this section we're going to discuss computational issues and with that specialized hardware when working with these deep learning models。
And then finally, just to ensure that we have the interpretability of our models。
we're going to briefly touch on locally interpretable model explanations or limes to help us gain some interpretability when working with these deep learning models that are generally more on the black box end of a model。
Now we're going to jump right in here to hardware for artificial intelligence, specifically GPUs。
Originally GPUs were designed to process graphics, and GPUs have become a popular deep learning workse。
Reasons being that they feature thousands of small, simple cores specialized for numeric。
parallel computations。There's going to be many transistors dedicated to computation。
The GPUs excel at repeated similar instructions in parallel。
GPUs are going to be optimized for parallel data throughput computation。
And major neural net breakthrough since 2012 have been powered by these GPU computations。
With that performance with GPUs has since increased by more than 5 x, what we talked about in 2012。
And once the data is in the GPU memory, the bottlenecks are actually very small。
And if you're looking for a popular GPU on your own, VD is a very popular GPU used for deep learning。
And just to highlight the main takeaway from all this is that GPUs are great at parallelization。
And we're going to look into why right now。

So the CPU is only made of a couple dozen cores, where cores are the powerless CPU to do any small calculations。

Whereas JeU, on the other hand。Are made of hundreds of maybe weaker cores。
but they're able to run all these calculations in peril。


So looking at the CPU and the breakdown, we have the ALU or the aritic arithmetic logic unit。
and that's a portion that can execute simple arithmetic and logical operations。
We have control and control here is to decode instructions into commands and calls on that ALU to perform the necessary calculations。
We have here the cache, which serves as high speed memory where instructions can be copied to and retrieved。
And then we have the DRA。

Which is for more longer term memory。And then if we look at this breakdown for the GPUs and all the colors should fit accordingly。
We see that for the GPUs, we spread out the control and cache and they're much smaller for each portion。
and for each we have a ton of ALU units that can paralyze the workload of our operations。

Now GPUs are specialized processor great for specific tasks such as gaming and deep learning。CPUs。
however, are still going to be the main computation engines on most computers as they are much more versatile。
Now, some differences between the two will include that CPUUs have dozens of cores。
whereas GPUs have thousands of less powerful cores。Which allows for high parallelization。
But for GPUs, if tasks are not being paralyzed, this may not work as efficiently as CPUUs。
CPUs are going to have fewer ALUs。And a lower compute density than GPus。
And we saw this in the image in our prior slide。CPUs are lower latency and have larger cache memory。
so compared to GPUs, they can make data more immediately available to our users。
GPUs are designed for parallel tasks, which is incredibly powerful for high amounts of major computations such as what we have in gaming and deep learning。
GPPUs in general will perform well for a single instruction performed over a large amount of data that can be paralyzed。
whereas CPUUs will perform better for a wider variety of tasks that don't use as much data and don't require as much parallelzization。
Now again, GPUs are specialized processors。And CPUs, however, are again。
still that main computation engine, so a bit more on that。
GPUs will have additional overhead when copying data from the main memory。Compared to CPUs。
CPUs will be better when a large number of memory swaps are needed as they have more efficient memory units than GPUs。
GPUs are poor for tasks that cannot be paralyzed。As mentioned earlier。
that is really the specialty of GPUs being able to paralyze in general。
GepUs are poor for heavy processing on fewer data streams。
it's really better for those larger data streams。CPUs excel at serial tasks and are easy to program。
so when there is some type of ordering to the tasks to be done and parallelization isn't leveraged。
CPUUs will usually outperform GPUs。And finally, most popular programming languages will be compiled will compile to machine code to be run on CPUs by default。

112:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p112 73_模型可解释性AI.zh_en -BV1eu4m1F7oz_p112-
Now, on a different note, because this is just some topics in regards to deep learning。
Deep learning models in general are difficult to interpret。
There are going to be many parameters and complex networks and connections within any one of our deep learning models。

So one approach is to generate locally interpretable model, agnostic explanations or a align。
And Lyme treats the model as a black box and focuses instead on the sensitivity of outputs to small changes in the inputs。
So we'll test how well simpler models will perform if we look at small changes in feature values for specific samples。

This will be analogous to feature importance in this respect that Lyme will summarize the sensitivity of regression or classification outcomes to each one of our variables。
And it will lean on linear models to produce the feature importance。
and thus nonlinearities and variables that cannot be perturbed or can't be changed。
such as say binary variables that can really only take on values of zero and1 and can't just be perturbed slightly will present challenges to working with this approach。

So to briefly recap。We discuss the computational issues on working with deep learning models and how we may want to use GPUs versus CPUUs。
depending on the model that we're building out, and we also even touched on why it's important that CPUs are still being used in our current computers。
Then finally, we closed out with the locally interpretable model explanations and discuss how we can use that in order to dive deeper into a model。
pertuurrbing some of our examples in order to understand some of the feature importance within our deep learning models。
Now that closes out our video here in our final video。
we're going to discuss reinforcement learning All right, I'll see you there。


113:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p113 74_强化学习(RL).zh_en -BV1eu4m1F7oz_p113-
In this video, we will provide a high level introduction to reinforcement learning。
Now let's go over the learning goals for this section in this section we're going to cover an overview of reinforcement learning at a very high level。
We'll have a discussion about the understanding bit of the approaches and implementation for reinforcement learning。
Then finally, we'll introduce reinforcement learning implementation using Python。

So let's start off with a reinforcement learning overview as we promised。

Now in reinforcement learning, the idea is that agents will be interacting with an environment。
And agents then are going to be the thing that ultimately takes the action。
So if you think in regards to games which are very popular for reinforcement learning。
Currently this would be the actual player。 if we were thinking through a model that was meant to figure out。
for example, where to place ads on a web page, the agent would just be that program that makes the decision where the ad will be placed。
And the environment。Is going to be the world through which our agent moves。
So if you're playing games such as chess, this would be the actual chess board for' thinking something like ads on a web page。
this would be the entire web page。And they choose from a set of available actions。 and again。
using the game example, this would be all possible moves that an agent or player can make in our game In our ads example。
this may be either adding an ad, removing an ad or take an option of neither removing or adding an ad from the current page。
And the actions that we take are going to impact the environment。Which, in turn。
impacts the agents via rewards。 So when an action is taken。
we have impacted the environment where our agent exists。
So if you think if we move a piece in our game, we have adjusted the environment of our game。
If our move resulted in us getting more points or winning the game。
this would be an example of our award。And our system would learn that the actions taken were good actions。
And similarly, for ads example, our award could be a result in increase in clicks or increase in revenue。
Now, something to note is that rewards are generally unknown and must be estimated by the agent。
so oftentimes it will take many steps to reach towards that reward stage of your game if that's just to win the game or to get to a certain place within the game。
So think for any game of any kind, oftentimes it'll take multiple moves before you get any type of reward。
And this process again repeat dynamically so agents continuously learn how to estimate rewards over time。

Now, advances in deep learning have led to many recent reinforcement learning developments。
For example, in 2013, researchers from DeepMd developed a system to play Atari games and actually beat humans in Atari games。
And in 2017, the Alpha Go system defeated the world champion in Go。 So for the first time。
the machines were able to beat a human champion in a complex game such as Go。
using reinforcement learning。

Now, in general, reinforcement learning algorithms have been limited due to significant data and computational requirements。
So if you think about the infinite number of possibilities at every juncture。
if you're adjusting for every person that visits your site or even for games。
which is the example that has proven to be successful。
But the reason why it's taken so long is that you think about something like go or chess。
the infinite amounts of moves that anyone can make。
along with the following move and reaction to those moves。😊。

It lead to us needing a lot of data to train our reinforcement learning models。Now, more recently。
progress has been made in areas with more direct business applications。

And examples include recommendation engines where recommending correctly could perhaps be a reward marketing with higher revenues or higher clicks。
again, being that reward mechanism and automated bidding if you are able to optimize the amount spent or paid per an item and setting up some reward system in that sense as well。

Now the idea here。Is that the agent, again, if you think about reinforcement learning。
the agent takes an action。That action affects the current environment。
And then feedback from that environment is passed back to the agent in terms of a reward。
so if it resulted in a positive results in relation to our reward system。
the agent's actions are then reinforced。And then vice versa for negative results。
if it ended up in a bad state, then the agent is reinforced not to take those same steps。

Now, reinforcement learning problems will vary significant。
And solutions represent a policy by which agents choose actions in response to the current state。
Or in other words, since this is not directly supervised learning。What takes our input?
And comes up with the resulting action is the policy。
and that is what we ultimately try and optimize whatever that policy is defined as。
And agents typically work to maximize expected rewards over time。
And this differs from typical machine learning problems because unlike with labels。
rewards are not known and often highly uncertain, we may not know at every juncture whether actions resulted in immediate rewards or even if it did。
if those intermediate rewards will lead to our larger goals of our network。

Whereas with typical machine learning problems, the solutions remain static。
With reinforcement learning, as actions impact the environment, the state changes。
which continuously changes the problem that we are working with。Then finally。
agents face a trade off between rewards in different periods。
again pointing to this uncertainty that revolves around this reward system。

Now just a quick introduction, we will get into a notebook, but in Python。
the most common library for reinforcement learning is going to be open AI gym。
So we're going to want to import our gym library。To create an environment we call gym。
Make and there are actually some environments that are going to be available to us according to the strings that we pass and we'll see this in the notebook so that we can specify the game or environment that world in which we are living。
And then end dot render now that we've created that environment object will show the current state of our environment。

Now, just to recap in this section, we discuss reinforcement learning overview with an understanding of that feedback loop。
where the goal is for an agent to interact with the environment。
to choose from a set of available actions to increase possible rewards。
And those rewards lead to reinforcement of those actions within the environment。
And we discuss how solution approaches to reinforcement learning relied on the policy by which agents chose actions in response to the given state。
and as those actions impact the environment, the state changes which changes the problem we are currently working with。
Then finally, we closed out with a quick introduction to reinforcement learning。
implementation in Python, which we're going to go into further in our final notebook。


114:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p114 75_RL笔记本(选修部分)第1部分.zh_en -BV1eu4m1F7oz_p114-
Welcome to our notebook here on reinforcement learning。In this notebook。
we're going to use the library Open AI gym, which you can install using PIip install or following the instructions that we have linked here。
Some key concepts that we should be aware of before we get started with AIGin are the idea of an observation。
and that's going to be the current state of the game。
And that will describe where your agent currentlyly is within that environment。
within the world of the game。There's going to be actions and those are the different moves that the agent can make and we discuss that in lecture。
There're going to be this idea of an episode。And one full game played, we're just playing games here。
one full game played from the beginning, which we initiate with environmentment。
reset until the end where done equals true is going to represent a single episode and we'll see this clearly as we walk through the actual examples。
And then the step is the part of the game that includes one action。
so that's just one specific action and the game transitions from one observation where you currently are in the game to the next。
where you will be after that action is taken。

So we're going to import gym and we're going to import pandas as we'll use that as well。
And we're going to play this game using the environment frozen Lake。V0 here。
line zero for frozen Lake。And if you're curious, I'm going to load it up in just a second。
there's going to be a lot of games available in the open AI gym and you can click on this link that we have here。
And that will allow you to look at what goes into each one of the different environments。Now。
this environment, the goal of the game。Is that you start at this S portion?

And you try to make it to the G here at that end。

So you see, the object of the game is to get to the gold G without landing in any holes。
So the idea is that you're walking across some frozen pond。And you can go into the frozen portions。
and you should be fine。 But if you step on an H, if you step on a hole。
then you will fall into the hole and you lost the game。

Now, built into that and we'll discuss this a bit later。
It's not going to be that you can always necessarily go in the direction that you're trying to go。
if you choose to go to the left, down, right or up, you may, by some small probability。
end up going the wrong direction and falling in a hole。
So we have to take that into account as well。 And that's going to be what we're trying to learn in terms of the probabilistic。
best path to take。 Otherwise it would be obvious where to go in order to get to G。

So we're going to create our environment by callingG。tMake and passing in this string frozen link V0。
and there's going to be a bunch of environments that are available within the gym and again you can look at the documentation in order to get to this。

We're then going to get our current observation。And when we call environment。reset。
what we're doing is we're resetting that our values so that we are starting at that starting point at S。

And we'll see that current observation in just a second, we print that out。
and we see that we're at observation zero and zero should indicate that we're at the beginning of our process。

Now。I want to look a bit deeper into what our environment actually is。 I'm going to call environment。
which is the environment we just initiated, the environment for this specific environment。

And look at the documentation。And in the documentation, we can see more about the game。
So you read here that winter is here and you and your friends were tossing around to Frisbee at the lake at the park when you wild throw left of Re out in the middle of the lake。
😊,And the goal is to get to that Frisbee and they have a cute story here about the water is mostly frozen and there's a frisbee shortage so you need to get to that disk。
but the ice is slippery so you won't always move that direction that you intend。
which I mentioned earlier。And the surface is described using the grid that we have here。
Where again S is that starting point, F is the frozen surface, H is a whole。
and G is the goal that you're trying to get to。

So that's how this game actually is going to work。If we want to look at the actual environment。
whether it's this game or another, we can look at our current environment。
where that orange highlight indicates where we currently are, what is our current state。


So I'm going to print out our action space and the action space will include four discrete actions that we can take It'll just say here we'll print this out so we can see it just says here discrete4 but that describes that we can go up down。
right or left。

And then if we want to take an action。What we can do because we don't know which actions to take。
we can take a random action, so we go environment do action space。
which is that attribute we just pulled out and we pull out this method sample it just randomly chooses moving。
Up down right or to the left。 And each one of those different directions are going to be indicated by some number。
And we see here on that first try that randomly we took the third direction or actually the fourth because Python starts at 0。
And I can run this again, still at three。And I run this again。
and we see that it chose a random different sample of one。

So now let's act, let's actually take this action。So our new action is equal to。
That sample that we're taking。So we're going to move up down right or left。
And then we take a step and that step's going to indicate actually taking an action within our environments。
And we're going to take a step in the new action direction。And that will output this tuple here。
And we're saving these as observation, reward, whether or not it's done in some extra information。
And we'll discuss this a bit further later on, we're actually going to print them out right here and we can take a look。

So we see that the observation we cell observation0。 there's no reward。
We only get a reward for getting to the end of the game。Are we done with the game no。
and then there's this extra information, something about probabilities in regards to the steps that we're taking。
So if we run this again, we see that we now took a different step and now our state is different。
Our orange square is now below。 Again, our reward will still be 0 because we haven't reached the end of the game。
Are we done, No, and that same information holds in regards to those probabilities。

Now we're going to take five different moves。So for I range 5。
we just continually do what we just did and we're going to print out at each one of these steps。
What our observation is, what the reward is, whether or not we're done, that extra information。
and then we're also going to render the actual state, the actual environment at each step。

So we see we moved down, then we moved down again, then we moved up。

Went to the right, and we see that we made each one of these moves throughout。 Again。
the move isn't necessarily indicative of where we end up。
There's a probability that we go in a different direction, and we have to take that into account。
The reward stays at zero unless we reach the end of the game。And here it says done。
and the reason why we're done is because we hit a hole。And if you hit a hole。
then your game is over。then you have lost the game, the game's over, you get zero award。

So that's why we have done actually equal to true here。

Now, hopefully we have a clear idea or a guess as to what each one of these outputs mean。
given that our observation starting off was zero, then it moved to four。As we see here。
But if we think about those observations, it would make sense that we start off at 0。
and these are the numerical values for every single state within our environment。 So 0,1,2,3,4,5,6。
And we see this is 5。

And that's observation 5。The reward refers to the outcome of the game and we only get a one if we're at that G。
Done tells us if the game is still going and again。
that game ends either at G or once you land in a hole。
Then the info gives us extra information about the world。 And here its probabilities。
And we ask you to perhaps guess what this means here, and we'll talk about this a bit further on。
But again, it's the idea of there's some type of probilistic determination rather than just a clear determination in regards to each step that you're going to take。

So now we want to simulate an entire episode。So again。
we described an entire episode as a game that's played from start to finish。
whether you land in a hole or if you land at G。So we start off at the current environments。
current observation being at zero by resetting our environments。

We know that done is equal to false to start off。

And while that's false。We set that new action equal to the sample。We then take a random action。
And we keep doing this again since we have this while loop until done is set to true。
And each time we run this, we're going to print out the new action, the new observation, the reward。
whether or not we're done and that information。 So we run this。
And we see each one of the actions taken。The observations where we ended up。 And then finally。
we ended up again at that observation 5。Which is the same hole we fell into before。
and our game ended。

So some things to notice about the actions taken。And where we ended up。If we look at。
Some of the actions。And where we end up, so we see here action0 at observation 4。
but we ended up at observation 8, whereas here we were back at observation 4 and we took action 0。
and we stayed that observation 4。 So there's a probability that we don't end up actually taking that same action。
or ending up in that same state, given that we took that same action。

And。If we think about it, action seems like, again up down left or right。
but they also have some type of stochastic term built into that。
and there seems to be a one third chance of going into a different square rather than the square that you're trying to。
given that you chose a certain direction。


Now, I'm going to pause the video here as we walk through just this setup here of working with Jim。
And in the next video, we're going to start to gather information so we can start to figure out our reward system and whether or not certain actions throughout all of our steps are good actions or bad actions。
All right, I'll see you in the next video。😊。


115:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p115 76_RL笔记本(选修部分)第2部分.zh_en -BV1eu4m1F7oz_p115-
Welcome back。 Here。 we're in part 1。 In our first video。
we introduced everything that we needed to understand for Open AI gym。
And here we're going to run a bunch of game simulations through to the end。

And start to observe and gather data on what type of actions actually lead to rewards。

So some things to keep in mind is we're going to want to store each one of the different run throughs we're going to name each one of those episodes right we discussed earlier how from start to finish will be a single episode we start again that'll be the second episode and so on and so forth we're going to do that for a thousand00 or more episodes here we're actually going do it for 40。
000 episodes。And at each step, we're going to want to save the observation where we currently are。
The action taken from that observation。And then the current reward。
whether or not we've reached a reward state。

And some things to keep in mind as we go through this。
we're going to want to reset our environments after each episode so that we started new。
We saw before how we output this tuple of new observation, reward。
whether or not it's done in that extra information, every time we take a step。
every time we take an action。And we're going to want to continue this game until done is equal to true until we're actually finished running through。
that's going to be one of these outputs that we have。

And we saw how we did that up above。And then some things that we can work with if we think about M doaspace。
n, that's going to give us the number of possible actions。 and if we think about that。
that should be four possible actions up down right and left。And then if we want to take a sample。

That's going to be that random step and that's what we're going to be doing as we gather data。
And then if we want to know how many possible states are on the environment。
we can call n dot observation space do n and that will tell us how many possible states are。
and if we think about our four by four grid, that means it's going to be 16 different states。
so things that you can look at in general as you work with these environments。



So walking through what we have here。We're going to reinstate our environment。

We're saying that we want to go through 40,000 full episodes, so resetting up until the end。
So that's going to be our goal as we run through this for loop。
And are going to save all of them in this life memory。So then we say four I the number of episodes。
and I'm actually going to pull this out so that we can look at some of these steps separately。

So this is the entire for loop that I'm going to pull out and put into a separate cell。


So we have the old observation, so starting off our old observation will'll be starting at 0。0。
Whether or not it's done, we're starting at false。 The total reward at this point is equal to 0。
And the episode memory we have with, again, we're in this for loop that we just pulled out is set to blank。


And while not done。We take a new action。And that new action just going to be a random action?
And we saw earlier how we can call random action using m。actionspace。sample。
We can then get using that new action and using MD dot step。
right that's how we take an action within our environment, we can get the current observation。
the reward, whether or not we're done。And then our total reward for these steps will be equal to the pass reward plus the new reward。
And generally speaking, this will only add up to 0,1, because once we hit one。
we've hit the end of the game, or we could end the game if we land in a hole。
and therefore we'd stay at 0。Then what we want to save at each step。Is going to be the observation。
so where were before?What action did we take, What was the reward once we took that action。
and then which episode were on, which will stay the same until we end the game, right。
This is still all within this while loop。

And we keep doing this until we hit done。

So let's actually just run this portion, I'm going to put this in the cell below。

So we've run now one episode。And we said I, let's just see what our Is from a different for loop we have I is equal to4。
so they're all going to be listed as episode4, but if we look at the episode memory now。


We see that we took a few steps observation zero, the actions that we took until we hit the end of the game。
And we hit the end of the game by falling in a hole。
and therefore we didn't end up actually by ending in a hole。
we didn't actually reach our reach through to our reward。

And then when we're saving these episode memories, we also want to save when we put this into a panda's data frame。
which is what we're ultimately going to do。 We're going to want to save with it。
So here we have the observation, the action, the reward, the episode。
All these will be different features in our data frame。 We're also going to want the total reward。😊。

Which is just going to be those values that we added up as new rewards came in。
So if we did hit a reward and we did get to one, then we'd attribute one to every single one of our different observations。

And then we want to look at this value i times total rewards divided by number of steps。
And what that does。 And we'll look at this a little bit later on once we actually have some that reached that reward state。
is it's going to weight each of our steps, according to how close we were to achieving that award。
So in the first step, that action is probably not as important as that last step that LED to our reward。
So that will get a higher reward step, a higher decay reward is what that's going to be called。


So we're going to extend that life memory using this。

Oh we life memory size。Up here, and I pulled that out of the for loop。

But we can add on that episode memory and let's just look at what that episode memory looks like。
And you see that it's going to be the same as before。
except now we added on that total reward plus that decay reward。

So that's going to be everything here within the for loop。
and then that's going to be added onto our life memory。

Which is our list here。

And then at the end, we're going to take that life memory and put that into a panda's data frame。
So let's run this and this will take just a bit to run as we' running through 40。
000 different iterations。So now it's run, we're going to call memorydf。describe。

And we see for each one of our features that we have the number of observations or what where we were and the average space where we were。
And usually you're probably closer to the beginning before we end。The actions taken。The reward。
that mean reward will be useful to understand, but we'll get a little bit deeper into understanding the full reward that probably means more around the total reward amounts where we have 2。
4。



And what's going to be even or make this a bit clear, well, first, let's look at the shape。
We see that we have 306,997 different rows。And if we look at。A couple of values here。

Let's say memory D F。Such that the memory。DF is, or let's say。Let's just look at the first episode。
or let's look where there's actually a reward。Well let's look where the total reward equals one。
and I'll explain why in just a second。

So we have our total reward here equal to one。And we see that it took quite a few steps to get there。
And now we can start to understand what each one of these new rows actually mean。

So if we look at the reward at each individual step。And these are in order。Well。
let's also just look at a single one of the episodes。
so let's say episode is 182 because we see that that ended in a reward。
So if we look at each one of these different steps, they're in order。
And we see that the reward is zero all the way up until the end when we got to that final step。
And with that, our total reward。Is then equal to one。 and it says that for Epi 182。
given the steps that we took, we were able to get to that end goal that we were looking for。
And then the decay reward weights each one of the steps differently。
According to how close we were to getting to that final step。 So here we are only a step away。
So we get more decay reward。 whereas at the beginning in the first one。
those steps probably as aren't as important in regards to getting to that final reward that we're looking for。
And that's going to play a role in deciding how we're going to ultimately optimize on our rewarding system。

Then, to see。How often we actually got to the reward。 We're going to group by episode。
So grouping all of our values by episode were summing the reward and recall that each episode will either sum to 0 or sum to one。
As we saw here, these are all zeros for the reward feature, except for that final value。
So if we group by episode, it'll either be 0 or one。 And if we take the average value。
Then we can see on average, how often we were successful。😊。
And we see that we were only successful 1。4% of the time, so if we just take random steps。
we probably won't be successful very often。

So the next goal and what we'll discuss in the next video is actually leveraging the data that we gathered all those 300000 x amount of rows that we now have to come up with more intelligent steps to take throughout our system。
All right, I'll see you there。😊。


116:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p116 77_RL笔记本(选修部分)第3部分.zh_en -BV1eu4m1F7oz_p116-
In this video, we're going to go about actually predicting what the next step should be。
So in order to do that, we're actually going to model using those observations that we just built。
using that data frame that we just built。So we're going to create somewhat of a supervised learning problem now。
And we're going to import random forests in our extra trees reggressor。
and if we recall extra treesre reggressor is just going to be a more randomized version of random force where the splits are going to be random。
So we're going to set our extra cheese reggresser。To our model, set the number of estimators。
We're then going to set what our target value is going to be。So we need to create a Y variable。
Given the information that we have。And the way that we're going to do that is we're going to await some of the different rewards that we had within our data frame。
So we had。The actual reward at each step。And recall that's only going to be available really at that final step if we did end up reaching that reward。
We'll do 0。1 times the decay reward and recall that that's going to more heavily weight those actions that are going to lead closer to that final reward。
and then we're going to put the most weight to the total reward。
whether or not any step along the way led to us ultimately getting reward within that observation within that episode。
And then our x variable, our x data that we care about is just going to be the observations and the action taken at that observation。
So given the state that we're at。What action did we take and deciding given that action。
whether or not that led to higher or lower reward?So we fit that。

Given our X and Y。

And that gives us steps to take along the way。And it gives us an outcome of given the X values that went in。
what is going to be the predicted y。And then ultimately, if we think about it, the goal will be。
Take the observation that we're currently at。Pass each one of the optional actions that we have。
And then return which one of those actions resulted in our model, giving the highest output。
So let's see that here。

So we have our model here we're going to use random forcegresser。
we're using that same Y and that same X that we just discussed, and then we're fitting our model。


We're then going to do 500 different episodes this time just to see the results。

We don't need to worry about this right now。

And then we're going to save our life memory as we did before。
and all of these steps are very similar to before we have our initial observation for our episode。
We're not done yet, so our done is equal to false until we actually get to done。
and our total reward is zero, and we have this blank episode memory for this specific episode。Now。
here's our first major difference。Our predictions are going to take all the old observation。
As well as I for I in range 4, recall that we only have four different steps and if we look at。
say some old observation。

We can see usually that we'll say here that's at five。
And we can actually take the model that we fit above。 Well, actually, let's just do this first。
We can see what our。Data is, we we're saying。Passing in this tuple and recall that this tuple should relate to the observation in the action。
so we're at observation 5, we can either take step zero。Or action zero, take action 1。
take action 2 or take action 3 when we're at observation5。

We then。Call our model and call predict for each one of these different tuples。And when we do that。
we have that model that we did above, we can actually call model dot predict here then。

This is, again, referencing this model above, not the model that we have initiated down here。

But you see that this outputs four different values where that maximum value should be the next step。
because that's predicting the most amount of reward possible。

And that's why we callmp。org max on the output of what we add here。Pta Arg Max。And that will tell us。
Which action would lead to the highest possible reward。

And then we take that new action, so now our new action is going to be decided by that Arcm。

And then as we go along, we keep adding on the reward。
we app on the actual observation that we were at, the action we took。
the reward and the episode that we're at。



And we reay the old observation and run through this loop until we hit that done。Equal to true。
right, this over here being equal to true。

We're then going to incorporate that the total reward into our episode memory。
so once this is done running。We're going to say, if we hit any reward。
then we're setting that equal to 1 or 0, generally speaking, as we discussed。
if we landed in a hole, we'd end up with a total reward of 0。
If we were able to get to that end point, then we'd end up with a total reward of one。

And then we add that on to our life memory。

And then we have our new data frame with this new life memory。
and then we can look at the mean value of our episode。
given the fact that we're now taking more educated steps along the way。

So I'm going to run this。And as we'll take just a second to run。
we're going through 500 different iterations now we have a model that has to fit first and also predict along every single step along the way。
So we're going to pause the video as it's taking just a second to run。

And then here we see the results that we ended up getting。And that may have taken just a bit of time。
We see that it got here up to 62。4。 So much better in terms of how often it was able to get to that end goal。
I'm not promising theyd be this high。 That's going to be somewhat random。
There is a bit of a sarcastic process there。 But still。
we see how much we are able to improve once we implemented that reward system。
Now that closes out our video here。In the last video in regards to reinforcement learning。
we're going to introduce one last environment。And see how we can work through that environment as well。
And hopefully, as you start to go through reinforcement learning on your own and play around in the open AI gym。
you'll be ready to hit the ground running。 All right, I'll see you there。😊。


117:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p117 78_RL笔记本(选修部分)第4部分.zh_en -BV1eu4m1F7oz_p117-
Welcome back to our final video here on reinforcement learning。In this video。
we're going to touch on a new environment, so we're going to be working in a new environment so that you'll learn how to work outside of just that frozen environment that we discussed earlier。
So here we're going to work with the cart pole V0。And as before, you can look at the documentation。
which can be helpful。But what we can also do and what I want to reference here is the actual site on open AI discussing this environment。
and then we'll come back to the documentation as well as there's more details。

So the idea of cartpo is that there's a pole attached by an unanctuated joint to a cart as we see in the picture to the right。
which moves along a frictionless track。And that system is controlled by applying a force of either plus one or minus one to the cart。
And the pendulum starts upright and the goal is to prevent it from falling over。
So a reward of plus one is provided every single time set that the poll remains upright。
So now our reward systems different rather than there're just being a reward at the end of the game。
there can be a reward at every single step。And then the episode ends when the pole is more than 15 degrees from the vertical。
so if we are。Move further than 15 degrees from the vertical。
or the cart moves from that center more than 2。4 units。
So if you're able to move it over while keeping it upright for 2。4 units, then you also end the game。
And the goal is to keep it up without moving 2。4 units for as long as possible。
and the maximum it also max is out at 200, so that's going to be the maximum amount of reward that we can get for a single episode。

Now some details that are worth knowing that are。In our documentation here。Is that at each time step。
at each observation to know our state, we can get the cart position。Decarte velocity。
and we have the min and max for it each。 This only goes from negative 4。8 to 4。8。
The velocity can go as fast or as slow as possible or as fast in the opposite direction。
We have the angle that goes from negative 24 degrees to 24 degrees。
and then the velocity at the tip。 and that's going to be different observations that will be available to us。
And if we think about it, again, we're going to want to leverage those observations。
In order to come up with our modeling of how to optimize on rewards。
And there's only going to be two different actions rather than the four that we had before。
either pushing the cart to the left one unit or pushing the cart to the right one unit。
So now many of these steps are going to be very similar to what we did before。
so first we're going to gather our data by just taking random actions。
So we see here that we're using the M dot actionspace do sample。
so we're doing a random action for each random action。We get the observation。 and now again。
the observation is going to have four different values。We're going to get the reward and again。
we can get a reward now at every single step。Whether or not it's done。
and we talked about the criteria of it being done。
either it moves a certain amount of units without falling, it does go beyond a certain degrees。
so therefore does fall, or if we are able to keep it balance for 200 time steps。
And then some extra information。Now, our total reward also is going to be more important。
As each time step, we can add on more rewards and we can end the game and have a pretty successful game without maxing out the total reward。
which would be 200。And then we're going to append to our memory。
and this is going to be something we put into our data frame。What the observation0 was。 And that's。
I don't recall which one is which, but we had the velocity we can call this again if you're curious。
we had the position, the velocity, the angle and the velocity at the tip。
Those are going to be our different observations。Then we have the action taken。
given the state that we are in, which is highlighted by observation 0 through three。
We get whether or not we had a reward and we keep track of what episode are're in。
And then we set the old observation to the new observation。
and we continue until we reach the end of the game。And then at the end of the game。
we're also going to save in our dictionary what the total reward is as we did so that when we get our data frame at the end。
we can look at that as well。So we run this and we put this into data f as we did before。
and we see that the mean number of steps that we were able to take without it falling。
without us failing is 22。And we can look at our memory DF and we see the observations and the average values for each one of the observations。
the different actions taken, which should average out at 50% because see there's 01。
The reward average most of the time was very close to one, it only failed at zero。
and that only happened at the end of the game so on the 23rd step。
And then what' will make this clear is if we look again as we did before。Let's first。
Go here and let's actually look where memory Df dot total reward。Is the max value。And hopefully。
we have one here where it got all the way to 200。And we don't, we only got up to 94 here。
and I guess that's because we're taking random steps, probably once we optimize。
we'll be able to get to 200。And we can see that we took either an action of 100 throughout。
and we were able to keep it balanced without moving 2。4 units for 94 different time stepss。
Now as before, let's create our aggressor, we're also going to have to create our Y variable that we're trying to optimize on。
So if we see here, we're going to create this comb reward or that combined reward。
And that's going to be 0。5 times the actual reward at each time step。Plus, the total reward。
So we're going to optimize more on whether we able to。Remain higher on that total reward。
And then we're going to fit here our extra trees regressor。On those different observations。
And the actions taken。And that's similar to before。
except before our observation or our state could only take on one value here it has four different values that describe that state。
And we have our action and we're trying to optimize on this memorydf。com reward。
So given all these different values, what's going to be the predicted reward。
combined reward that we just came up with。Now that we have our model fit。
we're going to use the same steps that we did before here we're taking that old observation here that's actually。
Again, show this above。Oh, first, we have to run this。And then we're going to look at this above。
and we see that we have all the observation values and then the action that we'd want to take either0 or one。
so we have either0 or one, and that's going to be that plus I in range 2。
And that's going to be our input values and given those input values。
we can come up with a prediction as we did before, so we call model dot predicts。
And we get the two different values and we just choose that maximum value again。So at each step。
we choose whether to go left or right, according to which one maximizes the potential output given the data that we've gathered。
And then all the steps from there are essentially the same, saving it all into memory。
And then we can see, given our new model。😊,How much further we ate。
how much further along we were able to keep that cart balanced。 So I'm going to run this。
And we'll pause the video again, and we'll come back once it's done running and discuss those results。
Now, as we see here in our results in our output 49。We were able to get up to 113。
77 on average in regards to how many different steps we're able to take。
and each step is going to add on to that reward。So we see by optimizing our model on our combined rewards。
we're able to greatly increase from 22 up to 113。77。
Now I also pulled out here from our new data frame。
the actual total rewards and where those total rewards max now, as mentioned if we get up to 200。
it'll stop the game and say that you've accomplished that highest goal possible。
So we see we got here up to 200 for 2000 rows。And this is for every action so that means we got there about 10 times。
so we see we're able to max out what you were not able to do when you're only taking random steps。
Now that closes out our video here on reinforcement learning。
I encourage you to dive into the OpenAAA website and keep playing around with different environments that may be available to you to keep learning more and more about reinforcement learning。

118:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p118 0_机器学习毕业项目简介.zh_en -BV1eu4m1F7oz_p118-

Welcome to the Machine Learning Capstone Project Introd。

Yan Luo PhD is a data scientist and developer at IBM Canada。

He has been building innovative AI and cognitive applications in various areas。
such as mining software repositories, personalized health management, wireless networks。
and digital banking。

Yan received his PhD in machine learning from the University of Western Ontario。

In previous machine learning courses, you learned four types of machine learning algorithms。
let's briefly recap those。Regression algorithms belong to supervised learning。
Regression aims to map a feature vector onto a numerical target variable。
and the learned coefficient of each feature in relation to the target variable。
Classification algorithms also belong to supervised to learning They map a feature vector onto a categorical target variable such as customer churn versus no churn。

For unsupervised learning and dimension reduction, you do not have a target variable。Instead。
in unsupervised learning, you try to find the patterns within the data itself。
typical tasks are similarity measurement, clustering, principal component analysis, and so on。

Lastly, you also learned about deep learning, which involves building deep and complex machine learning models such as neural networks to solve complicated tasks like computer vision and natural language processing。

Now in this capstone project, you will have opportunities to apply the machine learning knowledge and skills you acquired from previous courses。
You will be given an industrial scenario with real-worl data sets and you will solve valuable real- worldor problems using machine learning Finally。
after you have completed the project, you will have the opportunity to showcase your comprehensive machine learning skills to your peers more specifically in this project you will be asked to apply a wide range of machine learning algorithms such as regression。
classification and clustering to predict if a user will like an item or not this problem setting that is user item interaction prediction is fundamental to many successful machine learning systems such as recommender systems。
social network mining and advertising prediction in this capstone project。
we will focus on recommender systems。


With this in mind, we believe this capstone course will be an asset to your machine learning portfolio。

119:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p119 1_推荐系统简介.zh_en -BV1eu4m1F7oz_p119-
Hello and welcome In this video, we'll be going through a quick introduction to recommendation systems。
so let's get started。😊。

Even though people's tastes may vary, they generally follow patterns。By that。
I mean that there are similarities in the things that people tend to like。
Or another way to look at it is that people tend to like things in the same category or things that share the same characteristics。
For example, if you've recently purchased a book on machine learning and Python and you've enjoyed reading it。
it's very likely that you'll also enjoy reading a book on data visualization。
People also tend to have similar taste to those of the people they're close to in their lives。
Recomander systems try to capture these patterns in similar behaviors to help predict what else you might like。
Recommenander systems have many applications that I'm sure you're already familiar with。 Indeed。
recommender systems are usually at play on many websites。 For example。
suggesting books on Amazon and movies on Netflix。 In fact。
everything on Netflix's website is driven by customer selection。
If a certain movie gets viewed frequently enough。 Netflix's recommender system ensures that that movie gets an increasing number of recommendations。



Another example can be found in a daily use mobile app where a recommender engine is used to recommend anything from where to eat or what job to apply to。
On social media, sites like Facebook or LinkedIn regularly recommend friendships。

Recommenander systems are even used to personalize your experience on the web。For example。
when you go to a news platform website, a recommender system will make note of the types of stories that you clicked on and make recommendations on which types of stories you might be interested in reading in future。
There are many of these types of examples, and they are growing in number every day。
So let's take a closer look at the main benefits of using a recommendation system。

One of the main advantages of using recommendation systems is that users get a broader exposure to many different products they might be interested in。
This exposure encourages users towards continual usage or purchase of their product。

Not only does this provide a better experience for the user。
but it benefits the service provider as well, with increased potential revenue and better security for its customers。

There are generally two main types of recommendation systems。
content based and collaborative filtering。

The main difference between each can be summed up by the type of statement that a consumer might make。

For instance, the main paradigm of a content based recommendation system is driven by the statement。
Show me more of the same of what I've liked before。

Content based systems try to figure out what a user's favorite aspects of an item are and then make recommendations on items that share those aspects。

Collaborative filtering is based on a user saying, tell me what's popular among my neighbors because I might like it too。


Collaborative filtering techniques find similar groups of users and provide recommendations based on similar tastes within that group。
In short, it assumes that a user might be interested in what similar users are interested in。 Also。
there are hybrid recommender systems, which combine various mechanisms。

In terms of implementing recommender systems, there are two types, memory based and model based。
In memory based approaches, we use the entire user item data set to generate a recommendation system。
it uses statistical techniques to approximate users or items。

Examples of these techniques include pearson correlation, cosine similarity。
and Euclidean distance among others。In model based approaches。
a model of users is developed in an attempt to learn their preferences。
models can be created using machine learning techniques like regression, clustering, classification。
and so on。


This is the end of our video Thanks for watching。
120:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p120 2_基于内容的推荐系统.zh_en -BV1eu4m1F7oz_p120-
Hello and welcome In this video we'll be covering content based recommender systems。
so let's get started。

A content based recommendation system tries to recommend items to users based on their profile。
The user's profile revolves around that user's preferences and tastes。
It is shaped based on user ratings, including the number of times that user has clicked on different items or perhaps even liked those items。
The recommendation process is based on the similarity between those items。
similarity or closeness of items is measured based on the similarity in the content of those items。
When we say content, we're talking about things like the items category, tag, genre, and so on。
For example, if we have four movies and if the user likes or ratess the first two items。
and if item 3 is similar to item 1 in terms of their genre。
the engine will also recommend item 3 to the user。In essence。
this is what content based recommender system engines do Now let's dive into a content based recommender system to see how it works。

Let's assume we have a data set of only six movies。
This data set shows movies that our user has watched and also the genre of each of the movies。
For example, Batman versus Superman is in the adventure superhero genre and guardians of the Galy is in the comedy adventure。
superhero and science fiction genres。😊,Let's say the user has watched and rated three movies so far。
and she has given a rating of 2 out of 10 to the first movie,10 out of 10 to the second movie。
and 8 out of 10 to the third。The task of the recommender engine is to recommend one of the three candidate movies to this user。
or in other words, we want to predict what the user's possible rating would be of the three candidate movies if she were to watch them。

To achieve this, we have to build the user profile。First。
we create a vector to show the users' ratings for the movies that she's already watched。
We call it input user ratings。 Then we encode the movies through the one hot encoding approach。
genre of movies are used here as a feature set。

We use the first three movies to make this matrix, which represents the movie feature set matrixtrix。
If we multiply these two matrices, we can get the weighted feature set for the movies。

Let's take a look at the result。 This matrix is also called the weighted genre matrix and represents the interests of the user for each genre based on the movies that she's watched。
Now, given the weighted genre matrix, we can shape the profile of our active user, essentially。
we can aggregate the weighted genres and then normalize them to find the user profile。
It clearly indicates that she likes superhero movies more than other genres。
We use this profile to figure out what movie is proper to recommend to this user。

Recall that we also had three candidate movies for recommendation that haven't been watched by the user。
We encode these movies as well。 Now, we're in the position where we have to figure out which of them is most suited to be recommended to the user。

To do this, we simply multiply the user profile matrix by the candidate movie matrix。
which results in the weighted movies matrix。It shows the weight of each genre with respect to the user profile。
Now, if we aggregate these weighted ratings, we get the active users possible interest level in these three movies。
In essence, it's our recommendation list, which we can sort to rank the movies and recommend them to the user。
For example, we can say that the hitchhiker's guide to the galaxy has the highest score in our list and is proper to recommend to the user。
Now, you can come back and fill the predicted ratings for the user。😊。

So to recap what we've discussed so far, the recommendation in a content based system is based on users taste and the content or feature set items。
Such a model is very efficient。 However, in some cases, it doesn't work。For example。
assume that we have a movie in the drama genre, which the user has never watched。
So this genre would not be in her profile。 Therefore。
shell only get recommendations related to genres that are already in her profile and the recommender engine may never recommend any movie within other genres。
This problem can be solved by other types of recommender systems such as collaborative filtering。
Thanks for watching。

121:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p121 3_基于协作过滤的推荐系统.zh_en -BV1eu4m1F7oz_p121-
Hello and welcome In this video we'll be covering a recommender system technique called collaborativellabor filtertering。
so let's get started。

Collaborative filtering is based on the fact that relationships exist between products and people's interests。

Many recommendation systems use collaborative filtering to find these relationships and to give an accurate recommendation of a product that the user might like or be interested in Collabor filtering has basically two approaches。
user based and item based user based collaborative filtering is based on the user's similarity or neighborhood。
😊。

Item based collaborative filtering is based on similarity among items。
Let's first look at the intuition behind the user based approach。

In user based collaborative filtering, we have an active user for whom the recommendation is aimed。
The collaborative Fing engine first looks for users who are similar;
that is users who share the active users' rating patterns。
Collaborative filtering bases this similarity on things like history。
preference and choices that users make when buying, watching or enjoying something。 For example。
movies that similar users have rated highly。

Then it uses the ratings from these similar users to predict the possible ratings by the active user for a movie that she had not previously watched。
For instance, if two users are similar or are neighbors in terms of their interest in movies。
we can recommend a movie to the active user that her neighbor has already seen。

Now, let's dive into the algorithm to see how all of this works。

Assume that we have a simple user item matrix, which shows the ratings of four users for five different movies。
Let's also assume that our active user has watched and rated three out of these five movies。
Let's find out which of the two movies that our active user hasn't watched should be recommended to her。

The first step is to discover how similar the active user is to the other users。 How do we do this?
Well, this can be done through several different statistical and viial techniques such as distance or similarity measurements。
including Euclidean distance, Pearson correlation, cosine similarity and so on。
To calculate the level of similarity between two users。
we use the three movies that both the users have rated in the past。
Regardless of what we use for similarity measurement, let's say, for example。
the similarity could be 0。7,0。9 and 0。4 between the active user and other users。
These numbers represent similarity weights or proximity of the active user to other users in the dataset。

The next step is to create a weighted rating matrix。
We just calculated the similarity of users to our active user in the previous slide。
Now we can use it to calculate the possible opinion of the active user about our two target movies。

This is achieved by multiplying the similarity weights to the user ratings。

It results in a weighted ratings matrix, which represents the user's neighbor's opinion about our two candidate movies for recommendation。
In fact, it incorporates the behavior of other users and gives more weight to the ratings of those users who are more similar to the active user。

Now we can generate the recommendation matrix by aggregating all of the weighted rates。

However, as three users rated the first potential movie and two users rated the second movie。
we have to normalize the weighted rating values。We do this by dividing it by the sum of the similarity index for users。

The result is the potential rating that our active user will give to these movies based on her similarity to other users。
It is obvious that we can use it to rank the movies for providing recommendation to our active user。

Now, let's examine what's different between user based and item based collaborative filtering。

In the user based approach, the recommendation is based on users of the same neighborhood with whom he or she shares common preferences。
For example, as user 1 and user 3, both liked item 3 and item 4。
we consider them as similar or neighbor users。And recommend item1。
which is positively rated by user one to user3。

In the item based approach, similar items build neighborhoods on the behavior of users。 Please note。
however, that it is not based on their contents。 For example。
item 1 and item 3 are considered neighbors as they were positively rated by both user 1 and user 2。

So item1 can be recommended to user3 as he has already shown interest in item 3, therefore。
the recommendations here are based on the items in the neighborhood that a user might prefer。


Collaborative filtering is a very effective recommendation system, however。
there are some challenges with it as well。One of them is data sparsity。
Data sparsity happens when you have a large data set of users who generally rate only a limited number of items。
As mentioned, collaborative based recommenders can only predict scoring of an item if there are other users who have rated it。
Due to sparsity, we might not have enough ratings in the Use item data set。
which makes it impossible to provide proper recommendations。
Another issue to keep in mind is something called Cold Start。
Coalstart refers to the difficulty the recommendation system has when there is a new user。
and as such, a profile doesn't exist for them yet。
Cold start can also happen when we have a new item, which is not received a rating。
Scalability can become an issue as well as the number of users or items increases and the amount of data expands。
collaborative filtering algorithms will begin to suffer drops in performance。
simply due to growth in the similarity computation。
There are some solutions for each of these challenges such as using hybrid based recommender systems。
but they are out of scope of this course。Thanks for watching。

122:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p122 4_成功数据分析报告的要素.zh_en -BV1eu4m1F7oz_p122-
While finding and cleaning data is an important first step in data analysis。
a concept can be lost if you are not able to organize and represent the findings effectively to your audience In this video。
you will learn how to represent your findings by focusing on specific elements to create a successful data findings report。
After the data has been collected, cleaned and organized the work of interpretation begins。
You are now able to obtain a complete view of the data and hopefully answer the questions that were formed before starting the analysis。
Now, you typically begin to compose a findings report that explains what was learned。
depending on the stakeholders and how they receive the information。 The report could vary in form。
This could include a paper style report。 a slidehow presentation or maybe even both。
The findings report is a crucial part of data analysis。
as it conveys what was discovered when beginning this process。
the collected data and information may seem。


Little overwhelming。 The best way to get through this block is to begin by creating an outline by completing an outline。
you can then get a complete picture and begin to write in a precise but simple manner。
While there are many different formats for creating a data drivenrin presentation。
We have created a simple outline that is easy to follow yet effective。 When creating your outline。
always remember to structure it towards your audience and create a presentation that is appropriate for your situation。
You first begin with your cover page。 This beginning section will have the title of your presentation。
your name and then the date。 The next section in your outline will be an executive summary and then the table of contents。
The table of contents will contain the sections and subsections of your report in order to give your audience an overview of the contents。
This also enables readers to go directly to a specific section that may be more important to them。
Coninue your presentation with the introduction methodology。😊,Discussion, conclusion, and finally。
the appendix。 note that the depth and length for each element may vary depending on the audience and format of report。
The first step in creating your report is properly creating an executive summary。
This summary will briefly explain the details of the project and should be considered a standalone document。
This information is taken from the main points of your report。
And while it is acceptable to repeat information, no new information is presented。
The next section after the table of contents is the introduction。
The introduction explains the nature of the analysis states the problem and gives the questions that were to be answered by performing the analysis。
The next section is methodology。 methodology explains the data sources that were used in the analysis and outlines the plan for the collected data。
For example, was the cluster or regression method used to analyze the data。 Next。
we have the results section。This section goes into the detail of the data collection。
how it was organized and how it was analyzed。 This portion would also contain the charts and graphs that would substantiate the results and call attention to more complex or crucial findings by providing this interpretation of data。
You are able to give a detailed explanation to the audience and convey how it relates to the problem that was stated in the introduction。
Next, discuss the report findings and implications for this section。
you would engage the audience with a discussion of your implications that were drawn from the research。
For example, let's say you were conducting research for top programming languages for college graduates。
Would you find they need to learn multiple languages to remain competitive in the job market。
or would one language always reign supreme。We have now reached the conclusion of the report findings。
This final section should reiterate the problem given in the introduction and gives an overall summary of the findings。
It would also state the outcome of the analysis and if any other steps would be taken in the future。
And last, we have the appendix。 This section would contain information that really didn't fit in the main body of the report。
but you deemed it was still important enough to include This type of information could include locations where the raw data was collected or other details such as resources。
acknowledgments or references。 In this video, we learned about the important elements in creating a successful data findings report。
In the next video, we will learn the best practices when presenting your findings。


123:IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 p123 5_展示发现的最佳实践.zh_en -BV1eu4m1F7oz_p123-
嗯。🎼Okay, you've spent weeks, maybe months studying the data。
and the time has come to report your findings。 The questions have been answered。
and you feel good about the story。 So how will you speak to your audience so they leave with the intended message。
In this video, learn how to present your findings in a way that will engage and keep the attention of your audience。
Delivering data driven presentations may seem easy。
but there are a few important factors to remember in accurately conveying your message。
Make sure charts and graphs are not too small and are clearly labeled。
Use the data only as supporting evidence。 Share only one point from each chart or graph and eliminate data that does not support the key message。
😊。

Have you ever SAT through a presentation and the information being presented was difficult to read or understand。
While this may seem apparent, small charts and labels can be easily overlooked。
Make sure to test the visualizations by sitting at different distances like your audience。
And if the data cannot be seen clearly, then maybe a redesign should be considered。
When preparing the report, you may feel the only way to explain the findings is to pack the slides with data。
While this may seem sensible as a data analyst, your audience will probably not appreciate the intricacies of the data and just see a pile of numbers to resolve this issue。
begin by forming the key messages that need to be conveyed to the audience and build the story around these messages After forming the outline。
go back and insert the data to support your findings by not relying heavily on the data and using this method to create the presentation。
You will create a story that is engaging and interesting to your audience。
Presenting your data using charts and graphs is the best way to get your message across。 However。
if you are supplying too much information, it can be confusing。 For example, look at this pie chart。
Can you decipher what the key message is and what the presenter is trying to convey In the example。
the chart has so much information。 It is hard to determine what point the presenter is trying to make and what the focus should be for the audience by sticking with one idea and not summarizing multiple points into one visualization。
You are able to accurately convey the idea to the audience and avoid any confusion。
Data analysts can spend months researching data。 However。
some items that seem interesting to the analyst may not be relevant to the project。
Try to explain every little detail to your audience and not recognizing irrelevant data could damage the key message by eliminating this unnecessary data and highlighting only data points that support your key ideas you will。
🎼Keep the presentation clear and concise In this video。
we learned about creating a data driven presentation that will keep your audience engaged and how to deliver a clear and concise message。


IBM《机器学习(无监督学习、深度学习和强化学习、毕业项目)|machine learning》中英字幕 -BV1eu4m1F7oz_p94-
Welcome back now in this exercise we're just going to play around with some of the parameters。
show you some of the parameters that you can play around with。And then。On your own。
as mentioned earlier, I would suggest you try playing around as well with these max features。
the max length, as well as something that we won't do the hidden dimension for our R and N。
I believe we're also going to work with the word embedding dimension and we'll see the performance for each as we move along。
So。

First thing that we have is we set the max features equal to 20,000。
which is the same as what we had before。And then the other thing that we have is that we're setting the max length。
so recall that we cap off our sequences or our sentences at a certain length and then pad them accordingly as well。
Here, rather than the 30 that we had before, we're going to pad or truncate at 80。

Everything else that we have here stays the same。The same holds for our H dimension。
as well as our wordenbeddings, as well as the setup of our model, our R N N model。

So we're just going to run this, we're not going to walk through that again。

Again, we're going to use RMS prop and again the loss will be binary cross entropy。
the metric that we will track is going to be accuracy along with that loss will automatically be tracked。
We run this and then again, finally we will fit on our training set using our X train and Y train and then having that hold out validation data of X test and Y test。

So we run this and again, this may take some time to run and the only difference again that we have with this sample versus before is that we're setting the max length where we will truncate our sequences up here at 80。
So I'm going to pause here and we'll come back once this training is done。

So that will actually take quite a bit of time to run we're not also going to run the accuracy results as we did before。
but you can actually see those accuracy results here on the validation set and we see that it went up to 0。
842 compared to what we had before which was 0。7846 and that matches that evaluate output over here So we see we're able to increase our accuracy。




Now we want to see again, we're going to play around with just one more parameter here。

So we have, again, this time, instead of。

20,000 features, which is the amount of actual words we're going to use using the most common words。
we bring that down to 5,000, keeping that max length up at 80。

And then for our word embedding dimension, recall that that's going to take those integer values that we're starting off with。
And convert them into x dimensional vectors here, 20 dimensional prior they were at 50 dimensional vectors。
So of shrinking that down as well。 So we're changing two features actually here。

So you run this to get our new Xtrain and X test。

We get our new R&N model, everything else yet the same。

And then after that, again, we will use the compile with the same loss function, the same optimizer。
tracking the same metrics。And we're going to call fit again here。

And then after that, this is going to run。 and this will take some time as well。
So all these will take a bit of time and。It's part of the process when we're doing this deep learning。

But here we see it's running through that first epoch。
and then we'll do that again for another 10 epochs here。
The goal being that if the accuracy on our holdoutet is continuing to improve。
we should probably run it for more epochs。 And we did see that that was the case up here。
If We look at the validation accuracy。 We see that it continued to go up。

After each epoch, so we probably could have continued to run that and get even higher accuracy。

So we're actually going to do that here。 we'll see after 10 epochs how well we were able to perform。
and then after that, I'm just going to run this now。
It'll run for another 10 epoch and we'll see how much that accuracy can actually improve。
so I'm going to pause it here and we will get back once we are done having both these items。
which will be quite a bit of time。


Now, looking at our results here are going through now 20 epochs, 10 on the first run。
another 10 on the next run, and that second 10, of course。
as it was in our last notebook will pick up where we left off so here we see that we had a actually a 0。
8479。


On the training set, and we see that continues to go up as that loss continues to go down。

And we see towards the ends that we get that validation accuracy of about 0。84, that is。


Around equal to what we were able to accomplish in just 10 epochs。

Using the word embeddings with 50 dimensions, max features are 20000 and the max length of 80。

I'd say again, feel free to play around with these different parameters。
see if you can improve the model, but we are also going to in the next lecture start to discuss a more powerful recurrent neural net structure with more long term memory。
specifically LSTMs。

So I'll see you back in lecture where we will pick up with long short term memory models。 All right。
I'll see you there。



浙公网安备 33010602011771号