    PCA即主成分分析(Principle Components Analysis),是统计机器学习、数据挖掘中对数据进行预处理的常用的一种方法。PCA的作用有2个,一个是数据降维,一个是数据的可视化。在实际应用数据中,样本的维数可能很大,远远大于样本数目,这样模型的复杂度会很大,学习到的模型会过拟合,而且训练速度也会比较慢,内存消耗比较大,但实际数据可能有些维度是线性相关的,可能也含有噪声,这样降维处理就很有必要了,不过PCA对防止过拟合的效果也不明显,一般是根据正则化来防止过拟合。PCA是一种线性降维方法,它通过找出样本空间变化较大的一些正交的坐标方向,可认为就是样本的主成分,然后将样本投影到这些坐标从而降维到一个线性子空间。PCA找出了数据的主成分,丢掉了数据次要的成分,这些次要的成分可能是冗余或者噪声信息。PCA是一种特征选择的方法,假设数据维度为n,我们可以选取前k个方差最大的投影方向,然后把这些数据投影到这个子空间得到k维的数据,得到原数据的“压缩”表示,如果还是全部n个投影方向,我们实际只是选择了与原数据空间坐标轴方向不同的另一组坐标轴方向来表示数据而已。



















   那如何选择k的大小,即主成分的个数呢? 一般是根据要保留多少百分比的主成分来选择k,比如可以保留99%的主成分,这样就capture了数据中的99%的成分。





%% Step 0a: Load data
%  Here we provide the code to load natural image data into x.
%  x will be a 144 * 10000 matrix, where the kth column x(:, k) corresponds to
%  the raw image data from the kth 12x12 image patch sampled.
%  You do not need to change the code below.

x = sampleIMAGESRAW();
figure('name','Raw images');
randsel = randi(size(x,2),200,1); % A random selection of samples for visualization

%% Step 0b: Zero-mean the data (by row)
%  You can make use of the mean and repmat/bsxfun functions.

% -------------------- YOUR CODE HERE -------------------- 
[n,m] = size(x); %m为样本数,n为样本维度
avg = mean(x, 2);
x = x - repmat(avg, 1, size(x,2));
%% Step 1a: Implement PCA to obtain xRot
%  Implement PCA to obtain xRot, the matrix in which the data is expressed
%  with respect to the eigenbasis of sigma, which is the matrix U.

% -------------------- YOUR CODE HERE -------------------- 
xRot = zeros(size(x)); % You need to compute this
sigma = x * x' / size(x, 2);
[U,S,V] = svd(sigma);
xRot = U'*x;

%% Step 1b: Check your implementation of PCA
%  The covariance matrix for the data expressed with respect to the basis U
%  should be a diagonal matrix with non-zero entries only along the main
%  diagonal. We will verify this here.
%  Write code to compute the covariance matrix, covar. 
%  When visualised as an image, you should see a straight line across the
%  diagonal (non-zero entries) against a blue background (zero entries).

% -------------------- YOUR CODE HERE -------------------- 
covar = zeros(size(x, 1)); % You need to compute this
covar = xRot*xRot';

% Visualise the covariance matrix. You should see a line across the
% diagonal against a blue background.
figure('name','Visualisation of covariance matrix');

%% Step 2: Find k, the number of components to retain
%  Write code to determine k, the number of components to retain in order
%  to retain at least 99% of the variance.

% -------------------- YOUR CODE HERE -------------------- 
k = 0; % Set k accordingly
var_sum = sum(diag(covar));
curr_var_sum = 0;
for i=1:length(covar)
    curr_var_sum = curr_var_sum + covar(i,i);
    if curr_var_sum / var_sum >= 0.99
        k = i;

%% Step 3: Implement PCA with dimension reduction
%  Now that you have found k, you can reduce the dimension of the data by
%  discarding the remaining dimensions. In this way, you can represent the
%  data in k dimensions instead of the original 144, which will save you
%  computational time when running learning algorithms on the reduced
%  representation.
%  Following the dimension reduction, invert the PCA transformation to produce 
%  the matrix xHat, the dimension-reduced data with respect to the original basis.
%  Visualise the data and compare it to the raw data. You will observe that
%  there is little loss due to throwing away the principal components that
%  correspond to dimensions with low variation.

% -------------------- YOUR CODE HERE -------------------- 
xHat = zeros(size(x));  % You need to compute this
xTilde = U(:, 1:k)'*x;
xHat = U*[xTilde; zeros(n-k,m)];

% Visualise the data, and compare it to the raw data
% You should observe that the raw and processed data are of comparable quality.
% For comparison, you may wish to generate a PCA reduced image which
% retains only 90% of the variance.

figure('name',['PCA processed images ',sprintf('(%d / %d dimensions)', k, size(x, 1)),'']);
figure('name','Raw images');

%% Step 4a: Implement PCA with whitening and regularisation
%  Implement PCA with whitening and regularisation to produce the matrix
%  xPCAWhite. 

epsilon = 1;
xPCAWhite = zeros(size(x));
xPCAWhite = diag(1./sqrt(diag(S)+epsilon)) * xRot;
% -------------------- YOUR CODE HERE -------------------- 

%% Step 4b: Check your implementation of PCA whitening 
%  Check your implementation of PCA whitening with and without regularisation. 
%  PCA whitening without regularisation results a covariance matrix 
%  that is equal to the identity matrix. PCA whitening with regularisation
%  results in a covariance matrix with diagonal entries starting close to 
%  1 and gradually becoming smaller. We will verify these properties here.
%  Write code to compute the covariance matrix, covar. 
%  Without regularisation (set epsilon to 0 or close to 0), 
%  when visualised as an image, you should see a red line across the
%  diagonal (one entries) against a blue background (zero entries).
%  With regularisation, you should see a red line that slowly turns
%  blue across the diagonal, corresponding to the one entries slowly
%  becoming smaller.

% -------------------- YOUR CODE HERE -------------------- 
covar = xPCAWhite*xPCAWhite';
% Visualise the covariance matrix. You should see a red line across the
% diagonal against a blue background.
figure('name','Visualisation of covariance matrix');

%% Step 5: Implement ZCA whitening
%  Now implement ZCA whitening to produce the matrix xZCAWhite. 
%  Visualise the data and compare it to the raw data. You should observe
%  that whitening results in, among other things, enhanced edges.

xZCAWhite = zeros(size(x));
xZCAWhite = U'*xPCAWhite;
% -------------------- YOUR CODE HERE -------------------- 

% Visualise the data, and compare it to the raw data.
% You should observe that the whitened images have enhanced edges.
figure('name','ZCA whitened images');
figure('name','Raw images');

Andrew NG机器学习PCA讲义

