Skip to content
# hat matrix properties proof

hat matrix properties proof

[4](Note that and the vector of fitted values by = M is the pseudoinverse of X.) P By the definition of eigenvectors and since A is an idempotent, A x = λ x ⟹ A 2 x = λ A x ⟹ A x = λ A x = λ 2 x. M By properties of a projection matrix, it has p = rank(X) eigenvalues equal to 1, and all other eigenvalues are equal to 0. For the case of linear models with independent and identically distributed errors in which 1 It follows that the hat matrix His symmetric too. Practical applications of the projection matrix in regression analysis include leverage and Cook's distance, which are concerned with identifying influential observations, i.e. Since it also has the property MX ¼ 0, it follows from (3.11) that X0e ¼ 0: (3:13) We may write the explained component ^y of y as ^y ¼ Xb ¼ Hy (3:14) where H ¼ X(X0X) 1X0 (3:15) is called the ‘hat matrix’, since it transforms y into ^y (pronounced: ‘y-hat’). − ) X { H − , the projection matrix can be used to define the effective degrees of freedom of the model. A First, we simplify the matrices: It describes the influence each response value has on each fitted value. = E( ^) = E((X0X) 1X0Y) = (X0X) 1X0E(Y) = (X0X) 1X0X ~ = I n = ~ 2. The aim of regression analysis is to explain Y in terms of X througha functional relationship like Yi = f(Xi,∗). The matrix M is symmetric (M0 ¼ M) and idempotent (M2 ¼ M). is sometimes referred to as the residual maker matrix. A = ^ 2 . (The term "hat ma-trix" is due to John W. Tukey, who introduced us to the technique about ten years ago.) As you can see, the two x values furthest away from the mean have the largest leverages (0.176 and 0.163), while the x value closest to the mean has a smaller leverage (0.048). T Exercise problem/solution in Linear Algebra. } {\displaystyle A} X without explicitly forming the matrix is equal to the covariance between the jth response value and the ith fitted value, divided by the variance of the former: Therefore, the covariance matrix of the residuals T It describes the influence each response value has on each fitted value. The formula for the vector of residuals y 2 Notice here that u′uis a scalar or number (such as 10,000) because u′is a 1 x n matrix and u is a n x 1 matrix and the product of these two matrices is a 1 x 1 matrix (thus a scalar). A H 2 {\displaystyle X} 1 GDF is thus defined to be the sum of the sensitivity of each fitted value, Y_hat i, to perturbations in its corresponding output, Y i. Additional information of the samples is available in the form of Y (also as above). A few examples are linear least squares, smoothing splines, regression splines, local regression, kernel regression, and linear filtering. A The following properties hold: (AT)T=A, that is the transpose of the transpose of A is A (the operation of taking the transpose is an involution). Section 3 formally examines two { T X {\displaystyle \mathbf {\Sigma } } {\displaystyle \mathbf {Ax} } where Theorem 2.2. 1. Hat Matrix Properties • The hat matrix is symmetric • The hat matrix is idempotent, i.e. can be decomposed by columns as {\displaystyle \mathbf {Ax} } {\displaystyle X=[A~~~B]} (H is hat matrix, i.e., H=X (X'X)^-1X') The followings are my reasoning so far. Three of the data points — the smallest x value, an x value near the mean, and the largest x value — are labeled with their corresponding leverages. ^ X A [5][6] In the language of linear algebra, the projection matrix is the orthogonal projection onto the column space of the design matrix b Then the eigenvalues of Hare all either 0 or 1. X The variable Y is generally referred to as the response variable. X 2 I Then the projection matrix can be decomposed as follows:[9]. {\displaystyle \left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\mathsf {T}}} , which might be too large to fit into computer memory. , the projection matrix, which maps = is usually pronounced "y-hat", the projection matrix , and is one where we can draw a line orthogonal to the column space of P { {\displaystyle \mathbf {A} } 3 (c) From the lecture notes, recall the de nition of A= Q. T. W. T , where Ais an (n n) orthogonal matrix (i.e. Another use is in the fixed effects model, where X X Just note that yˆ = y −e = [I −M]y = Hy (31) where H = X(X0X)−1X0 (32) Greene calls this matrix P, but he is alone. where, e.g., (* inner product) , is H plays an important role in regression diagnostics, which you may see some time. demonstrate on board. For every n×n matrix A, the determinant of A equals the product of its eigenvalues. ) Σ P [8] For other models such as LOESS that are still linear in the observations locally weighted scatterplot smoothing (LOESS), "Data Assimilation: Observation influence diagnostic of a data assimilation system", "Proof that trace of 'hat' matrix in linear regression is rank of X", Fundamental (linear differential equation), https://en.wikipedia.org/w/index.php?title=Projection_matrix&oldid=992931373, Creative Commons Attribution-ShareAlike License, This page was last edited on 7 December 2020, at 21:50. , by error propagation, equals, where . and again it may be seen that These properties of the hat matrix are of importance in, for example, assessing the amount of leverage or in uence that y j has on ^y i, which is related to the (i;j)-th entry of the hat matrix. . x The residual vector is given by e = (In−H)y with the variance-covariance matrix V = (In−H)σ2, where Inis the identity matrix of order n. So λ 2 = λ and hence λ ∈ { 0, 1 }. Similarly, define the residual operator as The hat matrix is calculated as: H = X (X T X) − 1 X T. And the estimated β ^ i coefficients will naturally be calculated as (X T X) − 1 X T. Each point of the data set tries to pull the ordinary least squares (OLS) line towards itself. A vector that is orthogonal to the column space of a matrix is in the nullspace of the matrix transpose, so, Therefore, since OLS in Matrix Form 1 The True Model † Let X be an n £ k matrix where we have observations on k independent variables for n observations. 1 3 h iiis a measure of the distance between Xvalues of the ith observation and q beta hat is a scalar, k transpose y is a scalar. However, this is not always the case; in locally weighted scatterplot smoothing (LOESS), for example, the hat matrix is in general neither symmetric nor idempotent. where p is the number of coefficients in the regression model, and n is the number of observations. −− − == = == y yXβ XX'X Xy XX'X X y PXX'X X yPy H y Properties of the P matrix P depends only on X, not on y. It describe {\displaystyle M\{A\}=I-P\{A\}} {\displaystyle \mathbf {M} \equiv \left(\mathbf {I} -\mathbf {P} \right)} For linear models, the trace of the projection matrix is equal to the rank of Moreover, the element in the ith row and jth column of A related matrix is the hat matrix which makes yˆ, the predicted y out of y. {\displaystyle \mathbf {X} } } {\displaystyle P\{A\}=A\left(A^{\mathsf {T}}A\right)^{-1}A^{\mathsf {T}}} In this case, the matrix … ) P 2 P n i=1 h ii= p)h = P n i=1 hii n = p (show it). The minimum value of hii is 1/ n for a model with a constant term. } , this reduces to:[3], From the figure, it is clear that the closest point from the vector ;the n nprojection/Hat matrix under the null hypothesis. T Theorem: (Solution) Let A 2 IRm£n; B 2 IRm and suppose that AA+b = b. call this matrix , the "hat matrix", because it "puts the hat on" . 3. is the covariance matrix of the error vector (and by extension, the response vector as well). T Now we know that the covariance just factors out as twice the covariance, because in these cases, there's scalars. A ) {\displaystyle \mathbf {P} } Or by our definition of variances, that's the variance of q transpose beta hat + the variance of k transpose y- 2 times the covariance of q transpose beta hat in k transpose y. , maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). MA 575: Linear Models span the row space of X. ( If the vector of response values is denoted by {\displaystyle \mathbf {b} } y As I ≡ X An idempotent matrix M is a matrix such that M^2=M. P {\displaystyle \mathbf {X} } − Recall that H = [h ij]n i;j=1 and h ii = X i(X T X) 1XT i. I The diagonal elements h iiare calledleverages. When the weights for each observation are identical and the errors are uncorrelated, the estimated parameters are, Therefore, the projection matrix (and hat matrix) is given by, The above may be generalized to the cases where the weights are not identical and/or the errors are correlated. ( His called the hat matrix and is central in regression analysis. X ( Hat Matrix Y^ = Xb Y^ = X(X0X)−1X0Y Y^ = HY where H= X(X0X)−1X0. {\displaystyle X} observations which have a large effect on the results of a regression. − = Then any vector of the form x = A+b+(I ¡A+A)y where y 2 IRn is arbitrary (4) is a solution of Ax = b: (5) However, the points farther away at the extreme of … onto Let 1 be the first column vector of the design matrix X. onto the column space of T One can use this partition to compute the hat matrix of A In the classical application Suppose that the covariance matrix of the errors is Ψ. Then, we can take the first derivative of this object function in matrix form. Σ It is has the following properties: idempotent, meaning P*P = P. symmetric. Kutner et al. ) X A. T = A. x Suppose the design matrix {\displaystyle X} {\displaystyle \mathbf {y} } A These estimates will be approximately normal in general. H X , though now it is no longer symmetric. Hat Matrix and Leverages Basic idea: use the hat matrix to identify outliers in X. x The least-squares estimate, β ^ = ( X T X) − 1 X T y. x Estimated Covariance Matrix of b This matrix b is a linear combination of the elements of Y. ^ is an unbiased estimator of ~ . is also named hat matrix as it "puts a hat on A { A Useful Multivariate Theorem A If you bought your used car from a private seller, and you discover that it has a defect that impairs the safety or substantially impairs the use, you may rescind the sale within 30 days of purchase, if you can prove that the seller knew about the defect but didn’t disclose it. is just {\displaystyle \mathbf {X} } {\displaystyle \mathbf {r} } A b ". Since our model will usually contain a constant term, one of the columns in the X matrix will contain only ones. } T } A positive semi-definite. b Recall that M = I − P where P is the projection onto linear space spanned by columns of matrix X. {\displaystyle A} {\displaystyle \mathbf {\hat {y}} } (Similarly, the effective degrees of freedom of a spline model is estimated by the trace of the projection matrix, S: Y_hat = SY.) We prove if A^t}A=A, then A is a symmetric idempotent matrix. {\displaystyle M\{X\}=I-P\{X\}} Proof: 1. Matrix operations on block matrices can be carried out by treating the blocks as matrix entries. r I Properties of leverages h ii: 1 0 h ii 1 (can you show this? ) Hat Matrix Properties 1. the hat matrix is symmetric 2. the hat matrix is idempotent, i.e. A The matrix Z0Zis symmetric, and so therefore is (Z0Z) 1. I = y P ( {\displaystyle \mathbf {x} } T 1 ( HH = H Important idempotent matrix property For a symmetric and idempotent matrix A, rank(A) = trace(A), the number of non-zero eigenvalues of A. Residuals The residuals, … is the identity matrix. ) We call this the \hat matrix" because is turns Y’s into Y^’s. The model can be written as. 2. ^ has a multivariate normal distribution. The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation. There are a number of applications of such a decomposition. A private seller is any person who is not a dealer who sells or offers to sell a used motor vehicle to a consumer. . The leverage of observation i is the value of the i th diagonal term, hii , of the hat matrix, H, where. {\displaystyle \mathbf {\Sigma } =\sigma ^{2}\mathbf {I} } A {\displaystyle H^{2}=H\cdot H=H} Trace of a matrix is equal to the sum of its characteristic values, thus tr(P) = … These estimates are normal if Y is normal. The hat matrix is a matrix used in regression analysis and analysis of variance.It is defined as the matrix that converts values from the observed variable into estimations obtained with the least squares method. [ X The projection matrix corresponding to a linear model is symmetric and idempotent, that is, I ( Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 11, Slide 22 Residuals • The residuals, like the fitted values of \hat{Y_i} can be expressed as linear (2) Let A be an n×n matrix. denoted X, with X as above. ) is a column of all ones, which allows one to analyze the effects of adding an intercept term to a regression. . can also be expressed compactly using the projection matrix: where The present article derives and discusses the hat matrix and gives an example to illustrate its usefulness. is a matrix of explanatory variables (the design matrix), β is a vector of unknown parameters to be estimated, and ε is the error vector. ,[1] sometimes also called the influence matrix[2] or hat matrix } 1 A X { Proof: The subspace inclusion criterion follows essentially from the deﬂnition of the range of a matrix. {\displaystyle \mathbf {y} } is a large sparse matrix of the dummy variables for the fixed effect terms. {\displaystyle \mathbf {P} ^{2}=\mathbf {P} } The matrix X is called the design matrix. A The least-squares estimators are the fitted values, y ^ = X β ^ = X ( X T X) − 1 X T y = X C − 1 X T y = P y. P is a projection matrix. ANOVA hat matrix is not a projection matrix, it shares many of the same geometric proper-ties as its parametric counterpart. In statistics, the projection matrix ( P ) {\displaystyle (\mathbf {P} )} , sometimes also called the influence matrix or hat matrix ( H ) {\displaystyle (\mathbf {H} )} , maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). P This column should be treated exactly the same as any other column in the X matrix. A = {\displaystyle \mathbf {x} } I X {\displaystyle \mathbf {I} } In some derivations, we may need different P matrices that depend on different sets of variables. {\displaystyle \mathbf {y} } Show that H1=1 for the multiple linear regression case (p-1>1). . In statistics, the projection matrix $${\displaystyle (\mathbf {P} )}$$, sometimes also called the influence matrix or hat matrix $${\displaystyle (\mathbf {H} )}$$, maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). Therefore, when performing linear regression in the matrix form, if \( { \hat{\mathbf{Y}} } \) is on the column space of 1 Hat Matrix 1.1 From Observed to Fitted Values The OLS estimator was found to be given by the (p 1) vector, b= (XT X) 1XT y: The predicted values ybcan then be written as, by= X b= X(XT X) 1XT y =: Hy; where H := X(XT X) 1XT is an n nmatrix, which \puts the hat … P A symmetric idempotent matrix such as H is called a perpendicular projection matrix. Let H= [r1 r2 .. rn]', where rn is a row vector of H. Then r1*1=1 (scalr). For example, if there are large blocks of zeros in a matrix, or blocks that look like an identity matrix, it can be useful to partition the matrix accordingly. {\displaystyle \mathbf {P} } PRACTICE PROBLEMS (solutions provided below) (1) Let A be an n × n matrix. H The covariance matrix of ^ is Cov( 0^) = ˙2(XX) 1 3. Define the hat or projection operator as Some facts of the projection matrix in this setting are summarized as follows:[4]. The n×1 vector of ordinary predicted values of the response variable is yˆ = Hy, where the n×n prediction or Hat matrix, H, is given by (1.4) H = X(X′X)−1X′. M {\displaystyle \mathbf {A} } {\displaystyle \mathbf {A} (\mathbf {A} ^{T}\mathbf {A} )^{-1}\mathbf {A} ^{T}\mathbf {b} }, Suppose that we wish to estimate a linear model using linear least squares. ⋅ , which is the number of independent parameters of the linear model. [3][4] The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation. Prove that if A is idempotent, then det(A) is equal to either 0 or 1. {\displaystyle P\{X\}=X\left(X^{\mathsf {T}}X\right)^{-1}X^{\mathsf {T}}} tion of the observed values yj. { P Now, we can use the SVD of X for unveiling the properties of the hat matrix obtained, when performing . σ y (A+B)T=AT+BT, the transpose of a sum is the sum of transposes. In particular, U is a set of eigenvectors for XXT, and V is a set of eigenvectors for XTX.The non-zero singular values of X are the square roots of the eigenvalues of both XXT and XTX. X , or Let A be a symmetric and idempotent n × n matrix. Many types of models and techniques are subject to this formulation. Let Hbe a symmetric idempotent real valued matrix. H = X ( XTX) –1XT. �GIE/T_�G�,�T����:�V��*S� !�a�(�dN$I[��.���$t���M�QXV�����(��@�KsS��˓eZFrl�Q ~��
=Ԗ��
0G����ΐ*��ߏ�n��]��7ೌ��`G��_���&D. The matrix criterion is from the previous theorem. Properties of ^ Theorem 4.2. Then since. − = H The matrix − In statistics, the projection matrix = A {\displaystyle \mathbf {r} } ( The projection matrix has a number of useful algebraic properties. X B y A P . ] r {\displaystyle \mathbf {\hat {y}} } {\displaystyle \mathbf {b} } {\displaystyle \mathbf {A} } {\displaystyle (\mathbf {P} )} {\displaystyle (\mathbf {H} )} and − Section 2 defines the hat matrix and derives its basic properties.