Why do we dummify variables fro Liner regressions ?
1) Why do you want to convert race into numbers? I'm assuming you want to do something like a regression model, is that correct? I'm going to assume you're asking how to handle "categorical data" (categories like different races) in regression.
So, you want numerical variables, and you could just assign a number to each race. But, if you choose White=1, Black=2, Asian=3 then does it really make sense that the distance between White's and Black's is exactly half the distance between White's and Asian's? And, is that ordering even correct? Probably not.
Instead, what you do is create dummy variables. Let's say you have just those three races. Then, you create two dummy variables: White, Black. You could also use White, Asian or Black, Asian; the key is that you always create one fewer dummy variables then categories. Now, the White variable is 1 if the individual is white and is 0 otherwise, and the Black variable is 1 if the individual is black and is 0 otherwise. If you now fit a regression model, the coefficient for White tells you the average difference between asians and whites (note that the Asian dummy variable was not used, so asians become the baseline we compare to). The coefficient for Black tells you the average difference between asians and blacks.
Note: If you're using software to fit your regression model, you probably don't have to worry about all this. You just tell your software that the variable is categorical, and it handles all these details.
2) You don't need to worry about this, at least if you're doing a regression. Running the regression model will tell you coefficients for each variable as well as their standard errors, and that information tells you which variables are most important. If you want help interpreting those coefficients, that's a whole new topic.
Comments
Post a Comment