Note: If you want to jump directly to the “punch line” skipping all of my explanation and development (and making me feel unappreciated :() click here.

Recall that in regular least squares regression we fit a line to the data. More technically, what we do is model the expected value (average value) of the dependent variable *Y* as a linear function of the explanatory *X*-variables. That is, we model by the linear equation

In the last article (click here to review it), I explained why this equation makes no sense when your dependent *Y* variable is binary (only takes the values of 0 or 1).

So what do we do? We would still very much like to be dealing with a linear function of the *X*‘s because linear functions are relatively simple and interpretable. Thus, we would like to keep the right-hand side of our equation the same as in the least-squares case:

Perhaps we can fix things by applying some kind of function to the left-hand side of the equation so that it makes sense as a model for when is binary. That is, maybe we can find a function *f( )* so that

This can, in fact, be done. The special function *f( )* we use is called the logistic function (or logistic transform):

I have used *p* as the argument to the logistic function because this function takes on values between 0 and 1 (like a probability). Also, note that the log function here is the natural log (log to the base *e*). (Note: for more discussion about logarithms and the notation log( ) see the next post by clicking here.)

The next thing that I need to point out is that, when *Y* is a binary variable (taking values of 0 or 1 only), , where *p* is the probability that *Y* takes the value 1. In words what this says is that the expecte value of *Y* (that is, the average value) is the probability that *Y* is 1. If you do not see why , accept it as a fact for now. I will explain it in a later article.

So putting all of this together (the punch line!), the key equation (usually termed the “multivariate logistic regression equation” or “multivariate logistic regression model”) that we fit to our data is

where, is the probability that is 1. Since is called the “odds” (more about odds here), what logistic regression does is model the log odds as a linear function of the *X*-variables.

For completeness here, I am now going to “undo” the logistic transform and show you the equation for (recall that ). Here it is:

Notice that this equation for is clearly not linear. You may well not understand how I got from the last equation to this one. That is OK. Accept it for now and I will explain it in detail in a later post.

So now I will show you in a graph what the logistic regression equation is doing. The figure below shows the same data appropriate for logistic regression that I used in the post “Why Regular Regression Does NOT Work” (click here to review it), but with the logistic equation fit to the data.

In this figure, the smooth s-shaped trace shows the logistic function that is fit to the binary data. This function is an estimate of the probability that is one. As you can see, the probability that is 1 is very small on the left hand side of the figure. It increases through the middle of the figure and is nearly 1 on the right hand side of the figure.

Just to contrast the logistic regression fit with the regular least-squares regression line, I will now add the least-squares line to the figure.

This figure clearly shows how silly the least-squares line is for this binary data and how well the logistic curve estimates the probability that the dependent variable is 1.

Depending on your mathematical background, the above may seem a bit complicated, confusing, and maybe even mysterious. Don’t worry. I am going to help you out in the next few articles.

As always, if you have any comments or questions please feel free to leave them below.

I don’t get how do you achieve the linear regression. Let’s say I want to do it “by hand” from the following set of paired values (x , y):

X Y

{0.1 – 0}

{0.2 – 0}

{0.3 – 1}

{0.4 – 1}

I dont have values of “p”, the closest I have are the Y values (0 or 1), if I use these y values as “p” values, I’ll get that:

when

y = 0; log(p/1-p) = log(0/1)

and even worse, when:

y = 1; log(p/1-p) = log(1/0)!!!!

How do you solve it?? A simple logistic regression, by hand with the values I gave you?

Sir,

Thank you for helping in understanding the concepts better.

Can you provide end-end steps on OLS (multiple linear) and MLE (Logistic) regressions. Challenges of both and how to do variable selection, resolve multicollinearity, outliers, overfitting/underfitting. If it can be a case of banking data, it will be great.

Thank you!

Hello,

As all other participants I want to congratulate you for the clarity and high pedagogical level of your explanations. I have some questions here:

( I have a good level of math then feel free to use it…)

1) Why the function log(p/(1-p)) is a reasonable choice for the transform. There are an infinity of all possible transforms that keeps a [0..1] variable to (-infinity..+infinity)

2) Once we have chosen the transform, why it is reasonable to assume that the error term is normaly distributed?

3) In the linear regression , in some sens, the goal is to minimize let say the “euclidian” distance in the n dimensionql space R^n between between

our measured set of points { (Yi,X1,…,Xn) } with the set of points given by the assumed model (here linear) i.e. { (a1*X1+…+an*Xn; X1,…,Xn) }.

What would be the equivalent in logistic regression? here my set of measured set of points is { (Yi,X1,…,Xn) } with Yi = 0 or 1.

If I have correctly understood, the assumed model is { (exp(a1*X1+…+an*Xn) / (1 + exp(a1*X1+…+an*Xn)); X1,…,Xn) }. But now what we

minimize in order to find the coefficients a1,…,an?

Thanks for your time

Gianni

Esteemed Professor,

Thanks for making your nicely-written explanations available. I have some Math background, and I am trying to understand why E(Yi)=pi. Please comment; my doubts are wrapped around ** ‘s :

Yi is a random variable with two outcomes: success and failure here, i.e., a binomial R.V , taking values in {0,1}.

Then the mean is E(Y_i)= np_i ; n is the number of trials , where ** p_i is the probability of success in a single trial , given the vector X_i of _fixed_ values X_i:=(x_i1, X_i2,…., X_in).

If we consider a single trial, then E(Y_i)=1(p_i)=p_i . In the simplest case of having a single variable, Y_i is the experiment resulting from fixed input, e.g., say Y is the experiment of whether someone votes and X is age. Then Y_35 is the experiment/output of someone who is 35 years old voting. We assume anyone who is 35 has the same probability of voting (we can make X_i into continuous variable by considering age up to any fraction of a second)**.

Then logistic regression would determine E(Y_i|X_i) , i.e., the probability of someone voting given any age X_i.

Basically , the logistic function determines, given as an input a vector X=(x_1, x_2,..,x_n) of inputs, the probability of success given this fixed vector.

I hope I am not too far off; please ignore if this is the case, and thank you for your time and patience.

Basically, everything you have written starting with “If we consider a single trial, …” is correct. I also think you do understand this.

I would like to “sharpen” the first two sentences. Each Yi is a

Bernoulli

trial which takes the values 0 and 1 with P(Yi=1) = pi (not really a binomial).

The difference between the logistic regression set up and the binomial distribution is that in the binomial distribution each of the Yi’s has the same pi; that is pi = p for all i. In logistic regression, the pi’s can all be different. But they are assumed to follow the functional form given by the logistic equation (last formula in the post above).

One last remark. You refer to E(Yi|Xi), where Xi is an input vector as you describe. This is what is given by the last equation above, but I have just used E(Yi) on the left hand side and suppressed the conditional dependence on Xi (i.e., I have omitted “|Xi”). But the way you have written it is more precise.

I am glad you are finding this material useful.

Regards,

StatsProf

what is the explanation of E(Yi)=Pi. .

Yi takes on the values 0 and 1 with probabilities (1-pi) and pi.

The expectation is the probability weighted average of all of the possible outcomes.

Thus, the expected value of Yi is: E(Yi) = 0*(1-pi) + 1*pi = pi.

what is the explanation of Logistic Regression Equation for p i.e E(yi)=Pi. .

Thank you so much for sharing these articles. They are really helpful and illustrate the issues well

Prof

do we have multivariate multinomial logistic regression? how is it done?

Thanks for this great resource. I am trying to picture the data set I would use in regression analysis. If I have 5 rows of data and the dependent variable is 1 for 2 and 0 for 3 rows, will the log (p/1-p) be the same for each with a 1 value? (i.e., p=(2/5)/(1-(2/5))?

Hi Josh,

Sorry I did not reply to your question sooner, but I was travelling internationally.

The answer your question, “… will the log (p/1-p) be the same for each with a 1 value? (i.e., p=(2/5)/(1-(2/5))?” is no, they will (most likely) be different.

What I don’t think you are taking into account is the value of the X variable(s). If the X-variables are different for the different rows that have a value of 1 for the Y variable, then log (p/(1-p)) will be estimated differently for each of them. Logistic regression is trying to model how the X-variable affects the probability that the Y variable is 1, so the estimated probability that the Y-variable is one depends on the “concentration” of 1’s as the X-variable changes.

For all rows where the Y is 1 AND the X’s are exactly the same, then log (p/(1-p)) will be the same.

I hope this helps.

Regards,

StatsProf

Hi,

Thank you for the explanation, it was really helpful.

I have the same problem as mentioned above and I don’t know how to calculate p/1-p for each row.

In my example Y is 1 for 25 rows but X is different for each row.

Thank you,

Mahnoush

Thank you so much, with my limited knowledge about math, this really helpful.

It my bad at bad at math, and you save my life!

Do you have another blog about ML or statistic or data analysis like this ? anyway, best wishes for you.

With “limited” maths knowledge, I must say the explanations are very good. I did a bit of statistics during my 10+2 but at school level its very limited.

I am an aspiring Data Scientist and your explanations are extremely helpful. I Thank You for such nice stuffs.

Thanks so much for you comment. I am really glad my explanations help. I hope to have more good stuff coming!

StatsProf

Dear Prof

Your article really helped me understand-with v limited maths and stats skills-

v grateful

Thambu

Hello,

I was wondering I am looking for a simplified logistic equation to model my data onto. I am only in high school keep note. My data is the regression of the olympic 100 meter times from 1928 until now. Linear regression is obviously not the correct regression to use. I was wondering if you knew which equation for logistic functions could fit the data.

Daphne

Hi Daphne,

Linear regression probably is the right method to use for your data. Your Y variable is continuous, and therefore not a candidate for logistic regression. However, the Y variable probably needs to be transformed.

You have not given me much information about the X variables in the equation, except for the year variable. Do you have times for many runners or just the winning times? Do you have other X variables?

As far as your time variable goes, you should look at a histogram or normal probability plot of the Y value to see if their distribution is skewed. If it is, then you should consider using log(time) or perhaps the the cube root of time (time^(1/3)) as the Y. The objective is to find a transformation of the time variable that makes it look as normal (or at least as symmetric) as possible. Only consider logs, cube roots, square roots, and untransformed. Note that using a log transformation in a regular linear regression does not make it a logistic regression. The terms kind of sound the same, but they have different meanings.

Two other comments. First, the race time can be converted into speed (speed = 100M/time). It may make more sense to analyze speed than time. If you look at the histogram or normal probability plot of the speeds and they are skewed, you would want to use transformation as described above as well. Basically, you want to use the time or speed variable and one of the transformations that will make the Y variable look as normal as possible.

Second, when data comes in a time series (e.g., one 100M time per year), then there is often correlation from year to year. You might want to put in the Y value from the previous year as an X-variable.

So the kind of model I am thinking about would be Y(t) = intercept + beta1 Y(t-1) + beta2 * (Years since 1900) + other X’s and Betas.

Note, you should probably use years since 1900 instead of Year (i.e., 28 instead of 1928) to reduce multicolinearity with the constant term.

Once you fit the regression, you should check that the residuals look like they are from a normal distribution.

Now I may have gone well beyond what you know or have been taught so far. So ignore the stuff that goes beyond what you have some knowledge of. You can start with the simplest case: A regression of Y on (Year since 1900), where Y is transformed by log or cube root, whichever works best.

Finally, if you have an equation that predicts log time, I will leave it to you to figure out how to convert it to a prediction of time.

Good luck.

StatsProf

Thank you so much for this insight! I’m just re-learning logistic regression now after a multi-year hiatus of working after college — if I have any questions after reading your site, can I reach out to you?

Thanks again!

Thank you so much for taking the time to comment.

Please feel free to contact me with questions. I am notified by e-mail of any comment you make on any of the posts on the web site and I will try to get back to you as soon as I can.

StatsProf