Let's say I have contingent valuation data for a recreation trip where the dependent variable is y=0 if they would take the trip and y=1 otherwise. The independent variables are the added cost of the trip, a risk factor and whether the respondent takes a day or overnight trip. The logit generates a constant, a<0, and slopes on the cost, risk and overnight variables, b>0, c>0, d<0. Willingness to pay for a day trip with no risk is WTP =-a/b. The y=1 on this dependent variable can be decomposed into 3 categories: (1) stay at home, (2) visit another recreation site, or (3) do something else. I've estimated the binary logit where the 3 categories are collapsed into y=1 and the multinomial where y = 0, 1, 2, and 3. When I constrain the b and c coefficients to be equal across response categories the constant is 62% larger in the multinomial logit. This makes WTP significantly higher in the multinomial logit.

My question is: Why is the constant so much larger in the multinomial logit?

MNLvsBinary

 

Posted in
  1. Charlie Gibbons Avatar

    I think this is what’s going on:
    Logit models are a way of using discrete choices to model utility.
    The underlying utility has observable and unobservable components.
    In general, utility records preferences up to an affine transformation.
    Because of this affine invariance, the level and scale of utility must be fixed to make the econometric model identifiable.
    The level of utility is often fixed by setting the “no trip” constant to 0.
    The scale of utility is fixed by setting the variance of the unobserved utility to a chosen value (pi^2 / 6).
    Collapsing the alternatives impacts the level normalization, particularly because there is only one constant in the model. To offer an extreme example, suppose that there are three options: take the trip, get a root canal, or do something else. Perhaps we observe results that are 50-0-50%. When each option is explicitly considered in the MNL model, the constant for the trip would be relatively high—it’s partly being compared against the odds of choosing something really bad. But once the alternatives are collapsed, the model becomes essentially a 50-50 model of trip vs. anything else, yielding a different constant value.
    To make this point really clear, imagine having 98 really terrible options as alternatives and one anything else option plus the trip. The MNL would need to have a constant that yields 50% spread evenly over the 99 no trip alternatives and 50% on the trip alternative (the model only has one constant and it’s for the trip alternative). That would have to be a really big constant (essentially over 99 if the other observed terms are costs/reduce utility). Collapse it down, however, and the constant just needs to yield a 50-50 prediction, which would be closer to 0.
    Collapsing also impacts the scale normalization. Suppose that we could observe the unobserved utility for each option for a given respondent. The trip unobserved component is 0.5 and the unobserved components for the “no trip” options are -0.2, 0.1, and 0.4. The variance is 0.1.
    Now suppose that we lump the no trip options together. This essentially puts the trip option against the best no trip option; hence we are keeping the unobserved utilities of 0.5 and 0.4. [It’s not clear to me from your description whether the observable components are the same for all no trip options.] Now the unobserved utility for this respondent has a variance of 0.005.
    However, each formulation is going to be rescaled such that the variance of the unobserved utility is pi^2/6. This rescaling affects the two modeling approaches differently. Hence, the scale of the parameters cannot be compared between the two models.
    How could you get the same results from a binary logit and an MNL? You could retain the respondent’s actual choice, then randomly choose one alternative among the remaining to complete the choice set. Note that this could mean comparing two distinct no trip options to each other. Kenneth Train’s discrete choice book discusses this approach as a way to reduce the computational burden of a data set, for example.

  2. Charlie Avatar

    One point that might not be clear: Because the scale can change between the models, all the parameters could change in value between the two, hence it would not be appropriate to constrain the coefficients to be the same between the two models.

  3. John Whitehead Avatar

    Thanks, that’s very helpful!
    Can you use AIC to choose between models?

  4. Charlie Avatar

    Model selection criteria typically measure goodness of fit (i.e., how well they predict an outcome), often with penalty terms for more complex models (i.e., models with more parameters). Because these criteria measure how well an outcome is predicted, they can only be used to compare models where the outcome is the same.
    For some intuition, imagine running a regression of log income on demographic factors. Then, imagine running a (linear) regression of an indicator for being in the top 10% of incomes on the same demographic factors.
    R-squared is a measure of goodness of fit for regression models (not a great model selection criterion, but it works for intuition here). It wouldn’t be very useful in comparing the two proposed regressions, however. This is because it is measuring how well you can predict someone’s (log) income versus predicting how well you can predict who’s in the top 10% of incomes. Those are different questions.
    Similarly in your case, you are changing what you’re trying to predict between the models. You can imagine being able to predict one choice better than the other. But that outcome might not be relevant for your question of interest.
    So the short answer is that AIC or other model selection criteria can’t help choose between the models because the outcomes are different.
    Another point of comparison that might be useful: Andrew Gelman has discussed on his blog why continuous outcomes should not be discretized: https://statmodeling.stat.columbia.edu/2014/02/26/econometrics-political-science-epidemiology-etc-dont-model-probability-discrete-outcome-model-underlying-continuous-variable/. Collapsing the no trip options is similar, in that it is discarding potentially useful information. This analogy suggests that you should model each option separately; you could simulate aggregated comparisons after estimation if you wanted.
    One further note if you estimate each no trip option separately: You should probably allow each alternative to have its own constant (excluding one for identification as usual). This ensures that you get accurate share predictions for each alternative.

Leave a Reply

Discover more from Environmental Economics

Subscribe now to keep reading and get access to the full archive.

Continue reading