Assignment 3. Due date November 4th.

Question 1. Classification and out of sample prediction.

This question is based on the internet demographics dataset:
uva.jmp The data in JMP format (2.3Megs).

uva.zip The data (zipped) in JMP format (0.18Megs).

uva.txt The data in TXT format (1.6Megs).

Recall that the variable of interest is ``Newbie'' defined as those internet users in the survey who self report less than one years internet experience.

About 25% of users in the survey would be classified as Newbies. The objective of this question is to attempt to find demographic variables that help identify these new users.

Variables to choose from include:

     ``Age''                 
     ``Gender''               ``Household.Income''    
     ``Sexual.Preference''    ``Country''             
     ``Education.Attainment'' ``Major.Occupation''    
     ``Marital.Status''

A. This dataset has missing values. Which variable has the highest proportion of missing values? What impact do you think that removing the rows with missing values will have on a subsequent analysis: be brief and state any assumptions.

B. Comment on any apparent relationships between the demographic variables and ``Newbie'' status. That is, perform a marginal analysis of the Newbie variable by each potential explanatory variable. Write a single paragraph that characterizes a ``Newbie'' from this marginal perspective.

A product is to be targeted at new users. Contacting potential customers costs $10.00. If the user is a Newbie then you can expect to make $30.00 from the sale of the product. If the user is not a Newbie then you make no money and the $10.00 is wasted. You have enough money so that in theory you could mail everyone, but of course this would be wasteful if you are trying to maximize your return. This is because only 25% of users in the survey are Newbies, therefore it would be illogical to contact everyone. Rather a subset needs to be selected whose members have a higher probability of being a Newbie.

Your objective is to identify such a subset using a logistic regression model. You can use the ``out of sample validation'' to test your ideas. For example, say you use the simple logistic regression model that provides the following out of sample prediction:

 
                Predicted
                OLD   NEW 
          +---+-----+----+
Observed  |OLD| 5607| 226|
          +---+-----+----+   
          |NEW| 1656| 227|
          +---+-----+----+

and you mailed out to the 453 people who were predicted to be Newbies. This costs you $4530.00. But of these 453 , 227 were in fact Newbies so that you could expect to get back $30.00 * 227 = $6810.00, a profit of $6810.00 - $4530.00 = $2280.00.

C. Report a chosen model, using no interaction terms, together with a strategy and the expected return from that strategy based on an out of sample analysis (use 60% of the data for the out of sample analysis).

D. Repeat the analysis, but this time using a model that includes interaction terms. Report the model. How much extra expected profit does including interactions terms in the classification model provide?

Richard Waterman
Tue Oct 26 12:20:22 EDT 1999