Question 1. Classification and out of sample prediction.

Take the internet demographics dataset, called uva.txt.

Recall that the variable of interest is ``Newbie'' defined as those internet users in the survey who self report less than one years internet experience.

About 25% of users in the survey would be classified as Newbies. The objective of this question is to attempt to find demographic variables that help identify these new users.

Variables to choose from include:

     ``Age''                 
     ``Gender''               ``Household.Income''    
     ``Sexual.Preference''    ``Country''             
     ``Education.Attainment'' ``Major.Occupation''    
     ``Marital.Status''

A. This dataset has missing values. Which variable has the highest proportion of missing values? What impact do you think that removing the rows with missing values will have on a subsequent analysis: be brief and state any assumptions.

Filter out the the missing values from the dataset.

B. Comment on any apparent relationships between the demographic variables and ``Newbie'' status. (use the table command from the command line -- do help(table) to find out how to use it).

A product is to be targeted at new users. Contacting potential customers costs $10.00. If the user is a Newbie then you can expect to make $30.00 from the sale of the product. If the user is not a Newbie then you make no money and the $10.00 is wasted. You have enough money so that in theory you could mail everyone, but of course this would be wasteful if you are trying to maximize your return. This is because only 25% of users in the survey are Newbies, therefore it would be illogical to contact everyone. Rather a subset needs to be selected whose members have a higher probability of being a Newbie.

Your objective is to identify such a subset using a logistic regression model. You can use the ``out of sample validation'' to test your ideas. For example, say you use the simple logistic regression model

glm.out <- glm(Newbie ~ Age + Gender + Education.Attainment, 
                        family=binomial,data=uva.nomiss,subset=training)

which provides the following out of sample prediction:

     0   1 
0 5607 226
1 1656 227
and you mailed out to the 453 people who were predicted to be Newbies. This costs you $4530.00. But of these 453 , 227 were infact Newbies so that you could expect to get back $30.00 * 227 = $6810.00, a profit of $6810.00 - $4530.00 = $2280.00.

C. Report a chosen model together with a strategy and the expected return from that strategy based on an out of sample analysis.

Repeat the analysis, but this time using a Neural Network, rather than a logistic regression model for your classifier.

D. Report your network, strategy and expected return from an out of sample analysis.

Richard Waterman
Tue Nov 18 12:45:42 EST 1997