Analysis of SAT Verbal and Math scores

Introduction

This is a simple analysis of some SAT scores. First we read the data and display the structure of the data:

sat=read.table("sat.txt",header=T)
str(sat)

## 'data.frame':    162 obs. of  3 variables:
##  $ verbal: int  450 640 590 400 600 610 630 660 660 590 ...
##  $ math  : int  450 540 570 400 590 610 610 570 720 640 ...
##  $ gender: Factor w/ 2 levels "F","M": 1 1 2 2 2 2 1 2 1 1 ...

The data appear to have been read in correctly.

We are interested in whether there is a relationship between math scores and verbal scores.

Scatterplot

Our first step should be to draw a scatterplot, which we do as below. I added a smooth trend so that we can see what kind of relationship we have:

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

ggplot(sat,aes(x=verbal,y=math))+geom_point()+
  geom_smooth()

## `geom_smooth()` using method = 'loess'

We see that as verbal score increases, math score also tends to increase (though the rate of increase seems to level off as verbal score increases). The strength of the relationship is moderate at best: there are many students that are far from the trend.

Regression

Let us try to fit a regression to see how well it works. We’ll predict math score from verbal score (ignoring gender):

sat.1=lm(math~verbal,data=sat)
summary(sat.1)

## 
## Call:
## lm(formula = math ~ verbal, data = sat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -173.590  -47.596    1.158   45.086  259.659 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 209.55417   34.34935   6.101 7.66e-09 ***
## verbal        0.67507    0.05682  11.880  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.75 on 160 degrees of freedom
## Multiple R-squared:  0.4687, Adjusted R-squared:  0.4654 
## F-statistic: 141.1 on 1 and 160 DF,  p-value: < 2.2e-16

This shows a strongly significant relationship between the two variables. I am a little surprised that the P-value for verbal is so small, given that the trend is only moderately strong. But we have a lot of data, so that even a relationship of this strength is a lot stronger than we would see just by chance.

Extra stuff

This is as far as I wanted you to go, but I was interested in a couple of things:

should I be fitting a curve?
is there an effect of gender also?

To tackle the first thing, our starting point is the plots of residuals, thus:

ggplot(sat.1,aes(x=.fitted,y=.resid))+geom_point()

This looks pretty random. I also don’t see any fan-out (or anything like non-constant variance). I’m guessing, as well, that the normal quantile plot of the residuals should look pretty good:

r=resid(sat.1)
qqnorm(r)
qqline(r)

One small outlier at the top, and every one of the many (161) other points on the line. I call that good.

As for gender, well, I just add it to the regression and test it for significance:

sat.2=update(sat.1,.~.+gender)
summary(sat.2)

## 
## Call:
## lm(formula = math ~ verbal + gender, data = sat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -167.786  -43.444   -2.023   44.512  279.214 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 184.58164   34.06782   5.418 2.19e-07 ***
## verbal        0.68613    0.05513  12.446  < 2e-16 ***
## genderM      37.21856   10.93993   3.402 0.000846 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.49 on 159 degrees of freedom
## Multiple R-squared:  0.5047, Adjusted R-squared:  0.4985 
## F-statistic: 81.02 on 2 and 159 DF,  p-value: < 2.2e-16

Once again, there is a strong effect of verbal (knowing a student’s verbal test score definitely helps to predict their math score), but there is also an effect of gender, with a P-value of 0.00085. If you look at the coefficient, which is labelled genderM, you’ll see that it is about 37, which means that for two students with the same verbal test score (this is the “everything else equal” thing), one of them male and the other female, the male would be expected to get 37 more points on average than the female. You can speculate as to the reasons for that, but the statistics says that it is real, and not just chance.

ggplot makes it easy to get a scatterplot with the points indicated by gender, and also to put separate lines on for males and females:

ggplot(sat,aes(x=verbal,y=math,colour=gender))+geom_point()+
  geom_smooth(method="lm")

The plot shows the gender effect, which is consistent over all verbal scores, but it also shows that there is an enormous amount of variability: the gender effect is very much an “on average” thing, with some females scoring a lot higher and some males scoring a lot lower than you would guess. The grey envelopes, which are confidence intervals for the mean math score for all the different verbal scores and the two genders, overlap quite a bit (even though they are “significantly different”), and the prediction intervals for individual students with those verbal scores would overlap a lot more for male and female students.

Analysis of SAT Verbal and Math scores

Ken Butler

2017-01-06

Introduction

Scatterplot

Regression

Extra stuff