This is a simple analysis of some SAT scores. First we read the data and display the structure of the data:

```
sat=read.table("sat.txt",header=T)
str(sat)
```

```
## 'data.frame': 162 obs. of 3 variables:
## $ verbal: int 450 640 590 400 600 610 630 660 660 590 ...
## $ math : int 450 540 570 400 590 610 610 570 720 640 ...
## $ gender: Factor w/ 2 levels "F","M": 1 1 2 2 2 2 1 2 1 1 ...
```

The data appear to have been read in correctly.

We are interested in whether there is a relationship between math scores and verbal scores.

Our first step should be to draw a scatterplot, which we do as below. I added a smooth trend so that we can see what kind of relationship we have:

`library(tidyverse)`

```
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
```

`## Conflicts with tidy packages ----------------------------------------------`

```
## filter(): dplyr, stats
## lag(): dplyr, stats
```

```
ggplot(sat,aes(x=verbal,y=math))+geom_point()+
geom_smooth()
```

`## `geom_smooth()` using method = 'loess'`

We see that as verbal score increases, math score also tends to increase (though the rate of increase seems to level off as verbal score increases). The strength of the relationship is moderate at best: there are many students that are far from the trend.

Let us try to fit a regression to see how well it works. We’ll predict math score from verbal score (ignoring gender):

```
sat.1=lm(math~verbal,data=sat)
summary(sat.1)
```

```
##
## Call:
## lm(formula = math ~ verbal, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -173.590 -47.596 1.158 45.086 259.659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 209.55417 34.34935 6.101 7.66e-09 ***
## verbal 0.67507 0.05682 11.880 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.75 on 160 degrees of freedom
## Multiple R-squared: 0.4687, Adjusted R-squared: 0.4654
## F-statistic: 141.1 on 1 and 160 DF, p-value: < 2.2e-16
```

This shows a strongly significant relationship between the two variables. I am a little surprised that the P-value for `verbal`

is so small, given that the trend is only moderately strong. But we have a lot of data, so that even a relationship of this strength is a *lot* stronger than we would see just by chance.

This is as far as I wanted you to go, but I was interested in a couple of things:

- should I be fitting a curve?
- is there an effect of
`gender`

also?

To tackle the first thing, our starting point is the plots of residuals, thus:

`ggplot(sat.1,aes(x=.fitted,y=.resid))+geom_point()`

This looks pretty random. I also don’t see any fan-out (or anything like non-constant variance). I’m guessing, as well, that the normal quantile plot of the residuals should look pretty good:

```
r=resid(sat.1)
qqnorm(r)
qqline(r)
```

One small outlier at the top, and every one of the many (161) other points on the line. I call that good.

As for `gender`

, well, I just add it to the regression and test it for significance:

```
sat.2=update(sat.1,.~.+gender)
summary(sat.2)
```

```
##
## Call:
## lm(formula = math ~ verbal + gender, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -167.786 -43.444 -2.023 44.512 279.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 184.58164 34.06782 5.418 2.19e-07 ***
## verbal 0.68613 0.05513 12.446 < 2e-16 ***
## genderM 37.21856 10.93993 3.402 0.000846 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.49 on 159 degrees of freedom
## Multiple R-squared: 0.5047, Adjusted R-squared: 0.4985
## F-statistic: 81.02 on 2 and 159 DF, p-value: < 2.2e-16
```

Once again, there is a strong effect of `verbal`

(knowing a student’s verbal test score definitely helps to predict their math score), but there is *also* an effect of gender, with a P-value of 0.00085. If you look at the coefficient, which is labelled `genderM`

, you’ll see that it is about 37, which means that for two students *with the same verbal test score* (this is the “everything else equal” thing), one of them male and the other female, the male would be expected to get 37 more points on average than the female. You can speculate as to the reasons for that, but the statistics says that it is real, and not just chance.

`ggplot`

makes it easy to get a scatterplot with the points indicated by gender, and also to put separate lines on for males and females:

```
ggplot(sat,aes(x=verbal,y=math,colour=gender))+geom_point()+
geom_smooth(method="lm")
```

The plot shows the gender effect, which is consistent over all verbal scores, but it also shows that there is an enormous amount of variability: the gender effect is very much an “on average” thing, with some females scoring a lot higher and some males scoring a lot lower than you would guess. The grey envelopes, which are confidence intervals for the mean math score for all the different verbal scores and the two genders, overlap quite a bit (even though they are “significantly different”), and the prediction intervals for individual students with those verbal scores would overlap a lot more for male and female students.