# Simple Linear Regressions
First we load the tidyverse and read the "fishCatchData.csv" as a tibble.
```{r setup, include=FALSE}
library(broom)
library(tidyverse)
fish <- read_csv('fishCatchData.csv')
```
Let's plot it!
```{r plot1}
plot(fish$Length_cm, fish$Weight_g, pch=19)
# (plot(x,y) puts x on x-axis and y on y-axis)
```
Next we make our linear model, which essentially just finds the coefficients that minimize the residuals.
weight = intercept + x*length
```{r}
model <- lm(Weight_g ~ Length_cm, fish)
coef(model)
```
The summary function summarizes key aspects of the model.
```{r}
summary(model)
```
Finally, let's plot the model. We can show the fit with the geom_smooth function.
```{r}
fish %>% ggplot(aes(x = Length_cm, y = Weight_g)) +
geom_point() + geom_smooth(method = 'lm') +
theme_minimal()
```
# Categorical Independent Variables
Next we want to see if the type of fish can help us predict the weight. We need to make the "Name" a *factor*. Factors in R are variables which take on only a certain number of different values. Storing data as factors ensures the models treat such data correctly.
```{r}
tibble(fish)
summary(fish)
```
As you can see R does not automatically interpret the Names of fish as factors.
```{r}
fish <- mutate(fish, Name=factor(Name))
tibble(fish)
summary(fish)
```
We can also use the function 'str' to see the structure of the tibble. Note also that 'summary' reports the factor column differently than before.
```{r}
str(fish)
```
We can also group our data by the levels in the factors. Remember, in statistics speak, the factors contain levels.
```{r}
fish %>% group_by(Name) %>% summarize(M=mean(Weight_g), SD = sd(Weight_g))
```
The reference level is the first factor in alphanumeric order. In our case, this is 'bream'. We can change this later.
We can now see how well we can predict the weight if we know the name (and only the name).
```{r}
model_name <- lm (Weight_g ~ Name , fish)
summary(model_name)
```
What does the above mean? The Bream is the origin of a 6 dimensional space. There is Parkki dimension, Perch dimension, Pike dimension and so on. The estimates are the coefficients of the linear equation.
The values of Parkki, Perch etc can only take on values of 0 or 1.
We can only compare the rows directly to the intercept. So knowing a fish is a Parkki means its weight will be significantly less than a bream and so on.
Next let's look at Name and Length together. We get a much better fit!
```{r}
model_name_ln <- lm (Weight_g ~ Name + Length_cm, fish)
summary(model_name_ln)
```
Here we can visualize the line in 7 dimensional space projected by fish type onto 2 dimensions...
```{r}
ggplot(data=fish, aes(x=Length_cm, y=Weight_g,color=Name)) + geom_point() + geom_smooth(method='lm')
```