# Data analysis in r | Computer Science homework help

1) **Read** the income dataset, “zipIncomeAssignment.csv”, into R. (You can find the csv file in iLearn under the Content -> Week 2 folder.)

2) Change the column **names** of your data frame so that *zcta* becomes *zipCode* and *meanhouseholdincome* becomes *income*.

3) Analyze the **summary** of your data. What are the mean and median average incomes?

4) **Plot** a scatter plot of the data. Although this graph is not too informative, do you see any outlier values? If so, what are they?

5) In order to omit outliers, create a **subset** of the data so that:

$7,000 < income < $200,000 (or in R syntax , income > 7000 & income < 200000)

6) What’s your new mean?

7) Create a simple **box plot** of your data. Be sure to add a title and label the axes.

HINT: Take a look at: https://www.tutorialspoint.com/r/r_boxplots.htm (specifically, Creating the Boxplot.) Instead of “mpg ~ cyl”, you want to use “income ~ zipCode”.

In the box plot you created, notice that all of the income data is pushed towards the bottom of the graph because most average incomes tend to be low. Create a new box plot where the y-axis uses a log scale. Be sure to add a title and label the axes. For the next 2 questions, use the *ggplot* library in R, which enables you to create graphs with several different types of plots layered over each other.

8) Make a *ggplot* that consists of just a scatter plot using the function *geom_point()* with position = “*jitter” * so that the data points are grouped by zip code. Be sure to use *ggplot*’s function for taking the log10 of the y-axis data. (Hint: for *geom_point*, have *alpha*=0.2).

9) Create a new *ggplot* by adding a box plot layer to your previous graph. To do this, add the* ggplot *function *geom_boxplot()*. Also, add color to the scatter plot so that data points between different zip codes are different colors. Be sure to label the axes and add a title to the graph. (Hint: for *geom_boxplot*, have *alpha*=0.1 and *outlier.size*=0).

10) What can you conclude from this data analysis/visualization?