# 1. 练习: 价格与 X

``````ggplot(aes(x=x,y=price),data = diamonds) + geom_point()
``````

# 2. 发现 - 价格与 x

some outliers and an exponential relationship between price and x。

# 3. 练习: 相关性

``````cor(diamonds\$x,diamonds\$price)
cor(diamonds\$y,diamonds\$price)
cor(diamonds\$z,diamonds\$price)
``````

# 4. 练习: 价格与深度

``````ggplot(aes(x=depth,y=price),data = diamonds) + geom_point()
``````

# 8. 练习: 价格与克拉

Create a scatterplot of price vs carat and omit the top 1% of price and carat values.

``````ggplot(aes(x=carat,y=price),data = diamonds) +
geom_point() +
xlim(0,quantile(diamonds\$carat,0.99)) +
ylim(0,quantile(diamonds\$price,0.99)) +
geom_smooth(method="lm",color="red")
``````

# 9. 价格与体积

Create a scatterplot of price vs. volume (x * y * z).This is a very rough approximation for a diamond’s volume. Create a new variable for volume in the diamonds data frame.

``````diamonds\$volumn = diamonds\$x * diamonds\$y * diamonds\$z

ggplot(aes(x=volumn,y=price),data = diamonds) +
geom_point() +
xlim(0,quantile(diamonds\$volumn,0.99))
``````

``````library(plyr)
count(diamonds\$volumn == 0)
``````

# 11. 子集相关性

``````with(subset(diamonds,diamonds\$volumn > 0 & diamonds\$volumn <= 800),cor.test(price,volumn,method="pearson"))
``````

``````Pearson's product-moment correlation

data:  price and volumn
t = 486.33, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9008054 0.9039398
sample estimates:
cor
0.9023845
``````

# 12. 练习: 调整 - 价格与体积

``````ggplot(aes(x=volumn,y=price),data=subset(diamonds,diamonds\$volumn > 0 & diamonds\$volumn <= 800)) + geom_point(alpha = 1/20) + geom_smooth()
``````

# 13. 练习: 平均价格 - 净度

Use the function dplyr package to create a new data frame containing info on diamonds by clarity.

• Name the data frame diamondsByClarity
• The data frame should contain the following,variables in this order. (1) mean_price (2) median_price (3) min_price (4) max_price (5) n
• where n is the number of diamonds in each level of clarity.

• 方案1
``````suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))

data(diamonds)
diamonds.diamondsByClarity <- diamonds %>%
group_by(clarity) %>%
summarise(mean_price = mean(price),
median_price = median(price),
min_price = min(price),
max_price = max(price),
n = n()) %>%
arrange(clarity)
``````
• 方案2
``````suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))

data(diamonds)
clarity_groups <- group_by(diamonds,clarity)
diamonds.diamondsByClarity <- summarise(clarity_groups,
mean_price = mean(price),
median_price = median(price),
min_price = min(price),
max_price = max(price),
n = n())
``````
• 最终得到相同的数据集：通过clarity分组，然后计算各组的描述统计量：

# 14. 练习: 平均价格柱状图

We’ve created summary data frames with the mean price　by clarity and color. You can run the code in R to　verify what data is in the variables diamonds_mp_by_clarity　and diamonds_mp_by_color.

Your task is to write additional code to create two bar plots　on one output image using the grid.arrange() function from the package　gridExtra.

``````diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))

diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))

``````

• 练习：
``````p1 <- ggplot(aes(x=clarity,y=mean_price),data=diamonds_mp_by_clarity) + geom_boxplot()
p2 <- ggplot(aes(x=color,y=mean_price),data=diamonds_mp_by_color) + geom_boxplot()

library(gridExtra)
grid.arrange(p1,p2,ncol=1)
``````

# 15. 练习: 平均价格的趋势

We think something odd is going here. These trends seem to go against our intuition. Mean price tends to decrease as clarity improves. The same can be said for color. We encourage you to look into the mean price across cut.

``````diamonds_by_cut <- group_by(diamonds, cut)
diamonds_mp_by_cut <- summarise(diamonds_by_cut, mean_price = mean(price))
ggplot(aes(x=cut,y=mean_price),data=diamonds_mp_by_cut) + geom_boxplot()
``````

``````diamonds_by_cut <- group_by(diamonds, cut)
diamonds_mp_by_cut <- summarise(diamonds_by_cut, mean_carat = mean(carat))
ggplot(aes(x=cut,y=mean_carat),data=diamonds_mp_by_cut) + geom_boxplot()
``````

# 16. 练习: 重访 Gapminder

The Gapminder website contains over 500 data sets with information about the world’s population. Your task is to continue the investigation you did at the end of Problem Set 3 or you can start fresh and choose a different data set from Gapminder.

If you’re feeling adventurous or want to try some data munging see if you can find a data set or scrape one from the web.

In your investigation, examine pairs of variable and create 2-5 plots that make use of the techniques from Lesson 4.

You can find a link to the Gapminder website in the Instructor Notes.

Once you’ve completed your investigation, create a post in the discussions that includes:

1. the variable(s) you investigated, your observations, and any summary statistics
2. snippets of code that created the plots
`略`