有了绝对定量就完了？

Nov 28, 2017 5 min read 读文献

微生物组研究走向绝对定量

Jeroen Raes研究组上周在Nature发表文章，使用Flow cytometry估算粪便中的微生物细胞数量，配合16S测序算出的相对丰度，估算出OTU的绝对数量（absolute abundance）。文章很多结论都印证了直接应用相对丰度进行分析时，我们所看到的很多现象是成分数据性质的假象（可以想象，由于相对丰度在每一个样本中相加为1或100，一个OTU相对丰度增加必将引起其他OTU相对丰度减少，所谓的compositional bias）。特别针对于计算两个OTU的相关系数，当OTU分布不均匀时，很容易看到负相关的OTU–而这仅仅是因为它们受到相加为常数的限制而已。另一个典型例子就是主成分分析（PCA），PCA意在保持欧式距离不变的情况下对数据进行变换，但是相对丰度其实不在欧式空间中（可以参考:J. Aitchison, The Statistical Analysis of Compositional Data, 1986.），这就是为什么在微生物组的研究中更多采用生态学的距离（如，Bray-Curtis distance）来计算 $β$ -diversity，然后进行基于距离矩阵的分析（PCoA）。

似乎文章的take home message很简单了，微生物组的研究，我们应该使用类似的方法进行绝对定量。可是仔细想一想，文章指出粪便微生物总量的个体差异可以达到10倍之多，这样大的差异，如果某种微生物在个体之间差异很小，转换成绝对数量之后，个体间的差异将受制于微生物总量。

使用Flow cytometry测定的肠道微生物总量的variation有多大？

下载Nature文章中的Supplementary Table，这里并不不需要购买文章阅读权限，其中表6是Flow cytometry的数据。

加载R包，ggplot2主题

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.3.4     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0

## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggplot2':
## 
##     ggsave

library(readxl)

figtheme <- theme_bw() + 
  theme(text = element_text(size=10,face='bold'),panel.border  = element_rect(colour = "black",size=2),
        axis.title.y=element_text(margin=margin(0,15,0,0)),axis.title.x=element_text(margin=margin(15,0,0,0)),
        plot.margin = unit(c(1,1,1,1), "cm"),
        plot.title = element_text(margin=margin(0,0,15,0)))
theme_set(figtheme)

读取表6，并预处理

cell_counts_dat <- read_excel('~/nature24460-s2.xlsx', sheet=6) %>%
  mutate(Cohort=gsub('[0-9]+','',Individual)) %>%
  mutate(Cell_count_avg=`Average cell count (per gram of frozen feces)`)

## Warning in strptime(x, format, tz = tz): unknown timezone 'zone/tz/2017c.
## 1.0/zoneinfo/Asia/Singapore'

cell_counts_dat

## # A tibble: 321 x 10
##    Individual Cohort   Day `Health status`
##         <chr>  <chr> <dbl>           <chr>
##  1       SC01     SC     1         Healthy
##  2       SC02     SC     1         Healthy
##  3       SC03     SC     1         Healthy
##  4       SC04     SC     1         Healthy
##  5       SC05     SC     1         Healthy
##  6       SC06     SC     1         Healthy
##  7       SC07     SC     1         Healthy
##  8       SC08     SC     1         Healthy
##  9       SC09     SC     1         Healthy
## 10       SC10     SC     1         Healthy
## # ... with 311 more rows, and 6 more variables: `Average cell count (per
## #   gram of fresh feces)` <chr>, `STDEV cell count (per gram of fresh
## #   feces)` <chr>, `Average cell count (per gram of frozen feces)` <dbl>,
## #   `STDEV cell count (per gram of frozen feces)` <chr>, Enterotype <chr>,
## #   Cell_count_avg <dbl>

微生物总量差异

个体差异

aggregate(data=cell_counts_dat, Cell_count_avg~Cohort, function(x) c(Mean=mean(x), SD=sd(x), CV=sd(x)/mean(x)) )

##   Cohort Cell_count_avg.Mean Cell_count_avg.SD Cell_count_avg.CV
## 1     DC        9.092967e+10      6.344747e+10      6.977642e-01
## 2     LC        1.249278e+11      5.406697e+10      4.327858e-01
## 3     SC        1.528339e+11      7.018824e+10      4.592451e-01
## 4     VC        1.447904e+11      7.630896e+10      5.270304e-01

ggplot(cell_counts_dat, aes(x=Cohort, y=Cell_count_avg)) + geom_boxplot()

个体内差异

aggregate(data=cell_counts_dat[cell_counts_dat$Cohort=='LC',], Cell_count_avg~Individual, function(x) c(Mean=mean(x), SD=sd(x), CV=sd(x)/mean(x)) )

##    Individual Cell_count_avg.Mean Cell_count_avg.SD Cell_count_avg.CV
## 1        LC01        1.121069e+11      3.062958e+10      2.732175e-01
## 2        LC02        1.057725e+11      1.457955e+10      1.378387e-01
## 3        LC03        6.998669e+10      3.207140e+10      4.582500e-01
## 4        LC04        6.433548e+10      1.478855e+10      2.298661e-01
## 5        LC05        2.198245e+11      2.726579e+10      1.240343e-01
## 6        LC06        2.054927e+11      4.951278e+10      2.409466e-01
## 7        LC07        1.197242e+11      3.438372e+10      2.871911e-01
## 8        LC08        1.162890e+11      3.286685e+10      2.826307e-01
## 9        LC09        1.501435e+11      6.018554e+10      4.008535e-01
## 10       LC10        1.790086e+11      8.293900e+10      4.633241e-01
## 11       LC11        1.565348e+11      4.038210e+10      2.579752e-01
## 12       LC12        1.247153e+11      2.494971e+10      2.000533e-01
## 13       LC13        9.804859e+10      2.625298e+10      2.677548e-01
## 14       LC14        1.037225e+11      4.329933e+10      4.174534e-01
## 15       LC15        1.592359e+11      2.337419e+10      1.467897e-01
## 16       LC16        1.171450e+11      6.207212e+10      5.298741e-01
## 17       LC17        1.215423e+11      3.782074e+10      3.111736e-01
## 18       LC18        1.042144e+11      4.236728e+10      4.065395e-01
## 19       LC19        7.013542e+10      2.326036e+10      3.316492e-01
## 20       LC20        1.465100e+11      2.763866e+10      1.886470e-01

ggplot(cell_counts_dat[cell_counts_dat$Cohort=='LC',], aes(x=Day, y=Cell_count_avg)) + geom_line() + facet_wrap(~Individual)

容易看出，疾病组（DC）的微生物总量最低，其他几个健康组微生物总量差异不大，而个人的微生物总量在一周内的变化要小一些（CV~30%），但其实比起个体间差异不是小很多。

人的肠道微生物相对丰度的variation有多大？

下载人类微生物组项目（HMP）数据

我们使用HMPv35的16S的OTU table和元数据表(mapping file)。下载之后解压。

清洗数据，保留肠道数据并去掉整行为0的OTU

metadata <- fread('~/v35_map_uniquebyPSN.txt')
stoolIDs <- as.character((metadata %>% filter(HMPbodysubsite=='Stool'))[,1])

otutab <- fread('~/otu_table_psn_v35.txt',head=TRUE, select=stoolIDs,sep='\t')

## Warning in fread("~/otu_table_psn_v35.txt", head = TRUE, select =
## stoolIDs, : Starting data input on line 2 and discarding line 1 because it
## has too few or too many items to be column names or data: # QIIME v1.3.0-
## dev OTU table

## Warning in fread("~/otu_table_psn_v35.txt", head = TRUE, select =
## stoolIDs, : Column name '700107040' not found in column name header (case
## sensitive), skipping.

## Warning in fread("~/otu_table_psn_v35.txt", head = TRUE, select =
## stoolIDs, : Column name '700114489' not found in column name header (case
## sensitive), skipping.

## 
Read 44.1% of 45383 rows
Read 45383 rows and 319 (of 4790) columns from 0.409 GB file in 00:00:04

## keep samples with counts > 2000
otutab <- otutab %>% select(which(colSums(otutab)> 2000 ))
## we down sample (rarefy) the samples to have 2000 counts in total
otutab_rare <- apply(otutab, 2, function(x) rmultinom(1,size=2000, prob=x))
## remove OTUs not present at all and normalize to proportions
otutab_rel <- apply(otutab_rare[rowSums(otutab_rare)!=0,], 2, function(x)x/sum(x))

SD, CV, Mean

SD <- apply(otutab_rel,1,sd)
Mean <- apply(otutab_rel, 1, mean)
tmp <- data.frame(CV=SD/Mean, Mean, SD)
g1 <- ggplot(tmp, aes(y=SD, x=Mean)) + 
  geom_point(alpha=0.1) + 
  scale_x_log10()  + scale_y_log10() + 
  annotation_logticks(sides = 'lb')
g2 <- ggplot(tmp, aes(y=CV, x=Mean)) + 
  geom_point(alpha=0.1) + 
  scale_x_log10()   + 
  annotation_logticks(sides = 'b')
plot_grid(g1, g2, nrow=1)

从CV来看，HMP的数据相对丰度其实在个体间差距比较大（特别是低丰度的OTU，可以想象这是跟0的数量多有关），所以可能大多数情况下，我们不会受到总量变化（CV~50%）的影响。Note: 还是需要注意在我们的研究数据中会不会

绝对定量的分析，A thought experiment

正态分布的随机变量（ $M$ ， $X$ ）分别代表微生物总量（ $M \sim | N o r m a l (10^{11}, C V_{m} \times 10^{11} |)$ ），微生物 $x$ 的相对丰度（ $X \sim | N o r m a l (10^{- 1}, C V_{x} \times 10^{- 1}) |$ ），那么绝对数量 $M X$ 会与 $M$ 相关吗？

M <- abs(rnorm(1000, 1e11, 0.5e11)) ## CV=0.5
X <- sapply(c(0.1,0.5,1,5,10), function(x)abs(rnorm(1000, 0.1, x*0.1)))   
colnames(X) <- c('X_cv0.1','X_cv0.5','X_cv1', 'X_cv5', 'X_cv10')
dat <- melt(X)%>% mutate(MX=value*M) %>% mutate(M=rep(M,5))

dat %>% group_by(Var2) %>% summarise(cor(MX, M, method='spearman'))

## # A tibble: 5 x 2
##      Var2 `cor(MX, M, method = "spearman")`
##    <fctr>                             <dbl>
## 1 X_cv0.1                         0.9745780
## 2 X_cv0.5                         0.6241846
## 3   X_cv1                         0.4850720
## 4   X_cv5                         0.5014045
## 5  X_cv10                         0.4692217

ggplot(dat,aes(x=M,y=MX)) + 
  geom_point(alpha=0.4) + 
  facet_wrap(~Var2, nrow=2, scale='free_y')

所以不难看出来微生物的绝对数量与微生物总量相关，特别是当相对丰度差异不大的时候。So what？记得correlation有传递性吗？所以如果有两个不相关的微生物，它们的绝对数量会不会因为总量而变得相关呢？我们根据 $X$ 的分布生成另一个独立的随机变量 $Y$ ，让我们来看下他们的correlation是怎样的。

Y <- sapply(c(0.1,0.5,1,5,10), function(x)abs(rnorm(1000, 0.1, x*0.1))) 
colnames(Y) <- c('Y_cv0.1','Y_cv0.5','Y_cv1', 'Y_cv5', 'Y_cv10')
dat <- dat %>% mutate(Y=melt(Y)$value) %>% mutate(MY=M*Y)

毫无关联的 $X$ , $Y$

dat %>% group_by(Var2) %>% summarise(cor(value, Y, method='spearman'))

## # A tibble: 5 x 2
##      Var2 `cor(value, Y, method = "spearman")`
##    <fctr>                                <dbl>
## 1 X_cv0.1                          0.039534460
## 2 X_cv0.5                          0.025382017
## 3   X_cv1                          0.039640060
## 4   X_cv5                          0.015882484
## 5  X_cv10                          0.002462222

转化成绝对数量后

dat %>% group_by(Var2) %>% 
  summarise(Spearman_rho=cor(MX, MY, method='spearman'),
            p_value=cor.test(MX, MY, method='spearman')$p.value)

## # A tibble: 5 x 3
##      Var2 Spearman_rho      p_value
##    <fctr>        <dbl>        <dbl>
## 1 X_cv0.1    0.9509044 0.000000e+00
## 2 X_cv0.5    0.4455341 0.000000e+00
## 3   X_cv1    0.3194627 0.000000e+00
## 4   X_cv5    0.2866312 2.842296e-20
## 5  X_cv10    0.2930657 3.241526e-21

散点图

ggplot(dat,aes(x=MX,y=MY)) + 
  geom_point(alpha=0.4) + 
  facet_wrap(~Var2, nrow=2, scale='free')