Example usage of the `clusimp` package
clusimp.RmdIntroduction
The clusimp package implements an externally validated
clustering algorithm assigning people with heart failure with reduced
ejection fraction (HFrEF) into 3 groups with distinct rates of disease
progression. Specifically, people with HFrEF are grouped based on the 5K
SomaLogic SomaScan assay measuring plasma proteins values. More detailed
information on the proteins measured can be found in protein_info.tsv.
This vignette explains how to use the principal functions
clusimp and idep of the clusimp
package.
The data
As an illustrative example we have included a simulated dataset of 50 individuals with protein measurements. When using your own data please make sure the input data uses the exact same column names as the example data, which can be seen below and loaded from here. The protein names required in the dataset are also listed in protein_info.tsv.
library(clusimp)
# load the data
data(simulated_proteins)
# look at the data
dim(simulated_proteins)[1] 50 4211
pander(simulated_proteins[1:10, 1:10])| ID | CRYBB2.10000.28 | RAF1.10001.7 | ZNF41.10003.15 | ELK1.10006.25 |
|---|---|---|---|---|
| iid_1 | 1190 | 791 | 115.9 | 738.6 |
| iid_2 | 514.9 | 890.7 | 175.5 | 5210 |
| iid_3 | 481 | 209.1 | 183.6 | -440.8 |
| iid_4 | 412.5 | 559.3 | 151.1 | 335.3 |
| iid_5 | 506.7 | 552.6 | 61.25 | 4955 |
| iid_6 | 243.8 | 629 | 47.59 | 1099 |
| iid_7 | 683.5 | 476.8 | 108.8 | 541.1 |
| iid_8 | 1367 | 891.4 | 91.78 | 606.1 |
| iid_9 | 718.3 | 460.6 | 85.31 | 2523 |
| iid_10 | 765 | 528.7 | 165 | 756.9 |
| GUCA1A.10008.43 | OCRL.10011.65 | SPDEF.10012.5 | SNAI2.10014.31 |
|---|---|---|---|
| 525.3 | -60.79 | 1232 | 876.1 |
| 621.3 | 2223 | 1443 | 1295 |
| 471.8 | 2353 | 1180 | 1231 |
| -147.8 | 2425 | 1321 | 839.9 |
| 486.8 | 2804 | 1355 | 1076 |
| -15.1 | 2489 | 1584 | 947.7 |
| 422.3 | 1873 | 2347 | 781 |
| 586.2 | 2280 | 2361 | 1261 |
| 591 | 2736 | 1542 | 1399 |
| 398.5 | 7118 | 1798 | 846.4 |
| KCNAB2.10015.119 |
|---|
| 642.5 |
| 750.1 |
| 1012 |
| 648.4 |
| 456.1 |
| 696.8 |
| 650.8 |
| 555.8 |
| 119.8 |
| 729.6 |
Implement the clustering model
To assign the HFrEF patients to clusters, we apply the
clusimp function, which requires all protein measurements.
It normalises the measurements based on the mean and standard deviation,
performs principal component analysis, and uses these to cluster the
individuals. More detail is given in the accompanying paper [AVAILABLE
UPON PUBLICATION].
By default, the cluster model is implemented using means and standard
deviation from the derivation data. If there is more than one input row
the newly supplied data can be used for normalisation. Similarly, before
applying the clustering model the principal components will be
categorised into quartiles. This can be done on the derivation cut-offs
or on the quartiles of the newly supplied data. This is implemented by
parameters local_stand and local_cut. For the
derivation means and standard deviation, as well the quartile
cut-points, please refer to protein_info.tsv
and protein_cat_bounds.tsv.
res <- clusimp(df = simulated_proteins, id = "ID", column = "Predicted class",
local_stand = FALSE, local_cut = FALSE)The output consists of a dataframe containing the original data, the
principal components, the standardised data, the categorised principal
components and the predicted class for each individual. The predicted
class membership is saved in the column Predicted class,
specified in the column parameter.
| ID | CRYBB2.10000.28 | RAF1.10001.7 | Predicted class |
|---|---|---|---|
| iid_1 | 1190 | 791 | 3 |
| iid_10 | 765 | 528.7 | 3 |
| iid_11 | 459.5 | 1145 | 2 |
| iid_12 | 562.1 | 960.6 | 2 |
| iid_13 | 2440 | 1211 | 2 |
| iid_14 | 1423 | 638.8 | 2 |
| iid_15 | 1527 | 651 | 2 |
| iid_16 | 665.3 | 699.1 | 1 |
| iid_17 | 268.2 | 749.4 | 2 |
| iid_18 | 1521 | 651.5 | 3 |
Extracting differentially expressed proteins
If you have information on more than a single participants,
preferably from 50 or more, one can also identify differentially
expressed proteins, using the idep function. In this
function the data is pre-processed (normalised when
norm = TRUE) and the lmFit function of
limma is applied. This is illustrated below:
# define proteins to be tested
incl <- names(simulated_proteins)[!names(simulated_proteins) %in% c("ID", "Predicted class")]
# identify the differentially expressed proteins
deps <- idep(df = res, id = "ID", clus_col = "Predicted class", pval = 0.05,
logf = 1, prot_col = incl, norm = TRUE)Error in get(paste0(generic, “.”, class), envir = get_method_env()) : object ‘type_sum.accel’ not found
# check results
pander(deps$feptable)| clus_1.clus_2 | clus_1.clus_3 | clus_2.clus_3 | AveExpr | |
|---|---|---|---|---|
| CNDP1.5456.59 | 0.2168 | 1.015 | 0.7979 | 1.868 |
| TMPO.8265.225 | -0.753 | -1.188 | -0.4345 | 0.7969 |
| GHR.2948.58 | 0.3859 | 1.088 | 0.7019 | -0.8448 |
| DLK2.9359.9 | -0.3226 | -1.019 | -0.6967 | 0.2235 |
| CRIP1.18275.5 | 0.4227 | 1.011 | 0.5881 | 1.282 |
| GM2A.15441.6 | -0.1915 | -1.147 | -0.9553 | -0.1645 |
| BTN3A3.17692.2 | -0.5456 | -1.111 | -0.5651 | -0.1786 |
| WFDC2.11388.75 | -0.714 | -1.102 | -0.3877 | 1.72 |
| NBL1.2944.66 | -0.4257 | -1.182 | -0.7559 | -0.01545 |
| PDGFA.4499.21 | -1.386 | -1.208 | 0.1787 | 1.506 |
| F | P.Value | adj.P.Val | |
|---|---|---|---|
| CNDP1.5456.59 | 22.66 | 9.689e-08 | 8.765e-05 |
| TMPO.8265.225 | 21.34 | 1.953e-07 | 8.765e-05 |
| GHR.2948.58 | 20.21 | 3.622e-07 | 8.765e-05 |
| DLK2.9359.9 | 16.19 | 3.733e-06 | 0.0005232 |
| CRIP1.18275.5 | 15.69 | 5.092e-06 | 0.0006627 |
| GM2A.15441.6 | 12.96 | 2.887e-05 | 0.002252 |
| BTN3A3.17692.2 | 11.28 | 8.984e-05 | 0.00496 |
| WFDC2.11388.75 | 11.06 | 0.0001045 | 0.005602 |
| NBL1.2944.66 | 10.68 | 0.0001361 | 0.006703 |
| PDGFA.4499.21 | 8.359 | 0.0007347 | 0.01834 |
The output represents the values for each contrast, the average expression, test-statistic and (adjusted) p-values for each protein.