Skip to contents

Introduction

The clusimp package implements an externally validated clustering algorithm assigning people with heart failure with reduced ejection fraction (HFrEF) into 3 groups with distinct rates of disease progression. Specifically, people with HFrEF are grouped based on the 5K SomaLogic SomaScan assay measuring plasma proteins values. More detailed information on the proteins measured can be found in protein_info.tsv.

This vignette explains how to use the principal functions clusimp and idep of the clusimp package.

The data

As an illustrative example we have included a simulated dataset of 50 individuals with protein measurements. When using your own data please make sure the input data uses the exact same column names as the example data, which can be seen below and loaded from here. The protein names required in the dataset are also listed in protein_info.tsv.

library(clusimp)

# load the data
data(simulated_proteins)

# look at the data
dim(simulated_proteins)

[1] 50 4211

pander(simulated_proteins[1:10, 1:10])
Table continues below
ID CRYBB2.10000.28 RAF1.10001.7 ZNF41.10003.15 ELK1.10006.25
iid_1 1190 791 115.9 738.6
iid_2 514.9 890.7 175.5 5210
iid_3 481 209.1 183.6 -440.8
iid_4 412.5 559.3 151.1 335.3
iid_5 506.7 552.6 61.25 4955
iid_6 243.8 629 47.59 1099
iid_7 683.5 476.8 108.8 541.1
iid_8 1367 891.4 91.78 606.1
iid_9 718.3 460.6 85.31 2523
iid_10 765 528.7 165 756.9
Table continues below
GUCA1A.10008.43 OCRL.10011.65 SPDEF.10012.5 SNAI2.10014.31
525.3 -60.79 1232 876.1
621.3 2223 1443 1295
471.8 2353 1180 1231
-147.8 2425 1321 839.9
486.8 2804 1355 1076
-15.1 2489 1584 947.7
422.3 1873 2347 781
586.2 2280 2361 1261
591 2736 1542 1399
398.5 7118 1798 846.4
KCNAB2.10015.119
642.5
750.1
1012
648.4
456.1
696.8
650.8
555.8
119.8
729.6

Implement the clustering model

To assign the HFrEF patients to clusters, we apply the clusimp function, which requires all protein measurements. It normalises the measurements based on the mean and standard deviation, performs principal component analysis, and uses these to cluster the individuals. More detail is given in the accompanying paper [AVAILABLE UPON PUBLICATION].

By default, the cluster model is implemented using means and standard deviation from the derivation data. If there is more than one input row the newly supplied data can be used for normalisation. Similarly, before applying the clustering model the principal components will be categorised into quartiles. This can be done on the derivation cut-offs or on the quartiles of the newly supplied data. This is implemented by parameters local_stand and local_cut. For the derivation means and standard deviation, as well the quartile cut-points, please refer to protein_info.tsv and protein_cat_bounds.tsv.

res <- clusimp(df = simulated_proteins, id = "ID", column = "Predicted class",
               local_stand = FALSE, local_cut = FALSE)

The output consists of a dataframe containing the original data, the principal components, the standardised data, the categorised principal components and the predicted class for each individual. The predicted class membership is saved in the column Predicted class, specified in the column parameter.

pander(res[1:10, c(names(res)[1:3], "Predicted class")])
ID CRYBB2.10000.28 RAF1.10001.7 Predicted class
iid_1 1190 791 3
iid_10 765 528.7 3
iid_11 459.5 1145 2
iid_12 562.1 960.6 2
iid_13 2440 1211 2
iid_14 1423 638.8 2
iid_15 1527 651 2
iid_16 665.3 699.1 1
iid_17 268.2 749.4 2
iid_18 1521 651.5 3

Extracting differentially expressed proteins

If you have information on more than a single participants, preferably from 50 or more, one can also identify differentially expressed proteins, using the idep function. In this function the data is pre-processed (normalised when norm = TRUE) and the lmFit function of limma is applied. This is illustrated below:

# define proteins to be tested
incl <- names(simulated_proteins)[!names(simulated_proteins) %in% c("ID", "Predicted class")]

# identify the differentially expressed proteins
deps <- idep(df = res, id = "ID", clus_col = "Predicted class", pval = 0.05,
             logf = 1, prot_col = incl, norm = TRUE)

Error in get(paste0(generic, “.”, class), envir = get_method_env()) : object ‘type_sum.accel’ not found

# check results
pander(deps$feptable)
Table continues below
  clus_1.clus_2 clus_1.clus_3 clus_2.clus_3 AveExpr
CNDP1.5456.59 0.2168 1.015 0.7979 1.868
TMPO.8265.225 -0.753 -1.188 -0.4345 0.7969
GHR.2948.58 0.3859 1.088 0.7019 -0.8448
DLK2.9359.9 -0.3226 -1.019 -0.6967 0.2235
CRIP1.18275.5 0.4227 1.011 0.5881 1.282
GM2A.15441.6 -0.1915 -1.147 -0.9553 -0.1645
BTN3A3.17692.2 -0.5456 -1.111 -0.5651 -0.1786
WFDC2.11388.75 -0.714 -1.102 -0.3877 1.72
NBL1.2944.66 -0.4257 -1.182 -0.7559 -0.01545
PDGFA.4499.21 -1.386 -1.208 0.1787 1.506
  F P.Value adj.P.Val
CNDP1.5456.59 22.66 9.689e-08 8.765e-05
TMPO.8265.225 21.34 1.953e-07 8.765e-05
GHR.2948.58 20.21 3.622e-07 8.765e-05
DLK2.9359.9 16.19 3.733e-06 0.0005232
CRIP1.18275.5 15.69 5.092e-06 0.0006627
GM2A.15441.6 12.96 2.887e-05 0.002252
BTN3A3.17692.2 11.28 8.984e-05 0.00496
WFDC2.11388.75 11.06 0.0001045 0.005602
NBL1.2944.66 10.68 0.0001361 0.006703
PDGFA.4499.21 8.359 0.0007347 0.01834

The output represents the values for each contrast, the average expression, test-statistic and (adjusted) p-values for each protein.