Example usage of the `clusimp` package • clusimp

Introduction

The clusimp package implements an externally validated clustering algorithm assigning people with heart failure with reduced ejection fraction (HFrEF) into 3 groups with distinct rates of disease progression. Specifically, people with HFrEF are grouped based on the 5K SomaLogic SomaScan assay measuring plasma proteins values. More detailed information on the proteins measured can be found in protein_info.tsv.

This vignette explains how to use the principal functions clusimp and idep of the clusimp package.

The data

As an illustrative example we have included a simulated dataset of 50 individuals with protein measurements. When using your own data please make sure the input data uses the exact same column names as the example data, which can be seen below and loaded from here. The protein names required in the dataset are also listed in protein_info.tsv.

library(clusimp)

# load the data
data(simulated_proteins)

# look at the data
dim(simulated_proteins)

[1] 50 4211

pander(simulated_proteins[1:10, 1:10])

Table continues below
ID	CRYBB2.10000.28	RAF1.10001.7	ZNF41.10003.15	ELK1.10006.25
iid_1	1190	791	115.9	738.6
iid_2	514.9	890.7	175.5	5210
iid_3	481	209.1	183.6	-440.8
iid_4	412.5	559.3	151.1	335.3
iid_5	506.7	552.6	61.25	4955
iid_6	243.8	629	47.59	1099
iid_7	683.5	476.8	108.8	541.1
iid_8	1367	891.4	91.78	606.1
iid_9	718.3	460.6	85.31	2523
iid_10	765	528.7	165	756.9

Table continues below
GUCA1A.10008.43	OCRL.10011.65	SPDEF.10012.5	SNAI2.10014.31
525.3	-60.79	1232	876.1
621.3	2223	1443	1295
471.8	2353	1180	1231
-147.8	2425	1321	839.9
486.8	2804	1355	1076
-15.1	2489	1584	947.7
422.3	1873	2347	781
586.2	2280	2361	1261
591	2736	1542	1399
398.5	7118	1798	846.4

KCNAB2.10015.119
642.5
750.1
1012
648.4
456.1
696.8
650.8
555.8
119.8
729.6

Implement the clustering model

To assign the HFrEF patients to clusters, we apply the clusimp function, which requires all protein measurements. It normalises the measurements based on the mean and standard deviation, performs principal component analysis, and uses these to cluster the individuals. More detail is given in the accompanying paper [AVAILABLE UPON PUBLICATION].

By default, the cluster model is implemented using means and standard deviation from the derivation data. If there is more than one input row the newly supplied data can be used for normalisation. Similarly, before applying the clustering model the principal components will be categorised into quartiles. This can be done on the derivation cut-offs or on the quartiles of the newly supplied data. This is implemented by parameters local_stand and local_cut. For the derivation means and standard deviation, as well the quartile cut-points, please refer to protein_info.tsv and protein_cat_bounds.tsv.

res <- clusimp(df = simulated_proteins, id = "ID", column = "Predicted class",
               local_stand = FALSE, local_cut = FALSE)

The output consists of a dataframe containing the original data, the principal components, the standardised data, the categorised principal components and the predicted class for each individual. The predicted class membership is saved in the column Predicted class, specified in the column parameter.

pander(res[1:10, c(names(res)[1:3], "Predicted class")])

ID	CRYBB2.10000.28	RAF1.10001.7	Predicted class
iid_1	1190	791	3
iid_10	765	528.7	3
iid_11	459.5	1145	2
iid_12	562.1	960.6	2
iid_13	2440	1211	2
iid_14	1423	638.8	2
iid_15	1527	651	2
iid_16	665.3	699.1	1
iid_17	268.2	749.4	2
iid_18	1521	651.5	3

Extracting differentially expressed proteins

If you have information on more than a single participants, preferably from 50 or more, one can also identify differentially expressed proteins, using the idep function. In this function the data is pre-processed (normalised when norm = TRUE) and the lmFit function of limma is applied. This is illustrated below:

# define proteins to be tested
incl <- names(simulated_proteins)[!names(simulated_proteins) %in% c("ID", "Predicted class")]

# identify the differentially expressed proteins
deps <- idep(df = res, id = "ID", clus_col = "Predicted class", pval = 0.05,
             logf = 1, prot_col = incl, norm = TRUE)

Error in get(paste0(generic, “.”, class), envir = get_method_env()) : object ‘type_sum.accel’ not found

# check results
pander(deps$feptable)

Table continues below
	clus_1.clus_2	clus_1.clus_3	clus_2.clus_3	AveExpr
CNDP1.5456.59	0.2168	1.015	0.7979	1.868
TMPO.8265.225	-0.753	-1.188	-0.4345	0.7969
GHR.2948.58	0.3859	1.088	0.7019	-0.8448
DLK2.9359.9	-0.3226	-1.019	-0.6967	0.2235
CRIP1.18275.5	0.4227	1.011	0.5881	1.282
GM2A.15441.6	-0.1915	-1.147	-0.9553	-0.1645
BTN3A3.17692.2	-0.5456	-1.111	-0.5651	-0.1786
WFDC2.11388.75	-0.714	-1.102	-0.3877	1.72
NBL1.2944.66	-0.4257	-1.182	-0.7559	-0.01545
PDGFA.4499.21	-1.386	-1.208	0.1787	1.506

	F	P.Value	adj.P.Val
CNDP1.5456.59	22.66	9.689e-08	8.765e-05
TMPO.8265.225	21.34	1.953e-07	8.765e-05
GHR.2948.58	20.21	3.622e-07	8.765e-05
DLK2.9359.9	16.19	3.733e-06	0.0005232
CRIP1.18275.5	15.69	5.092e-06	0.0006627
GM2A.15441.6	12.96	2.887e-05	0.002252
BTN3A3.17692.2	11.28	8.984e-05	0.00496
WFDC2.11388.75	11.06	0.0001045	0.005602
NBL1.2944.66	10.68	0.0001361	0.006703
PDGFA.4499.21	8.359	0.0007347	0.01834

The output represents the values for each contrast, the average expression, test-statistic and (adjusted) p-values for each protein.