| Title: | Kernel Partial Correlation Coefficient |
|---|---|
| Description: | Implementations of two empirical versions the kernel partial correlation (KPC) coefficient and the associated variable selection algorithms. KPC is a measure of the strength of conditional association between Y and Z given X, with X, Y, Z being random variables taking values in general topological spaces. As the name suggests, KPC is defined in terms of kernels on reproducing kernel Hilbert spaces (RKHSs). The population KPC is a deterministic number between 0 and 1; it is 0 if and only if Y is conditionally independent of Z given X, and it is 1 if and only if Y is a measurable function of Z and X. One empirical KPC estimator is based on geometric graphs, such as K-nearest neighbor graphs and minimum spanning trees, and is consistent under very weak conditions. The other empirical estimator, defined using conditional mean embeddings (CMEs) as used in the RKHS literature, is also consistent under suitable conditions. Using KPC, a stepwise forward variable selection algorithm KFOCI (using the graph based estimator of KPC) is provided, as well as a similar stepwise forward selection algorithm based on the RKHS based estimator. For more details on KPC, its empirical estimators and its application on variable selection, see Huang, Z., N. Deb, and B. Sen (2022). “Kernel partial correlation coefficient – a measure of conditional dependence” (URL listed below). When X is empty, KPC measures the unconditional dependence between Y and Z, which has been described in Deb, N., P. Ghosal, and B. Sen (2020), “Measuring association on topological spaces using kernels and geometric graphs” <doi:10.48550/arXiv.2010.01768>, and it is implemented in the functions KMAc() and Klin() in this package. The latter can be computed in near linear time. |
| Authors: | Zhen Huang [aut, cre], Nabarun Deb [ctb], Bodhisattva Sen [ctb], Benjamin Lang [ctb] |
| Maintainer: | Zhen Huang <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.3 |
| Built: | 2026-06-07 06:12:02 UTC |
| Source: | https://github.com/zh2395/kpc |
A dataset containing 9 variables, consists of the voting results earned by the top five candidates from 250 electoral districts in Korea.
ElecDataElecData
A data frame with 1250 rows and 9 variables:
250 precinct codes designated by the election committee (4 digits)
250 city codes of administrative standard code management system (5 digits)
Symbols 1-5, corresponding to Moon Jae-in, Hong Jun-pyo, Ahn Cheol-soo, Yoo Seung-min, Shim Sang-jung
Average age of voters in 17 years: statistics on resident registration population of the Ministry of Government Administration and Home Affairs
Average number of years of education for voters
Average price per square meter in 17 years
The average insurance premium for each city, county, district
Vote rate by candidate
Number of votes by candidate
https://github.com/OhmyNews/2017-Election
Variable selection with KPC using directed K-NN graph or minimum spanning tree (MST)
KFOCI( Y, X, Z = NULL, k = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), Knn = min(ceiling(NROW(Y)/20), 20), num_features = NULL, stop = TRUE, numCores = parallel::detectCores(), verbose = FALSE )KFOCI( Y, X, Z = NULL, k = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), Knn = min(ceiling(NROW(Y)/20), 20), num_features = NULL, stop = TRUE, numCores = parallel::detectCores(), verbose = FALSE )
Y |
a matrix of responses (n by dy). |
X |
a matrix of predictors (n by dx). |
Z |
Integer vector of column indices in |
k |
a function |
Knn |
a positive integer indicating the number of nearest neighbor; or "MST". The suggested choice of Knn is 0.05n for samples up to a few hundred observations. For large n, the suggested Knn is sublinear in n. That is, it may grow slower than any linear function of n. The computing time is approximately linear in Knn. A smaller Knn takes less time. |
num_features |
the number of variables to be selected from the non-pre-conditioned set, cannot be larger than |
stop |
If |
numCores |
number of cores that are going to be used for parallelizing the process. |
verbose |
whether to print each selected variables during the forward stepwise algorithm |
A stepwise forward selection of variables using KPC. At each step it selects the that maximizes
selected . When Z is specified, the algorithm conditions on those variables throughout, i.e. the formal goal is then to find a subset such that .
It is suggested to normalize the predictors before applying KFOCI.
Euclidean distance is used for computing the K-NN graph and the MST.
The algorithm returns a vector of the indices from 1,...,dx from the non-pre-conditioned set of the selected variables in the same order that they were selected. The variables at the front are expected to be more informative in predicting Y.
n = 200 p = 10 X = matrix(rnorm(n * p), ncol = p) Y = X[, 1] * X[, 2] + sin(X[, 1] * X[, 3]) KFOCI(Y, X, k=kernlab::rbfdot(1), Knn=1, numCores=1) # 1 2 3n = 200 p = 10 X = matrix(rnorm(n * p), ncol = p) Y = X[, 1] * X[, 2] + sin(X[, 1] * X[, 3]) KFOCI(Y, X, k=kernlab::rbfdot(1), Knn=1, numCores=1) # 1 2 3
Calculate (the unconditional version of graph-based KPC) using directed K-NN graph or minimum spanning tree (MST).
The computational complexity is O(nlog(n))
Klin( Y, X, k = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), Knn = 1 )Klin( Y, X, k = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), Knn = 1 )
Y |
a matrix of response (n by dy) |
X |
a matrix of predictors (n by dx) |
k |
a function |
Knn |
the number of K-nearest neighbor to use; or "MST". A small Knn (e.g., Knn=1) is recommended. |
is an estimate of the population kernel measure of association, based on data from .
For K-NN graph, can be computed in near linear time (in ).
In particular,
,
where all symbols have their usual meanings as in the definition of .
Euclidean distance is used for computing the K-NN graph and the MST.
The algorithm returns a real number ‘Klin’: an empirical kernel measure of association which can be computed in near linear time when K-NN graphs are used.
Deb, N., P. Ghosal, and B. Sen (2020), “Measuring association on topological spaces using kernels and geometric graphs” <arXiv:2010.01768>.
library(kernlab) Klin(Y = rnorm(100), X = rnorm(100), k = rbfdot(1), Knn = 1)library(kernlab) Klin(Y = rnorm(100), X = rnorm(100), k = rbfdot(1), Knn = 1)
Calculate (the unconditional version of graph-based KPC) using directed K-NN graph or minimum spanning tree (MST).
KMAc( Y, X, k = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), Knn = 1 )KMAc( Y, X, k = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), Knn = 1 )
Y |
a matrix of response (n by dy) |
X |
a matrix of predictors (n by dx) |
k |
a function |
Knn |
the number of K-nearest neighbor to use; or "MST". A small Knn (e.g., Knn=1) is recommended for an accurate estimate of the population KMAc. |
is an estimate of the population kernel measure of association, based on data from .
For K-NN graph, ties will be broken at random. MST is found using package emstreeR.
In particular,
where denotes a MST or K-NN graph on , denotes the set of edges of and
implies that there is an edge from to in .
Euclidean distance is used for computing the K-NN graph and the MST.
The algorithm returns a real number ‘KMAc’, the empirical kernel measure of association
Deb, N., P. Ghosal, and B. Sen (2020), “Measuring association on topological spaces using kernels and geometric graphs” <arXiv:2010.01768>.
library(kernlab) KMAc(Y = rnorm(100), X = rnorm(100), k = rbfdot(1), Knn = 1)library(kernlab) KMAc(Y = rnorm(100), X = rnorm(100), k = rbfdot(1), Knn = 1)
Calculate the kernel partial correlation (KPC) coefficient with directed K-nearest neighbor (K-NN) graph or minimum spanning tree (MST).
KPCgraph( Y, X, Z, k = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), Knn = 1, trans_inv = FALSE )KPCgraph( Y, X, Z, k = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), Knn = 1, trans_inv = FALSE )
Y |
a matrix (n by dy) |
X |
a matrix (n by dx) or |
Z |
a matrix (n by dz) |
k |
a function |
Knn |
a positive integer indicating the number of nearest neighbor to use; or "MST". A small Knn (e.g., Knn=1) is recommended for an accurate estimate of the population KPC. |
trans_inv |
TRUE or FALSE. Is |
The kernel partial correlation squared (KPC) measures the conditional dependence
between and given , based on an i.i.d. sample of .
It converges to the population quantity (depending on the kernel) which is between 0 and 1.
A small value indicates low conditional dependence between and given , and
a large value indicates stronger conditional dependence.
If X == NULL, it returns the KMAc(Y,Z,k,Knn), which measures the unconditional dependence between and .
Euclidean distance is used for computing the K-NN graph and the MST.
MST in practice often achieves similar performance as the 2-NN graph. A small K is recommended for the K-NN graph for an accurate estimate of the population KPC,
while if KPC is used as a test statistic for conditional independence, a larger K can be beneficial.
The algorithm returns a real number which is the estimated KPC.
library(kernlab) n = 2000 x = rnorm(n) z = rnorm(n) y = x + z + rnorm(n,1,1) KPCgraph(y,x,z,vanilladot(),Knn=1,trans_inv=FALSE) n = 1000 x = runif(n) z = runif(n) y = (x + z) %% 1 KPCgraph(y,x,z,rbfdot(5),Knn="MST",trans_inv=TRUE) discrete_ker = function(y1,y2) { if (y1 == y2) return(1) return(0) } class(discrete_ker) <- "kernel" set.seed(1) n = 2000 x = rnorm(n) z = rnorm(n) y = rep(0,n) for (i in 1:n) y[i] = sample(c(1,0),1,prob = c(exp(-z[i]^2/2),1-exp(-z[i]^2/2))) KPCgraph(y,x,z,discrete_ker,1) ##0.330413library(kernlab) n = 2000 x = rnorm(n) z = rnorm(n) y = x + z + rnorm(n,1,1) KPCgraph(y,x,z,vanilladot(),Knn=1,trans_inv=FALSE) n = 1000 x = runif(n) z = runif(n) y = (x + z) %% 1 KPCgraph(y,x,z,rbfdot(5),Knn="MST",trans_inv=TRUE) discrete_ker = function(y1,y2) { if (y1 == y2) return(1) return(0) } class(discrete_ker) <- "kernel" set.seed(1) n = 2000 x = rnorm(n) z = rnorm(n) y = rep(0,n) for (i in 1:n) y[i] = sample(c(1,0),1,prob = c(exp(-z[i]^2/2),1-exp(-z[i]^2/2))) KPCgraph(y,x,z,discrete_ker,1) ##0.330413
Compute estimate of Kernel partial correlation (KPC) coefficient using conditional mean embeddings in the reproducing kernel Hilbert spaces (RKHS).
KPCRKHS( Y, X = NULL, Z, ky = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), kx = kernlab::rbfdot(1/(2 * stats::median(stats::dist(X))^2)), kxz = kernlab::rbfdot(1/(2 * stats::median(stats::dist(cbind(X, Z)))^2)), eps = 0.001, appro = FALSE, tol = 1e-05 )KPCRKHS( Y, X = NULL, Z, ky = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), kx = kernlab::rbfdot(1/(2 * stats::median(stats::dist(X))^2)), kxz = kernlab::rbfdot(1/(2 * stats::median(stats::dist(cbind(X, Z)))^2)), eps = 0.001, appro = FALSE, tol = 1e-05 )
Y |
a matrix (n by dy) |
X |
a matrix (n by dx) or |
Z |
a matrix (n by dz) |
ky |
a function |
kx |
the kernel function for |
kxz |
the kernel function for |
eps |
a small positive regularization parameter for inverting the empirical cross-covariance operator |
appro |
whether to use incomplete Cholesky decomposition for approximation |
tol |
tolerance used for incomplete Cholesky decomposition (implemented by the function |
The kernel partial correlation (KPC) coefficient measures the conditional dependence
between and given , based on an i.i.d. sample of .
It converges to the population quantity (depending on the kernel) which is between 0 and 1.
A small value indicates low conditional dependence between and given , and
a large value indicates stronger conditional dependence.
If X = NULL, it measures the unconditional dependence between and .
The algorithm returns a real number which is the estimated KPC.
n = 500 set.seed(1) x = rnorm(n) z = rnorm(n) y = x + z + rnorm(n,1,1) library(kernlab) k = vanilladot() KPCRKHS(y, x, z, k, k, k, 1e-3/n^(0.4), appro = FALSE) # 0.4854383 (Population quantity = 0.5) KPCRKHS(y, x, z, k, k, k, 1e-3/n^(0.4), appro = TRUE, tol = 1e-5) # 0.4854383 (Population quantity = 0.5)n = 500 set.seed(1) x = rnorm(n) z = rnorm(n) y = x + z + rnorm(n,1,1) library(kernlab) k = vanilladot() KPCRKHS(y, x, z, k, k, k, 1e-3/n^(0.4), appro = FALSE) # 0.4854383 (Population quantity = 0.5) KPCRKHS(y, x, z, k, k, k, 1e-3/n^(0.4), appro = TRUE, tol = 1e-5) # 0.4854383 (Population quantity = 0.5)
The algorithm performs a forward stepwise variable selection using RKHS estimators.
KPCRKHS_VS( Y, X, num_features, ky = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), kS = NULL, eps = 0.001, appro = FALSE, tol = 1e-05, numCores = parallel::detectCores(), verbose = FALSE )KPCRKHS_VS( Y, X, num_features, ky = kernlab::rbfdot(1/(2 * stats::median(stats::dist(Y))^2)), kS = NULL, eps = 0.001, appro = FALSE, tol = 1e-05, numCores = parallel::detectCores(), verbose = FALSE )
Y |
a matrix of responses (n by dy) |
X |
a matrix of predictors (n by dx) |
num_features |
the number of variables to be selected, cannot be larger than dx. |
ky |
a function |
kS |
a function that takes X and a subset of indices S as inputs, and then outputs the kernel for X_S. The first argument of kS is X, and the second argument is a vector of positive integer. If |
eps |
a positive number; the regularization parameter for the RKHS estimator |
appro |
whether to use incomplete Cholesky decomposition for approximation |
tol |
tolerance used for incomplete Cholesky decomposition ( |
numCores |
number of cores that are going to be used for parallelizing the process. |
verbose |
whether to print each selected variables during the forward stepwise algorithm |
A stepwise forward selection of variables using KPC. At each step it selects the that maximizes selected .
It is suggested to normalize the features before applying the algorithm.
The algorithm returns a vector of the indices from 1,...,dx of the selected variables in the same order that they were selected. The variables at the front are expected to be more informative in predicting Y.
n = 200 p = 10 X = matrix(rnorm(n * p), ncol = p) Y = X[, 1] * X[, 2] + sin(X[, 1] * X[, 3]) library(kernlab) kS = function(X,S) return(rbfdot(1/length(S))) KPCRKHS_VS(Y, X, num_features = 3, rbfdot(1), kS, eps = 1e-3, appro = FALSE, numCores = 1) kS = function(X,S) return(rbfdot(1/(2*stats::median(stats::dist(X[,S]))^2))) KPCRKHS_VS(Y, X, num_features = 3, rbfdot(1), kS, eps = 1e-3, appro = FALSE, numCores = 1)n = 200 p = 10 X = matrix(rnorm(n * p), ncol = p) Y = X[, 1] * X[, 2] + sin(X[, 1] * X[, 3]) library(kernlab) kS = function(X,S) return(rbfdot(1/length(S))) KPCRKHS_VS(Y, X, num_features = 3, rbfdot(1), kS, eps = 1e-3, appro = FALSE, numCores = 1) kS = function(X,S) return(rbfdot(1/(2*stats::median(stats::dist(X[,S]))^2))) KPCRKHS_VS(Y, X, num_features = 3, rbfdot(1), kS, eps = 1e-3, appro = FALSE, numCores = 1)
A dataset containing three variables (creatinine clearance C; digoxin clearance D; urine flow U) from 35 patients.
medmed
A data frame with 35 rows and 3 variables:
creatinine clearance, in ml/min/1.73m^2
digoxin clearance, in ml/min/1.73m^2
urine flow, in ml/min
Edwards, D. (2012). Introduction to graphical modelling, Section 3.1.4, Springer Science & Business Media.
Calculate using directed K-NN graph or minimum spanning tree (MST).
TnKnn(Y, X, k, Knn = 1)TnKnn(Y, X, k, Knn = 1)
Y |
a matrix of response (n by dy) |
X |
a matrix of predictors (n by dx) |
k |
a function |
Knn |
the number of K-nearest neighbor to use; or "MST". |
is an estimate of , with , drawn iid from , given .
For K-NN graph, ties will be broken at random. Algorithm finding the MST is implemented the package emstreeR.
The algorithm returns a real number which is the value of Tn.