Lasso on Vertically Partitioned Data

Implementing Lasso on vertically partitioned data in a naive fashion provides some information on the issues with vertically partitioned data.

Consider a dataset with response \(y\) where the predictor matrix \(X\) is vertically partitioned among three sites.

\[ X = [X_1, X_2, X_3] \]

where the combined \(X\) is \(n\times p\), each \(X_i\) is \(n\times p_k\), for \(k= 1, 2, 3\), and \(\sum_{i=1}^3 p_k = p.\)

We wish to fit a lasso model:

\[ \begin{array}{ll} \underset{\beta}{\mbox{minimize}} & ||y - X\beta||_2^2 + \lambda ||\beta||_1. \end{array} \]

For purposes of illustration, we generate a dataset.

set.seed(129)
n <- 100
p <- 5
X <- matrix(rnorm(n * p), n, p)
X1 <- X[, 1:2]
X2 <- X[, 3, drop = FALSE]
X3 <- X[, 4:5]
beta_true <- c(5, 0, 0, 2, 2)
y <- X %*% beta_true + rnorm(n)

The Aggregated Fit

If the data were not vertically partitioned, the fit would be straightforward for a specified \(\lambda = 2\) say. We would just solve the optimization problem, the primal problem.

suppressWarnings(suppressMessages(library(CVXR)))

beta <- Variable(p)
lambda <- 2
p_obj <- sum_squares(y - X %*% beta) + lambda / 2 * p_norm(beta, 1)
p_prob <- Problem(Minimize(p_obj))
p_result <- solve(p_prob)

The resulting value of the primal objective is 109.5436324 and the fitted estimate is

(beta_primal <- p_result$getValue(beta))
##              [,1]
## [1,]  4.937721027
## [2,] -0.007471594
## [3,]  0.094597544
## [4,]  2.065937851
## [5,]  2.130020709

The Lasso Dual

The dual problem for lasso for that \(\lambda\) is

\[ \begin{array}{ll} \underset{u}{\mbox{minimize}} & ||y - u||_2^2 \end{array} \mbox{ subject to } ||X^Tu||_{\infty} \leq \lambda. \]

In the above \(u\) is an \(n\)-vector of parameters and the constraint is

\[ X^Tu = \left[\begin{array}{l} X_1^Tu\\ X_2^Tu\\ X_3^Tu \end{array}\right] \]

where each \(X_k^Tu\) is \(p_k\times n\). It follows that

\[ |X^Tu|_{\infty} = \max\{||X_1^Tu||_{\infty}, ||X_2^Tu||_{\infty}, ||X_3^Tu||_{\infty}\} \]

Thus, if each site \(k\) provides \(||X_k^Tu||_{\infty}\) for a given \(u\), the constraint can be computed in a distributed fashion by a master performing the optimization.

So the dual is solvable in a distributed fashion.

Indeed, we can compute this to check.

u <- Variable(n)
d_obj <- 0.5 * sum_squares(y - u)
##d_constraint <- list(p_norm(t(X) %*% u) <= lambda)
d_constraint <- list(
    max(p_norm(t(X1) %*% u, Inf),
        p_norm(t(X2) %*% u, Inf),
        p_norm(t(X3) %*% u, Inf)) <= lambda
)
    
d_prob <- Problem(Minimize(d_obj), d_constraint)
d_result <- solve(d_prob)
uVal <- d_result$getValue(u)
## Print a few values out of the 100
head(uVal)
##             [,1]
## [1,] -1.05733012
## [2,] -0.32031376
## [3,]  0.57115415
## [4,] -1.28355101
## [5,] -1.97281480
## [6,]  0.09022386

So far, so good.

The Catch

Now that we have solved the dual problem, we need to recover the solution to the original primal problem. The correspondence between the primal solution \(\hat{\beta}\) and dual solution \(u\) is:

\[ X\hat{\beta} = y - u. \]

The solution is, of course,

\[ \hat{\beta} = (X^TX)^{-1}X^T(y- u). \]

This is where we get killed because the \(X^TX\) matrix involves cross-site terms like \(X_i^TX_j\):

If we ignored that fact, we can check that we do get the right solution.

XtX <- t(X) %*% X
beta_dual<- solve(XtX) %*% t(X) %*% (y - uVal)
cbind(beta_dual, beta_primal)
##               [,1]         [,2]
## [1,]  4.919405e+00  4.937721027
## [2,] -9.091532e-06 -0.007471594
## [3,]  7.857126e-02  0.094597544
## [4,]  2.047664e+00  2.065937851
## [5,]  2.110122e+00  2.130020709

References