Reviewer 5 of 2018 ACC submission 1409

Comments to the author
======================

This paper deals with "privatised" versions of the
distributed gradient descent algorithm on a graph. It seems
to be the conference version of preprints [36-38] by the
authors. The authors introduce 3 algorithms where they add
noise to the parameters estimated by distributed gradient
descent in such a way that the noise balances out
eventually, they provide a convergence rate analysis to
show how this additional privacy affects the constant and
note the actual rate.

* The proposed algorithms are rather well introduced as
well as comparisons with standard DGD. However their actual
privacy and comparison with usual privacy setups
(differential privacy) is done too quickly and late in the
paper which makes the reader question the interest of the
proposed algorithms while reading the paper.
* I fail to see an actual usecase for the Functional
Sharing version, could the authors provide one?
* The numerical illustrations are not very convincing to
me:
i) polynomial optimization: the functions are rather simple
but more importantly the convergence speed depends heavily
on the initialization. It seems strange that the RSS-LB
performs way better than standard DGD, the reviewer wonders
if that it is due to the intialization or a particular
realization.
ii) Machine Learning: I suppose that the optimized function
is non-convex for the MNIST deep neural network training so
I wonder what point it actually makes. For the logistic
regression, a functional value would have more sense in an
optimization context as Accuracy can be tampered with due
to regulurarization and overfitting issues. 


Minor comments:
*[Assumptions] The compacity of X is rather crucial as it
is a rather strong assumption that implies that (A-1-3, X^*
is bounded) and that A-2-2 => A-2-1. The text of the
assumptions could thus be simplied to enlight this
assumptions.
*[Sec V] in the title "Proof of Theorem 3", Theorem is
misspelled.


Reviewer 2 of 2018 ACC submission 1409

Comments to the author
======================

The authors propose a series of algorithms for distributed
optimization with some privacy guarantees. The paper is
well written, nevertheless, there is some concerns about
the provided results. A border line rejection is suggested.
  Find the comments bellow:

If the authors consider fixed undirected graphs, why do
they differentiate the assumption of having strongly
connected graphs, it seems that in this scenario the graph
being connected is enough.

It is not clear whether the authors consider a fixed or a
time-varying graph. On the one side, they define the set of
neighbors to be fixed, but in their definition of the
gossip matrix they use the subindex k.

Apart from boundedness, there is never an intuition on how
to sample the random variables s^{i,j}, in fact in their
algorithm 1, there is never a sampling stage where the
realizations of this random variables are drawn. How can
they in step 1 assign the value of 0 to a random variable?

The boundedness assumption of the random noise variables
also implies that the agents require an extra information
on the number of agents in the network.

Again why does the matrix B depends on k? if the authors
use the metropolis weight and the graphs are fixed, then
there is no reason to have this.

MAIN Concerns: 
1) In Theorem 1, the authors claim that their algorithmic
proposal “solve” the distributed optimization problem.
However, one required input for their algorithm is the
parameter MAX_ITER, which by the way seem to be discarded
in Algorithm 3. This MAX_ITER is never discussed, how can
the authors find a value for this parameter? Moreover, what
is the definition of a solution for the problem, since
clearly this algorithm in finite time will only produce
approximate ones? They talk about asymptotic convergence,
yet their algorithms execute in finite time.

2) The authors claim their algorithms achieve some
trade-off between privacy and solution of the distributed
optimization problem. On the side of the solution itself,
Theorem 3 provides a result that relates the magnitude of
the added noise versus the convergence rate of the
algorithm. However, there is no way to quantify the effect
of \Delta in the privacy preservation side. Their results
in Theorem 2 talk about preserving privacy but do not
relate in any manner the parameter value \Delta.

NOTATION: f is both used for a function and the number of
adversarial agents, this creates confusion.

The experimental results on the machine learning example
seem interesting, nevertheless, deep neural networks are
particularly shown to be non-convex, thus invalidating the
initial convexity assumption required for the results of
the paper to hold.

For theorem 3, the authors refer to [16] for their
convergence results. Nevertheless, [16] work on a different
setup of time-varying directed graphs.


Reviewer 4 of 2018 ACC submission 1409

Comments to the author
======================

The content and the focus of the paper is very interesting
and relevant.

In many places the paper you have used the terminology
'feasible objective function', 'feasible instances of
problem 1'. Especially, in the key results (e.g., claims
and theorems) it is used. In my opinion this is not clear
and confusing. When you are using 'feasible set' its
definition is clear. However, the terminology of 'feasible
function' is not clear. So I want you to be more precise in
this. 

You have mentioned that "When each agent’s local objective
function is restricted to be any polynomial of a bounded
degree, the set of feasible functions forms an additive
group.
Theorem 2 makes a claim regarding the privacy achieved
using function sharing in this case." However, in
assumption A1, you exclude the nonconvex functions.
Therefore, the claim above, in particular, "Theorem 2 makes
a claim regarding the privacy achieved using function
sharing in this case" is not really true. Please refine the
sentence appropriately.