======= Review 1 ======= > *** Contributions: What are the major issues addressed in the paper? Do you consider them important? Comment on the novelty, creativity, impact, and technical depth in the paper. This paper proposes a novel scheme for guaranteeing Byzantine fault tolerance. The proposed method is based on network coding, and to the best of my knowledge is a new application of this technique. The paper does not provide very much theoretical analysis (although there is a proof of correctness), relying instead on testbed evaluation. The performance improvement, as measured in the testbed, appears to be relatively minor. > *** Strengths: What are the major reasons to accept the paper? [Be brief.] The paper presents a new approach to an important problem, as well as a new application of network coding. The experimental data provide comparison between the proposed scheme and existing methods. > *** Weaknesses: What are the major reasons NOT to accept the paper? [Be brief.] The improvement of the proposed method over existing methods is not reflected in the simulation plots. Not much theoretical analysis. > *** Detailed Comments: Please provide detailed comments that will help the TPC assess the paper and help provide feedback to the authors. My first comment is that the title of the paper is somewhat misleading. Based on the title, the paper appears to be an experimental/testbed study of existing Byzantine fault tolerance methods, when in fact it proposes and implements a new method. Also, it is not clear why the contribution is motivated by data centers, since Byzantine faults are present in other contexts as well. The paper proposes an approach (referred to as NCBA) to correcting Byzantine faults based on network coding. While this is a novel application of network coding, it appears to rely on existing network coding tools. Also, the paper would benefit from additional explanation of why the particular coding scheme used in this work was selected, as well as some theoretical or experimental evaluation of other possible encoding methods. The paper demonstrates the effectiveness of NCBA through experimental evaluation. The simulation plots in Figures 3-6, however, seem to show that NCBA either does not improve on the performance of existing methods (Figures 3 and 4) or performs worse (Figures 5 and 6, left-most data points). This casts doubt on the practical usefulness of the proposed method. Hence, while the use of network coding in ensuring fault tolerance is an interesting concept, the paper does not make the case that it is a viable alternative to the current state of the art. > *** Recommendation: Your overall rating (Please try giving as few borderlines as possible). B = (top 30% of reviewer's perception of all INFOCOM submissions, but not top 20%) - accept if there is room (3) ======= Review 2 ======= > *** Contributions: What are the major issues addressed in the paper? Do you consider them important? Comment on the novelty, creativity, impact, and technical depth in the paper. There are two standard protocols for Byzantine fault tolerant state machine protocol that are applicable for data centers: one is the basic one introduced by Lamport and then the second one which s a somewhat simplified (i.e., with some additional assumption) version that uses collision resistant hash functions to reduce the network traffic. The current paper proposes new interesting approach to do the same using the well concept of network coding that has been fairly extensively used in other applications. The new approach does not use any hash function and hence provides a correct solution as opposed to a correct solution with a high probability; and, of course, the new schemes is more expensive in terms of communication cost but is computationally cheaper. The paper also does a comprehensive experimental investigation to compare the three schemes and show that the new scheme is at least as good as the previous ones, but does not depend on the goodness of the chosen hash functions. > *** Strengths: What are the major reasons to accept the paper? [Be brief.] The use of the of the concept of network encoding to solve the problem is novel, interesting and demonstrates the usefulness of network coding in yet another application domain. > *** Weaknesses: What are the major reasons NOT to accept the paper? [Be brief.] I do not see much except I would like to see some theoretical analysis of the proposed NCBA scheme; then that is outside the stated scope of the paper. > *** Detailed Comments: Please provide detailed comments that will help the TPC assess the paper and help provide feedback to the authors. There are two standard protocols for Byzantine fault tolerant state machine protocol that are applicable for data centers: one is the basic one introduced by Lamport and then the second one which s a somewhat simplified (i.e., with some additional assumption) version that uses collision resistant hash functions to reduce the network traffic. The current paper proposes new interesting approach to do the same using the well concept of network coding that has been fairly extensively used in other applications. The new approach does not use any hash function and hence provides a correct solution as opposed to a correct solution with a high probability; and, of course, the new schemes is more expensive in terms of communication cost but is computationally cheaper. The paper also does a comprehensive experimental investigation to compare the three schemes and show that the new scheme is at least as good as the previous ones, but does not depend on the goodness of the chosen hash functions. > *** Recommendation: Your overall rating (Please try giving as few borderlines as possible). B+ = (top 20% of reviewer's perception of all INFOCOM submissions, but not top 10%) weak accept (4) ======= Review 3 ======= > *** Contributions: What are the major issues addressed in the paper? Do you consider them important? Comment on the novelty, creativity, impact, and technical depth in the paper. This paper implements and evaluates three different Byzantine Fault-Tolerant (BFT) state machine replication protocols for data centers, and the authors propose a network coding based BFT protocol that performs comparably with Digest. > *** Strengths: What are the major reasons to accept the paper? [Be brief.] This paper represents the first implementation of BFT with network coding, which can trade off between communication cost and computational cost in computing hash functions. > *** Weaknesses: What are the major reasons NOT to accept the paper? [Be brief.] My concern regarding the paper is related to the significance of trade-off between communication cost and computational cost in computing hash functions. > *** Detailed Comments: Please provide detailed comments that will help the TPC assess the paper and help provide feedback to the authors. The authors claim that "even though NCBA introduces 50% more communication cost than PBFT does, it is compensated by reducing the computational cost of hashing. Through extensive experiments, we verified that NCBA performs at least as well as Digest, without relying on any cryptographic assumption on the hardness of breaking the hash function." I really doubt the practical significance of the above claim. First of all, standard hash functions such as SHA-256 has been widely applied almost to every aspect of the network/communication/internet/security protocols. I don't see any problem using it. Hash functions are designed to be efficient in computation; paying 50% more communication overhead to reduce the computation cost of the hash function just is not the right way to pursue in my opinion. That said, it is nice to see a new approach implementing BFT based on network coding, even though its significance is not justified too well. > *** Recommendation: Your overall rating (Please try giving as few borderlines as possible). B = (top 30% of reviewer's perception of all INFOCOM submissions, but not top 20%) - accept if there is room (3) ======= Review 4 ======= > *** Contributions: What are the major issues addressed in the paper? Do you consider them important? Comment on the novelty, creativity, impact, and technical depth in the paper. This paper investigates the Byzantine Fault-Tolerant (BFT) replication protocol design problem for data center, online storage, and cloud computing. Instead of using hashing function and relying on its good collision-reduction property, the proposed protocol uses RS codes for state broadcasts among replicas. Such RS codes allow the replicas to efficiently determine the faulty one(s) among themselves. > *** Strengths: What are the major reasons to accept the paper? [Be brief.] The strength of the paper is the unique application of RS code into faulty tolerancy and cloud computing. The problem is interesting and solution is novel. The paper is well-written. > *** Weaknesses: What are the major reasons NOT to accept the paper? [Be brief.] Please see the detailed comments below. > *** Detailed Comments: Please provide detailed comments that will help the TPC assess the paper and help provide feedback to the authors. This paper investigates the Byzantine Fault-Tolerant (BFT) replication protocol design problem for data center, online storage, and cloud computing. Instead of using hashing function and relying on its good collision-reduction property, the proposed protocol uses RS codes for state broadcasts among replicas. Such RS codes allow the replicas to efficiently determine the faulty one(s) among themselves. The strength of the paper is the unique application of RS code into faulty tolerancy and cloud computing. The problem is interesting and solution is novel. The paper is well-written. I have several comments: In Figure 1, prepare phase, shouldn't peer 3 broadcast to peers 1 and 2 as well? In addition, blue color arrows in this figure can't be distinguished in black/white printout and should be avoided. In first paragraph of Section IV.B. The third issue (a faulty peer Pi broadcasts Detected_i=TRUE even though it should be FALSE). Under what circumstances will such an error occur? In the following paragraph, Y will ignore X when X ignores Y (as discussed by the authors). Will this affect whether other fault-free nodes will detect the faulty Y? Majority vote? The linear lines in Figure 3 are not very informative. The authors are suggested to present results that help readers understand the contents better. Furthermore, why did the authors include SHA-256 with SHA-512? The results look rather similar. > *** Recommendation: Your overall rating (Please try giving as few borderlines as possible). A = (top 10% of reviewer's perception of all INFOCOM submissions) strong accept (5)