MP-SPDZ框架开发:secureML实现(更新中)

分支一:MP-SPDZ协议结构

Target:了解MP-SPDZ底层的接口

空协议:NoProtocol
在MPSPDZ添加新的协议,通过官方文档发现,需要实现以下部分:

  1. Machines/no-party.cpp
    ○ Contains the main function.
  2. Protocols/NoShare.h
    ○ Contains the NoShare class, which is supposed to hold one share. NoShare takes the cleartext type as a template parameter.
  3. Protocols/NoProtocol.h
    ○ Contains a number of classes representing instances of protocols:
    a. NoInput
    ■ Private input.
    b. NoProtocol
    ■ Multiplication protocol.
    c. NoOutput
    ⅰ. Public output.
  4. Protocols/NoLivePrep.h
    ○ Contains the NoLivePrep class, representing a preprocessing instance.
    添加新协议时若include了其他的库,需修改makefile进行链接

半诚实两方协议:Semi & Semi2k
semi2k编译运行

make -j8 semi2k-party.x
./compile.py -R 64 tutorial
./semi2k-party.x -N 2 -I -p 0 tutorial
./semi2k-party.x -N 2 -I -p 1 tutorial (in a separate terminal)

semi2k在协议层用到的主要协议是Protocols/Semi2kShare.h和Protocols/SemiPrep2k.h
Semi2kShare主要功能在SemiShare中实现

01 预计算
SemiPrep:生成乘法三元组
SemiPrep2k:继承自SemiPrep,此外实现了RepRingOnlyEdabitPrep的接口

02 乘法协议:Semi
Semi中乘法协议由多层组成
Semi层
使用Processor/TruncPrTuple.h中的trunc_pr函数实现了两方下的截断
Beaver层
三元组乘法,对应了piranha中的TPC的乘法
实现了乘法的各个阶段prepare_mul,start_exchange,stop_exchange,finalize_mul
ProtocolBase层(Replicated.h)
在ProtocolBase中有
● muls:通过processor创建子进程进行
● mulrs:
● multiply:向量乘法,通过下列基本算子完成通信和计算
● mul:普通乘法,通过下列基本算子完成通信和计算
● 基本算子(init_mul,prepare_mul,exchange,finalize_mul):这些通常在上层被覆盖
Processor层
● muls:n组向量对位乘,长度size。 S[0]=S[1]S[2]
● mulrs:n组向量对位乘,长度不同,各为每组的S[0]。 S[1]=S[2]
S[3]
● dotprods:点积
● matmuls:矩阵乘,调用多次dotprods完成计算
● matmulsm:
● conv2d:

03 秘密分享:Semi2kShare
对应了sint,sfix这些数据类型
Semi2kShare
继承自SemiShare,指定了模板类SignedZ2为SemiShare的super,K为环参数。
实现了split?
SemiShare
继承自模板类super,实现了ShareInterface中的接口

GC
SemiSecret
实现了基于除法的截断方法trans,某种乘法andrsvec,布尔运算bitcom和bitdec,明文重构reveal
Semi
继承自Protocols/Beaver,传入SemiSecret作为参量,使得Semi成为具有Beaver乘法的SemiSecret。此外实现了prepare_mult?

关于semi的issue

关于比较的issue

关于三元组的issue
https://github.com/data61/MP-SPDZ/issues/821Semi 以 1000 为批次生成乘法三元组,因此 1 或 100 次乘法的成本将使用相同数量的预处理。您可以使用输出中所示的方法来减小批大小。-b
https://github.com/data61/MP-SPDZ/issues/335的三元组数对应于乘法数,输入元组数对应于输入数。

分支二:secureML的gpu实现(piranha)

Target:学习secureML实现方式
● piranha已在173上部署
● piranha在协议层的两方的实现方式是secureML。
● 在piranha中,协议层和设备层之间独立,因此将其引用的gpu接口替换为一般的cpu直接运算,可能可以直接用于在MP-SPDZ中实现secureML,或作为参考
Two-party protocol (P-SecureML). In 2017, Mohassel and Zhang [60] proposed a 2-party (and a trusted third party variant) protocol for privacy-preserving machine learning, using a 2-out-of-2 arithmetic secret sharing as the basis for its functionality. The linear layers are computed using Beaver triples and the non-linear layers are evaluated with garbled circuits. In our implementation, we replace the expensive GC-based evaluation of ReLUs with a more recent and efficient comparison protocol using edaBits [34, 56].

mpc层
01 Precompute.h&Precompute.cpp
只有一个类 class Precompute
getBeaverTriples
获取乘法三元组:直接把三个share对象赋了初值1,x*y=1,便于实验(作者在issue回应了此问题)
TPC.fill()为两方各自赋予x=y=z=1和x=y=z=0

        template<typename T, typename Share>
        void getBeaverTriples(Share &x, Share &y, Share &z) {

            x.fill(1);
            y.fill(1);
            z.fill(1);
        }

02 TPC.h&TPC.inl
fill
在getBeaverTriples中用到。
对于precompute中的 x=y=z=1,拆分后两方AB各获得A=A=A=1,B=B=B=0

template<typename T, typename I>
void TPCBase<T, I>::fill(T val) {
    shareA->fill(partyNum == PARTY_A ? val : 0);
}

reconstruct
通信模块,transmit发送,receive接受,join拷贝数据到gpu
发送share并对方那接收share,返回结果为两者之和
其中两个profiler是外部全局对象,用于累计通信开销和通信轮次

template<typename T, typename I, typename I2>
void reconstruct(TPC<T, I> &in, DeviceData<T, I2> &out) {

    comm_profiler.start();
    // 1 - send shareA to next party
    in.getShare(0)->transmit(TPC<T>::otherParty(partyNum));

    // 2 - receive shareA from previous party into DeviceBuffer 
    DeviceData<T> rxShare(in.size());
    rxShare.receive(TPC<T>::otherParty(partyNum));

    in.getShare(0)->join();
    rxShare.join();
    comm_profiler.accumulate("comm-time");

    // 3 - result is our shareB + received shareA
    out.zero();
    out += *in.getShare(0);
    out += rxShare;

    func_profiler.add_comm_round();
}

乘法,share乘share
getShare函数会将TPC对象中的share转换为DeviceData返回
getBeaverTriples获取x,y,z为beaver三元组,0=0=0=1,1=1=1=0
乘号两边为this和rhs,令为[a]0和[b]0
reconstruct重构a和b:发送<x+a>0,<y+b>0;同时从对方那接受share并重构e=a+x, f=b+y
由于share加减share两方都参与计算,share加减public则只有0,所以
● 0方计算<a * b>0 = + (f - ) * e - * f = + e * f - * e - * f
● 1方计算<a * b>1 = - (f - ) * e?- * f
● 满足<a * b>0 + <a * b>1 = x * y + f * e - y * e - x * f = (e - x) * (f - y) = a * b
● 因此this = <a * b>1, 所有方获得乘法结果ab的share

template<typename T, typename I>
template<typename I2>
TPCBase<T, I> &TPCBase<T, I>::operator*=(const TPCBase<T, I2> &rhs) {

    size_t size = rhs.size();
    TPC<T> x(size), y(size), z(size);
    PrecomputeObject.getBeaverTriples<T, TPC<T> >(x, y, z);
    DeviceData<T> e(size), f(size), temp(size);

    *x.getShare(0) += *this->getShare(0); 
    *y.getShare(0) += *rhs.getShare(0);
    reconstruct(x, e); reconstruct(y, f);
    *x.getShare(0) -= *this->getShare(0);
    *y.getShare(0) -= *rhs.getShare(0);
    
    this->zero();
    *this += z;

    temp.zero();
    temp += f;
    temp -= *y.getShare(0);
    temp *= e;
    *this += temp;

    temp.zero();
    temp -= *x.getShare(0);
    temp *= f;
    *this += temp;
 
    return *this;
}

矩阵乘
localMatMul为gpu函数,注意到truncate用dividePublic共享除法实现

template<typename T>
void matmul(const TPC<T> &a, const TPC<T> &b, TPC<T> &c,
        int M, int N, int K,
        bool transpose_a, bool transpose_b, bool transpose_c, T truncation) {

    localMatMul(a, b, c, M, N, K, transpose_a, transpose_b, transpose_c);

    // truncate
    dividePublic(c, (T)1 << truncation);
}
template<typename T>
void localMatMul(const TPC<T> &a, const TPC<T> &b, TPC<T> &c,
        int M, int N, int K,
        bool transpose_a, bool transpose_b, bool transpose_c) {
    
    TPC<T> x(a.size()), y(b.size()), z(c.size());

    int a_rows = transpose_a ? K : M; int a_cols = transpose_a ? M : K;
    int b_rows = transpose_b ? N : K; int b_cols = transpose_b ? K : N;
    PrecomputeObject.getMatrixBeaverTriple<T, TPC<T> >(x, y, z, a_rows, a_cols, b_rows, b_cols, transpose_a, transpose_b);

    DeviceData<T> e(x.size()), f(y.size()), temp(z.size());

    x += a; y += b;
    reconstruct(x, e);
    reconstruct(y, f);
    x -= a; y -= b;

    c.zero();
    c += z;

    gpu::gemm(M, N, K, &e, transpose_a, &f, transpose_b, &temp, transpose_c);
    c += temp;
    temp.zero();

    gpu::gemm(M, N, K, &e, transpose_a, y.getShare(0), transpose_b, &temp, transpose_c);
    c -= temp;
    temp.zero();

    gpu::gemm(M, N, K, x.getShare(0), transpose_a, &f, transpose_b, &temp, transpose_c);
    c -= temp;
}

除法,share除public
通信:
● 预计算 r = 1, rPrime = denominator = d
● 一方获得[r],[rPrime] = 1 ,d;另一方获得[r],[rPrime] = 0 ,0
● reconstruct重构被除数[a],发送[a-rPrime];
● 同时接受share并重构reconstructed = a - rPrime = a - d
本地计算:
● a += r是share加share,两方都计算
● a += reconstructed是share加DeviceData,只有一方计算
因此
● 0方计算 [a]0 = [r] + reconstructed / d = [r] + (a - d) / d
● 1方计算 [a]1 = [r]
● 满足两方相加后 [a]0 + [a]1 = r + (a - d) / d = 1 + (a - d) / d = a / d

template<typename T, typename I>
void dividePublic(TPC<T, I> &a, T denominator) {

    TPC<T> r(a.size()), rPrime(a.size());
    PrecomputeObject.getDividedShares<T, TPC<T> >(r, rPrime, denominator, a.size()); 
    a -= rPrime;
    
    DeviceData<T> reconstructed(a.size());
    reconstruct(a, reconstructed);
    reconstructed /= denominator;

    a.zero();
    a += r;
    a += reconstructed;
}

carryout
● piranha设计的基于迭代器的内存优化的最高位计算
● 初始化四个向量,映射了p的奇数位、偶数位,q的奇数位、偶数位
● 逐层循环计算,每次将奇偶位合并
○ gTemp = pOdd & gEven
○ pEven &= pOdd
○ gOdd ^= gTemp
○ gOdd -> gEven, gOdd
○ pEven -> pEven, pOdd
● 将gOdd和gEven合并并拷贝到输出out

template<typename T, typename I, typename I2>
void carryOut(TPC<T, I> &p, TPC<T, I> &g, int k, TPC<T, I2> &out) {

    // get zip iterators on both p and g
    //  -> pEven, pOdd, gEven, gOdd

    int stride = 2;
    int offset = 1;

    using SRIterator = typename StridedRange<I>::iterator;

    StridedRange<I> pEven0Range(p.getShare(0)->begin(), p.getShare(0)->end(), stride);
    DeviceData<T, SRIterator> pEven0(pEven0Range.begin(), pEven0Range.end());
    TPC<T, SRIterator> pEven(&pEven0);

    StridedRange<I> pOdd0Range(p.getShare(0)->begin() + offset, p.getShare(0)->end(), stride);
    DeviceData<T, SRIterator> pOdd0(pOdd0Range.begin(), pOdd0Range.end());
    TPC<T, SRIterator> pOdd(&pOdd0);

    StridedRange<I> gEven0Range(g.getShare(0)->begin(), g.getShare(0)->end(), stride);
    DeviceData<T, SRIterator> gEven0(gEven0Range.begin(), gEven0Range.end());
    TPC<T, SRIterator> gEven(&gEven0);

    StridedRange<I> gOdd0Range(g.getShare(0)->begin() + offset, g.getShare(0)->end(), stride);
    DeviceData<T, SRIterator> gOdd0(gOdd0Range.begin(), gOdd0Range.end());
    TPC<T, SRIterator> gOdd(&gOdd0);

    while(k > 1) {

        // gTemp = pOdd & gEven
        //  store result in gEven
        gEven &= pOdd;

        // pEven & pOdd
        //  store result in pEven
        pEven &= pOdd;

        // gOdd ^ gTemp
        //  store result in gOdd
        gOdd ^= gEven;

        // regenerate zip iterators to p and g

        //  gOdd -> gEven, gOdd
        gEven0Range.set(g.getShare(0)->begin() + offset, g.getShare(0)->end(), stride*2);
        gEven0.set(gEven0Range.begin(), gEven0Range.end());
        gEven.set(&gEven0);

        offset += stride;

        gOdd0Range.set(g.getShare(0)->begin() + offset, g.getShare(0)->end(), stride*2);
        gOdd0.set(gOdd0Range.begin(), gOdd0Range.end());
        gOdd.set(&gOdd0);

        //  pEven -> pEven, pOdd
        stride *= 2;

        pEven0Range.set(p.getShare(0)->begin(), p.getShare(0)->end(), stride);
        pEven0.set(pEven0Range.begin(), pEven0Range.end());
        pEven.set(&pEven0);

        pOdd0Range.set(p.getShare(0)->begin() + stride/2, p.getShare(0)->end(), stride);
        pOdd0.set(pOdd0Range.begin(), pOdd0Range.end());
        pOdd.set(&pOdd0);

        k /= 2;
    }

    // copy output to destination
    // out.zip(gEven, gOdd);
    StridedRange<I> outputEven0Range(out.getShare(0)->begin(), out.getShare(0)->end(), 2);
    thrust::copy(gEven.getShare(0)->begin(), gEven.getShare(0)->end(), outputEven0Range.begin());

    StridedRange<I> outputOdd0Range(out.getShare(0)->begin() + 1, out.getShare(0)->end(), 2);
    thrust::copy(gOdd.getShare(0)->begin(), gOdd.getShare(0)->end(), outputOdd0Range.begin());
}

drelu
● 输入
A = 1 + , B =
● 重构获得a = r + 1 = input + 2
● gpu::setCarryOutMSB, msbA=msb^a,msbB=msb, 重设rbit=0和abit=1
● preResult = carryOut(r ^ a , r & a)
● result = 1 - msb^preResult

// TODO change into 2 arguments with subtraction, pointer NULL indicates compare w/ 0
template<typename T, typename U, typename I, typename I2>
void dReLU(const TPC<T, I> &input, TPC<U, I2> &result) {

    //TO_BE_DONE
    int bitWidth = sizeof(T) * 8;

    TPC<T> r(input.size());
    TPC<U> rbits(input.size() * bitWidth);
    rbits.fill(1);

    DeviceData<T> a(input.size());
    r += input;
    reconstruct(r, a);
    a += 1;

    DeviceData<U> abits(rbits.size());
    gpu::bitexpand(&a, &abits);

    TPC<U> msb(input.size());

    // setCarryOutMSB overwrites abits/rbits, so make sure if we're party C that we don't accidentally use the modified values (hacky)
    gpu::setCarryOutMSB(*(rbits.getShare(0)), abits, *(msb.getShare(0)), bitWidth, partyNum == TPC<U>::PARTY_A);

    TPC<U> g(rbits.size());
    g.zero();
    g += rbits;
    g &= abits;

    TPC<U> p(rbits.size());
    p.zero();
    p += rbits;
    p ^= abits;

    TPC<U> preResult(result.size());
    carryOut(p, g, bitWidth, preResult);

    preResult ^= msb;

    result.fill(1);
    result -= preResult;
}

03 share乘法中的问题
修改前party的三元组为(0,0,0),因此乘法中partyB的计算结果始终为0。但在乘法中party的计算中可能存在部分错误,不过其结果仍恰巧为0
修改fill,交换两方beavertriple如下
shareA->fill(partyNum == PARTY_A ? 0 : val);
之后,MINIST的10分类准确率下降至10%

lx@dell-PowerEdge-R740:~/piranha$ ./files/samples/localhost_runner.sh
run unit tests? false
config network: "files/models/secureml-norelu.json"
network filename: files/models/secureml-norelu.json
----------------------------------------------
(1) FC Layer              784 x 128
                          512            (Batch Size)
----------------------------------------------
(2) ReLU Layer            512 x 128
----------------------------------------------
(3) FC Layer              128 x 128
                          512            (Batch Size)
----------------------------------------------
(4) ReLU Layer            512 x 128
----------------------------------------------
(5) FC Layer              128 x 10
                          512            (Batch Size)
TRAINING, EPOCHS = 10 ITERATIONS = 117
epoch,0
total time (s),320.481703
total tx comm (MB),4757.923828
total rx comm (MB),4757.923828
train accuracy,0.099910
epoch,1
total time (s),689.910743
total tx comm (MB),9515.847656
total rx comm (MB),9515.847656
train accuracy,0.101379
epoch,2
total time (s),1053.758367
total tx comm (MB),14273.771484
total rx comm (MB),14273.771484
train accuracy,0.100978
epoch,3
total time (s),1414.614286
total tx comm (MB),19031.695312
total rx comm (MB),19031.695312
train accuracy,0.099927
epoch,4
total time (s),1772.819317
total tx comm (MB),23789.619141
total rx comm (MB),23789.619141
train accuracy,0.099860
epoch,5
total time (s),2128.047614
total tx comm (MB),28547.542969
total rx comm (MB),28547.542969
train accuracy,0.098608

不影响secureML的实现,但要注意实现中三元组的设置
nn层
01 secureml.json
secureml模型:由fc和relu交替组成
secureml-norelu:相比去掉了最后一层relu

{
    "name": "SecureML",
    "dataset": "MNIST",
    "batch_size": 128,
    "input_size": 784,
    "num_classes": 10,
    "model": [
        {
            "layer": "fc",
            "input_dim": 784,
            "output_dim": 128 
        },
        {
            "layer": "relu",
            "input_dim": 128
        },
        {
            "layer": "fc",
            "input_dim": 128,
            "output_dim": 128 
        },
        {
            "layer": "relu",
            "input_dim": 128
        },
        {
            "layer": "fc",
            "input_dim": 128,
            "output_dim": 10 
        },
        {
            "layer": "relu",
            "input_dim": 10 
        }
    ] 
}

02 ReLULayer.cu
● 前向传播

template<typename T, template<typename, typename...> typename Share>
void ReLULayer<T, Share>::forward(const Share<T> &input) {

    if (piranha_config["debug_all_forward"]) {
        printf("layer %d\n", this->layerNum);
        //printShareTensor(*const_cast<Share<T> *>(&input), "fw pass input (n=1)", 1, 1, 1, input.size() / conf.batchSize);
    }

    log_print("ReLU.forward");

    /*
size_t rows = conf.batchSize; // ???
size_t columns = conf.inputDim;
size_t size = rows*columns;
*/

    this->layer_profiler.start();
    relu_profiler.start();
    debug_profiler.start();

    activations.zero();

    ReLU(input, activations, reluPrime);

    debug_profiler.accumulate("relu-fw-fprop");
    this->layer_profiler.accumulate("relu-forward");
    relu_profiler.accumulate("relu-forward");

    if (piranha_config["debug_all_forward"]) {
        //printShareTensor(*const_cast<Share<T> *>(&activations), "fw pass activations (n=1)", 1, 1, 1, activations.size() / conf.batchSize);
        std::vector<double> vals(activations.size());
        copyToHost(activations, vals);

        printf("relu,fw activation,min,%e,avg,%e,max,%e\n", 
            *std::min_element(vals.begin(), vals.end()),
            std::accumulate(vals.begin(), vals.end(), 0.0) / static_cast<float>(vals.size()), 
            *std::max_element(vals.begin(), vals.end()));
    }
}

● 反向传播

template<typename T, template<typename, typename...> typename Share>
void ReLULayer<T, Share>::backward(const Share<T> &delta, const Share<T> &forwardInput) {

    if (piranha_config["debug_all_backward"]) {
        printf("layer %d\n", this->layerNum);
        //printShareFinite(*const_cast<Share<T> *>(&delta), "input delta for bw pass (first 10)", 10);
        std::vector<double> vals(delta.size());
        copyToHost(
            *const_cast<Share<T> *>(&delta),
            vals
        );
        
        printf("relu,bw input delta,min,%e,avg,%e,max,%e\n", 
                *std::min_element(vals.begin(), vals.end()),
                std::accumulate(vals.begin(), vals.end(), 0.0) / static_cast<float>(vals.size()), 
                *std::max_element(vals.begin(), vals.end()));
    }

	log_print("ReLU.backward");

	relu_profiler.start();
	this->layer_profiler.start();
    debug_profiler.start();

    this->deltas.zero();

	// (1) Compute backwards gradient for previous layer
	Share<T> zeros(delta.size());
	zeros.zero();
    selectShare(zeros, delta, reluPrime, deltas);

    // (2) Compute gradients w.r.t. layer params and update
    // nothing for ReLU

    debug_profiler.accumulate("relu-bw");
    relu_profiler.accumulate("relu-backward");
    this->layer_profiler.accumulate("relu-backward");

    //return deltas;
}

分支三:secureML的cpu实现

https://github.com/shreya-28/Secure-ML
暂未部署
一个非作者的secureML实验复现,作为参考

posted @ 2023-03-01 17:02  Synnn  阅读(390)  评论(2编辑  收藏  举报