Pysyft介绍

理念

希望给予Data Scientist一个空间,空间里能有不同数据集,各个Data Scientist可以使用和做任务,但是数据不能被分享和下载和泄露,甚至被看到。即,大家能用,但是不能被明文看到。所有操作都是remote operation,远程的。
他是一个手脚架,面向开发者,类似tensorflow和pytorch。

概念

  • worker:参与者机器
  • data scientist: 参与者身份
  • pointer:指针,跟c的指针一样。因为操作都是远程,所以这个很常用。
  • Domain Node:存储共享数据的服务器。原文Domain Node is a server in which we will host our private data
  • network node: 对domain node的高层。A Network Node is a level of abstraction above a Domain Node. It's a server which exists outside of any Data Owner's institution, providing services to a network for Data Owners and Data Scientists, such as dataset searching and bulk project approval (the ability to participate in projects across groups of domains and data scientists at a time).
  • privacy budget: 用于数据集的差分隐私。原文The privacy budget represents how much noise the Data Scientist can remove from a dataset when accessing it. Domains will set a privacy budget per data scientist.

操作

  • 获取数据:连接服务器,search 数据集名。想要进行数据读取,需要在Domain Node上有账户信息,每次读取要login
  • 数据“查看”:只会给出dataset的整体信息,和sample信息(甚至可能是合成的类似的信息)
  • 数据定位:在服务器(框架中叫Domain Node)中,用domain定位数据
  • 开启服务器(Domain Node):使用HAgrid
  • 关停服务器:框架中叫做land the domain node或者spin down the domain node。指的是停止数据共享服务(并没有删除)。
  • 上传数据:也可以说共享数据,把自己的数据上传到domain node,上传的时候需要指定参数

    You'll notice that in the example above, within .load_dataset(), there are four properties you can define:
    Assets
    Name
    Description
    Metadata
    These properties are shown to Data Scientists when they are searching for datasets they'd like to work with. You can think of them as your dataset's listing information. When "Metadata" is being referred to here, it is being referred to not in a Differential Privacy sense but in a search and discovery sense.
    Note: To help prevent data being uploaded that cannot be made differentially private, PyGrid will raise a warning if the data you've uploaded isn't compatible with its Differential Privacy framework. When that happens you will need to doublecheck that the dataset is using the correct data type and has its min_val, max_val, and entities defined.

  • 数据结果公布:使用publish,publish uses the privacy budget approved by the data owner to access the data in a noised format that does not compromise the original dataset. sigma is the amount of privacy budget the data scientist plans to use.
  • 盗取数据,框架使用差分隐私,所以会避免这一点

小结

可见这是to C端的一个工具手脚架,帮助快速建立一个小型联邦学习场景。
区别于FATE,to B端的,FATE大包大揽、封装严密,pysyft更轻量(但是也有可能效率更低)。
在数据传输上,pysyft对于远端网络传输和连接并没有相关材料,需要自行实现,而FATE的官方材料就提供了不止一种部署方法。

参考

posted @ 2023-10-17 11:37  ZephyrYin  阅读(292)  评论(0)    收藏  举报