翻译 Big Memory .NET Part 1 – The Challenges in Handling 1 Billion Resident Business Objects
Overview
This article describes the concept of Big Memory and concentrates on its applicability to managed execution models like the one used in Microsoft’s Common Language Runtime (CLR). A few different approaches are suggested to resolve GC pausing issues that arise when a managed process starts to store over a few million objects.
概述
本主题描述大内存概念并且关注于其执行模型的通用性就像微软的CLR的执行模式。建议了几种不同的解决方法来解决当托管进程包含大量的对象时导致的GC暂停问题。
Use Cases
Why do we need to store so many objects in memory?
Say we store several hundreds of millions of addresses in an application that needs to calculate routes. The application needs to build an object-graph and stores it together with addresses. Then the application keeps these objects in memory for several days, processing thousands of queries a second.
Or consider social data with lots of messages and responses. An average site may easily add tens of thousands of “social records” daily. Fetching those bits from the database is slow. Many people use out-of-process caches, but I have my data already here. I want a Dictionary with million entries.
Then there is the “Internet of Things”, known for producing countless records as devices generate data from sensors (such as fitness trackers) on their own, without human intervention. When users log-in to a portal they want to quickly see the “overview” of daily activities, i.e. pulse plot. A Big Memory cache is the perfect solution for these kind of problems, as the whole daily series from every user device may be persisted in RAM.
用例
为什么我们需要在内存中存储如此多的对象?
比如在大计算的应用中需要成百上千个地址。程序需要构建对象关系图并且存储其地址。程序保存对象在内存中很多天,每秒执行上千次的查询。
或者考虑有大量消息和响应的社交媒体数据。通常而言网站日常会轻松的添加成千上万的记录,从数据库中抓取数据很慢,用户使用进程外的缓存, (然而我已经用了我的数据,我更想要一个用百万词条的字典。)这里什么含义
还有物联网。众所周知,它可以产生无数的记录,因为设备可以独立地从传感器(例如健身追踪器)生成数据,而不需要人工干预。当用户登录到门户网站时,他们希望快速查看日常活动的“概览”,即脉冲图。一个大内存缓存是解决这些问题的完美解决方案,因为每个用户设备的整个日序列都可以在RAM中保存。
About Big Memory
The term “Big Data” is nothing new. It describes huge volumes in principle; whether on disk, networks or anywhere else. Big Memory facilitates Big Data activities by doing more processing on the server or tight cluster of servers, still keeping stuff in RAM. The Big Memory approach is also conducive to real-time querying/aggregation/analytics. Think map/reduce in real-time, a kind of Hadoop that does not need to “start” and wait until done, rather “real-time Hadoop” that keeps working on data in RAM.
Big Memory comes in different forms, primarily: heaps and composite data structures. Just like in any “regular” programming language, all complex/composite data structures like lists, trees, dictionaries, sets etc. are built around the heap primitives like: alloc/read/write/free.
大存储
“大数据”并不是什么新鲜事。它描述了大量的原则,不管是在磁盘、网络还是其他任何地方。大内存通过在服务器或服务器集群上进行更多处理,在RAM中保存数据,从而促进大数据活动。大内存方法也有助于实时查询/聚合/分析。想象为实时的 map/reduce,一种不需要“启动”并等待完成的“Hadoop”,相当于“实时Hadoop”它在RAM中不断地处理数据。
大内存来源不同形式,特别的:堆和复杂的数据结构 ,如常规的程序语言,所有的复杂/复杂数据结构的 如list tree 等等 依赖于堆原语建立 如地址分配/读写和释放。
Big Memory heaps provide a familiar object model; references obtained when you allocate objects, dereference objects, and free up space by deallocating by reference. Heaps are very useful for things that need to be traversed, like graph databases or neural networks. In these kind of systems a reference is all you need to access another object, no need to lookup by keys.
大内存堆 提供熟悉的对象模型; 当分配对象,解除引用对象,通过引用释放释放空间获得的引用对象。堆对于需要遍历的对象非常有用,在图形库和神经网络里,引用就是你需要访问的对象,不需要通过键来查找。
Big Memory data structures, including lookup tables, queues, and trees are now modeled around the Big Memory heap.
大内存数据结构(包括查找表,队列和树)现在已大内存堆为模型。
Big Memory caches are used for lookups; they are actually a form of dictionary/lookup table. Caches are used everywhere these days, and are usually built on top of heaps. Like many other data structures, a cache is a layer of indirection between a caller and a heap.
What is interesting is both Big Memory heaps and caches may be either local (one machine with 64 gigabites or more of RAM) or distributed. An important requirement is the speed. Since we are in memory, we can have very fast access. Terracotta has been pushing the Big Memory concept in the Java world for a few years now. Redis and memcached are good examples of in-RAM quick storage.
有趣的是,大内存堆和缓存可能是本地的(一台机器有64G或更多的RAM)或分布式的。 速度是一个重要的要求。 内存有着非常高的访问速度。Java世界里Terracotta 一直在推动大内存概念。 Redis和memcached是内存快速存储的好例子。
So, what qualifies a solution as a Big Memory one? It is the purposeful memory utilization pattern. If your solution is purposely built to store/cache lots of data at the expense of memory consumption so it can do real-time tasks that otherwise would have been deferred/batched, it is a Big Memory solution. For example, instead of running a Hadoop job for 5 hours every night against many shards in your DB, you may get the end result at any time in seconds from your web servers if they maintain all the rollups in RAM. Storing 100 million pre-computed keys updated with every transaction in RAM is orders of magnitude faster than re-querying any data store, given those updates are very fast and don't take much time. In order for this pattern to be effective, the memory utilization has to be specifically designed for this purpose. It isn’t enough to simply allocate multiple gigabytes of memory in a haphazard fashion.
那么,什么样的解决方案可以作为一个大内存? 这是特意的内存使用模式。 如果您的解决方案以消耗内存为代价来存储/缓存大量数据,可以执行实时任务,否则这些任务将被推迟/批处理,这是一个大内存解决方案。
例如,如果他们保存在ARM上你可随时在几秒了获取结果,而不是每天晚上在DB上运行5个小时的hadoop JOB。
上亿个KEY的更新在事务中的速度比重新查询任何数据存储快几个数量级,因为这些更新速度非常快,而且不需要太多时间。 为了使这种模式有效,内存利用率必须专门为此设计。 简单地以偶然的方式分配多个千兆字节的内存是不够的。
Garbage Collection and Big Memory? There is a Problem!
GC managed memory is acceptable for most applications and use cases. But GC does not work well with big in-memory heaps, caches, and other data structures. This issue has been well known in the Java community for many years; here is a video about JVM low-latency which describes similar problems and solutions.
GC托管被很多的程序采用。但是GC在大内存堆,缓存和其他一些数据结构上并不能很好的工作。此问题在JAVA社区讨论很多年。
Periodically GC makes “major” scans. With enough RAM, those scans can “stop the world” for seconds. Not only does this suspend all operations for the duration of the GC cycle, it also empties the CPU’s low latency caches, so it will take even longer before you return to full speed. What’s worse is from the user’s perspective, these pauses happen randomly. While there are heuristics involved that deal with current memory pressure, low water marks, and other checks and balances to determine when the GC will run, they are hidden from the user.
Here is the non-obvious problem: the more physical memory your host has, the more prone it is to GC-pauses. Why? When you give GC more RAM to work with, it procrastinates collecting Gen2; that is, it keeps postponing “major work” as long as it can. By allocating too many objects very fast you will put enough GC pressure to start sweeping even when there is 50% free. If, on the other hand, you slowly allocate objects, around 250 per second, it will delay the GC cycle. On my machine with 64 GB, the GC does not kick in with a full scan until I use around 60 GB. Then there is a pause lasting between 3 to 30 seconds, depending on the size of each object.
This can be particularly troubling when the workload dramatically increases. For example, if the servers are mostly idle throughout the night and then kick into high gear as employees login at the beginning of their shift. This is when GC would try to collect as little as possible (so no long block-all pause happens), but is already almost out of resources because it was slowly growing for the past few hours and is now at 90% RAM utilization.
What makes matters worse: those patterns are very hard to predict as the business workload may change dynamically. Modern APIs like GCLatencyMode do help in some cases but they make the whole system more complex and brittle.
But we need to store many objects for real life Big Memory applications!
IO bandwidth provides a major limitation for application throughput and latency. For more complex queries, the database may take several minutes to fetch the data from disk. This simply isn’t acceptable to business users who expect real time responses.
Possible Solutions
On one hand, we want to use .NET and managed memory because we don’t want to step back to the unsafe, costly development of unmanaged applications in languages such as C++, C, and, Delphi.
On the other hand, GC does not allow us to efficiently work with big caches that store a large number of native objects (i.e. reference types) and keep them around for a long time, while still preserving OOP goodies such as virtual methods and interfaces. You can densely pack structs into arrays, but lose the ability to work with subclasses.
The proposed solutions are roughly based on two approaches:
- Make objects, or at least their data/state, “invisible” to GC - this way GC has nothing to scan, or
- Keep objects visible to GC but reuse them at the same location (object pool).
So, one could use:
- Structs instead of classes. Since the GC does not “see” structs (unless they are boxed) this would relieve the pressure on the managed GC system. Keeping data this way requires working with large arrays of structs.
The disadvantage of this approach is struct types cannot represent referential models (parent/child/parent) that may be needed within the business logic. Furthermore, structs do not support inheritance. This means they can only be used for the simplest of DTOs. There are workarounds for this, but they entail simulating OOP features in a rather opaque fashion rarely seen outside of advanced C programming (i.e. the Linux kernel). - Pre-allocated object pools. This can stop GC from shifting stuff in RAM as objects are “almost pinned” they just keep being recycled at the same address (most of the time).
Just like the structs approach, this one is not transparent enough. These objects would have to be designed so they are able to recycle their internal state. You would not be able to do a regular NEW and Dispose/using (if needed). Then transitive references would have to be “pool-prepped” for this as well. Again, extra work. On the other hand, an object pool still keeps those objects as CLR objects. GC sees them - and has to visit. So this does not really relieve any GC pressure. - Unmanaged RAM + marshalling. The only thing I like about this is “marshal” reminds me of Tommy Lee Jones’ stellar performances in “The Fugitive” and “US Marshals”. The idea of copying memory to and from an unmanaged heap is interesting but without true “walking” of references it is really not a generic solution. One could not just copy some buffer from one RAM chunk to another.
Yes, there are “zero-copy” serializers, but they are special memory formats incompatible with CLR objects. To use zero-copy, one needs to use object layouts in memory - so this actually shifts serialization time from the “zero-copy serializer” that does nothing, to your business code. Keep in mind, the cost is still there. This approach may be great for some cases, but it cannot be used as a generic solution for polymorphic CLR objects. - A limited number of pre-allocated byte arrays and serialize smaller data objects in them. This would require a true memory manager that allocates chunks. That is probably very slow due to serialization overhead and is most likely unusable.
So what do we do now? Can a CLR process efficiently keep hundreds of millions of my objects resident beyond Gen 0 without stalls?
In Part 2, Dmitriy Khmaladze introduce his solution to this problem, NFX Pile. Pile is a hybrid memory manager written in C# with 100% managed code designed to act as an in-memory object warehouse.

浙公网安备 33010602011771号