Wednesday, May 28, 2014

TCMALLOC and glibc malloc

TCMALLOC

Installation steps:

packages needed


1]    Libunwind
$ tar -zxvf libunwind-1.1.tar.gz
$ cd libunwind-1.1
$ ./configure
$ make
$ make install


2 ] gperftools
$ tar -zxvf  gperftools-2.1.tar.gz
$ cd  gperftools-2.1
$ ./configure
$ make
$ make install




WHY IS IT MORE EFFICIENT ??
A good memory allocator needs to balance a number of goals. Two of these most prominent goals are minimizing time and minimizing space usage. Speed is important for any malloc implementation because if malloc can get faster, thousands of existing applications having bottlenecks on dynamic memory allocation will get significant performance boost without the need to change any code.


Disadvantages of Traditional malloc and an improvement:

  • Lots of wasted space especially for small allocation objects: each Header / Footer occupies 4 bytes (in a 32 bit machine), if coalescing are adopted (without coalescing, only Header is required), every object must be surrounded by Header and Footer, so N- 8 byte objects will account for 16*N bytes.

  • If the size of a free object is bigger than required, but not big enough to carve into smaller objects, then there will be an internal fragmentation.

  • No mechanism to separate small allocations with large (eg. requesting more than 200 MB memory) ones. Cannot adjust the bin size to speed up all kinds of allocations.
  • In a multithreaded application, these data structures need to be protected with locks. As memory is being allocated concurrently in multiple threads, all the threads must wait in line while requests are handled once at a time. Therefore, all threads are competing for access to the same heap causing a problem known as heap contention. Adding a Second Layer (as a thread-cache) on top of the “Base Allocator” can greatly increase the scalability of the allocator .









TCMalloc’s design to counter glibc malloc’s shortages:

  • Memories are allocated by a run of pages instead of arbitrary sizes, thus greatly reduced sbrk or mmap system call overhead.

  • Instead of using Header / Footer, a global page map was used to map between a given page to the location containing info about this page. For 64-bit architecture with 4K pages, 2^52 items have to be mapped. Therefore, a three-level radix tree was used to minimize memory cost, at the beginning, about 4.5 MB are used for the mapping.

  • Separating the small allocation (<= 32KB) with big ones. Also, each thread gets a private thread cache, lock free multi-threaded allocation can be achieved if there is enough space in thread cache.

  • Large object allocations are satisfied by the central page heap, the central heap is NOT thread-safe, so a spinlock has to be taken when allocating from central page heap.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.