Wednesday, December 3, 2014

Thread Specific heap/tcmalloc

Wanted to write this down somewhere for some time, so doing it now.

As i have touched upon before (and as i understand), user-level threads share the data section of its parent process, as well as the heap of the parent process (note that when a process is "Executed" by the kernel, the kernel allocates an area to execute consisting of the text segment, data segment (memory arena). Because all threads share the heap, the allocation (e.g. done using malloc (in turn using brk or mmap) ) in the heap needs to have a lock, and this is a costly operation. Every time malloc/free is called from any of the threads, a lock has to be taken, then memory allocated/deallocated, while other threads wait on the lock.

With tcmalloc, i would expect that the heap is broken up into parts so that each thread can use only its part of the heap and there is also a centralized heap where the thread will try to grab data from in case of running out of memory on its local heap.


(btw, just while on the topic, making a quick note about Valgrind's issues with tcmalloc: from valgrind man page: "When a shared library is loaded, Valgrind checks for functions in the library that must be replaced or wrapped. For example, Memcheck replaces all malloc related functions (malloc, free, calloc, ...) with its own versions. Such replacements are done by default only in shared libraries whose soname matches a predefined soname pattern (e.g.libc.so* on linux). By default, no replacement is done for a statically linked library or for alternative libraries such as tcmalloc. In some cases, the replacements allow --soname-synonyms to specify one additional synonym pattern, giving flexibility in the replacement".)

Tuesday, November 11, 2014

Saturday, October 18, 2014

Rate Control/Traffic Policing/Traffic Shaping

Again, nothing new and something quite well known in the networking literature (maybe networking 101 for some) but putting it out here just as a reference because i think this is used pretty commonly in the networking industry and folks should be aware about this.

Let's say the goal is to rate control events (an event can be anything like number of hits, or number of clicks or say number of bytes delivered etc.) so that incoming traffic can be handled without overwhelming one's servers. The most commonly used (in my experience) algorithm for rate control (shaping, policing) is the token-bucket or leaky bucket based on one's requirements. However, one may think about trying to use other algorithms e.g. a sliding window. Here are the pros and cons of each approach:

  1. Sliding Window: In this approach, we specify the rate in the sliding window that we will accept .e.g B events in the window, and anything in excess will be dropped. This approach has the problem as shown below that it cannot handle burst properly. Especially in a pathological case, this algorithm has the risk of shaping the traffic incorrectly over a long duration. As one can see in the picture below, one reason burst could not be handled well was because our sliding window slid 1 slot at a time. If instead we slide a random amount, we have effectively constructed a "randomized algorithm", that would not be 100% accurate rate control but which will be on average expected to be pretty close to the expected rate. Further, one can combine the sliding window approach with the token bucket approach as outlined below. 
  2. Leaky Bucket algorithm: In this algorithm, a "bucket" data structure is created, which can be implemented as a queue and packets are inserted into the queue when they arrive, and are taken out at a predefined fixed rate. If an egress event occurs and there is nothing in the queue, we say that an "underrun" has occurred. If there are too many packets in the queue, specified by a size of the queue, we say that an "overrun" has occurred. The question of how big a burst size can be is configurable and is called as the sustainable cell rate (or sustainable bitrate, or sustainable event rate). [Ref: Wikipedia, Image credits, Wikpedia]. This is useful as a jitter buffer i.e. a buffer to eliminate jitter i.e. packets/events come in at whatever rate , but they will go out at a fixed rate. 
     
  3. Token Bucket algorithm: In this approach, tokens are generated at a constant rate (or whatever rate one wants), and then arriving events/data is queued up and then when a packet is supposed to be dequeued (See *). When the packet is dequeued, number of events associated with that packet are pulled out of the token bucket. [Image credits: this]. * Typically there is some logic for dequeuing, e.g. (A) when trying to dequeue media packets that have a timestamp .e.g. a PTS in a MPEG2 stream, the packets will sit in the queue till the system wall-clock time/timestamp reaches the PTS of the packet (B) Sometimes networking queues are implemented as simply list-driven e.g. reorder queue: one can keep a counter for the expected packet count, and keep egressing the packets that arrive in order. When a packet arrives out of order (greater than the one expected), the packets are queued up in a sorted queue. When the expected packet arrives, elements in the queue that are in order are dequeued (and passed to token check stage). Note that the token bucket algorithm can also run into problems with bursty traffic. One may need to have a token debt- algorithm to deal with certain scenarios (http://www.google.com/patents/US20110075562 where if the token count becomes negative, the egress is stalled till the count becomes positive again. This is good for shaping. What about Policing using token bucket ? That can be handled by having a lower bound for the debt. Our rate control/policer can only work in those bounds of debt. For policing, there could be a long-term debt checker, which if non-zero guarantees that our algorithm is token-non-negative, which means that the given stream is within bounds. If not, we drop. 
  4. Hybrids: one can devise hybrids e.g. leaky bucket + token bucket, where the data incoming out of a leaky bucket comes at a fixed rate, and goes into the token bucket queue at that fixed rate and rate control and policing can be done at the outgoing packet level. For Policing, The Sliding window approach can also be combined with token bucket, where some number of tokens are generated every second, say B tokens/sec, so that way the burst shown in that example will go through, but had there been another burst (3rd), then that would be dropped. Thus, we would need to have negative debt tokens as described in the point 3 above.
Of course, when implementing these, one needs some kind of timer mechanism, and one can use the commonly used libevent to create timers using evtimer_new(), and evtimer_add() etc. Note that libevent's timing mechanism is heap based where insertions and deletions are O(lgn).  (see minheap-internal.h)

One can of course, also implement timers using timing wheels: e.g. this http://25thandclement.com/~william/projects/timeout.c.html, where the insertions and deletions are O(1). 

Thursday, August 14, 2014

spinlock vs mutex

found this info nice. noting it.

http://stackoverflow.com/questions/5869825/when-should-one-use-a-spinlock-instead-of-mutex

"In homogeneous multi-core environments, if the time spend on critical section is small than use Spinlock, because we can reduce the context switch time. (Single-core comparison is not important, because some systems implementation Spinlock in the middle of the switch"

is what i took away from it. 

Monday, August 4, 2014

CBR over TCP

here is one study of trying to achieve a CBR over TCP:
http://www.cs.columbia.edu/~salman/publications/votcp-conext07.pdf


two scenarios:
(1) Let's assume that an encoder is generating X bytes/second - and this rate is constant at every time instant (howsoever small a time window we assume).
The goal is to send over these X bytes at Xbytes/second on the network. We would like to  send X bytes/second using TCP, i.e. the instantaneous rate should be X bytes/sec.

We know that TCP's window is chosen to be min(receiver's flow control window, congestion control window (cwnd)), and if we want to have true CBR, we would like to send exactly same number of bytes per second till we experience a loss, and in case of loss somehow compensate for the multiplicative decrease by sending additional data . This is in my opinion pretty hard to achieve.
In other words, it is very hard for instantaneously constant CBR over TCP , however the average behavior over 1 second may be close to CBR.

Wednesday, June 25, 2014

Survey of algorithmic techniques

Follow up: Lots of interesting ideas for me

http://www.scottaaronson.com/blog/?p=458

I am going to sift through the literature and note down algorithmic techniques that one commonly finds and am going to make a map of problem<-->algorithm and its complexity. 

Also here is a map of the whole of theoretical computer science. It is not of great relevance directly, but can be useful to know. Note also that recursive languages are the subset of recursively enumerable languages.

http://upload.wikimedia.org/wikipedia/en/6/64/Theoretical_computer_science.svg

Thursday, April 10, 2014

Tuesday, March 11, 2014

Computer languages

There are broadly 2 categories of computer programming languages today viz.

  1. Imperative/Procedural Languages
    1. unstructured (gotos)
    2. structured (functions)
    3. modular (separatation of concerns)
    4. object oriented
    5. subject oriented (who calls the function in an object also matters (possible to be modeled in oo by using a parameter for subject))
  2. Declarative languages (describe what is to be done, rather than how) 
    1. Constraint programming
    2. functional programming
    3. logic programming
    4. sql
    5. html etc

Wednesday, February 5, 2014

VQE-C

What we were doing when i was with Cisco:
https://github.com/wmanley/cisco-vqe-client