Some Notes: March 2012

Thursday, March 29, 2012

List of important papers in Computer Science

http://en.wikipedia.org/wiki/List_of_important_publications_in_computer_science

Wednesday, March 28, 2012

Affinity Lattice

lets say there are 3 bits 000, the graph im interested in looks like this

111 ------\

/ \ \

011 101 |

/ \ / \ |

| 001 | |

| | | |

| 000 | |

\ / \ / |

010 100 |

\ / /

110 ------/

note that this is the graph where we go from one node to another if 1 bit changes. What i have drawn above is essentially a cube flattened out. The trouble is that i cannot draw a planar/non-overlapping graph/lattice for 4 dimensional codes.I am curious to find out if it is even possible at all to draw a hypercube on paper without overlapping edges.

Further, we cannot write this in a row-by-row table because the geometry of the table does not allow more than 2 neighbors.

What i am essentially looking for are geometric objects into which these codes can be embedded, and i want to call those geometric objects as "Affinity Objects" perhaps Affinity lattices.

Let me get back on this and find out the literature about this

Monday, March 26, 2012

Intro to File Systems

Quick high-level intro to File Systems on Linux. File System design can be motivated by things such as trying to reduce the seek-time (thereby latency, thereby increase throughput) and reduce the wait-time for write operations, and to achieve ACID (or not).

Berkeley Fast File System: Place "sibling" files in nearby blocks. Boot blocks + superblock + cylinder groups (free-list + inodes(permissions, first 10 data blocks + indirect blocks) + directory blocks + data blocks). The MS-DOS FAT is a simpler version of this.
Journaling File Systems:

Metadata Journaling file systems: Maintain a log/Journal of the meta-operations related to allocating inodes, dirs, blocks, but dont actually store the data being written to blocks, into the log/Journal. [2] e.g. EXT3FS
Log Structured File System[1] [2]: See this for a quick high level understanding. To avoid circular logs and fragmentation, "segments"/pools of free segments etc are used based on Garbage Collection techniques. [Wikipedia]

Hierarchical File Systems: [Wikipedia]
Distributed File Systems [Wikipidia]

Chunked file systems e.g. Google File System

Have a look at this Wikipedia page :http://en.wikipedia.org/wiki/Comparison_of_file_systems

Wednesday, March 14, 2012

Internet Next-Gen

Some Proposals:

Today there is quite a bit of data on the internet that is public. This has in fact been the reason why Google could crawl so much of data. The issue is that when content is going to get locked up into social networks, internet quickly approches the society of humans i.e. privacy/information sharing/hiding etc for political agendas may follow. What is needed here is some governing mechanism to allow sites to interoperate and share data for searches. Some data like Wikipedia is in the best-interests of everyone to be shared. There needs to be thinking about how data is going to be shared across social networking silos.
Named Data Networking: Routers cache index to data. Effort made by routers is to find the path to the router that has access to content, rather than to attempt to connect to some end point, and treats content as the entity of primary importance, than treating destination IP as the entity of primary importance. Data can re thus retrieved based on the URL(i.e. URL without the host)

Monday, March 12, 2012

Notes about HTTPS & caches

http://encosia.com/cripple-the-google-cdns-caching-with-a-single-character/

Sunday, March 11, 2012

On Caches

This is quite a vast topic, because so much of what the work that is done in the Computer Science field is an extension of some-sort-of-cacheing by some entities. One can in fact look upon a major chunk of the computer science field itself as a search for "better" cacheing techniques. (the meaning of "better" changes with applications)

The simplest way of cacheing is of course, store data in a hash table with the key as the "address" and the value as the data. This kind of cacheing is called an associative array or hash based cacheing.

However, we can generalize this process a bit by trying to map the entire address space into a smaller address space. I will start off the discussion with the caches as one finds in Computer Organization books i.e. caching of data one finds in slower memories into faster memories. In such types of cacheing, there are 4 things of interest and we can arrange the 4 things (line-number, set, tag, and element-offset; to be discussed soon) in 4! or 24 ways, so there are at least 24 types of caching possible. I am saying at least 24 because one may use advanced data structures to implement caches which will scale up the total types of caches possible.

Organization:
See figures for an conceptual understanding. Caches are in reality not necessarily flat as shown here. The flat caches as ive drawn is perhaps the simplest way to explain it. I will briefly discuss 2 alternative ways in which cacheing of data from CPU addresses can be achieved.

Type 1: Flat Cache/Tags are part of the Cache
See the following picture. Lets say the main memory is 2^A * k bytes. We can envision the memory address to be broken up into x, y, z as follows. The "address-chunk" is referred to as an "element" in literature.

Type 2: Flat Cache/Note the difference in the tag bits.

Storage Benefits:
Cacheing benefits are achieved by the "tag field" being non-zero (e.g. shown as y > 0 bits). Note that we can potentially store the tag in a separate faster-memory than the cache, and this is often done (see this), but i am for now showing it part of the cache, as the tag is an important part/storage of the caching process.

Search:
The search process for caches is as follows:

Address --> find the cache line, or group of lines (if s > 0) associated with this address
Search "associatively" within this set's tags and look for a match.

If number of sets is 1 i.e. s is 0 bits, we say that the cache is "direct mapped"; and when there is only 1 cache line i.e. we view upon all lines as part of 1 set, then we call it "fully associative" cache. One will also find in the literature the definition that a direct-mapped cache is a set-associative cache with 1 line per set [this]. The terminology only makes sense because of the way "Associative" searches are done for the elements part of a "set" i.e. this is an advancement over the simplistic data structure ive shown i.e. one may do some sort of fast-searches for the tag-section in a set by keeping the tags in a special memory.

Question: What difference do these things (type 1 and type2) make ?

The answer is that in type1, "close" addresses map to the same line (the higher order bits in the address, decide the cache-line); and in type 2, "close" addresses map to different lines (as, relatively lower order bits in the address, decide the cache line)

As a result of this, type 2 is preferred, because cacheing wants to rely on principle of locality i.e. we want close addresses to map to different cache lines so that all those lines are in the cache when the processor needs them, effectively saving latency and increasing throughput.

Cache Eviction Logic:

In the above discussion the eviction logic was that, in case of "direct mapped cache" i.e. each line is a separate set, if the cache line one is interested in, is occupied with a different tag, we kick out that data and load new data.

If we are talking about a set-associative cache, then, we would potentially associatively search for a the tag match in all the lines in the "set", and then we have to decide some mechanism to kick out an entry - this may be done as LRU, LFU, FIFO etc (google for this).

Cache Coherency:

http://en.wikipedia.org/wiki/Cache_coherence

Distributed Caches & Partitioning: See this

Extensions:

Web Cacheing:

One can in principle extend this logic to cacheing in general, e.g. suppose we want to cache the content from www. google.com/blah1 and www.google.com/blah2. May it be nice to use higher-order bits (e.g. the www.google.com to map to a cache-block, or may be we better off using finer granularities (i.e. blah1, blah2) to store data. I would go with the latter for locality of reference. One can see that there is potential for research here too i.e. what is the performance benefits of each style depending on the browsing habits of the user.

Network Buffering:

Often, in networking systems, one may consider the caches to be made up of a pool of buffers separated in zones -- and thus in the above classification scheme, networking caches may be envisioned as; "zone" meaning a set, and buffers in a zone meaning the cache-lines in a set. The eviction scheme is used to figure out which buffer to re-use in a particular zone

Tuesday, March 6, 2012

Components of the Web

The Web is made up of 3 general components.

Origin Servers
Clients
Proxy: (See this, which im reproducing ) There are four kinds of widgets that people might call proxies. Note that the term "widget" below may be either software, hardware, or a combination.

A router is a widget that sends packets back and forth between multiple network segments (i..e., between ethernet, ADSL, and LocalTalk networks all coming in to the same machine, but possibly on different ports or cards). The widget may choose to filter out some packets and not route them. Important characteristic: multiple networks attached.
A gateway is widget that translates data from one protocol (like TCP/IP) to another protocol (like AppleTalk) as it flies past. The widget may also choose to not translate some packets, and filter them out.Important characteristic: protocol translation.
A proxy is a widget that receives requests with its left hand (usually from one network), and then launches similar requests with its right hand (usually on a different network). As the results arrive back at the right hand, they are copied over to the still-waiting left hand and returned to the original client. (It acts as a server with its left hand, and as a client with its right hand.) The widget may filter out certain requests or certain responses and not copy them from hand to hand. Important characteristic: acts as both client and server. When a proxy operates at the packet level it is a transparent proxy, and the clients usually don't know that there's a proxy involved.
A cache is a widget that receives requests and either responds directly from it's own store of information, or forwards the requests "upstream" to another server. When data comes back from the upstream server, this widget keeps a copy of the data for itself, and also returns the data to the original client. Important characteristic: data stored locally.

Cloud Computing Landscape

http://cloudstack.org/blog.html?start=5 led to me nice overviews about the landscape. Will explore further, add more detailed info, better surveys etc. soon.