The Architecture of Infoseek

Date: 
Wednesday, March 6, 2002 - 17:30
Location: 
TH 331
Presenter: 
Kinson Ho
Abstract: 
Infoseek was a high-traffic web site that served highly-dynamic and personalized content to its users. All the HTML pages were constructed at the time of request. In this talk I describe the key techniques we used to ensure that the web site would be highly-available, fast and scalable. The web site was structured as a tightly-integrated distributed system of mostly internally developed software systems. We developed our own high performance proxy servers, web servers, persistent object stores, distributed caching server and search system. The basic building block of these software systems was a lightweight, multithreaded client-server framework with persistent connections. We used N+1 redundancy in any server cluster to guarantee high availability. Software-based solutions were used for server failure detection and load balancing. We used client-side non-persistent content caching extensively to reduce latency, and to ensure that our system could scale up to handle a large number of users easily. Client cache invalidation based on unicasts and multicasts were used to maintain consistency between the caches and the persistent content servers. The search system used a distributed architecture so that the size of the search index may grow without increasing search latency, and used large in-memory caches to reduce the need for slow disk accesses. Custom-designed binary transport and encoding protocols were used between client-server pairs in the performance-sensitive parts of the distributed system to reduce network latency, and to reduce the overhead of packet parsing.
Bio: 

Kinson Ho was part of the core engineering group at Infoseek (GO.com) from 1997 to 2001. He worked on parts of the multithreaded client-server framework, implemented a parallel library used in the search system, and developed a multicast-based cache invalidation mechanism. He has considerable experience in troubleshooting site-wide distributed system issues, and in performance monitoring, analysis and tuning.