无奈,换 zeus 了,很坚挺,商业的就是商业的。 由 soff 发表于 Wed Sep 13 13:39:01 2006 3. Re:Lighttpd+Squid+Apache搭建高效率Web服务器 His result looks weird, as a result, his conclusion is wrong.
Squid does not boost dynamic page at all, the speed gain in his test is because his client is requesting the same page in paralell, and squid will return the same page for the concurrent requests. I also guess that he did not configure expire time for static content in his web server, Squid will try to refetch the file with If-Modified-Since header for each request. That's why squid performs poor in the static test. 由 kxn 发表于 Wed Sep 13 13:41:24 2006 4. Re:Lighttpd+Squid+Apache搭建高效率Web服务器 不太同意这一点,对Squid而言,动态页面和静态页面是一样的,只要设置好HTTP头, 如果设置Expires,是没有缓存效果的 如果不能Cache动态页面的话,那怎么起到加速效果? 由 davies 发表于 Wed Sep 13 13:42:00 2006 5. Re:Lighttpd+Squid+Apache搭建高效率Web服务器 不好意思,英语不好,误导你了,上午在单位的机器没法输入中文 动态页面除非正确设置HTTP的过期时间头,否则就是没有加速效果的.反过来说,静态页面也需要设置过期时间头才对.
你实际测试动态页面有性能提升,这有几种可能,一是你的测试用的是并发请求同一个页面,squid对并发的同页面请求,如果拿到的结果里面没有 non cache 头,会把这一个结果同时发回给所有请求,相当于有一个非常短时间的cache,测试结果看起来会好很多,但是实际因为请求同一页面的机会不是很多,所以基本没有啥改进,另一种情况是你用的动态页面程序是支持if-modified-since头的,他如果判断这个时间以后么有修改过,就直接返回not modified,速度也会加快很多.
使用软件方式来实现基于网络地址转换的负载均衡则要实际的多,除了一些厂商提供的解决方法之外,更有效的方法是使用免费的自由软件来完成这项任务。其中包括Linux Virtual Server Project中的NAT实现方式,或者本文作者在FreeBSD下对natd的修订版本。一般来讲,使用这种软件方式来实现地址转换,中心负载均衡器存在带宽限制,在100MB的快速以太网条件下,能得到最快达80MB的带宽,然而在实际应用中,可能只有40MB-60MB的可用带宽。
开源平台的高并发集群思考 目前碰到的高并发应用,需要高性能需求的主要是两个方面 1。网络 2。数据库 这两个方面的解决方式其实还是一致的 1。充分接近单机的性能瓶颈,自我优化 2。单机搞不定的时候( 数据传输瓶颈: 单位时间内磁盘读写/网络数据包的收发 cpu计算瓶颈),把负荷分担给多台机器,就是所谓的负载均衡 网络方面单机的处理 1。底层包收发处理的模式变化(从select 模式到epoll / kevent) 2。应用模式的变化 2.1 应用层包的构造方式 2.2 应用协议的实现 2.3 包的缓冲模式 2.4 单线程到多线程 网络负载均衡的几个办法 1。代理模式:代理服务器只管收发包,收到包以后转给后面的应用服务器群(服务器群后可能还会有一堆堆的数据库服务器等等),并且把返回的结果再返回给请求端 2。虚拟代理ip:代理服务器收发包还负载太高,那就增加多台代理服务器,都来管包的转发。这些代理服务器可以用统一的虚拟ip,也可以单独的ip 3。p2p:一些广播的数据可以p2p的模式来减轻服务器的网络压力 数据库(指mysql)单机的处理 1。数据库本身结构的设计优化(分表,分记录,目的在于保证每个表的记录数在可定的范围内) 2。sql语句的优化 3。master + slave模式 数据库集群的处理 1。master + slave模式 (可有效地处理并发查询) 2。mysql cluster 模式 (可有效地处理并发数据变化) 相关资料: http://dev.mysql.com/doc/refman/5.0/en/ndbcluster.html 大型、高负载网站架构和应用初探 时间:30-45分钟 开题:163,sina,sohu等网站他们有很多应用程序都是PHP写的,为什么他们究竟是如何能做出同时跑几千人甚至上万同时在线应用程序呢? • 挑选性能更好web服务器 o 单台 Apache web server 性能的极限 o 选用性能更好的web server TUX,lighttpd,thttpd … o 动,静文件分开,混合使用 • 应用程序优化,Cache的使用和共享 o 常见的缓存技术 生成静态文件 对象持久化 serialize & unserialize o Need for Speed ,在最快的地方做 cache Linux 系统下的 /dev/shm tmpfs/ramdisk php内置的 shared memory function /IPC memcached MySQL的HEAP表 o 多台主机共享cache NFS,memcached,MySQL 优点和缺点比较 • MySQL数据库优化 o 配置 my.cnf,设置更大的 cache size o 利用 phpMyAdmin 找出配置瓶颈,榨干机器的每一点油 o 集群(热同步,mysql cluster) • 集群,提高网站可用性 o 最简单的集群,设置多条A记录,DNS轮询,可用性问题 o 确保高可用性和伸缩性能的成熟集群解决方案 通过硬件实现,如路由器,F5 network 通过软件或者操作系统实现 基于内核,通过修改TCP/IP数据报文负载均衡,并确保伸缩性的 LVS以及 确保可用性守护进程ldirectord 基于 layer 7,通过URL分发的 HAproxy o 数据共享问题 NFS,Samba,NAS,SAN o 案例 • 解决南北互通,电信和网通速度问题 o 双线服务器 o CDN 根据用户IP转换到就近服务器的智能DNS,dnspod … Squid 反向代理,(优点,缺点) o 案例 http://blog.yening.cn/2007/03/25/226.html#more-226 说说大型高并发高负载网站的系统架构 By Michael 转载请保留出处:俊麟 Michael’s blog ( http://www.toplee.com/blog/?p=71) Trackback Url : http://www.toplee.com/blog/wp-trackback.php?p=71 我在CERNET做过拨号接入平台的搭建,而后在Yahoo&3721从事过搜索引擎前端开发,又在MOP处理过大型社区猫扑大杂烩的架构升级等工作,同时自己接触和开发过不少大中型网站的模块,因此在大型网站应对高负载和并发的解决方案上有一些积累和经验,可以和大家一起探讨一下。
一个小型的网站,比如个人网站,可以使用最简单的html静态页面就实现了,配合一些图片达到美化效果,所有的页面均存放在一个目录下,这样的网站对系统架构、性能的要求都很简单,随着互联网业务的不断丰富,网站相关的技术经过这些年的发展,已经细分到很细的方方面面,尤其对于大型网站来说,所采用的技术更是涉及面非常广,从硬件到软件、编程语言、数据库、WebServer、防火墙等各个领域都有了很高的要求,已经不是原来简单的html静态网站所能比拟的。 大型网站,比如门户网站。在面对大量用户访问、高并发请求方面,基本的解决方案集中在这样几个环节:使用高性能的服务器、高性能的数据库、高效率的编程语言、还有高性能的Web容器。但是除了这几个方面,还没法根本解决大型网站面临的高负载和高并发问题。 上面提供的几个解决思路在一定程度上也意味着更大的投入,并且这样的解决思路具备瓶颈,没有很好的扩展性,下面我从低成本、高性能和高扩张性的角度来说说我的一些经验。 1、HTML静态化 其实大家都知道,效率最高、消耗最小的就是纯静态化的html页面,所以我们尽可能使我们的网站上的页面采用静态页面来实现,这个最简单的方法其实也是最有效的方法。但是对于大量内容并且频繁更新的网站,我们无法全部手动去挨个实现,于是出现了我们常见的信息发布系统CMS,像我们常访问的各个门户站点的新闻频道,甚至他们的其他频道,都是通过信息发布系统来管理和实现的,信息发布系统可以实现最简单的信息录入自动生成静态页面,还能具备频道管理、权限管理、自动抓取等功能,对于一个大型网站来说,拥有一套高效、可管理的CMS是必不可少的。 除了门户和信息发布类型的网站,对于交互性要求很高的社区类型网站来说,尽可能的静态化也是提高性能的必要手段,将社区内的帖子、文章进行实时的静态化,有更新的时候再重新静态化也是大量使用的策略,像Mop的大杂烩就是使用了这样的策略,网易社区等也是如此。目前很多博客也都实现了静态化,我使用的这个Blog程序WordPress还没有静态化,所以如果面对高负载访问, www.toplee.com一定不能承受 同时,html静态化也是某些缓存策略使用的手段,对于系统中频繁使用数据库查询但是内容更新很小的应用,可以考虑使用html静态化来实现,比如论坛中论坛的公用设置信息,这些信息目前的主流论坛都可以进行后台管理并且存储再数据库中,这些信息其实大量被前台程序调用,但是更新频率很小,可以考虑将这部分内容进行后台更新的时候进行静态化,这样避免了大量的数据库访问请求。 在进行html静态化的时候可以使用一种折中的方法,就是前端使用动态实现,在一定的策略下进行定时静态化和定时判断调用,这个能实现很多灵活性的操作,我开发的台球网站故人居( www.8zone.cn)就是使用了这样的方法,我通过设定一些html静态化的时间间隔来对动态网站内容进行缓存,达到分担大部分的压力到静态页面上,可以应用于中小型网站的架构上。故人居网站的地址: http://www.8zone.cn,顺便提一下,有喜欢台球的朋友多多支持我这个免费网站:) 2、图片服务器分离 大家知道,对于Web服务器来说,不管是Apache、IIS还是其他容器,图片是最消耗资源的,于是我们有必要将图片与页面进行分离,这是基本上大型网站都会采用的策略,他们都有独立的图片服务器,甚至很多台图片服务器。这样的架构可以降低提供页面访问请求的服务器系统压力,并且可以保证系统不会因为图片问题而崩溃。 在应用服务器和图片服务器上,可以进行不同的配置优化,比如Apache在配置ContentType的时候可以尽量少支持,尽可能少的LoadModule,保证更高的系统消耗和执行效率。 我的台球网站故人居8zone.cn也使用了图片服务器架构上的分离,目前是仅仅是架构上分离,物理上没有分离,由于没有钱买更多的服务器:),大家可以看到故人居上的图片连接都是类似img.9tmd.com或者img1.9tmd.com的URL。 另外,在处理静态页面或者图片、js等访问方面,可以考虑使用lighttpd代替Apache,它提供了更轻量级和更高效的处理能力。 3、数据库集群和库表散列 大型网站都有复杂的应用,这些应用必须使用数据库,那么在面对大量访问的时候,数据库的瓶颈很快就能显现出来,这时一台数据库将很快无法满足应用,于是我们需要使用数据库集群或者库表散列。 在数据库集群方面,很多数据库都有自己的解决方案,Oracle、Sybase等都有很好的方案,常用的MySQL提供的Master/Slave也是类似的方案,您使用了什么样的DB,就参考相应的解决方案来实施即可。 上面提到的数据库集群由于在架构、成本、扩张性方面都会受到所采用DB类型的限制,于是我们需要从应用程序的角度来考虑改善系统架构,库表散列是常用并且最有效的解决方案。我们在应用程序中安装业务和应用或者功能模块将数据库进行分离,不同的模块对应不同的数据库或者表,再按照一定的策略对某个页面或者功能进行更小的数据库散列,比如用户表,按照用户ID进行表散列,这样就能够低成本的提升系统的性能并且有很好的扩展性。sohu的论坛就是采用了这样的架构,将论坛的用户、设置、帖子等信息进行数据库分离,然后对帖子、用户按照板块和ID进行散列数据库和表,最终可以在配置文件中进行简单的配置便能让系统随时增加一台低成本的数据库进来补充系统性能。 4、缓存 缓存一词搞技术的都接触过,很多地方用到缓存。网站架构和网站开发中的缓存也是非常重要。这里先讲述最基本的两种缓存。高级和分布式的缓存在后面讲述。 架构方面的缓存,对Apache比较熟悉的人都能知道Apache提供了自己的mod_proxy缓存模块,也可以使用外加的Squid进行缓存,这两种方式均可以有效的提高Apache的访问响应能力。 网站程序开发方面的缓存,Linux上提供的Memcached是常用的缓存方案,不少web编程语言都提供memcache访问接口,php、perl、c和java都有,可以在web开发中使用,可以实时或者Cron的把数据、对象等内容进行缓存,策略非常灵活。一些大型社区使用了这样的架构。 另外,在使用web语言开发的时候,各种语言基本都有自己的缓存模块和方法,PHP有Pear的Cache模块和eAccelerator加速和Cache模块,还要知名的Apc、XCache(国人开发的,支持!)php缓存模块,Java就更多了,.net不是很熟悉,相信也肯定有。 5、镜像 镜像是大型网站常采用的提高性能和数据安全性的方式,镜像的技术可以解决不同网络接入商和地域带来的用户访问速度差异,比如ChinaNet和EduNet之间的差异就促使了很多网站在教育网内搭建镜像站点,数据进行定时更新或者实时更新。在镜像的细节技术方面,这里不阐述太深,有很多专业的现成的解决架构和产品可选。也有廉价的通过软件实现的思路,比如Linux上的rsync等工具。 6、负载均衡 负载均衡将是大型网站解决高负荷访问和大量并发请求采用的终极解决办法。 负载均衡技术发展了多年,有很多专业的服务提供商和产品可以选择,我个人接触过一些解决方法,其中有两个架构可以给大家做参考。另外有关初级的负载均衡DNS轮循和较专业的CDN架构就不多说了。 6.1 硬件四层交换 第四层交换使用第三层和第四层信息包的报头信息,根据应用区间识别业务流,将整个区间段的业务流分配到合适的应用服务器进行处理。 第四层交换功能就象是虚IP,指向物理服务器。它传输的业务服从的协议多种多样,有HTTP、FTP、NFS、Telnet或其他协议。这些业务在物理服务器基础上,需要复杂的载量平衡算法。在IP世界,业务类型由终端TCP或UDP端口地址来决定,在第四层交换中的应用区间则由源端和终端IP地址、TCP和UDP端口共同决定。 在硬件四层交换产品领域,有一些知名的产品可以选择,比如Alteon、F5等,这些产品很昂贵,但是物有所值,能够提供非常优秀的性能和很灵活的管理能力。Yahoo中国当初接近2000台服务器使用了三四台Alteon就搞定了。 6.2 软件四层交换 大家知道了硬件四层交换机的原理后,基于OSI模型来实现的软件四层交换也就应运而生,这样的解决方案实现的原理一致,不过性能稍差。但是满足一定量的压力还是游刃有余的,有人说软件实现方式其实更灵活,处理能力完全看你配置的熟悉能力。 软件四层交换我们可以使用Linux上常用的LVS来解决,LVS就是Linux Virtual Server,他提供了基于心跳线heartbeat的实时灾难应对解决方案,提高系统的鲁棒性,同时可供了灵活的虚拟VIP配置和管理功能,可以同时满足多种应用需求,这对于分布式的系统来说必不可少。 一个典型的使用负载均衡的策略就是,在软件或者硬件四层交换的基础上搭建squid集群,这种思路在很多大型网站包括搜索引擎上被采用,这样的架构低成本、高性能还有很强的扩张性,随时往架构里面增减节点都非常容易。这样的架构我准备空了专门详细整理一下和大家探讨。 总结: 对于大型网站来说,前面提到的每个方法可能都会被同时使用到,Michael这里介绍得比较浅显,具体实现过程中很多细节还需要大家慢慢熟悉和体会,有时一个很小的squid参数或者apache参数设置,对于系统性能的影响就会很大,希望大家一起讨论,达到抛砖引玉之效。 转载请保留出处:俊麟 Michael’s blog ( http://www.toplee.com/blog/?p=71) Trackback Url : http://www.toplee.com/blog/wp-trackback.php?p=71 This entry is filed under 其他技术, 技术交流. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site. (2 votes, average: 6.5 out of 10) Loading ... 58 Responses to “说说大型高并发高负载网站的系统架构” 1 pi1ot says: April 29th, 2006 at 1:00 pm Quote 各模块间或者进程间的通信普遍异步化队列化也相当重要,可以兼顾轻载重载时的响应性能和系统压力,数据库压力可以通过file cache分解到文件系统,文件系统io压力再通过mem cache分解,效果很不错. 2 Exception says: April 30th, 2006 at 4:40 pm Quote 写得好!现在,网上像这样的文章不多,看完受益匪浅 3 guest says: May 1st, 2006 at 8:13 am Quote 完全胡说八道! “大家知道,对于Web服务器来说,不管是Apache、IIS还是其他容器,图片是最消耗资源的”,你以为是在内存中动态生成图片啊.无论是什么文件,在容器输出时只是读文件,输出给response而已,和是什么文件有什么关系. 关键是静态文件和动态页面之间应该采用不同策略,如静态文件应该尽量缓存,因为无论你请求多少次输出内容都是相同的,如果用户页面中有二十个就没有必要请求二十次,而应该使用缓存.而动态页面每次请求输出都不相同(否则就应该是静态的),所以不应该缓存. 所以即使在同一服务器上也可以对静态和动态资源做不同优化,专门的图片服务器那是为了资源管理的方便,和你说的性能没有关系. 4 Michael says: May 2nd, 2006 at 1:15 am Quote 动态的缓存案例估计楼上朋友没有遇到过,在处理inktomi的搜索结果的案例中,我们使用的全部是面对动态的缓存,对于同样的关键词和查询条件来说,这样的缓存是非常重要的,对于动态的内容缓存,编程时使用合理的header参数可以方便的管理缓存的策略,比如失效时间等。 我们说到有关图片影响性能的问题,一般来说都是出自于我们的大部分访问页面中图片往往比html代码占用的流量大,在同等网络带宽的情况下,图片传输需要的时间更长,由于传输需要花很大开销在建立连接上,这会延长用户client端与server端的http连接时长,这对于apache来说,并发性能肯定会下降,除非你的返回全部是静态的,那就可以把 httpd.conf 中的 KeepAlives 为 off ,这样可以减小连接处理时间,但是如果图片过多会导致建立的连接次数增多,同样消耗性能。 另外我们提到的理论更多的是针对大型集群的案例,在这样的环境下,图片的分离能有效的改进架构,进而影响到性能的提升,要知道我们为什么要谈架构?架构可能为了安全、为了资源分配、也为了更科学的开发和管理,但是终极目都是为了性能。 另外在RFC1945的HTTP协议文档中很容易找到有关Mime Type和Content length部分的说明,这样对于理解图片对性能影响是很容易的。 楼上的朋友完全是小人作为,希望别用guest跟我忽悠,男人还害怕别人知道你叫啥?再说了,就算说错了也不至于用胡说八道来找茬!大家重在交流和学习,我也不是什么高人,顶多算个普通程序员而已。 5 Ken Kwei says: June 3rd, 2006 at 3:42 pm Quote Michael 您好,这篇文章我看几次了,有一个问题,您的文章中提到了如下一段: “对于交互性要求很高的社区类型网站来说,尽可能的静态化也是提高性能的必要手段,将社区内的帖子、文章进行实时的静态化,有更新的时候再重新静态化也是大量使用的策略,像Mop的大杂烩就是使用了这样的策略,网易社区等也是如此。” 对于大型的站点来说,他的数据库和 Web Server 一般都是分布式的,在多个区域都有部署,当某个地区的用户访问时会对应到一个节点上,如果是对社区内的帖子实时静态化,有更新时再重新静态化,那么在节点之间如何立刻同步呢?数据库端如何实现呢?如果用户看不到的话会以为发帖失败?造成重复发了,那么如何将用户锁定在一个节点上呢,这些怎么解决?谢谢。 6 Michael says: June 3rd, 2006 at 3:57 pm Quote 对于将一个用户锁定在某个节点上是通过四层交换来实现的,一般情况下是这样,如果应用比较小的可以通过程序代码来实现。大型的应用一般通过类似LVS和硬件四层交换来管理用户连接,可以制定策略来使用户的连接在生命期内保持在某个节点上。 静态化和同步的策略比较多,一般采用的方法是集中或者分布存储,但是静态化却是通过集中存储来实现的,然后使用前端的proxy群来实现缓存和分担压力。 7 javaliker says: June 10th, 2006 at 6:38 pm Quote 希望有空跟你学习请教网站负载问题。 8 barrycmster says: June 19th, 2006 at 4:14 pm Quote Great website! Bookmarked! I am impressed at your work! 9 heiyeluren says: June 21st, 2006 at 10:39 am Quote 一般对于一个中型网站来说,交互操作非常多,日PV百万左右,如何做合理的负载? 10 Michael says: June 23rd, 2006 at 3:15 pm Quote heiyeluren on June 21, 2006 at 10:39 am said: 一般对于一个中型网站来说,交互操作非常多,日PV百万左右,如何做合理的负载? 交互如果非常多,可以考虑使用集群加Memory Cache的方式,把不断变化而且需要同步的数据放入Memory Cache里面进行读取,具体的方案还得需要结合具体的情况来分析。 11 donald says: June 27th, 2006 at 5:39 pm Quote 请问,如果一个网站处于技术发展期,那么这些优化手段应该先实施哪些后实施哪些呢? 或者说从成本(技术、人力和财力成本)方面,哪些先实施能够取得最大效果呢? 12 Michael says: June 27th, 2006 at 9:16 pm Quote donald on June 27, 2006 at 5:39 pm said: 请问,如果一个网站处于技术发展期,那么这些优化手段应该先实施哪些后实施哪些呢? 或者说从成本(技术、人力和财力成本)方面,哪些先实施能够取得最大效果呢? 先从服务器性能优化、代码性能优化方面入手,包括webserver、dbserver的优化配置、html静态化等容易入手的开始,这些环节争取先榨取到最大化的利用率,然后再考虑从架构上增加投入,比如集群、负载均衡等方面,这些都需要在有一定的发展积累之后再做考虑比较恰当。 13 donald says: June 30th, 2006 at 4:39 pm Quote 恩,多谢Michael的耐心讲解 14 Ade says: July 6th, 2006 at 11:58 am Quote 写得好,为人也不错. 15 ssbornik says: July 17th, 2006 at 2:39 pm Quote Very good site. Thanks for author! 16 echonow says: September 1st, 2006 at 2:28 pm Quote 赞一个先,是一篇很不错的文章,不过要真正掌握里面的东西恐怕还是需要时间和实践! 先问一下关于图片服务器的问题了! 我的台球网站故人居9tmd.com也使用了图片服务器架构上的分离,目前是仅仅是架构上分离,物理上没有分离,由于没有钱买更多的服务器:),大家可以看到故人居上的图片连接都是类似img.9tmd.com或者img1.9tmd.com的URL。 这个,楼主这个img.9tmd.com是虚拟主机吧,也就是说是一个apache提供的服务吧,这样的话对于性能的提高也很有意义吗?还是只是铺垫,为了方便以后的物理分离呢? 17 Michael says: September 1st, 2006 at 3:05 pm Quote echonow on September 1, 2006 at 2:28 pm said: 赞一个先,是一篇很不错的文章,不过要真正掌握里面的东西恐怕还是需要时间和实践! 先问一下关于图片服务器的问题了! 我的台球网站故人居9tmd.com也使用了图片服务器架构上的分离,目前是仅仅是架构上分离,物理上没有分离,由于没有钱买更多的服务器:),大家可以看到故人居上的图片连接都是类似img.9tmd.com或者img1.9tmd.com的URL。 这个,楼主这个img.9tmd.com是虚拟主机吧,也就是说是一个apache提供的服务吧,这样的话对于性能的提高也很有意义吗?还是只是铺垫,为了方便以后的物理分离呢? 这位朋友说得很对,因为目前只有一台服务器,所以从物理上无法实现真正的分离,暂时使用虚拟主机来实现,是为了程序设计和网站架构上的灵活,如果有了一台新的服务器,我只需要把图片镜像过去或者同步过去,然后把img.9tmd.com的dns解析到新的服务器上就自然实现了分离,如果现在不从架构和程序上实现,今后这样的分离就会比较痛苦:) 18 echonow says: September 7th, 2006 at 4:59 pm Quote 谢谢lz的回复,现在主要实现问题是如何能在素材上传时直接传到图片服务器上呢,总不至于每次先传到web,然后再同步到图片服务器吧 19 Michael says: September 7th, 2006 at 11:25 pm Quote echonow on September 7, 2006 at 4:59 pm said: 谢谢lz的回复,现在主要实现问题是如何能在素材上传时直接传到图片服务器上呢,总不至于每次先传到web,然后再同步到图片服务器吧 通过samba或者nfs实现是比较简单的方法。然后使用squid缓存来降低访问的负载,提高磁盘性能和延长磁盘使用寿命。 20 echonow says: September 8th, 2006 at 9:42 am Quote 多谢楼主的耐心指导,我先研究下,用共享区来存储确实是个不错的想法! 21 Michael says: September 8th, 2006 at 11:16 am Quote echonow on September 8, 2006 at 9:42 am said: 多谢楼主的耐心指导,我先研究下,用共享区来存储确实是个不错的想法! 不客气,欢迎常交流! 22 fanstone says: September 11th, 2006 at 2:26 pm Quote Michael,谢谢你的好文章。仔细看了,包括回复,受益匪浅。 Michael on June 27, 2006 at 9:16 pm said: donald on June 27, 2006 at 5:39 pm said: 请问,如果一个网站处于技术发展期,那么这些优化手段应该先实施哪些后实施哪些呢? 或者说从成本(技术、人力和财力成本)方面,哪些先实施能够取得最大效果呢? 先从服务器性能优化、代码性能优化方面入手,包括webserver、dbserver的优化配置、html静态化等容易入手的开始,这些环节争取先榨取到最大化的利用率,然后再考虑从架构上增加投入,比如集群、负载均衡等方面,这些都需要在有一定的发展积累之后再做考虑比较恰当。 尤其这个部分很是有用,因为我也正在建一个电子商务类的网站,由于是前期阶段,费用的问题毕竟有所影响,所以暂且只用了一台服务器囊括过了整个网站。除去前面所说的图片服务器分离,还有什么办法能在网站建设初期尽可能的为后期的发展做好优化(性能优化,系统合理构架,前面说的websever、dbserver优化,后期譬如硬件等扩展尽可能不要过于烦琐等等)? 也就是所谓的未雨绸缪了,尽可能在先期考虑到后期如果发展壮大的需求,预先做好系统规划,并且在前期资金不足的情况下尽量做到网站以最优异的性能在运行。关于达到这两个要求,您可以给我一些稍稍详细一点的建议和技术参考吗?谢谢! 看了你的文章,知道你主要关注*nix系统架构的,我的是.net和win2003的,不过我觉得这个影响也不大。主要关注点放在外围的网站优化上。 谢谢!希望能得到您的一些好建议。 23 Michael says: September 11th, 2006 at 2:55 pm Quote 回复fanstone: 关于如何在网站的前期尽可能低成本的投入,做到性能最大化利用,同时做好后期系统架构的规划,这个问题可以说已经放大到超出技术范畴,不过和技术相关的部分还是有不少需要考虑的。 一个网站的规划关键的就是对阶段性目标的规划,比如预测几个月后达到什么用户级别、存储级别、并发请求数,然后再过几个月又将什么情况,这些预测必须根据具体业务和市场情况来进行预估和不断调整的,有了这些预测数据作为参考,就能进行技术架构的规划,否则技术上是无法合理进行架构设计的。 在网站发展规划基础上,考虑今后要提供什么样的应用?有些什么样的域名关系?各个应用之间的业务逻辑和关联是什么?面对什么地域分布的用户提供服务?等等。。。 上面这些问题有助于规划网站服务器和设备投入,同时从技术上可以及早预测到未来将会是一个什么架构,在满足这个架构下的每个节点将需要满足什么条件,就是初期架构的要求。 总的来说,不结合具体业务的技术规划是没有意义的,所以首先是业务规划,也就是产品设计,然后才是技术规划。 24 fanstone says: September 11th, 2006 at 8:52 pm Quote 谢谢解答,我再多看看! 25 Roc says: March 22nd, 2007 at 11:48 pm Quote 很好的文章,楼主说的方法非常适用,目前我们公司的网站也是按照楼主所说的方法进行设计的,效果比较好,利于以后的扩展,另外我再补充一点,其实楼主也说了,网站的域名也需要提前考虑和规划,比如网站的图片内容比较多,可以按应用图片的类型可以根据不同的业务需求采用不同的域名img1~imgN等,便于日后的扩展和移至,希望楼主能够多发一些这样的好文章。 26 zhang says: April 3rd, 2007 at 9:08 am Quote 图片服务器与主数据分离的问题。 图片是存储在硬盘里好还是存储在数据库里好? 请您分硬盘和数据库两种情况解释下面的疑问。 当存放图片的服务器容量不能满足要求时如何办? 当存放图片的服务器负载不能满足要求时如何办? 谢谢。 27 Michael says: April 3rd, 2007 at 2:29 pm Quote zhang on April 3, 2007 at 9:08 am said: 图片服务器与主数据分离的问题。 图片是存储在硬盘里好还是存储在数据库里好? 请您分硬盘和数据库两种情况解释下面的疑问。 当存放图片的服务器容量不能满足要求时如何办? 当存放图片的服务器负载不能满足要求时如何办? 谢谢。 肯定是存储在硬盘里面,出现存储在数据库里面的说法实际上是出自一些虚拟主机或者租用空间的个人网站和企业网站,因为网站数据量小,也为了备份方便,从大型商业网站来说,没有图片存储在数据库里面的大型应用。数据库容量和效率都会是极大的瓶颈。 你提到的后面两个问题。容量和负载基本上是同时要考虑的问题,容量方面,大部分的解决方案都是使用海量存储,比如专业的盘阵,入门级的磁盘柜或者高级的光纤盘阵、局域网盘阵等,这些都是主要的解决方案。记得我原来说过,如果是考虑低成本,一定要自己使用便宜单台服务器来存储,那就需要从程序逻辑上去控制,比如你可以多台同样的服务器来存储,分别提供NFS的分区给前端应用使用,在前端应用的程序逻辑中自己去控制存储在哪一台服务器的NFS分区上,比如根据Userid或者图片id、或者别的逻辑去进行散列,这个和我们规划大型数据库存储散列分表或者分库存储的逻辑类似。 基本上图片负载高的解决办法有两种,前端squid缓存和镜像,通过对存储设备(服务器或者盘阵)使用镜像,可以分布到多台服务器上对外提供图片服务,然后再配合squid缓存实现负载的降低和提高用户访问速度。 希望能回答了您的问题。 28 Michael says: April 3rd, 2007 at 2:41 pm Quote Roc on March 22, 2007 at 11:48 pm said: 很好的文章,楼主说的方法非常适用,目前我们公司的网站也是按照楼主所说的方法进行设计的,效果比较好,利于以后的扩展,另外我再补充一点,其实楼主也说了,网站的域名也需要提前考虑和规划,比如网站的图片内容比较多,可以按应用图片的类型可以根据不同的业务需求采用不同的域名img1~imgN等,便于日后的扩展和移至,希望楼主能够多发一些这样的好文章。 欢迎常来交流,还希望能得到你的指点。大家相互学习。 29 zhang says: April 4th, 2007 at 11:39 pm Quote 非常感谢您的回复, 希望将来有合作的机会。 再次感谢。 30 Charles says: April 9th, 2007 at 2:50 pm Quote 刚才一位朋友把你的 BLOG 发给我看,问我是否认识你,所以我就仔细看了一下你的 BLOG,发现这篇文章。 很不错的一篇文章,基本上一个大型网站需要做的事情都已经提及了。我自己也曾任职于三大门户之一,管理超过 100 台的 SQUID 服务器等,希望可以也分享一下我的经验和看法。 1、图片服务器分离 这个观点是我一直以来都非常支持的。特别是如果程序与图片都放在同一个 APAHCE 的服务器下,每一个图片的请求都有可能导致一个 HTTPD 进程的调用,而 HTTPD 如果包含有 PHP 模块的的时候,就会占用过多的内存,而这个是没有任何必要的。 使用独立的图片服务器不但可以避免以上这个情况,更可以对不同的使用性质的图片设置不同的过期时间,以便同一个用户在不同页面访问相同图片时不会再次从服务器(基于是缓存服务器)取数据,不但止快速,而且还省了带宽。还有就是,对于缓存的时间上,亦可以做调立的调节。 在我过往所管理的图片服务器中,不但止是将图片与应用及页面中分离出来,还是为不同性质的图片启用不同的域名。以缓解不同性质图片带来的压力。例如 photo.img.domain.com 这个域名是为了摄影服务的,平时使用 5 台 CACHE,但到了 5.1 长假期后,就有可能需要独立为他增加至 10 台。而增加的这 5 台可以从其他负载较低的图片服务器中调动过来临时使用。 2、数据库集群 一套 ORACLE RAC 的集群布置大概在 40W 左右,这个价格对于一般公司来说,是没有必要的。因为 WEB 的应用逻辑相对较简单,而 ORACLE 这些大型数据库的价值在于数据挖掘,而不在于简单的存储。所以选择 MySQL 或 PostgreSQL 会实际一些。 简单的 MySQL 复制就可以实现较好的效果。读的时候从 SLAVE 读,写的时候才到 MASTER 上更新。实际的情况下,MySQL 的复制性能非常好,基本上不会带来太高的更新延时。使用 balance ( http://www.inlab.de/balance.html)这个软件,在本地(127.0.0.1)监听 3306 端口,再映射多个 SLAVE 数据库,可以实现读取的负载均衡。 3、图片保存于磁盘还是数据库? 对于这个问题,我亦有认真地考虑过。如果是在 ext3 的文件系统下,建 3W 个目录就到极限了,而使用 xfs 的话就没有这个限制。图片的存储,如果需要是大量的保存,必须要分隔成很多个小目录,否则就会有 ext3 只能建 3W 目录的限制,而且文件数及目录数太多会影响磁盘性能。还没有算上空间占用浪费等问题。 更更重要的是,对于一个大量小文件的数据备份,要占用极大的资源和非常长的时间。在这些问题前面,可能将图片保存在数据库是个另外的选择。 可以尝试将图片保存到数据库,前端用 PHP 程序返回实际的图片,再在前端放置一个 SQUID 的服务器,可以避免性能问题。那么图片的备份问题,亦可以利用 MySQL 的数据复制机制来实现。这个问题就可以得到较好的解决了。 4、页面的静态化我就不说了,我自己做的 wordpress 就完全实现了静态化,同时能很好地兼顾动态数据的生成。 5、缓存 我自己之前也提出过使用 memcached,但实际使用中不是非常特别的理想。当然,各个应用环境不一致会有不一致的使用结果,这个并不重要。只要自己觉得好用就用。 6、软件四层交换 LVS 的性能非常好,我有朋友的网站使用了 LVS 来做负责均衡的调度器,数据量非常大都可以轻松支撑。当然是使用了 DR 的方式。 其实我自己还想过可以用 LVS 来做 CDN 的调度。例如北京的 BGP 机房接受用户的请求,然后通过 LVS 的 TUN 方式,将请求调度到电信或网通机房的实际物理服务器上,直接向用户返回数据。 这种是 WAN 的调度,F5 这些硬件设备也应用这样的技术。不过使用 LVS 来实现费用就大大降低。 以上都只属个人观点,能力有限,希望对大家有帮助。 :) 31 Michael says: April 9th, 2007 at 8:17 pm Quote 很少见到有朋友能在我得blog上留下这么多有价值的东西,代表别的可能看到这篇文章的朋友一起感谢你。 balance ( http://www.inlab.de/balance.html) 这个东西准备看一下。 32 Michael says: April 16th, 2007 at 1:29 am Quote 如果要说3Par的光纤存储局域网技术细节,我无法给您太多解释,我对他们的产品没有接触也没有了解,不过从SAN的概念上是可以知道大概框架的,它也是一种基于光纤通道的存储局域网,可以支持远距离传输和较高的系统扩展性,传统的SAN使用专门的FC光通道SCSI磁盘阵列,从你提供的内容来看,3Par这个东西建立在低成本的SATA或FATA磁盘阵列基础上,这一方面能降低成本,同时估计3Par在技术上有创新和改进,从而提供了廉价的高性能存储应用。 这个东西细节只有他们自己知道,你就知道这是个商业的SAN (存储局域网,说白了也是盘阵,只是通过光纤通道独立于系统外的)。 33 zhang says: April 16th, 2007 at 2:10 am Quote myspace和美国的许多银行都更换为了3Par 请您在百忙之中核实一下,是否确实像说的那么好。 下面是摘抄: Priceline.com是一家以销售空座机票为主的网络公司,客户数量多达680万。该公司近期正在内部实施一项大规模的SAN系统整合计划,一口气购进了5套3PARdata的服务器系统,用以替代现有的上百台Sun存储阵列。如果该方案部署成功的话,将有望为Priceline.com节省大量的存储管理时间、资本开销及系统维护费用。 Priceline.com之前一直在使用的SAN系统是由50台光纤磁盘阵列、50台SCSI磁盘阵列和15台存储服务器构成的。此次,该公司一举购入了5台3Par S400 InServ Storage Servers存储服务器,用以替代原来的服务器系统,使得设备整体能耗、占用空间及散热一举降低了60%。整个系统部署下来,总存储容量将逼近30TB。 Priceline的首席信息官Ron Rose拒绝透露该公司之前所使用的SAN系统设备的供应商名称,不过,消息灵通人士表示,PriceLine原来的存储环境是由不同型号的Sun系统混合搭建而成的。 “我们并不愿意随便更换系统供应商,不过,3Par的存储系统所具备的高投资回报率,实在令人难以抗拒,”Rose介绍说,“我们给了原来的设备供应商以足够的适应时间,希望它们的存储设备也能够提供与3Par一样的效能,最后,我们失望了。如果换成3Par的存储系统的话,短期内就可以立刻见到成效。” Rose接着补充说,“原先使用的那套SAN系统,并没有太多让我们不满意的地方,除了欠缺一点灵活性之外:系统的配置方案堪称不错,但并不是最优化的。使用了大量价格偏贵的光纤磁盘,许多SAN端口都是闲置的。” 自从更换成3Par的磁盘阵列之后,该公司存储系统的端口数量从90个骤减为24个。“我们购买的软件许可证都是按端口数量来收费的。每增加一个端口,需要额外支付500-1,500美元,单单这一项,就为我们节省了一笔相当可观的开支,”Rose解释说。而且,一旦启用3Par的精简自动配置软件,系统资源利用率将有望提升30%,至少在近一段时间内该公司不必考虑添置新的磁盘系统。 精简自动配置技术最大的功效就在于它能够按照应用程序的实际需求来分配存储资源,有效地降低了空间闲置率。如果当前运行的应用程序需要额外的存储资源的话,该软件将在不干扰应用程序正常运行的前提下,基于“按需”和“公用”的原则来自动发放资源空间,避免了人力干预,至少为存储管理员减轻了一半以上的工作量。 3Par的磁盘阵列是由低成本的SATA和FATA(即:低成本光纤信道接口)磁盘驱动器构成的,而并非昂贵的高效能FC磁盘,大大降低了系统整体成本。 3Par推出的SAN解决方案,实际上是遵循了“允许多个分布式介质服务器共享通过光纤信道SAN 连接的公共的集中化存储设备”的设计理念。“这样一来,就不必给所有的存储设备都外挂一个代理服务程序了,”Rose介绍说。出于容灾容错和负载均衡的考虑,Priceline搭建了两个生产站点,每一个站点都部署了一套3Par SAN系统。此外,Priceline还购买了两台3Par Inservs服务器,安置在主数据中心内,专门用于存放镜像文件。第5台服务器设置在Priceline的企业资料处理中心内,用于存放数据仓库;第6台服务器设置在实验室内,专门用于进行实际网站压力测试。 MySpace目前采用了一种新型SAN设备——来自加利福尼亚州弗里蒙特的3PARdata。在3PAR的系统里,仍能在逻辑上按容量划分数据存储,但它不再被绑定到特定磁盘或磁盘簇,而是散布于大量磁盘。这就使均分数据访问负荷成为可能。当数据库需要写入一组数据时,任何空闲磁盘都可以马上完成这项工作,而不再像以前那样阻塞在可能已经过载的磁盘阵列处。而且,因为多个磁盘都有数据副本,读取数据时,也不会使SAN的任何组件过载。 3PAR宣布,VoIP服务供应商Cbeyond Communications已向它订购了两套InServ存储服务器,一套应用于该公司的可操作支持系统,一套应用于测试和开发系统环境。3PAR的总部设在亚特兰大,该公司的产品多销往美国各州首府和大城市,比如说亚特兰大、达拉斯、丹佛、休斯顿、芝加哥,等等。截至目前为止,3PAR售出的服务器数量已超过了15,000台,许多客户都是来自于各行各业的龙头企业,它们之所以挑选3PAR的产品,主要是看中了它所具备的高性能、可扩展性、操作简单、无比伦比的性价比等优点,另外,3PAR推出的服务器系列具有高度的集成性能,适合应用于高速度的T1互联网接入、本地和长途语音服务、虚拟主机(Web hosting)、电子邮件、电话会议和虚拟个人网络(VPN)等服务领域。 亿万用户网站MySpace的成功秘密 ◎ 文 / David F. Carr 译 / 罗小平 高速增长的访问量给社区网络的技术体系带来了巨大挑战。MySpace的开发者多年来不断重构站点软件、数据库和存储系统,以期与自身的成长同步——目前,该站点月访问量已达400亿。绝大多数网站需要应对的流量都不及MySpace的一小部分,但那些指望迈入庞大在线市场的人,可以从MySpace的成长过程学到知识。 MySpace开发人员已经多次重构站点软件、数据库和存储系统,以满足爆炸性的成长需要,但此工作永不会停息。“就像粉刷金门大桥,工作完成之时,就是重新来过之日。”(译者注:意指工人从桥头开始粉刷,当到达桥尾时,桥头涂料已经剥落,必须重新开始)MySpace技术副总裁Jim Benedetto说。 既然如此,MySpace的技术还有何可学之处?因为MySpace事实上已经解决了很多系统扩展性问题,才能走到今天。 Benedetto说他的项目组有很多教训必须总结,他们仍在学习,路漫漫而修远。他们当前需要改进的工作包括实现更灵活的数据缓存系统,以及为避免再次出现类似7月瘫痪事件的地理上分布式架构。 背景知识 当然,这么多的用户不停发布消息、撰写评论或者更新个人资料,甚至一些人整天都泡在Space上,必然给MySpace的技术工作带来前所未有的挑战。而传统新闻站点的绝大多数内容都是由编辑团队整理后主动提供给用户消费,它们的内容数据库通常可以优化为只读模式,因为用户评论等引起的增加和更新操作很少。而MySpace是由用户提供内容,数据库很大比例的操作都是插入和更新,而非读取。 浏览MySpace上的任何个人资料时,系统都必须先查询数据库,然后动态创建页面。当然,通过数据缓存,可以减轻数据库的压力,但这种方案必须解决原始数据被用户频繁更新带来的同步问题。 MySpace的站点架构已经历了5个版本——每次都是用户数达到一个里程碑后,必须做大量的调整和优化。Benedetto说,“但我们始终跟不上形势的发展速度。我们重构重构再重构,一步步挪到今天”。 在每个里程碑,站点负担都会超过底层系统部分组件的最大载荷,特别是数据库和存储系统。接着,功能出现问题,用户失声尖叫。最后,技术团队必须为此修订系统策略。 虽然自2005年早期,站点账户数超过7百万后,系统架构到目前为止保持了相对稳定,但MySpace仍然在为SQL Server支持的同时连接数等方面继续攻坚,Benedetto说,“我们已经尽可能把事情做到最好”。 里程碑一:50万账户 按Benedetto 的说法,MySpace最初的系统很小,只有两台Web服务器和一个数据库服务器。那时使用的是Dell双CPU、4G内存的系统。 单个数据库就意味着所有数据都存储在一个地方,再由两台Web服务器分担处理用户请求的工作量。但就像MySpace后来的几次底层系统修订时的情况一样,三服务器架构很快不堪重负。此后一个时期内,MySpace基本是通过添置更多Web服务器来对付用户暴增问题的。 但到在2004年早期,MySpace用户数增长到50万后,数据库服务器也已开始汗流浃背。 但和Web服务器不同,增加数据库可没那么简单。如果一个站点由多个数据库支持,设计者必须考虑的是,如何在保证数据一致性的前提下,让多个数据库分担压力。 在第二代架构中,MySpace运行在3个SQL Server数据库服务器上——一个为主,所有的新数据都向它提交,然后由它复制到其他两个;另两个全力向用户供给数据,用以在博客和个人资料栏显示。这种方式在一段时间内效果很好——只要增加数据库服务器,加大硬盘,就可以应对用户数和访问量的增加。 里程碑二:1-2百万账户 MySpace注册数到达1百万至2百万区间后,数据库服务器开始受制于I/O容量——即它们存取数据的速度。而当时才是2004年中,距离上次数据库系统调整不过数月。用户的提交请求被阻塞,就像千人乐迷要挤进只能容纳几百人的夜总会,站点开始遭遇“主要矛盾”,Benedetto说,这意味着MySpace永远都会轻度落后于用户需求。 “有人花5分钟都无法完成留言,因此用户总是抱怨说网站已经完蛋了。”他补充道。 这一次的数据库架构按照垂直分割模式设计,不同的数据库服务于站点的不同功能,如登录、用户资料和博客。于是,站点的扩展性问题看似又可以告一段落了,可以歇一阵子。 垂直分割策略利于多个数据库分担访问压力,当用户要求增加新功能时,MySpace将投入新的数据库予以支持它。账户到达2百万后,MySpace还从存储设备与数据库服务器直接交互的方式切换到SAN(Storage Area Network,存储区域网络)——用高带宽、专门设计的网络将大量磁盘存储设备连接在一起,而数据库连接到SAN。这项措施极大提升了系统性能、正常运行时间和可靠性,Benedetto说。 里程碑三:3百万账户 当用户继续增加到3百万后,垂直分割策略也开始难以为继。尽管站点的各个应用被设计得高度独立,但有些信息必须共享。在这个架构里,每个数据库必须有各自的用户表副本——MySpace授权用户的电子花名册。这就意味着一个用户注册时,该条账户记录必须在9个不同数据库上分别创建。但在个别情况下,如果其中某台数据库服务器临时不可到达,对应事务就会失败,从而造成账户非完全创建,最终导致此用户的该项服务无效。 另外一个问题是,个别应用如博客增长太快,那么专门为它服务的数据库就有巨大压力。 2004年中,MySpace面临Web开发者称之为“向上扩展”对“向外扩展”(译者注:Scale Up和Scale Out,也称硬件扩展和软件扩展)的抉择——要么扩展到更大更强、也更昂贵的服务器上,要么部署大量相对便宜的服务器来分担数据库压力。一般来说,大型站点倾向于向外扩展,因为这将让它们得以保留通过增加服务器以提升系统能力的后路。 但成功地向外扩展架构必须解决复杂的分布式计算问题,大型站点如Google、Yahoo和Amazon.com,都必须自行研发大量相关技术。以Google为例,它构建了自己的分布式文件系统。 另外,向外扩展策略还需要大量重写原来软件,以保证系统能在分布式服务器上运行。“搞不好,开发人员的所有工作都将白费”,Benedetto说。 因此,MySpace首先将重点放在了向上扩展上,花费了大约1个半月时间研究升级到32CPU服务器以管理更大数据库的问题。Benedetto说,“那时候,这个方案看似可能解决一切问题。”如稳定性,更棒的是对现有软件几乎没有改动要求。 糟糕的是,高端服务器极其昂贵,是购置同样处理能力和内存速度的多台服务器总和的很多倍。而且,站点架构师预测,从长期来看,即便是巨型数据库,最后也会不堪重负,Benedetto说,“换句话讲,只要增长趋势存在,我们最后无论如何都要走上向外扩展的道路。” 因此,MySpace最终将目光移到分布式计算架构——它在物理上分布的众多服务器,整体必须逻辑上等同于单台机器。拿数据库来说,就不能再像过去那样将应用拆分,再以不同数据库分别支持,而必须将整个站点看作一个应用。现在,数据库模型里只有一个用户表,支持博客、个人资料和其他核心功能的数据都存储在相同数据库。 既然所有的核心数据逻辑上都组织到一个数据库,那么MySpace必须找到新的办法以分担负荷——显然,运行在普通硬件上的单个数据库服务器是无能为力的。这次,不再按站点功能和应用分割数据库,MySpace开始将它的用户按每百万一组分割,然后将各组的全部数据分别存入独立的SQL Server实例。目前,MySpace的每台数据库服务器实际运行两个SQL Server实例,也就是说每台服务器服务大约2百万用户。Benedetto指出,以后还可以按照这种模式以更小粒度划分架构,从而优化负荷分担。 当然,还是有一个特殊数据库保存了所有账户的名称和密码。用户登录后,保存了他们其他数据的数据库再接管服务。特殊数据库的用户表虽然庞大,但它只负责用户登录,功能单一,所以负荷还是比较容易控制的。 里程碑四:9百万到1千7百万账户 2005年早期,账户达到9百万后,MySpace开始用Microsoft的C#编写ASP.NET程序。C#是C语言的最新派生语言,吸收了C++和Java的优点,依托于Microsoft .NET框架(Microsoft为软件组件化和分布式计算而设计的模型架构)。ASP.NET则由编写Web站点脚本的ASP技术演化而来,是Microsoft目前主推的Web站点编程环境。 可以说是立竿见影,MySpace马上就发现ASP.NET程序运行更有效率,与ColdFusion相比,完成同样任务需消耗的处理器能力更小。据技术总监Whitcomb说,新代码需要150台服务器完成的工作,如果用ColdFusion则需要246台。Benedetto还指出,性能上升的另一个原因可能是在变换软件平台,并用新语言重写代码的过程中,程序员复审并优化了一些功能流程。 最终,MySpace开始大规模迁移到ASP.NET。即便剩余的少部分ColdFusion代码,也从Cold-Fusion服务器搬到了ASP.NET,因为他们得到了BlueDragon.NET(乔治亚州阿尔法利塔New Atlanta Communications公司的产品,它能将ColdFusion代码自动重新编译到Microsoft平台)的帮助。 账户达到1千万时,MySpace再次遭遇存储瓶颈问题。SAN的引入解决了早期一些性能问题,但站点目前的要求已经开始周期性超越SAN的I/O容量——即它从磁盘存储系统读写数据的极限速度。 原因之一是每数据库1百万账户的分割策略,通常情况下的确可以将压力均分到各台服务器,但现实并非一成不变。比如第七台账户数据库上线后,仅仅7天就被塞满了,主要原因是佛罗里达一个乐队的歌迷疯狂注册。 某个数据库可能因为任何原因,在任何时候遭遇主要负荷,这时,SAN中绑定到该数据库的磁盘存储设备簇就可能过载。“SAN让磁盘I/O能力大幅提升了,但将它们绑定到特定数据库的做法是错误的。”Benedetto说。 最初,MySpace通过定期重新分配SAN中数据,以让其更为均衡的方法基本解决了这个问题,但这是一个人工过程,“大概需要两个人全职工作。”Benedetto说。 长期解决方案是迁移到虚拟存储体系上,这样,整个SAN被当作一个巨型存储池,不再要求每个磁盘为特定应用服务。MySpace目前采用了一种新型SAN设备——来自加利福尼亚州弗里蒙特的3PARdata。 在3PAR的系统里,仍能在逻辑上按容量划分数据存储,但它不再被绑定到特定磁盘或磁盘簇,而是散布于大量磁盘。这就使均分数据访问负荷成为可能。当数据库需要写入一组数据时,任何空闲磁盘都可以马上完成这项工作,而不再像以前那样阻塞在可能已经过载的磁盘阵列处。而且,因为多个磁盘都有数据副本,读取数据时,也不会使SAN的任何组件过载。 当2005年春天账户数达到1千7百万时,MySpace又启用了新的策略以减轻存储系统压力,即增加数据缓存层——位于Web服务器和数据库服务器之间,其唯一职能是在内存中建立被频繁请求数据对象的副本,如此一来,不访问数据库也可以向Web应用供给数据。换句话说,100个用户请求同一份资料,以前需要查询数据库100次,而现在只需1次,其余都可从缓存数据中获得。当然如果页面变化,缓存的数据必须从内存擦除,然后重新从数据库获取——但在此之前,数据库的压力已经大大减轻,整个站点的性能得到提升。 缓存区还为那些不需要记入数据库的数据提供了驿站,比如为跟踪用户会话而创建的临时文件——Benedetto坦言他需要在这方面补课,“我是数据库存储狂热分子,因此我总是想着将万事万物都存到数据库。”但将像会话跟踪这类的数据也存到数据库,站点将陷入泥沼。 增加缓存服务器是“一开始就应该做的事情,但我们成长太快,以致于没有时间坐下来好好研究这件事情。”Benedetto补充道。 里程碑五:2千6百万账户 2005年中期,服务账户数达到2千6百万时,MySpace切换到了还处于beta测试的SQL Server 2005。转换何太急?主流看法是2005版支持64位处理器。但Benedetto说,“这不是主要原因,尽管这也很重要;主要还是因为我们对内存的渴求。”支持64位的数据库可以管理更多内存。 更多内存就意味着更高的性能和更大的容量。原来运行32位版本的SQL Server服务器,能同时使用的内存最多只有4G。切换到64位,就好像加粗了输水管的直径。升级到SQL Server 2005和64位Windows Server 2003后,MySpace每台服务器配备了32G内存,后于2006年再次将配置标准提升到64G。 意外错误 如果没有对系统架构的历次修改与升级,MySpace根本不可能走到今天。但是,为什么系统还经常吃撑着了?很多用户抱怨的“意外错误”是怎么引起的呢? 原因之一是MySpace对Microsoft的Web技术的应用已经进入连Microsoft自己也才刚刚开始探索的领域。比如11月,超出SQL Server最大同时连接数,MySpace系统崩溃。Benedetto说,这类可能引发系统崩溃的情况大概三天才会出现一次,但仍然过于频繁了,以致惹人恼怒。一旦数据库罢工,“无论这种情况什么时候发生,未缓存的数据都不能从SQL Server获得,那么你就必然看到一个‘意外错误’提示。”他解释说。 去年夏天,MySpace的Windows 2003多次自动停止服务。后来发现是操作系统一个内置功能惹的祸——预防分布式拒绝服务攻击(黑客使用很多客户机向服务器发起大量连接请求,以致服务器瘫痪)。MySpace和其他很多顶级大站点一样,肯定会经常遭受攻击,但它应该从网络级而不是依靠Windows本身的功能来解决问题——否则,大量MySpace合法用户连接时也会引起服务器反击。 “我们花了大约一个月时间寻找Windows 2003服务器自动停止的原因。”Benedetto说。最后,通过Microsoft的帮助,他们才知道该怎么通知服务器:“别开枪,是友军。” 紧接着是在去年7月某个周日晚上,MySpace总部所在地洛杉矶停电,造成整个系统停运12小时。大型Web站点通常要在地理上分布配置多个数据中心以预防单点故障。本来,MySpace还有其他两个数据中心以应对突发事件,但Web服务器都依赖于部署在洛杉矶的SAN。没有洛杉矶的SAN,Web服务器除了恳求你耐心等待,不能提供任何服务。 Benedetto说,主数据中心的可靠性通过下列措施保证:可接入两张不同电网,另有后备电源和一台储备有30天燃料的发电机。但在这次事故中,不仅两张电网失效,而且在切换到备份电源的过程中,操作员烧掉了主动力线路。 2007年中,MySpace在另两个后备站点上也建设了SAN。这对分担负荷大有帮助——正常情况下,每个SAN都能负担三分之一的数据访问量。而在紧急情况下,任何一个站点都可以独立支撑整个服务,Benedetto说。 MySpace仍然在为提高稳定性奋斗,虽然很多用户表示了足够信任且能原谅偶现的错误页面。 “作为开发人员,我憎恶Bug,它太气人了。”Dan Tanner这个31岁的德克萨斯软件工程师说,他通过MySpace重新联系到了高中和大学同学。“不过,MySpace对我们的用处很大,因此我们可以原谅偶发的故障和错误。” Tanner说,如果站点某天出现故障甚至崩溃,恢复以后他还是会继续使用。 这就是为什么Drew在论坛里咆哮时,大部分用户都告诉他应该保持平静,如果等几分钟,问题就会解决的原因。Drew无法平静,他写道,“我已经两次给MySpace发邮件,而它说一小时前还是正常的,现在出了点问题……完全是一堆废话。”另一个用户回复说,“毕竟它是免费的。”Benedetto坦承100%的可靠性不是他的目标。“它不是银行,而是一个免费的服务。”他说。 换句话说,MySpace的偶发故障可能造成某人最后更新的个人资料丢失,但并不意味着网站弄丢了用户的钱财。“关键是要认识到,与保证站点性能相比,丢失少许数据的故障是可接受的。”Benedetto说。所以,MySpace甘冒丢失2分钟到2小时内任意点数据的危险,在SQL Server配置里延长了“checkpoint”操作——它将待更新数据永久记录到磁盘——的间隔时间,因为这样做可以加快数据库的运行。 Benedetto说,同样,开发人员还经常在几个小时内就完成构思、编码、测试和发布全过程。这有引入Bug的风险,但这样做可以更快实现新功能。而且,因为进行大规模真实测试不具可行性,他们的测试通常是在仅以部分活跃用户为对象,且用户对软件新功能和改进不知就里的情况下进行的。因为事实上不可能做真实的加载测试,他们做的测试通常都是针对站点。 “我们犯过大量错误,”Benedetto说,“但到头来,我认为我们做对的还是比做错的多。” 34 zhang says: April 16th, 2007 at 2:15 am Quote 了解联合数据库服务器 为达到最大型网站所需的高性能级别,多层系统一般在多个服务器之间平衡每一层的处理负荷。SQL Server 2005 通过对 SQL Server 数据库中的数据进行水平分区,在一组服务器之间分摊数据库处理负荷。这些服务器独立管理,但协作处理应用程序的数据库请求;这样一组协作服务器称为“联合体”。 只有在应用程序将每个 SQL 语句发送到包含该语句所需的大部分数据的成员服务器时,联合数据库层才能达到非常高的性能级别。这称为使用语句所需的数据来配置 SQL 语句。使用所需的数据来配置 SQL 语句不是联合服务器所特有的要求。群集系统也有此要求。 虽然服务器联合体与单个数据库服务器对应用程序来说是一样的,但在实现数据库服务层的方式上存在内部差异, 35 Michael says: April 16th, 2007 at 3:18 am Quote 关于MySpace是否使用了3Par的SAN,并且起到多大的关键作用,我也无法考证,也许可以通过在MySpace工作的朋友可以了解到,但是从各种数据和一些案例来看,3Par的确可以改善成本过高和存储I/O性能问题,但是实际应用中,除非电信、银行或者真的类似MySpace这样规模的站点,的确很少遇到存储连SAN都无法满足的情况,另外,对于数据库方面,据我知道,凡电信、金融和互联网上电子商务关键数据应用,基本上Oracle才是最终的解决方案。 包括我曾经工作的Yahoo,他们全球超过70%以上应用使用MySQL,但是和钱相关的或者丢失数据会承担责任的应用,都是使用Oracle。在UDB方面,我相信Yahoo的用户数一定超过MySpace的几千万。 事实上,国内最值得研究的案例应该是腾讯,腾讯目前的数据量一定是惊人的,在和周小旻的一次短暂对话中知道腾讯的架构专家自己实现了大部分的技术,细节我无法得知。 36 Michael says: April 16th, 2007 at 3:23 am Quote 图片存储到数据库我依然不赞同,不过一定要这么做也不是不可以,您提到的使用CGI程序输出到用户客户端,基本上每种web编程语言都支持,只要能输出标准的HTTP Header信息就可以了,比如PHP可以使用 header(”content-type:image/jpeg\r\n”); 语句输出当前http返回的文件mime类型为图片,同时还有更多的header()函数可以输出的HTTP Header信息,包括 content-length 之类的(提供range 断点续传需要),具体的可以参考PHP的手册。 另外,perl、asp、jsp这些都提供类似的实现方法,这不是语言问题,而是一个HTTP协议问题。 37 zhang says: April 16th, 2007 at 8:52 am Quote 早晨,其实已经是上午,起床后, 看到您凌晨3:23的回复,非常感动。 一定注意身体。 好像您还没有太太, 太太和孩子也像正规程序一样,会良好地调节您的身体。 千万不要使用野程序调节身体,会中毒。 开个玩笑。 38 zhang says: April 16th, 2007 at 8:59 am Quote 看到您凌晨3:23的回复, 非常感动! 一定注意身体, 好像您还没有太太, 太太和孩子就像正规程序一样,能够良好地调节您的身体, 千万不要使用野程序调节身体,会中毒。 开个玩笑。 39 Michael says: April 16th, 2007 at 11:04 am Quote zhang on April 16, 2007 at 8:59 am said: 看到您凌晨3:23的回复, 非常感动! 一定注意身体, 好像您还没有太太, 太太和孩子就像正规程序一样,能够良好地调节您的身体, 千万不要使用野程序调节身体,会中毒。 开个玩笑。 哈哈,最近我是有点疯狂,不过从大学开始,似乎就习惯了晚睡,我基本多年都保持2点左右睡觉,8点左右起床,昨晚有点夸张,因为看一个文档和写一些东西一口气就到3点多了,临睡前看到您的留言,顺便就回复了。 40 myld says: April 18th, 2007 at 1:38 pm Quote 感谢楼主写了这么好的文章,谢谢!!! 41 楓之谷外掛 says: April 27th, 2007 at 11:04 pm Quote 看ㄋ你的文章,很有感覺的說.我自己也做網站,希望可以多多交流一下,大家保持聯繫. http://www.gameon9.com/ http://www.gameon9.com.tw/ 42 南半球 says: May 9th, 2007 at 8:22 pm Quote 关于两位老大讨论的:图片保存于磁盘还是数据库 个人觉得数据库存文件的话,查询速度可能快点,但数据量很大的时候要加上索引,这样添加记录的速度就慢了 mysql对大数据量的处理能力还不是很强,上千万条记录时,性能也会下降 数据库另外一个瓶颈问题就是连接 用数据库,就要调用后台程序(JSP/JAVA,PHP等)连接数据库,而数据库的连接连接、传输数据都要耗费系统资源。数据库的连接数也是个瓶颈问题。曾经写过一个很烂的程序,每秒访问3到5次的数据库,结果一天下来要连接20多万次数据库,把对方的mysql数据库搞瘫痪了。 43 zhang says: May 19th, 2007 at 12:07 am Quote 抽空儿回这里浏览了一下, 有收获, “写真照”换了,显得更帅了。 ok 44 Michael says: May 19th, 2007 at 12:17 am Quote zhang on May 19, 2007 at 12:07 am said: 抽空儿回这里浏览了一下, 有收获, “写真照”换了,显得更帅了。 ok 哈哈,让您见笑了 45 David says: May 30th, 2007 at 3:27 pm Quote 很好,虽然我不是做web的,但看了还是收益良多。 46 pig345 says: June 13th, 2007 at 10:23 am Quote 感谢Michael 47 疯子日记 says: June 13th, 2007 at 10:12 pm Quote 回复:说说大型高并发高负载网站的系统架构 … 好文章!学习中…………. 48 terry says: June 15th, 2007 at 4:29 pm Quote 推荐nginx 49 7u5 says: June 16th, 2007 at 11:54 pm Quote 拜读 50 Michael says: June 16th, 2007 at 11:59 pm Quote terry on June 15, 2007 at 4:29 pm said: 推荐nginx 欢迎分享Nginx方面的经验:) 51 说说大型高并发高负载网站的系统架构 - 红色的河 says: June 21st, 2007 at 11:40 pm Quote […] 来源: http://www.toplee.com/blog/archives/71.html 时间:11:40 下午 | 分类:技术文摘 标签:系统架构, 大型网站, 性能优化 […] 52 laoyao2k says: June 23rd, 2007 at 11:35 am Quote 看到大家都推荐图片分离,我也知道这样很好,但页面里的图片的绝对网址是开发的时候就写进去的,还是最终执行的时候替换的呢? 如果是开发原始网页就写进去的,那本地调试的时候又是怎么显示的呢? 如果是最终执行的时候替换的话,是用的什么方法呢? 53 Michael says: June 23rd, 2007 at 8:21 pm Quote 都可以,写到配置文件里面就可以,或者用全局变量定义,方法很多也都能实现,哪怕写死了在开发的时候把本地调试也都配好图片server,在hosts文件里面指定一下ip到本地就可以了。 假设用最终执行时候的替换,就配置你的apache或者别的server上的mod_rewrite模块来实现,具体的参照相关文档。 54 laoyao2k says: June 25th, 2007 at 6:43 pm Quote 先谢谢博主的回复,一直在找一种方便的方法将图片分离。 看来是最终替换法比较灵活,但我知道mod_rewrite是用来将用户提交的网址转换成服务器上的真实网址。 看了博主的回复好像它还有把网页执行的结果进行替换后再返回给浏览器的功能,是这样吗? 55 Michael says: June 25th, 2007 at 11:00 pm Quote 不是,只转换用户请求,对url进行rewrite,进行重定向到新的url上,规则很灵活,建议仔细看看lighttpd或者apache的mod_rewrite文档,当然IIS也有类似的模块。 56 laoyao2k says: June 25th, 2007 at 11:56 pm Quote 我知道了,如果要让客户浏览的网页代码里的图片地址是绝对地址,只能在开发时就写死了(对于静态页面)或用变量替换(对于动态页面更灵活些),是这样吗? 我以为有更简单的方法呢,谢博主指点了。 57 马蜂不蛰 says: July 24th, 2007 at 1:25 pm Quote 请教楼主: 我正在搞一个医学教育视频资源在线预览的网站,只提供几分钟的视频预览,用swf格式,会员收看预览后线下可购买DVD光碟。 系统架构打算使用三台服务器:网页服务器、数据库服务器、视频服务器。 网页使用全部静态,数据库用SQL Server 2000,CMS是用ASP开发的。 会员数按十万级设计,不使用库表散列技术,请楼主给个建议,看看我的方案可行不? 58 Michael says: July 24th, 2007 at 11:56 pm Quote 这个数量级的应用好好配置优化一下服务器和代码,三台服务器完全没有问题,关键不是看整体会员数有多少,而是看同时在线的并发数有多少,并发不多就完全没有问题了,并发多的话,三台的这种架构还是有些问题的。 mixi技术架构 mixi.jp:使用开源软件搭建的可扩展SNS网站 总概关键点: 1,Mysql 切分,采用Innodb运行 2,动态Cache 服务器 -- 美国Facebok.com,中国Yeejee.com,日本mixi.jp均采用开源分布式缓存服务器Memcache 3,图片缓存和加
顺便提及,MySpace在2006年7月24号晚上开始了长达12小时的瘫痪,期间只有一个可访问页面——该页面解释说位于洛杉矶的主数据中心发生故障。为了让大家耐心等待服务恢复,该页面提供了用Flash开发的派克人(Pac-Man)游戏。Web站点跟踪服务研究公司总经理Bill Tancer说,尤其有趣的是,MySpace瘫痪期间,访问量不降反升,“这说明了人们对MySpace的痴迷——所有人都拥在它的门口等着放行”。 现Nielsen Norman Group 咨询公司负责人、原Sun Microsystems公司工程师,因在Web站点方面的评论而闻名的Jakob Nielsen说,MySpace的系统构建方法显然与Yahoo、eBay以及Google都不相同。和很多观察家一样,他相信MySpace对其成长速度始料未及。“虽然我不认为他们必须在计算机科学领域全面创新,但他们面对的的确是一个巨大的科学难题。”他说。
用户注册时,需要提供个人基本信息,主要包括籍贯、性取向和婚姻状况。虽然MySpace屡遭批评,指其为网上性犯罪提供了温床,但对于未成年人,有些功能还是不予提供的。 MySpace的个人资料页上表述自己的方式很多,如文本式“关于本人”栏、选择加载入MySpace音乐播放器的歌曲,以及视频、交友要求等。它还允许用户使用CSS(一种Web标准格式语言,用户以此可设置页面元素的字体、颜色和页面背景图像)自由设计个人页面,这也提升了人气。不过结果是五花八门——很多用户的页面布局粗野、颜色迷乱,进去后找不到东南西北,不忍卒读;而有些人则使用了专业设计的模版(可阅读《Too Much of a Good Thing?》第49页),页面效果很好。 在网站上线8个月后,开始有大量用户邀请朋友注册MySpace,因此用户量大增。“这就是网络的力量,这种趋势一直没有停止。”Chau说。
拥有Fox电视网络和20th Century Fox影业公司的媒体帝国——新闻集团,看到了他们在互联网用户中的机会,于是在2005年斥资5.8亿美元收购了MySpace。新闻集团董事局主席Rupert Murdoch最近向一个投资团透露,他认为MySpace目前是世界主要Web门户之一,如果现在出售MySpace,那么可获60亿美元——这比2005年收购价格的10倍还多!新闻集团还惊人地宣称,MySpace在2006年7月结束的财政年度里总收入约2亿美元,而且预期在2007年度,Fox Interactive公司总收入将达到5亿美元,其中4亿来自MySpace。 然而MySpace还在继续成长。12月份,它的注册账户达到1.4亿,而2005年11月时不过4千万。当然,这个数字并不等于真实的用户个体数,因为有些人可能有多个帐号,而且个人资料也表明有些是乐队,或者是虚构的名字,如波拉特(译者注:喜剧电影《Borat》主角),还有像Burger King(译者注:美国最大的汉堡连锁集团)这样的品牌名。
里程碑五:2千6百万账户 2005年中期,服务账户数达到2千6百万时,MySpace切换到了还处于beta测试的SQL Server 2005。转换何太急?主流看法是2005版支持64位处理器。但Benedetto说,“这不是主要原因,尽管这也很重要;主要还是因为我们对内存的渴求。”支持64位的数据库可以管理更多内存。 更多内存就意味着更高的性能和更大的容量。原来运行32位版本的SQL Server服务器,能同时使用的内存最多只有4G。切换到64位,就好像加粗了输水管的直径。升级到SQL Server 2005和64位Windows Server 2003后,MySpace每台服务器配备了32G内存,后于2006年再次将配置标准提升到64G。
Judicious use of server-side state No server affinity Functional server pools Horizontal and vertical database partitioning eBay取得数据访问的线性扩展的做法是非常让人感兴趣的。他们提到使用"定制的O-R mapping" 来支持本地Cache和全局Cache、lazy loading, fetch sets (deep and shallow)以及读取和提交更新的子集。 而且,他们只使用bean管理的事务以及使用数据库的自动提交和O-R mapping来route不同的数据源. 有几个事情是非常令人吃惊的。第一,完全不使用Entity Beans,只使用他自己的O-R mapping工具(Hibernate anyone?)。 第二、基于Use-Case的应用服务器划分。第三、数据库的划分也是基于Use-Case。最后是系统的无状态本性以及明显不使用集群技术。 下面是关于服务器状态的引用: 基本上我们没有真正地使用server-side state。我们可能使用它,但现在我们并没有找到使用它的理由。....。如果需要状态化的话,我们把状态放到数据库中;需要的时候我们再从数据库中取。我们不必使用集群。也就不用为集群做任何工作。 总之,你自己不必为架构一台有状态的服务器所困扰,更进一步,忘掉集群,你不需要它。现在看看功能划分:
(2)、商业逻辑层的设计模式: - Business Delegate - Service Locator (X) - Session Facade - Application Service (X) - Business Object (X) - Composite Entity - Transfer Object (X) - Transfer Object Assembler (X) - Value List Handler (X) 带(X)表示这些设计模式在eBay.com的架构中采用了。
3、集成层(也称为数据访问层) 设计模式: - Data Access Object (X) - Service Activator - Domain Store (X) - Web Service Broker (X) 带(X)表示这些设计模式在eBay.com的架构中采用了。
2、为了可适应未来的架构,ebay采用了下面的做法 采用J2EE模式 Only adopt Technology when required Create new Technology as needed 大量的性能测试 大量的容量计划 大量关键点的调优 Highly redundant operational infrastructure and the technology to leverage it
3、为了实现可线性扩展,ebay采用了下面的做法: (1) 合理地使用server state (2) No server affinity (3) Functional server pools。 (4) Horizontal and vertical database partitioning。
4、为了使得数据访问可线性扩展 (1) 建模我们的数据访问层 (2) 支持Support well-defined data access patterns Relationships and traversals 本地cache和全局cache (3) 定制的O-R mapping—域存储模式 (4) Code generation of persistent objects (5) 支持lazy loading (6) 支持fetch sets (shallow/deep fetches) (7) 支持retrieval and submit (Read/Write sets)
5、为了使数据存储可线性扩展,eBay采用了下列做法 (1)商业逻辑层的事务控制 只采用Bean管理的事务 Judicious use of XA 数据库的自动提交 (2)基于内容的路由 运行期间采用 O-R Mapping ,找到正确的数据源 支持数据库的水平线性扩展 Failover hosts can be defined (3)数据源管理 动态的 Overt and heuristic control of availability 如果数据库宕机,应用可以为其他请求服务。
http_port 80 httpd_accel_host virtual httpd_accel_single_host off httpd_accel_port 80 httpd_accel_uses_host_header on httpd_accel_with_proxy on # accelerater my domain only acl acceleratedHostA dstdomain .example1.com acl acceleratedHostB dstdomain .example2.com acl acceleratedHostC dstdomain .example3.com # accelerater http protocol on port 80 acl acceleratedProtocol protocol HTTP acl acceleratedPort port 80 # access arc acl all src 0.0.0.0/0.0.0.0 # Allow requests when they are to the accelerated machine AND to the # right port with right protocol http_access allow acceleratedProtocol acceleratedPort acceleratedHostA http_access allow acceleratedProtocol acceleratedPort acceleratedHostB http_access allow acceleratedProtocol acceleratedPort acceleratedHostC # logging emulate_httpd_log on cache_store_log none # manager acl manager proto cache_object http_access allow manager all cachemgr_passwd pass all
----------------------cut here--------------------------------- 创建缓存目录: /usr/local/squid/sbin/squid -z 启动squid /usr/local/squid/sbin/squid 停止squid: /usr/local/squid/sbin/squid -k shutdown 启用新配置: /usr/local/squid/sbin/squid -k reconfig 通过crontab每天0点截断/轮循日志: 0 0 * * * (/usr/local/squid/sbin/squid -k rotate) 可缓存的动态页面设计 什么样的页面能够比较好的被缓存服务器缓存呢?如果返回内容的HTTP HEADER中有"Last-Modified"和"Expires"相关声明,比如: Last-Modified: Wed, 14 May 2003 13:06:17 GMT Expires: Fri, 16 Jun 2003 13:06:17 GMT 前端缓存服务器在期间会将生成的页面缓存在本地:硬盘或者内存中,直至上述页面过期。 因此,一个可缓存的页面: • 页面必须包含Last-Modified: 标记 一般纯静态页面本身都会有Last-Modified信息,动态页面需要通过函数强制加上,比如在PHP中: // always modified now header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT"); • 必须有Expires或Cache-Control: max-age标记设置页面的过期时间: 对于静态页面,通过apache的mod_expires根据页面的MIME类型设置缓存周期:比如图片缺省是1个月,HTML页面缺省是2天等。 <IfModule mod_expires.c> ExpiresActive on ExpiresByType image/gif "access plus 1 month" ExpiresByType text/css "now plus 2 day" ExpiresDefault "now plus 1 day" </IfModule>
对于动态页面,则可以直接通过写入HTTP返回的头信息,比如对于新闻首页index.php可以是20分钟,而对于具体的一条新闻页面可能是1天后过期。比如:在php中加入了1个月后过期: // Expires one month later header("Expires: " .gmdate ("D, d M Y H:i:s", time() + 3600 * 24 * 30). " GMT"); • 如果服务器端有基于HTTP的认证,必须有Cache-Control: public标记,允许前台 ASP应用的缓存改造 首先在公用的包含文件中(比如include.asp)加入以下公用函数: <% ' Set Expires Header in minutes Function SetExpiresHeader(ByVal minutes) ' set Page Last-Modified Header: ' Converts date (19991022 11:08:38) to http form (Fri, 22 Oct 1999 12:08:38 GMT) Response.AddHeader "Last-Modified", DateToHTTPDate(Now())
' The Page Expires in Minutes Response.Expires = minutes
' Set cache control to externel applications Response.CacheControl = "public" End Function ' Converts date (19991022 11:08:38) to http form (Fri, 22 Oct 1999 12:08:38 GMT) Function DateToHTTPDate(ByVal OleDATE) Const GMTdiff = #08:00:00# OleDATE = OleDATE - GMTdiff DateToHTTPDate = engWeekDayName(OleDATE) & _ ", " & Right("0" & Day(OleDATE),2) & " " & engMonthName(OleDATE) & _ " " & Year(OleDATE) & " " & Right("0" & Hour(OleDATE),2) & _ ":" & Right("0" & Minute(OleDATE),2) & ":" & Right("0" & Second(OleDATE),2) & " GMT" End Function Function engWeekDayName(dt) Dim Out Select Case WeekDay(dt,1) Case 1:Out="Sun" Case 2:Out="Mon" Case 3:Out="Tue" Case 4:Out="Wed" Case 5:Out="Thu" Case 6:Out="Fri" Case 7:Out="Sat" End Select engWeekDayName = Out End Function Function engMonthName(dt) Dim Out Select Case Month(dt) Case 1:Out="Jan" Case 2:Out="Feb" Case 3:Out="Mar" Case 4:Out="Apr" Case 5:Out="May" Case 6:Out="Jun" Case 7:Out="Jul" Case 8:Out="Aug" Case 9:Out="Sep" Case 10:Out="Oct" Case 11:Out="Nov" Case 12:Out="Dec" End Select engMonthName = Out End Function %> 然后在具体的页面中,比如index.asp和news.asp的“最上面”加入以下代码:HTTP Header <!--#include file="../include.asp"--> <% '页面将被设置20分钟后过期 SetExpiresHeader(20) %> 应用的缓存兼容性设计
经过代理以后,由于在客户端和服务之间增加了中间层,因此服务器无法直接拿到客户端的IP,服务器端应用也无法直接通过转发请求的地址返回给客户端。但是在转发请求的HTTD头信息中,增加了HTTP_X_FORWARDED_????信息。用以跟踪原有的客户端IP地址和原来客户端请求的服务器地址: 下面是2个例子,用于说明缓存兼容性应用的设计原则: '对于一个需要服务器名的地址的ASP应用:不要直接引用HTTP_HOST/SERVER_NAME,判断一下是否有HTTP_X_FORWARDED_SERVER function getHostName () dim hostName as String = "" hostName = Request.ServerVariables("HTTP_HOST") if not isDBNull(Request.ServerVariables("HTTP_X_FORWARDED_HOST")) then if len(trim(Request.ServerVariables("HTTP_X_FORWARDED_HOST"))) > 0 then hostName = Request.ServerVariables("HTTP_X_FORWARDED_HOST") end if end if return hostNmae end function
//对于一个需要记录客户端IP的PHP应用:不要直接引用REMOTE_ADDR,而是要使用HTTP_X_FORWARDED_FOR, function getUserIP (){ $user_ip = $_SERVER["REMOTE_ADDR"]; if ($_SERVER["HTTP_X_FORWARDED_FOR"]) { $user_ip = $_SERVER["HTTP_X_FORWARDED_FOR"]; } }
phpMan.php是一个基于php的man page server,每个man page需要调用后台的man命令和很多页面格式化工具,系统负载比较高,提供了Cache Friendly的URL,以下是针对同样的页面的性能测试资料: 测试环境:Redhat 8 on Cyrix 266 / 192M Mem 测试程序:使用apache的ab(apache benchmark): 测试条件:请求50次,并发50个连接 测试项目:直接通过apache 1.3 (80端口) vs squid 2.5(8000端口:加速80端口)
测试1:无CACHE的80端口动态输出: ab -n 100 -c 10 http://www.chedong.com:81/phpMan.php/man/kill/1 This is ApacheBench, Version 1.3d <$Revision: 1.2 $> apache-1.3 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2001 The Apache Group, http://www.apache.org/
Benchmarking localhost (be patient).....done Server Software: Apache/1.3.23 Server Hostname: localhost Server Port: 80
Concurrency Level: 5 Time taken for tests: 63.164 seconds Complete requests: 50 Failed requests: 0 Broken pipe errors: 0 Total transferred: 245900 bytes HTML transferred: 232750 bytes Requests per second: 0.79 [#/sec] (mean) Time per request: 6316.40 [ms] (mean) Time per request: 1263.28 [ms] (mean, across all concurrent requests) Transfer rate: 3.89 [Kbytes/sec] received
Connnection Times (ms)
min mean[+/-sd] median max Connect: 0 29 106.1 0 553 Processing: 2942 6016 1845.4 6227 10796
Waiting: 2941 5999 1850.7 6226 10795
Total: 2942 6045 1825.9 6227 10796
Percentage of the requests served within a certain time (ms) 50% 6227 66% 7069 75% 7190 80% 7474 90% 8195 95% 8898 98% 9721 99% 10796 100% 10796 (last request)
测试2:SQUID缓存输出 /home/apache/bin/ab -n50 -c5 "http://localhost:8000/phpMan.php/man/kill/1" This is ApacheBench, Version 1.3d <$Revision: 1.2 $> apache-1.3 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2001 The Apache Group, http://www.apache.org/
Benchmarking localhost (be patient).....done Server Software: Apache/1.3.23 Server Hostname: localhost Server Port: 8000
Concurrency Level: 5 Time taken for tests: 4.265 seconds Complete requests: 50 Failed requests: 0 Broken pipe errors: 0 Total transferred: 248043 bytes HTML transferred: 232750 bytes Requests per second: 11.72 [#/sec] (mean) Time per request: 426.50 [ms] (mean) Time per request: 85.30 [ms] (mean, across all concurrent requests) Transfer rate: 58.16 [Kbytes/sec] received
Connnection Times (ms)
min mean[+/-sd] median max Connect: 0 1 9.5 0 68 Processing: 7 83 537.4 7 3808
Waiting: 5 81 529.1 6 3748
Total: 7 84 547.0 7 3876
Percentage of the requests served within a certain time (ms) 50% 7 66% 7 75% 7 80% 7 90% 7 95% 7 98% 8 99% 3876 100% 3876 (last request)
可缓存的页面设计 http://linux.oreillynet.com/pub/a/linux/2002/02/28/cachefriendly.html 运用ASP.NET的输出缓冲来存储动态页面 - 开发者 - ZDNet China http://www.zdnet.com.cn/developer/tech/story/0,2000081602,39110239-2,00.htm 相关RFC文档:
• RFC 2616:
o section 13 (Caching) o section 14.9 (Cache-Control header) o section 14.21 (Expires header) o section 14.32 (Pragma: no-cache) is important if you are interacting with HTTP/1.0 caches o section 14.29 (Last-Modified) is the most common validation method o section 3.11 (Entity Tags) covers the extra validation method
YouTube Scalability Talk Cuong Do of YouTube / Google recently gave a Google Tech Talk on scalability. I found it interesting in light of my own comments on YouTube’s 45 TB a while back. Here are my notes from his talk, a mix of what he said and my commentary: In the summer of 2006, they grew from 30 million pages per day to 100 million pages per day, in a 4 month period. (Wow! In most organizations, it takes nearly 4 months to pick out, order, install, and set up a few servers.) YouTube uses Apache for FastCGI serving. (I wonder if things would have been easier for them had they chosen nginx, which is apparently wonderful for FastCGI and less problematic than Lighttpd) YouTube is coded mostly in Python. Why? “Development speed critical”. They use psyco, Python -> C compiler, and also C extensions, for performance critical work. They use Lighttpd for serving the video itself, for a big improvement over Apache. Each video hosted by a “mini cluster”, which is a set of machine with the same content. This is a simple way to provide headroom (slack), so that a machine can be taken down for maintenance (or can fail) without affecting users. It also provides a form of backup. The most popular videos are on a CDN (Content Distribution Network) - they use external CDNs and well as Google’s CDN. Requests to their own machines are therefore tail-heavy (in the “Long Tail” sense), because the head codes to the CDN instead. Because of the tail-heavy load, random disks seeks are especially important (perhaps more important than caching?). YouTube uses simple, cheap, commodity Hardware. The more expensive the hardware, the more expensive everything else gets (support, etc.). Maintenance is mostly done with rsync, SSH, other simple, common tools. The fun is not over: Cuong showed a recent email titled “3 days of video storage left”. There is constant work to keep up with the growth. Thumbnails turn out to be surprisingly hard to serve efficiently. Because there, on average, 4 thumbnails per video and many thumbnails per pages, the overall number of thumbnails per second is enormous. They use a separate group of machines to serve thumbnails, with extensive caching and OS tuning specific to this load. YouTube was bit by a “too many files in one dir” limit: at one point they could accept no more uploads (!!) because of this. The first fix was the usual one: split the files across many directories, and switch to another file system better suited for many small files. Cuong joked about “The Windows approach of scaling: restart everything” Lighttpd turned out to be poor for serving the thumbnails, because its main loop is a bottleneck to load files from disk; they addressed this by modifying Lighttpd to add worker threads to read from disk. This was good but still not good enough, with one thumbnail per file, because the enormous number of files was terribly slow to work with (imagine tarring up many million files). Their new solution for thumbnails is to use Google’s BigTable, which provides high performance for a large number of rows, fault tolerance, caching, etc. This is a nice (and rare?) example of actual synergy in an acquisition. YouTube uses MySQL to store metadata. Early on they hit a Linux kernel issue which prioritized the page cache higher than app data, it swapped out the app data, totally overwhelming the system. They recovered from this by removing the swap partition (while live!). This worked. YouTube uses Memcached. To scale out the database, they first used MySQL replication. Like everyone else that goes down this path, they eventually reach a point where replicating the writes to all the DBs, uses up all the capacity of the slaves. They also hit a issue with threading and replication, which they worked around with a very clever “cache primer thread” working a second or so ahead of the replication thread, prefetching the data it would need. As the replicate-one-DB approach faltered, they resorted to various desperate measures, such as splitting the video watching in to a separate set of replicas, intentionally allowing the non-video-serving parts of YouTube to perform badly so as to focus on serving videos. Their initial MySQL DB server configuration had 10 disks in a RAID10. This does not work very well, because the DB/OS can’t take advantage of the multiple disks in parallel. They moved to a set of RAID1s, appended together. In my experience, this is better, but still not great. An approach that usually works even better is to intentionally split different data on to different RAIDs: for example, a RAID for the OS / application, a RAID for the DB logs, one or more RAIDs for the DB table (uses “tablespaces” to get your #1 busiest table on separate spindles from your #2 busiest table), one or more RAID for index, etc. Big-iron Oracle installation sometimes take this approach to extremes; the same thing can be done with free DBs on free OSs also. In spite of all these effort, they reached a point where replication of one large DB was no longer able to keep up. Like everyone else, they figured out that the solution database partitioning in to “shards”. This spread reads and writes in to many different databases (on different servers) that are not all running each other’s writes. The result is a large performance boost, better cache locality, etc. YouTube reduced their total DB hardware by 30% in the process. It is important to divide users across shards by a controllable lookup mechanism, not only by a hash of the username/ID/whatever, so that you can rebalance shards incrementally. An interesting DMCA issue: YouTube complies with takedown requests; but sometimes the videos are cached way out on the “edge” of the network (their caches, and other people’s caches), so its hard to get a video to disappear globally right away. This sometimes angers content owners. Early on, YouTube leased their hardware. High Performance Web Sites by Nate Koechley One dozen rules for faster pages 1. Share results of our research in t Yahoo is firml committed to openness Why talk about performance? In the last 2 years, we do a lot more with web pages. Steve Souders - High Performance Two Performance Flavors: Response Time and System Efficiency The importance of front end performance!!! 95% is front end. Back0end vs. Front-end Until now we haveon 1 Perception - How fast does it feel to the users? Perceived response time It's in the eye of the beholder 2 80% of consequences Yahoo Interface Blog yuiblog.com 3 Cache Sadly the cache doesn't work as well as it should 40-60% of users still have an empty cache Therefore optimize for no-cache and with cache 4 Cookies Set scope correctly Keep sizes low, 80ms delay with cookies Total cookie size - Amazon 60 bytes - good example. 1. Eliminate unnecessary cookies 2. Keep cookie sizes low 3. 5 Parallel Downloads One Dozen Rules Rule 1 - Make fewer HTTP requests css sprites alistapart.com/articles/sprites Combine Scripts, Combined Stylesheets Rule 2 - Use a CDN amazon.com - Akamai Distribute your static content beofre distributing content Rule 3 Add an Expires Header Not just for images images, stylesheets and scripts Rule 4: Gzip Components You can addect users download times 90% of browsers support compression Gzip compresses more than deflate Gzip: not just for HTML for gzip scripts, Free YUI Hosting includes Aggregated files w Rule 5: Put CSS at the top stsylesheets use < link > not @import!!!!! Slower, but perceived loading time is faster Rule 6; Move scripts to the bottom of th te page scripts block rendering what about defer? - no good Rule 7: Avoid CSS Expressions Rule 8: Make JS and CSS External Inline: bigger HTML but no hhtp request External: cachable but extra http Except for a users "home page" Post-Onload Download Dynamic Inlining Rule 9: Reduce DNS Lookups Best practice: Max 2-4 hosts Use keep-alive Rule 10: Minify Javascript Take out white space, Two popular choices - Dojo is a better compressor but JSMin is less error prone. minify is safer than obstifacation Rule 11: Avoid redirects Redirects are worst form of blocking Redirects - Amazon have none! Rule 12: Tuen off ETags Case Studies Yahoo 1 Moved JS to onload 2 removed redirects 50% faster What about performance and Web 2.0 apps? Client-side CPU is more of an issue User expectations are higher start off on the right foot - care! Live Analysis IBM Page Detailer - windows only Fasterfox - measures load time of pages LiveHTTPHeaders firefox extension Firebug - Recommended! YSlow to be released soon. Conclusion Focus on the front end harvest the low hanging fruit reduce http requests Rules for High Performance Web Sites These rules are the key to speeding up your web pages. They've been tested on some of the most popular sites on the Internet and have successfully reduced the response times of those pages by 25-50%. The key insight behind these best practices is the realization that only 10-20% of the total end-user response time is spent getting the HTML document. You need to focus on the other 80-90% if you want to make your pages noticeably faster. These rules are the best practices for optimizing the way servers and browsers handle that 80-90% of the user experience. • Rule 1 - Make Fewer HTTP Requests • Rule 2 - Use a Content Delivery Network • Rule 3 - Add an Expires Header • Rule 4 - Gzip Components • Rule 5 - Put CSS at the Top • Rule 6 - Move Scripts to the Bottom • Rule 7 - Avoid CSS Expressions • Rule 8 - Make JavaScript and CSS External • Rule 9 - Reduce DNS Lookups • Rule 10 - Minify JavaScript • Rule 11 - Avoid Redirects • Rule 12 - Remove Duplicate Scripts • Rule 13 - Turn Off ETags • Rule 14 - Make AJAX Cacheable and Small 对于应用高并发,DB千万级数量该如何设计系统哪? 背景: 博客类型的应用,系统实时交互性比较强。各种统计,计数器,页面的相关查询之类的都要频繁操作数据库。数据量要求在千万级,同时在线用户可能会有几万人活跃。系统现在是基于spring + hibernate + jstl + mysql的,在2千人在线,几十万记录下没有什么压力。可对于千万记录以及数万活跃用户没什么经验和信心。 对于这些,我的一点设计想法与问题,欢迎大家指导:
体系结构选定之后,我们就要考虑更加细节的部分,比如说用什么操作系统,用操作系统提供的那些API。在这方面,前辈们已经做过很多,我们只需要简单的“拿来”即可,如果再去枉费唇舌,简直就是浪费时间,图财害命。High-Performance Server Architecture从根本上分析了导致服务器低效的罪魁祸首:数据拷贝、(用户和内核)上下文切换、内存申请(管理)和锁竞争;The C10K Problem列举并分析了UNIX、Linux甚至是部分Windows为提高服务器性能而设计的一些系统调用接口,这篇文档的难能可贵之处还在于它一致保持更新;Benchmarking BSD and Linux更是通过实测数据用图表的形式把BSD和Linux的相关系统调用的性能直观地陈列在我们眼前,结果还是令人激动的:Linux 2.6的相关系统调用的时间复杂度竟然是O(1)。
interface Ethernet0/0 ip address 192.168.1.4 255.255.255.248 ip nat inside ! interface Serial0/0 ip address 200.200.1.1 255.255.255.248 ip nat outside ! ip access-list 1 permit 200.200.1.2 ! ip nat pool websrv 192.168.1.1 192.168.1.3 netmask 255.255.255.248 type rotary ip nat inside destination list 1 pool websrv
select count(distinct v_yjhm) from (select v_yjhm from zjjk_t_yssj_o_his a where n_yjzl > 0 and d_sjrq between to_date('20070301', 'yyyymmdd') and to_date('20070401', 'yyyymmdd') and v_yjzldm like '40%' and not exists(select 'a' from INST_TRIG_ZJJK_T_YSSJ_O b where a.v_yjtm=b.yjbh) --and v_yjtm not in (select yjbh from INST_TRIG_ZJJK_T_YSSJ_O) union select v_yjhm from zjjk_t_yssj_u_his a where n_yjzl > 0 and d_sjrq between to_date('20070301', 'yyyymmdd') and to_date('20070401', 'yyyymmdd') and v_yjzldm like '40%' and not exists(select 'a' from INST_TRIG_ZJJK_T_YSSJ_U b where a.v_yjtm=b.yjbh)) --and v_yjtm not in (select yjbh from INST_TRIG_ZJJK_T_YSSJ_U))
优化建议: 1、什么是DISTINCT ? 就是分组排序后取唯一值 ,底层行为 分组排序 2、什么是 UNION 、 UNION ALL ? UNION : 对多个结果集取DISTINCT ,生成一个不含重复记录的结果集,返回给前端,UNION ALL :不对结果集进行去重复操作 底层行为:分组排序 3、什么是 COUNT(*) ? 累加 4、需要有什么样的索引? S_sjrq + v_yjzldm : 理由: 假如全省的数据量在表中全部数为1000万,查询月数据量为200万,1000万中特快占50%, 则 通过 beween 时间(d_sjrq)+ 种类( v_yjzldm ),可过滤出约100万,这是最好的检索方式了。 5、两表都要进行一次 NOT EXISTS 运算,如何做最优? NOT EXISTS 是不好做的运算,但是我们可以合并两次的NOT EXISTS 运算。让这费资源的活只干一次。
综合以上,我们可以如下优化这个SQL: 1、内部的UNION 也是去重复,外部的DISTINCT 也是去重复,可左右去掉一个,建议内部的改为 UNION ALL , 这里稍请注意一下,如果V_YJHM 有NULL的情况,可能会引起COUNT值不对实际数的情况。 2、建一个 D_SJRQ+V_YJZLDM 的复合索引 3、将两个子查询先 UNION ALL 联结 , 另两个用来做 NOT EXISTS 的表也 UNION ALL 联结 4、在3的基础上再做 NOT EXISTS 5、将NOT EXISTS 替换为NOT IN ,同时加提示 HASH_AJ 做半连接HASH运算 6、最后为外层的COUNT(DISTINCT … 获得结果数
SQL书写如下: select count(distinct v_yjhm) from (select v_yjtm, v_yjhm from zjjk_t_yssj_o_his a where n_yjzl > 0 and d_sjrq between to_date('20070301', 'yyyymmdd') and to_date('20070401', 'yyyymmdd') and v_yjzldm like '40%' union all select v_yjtm, v_yjhm from zjjk_t_yssj_u_his a where n_yjzl > 0 and d_sjrq between to_date('20070301', 'yyyymmdd') and to_date('20070401', 'yyyymmdd') and v_yjzldm like '40%' ) a where a.v_yjtm not IN (select /*+ HASH_AJ */ yjbh from (select yjbh from INST_TRIG_ZJJK_T_YSSJ_O union all select yjbh from INST_TRIG_ZJJK_T_YSSJ_U))
SQL Server 2005是微软在推出SQL Server 2000后时隔五年推出的一个数据库平台,它的数据库引擎为关系型数据和结构化数据提供了更安全可靠的存储功能,使用户可以构建和管理用于业务的高可用和高性能的数据应用程序。此外SQL Server 2005结合了分析、报表、集成和通知功能。这使企业可以构建和部署经济有效的BI解决方案,帮助团队通过记分卡、Dashboard、Web Services和移动设备将数据应用推向业务的各个领域。无论是开发人员、数据库管理员、信息工作者还是决策者,SQL Server 2005都可以提供出创新的解决方案,并可从数据中获得更多的益处。
建立分区表先要创建文件组,而创建多个文件组主要是为了获得好的 I/O 平衡。一般情况下,文件组数最好与分区数相同,并且这些文件组通常位于不同的磁盘上。每个文件组可以由一个或多个文件构成,而每个分区必须映射到一个文件组。一个文件组可以由多个分区使用。为了更好地管理数据(例如,为了获得更精确的备份控制),对分区表应进行设计,以便只有相关数据或逻辑分组的数据位于同一个文件组中。使用 ALTER DATABASE,添加逻辑文件组名:
ALTER DATABASE [DeanDB] ADD FILEGROUP [FG1]
DeanDB为数据库名称,FG1文件组名。创建文件组后,再使用 ALTER DATABASE 将文件添加到该文件组中:
ALTER DATABASE [DeanDB] ADD FILE ( NAME = N'FG1', FILENAME = N'C:DeanDataFG1.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ) TO FILEGROUP [FG1]
Qcache total blocks Qcache_lowmem_prunes的值非常大,则表明经常出现缓冲不够的情况,同时Qcache_hits的值非常大,则表明查询缓冲使用非常频繁,此时需要增加缓冲大小Qcache_hits的值不大,则表明你的查询重复率很低,这种情况下使用查询缓冲反而会影响效率,那么可以考虑不用查询缓冲。此外,在SELECT语句中加入SQL_NO_CACHE可以明确表示不使用查询缓冲。 Qcache_free_blocks,如果该值非常大,则表明缓冲区中碎片很多,query_cache_type指定是否使用查询缓冲。 我设置: QUOTE: query_cache_size = 32M
query_cache_type= 1 得到如下状态值: Qcache queries in cache 12737 表明目前缓存的条数
Qcache inserts 20649006
Qcache hits 79060095 看来重复查询率还挺高的
Qcache lowmem prunes 617913 有这么多次出现缓存过低的情况
Qcache not cached 189896
Qcache free memory 18573912目前剩余缓存空间
Qcache free blocks 5328 这个数字似乎有点大 碎片不少
Qcache total blocks 30953
如果内存允许32M应该要往上加点 table_cache指定表高速缓存的大小。每当MySQL访问一个表时,如果在表缓冲区中还有空间,该表就被打开并放入其中,这样可以更快地访问表内容。通过检查峰值时间的状态值Open_tables和Opened_tables,可以决定是否需要增加table_cache的值。如果你发现open_tables等于table_cache,并且opened_tables在不断增长,那么你就需要增加table_cache的值了(上述状态值可以使用SHOW STATUS LIKE ‘Open%tables’获得)。注意,不能盲目地把table_cache设置成很大的值。如果设置得太高,可能会造成文件描述符不足,从而造成性能不稳定或者连接失败。 对于有1G内存的机器,推荐值是128-256。 笔者设置 QUOTE: table_cache = 256 得到以下状态: Open tables 256
DB封装类中这样定义基类: [Copy to clipboard] [ - ] CODE: /***************************************************************************** CLASS table base *****************************************************************************/ class table { var $table_name; ...
function table() {} //切换DB function change_db($db_name) { global $db; $db->Database=$db_name; mysql_select_db($db_name,$db->Link_ID); }
...
}
相应表类定义: [Copy to clipboard] [ - ] CODE: /***************************************************************************** CLASS info *****************************************************************************/ class info extends table{ function info() { global $site_name; $this->change_db($site_name);
• haiwangstar • • 等 级: 发表于:2007-03-28 10:01:554楼 得分:0 另外关于索引, 表会有3个数字列,一个是级别,共15级,另一个是X,再有一个是Y.这两列的数据范围都是从0到1000左右,查询的时候每次都会用到这三个列,大抵应该是这样select image from table1 WHERE level = ? and x = ? and y = ? 这个时候如何建索引会比较好, 将级别建为簇索引,X,Y建为复合索引? 还是三个列统一建为复合索引好..
With web2.0 starting, raised the Internet new turn of network to start undertaking the flood tide. To be user-oriented concept of the new websites, not only successfully created a large number of new sites, but also greatly facilitate the development of the Internet people. The Web2.0 at the same time take the user as the guidance idea, enabled the newborn website to have the new characteristic - high concurrency, high page views, big data quantity, and complex logic , etc. , also set the new request to the website construction. This article revolves the high concurrent high current capacity the website overhead construction design question, the main research discussed these content: First discussed the useage of mirror sites in the entire network, CDN content distribution network, the convenience and respective good and bad points comparison which brings to the load balancing. Then in the local area network, the fourth level exchange technology, including hardware solution F5 and software solution LVS, has been carried on with a simple discussion. Received in the single server level, this article emphatically discussed the single server socket optimization, the hard disk cache technology, the memory level buffer technology, CPU and the IO balance technology , the read-write separation technology and so on. In the application level, this article introduced some large-scale website commonly used technologies, as well as the reason of choice of these technicals. Finally, highly discussed the website in the overhead construction to expand accommodates, fault-tolerant. This article form which unifies by the theory and the practice, experience which in the author practical work obtains, has amore widespread serviceability.
KEY WORDS: high page view, high concurrency, architecture of website site, expansion
[附录2] MySQL wrap class <?php class mysqlRpc { var $_hostWrite = ''; var $_userWrite = ''; var $_passWrite = ''; var $_hostRead = ''; var $_userRead = ''; var $_passRead = ''; var $_dataBase = ''; var $db_write_handle = null; var $db_read_handle = null; var $db_last_handle = null; var $_cacheData = array(); var $mmtime = 60; function mysqlRpc($database, $w_servername, $w_username, $w_password, $r_servername='', $r_username='', $r_password='') {}
function connect_write_db() {} function connect_read_db() {} function query_write($sql, $return = false) {} function query_read($sql, $return = false) {}
function query_first($sql, $return = false) {} function insert_id(){} function affected_rows(){}
function escape_string($string){} function fetch_array($queryresult, $type = MYSQL_ASSOC){}
memcache在blog系统中的应用 http://hi.baidu.com/gowtd/blog/item/4fa90f2353f5444e935807fa.html 对memcache的接触也就不到4-5天的时间,大半的时间是用在研究如何利用memcache的接口,用简单有效的方式融入到我们blog应用系统中。基于此前日志模块已经使用memcache,并且在实际测试中有良好的性能表现(很高的cache命中率, 跟总体数量少有很大关系),相册模块也开始在dao中提供对memcache的支持。 只是原先封装memcache接口实在是不够抽象,日志模块的dao使用memcache就使得代码量增加了50%,而且代码也显得非常凌乱。在应用到逻辑更加复杂的相册模块时,dao层变得更加庞大;并且由于对memcache还不够了解,使得相册的dao错误百出。可以说,相册支持memcache的初试版本在关闭memcache后,性能会有一半的下降;打开memcache后性能不会有什么增长,反而会有些下降。而且代码更加凌乱,可读性和可维护性很差。没有办法,只有亲自来重新写一下相册的dao。通过周四和周五期间和taotao同志的多次协商和讨论,总算把新的dao给完成了。代码精简了不少,memcache的接口也有了一定的抽象(虽然还是比较丑陋的接口),基本每个dao接口都可以在10-15行代码里面搞定。可读性大大提高,对二级索引等比较烦琐的管理也都被屏蔽了起来。下礼拜需要把他们放到服务器上去测试一下,当然,还要逼着taotao把接口搞的更好看些。 今天休息,看了下memcache的client和server的代码,很简单的东西。但是在这中间,我还是觉得有些东西值得思考。 1. memcache对java对象的支持。memcache的客户端对java对象的支持做了些优化。主要是对primitive类型的优化。memcache服务器对所存储的数据类型是完全无知的,甚至也不是一个对象存储系统。它所能看到的就是一块块装着数据的内存。而在memcache的客户端,如果完全按照java的oo思想来把对象放进去,还是有些低效的。比如把一个boolean值要放进cache里去,java客户端的通常做法是生成一个Boolean对象,然后把这个对象串行化,在写入到服务器上。这就浪费了服务器的存储内存。一个Boolean对象好歹也需要40byte的内存,服务器要分配给它64byte的内存来用于存储该对象。而一个boolean值实际上只要1个byte就能搞定的事情。 memcache这点做的比较好,它对primitive类型做压缩,同样是boolean值,用2个byte来存储值并写给服务器。前面一个byte存储值类型,后面一个byte存储实际的值。这个还是很值得推崇的方法。 对其他对象的存储支持,memcahce就采取通用的对象序列化方法,到取回对象时,再重建这个对象。这种方法的好处就是简单,程序员不需要考虑对象的重建问题,依赖java的特性来重建对象。 但是我认为,在我们的应用系统中,使用这种简单的方法是不必要的,可以用其他有效的方式来提高效率和性能。举个例子,一个Photo对象要放到cache里面取,假设经过序列化后对象大小变成139byte(很普遍的,实际更大)。根据memcache的内存分层分配,就要实际分配256byte的实际内存给它。如果对象大小是260byte,系统就要分配512byte的内存给它。其中将近一半的内存是用不上的。而真的而去看序列化后的对象,里面很多信息都是用来在java重建对象时用的。如果我们也能参照memcache对primitive支持,让Photo对象实现一个特定的接口,这个接口能从一个字串中初试化一个photo对象来。这样就省了在服务器上存储Photo对象的很多类型信息,节省了对内存占用。但是,这就需要我们所有放入到cache中的对象都要实现这样的接口,限制了memcache的通用性。不过,在我们的blog应用中,没有多少种对象会放到cache中,这种通用性可以牺牲一下。 2. 在blog系统中,将blog对象和Photo对象同时存储,是否合适。在我看来,Blog对象对于Photo对象来说完全是个大家伙。一个包含长篇大论的blog对象,要占用很大一块内存,而且每个blog对象的大小还不一定,或大或小。而Photo对象则大小相对稳定。根据memcache的slab内存分配原则,当内存已经无法再分配时,要根据所请求放入的对象的大小到所对应的slab上以LRU算法把一些对象交换出去。可以设想一下,如果在一个cache服务器上,有很多小的Photo对象,和一些大的Blog对象。并且在开始时,cache服务器先很频繁地为Photo 对象提供存储服务。很明显,当系统稳定时,新放入Blog对象更加容易引起某个slab上的对象被交换。因为系统中的大块内存都被无数小的Photo对象所分割占用,而Blog对象只能获得一小部分的内存。此时,系统不会调整Photo对象占用的内存来补充Blog对象,因为两者很大程度上是处在两个不同的slab上。可以说,在cache服务器上,blog对象的平均生命周期会比Photo短,更容易被交换出去。从而造成blog对象的失配比率会比photo对象要高。我的想法是将blog对象和photo对象分别存放,让一群大小基本相同的对象放置在一个cache服务器上,可能是比较好点。这可以通过在选择cache存储服务器时,同时考虑对象的大小来完成 CommunityServer性能问题浅析 前言 可能有很多朋友在使用CommunityServer(以下简称CS)的过程中,当数据越来越多后,速度会越来越慢,资源耗用越来越大,对于性能不好的服务器,简直像一场噩梦一样,我终于刚刚结束了这个噩梦,简单谈谈是什么原因导致了CS在性能上存在的种种问题。(我对于数据库方面不是很专业,所以如果本文中有什么谬误,敬请各位指出,不胜感谢!) 忘了自我介绍一下,我是宝玉,以前做过Asp.net Forums和CommunityServer的本地化工作,母校西工大的民间社区( http://www.openlab.net.cn)用的是CS系统。该有人骂我做广告了,其实我是防盗版,郭安定大哥那学的,哈哈! 性能问题分析 鸡肋式的多站点支持 其中一个性能影响就是它的多站点功能,也许这确实是个不错的注意:同一个数据库,不同域名就可以有完全独立的站点,但是对于绝大部分用户来说,这个真的有用么?首先姑且不讨论它是否真的那么有用,但是在性能上,他绝对会有一定影响的:系统初始化的时候,首先要加载所有的站点设置,这也是为什么CS第一次访问会那么慢的原因之一;然后大部分查询的时候,都要带上SettingId字段,并且在数据库中,对这个字段的索引并没有建的很理想,对于大量数据的查询来说,如果没有合理的建索引,有时候多一个查询条件对于性能会带来极大的影响。 内容数据的集中式存储 一般的系统,都尽可能的将大量的内容数据分开存储(例如飞信系统的用户存储,就是分库的^_^),对于数据库,更是有专门的分库方案,这都是为了增加性能,提高检索效率。而CS由于架构的原因,将论坛、博客、相册、留言板等内容管理相关的信息,全部保存在cs_Groups(分组)、cs_Section(分类)、cs_Threads(主题索引)、cs_Posts(内容数据),这种架构给代码编写上带来了极大的便利,但是在性能上,不折不扣是个性能杀手,这也是CS慢的最根本原因,举个例子,假如我的论坛有100W数据,博客有5万条数据、相册有10万条数据,如果我要检索最新博客帖子,那么我要去这120万数据里面检索符合条件的数据,并且要加上诸如SettingsId、ApplicationType等用来区分属于哪个站点,哪种数据类型之类的条件,数据一多,必然会是一场噩梦,让你的查询响应速度越来越慢,从几秒钟到几十秒钟到Sql超时。 过于依赖缓存 缓存是个很好的东西,可以大大的减少数据库的访问,是asp.net程序提高性能必不可少的。不知道各位在设计开发系统,用缓存用的很爽的时候,有没有想过,如果缓存失效了会怎么样?如果缓存太大了会怎么样?相信各位CS会有一个感觉,那就是CS刚启动的时候速度好慢,或者使用过程中突然变的很慢,那就是因为好多数据还没有初始化到缓存,例如站点设置、用户资料、Groups集合、Setions集合等等一系列信息,这一系列信息的加载加起来在服务器性能不够好的情况下是个漫长的过程,如果碰巧还要去查询最新论坛帖子、未回复的帖子之类,那么噩梦就开始了,这时候就要拼人品了,看你是不是应用程序池刚重启完的第一个人o(∩_∩)o 。CS在缓存的策略上,细粒度不够,一般都是一个集合一个集合的进行缓存(例如最新论坛帖子集合),这样导致缓存需要频繁更新,而且缓存内的数据一般比较大,内存占用涨的很快,内存涨的快又导致了应用程序池频繁重启,这样,CS在缓存方面的优势反而变成了一种缺陷,导致服务器的资源占用居高不下。 CCS的雪上加霜 前面说过,我做过CS的本地化开发,加了不少CS的本地化开发工作,但是由于当时数据库知识的匮乏,导致了一些在性能上雪上加霜的行为,例如精华帖子功能,其中标志是否为精华帖(精华等级)的ValuedLevel字段没有加上索引,在数据量大的情况下,检索会比较慢。由于我已经不在做CCS的开发,已经没有办法来修正这些性能问题了,只能对大家表示歉意。 后记 如何解决? 最简单就是等着升级了,相信CS以后的版本会越来越强劲的,这些问题肯定会逐步解决的。如果等不及的话,就只能自己动手了,使用Sql Profiler监测Sql的执行,找出影响性能的查询,然后针对性优化。 前面我说我结束CS性能的噩梦,肯定有朋友会问我怎么结束的了,在此,就先埋一个伏笔了,在05年的时候,我就开始如何构思开发一套高性能的类似CS的系统,06年初开始设计,然后利用业余时间进行了具体的开发,到今天已经有了小成,在性能上有了质的飞跃,针对这套系统的设计和性能优化的心得,我会逐渐以博客的形式来和大家一起分享交流。 Digg PHP's Scalability and Performance • listen Monday April 10, 2006 9:28AM by Brian Fioca in Technical Several weeks ago there was a notable bit of controversy over some comments made by James Gosling, father of the Java programming language. He has since addressed the flame war that erupted, but the whole ordeal got me thinking seriously about PHP and its scalability and performance abilities compared to Java. I knew that several hugely popular Web 2.0 applications were written in scripting languages like PHP, so I contacted Owen Byrne - Senior Software Engineer at digg.com to learn how he addressed any problems they encountered during their meteoric growth. This article addresses the all-to-common false assumptions about the cost of scalability and performance in PHP applications. At the time Gosling’s comments were made, I was working on tuning and optimizing the source code and server configuration for the launch of Jobby, a Web 2.0 resume tracking application written using the WASP PHP framework. I really hadn’t done any substantial research on how to best optimize PHP applications at the time. My background is heavy in the architecture and development of highly scalable applications in Java, but I realized there were enough substantial differences between Java and PHP to cause me concern. In my experience, it was certainly faster to develop web applications in languages like PHP; but I was curious as to how much of that time savings might be lost to performance tuning and scaling costs. What I found was both encouraging and surprising. What are Performance and Scalability? Before I go on, I want to make sure the ideas of performance and scalability are understood. Performance is measured by the output behavior of the application. In other words, performance is whether or not the app is fast. A good performing web application is expected to render a page in around or under 1 second (depending on the complexity of the page, of course). Scalability is the ability of the application to maintain good performance under heavy load with the addition of resources. For example, as the popularity of a web application grows, it can be called scalable if you can maintain good performance metrics by simply making small hardware additions. With that in mind, I wondered how PHP would perform under heavy load, and whether it would scale well compared with Java. Hardware Cost My first concern was raw horsepower. Executing scripting language code is more hardware intensive because to the code isn’t compiled. The hardware we had available for the launch of Jobby was a single hosted Linux server with a 2GHz processor and 1GB of RAM. On this single modest server I was going to have to run both Apache 2 and MySQL. Previous applications I had worked on in Java had been deployed on 10-20 application servers with at least 2 dedicated, massively parallel, ultra expensive database servers. Of course, these applications handled traffic in the millions of hits per month. To get a better idea of what was in store for a heavily loaded PHP application, I set up an interview with Owen Byrne, cofounder and Senior Software Engineer at digg.com. From talking with Owen I learned digg.com gets on the order of 200 million page views per month, and they’re able to handle it with only 3 web servers and 8 small database servers (I’ll discuss the reason for so many database servers in the next section). Even better news was that they were able to handle their first year’s worth of growth on a single hosted server like the one I was using. My hardware worries were relieved. The hardware requirements to run high-traffic PHP applications didn’t seem to be more costly than for Java. Database Cost Next I was worried about database costs. The enterprise Java applications I had worked on were powered by expensive database software like Oracle, Informix, and DB2. I had decided early on to use MySQL for my database, which is of course free. I wondered whether the simplicity of MySQL would be a liability when it came to trying to squeeze the last bit of performance out of the database. MySQL has had a reputation for being slow in the past, but most of that seems to have come from sub-optimal configuration and the overuse of MyISAM tables. Owen confirmed that the use of InnoDB for tables for read/write data makes a massive performance difference. There are some scalability issues with MySQL, one being the need for large amounts of slave databases. However, these issues are decidedly not PHP related, and are being addressed in future versions of MySQL. It could be argued that even with the large amount of slave databases that are needed, the hardware required to support them is less expensive than the 8+ CPU boxes that typically power large Oracle or DB2 databases. The database requirements to run massive PHP applications still weren’t more costly than for Java. PHP Coding Cost Lastly, and most importantly, I was worried about scalability and performance costs directly attributed to the PHP language itself. During my conversation with Owen I asked him if there were any performance or scalability problems he encountered that were related to having chosen to write the application in PHP. A bit to my surprise, he responded by saying, “none of the scaling challenges we faced had anything to do with PHP,” and that “the biggest issues faced were database related.” He even added, “in fact, we found that the lightweight nature of PHP allowed us to easily move processing tasks from the database to PHP in order to deal with that problem.” Owen mentioned they use the APC PHP accelerator platform as well as MCache to lighten their database load. Still, I was skeptical. I had written Jobby entirely in PHP 5 using a framework which uses a highly object oriented MVC architecture to provide application development scalability. How would this hold up to large amounts of traffic? My worries were largely related to the PHP engine having to effectively parse and interpret every included class on each page load. I discovered this was just my misunderstanding of the best way to configure a PHP server. After doing some research, I found that by using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator, this was no longer a problem. Any class or script loading overhead was only encountered on the first page load. Subsequent page loads were of comparative performance to a typical Java application. Making these configuration changes were trivial and generated massive performance gains. With regard to scalability and performance, PHP itself, even PHP 5 with heavy OO, was not more costly than Java. Conclusion Jobby was launched successfully on its single modest server and, thanks to links from Ajaxian and TechCrunch, went on to happily survive hundreds of thousands of hits in a single week. Assuming I applied all of my new found PHP tuning knowledge correctly, the application should be able to handle much more load on its current hardware. Digg is in the process of preparing to scale to 10 times current load. I asked Owen Byrne if that meant an increase in headcount and he said that wasn’t necessary. The only real change they identified was a switch to a different database platform. There doesn’t seem to be any additional manpower cost to PHP scalability either. It turns out that it really is fast and cheap to develop applications in PHP. Most scaling and performance challenges are almost always related to the data layer, and are common across all language platforms. Even as a self-proclaimed PHP evangelist, I was very startled to find out that all of the theories I was subscribing to were true. There is simply no truth to the idea that Java is better than scripting languages at writing scalable web applications. I won’t go as far as to say that PHP is better than Java, because it is never that simple. However it just isn’t true to say that PHP doesn’t scale, and with the rise of Web 2.0, sites like Digg, Flickr, and even Jobby are proving that large scale applications can be rapidly built and maintained on-the-cheap, by one or two developers. Further Reading YouTube Architecture
Tue, 07/17/2007 - 20:20 — Todd Hoff YouTube Architecture (3936) YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google? Information Sources • Google Video Platform • Apache • Python • Linux (SuSe) • MySQL • psyco, a dynamic python->C compiler • lighttpd for video instead of Apache What's Inside? The Stats • Supports the delivery of over 100 million videos per day. • Founded 2/2005 • 3/2006 30 million video views/day • 7/2006 100 million video views/day • 2 sysadmins, 2 scalability software architects • 2 feature developers, 2 network engineers, 1 DBA Recipe for handling rapid growth
while (true) { identify_and_fix_bottlenecks(); drink(); sleep(); notice_new_bottleneck(); } This loop runs many times a day. Web Servers • NetScalar is used for load balancing and caching static content. • Run Apache with mod_fast_cgi. • Requests are routed for handling by a Python application server. • Application server talks to various databases and other informations sources to get all the data and formats the html page. • Can usually scale web tier by adding more machines. • The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs. • Python allows rapid flexible development and deployment. This is critical given the competition they face. • Usually less than 100 ms page service times. • Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops. • For high CPU intensive activities like encryption, they use C extensions. • Some pre-generated cached HTML for expensive to render blocks. • Row level caching in the database. • Fully formed Python objects are cached. • Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends. Video Serving • Costs include bandwidth, hardware, and power consumption. • Each video hosted by a mini-cluster. Each video is served by more than one machine. • Using a a cluster means: - More disks serving content which means more speed. - Headroom. If a machine goes down others can take over. - There are online backups. • Servers use the lighttpd web server for video: - Apache had too much overhead. - Uses epoll to wait on multiple fds. - Switched from single process to multiple process configuration to handle more connections. • Most popular content is moved to a CDN (content delivery network): - CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network. - CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory. • Less popular content (1-20 views per day) uses YouTube servers in various colo sites. - There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed. - Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior. - Tune RAID controller and pay attention to other lower level issues to help. - Tune memory on each machine so there's not too much and not too little. Serving Video Key Points • Keep it simple and cheap. • Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load. • Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net. • Use simple common tools. They use most tools build into Linux and layer on top of those. • Handle random seeks well (SATA, tweaks). Serving Thumbnails • Surprisingly difficult to do efficiently. • There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos. • Thumbnails are hosted on just a few machines. • Saw problems associated with serving a lot of small objects: - Lots of disk seeks and problems with inode caches and page caches at OS level. - Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea. - A high number of requests/sec as web pages can display 60 thumbnails on page. - Under such high loads Apache performed badly. - Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20. - Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache. - With so many images setting up a new machine took over 24 hours. - Rebooting machine took 6-10 hours for cache to warm up to not go to disk. • To solve all their problems they started using Google's BigTable, a distributed data store: - Avoids small file problem because it clumps files together. - Fast, fault tolerant. Assumes its working on a unreliable network. - Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites. - For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable. Databases • The Early Years - Use MySQL to store meta data like users, tags, and descriptions. - Served data off a monolithic RAID 10 Volume with 10 disks. - Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered. - They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach. - Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master. - Updates cause cache misses which goes to disk where slow I/O causes slow replication. - Using a replicating architecture you need to spend a lot of money for incremental bits of write performance. - One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster. • The later years: - Went to database partitioning. - Split into shards with users assigned to different shards. - Spreads writes and reads. - Much better cache locality which means less IO. - Resulted in a 30% hardware reduction. - Reduced replica lag to 0. - Can now scale database almost arbitrarily. Data Center Strategy • Used manage hosting providers at first. Living off credit cards so it was the only way. • Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements. • So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts. • Use 5 or 6 data centers plus the CDN. • Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN. • Video bandwidth dependent, not really latency dependent. Can come from any colo. • For images latency matters, especially when you have 60 images on a page. • Images are replicated to different data centers using BigTable. Code looks at different metrics to know who is closest. Lessons Learned • Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions. • Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities. • Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas. • Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening. • Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance. • Constant iteration on bottlenecks: - Software: DB, caching - OS: disk I/O - Hardware: memory, RAID • You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible. 1. Jesse • Comments (78) • April 10th Justin Silverton at Jaslabs has a supposed list of 10 tips for optimizing MySQL queries. I couldn't read this and let it stand because this list is really, really bad. Some guy named Mike noted this, too. So in this entry I'll do two things: first, I'll explain why his list is bad; second, I'll present my own list which, hopefully, is much better. Onward, intrepid readers! Why That List Sucks 1. He's swinging for the top of the trees The rule in any situation where you want to opimize some code is that you first profile it and then find the bottlenecks. Mr. Silverton, however, aims right for the tippy top of the trees. I'd say 60% of database optimization is properly understanding SQL and the basics of databases. You need to understand joins vs. subselects, column indices, how to normalize data, etc. The next 35% is understanding the performance characteristics of your database of choice. COUNT(*) in MySQL, for example, can either be almost-free or painfully slow depending on which storage engine you're using. Other things to consider: under what conditions does your database invalidate caches, when does it sort on disk rather than in memory, when does it need to create temporary tables, etc. The final 5%, where few ever need venture, is where Mr. Silverton spends most of his time. Never once in my life have I used SQL_SMALL_RESULT. 2. Good problems, bad solutions There are cases when Mr. Silverton does note a good problem. MySQL will indeed use a dynamic row format if it contains variable length fields like TEXT or BLOB, which, in this case, means sorting needs to be done on disk. The solution is not to eschew these datatypes, but rather to split off such fields into an associated table. The following schema represents this idea: 1. CREATE TABLE posts ( 2. id int UNSIGNED NOT NULL AUTO_INCREMENT, 3. author_id int UNSIGNED NOT NULL, 4. created timestamp NOT NULL, 5. PRIMARY KEY(id) 6. ); 7. 8. CREATE TABLE posts_data ( 9. post_id int UNSIGNED NOT NULL. 10. body text, 11. PRIMARY KEY(post_id) 12. ); 3. That's just...yeah Some of his suggestions are just mind-boggling, e.g., "remove unnecessary paratheses." It really doesn't matter whether you do SELECT * FROM posts WHERE (author_id = 5 AND published = 1) or SELECT * FROM posts WHERE author_id = 5 AND published = 1. None. Any decent DBMS is going to optimize these away. This level of detail is akin to wondering when writing a C program whether the post-increment or pre-increment operator is faster. Really, if that's where you're spending your energy, it's a surprise you've written any code at all My list Let's see if I fare any better. I'm going to start from the most general. 4. Benchmark, benchmark, benchmark! You're going to need numbers if you want to make a good decision. What queries are the worst? Where are the bottlenecks? Under what circumstances am I generating bad queries? Benchmarking is will let you simulate high-stress situations and, with the aid of profiling tools, expose the cracks in your database configuration. Tools of the trade include supersmack, ab, and SysBench. These tools either hit your database directly (e.g., supersmack) or simulate web traffic (e.g., ab). 5. Profile, profile, profile! So, you're able to generate high-stress situations, but now you need to find the cracks. This is what profiling is for. Profiling enables you to find the bottlenecks in your configuration, whether they be in memory, CPU, network, disk I/O, or, what is more likely, some combination of all of them. The very first thing you should do is turn on the MySQL slow query log and install mtop. This will give you access to information about the absolute worst offenders. Have a ten-second query ruining your web application? These guys will show you the query right off. After you've identified the slow queries you should learn about the MySQL internal tools, like EXPLAIN, SHOW STATUS, and SHOW PROCESSLIST. These will tell you what resources are being spent where, and what side effects your queries are having, e.g., whether your heinous triple-join subselect query is sorting in memory or on disk. Of course, you should also be using your usual array of command-line profiling tools like top, procinfo, vmstat, etc. to get more general system performance information. 6. Tighten Up Your Schema Before you even start writing queries you have to design a schema. Remember that the memory requirements for a table are going to be around #entries * size of a row. Unless you expect every person on the planet to register 2.8 trillion times on your website you do not in fact need to make your user_id column a BIGINT. Likewise, if a text field will always be a fixed length (e.g., a US zipcode, which always has a canonical representation of the form "XXXXX-XXXX") then a VARCHAR declaration just adds a superfluous byte for every row. Some people poo-poo database normalization, saying it produces unecessarily complex schema. However, proper normalization results in a minimization of redundant data. Fundamentally that means a smaller overall footprint at the cost of performance — the usual performance/memory tradeoff found everywhere in computer science. The best approach, IMO, is to normalize first and denormalize where performance demands it. Your schema will be more logical and you won't be optimizing prematurely. 7. Partition Your Tables Often you have a table in which only a few columns are accessed frequently. On a blog, for example, one might display entry titles in many places (e.g., a list of recent posts) but only ever display teasers or the full post bodies once on a given page. Horizontal vertical partitioning helps: 1. CREATE TABLE posts ( 2. id int UNSIGNED NOT NULL AUTO_INCREMENT, 3. author_id int UNSIGNED NOT NULL, 4. title varchar(128), 5. created timestamp NOT NULL, 6. PRIMARY KEY(id) 7. ); 8. 9. CREATE TABLE posts_data ( 10. post_id int UNSIGNED NOT NULL, 11. teaser text, 12. body text, 13. PRIMARY KEY(post_id) 14. ); The above represents a situation where one is optimizing for reading. Frequently accessed data is kept in one table while infrequently accessed data is kept in another. Since the data is now partitioned the infrequently access data takes up less memory. You can also optimize for writing: frequently changed data can be kept in one table, while infrequently changed data can be kept in another. This allows more efficient caching since MySQL no longer needs to expire the cache for data which probably hasn't changed. 8. Don't Overuse Artificial Primary Keys Artificial primary keys are nice because they can make the schema less volatile. If we stored geography information in the US based on zip code, say, and the zip code system suddenly changed we'd be in a bit of trouble. On the other hand, many times there are perfectly fine natural keys. One example would be a join table for many-to-many relationships. What not to do: 1. CREATE TABLE posts_tags ( 2. relation_id int UNSIGNED NOT NULL AUTO_INCREMENT, 3. post_id int UNSIGNED NOT NULL, 4. tag_id int UNSIGNED NOT NULL, 5. PRIMARY KEY(relation_id), 6. UNIQUE INDEX(post_id, tag_id) 7. ); Not only is the artificial key entirely redundant given the column constraints, but the number of post-tag relations are now limited by the system-size of an integer. Instead one should do: 8. CREATE TABLE posts_tags ( 9. post_id int UNSIGNED NOT NULL, 10. tag_id int UNSIGNED NOT NULL, 11. PRIMARY KEY(post_id, tag_id) 12. ); 9. Learn Your Indices Often your choice of indices will make or break your database. For those who haven't progressed this far in their database studies, an index is a sort of hash. If we issue the query SELECT * FROM users WHERE last_name = 'Goldstein' and last_name has no index then your DBMS must scan every row of the table and compare it to the string 'Goldstein.' An index is usually a B-tree (though there are other options) which speeds up this comparison considerably. You should probably create indices for any field on which you are selecting, grouping, ordering, or joining. Obviously each index requires space proportional to the number of rows in your table, so too many indices winds up taking more memory. You also incur a performance hit on write operations, since every write now requires that the corresponding index be updated. There is a balance point which you can uncover by profiling your code. This varies from system to system and implementation to implementation. 10. SQL is Not C C is the canonical procedural programming language and the greatest pitfall for a programmer looking to show off his database-fu is that he fails to realize that SQL is not procedural (nor is it functional or object-oriented, for that matter). Rather than thinking in terms of data and operations on data one must think of sets of data and relationships among those sets. This usually crops up with the improper use of a subquery: 1. SELECT a.id, 2. (SELECT MAX(created) 3. FROM posts 4. WHERE author_id = a.id) 5. AS latest_post 6. FROM authors a Since this subquery is correlated, i.e., references a table in the outer query, one should convert the subquery to a join. 7. SELECT a.id, MAX(p.created) AS latest_post 8. FROM authors a 9. INNER JOIN posts p 10. ON (a.id = p.author_id) 11. GROUP BY a.id 11. Understand your engines MySQL has two primary storange engines: MyISAM and InnoDB. Each has its own performance characteristics and considerations. In the broadest sense MyISAM is good for read-heavy data and InnoDB is good for write-heavy data, though there are cases where the opposite is true. The biggest gotcha is how the two differ with respect to the COUNT function. MyISAM keeps an internal cache of table meta-data like the number of rows. This means that, generally, COUNT(*) incurs no additional cost for a well-structured query. InnoDB, however, has no such cache. For a concrete example, let's say we're trying to paginate a query. If you have a query SELECT * FROM users LIMIT 5,10, let's say, running SELECT COUNT(*) FROM users LIMIT 5,10 is essentially free with MyISAM but takes the same amount of time as the first query with InnoDB. MySQL has a SQL_CALC_FOUND_ROWS option which tells InnoDB to calculate the number of rows as it runs the query, which can then be retreived by executing SELECT FOUND_ROWS(). This is very MySQL-specific, but can be necessary in certain situations, particularly if you use InnoDB for its other features (e.g., row-level locking, stored procedures, etc.). 12. MySQL specific shortcuts MySQL provides many extentions to SQL which help performance in many common use scenarios. Among these are INSERT ... SELECT, INSERT ... ON DUPLICATE KEY UPDATE, and REPLACE. I rarely hesitate to use the above since they are so convenient and provide real performance benefits in many situations. MySQL has other keywords which are more dangerous, however, and should be used sparingly. These include INSERT DELAYED, which tells MySQL that it is not important to insert the data immediately (say, e.g., in a logging situation). The problem with this is that under high load situations the insert might be delayed indefinitely, causing the insert queue to baloon. You can also give MySQL index hints about which indices to use. MySQL gets it right most of the time and when it doesn't it is usually because of a bad scheme or poorly written query. 13. And one for the road... Last, but not least, read Peter Zaitsev's MySQL Performance Blog if you're into the nitty-gritty of MySQL performance. He covers many of the finer aspects of database administration and performance. Library This is a collection of Slides, presentations and videos on topics related to designing of high throughput, scalable, highly available websites I’ve been collecting for a while. 8/28/2007 Blog Inside Myspace
8/28/2007 FAQ Good Memcached FAQ
8/28/2007 Blog Measuring scalability
8/28/2007 Blog Distributed caching with memcached
8/28/2007 Blog Mailinator stats (all on a single server)
8/28/2007 Blog Architecture of Mailinator
8/25/2007 Slides 74 Building Scalable web architectures
8/25/2007 Slides 42 Typepad Architecture change: Change Your Car’s Tires at 100 mph
8/25/2007 Slides Skype protocol (also talks about p2p connections which is critical for its scalability)
8/20/2007 slides Slashdot’s History of scaling Mysql
8/19/2007 Slides 20 Big Bad Postgres SQL
8/19/2007 Slides 90 Scalable internet architectures
8/19/2007 Slides 59 Production troubleshooting (not related to scalability… but shit happens everywhere)
8/19/2007 Slides 31 Clustered Logging with mod_log_spread
8/19/2007 Slides 86 Understanding and Building HA/LB clusters
8/12/2007 Blog Multi-Master Mysql Replication
8/12/2007 Blog Large-Scale Methodologies for the World Wide Web
8/12/2007 Blog Scaling gracefully
8/12/2007 Blog Implementing Tag cloud - The nasty way
8/12/2007 Blog Normalized Data is for sissies
8/12/2007 Slides APC at facebook
8/6/2007 Video Plenty Of fish interview with its CEO
8/6/2007 Slides PHP scalability myth
8/6/2007 Slides 79 High performance PHP
8/6/2007 Blog Digg: PHP’s scalability and Performance
8/2/2007 Blog Getting Started with Drupal
8/2/2007 Blog 4 Problems with Drupal
8/2/2007 Video 55m Seattle Conference on Scalability: MapReduce Used on Large Data Sets
8/2/2007 Video 60m Seattle Conference on Scalability: Scaling Google for Every User
8/2/2007 Video 53m Seattle Conference on Scalability: VeriSign’s Global DNS Infrastucture
8/2/2007 Video 53m Seattle Conference on Scalability: YouTube Scalability
8/2/2007 Video 59m Seattle Conference on Scalability: Abstractions for Handling Large Datasets
8/2/2007 Video 55m Seattle Conference on Scalability: Building a Scalable Resource Management
8/2/2007 Video 44m Seattle Conference on Scalability: SCTPs Reliability and Fault Tolerance
8/2/2007 Video 27m Seattle Conference on Scalability: Lessons In Building Scalable Systems
8/2/2007 Video 41m Seattle Conference on Scalability: Scalable Test Selection Using Source Code
8/2/2007 Video 53m Seattle Conference on Scalability: Lustre File System
8/2/2007 Slides 16 Technology at Digg.com
8/2/2007 Blog Extreme Makeover: Database or MySQL@YouTube
8/2/2007 Slides 60 “Real Time
8/2/2007 Blog Mysql at Google
8/2/2007 Slides 56 Scaling Twitter
8/2/2007 Slides 53 How we build Vox
8/2/2007 Slides 97 High Performance websites
8/2/2007 Slides 101 Beyond the file system design
8/2/2007 Slides 145 Scalable web architectures
8/2/2007 Blog “Build Scalable Web 2.0 Sites with Ubuntu
8/2/2007 Slides 34 Scalability set Amazon’s servers on fire not yours
8/2/2007 Slides 41 Hardware layouts for LAMP installations
8/2/2007 Video 91m Mysql scaling and high availability architectures
8/2/2007 Audio 137 Lessons from Building world’s largest social music platform
8/2/2007 PDF 137 Lessons from Building world’s largest social music platform
8/2/2007 Slides 137 Lessons from Building world’s largest social music platform
8/2/2007 PDF 80 Livejournal’s backend: history of scaling
8/2/2007 Slides 80 Livejournal’s backend: history of scaling
8/2/2007 Slides 26 Scalable Web Architectures (w/ Ruby and Amazon S3)
8/2/2007 Blog Yahoo! bookmarks uses symfony
8/2/2007 Slides Getting Rich with PHP 5
8/2/2007 Audio Getting Rich with PHP 5
8/2/2007 Blog Scaling Fast and Cheap - How We Built Flickr
8/2/2007 News Open source helps Flickr share photos
8/2/2007 Slides 41 Flickr and PHP
8/2/2007 Slides 30 Wikipedia: Cheap and explosive scaling with LAMP
8/2/2007 Blog YouTube Scalability Talk
8/2/2007 High Order Bit: Architecture for Humanity
8/2/2007 PDF Mysql and Web2.0 companies
8/3/2007 36 Building Highly Scalable Web Applications
8/3/2007 Introduction to hadoop
8/3/2007 webpage The Hadoop Distributed File System: Architecture and Design
8/3/2007 Interpreting the Data: Parallel Analysis with Sawzall
8/3/2007 PDF ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval
8/3/2007 PDF SEDA: An Architecture for well conditioned scalable internet services
8/3/2007 PDF A scalable architecuture for Global web service hosting service
8/3/2007 Meed Hadoop
8/3/2007 Blog Yahoo’s Hadoop Support
8/3/2007 Blog Running Hadoop MapReduce on Amazon EC2 and Amazon S3
8/3/2007 53m LH*RSP2P : A Scalable Distributed Data Structure for P2P Environment
8/3/2007 90m Scaling the Internet routing table with Locator/ID Separation Protocol (LISP)
8/3/2007 Hadoop Map/Reduce
8/3/2007 Slides Hadoop distributed file system
8/3/2007 Video 45m Brad Fitzpatrick - Behind the Scenes at LiveJournal: Scaling Storytime
8/3/2007 PDF Bigtable: A Distributed Storage System for Structured Data
8/3/2007 PDF Fault-Tolerant and scalable TCP splice and web server architecture
8/3/2007 Video BigTable: A Distributed Structured Storage System
8/3/2007 PDF MapReduce: Simplified Data Processing on Large Clusters
8/3/2007 PDF Google Cluster architecture
8/3/2007 PDF Google File System
8/3/2007 Doc Implementing a Scalable Architecture
8/3/2007 News How linux saved Millions for Amazon
8/3/2007 Yahoo experience with hadoop
8/3/2007 Slides Scalable web application using Mysql and Java
8/3/2007 Slides Friendster: scalaing for 1 Billion Queries per day
8/3/2007 Blog Lightweight web servers
8/3/2007 PDF Mysql Scale out by application partitioning
8/3/2007 PDF Replication under scalable hashing: A family of algorithms for Scalable decentralized data distribution
8/3/2007 Product Clustered storage revolution
8/3/2007 Blog Early Amazon Series
8/3/2007 Web Wikimedia Server info
8/3/2007 Slides 32 Wikimedia Architecture
8/3/2007 Slides 21 MySpace presentation
8/3/2007 PDF A scalable and fault-tolerant architecture for distributed web resource discovery
8/4/2007 PDF The Chubby Lock Service for Loosely-Coupled Distributed Systems
8/5/2007 Slides 47 Real world Mysql tuning
8/5/2007 Slides 100 Real world Mysql performance tuning
8/5/2007 Slides 63 Learning MogileFS: Buliding scalable storage system
8/5/2007 Slides Reverse Proxy and Webserver
8/5/2007 PDF Case for Shared Nothing
8/5/2007 Slides 27 A scalable stateless proxy for DBI
8/5/2007 Slides 91 Real world scalability web builder 2006
8/5/2007 Slides 52 Real world web scalability
Friendster Architecture
Thu, 07/12/2007 - 05:18 — Todd Hoff • Friendster Architecture (341) Friendster is one of the largest social network sites on the web. it emphasizes genuine friendships and the discovery of new people through friends. Site: http://www.friendster.com/ Information Sources • Friendster - Scaling for 1 Billion Queries per day Platform • MySQL • Perl • PHP • Linux • Apache What's Inside? • Dual x86-64 AMD Opterons with 8 GB of RAM • Faster disk (SAN) • Optimized indexes • Traditional 3-tier architecture with hardware load balancer in front of the databases • Clusters based on types: ad, app, photo, monitoring, DNS, gallery search DB, profile DB, user infor DB, IM status cache, message DB, testimonial DB, friend DB, graph servers, gallery search, object cache. Lessons Learned • No persistent database connections. • Removed all sorts. • Optimized indexes • Don’t go after the biggest problems first • Optimize without downtime • Split load • Moved sorting query types into the application and added LIMITS. • Reduced ranges • Range on primary key • Benchmark -> Make Change -> Benchmark -> Make Change (Cycle of Improvement) • Stabilize: always have a plan to rollback • Work with a team • Assess: Define the issues • A key design goal for the new system was to move away from maintaining session state toward a stateless architecture that would clean up after each request • Rather than buy big, centralized boxes, [our philosophy] was about buying a lot of thin, cheap boxes. If one fails, you roll over to another box. Feedblendr Architecture - Using EC2 to Scale
Wed, 10/31/2007 - 05:15 — Todd Hoff • Feedblendr Architecture - Using EC2 to Scale (56) A man had a dream. His dream was to blend a bunch of RSS/Atom/RDF feeds into a single feed. The man is Beau Lebens of Feedville and like most dreamers he was a little short on coin. So he took refuge in the home of a cheap hosting provider and Beau realized his dream, creating FEEDblendr. But FEEDblendr chewed up so much CPU creating blended feeds that the cheap hosting provider ordered Beau to find another home. Where was Beau to go? He eventually found a new home in the virtual machine room of Amazon's EC2. This is the story of how Beau was finally able to create his one feeds safe within the cradle of affordable CPU cycles. Site: http://feedblendr.com/ The Platform • EC2 (Fedora Core 6 Lite distro) • S3 • Apache • PHP • MySQL • DynDNS (for round robin DNS) The Stats • Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr. • FEEDblendr uses 2 EC2 instances. The same Amazon Instance (AMI) is used for both instances. • Over 10,000 blends have been created, containing over 45,000 source feeds. • Approx 30 blends created per day. Processors on the 2 instances are actually pegged pretty high (load averages at ~ 10 - 20 most of the time). The Architecture • Round robin DNS is used to load balance between instances. -The DNS is updated by hand as an instance is validited to work correctly before the DNS is updated. -Instances seem to be more stable now than they were in the past, but you must still assume they can be lost at any time and no data will be persisted between reboots. • The database is still hosted on an external service because EC2 does not have a decent persistent storage system. • The AMI is kept as minimal as possible. It is a clean instance with some auto-deployment code to load the application off of S3. This means you don't have to create new instances for every software release. • The deployment process is: - Software is developed on a laptop and stored in subversion. - A makefile is used to get a revision, fix permissions etc, package and push to S3. - When the AMI launches it runs a script to grab the software package from S3. - The package is unpacked and a specific script inside is executed to continue the installation process. - Configuration files for Apache, PHP, etc are updated. - Server-specific permissions, symlinks etc are fixed up. - Apache is restarted and email is sent with the IP of that machine. Then the DNS is updated by hand with the new IP address. • Feeds are intelligently cached independely on each instance. This is to reduce the costly polling for feeds as much as possible. S3 was tried as a common feed cache for both instances, but it was too slow. Perhaps feeds could be written to each instance so they would be cached on each machine? Lesson Learned • A low budget startup can effectively bootstrap using EC2 and S3. • For the budget conscious the free ZoneEdit service might work just as well as the /year DynDNS service (which works fine). • Round robin load balancing is slow and unreliable. Even with a short TTL for the DNS some systems hold on to the IP addressed for a long time, so new machines are not load balanced to. • Many problems exist with RSS implementations that keep feeds from being effectively blended. A lot of CPU is spent reading and blending feeds unecessarily because there's no reliable cross implementation way to tell when a feed has really changed or not. • It's really a big mindset change to consider that your instances can go away at any time. You have to change your architecture and design to live with this fact. But once you internalize this model, most problems can be solved. • EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be. • Use the AMI's ability to be passed a parameter to select which configuration to load from S3. This allows you to test different configurations without moving/deleting the current active one. • Create an automated test system to validate an instance as it boots. Then automatically update the DNS if the tests pass. This makes it easy create new instances and takes the slow human out of the loop. • Always load software from S3. The last thing you want happening is your instance loading, and for some reason not being able to contact your SVN server, and thus failing to load properly. Putting it in S3 virtually eliminates the chances of this occurring, because it's on the same network. Related Articles • What is a 'River of News' style aggregator? • Build an Infinitely Scalable Infrastructure for 0 Using Amazon Services • EC2 • Example • MySQL • PHP • S3 • Visit Feedblendr Architecture - Using EC2 to Scale • 716 reads Comments Wed, 10/31/2007 - 15:04 — Greg Linden (not verified) Re: Feedblendr Architecture - Using EC2 to Scale I might be missing something, but I don't see how this is an interesting example of "using EC2 to scale". There appears to be no difference between using EC2 in the way Beau is using it and setting up two leased servers from a normal provider. In fact, getting leased servers might be better, since the cost might be lower (an EC2 instance costs /month + bandwidth) and the database would be on the same network. Beau does not appear to be doing anything that takes advantage of EC2, such as dynamically creating and discarding instances based on demand. Am I missing something here? Is this an interesting use of using EC2 to scale? • reply Wed, 10/31/2007 - 16:35 — Todd Hoff
Re: Feedblendr Architecture - Using EC2 to Scale > I might be missing something, but I don't see how this is an interesting example of "using EC2 to scale". I admit to being a bit polymorphously perverse with respect to finding things interesting, but from Beau's position, which many people are, the drama is thrilling. The story starts with a conflict: how to implement this idea? The first option is the traditional cheap host option. And for a long time that would have been the end of the story. Dedicated servers with higher end CPUs, RAM, and persistent storage are still not cheap. So if you aren't making money that would have been where the story ended. Scaling by adding more and more dedicated servers would be impossible. Hopefully the new grid model will allow a lot of people to keep writing their stories. His learning curve of creating the system is what was most interesting. Figuring out how to set things up, load balance, load the software, test it, regular nuts and bolts development stuff. And that puts him in the position of being able to get more CPU immediately when and if the time comes. He'll be able to add that feature in quickly because he's already done the ground work. But for now it's running fine. The spanner in the plan was the database and that points out the fatal flaw of EC2, which is the database. The plan would look a bit more successful if that part had worked out better, but it didn't, which is also interesting. • reply Wed, 10/31/2007 - 18:41 — Beau Lebens (not verified) Re: Feedblendr Architecture - Using EC2 to Scale @Todd, thanks for the write-up, and a couple quick corrections/clarifications: - "Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr." - Just to be clear, the learning curve was mostly in dealing with EC2 and how it works, not so much FeedBlendr, which at it's core is relatively simple. - "no data will be persisted between reboots" this is not exactly true. Rebooting will persist data, but a true "crash" or termination of your instance will discard everything. - "The database is still hosted on an external service because EC2 does not have a decent persistent storage system" - more the case here is that I didn't want to have to deal with (or pay for) setting something up to cater to them not having persistent storage. It is being done by other people, and can be done, it just seemed like overkill for what I was doing. - "EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be" - to be clear, EC2 has no inherent load balancing, so it's up to you (the developer/admin) to provide it yourself somehow. There are a number of different ways of doing it, but I choose dynamic DNS because it was something I was familiar with. @Greg in response to your question - I suppose the point here is that even though FeedBlendr isn't currently a poster-child for scaling, that's also kind of the point. As Todd says, this is about the learning curve and trials and tribulations of getting to a point where it can scale. There is nothing stopping me (other than budget!) from launching an additional 5 instances right now and adding them into DNS, and then I've suddenly scaled. From there I can kill some instances off and scale back. This is all about getting to the point where I even have that option, and how it was done on EC2 in particular. Cheers, Beau PlentyOfFish Architecture
Tue, 10/30/2007 - 04:48 — Todd Hoff • PlentyOfFish Architecture (983) Update: by Facebook standards Read/WriteWeb says POF is worth a cool one billion dollars. It helps to talk like Dr. Evil when saying it out loud. PlentyOfFish is a hugely popular on-line dating system slammed by over 45 million visitors a month and 30+ million hits a day (500 - 600 pages per second). But that's not the most interesting part of the story. All this is handled by one person, using a handful of servers, working a few hours a day, while making million a year from Google ads. Jealous? I know I am. How are all these love connections made using so few resources? Site: http://www.plentyoffish.com/ Information Sources • Channel9 Interview with Markus Frind • Blog of Markus Frind • Plentyoffish: 1-Man Company May Be Worth Billion The Platform • Microsoft Windows • ASP.NET • IIS • Akamai CDN • Foundry ServerIron Load Balancer The Stats • PlentyOfFish (POF) gets 1.2 billion page views/month, and 500,000 average unique logins per day. The peak season is January, when it will grow 30 percent. • POF has one single employee: the founder and CEO Markus Frind. • Makes up to million a year on Google ads working only two hours a day. • 30+ Million Hits a Day (500 - 600 pages per second). • 1.1 billion page views and 45 million visitors a month. • Has 5-10 times the click through rate of Facebook. • A top 30 site in the US based on Competes Attention metric, top 10 in Canada and top 30 in the UK. • 2 load balanced web servers with 2 Quad Core Intel Xeon X5355 @ 2.66Ghz), 8 Gigs of RAM (using about 800 MBs), 2 hard drives, runs Windows x64 Server 2003. • 3 DB servers. No data on their configuration. • Approaching 64,000 simultaneous connections and 2 million page views per hour. • Internet connection is a 1Gbps line of which 200Mbps is used. • 1 TB/day serving 171 million images through Akamai. • 6TB storage array to handle millions of full sized images being uploaded every month to the site. What's Inside • Revenue model has been to use Google ads. Match.com, in comparison, generates 0 million a year, primarily from subscriptions. POF's revenue model is about to change so it can capture more revenue from all those users. The plan is to hire more employees, hire sales people, and sell ads directly instead of relying solely on AdSense. • With 30 million page views a day you can make good money on advertising, even a 5 - 10 cents a CPM. • Akamai is used to serve 100 million plus image requests a day. If you have 8 images and each takes 100 msecs you are talking a second load just for the images. So distributing the images makes sense. • 10’s of millions of image requests are served directly from their servers, but the majority of these images are less than 2KB and are mostly cached in RAM. • Everything is dynamic. Nothing is static. • All outbound Data is Gzipped at a cost of only 30% CPU usage. This implies a lot of processing power on those servers, but it really cuts bandwidth usage. • No caching functionality in ASP.NET is used. It is not used because as soon as the data is put in the cache it's already expired. • No built in components from ASP are used. Everything is written from scratch. Nothing is more complex than a simple if then and for loops. Keep it simple. • Load balancing - IIS arbitrarily limits the total connections to 64,000 so a load balancer was added to handle the large number of simultaneous connections. Adding a second IP address and then using a round robin DNS was considered, but the load balancer was considered more redundant and allowed easier swap in of more web servers. And using ServerIron allowed advanced functionality like bot blocking and load balancing based on passed on cookies, session data, and IP data. - The Windows Network Load Balancing (NLB) feature was not used because it doesn't do sticky sessions. A way around this would be to store session state in a database or in a shared file system. - 8-12 NLB servers can be put in a farm and there can be an unlimited number of farms. A DNS round-robin scheme can be used between farms. Such an architecture has been used to enable 70 front end web servers to support over 300,000 concurrent users. - NLB has an affinity option so a user always maps to a certain server, thus no external storage is used for session state and if the server fails the user loses their state and must relogin. If this state includes a shopping cart or other important data, this solution may be poor, but for a dating site it seems reasonable. - It was thought that the cost of storing and fetching session data in software was too expensive. Hardware load balancing is simpler. Just map users to specific servers and if a server fails have the user log in again. - The cost of a ServerIron was cheaper and simpler than using NLB. Many major sites use them for TCP connection pooling, automated bot detection, etc. ServerIron can do a lot more than load balancing and these features are attractive for the cost. • Has a big problem picking an ad server. Ad server firms want several hundred thousand a year plus they want multi-year contracts. • In the process of getting rid of ASP.NET repeaters and instead uses the append string thing or response.write. If you are doing over a million page views a day just write out the code to spit it out to the screen. • Most of the build out costs went towards a SAN. Redundancy at any cost. • Growth was through word of mouth. Went nuts in Canada, spread to UK, Australia, and then to the US. • Database - One database is the main database. - Two databases are for search. Load balanced between search servers based on the type of search performed. - Monitors performance using task manager. When spikes show up he investigates. Problems were usually blocking in the database. It's always database issues. Rarely any problems in .net. Because POF doesn't use the .net library it's relatively easy to track down performance problems. When you are using many layers of frameworks finding out where problems are hiding is frustrating and hard. - If you call the database 20 times per page view you are screwed no matter what you do. - Separate database reads from writes. If you don't have a lot of RAM and you do reads and writes you get paging involved which can hang your system for seconds. - Try and make a read only database if you can. - Denormalize data. If you have to fetch stuff from 20 different tables try and make one table that is just used for reading. - One day it will work, but when your database doubles in size it won't work anymore. - If you only do one thing in a system it will do it really really well. Just do writes and that's good. Just do reads and that's good. Mix them up and it messes things up. You run into locking and blocking issues. - If you are maxing the CPU you've either done something wrong or it's really really optimized. If you can fit the database in RAM do it. • The development process is: come up with an idea. Throw it up within 24 hours. It kind of half works. See what user response is by looking at what they actually do on the site. Do messages per user increase? Do session times increase? If people don't like it then take it down. • System failures are rare and short lived. Biggest issues are DNS issues where some ISP says POF doesn't exist anymore. But because the site is free, people accept a little down time. People often don't notice sites down because they think it's their problem. • Going from one million to 12 million users was a big jump. He could scale to 60 million users with two web servers. • Will often look at competitors for ideas for new features. • Will consider something like S3 when it becomes geographically load balanced. Lessons Learned • You don't need millions in funding, a sprawling infrastructure, and a building full of employees to create a world class website that handles a torrent of users while making good money. All you need is an idea that appeals to a lot of people, a site that takes off by word of mouth, and the experience and vision to build a site without falling into the typical traps of the trade. That's all you need :-) • Necessity is the mother of all change. • When you grow quickly, but not too quickly you have a chance grow, modify, and adapt. • RAM solves all problems. After that it's just growing using bigger machines. • When starting out keep everything as simple as possible. Nearly everyone gives this same advice and Markus makes a noticeable point of saying everything he does is just obvious common sense. But clearly what is simple isn't merely common sense. Creating simple things is the result of years of practical experience. • Keep database access fast and you have no issues. • A big reason POF can get away with so few people and so little equipment is they use a CDN for serving large heavily used content. Using a CDN may be the secret sauce in a lot of large websites. Markus thinks there isn't a single site in the top 100 that doesn’t use a CDN. Without a CDN he thinks load time in Australia would go to 3 or 4 seconds because of all the images. • Advertising on Facebook yielded poor results. With 2000 clicks only 1 signed up. With a CTR of 0.04% Facebook gets 0.4 clicks per 1000 ad impressions, or .4 clicks per CPM. At 5 cent/CPM = 12.5 cents a click, 50 cent/CPM = .25 a click. .00/CPM = .50 a click. .00/CPM = .50 a click. • It's easy to sell a few million page views at high CPM’s. It's a LOT harder to sell billions of page views at high CPM’s, as shown by Myspace and Facebook. • The ad-supported model limits your revenues. You have to go to a paid model to grow larger. To generate 100 million a year as a free site is virtually impossible as you need too big a market. • Growing page views via Facebook for a dating site won't work. Having a visitor on you site is much more profitable. Most of Facebook's page views are outside the US and you have to split 5 cent CPM’s with Facebook. • Co-req is a potential large source of income. This is where you offer in your site's sign up to send the user more information about mortgages are some other product. • You can't always listen to user responses. Some users will always love new features and others will hate it. Only a fraction will complain. Instead, look at what features people are actually using by watching your site. Wikimedia architecture
Wed, 08/22/2007 - 23:56 — Todd Hoff • Wikimedia architecture (566) Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site: http://wikimedia.org/ Information Sources • Wikimedia architecture • http://meta.wikimedia.org/wiki/Wikimedia_servers • scale-out vs scale-up in the from Oracle to MySQL blog. Platform • Apache • Linux • MySQL • PHP • Squid • LVS • Lucene for Search • Memcached for Distributed Object Cache • Lighttpd Image Server The Stats • 8 million articles spread over hundreds of language projects (english, dutch, ...) • 10th busiest site in the world (source: Alexa) • Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers • 30 000 HTTP requests/s during peak-time • 3 Gbit/s of data traffic • 3 data centers: Tampa, Amsterdam, Seoul • 350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory • managed by ~ 6 people • 3 clusters on 3 different continents The Architecture • Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters • HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files. • 55 Squid servers currently, plus 20 waiting for setup. • 1,000 HTTP requests/s per server, up to 2,500 under stress • ~ 100 - 250 Mbit/s per server • ~ 14 000 - 32 000 open connections per server • Up to 40 GB of disk caches per Squid server • Up to 4 disks per server (1U rack servers) • 8 GB of memory, half of that used by Squid • Hit rates: 85% for Text, 98% for Media, since the use of CARP. • PowerDNS provides geographical distribution. • In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage. • To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches. • One centrally managed & synchronized software installation for hundreds of wikis. • MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box) • Hardware shared with External Storage and Memcached tasks • Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases • Actual revision text is stored as blobs in External storage • Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches • Separate database per wiki (not separate server!) • One master, many replicated slaves • Read operations are load balanced over the slaves, write operations go to the master • The master is used for some read operations in case the slaves are not yet up to date (lagged) • External Storage - Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data - Allows use of spare resources on application servers (2x 250-500 GB per server) - Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability Lessons Learned • Focus on architecture, not so much on operations or nontechnical stuff. • Sometimes caching costs more than recalculating or looking up at the data source...profiling! • Avoid expensive algorithms, database queries, etc. • Cache every result that is expensive and has temporal locality of reference. • Focus on the hot spots in the code (profiling!). • Scale by separating: - Read and write operations (master/slave) - Expensive operations from cheap and more frequent operations (query groups) - Big, popular wikis from smaller wikis • Improve caching: temporal and spatial locality of reference and reduces the data set size per server • Text is compressed and only revisions between articles are stored. • Simple seeming library calls like using stat to check for a file's existence can take too long when loaded. • Disk seek I/O limited, the more disk spindles, the better! • Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing. • It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks. • Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months! Scaling Early Stage Startups
Mon, 10/29/2007 - 04:26 — Todd Hoff • Scaling Early Stage Startups (56) Mark Maunder of No VC Required--who advocates not taking VC money lest you be turned into a frog instead of the prince (or princess) you were dreaming of--has an excellent slide deck on how to scale an early stage startup. His blog also has some good SEO tips and a very spooky widget showing the geographical location of his readers. Perfect for Halloween! What is Mark's other worldly scaling strategies for startups? Site: http://novcrequired.com/ Information Sources • Slides from Seattle Tech Startup Talk. • Scaling Early Stage Startups blog post by Mark Maunder. The Platform • Linxux • An ISAM type data store. • Perl • Httperf is used for benchmarking. • Websitepulse.com is used for perf monitoring. The Architecture • Performance matters because being slow could cost you 20% of your revenue. The UIE guys disagree saying this ain't necessarily so. They explain their reasoning in Usability Tools Podcast: The Truth About Page Download Time. The idea is: "There was still another surprising finding from our study: a strong correlation between perceived download time and whether users successfully completed their tasks on a site. There was, however, no correlation between actual download time and task success, causing us to discard our original hypothesis. It seems that, when people accomplish what they set out to do on a site, they perceive that site to be fast." So it might be a better use of time to improve the front-end rather than the back-end. • MySQL was dumped because of performance problems: MySQL didn't handle a high number of writes and deletes on large tables, writes blow away the query cache, large numbers of small tables (over 10,000) are not well supported, uses a lot of memory to cache indexes, maxed out at 200 concurrent read/write queuries per second with over 1 million records. • For data storage they evolved to a fixed length ISAM like record scheme that allows seeking directly to the data. Still uses file level locking and its benchmarked at 20,000+ concurrent reads/writes/deletes. Considering moving to BerkelyDB which is a very highly performing and is used by many large websites, especially when you primarily need key-value type lookups. I think it might be interesting to store json if a lot of this data ends up being displayed on the web page. • Moved to httpd.prefork for Perl. That with no keepalive on the application servers uses less RAM and works well. Lessons Learned • Configure your DB and web server correctly. MySQL and Apache's memory usage can easily spiral out of control which leads gridingly slow performance as swapping increases. Here are a few resources for helping with configuration issues. • Serve only the users you care about. Block content theives that crawl your site using a lot of valuable resources for nothing. Monitor the number of content pages they fetch per minute. If a threshold is exceeded and then do a reverse lookup on their IP address and configure your firewall to block them. • Cache as much DB data and static content as possible. Perl's Cache::FileCache was used to cache DB data and rendered HTML on disk. • Use two different host names in URLs to enable browser clients to load images in parallele. • Make content as static as possible Create a separate Image and CSS server to serve the static content. Use keepalives on static content as static content uses little memory per thread/process. • Leave plenty of spare memory. Spare memory allows Linux to use more memory fore file system caching which increased performance about 20 percent. • Turn Keepalive off on your dynamic content. Increasing http requests can exhaust the thread and memory resources needed to serve them. • You may not need a complex RDBMS for accessing data. Consider a lighter weight database BerkelyDB. Database parallelism choices greatly impact scalability By Sam Madden on October 30, 2007 9:15 AM | Permalink | Comments (2) | TrackBacks (0) Large databases require the use of parallel computing resources to get good performance. There are several fundamentally different parallel architectures in use today; in this post, Dave DeWitt, Mike Stonebraker, and I review three approaches and reflect on the pros and cons of each. Though these tradeoffs were articulated in the research community twenty years ago, we wanted to revisit these issues to bring readers up to speed before publishing upcoming posts that will discuss recent developments in parallel database design.
Shared-memory systems don't scale well as the shared bus becomes the bottleneck
In a shared-memory approach, as implemented on many symmetric multi-processor machines, all of the CPUs share a single memory and a single collection of disks. This approach is relatively easy to program. Complex distributed locking and commit protocols are not needed because the lock manager and buffer pool are both stored in the memory system where they can be easily accessed by all the processors.
Unfortunately, shared-memory systems have fundamental scalability limitations, as all I/O and memory requests have to be transferred over the same bus that all of the processors share. This causes the bandwidth of the bus to rapidly become a bottleneck. In addition, shared-memory multiprocessors require complex, customized hardware to keep their L2 data caches consistent. Hence, it is unusual to see shared-memory machines of larger than 8 or 16 processors unless they are custom-built from non-commodity parts (and if they are custom-built, they are very expensive). As a result, shared-memory systems don't scale well.
Shared-disk systems don't scale well either
Shared-disk systems suffer from similar scalability limitations. In a shared-disk architecture, there are a number of independent processor nodes, each with its own memory. These nodes all access a single collection of disks, typically in the form of a storage area network (SAN) system or a network-attached storage (NAS) system. This architecture originated with the Digital Equipment Corporation VAXcluster in the early 1980s, and has been widely used by Sun Microsystems and Hewlett-Packard.
Shared-disk architectures have a number of drawbacks that severely limit scalability. First, the interconnection network that connects each of the CPUs to the shared-disk subsystem can become an I/O bottleneck. Second, since there is no pool of memory that is shared by all the processors, there is no obvious place for the lock table or buffer pool to reside. To set locks, one must either centralize the lock manager on one processor or resort to a complex distributed locking protocol. This protocol must use messages to implement in software the same sort of cache-consistency protocol implemented by shared-memory multiprocessors in hardware. Either of these approaches to locking is likely to become a bottleneck as the system is scaled.
To make shared-disk technology work better, vendors typically implement a "shared-cache" design. Shared cache works much like shared disk, except that, when a node in a parallel cluster needs to access a disk page, it first checks to see if the page is in its local buffer pool ("cache"). If not, it checks to see if the page is in the cache of any other node in the cluster. If neither of those efforts works, it reads the page from disk.
Such a cache appears to work fairly well on OLTP but performs less well for data warehousing workloads. The problem with the shared-cache design is that cache hits are unlikely to happen because warehouse queries are typically answered using sequential scans of the fact table (or via materialized views). Unless the whole fact table fits in the aggregate memory of the cluster, sequential scans do not typically benefit from large amounts of cache. Thus, the entire burden of answering such queries is placed on the disk subsystem. As a result, a shared cache just creates overhead and limits scalability.
In addition, the same scalability problems that exist in the shared memory model also occur in the shared-disk architecture. The bus between the disks and the processors will likely become a bottleneck, and resource contention for certain disk blocks, particularly as the number of CPUs increases, can be a problem. To reduce bus contention, customers frequently configure their large clusters with many Fiber channel controllers (disk buses), but this complicates system design because now administrators must partition data across the disks attached to the different controllers.
Shared-nothing scales the best
In a shared-nothing approach, by contrast, each processor has its own set of disks. Data is "horizontally partitioned" across nodes. Each node has a subset of the rows from each table in the database. Each node is then responsible for processing only the rows on its own disks. Such architectures are especially well suited to the star schema queries present in data warehouse workloads, as only a very limited amount of communication bandwidth is required to join one or more (typically small) dimension tables with the (typically much larger) fact table.
In addition, every node maintains its own lock table and buffer pool, eliminating the need for complicated locking and software or hardware consistency mechanisms. Because shared nothing does not typically have nearly as severe bus or resource contention as shared-memory or shared-disk machines, shared nothing can be made to scale to hundreds or even thousands of machines. Because of this, it is generally regarded as the best-scaling architecture.
The shared nothing approach compliments other enhancements
As a closing point, we note that this shared nothing approach is completely compatible with other advanced database techniques we've discussed on this blog, such as compression and vertical partitioning. Systems that combine all of these techniques are likely to offer the best performance and scalability when compared to more traditional architectures. Introduction to Distributed System Design Table of Contents Audience and Pre-Requisites The Basics So How Is It Done? Remote Procedure Calls Distributed Design Principles Exercises References ________________________________________ Audience and Pre-Requisites This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms. The Basics What is a distributed system? It's one of those things that's hard to define without first defining many other things. Here is a "cascading" definition of a distributed system: A program is the code you write. A process is what you get when you run it. A message is used to communicate between processes. A packet is a fragment of a message that might travel on a wire. A protocol is a formal description of message formats and the rules that two processes must follow in order to exchange those messages. A network is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are connected by communication links. A component can be a process or any piece of hardware required to run a process, support communications between processes, store data, etc. A distributed system is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, such that all components cooperate together to perform a single or small set of related tasks. Why build a distributed system? There are lots of advantages including the ability to connect remote users with remote resources in an open and scalable way. When we say open, we mean each component is continually open to interaction with other components. When we say scalable, we mean the system can easily be altered to accommodate changes in the number of users, resources and computing entities. Thus, a distributed system can be much larger and more powerful given the combined capabilities of the distributed components, than combinations of stand-alone systems. But it's not easy - for a distributed system to be useful, it must be reliable. This is a difficult goal to achieve because of the complexity of the interactions between simultaneously running components. To be truly reliable, a distributed system must have the following characteristics: • Fault-Tolerant: It can recover from component failures without performing incorrect actions. • Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed. • Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired. • Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system. • Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect. • Predictable Performance: The ability to provide desired responsiveness in a timely manner. • Secure: The system authenticates access to data and services [1] These are high standards, which are challenging to achieve. Probably the most difficult challenge is a distributed system must be able to continue operating correctly even when components fail. This issue is discussed in the following excerpt of an interview with Ken Arnold. Ken is a research scientist at Sun and is one of the original architects of Jini, and was a member of the architectural team that designed CORBA. ________________________________________ Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure. Imagine asking people, "If the probability of something happening is one in 1013, how often would it happen?" Common sense would be to answer, "Never." That is an infinitely large number in human terms. But if you ask a physicist, she would say, "All the time. In a cubic foot of air, those things happen all the time." When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design for failure. It is your number one concern. What does designing for failure mean? One classic problem is partial failure. If I send a message to you and then a network failure occurs, there are two possible outcomes. One is that the message got to you, and then the network broke, and I just didn't get the response. The other is the message never got to you because the network broke before it arrived. So if I never receive a response, how do I know which of those two results happened? I cannot determine that without eventually finding you. The network has to be repaired or you have to come up, because maybe what happened was not a network failure but you died. How does this change how I design things? For one thing, it puts a multiplier on the value of simplicity. The more things I can do with you, the more things I have to think about recovering from. [2] ________________________________________ Handling failures is an important theme in distributed systems design. Failures fall into two obvious categories: hardware and software. Hardware failures were a dominant concern until the late 80's, but since then internal hardware reliability has improved enormously. Decreased heat production and power consumption of smaller circuits, reduction of off-chip connections and wiring, and high-quality manufacturing techniques have all played a positive role in improving hardware reliability. Today, problems are most often associated with connections and mechanical devices, i.e., network failures and drive failures. Software failures are a significant issue in distributed systems. Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%). Residual bugs in mature systems can be classified into two main categories [5]. • Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode. The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle). • Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions. [6] Heisenbugs tend to be more prevalent in distributed systems than in local systems. One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes. Let's get a little more specific about the types of failures that can occur in a distributed system: • Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure. • Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop. • Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded. • Network failures: A network link breaks. • Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure. • Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc. • Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc. [1] Our goal is to design a distributed system with the characteristics listed above (fault-tolerant, highly available, recoverable, etc.), which means we must design for failure. To design for failure, we must be careful to not make any assumptions about the reliability of the components of a system. Everyone, when they first build a distributed system, makes the following eight assumptions. These are so well-known in this field that they are commonly referred to as the "8 Fallacies". 1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous. [3] Latency: the time between initiating a request for data and the beginning of the actual data transfer. Bandwidth: A measure of the capacity of a communications channel. The higher a channel's bandwidth, the more information it can carry. Topology: The different configurations that can be adopted in building networks, such as a ring, bus, star or meshed. Homogeneous network: A network running a single network protocol. So How Is It Done? Building a reliable system that runs over an unreliable communications network seems like an impossible goal. We are forced to deal with uncertainty. A process knows its own state, and it knows what state other processes were in recently. But the processes have no way of knowing each other's current state. They lack the equivalent of shared memory. They also lack accurate ways to detect failure, or to distinguish a local software/hardware failure from a communication failure. Distributed systems design is obviously a challenging endeavor. How do we do it when we are not allowed to assume anything, and there are so many complexities? We start by limiting the scope. We will focus on a particular type of distributed systems design, one that uses a client-server model with mostly standard protocols. It turns out that these standard protocols provide considerable help with the low-level details of reliable network communications, which makes our job easier. Let's start by reviewing client-server technology and the protocols.
In client-server applications, the server provides some service, such as processing database queries or sending out current stock prices. The client uses the service provided by the server, either displaying database query results to the user or making stock purchase recommendations to an investor. The communication that occurs between the client and the server must be reliable. That is, no data can be dropped and it must arrive on the client side in the same order in which the server sent it. There are many types of servers we encounter in a distributed system. For example, file servers manage disk storage units on which file systems reside. Database servers house databases and make them available to clients. Network name servers implement a mapping between a symbolic name or a service description and a value such as an IP address and port number for a process that provides the service. In distributed systems, there can be many servers of a particular type, e.g., multiple file servers or multiple network name servers. The term service is used to denote a set of servers of a particular type. We say that a binding occurs when a process that needs to access a service becomes associated with a particular server which provides the service. There are many binding policies that define how a particular server is chosen. For example, the policy could be based on locality (a Unix NIS client starts by looking first for a server on its own machine); or it could be based on load balance (a CICS client is bound in such a way that uniform responsiveness for all clients is attempted). A distributed service may employ data replication, where a service maintains multiple copies of data to permit local access at multiple locations, or to increase availability when a server process may have crashed. Caching is a related concept and very common in distributed systems. We say a process has cached data if it maintains a copy of the data locally, for quick access if it is needed again. A cache hit is when a request is satisfied from cached data, rather than from the primary service. For example, browsers use document caching to speed up access to frequently used documents. Caching is similar to replication, but cached data can become stale. Thus, there may need to be a policy for validating a cached data item before using it. If a cache is actively refreshed by the primary service, caching is identical to replication. [1] As mentioned earlier, the communication between client and server needs to be reliable. You have probably heard of TCP/IP before. The Internet Protocol (IP) suite is the set of communication protocols that allow for communication on the Internet and most commercial networks. The Transmission Control Protocol (TCP) is one of the core protocols of this suite. Using TCP, clients and servers can create connections to one another, over which they can exchange data in packets. The protocol guarantees reliable and in-order delivery of data from sender to receiver. The IP suite can be viewed as a set of layers, each layer having the property that it only uses the functions of the layer below, and only exports functionality to the layer above. A system that implements protocol behavior consisting of layers is known as a protocol stack. Protocol stacks can be implemented either in hardware or software, or a mixture of both. Typically, only the lower layers are implemented in hardware, with the higher layers being implemented in software. ________________________________________ Resource : The history of TCP/IP mirrors the evolution of the Internet. Here is a brief overview of this history. ________________________________________ There are four layers in the IP suite: 1. Application Layer : The application layer is used by most programs that require network communication. Data is passed down from the program in an application-specific format to the next layer, then encapsulated into a transport layer protocol. Examples of applications are HTTP, FTP or Telnet. 2. Transport Layer : The transport layer's responsibilities include end-to-end message transfer independent of the underlying network, along with error control, fragmentation and flow control. End-to-end message transmission at the transport layer can be categorized as either connection-oriented (TCP) or connectionless (UDP). TCP is the more sophisticated of the two protocols, providing reliable delivery. First, TCP ensures that the receiving computer is ready to accept data. It uses a three-packet handshake in which both the sender and receiver agree that they are ready to communicate. Second, TCP makes sure that data gets to its destination. If the receiver doesn't acknowledge a particular packet, TCP automatically retransmits the packet typically three times. If necessary, TCP can also split large packets into smaller ones so that data can travel reliably between source and destination. TCP drops duplicate packets and rearranges packets that arrive out of sequence. <>UDP is similar to TCP in that it is a protocol for sending and receiving packets across a network, but with two major differences. First, it is connectionless. This means that one program can send off a load of packets to another, but that's the end of their relationship. The second might send some back to the first and the first might send some more, but there's never a solid connection. UDP is also different from TCP in that it doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order. All that is guaranteed is the packet's contents. This means it's a lot faster, because there's no extra overhead for error-checking above the packet level. For this reason, games often use this protocol. In a game, if one packet for updating a screen position goes missing, the player will just jerk a little. The other packets will simply update the position, and the missing packet - although making the movement a little rougher - won't change anything. <>Although TCP is more reliable than UDP, the protocol is still at risk of failing in many ways. TCP uses acknowledgements and retransmission to detect and repair loss. But it cannot overcome longer communication outages that disconnect the sender and receiver for long enough to defeat the retransmission strategy. The normal maximum disconnection time is between 30 and 90 seconds. TCP could signal a failure and give up when both end-points are fine. This is just one example of how TCP can fail, even though it does provide some mitigating strategies. 3. Network Layer : As originally defined, the Network layer solves the problem of getting packets across a single network. With the advent of the concept of internetworking, additional functionality was added to this layer, namely getting data from a source network to a destination network. This generally involves routing the packet across a network of networks, e.g. the Internet. IP performs the basic task of getting packets of data from source to destination. 4. Link Layer : The link layer deals with the physical transmission of data, and usually involves placing frame headers and trailers on packets for travelling over the physical network and dealing with physical components along the way. ________________________________________ Resource : For more information on the IP Suite, refer to the Wikipedia article. ________________________________________ Remote Procedure Calls Many distributed systems were built using TCP/IP as the foundation for the communication between components. Over time, an efficient method for clients to interact with servers evolved called RPC, which means remote procedure call. It is a powerful technique based on extending the notion of local procedure calling, so that the called procedure may not exist in the same address space as the calling procedure. The two processes may be on the same system, or they may be on different systems with a network connecting them. An RPC is similar to a function call. Like a function call, when an RPC is made, the arguments are passed to the remote procedure and the caller waits for a response to be returned. In the illustration below, the client makes a procedure call that sends a request to the server. The client process waits until either a reply is received, or it times out. When the request arrives at the server, it calls a dispatch routine that performs the requested service, and sends the reply to the client. After the RPC call is completed, the client process continues. <> Threads are common in RPC-based distributed systems. Each incoming request to a server typically spawns a new thread. A thread in the client typically issues an RPC and then blocks (waits). When the reply is received, the client thread resumes execution. A programmer writing RPC-based code does three things: 1. Specifies the protocol for client-server communication 2. Develops the client program 3. Develops the server program The communication protocol is created by stubs generated by a protocol compiler. A stub is a routine that doesn't actually do much other than declare itself and the parameters it accepts. The stub contains just enough code to allow it to be compiled and linked. The client and server programs must communicate via the procedures and data types specified in the protocol. The server side registers the procedures that may be called by the client and receives and returns data required for processing. The client side calls the remote procedure, passes any required data and receives the returned data. Thus, an RPC application uses classes generated by the stub generator to execute an RPC and wait for it to finish. The programmer needs to supply classes on the server side that provide the logic for handling an RPC request. RPC introduces a set of error cases that are not present in local procedure programming. For example, a binding error can occur when a server is not running when the client is started. Version mismatches occur if a client was compiled against one version of a server, but the server has now been updated to a newer version. A timeout can result from a server crash, network problem, or a problem on a client computer. Some RPC applications view these types of errors as unrecoverable. Fault-tolerant systems, however, have alternate sources for critical services and fail-over from a primary server to a backup server. A challenging error-handling case occurs when a client needs to know the outcome of a request in order to take the next step, after failure of a server. This can sometimes result in incorrect actions and results. For example, suppose a client process requests a ticket-selling server to check for a seat in the orchestra section of Carnegie Hall. If it's available, the server records the request and the sale. But the request fails by timing out. Was the seat available and the sale recorded? Even if there is a backup server to which the request can be re-issued, there is a risk that the client will be sold two tickets, which is an expensive mistake in Carnegie Hall [1]. Here are some common error conditions that need to be handled: • Network data loss resulting in retransmit: Often, a system tries to achieve 'at most once' transmission tries. In the worst case, if duplicate transmissions occur, we try to minimize any damage done by the data being received multiple time. • Server process crashes during RPC operation: If a server process crashes before it completes its task, the system usually recovers correctly because the client will initiate a retry request once the server has recovered. If the server crashes completing the task but before the RPC reply is sent, duplicate requests sometimes result due to client retries. • Client process crashes before receiving response: Client is restarted. Server discards response data.
Some Distributed Design Principles Given what we have covered so far, we can define some fundamental design principles which every distributed system designer and software engineer should know. Some of these may seem obvious, but it will be helpful as we proceed to have a good starting list. ________________________________________ • As Ken Arnold says: "You have to design distributed systems with the expectation of failure." Avoid making assumptions that any component in the system is in a particular state. A classic error scenario is for a process to send data to a process running on a second machine. The process on the first machine receives some data back and processes it, and then sends the results back to the second machine assuming it is ready to receive. Any number of things could have failed in the interim and the sending process must anticipate these possible failures. • Explicitly define failure scenarios and identify how likely each one might occur. Make sure your code is thoroughly covered for the most likely ones. • Both clients and servers must be able to deal with unresponsive senders/receivers. • Think carefully about how much data you send over the network. Minimize traffic as much as possible. • Latency is the time between initiating a request for data and the beginning of the actual data transfer. Minimizing latency sometimes comes down to a question of whether you should make many little calls/data transfers or one big call/data transfer. The way to make this decision is to experiment. Do small tests to identify the best compromise. • Don't assume that data sent across a network (or even sent from disk to disk in a rack) is the same data when it arrives. If you must be sure, do checksums or validity checks on data to verify that the data has not changed. • Caches and replication strategies are methods for dealing with state across components. We try to minimize stateful components in distributed systems, but it's challenging. State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component. If it can be reconstructed it's a cache. Caches can be helpful in mitigating the risks of maintaining state across components. But cached data can become stale, so there may need to be a policy for validating a cached data item before using it. If a process stores information that can't be reconstructed, then problems arise. One possible question is, "Are you now a single point of failure?" I have to talk to you now - I can't talk to anyone else. So what happens if you go down? To deal with this issue, you could be replicated. Replication strategies are also useful in mitigating the risks of maintaining state. But there are challenges here too: What if I talk to one replicant and modify some data, then I talk to another? Is that modification guaranteed to have already arrived at the other? What happens if the network gets partitioned and the replicants can't talk to each other? Can anybody proceed? There are a set of tradeoffs in deciding how and where to maintain state, and when to use caches and replication. It's more difficult to run small tests in these scenarios because of the overhead in setting up the different mechanisms. • Be sensitive to speed and performance. Take time to determine which parts of your system can have a significant impact on performance: Where are the bottlenecks and why? Devise small tests you can do to evaluate alternatives. Profile and measure to learn more. Talk to your colleagues about these alternatives and your results, and decide on the best solution. • Acks are expensive and tend to be avoided in distributed systems wherever possible. • Retransmission is costly. It's important to experiment so you can tune the delay that prompts a retransmission to be optimal. Exercises 1. Have you ever encountered a Heisenbug? How did you isolate and fix it? 2. For the different failure types listed above, consider what makes each one difficult for a programmer trying to guard against it. What kinds of processing can be added to a program to deal with these failures? 3. Explain why each of the 8 fallacies is actually a fallacy. 4. Contrast TCP and UDP. Under what circumstances would you choose one over the other? 5. What's the difference between caching and data replication? 6. What are stubs in an RPC implementation? 7. What are some of the error conditions we need to guard against in a distributed environment that we do not need to worry about in a local programming environment? 8. Why are pointers (references) not usually passed as parameters to a Remote Procedure Call? 9. Here is an interesting problem called partial connectivity that can occur in a distributed environment. Let's say A and B are systems that need to talk to each other. C is a master that also talks to A and B individually. The communications between A and B fail. C can tell that A and B are both healthy. C tells A to send something to B and waits for this to occur. C has no way of knowing that A cannot talk to B, and thus waits and waits and waits. What diagnostics can you add in your code to deal with this situation? 10. What is the leader-election algorithm? How can it be used in a distributed system? 11. This is the Byzantine Generals problem: Two generals are on hills either side of a valley. They each have an army of 1000 soldiers. In the woods in the valley is an enemy army of 1500 men. If each general attacks alone, his army will lose. If they attack together, they will win. They wish to send messengers through the valley to coordinate when to attack. However, the messengers may get lost or caught in the woods (or brainwashed into delivering different messages). How can they devise a scheme by which they either attack with high probability, or not at all? References [1] Birman, Kenneth. Reliable Distributed Systems: Technologies, Web Services and Applications. New York: Springer-Verlag, 2005. [2] Interview with Ken Arnold [3] The Eight Fallacies [4] Wikipedia article on IP Suite [5] Gray, J. and Reuter, A. Transaction Processing: Concepts and Techniques. San Mateo, CA: Morgan Kaufmann, 1993. [6] Bohrbugs and Heisenbugs Flickr Architecture
Wed, 08/29/2007 - 10:04 — Todd Hoff • Flickr Architecture (1164) Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it? Site: http://www.flickr.com/ Information Sources • Flickr and PHP (an early document) • Capacity Planning for LAMP • Federation at Flickr: A tour of the Flickr Architecture. • Building Scalable Web Sites by Cal Henderson from Flickr. • Database War Stories #3: Flickr by Tim O'Reilly • Cal Henderson's Talks. A lot of useful PowerPoint presentations. Platform • PHP • MySQL • Shards • Memcached for a caching layer. • Squid in reverse-proxy for html and images. • Linux (RedHat) • Smarty for templating • Perl • PEAR for XML and Email parsing • ImageMagick, for image processing • Java, for the node service • Apache • SystemImager for deployment • Ganglia for distributed system monitoring • Subcon stores essential system configuration files in a subversion repository for easy deployment to machines in a cluster. • Cvsup for distributing and updating collections of files across a network. The Stats • More than 4 billion queries per day. • ~35M photos in squid cache (total) • ~2M photos in squid’s RAM • ~470M photos, 4 or 5 sizes of each • 38k req/sec to memcached (12M objects) • 2 PB raw storage (consumed about ~1.5TB on Sunday • Over 400,000 photos being added every day The Architecture • A pretty picture of Flickr's architecture can be found on this slide . A simple depiction is: -- Pair of ServerIron's ---- Squid Caches ------ Net App's ---- PHP App Servers ------ Storage Manager ------ Master-master shards ------ Dual Tree Central Database ------ Memcached Cluster ------ Big Search Engine - The Dual Tree structure is a custom set of changes to MySQL that allows scaling by incrementally adding masters without a ring architecture. This allows cheaper scaling because you need less hardware as compared to master-master setups which always requires double the hardware. - The central database includes data like the 'users' table, which includes primary user keys (a few different IDs) and a pointer to which shard a users' data can be found on. • Use dedicated servers for static content. • Talks about how to support Unicode. • Use a share nothing architecture. • Everything (except photos) are stored in the database. • Statelessness means they can bounce people around servers and it's easier to make their APIs. • Scaled at first by replication, but that only helps with reads. • Create a search farm by replicating the portion of the database they want to search. • Use horizontal scaling so they just need to add more machines. • Handle pictures emailed from users by parsing each email is it's delivered in PHP. Email is parsed for any photos. • Earlier they suffered from Master-Slave lag. Too much load and they had a single point of failure. • They needed the ability to make live maintenance, repair data, and so forth, without taking the site down. • Lots of excellent material on capacity planning. Take a look in the Information Sources for more details. • Went to a federated approach so they can scale far into the future: - Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone else's’ blog - Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time. - PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!) • Shards: - Slice of the main database - Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active. - Shard assignments are from a random number for new accounts - Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually. • Clicking a Favorite: - Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5) - Pulls my Information from cache, to get my shard location (say on shard-13) - Starts a “distributed transaction” - to answer the question: Who favorited the photo? What are my favorites? • Can ask question from any shard, and recover data. Its absolutely redundant. • To get rid of replication lag… - every page load, the user is assigned to a bucket - if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistent connections, they build connections and tear it down. Every page load thus, tests the connection. • Every users reads and writes are kept in one shard. Notion of replication lag is gone. • Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode. To upgrade you just have to shut down half the shard, upgrade that half, and then repeat the process. • Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load. • Average queries per page, are 27-35 SQL statements. Favorites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages. • Over 36,000 queries per second - running within capacity threshold. Burst of traffic, double 36K/qps. • Each Shard holds 400K+ users data. - A lot of data is stored twice. For example, a comment is part of the relation between the commentor and the commentee. Where is the comment stored? How about both places? Transactions are used to prevent out of sync data: open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st committed. but there still a chance for failure when a box goes down during the 1st commit. • Search: - Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search - Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness) - Think of it such that you’ve got Lucene-like search • Hardware: - EMT64 w/RHEL4, 16GB RAM - 6-disk 15K RPM RAID-10. - Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files - the photos are a lot larger). - 2U boxes. Each shard has~120GB of data. • Backup procedure: - ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare. - Snapshots are taken every night across the entire cluster of databases. - Writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. Doing this to an in-production photo storage filer is a bad idea. - However much it costs to keep multiple days of backups of all of your data, it's worth it. Keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups. • Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database. • Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards. • max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity. • Tags: - Tags do not fit well with traditional normalized RDBMs schema design. Denormalization or heavy caching is the only way to generate a tag cloud in milliseconds for hundreds of millions of tags. - Some of their data views are calculated offline by dedicated processing clusters which save the results into MySQL because some relationships are so complicated to calculate it would absorb all the database CPU cycles. • Future Direction: - Make it faster with real-time BCP, so all data centers can receive writes to the data layer (db, memcache, etc) all at the same time. Everything is active nothing will ever be idle. Lessons Learned • Think of your application as more than just a web application. You'll have REST APIs, SOAP APIs, RSS feeds, Atom feeds, etc. • Go stateless. Statelessness makes for a simpler more robust system that can handle upgrades without flinching. • Re-architecting your database sucks. • Capacity plan. Bring capacity planning into the product discussion EARLY. Get buy-in from the $$$ people (and engineering management) that it’s something to watch. • Start slow. Don’t buy too much equipment just because you’re scared/happy that your site will explode. • Measure reality. Capacity planning math should be based on real things, not abstract ones. • Build in logging and metrics. Usage stats are just as important as server stats. Build in custom metrics to measure real-world usage to server-based stats. • Cache. Caching and RAM is the answer to everything. • Abstract. Create clear levels of abstraction between database work, business logic, page logic, page mark-up and the presentation layer. This supports quick turn around iterative development. • Layer. Layering allows developers to create page level logic which designers can use to build the user experience. Designers can ask for page logic as needed. It's a negotiation between the two parties. • Release frequently. Even every 30 minutes. • Forget about small efficiencies, about 97% of the time. Premature optimization is the root of all evil. • Test in production. Build into the architecture mechanisms (config flags, load balancing, etc.) with which you can deploy new hardware easily into (and out of) production. • Forget benchmarks. Benchmarks are fine for getting a general idea of capabilities, but not for planning. Artificial tests give artificial results, and the time is better used with testing for real. • Find ceilings. - What is the maximum something that every server can do ? - How close are you to that maximum, and how is it trending ? - MySQL (disk IO ?) - SQUID (disk IO ? or CPU ?) - memcached (CPU ? or network ?) • Be sensitive to the usage patterns for your type of application. - Do you have event related growth? For example: disaster, news event. - Flickr gets 20-40% more uploads on first work day of the year than any previous peak the previous year. - 40-50% more uploads on Sundays than the rest of the week, on average • Be sensitive to the demands of exponential growth. More users means more content, more content means more connections, more connections mean more usage. • Plan for peaks. Be able to handle peak loads up and down the stack. • Apache • Example • Java • Linux • MySQL • Perl • PHP • Shard • Visit Flickr Architecture • 24401 reads Comments Wed, 08/08/2007 - 13:23 — Sam (not verified) How to store images? Is there an easier solution managing images in combination of database and files? It seems storing your images in database might really slow down the site. • reply Wed, 08/08/2007 - 16:23 — Douglas F Shearer (not verified) RE: How to store images? Flickr only store a reference to an image in their databases, the actual file is stored on a separate storage server elsewhere on the network. A typical URL for a Flickr image looks like this: http://farm1.static.flickr.com/104/301293250_dc284905d0_m.jpg If we split this up we get: farm1 - Obviously the farm at which the image is stored. I have yet to see a value other than one. .static.flickr.com - Fairly self explanitory. /104 - The server ID number. /301293250 - The image ID. _dc284905d0 - The image 'secret'. I assume this is to prevent images being copied without first getting the information from the API. _m - The size of the image. In this case the 'm' denotes medium, but this can be small, thumb etc. For the standard image size there is no size of this form in the URL. Amazon Architecture
Tue, 09/18/2007 - 19:44 — Todd Hoff • Amazon Architecture (2495) This is a wonderfully informative Amazon update based on Joachim Rohde's discovery of an interview with Amazon's CTO. You'll learn about how Amazon organizes their teams around services, the CAP theorem of building scalable systems, how they deploy software, and a lot more. Many new additions from the ACM Queue article have also been included. Amazon grew from a tiny online bookstore to one of the largest stores on earth. They did it while pioneering new and interesting ways to rate, review, and recommend products. Greg Linden shared is version of Amazon's birth pangs in a series of blog articles Site: http://amazon.com Information Sources
• Early Amazon by Greg Linden • How Linux saved Amazon millions • Interview Werner Vogels - Amazon's CTO • Asynchronous Architectures - a nice summary of Werner Vogels' talk by Chris Loosley • Learning from the Amazon technology platform - A Conversation with Werner Vogels • Werner Vogels' Weblog - building scalable and robust distributed systems Platform • Linux • Oracle • C++ • Perl • Mason • Java • Jboss • Servlets The Stats • More than 55 million active customer accounts. • More than 1 million active retail partners worldwide. • Between 100-150 services are accessed to build a page. The Architecture • What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added. Increasing performance in general means serving more units of work, but it can also be to handle larger units of work, such as when datasets grow. • The big architectural change that Amazon made was to move from a two-tier monolith to a fully-distributed, decentralized, services platform serving many different applications. • Started as one application talking to a back end. Written in C++. • It grew. For years the scaling efforts at Amazon focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. In 2001 it became clear that the front-end application couldn't scale anymore. The databases were split into small parts and around each part and created a services interface that was the only way to access the data. • The databases became a shared resource that made it hard to scale-out the overall business. The front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes. • Their architecture is loosely coupled and built around services. A service-oriented architecture gave them the isolation that would allow building many software components rapidly and independently. • Grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server. So are the applications that serve the Web-services interface, the customer service application, and the seller interface. • Many third party technologies are hard to scale to Amazon size. Especially communication infrastructure technologies. They work well up to a certain scale and then fail. So they are forced to build their own. • Not stuck with one particular approach. Some places they use jboss/java, but they use only servlets, not the rest of the J2EE stack. • C++ is uses to process requests. Perl/Mason is used to build content. • Amazon doesn't like middleware because it tends to be framework and not a tool. If you use a middleware package you get lock-in around the software patterns they have chosen. You'll only be able to use their software. So if you want to use different packages you won't be able to. You're stuck. One event loop for messaging, data persistence, AJAX, etc. Too complex. If middleware was available in smaller components, more as a tool than a framework, they would be more interested. • The SOAP web stack seems to want to solve all the same distributed systems problems all over again. • Offer both SOAP and REST web services. 30% use SOAP. These tend to be Java and .NET users and use WSDL files to generate remote object interfaces. 70% use REST. These tend to be PHP or PERL users. • In either SOAP or REST developers can get an object interface to Amazon. Developers just want to get job done. They don't care what goes over the wire. • Amazon wanted to build an open community around their services. Web services were chosed because it's simple. But hat's only on the perimeter. Internally it's a service oriented architecture. You can only access the data via the interface. It's described in WSDL, but they use their own encapsulation and transport mechanisms. • Teams are Small and are Organized Around Services - Services are the independent units delivering functionality within Amazon. It's also how Amazon is organized internally in terms of teams. - If you have a new business idea or problem you want to solve you form a team. Limit the team to 8-10 people because communication hard. They are called two pizza teams. The number of people you can feed off two pizzas. - Teams are small. They are assigned authority and empowered to solve a problem as a service in anyway they see fit. - As an example, they created a team to find phrases within a book that are unique to the text. This team built a separate service interface for that feature and they had authority to do what they needed. - Extensive A/B testing is used to integrate a new service . They see what the impact is and take extensive measurements. • Deployment - They create special infrastructure for managing dependencies and doing a deployment. - Goal is to have all right services to be deployed on a box. All application code, monitoring, licensing, etc should be on a box. - Everyone has a home grown system to solve these problems. - Output of deployment process is a virtual machine. You can use EC2 to run them. • Work From the Customer Backwards to Verify a New Service is Worth Doing - Work from the customer backward. Focus on value you want to deliver for the customer. - Force developers to focus on value delivered to the customer instead of building technology first and then figuring how to use it. - Start with a press release of what features the user will see and work backwards to check that you are building something valuable. - End up with a design that is as minimal as possible. Simplicity is the key if you really want to build large distributed systems. • State Management is the Core Problem for Large Scale Systems - Internally they can deliver infinite storage. - Not all that many operations are stateful. Checkout steps are stateful. - Most recent clicked web page service has recommendations based on session IDs. - They keep track of everything anyway so it's not a matter of keeping state. There's little separate state that needs to be kept for a session. The services will already be keeping the information so you just use the services. • Eric Brewer's CAP Theorem or the Three properties of Systems - Three properties of a system: consistency, availability, tolerance to network partitions. - You can have at most two of these three properties for any shared-data system. - Partitionability: divide nodes into small groups that can see other groups, but they can't see everyone. - Consistency: write a value and then you read the value you get the same value back. In a partitioned system there are windows where that's not true. - Availability: may not always be able to write or read. The system will say you can't write because it wants to keep the system consistent. - To scale you have to partition, so you are left with choosing either high consistency or high availability for a particular system. You must find the right overlap of availability and consistency. - Choose a specific approach based on the needs of the service. - For the checkout process you always want to honor requests to add items to a shopping cart because it's revenue producing. In this case you choose high availability. Errors are hidden from the customer and sorted out later. - When a customer submits an order you favor consistency because several services--credit card processing, shipping and handling, reporting--are simultaneously accessing the data. Lessons Learned • You must change your mentality to build really scalable systems. Approach chaos in a probabilistic sense that things will work well. In traditional systems we present a perfect world where nothing goes down and then we build complex algorithms (agreement technologies) on this perfect world. Instead, take it for granted stuff fails, that's reality, embrace it. For example, go more with a fast reboot and fast recover approach. With a decent spread of data and services you might get close to 100%. Create self-healing, self-organizing lights out operations. • Create a shared nothing infrastructure. Infrastructure can become a shared resource for development and deployment with the same downsides as shared resources in your logic and data tiers. It can cause locking and blocking and dead lock. A service oriented architecture allows the creation of a parallel and isolated development process that scales feature development to match your growth. • Open up you system with APIs and you'll create an ecosystem around your application. • Only way to manage as large distributed system is to keep things as simple as possible. Keep things simple by making sure there are no hidden requirements and hidden dependencies in the design. Cut technology to the minimum you need to solve the problem you have. It doesn't help the company to create artificial and unneeded layers of complexity. • Organizing around services gives agility. You can do things in parallel is because the output is a service. This allows fast time to market. Create an infrastructure that allows services to be built very fast. • There's bound to be problems with anything that produces hype before real implementation • Use SLAs internally to manage services. • Anyone can very quickly add web services to their product. Just implement one part of your product as a service and start using it. • Build your own infrastructure for performance, reliability, and cost control reasons. By building it yourself you never have to say you went down because it was company X's fault. Your software may not be more reliable than others, but you can fix, debug, and deployment much quicker than when working with a 3rd party. • Use measurement and objective debate to separate the good from the bad. I've been to several presentations by ex-Amazoners and this is the aspect of Amazon that strikes me as uniquely different and interesting from other companies. Their deep seated ethic is to expose real customers to a choice and see which one works best and to make decisions based on those tests. Avinash Kaushik calls this getting rid of the influence of the HiPPO's, the highest paid people in the room. This is done with techniques like A/B testing and Web Analytics. If you have a question about what you should do code it up, let people use it, and see which alternative gives you the results you want. • Create a frugal culture. Amazon used doors for desks, for example. • Know what you need. Amazon has a bad experience with an early recommender system that didn't work out: "This wasn't what Amazon needed. Book recommendations at Amazon needed to work from sparse data, just a few ratings or purchases. It needed to be fast. The system needed to scale to massive numbers of customers and a huge catalog. And it needed to enhance discovery, surfacing books from deep in the catalog that readers wouldn't find on their own." • People's side projects, the one's they follow because they are interested, are often ones where you get the most value and innovation. Never underestimate the power of wandering where you are most interested. • Involve everyone in making dog food. Go out into the warehouse and pack books during the Christmas rush. That's teamwork. • Create a staging site where you can run thorough tests before releasing into the wild. • A robust, clustered, replicated, distributed file system is perfect for read-only data used by the web servers. • Have a way to rollback if an update doesn't work. Write the tools if necessary. • Switch to a deep services-based architecture ( http://webservices.sys-con.com/read/262024.htm). • Look for three things in interviews: enthusiasm, creativity, competence. The single biggest predictor of success at Amazon.com was enthusiasm. • Hire a Bob. Someone who knows their stuff, has incredible debugging skills and system knowledge, and most importantly, has the stones to tackle the worst high pressure problems imaginable by just leaping in. • Innovation can only come from the bottom. Those closest to the problem are in the best position to solve it. any organization that depends on innovation must embrace chaos. Loyalty and obedience are not your tools. • Creativity must flow from everywhere. • Everyone must be able to experiment, learn, and iterate. Position, obedience, and tradition should hold no power. For innovation to flourish, measurement must rule. • Embrace innovation. In front of the whole company, Jeff Bezos would give an old Nike shoe as "Just do it" award to those who innovated. • Don't pay for performance. Give good perks and high pay, but keep it flat. Recognize exceptional work in other ways. Merit pay sounds good but is almost impossible to do fairly in large organizations. Use non-monetary awards, like an old shoe. It's a way of saying thank you, somebody cared. • Get big fast. The big guys like Barnes and Nobel are on your tail. Amazon wasn't even the first, second, or even third book store on the web, but their vision and drive won out in the end. • In the data center, only 30 percent of the staff time spent on infrastructure issues related to value creation, with the remaining 70 percent devoted to dealing with the "heavy lifting" of hardware procurement, software management, load balancing, maintenance, scalability challenges and so on. • Prohibit direct database access by clients. This means you can make you service scale and be more reliable without involving your clients. This is much like Google's ability to independently distribute improvements in their stack to the benefit of all applications. • Create a single unified service-access mechanism. This allows for the easy aggregation of services, decentralized request routing, distributed request tracking, and other advanced infrastructure techniques. • Making Amazon.com available through a Web services interface to any developer in the world free of charge has also been a major success because it has driven so much innovation that they couldn't have thought of or built on their own. • Developers themselves know best which tools make them most productive and which tools are right for the job. • Don't impose too many constraints on engineers. Provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, allow teams to function as independently as possible. • Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. Have many support tools that are of a self-help nature. Support an environment around the service development that never gets in the way of the development itself. • You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service. • Developers should spend some time with customer service every two years. Their they'll actually listen to customer service calls, answer customer service e-mails, and really understand the impact of the kinds of things they do as technologists. • Use a "voice of the customer," which is a realistic story from a customer about some specific part of your site's experience. This helps managers and engineers connect with the fact that we build these technologies for real people. Customer service statistics are an early indicator if you are doing something wrong, or what the real pain points are for your customers. • Infrastructure for Amazon, like for Google, is a huge competitive advantage. They can build very complex applications out of primitive services that are by themselves relatively simple. They can scale their operation independently, maintain unparalleled system availability, and introduce new services quickly without the need for massive reconfiguration. • Example • Java • Linux • Oracle • Perl • Visit Amazon Architecture • 33843 reads Comments Thu, 08/09/2007 - 16:22 — herval (not verified) Jeff.. Bazos? Jeff.. Bazos? • reply Wed, 08/29/2007 - 22:07 — Joachim Rohde (not verified) Werner Vogels, the CTO of Werner Vogels, the CTO of amazon, spoke a tiny bit about technical details on SE-Radio. You can find the podcast under http://www.se-radio.net/index.php?post_id=157593 Interesting episode. • reply Fri, 08/31/2007 - 05:27 — Todd Hoff
Re: Amazon Architecture That is a good interview. Thanks. I'll be adding the new information soon. • reply Fri, 09/07/2007 - 08:39 — Arturo Fernandez (not verified) Re: Amazon Architecture Amazon uses Perl and Mason. See: http://www.masonhq.com/?MasonPoweredSites • reply Tue, 09/11/2007 - 07:52 — Alexei A. Korolev (not verified) Re: Amazon Architecture as i see they reduce c++ part and move to j2ee? • reply Tue, 09/18/2007 - 02:30 — Anonymous (not verified) It's WSDL I am not sure how you can get that one wrong, unless you are a manager, but even then some engineer would school you 'til Sunday. • reply Tue, 09/18/2007 - 04:37 — Todd Hoff
Re: It's WSDL Thanks for catching that. I listen to these things a few times and sometimes I just write what I hear instead of what I mean. • reply Tue, 09/18/2007 - 07:28 — Werner (not verified) Re: Amazon Architecture I actually gave a scrisper definition of scalability at: A Word on Scalability Personaly I like the interview in ACM Queue best for a high level view --Werner Scaling Twitter: Making Twitter 10000 Percent Faster
Mon, 10/08/2007 - 21:01 — Todd Hoff • Scaling Twitter: Making Twitter 10000 Percent Faster (913) Twitter started as a side project and blew up fast, going from 0 to millions of page views within a few terrifying months. Early design decisions that worked well in the small melted under the crush of new users chirping tweets to all their friends. Web darling Ruby on Rails was fingered early for the scaling problems, but Blaine Cook, Twitter's lead architect, held Ruby blameless:
For us, it’s really about scaling horizontally - to that end, Rails and Ruby haven’t been stumbling blocks, compared to any other language or framework. The performance boosts associated with a “faster” language would give us a 10-20% improvement, but thanks to architectural changes that Ruby and Rails happily accommodated, Twitter is 10000% faster than it was in January. If Ruby on Rails wasn't to blame, how did Twitter learn to scale ever higher and higher? Update: added slides Small Talk on Getting Big. Scaling a Rails App & all that Jazz Site: http://twitter.com Information Sources • Scaling Twitter Video by Blaine Cook. • Scaling Twitter Slides • Good News blog post by Rick Denatale • Scaling Twitter blog post Patrick Joyce. • Twitter API Traffic is 10x Twitter’s Site. • A Small Talk on Getting Big. Scaling a Rails App & all that Jazz - really cute dog picks The Platform • Ruby on Rails • Erlang • MySQL • Mongrel - hybrid Ruby/C HTTP server designed to be small, fast, and secure • Munin • Nagios • Google Analytics • AWStats - real-time logfile analyzer to get advanced statistics • Memcached The Stats • Over 350,000 users. The actual numbers are as always, very super super top secret. • 600 requests per second. • Average 200-300 connections per second. Spiking to 800 connections per second. • MySQL handled 2,400 requests per second. • 180 Rails instances. Uses Mongrel as the "web" server. • 1 MySQL Server (one big 8 core box) and 1 slave. Slave is read only for statistics and reporting. • 30+ processes for handling odd jobs. • 8 Sun X4100s. • Process a request in 200 milliseconds in Rails. • Average time spent in the database is 50-100 milliseconds. • Over 16 GB of memcached. The Architecture • Ran into very public scaling problems. The little bird of failure popped up a lot for a while. • Originally they had no monitoring, no graphs, no statistics, which makes it hard to pinpoint and solve problems. Added Munin and Nagios. There were difficulties using tools on Solaris. Had Google analytics but the pages weren't loading so it wasn't that helpful :-) • Use caching with memcached a lot. - For example, if getting a count is slow, you can memoize the count into memcache in a millisecond. - Getting your friends status is complicated. There are security and other issues. So rather than doing a query, a friend's status is updated in cache instead. It never touches the database. This gives a predictable response time frame (upper bound 20 msecs). - ActiveRecord objects are huge so that's why they aren't cached. So they want to store critical attributes in a hash and lazy load the other attributes on access. - 90% of requests are API requests. So don't do any page/fragment caching on the front-end. The pages are so time sensitive it doesn't do any good. But they cache API requests. • Messaging - Use message a lot. Producers produce messages, which are queued, and then are distributed to consumers. Twitter's main functionality is to act as a messaging bridge between different formats (SMS, web, IM, etc). - Send message to invalidate friend's cache in the background instead of doing all individually, synchronously. - Started with DRb, which stands for distributed Ruby. A library that allows you to send and receive messages from remote Ruby objects via TCP/IP. But it was a little flaky and single point of failure. - Moved to Rinda, which a shared queue that uses a tuplespace model, along the lines of Linda. But the queues are persistent and the messages are lost on failure. - Tried Erlang. Problem: How do you get a broken server running at Sunday Monday with 20,000 users waiting? The developer didn't know. Not a lot of documentation. So it violates the use what you know rule. - Moved to Starling, a distributed queue written in Ruby. - Distributed queues were made to survive system crashes by writing them to disk. Other big websites take this simple approach as well. • SMS is handled using an API supplied by third party gateway's. It's very expensive. • Deployment - They do a review and push out new mongrel servers. No graceful way yet. - An internal server error is given to the user if their mongrel server is replaced. - All servers are killed at once. A rolling blackout isn't used because the message queue state is in the mongrels and a rolling approach would cause all the queues in the remaining mongrels to fill up. • Abuse - A lot of down time because people crawl the site and add everyone as friends. 9000 friends in 24 hours. It would take down the site. - Build tools to detect these problems so you can pinpoint when and where they are happening. - Be ruthless. Delete them as users. • Partitioning - Plan to partition in the future. Currently they don't. These changes have been enough so far. - The partition scheme will be based on time, not users, because most requests are very temporally local. - Partitioning will be difficult because of automatic memoization. They can't guarantee read-only operations will really be read-only. May write to a read-only slave, which is really bad. • Twitter's API Traffic is 10x Twitter’s Site - Their API is the most important thing Twitter has done. - Keeping the service simple allowed developers to build on top of their infrastructure and come up with ideas that are way better than Twitter could come up with. For example, Twitterrific, which is a beautiful way to use Twitter that a small team with different priorities could create. • Monit is used to kill process if they get too big. Lessons Learned • Talk to the community. Don't hide and try to solve all problems yourself. Many brilliant people are willing to help if you ask. • Treat your scaling plan like a business plan. Assemble a board of advisers to help you. • Build it yourself. Twitter spent a lot of time trying other people's solutions that just almost seemed to work, but not quite. It's better to build some things yourself so you at least have some control and you can build in the features you need. • Build in user limits. People will try to bust your system. Put in reasonable limits and detection mechanisms to protect your system from being killed. • Don't make the database the central bottleneck of doom. Not everything needs to require a gigantic join. Cache data. Think of other creative ways to get the same result. A good example is talked about in Twitter, Rails, Hammers, and 11,000 Nails per Second. • Make your application easily partitionable from the start. Then you always have a way to scale your system. • Realize your site is slow. Immediately add reporting to track problems. • Optimize the database. - Index everything. Rails won't do this for you. - Use explain to how your queries are running. Indexes may not be being as you expect. - Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins. - Avoid complex joins. - Avoid scanning large sets of data. • Cache the hell out of everything. Individual active records are not cached, yet. The queries are fast enough for now. • Test everything. - You want to know when you deploy an application that it will render correctly. - They have a full test suite now. So when the caching broke they were able to find the problem before going live. • Long running processes should be abstracted to daemons. • Use exception notifier and exception logger to get immediate notification of problems so you can address the right away. • Don't do stupid things. - Scale changes what can be stupid. - Trying to load 3000 friends at once into memory can bring a server down, but when there were only 4 friends it works great. • Most performance comes not from the language, but from application design. • Turn your website into an open service by creating an API. Their API is a huge reason for Twitter's success. It allows user's to create an ever expanding and ecosystem around Twitter that is difficult to compete with. You can never do all the work your user's can do and you probably won't be as creative. So open you application up and make it easy for others to integrate your application with theirs. Related Articles • For a discussion of partitioning take a look at Amazon Architecture, An Unorthodox Approach to Database Design : The Coming of the Shard, Flickr Architecture • The Mailinator Architecture has good strategies for abuse protection. • GoogleTalk Architecture addresses some interesting issues when scaling social networking sites. • Example • Memcached • RoR • Visit Scaling Twitter: Making Twitter 10000 Percent Faster • 26585 reads Comments Thu, 09/13/2007 - 22:51 — Royans (not verified) Re: Scaling Twitter: Making Twitter 10000 Percent Faster Todd, thanks for the excellent research u did on twitter. Its amazing that the entire Twitter infrastructure is running with just one rw database. Would be interesting to find out the usage stats on that single box... • reply Fri, 09/14/2007 - 00:15 — Bob Warfield (not verified) Re: Scaling Twitter: Making Twitter 10000 Percent Faster Loved your article, it echoes a lot of themes I've been talking about for awhile on my blog, so I wrote about the Twitter case based on your article here: http://smoothspan.wordpress.com/2007/09/14/twitter-scaling-story-mirrors... • reply Sat, 09/15/2007 - 07:15 — Shanti Braford (not verified) Re: Scaling Twitter: Making Twitter 10000 Percent Faster I wonder what the RoR haters will make up now to say that ruby doesn't scale. They loved jumping on the ruby hate bandwagon when twitter was going through it's difficulties. Little bo beep has been quite silent since. Caching was the answer? Shock. Gasp. Awe. Just like PHP?!? Crazy! • reply Sat, 09/15/2007 - 11:23 — Dave Hoover (not verified) Re: Scaling Twitter: Making Twitter 10000 Percent Faster I think you're referring to Starfish, not Starling. Great article! • reply Thu, 09/20/2007 - 08:36 — choonkeat (not verified) Re: Scaling Twitter: Making Twitter 10000 Percent Faster No, its not Starfish. In the video of his presentation, he mentions "so I wrote Starling..." • reply Thu, 09/20/2007 - 16:02 — miles (not verified) Re: Scaling Twitter: Making Twitter 10000 Percent Faster great article (and site) Todd. thanks for pulling all this information together. It's a great resource ps. @Dave: Blaine referred to his 'starling' messaging framework at the SJ Ruby Conference earlier in the year. • reply Sat, 09/22/2007 - 14:01 — Marcus (not verified) They could have been 20% better? So, let's be clear, the biased source in defense mode says themselves they could have been 20% faster just by selecting a different language (note that it doesn't exactly say what the performance hit of the Rails framework itself is, so let's just go with 20% improvement by changing languages and ignore potential problems in (1) their coding decisions and (2) their chosen framework).... Wow, sign me up for an easy 20% improvement! Yeah, yeah, I know, I'll hear the usual tripe about how amazing fast Ruby is to develop with. Visual Basic is pretty easy too, as is PHP, but I don't use those either. • reply Mon, 09/24/2007 - 08:02 — Mikael (not verified) Re: Scaling Twitter: Making Twitter 10000 Percent Faster Sounds like Ruby on Rails _was_ to blame as the 10000 percent improvement was reached by essentially removing the "on rails" part of the equation by extensive caching. This seems to be the real weakness of RoR; Ruby in itself seems OK performance-wise, slower than PHP for example but not catastrophically so. PHP is slower than Java but scales nicely anyway. The database abstraction in "on rails" is a real performance killer though and all the high traffic sites that use RoR successfully (twitter, penny arcade, ...) seems to have taken steps to avoid using the database abstraction on live page views by extensive caching. Of course, caching is a necessary tool for scaling regardless of the platform but with a less inefficient abstraction layer than the one in RoR it is possible to grow more before you have to recode stuff for caching. • reply Fri, 10/12/2007 - 18:07 — Dustin Puryear (not verified) Re: Scaling Twitter: Making Twitter 10000 Percent Faster Excellent article. I agree with one of the other commenters that it's surprising they have this running from a single MySQL server. Wow. The fact that twitter tends to be very write-heavy, and MySQL isn't exactly perfect for multimaster replication architectures probably has a lot to do with that. I wonder what they are planning to do for future growth? Obviously this will not continue to work as-is.. -- Dustin Puryear Author, "Best Practices for Managing Linux and UNIX Servers" http://www.puryear-it.com/pubs/linux-unix-best-practices Google Architecture
Mon, 07/23/2007 - 04:26 — Todd Hoff • Google Architecture (1526) Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that? Information Sources • Video: Building Large Systems at Google • Google Lab: The Google File System • Google Lab: MapReduce: Simplified Data Processing on Large Clusters • Google Lab: BigTable. • Video: BigTable: A Distributed Structured Storage System. • Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems. • How Google Works by David Carr in Baseline Magazine. • Google Lab: Interpreting the Data: Parallel Analysis with Sawzall. • Dare Obasonjo's Notes on the scalability conference. Platform • Linux • A large diversity of languages: Python, Java, C++ What's Inside? The Stats • Estimated 450,000 low-cost commodity servers in 2006 • In 2005 Google indexed 8 billion web pages. By now, who knows? • Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster. • Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month. • BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users. The Stack Google visualizes their infrastructure as a three layer stack: • Products: search, advertising, email, maps, video, chat, blogger • Distributed Systems Infrastructure: GFS, MapReduce, and BigTable. • Computing Platforms: a bunch of machines in a bunch of different data centers • Make sure easy for folks in the company to deploy at a low cost. • Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data. Reliable Storage Mechanism with GFS (Google File System) • Reliable scalable storage is a core need of any application. GFS is their core storage platform. • Google File System - large distributed log structured file system in which they throw in a lot of data. • Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required: - high reliability across data centers - scalability to thousands of network nodes - huge read/write bandwidth requirements - support for large blocks of data which are gigabytes in size. - efficient distribution of operations across nodes to reduce bottlenecks • System has master and chunk servers. - Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk. - Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers. • A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers. • Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs. Do Something With the Data Using MapReduce • Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in. • MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. • Why use MapReduce? - Nice way to partition tasks across lots of machines. - Handle machine failure. - Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc. - Computation can automatically move closer to the IO source. • The MapReduce system has three different types of servers. - The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks. - The Map servers accept user input and performs map operations on them. The results are written to intermediate files - The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them. • For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically. - The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS. - In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count. - Shuffling aggregates key types. - The reductions sums up all the key value pairs and produces the final answer. • The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on. • Programs can be very small. As little as 20 to 50 lines of code. • One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest. • Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O. Storing Structured Data in BigTable • BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second. • BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries. • It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure. • Commercial databases simply don't scale to this level and they don't work across 1000s machines. • By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in. • Machines can be added and deleted while the system is running and the whole system just works. • Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp. • Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable. • BigTable has three different types of servers: - The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed. - The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. - The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion. • A locality group can be used to physically store related bits of data together for better locality of reference. • Tablets are cached in RAM as much as possible. Hardware • When you have a lot of machines how do you build them to be cost efficient and use power efficiently? • Use ultra cheap commodity hardware and built software on top to handle their death. • A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work. • Linux, in-house rack design, PC class mother boards, low end storage. • Price per wattage on performance basis isn't getting better. Have huge power and cooling issues. • Use a mix of collocation and their own data centers. Misc • Push changes out quickly rather than wait for QA. • Libraries are the predominant way of building programs. • Some are applications are provided as services, like crawling. • An infrastructure handles versioning of applications so they can be release without a fear of breaking things. Future Directions for Google • Support geo-distributed clusters. • Create a single global namespace for all data. Currently data is segregated by cluster. • More and better automated migration of data and computation. • Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage). Lessons Learned • Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software. • Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky. • Take a look at Hadoop (product) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here. • An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare. • Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack. • Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades. • Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner. • Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment. • Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO. Digg Architecture
Tue, 08/07/2007 - 01:28 — Todd Hoff • Digg Architecture (966) Traffic generated by Digg's over 1.2 million famously info-hungry users can crash an unsuspecting website head-on into its CPU, memory, and bandwidth limits. How does Digg handle all this load? Site: http://digg.com Information Sources • How Digg.com uses the LAMP stack to scale upward • Digg PHP's Scalability and Performance Platform • MySQL • Linux • PHP • Lucene • APC PHP Accelerator • MCache The Stats • Started in late 2004 with a single Linux server running Apache 1.3, PHP 4, and MySQL. 4.0 using the default MyISAM storage engine • Over 1.2 million users. • Over 200 million page views per month • 100 servers hosted in multiple data centers. - 20 database servers - 30 Web servers - A few search servers running Lucene. - The rest are used for redundancy. • 30GB of data. • None of the scaling challenges we faced had anything to do with PHP. The biggest issues faced were database related. • The lightweight nature of PHP allowed them to move processing tasks from the database to PHP in order to improve scaling. Ebay does this in a radical way. They moved nearly all work out of the database and into applications, including joins, an operation we normally think of as the job of the database. What's Inside • Load balancer in the front that sends queries to PHP servers. • Uses a MySQL master-slave setup. - Transaction-heavy servers use the InnoDB storage engine. - OLAP-heavy servers use the MyISAM storage engine. - They did not notice a performance degradation moving from MySQL 4.1 to version 5. • Memcached is used for caching. • Sharding is used to break the database into several smaller ones. • Digg's usage pattern makes it easier for them to scale. Most people just view the front page and leave. Thus 98% of Digg's database accesses are reads. With this balance of operations they don't have to worry about the complex work of architecting for writes, which makes it a lot easier for them to scale. • They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup. • To lighten their database load they used the APC PHP accelerator MCache. • You can configure PHP not parse and compile on each load using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator. On a page's first load the PHP code is compiles so any subsequent page loads are very fast. Lessons Learned • Tune MySQL through your database engine selection. Use InnoDB when you need transactions and MyISAM when you don't. For example, transactional tables on the master can use MyISAM for read-only slaves. • At some point in their growth curve they were unable to grow by adding RAM so had to grow through architecture. • People often complain Digg is slow. This is perhaps due to their large javascript libraries rather than their backend architecture. • One way they scale is by being careful of which application they deploy on their system. They are careful not to release applications which use too much CPU. Clearly Digg has a pretty standard LAMP architecture, but I thought this was an interesting point. Engineers often have a bunch of cool features they want to release, but those features can kill an infrastructure if that infrastructure doesn't grow along with the features. So push back until your system can handle the new features. This goes to capacity planning, something the Flickr emphasizes in their scaling process. • You have to wonder if by limiting new features to match their infrastructure might Digg lose ground to other faster moving social bookmarking services? Perhaps if the infrastructure was more easily scaled they could add features faster which would help them compete better? On the other hand, just adding features because you can doesn't make a lot of sense either. • The data layer is where most scaling and performance problems are to be found and these are language specific. You'll hit them using Java, PHP, Ruby, or insert your favorite language here. An Unorthodox Approach to Database Design : The Coming of the Shard
Tue, 07/31/2007 - 18:13 — Todd Hoff • An Unorthodox Approach to Database Design : The Coming of the Shard (1136) Once upon a time we scaled databases by buying ever bigger, faster, and more expensive machines. While this arrangement is great for big iron profit margins, it doesn't work so well for the bank accounts of our heroic system builders who need to scale well past what they can afford to spend on giant database servers. In a extraordinary two article series, Dathan Pattishall, explains his motivation for a revolutionary new database architecture--sharding--that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost. What is sharding and how has it come to be the answer to large website scaling problems? Information Sources * Unorthodox approach to database design Part1:History * Unorthodox approach to database design Part 2:Friendster What is sharding? While working at Auction Watch, Dathan got the idea to solve their scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It's a federated model. Groups of 500K users are stored together in what are called shards. The advantages are: • High availability. If one box goes down the others still operate. • Faster queries. Smaller amounts of data in each user group mean faster querying. • More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites. • You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work. How is sharding different than traditional architectures? Sharding is different than traditional database architecture in several important ways: • Data are denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together. This doesn't mean you don't also segregate data by type. You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write. • Data are parallelized across many physical instances. Historically database servers are scaled up. You buy bigger machines to get more power. With sharding the data are parallelized and you scale by scaling out. Using this approach you can get massively more work done because it can be done in parallel. • Data are kept small. The larger a set of data a server handles the harder it is to cash intelligently because you have such a wide diversity of data being accessed. You need huge gobs of RAM that may not even be enough to cache the data when you need it. By isolating data into smaller shards the data you are accessing is more likely to stay in cache. Smaller sets of data are also easier to backup, restore, and manage. • Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over. • It doesn't use replication. Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master. Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled. Sharding cleanly and elegantly solves the problems with replication. Some Problems With Sharding Sharding isn't perfect. It does have a few problems. • Rebalancing data. What happens when a shard outgrows your storage and needs to be split? Let's say some user has a particularly large friends list that blows your storage capacity for the shard. You need to move the user to a different shard. On some platforms I've worked on this is a killer problem. You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime. Rebalancing has to be built in from the start. Google's shards automatically rebalance. For this to work data references must go through some sort of naming service so they can be relocated. This is what Flickr does. And your references must be invalidateable so the underlying data can be moved while you are using it. • Joining data from multiple shards. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can't just issue a query and get back all the data. You have to make individual requests to your data sources, get all the responses, and the build the page. Thankfully, because of caching and fast networks this process is usually fast enough that your page load times can be excellent. • How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions. • Less leverage. People have experience with traditional RDBMS tools so there is a lot of help out there. You have books, experts, tool chains, and discussion forums when something goes wrong or you are wondering how to implement a new feature. Eclipse won't have a shard view and you won't find any automated backup and restore programs for your shard. With sharding you are on your own. • Implementing shards is not well supported. Sharding is currently mostly a roll your own approach. LiveJournal makes their tool chain available. Hibernate has a library under development. MySQL has added support for partioning. But in general it's still something you must implement yourself. Comments Tue, 07/31/2007 - 22:12 — Tim (not verified) Great post ! This is probably the most interesting post I've read in a long long time. Thanks for sharing the advantages and drawbacks of sharding.... and thanks for putting together all these resources/info about scaling... it's really really interesting. • reply Wed, 08/01/2007 - 20:22 — Vinit (not verified) Thanks for this info. Thanks for this info. Helped me understand about what the heck to do with all this user data coming my way!!! • reply Wed, 08/01/2007 - 23:17 — Ryan T Mulligan (not verified) Intranet? I dislike how your link of livejournal does not actually go to a livejournal website, or information about their toolchain. • reply Thu, 08/02/2007 - 01:48 — Todd Hoff
re: intranet I am not sure what you mean about live journal. It goes to a page on this site which references two danga.com sites. Oh I see, memcached goes to a category link which doesn't include memcached. The hover text does include the link, but I'll add it in. Good catch. Thanks. • reply Thu, 08/02/2007 - 15:10 — tim wee (not verified) Question about a statement in the post "Sharding cleanly and elegantly solves the problems with replication." Is this true? You do need to replicate still right? You need duplication and a copy of the data that is not too stale in case one of your shards go down? So you still need to replicate correct? • reply Thu, 08/02/2007 - 15:21 — Todd Hoff
sharding and replication > Is this true? You do need to replicate still right? You won't have the problems with replication overhead and lag because you are writing to a appropriately sized shard rather than a single master that must replicate to all its slaves, assuming you have a lot of slaves. You will still replicate within the shard, but that will be more of a fixed reasonable cost because the number of slaves will be small. Google replicates content 3 times, so even in that case it's more of a fixed overhead then chaining a lot of slaves together. That's my understanding at least. • reply Sat, 08/04/2007 - 17:23 — Kevin Burton (not verified) We might OSS our sharding framework We've been building out a sharding framework for use with Spinn3r. It's about to be deployed into production this week. We're very happy with the way things have moved forward and will probably be OSSing it. We aggregate a LOT of data on behalf of our customers so have huge data storage requirements. Kevin • reply Sat, 08/04/2007 - 21:53 — Anonymous (not verified) Lookup table Do you still need a master lookup table? How do you know which shard has the data you need to access? • reply Sat, 08/04/2007 - 23:01 — Todd Hoff
re: Lookup table I think a lookup table is the most flexible option. It allows for flexible shard assignment algorithms and you can change the mapping when you need to. Here a few other ideas. I am sure there are more. Flickr talks about a "global ring" that is like DNS that maps a key to a shard ID. The map is kept in memcached for a 1/2 hour or so. I bought Cal Henderson's book and it should arrive soon. If he has more details I'll do a quick write up. Users could be assigned to shards initially on a round robin basis or through some other whizzbang algorithm. You could embed shard IDs into your assigned row IDs so the mapping is obvious from the ID. You could hash on a key to a shard bucket. You could partition based on on key values. So users with names starting with A-C go to shard1, that sort of thing. MySQL has a number of different partition rules as examples. • reply Mon, 08/06/2007 - 02:13 — Frank (not verified) Thank you It's helpful to know how the big players handle their scaling issues. Thanks for sharing! • reply Thu, 08/09/2007 - 12:43 — Anonymous (not verified) Sharding Attacking sharding from the application layer is simply wrong. This functionality should be left to the DBMS. Access from the app layer would be transparent and it would be up to the DB admin to configure the data servers correctly for sharding to automatically and transparantly scale out across them. If you are implementing sharding from the app layer you are getting yourself in a very tight corner and one day will find out how stuck you are there. This is the cornerstone of improper delegation of the functionalities in a multi-tier system. • reply Mon, 08/13/2007 - 14:44 — Diogin (not verified) >How do you partition your >How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions. I have exactly the question to ask.. I've refered the architectures of LiveJournal and Mixi, both of which introduce shards. Howerver, I saw a "Global Cluster" which store meta informations for other clusters. By doing this we get an extreme heavy cluster, it must handle all the cluster_id <-> user_id metas and lost the advantage of sharding...Is it? The other way, partition by algorithms on keys, is difficult in transition. So, could you give me some advice? Thank you very much for sharing experiences :) • reply Mon, 08/13/2007 - 15:36 — Todd Hoff
> By doing this we get an > By doing this we get an extreme heavy cluster, it must handle > all the cluster_id <-> user_id metas I think the idea is that because these mapping is so small they can all be cached in RAM and thus their resolution can be very very fast. Do you think that would be too slow? • reply Mon, 08/13/2007 - 17:11 — Diogin (not verified) Yeah, you reminded me! I Yeah, you reminded me! I have a little doubts on this before, when I think the table would be terribly huge as all the table records on other clusters are all gathered in this table, and all queries should first refer to this cluster. Maybe I can use memcached clusters to cache them. Thank you :) • reply Tue, 08/14/2007 - 17:33 — Arnon Rotem-Gal-Oz (not verified) Partitioning is the new standard If you look at the architectures of all the major internet-scale sites (such as eBAy, Amazon, Flicker etc.) you'd see they've come to the same conclusion and same patterns I've also published an article on InfoQ discussing this topic yesterday (I only found this site now or I would have included it in the article) Arnon • reply Wed, 08/22/2007 - 21:24 — Todd Hoff
More Good Info on Partitioning Jeremy Cole and Eric Bergen have an excellent section on database partitioning starting on about page 14 of MySQL Scaling and High Availability Architectures. They talk about different partitioning models, difficulties with partitioning, HiveDB, Hibernate Shards, and server architectures. I'll definitely do a write up on this document a little later, but if interested dive in now. • reply Fri, 09/14/2007 - 11:20 — Norman Harebottle (not verified) Re: An Unorthodox Approach to Database Design : The Coming of th I agree with the above post questioning the placement of partitioning logic in the application layer. Why not write the application layer against a logical model (NOT a storage model!) and then just engineer the existing data storage abstraction mechanism (DBMS engine) such that it will handle the partitioning functionality in a parallel manner? I would be very interested to see a study done comparing the architectures of this sharding concept against a federated database design such as what is described on this site http://www.sql-server-performance.com/tips/federated_databases_p1.aspx • reply Fri, 09/14/2007 - 15:22 — Todd Hoff
Re: is the logical model the correct place for paritioning? > I would be very interested to see a study done comparing the > architectures of this sharding concept against a federated database design Most of us don't have accesses to a real affordable federated database (parallel queries), so it's somewhat a moot point :-) And even these haven't been able to scale at the highest levels anyway. The advantage of partitioning in the application layer is that you are not bottlenecked on the traffic cop server that must analyze SQL and redistribute work to the proper federations and then integrate the results. You go directly to where the work needs to be done. I understand the architectural drive to push this responsibility to a logical layer, but it's hard to ignore the advantages of using client side resources to go directly to the right shard based on local context. What is the cost? A call to library code that must be kept in sync with the partitioning scheme. Is that really all that bad? • reply Thu, 09/20/2007 - 14:10 — Anonymous (not verified) Re: An Unorthodox Approach to Database Design : The Coming of th Sorry, this isn't new, except maybe to younger programmers. I've been using this approach for many years. Federated databases with horizontally partioned data is old news. Its just not taught as a standard technique for scaling, mostly because scaling isn't taught as a core subject. (Why is that, do you suppose? Too hard to cover? Practical examples not practical in an academic setting?) The reason this is getting attention now is a perfect storm of cheap hardware, nearly free network connectivity, free database software, and beaucoup users (testers?) over the 'net. • reply Sat, 09/22/2007 - 13:53 — Marcus (not verified) Useful, interesting, but not new Great content and great site... except for the worshipful adoration of these young teams who seem to think they've each discovered America. I could go on at length about Flickr's performance troubles and extremely slow changes after their early success, Twitter's meltdowns and indefensible defense of the slowest language in wide use on the net, and old-school examples of horizontal partitioning (AKA "sharding" heh), but I'll spare you. This cotton candy sugary sweet lovin' of Web 2.0 darlings really is a bit tiresome. Big kudos though on deep coverage of the subject matter between the cultic chants. :-D • reply Sat, 09/22/2007 - 18:39 — Todd Hoff
Re: Useful, interesting, but not new > Big kudos though on deep coverage of the subject matter between the cultic chants. :-D I am actually more into root music. But thanks, I think :-) • reply Fri, 09/28/2007 - 06:11 — Sean Bannister (not verified) Re: An Unorthodox Approach to Database Design : The Coming of th Good article, very interesting to read. • reply Sat, 09/29/2007 - 06:46 — Ed (not verified) Re: An Unorthodox Approach to Database Design : The Coming of th Fanball.com has been using this technique for its football commissioner product for years. As someone else commented, it used to be called horizontal partitioning back then. Does it cost less if it's called sharding? :-) • reply Mon, 10/08/2007 - 03:02 — Anonymous (not verified) Re: An Unorthodox Approach to Database Design Why is this even news, we did something similar in my old job. Split up different clients among different server stacks. Move along nothing to see... • reply Sun, 10/14/2007 - 11:29 — Anonymous (not verified) Re: An Unorthodox Approach to Database Design : The Coming of th Could it be the fact that people are pulling this off with mysql and berkeleydb that is making horizontal partitioning interesting? When you compare two solutions, one using an open source database and one using a closed source database, is one solution more inherently scalable? Well all things being equal performance wise, its nice to not have to do a purchase order for the closed source software, so I would say that is why this is getting all the 'hype'. Old school oracle/mssqlserver patronizing DBAs are getting schooled by non-dbas who are setting up the *highest* data throughput architectures and not using sql server or oracle. That is why this is getting high visibility. Believe or not many people still say mysql/berkeleydb is a toy outside of some of the major tech hubs. Stories like this are what make people, especially dbas, listen. The only recourse is 'I have done that before with xxx database'. Well you should be the one that suggests doing it with the open source 'toy' database then, if you are so good. In my experience there are many old-school DBAs that are in denial that this kind of architecture is capable of out performing their *multi-million dollar oracle software purchase decisions* and they don't want to admit it. • reply Wed, 10/24/2007 - 13:12 — Harel Malka (not verified) Re: An Unorthodox Approach to Database Design : The Coming of th What I'm most interested in relating to Shards is people's thoughts and experience in migrating TO a shard approach from a single database, and moving (large amounts of) data around from shard to shard. In particular - strategies to maintain referential integrity as we're moving data by a user. As well, should you need to query data joining user A and user B which both reside on different shards - what approaches people see as fit? Harel LiveJournal Architecture
Mon, 07/09/2007 - 16:57 — Todd Hoff • LiveJournal Architecture (608) A fascinating and detailed story of how LiveJournal evolved their system to scale. LiveJournal was an early player in the free blog service race and faced issues from quickly adding a large number of users. Blog posts come fast and furious which causes a lot of writes and writes are particularly hard to scale. Understanding how LiveJournal faced their scaling problems will help any aspiring website builder. Site: http://www.livejournal.com/ Information Sources • LiveJournal - Behind The Scenes Scaling Storytime • Google Video • Tokyo Video • 2005 version Platform • Linux • MySql • Perl • Memcached • MogileFS • Apache What's Inside? • Scaling from 1, 2, and 4 hosts to cluster of servers. • Avoid single points of failure. • Using MySQL replication only takes you so far. • Becoming IO bound kills scaling. • Spread out writes and reads for more parallelism. • You can't keep adding read slaves and scale. • Shard storage approach, using DRBD, for maximal throughput. Allocate shards based on roles. • Caching to improve performance with memcached. Two-level hashing to distributed RAM. • Perlbal for web load balancing. • MogileFS, a distributed file system, for parallelism. • TheSchwartz and Gearman for distributed job queuing to do more work in parallel. • Solving persistent connection problems. Lessons Learned • Don't be afraid to write your own software to solve your own problems. LiveJournal as provided incredible value to the community through their efforts. • Sites can evolve from small 1, 2 machine setups to larger systems as they learn about their users and what their system really needs to do. • Parallelization is key to scaling. Remove choke points by caching, load balancing, sharding, clustering file systems, and making use of more disk spindles. • Replication has a cost. You can't just keep adding more and more read slaves and expect to scale. • Low level issues like which OS event notification mechanism to use, file system and disk interactions, threading and even models, and connection types, matter at scale. • Large sites eventually turn to a distributed queuing and scheduling mechanism to distribute large work loads across a grid. GoogleTalk Architecture
Mon, 07/23/2007 - 22:47 — Todd Hoff • GoogleTalk Architecture (549) Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it? Site: http://www.google.com/talk Information Sources • GoogleTalk Architecture Platform • Linux • Java • Google Stack • Shard What's Inside? The Stats • Support presence and messages for millions of users. • Handles billions of packets per day in under 100ms. • IM is different than many other applications because the requests are small packets. • Routing and application logic are applied per packet for sender and receiver. • Messages must be delivered in-order. • Architecture extends to new clients and Google services. Lessons Learned • Measure the right thing. - People ask about how many IMs do you deliver or how many active users. Turns out not to be the right engineering question. - Hard part of IM is how to show correct present to all connected users because growth is non-linear: ConnectedUsers * BuddyListSize * OnlineStateChanges - A linear user grown can mean a very non-linear server growth which requires serving many billions of presence packets per day. - Have a large number friends and presence explodes. The number IMs not that big of deal. • Real Life Load Tests - Lab tests are good, but don't tell you enough. - Did a backend launch before the real product launch. - Simulate presence requests and going on-line and off-line for weeks and months, even if real data is not returned. It works out many of the kinks in network, failover, etc. • Dynamic Resharding - Divide user data or load across shards. - Google Talk backend servers handle traffic for a subset of users. - Make it easy to change the number of shards with zero downtime. - Don't shard across data centers. Try and keep users local. - Servers can bring down servers and backups take over. Then you can bring up new servers and data migrated automatically and clients auto detect and go to new servers. • Add Abstractions to Hide System Complexity - Different systems should have little knowledge of each other, especially when separate groups are working together. - Gmail and Orkut don't know about sharding, load-balancing, or fail-over, data center architecture, or number of servers. Can change at anytime without cascading changes throughout the system. - Abstract these complexities into a set of gateways that are discovered at runtime. - RPC infrastructure should handle rerouting. • Understand Semantics of Lower Level Libraries - Everything is abstracted, but you must still have enough knowledge of how they work to architect your system. - Does your RPC create TCP connections to all or some of your servers? Very different implications. - Does the library performance health checking? This is architectural implications as you can have separate system failing independently. - Which kernel operation should you use? IM requires a lot connections but few have any activity. Use epoll vs poll/select. • Protect Again Operation Problems - Smooth out all spoke in server activity graphs. - What happens when servers restart with an empty cache? - What happens if traffic shifts to a new data center? - Limit cascading problems. Back of from busy servers. Don't accept work when sick. - Isolate in emergencies. Don't infect others with your problems. - Have intelligent retry logic policies abstracted away. Don't sit in hard 1msec retry loops, for example. • Any Scalable System is a Distributed System - Add fault tolerance to every component of the system. Everything fails. - Add ability to profile live servers without impacting server. Allows continual improvement. - Collect metrics from server for monitoring. Log everything about your system so you see patterns in cause and effects. - Log end-to-end so you can reconstruct an entire operation from beginning to end across all machines. • Software Development Strategies - Make sure binaries are both backward and forward compatible so you can have old clients work with new code. - Build an experimentation framework to try new features. - Give engineers access to product machines. Gives end-to-end ownership. This is very different than many companies who have completely separate OP teams in their data centers. Often developers can't touch production machines. ">