网络运维词汇汇总

网络运维词汇汇总本篇之所以起该名字

网络运维词汇汇总"

    本篇之所以起该名字,是因为我在一家网络公司工作所遇到的一些相关词汇,仅供参考。–

1、关系型数据库服务 RDS:

         关系型数据库服务(RelationalDatabase Service,简称RDS)是一种稳定可靠、可弹性伸缩的在线数据库服务。RDS采用即开即用方式,兼容MySQL、SQL Server两种关系型数据库,并提供数据库在线扩容、备份回滚、性能监测及分析功能。RDS与云服务器搭配使用I/O性能倍增,内网互通避免网络瓶颈。

 

2、开放存储服务 OSS:

         开放存储服务(OpenStorage Service,简称OSS)是支持任意数据类型的存储服务,支持任意时间、地点的数据上传和下载,OSS中每个存储对象(object)由名称、内容、描述三部分组成。

 

3、内容分发网络 CDN:

         内容分发网络(ContentDelivery Network,简称CDN)将加速内容分发至离用户最近的节点,缩短用户查看对象的延迟,提高用户访问网站的响应速度与网站的可用性,解决网络带宽小、用户访问量大、网点分布不均等问题。

 

4、负载均衡 SLB:

         负载均衡(ServerLoad Balancer,简称SLB)是对多台云服务器进行流量分发的负载均衡服务。SLB可以通过流量分发扩展应用系统对外的服务能力,通过消除单点故障提升应用系统的可用性。

 

5、Django

Django是一个开放源代码的Web应用框架,由Python写成。采用了MVC的软件设计模式,即模型M,视图V和控制器C。它最初是被开发来用于管理劳伦斯出版集团旗下的一些以新闻内容为主的网站的。并于2005年7月在BSD许可证下发布。这套框架是以比利时的吉普赛爵士吉他手Django Reinhardt来命名的。

 

Django的主要目标是使得开发复杂的、数据库驱动的网站变得简单。Django注重组件的重用性和“可插拔性”,敏捷开发和DRY法则(Don’t Repeat Yourself)。在Django中Python被普遍使用,甚至包括配置文件和数据模型。

 

Django于2008年6月17日正式成立基金会。

 

Web应用框架(Web applicationframework)是一种计算机软件框架,用来支持动态网站、网络应用程序及网络服务的开发。这种框架有助于减轻网页开发时共通性活动的工作负荷,例如许多框架提供数据库访问接口、标准样板以及会话管理等,可提升代码的可再用性。

 

DRC(Data Replication Center)

由异地容灾而来

DAM(DataBase Activity Monitor)安全方面来对数据库异常活动进行监测和审计。

6、异地备份

异地备份,通过互联网TCP/IP协议,备特佳容灾备份系统将本地的数据实时备份到异地服务器中,可以通过异地备份的数据进行远程恢复,也可以在异地进行数据回退,异地备份,如果想做接管需要专线连接,只有在同一网段内才能实现业务的接管。

在建立容灾备份系统时会涉及到多种技术,如:SAN或NAS技术、远程镜像技术、虚拟存储、基于IP的SAN的互连技术、快照技术等。

 

许多存储厂商纷纷推出基于SAN的异地容灾软、硬件产品,希望能够为用户提供整套以SAN网络环境和异地实时备份为基础的,高效、可靠的异地容灾解决方案,并且能够为用户提供支持各种操作系统平台、数据库应用和网络应用的系统容灾服务。

为了确保基于存储区域网络(StorageArea Network,SAN)的异地容灾系统在主系统发生意外灾难后实现同城异地的数据容灾,采用SAN作为数据存储模式,通过光纤通道将生产数据中心和备份数据中心连接起来,使用跨阵列磁盘镜像技术实现异地数据中心之间的备份和恢复.容灾系统是计算机系统安全的最后保障,适用于大多数中小型企业的数据容灾需求,同时,还为企业将来实现更高级别的系统容灾做准备.

AS(Network Attached Storage:网络附属存储)是一种将分布、独立的数据整合为大型、集中化管理的数据中心,以便于对不同主机和应用服务器进行访问的技术。按字面简单说就是连接在网络上, 具备资料存储功能的装置,因此也称为“网络存储器”。它是一种专用数据存储服务器。它以数据为中心,将存储设备与服务器彻底分离,集中管理数据,从而释放带宽、提高性能、降低总拥有成本、保护投资。其成本远远低于使用服务器存储,而效率却远远高于后者。

DAS即直连方式存储,英文全称是Direct Attached Storage。中文翻译成“直接附加存储”。顾名思义,在这种方式中,存储设备是通过电缆(通常是SCSI接口电缆)直接到服务器的。I/O(输入/输入)请求直接发送到存储设备。DAS,也可称为SAS(Server-Attached Storage,服务器附加存储)。它依赖于服务器,其本身是硬件的堆叠,不带有任何存储操作系统。

 

7、Nginx

Nginx (“engine x”) 是一个高性能的 HTTP 和 反向代理 服务器,也是一个IMAP/POP3/SMTP 代理服务器。 Nginx 是由 Igor Sysoev 为俄罗斯访问量第二的 Rambler.ru 站点开发的,第一个公开版本0.1.0发布于2004年10月4日。其将源代码以类BSD许可证的形式发布,因它的稳定性、丰富的功能集、示例配置文件和低系统资源的消耗而闻名。2011年6月1日,nginx 1.0.4发布

Nginx作为负载均衡服务器:Nginx 既可以在内部直接支持Rails 和 PHP 程序对外进行服务,也可以支持作为 HTTP代理服务器对外进行服务。Nginx采用C进行编写,不论是系统资源开销还是CPU使用效率都比Perlbal 要好很多。

8、反向代理

反向代理(Reverse Proxy)方式是指以代理服务器来接受internet上的连接请求,然后将请求转发给内部网络上的服务器,并将从服务器上得到的结果返回给internet上请求连接的客户端,此时代理服务器对外就表现为一个服务器。

9、Hadoop

一个分布式系统基础架构,由Apache基金会所开发。

用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力高速运算和存储。

[1]Hadoop实现了一个分布式文件系统(HadoopDistributed File System),简称HDFS。HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高传输率(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以以流的形式访问(streaming access)文件系统中的数据。

10、PXE

PXE(preboot execute environment,预启动执行环境)是由Intel公司开发的最新技术,工作于Client/Server的网络模式,支持工作站通过网络从远端服务器下载映像,并由此支持通过网络启动操作系统,在启动过程中,终端要求服务器分配IP地址,再用TFTP(trivial filetransfer protocol)或MTFTP(multicasttrivial file transfer protocol)协议下载一个启动软件包到本机内存中执行,由这个启动软件包完成终端(客户?)基本软件设置,从而引导预先安装在服务器中的终端操作系统。PXE可以引导多种操作系统,如:Windows95/98/2000/windows2003/windows2008/winXP/win7/win8,linux等。

11、Linux I/O调度方法

操作系统的调度有

CPU调度   CPU scheduler

IO调度      IO scheduler

 IO调度器的总体目标是希望让磁头能够总是往一个方向移动,移动到底了再往反方向走,这恰恰就是现实生活中的电梯模型,所以IO调度器也被叫做电梯. (elevator)而相应的算法也就被叫做电梯算法. 而Linux中IO调度的电梯算法有好几种,

as(Anticipatory),预期的

cfq(Complete Fairness Queueing),

deadline,

noop(No Operation).

 具体使用哪种算法我们可以在启动的时候通过内核参数elevator来指定.

一)I/O调度的4种算法

 1)CFQ(完全公平排队I/O调度程序)

特点:
在最新的内核版本和发行版中,都选择CFQ做为默认的I/O调度器,对于通用的服务器也是最好的选择. CFQ试图均匀地分布对I/O带宽的访问,避免进程被饿死并实现较低的延迟,是deadline和as调度器的折中. CFQ对于多媒体应用(video,audio)和桌面系统是最好的选择.
CFQ赋予I/O请求一个优先级,而I/O优先级请求独立于进程优先级,高优先级的进程的读写不能自动地继承高的I/O优先级.

工作原理:
CFQ为每个进程/线程,单独创建一个队列来管理该进程所产生的请求,也就是说每个进程一个队列,各队列之间的调度使用时间片来调度, 以此来保证每个进程都能被很好的分配到I/O带宽.I/O调度器每次执行一个进程的4次请求.

2)NOOP(电梯式调度程序)

特点:
在Linux2.4或更早的版本的调度程序,那时只有这一种I/O调度算法. NOOP实现了一个简单的FIFO队列,它像电梯的工作主法一样对I/O请求进行组织,当有一个新的请求到来时,它将请求合并到最近的请求之后,以此来保证请求同一介质. NOOP倾向饿死读而利于写. NOOP对于闪存设备,RAM,嵌入式系统是最好的选择. 电梯算法饿死读请求的解释:因为写请求比读请求更容易. 写请求通过文件系统cache,不需要等一次写完成,就可以开始下一次写操作,写请求通过合并,堆积到I/O队列中. 读请求需要等到它前面所有的读操作完成,才能进行下一次读操作.在读操作之间有几毫秒时间,而写请求在这之间就到来,饿死了后面的读请求.

 3)Deadline(截止时间调度程序)

特点:
通过时间以及硬盘区域进行分类,这个分类和合并要求类似于noop的调度程序. Deadline确保了在一个截止时间内服务请求,这个截止时间是可调整的,而默认读期限短于写期限.这样就防止了写操作因为不能被读取而饿死的现象. Deadline对数据库环境(ORACLE RAC,MYSQL等)是最好的选择.

4)AS(预料I/O调度程序)

特点:
本质上与Deadline一样,但在最后一次读操作后,要等待6ms,才能继续进行对其它I/O请求进行调度. 可以从应用程序中预订一个新的读请求,改进读操作的执行,但以一些写操作为代价.
它会在每个6ms中插入新的I/O操作,而会将一些小写入流合并成一个大写入流,用写入延时换取最大的写入吞吐量. AS适合于写入较多的环境,比如文件服务器,AS对数据库环境表现很差.

  查看当前系统支持的IO调度算法
dmesg | grep -i scheduler

[root@localhost~]# dmesg | grep -i scheduler
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)

查看当前系统的I/O调度方法:

cat/sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

临地更改I/O调度方法:
例如:想更改到noop电梯调度算法:
echo noop > /sys/block/sda/queue/scheduler

想永久的更改I/O调度方法:
修改内核引导参数,加入elevator=调度程序名
vi /boot/grub/menu.lst
更改到如下内容:
kernel /boot/vmlinuz-2.6.18-8.el5 ro root=LABEL=/ elevator=deadline rhgb quiet

重启之后,查看调度方法:
cat /sys/block/sda/queue/scheduler
noop anticipatory [deadline] cfq
已经是deadline了

 二 )I/O调度程序的测试

本次测试分为只读,只写,读写同时进行.
分别对单个文件600MB,每次读写2M,共读写300次.

 1)测试磁盘读:

[root@test1tmp]# echo deadline > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/sda1 of=/dev/null bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.81189 seconds, 92.4 MB/s

real0m6.833s
user 0m0.001s
sys 0m4.556s
[root@test1 tmp]# echo noop > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/sda1 of=/dev/null bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.61902 seconds, 95.1 MB/s

real0m6.645s
user 0m0.002s
sys 0m4.540s
[root@test1 tmp]# echo anticipatory > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/sda1 of=/dev/null bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 8.00389 seconds, 78.6 MB/s

real0m8.021s
user 0m0.002s
sys 0m4.586s
[root@test1 tmp]# echo cfq > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/sda1 of=/dev/null bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 29.8 seconds, 21.1 MB/s

real0m29.826s
user 0m0.002s
sys 0m28.606s
结果:
第一 noop:用了6.61902秒,速度为95.1MB/s
第二 deadline:用了6.81189秒,速度为92.4MB/s
第三 anticipatory:用了8.00389秒,速度为78.6MB/s
第四 cfq:用了29.8秒,速度为21.1MB/s

2)测试写磁盘:

[root@test1 tmp]# echo cfq > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/zero of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.93058 seconds, 90.8 MB/s

real0m7.002s
user 0m0.001s
sys 0m3.525s
[root@test1 tmp]# echo anticipatory > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/zero of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.79441 seconds, 92.6 MB/s

real0m6.964s
user 0m0.003s
sys 0m3.489s
[root@test1 tmp]# echo noop > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/zero of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 9.49418 seconds, 66.3 MB/s

real0m9.855s
user 0m0.002s
sys 0m4.075s
[root@test1 tmp]# echo deadline > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/zero of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.84128 seconds, 92.0 MB/s

real0m6.937s
user 0m0.002s
sys 0m3.447s

测试结果:
第一 anticipatory,用了6.79441秒,速度为92.6MB/s
第二 deadline,用了6.84128秒,速度为92.0MB/s
第三 cfq,用了6.93058秒,速度为90.8MB/s
第四 noop,用了9.49418秒,速度为66.3MB/s

3)测试同时读/写

 

[root@test1tmp]# echo deadline > /sys/block/sda/queue/scheduler
[root@test1 tmp]# dd if=/dev/sda1 of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 15.1331 seconds, 41.6 MB/s
[root@test1 tmp]# echo cfq > /sys/block/sda/queue/scheduler
[root@test1 tmp]# dd if=/dev/sda1 of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 36.9544 seconds, 17.0 MB/s
[root@test1 tmp]# echo anticipatory > /sys/block/sda/queue/scheduler
[root@test1 tmp]# dd if=/dev/sda1 of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 23.3617 seconds, 26.9 MB/s
[root@test1 tmp]# echo noop > /sys/block/sda/queue/scheduler
[root@test1 tmp]# dd if=/dev/sda1 of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 17.508 seconds, 35.9 MB/s

测试结果:
第一 deadline,用了15.1331秒,速度为41.6MB/s
第二 noop,用了17.508秒,速度为35.9MB/s
第三 anticipatory,用了23.3617秒,速度为26.9MS/s
第四 cfq,用了36.9544秒,速度为17.0MB/s

 

三)ionice

 ionice可以更改任务的类型和优先级,不过只有cfq调度程序可以用ionice.

有三个例子说明ionice的功能:

 采用cfq的实时调度,优先级为7
ionice -c1 -n7 -ptime dd if=/dev/sda1 of=/tmp/test bs=2M count=300&

采用缺省的磁盘I/O调度,优先级为3
ionice -c2 -n3 -ptime dd if=/dev/sda1 of=/tmp/test bs=2M count=300&

采用空闲的磁盘调度,优先级为0
ionice -c3 -n0 -ptime dd if=/dev/sda1 of=/tmp/test bs=2M count=300&

ionice的三种调度方法,实时调度最高,其次是缺省的I/O调度,最后是空闲的磁盘调度.
ionice的磁盘调度优先级有8种,最高是0,最低是7.

注意,磁盘调度的优先级与进程nice的优先级没有关系.
一个是针对进程I/O的优先级,一个是针对进程CPU的优先级.

 

AnticipatoryI/Oscheduler            适用于大多数环境,但不太合适数据库应用

DeadlineI/Oscheduler                  通常与Anticipatory相当,但更简洁小巧,更适合于数据库应用

CFQ I/Oschedule                        为所有进程分配等量的带宽,适合于桌面多任务及多媒体应用,默认IO调度器

DefaultI/O scheduler

 

 The CFQ scheduler has the following tunable parameters:

/sys/block/<device>/queue/iosched/slice_idle

When atask has no more I/O to submit in its time slice, the I/O scheduler waits for awhile before scheduling the next thread to improve locality of I/O. For mediawhere locality does not play a big role (SSDs, SANs with lots of disks) setting /sys/block/<device>/queue/iosched/slice_idle to 0 canimprove the throughput considerably.

/sys/block/<device>/queue/iosched/quantum

Thisoption limits the maximum number of requests that are being processed by thedevice at once. The default value is 4. For a storage with several disks,this setting can unnecessarily limit parallel processing of requests.Therefore, increasing the value can improve performance although this can causethat the latency of some I/O may be increased due to more requests beingbuffered inside the storage. When changing this value, you can also considertuning /sys/block/<device>/queue/iosched/slice_async_rq (thedefault value is 2) which limits the maximum number of asynchronousrequests—usually writing requests—that are submitted in one time slice.

/sys/block/<device>/queue/iosched/low_latency

For workloads where the latency of I/O iscrucial,setting /sys/block/<device>/queue/iosched/low_latencyto 1 canhelp.

DEADLINE

DEADLINE isa latency-oriented I/O scheduler. Each I/O request has got a deadline assigned.Usually, requests are stored in queues (read and write) sorted by sectornumbers. The DEADLINE algorithm maintains two additional queues (readand write) where the requests are sorted by deadline. As long as no request hastimed out, the“sector” queue is used. If timeouts occur, requests fromthe “deadline” queue are served until there are no more expiredrequests. Generally, the algorithm prefers reads over writes.

This scheduler can provide a superiorthroughput over the CFQ I/O scheduler in cases where several threadsread and write and fairness is not an issue. For example, for several parallelreaders from a SAN and for databases (especially whenusing “TCQ” disks). The DEADLINE scheduler has thefollowing tunable parameters:

/sys/block/<device>/queue/iosched/writes_starved

Controlshow many reads can be sent to disk before it is possible to send writes. Avalue of 3 means, that three read operations are carried out for onewrite operation.

/sys/block/<device>/queue/iosched/read_expire

Sets thedeadline (current time plus the read_expire value) for read operations inmilliseconds. The default is 500.

/sys/block/<device>/queue/iosched/write_expire

/sys/block/<device>/queue/iosched/read_expire Setsthe deadline (current time plus the read_expire value) for read operations inmilliseconds. The default is 500.

@logsys0data]#wgethttp://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

 6.x

[root@logsys0data]#wgethttp://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

[root@logsys0 data]# rpm -ivhepel-release-6-8.noarch.rpm3. 安装syslog-ng

[root@logsys0 data]# yum –enablerepo=epelinstall syslog-ng eventlog syslog-ng-libdbi

设置变更:vim/etc/syslog-ng/syslog-ng.conf[root@logsys0 data]# chkconfig rsyslog off;chkconfig syslog-ng on

[root@logsys0 data]# service rsyslogstop;service syslog-ng start

关闭系统日志记录器:                                      [确定]

启动 syslog-ng:                                          [确定]

#重新加载配置:service syslog-ng reload

安装成功

4. 开启防火墙80和514端口

vi /etc/sysconfig/iptables

添加两条规则

-A RH-Firewall-1-INPUT -m state –state NEW-m tcp -p tcp –dport 80 -j ACCEPT

-A RH-Firewall-1-INPUT -m state –state NEW-m udp -p udp –dport 514 -j ACCEPT

配置文件如下:

# Firewall configuration written bysystem-config-firewall

# Manual customization of this file is notrecommended.

*filter

:INPUT ACCEPT [0:0]

:FORWARD ACCEPT [0:0]

:OUTPUT ACCEPT [0:0]

:RH-Firewall-1-INPUT – [0:0]

-A INPUT -j RH-Firewall-1-INPUT

-A FORWARD -j RH-Firewall-1-INPUT

-A RH-Firewall-1-INPUT -i lo -j ACCEPT

-A RH-Firewall-1-INPUT -p icmp –icmp-typeany -j ACCEPT

 

-A INPUT -m state –state ESTABLISHED,RELATED-j ACCEPT

-A INPUT -p icmp -j ACCEPT

-A INPUT -i lo -j ACCEPT

-A INPUT -m state –state NEW -m tcp -p tcp–dport 22 -j ACCEPT

-A RH-Firewall-1-INPUT -m state –state NEW-m tcp -p tcp –dport 80 -j ACCEPT

-A RH-Firewall-1-INPUT -m state –state NEW-m udp -p udp –dport 514 -j ACCEPT

-A INPUT -j REJECT –reject-withicmp-host-prohibited

-A FORWARD -j REJECT –reject-withicmp-host-prohibited

COMMIT

 

[root@ www.linuxidc.com  data]# /etc/init.d/iptables restart

iptables:清除防火墙规则:                                [确定]

iptables:将链设置为政策 ACCEPT:filter                    [确定]

iptables:正在卸载模块:                                  [确定]

iptables:应用防火墙规则:                                [确定

12、OLAP 即 联机分析处理

简写为OLAP,随着数据库技术的发展和应用,数据库存储的数据量从20世纪80年代的兆(M)字节及千兆(G)字节过渡到现在的兆兆(T)字节和千兆兆(P)字节,同时,用户的查询需求也越来越复杂,涉及的已不仅是查询或操纵一张关系表中的一条或几条记录,而且要对多张表中千万条记录的数据进行数据分析和信息综合,关系数据库系统已不能全部满足这一要求。在国外,不少软件厂商采取了发展其前端产品来弥补关系数据库管理系统支持的不足,力图统一分散的公共应用逻辑,在短时间内响应非数据处理专业人员的复杂查询要求。

联机分析处理(OLAP)系统是数据仓库系统最主要的应用,专门设计用于支持复杂的分析操作,侧重对决策人员和高层管理人员的决策支持,可以根据分析人员的要求快速、灵活地进行大数据量的复杂查询处理,并且以一种直观而易懂的形式将查询结果提供给决策人员,以便他们准确掌握企业(公司)的经营状况,了解对象的需求,制定正确的方案。

 

13、磁盘

整个磁盘盘上头好像有多个同心囿绘制出癿饼图,而由囿心以放射状癿方式分割出磁盘癿最小储存单位,那就是扂区(Sector),在物理组成分面,每个扂区大小为512Bytes,这个值是丌会改变癿。而扂区组成一个囿就成为磁道(track),如果是在多碟癿硬盘上面,在所有磁盘盘上面癿同一个磁道可以组成一个磁柱(Cylinder),磁柱也是一般我仧分割硬盘时癿最小单位了!

在计算整个硬盘癿储存量时,简单癿计算公式就是:『header数量 * 每个header负责癿磁柱数量 * 每个磁柱所吨有癿扂区数量 * 扂区癿容量』,单位换算为『header * cylinder/header * secter/cylinder * 512bytes/secter』

 

装置

 装置在Linux内癿文件名

 

 IDE硬盘机

 /dev/hd[a-d]

 

 SCSI/SATA/USB硬盘机

 /dev/sd[a-p]

 

 USB快闪碟

 /dev/sd[a-p](不SATA相同)

 

整颗磁盘的第一个分区特别的重要,因为他记录了整颗磁盘的重要信息! 磁盘的第一个分区主要记录了两个重要信息,分别是:

 主要启动记录区(MasterBoot Record, MBR):可以安装开机管理程序的地方,有446 bytes

 分割表(partitiontable):记录整颗硬盘分割癿状态,有64 bytes

 

14、磁盘分区

鸟哥p82-

/dev/sdb是SSD设备,/dev/sda是传统的磁盘设备,加载了Flashcache之后呢,会将这两个设备虚拟化为一个带有缓存的块设备/dev/mapper/cachedev。

上图中,Flashcache将普通的SAS盘(/dev/sda)和一个高速的SSD(/dev/sdb)虚拟成一个带缓存的块设备(/dev/mapper/cachedev)。后续还将会有更多关于Flashcache相关的文章出现,

 

所谓癿『挂载』就是利用一个目录当成迚入点,将磁盘分区槽癿数据放置在该目录下;也就是说,迚入该目录就可以读叏该分割槽癿意思。这个劢作我们称为『挂载』,那个迚入点癿目录我们称为『挂载点』。由二整个Linux系统最重要癿是根目录,因此根目录一定需要挂载到某个分割槽癿。至二其他癿目录则可依用户自己癿需求来给予挂载到丌同癿分割槽。

 

tmpfs是一种基于内存的文件系统,它和虚拟磁盘ramdisk比较类似像,但不完全相同,和ramdisk一样,tmpfs可以使用RAM,但它也可以使用swap分区来存储。而且传统的ramdisk是个块设备,要用mkfs来格式化它,才能真正地使用它;而tmpfs是一个文件系统,并不是块设备,只是安装它,就可以使用了。tmpfs是最好的基于RAM的文件系统。

15、NAT服务器

NAT英文全称是“Network AddressTranslation”,中文意思是“网络地址转换”,它是一个IETF(Internet Engineering Task Force, Internet工程任务组)标准,允许一个整体机构以一个公用IP(Internet Protocol)地址出现在Internet上。顾名思义,它是一种把内部私有网络地址(IP地址)翻译成合法网络IP地址的技术。

Web(WWW服务器)

CentOS使用癿是Apache这套软件来达成WWW网站癿功能,在WWW服务器上面,如果你还有提供数据库系统癿话, 那举CPU癿等级就丌能太低,而最重要癿则是RAM了!要增加WWW服务器癿效能,通常提升RAM是一个丌错癿考虑。

16、Proxy(代理服务器)

这也是常常会安装癿一个服务器软件,尤其像中小学校癿带宽较丌足癿环境下, Proxy将可有效癿解决带宽丌足癿问题!当然,你也可以在家里内部安装一个Proxy喔!但是,这个服务器癿硬件要求可以说是相对而言最高癿,他丌但需要较强有力癿CPU来运作,对二硬盘癿速度不容量要求也很高!自然,既然提供了网络服务,网络卡则是重要癿一环!

17、/usr

徆多读者都会诨会/usr为user的缩写,其实usr是UnixSoftware Resource的缩写,也就是『Unix操作系统软件资源』所放置的目彔,而丌是用户的数据啦!这点要注意。 FHS建议所有软件开发者,应该将他们的数据合理的分别放置到这个目彔下的次目彔,而丌要自行建立该软件自己独立的目彔。

18、/var

如果/usr是安装时会占用较大硬盘容量的目彔,那举/var就是在系统运作后才会渐渐占用硬盘容量的目彔。因为/var目彔主要针对常态怅变劢的档案,包括快取(cache)、登彔档(log file)以及某些软件运作所产生的档案,包括程序档案(lock file, run file),戒者例如MySQL数据库的档案等等。常见的次目彔有:

19、tac (反向列示)

cat 不 tac ,有没有发现呀!对啦!tac 刚好是将 cat 反写过杢,所以他癿功能就跟 cat 相反啦,

20、less

less 癿用法比起 more 又更加癿有弹性,忟么说呢?在 more 癿时候,我们幵没有办法向前面翻,叧能往后面看,但若使用了 less 时,呵呵!就可以使用 [pageup] [pagedown] 等挄键癿功能杢往前往后翻看文件,你瞧,是丌是更容易使用杢观看一个档案癿内容了呢!

21、Set UID

 

22、LVM

LVM是 LogicalVolume Manager(逻辑卷管理)的简写,它是Linux环境下对磁盘分区进行管理的一种机制,它由Heinz Mauelshagen在Linux 2.4内核上实现,目前最新版本为:稳定版1.0.5,开发版 1.1.0-rc2,以及LVM2开发版

LVM是逻辑盘卷管理(LogicalVolumeManager)的简称,它是Linux环境下对磁盘分区进行管理的一种机制,LVM是建立在硬盘和分区之上的一个逻辑层,来提高磁盘分区管理的灵活性。通过LVM系统管理员可以轻松管理磁盘分区,如:将若干个磁盘分区连接为一个整块的卷组(volumegroup),形成一个存储池。管理员可以在卷组上随意创建逻辑卷组(logicalvolumes),并进一步在逻辑卷组上创建文件系统。管理员通过LVM可以方便的调整存储卷组的大小,并且可以对磁盘存储按照组的方式进行命名、管理和分配,例如按照使用用途进行定义:“development”和“sales”,而不是使用物理磁盘名“sda”和“sdb”。而且当系统添加了新的磁盘,通过LVM管理员就不必将磁盘的文件移动到新的磁盘上以充分利用新的存储空间,而是直接扩展文件系统跨越磁盘即可.

LVM是在磁盘分区和文件系统之间添加的一个逻辑层,来为文件系统屏蔽下层磁盘分区布局,提供一个抽象的盘卷,在盘卷上建立文件系统。首先我们讨论以下几个LVM术语:

*物理存储介质(Thephysical media)

这里指系统的存储设备:硬盘,是存储系统最低层的存储单元。

*物理卷(physicalvolume,PV)

物理卷就是指硬盘分区或从逻辑上与磁盘分区具有同样功能的设备(如RAID),是LVM的基本存储逻辑块,但和基本的物理存储介质(如分区、磁盘等)比较,却包含有与LVM相关的管理参数。

*卷组(VolumeGroup,VG)

LVM卷组类似于非LVM系统中的物理硬盘,其由物理卷组成。可以在卷组上创建一个或多个“LVM分区”(逻辑卷),LVM卷组由一个或多个物理卷组成。

*逻辑卷(logicalvolume,LV)

LVM的逻辑卷类似于非LVM系统中的硬盘分区,在逻辑卷之上可以建立文件系统(比如/home或者/usr等)。

*PE(physicalextent,PE)

每一个物理卷被划分为称为PE(PhysicalExtents)的基本单元,具有唯一编号的PE是可以被LVM寻址的最小单元。PE的大小是可配置的,默认为4MB。

*LE(logicalextent,LE)

逻辑卷也被划分为被称为LE(LogicalExtents)的可被寻址的基本单位。在同一个卷组中,LE的大小和PE是相同的,并且一一对应。

首先可以看到,物理卷(PV)被由大小等同的基本单元PE组成。

一个卷组由一个或多个物理卷组成。

从上图可以看到,PE和LE有着一一对应的关系。逻辑卷建立在卷组上。逻辑卷就相当于非LVM系统的磁盘分区,可以在其上创建文件系统。

下图是磁盘分区、卷组、逻辑卷和文件系统之间的逻辑关系的示意图:

和非LVM系统将包含分区信息的元数据保存在位于分区的起始位置的分区表中一样,逻辑卷以及卷组相关的元数据也是保存在位于物理卷起始处的VGDA(卷组描述符区域)中。VGDA包括以下内容:PV描述符、VG描述符、LV描述符、和一些PE描述符。

系统启动LVM时激活VG,并将VGDA加载至内存,来识别LV的实际物理存储位置。当系统进行I/O操作时,就会根据VGDA建立的映射机制来访问实际的物理位置

 

 

每个 inode 不 block 都有编号,至亍这三个数据癿意丿可以简略说明如下:

 superblock:记彔此filesystem 癿整体信息,包括inode/block癿总量、使用量、剩余量,以及文件系统癿格式不相关信息等;

 inode:记彔档案癿属性,一个档案占用一个inode,同时记彔此档案癿数据所在癿 block 号码;

 block:实际记彔档案癿内容,若档案太大时,会占用多个 block 。

 

23、apparmor

AppArmor是一个高效和易于使用的Linux系统安全应用程序。AppArmor对操作系统和应用程序所受到的威胁进行从内到外的保护,甚至是未被发现的0day漏洞和未知的应用程序漏洞所导致的攻击。AppArmor安全策略可以完全定义个别应用程序可以访问的系统资源与各自的特权。AppArmor包含大量的默认策略,它将先进的静态分析和基于学习的工具结合起来,AppArmor甚至可以使非常复杂的应用可以使用在很短的时间内应用成功。

 

https://linuxcontainers.org/news/

http://www.ibm.com/developerworks/cn/linux/l-lxc-containers/

24、PCI

我们先来看一个例子,我的电脑装有1G的RAM,1G以后的物理内存地址空间都是外部设备IO在系统内存地址空间上的映射。 /proc/iomem描述了系统中所有的设备I/O在内存地址空间上的映射。我们来看地址从1G开始的第一个设备在/proc/iomem中是如何描述的:

  40000000-400003ff :0000:00:1f.1

  这是一个PCI设备,40000000-400003ff是它所映射的内存地址空间,占据了内存地址空间的1024 bytes的位置,而 0000:00:1f.1则是一个PCI外设的地址,它以冒号和逗号分隔为4个部分,第一个16位表示域,第二个8位表示一个总线编号,第三个5位表示一个设备号,最后是3位,表示功能号。

  因为PCI规范允许单个系统拥有高达256个总线,所以总线编号是8位。但对于大型系统而言,这是不够的,所以,引入了域的概念,每个 PCI域可以拥有最多256个总线,每个总线上可支持32个设备,所以设备号是5位,而每个设备上最多可有8种功能,所以功能号是3位。由此,我们可以得出上述的PCI设备的地址是0号域0号总线上的31号设备上的1号功能。

编程范型或编程范式(英语:Programmingparadigm),(范即模范之意,范式即模式、方法),是一类典型的编程风格,是指从事软件工程的一类典型的风格(可以对照方法学)。如:函数式编程、程序编程、面向对象编程、指令式编程等等为不同的编程范型。

 

编程范型提供了(同时决定了)程序员对程序执行的看法。例如,在面向对象编程中,程序员认为程序是一系列相互作用的对象,而在函数式编程中一个程序会被看作是一个无状态的函数计算的串行。

 

正如软件工程中不同的群体会提倡不同的“方法学”一样,不同的编程语言也会提倡不同的“编程范型”。一些语言是专门为某个特定的范型设计的(如Smalltalk和Java支持面向对象编程,而Haskell和Scheme则支持函数式编程),同时还有另一些语言支持多种范型(如Ruby、Common Lisp、Python和Oz)。

25、范型

很多编程范型已经被熟知他们禁止使用哪些技术,同时允许使用哪些。 例如,纯粹的函数式编程不允许有副作用[1];结构化编程不允许使用goto。可能是因为这个原因,新的范型常常被那些惯于较早的风格的人认为是教条主义或过分严格。然而,这样避免某些技术反而更加证明了关于程序正确性——或仅仅是理解它的行为——的法则,而不用限制程序语言的一般性。

编程范型和编程语言之间的关系可能十分复杂,由于一个编程语言可以支持多种范型。例如,C++设计时,支持过程化编程、面向对象编程以及泛型编程。然而,设计师和程序员们要考虑如何使用这些范型元素来构建一个程序。一个人可以用C++写出一个完全过程化的程序,另一个人也可以用C++写出一个纯粹的面向对象程序,甚至还有人可以写出杂揉了两种范型的程序。

 

26、python错误ImportError: No module named setuptools 解决方法

在python运行过程中出现如下错误:

python错误:ImportError:No module named setuptools
这句错误提示的表面意思是:没有setuptools的模块,说明python缺少这个模块,那我们只要安装这个模块即可解决此问题,下面我们来安装一下:
在命令行下:
下载setuptools包

shell# wgethttp://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz
解压setuptools包

shell# tar zxvfsetuptools-0.6c11.tar.gz

shell# cdsetuptools-0.6c11
编译setuptools

shell# pythonsetup.py build
开始执行setuptools安装

shell# pythonsetup.py install

 

27、Mysql用户:

selectdistinct(User) from mysql.user

格式:grant 权限 on 数据库名.表名 to 用户@登录主机identified by “用户密码”;

@ 后面是访问mysql的客户端IP地址(或是 主机名) % 代表任意的客户端,如果填写localhost 为本地访问(那此用户就不能远程访问该mysql数据库了)。

DROP USER’username’@’host’;

 

删除行:

mysql> delete from pet wherename=”Whistler”;

28、HBase

HBase 是一个分布式的可扩展、非关系型开源数据库。它很好地用 JAVA实现了 Google 的 Bigtable 系统大部分特性.

 

29、OpenStack

OpenStack是一个云平台管理的项目,它不是一个软件。这个项目由几个主要的组件组合起来完成一些具体的工作。OpenStack是一个旨在为公共及私有云的建设与管理提供软件的开源项目。它包括控制器、计算 (Nova)、存储 (Swift)、消息队列 (RabbitMQ) 和网络 (Quantum) 组件。

30、Hadoop

是一个非常优秀的分布式编程框架,设计精巧而且目前没有同级别同重量的替代品。

 

这张图就是Hadoop的架构图,Map和Reduce是两个最基本的处理阶段,之前有输入数据格式定义和数据分片,之后有输出数据格式定义,二者中间还可以实现combine这个本地reduce操作和partition这个重定向mapper输出的策略行为。

Hadoop不适合用来处理大批量的小文件。其实这是由namenode的局限性所决定的,如果文件过小,namenode存储的元信息相对来说就会占用过大比例的空间,内存还是磁盘开销都非常大。

 

31、浅谈TCP优化

 

   IlyaGrigorik 在「High Performance Browser Networking」中做了很多细致的描述,让人读起来醍醐灌顶,我大概总结了

 

32、Mysql集群性能

整信集群由三类节点构成:数据节点、应用程序节点以及管理节点。

•数据节点通常负责数据访问与存储事务。

•应用程序节点提供由应用程序逻辑层及应用API指向数据节点的链接。

•管理节点在集群配置体系中的作用至关重要,并在网络分区环境下负责负载指派。

 

 

33、Shell脚本,备份:

备份网站内容

#!/bin/bash

#指定运行的脚本shell

#运行脚本要给用户执行权限

bakdir=/backup

month=`date +%m`

day=`date +%d`

year=`date +%Y`

hour=`date +%k`

min=`date +%M`

dirname=$year-$month-$day-$hour-$min

mkdir $bakdir/$dirname

mkdir $bakdir/$dirname/conf

mkdir $bakdir/$dirname/web

mkdir $bakdir/$dirname/db

#备份conf,检测通过

gzupload=upload.tgz

cp /opt/apache2/conf/httpd.conf$bakdir/$dirname/conf/httpd.conf

cd /opt/apache2/htdocs/php

tar -zcvf $bakdir/$dirname/web/$gzupload./upload

#远程拷贝的目录要有可写权限

scp -r /backup/$dirname root@10.1.1.178:/backup

 

备份数据库:

#!/bin/bash

#指定运行的脚本shell

#运行脚本要给用户执行权限

bakdir=/backup

month=`date +%m`

day=`date +%d`

year=`date +%Y`

hour=`date +%k`

min=`date +%M`

dirname=$year-$month-$day-$hour-$min

mkdir $bakdir/$dirname

mkdir $bakdir/$dirname/conf

mkdir $bakdir/$dirname/web

mkdir $bakdir/$dirname/db

#热备份数据库

cp /opt/mysql/my.cnf $bakdir/$dirname/db/my.cnf

cd /opt/mysql

mysqldump –opt -u zhy -p –password=1986test>$bakdir/$dirname/db/test.sql

mysqldump –opt -u zhy -p –password=1986phpwind>$bakdir/$dirname/db/phpwind.sql

#远程拷贝的目录要有可写权限

scp -r /backup/$dirname root@10.1.1.178:/backup

 

MySQL的热备份脚本

本脚本是mysqldump –opt的补充:

#!/bin/bash

PATH=/usr/local/sbin:/usr/bin:/bin

# The Directory of Backup

BACKDIR=/usr/mysql_backup

# The Password of MySQL

ROOTPASS=password

# Remake the Directory of Backup

rm -rf $BACKDIR

mkdir -p $BACKDIR

# Get the Name of Database

DBLIST=`ls -p /var/lib/mysql | grep / | tr-d /`

# Backup with Database

for dbname in $DBLIST

do

mysqlhotcopy $dbname -u root -p $ROOTPASS$BACKDIR | logger -t mysqlhotcopy

done

 

33、RPS和RFS

•       RPS 全称是 ReceivePacket Steering, 这是Google工程师 Tom Herbert (therbert@google.com )提交的内核补丁, 在2.6.35进入Linux内核. 这个patch采用软件模拟的方式,实现了多队列网卡所提供的功能,分散了在多CPU系统上数据接收时的负载, 把软中断分到各个CPU处理,而不需要硬件支持,大大提高了网络性能。

•       RFS 全称是 ReceiveFlow Steering, 这也是Tom提交的内核补丁,它是用来配合RPS补丁使用的,是RPS补丁的扩展补丁,它把接收的数据包送达应用所在的CPU上,提高cache的命中率。

•       这两个补丁往往都是一起设置,来达到最好的优化效果, 主要是针对单队列网卡多CPU环境(多队列多重中断的网卡也可以使用该补丁的功能,但多队列多重中断网卡有更好的选择:SMP IRQ affinity)

原理

RPS: RPS实现了数据流的hash归类,并把软中断的负载均衡分到各个cpu,实现了类似多队列网卡的功能。由于RPS只是单纯的把同一流的数据包分发给同一个CPU核来处理了,但是有可能出现这样的情况,即给该数据流分发的CPU核和执行处理该数据流的应用程序的CPU核不是同一个:数据包均衡到不同的 cpu,这个时候如果应用程序所在的cpu和软中断处理的cpu不是同一个,此时对于cpu cache的影响会很大。那么RFS补丁就是用来确保应用程序处理的cpu跟软中断处理的cpu是同一个,这样就充分利用cpu的cache。

•       应用RPS之前: 所有数据流被分到某个CPU,多CPU没有被合理利用,造成瓶颈

 

•       应用RPS之后: 同一流的数据包被分到同个CPU核来处理,但可能出现cpucache迁跃

 

•       应用RPS+RFS之后: 同一流的数据包被分到应用所在的CPU核

 

 

必要条件

使用RPS和RFS功能,需要有大于等于2.6.35版本的Linux kernel.

如何判断内核版本:

     $uname –r

     2.6.38-2-686-bigmem

对比测试

类别         测试客户端     测试服务端

型号         BladeCenter HS23p BladeCenter HS23p

CPU XeonE5-2609  Xeon E5-2630

网卡         Broadcom NetXtreme II BCM5709SGigabit Ethernet    Emulex CorporationOneConnect 10Gb NIC

内核         3.2.0-2-amd64          3.2.0-2-amd64

内存         62GB        66GB

系统         Debian 6.0.4    Debian 6.0.5

超线程     否     是

CPU核     4       6

驱动         bnx2          be2net

     客户端: netperf

     服务端:netserver

     RPScpu bitmap测试分类: 0(不开启rps功能), one cpu per queue(每队列绑定到1个CPU核上), all cpus per queue(每队列绑定到所有cpu核上), 不同分类的设置值如下

1)      0(不开启rps功能)

/sys/class/net/eth0/queues/rx-0/rps_cpus00000000

/sys/class/net/eth0/queues/rx-1/rps_cpus00000000

/sys/class/net/eth0/queues/rx-2/rps_cpus00000000

/sys/class/net/eth0/queues/rx-3/rps_cpus00000000

/sys/class/net/eth0/queues/rx-4/rps_cpus00000000

/sys/class/net/eth0/queues/rx-5/rps_cpus00000000

/sys/class/net/eth0/queues/rx-6/rps_cpus00000000

/sys/class/net/eth0/queues/rx-7/rps_cpus00000000

/sys/class/net/eth0/queues/rx-0/rps_flow_cnt0

/sys/class/net/eth0/queues/rx-1/rps_flow_cnt0

/sys/class/net/eth0/queues/rx-2/rps_flow_cnt0

/sys/class/net/eth0/queues/rx-3/rps_flow_cnt0

/sys/class/net/eth0/queues/rx-4/rps_flow_cnt0、

/sys/class/net/eth0/queues/rx-5/rps_flow_cnt0

/sys/class/net/eth0/queues/rx-6/rps_flow_cnt0

/sys/class/net/eth0/queues/rx-7/rps_flow_cnt0

/proc/sys/net/core/rps_sock_flow_entries 0

2)      onecpu per queue(每队列绑定到1个CPU核上)

/sys/class/net/eth0/queues/rx-0/rps_cpus00000001

/sys/class/net/eth0/queues/rx-1/rps_cpus00000002

/sys/class/net/eth0/queues/rx-2/rps_cpus00000004

/sys/class/net/eth0/queues/rx-3/rps_cpus00000008

/sys/class/net/eth0/queues/rx-4/rps_cpus00000010

/sys/class/net/eth0/queues/rx-5/rps_cpus00000020

/sys/class/net/eth0/queues/rx-6/rps_cpus00000040

/sys/class/net/eth0/queues/rx-7/rps_cpus00000080

/sys/class/net/eth0/queues/rx-0/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-1/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-2/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-3/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-4/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-5/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-6/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-7/rps_flow_cnt4096

/proc/sys/net/core/rps_sock_flow_entries32768

3)      allcpus per queue(每队列绑定到所有cpu核上)

/sys/class/net/eth0/queues/rx-0/rps_cpus000000ff

/sys/class/net/eth0/queues/rx-1/rps_cpus000000ff

/sys/class/net/eth0/queues/rx-2/rps_cpus000000ff

/sys/class/net/eth0/queues/rx-3/rps_cpus000000ff

/sys/class/net/eth0/queues/rx-4/rps_cpus000000ff

/sys/class/net/eth0/queues/rx-5/rps_cpus000000ff

/sys/class/net/eth0/queues/rx-6/rps_cpus000000ff

/sys/class/net/eth0/queues/rx-7/rps_cpus000000ff

/sys/class/net/eth0/queues/rx-0/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-1/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-2/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-3/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-4/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-5/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-6/rps_flow_cnt4096

/sys/class/net/eth0/queues/rx-7/rps_flow_cnt4096

/proc/sys/net/core/rps_sock_flow_entries32768

测试方法: 每种测试类型执行3次,中间睡眠10秒, 每种测试类型分别执行100、500、1500个实例,每实例测试时间长度为60秒。

     TCP_RR1 byte: 测试TCP 小数据包 request/response的性能

netperf -t TCP_RR -H $serverip -c -C -l 60

     UDP_RR1 byte: 测试UDP 小数据包 request/response的性能

netperf -t UDP_RR -H $serverip -c -C -l 60

     TCP_RR256 byte: 测试TCP 大数据包 request/response的性能

netperf -t TCP_RR -H $serverip -c -C -l 60– -r256,256

     UDP_RR256 byte: 测试UDP 大数据包 request/response的性能

netperf -t UDP_RR -H $serverip -c -C -l 60– -r256,256

 

TPS测试结果

     TCP_RR1 byte小包测试结果

 

     TCP_RR256 byte大包测试结果

 

     UDP_RR1 byte小包测试结果

 

     UDP_RR256 byte大包测试结果

 

CPU负载变化

在测试过程中,使用mpstat收集各个CPU核的负载变化

1.      关闭RPS/RFS: 可以看出关闭RPS/RFS时,软中断的负载都在cpu0上,并没有有效的利用多CPU的特性,导致了性能瓶颈。

Average:    CPU    %usr   %nice   %sys %iowait    %irq   %soft %steal  %guest   %idle

Average:    all    3.65    0.00  35.75    0.05    0.01  14.56    0.00    0.00  45.98

Average:       0   0.00    0.00    0.00   0.00    0.00  100.00   0.00    0.00    0.00

Average:       1   4.43    0.00   37.76   0.00    0.11   11.49   0.00    0.00   46.20

Average:       2   5.01    0.00   45.80   0.00    0.00    0.00   0.00    0.00   49.19

Average:       3   5.11    0.00   45.07   0.00    0.00    0.00   0.00    0.00   49.82

Average:       4   3.52    0.00   40.38   0.14    0.00    0.00   0.00    0.00   55.96

Average:       5   3.85    0.00   39.91   0.00    0.00    0.00   0.00    0.00   56.24

Average:       6   3.62    0.00   40.48   0.14    0.00    0.00   0.00    0.00   55.76

Average:       7   3.87    0.00   38.86   0.11    0.00    0.00   0.00    0.00   57.16

2.      每队列关联到一个CPUTCP_RR: 可以看出软中断负载已经能分散到各个CPU核上,有效利用了多CPU的特性,大大提高了系统的网络性能。

Average:    CPU    %usr   %nice   %sys %iowait    %irq   %soft %steal  %guest   %idle

Average:    all    5.58    0.00  59.84    0.01    0.00  22.71    0.00    0.00  11.86

Average:       0   2.16    0.00   20.85   0.00    0.04   72.03   0.00    0.00    4.93

Average:       1   4.68    0.00   46.27   0.00    0.00   42.73   0.00    0.00    6.32

Average:       2   6.76    0.00   63.79   0.00    0.00   11.03   0.00    0.00   18.42

Average:       3   6.61    0.00   65.71   0.00    0.00   11.51   0.00    0.00   16.17

Average:       4   5.94    0.00   67.83   0.07    0.00   11.59   0.00    0.00   14.58

Average:       5   5.99    0.00   69.42   0.04    0.00   12.54   0.00    0.00   12.01

Average:       6   5.94    0.00   69.41   0.00    0.00   12.86   0.00    0.00   11.78

Average:       7   6.13    0.00   69.61   0.00    0.00   14.48   0.00    0.00    9.77

 

3.      每队列关联到一个CPUUDP_RR: CPU负载未能均衡的分布到各个CPU, 这是由于网卡hash计算在UDP包上的不足,详细请见本文后记部分。

Average:    CPU    %usr   %nice   %sys %iowait    %irq   %soft %steal  %guest   %idle

Average:    all    3.01    0.00  29.84    0.07    0.01  13.35    0.00    0.00  53.71

Average:       0   0.00    0.00    0.08   0.00    0.00   90.01   0.00    0.00    9.91

Average:       1   3.82    0.00   32.87   0.00    0.05   12.81   0.00    0.00   50.46

Average:       2   4.84    0.00   37.53   0.00    0.00    0.14   0.00    0.00   57.49

Average:       3   4.90    0.00   37.92   0.00    0.00    0.16   0.00    0.00   57.02

Average:       4   2.57    0.00   32.72   0.20    0.00    0.09   0.00    0.00   64.42

Average:       5   2.66    0.00   33.54   0.11    0.00    0.08   0.00    0.00   63.60

Average:       6   2.75    0.00   32.81   0.09    0.00    0.06   0.00    0.00   64.30

Average:       7   2.71    0.00   32.66   0.17    0.00    0.06   0.00    0.00   64.40

4.      每队列关联到所有CPU: 可以看出软中断负载已经能分散到各个CPU核上,有效利用了多CPU的特性,大大提高了系统的网络性能

Average:    CPU    %usr   %nice   %sys %iowait    %irq   %soft %steal  %guest   %idle

Average:    all    5.39    0.00  59.97    0.00    0.00  22.57    0.00    0.00  12.06

Average:       0   1.46    0.00   21.83   0.04    0.00   72.08   0.00    0.00    4.59

Average:       1   4.45    0.00   46.40   0.00    0.04   43.39   0.00    0.00    5.72

Average:       2   6.84    0.00  65.62    0.00    0.00  11.39    0.00    0.00  16.15

Average:       3   6.71    0.00   67.13   0.00    0.00   12.07   0.00    0.00   14.09

Average:       4   5.73    0.00   66.97   0.00    0.00   10.71   0.00    0.00   16.58

Average:       5   5.74    0.00   68.57   0.00    0.00   13.02   0.00    0.00   12.67

Average:       6   5.79    0.00   69.27   0.00    0.00   12.31   0.00    0.00   12.63

Average:       7   5.96    0.00   68.98   0.00    0.00   12.00   0.00    0.00   13.06

 

结果分析

以下结果只是针对测试服务器特定硬件及系统的数据,在不同测试对象的RPS/RFS测试结果可能有不同的表现。

TCP性能:

•       在没有打开RPS/RFS的情况下,随着进程数的增加,TCP tps性能并明显没有提升,在184~188k之间。

•       打开RPS/RFS之后,随着RPS导致软中断被分配到所有CPU上和RFS增加的cache命中, 小数据包(1字节)及大数据包(256字节,相对小数据包而言, 而不是实际应用中的大数据包)的tps性能都有显著提升

•       100个进程提升40%的性能(两种RPS/RFS设置的性能结果一致),cpu负载升高40%

•       500个进程提升70%的性能(两种RPS/RFS设置的性能结果一致),cpu负载升高62%

•       1500个进程提升75%的性能(两种RPS/RFS设置的性能结果一致),cpu负载升高77%

UDP性能:

•       在没有打开RPS/RFS的情况下,随着进程数的增加,UDP tps性能并明显没有提升,在226~235k之间。

•       打 开RPS/RFS之后,,随着RPS导致软中断被分配到所有CPU上和RFS增加的cache命中, 小数据包(1字节)及大数据包(256字节,相对小数据包而言, 而不是实际应用中的大数据包)的TPS性能, 在每队列关联到所有CPU的情况下有显著提升, 而每队列关联到一个CPU后反倒是导致了UDPtps性能下降1% (这是bnx2网卡不支持UDP port hash及此次测试的局限性造成的结果, 详细分析见: 后记)

•       每队列关联到所有CPU的情况下, 在100个进程时小包提升40%的性能, cpu负载升高60%; 大包提升33%, cpu负载升高47%

•       每队列关联到所有CPU的情况下, 在500个进程提小包提升62%的性能, cpu负载升高71%; 大包提升60%, cpu负载升高65%

•       每队列关联到所有CPU的情况下, 在1500个进程提升65%的性能, cpu负载升高75%; 大包提升64%, cpu负载升高74%

后记

UDP在每队列绑定到一个CPU时性能下降,而绑定到所有CPU时,却有性能提升,这一问题涉及到几个因素,当这几个因素凑一起时,导致了这种奇特的表现。

•       此次测试的局限性:本次测试是1对1的网络测试,产生的数据包的IP地址都是相同的

•       bnx2 网卡在RSS hash上,不支持UDP Port,也就是说,网卡在对TCP数据流进行队列选择时的hash包含了ip和port, 而在UDP上的hash, 只有IP地址,导致了本次测试(上面的局限性影响)的UDP数据包的hash结果都是一样的,数据包被转送到同一条队列。

•       单 单上面两个因素,还无法表现出UDP在每队列绑定到一个CPU时性能下降,而绑定到所有CPU时,却有性能提升的现象。 因为RPS/RFS本身也有hash计算,也就是进入队列后的数据包,还需要经过RPS/RFS的hash计算(这里的hash支持udp port), 然后进行第二次数据包转送选择;如果每队列绑定到一个CPU, 系统直接跳过第二次hash计算,数据包直接分配到该队列关联的CPU处理,也就导致了在第一次hash计算后被错误转送到某一队列的UDP数据包,将直接送到cpu处理,导致了性能的下降;而如果是每队列绑定到所有CPU,那么进入队列后的数据包会在第二次hash时被重新分配,修正了第一次hash的错误选择。

相关对比测试

1. SMP IRQ affinity:http://www.igigo.net/archives/231

参考资料

•       Software receive packetsteering

•       Receive Packet Steering

•       Receive packet steering

•       Receive Flow Steering

•       linux kernel 2.6.35中RFS特性详解

•       Linux 2.6.35 新增特性 RPS RFS

•       kernel/Documentation/networking/scaling.txt

 

34、SLES9下配置 IP Bonding 的步骤

   Toavoid problems it is advisable that all network cards use the same driver. Ifthey use different drivers, please take the following into consideration:

There are three driver-dependent methodsfor check whether a network card has a link or a network connection.

   *   MII link status detection

   *   Register in the drivernetif_carrier

   *   ARP monitoring

It is very important that the used driverssupport the same method. If this is not the case because e.g. the first networkcard driver only supports MII link status detection whereas the second driverjust supports netif_carrier, the only solution is to replace the network cardin order to use a different driver.

To find out what method is supported byyour driver, proceed as follows:

* MII link status can be determined withthe tools mii-tool or ethtool.

* In the case of netif_carrier and ARPmonitoring, refer to the driver’s source code to find out whether these methodsare supported or not. The corresponding kernel sources must be installed forthis purpose. Regarding netif_carrier, search exactly for this string in thedriver’s source code, e.g.

     grep netif_carrier via-rhine.c

As for the ARP monitoring method, thedriver must support either the register last_rx or trans_start. Thus, you cansearch in the driver’s source code for:

     grep “last_rx\|trans_start” via-rhine.c

Start with the setup only after havingverified this.

Procedure

In this sample scenario, two network cardswill be combined by way of bonding mode=1 (active backup).

1. Configure your network cards with YaST.Allocate the IP address that must be used for the bonding device to one networkcard and a dummy IP address to the rest of network cards.

2. Copy the configuration of the networkcard with the right IP address to a file ifcfg-bond0.

     cd /etc/sysconfig/network

     cp ifcfg-eth-id–xx:xx:xx:xx:xx:01 ifcfg-bond0

3. Find out and write down the PCI IDs ofall the involved network cards.

     For example:

     linux:~ # grep bus-pci ifcfg-eth-id–xx:xx:xx:xx:xx:01

     _nm_name=’bus-pci-0000:00:09.0′

     linux:~ # grep bus-pci ifcfg-eth-id–xx:xx:xx:xx:xx:02

     _nm_name=’bus-pci-0000:00:0a.0′

     linux:~ #

4. Edit the file ifcfg-bond0 previouslycreated and insert the following lines.

     BONDING_MASTER=yes

     BONDING_SLAVE_0=’bus-pci-0000:00:09.0′

     BONDING_SLAVE_1=’bus-pci-0000:00:0a.0′

     Now insert the options for the bonding module. Depending on what linkdetection method you are using, the line may look like this:

    *  MII link detection method

           BONDING_MODULE_OPTS=’miimon=100 mode=1use_carrier=0′

    *  netif_carrier method

           BONDING_MODULE_OPTS=’miimon=100 mode=1 use_carrier=1′

    *  ARP monitoring method

           BONDING_MODULE_OPTS=’arp_interval=2500 arp_ip_target=192.168.1.1 mode=1′

5. Remove the old configuration files

  linux:~ # rm ifcfg-eth-id–xx:xx:xx:xx:xx:01

  linux:~ # rm ifcfg-eth-id–xx:xx:xx:xx:xx:02

6. Restart the network with

  rcnetwork restart

Additional Information

Occasionally it has been experienced thatnot all network interfaces come up after a system reboot. To prevent this, theloading of the modules should start earlier during the reboot process. Thefollowing procedure is helpful in this case:

   1.Edit the file /etc/sysconfig/kernel and add this line:

     MODULES_LOADED_ON_BOOT=”bcm5700″

   2.Reboot the server and check the status of all network interfaces, usingcommands lspci and ifconfig.

   3.If this method is not successful, edit the file /etc/sysconfig/kernel again andremove the line inserted at step 1. Modify the line containing theINITRD_MODULES statement; add the bcm5700 to this line. It should read likeINITRD_MODULES=”cdrom scsi_mod ide-cd ehci-hcd reiserfs bcm5700″

   4.Call command mkinitrd

   5.Reboot the server as in step 2

Another method is to delay the starting ofthe network interfaces after loading the modules. To do this, edit the file/etc/sysconfig/network/config and change the variable WAIT_FOR_INTERFACES tothe wanted delay in seconds. To delay the interfaces 3 seconds, enter

WAIT_FOR_INTERFACES=3

Reboot the server to verify the success ofthis measure.

当然也可以采用一些简单的办法,例如直接修改 /etc/init.d/network 网络启动脚本。

在start) 部分的结尾处添加 ipbonding 的手工脚本,例如:

ifconfig eth0 0.0.0.0

ifconfig eth1 0.0.0.0

modprobe bonding miimon=100 mode=1use_carrier=1

ifconfig bond0 192.168.1.123 netmask255.255.255.0

ifenslave bond0 eth0

ifenslave bond0 eth1

route add default gw 192.168.1.1

然后在 stop) 部分开始考虑添加:

ifdown bond0

rmmod bonding

———————————————————————

Introduction

       The Linux bonding driver provides a method for aggregating

multiple network interfaces into a singlelogical “bonded” interface.

The behavior of the bonded interfacesdepends upon the mode; generally

speaking, modes provide either hot standbyor load balancing services.

Additionally, link integrity monitoring maybe performed.

       

       The bonding driver originally came from Donald Becker’s

beowulf patches for kernel 2.0. It haschanged quite a bit since, and

the original tools from extreme-linux andbeowulf sites will not work

with this version of the driver.

       For new versions of the driver, updated userspace tools, and

who to ask for help, please follow thelinks at the end of this file.

2. Bonding Driver Options

        Options for the bonding driver aresupplied as parameters to

the bonding module at load time.  They may be given as command line

arguments to the insmod or modprobecommand, but are usually specified

in either the /etc/modules.conf or/etc/modprobe.conf configuration

file, or in a distro-specific configurationfile (some of which are

detailed in the next section).

       The available bonding driver parameters are listed below. If a

parameter is not specified the defaultvalue is used.  When initially

configuring a bond, it is recommended”tail -f /var/log/messages” be

run in a separate window to watch forbonding driver error messages.

       It is critical that either the miimon or arp_interval and

arp_ip_target parameters be specified,otherwise serious network

degradation will occur during linkfailures.  Very few devices do not

support at least miimon, so there is reallyno reason not to use it.

       Options with textual values will accept either the text name

or, for backwards compatibility, the optionvalue.  E.g.,

“mode=802.3ad” and”mode=4″ set the same mode.

       The parameters are as follows:

arp_interval

       Specifies the ARP link monitoring frequency in milliseconds.

       If ARP monitoring is used in an etherchannel compatible mode

       (modes 0 and 2), the switch should be configured in a mode

       that evenly distributes packets across all links. If the

       switch is configured to distribute the packets in an XOR

       fashion, all replies from the ARP targets will be received on

       the same link which could cause the other team members to

       fail.  ARP monitoring should notbe used in conjunction with

       miimon.  A value of 0 disables ARPmonitoring.  The default

       value is 0.

arp_ip_target

       Specifies the IP addresses to use as ARP monitoring peers when

       arp_interval is > 0.  These arethe targets of the ARP request

       sent to determine the health of the link to the targets.

       Specify these values in ddd.ddd.ddd.ddd format.  Multiple IP

       addresses must be separated by a comma. At least one IP

       address must be given for ARP monitoring to function.  The

       maximum number of targets that can be specified is 16.  The

       default value is no IP addresses.

downdelay

       Specifies the time, in milliseconds, to wait before disabling

       a slave after a link failure has been detected.  This option

       is only valid for the miimon link monitor.  The downdelay

       value should be a multiple of the miimon value; if not, it

       will be rounded down to the nearest multiple.  The default

       value is 0.

lacp_rate

       Option specifying the rate in which we’ll ask our link partner

       to transmit LACPDU packets in 802.3ad mode.  Possible values

       are:

       slow or 0

                Request partner to transmitLACPDUs every 30 seconds

       fast or 1

                Request partner to transmitLACPDUs every 1 second

       The default is slow.

max_bonds

       Specifies the number of bonding devices to create for this

       instance of the bonding driver. E.g., if max_bonds is 3, and

       the bonding driver is not already loaded, then bond0, bond1

       and bond2 will be created.  Thedefault value is 1.

miimon

       Specifies the MII link monitoring frequency in milliseconds.

       This determines how often the link state of each slave is

       inspected for link failures.  Avalue of zero disables MII

       link monitoring.  A value of 100is a good starting point.

       The use_carrier option, below, affects how the link state is

       determined.  See the HighAvailability section for additional

       information.  The default value is0.

mode

       Specifies one of the bonding policies. The default is

       balance-rr (round robin). Possible values are:

       balance-rr or 0

                Round-robin policy: Transmitpackets in sequential

                order from the first availableslave through the

                last.  This mode provides load balancing and fault

                tolerance.

       active-backup or 1

                Active-backup policy: Only oneslave in the bond is

                active.  A different slave becomes active if, and only

                if, the active slavefails.  The bond’s MAC address is

                externally visible on only oneport (network adapter)

                to avoid confusing the switch.

                In bonding version 2.6.2 orlater, when a failover

                occurs in active-backup mode,bonding will issue one

                or more gratuitous ARPs on thenewly active slave.

                One gratutious ARP is issuedfor the bonding master

                interface and each VLANinterfaces configured above

                it, provided that the interfacehas at least one IP

                address configured.  Gratuitous ARPs issued for VLAN

                interfaces are tagged with theappropriate VLAN id.

                This mode provides faulttolerance.  The primary

                option, documented below,affects the behavior of this

                mode.

       balance-xor or 2

                XOR policy: Transmit based onthe selected transmit hash policy.  Thedefault policy is a simple [(source MAC address XOR’d with destination MACaddress) modulo slave count].  Alternatetransmit policies may be

                selected via thexmit_hash_policy option, described

                below.

                This mode provides loadbalancing and fault tolerance.

       broadcast or 3

                Broadcast policy: transmitseverything on all slave

                interfaces.  This mode provides fault tolerance.

       802.3ad or 4

                IEEE 802.3ad Dynamic linkaggregation.  Creates

                aggregation groups that sharethe same speed and

                duplex settings.  Utilizes all slaves in the active

                aggregator according to the802.3ad specification.

                Slave selection for outgoingtraffic is done according

                to the transmit hash policy,which may be changed from

                the default simple XOR policyvia the xmit_hash_policy

                option, documented below.  Note that not all transmit

                policies may be 802.3adcompliant, particularly in

                regards to the packetmis-ordering requirements of

                section 43.2.4 of the 802.3adstandard.  Differing

                peer implementations will havevarying tolerances for

                noncompliance.

                Prerequisites:

                1. Ethtool support in the basedrivers for retrieving

                the speed and duplex of eachslave.

                2. A switch that supports IEEE802.3ad Dynamic link

                aggregation.

                Most switches will require sometype of configuration

                to enable 802.3ad mode.

       balance-tlb or 5

                Adaptive transmit loadbalancing: channel bonding that

                does not require any specialswitch support.  The

                outgoing traffic is distributedaccording to the

                current load (computed relativeto the speed) on each

                slave.  Incoming traffic is received by the current

                slave.  If the receiving slave fails, another slave

                takes over the MAC address ofthe failed receiving

                slave.

                Prerequisite:

                Ethtool support in the basedrivers for retrieving the

                speed of each slave.

       balance-alb or 6

                Adaptive load balancing:includes balance-tlb plus

                receive load balancing (rlb)for IPV4 traffic, and

                does not require any specialswitch support.  The

                receive load balancing isachieved by ARP negotiation.

                The bonding driver interceptsthe ARP Replies sent by

                the local system on their wayout and overwrites the

                source hardware address withthe unique hardware

                address of one of the slaves inthe bond such that

                different peers use differenthardware addresses for

                the server.

                Receive traffic fromconnections created by the server

                is also balanced.  When the local system sends an ARP

                Request the bonding drivercopies and saves the peer’s

                IP information from the ARPpacket.  When the ARP

                Reply arrives from the peer,its hardware address is

                retrieved and the bondingdriver initiates an ARP

                reply to this peer assigning itto one of the slaves

                in the bond.  A problematic outcome of using ARP

                negotiation for balancing isthat each time that an

                ARP request is broadcast ituses the hardware address

                of the bond.  Hence, peers learn the hardware address

                of the bond and the balancingof receive traffic

                collapses to the currentslave.  This is handled by

               sending updates (ARPReplies) to all the peers with

                their individually assignedhardware address such that

                the traffic isredistributed.  Receive traffic is also

                redistributed when a new slaveis added to the bond

                and when an inactive slave isre-activated.  The

                receive load is distributedsequentially (round robin)

                among the group of highestspeed slaves in the bond.

                When a link is reconnected or anew slave joins the

                bond the receive traffic isredistributed among all

                active slaves in the bond byinitiating ARP Replies

                with the selected mac addressto each of the

                clients. The updelay parameter(detailed below) must

                be set to a value equal orgreater than the switch’s

                forwarding delay so that theARP Replies sent to the

                peers will not be blocked bythe switch.

                Prerequisites:

                1. Ethtool support in the basedrivers for retrieving

                the speed of each slave.

                2. Base driver support forsetting the hardware

                address of a device while it isopen.  This is

                required so that there willalways be one slave in the

                team using the bond hardwareaddress (the

                curr_active_slave) while havinga unique hardware

                address for each slave in thebond.  If the

                curr_active_slave fails itshardware address is

                swapped with the newcurr_active_slave that was

                chosen.

primary

       A string (eth0, eth2, etc) specifying which slave is the

        primary device.  The specified device will always be the

       active slave while it is available. Only when the primary is

       off-line will alternate devices be used. This is useful when

       one slave is preferred over another, e.g., when one slave has

       higher throughput than another.

       The primary option is only valid for active-backup mode.

updelay

       Specifies the time, in milliseconds, to wait before enabling a

       slave after a link recovery has been detected.  This option is

       only valid for the miimon link monitor. The updelay value

       should be a multiple of the miimon value; if not, it will be

       rounded down to the nearest multiple. The default value is 0.

use_carrier

       Specifies whether or not miimon should use MII or ETHTOOL

       ioctls vs. netif_carrier_ok() to determine the link

       status. The MII or ETHTOOL ioctls are less efficient and

       utilize a deprecated calling sequence within the kernel.  The

       netif_carrier_ok() relies on the device driver to maintain its

       state with netif_carrier_on/off; at this writing, most, but

       not all, device drivers support this facility.

       If bonding insists that the link is up when it should not be,

       it may be that your network device driver does not support

       netif_carrier_on/off.  The defaultstate for netif_carrier is

       “carrier on,” so if a driver does not support netif_carrier,

        it will appear as if the link is alwaysup.  In this case,

       setting use_carrier to 0 will cause bonding to revert to the

       MII / ETHTOOL ioctl method to determine the link state.

       A value of 1 enables the use of netif_carrier_ok(), a value of

       0 will use the deprecated MII / ETHTOOL ioctls.  The default

       value is 1.

xmit_hash_policy

       Selects the transmit hash policy to use for slave selection in

       balance-xor and 802.3ad modes. Possible values are:

       layer2

                Uses XOR of hardware MACaddresses to generate the

                hash.  The formula is

                (source MAC XOR destinationMAC) modulo slave count

                This algorithm will place alltraffic to a particular

                network peer on the same slave.

                This algorithm is 802.3adcompliant.

       layer3+4

                This policy uses upper layerprotocol information,

                when available, to generate thehash.  This allows for

                traffic to a particular networkpeer to span multiple

                slaves, although a singleconnection will not span

                multiple slaves.

                The formula for unfragmentedTCP and UDP packets is

                ((source port XOR dest port)XOR

                         ((source IP XOR destIP) AND 0xffff)

                                modulo slavecount

                For fragmented TCP or UDPpackets and all other IP

                protocol traffic, the source anddestination port

                information is omitted.  For non-IP traffic, the

                formula is the same as for thelayer2 transmit hash

                policy.

                This policy is intended tomimic the behavior of

               certain switches, notably Ciscoswitches with PFC2 as

                well as some Foundry and IBMproducts.

                This algorithm is not fully802.3ad compliant.  A

                single TCP or UDP conversationcontaining both

                fragmented and unfragmented packets will seepackets

                striped across twointerfaces.  This may result in out

                of order delivery.  Most traffic types will not meet

                this criteria, as TCP rarelyfragments traffic, and

                most UDP traffic is notinvolved in extended

                conversations.  Other implementations of 802.3ad may

                or may not tolerate thisnoncompliance.

        The default value is layer2.  This option was added in bonding

version 2.6.3.  In earlier versions of bonding, thisparameter does

not exist, and the layer2 policy is theonly policy.

3. Configuring Bonding Devices

       There are, essentially, two methods for configuring bonding:

with support from the distro’s networkinitialization scripts, and

without. Distros generally use one of two packages for the network

initialization scripts: initscripts orsysconfig.  Recent versions of

these packages have support for bonding,while older versions do not.

       We will first describe the options for configuring bonding for

distros using versions of initscripts andsysconfig with full or

partial support for bonding, then provideinformation on enabling

bonding without support from the networkinitialization scripts (i.e.,

older versions of initscripts orsysconfig).

       If you’re unsure whether your distro uses sysconfig or

initscripts, or don’t know if it’s newenough, have no fear.

Determining this is fairly straightforward.

       First, issue the command:

$ rpm -qf /sbin/ifup

       It will respond with a line of text starting with either

“initscripts” or”sysconfig,” followed by some numbers.  This is the

package that provides your network initializationscripts.

       Next, to determine if your installation supports bonding,

issue the command:

$ grep ifenslave /sbin/ifup

       If this returns any matches, then your initscripts or

sysconfig has support for bonding.

3.1 Configuration with sysconfig support

       This section applies to distros using a version of sysconfig

with bonding support, for example, SuSELinux Enterprise Server 9.

       SuSE SLES 9’s networking configuration system does support

bonding, however, at this writing, the YaSTsystem configuration

frontend does not provide any means to workwith bonding devices.

Bonding devices can be managed by hand,however, as follows.

       First, if they have not already been configured, configure the

slave devices.  On SLES 9, this is most easily done byrunning the

yast2 sysconfig configuration utility.  The goal is for to create an

ifcfg-id file for each slave device.  The simplest way to accomplish

this is to configure the devices for DHCP(this is only to get the

file ifcfg-id file created; see below forsome issues with DHCP).  The

name of the configuration file for eachdevice will be of the form:

ifcfg-id-xx:xx:xx:xx:xx:xx

       Where the “xx” portion will be replaced with the digits from

the device’s permanent MAC address.

       Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been

created, it is necessary to edit theconfiguration files for the slave

devices (the MAC addresses correspond tothose of the slave devices).

Before editing, the file will containmultiple lines, and will look

something like this:

BOOTPROTO=’dhcp’

STARTMODE=’on’

USERCTL=’no’

UNIQUE=’XNzu.WeZGOGF+4wE’

_nm_name=’bus-pci-0001:61:01.0′

       Change the BOOTPROTO and STARTMODE lines to the following:

BOOTPROTO=’none’

STARTMODE=’off’

       Do not alter the UNIQUE or _nm_name lines.  Remove any other

lines (USERCTL, etc).

       Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified,

it’s time to create the configuration filefor the bonding device

itself. This file is named ifcfg-bondX, where X is the number of the

bonding device to create, starting at0.  The first such file is

ifcfg-bond0, the second is ifcfg-bond1, andso on.  The sysconfig

network configuration system will correctlystart multiple instances

of bonding.

        The contents of the ifcfg-bondX file is asfollows:

BOOTPROTO=”static”

BROADCAST=”

10.0.2.255

IPADDR=”

10.0.2.10

NETMASK=”

255.255.0.0

NETWORK=”

10.0.2.0

REMOTE_IPADDR=””

STARTMODE=”onboot”

BONDING_MASTER=”yes”

BONDING_MODULE_OPTS=”mode=active-backupmiimon=100″

BONDING_SLAVE0=”eth0″

BONDING_SLAVE1=”bus-pci-0000:06:08.1″

       Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK

values with the appropriate values for yournetwork.

       The STARTMODE specifies when the device is brought online.

The possible values are:

       onboot:  The device is started atboot time.  If you’re not

                 sure, this is probably whatyou want.

       manual:  The device is startedonly when ifup is called

                 manually.  Bonding devices may be configured this

                 way if you do not wish them tostart automatically

                 at boot for some reason.

       hotplug: The device is started by a hotplug event.  This is not

                a valid choice for abonding device.

       off or ignore: The device configuration is ignored.

       The line BONDING_MASTER=’yes’ indicates that the device is a

 

bonding master device.  The only useful value is “yes.”

       The contents of BONDING_MODULE_OPTS are supplied to the

instance of the bonding module for thisdevice.  Specify the options

for the bonding mode, link monitoring, andso on here.  Do not include

the max_bonds bonding parameter; this willconfuse the configuration

system if you have multiple bondingdevices.

       Finally, supply one BONDING_SLAVEn=”slave device” for each

slave. where “n” is an increasing value, one for each slave.  The

“slave device” is either aninterface name, e.g., “eth0”, or a device

specifier for the network device.  The interface name is easier to

find, but the ethN names are subject tochange at boot time if, e.g.,

a device early in the sequence hasfailed.  The device specifiers

(bus-pci-0000:06:08.1 in the example above)specify the physical

network device, and will not change unlessthe device’s bus location

changes (for example, it is moved from onePCI slot to another).  The

example above uses one of each type fordemonstration purposes; most

configurations will choose one or the otherfor all slave devices.

       When all configuration files have been modified or created,

networking must be restarted for theconfiguration changes to take

effect. This can be accomplished via the following:

# /etc/init.d/network restart

       Note that the network control script (/sbin/ifdown) will

remove the bonding module as part of thenetwork shutdown processing,

so it is not necessary to remove the moduleby hand if, e.g., the

module parameters have changed.

       Also, at this writing, YaST/YaST2 will not manage bonding

devices (they do not show bondinginterfaces on its list of network

devices). It is necessary to edit the configuration file by hand to

change the bonding configuration.

       Additional general options and details of the ifcfg file

format can be found in an example ifcfgtemplate file:

/etc/sysconfig/network/ifcfg.template

       Note that the template does not document the various BONDING_

settings described above, but does describemany of the other options.

3.1.1 Using DHCP with sysconfig

       Under sysconfig, configuring a device with BOOTPROTO=’dhcp’

will cause it to query DHCP for its IPaddress information.  At this

writing, this does not function for bondingdevices; the scripts

attempt to obtain the device address fromDHCP prior to adding any of

the slave devices.  Without active slaves, the DHCP requests arenot

sent to the network.

3.1.2 Configuring Multiple Bonds withsysconfig

       The sysconfig network initialization system is capable of

handling multiple bonding devices.  All that is necessary is for each

bonding instance to have an appropriatelyconfigured ifcfg-bondX file

(as described above).  Do not specify the “max_bonds”parameter to any

instance of bonding, as this will confusesysconfig.  If you require

multiple bonding devices with identicalparameters, create multiple

ifcfg-bondX files.

       Because the sysconfig scripts supply the bonding module

options in the ifcfg-bondX file, it is notnecessary to add them to

the system /etc/modules.conf or/etc/modprobe.conf configuration file.

3.3 Configuring Bonding Manually

       This section applies to distros whose network initialization

scripts (the sysconfig or initscriptspackage) do not have specific

knowledge of bonding.  One such distro is SuSE Linux EnterpriseServer

version 8.

       The general method for these systems is to place the bonding

module parameters into /etc/modules.conf or/etc/modprobe.conf (as

appropriate for the installed distro), thenadd modprobe and/or

ifenslave commands to the system’s globalinit script.  The name of

the global init script differs; forsysconfig, it is

/etc/init.d/boot.local and for initscriptsit is /etc/rc.d/rc.local.

       For example, if you wanted to make a simple bond of two e100

devices (presumed to be eth0 and eth1), andhave it persist across

reboots, edit the appropriate file(/etc/init.d/boot.local or

/etc/rc.d/rc.local), and add the following:

modprobe bonding mode=balance-albmiimon=100

modprobe e100

ifconfig bond0

192.168.1.1

netmask

255.255.255.0

up

ifenslave bond0 eth0

ifenslave bond0 eth1

       Replace the example bonding module parameters and bond0

network configuration (IP address, netmask,etc) with the appropriate

values for your configuration.

       Unfortunately, this method will not provide support for the

ifup and ifdown scripts on the bonddevices.  To reload the bonding

configuration, it is necessary to run theinitialization script, e.g.,

# /etc/init.d/boot.local

       or

# /etc/rc.d/rc.local

       It may be desirable in such a case to create a separate script

which only initializes the bondingconfiguration, then call that

separate script from withinboot.local.  This allows for bonding tobe

enabled without re-running the entireglobal init script.

       To shut down the bonding devices, it is necessary to first

mark the bonding device itself as beingdown, then remove the

appropriate device driver modules.  For our example above, you can do

the following:

# ifconfig bond0 down

# rmmod bonding

# rmmod e100

       Again, for convenience, it maybe desirable to create a script

with these commands.

3.3.1 Configuring Multiple Bonds Manually

       This section contains information on configuring multiple

bonding devices with differing options forthose systems whose network

initialization scripts lack support forconfiguring multiple bonds.

       If you require multiple bonding devices, but all with the same

options, you may wish to use the”max_bonds” module parameter,

documented above.

       To create multiple bonding devices with differing options, it

is necessary to load the bonding drivermultiple times.  Note that

current versions of the sysconfig networkinitialization scripts

handle this automatically; if your distrouses these scripts, no

special action is needed.  See the section Configuring Bonding

Devices, above, if you’re not sure aboutyour network initialization

scripts.

       To load multiple instances of the module, it is necessary to

specify a different name for each instance(the module loading system

requires that every loaded module, evenmultiple instances of the same

module, have a unique name).  This is accomplished by supplying

multiple sets of bonding options in/etc/modprobe.conf, for example:

       

alias bond0 bonding

options bond0 -o bond0 mode=balance-rrmiimon=100

alias bond1 bonding

options bond1 -o bond1 mode=balance-albmiimon=50

       will load the bonding module two times. The first instance is

named “bond0” and creates thebond0 device in balance-rr mode with an

miimon of 100.  The second instance is named”bond1″ and creates the

bond1 device in balance-alb mode with anmiimon of 50.

       In some circumstances (typically with older distributions),

the above does not work, and the secondbonding instance never sees

its options.  In that case, the second options line can besubstituted

as follows:

install bond1 /sbin/modprobe–ignore-install bonding -o bond1 \

       mode=balance-alb miimon=50

       This may be repeated any number of times, specifying a new and

unique name in place of bond1 for eachsubsequent instance.

5. Querying Bonding Configuration

5.1 Bonding Configuration

       Each bonding device has a read-only file residing in the

/proc/net/bonding directory.  The file contents include information

about the bonding configuration, optionsand state of each slave.

       For example, the contents of /proc/net/bonding/bond0 after the

driver is loaded with parameters of mode=0and miimon=1000 is

generally as follows:

       Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004)

       Bonding Mode: load balancing (round-robin)

       Currently Active Slave: eth0

       MII Status: up

       MII Polling Interval (ms): 1000

       Up Delay (ms): 0

       Down Delay (ms): 0

       Slave Interface: eth1

       MII Status: up

       Link Failure Count: 1

       Slave Interface: eth0

       MII Status: up

       Link Failure Count: 1

       The precise format and contents will change depending upon the

bonding configuration, state, and versionof the bonding driver.

5.2 Network configuration

       The network configuration can be inspected using the ifconfig

command. Bonding devices will have the MASTER flag set; Bonding slave

devices will have the SLAVE flag set.  The ifconfig output does not

contain information on which slaves areassociated with which masters.

       In the example below, the bond0 interface is the master

(MASTER) while eth0 and eth1 are slaves(SLAVE). Notice all slaves of

bond0 have the same MAC address (HWaddr) asbond0 for all modes except

TLB and ALB that require a unique MACaddress for each slave.

# /sbin/ifconfig

bond0    Link encap:Ethernet  HWaddr00:C0:F0:1F:37:B4

         inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0

         UP BROADCAST RUNNING MASTER MULTICAST MTU:1500  Metric:1

         RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0

         TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0

         collisions:0 txqueuelen:0

eth0     Link encap:Ethernet  HWaddr 00:C0:F0:1F:37:B4

         inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0

         UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500  Metric:1

         RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0

         TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0

         collisions:0 txqueuelen:100

         Interrupt:10 Base address:0x1080

eth1     Link encap:Ethernet  HWaddr00:C0:F0:1F:37:B4

         inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0

          UP BROADCAST RUNNING SLAVEMULTICAST  MTU:1500  Metric:1

         RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0

         TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0

         collisions:0 txqueuelen:100

         Interrupt:9 Base address:0x1400

6. Switch Configuration

       For this section, “switch” refers to whatever system the

bonded devices are directly connected to(i.e., where the other end of

the cable plugs into).  This may be an actual dedicated switchdevice,

or it may be another regular system (e.g.,another computer running

Linux),

       The active-backup, balance-tlb and balance-alb modes do not

require any specific configuration of theswitch.

       The 802.3ad mode requires that the switch have the appropriate

ports configured as an 802.3adaggregation.  The precise method used

to configure this varies from switch toswitch, but, for example, a

Cisco 3550 series switch requires that theappropriate ports first be

grouped together in a single etherchannelinstance, then that

etherchannel is set to mode”lacp” to enable 802.3ad (instead of

standard EtherChannel).

       The balance-rr, balance-xor and broadcast modes generally

require that the switch have theappropriate ports grouped together.

The nomenclature for such a group differsbetween switches, it may be

called an “etherchannel” (as inthe Cisco example, above), a “trunk

group” or some other similarvariation.  For these modes, each switch

will also have its own configuration optionsfor the switch’s transmit

policy to the bond.  Typical choices include XOR of either the MACor

IP addresses.  The transmit policy of the two peers does notneed to

match. For these three modes, the bonding mode really selects a

transmit policy for an EtherChannel group;all three will interoperate

with another EtherChannel group.

7. 802.1q VLAN Support

       It is possible to configure VLAN devices over a bond interface

using the 8021q driver.  However, only packets coming from the 8021q

driver and passing through bonding will betagged by default.  Self

generated packets, for example, bonding’slearning packets or ARP

packets generated by either ALB mode or theARP monitor mechanism, are

tagged internally by bonding itself.  As a result, bonding must

“learn” the VLAN IDs configuredabove it, and use those IDs to tag

self generated packets.

       For reasons of simplicity, and to support the use of adapters

that can do VLAN hardware accelerationoffloading, the bonding

interface declares itself as fully hardwareoffloading capable, it gets

the add_vid/kill_vid notifications togather the necessary

information, and it propagates thoseactions to the slaves.  In case

of mixed adapter types, hardwareaccelerated tagged packets that

should go through an adapter that is notoffloading capable are

“un-accelerated” by the bondingdriver so the VLAN tag sits in the

regular location.

       VLAN interfaces must be added on top of a bonding interface

only after enslaving at least oneslave.  The bonding interface has a

hardware address of 00:00:00:00:00:00 untilthe first slave is added.

If the VLAN interface is created prior tothe first enslavement, it

would pick up the all-zeroes hardwareaddress.  Once the first slave

is attached to the bond, the bond deviceitself will pick up the

slave’s hardware address, which is thenavailable for the VLAN device.

       Also, be aware that a similar problem can occur if all slaves

are released from a bond that still has oneor more VLAN interfaces on

top of it. When a new slave is added, the bonding interface will

obtain its hardware address from the firstslave, which might not

match the hardware address of the VLANinterfaces (which was

ultimately copied from an earlier slave).

       There are two methods to insure that the VLAN device operates

with the correct hardware address if allslaves are removed from a

bond interface:

•  Remove all VLAN interfacesthen recreate them

  

•  Set the bonding interface’shardware address so that it

matches the hardware address of the VLANinterfaces.

       Note that changing a VLAN interface’s HW address would set the

underlying device — i.e. the bondinginterface — to promiscuous

mode, which might not be what you want.

8. Link Monitoring

       The bonding driver at present supports two schemes for

monitoring a slave device’s link state: theARP monitor and the MII

monitor.

       At the present time, due to implementation restrictions in the

bonding driver itself, it is not possible toenable both ARP and MII

monitoring simultaneously.

8.1 ARP Monitor Operation

       The ARP monitor operates as its name suggests: it sends ARP

queries to one or more designated peersystems on the network, and

uses the response as an indication that thelink is operating.  This

gives some assurance that traffic isactually flowing to and from one

or more peers on the local network.

       The ARP monitor relies on the device driver itself to verify

that traffic is flowing.  In particular, the driver must keep up to

date the last receive time,dev->last_rx, and transmit start time,

dev->trans_start.  If these are not updated by the driver, thenthe

ARP monitor will immediately fail anyslaves using that driver, and

those slaves will stay down.  If networking monitoring (tcpdump, etc)

shows the ARP requests and replies on thenetwork, then it may be that

your device driver is not updating last_rxand trans_start.

8.2 Configuring Multiple ARP Targets

       While ARP monitoring can be done with just one target, it can

be useful in a High Availability setup tohave several targets to

monitor. In the case of just one target, the target itself may go

down or have a problem making itunresponsive to ARP requests.  Having

an additional target (or several) increasesthe reliability of the ARP

monitoring.

       Multiple ARP targets must be separated by commas as follows:

# example options for ARP monitoring withthree targets

alias bond0 bonding

options bond0 arp_interval=60arp_ip_target=

192.168.0.1

,

192.168.0.3

,

192.168.0.9

       For just a single target the options would resemble:

# example options for ARP monitoring withone target

alias bond0 bonding

options bond0 arp_interval=60arp_ip_target=

192.168.0.100

8.3 MII Monitor Operation

       The MII monitor monitors only the carrier state of the local

network interface.  It accomplishes this in one of three ways: by

depending upon the device driver tomaintain its carrier state, by

querying the device’s MII registers, or bymaking an ethtool query to

the device.

       If the use_carrier module parameter is 1 (the default value),

then the MII monitor will rely on thedriver for carrier state

information (via the netif_carriersubsystem).  As explained in the

use_carrier parameter information, above,if the MII monitor fails to

detect carrier loss on the device (e.g.,when the cable is physically

disconnected), it may be that the driverdoes not support

netif_carrier.

       If use_carrier is 0, then the MII monitor will first query the

device’s (via ioctl) MII registers andcheck the link state.  If that

request fails (not just that it returnscarrier down), then the MII

monitor will make an ethtool ETHOOL_GLINKrequest to attempt to obtain

the same information.  If both methods fail (i.e., the driver either

does not support or had some error inprocessing both the MII register

and ethtool requests), then the MII monitorwill assume the link is

up.

9. Potential Sources of Trouble

9.1 Adventures in Routing

       When bonding is configured, it is important that the slave

devices not have routes that supercederoutes of the master (or,

generally, not have routes at all).  For example, suppose the bonding

device bond0 has two slaves, eth0 and eth1,and the routing table is

as follows:

Kernel IP routing table

Destination     Gateway         Genmask         Flags  MSS Window  irtt Iface

10.0.0.0 0

.0.0.0        

255.255.0.0

    U        40 0          0 eth0

10.0.0.0 0

.0.0.0        

255.255.0.0

    U        40 0          0 eth1

10.0.0.0 0

.0.0.0        

255.255.0.0

    U        40 0          0 bond0

127.0.0.0 0

.0.0.0        

255.0.0.0

      U        40 0          0 lo

       This routing configuration will likely still update the

receive/transmit times in the driver(needed by the ARP monitor), but

may bypass the bonding driver (becauseoutgoing traffic to, in this

case, another host on network 10 would useeth0 or eth1 before bond0).

       The ARP monitor (and ARP itself) may become confused by this

configuration, because ARP requests(generated by the ARP monitor)

will be sent on one interface (bond0), butthe corresponding reply

will arrive on a different interface(eth0).  This reply looks to ARP

as an unsolicited ARP reply (because ARPmatches replies on an

interface basis), and is discarded.  The MII monitor is not affected

by the state of the routing table.

       The solution here is simply to insure that slaves do not have

routes of their own, and if for some reasonthey must, those routes do

not supercede routes of their master.  This should generally be the

case, but unusual configurations or errantmanual or automatic static

route additions may cause trouble.

9.2 Ethernet Device Renaming

       On systems with network configuration scripts that do not

associate physical devices directly withnetwork interface names (so

that the same physical device always hasthe same “ethX” name), it may

be necessary to add some special logic toeither /etc/modules.conf or

/etc/modprobe.conf (depending upon which isinstalled on the system).

       For example, given a modules.conf containing the following:

alias bond0 bonding

options bond0 mode=some-mode miimon=50

alias eth0 tg3

alias eth1 tg3

alias eth2 e1000

alias eth3 e1000

       If neither eth0 and eth1 are slaves to bond0, then when the

bond0 interface comes up, the devices mayend up reordered.  This

happens because bonding is loaded first,then its slave device’s

drivers are loaded next.  Since no other drivers have been loaded,

when the e1000 driver loads, it willreceive eth0 and eth1 for its

devices, but the bonding configurationtries to enslave eth2 and eth3

(which may later be assigned to the tg3devices).

       Adding the following:

add above bonding e1000 tg3

       causes modprobe to load e1000 then tg3, in that order, when

bonding is loaded.  This command is fully documented in the

modules.conf manual page.

       On systems utilizing modprobe.conf (or modprobe.conf.local),

an equivalent problem can occur.  In this case, the following can be

added to modprobe.conf (ormodprobe.conf.local, as appropriate), as

follows (all on one line; it has been splithere for clarity):

install bonding /sbin/modprobe tg3;/sbin/modprobe e1000;

       /sbin/modprobe –ignore-install bonding

       This will, when loading the bonding module, rather than

performing the normal action, insteadexecute the provided command.

This command loads the device drivers inthe order needed, then calls

modprobe with –ignore-install to cause thenormal action to then take

place. Full documentation on this can be found in the modprobe.conf

and modprobe manual pages.

9.3. Painfully Slow Or No Failed LinkDetection By Miimon

       By default, bonding enables the use_carrier option, which

instructs bonding to trust the driver tomaintain carrier state.

       As discussed in the options section, above, some drivers do

not support the netif_carrier_on/_off linkstate tracking system.

With use_carrier enabled, bonding willalways see these links as up,

regardless of their actual state.

       Additionally, other drivers do support netif_carrier, but do

not maintain it in real time, e.g., onlypolling the link state at

some fixed interval.  In this case, miimon will detect failures,but

only after some long period of time hasexpired.  If it appears that

miimon is very slow in detecting linkfailures, try specifying

use_carrier=0 to see if that improves thefailure detection time.  If

it does, then it may be that the driverchecks the carrier state at a

fixed interval, but does not cache the MIIregister values (so the

use_carrier=0 method of querying theregisters directly works).  If

use_carrier=0 does not improve thefailover, then the driver may cache

the registers, or the problem may beelsewhere.

       Also, remember that miimon only checks for the device’s

carrier state.  It has no way to determine the state ofdevices on or

beyond other ports of a switch, or if aswitch is refusing to pass

traffic while still maintaining carrier on.

10. SNMP agents

       If running SNMP agents, the bonding driver should be loaded

before any network drivers participating ina bond.  This requirement

is due to the interface index(ipAdEntIfIndex) being associated to

the first interface found with a given IPaddress.  That is, there is

only one ipAdEntIfIndex for each IPaddress.  For example, if eth0 and

eth1 are slaves of bond0 and the driver foreth0 is loaded before the

bonding driver, the interface for the IPaddress will be associated

with the eth0 interface.  This configuration is shown below, the IP

address

192.168.1.1

has an interface index of 2 which indexesto eth0

in the ifDescr table (ifDescr.2).

    interfaces.ifTable.ifEntry.ifDescr.1 = lo

    interfaces.ifTable.ifEntry.ifDescr.2 = eth0

    interfaces.ifTable.ifEntry.ifDescr.3 = eth1

    interfaces.ifTable.ifEntry.ifDescr.4 = eth2

    interfaces.ifTable.ifEntry.ifDescr.5 = eth3

    interfaces.ifTable.ifEntry.ifDescr.6 = bond0

    ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5

    ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2

    ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4

    ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1

       This problem is avoided by loading the bonding driver before

any network drivers participating in abond.  Below is an example of

loading the bonding driver first, the IPaddress

192.168.1.1

is

correctly associated with ifDescr.2.

    interfaces.ifTable.ifEntry.ifDescr.1 = lo

    interfaces.ifTable.ifEntry.ifDescr.2 = bond0

    interfaces.ifTable.ifEntry.ifDescr.3 = eth0

    interfaces.ifTable.ifEntry.ifDescr.4 = eth1

    interfaces.ifTable.ifEntry.ifDescr.5 = eth2

    interfaces.ifTable.ifEntry.ifDescr.6 = eth3

    ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6

    ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2

    ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5

    ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1

       While some distributions may not report the interface name in

ifDescr, the association between the IPaddress and IfIndex remains

and SNMP functions such asInterface_Scan_Next will report that

association.

11. Promiscuous mode

       When running network monitoring tools, e.g., tcpdump, it is

common to enable promiscuous mode on thedevice, so that all traffic

is seen (instead of seeing only trafficdestined for the local host).

The bonding driver handles promiscuous modechanges to the bonding

master device (e.g., bond0), and propagatesthe setting to the slave

devices.

       For the balance-rr, balance-xor, broadcast, and 802.3ad modes,

the promiscuous mode setting is propagatedto all slaves.

       For the active-backup, balance-tlb and balance-alb modes, the

promiscuous mode setting is propagated onlyto the active slave.

       For balance-tlb mode, the active slave is the slave currently

receiving inbound traffic.

       For balance-alb mode, the active slave is the slave used as a

“primary.”  This slave is used for mode-specific controltraffic, for

sending to peers that are unassigned or ifthe load is unbalanced.

       For the active-backup, balance-tlb and balance-alb modes, when

the active slave changes (e.g., due to alink failure), the

promiscuous setting will be propagated tothe new active slave.

12. Configuring Bonding for HighAvailability

       High Availability refers to configurations that provide

maximum network availability by havingredundant or backup devices,

links or switches between the host and therest of the world.  The

goal is to provide the maximum availabilityof network connectivity

(i.e., the network always works), eventhough other configurations

could provide higher throughput.

12.1 High Availability in a Single SwitchTopology

       If two hosts (or a host and a single switch) are directly

connected via multiple physical links, thenthere is no availability

penalty to optimizing for maximumbandwidth.  In this case, there is

only one switch (or peer), so if it fails,there is no alternative

access to fail over to.  Additionally, the bonding load balance modes

support link monitoring of their members,so if individual links fail,

the load will be rebalanced across theremaining devices.

       See Section 13, “Configuring Bonding for Maximum Throughput”

for information on configuring bonding withone peer device.

12.2 High Availability in a Multiple SwitchTopology

       With multiple switches, the configuration of bonding and the

network changes dramatically.  In multiple switch topologies, there is

a trade off between network availabilityand usable bandwidth.

       Below is a sample network, configured to maximize the

availability of the network:

                |                                     |

                |port3                           port3|

         +—–+—-+                         +—–+—-+

         |          |port2       ISL     port2|          |

         | switch A +————————–+ switch B |

         |          |                          |          |

         +—–+—-+                         +—–++—+

                |port1                           port1|

                |             +——-+               |

                +————-+ host1+—————+

                         eth0 +——-+ eth1

       In this configuration, there is a link between the two

switches (ISL, or inter switch link), andmultiple ports connecting to

the outside world (“port3” oneach switch).  There is no technical

reason that this could not be extended to athird switch.

12.2.1 HA Bonding Mode Selection forMultiple Switch Topology

       In a topology such as the example above, the active-backup and

broadcast modes are the only useful bondingmodes when optimizing for

availability; the other modes require alllinks to terminate on the

same peer for them to behave rationally.

active-backup: This is generally thepreferred mode, particularly if

       the switches have an ISL and play together well.  If the

       network configuration is such that one switch is specifically

       a backup switch (e.g., has lower capacity, higher cost, etc),

       then the primary option can be used to insure that the

       preferred link is always used when it is available.

broadcast: This mode is really a specialpurpose mode, and is suitable

       only for very specific needs.  Forexample, if the two

       switches are not connected (no ISL), and the networks beyond

       them are totally independent.  Inthis case, if it is

       necessary for some specific one-way traffic to reach both

       independent networks, then the broadcast mode may be suitable.

12.2.2 HA Link Monitoring Selection forMultiple Switch Topology

       The choice of link monitoring ultimately depends upon your

switch. If the switch can reliably fail ports in response to other

failures, then either the MII or ARPmonitors should work.  For

example, in the above example, if the”port3″ link fails at the remote

end, the MII monitor has no direct means todetect this.  The ARP

monitor could be configured with a targetat the remote end of port3,

thus detecting that failure without switchsupport.

       In general, however, in a multiple switch topology, the ARP

monitor can provide a higher level ofreliability in detecting end to

end connectivity failures (which may becaused by the failure of any

individual component to pass traffic forany reason).  Additionally,

the ARP monitor should be configured withmultiple targets (at least

one for each switch in the network).  This will insure that,

regardless of which switch is active, theARP monitor has a suitable

target to query.

13. Configuring Bonding for MaximumThroughput

13.1 Maximizing Throughput in a SingleSwitch Topology

       In a single switch configuration, the best method to maximize

throughput depends upon the application andnetwork environment.  The

various load balancing modes each havestrengths and weaknesses in

different environments, as detailed below.

       For this discussion, we will break down the topologies into

two categories.  Depending upon the destination of mosttraffic, we

categorize them into either”gatewayed” or “local” configurations.

       In a gatewayed configuration, the “switch” is acting primarily

as a router, and the majority of trafficpasses through this router to

other networks.  An example would be the following:

    +———-+                    +———-+

    |          |eth0            port1|          | to other networks

    | Host A   +———————+router   +——————->

    |         +———————+          |Hosts B and C are out

    |          |eth1            port2|          | here somewhere

    +———-+                    +———-+

       The router may be a dedicated router device, or another host

acting as a gateway.  For our discussion, the important point isthat

the majority of traffic from Host A willpass through the router to

some other network before reaching itsfinal destination.

       In a gatewayed network configuration, although Host A may

communicate with many other systems, all ofits traffic will be sent

and received via one other peer on thelocal network, the router.

       Note that the case of two systems connected directly via

multiple physical links is, for purposes ofconfiguring bonding, the

same as a gatewayed configuration.  In that case, it happens that all

traffic is destined for the”gateway” itself, not some other network

beyond the gateway.

        In a local configuration, the”switch” is acting primarily as

a switch, and the majority of trafficpasses through this switch to

reach other stations on the samenetwork.  An example would be the

following:

   +———-+           +———-+       +——–+

   |          |eth0   port1|          +——-+ Host B |

   |  Host A  +————+  switch |port3  +——–+

   |          +————+          |                  +——–+

   |          |eth1   port2|          +——————+ Host C |

   +———-+           +———-+port4            +——–+

       Again, the switch may be a dedicated switch device, or another

host acting as a gateway.  For our discussion, the important point is

that the majority of traffic from Host A isdestined for other hosts

on the same local network (Hosts B and C inthe above example).

       In summary, in a gatewayed configuration, traffic to and from

the bonded device will be to the same MAClevel peer on the network

(the gateway itself, i.e., the router),regardless of its final

destination.  In a local configuration, traffic flowsdirectly to and

from the final destinations, thus, eachdestination (Host B, Host C)

will be addressed directly by theirindividual MAC addresses.

       This distinction between a gatewayed and a local network

configuration is important because many ofthe load balancing modes

available use the MAC addresses of thelocal network source and

destination to make load balancingdecisions.  The behavior of each

mode is described below.

13.1.1 MT Bonding Mode Selection for SingleSwitch Topology

       This configuration is the easiest to set up and to understand,

although you will have to decide whichbonding mode best suits your

needs. The trade offs for each mode are detailed below:

balance-rr: This mode is the only mode thatwill permit a single

       TCP/IP connection to stripe traffic across multiple

       interfaces. It is therefore the only mode that will allow a

       single TCP/IP stream to utilize more than one interface’s

       worth of throughput.  This comesat a cost, however: the

       striping often results in peer systems receiving packets out

       of order, causing TCP/IP’s congestion control system to kick

       in, often by retransmitting segments.

       It is possible to adjust TCP/IP’s congestion limits by

       altering the net.ipv4.tcp_reordering sysctl parameter.  The

       usual default value is 3, and the maximum useful value is 127.

       For a four interface balance-rr bond, expect that a single

       TCP/IP stream will utilize no more than approximately 2.3

       interface’s worth of throughput, even after adjusting

       tcp_reordering.

       Note that this out of order delivery occurs when both the

       sending and receiving systems are utilizing a multiple

       interface bond.  Consider aconfiguration in which a

       balance-rr bond feeds into a single higher capacity network

       channel (e.g., multiple 100Mb/sec ethernets feeding a single

       gigabit ethernet via an etherchannel capable switch).  In this

       configuration, traffic sent from the multiple 100Mb devices to

       a destination connected to the gigabit device will not see

       packets out of order.  However,traffic sent from the gigabit

       device to the multiple 100Mb devices may or may not see

       traffic out of order, depending upon the balance policy of the

       switch.  Many switches do notsupport any modes that stripe

       traffic (instead choosing a port based upon IP or MAC level

       addresses); for those devices, traffic flowing from the

       gigabit device to the many 100Mb devices will only utilize one

       interface.

       If you are utilizing protocols other than TCP/IP, UDP for

       example, and your application can tolerate out of order

       delivery, then this mode can allow for single stream datagram

       performance that scales near linearly as interfaces are added

       to the bond.

       This mode requires the switch to have the appropriate ports

       configured for “etherchannel” or “trunking.”

active-backup: There is not much advantagein this network topology to

       the active-backup mode, as the inactive backup devices are all

       connected to the same peer as the primary.  In this case, a

       load balancing mode (with link monitoring) will provide the

       same level of network availability, but with increased

       available bandwidth.  On the plusside, active-backup mode

       does not require any configuration of the switch, so it may

       have value if the hardware available does not support any of

       the load balance modes.

balance-xor: This mode will limit trafficsuch that packets destined

       for specific peers will always be sent over the same

       interface.  Since the destinationis determined by the MAC

       addresses involved, this mode works best in a “local” network

       configuration (as described above), with destinations all on

       the same local network.  This modeis likely to be suboptimal

       if all your traffic is passed through a single router (i.e., a

       “gatewayed” network configuration, as described above).

       As with balance-rr, the switch ports need to be configured for

       “etherchannel” or “trunking.”

broadcast: Like active-backup, there is notmuch advantage to this

       mode in this type of network topology.

802.3ad: This mode can be a good choice forthis type of network

       topology.  The 802.3ad mode is anIEEE standard, so all peers

       that implement 802.3ad should interoperate well.  The 802.3ad

       protocol includes automatic configuration of the aggregates,

       so minimal manual configurationof the switch is needed

       (typically only to designate that some set of devices is

       available for 802.3ad).  The802.3ad standard also mandates

       that frames be delivered in order (within certain limits), so

       in general single connections will not see misordering of

       packets.  The 802.3ad mode doeshave some drawbacks: the

       standard mandates that all devices in the aggregate operate at

       the same speed and duplex.  Also,as with all bonding load

       balance modes other than balance-rr, no single connection will

       be able to utilize more than a single interface’s worth of

       bandwidth. 

       Additionally, the linux bonding 802.3ad implementation

       distributes traffic by peer (using an XOR of MAC addresses),

       so in a “gatewayed” configuration, all outgoing traffic will

       generally use the same device. Incoming traffic may also end

       up on a single device, but that is dependent upon the

       balancing policy of the peer’s 8023.ad implementation.  In a

       “local” configuration, traffic will be distributed across the

       devices in the bond.

       Finally, the 802.3ad mode mandates the use of the MII monitor,

       therefore, the ARP monitor is not available in this mode.

balance-tlb: The balance-tlb mode balancesoutgoing traffic by peer.

       Since the balancing is done according to MAC address, in a

       “gatewayed” configuration (as described above), this mode will

       send all traffic across a single device. However, in a

       “local” network configuration, this mode balances multiple

       local network peers across devices in a vaguely intelligent

       manner (not a simple XOR as in balance-xor or 802.3ad mode),

       so that mathematically unlucky MAC addresses (i.e., ones that

       XOR to the same value) will not all “bunch up” on a single

       interface.

       Unlike 802.3ad, interfaces may be of differing speeds, and no

       special switch configuration is required.  On the down side,

       in this mode all incoming traffic arrives over a single

       interface, this mode requires certain ethtool support in the

       network device driver of the slave interfaces, and the ARP

       monitor is not available.

balance-alb: This mode is everything thatbalance-tlb is, and more.

       It has all of the features (and restrictions) of balance-tlb,

       and will also balance incoming traffic from local network

       peers (as described in the Bonding Module Options section,

       above).

       The only additional down side to this mode is that the network

       device driver must support changing the hardware address while

       the device is open.

13.1.2 MT Link Monitoring for Single SwitchTopology

       The choice of link monitoring may largely depend upon which

mode you choose to use.  The more advanced load balancing modes do not

support the use of the ARP monitor, and arethus restricted to using

the MII monitor (which does not provide ashigh a level of end to end

assurance as the ARP monitor).

13.2 Maximum Throughput in a MultipleSwitch Topology

       Multiple switches may be utilized to optimize for throughput

when they are configured in parallel aspart of an isolated network

between two or more systems, for example:

                       +———–+

                       |  Host A  |

                       +-+—+—+-+

                         |  |   |

                +——–+   |  +———+

                |            |             |

        +——+—+  +—–+—-+  +—–+—-+

        | Switch A |  | Switch B |  | Switch C |

        +——+—+  +—–+—-+  +—–+—-+

                |            |             |

                +——–+   |  +———+

                         |   |   |

                       +-+—+—+-+

                       |  Host B  |

                       +———–+

       In this configuration, the switches are isolated from one

another. One reason to employ a topology such as this is for an

isolated network with many hosts (a clusterconfigured for high

performance, for example), using multiplesmaller switches can be more

cost effective than a single larger switch,e.g., on a network with 24

hosts, three 24 port switches can besignificantly less expensive than

a single 72 port switch.

       If access beyond the network is required, an individual host

can be equipped with an additional networkdevice connected to an

external network; this host thenadditionally acts as a gateway.

13.2.1 MT Bonding Mode Selection forMultiple Switch Topology

       In actual practice, the bonding mode typically employed in

configurations of this type isbalance-rr.  Historically, in this

network configuration, the usual caveatsabout out of order packet

delivery are mitigated by the use ofnetwork adapters that do not do

any kind of packet coalescing (via the useof NAPI, or because the

device itself does not generate interruptsuntil some number of

packets has arrived).  When employed in this fashion, the balance-rr

mode allows individual connections betweentwo hosts to effectively

utilize greater than one interface’sbandwidth.

13.2.2 MT Link Monitoring for MultipleSwitch Topology

       Again, in actual practice, the MII monitor is most often used

in this configuration, as performance isgiven preference over

availability.  The ARP monitor will function in thistopology, but its

advantages over the MII monitor aremitigated by the volume of probes

needed as the number of systems involvedgrows (remember that each

host in the network is configured withbonding).

14. Switch Behavior Issues

14.1 Link Establishment and Failover Delays

       Some switches exhibit undesirable behavior with regard to the

timing of link up and down reporting by theswitch.

       First, when a link comes up, some switches may indicate that

the link is up (carrier available), but notpass traffic over the

interface for some period of time.  This delay is typically due to

some type of autonegotiation or routingprotocol, but may also occur

during switch initialization (e.g., duringrecovery after a switch

failure). If you find this to be a problem, specify an appropriate

value to the updelay bonding module optionto delay the use of the

relevant interface(s).

       Second, some switches may “bounce” the link state one or more

times while a link is changing state.  This occurs most commonly while

the switch is initializing.  Again, an appropriate updelay value may

help.

       Note that when a bonding interface has no active links, the

driver will immediately reuse the firstlink that goes up, even if the

updelay parameter has been specified (theupdelay is ignored in this

case). If there are slave interfaces waiting for the updelay timeout

to expire, the interface that first wentinto that state will be

immediately reused.  This reduces down time of the network if the

value of updelay has been overestimated,and since this occurs only in

cases with no connectivity, there is noadditional penalty for

ignoring the updelay.

       In addition to the concerns about switch timings, if your

switches take a long time to go into backupmode, it may be desirable

to not activate a backup interfaceimmediately after a link goes down.

Failover may be delayed via the downdelaybonding module option.

14.2 Duplicated Incoming Packets

       It is not uncommon to observe a short burst of duplicated

traffic when the bonding device is firstused, or after it has been

idle for some period of time.  This is most easily observed by issuing

a “ping” to some other host onthe network, and noticing that the

output from ping flags duplicates(typically one per slave).

        For example, on a bond in active-backupmode with five slaves

all connected to one switch, the output mayappear as follows:

# ping -n

10.0.4.2

PING

10.0.4.2

(

10.0.4.2

) from

10.0.3.10

: 56(84) bytes of data.

64 bytes from

10.0.4.2

: icmp_seq=1 ttl=64 time=13.7 ms

64 bytes from

10.0.4.2

: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)

64 bytes from

10.0.4.2

: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)

64 bytes from

10.0.4.2

: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)

64 bytes from

10.0.4.2

: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)

64 bytes from

10.0.4.2

: icmp_seq=2 ttl=64 time=0.216 ms

64 bytes from

10.0.4.2

: icmp_seq=3 ttl=64 time=0.267 ms

64 bytes from

10.0.4.2

: icmp_seq=4 ttl=64 time=0.222 ms

       This is not due to an error in the bonding driver, rather, it

is a side effect of how many switchesupdate their MAC forwarding

tables. Initially, the switch does not associate the MAC address in

the packet with a particular switch port,and so it may send the

traffic to all ports until its MACforwarding table is updated.  Since

the interfaces attached to the bond mayoccupy multiple ports on a

single switch, when the switch(temporarily) floods the traffic to all

ports, the bond device receives multiplecopies of the same packet

(one per slave device).

        The duplicated packet behavior is switchdependent, some

switches exhibit this, and some donot.  On switches that display this

behavior, it can be induced by clearing theMAC forwarding table (on

most Cisco switches, the privileged command”clear mac address-table

dynamic” will accomplish this).

15. Hardware Specific Considerations

       This section contains additional information for configuring

bonding on specific hardware platforms, orfor interfacing bonding

with particular switches or other devices.

15.1 IBM BladeCenter

       This applies to the JS20 and similar systems.

       On the JS20 blades, the bonding driver supports only

balance-rr, active-backup, balance-tlb andbalance-alb modes.  This is

largely due to the network topology insidethe BladeCenter, detailed

below.

JS20 network adapter information

       All JS20s come with two Broadcom Gigabit Ethernet ports

integrated on the planar (that’s”motherboard” in IBM-speak). In the

BladeCenter chassis, the eth0 port of allJS20 blades is hard wired to

I/O Module #1; similarly, all eth1 portsare wired to I/O Module #2.

An add-on Broadcom daughter card can beinstalled on a JS20 to provide

two more Gigabit Ethernet ports.  These ports, eth2 and eth3, are

wired to I/O Modules 3 and 4, respectively.

       Each I/O Module may contain either a switch or a passthrough

module (which allows ports to be directlyconnected to an external

switch). Some bonding modes require a specific BladeCenter internal

network topology in order to function; theseare detailed below.

       Additional BladeCenter-specific networking information can be

found in two IBM Redbooks (

www.ibm.com/redbooks):

“IBM eServer BladeCenter NetworkingOptions”

“IBM eServer BladeCenter Layer 2-7Network Switching”

BladeCenter networking configuration

       Because a BladeCenter can be configured in a very large number

of ways, this discussion will be confinedto describing basic

configurations.

       Normally, Ethernet Switch Modules (ESMs) are used in I/O

modules 1 and 2.  In this configuration, the eth0 and eth1ports of a

JS20 will be connected to differentinternal switches (in the

respective I/O modules).

       A passthrough module (OPM or CPM, optical or copper,

passthrough module) connects the I/O moduledirectly to an external

switch. By using PMs in I/O module #1 and #2, the eth0 and eth1

interfaces of a JS20 can be redirected tothe outside world and

connected to a common external switch.

       Depending upon the mix of ESMs and PMs, the network will

appear to bonding as either a single switchtopology (all PMs) or as a

multiple switch topology (one or more ESMs,zero or more PMs).  It is

also possible to connect ESMs together,resulting in a configuration

much like the example in “HighAvailability in a Multiple Switch

Topology,” above.

Requirements for specific modes

       The balance-rr mode requires the use of passthrough modules

for devices in the bond, all connected toan common external switch.

That switch must be configured for”etherchannel” or “trunking” on the

appropriate ports, as is usual forbalance-rr.

       The balance-alb and balance-tlb modes will function with

either switch modules or passthroughmodules (or a mix).  The only

specific requirement for these modes isthat all network interfaces

must be able to reach all destinations fortraffic sent over the

bonding device (i.e., the network mustconverge at some point outside

the BladeCenter).

       The active-backup mode has no additional requirements.

Link monitoring issues

       When an Ethernet Switch Module is in place, only the ARP

monitor will reliably detect link loss toan external switch.  This is

nothing unusual, but examination of theBladeCenter cabinet would

suggest that the “external”network ports are the ethernet ports for

the system, when it fact there is a switchbetween these “external”

ports and the devices on the JS20 systemitself.  The MII monitor is

only able to detect link failures betweenthe ESM and the JS20 system.

       When a passthrough module is in place, the MII monitor does

detect failures to the “external”port, which is then directly

connected to the JS20 system.

Other concerns

       The Serial Over LAN (SoL) link is established over the primary

ethernet (eth0) only, therefore, any loss oflink to eth0 will result

in losing your SoL connection.  It will not fail over with other

network traffic, as the SoL system isbeyond the control of the

bonding driver.

       It may be desirable to disable spanning tree on the switch

(either the internal Ethernet SwitchModule, or an external switch) to

avoid fail-over delay issues when usingbonding.

       

16. Frequently Asked Questions

•  Is it SMP safe?

       Yes. The old 2.0.xx channel bonding patch was not SMP safe.

The new driver was designed to be SMP safefrom the start.

2. What type of cards will work with it?

       Any Ethernet type cards (you can even mix cards – a Intel

EtherExpress PRO/100 and a 3com 3c905b, forexample).  For most modes,

devices need not be of the same speed.

3.  Howmany bonding devices can I have?

       There is no limit.

4. How many slaves can a bonding device have?

       This is limited only by the number of network interfaces Linux

supports and/or the number of network cardsyou can place in your

system.

5. What happens when a slave link dies?

       If link monitoring is enabled, then the failing device will be

disabled. The active-backup mode will fail over to a backup link, and

other modes will ignore the failedlink.  The link will continue to be

monitored, and should it recover, it willrejoin the bond (in whatever

manner is appropriate for the mode). Seethe sections on High

Availability and the documentation for eachmode for additional

information.

       

       Link monitoring can be enabled via either the miimon or

arp_interval parameters (described in themodule parameters section,

above). In general, miimon monitors the carrier state as sensed by

the underlying network device, and the arpmonitor (arp_interval)

monitors connectivity to another host onthe local network.

       If no link monitoring is configured, the bonding driver will

be unable to detect link failures, and willassume that all links are

always available.  This will likely result in lost packets, anda

resulting degradation of performance.  The precise performance loss

depends upon the bonding mode and networkconfiguration.

6. Can bonding be used for High Availability?

       Yes.  See the section on HighAvailability for details.

7. Which switches/systems does it work with?

       The full answer to this depends upon the desired mode.

       In the basic balance modes (balance-rr and balance-xor), it

works with any system that supportsetherchannel (also called

trunking). Most managed switches currently available have such

support, and many unmanaged switches aswell.

       The advanced balance modes (balance-tlb and balance-alb) do

not have special switch requirements, butdo need device drivers that

support specific features (described in theappropriate section under

module parameters, above).

       In 802.3ad mode, it works with with systems that support IEEE

802.3ad Dynamic Link Aggregation.  Most managed and many unmanaged

switches currently available support802.3ad.

       The active-backup mode should work with any Layer-II switch.

8. Where does a bonding device get its MAC address from?

       If not explicitly configured (with ifconfig or ip link), the

MAC address of the bonding device is takenfrom its first slave

device. This MAC address is then passed to all following slaves and

remains persistent (even if the first slaveis removed) until the

bonding device is brought down orreconfigured.

       If you wish to change the MAC address, you can set it with

ifconfig or ip link:

# ifconfig bond0 hw ether 00:11:22:33:44:55

# ip link set bond0 address66:77:88:99:aa:bb

       The MAC address can be also changed by bringing down/up the

device and then changing its slaves (ortheir order):

# ifconfig bond0 down ; modprobe -r bonding

# ifconfig bond0 …. up

# ifenslave bond0 eth…

       This method will automatically take the address from the next

slave that is added.

       To restore your slaves’ MAC addresses, you need to detach them

from the bond (`ifenslave -d bond0 eth0′).The bonding driver will

then restore the MAC addresses that theslaves had before they were

enslaved

Linux总是可以用一种最简单的方式实现一个很复杂的功能,特别是网络方面的 ,哪怕这个功能被认为只是在高端设备上才有,linux也可以很容易的实现,以前的文章已经说了不少次了,比如vlan功能,比如高级路由和防火墙功能等等,本文着重说一下linux的bonding,也就是端口聚合的功能模块。不可否认,在网络设备这个层面上上,linux搞出了两个很成功的虚拟设备的概念,一个是tap网卡,另一个就是本文所讲述的bonding,关于tap网卡的内容,请参阅之前关于OpenVPN的文章。

 

     如果有一个问题摆在眼前,那就是关于linux bonding有什么比较好的资料,答案就是linux内核的文档,该文档在$KERNEL-ROOT/Documentation/networking/bonding.txt,我觉得没有任何资料比这个更权威了。

一、bonding简介

bonding是一个linux kernel的driver,加载了它以后,linux支持将多个物理网卡捆绑成一个虚拟的bond网卡,随着版本的升级,bond驱动可配置的参数越来越多,而且配置本身也越来越方便了。

     我们在很多地方会使用到物理网卡端口汇聚的功能,比如我们想提升网络速率,比如我们想提供热备份,比如我们想把我们的主机配置成一个网桥,并且使之支持802.3ad动态端口聚合协议等等,然而最重要的还是两点,第一点是负载均衡,第二点就是热备份啦。

二、驱动以及Changes介绍

linux的bonding驱动的最初版本仅仅提供了基本的机制,而且需要在加载模块的时候指定配置参数,如果想更改配置参数,那么必须重新加载bonding模块;然后modprobe支持一种rename的机制,也就是在modprobe的时候支持使用-o重新为此模块命名,这样就可以实现一个模块以不同的配置参数加载多次了,起初比如我有4个网口,想把两个配置成负载均衡,两个配置成热备,这只能手工重新将bonding编译成不同的名称来解决,modprobe有了-o选项之后,就可以两次加载相同的驱动了,比如可以使用:

modprobe bonding -o bond0 mode=0

modprobe bonding -o bond1 mode=1

加载两次bonding驱动,用lsmod看一下,结果是bond0和bond1,并没有bonding,这是由于modprobe加载时命名了,然而最终,这个命名机制不再被支持了,因为正如modprobe的man手册所叙述的一样,-o重命名机制主要适用于test。最后,bonding支持了sysfs的配置机制,对/sys/class/net/目录下的文件进行读或者写就可以完成对驱动的配置。

     不管怎样,在sysfs完全支持bonding配置之前,如果想往某一个bonding网卡添加设备或者删除设备的时候,还是要使用经典且传统的ioctl调用,因此必然需要一个用户态程序与之对应,该程序就是ifenslave。

     我想,如果linux的所有关于设备的配置都能统一于sysfs,所有的关于内核和进程配置统一于procfs(内核是所有进程共享的地址空间,也有自己的内核线程以及进程0,因此对内核的配置应该在procfs中),对所有的消息,使用netlink通信,这就太好了,摆脱了命令式的ioctl配置,文件式(netlink使用的sendto之类的系统调用也可以归为文件系统调用相关的)的配置将更加高效,简单以及好玩!

三、bonding配置参数

在内核文档中,列举了许多bonding驱动的参数,然后本文不是文档的翻译,因此不再翻译文档和介绍和主题无关的参数,仅对比较重要的参数进行介绍,并且这些介绍也不是翻译,而是一些建议或者心得。

ad_select: 802.3ad相关。如果不明白这个,那不要紧,抛开Linux的bonding驱动,直接去看802.3ad的规范就可以了。列举这个选项说明linux bonding驱动完全支持了动态端口聚合协议。

arp_interval和arp_ip_target: 以一个固定的间隔向某些固定的地址发送arp,以监控链路。有些配置下,需要使用arp来监控链路,因为这是一种三层的链路监控,使用网卡状态或者链路层pdu监控只能监控到双绞线两端的接口的健康情况,而监控不到到下一条路由器或者目的主机之间的全部链路的健康状况。

primary: 表示优先权,顺序排列,当出现某种选择事件时,按照从前到后的顺序选择网口,比如802.3ad协议中的选择行为。

fail_over_mac: 对于热备模式是否使用同一个mac地址,如果不使用一个mac的话,就要完全依赖免费arp机制更新其它机器的arp缓存了。比如,两个有网卡,网卡1和网卡2处于热备模式,网卡1的mac是mac1,网卡2的mac是mac2,网卡1一直是master,但是网卡1突然down掉了,此时需要网卡2接替,然而网卡2的mac地址与之前的网卡1不同,别的主机回复数据包的时候还是使用网卡1的mac地址来回复的,由于mac1已经不在网络上了,这就会导致数据包将不会被任何网卡接收。因此网卡2接替了master的角色之后,最好有一个回调事件,处理这个事件的时候,进行一次免费的arp广播,广播自己更换了mac地址。

lacp_rate: 发送802.3ad的LACPDU,以便对端设备自动获取链路聚合的信息。

max_bonds: 初始时创建bond设备接口的数量,默认值是1。但是这个参数并不影响可以创建的最大的bond设备数量。

use_carrier: 使用MII的ioctl还是使用驱动获取保持的状态,如果是前者的话需要自己调用mii的接口进行硬件检测,而后者则是驱动自动进行硬件检测(使用watchdog或者定时器),bonding驱动只是获取结果,然而这依赖网卡驱动必须支持状态检测,如果不支持的话,网卡的状态将一直是on。

mode: 这个参数最重要,配置以什么模式运行,这个参数在bond设备up状态下是不能更改的,必须先down设备(使用ifconfig bondX down)才可以配置,主要的有以下几个:

1.balance-rr or 0: 轮转方式的负载均衡模式,流量轮流在各个bondX的真实设备之间分发。注意,一定要用状态检测机制,否则如果一个设备down掉以后,由于没有状态检测,该设备将一直是up状态,仍然接受发送任务,这将会出现丢包。

2.active-backup or 1: 热备模式。在比较高的版本中,免费arp会在切换时自动发送,避免一些故障,比如fail_over_mac参数描述的故障。

3.balance-xor or 2: 我不知道既然bonding有了xmit_hash_policy这个参数,为何还要将之单独设置成一种模式,在这个模式中,流量也是分发的,和轮转负载不同的是,它使用源/目的mac地址为自变量通过xor|mod函数计算出到底将数据包分发到哪一个口。

4.broadcast or 3: 向所有的口广播数据,这个模式很XX,但是容错性很强大。

5.802.3ad or 4: 这个就不多说了,就是以802.3ad的方式运行。

xmit_hash_policy: 这个参数的重要性我认为仅次于mode参数,mode参数定义了分发模式 ,而这个参数定义了分发策略 ,文档上说这个参数用于mode2和mode4,我觉得还可以定义更为复杂的策略呢。

1.layer2: 使用二层帧头作为计算分发出口的参数,这导致通过同一个网关的数据流将完全从一个端口发送,为了更加细化分发策略,必须使用一些三层信息,然而却增加了计算开销,天啊,一切都要权衡!

2.layer2+3: 在1的基础上增加了三层的ip报头信息,计算量增加了,然而负载却更加均衡了,一个个主机到主机的数据流形成并且同一个流被分发到同一个端口,根据这个思想,如果要使负载更加均衡,我们在继续增加代价的前提下可以拿到4层的信息。

3.layer3+4: 这个还用多说吗?可以形成一个个端口到端口的流,负载更加均衡。然而且慢! 事情还没有结束,虽然策略上我们不想将同一个tcp流的传输处理并行化以避免re-order或者re-transmit,因为tcp本身就是一个串行协议,比如Intel的8257X系列网卡芯片都在尽量减少将一个tcp流的包分发到不同的cpu,同样,端口聚合的环境下,同一个tcp流也应该使用本policy使用同一个端口发送,但是不要忘记,tcp要经过ip,而ip是可能要分段的,分了段的ip数据报中直到其被重组(到达对端或者到达一个使用nat的设备)都再也不能将之划为某个tcp流了。ip是一个完全无连接的协议,它只关心按照本地的mtu进行分段而不管别的,这就导致很多时候我们使用layer3+4策略不会得到完全满意的结果。可是事情又不是那么严重,因为ip只是依照本地的mtu进行分段,而tcp是端到端的,它可以使用诸如mss以及mtu发现之类的机制配合滑动窗口机制最大限度减少ip分段,因此layer3+4策略,很OK!

miimon和arp: 使用miimon仅能检测链路层的状态,也就是链路层的端到端连接(即交换机某个口和与之直连的本地网卡口),然而交换机的上行口如果down掉了还是无法检测到,因此必然需要网络层的状态检测,最简单也是最直接的方式就是arp了,可以直接arp网关,如果定时器到期网关还没有回复arp reply,则认为链路不通了。

 

35、Pssh

是一个可以在多台服务器上执行命令的工具,同时支持拷贝文件,是同类工具中很出色的。使用是必须在各个服务器上配置好密钥认证访问。

 

在系统centos 5.6  64位 和 red hat enterpriselinux 6.1 64位中测试通过

1   安装pssh

    在http://www.theether.org/pssh/  或者http://code.google.com/p/parallel-ssh/下载pssh最新版本

   #   wget  http://www.theether.org/pssh/pssh-1.4.3.tar.gz

   #   tar zxvf pssh-1.4.3.tar.gz

  #   cd pssh-1.4.3

 #   wget’http://peak.telecommunity.com/dist/ez_setup.py’

 #   python ez_setup.py

 # python setup.py install

36、道

学会翻墙并搭建自己的网站;

去stackoverflow.com上回答10个问题;

在ATA上发一篇技术文章;

了解淘宝网站是如何支撑秒杀的;

了解支撑双11这样的大促都做了什么;

尝试做一件非工作职责范围,对团队或公司业务有帮助的事;

尝试给涉及的或感兴趣的开源软件提交一个patch;

初步判断自己希望发展的方向

 

 

37、High SpeedFrameWork(HSF)

远程服务调用框架(RPC)

方便易用,对Java代码侵入很小

支撑了800多个线上业务系统,双11HSF日调用量超过1200亿

37、淘宝分布式数据层(TDDL)

数据源管理,数据切分,去IOE的最重要组件之一

安全稳定,从未出过严重故障

支持了600多个线上的业务系统

38、Ipsan

39、Storage AreaNetwork,

存储区域网络,多采用高速光纤通道,对速率、冗余性要求高

使用iSCSI存储协议,块级传输

 

40、IDC

IDC 互联网数据中心(Internet DataCenter)简称IDC,就是电信部门利用已有的互联网通信线路、带宽资源,建立标准化的电信专业级机房环境,为企业、政府提供服务器托管、租用以及相关增值等方面的全方位服务。

 

41、TDP

TDP的英文全称是“Thermal DesignPower”,中文直译是“散热设计功耗”。主要是提供给计算机系统厂商,散热片/风扇厂商,以及机箱厂商等等进行系统设计时使用的。一般TDP主要应用于CPU,CPU TDP值对应系列CPU 的最终版本在满负荷(CPU 利用率为100%的理论上)可能会达到的最高散热热量,散热器必须保证在处理器TDP最大的时候,处理器的温度仍然在设计范围之内。

 

42、DIMM

(Dual Inline Memory Module,双列直插内存模块)与SIMM相当类似,不同的只是DIMM的金手指两端不像SIMM那样是互通的,它们各自独立传输信号,因此可以满足更多数据信号的传送需要。

 

NUMA     

MySQL单机多实例方案

MySQL单机多实例方案,是指在一台物理的PC服务器上运行多个MySQL数据库实例,为什么要这样做?这样做的好处是什么?

1.存储技术飞速发展,IO不再是瓶颈

普通PC服务器的CPU与IO资源不均衡,因为磁盘的IO能力非常有限,为了满足应用的需要,往往需要配置大量的服务器,这样就造成CPU资源的大量浪费。但是,Flash存储技术的出现改变了这一切,单机的IO能力不再是瓶颈,可以在单机运行多个MySQL实例提升CPU利用率。

2.MySQL对多核CPU利用率低

MySQL对多核CPU的利用率不高,一直是个问题,5.1版本以前的MySQL,当CPU超过4个核时,性能无法线性扩展。虽然MySQL后续版本一直在改进这个问题,包括Innodb plugin和Percona XtraDB都对多核CPU的利用率改进了很多,但是依然无法实现性能随着CPU core的增加而提升。我们现在常用的双路至强服务器,单颗CPU有4-8个core,在操作系统上可以看到16-32 CPU(每个core有两个线程),四路服务器可以达到64 core甚至更多,所以提升MySQL对于多核CPU的利用率是提升性能的重要手段。下图是Percona的一份测试数据:

 

3.NUMA对MySQL性能的影响

我们现在使用的PC服务器都是NUMA架构的,下图是Intel 5600 CPU的架构:

 

NUMA的内存分配策略有四种:

1.缺省(default):总是在本地节点分配(分配在当前进程运行的节点上);

2.绑定(bind):强制分配到指定节点上;

3.交叉(interleave):在所有节点或者指定的节点上交织分配;

4.优先(preferred):在指定节点上分配,失败则在其他节点上分配。

因为NUMA默认的内存分配策略是优先在进程所在CPU的本地内存中分配,会导致CPU节点之间内存分配不均衡,当某个CPU节点的内存不足时,会导致swap产生,而不是从远程节点分配内存。这就是所谓的swap insanity现象。

MySQL采用了线程模式,对于NUMA特性的支持并不好,如果单机只运行一个MySQL实例,我们可以选择关闭NUMA,关闭的方法有三种:1.硬件层,在BIOS中设置关闭;2.OS内核,启动时设置numa=off;3.可以用numactl命令将内存分配策略修改为interleave(交叉),有些硬件可以在BIOS中设置。

如果单机运行多个MySQL实例,我们可以将MySQL绑定在不同的CPU节点上,并且采用绑定的内存分配策略,强制在本节点内分配内存,这样既可以充分利用硬件的NUMA特性,又避免了单实例MySQL对多核CPU利用率不高的问题。

 

资源隔离方案

1.CPU,Memory

numactl –cpubind=0 –localalloc,此命令将MySQL绑定在不同的CPU节点上,cpubind是指NUMA概念中的CPU节点,可以用numactl–hardware查看,localalloc参数指定内存为本地分配策略。

2.IO

我们在机器中内置了fusionio卡(320G),配合flashcache技术,单机的IO不再成为瓶颈,所以IO我们采用了多实例共享的方式,并没有对IO做资源限制。多个MySQL实例使用相同的物理设备,不同的目录的来进行区分。

3.Network

因为单机运行多个实例,必须对网络进行优化,我们通过多个的IP的方式,将多个MySQL实例绑定在不同的网卡上,从而提高整体的网络能力。还有一种更高级的做法是,将不同网卡的中断与CPU绑定,这样可以大幅度提升网卡的效率。

4.为什么不采用虚拟机

虚拟机会耗费额外的资源,而且MySQL属于IO类型的应用,采用虚拟机会大幅度降低IO的性能,而且虚拟机的管理成本比较高。所以,我们的数据库都不采用虚拟机的方式。

5.性能

下图是Percona的测试数据,可以看到运行两个实例的提升非常明显。

 

高可用方案

因为单机运行了多个MySQL实例,所以不能采用主机层面的HA策略,比如heartbeat。因为当一个MySQL实例出现问题时,无法将整个机器切换。所以必须改为MySQL实例级别的HA策略,我们采用了自己开发的MySQL访问层来解决HA的问题,当某个实例出问题时,只切换一个实例,对于应用来说,这一层是透明的。

MySQL单机多实例方案的优点

1.节省成本,虽然采用Flash存储的成本比较高,但是如果可以缩减机器的数量,考虑到节省电力和机房使用的成本,还是比单机单实例的方案更便宜。

2.提升利用率,利用NUMA特性,将MySQL实例绑定在不同的CPU节点,不仅提高了CPU利用率,同时解决了MySQL对多核CPU的利用率问题。

3.提升用户体验,采用Flash存储技术,大幅度降低IO响应时间,有助于提升用户的体验。

–EOF–

关于NUMA可以参考这篇文章:NUMA与Intel新一代Xeon处理

 

43、SLC/MLC/TLC

SLC = Single-Level Cell ,即1bit/cell,速度快寿命长,价格超贵(约MLC 3倍以上的价格),约10万次擦写寿命   

MLC = Multi-Level Cell,即2bit/cell,速度一般寿命一般,价格一般,约3000—10000次擦写寿命   

TLC = Trinary-Level Cell,即3bit/cell,也有Flash厂家叫8LC,速度慢寿命短,价格便宜,约500次擦写寿命,目前还没有厂家能做到1000次。

 

1、  读写效能区别相比SLC闪存,MLC的读写效能要差,SLC闪存约可以反复读写10万次左右,而MLC则大约只能读写1万次左右,甚至有部分产品只能达到5000次左右。2、读写速度区别在相同条件下,MLC的读写速度要比SLC芯片慢,目前MLC芯片速度大约只有2M左右。3、能耗区别在相同使用条件下,MLC能耗比SLC高,要多15%左右的电流消耗。4、成本区别MLC内存颗粒容量大,可大幅节省制造商端的成本。slc容量小,成本高。从性能上讲SLC好,单从性价比讲MLC高。

 

44、存储架构

服务器内置存储:一般指在服务器内部的存储空间如IDE、SCSI、SAS、SATA等。

直接连接存储DAS(DirectAttached Storage):通过IDE、SCSI、FC接口与服务器直接相连,以服务器为中心。客户机的数据访问必须通过服务器,然后经过其I/O总线访问相应的存储设备,服务器实际上起到一种存储转发的作用。

网络连接存储NAS:使用一个专用存储服务器,去掉了通用服务器原有的不适用的大多数计算功能,而仅仅提供文件系统功能,用于存储服务。NAS通过基于 IP的网络文件协议向多种客户段提供文件级I/O服务,客户端可以在NAS存储设备提供的目录或设备中进行文件级操作。专用服务器利用NFS或CIFS,充当远程文件服务器,对外提供了跨平台的文件同时存取服务,因此NAS主要应用于文件共享任务。

存储区域网络SAN(storage areanetwork):通过网络方式连接存储设备和应用服务器,这个网络专用于主机和存储设备之间的访问,当有数据的存取需求时,数据可以通过存储区域网络在服务器和后台存储设备之间高速传输,这种形式的网络存储结构称为SAN。SAN由应用服务器、后端存储系统、SAN连接设备组成。后端存储系统由SAN 控制器和磁盘系统构成。SAN控制器是后端存储系统的关键,它提供存储接入、数据操作及备份、数据共享、数据快照等数据安全管理和系统管理功能。后端存储系统使用磁盘阵列和RAID策略为数据提供存储空间和安全保护措施。SAN连接设备包括交换机、HBA卡和各种介质的连接线。

 

 一、iSCSI存储系统架构之控制器系统架构

  iSCSI的核心处理单元采用与FC光纤存储设备相同的结构。即采用专用的数据传输芯片、专用的RAID数据校验芯片、专用的高性能cache缓存和专用的嵌入式系统平台。打开设备机箱时可以看到iSCSI设备内部采用无线缆的背板结构,所有部件与背板之间通过标准或非标准的插槽链接在一起,而不是普通PC中的多种不同型号和规格的线缆链接。

  这种类型的iSCSI存储系统架构核心处理单元采用高性能的硬件处理芯片,每个芯片功能单一,因此处理效率较高。操作系统是嵌入式设计,与其他类型的操作系统相比,嵌入式操作系统具有体积小、高稳定性、强实时性、固化代码以及操作方便简单等特点。因此控制器架构的iSCSI存储设备具有较高的安全性和和稳定性。

  控制器架构iSCSI存储内部基于无线缆的背板链接方式,完全消除了链接上的单点故障,因此系统更安全,性能更稳定。一般可用于对性能的稳定性和高可用性具有较高要求的在线存储系统,比如:中小型数据库系统,大型数据的库备份系统,远程容灾系统,网站、电力或非线性编辑制作网等。

  控制器架构的iSCSI设备由于核心处理器全部采用硬件,制造成本较高,因此一般销售价格较高。

  目前市场还可以见到一种特殊的基于控制器架构的iSCSI存储设备。该类存储设备是在现有FC存储设备的基础上扩充或者增加iSCSI协议转换模块,使得FC存储设备可以支持FC数据传输协议和iSCSI传输协议,如EMC 150i/300i/500i 等。

  区分一个设备是否是控制器架构,可从以下几个方面去考虑:

  1、是否双控:除了一些早期型号或低端型号外,高性能的iSCSI存储一般都会采用active-active的双控制器工作方式。控制器为模块化设计,并安装在同一个机箱内,非两个独立机箱的控制器。

 

  2、缓存:有双控制器缓存镜像、缓存断电保护功能。

 

  3、数据校验:采用专用硬件校验和数据传输芯片,非依靠普通CPU的软件校验,或普通RAID卡。

 

  4、内部结构:打开控制器架构的设备,内部全部为无线缆的背板式连接方式,各硬件模块连接在背板的各个插槽上。

  二、iSCSI存储系统架构之iSCSI连接桥系统架构

  整个iSCSI存储系统架构分为两个部分,一个部分是前端协议转换设备,另一部分是后端存储。结构上类似NAS网关及其后端存储设备。

  前端协议转换部分一般为硬件设备,主机接口为千兆以太网接口,磁盘接口一般为SCSI接口或FC接口,可连接SCSI磁盘阵列和FC存储设备。通过千兆以太网主机接口对外提供iSCSI数据传输协议。

  后端存储一般采用SCSI磁盘阵列和FC存储设备,将SCSI磁盘阵列和FC存储设备的主机接口直接连接到iSCSI桥的磁盘接口上。

  iSCSI连接桥设备本身只有协议转换功能,没有RAID校验和快照、卷复制等功能。创建RAID组、创建LUN等操作必须在存储设备上完成,存储设备有什么功能,整个iSCSI设备就具有什么样的功能。

  SANRAD的V-Switch系列,ATTO Technology的iPBridge系列的iSCSI桥接器,提供iSCSI-to-SCSI与iSCSI-to-FC 的桥接,可将直连的磁盘阵列柜(Disk Array,JBOD、DAS)或磁带设备(Autoloader、Tape Library)转变成iSCSI存储设备。

  不过随着iSCSI技术的逐渐成熟,连接桥架构的iSCSI设备越来越少,目前的市场上基本已看不到这样的产品了。

  三、iSCSI存储系统架构之PC系统架构

  那么何谓PC架构?按字面的意思可以理解为存储设备建立在PC服务器的基础上。即就是选择一个普通的、性能优良的、可支持多块磁盘的PC(一般为PC服务器和工控服务器),选择一款相对成熟稳定的iSCSItarget软件,将iSCSItarget软件安装在PC服务器上,使普通的PC服务器转变成一台iSCSI存储设备,并通过PC服务器的以太网卡对外提供iSCSI数据传输协议。

  目前常见的iSCSItarget软件多半由商业软件厂商提供,如DataCore Software的SANmelody,FalconStor Software的iSCSI Server for Windows,和String Bean Software的WinTarget等。这软件都只能运行在Windows操作系统平台上。

  在PC架构的iSCSI存储设备上,所有的RAID组校验、逻辑卷管理、iSCSI 运算、TCP/IP 运算等都是以纯软件方式实现,因此对PC的CPU和内存的性能要求较高。另外iSCSI存储设备的性能极容易收PC服务器运行状态的影响。

  当由于PC架构iSCSI存储设备的研发、生产、安装使用相对简单,硬件和软件成本相对较低,因此市场上常见的基于PC架构的iSCSI设备的价格都比较低,在一些对性能稳定性要求较低的系统中具有较大的价格优势。

  四、iSCSI存储系统架构之PC+NIC系统架构

  PC+iSCSItarget软件方式是一种低价低效比的iSCSI存储系统架构解决方案,另外还有一种基于PC+NIC的高阶高效性iSCSI存储系统架构方案。

这款iSCSI存储系统架构方案是指在PC服务器中安装高性能的TOE智能NIC卡,将CPU资源较大的iSCSI运算、TCP/IP运算等数据传输操作转移到智能卡的硬件芯片上,由智能卡的专用硬件芯片来完成iSCSI运算、TCP/IP运算等,简化网络两端的内存数据交换程序,从而加速数据传输效率,降低PC的CPU占用,提高存储的性能。

 

1、           规划建设和管理SAN存储网络; 2、管理主流中高端存储产品,如EMC、HDS、NETAPP等; 3、应用主流备份软件,如Veritas、TSM、commvault,实施数据备份解决方案; 4、存储管理、备份管理、数据迁移、性能优化。职位要求: 1、熟练掌握存储网络建设管理和相关产品方案; 2、熟练掌握主流中高端存储产品,如EMC、NetApp、HDS等; 3、熟练掌握主流备份软件,如Veritas、TSM、commvault,实施数据备份解决方案; 4、熟悉Linux、AIX、HP-UNIX系统管理,掌握相关Shell脚本编程; 5、具有一定数据库知识。

45、SSD

 

地址空间虚拟化、容量冗余、垃圾回收、磨损均衡、坏块管理等一系列机制和措

施,保证了SSD 使用寿命的最大化

 

46、error whileloading shared libraries: libpython2.7.so.1.0: cannot open shared object file:No su .

分类: python 2012-09-26 17:01 2458人阅读 评论(0) 收藏 举报

objectfilepython编译器安装了python2.7,第一次执行时报错:

error while loading shared libraries:libpython2.7.so.1.0: cannot open shared object file: No such file or directory

 

 

解决方法如下:

1.编辑      vi /etc/ld.so.conf

如果是非root权限帐号登录,使用 sudo vi /etc/ld.so.conf

添加上python2.7的lib库地址,如我的/usr/local/Python2.7/lib,保存文件

 

 

2.执行 /sbin/ldconfig -v命令,如果是非root权限帐号登录,使用sudo  /sbin/ldconfig -v。这样 ldd 才能找到这个库,执行python2.7就不会报错了

 

 

/etc/ld.so.conf:

这个文件记录了编译时使用的动态链接库的路径。

默认情况下,编译器只会使用/lib和/usr/lib这两个目录下的库文件

如果你安装了某些库,没有指定 –prefix=/usr 这样lib库就装到了/usr/local下,而又没有在/etc/ld.so.conf中添加/usr/local/lib,就会报错了

 

 

ldconfig是个什么东东吧 :

它是一个程序,通常它位于/sbin下,是root用户使用的东东。具体作用及用法可以man ldconfig查到

简单的说,它的作用就是将/etc/ld.so.conf列出的路径下的库文件缓存到/etc/ld.so.cache 以供使用

因此当安装完一些库文件,(例如刚安装好glib),或者修改ld.so.conf增加新的库路径后,需要运行一下/sbin/ldconfig

使所有的库文件都被缓存到ld.so.cache中,如果没做,即使库文件明明就在/usr/lib下的,也是不会被使用的,结果

编译过程中抱错,缺少xxx库。

 

47、Cache

为了提高磁盘存取效率, Linux做了一些精心的设计, 除了对dentry进行缓存(用于VFS,加速文件路径名到inode的转换), 还采取了两种主要Cache方式:Buffer Cache和Page Cache。前者针对磁盘块的读写,后者针对文件inode的读写。这些Cache有效缩短了 I/O系统调用(比如read,write,getdents)的时间。

 

48、linux性能问题(CPU,内存,磁盘I/O,网络)

一. CPU性能评估

1.vmstat [-V] [-n] [depay [count]]

 

-V : 打印出版本信息,可选参数

-n : 在周期性循环输出时,头部信息仅显示一次

delay : 两次输出之间的时间间隔

count : 按照delay指定的时间间隔统计的次数。默认是1

如:vmstat 1 3

user1@user1-desktop:~$ vmstat 1 3

procs ———–memory———- —swap——-io—- -system– —-cpu—-

r b swpd free buff cache si so bi bo in csus sy id wa

0 0 0 1051676 139504 477028 0 0 46 31 130493 3 1 95 2

0 0 0 1051668 139508 477028 0 0 0 4 3771792 3 1 95 0

0 0 0 1051668 139508 477028 0 0 0 0 3271741 3 1 95 0

r : 运行和等待CPU时间片的进程数(若长期大于CPU的个数,说明CPU不足,需要增加CPU)【注意】

b : 在等待资源的进程数(如等待I/O或者内存交换等)

swpd : 切换到内存交换区的内存数量,单位kB

free : 当前空闲物理内存,单位kB

buff : buffers cache的内存数量,一般对块设备的读写才需要缓存

cache : page cached的内存数量,一般作为文件系统cached,频繁访问的文件都会被cached

si : 由磁盘调入内存,即内存进入内存交换区的数量

so : 内存调入磁盘,内存交换区进入内存的数量

bi : 从块设备读入数据的总量,即读磁盘,单位kB/s

bo : 写入到块设备的数据总量,即写磁盘,单位kB/s

in : 某一时间间隔中观测到的每秒设备中断数

cs : 每秒产生的上下文切换次数

us :用户进程消耗的CPU时间百分比【注意】

sy : 内核进程消耗CPU时间百分比【注意】

id : CPU处在空闲状态的时间百分比【注意】

wa :IO等待所占用的CPU时间百分比

如果si、so的值长期不为0,表示系统内从不足,需要增加系统内存

bi+bo参考值为1000,若超过1000,且wa较大,表示系统IO有问题,应该提高磁盘的读写性能

in与cs越大,内核消耗的CPU时间就越多

us+sy参考值为80%,如果大于80%,说明可能存在CPU资源不足的情况

 

综上所述,CPU性能评估中重点注意r、us、sy和id列的值。

2. sar [options] [-o filename] [interval[count] ]

options:

-A :显示系统所有资源设备(CPU、内存、磁盘)的运行状态

-u : 显示系统所有CPU在采样时间内的负载状态

-P : 显示指定CPU的使用情况(CPU计数从0开始)

-d : 显示所有硬盘设备在采样时间内的使用状况

-r : 显示内存在采样时间内的使用状况

-b : 显示缓冲区在采样时间内的使用情况

-v : 显示进程、文件、I节点和锁表状态

-n :显示网络运行状态。参数后跟DEV(网络接口)、EDEV(网络错误统计)、SOCK(套接字)、FULL(显示其它3个参数所有)。可单独或一起使用

-q : 显示运行队列的大小,与系统当时的平均负载相同

-R : 显示进程在采样时间内的活动情况

-y : 显示终端设备在采样时间内的活动情况

-w : 显示系统交换活动在采样时间内的状态

-o : 将命令结果以二进制格式存放在指定的文件中

interval : 采样间隔时间,必须有的参数

count : 采样次数,默认1

如:sar -u 1 3

user1@user1-desktop:~$ sar -u 1 3

Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)

09时27分18秒 CPU %user%nice %system %iowait %steal %idle

09时27分19秒 all 1.990.00 0.50 5.97 0.00 91.54

09时27分20秒 all 3.900.00 2.93 5.85 0.00 87.32

09时27分21秒 all 2.930.00 1.46 4.39 0.00 91.22

平均时间: all 2.95 0.00 1.64 5.40 0.00 90.02

%user : 用户进程消耗CPU时间百分比

%nice : 运行正常进程消耗CPU时间百分比

%system : 系统进程消耗CPU时间百分比

%iowait : IO等待多占用CPU时间百分比

%steal : 内存在相对紧张坏经下pagein强制对不同页面进行的steal操作

%idle : CPU处在空闲状态的时间百分比

3. iostat [-c | -d] [-k] [-t] [-x [device]][interval [count]]

-c :显示CPU使用情况

-d :显示磁盘使用情况

-k : 每秒以k bytes为单位显示数据

-t :打印出统计信息开始执行的时间

-x device :指定要统计的磁盘设备名称,默认为所有磁盘设备

interval :制定两次统计时间间隔

count : 统计次数

如: iostat -c

user1@user1-desktop:~$ iostat -c

Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)

avg-cpu: %user %nice %system %iowait %steal%idle

2.51 0.02 1.27 1.40 0.00 94.81

(每项代表的含义与sar相同)

4. uptime ,如:

user1@user1-desktop:~$ uptime

10:13:30 up 1:15, 2 users, load average:0.00, 0.07, 0.11

显示的分别是:系统当前时间,系统上次开机到现在运行了多长时间,目前登录用户个数,系统在1分钟内、5分钟内、15分钟内的平均负载

注意:load average的三个值一般不能大于系统CPU的个数,否则说明CPU很繁忙

二 . 内存性能评估

1. free

2. watch 与 free 相结合,在watch后面跟上需要运行的命令,watch就会自动重复去运行这个命令,默认是2秒执行一次,如:

Every 2.0s: free Sat Mar 5 10:30:17 2011

total used free shared buffers cached

Mem: 2060496 1130188 930308 0 261284 483072

-/+ buffers/cache: 385832 1674664

Swap: 3000316 0 3000316

(-n指定重复执行的时间,-d表示高亮显示变动)

3.使用vmstat,关注swpd、si和so

4. sar -r,如:

user1@user1-desktop:~$ sar -r 2 3

Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)

10时34分11秒 kbmemfreekbmemused %memused kbbuffers kbcached kbcommit %commit

10时34分13秒 9235481136948 55.18 265456 487156 1347736 26.63

10时34分15秒 9235481136948 55.18 265464 487148 1347736 26.63

10时34分17秒 9235481136948 55.18 265464 487156 1347736 26.63

平均时间: 923548 1136948 55.18 265461 487153 1347736 26.63

kbmemfree : 空闲物理内存

kbmemused : 已使用物理内存

%memused : 已使用内存占总内存百分比

kbbuffers : Buffer Cache大小

kbcached : Page Cache大小

kbcommit : 应用程序当前使用内存大小

%commit :应用程序使用内存百分比

三 . 磁盘I/O性能评估

1. sar -d ,如:

user1@user1-desktop:~$ sar -d 1 3

Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)

10时42分27秒 DEV tpsrd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util

10时42分28秒 dev8-0 0.000.00 0.00 0.00 0.00 0.00 0.00 0.00

10时42分28秒 DEV tpsrd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util

10时42分29秒 dev8-0 2.000.00 64.00 32.00 0.02 8.00 8.00 1.60

10时42分29秒 DEV tpsrd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util

10时42分30秒 dev8-0 0.000.00 0.00 0.00 0.00 0.00 0.00 0.00

平均时间: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util

平均时间: dev8-0 0.67 0.00 21.33 32.00 0.01 8.00 8.00 0.53

DEV : 磁盘设备名称

tps :每秒到物理磁盘的传送数,即每秒的I/O流量。一个传送就是一个I/O请求,多个逻辑请求可以被合并为一个物理I/O请求

rc_sec/s:每秒从设备读入的扇区数(1扇区=512字节)

wr_sec/s : 每秒写入设备的扇区数目

avgrq-sz : 平均每次设备I/O操作的数据大小(以扇区为单位)

avgqu-sz : 平均I/O队列的长度

await : 平均每次设备I/O操作的等待时间(毫秒)

svctm :平均每次设备I/O 操作的服务时间(毫秒)

%util :一秒中有百分之几的时间用用于I/O操作

正常情况下svctm应该小于await,而svctm的大小和磁盘性能有关,CPU、内存的负荷也会对svctm值造成影响,过多的请求也会简介导致svctm值的增加。

await的大小一般取决与svctm的值和I/O队列长度以及I/O请求模式。如果svctm与await很接近,表示几乎没有I/O等待,磁盘性能很好;如果await的值远高于svctm的值,表示I/O队列等待太长,系统上运行的应用程序将变慢,此时可以通过更换更快的硬盘来解决问题。

%util若接近100%,表示磁盘产生I/O请求太多,I/O系统已经满负荷地在工作,该磁盘可能存在瓶颈。长期下去,势必影响系统的性能,可通过优化程序或者通过更换更高、更快的磁盘来解决此问题。

2. iostat -d

user1@user1-desktop:~$ iostat -d 2 3

Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 5.89 148.87 57.77 1325028 514144

Device: tps Blk_read/s Blk_wrtn/s Blk_readBlk_wrtn

sda 0.00 0.00 0.00 0 0

Device: tps Blk_read/s Blk_wrtn/s Blk_readBlk_wrtn

sda 0.00 0.00 0.00 0 0

Blk_read/s : 每秒读取的数据块数

Blk_wrtn/s : 每秒写入的数据块数

Blk_read : 读取的所有块数

Blk_wrtn : 写入的所有块数

如果Blk_read/s很大,表示磁盘直接读取操作很多,可以将读取的数据写入内存中进行操作;如果Blk_wrtn/s很大,表示磁盘的写操作很频繁,可以考虑优化磁盘或者优化程序。这两个选项没有一个固定的大小,不同的操作系统值也不同,但长期的超大的数据读写,肯定是不正常的,一定会影响系统的性能。

3. iostat -x /dev/sda 2 3 ,对指定磁盘的单独统计

4. vmstat -d

四 . 网络性能评估

1. ping

time值显示了两台主机之间的网络延时情况,若很大,表示网络的延时很大。packets loss表示网络丢包率,越小表示网络的质量越高。

2. netstat -i ,如:

user1@user1-desktop:~$ netstat -i

Kernel Interface table

Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVRTX-OK TX-ERR TX-DRP TX-OVR Flg

eth0 1500 0 6043239 0 0 0 87311 0 0 0 BMRU

lo 16436 0 2941 0 0 0 2941 0 0 0 LRU

Iface : 网络设备的接口名称

MTU : 最大传输单元,单位字节

RX-OK / TX-OK : 准确无误地接收 / 发送了多少数据包

RX-ERR / TX-ERR : 接收 / 发送数据包时产生了多少错误

RX-DRP / TX-DRP : 接收 / 发送数据包时丢弃了多少数据包

RX-OVR / TX-OVR : 由于误差而遗失了多少数据包

Flg :接口标记,其中:

L :该接口是个回环设备

B : 设置了广播地址

M : 接收所有的数据包

R :接口正在运行

U : 接口处于活动状态

O : 在该接口上禁用arp

P :表示一个点到点的连接

正常情况下,RX-ERR,RX-DRP,RX-OVR,TX-ERR,TX-DRP,TX-OVR都应该为0,若不为0且很大,那么网络质量肯定有问题,网络传输性能也一定会下降。

当网络传输存在问题时,可以检测网卡设备是否存在故障,还可以检查网络部署环境是否合理。

3. netstat -r (default行对应的值表示系统的默认路由)

4. sar -n ,n后为DEV(网络接口信息)、EDEV(网络错误统计信息)、SOCK(套接字信息)、和FULL(显示所有)

wangxin@wangxin-desktop:~$ sar -n DEV 2 3

Linux 2.6.35-27-generic (wangxin-desktop)2011年03月05日 _i686_ (2 CPU)

11时55分32秒 IFACErxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s

11时55分34秒 lo 2.002.00 0.12 0.12 0.00 0.00 0.00

11时55分34秒 eth0 2.500.50 0.31 0.03 0.00 0.00 0.00

11时55分34秒 IFACErxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s

11时55分36秒 lo 0.000.00 0.00 0.00 0.00 0.00 0.00

11时55分36秒 eth0 1.500.00 0.10 0.00 0.00 0.00 0.00

11时55分36秒 IFACErxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s

11时55分38秒 lo 0.000.00 0.00 0.00 0.00 0.00 0.00

11时55分38秒 eth0 14.500.00 0.88 0.00 0.00 0.00 0.00

平均时间: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s

平均时间: lo 0.67 0.67 0.04 0.04 0.00 0.00 0.00

平均时间: eth0 6.17 0.17 0.43 0.01 0.00 0.00 0.00

IFACE : 网络接口设备

rxpck/s : 每秒接收的数据包大小

txpck/s :每秒发送的数据包大小

rxkB/s : 每秒接受的字节数

txkB/s : 每秒发送的字节数

rxcmp/s : 每秒接受的压缩数据包

txcmp/s : 每秒发送的压缩数据包

rxmcst/s : 每秒接受的多播数据包

 

 

49、HSF(High-SpeedService Framework)

是一个远程调用(RPC)框架,建筑起淘宝整个Java应用分布式集群环境。

 

 

50、LVS简介

LVS是LinuxVirtual Server的简称,也就是Linux虚拟服务器,

使用LVS技术要达到的目标是:通过LVS提供的负载均衡技术和Linux操作系统实现一个高性能、高可用的服务器群集,它具有良好可靠性、可扩展性和可操作性。从而以低廉的成本实现最优的服务性能。

LVS自从1998年开始,发展到现在已经是一个比较成熟的技术项目了。可以利用LVS技术实现高可伸缩的、高可用的网络服务,例如WWW服务、Cache服务、DNS服务、FTP服务、MAIL服务、视频/音频点播服务等等,有许多比较著名网站和组织都在使用LVS架设的集群系统,例如:Linux的门户网站(www.linux.com)、向RealPlayer提供音频视频服务而闻名的Real公司(www.real.com)、全球最大的开源网站(sourceforge.net)等。

二、 LVS体系结构

使用LVS架设的服务器集群系统有三个部分组成:最前端的负载均衡层,用Load Balancer表示,中间的服务器群组层,用Server Array表示,最底端的数据共享存储层,用Shared Storage表示,在用户看来,所有的内部应用都是透明的,用户只是在使用一个虚拟服务器提供的高性能服务。

LVS体系结构如图1所示:

图1 LVS的体系结构

 下面对LVS的各个组成部分进行详细介绍:

 Load Balancer层:位于整个集群系统的最前端,有一台或者多台负载调度器(Director Server)组成,LVS模块就安装在Director Server上,而Director的主要作用类似于一个路由器,它含有完成LVS功能所设定的路由表,通过这些路由表把用户的请求分发给Server Array层的应用服务器(Real Server)上。同时,在Director Server上还要安装对Real Server服务的监控模块Ldirectord,此模块用于监测各个Real Server服务的健康状况。在Real Server不可用时把它从LVS路由表中剔除,恢复时重新加入。

 Server Array层:由一组实际运行应用服务的机器组成,Real Server可以是WEB服务器、MAIL服务器、FTP服务器、DNS服务器、视频服务器中的一个或者多个,每个Real Server之间通过高速的LAN或分布在各地的WAN相连接。在实际的应用中,Director Server也可以同时兼任Real Server的角色。

 Shared Storage层:是为所有RealServer提供共享存储空间和内容一致性的存储区域,在物理上,一般有磁盘阵列设备组成,为了提供内容的一致性,一般可以通过NFS网络文件系统共享数据,但是NFS在繁忙的业务系统中,性能并不是很好,此时可以采用集群文件系统,例如Red hat的GFS文件系统,oracle提供的OCFS2文件系统等。

从整个LVS结构可以看出,DirectorServer是整个LVS的核心,目前,用于Director Server的操作系统只能是Linux和FreeBSD,linux2.6内核不用任何设置就可以支持LVS功能,而FreeBSD作为Director Server的应用还不是很多,性能也不是很好。

对于Real Server,几乎可以是所有的系统平台,Linux、windows、Solaris、AIX、BSD系列都能很好的支持。

三、  LVS集群的特点

3.1  IP负载均衡与负载调度算法

1.IP负载均衡技术

负载均衡技术有很多实现方案,有基于DNS域名轮流解析的方法、有基于客户端调度访问的方法、有基于应用层系统负载的调度方法,还有基于IP地址的调度方法,在这些负载调度算法中,执行效率最高的是IP负载均衡技术。

LVS的IP负载均衡技术是通过IPVS模块来实现的,IPVS是LVS集群系统的核心软件,它的主要作用是:安装在Director Server上,同时在Director Server上虚拟出一个IP地址,用户必须通过这个虚拟的IP地址访问服务。这个虚拟IP一般称为LVS的VIP,即Virtual IP。访问的请求首先经过VIP到达负载调度器,然后由负载调度器从Real Server列表中选取一个服务节点响应用户的请求。

当用户的请求到达负载调度器后,调度器如何将请求发送到提供服务的Real Server节点,而Real Server节点如何返回数据给用户,是IPVS实现的重点技术,IPVS实现负载均衡机制有三种,分别是NAT、TUN和DR,详述如下:

 VS/NAT: 即(Virtual Server viaNetwork Address Translation)

也就是网络地址翻译技术实现虚拟服务器,当用户请求到达调度器时,调度器将请求报文的目标地址(即虚拟IP地址)改写成选定的Real Server地址,同时报文的目标端口也改成选定的Real Server的相应端口,最后将报文请求发送到选定的Real Server。在服务器端得到数据后,Real Server返回数据给用户时,需要再次经过负载调度器将报文的源地址和源端口改成虚拟IP地址和相应端口,然后把数据发送给用户,完成整个负载调度过程。

可以看出,在NAT方式下,用户请求和响应报文都必须经过Director Server地址重写,当用户请求越来越多时,调度器的处理能力将称为瓶颈。

 VS/TUN :即(VirtualServer via IP Tunneling)

也就是IP隧道技术实现虚拟服务器。它的连接调度和管理与VS/NAT方式一样,只是它的报文转发方法不同,VS/TUN方式中,调度器采用IP隧道技术将用户请求转发到某个Real Server,而这个Real Server将直接响应用户的请求,不再经过前端调度器,此外,对Real Server的地域位置没有要求,可以和Director Server位于同一个网段,也可以是独立的一个网络。因此,在TUN方式中,调度器将只处理用户的报文请求,集群系统的吞吐量大大提高。

 VS/DR: 即(Virtual Server viaDirect Routing)

也就是用直接路由技术实现虚拟服务器。它的连接调度和管理与VS/NAT和VS/TUN中的一样,但它的报文转发方法又有不同,VS/DR通过改写请求报文的MAC地址,将请求发送到Real Server,而Real Server将响应直接返回给客户,免去了VS/TUN中的IP隧道开销。这种方式是三种负载调度机制中性能最高最好的,但是必须要求Director Server与Real Server都有一块网卡连在同一物理网段上。

2.负载调度算法

上面我们谈到,负载调度器是根据各个服务器的负载情况,动态地选择一台Real Server响应用户请求,那么动态选择是如何实现呢,其实也就是我们这里要说的负载调度算法,根据不同的网络服务需求和服务器配置,IPVS实现了如下八种负载调度算法,这里我们详细讲述最常用的四种调度算法,剩余的四种调度算法请参考其它资料。

 轮叫调度(RoundRobin)

“轮叫”调度也叫1:1调度,调度器通过“轮叫”调度算法将外部用户请求按顺序1:1的分配到集群中的每个Real Server上,这种算法平等地对待每一台Real Server,而不管服务器上实际的负载状况和连接状态。

 加权轮叫调度(WeightedRound Robin)

“加权轮叫”调度算法是根据RealServer的不同处理能力来调度访问请求。可以对每台Real Server设置不同的调度权值,对于性能相对较好的Real Server可以设置较高的权值,而对于处理能力较弱的Real Server,可以设置较低的权值,这样保证了处理能力强的服务器处理更多的访问流量。充分合理的利用了服务器资源。同时,调度器还可以自动查询Real Server的负载情况,并动态地调整其权值。

 最少链接调度(LeastConnections)

“最少连接”调度算法动态地将网络请求调度到已建立的链接数最少的服务器上。如果集群系统的真实服务器具有相近的系统性能,采用“最小连接”调度算法可以较好地均衡负载。

 加权最少链接调度(WeightedLeast Connections)

“加权最少链接调度”是“最少连接调度”的超集,每个服务节点可以用相应的权值表示其处理能力,而系统管理员可以动态的设置相应的权值,缺省权值为1,加权最小连接调度在分配新连接请求时尽可能使服务节点的已建立连接数和其权值成正比。

其它四种调度算法分别为:基于局部性的最少链接(Locality-Based Least Connections)、带复制的基于局部性最少链接(Locality-BasedLeast Connections with Replication)、目标地址散列(DestinationHashing)和源地址散列(Source Hashing),对于这四种调度算法的含义,本文不再讲述,如果想深入了解这其余四种调度策略的话,可以登陆LVS中文站点zh.linuxvirtualserver.org,查阅更详细的信息。

3.2 高可用性

LVS是一个基于内核级别的应用软件,因此具有很高的处理性能,用LVS构架的负载均衡集群系统具有优秀的处理能力,每个服务节点的故障不会影响整个系统的正常使用,同时又实现负载的合理均衡,使应用具有超高负荷的服务能力,可支持上百万个并发连接请求。如配置百兆网卡,采用VS/TUN或VS/DR调度技术,整个集群系统的吞吐量可高达1Gbits/s;如配置千兆网卡,则系统的最大吞吐量可接近10Gbits/s。

3.3 高可靠性

LVS负载均衡集群软件已经在企业、学校等行业得到了很好的普及应用,国内外很多大型的、关键性的web站点也都采用了LVS集群软件,所以它的可靠性在实践中得到了很好的证实。有很多以LVS做的负载均衡系统,运行很长时间,从未做过重新启动。这些都说明了LVS的高稳定性和高可靠性。

3.4 适用环境

LVS对前端DirectorServer目前仅支持Linux和FreeBSD系统,但是支持大多数的TCP和UDP协议,支持TCP协议的应用有:HTTP,HTTPS ,FTP,SMTP,,POP3,IMAP4,PROXY,LDAP,SSMTP等等。支持UDP协议的应用有:DNS,NTP,ICP,视频、音频流播放协议等。

LVS对Real Server的操作系统没有任何限制,RealServer可运行在任何支持TCP/IP的操作系统上,包括Linux,各种Unix(如FreeBSD、Sun Solaris、HP Unix等),Mac/OS和Windows等。

3.5 开源软件

LVS集群软件是按GPL(GNU PublicLicense)许可证发行的自由软件,因此,使用者可以得到软件的源代码,并且可以根据自己的需要进行各种修改,但是修改必须是以GPL方式发行

 

51、精卫简介

精卫填海(简称精卫)是一个基于MySQL数据库的数据复制组件,远期目标是构建一个完善可接入多种不同类型源数据的实时数据复制框架。基于最最原始的生产者-消费者模型,引入Pipeline(负责数据传送)、Extractor(生产数据)、Applier(消费数据)的概念,构建一套高易用性的数据复制框架。

 

52、HDD传输带宽:

具有高带宽规格的硬盘在传输大块连续数据时具有优势,而具有高IOPS的硬盘在传输小块不连续的数据时具有优势。

 

53、IOPS=1/(换道时间+数据传输时间)

完成一次IO所用的时间=寻道时间+旋转延迟时间+数据传输时间,IOPS=IO并发系数/(寻道时间+旋转延迟时间+数据传输时间)

 

54、Lxc Python3scripting

 

As much fun as C may be, I usually like toscript my containers and C isn’t really the best language for that. That’s whyI wrote and maintain the official python3 binding.

 

The equivalent to the example above inpython3 would be:

 

import lxc

import sys

 

# Setup the container object

c = lxc.Container(“apicontainer”)

if c.defined:

   print(“Container already exists”, file=sys.stderr)

   sys.exit(1)

 

# Create the container rootfs

if not c.create(“download”,lxc.LXC_CREATE_QUIET, {“dist”: “ubuntu”,

                                                  “release”: “trusty”,

                                                  “arch”: “i386”}):

   print(“Failed to create the container rootfs”,file=sys.stderr)

   sys.exit(1)

 

# Start the container

if not c.start():

   print(“Failed to start the container”, file=sys.stderr)

    sys.exit(1)

 

# Query some information

print(“Container state: %s” %c.state)

print(“Container PID: %s” %c.init_pid)

 

# Stop the container

if not c.shutdown(30):

   print(“Failed to cleanly shutdown the container, forcing.”)

   if not c.stop():

       print(“Failed to kill thecontainer”, file=sys.stderr)

       sys.exit(1)

 

# Destroy the container

if not c.destroy():

   print(“Failed to destroy the container.”, file=sys.stderr)

   sys.exit(1)

Now for that specific example, python3isn’t that much simpler than the C equivalent.

 

But what if we wanted to do somethingslightly more tricky, like iterating through all existing containers, startthem (if they’re not already started), wait for them to have networkconnectivity, then run updates and shut them down?

 

import lxc

import sys

 

for container inlxc.list_containers(as_object=True):

    #Start the container (if not started)

   started=False

   if not container.running:

       if not container.start():

           continue

       started=True

 

   if not container.state == “RUNNING”:

       continue

 

    #Wait for connectivity

   if not container.get_ips(timeout=30):

       continue

 

    #Run the updates

   container.attach_wait(lxc.attach_run_command,

                          [“apt-get”,”update”])

   container.attach_wait(lxc.attach_run_command,

                          [“apt-get”,”dist-upgrade”, “-y”])

 

    #Shutdown the container

   if started:

       if not container.shutdown(30):

           container.stop()

The most interesting bit in the exampleabove is the attach_wait command, which basically lets your run a standardpython function in the container’s namespaces, here’s a more obvious example:

 

import lxc

 

c = lxc.Container(“p1”)

if not c.running:

   c.start()

 

def print_hostname():

   with open(“/etc/hostname”, “r”) as fd:

       print(“Hostname: %s” % fd.read().strip())

 

# First run on the host

print_hostname()

 

# Then on the container

c.attach_wait(print_hostname)

 

if not c.shutdown(30):

   c.stop()

And the output of running the above:

 

stgraber@castiana:~$ python3 lxc-api.py

/home/stgraber/<frozen>:313: Warning:The python-lxc API isn’t yet stable and may change at any point in the future.

Hostname: castiana

Hostname: p1

It may take you a little while to wrap yourhead around the possibilities offered by that function, especially as it alsotakes quite a few flags (look for LXC_ATTACH_* in the C API) which lets youcontrol which namespaces to attach to, whether to have the function containedby apparmor, whether to bypass cgroup restrictions, …

 

That kind of flexibility is somethingyou’ll never get with a virtual machine and the way it’s supported through ourbindings makes it easier than ever to use by anyone who wants to automatecustom workloads.

 

You can also use the API to script cloningcontainers and using snapshots (though for that example to work, you needcurrent upstream master due to a small bug I found while writing this…):

 

import lxc

import os

import sys

 

if not os.geteuid() == 0:

   print(“The use of overlayfs requires privileged containers.”)

   sys.exit(1)

 

# Create a base container (if missing)using an Ubuntu 14.04 image

base = lxc.Container(“base”)

if not base.defined:

   base.create(“download”, lxc.LXC_CREATE_QUIET,{“dist”: “ubuntu”,

                                                   “release”:”precise”,

                                                  “arch”: “i386”})

 

    #Customize it a bit

   base.start()

   base.get_ips(timeout=30)

   base.attach_wait(lxc.attach_run_command, [“apt-get”,”update”])

   base.attach_wait(lxc.attach_run_command, [“apt-get”,”dist-upgrade”, “-y”])

 

   if not base.shutdown(30):

       base.stop()

 

# Clone it as web (if not already existing)

web = lxc.Container(“web”)

if not web.defined:

    #Clone base using an overlayfs overlay

   web = base.clone(“web”, bdevtype=”overlayfs”,

                    flags=lxc.LXC_CLONE_SNAPSHOT)

 

    #Install apache

   web.start()

   web.get_ips(timeout=30)

   web.attach_wait(lxc.attach_run_command, [“apt-get”,”update”])

   web.attach_wait(lxc.attach_run_command, [“apt-get”,”install”,

                                            “apache2”, “-y”])

 

   if not web.shutdown(30):

       web.stop()

 

# Create a website container based on theweb container

mysite = web.clone(“mysite”,bdevtype=”overlayfs”,

                  flags=lxc.LXC_CLONE_SNAPSHOT)

mysite.start()

ips =mysite.get_ips(family=”inet”, timeout=30)

if ips:

   print(“Website running at: http://%s” % ips[0])

else:

   if not mysite.shutdown(30):

       mysite.stop()

The above will create a base containerusing a downloaded image, then clone it using an overlayfs based overlay, addapache2 to it, then clone that resulting container into yet another one called“mysite”. So “mysite” is effectively an overlay clone of “web” which is itselfan overlay clone of “base”.

 

 

 

So there you go, I tried to cover most ofthe interesting bits of our API with the examples above, though there’s muchmore available, for example, I didn’t cover the snapshot API (currentlyrestricted to system containers) outside of the specific overlayfs case aboveand only scratched the surface of what’s possible to do with the attachfunction.

 

LXC 1.0 will release with a stable versionof the API, we’ll be doing additions in the next few 1.x versions (while doingbugfix only updates to 1.0.x) and hope not to have to break the whole API forquite a while (though we’ll certainly be adding more stuff to it)

 

55、NIC

台式机一般都采用内置网卡来连接网络。网卡也叫“网络适配器”,英文全称为“Network Interface Card”,简称“NIC”,网卡是局域网中最基本的部件之一,它是连接计算机与网络的硬件设备。无论是双绞线连接、同轴电缆连接还是光纤连接,都必须借助于网卡才能实现数据的通信。它的主要技术参数为带宽、总线方式、电气接口方式等。它的基本功能为:从并行到串行的数据转换,包的装配和拆装,网络存取控制,数据缓存和网络信号。目前主要是8位和16位网卡。

56、NAT:

NAT(NetworkAddress Translation,网络地址转换)是将IP 数据包头中的IP 地址转换为另一个IP 地址的过程。在实际应用中,NAT 主要用于实现私有网络访问公共网络的功能。这种通过使用少量的公有IP 地址代表较多的私有IP 地址的方式,将有助于减缓可用IP地址空间的枯竭。在RFC 1632中有对NAT的说明。

NAT功能

NAT不仅能解决了lP地址不足的问题,而且还能够有效地避免来自网络外部的攻击,隐藏并保护网络内部的计算机。

1.宽带分享:这是 NAT 主机的最大功能。

2.安全防护:NAT 之内的 PC 联机到 Internet 上面时,他所显示的 IP 是 NAT 主机的公共 IP,所以 Client 端的 PC 当然就具有一定程度的安全了,外界在进行 portscan(端口扫描)的时候,就侦测不到源Client 端的 PC 。

NAT实现方式

NAT的实现方式有三种,即静态转换StaticNat、动态转换Dynamic Nat和端口多路复用OverLoad。

静态转换是指将内部网络的私有IP地址转换为公有IP地址,IP地址对是一对一的,是一成不变的,某个私有IP地址只转换为某个公有IP地址。借助于静态转换,可以实现外部网络对内部网络中某些特定设备(如服务器)的访问。

动态转换是指将内部网络的私有IP地址转换为公用IP地址时,IP地址是不确定的,是随机的,所有被授权访问上Internet的私有IP地址可随机转换为任何指定的合法IP地址。也就是说,只要指定哪些内部地址可以进行转换,以及用哪些合法地址作为外部地址时,就可以进行动态转换。动态转换可以使用多个合法外部地址集。当ISP提供的合法IP地址略少于网络内部的计算机数量时。可以采用动态转换的方式。

端口多路复用(Port address Translation,PAT)是指改变外出数据包的源端口并进行端口转换,即端口地址转换(PAT,Port Address Translation).采用端口多路复用方式。内部网络的所有主机均可共享一个合法外部IP地址实现对Internet的访问,从而可以最大限度地节约IP地址资源。同时,又可隐藏网络内部的所有主机,有效避免来自internet的攻击。因此,目前网络中应用最多的就是端口多路复用方式。

NAT的技术背景

要真正了解NAT就必须先了解现在IP地址的适用情况,私有 IP 地址是指内部网络或主机的IP 地址,公有IP 地址是指在因特网上全球唯一的IP 地址。RFC 1918 为私有网络预留出了三个IP 地址块,如下:

A 类:10.0.0.0~10.255.255.255

B 类:172.16.0.0~172.31.255.255

C 类:192.168.0.0~192.168.255.255

上述三个范围内的地址不会在因特网上被分配,因此可以不必向ISP 或注册中心申请而在公司或企业内部自由使用。

NAPT

NAPT(Network Address PortTranslation),即网络端口地址转换,可将多个内部地址映射为一个合法公网地址,但以不同的协议端口号与不同的内部地址相对应,也就是<内部地址+内部端口>与<外部地址+外部端口>之间的转换。NAPT普遍用于接入设备中,它可以将中小型的网络隐藏在一个合法的IP地址后面。NAPT也被称为“多对一”的NAT,或者叫PAT(Port Address Translations,端口地址转换)、地址超载(addressoverloading)。

NAPT与动态地址NAT不同,它将内部连接映射到外部网络中的一个单独的IP地址上,同时在该地址上加上一个由NAT设备选定的TCP端口号。NAPT算得上是一种较流行的NAT变体,通过转换TCP或UDP协议端口号以及地址来提供并发性。除了一对源和目的IP地址以外,这个表还包括一对源和目的协议端口号,以及NAT盒使用的一个协议端口号。

NAPT的主要优势在于,能够使用一个全球有效IP地址获得通用性。主要缺点在于其通信仅限于TCP或UDP。当所有通信都采用TCP或UDP,NAPT允许一台内部计算机访问多台外部计算机,并允许多台内部主机访问同一台外部计算机,相互之间不会发生冲突。

 

NAT工作原理

NAT将自动修改IP报文的源IP地址和目的IP地址,Ip地址校验则在NAT处理过程中自动完成。有些应用程序将源IP地址嵌入到IP报文的数据部分中,所以还需要同时对报文的数据部分进行修改,以匹配IP头中已经修改过的源IP地址。否则,在报文数据部分嵌入IP地址的应用程序就不能正常工作。

①如图这个 client(终端)的 gateway (网关)设定为 NAT 主机,所以当要连上 Internet 的时候,该封包就会被送到 NAT 主机,这个时候的封包 Header 之 source IP(源IP)为 192.168.1.100 ;

②而透过这个 NAT 主机,它会将 client 的对外联机封包的 source IP ( 192.168.1.100 ) 伪装成 ppp0 ( 假设为拨接情况 )这个接口所具有的公共 IP,因为是公共 IP 了,所以这个封包就可以连上 Internet 了,同时 NAT 主机并且会记忆这个联机的封包是由哪一个 ( 192.168.1.100 ) client 端传送来的;

③由 Internet 传送回来的封包,当然由 NAT主机来接收了,这个时候, NAT 主机会去查询原本记录的路由信息,并将目标 IP 由 ppp0 上面的公共 IP 改回原来的 192.168.1.100 ;

④最后则由 NAT 主机将该封包传送给原先发送封包的 Client。

配置NAT

在配置NAT(网络地址转换)之前,首先需要了解内部本地地址和内部全局地址的分配情况。根据不同的需求,执行以下不同的配置任务。

内部源地址NAT配置

内部源地址NAPT配置

重叠地址NAT配置

TCP负载均衡

 

 

57、How-toinstall LXC and OpenQuake LXC on RHEL/CentOS 6

As root user:

1)Add the EPEL repo to your RHEL/CentOS 6 server

$ rpm -ivh http://mirror.1000mbps.com/fedora-epel/6/i386/epel-release-6-8.noarch.rpm

2)Install LXC 0.9.0 from epel and some other stuff needed

$ yum install --enablerepo=epel-testing lxc lxc-libs lxc-templates bridge-utils libcgroup

3) Enablethe cgroups

$ service cgconfig start
$ service cgred start
$ chkconfig --level 345 cgconfig on
$ chkconfig --level 345 cgred on

4) Setup thenetwork:

the easiest way isto create an internal network, so you do not need to expose the LXC to thebare-metal server network.

a) Create the bridge

$ brctl addbr lxcbr0

问题终于解决来。不是一般的菜鸟真的伤不起啊。

解决方法参考链接:http://www.360doc.com/content/12/0507/14/9318309_209243400.shtml

其实,问题很简单,就是要关闭网络管理器:

01.

02.chkconfig NetworkManager off

03.service NetworkManager stop

04.

复制代码似乎在某文章里提到过这个东西,但我不知道怎么关闭,就忽略了,不知道把这个服务启动会怎么样。

 

b) Make the bridge persistent on reboot

create /etc/sysconfig/network-scripts/ifcfg-lxcbr0 and add

DEVICE="lxcbr0"
TYPE="Bridge"
BOOTPROTO="static"
IPADDR="10.0.3.1"
NETMASK="255.255.255.0"

c) Start the bridge interface

$ ifup lxcbr0

5)Configure the firewall

to allow outgoing traffic from the container: edit /etc/sysconfig/iptables and

a) Comment or remove

-A FORWARD -j REJECT --reject-with icmp-host-prohibited

b) Add at the end of file

*nat
:PREROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -o eth0 -j MASQUERADE
COMMIT

c) Restart the firewall

$ service iptables restart

6) Enable IPv4 forwarding

edit /etc/sysctl.conf and change net.ipv4.ip_forward = 0 to net.ipv4.ip_forward = 1,then apply the new parameters with

$ sysctl –p

7) Download OpenQuake LXC

$ cd /tmp && wget http://ftp.openquake.org/oq-master/lxc/Ubuntu_lxc_12.04_64_oq_master_nightly-140310.tar.bz2

8) Extract the OpenQuake LXC

$ tar --numeric-owner -C /var/lib/lxc -xpsjf /tmp/Ubuntu_lxc_12.04_64_oq_master_nightly-140310.tar.bz2

9) Check if the LXC is installed and ready

with the commandlxc-ls you should see

$ lxc-ls
openquake-nightly-140310

10) Setup the OpenQuake LXC ip address

open /var/lib/lxc/openquake/rootfs/etc/network/interfaces and change iface eth0 inet dhcp to

iface eth0 inet static
address 10.0.3.2
netmask 255.255.255.0
gateway 10.0.3.1
dns-nameservers 8.8.8.8

11) Start the OpenQuake LXC

$ lxc-start –d –n openquake

12) Login into the running OpenQuake LXC

$ lxc-console –n openquake
(to detach press ctrl-a + q)

You can also loginusing SSH from the host server:

$ ssh openquake@10.0.3.2
User: openquake
Password: openquake

Please note:

·        This how-to is intended fora fresh, standard installation of RHEL/CentOS 6 (and is tested on 6.4). It mayneed some adjustments for customized installations.

·        On 5. the firewall could bealready customized by the sysadmin, please be careful when edit it. For moredetails please ask to your network and/or system administrator.

·        On 5. section b. -APOSTROUTING -o eth0 -j MASQUERADE “eth0” is the name of the host server main interface.It can differ in your configuration (see the used interface with ifconfig).

·        On 8. the --numeric-owner is mandatory.

·        On 10. the 8.8.8.8 DNS is the one provided by Google. It’s better to use yourinternal DNS, so change that IP address with the one associated to your DNSserver. For more details please ask to your network and/or systemadministrator.

·        On certain installationsthe rsyslogd process inside thecontainer can eat lots of CPU cycles. To fix it run, within the container,these commands:

service rsyslog stop
sed -i -e 's/^\$ModLoad imklog/#\$ModLoad imklog/g' /etc/rsyslog.conf
service rsyslog start

58、django

开放源代码的Web应用框架,由Python写成。采用了MVC的软件设计模式,即模型M,视图V和控制器C。它最初是被开发来用于管理劳伦斯出版集团旗下的一些以新闻内容为主的网站的,即是CMS(内容管理系统)软件。并于2005年7月在BSD许可证下发布。这套框架是以比利时的吉普赛爵士吉他手Django Reinhardt来命名的。

Django 框架的核心组件有:

用于创建模型的对象关系映射

为最终用户设计的完美管理界面

一流的 URL 设计

设计者友好的模板语言

缓存系统。

 

59、Lxc 1.0 noconfiguration file for ‘/sbin/init’

hose who have problem with “lxc-start: noconfiguration file for ‘/sbin/init’ (may crash the host)”, just add “-f ” tothe lxc-start. For example:

 

lxc-start -n vps101 -l DEBUG -f/var/lib/lxc/vps101/config

 /usr/var/lib/lxc/centos/config

 

By default, containers are located under/var/lib/lxc for the root user, and $HOME/.local/share/lxc otherwise. Thelocation can be specified for all lxc commands using the”-P|–lxcpath” argument.

 

 

60、Httpd

httpd是Apache超文本传输协议(HTTP)服务器的主程序。被设计为一个独立运行的后台进程,它会建立一个处理请求的子进程或线程的池。

 

61、Turbo Boost

Intel® Turbo Boost Technology 2.01automatically allows processor cores to run faster than the rated operatingfrequency if they’re operating below power, current, and temperature specificationlimits.

 

62、P-state

 在Intel平台上通常指的是EIST(EnhancedIntel SpeedStep Technology),EIST允许多个核动态的切换电压和频率,动态的调整系统的功耗。OSPM通过WRMSR指令写IA32_PERF_CTL MSR的方式调整CPU电压和工作频率。

 

68、如何将Python的json本地话

首先从http://pypi.python.org/pypi/python-json下载python-json,然后安装。

解压zip包然后把json.py minjson.py 拷到 /usr/lib/python2.5/下面就行了。

怎样使用请看:http://docs.python.org/library/json.html

 

69、Pythonmysqldb

tar xvzf MySQL-python-1.2.1.tar.gz

cd MySQL-python-1.2.1

yum install -y python-devel

yum install -y mysql-devel

python setup.py build

python setup.py install

 

70、IO调度策略

bfq 、

“bandwidth fair quing”

cfq、

completely fair quing

 noop、

Noop- is the idea of first come firstserve, get you cake and wait in line to get another peice so if a fat ass isjust requiring the whole cake others get hungry but must wait there turn. Bfqcan solve that issue especially with sudden heavy multitasking not sticking atask to feel sluggish.

 dealine

 

 

71、tmpfs

tmpfs是一种基于内存的文件系统,它和虚拟磁盘ramdisk比较类似像,但不完全相同,和ramdisk一样,tmpfs可以使用RAM,但它也可以使用swap分区来存储。而且传统的ramdisk是个块设备,要用mkfs来格式化它,才能真正地使用它;而tmpfs是一个文件系统,并不是块设备,只是安装它,就可以使用了。tmpfs是最好的基于RAM的文件系统。

 

 

DMI,即Desktop ManagementInterface。

 

72、iptables

iptables 是建立在 netfilter 架构基础上的一个包过滤管理工具,最主要的作用是用来做防火墙或透明代理。Iptables 从 ipchains 发展而来,它的功

 

能更为强大。Iptables 提供以下三种功能:包过滤、NAT(网络地址转换)和通用的 pre-route packet mangling。包过滤:用来过滤包,但是不修

 

改包的内容。Iptables 在包过滤方面相对于 ipchians 的主要优点是速度更快,使用更方便。NAT:NAT 可以分为源地址 NAT 和目的地址 NAT。

 

        Iptables 可以追加、插入或删除包过滤规则。实际上真正执行这些过虑规则的是 netfilter 及其相关模块(如 iptables 模块和 nat 模块)。

 

看内核时总遇到if(likely( )){}或是if(unlikely( ))这样的语句,最初不解其意,现在有所了解,所以也想介绍一下。

 

 

73、Linux kernel–likely

likely() 与 unlikely()是内核(我看的是2.6.22.6版本,2.6的版本应该都有)中定义的两个宏。位于/include/linux/compiler.h中,

具体定义如下:

#define likely(x) __builtin_expect(!!(x),1)

#define unlikely(x) __builtin_expect(!!(x),0)

 

__builtin_expect是gcc(版本>=2.96,网上写的,我没验证过)中提供的一个预处理命令(这个名词也是网上写的,我想叫函数更好些),有利于代码优化。gcc(version 4.4.0)具体定义如下:

long __builtin_expect (long exp, long c)[Built-in Function]

 

注解为:

You may use __builtin_expect to provide thecompiler with branch prediction information. In general, you should prefer touse actual profile feedback for this (‘-fprofile-arcs’), as programmers arenotoriously bad at predicting how their programs actually perform. However,there are applications in which this data is hard to collect.The return valueis the value of exp, which should be an integral expression. The semantics ofthe built-in are that it is expected that exp == c.

 

它的意思是:我们可以使用这个函数人为告诉编绎器一些分支预测信息“exp==c”是“很可能发生的”。

 

#define likely(x) __builtin_expect(!!(x),1)也就是说明x==1是“经常发生的”或是“很可能发生的”。

使用likely ,执行if后面语句的可能性大些,编译器将if{}是的内容编译到前面, 使用unlikely ,执行else后面语句的可能性大些,编译器将else{}里的内容编译到前面。这样有利于cpu预取,提高预取指令的正确率,因而可提高效率。

 

74、查看内存页大小:

getconf PAGESIZE

tune2fs -l /dev/sda1 |grep ‘Block size’

 

75、安装mysql 5.5环境

76、t-mysql.x86_645.5.18

yum install -b current t-mysql

Loaded plugins: branch, security

Setting up Install Process

Resolving Dependencies

–> Running transaction check

—> Package t-mysql.x86_640:5.5.18.4132-38.el6 will be installed

–> Finished Dependency Resolution

yum install -y -b current t-alimysql-env

 

77、内存管理中的coldpage和hot page, 冷页 vs 热页

所谓冷热是针对处理器cache来说的,冷就是页不大可能在cache中,热就是有很大几率在cache中。

78、LBA

LBA(Logical Block Address),中文名称:逻辑区块地址。是描述电脑存储设备上数据所在区块的通用机制,一般用在像硬盘这样的辅助记忆设备。LBA可以意指某个数据区块的地址或是某个地址所指向的数据区块。电脑上所谓一个逻辑区块通常是512或1024位组。ISO-9660格式的标准CD则以2048位组为一个逻辑区块大小。

79、SAN

存储区域网络(SAN)是一种高速网络或子网络,提供在计算机与存储系统之间的数据传输。存储设备是指一张或多张用以存储计算机数据的磁盘设备。一个 SAN 网络由负责网络连接的通信结构、负责组织连接的管理层、存储部件以及计算机系统构成,从而保证数据传输的安全性和力度。当前常见的可使用 SAN 技术,诸如 IBM 的光纤 SCON,它是 FICON 的增强结构,或者说是一种更新的光纤信道技术。LSI的Nytro XD智能高速缓存技术,为存储区域网络 (SAN) 和直接附加存储 (DAS) 环境提供开箱即用的应用加速功能。另外存储区域网络中也运用到高速以太网协议。SCSI 和 iSCSI 是目前使用较为广泛的两种存储区域网络协议。

80、OSI(开放系统互联(Open System Interconnection)

 

81、DIMM

DIMM(Dual Inline MemoryModule,双列直插内存模块)与SIMM(single in-line memory module,单边接触内存模组)相当类似,不同的只是DIMM的金手指两端不像SIMM那样是互通的,它们各自独立传输信号,因此可以满足更多数据信号的传送需要。同样采用DIMM,SDRAM 的接口与DDR内存的接口也略有不同,SDRAM DIMM为168Pin DIMM结构,金手指每面为84Pin,金手指上有两个卡口,用来避免插入插槽时,错误将内存反向插入而导致烧毁;DDR2 DIMM则采用240pin DIMM结构,金手指每面有120Pin。卡口数量的不同,是二者最为明显的区别。DDR3 DIMM同为240pin DIMM结构,金手指每面有120Pin,与DDR2 DIMM一样金手指上也只有一个卡口,但是卡口的位置与DDR2 DIMM稍微有一些不同,因此DDR3内存是插不进DDR2 DIMM的,同理DDR2内存也是插不进DDR3 DIMM的,因此在一些同时具有DDR2 DIMM和DDR3 DIMM的主板上,不会出现将内存插错插槽的问题。

82、DMI

DMI是英文单词DesktopManagement Interface的缩写,也就是桌面管理界面,它含有关于系统硬件的配置信息。计算机每次启动时都对DMI数据进行校验,如果该数据出错或硬件有所变动,就会对机器进行检测,并把测试的数据写入BIOS芯片保存。

 

83、translationlookaside buffer

Translates a virtual address into aphysical address

 

84、IA-32systems start at 0x08048000, leaving a gap of roughly 128 MiB between the

lowest possible address and the start ofthe text mapping that is used to catch NULL pointers

 

85、SElinux

查看SELinux状态:

1、/usr/sbin/sestatus-v      ##如果SELinux status参数为enabled即为开启状态

SELinux status:                 enabled

2、getenforce                 ##也可以用这个命令检查

关闭SELinux:

1、临时关闭(不用重启机器):

setenforce 0                  ##设置SELinux 成为permissive模式

                              ##setenforce 1 设置SELinux 成为enforcing模式

2、修改配置文件需要重启机器:

修改/etc/selinux/config 文件

将SELINUX=enforcing改为SELINUX=disabled

重启机器即可

今天的文章网络运维词汇汇总分享到此就结束了,感谢您的阅读,如果确实帮到您,您可以动动手指转发给其他人。

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:https://bianchenghao.cn/33391.html

(0)
编程小号编程小号

相关推荐

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注