本篇之所以起该名字,是因为我在一家网络公司工作所遇到的一些相关词汇,仅供参考。–
1、关系型数据库服务 RDS:
关系型数据库服务(RelationalDatabase Service,简称RDS)是一种稳定可靠、可弹性伸缩的在线数据库服务。RDS采用即开即用方式,兼容MySQL、SQL Server两种关系型数据库,并提供数据库在线扩容、备份回滚、性能监测及分析功能。RDS与云服务器搭配使用I/O性能倍增,内网互通避免网络瓶颈。
2、开放存储服务 OSS:
开放存储服务(OpenStorage Service,简称OSS)是支持任意数据类型的存储服务,支持任意时间、地点的数据上传和下载,OSS中每个存储对象(object)由名称、内容、描述三部分组成。
3、内容分发网络 CDN:
内容分发网络(ContentDelivery Network,简称CDN)将加速内容分发至离用户最近的节点,缩短用户查看对象的延迟,提高用户访问网站的响应速度与网站的可用性,解决网络带宽小、用户访问量大、网点分布不均等问题。
4、负载均衡 SLB:
负载均衡(ServerLoad Balancer,简称SLB)是对多台云服务器进行流量分发的负载均衡服务。SLB可以通过流量分发扩展应用系统对外的服务能力,通过消除单点故障提升应用系统的可用性。
5、Django
Django是一个开放源代码的Web应用框架,由Python写成。采用了MVC的软件设计模式,即模型M,视图V和控制器C。它最初是被开发来用于管理劳伦斯出版集团旗下的一些以新闻内容为主的网站的。并于2005年7月在BSD许可证下发布。这套框架是以比利时的吉普赛爵士吉他手Django Reinhardt来命名的。
Django的主要目标是使得开发复杂的、数据库驱动的网站变得简单。Django注重组件的重用性和“可插拔性”,敏捷开发和DRY法则(Don’t Repeat Yourself)。在Django中Python被普遍使用,甚至包括配置文件和数据模型。
Django于2008年6月17日正式成立基金会。
Web应用框架(Web applicationframework)是一种计算机软件框架,用来支持动态网站、网络应用程序及网络服务的开发。这种框架有助于减轻网页开发时共通性活动的工作负荷,例如许多框架提供数据库访问接口、标准样板以及会话管理等,可提升代码的可再用性。
DRC(Data Replication Center)
由异地容灾而来
DAM(DataBase Activity Monitor)安全方面来对数据库异常活动进行监测和审计。
6、异地备份
异地备份,通过互联网TCP/IP协议,备特佳容灾备份系统将本地的数据实时备份到异地服务器中,可以通过异地备份的数据进行远程恢复,也可以在异地进行数据回退,异地备份,如果想做接管需要专线连接,只有在同一网段内才能实现业务的接管。
在建立容灾备份系统时会涉及到多种技术,如:SAN或NAS技术、远程镜像技术、虚拟存储、基于IP的SAN的互连技术、快照技术等。
许多存储厂商纷纷推出基于SAN的异地容灾软、硬件产品,希望能够为用户提供整套以SAN网络环境和异地实时备份为基础的,高效、可靠的异地容灾解决方案,并且能够为用户提供支持各种操作系统平台、数据库应用和网络应用的系统容灾服务。
为了确保基于存储区域网络(StorageArea Network,SAN)的异地容灾系统在主系统发生意外灾难后实现同城异地的数据容灾,采用SAN作为数据存储模式,通过光纤通道将生产数据中心和备份数据中心连接起来,使用跨阵列磁盘镜像技术实现异地数据中心之间的备份和恢复.容灾系统是计算机系统安全的最后保障,适用于大多数中小型企业的数据容灾需求,同时,还为企业将来实现更高级别的系统容灾做准备.
AS(Network Attached Storage:网络附属存储)是一种将分布、独立的数据整合为大型、集中化管理的数据中心,以便于对不同主机和应用服务器进行访问的技术。按字面简单说就是连接在网络上, 具备资料存储功能的装置,因此也称为“网络存储器”。它是一种专用数据存储服务器。它以数据为中心,将存储设备与服务器彻底分离,集中管理数据,从而释放带宽、提高性能、降低总拥有成本、保护投资。其成本远远低于使用服务器存储,而效率却远远高于后者。
DAS即直连方式存储,英文全称是Direct Attached Storage。中文翻译成“直接附加存储”。顾名思义,在这种方式中,存储设备是通过电缆(通常是SCSI接口电缆)直接到服务器的。I/O(输入/输入)请求直接发送到存储设备。DAS,也可称为SAS(Server-Attached Storage,服务器附加存储)。它依赖于服务器,其本身是硬件的堆叠,不带有任何存储操作系统。
7、Nginx
Nginx (“engine x”) 是一个高性能的 HTTP 和 反向代理 服务器,也是一个IMAP/POP3/SMTP 代理服务器。 Nginx 是由 Igor Sysoev 为俄罗斯访问量第二的 Rambler.ru 站点开发的,第一个公开版本0.1.0发布于2004年10月4日。其将源代码以类BSD许可证的形式发布,因它的稳定性、丰富的功能集、示例配置文件和低系统资源的消耗而闻名。2011年6月1日,nginx 1.0.4发布
Nginx作为负载均衡服务器:Nginx 既可以在内部直接支持Rails 和 PHP 程序对外进行服务,也可以支持作为 HTTP代理服务器对外进行服务。Nginx采用C进行编写,不论是系统资源开销还是CPU使用效率都比Perlbal 要好很多。
8、反向代理
反向代理(Reverse Proxy)方式是指以代理服务器来接受internet上的连接请求,然后将请求转发给内部网络上的服务器,并将从服务器上得到的结果返回给internet上请求连接的客户端,此时代理服务器对外就表现为一个服务器。
9、Hadoop
一个分布式系统基础架构,由Apache基金会所开发。
用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力高速运算和存储。
[1]Hadoop实现了一个分布式文件系统(HadoopDistributed File System),简称HDFS。HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高传输率(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以以流的形式访问(streaming access)文件系统中的数据。
10、PXE
PXE(preboot execute environment,预启动执行环境)是由Intel公司开发的最新技术,工作于Client/Server的网络模式,支持工作站通过网络从远端服务器下载映像,并由此支持通过网络启动操作系统,在启动过程中,终端要求服务器分配IP地址,再用TFTP(trivial filetransfer protocol)或MTFTP(multicasttrivial file transfer protocol)协议下载一个启动软件包到本机内存中执行,由这个启动软件包完成终端(客户?)基本软件设置,从而引导预先安装在服务器中的终端操作系统。PXE可以引导多种操作系统,如:Windows95/98/2000/windows2003/windows2008/winXP/win7/win8,linux等。
操作系统的调度有
CPU调度 CPU scheduler
IO调度 IO scheduler
IO调度器的总体目标是希望让磁头能够总是往一个方向移动,移动到底了再往反方向走,这恰恰就是现实生活中的电梯模型,所以IO调度器也被叫做电梯. (elevator)而相应的算法也就被叫做电梯算法. 而Linux中IO调度的电梯算法有好几种,
as(Anticipatory),预期的
cfq(Complete Fairness Queueing),
deadline,
noop(No Operation).
具体使用哪种算法我们可以在启动的时候通过内核参数elevator来指定.
一)I/O调度的4种算法
1)CFQ(完全公平排队I/O调度程序)
特点:
在最新的内核版本和发行版中,都选择CFQ做为默认的I/O调度器,对于通用的服务器也是最好的选择. CFQ试图均匀地分布对I/O带宽的访问,避免进程被饿死并实现较低的延迟,是deadline和as调度器的折中. CFQ对于多媒体应用(video,audio)和桌面系统是最好的选择.
CFQ赋予I/O请求一个优先级,而I/O优先级请求独立于进程优先级,高优先级的进程的读写不能自动地继承高的I/O优先级.
工作原理:
CFQ为每个进程/线程,单独创建一个队列来管理该进程所产生的请求,也就是说每个进程一个队列,各队列之间的调度使用时间片来调度, 以此来保证每个进程都能被很好的分配到I/O带宽.I/O调度器每次执行一个进程的4次请求.
2)NOOP(电梯式调度程序)
特点:
在Linux2.4或更早的版本的调度程序,那时只有这一种I/O调度算法. NOOP实现了一个简单的FIFO队列,它像电梯的工作主法一样对I/O请求进行组织,当有一个新的请求到来时,它将请求合并到最近的请求之后,以此来保证请求同一介质. NOOP倾向饿死读而利于写. NOOP对于闪存设备,RAM,嵌入式系统是最好的选择. 电梯算法饿死读请求的解释:因为写请求比读请求更容易. 写请求通过文件系统cache,不需要等一次写完成,就可以开始下一次写操作,写请求通过合并,堆积到I/O队列中. 读请求需要等到它前面所有的读操作完成,才能进行下一次读操作.在读操作之间有几毫秒时间,而写请求在这之间就到来,饿死了后面的读请求.
3)Deadline(截止时间调度程序)
特点:
通过时间以及硬盘区域进行分类,这个分类和合并要求类似于noop的调度程序. Deadline确保了在一个截止时间内服务请求,这个截止时间是可调整的,而默认读期限短于写期限.这样就防止了写操作因为不能被读取而饿死的现象. Deadline对数据库环境(ORACLE RAC,MYSQL等)是最好的选择.
4)AS(预料I/O调度程序)
特点:
本质上与Deadline一样,但在最后一次读操作后,要等待6ms,才能继续进行对其它I/O请求进行调度. 可以从应用程序中预订一个新的读请求,改进读操作的执行,但以一些写操作为代价.
它会在每个6ms中插入新的I/O操作,而会将一些小写入流合并成一个大写入流,用写入延时换取最大的写入吞吐量. AS适合于写入较多的环境,比如文件服务器,AS对数据库环境表现很差.
查看当前系统支持的IO调度算法
dmesg | grep -i scheduler
[root@localhost~]# dmesg | grep -i scheduler
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
查看当前系统的I/O调度方法:
cat/sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
临地更改I/O调度方法:
例如:想更改到noop电梯调度算法:
echo noop > /sys/block/sda/queue/scheduler
想永久的更改I/O调度方法:
修改内核引导参数,加入elevator=调度程序名
vi /boot/grub/menu.lst
更改到如下内容:
kernel /boot/vmlinuz-2.6.18-8.el5 ro root=LABEL=/ elevator=deadline rhgb quiet
重启之后,查看调度方法:
cat /sys/block/sda/queue/scheduler
noop anticipatory [deadline] cfq
已经是deadline了
二 )I/O调度程序的测试
本次测试分为只读,只写,读写同时进行.
分别对单个文件600MB,每次读写2M,共读写300次.
1)测试磁盘读:
[root@test1tmp]# echo deadline > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/sda1 of=/dev/null bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.81189 seconds, 92.4 MB/s
real0m6.833s
user 0m0.001s
sys 0m4.556s
[root@test1 tmp]# echo noop > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/sda1 of=/dev/null bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.61902 seconds, 95.1 MB/s
real0m6.645s
user 0m0.002s
sys 0m4.540s
[root@test1 tmp]# echo anticipatory > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/sda1 of=/dev/null bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 8.00389 seconds, 78.6 MB/s
real0m8.021s
user 0m0.002s
sys 0m4.586s
[root@test1 tmp]# echo cfq > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/sda1 of=/dev/null bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 29.8 seconds, 21.1 MB/s
real0m29.826s
user 0m0.002s
sys 0m28.606s
结果:
第一 noop:用了6.61902秒,速度为95.1MB/s
第二 deadline:用了6.81189秒,速度为92.4MB/s
第三 anticipatory:用了8.00389秒,速度为78.6MB/s
第四 cfq:用了29.8秒,速度为21.1MB/s
2)测试写磁盘:
[root@test1 tmp]# echo cfq > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/zero of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.93058 seconds, 90.8 MB/s
real0m7.002s
user 0m0.001s
sys 0m3.525s
[root@test1 tmp]# echo anticipatory > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/zero of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.79441 seconds, 92.6 MB/s
real0m6.964s
user 0m0.003s
sys 0m3.489s
[root@test1 tmp]# echo noop > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/zero of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 9.49418 seconds, 66.3 MB/s
real0m9.855s
user 0m0.002s
sys 0m4.075s
[root@test1 tmp]# echo deadline > /sys/block/sda/queue/scheduler
[root@test1 tmp]# time dd if=/dev/zero of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 6.84128 seconds, 92.0 MB/s
real0m6.937s
user 0m0.002s
sys 0m3.447s
测试结果:
第一 anticipatory,用了6.79441秒,速度为92.6MB/s
第二 deadline,用了6.84128秒,速度为92.0MB/s
第三 cfq,用了6.93058秒,速度为90.8MB/s
第四 noop,用了9.49418秒,速度为66.3MB/s
3)测试同时读/写
[root@test1tmp]# echo deadline > /sys/block/sda/queue/scheduler
[root@test1 tmp]# dd if=/dev/sda1 of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 15.1331 seconds, 41.6 MB/s
[root@test1 tmp]# echo cfq > /sys/block/sda/queue/scheduler
[root@test1 tmp]# dd if=/dev/sda1 of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 36.9544 seconds, 17.0 MB/s
[root@test1 tmp]# echo anticipatory > /sys/block/sda/queue/scheduler
[root@test1 tmp]# dd if=/dev/sda1 of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 23.3617 seconds, 26.9 MB/s
[root@test1 tmp]# echo noop > /sys/block/sda/queue/scheduler
[root@test1 tmp]# dd if=/dev/sda1 of=/tmp/test bs=2M count=300
300+0 records in
300+0 records out
629145600 bytes (629 MB) copied, 17.508 seconds, 35.9 MB/s
测试结果:
第一 deadline,用了15.1331秒,速度为41.6MB/s
第二 noop,用了17.508秒,速度为35.9MB/s
第三 anticipatory,用了23.3617秒,速度为26.9MS/s
第四 cfq,用了36.9544秒,速度为17.0MB/s
三)ionice
ionice可以更改任务的类型和优先级,不过只有cfq调度程序可以用ionice.
有三个例子说明ionice的功能:
采用cfq的实时调度,优先级为7
ionice -c1 -n7 -ptime dd if=/dev/sda1 of=/tmp/test bs=2M count=300&
采用缺省的磁盘I/O调度,优先级为3
ionice -c2 -n3 -ptime dd if=/dev/sda1 of=/tmp/test bs=2M count=300&
采用空闲的磁盘调度,优先级为0
ionice -c3 -n0 -ptime dd if=/dev/sda1 of=/tmp/test bs=2M count=300&
ionice的三种调度方法,实时调度最高,其次是缺省的I/O调度,最后是空闲的磁盘调度.
ionice的磁盘调度优先级有8种,最高是0,最低是7.
注意,磁盘调度的优先级与进程nice的优先级没有关系.
一个是针对进程I/O的优先级,一个是针对进程CPU的优先级.
AnticipatoryI/Oscheduler 适用于大多数环境,但不太合适数据库应用
DeadlineI/Oscheduler 通常与Anticipatory相当,但更简洁小巧,更适合于数据库应用
CFQ I/Oschedule 为所有进程分配等量的带宽,适合于桌面多任务及多媒体应用,默认IO调度器
DefaultI/O scheduler
The CFQ
scheduler has the following tunable parameters:
/sys/block/<device>/queue/iosched/slice_idle
When atask has no more I/O to submit in its time slice, the I/O scheduler waits for awhile before scheduling the next thread to improve locality of I/O. For mediawhere locality does not play a big role (SSDs, SANs with lots of disks) setting /sys/block/<device>/queue/iosched/slice_idle to 0 canimprove the throughput considerably.
/sys/block/<device>/queue/iosched/quantum
Thisoption limits the maximum number of requests that are being processed by thedevice at once. The default value is 4. For a storage with several disks,this setting can unnecessarily limit parallel processing of requests.Therefore, increasing the value can improve performance although this can causethat the latency of some I/O may be increased due to more requests beingbuffered inside the storage. When changing this value, you can also considertuning /sys/block/<device>/queue/iosched/slice_async_rq (thedefault value is 2) which limits the maximum number of asynchronousrequests—usually writing requests—that are submitted in one time slice.
/sys/block/<device>/queue/iosched/low_latency
For workloads where the latency of I/O iscrucial,setting /sys/block/<device>/queue/iosched/low_latencyto 1 canhelp.
DEADLINE
DEADLINE isa latency-oriented I/O scheduler. Each I/O request has got a deadline assigned.Usually, requests are stored in queues (read and write) sorted by sectornumbers. The DEADLINE algorithm maintains two additional queues (readand write) where the requests are sorted by deadline. As long as no request hastimed out, the“sector” queue is used. If timeouts occur, requests fromthe “deadline” queue are served until there are no more expiredrequests. Generally, the algorithm prefers reads over writes.
This scheduler can provide a superiorthroughput over the CFQ I/O scheduler in cases where several threadsread and write and fairness is not an issue. For example, for several parallelreaders from a SAN and for databases (especially whenusing “TCQ” disks). The DEADLINE scheduler has thefollowing tunable parameters:
/sys/block/<device>/queue/iosched/writes_starved
Controlshow many reads can be sent to disk before it is possible to send writes. Avalue of 3 means, that three read operations are carried out for onewrite operation.
/sys/block/<device>/queue/iosched/read_expire
Sets thedeadline (current time plus the read_expire value) for read operations inmilliseconds. The default is 500.
/sys/block/<device>/queue/iosched/write_expire
/sys/block/<device>/queue/iosched/read_expire Setsthe deadline (current time plus the read_expire value) for read operations inmilliseconds. The default is 500.
@logsys0data]#wgethttp://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
6.x
[root@logsys0data]#wgethttp://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
[root@logsys0 data]# rpm -ivhepel-release-6-8.noarch.rpm3. 安装syslog-ng
[root@logsys0 data]# yum –enablerepo=epelinstall syslog-ng eventlog syslog-ng-libdbi
设置变更:vim/etc/syslog-ng/syslog-ng.conf[root@logsys0 data]# chkconfig rsyslog off;chkconfig syslog-ng on
[root@logsys0 data]# service rsyslogstop;service syslog-ng start
关闭系统日志记录器: [确定]
启动 syslog-ng: [确定]
#重新加载配置:service syslog-ng reload
安装成功
4. 开启防火墙80和514端口
vi /etc/sysconfig/iptables
添加两条规则
-A RH-Firewall-1-INPUT -m state –state NEW-m tcp -p tcp –dport 80 -j ACCEPT
-A RH-Firewall-1-INPUT -m state –state NEW-m udp -p udp –dport 514 -j ACCEPT
配置文件如下:
# Firewall configuration written bysystem-config-firewall
# Manual customization of this file is notrecommended.
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:RH-Firewall-1-INPUT – [0:0]
-A INPUT -j RH-Firewall-1-INPUT
-A FORWARD -j RH-Firewall-1-INPUT
-A RH-Firewall-1-INPUT -i lo -j ACCEPT
-A RH-Firewall-1-INPUT -p icmp –icmp-typeany -j ACCEPT
-A INPUT -m state –state ESTABLISHED,RELATED-j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state –state NEW -m tcp -p tcp–dport 22 -j ACCEPT
-A RH-Firewall-1-INPUT -m state –state NEW-m tcp -p tcp –dport 80 -j ACCEPT
-A RH-Firewall-1-INPUT -m state –state NEW-m udp -p udp –dport 514 -j ACCEPT
-A INPUT -j REJECT –reject-withicmp-host-prohibited
-A FORWARD -j REJECT –reject-withicmp-host-prohibited
COMMIT
[root@ www.linuxidc.com data]# /etc/init.d/iptables restart
iptables:清除防火墙规则: [确定]
iptables:将链设置为政策 ACCEPT:filter [确定]
iptables:正在卸载模块: [确定]
iptables:应用防火墙规则: [确定
12、OLAP 即 联机分析处理
简写为OLAP,随着数据库技术的发展和应用,数据库存储的数据量从20世纪80年代的兆(M)字节及千兆(G)字节过渡到现在的兆兆(T)字节和千兆兆(P)字节,同时,用户的查询需求也越来越复杂,涉及的已不仅是查询或操纵一张关系表中的一条或几条记录,而且要对多张表中千万条记录的数据进行数据分析和信息综合,关系数据库系统已不能全部满足这一要求。在国外,不少软件厂商采取了发展其前端产品来弥补关系数据库管理系统支持的不足,力图统一分散的公共应用逻辑,在短时间内响应非数据处理专业人员的复杂查询要求。
联机分析处理(OLAP)系统是数据仓库系统最主要的应用,专门设计用于支持复杂的分析操作,侧重对决策人员和高层管理人员的决策支持,可以根据分析人员的要求快速、灵活地进行大数据量的复杂查询处理,并且以一种直观而易懂的形式将查询结果提供给决策人员,以便他们准确掌握企业(公司)的经营状况,了解对象的需求,制定正确的方案。
13、磁盘
整个磁盘盘上头好像有多个同心囿绘制出癿饼图,而由囿心以放射状癿方式分割出磁盘癿最小储存单位,那就是扂区(Sector),在物理组成分面,每个扂区大小为512Bytes,这个值是丌会改变癿。而扂区组成一个囿就成为磁道(track),如果是在多碟癿硬盘上面,在所有磁盘盘上面癿同一个磁道可以组成一个磁柱(Cylinder),磁柱也是一般我仧分割硬盘时癿最小单位了!
在计算整个硬盘癿储存量时,简单癿计算公式就是:『header数量 * 每个header负责癿磁柱数量 * 每个磁柱所吨有癿扂区数量 * 扂区癿容量』,单位换算为『header * cylinder/header * secter/cylinder * 512bytes/secter』
装置 |
装置在Linux内癿文件名 |
|
IDE硬盘机 |
/dev/hd[a-d] |
|
SCSI/SATA/USB硬盘机 |
/dev/sd[a-p] |
|
USB快闪碟 |
/dev/sd[a-p](不SATA相同) |
|
整颗磁盘的第一个分区特别的重要,因为他记录了整颗磁盘的重要信息! 磁盘的第一个分区主要记录了两个重要信息,分别是:
主要启动记录区(MasterBoot Record, MBR):可以安装开机管理程序的地方,有446 bytes
分割表(partitiontable):记录整颗硬盘分割癿状态,有64 bytes
14、磁盘分区
鸟哥p82-
/dev/sdb是SSD设备,/dev/sda是传统的磁盘设备,加载了Flashcache之后呢,会将这两个设备虚拟化为一个带有缓存的块设备/dev/mapper/cachedev。
上图中,Flashcache将普通的SAS盘(/dev/sda)和一个高速的SSD(/dev/sdb)虚拟成一个带缓存的块设备(/dev/mapper/cachedev)。后续还将会有更多关于Flashcache相关的文章出现,
所谓癿『挂载』就是利用一个目录当成迚入点,将磁盘分区槽癿数据放置在该目录下;也就是说,迚入该目录就可以读叏该分割槽癿意思。这个劢作我们称为『挂载』,那个迚入点癿目录我们称为『挂载点』。由二整个Linux系统最重要癿是根目录,因此根目录一定需要挂载到某个分割槽癿。至二其他癿目录则可依用户自己癿需求来给予挂载到丌同癿分割槽。
tmpfs是一种基于内存的文件系统,它和虚拟磁盘ramdisk比较类似像,但不完全相同,和ramdisk一样,tmpfs可以使用RAM,但它也可以使用swap分区来存储。而且传统的ramdisk是个块设备,要用mkfs来格式化它,才能真正地使用它;而tmpfs是一个文件系统,并不是块设备,只是安装它,就可以使用了。tmpfs是最好的基于RAM的文件系统。
15、NAT服务器
NAT英文全称是“Network AddressTranslation”,中文意思是“网络地址转换”,它是一个IETF(Internet Engineering Task Force, Internet工程任务组)标准,允许一个整体机构以一个公用IP(Internet Protocol)地址出现在Internet上。顾名思义,它是一种把内部私有网络地址(IP地址)翻译成合法网络IP地址的技术。
Web(WWW服务器)
CentOS使用癿是Apache这套软件来达成WWW网站癿功能,在WWW服务器上面,如果你还有提供数据库系统癿话, 那举CPU癿等级就丌能太低,而最重要癿则是RAM了!要增加WWW服务器癿效能,通常提升RAM是一个丌错癿考虑。
16、Proxy(代理服务器)
这也是常常会安装癿一个服务器软件,尤其像中小学校癿带宽较丌足癿环境下, Proxy将可有效癿解决带宽丌足癿问题!当然,你也可以在家里内部安装一个Proxy喔!但是,这个服务器癿硬件要求可以说是相对而言最高癿,他丌但需要较强有力癿CPU来运作,对二硬盘癿速度不容量要求也很高!自然,既然提供了网络服务,网络卡则是重要癿一环!
17、/usr
徆多读者都会诨会/usr为user的缩写,其实usr是UnixSoftware Resource的缩写,也就是『Unix操作系统软件资源』所放置的目彔,而丌是用户的数据啦!这点要注意。 FHS建议所有软件开发者,应该将他们的数据合理的分别放置到这个目彔下的次目彔,而丌要自行建立该软件自己独立的目彔。
18、/var
如果/usr是安装时会占用较大硬盘容量的目彔,那举/var就是在系统运作后才会渐渐占用硬盘容量的目彔。因为/var目彔主要针对常态怅变劢的档案,包括快取(cache)、登彔档(log file)以及某些软件运作所产生的档案,包括程序档案(lock file, run file),戒者例如MySQL数据库的档案等等。常见的次目彔有:
19、tac (反向列示)
cat 不 tac ,有没有发现呀!对啦!tac 刚好是将 cat 反写过杢,所以他癿功能就跟 cat 相反啦,
20、less
less 癿用法比起 more 又更加癿有弹性,忟么说呢?在 more 癿时候,我们幵没有办法向前面翻,叧能往后面看,但若使用了 less 时,呵呵!就可以使用 [pageup] [pagedown] 等挄键癿功能杢往前往后翻看文件,你瞧,是丌是更容易使用杢观看一个档案癿内容了呢!
21、Set UID
22、LVM
LVM是 LogicalVolume Manager(逻辑卷管理)的简写,它是Linux环境下对磁盘分区进行管理的一种机制,它由Heinz Mauelshagen在Linux 2.4内核上实现,目前最新版本为:稳定版1.0.5,开发版 1.1.0-rc2,以及LVM2开发版
LVM是逻辑盘卷管理(LogicalVolumeManager)的简称,它是Linux环境下对磁盘分区进行管理的一种机制,LVM是建立在硬盘和分区之上的一个逻辑层,来提高磁盘分区管理的灵活性。通过LVM系统管理员可以轻松管理磁盘分区,如:将若干个磁盘分区连接为一个整块的卷组(volumegroup),形成一个存储池。管理员可以在卷组上随意创建逻辑卷组(logicalvolumes),并进一步在逻辑卷组上创建文件系统。管理员通过LVM可以方便的调整存储卷组的大小,并且可以对磁盘存储按照组的方式进行命名、管理和分配,例如按照使用用途进行定义:“development”和“sales”,而不是使用物理磁盘名“sda”和“sdb”。而且当系统添加了新的磁盘,通过LVM管理员就不必将磁盘的文件移动到新的磁盘上以充分利用新的存储空间,而是直接扩展文件系统跨越磁盘即可.
LVM是在磁盘分区和文件系统之间添加的一个逻辑层,来为文件系统屏蔽下层磁盘分区布局,提供一个抽象的盘卷,在盘卷上建立文件系统。首先我们讨论以下几个LVM术语:
*物理存储介质(Thephysical media)
这里指系统的存储设备:硬盘,是存储系统最低层的存储单元。
*物理卷(physicalvolume,PV)
物理卷就是指硬盘分区或从逻辑上与磁盘分区具有同样功能的设备(如RAID),是LVM的基本存储逻辑块,但和基本的物理存储介质(如分区、磁盘等)比较,却包含有与LVM相关的管理参数。
*卷组(VolumeGroup,VG)
LVM卷组类似于非LVM系统中的物理硬盘,其由物理卷组成。可以在卷组上创建一个或多个“LVM分区”(逻辑卷),LVM卷组由一个或多个物理卷组成。
*逻辑卷(logicalvolume,LV)
LVM的逻辑卷类似于非LVM系统中的硬盘分区,在逻辑卷之上可以建立文件系统(比如/home或者/usr等)。
*PE(physicalextent,PE)
每一个物理卷被划分为称为PE(PhysicalExtents)的基本单元,具有唯一编号的PE是可以被LVM寻址的最小单元。PE的大小是可配置的,默认为4MB。
*LE(logicalextent,LE)
逻辑卷也被划分为被称为LE(LogicalExtents)的可被寻址的基本单位。在同一个卷组中,LE的大小和PE是相同的,并且一一对应。
首先可以看到,物理卷(PV)被由大小等同的基本单元PE组成。
一个卷组由一个或多个物理卷组成。
从上图可以看到,PE和LE有着一一对应的关系。逻辑卷建立在卷组上。逻辑卷就相当于非LVM系统的磁盘分区,可以在其上创建文件系统。
下图是磁盘分区、卷组、逻辑卷和文件系统之间的逻辑关系的示意图:
和非LVM系统将包含分区信息的元数据保存在位于分区的起始位置的分区表中一样,逻辑卷以及卷组相关的元数据也是保存在位于物理卷起始处的VGDA(卷组描述符区域)中。VGDA包括以下内容:PV描述符、VG描述符、LV描述符、和一些PE描述符。
系统启动LVM时激活VG,并将VGDA加载至内存,来识别LV的实际物理存储位置。当系统进行I/O操作时,就会根据VGDA建立的映射机制来访问实际的物理位置
每个 inode 不 block 都有编号,至亍这三个数据癿意丿可以简略说明如下:
superblock:记彔此filesystem 癿整体信息,包括inode/block癿总量、使用量、剩余量,以及文件系统癿格式不相关信息等;
inode:记彔档案癿属性,一个档案占用一个inode,同时记彔此档案癿数据所在癿 block 号码;
block:实际记彔档案癿内容,若档案太大时,会占用多个 block 。
23、apparmor
AppArmor是一个高效和易于使用的Linux系统安全应用程序。AppArmor对操作系统和应用程序所受到的威胁进行从内到外的保护,甚至是未被发现的0day漏洞和未知的应用程序漏洞所导致的攻击。AppArmor安全策略可以完全定义个别应用程序可以访问的系统资源与各自的特权。AppArmor包含大量的默认策略,它将先进的静态分析和基于学习的工具结合起来,AppArmor甚至可以使非常复杂的应用可以使用在很短的时间内应用成功。
https://linuxcontainers.org/news/
http://www.ibm.com/developerworks/cn/linux/l-lxc-containers/
24、PCI
我们先来看一个例子,我的电脑装有1G的RAM,1G以后的物理内存地址空间都是外部设备IO在系统内存地址空间上的映射。 /proc/iomem描述了系统中所有的设备I/O在内存地址空间上的映射。我们来看地址从1G开始的第一个设备在/proc/iomem中是如何描述的:
40000000-400003ff :0000:00:1f.1
这是一个PCI设备,40000000-400003ff是它所映射的内存地址空间,占据了内存地址空间的1024 bytes的位置,而 0000:00:1f.1则是一个PCI外设的地址,它以冒号和逗号分隔为4个部分,第一个16位表示域,第二个8位表示一个总线编号,第三个5位表示一个设备号,最后是3位,表示功能号。
因为PCI规范允许单个系统拥有高达256个总线,所以总线编号是8位。但对于大型系统而言,这是不够的,所以,引入了域的概念,每个 PCI域可以拥有最多256个总线,每个总线上可支持32个设备,所以设备号是5位,而每个设备上最多可有8种功能,所以功能号是3位。由此,我们可以得出上述的PCI设备的地址是0号域0号总线上的31号设备上的1号功能。
编程范型或编程范式(英语:Programmingparadigm),(范即模范之意,范式即模式、方法),是一类典型的编程风格,是指从事软件工程的一类典型的风格(可以对照方法学)。如:函数式编程、程序编程、面向对象编程、指令式编程等等为不同的编程范型。
编程范型提供了(同时决定了)程序员对程序执行的看法。例如,在面向对象编程中,程序员认为程序是一系列相互作用的对象,而在函数式编程中一个程序会被看作是一个无状态的函数计算的串行。
正如软件工程中不同的群体会提倡不同的“方法学”一样,不同的编程语言也会提倡不同的“编程范型”。一些语言是专门为某个特定的范型设计的(如Smalltalk和Java支持面向对象编程,而Haskell和Scheme则支持函数式编程),同时还有另一些语言支持多种范型(如Ruby、Common Lisp、Python和Oz)。
25、范型
很多编程范型已经被熟知他们禁止使用哪些技术,同时允许使用哪些。 例如,纯粹的函数式编程不允许有副作用[1];结构化编程不允许使用goto。可能是因为这个原因,新的范型常常被那些惯于较早的风格的人认为是教条主义或过分严格。然而,这样避免某些技术反而更加证明了关于程序正确性——或仅仅是理解它的行为——的法则,而不用限制程序语言的一般性。
编程范型和编程语言之间的关系可能十分复杂,由于一个编程语言可以支持多种范型。例如,C++设计时,支持过程化编程、面向对象编程以及泛型编程。然而,设计师和程序员们要考虑如何使用这些范型元素来构建一个程序。一个人可以用C++写出一个完全过程化的程序,另一个人也可以用C++写出一个纯粹的面向对象程序,甚至还有人可以写出杂揉了两种范型的程序。
26、python错误ImportError: No module named setuptools 解决方法
在python运行过程中出现如下错误:
python错误:ImportError:No module named setuptools
这句错误提示的表面意思是:没有setuptools的模块,说明python缺少这个模块,那我们只要安装这个模块即可解决此问题,下面我们来安装一下:
在命令行下:
下载setuptools包
shell# wgethttp://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz
解压setuptools包
shell# tar zxvfsetuptools-0.6c11.tar.gz
shell# cdsetuptools-0.6c11
编译setuptools
shell# pythonsetup.py build
开始执行setuptools安装
shell# pythonsetup.py install
27、Mysql用户:
selectdistinct(User) from mysql.user
格式:grant 权限 on 数据库名.表名 to 用户@登录主机identified by “用户密码”;
@ 后面是访问mysql的客户端IP地址(或是 主机名) % 代表任意的客户端,如果填写localhost 为本地访问(那此用户就不能远程访问该mysql数据库了)。
DROP USER’username’@’host’;
删除行:
mysql> delete from pet wherename=”Whistler”;
28、HBase
HBase 是一个分布式的可扩展、非关系型开源数据库。它很好地用 JAVA实现了 Google 的 Bigtable 系统大部分特性.
29、OpenStack
OpenStack是一个云平台管理的项目,它不是一个软件。这个项目由几个主要的组件组合起来完成一些具体的工作。OpenStack是一个旨在为公共及私有云的建设与管理提供软件的开源项目。它包括控制器、计算 (Nova)、存储 (Swift)、消息队列 (RabbitMQ) 和网络 (Quantum) 组件。
30、Hadoop
是一个非常优秀的分布式编程框架,设计精巧而且目前没有同级别同重量的替代品。
这张图就是Hadoop的架构图,Map和Reduce是两个最基本的处理阶段,之前有输入数据格式定义和数据分片,之后有输出数据格式定义,二者中间还可以实现combine这个本地reduce操作和partition这个重定向mapper输出的策略行为。
Hadoop不适合用来处理大批量的小文件。其实这是由namenode的局限性所决定的,如果文件过小,namenode存储的元信息相对来说就会占用过大比例的空间,内存还是磁盘开销都非常大。
31、浅谈TCP优化
IlyaGrigorik 在「High Performance Browser Networking」中做了很多细致的描述,让人读起来醍醐灌顶,我大概总结了
32、Mysql集群性能
整信集群由三类节点构成:数据节点、应用程序节点以及管理节点。
•数据节点通常负责数据访问与存储事务。
•应用程序节点提供由应用程序逻辑层及应用API指向数据节点的链接。
•管理节点在集群配置体系中的作用至关重要,并在网络分区环境下负责负载指派。
33、Shell脚本,备份:
备份网站内容
#!/bin/bash
#指定运行的脚本shell
#运行脚本要给用户执行权限
bakdir=/backup
month=`date +%m`
day=`date +%d`
year=`date +%Y`
hour=`date +%k`
min=`date +%M`
dirname=$year-$month-$day-$hour-$min
mkdir $bakdir/$dirname
mkdir $bakdir/$dirname/conf
mkdir $bakdir/$dirname/web
mkdir $bakdir/$dirname/db
#备份conf,检测通过
gzupload=upload.tgz
cp /opt/apache2/conf/httpd.conf$bakdir/$dirname/conf/httpd.conf
cd /opt/apache2/htdocs/php
tar -zcvf $bakdir/$dirname/web/$gzupload./upload
#远程拷贝的目录要有可写权限
scp -r /backup/$dirname root@10.1.1.178:/backup
备份数据库:
#!/bin/bash
#指定运行的脚本shell
#运行脚本要给用户执行权限
bakdir=/backup
month=`date +%m`
day=`date +%d`
year=`date +%Y`
hour=`date +%k`
min=`date +%M`
dirname=$year-$month-$day-$hour-$min
mkdir $bakdir/$dirname
mkdir $bakdir/$dirname/conf
mkdir $bakdir/$dirname/web
mkdir $bakdir/$dirname/db
#热备份数据库
cp /opt/mysql/my.cnf $bakdir/$dirname/db/my.cnf
cd /opt/mysql
mysqldump –opt -u zhy -p –password=1986test>$bakdir/$dirname/db/test.sql
mysqldump –opt -u zhy -p –password=1986phpwind>$bakdir/$dirname/db/phpwind.sql
#远程拷贝的目录要有可写权限
scp -r /backup/$dirname root@10.1.1.178:/backup
MySQL的热备份脚本
本脚本是mysqldump –opt的补充:
#!/bin/bash
PATH=/usr/local/sbin:/usr/bin:/bin
# The Directory of Backup
BACKDIR=/usr/mysql_backup
# The Password of MySQL
ROOTPASS=password
# Remake the Directory of Backup
rm -rf $BACKDIR
mkdir -p $BACKDIR
# Get the Name of Database
DBLIST=`ls -p /var/lib/mysql | grep / | tr-d /`
# Backup with Database
for dbname in $DBLIST
do
mysqlhotcopy $dbname -u root -p $ROOTPASS$BACKDIR | logger -t mysqlhotcopy
done
33、RPS和RFS
• RPS 全称是 ReceivePacket Steering, 这是Google工程师 Tom Herbert (therbert@google.com )提交的内核补丁, 在2.6.35进入Linux内核. 这个patch采用软件模拟的方式,实现了多队列网卡所提供的功能,分散了在多CPU系统上数据接收时的负载, 把软中断分到各个CPU处理,而不需要硬件支持,大大提高了网络性能。
• RFS 全称是 ReceiveFlow Steering, 这也是Tom提交的内核补丁,它是用来配合RPS补丁使用的,是RPS补丁的扩展补丁,它把接收的数据包送达应用所在的CPU上,提高cache的命中率。
• 这两个补丁往往都是一起设置,来达到最好的优化效果, 主要是针对单队列网卡多CPU环境(多队列多重中断的网卡也可以使用该补丁的功能,但多队列多重中断网卡有更好的选择:SMP IRQ affinity)
原理
RPS: RPS实现了数据流的hash归类,并把软中断的负载均衡分到各个cpu,实现了类似多队列网卡的功能。由于RPS只是单纯的把同一流的数据包分发给同一个CPU核来处理了,但是有可能出现这样的情况,即给该数据流分发的CPU核和执行处理该数据流的应用程序的CPU核不是同一个:数据包均衡到不同的 cpu,这个时候如果应用程序所在的cpu和软中断处理的cpu不是同一个,此时对于cpu cache的影响会很大。那么RFS补丁就是用来确保应用程序处理的cpu跟软中断处理的cpu是同一个,这样就充分利用cpu的cache。
• 应用RPS之前: 所有数据流被分到某个CPU,多CPU没有被合理利用,造成瓶颈
• 应用RPS之后: 同一流的数据包被分到同个CPU核来处理,但可能出现cpucache迁跃
• 应用RPS+RFS之后: 同一流的数据包被分到应用所在的CPU核
必要条件
使用RPS和RFS功能,需要有大于等于2.6.35版本的Linux kernel.
如何判断内核版本:
$uname –r
2.6.38-2-686-bigmem
对比测试
类别 测试客户端 测试服务端
型号 BladeCenter HS23p BladeCenter HS23p
CPU XeonE5-2609 Xeon E5-2630
网卡 Broadcom NetXtreme II BCM5709SGigabit Ethernet Emulex CorporationOneConnect 10Gb NIC
内核 3.2.0-2-amd64 3.2.0-2-amd64
内存 62GB 66GB
系统 Debian 6.0.4 Debian 6.0.5
超线程 否 是
CPU核 4 6
驱动 bnx2 be2net
客户端: netperf
服务端:netserver
RPScpu bitmap测试分类: 0(不开启rps功能), one cpu per queue(每队列绑定到1个CPU核上), all cpus per queue(每队列绑定到所有cpu核上), 不同分类的设置值如下
1) 0(不开启rps功能)
/sys/class/net/eth0/queues/rx-0/rps_cpus00000000
/sys/class/net/eth0/queues/rx-1/rps_cpus00000000
/sys/class/net/eth0/queues/rx-2/rps_cpus00000000
/sys/class/net/eth0/queues/rx-3/rps_cpus00000000
/sys/class/net/eth0/queues/rx-4/rps_cpus00000000
/sys/class/net/eth0/queues/rx-5/rps_cpus00000000
/sys/class/net/eth0/queues/rx-6/rps_cpus00000000
/sys/class/net/eth0/queues/rx-7/rps_cpus00000000
/sys/class/net/eth0/queues/rx-0/rps_flow_cnt0
/sys/class/net/eth0/queues/rx-1/rps_flow_cnt0
/sys/class/net/eth0/queues/rx-2/rps_flow_cnt0
/sys/class/net/eth0/queues/rx-3/rps_flow_cnt0
/sys/class/net/eth0/queues/rx-4/rps_flow_cnt0、
/sys/class/net/eth0/queues/rx-5/rps_flow_cnt0
/sys/class/net/eth0/queues/rx-6/rps_flow_cnt0
/sys/class/net/eth0/queues/rx-7/rps_flow_cnt0
/proc/sys/net/core/rps_sock_flow_entries 0
2) onecpu per queue(每队列绑定到1个CPU核上)
/sys/class/net/eth0/queues/rx-0/rps_cpus00000001
/sys/class/net/eth0/queues/rx-1/rps_cpus00000002
/sys/class/net/eth0/queues/rx-2/rps_cpus00000004
/sys/class/net/eth0/queues/rx-3/rps_cpus00000008
/sys/class/net/eth0/queues/rx-4/rps_cpus00000010
/sys/class/net/eth0/queues/rx-5/rps_cpus00000020
/sys/class/net/eth0/queues/rx-6/rps_cpus00000040
/sys/class/net/eth0/queues/rx-7/rps_cpus00000080
/sys/class/net/eth0/queues/rx-0/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-1/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-2/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-3/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-4/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-5/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-6/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-7/rps_flow_cnt4096
/proc/sys/net/core/rps_sock_flow_entries32768
3) allcpus per queue(每队列绑定到所有cpu核上)
/sys/class/net/eth0/queues/rx-0/rps_cpus000000ff
/sys/class/net/eth0/queues/rx-1/rps_cpus000000ff
/sys/class/net/eth0/queues/rx-2/rps_cpus000000ff
/sys/class/net/eth0/queues/rx-3/rps_cpus000000ff
/sys/class/net/eth0/queues/rx-4/rps_cpus000000ff
/sys/class/net/eth0/queues/rx-5/rps_cpus000000ff
/sys/class/net/eth0/queues/rx-6/rps_cpus000000ff
/sys/class/net/eth0/queues/rx-7/rps_cpus000000ff
/sys/class/net/eth0/queues/rx-0/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-1/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-2/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-3/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-4/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-5/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-6/rps_flow_cnt4096
/sys/class/net/eth0/queues/rx-7/rps_flow_cnt4096
/proc/sys/net/core/rps_sock_flow_entries32768
测试方法: 每种测试类型执行3次,中间睡眠10秒, 每种测试类型分别执行100、500、1500个实例,每实例测试时间长度为60秒。
TCP_RR1 byte: 测试TCP 小数据包 request/response的性能
netperf -t TCP_RR -H $serverip -c -C -l 60
UDP_RR1 byte: 测试UDP 小数据包 request/response的性能
netperf -t UDP_RR -H $serverip -c -C -l 60
TCP_RR256 byte: 测试TCP 大数据包 request/response的性能
netperf -t TCP_RR -H $serverip -c -C -l 60– -r256,256
UDP_RR256 byte: 测试UDP 大数据包 request/response的性能
netperf -t UDP_RR -H $serverip -c -C -l 60– -r256,256
TPS测试结果
TCP_RR1 byte小包测试结果
TCP_RR256 byte大包测试结果
UDP_RR1 byte小包测试结果
UDP_RR256 byte大包测试结果
CPU负载变化
在测试过程中,使用mpstat收集各个CPU核的负载变化
1. 关闭RPS/RFS: 可以看出关闭RPS/RFS时,软中断的负载都在cpu0上,并没有有效的利用多CPU的特性,导致了性能瓶颈。
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
Average: all 3.65 0.00 35.75 0.05 0.01 14.56 0.00 0.00 45.98
Average: 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
Average: 1 4.43 0.00 37.76 0.00 0.11 11.49 0.00 0.00 46.20
Average: 2 5.01 0.00 45.80 0.00 0.00 0.00 0.00 0.00 49.19
Average: 3 5.11 0.00 45.07 0.00 0.00 0.00 0.00 0.00 49.82
Average: 4 3.52 0.00 40.38 0.14 0.00 0.00 0.00 0.00 55.96
Average: 5 3.85 0.00 39.91 0.00 0.00 0.00 0.00 0.00 56.24
Average: 6 3.62 0.00 40.48 0.14 0.00 0.00 0.00 0.00 55.76
Average: 7 3.87 0.00 38.86 0.11 0.00 0.00 0.00 0.00 57.16
2. 每队列关联到一个CPUTCP_RR: 可以看出软中断负载已经能分散到各个CPU核上,有效利用了多CPU的特性,大大提高了系统的网络性能。
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
Average: all 5.58 0.00 59.84 0.01 0.00 22.71 0.00 0.00 11.86
Average: 0 2.16 0.00 20.85 0.00 0.04 72.03 0.00 0.00 4.93
Average: 1 4.68 0.00 46.27 0.00 0.00 42.73 0.00 0.00 6.32
Average: 2 6.76 0.00 63.79 0.00 0.00 11.03 0.00 0.00 18.42
Average: 3 6.61 0.00 65.71 0.00 0.00 11.51 0.00 0.00 16.17
Average: 4 5.94 0.00 67.83 0.07 0.00 11.59 0.00 0.00 14.58
Average: 5 5.99 0.00 69.42 0.04 0.00 12.54 0.00 0.00 12.01
Average: 6 5.94 0.00 69.41 0.00 0.00 12.86 0.00 0.00 11.78
Average: 7 6.13 0.00 69.61 0.00 0.00 14.48 0.00 0.00 9.77
3. 每队列关联到一个CPUUDP_RR: CPU负载未能均衡的分布到各个CPU, 这是由于网卡hash计算在UDP包上的不足,详细请见本文后记部分。
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
Average: all 3.01 0.00 29.84 0.07 0.01 13.35 0.00 0.00 53.71
Average: 0 0.00 0.00 0.08 0.00 0.00 90.01 0.00 0.00 9.91
Average: 1 3.82 0.00 32.87 0.00 0.05 12.81 0.00 0.00 50.46
Average: 2 4.84 0.00 37.53 0.00 0.00 0.14 0.00 0.00 57.49
Average: 3 4.90 0.00 37.92 0.00 0.00 0.16 0.00 0.00 57.02
Average: 4 2.57 0.00 32.72 0.20 0.00 0.09 0.00 0.00 64.42
Average: 5 2.66 0.00 33.54 0.11 0.00 0.08 0.00 0.00 63.60
Average: 6 2.75 0.00 32.81 0.09 0.00 0.06 0.00 0.00 64.30
Average: 7 2.71 0.00 32.66 0.17 0.00 0.06 0.00 0.00 64.40
4. 每队列关联到所有CPU: 可以看出软中断负载已经能分散到各个CPU核上,有效利用了多CPU的特性,大大提高了系统的网络性能
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
Average: all 5.39 0.00 59.97 0.00 0.00 22.57 0.00 0.00 12.06
Average: 0 1.46 0.00 21.83 0.04 0.00 72.08 0.00 0.00 4.59
Average: 1 4.45 0.00 46.40 0.00 0.04 43.39 0.00 0.00 5.72
Average: 2 6.84 0.00 65.62 0.00 0.00 11.39 0.00 0.00 16.15
Average: 3 6.71 0.00 67.13 0.00 0.00 12.07 0.00 0.00 14.09
Average: 4 5.73 0.00 66.97 0.00 0.00 10.71 0.00 0.00 16.58
Average: 5 5.74 0.00 68.57 0.00 0.00 13.02 0.00 0.00 12.67
Average: 6 5.79 0.00 69.27 0.00 0.00 12.31 0.00 0.00 12.63
Average: 7 5.96 0.00 68.98 0.00 0.00 12.00 0.00 0.00 13.06
结果分析
以下结果只是针对测试服务器特定硬件及系统的数据,在不同测试对象的RPS/RFS测试结果可能有不同的表现。
TCP性能:
• 在没有打开RPS/RFS的情况下,随着进程数的增加,TCP tps性能并明显没有提升,在184~188k之间。
• 打开RPS/RFS之后,随着RPS导致软中断被分配到所有CPU上和RFS增加的cache命中, 小数据包(1字节)及大数据包(256字节,相对小数据包而言, 而不是实际应用中的大数据包)的tps性能都有显著提升
• 100个进程提升40%的性能(两种RPS/RFS设置的性能结果一致),cpu负载升高40%
• 500个进程提升70%的性能(两种RPS/RFS设置的性能结果一致),cpu负载升高62%
• 1500个进程提升75%的性能(两种RPS/RFS设置的性能结果一致),cpu负载升高77%
UDP性能:
• 在没有打开RPS/RFS的情况下,随着进程数的增加,UDP tps性能并明显没有提升,在226~235k之间。
• 打 开RPS/RFS之后,,随着RPS导致软中断被分配到所有CPU上和RFS增加的cache命中, 小数据包(1字节)及大数据包(256字节,相对小数据包而言, 而不是实际应用中的大数据包)的TPS性能, 在每队列关联到所有CPU的情况下有显著提升, 而每队列关联到一个CPU后反倒是导致了UDPtps性能下降1% (这是bnx2网卡不支持UDP port hash及此次测试的局限性造成的结果, 详细分析见: 后记)
• 每队列关联到所有CPU的情况下, 在100个进程时小包提升40%的性能, cpu负载升高60%; 大包提升33%, cpu负载升高47%
• 每队列关联到所有CPU的情况下, 在500个进程提小包提升62%的性能, cpu负载升高71%; 大包提升60%, cpu负载升高65%
• 每队列关联到所有CPU的情况下, 在1500个进程提升65%的性能, cpu负载升高75%; 大包提升64%, cpu负载升高74%
后记
UDP在每队列绑定到一个CPU时性能下降,而绑定到所有CPU时,却有性能提升,这一问题涉及到几个因素,当这几个因素凑一起时,导致了这种奇特的表现。
• 此次测试的局限性:本次测试是1对1的网络测试,产生的数据包的IP地址都是相同的
• bnx2 网卡在RSS hash上,不支持UDP Port,也就是说,网卡在对TCP数据流进行队列选择时的hash包含了ip和port, 而在UDP上的hash, 只有IP地址,导致了本次测试(上面的局限性影响)的UDP数据包的hash结果都是一样的,数据包被转送到同一条队列。
• 单 单上面两个因素,还无法表现出UDP在每队列绑定到一个CPU时性能下降,而绑定到所有CPU时,却有性能提升的现象。 因为RPS/RFS本身也有hash计算,也就是进入队列后的数据包,还需要经过RPS/RFS的hash计算(这里的hash支持udp port), 然后进行第二次数据包转送选择;如果每队列绑定到一个CPU, 系统直接跳过第二次hash计算,数据包直接分配到该队列关联的CPU处理,也就导致了在第一次hash计算后被错误转送到某一队列的UDP数据包,将直接送到cpu处理,导致了性能的下降;而如果是每队列绑定到所有CPU,那么进入队列后的数据包会在第二次hash时被重新分配,修正了第一次hash的错误选择。
相关对比测试
1. SMP IRQ affinity:http://www.igigo.net/archives/231
参考资料
• Software receive packetsteering
• Receive Packet Steering
• Receive packet steering
• Receive Flow Steering
• linux kernel 2.6.35中RFS特性详解
• Linux 2.6.35 新增特性 RPS RFS
• kernel/Documentation/networking/scaling.txt
34、SLES9下配置 IP Bonding 的步骤
Toavoid problems it is advisable that all network cards use the same driver. Ifthey use different drivers, please take the following into consideration:
There are three driver-dependent methodsfor check whether a network card has a link or a network connection.
* MII link status detection
* Register in the drivernetif_carrier
* ARP monitoring
It is very important that the used driverssupport the same method. If this is not the case because e.g. the first networkcard driver only supports MII link status detection whereas the second driverjust supports netif_carrier, the only solution is to replace the network cardin order to use a different driver.
To find out what method is supported byyour driver, proceed as follows:
* MII link status can be determined withthe tools mii-tool or ethtool.
* In the case of netif_carrier and ARPmonitoring, refer to the driver’s source code to find out whether these methodsare supported or not. The corresponding kernel sources must be installed forthis purpose. Regarding netif_carrier, search exactly for this string in thedriver’s source code, e.g.
grep netif_carrier via-rhine.c
As for the ARP monitoring method, thedriver must support either the register last_rx or trans_start. Thus, you cansearch in the driver’s source code for:
grep “last_rx\|trans_start” via-rhine.c
Start with the setup only after havingverified this.
Procedure
In this sample scenario, two network cardswill be combined by way of bonding mode=1 (active backup).
1. Configure your network cards with YaST.Allocate the IP address that must be used for the bonding device to one networkcard and a dummy IP address to the rest of network cards.
2. Copy the configuration of the networkcard with the right IP address to a file ifcfg-bond0.
cd /etc/sysconfig/network
cp ifcfg-eth-id–xx:xx:xx:xx:xx:01 ifcfg-bond0
3. Find out and write down the PCI IDs ofall the involved network cards.
For example:
linux:~ # grep bus-pci ifcfg-eth-id–xx:xx:xx:xx:xx:01
_nm_name=’bus-pci-0000:00:09.0′
linux:~ # grep bus-pci ifcfg-eth-id–xx:xx:xx:xx:xx:02
_nm_name=’bus-pci-0000:00:0a.0′
linux:~ #
4. Edit the file ifcfg-bond0 previouslycreated and insert the following lines.
BONDING_MASTER=yes
BONDING_SLAVE_0=’bus-pci-0000:00:09.0′
BONDING_SLAVE_1=’bus-pci-0000:00:0a.0′
Now insert the options for the bonding module. Depending on what linkdetection method you are using, the line may look like this:
* MII link detection method
BONDING_MODULE_OPTS=’miimon=100 mode=1use_carrier=0′
* netif_carrier method
BONDING_MODULE_OPTS=’miimon=100 mode=1 use_carrier=1′
* ARP monitoring method
BONDING_MODULE_OPTS=’arp_interval=2500 arp_ip_target=192.168.1.1 mode=1′
5. Remove the old configuration files
linux:~ # rm ifcfg-eth-id–xx:xx:xx:xx:xx:01
linux:~ # rm ifcfg-eth-id–xx:xx:xx:xx:xx:02
6. Restart the network with
rcnetwork restart
Additional Information
Occasionally it has been experienced thatnot all network interfaces come up after a system reboot. To prevent this, theloading of the modules should start earlier during the reboot process. Thefollowing procedure is helpful in this case:
1.Edit the file /etc/sysconfig/kernel and add this line:
MODULES_LOADED_ON_BOOT=”bcm5700″
2.Reboot the server and check the status of all network interfaces, usingcommands lspci and ifconfig.
3.If this method is not successful, edit the file /etc/sysconfig/kernel again andremove the line inserted at step 1. Modify the line containing theINITRD_MODULES statement; add the bcm5700 to this line. It should read likeINITRD_MODULES=”cdrom scsi_mod ide-cd ehci-hcd reiserfs bcm5700″
4.Call command mkinitrd
5.Reboot the server as in step 2
Another method is to delay the starting ofthe network interfaces after loading the modules. To do this, edit the file/etc/sysconfig/network/config and change the variable WAIT_FOR_INTERFACES tothe wanted delay in seconds. To delay the interfaces 3 seconds, enter
WAIT_FOR_INTERFACES=3
Reboot the server to verify the success ofthis measure.
当然也可以采用一些简单的办法,例如直接修改 /etc/init.d/network 网络启动脚本。
在start) 部分的结尾处添加 ipbonding 的手工脚本,例如:
ifconfig eth0 0.0.0.0
ifconfig eth1 0.0.0.0
modprobe bonding miimon=100 mode=1use_carrier=1
ifconfig bond0 192.168.1.123 netmask255.255.255.0
ifenslave bond0 eth0
ifenslave bond0 eth1
route add default gw 192.168.1.1
然后在 stop) 部分开始考虑添加:
ifdown bond0
rmmod bonding
———————————————————————
Introduction
The Linux bonding driver provides a method for aggregating
multiple network interfaces into a singlelogical “bonded” interface.
The behavior of the bonded interfacesdepends upon the mode; generally
speaking, modes provide either hot standbyor load balancing services.
Additionally, link integrity monitoring maybe performed.
The bonding driver originally came from Donald Becker’s
beowulf patches for kernel 2.0. It haschanged quite a bit since, and
the original tools from extreme-linux andbeowulf sites will not work
with this version of the driver.
For new versions of the driver, updated userspace tools, and
who to ask for help, please follow thelinks at the end of this file.
2. Bonding Driver Options
Options for the bonding driver aresupplied as parameters to
the bonding module at load time. They may be given as command line
arguments to the insmod or modprobecommand, but are usually specified
in either the /etc/modules.conf or/etc/modprobe.conf configuration
file, or in a distro-specific configurationfile (some of which are
detailed in the next section).
The available bonding driver parameters are listed below. If a
parameter is not specified the defaultvalue is used. When initially
configuring a bond, it is recommended”tail -f /var/log/messages” be
run in a separate window to watch forbonding driver error messages.
It is critical that either the miimon or arp_interval and
arp_ip_target parameters be specified,otherwise serious network
degradation will occur during linkfailures. Very few devices do not
support at least miimon, so there is reallyno reason not to use it.
Options with textual values will accept either the text name
or, for backwards compatibility, the optionvalue. E.g.,
“mode=802.3ad” and”mode=4″ set the same mode.
The parameters are as follows:
arp_interval
Specifies the ARP link monitoring frequency in milliseconds.
If ARP monitoring is used in an etherchannel compatible mode
(modes 0 and 2), the switch should be configured in a mode
that evenly distributes packets across all links. If the
switch is configured to distribute the packets in an XOR
fashion, all replies from the ARP targets will be received on
the same link which could cause the other team members to
fail. ARP monitoring should notbe used in conjunction with
miimon. A value of 0 disables ARPmonitoring. The default
value is 0.
arp_ip_target
Specifies the IP addresses to use as ARP monitoring peers when
arp_interval is > 0. These arethe targets of the ARP request
sent to determine the health of the link to the targets.
Specify these values in ddd.ddd.ddd.ddd format. Multiple IP
addresses must be separated by a comma. At least one IP
address must be given for ARP monitoring to function. The
maximum number of targets that can be specified is 16. The
default value is no IP addresses.
downdelay
Specifies the time, in milliseconds, to wait before disabling
a slave after a link failure has been detected. This option
is only valid for the miimon link monitor. The downdelay
value should be a multiple of the miimon value; if not, it
will be rounded down to the nearest multiple. The default
value is 0.
lacp_rate
Option specifying the rate in which we’ll ask our link partner
to transmit LACPDU packets in 802.3ad mode. Possible values
are:
slow or 0
Request partner to transmitLACPDUs every 30 seconds
fast or 1
Request partner to transmitLACPDUs every 1 second
The default is slow.
max_bonds
Specifies the number of bonding devices to create for this
instance of the bonding driver. E.g., if max_bonds is 3, and
the bonding driver is not already loaded, then bond0, bond1
and bond2 will be created. Thedefault value is 1.
miimon
Specifies the MII link monitoring frequency in milliseconds.
This determines how often the link state of each slave is
inspected for link failures. Avalue of zero disables MII
link monitoring. A value of 100is a good starting point.
The use_carrier option, below, affects how the link state is
determined. See the HighAvailability section for additional
information. The default value is0.
mode
Specifies one of the bonding policies. The default is
balance-rr (round robin). Possible values are:
balance-rr or 0
Round-robin policy: Transmitpackets in sequential
order from the first availableslave through the
last. This mode provides load balancing and fault
tolerance.
active-backup or 1
Active-backup policy: Only oneslave in the bond is
active. A different slave becomes active if, and only
if, the active slavefails. The bond’s MAC address is
externally visible on only oneport (network adapter)
to avoid confusing the switch.
In bonding version 2.6.2 orlater, when a failover
occurs in active-backup mode,bonding will issue one
or more gratuitous ARPs on thenewly active slave.
One gratutious ARP is issuedfor the bonding master
interface and each VLANinterfaces configured above
it, provided that the interfacehas at least one IP
address configured. Gratuitous ARPs issued for VLAN
interfaces are tagged with theappropriate VLAN id.
This mode provides faulttolerance. The primary
option, documented below,affects the behavior of this
mode.
balance-xor or 2
XOR policy: Transmit based onthe selected transmit hash policy. Thedefault policy is a simple [(source MAC address XOR’d with destination MACaddress) modulo slave count]. Alternatetransmit policies may be
selected via thexmit_hash_policy option, described
below.
This mode provides loadbalancing and fault tolerance.
broadcast or 3
Broadcast policy: transmitseverything on all slave
interfaces. This mode provides fault tolerance.
802.3ad or 4
IEEE 802.3ad Dynamic linkaggregation. Creates
aggregation groups that sharethe same speed and
duplex settings. Utilizes all slaves in the active
aggregator according to the802.3ad specification.
Slave selection for outgoingtraffic is done according
to the transmit hash policy,which may be changed from
the default simple XOR policyvia the xmit_hash_policy
option, documented below. Note that not all transmit
policies may be 802.3adcompliant, particularly in
regards to the packetmis-ordering requirements of
section 43.2.4 of the 802.3adstandard. Differing
peer implementations will havevarying tolerances for
noncompliance.
Prerequisites:
1. Ethtool support in the basedrivers for retrieving
the speed and duplex of eachslave.
2. A switch that supports IEEE802.3ad Dynamic link
aggregation.
Most switches will require sometype of configuration
to enable 802.3ad mode.
balance-tlb or 5
Adaptive transmit loadbalancing: channel bonding that
does not require any specialswitch support. The
outgoing traffic is distributedaccording to the
current load (computed relativeto the speed) on each
slave. Incoming traffic is received by the current
slave. If the receiving slave fails, another slave
takes over the MAC address ofthe failed receiving
slave.
Prerequisite:
Ethtool support in the basedrivers for retrieving the
speed of each slave.
balance-alb or 6
Adaptive load balancing:includes balance-tlb plus
receive load balancing (rlb)for IPV4 traffic, and
does not require any specialswitch support. The
receive load balancing isachieved by ARP negotiation.
The bonding driver interceptsthe ARP Replies sent by
the local system on their wayout and overwrites the
source hardware address withthe unique hardware
address of one of the slaves inthe bond such that
different peers use differenthardware addresses for
the server.
Receive traffic fromconnections created by the server
is also balanced. When the local system sends an ARP
Request the bonding drivercopies and saves the peer’s
IP information from the ARPpacket. When the ARP
Reply arrives from the peer,its hardware address is
retrieved and the bondingdriver initiates an ARP
reply to this peer assigning itto one of the slaves
in the bond. A problematic outcome of using ARP
negotiation for balancing isthat each time that an
ARP request is broadcast ituses the hardware address
of the bond. Hence, peers learn the hardware address
of the bond and the balancingof receive traffic
collapses to the currentslave. This is handled by
sending updates (ARPReplies) to all the peers with
their individually assignedhardware address such that
the traffic isredistributed. Receive traffic is also
redistributed when a new slaveis added to the bond
and when an inactive slave isre-activated. The
receive load is distributedsequentially (round robin)
among the group of highestspeed slaves in the bond.
When a link is reconnected or anew slave joins the
bond the receive traffic isredistributed among all
active slaves in the bond byinitiating ARP Replies
with the selected mac addressto each of the
clients. The updelay parameter(detailed below) must
be set to a value equal orgreater than the switch’s
forwarding delay so that theARP Replies sent to the
peers will not be blocked bythe switch.
Prerequisites:
1. Ethtool support in the basedrivers for retrieving
the speed of each slave.
2. Base driver support forsetting the hardware
address of a device while it isopen. This is
required so that there willalways be one slave in the
team using the bond hardwareaddress (the
curr_active_slave) while havinga unique hardware
address for each slave in thebond. If the
curr_active_slave fails itshardware address is
swapped with the newcurr_active_slave that was
chosen.
primary
A string (eth0, eth2, etc) specifying which slave is the
primary device. The specified device will always be the
active slave while it is available. Only when the primary is
off-line will alternate devices be used. This is useful when
one slave is preferred over another, e.g., when one slave has
higher throughput than another.
The primary option is only valid for active-backup mode.
updelay
Specifies the time, in milliseconds, to wait before enabling a
slave after a link recovery has been detected. This option is
only valid for the miimon link monitor. The updelay value
should be a multiple of the miimon value; if not, it will be
rounded down to the nearest multiple. The default value is 0.
use_carrier
Specifies whether or not miimon should use MII or ETHTOOL
ioctls vs. netif_carrier_ok() to determine the link
status. The MII or ETHTOOL ioctls are less efficient and
utilize a deprecated calling sequence within the kernel. The
netif_carrier_ok() relies on the device driver to maintain its
state with netif_carrier_on/off; at this writing, most, but
not all, device drivers support this facility.
If bonding insists that the link is up when it should not be,
it may be that your network device driver does not support
netif_carrier_on/off. The defaultstate for netif_carrier is
“carrier on,” so if a driver does not support netif_carrier,
it will appear as if the link is alwaysup. In this case,
setting use_carrier to 0 will cause bonding to revert to the
MII / ETHTOOL ioctl method to determine the link state.
A value of 1 enables the use of netif_carrier_ok(), a value of
0 will use the deprecated MII / ETHTOOL ioctls. The default
value is 1.
xmit_hash_policy
Selects the transmit hash policy to use for slave selection in
balance-xor and 802.3ad modes. Possible values are:
layer2
Uses XOR of hardware MACaddresses to generate the
hash. The formula is
(source MAC XOR destinationMAC) modulo slave count
This algorithm will place alltraffic to a particular
network peer on the same slave.
This algorithm is 802.3adcompliant.
layer3+4
This policy uses upper layerprotocol information,
when available, to generate thehash. This allows for
traffic to a particular networkpeer to span multiple
slaves, although a singleconnection will not span
multiple slaves.
The formula for unfragmentedTCP and UDP packets is
((source port XOR dest port)XOR
((source IP XOR destIP) AND 0xffff)
modulo slavecount
For fragmented TCP or UDPpackets and all other IP
protocol traffic, the source anddestination port
information is omitted. For non-IP traffic, the
formula is the same as for thelayer2 transmit hash
policy.
This policy is intended tomimic the behavior of
certain switches, notably Ciscoswitches with PFC2 as
well as some Foundry and IBMproducts.
This algorithm is not fully802.3ad compliant. A
single TCP or UDP conversationcontaining both
fragmented and unfragmented packets will seepackets
striped across twointerfaces. This may result in out
of order delivery. Most traffic types will not meet
this criteria, as TCP rarelyfragments traffic, and
most UDP traffic is notinvolved in extended
conversations. Other implementations of 802.3ad may
or may not tolerate thisnoncompliance.
The default value is layer2. This option was added in bonding
version 2.6.3. In earlier versions of bonding, thisparameter does
not exist, and the layer2 policy is theonly policy.
3. Configuring Bonding Devices
There are, essentially, two methods for configuring bonding:
with support from the distro’s networkinitialization scripts, and
without. Distros generally use one of two packages for the network
initialization scripts: initscripts orsysconfig. Recent versions of
these packages have support for bonding,while older versions do not.
We will first describe the options for configuring bonding for
distros using versions of initscripts andsysconfig with full or
partial support for bonding, then provideinformation on enabling
bonding without support from the networkinitialization scripts (i.e.,
older versions of initscripts orsysconfig).
If you’re unsure whether your distro uses sysconfig or
initscripts, or don’t know if it’s newenough, have no fear.
Determining this is fairly straightforward.
First, issue the command:
$ rpm -qf /sbin/ifup
It will respond with a line of text starting with either
“initscripts” or”sysconfig,” followed by some numbers. This is the
package that provides your network initializationscripts.
Next, to determine if your installation supports bonding,
issue the command:
$ grep ifenslave /sbin/ifup
If this returns any matches, then your initscripts or
sysconfig has support for bonding.
3.1 Configuration with sysconfig support
This section applies to distros using a version of sysconfig
with bonding support, for example, SuSELinux Enterprise Server 9.
SuSE SLES 9’s networking configuration system does support
bonding, however, at this writing, the YaSTsystem configuration
frontend does not provide any means to workwith bonding devices.
Bonding devices can be managed by hand,however, as follows.
First, if they have not already been configured, configure the
slave devices. On SLES 9, this is most easily done byrunning the
yast2 sysconfig configuration utility. The goal is for to create an
ifcfg-id file for each slave device. The simplest way to accomplish
this is to configure the devices for DHCP(this is only to get the
file ifcfg-id file created; see below forsome issues with DHCP). The
name of the configuration file for eachdevice will be of the form:
ifcfg-id-xx:xx:xx:xx:xx:xx
Where the “xx” portion will be replaced with the digits from
the device’s permanent MAC address.
Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been
created, it is necessary to edit theconfiguration files for the slave
devices (the MAC addresses correspond tothose of the slave devices).
Before editing, the file will containmultiple lines, and will look
something like this:
BOOTPROTO=’dhcp’
STARTMODE=’on’
USERCTL=’no’
UNIQUE=’XNzu.WeZGOGF+4wE’
_nm_name=’bus-pci-0001:61:01.0′
Change the BOOTPROTO and STARTMODE lines to the following:
BOOTPROTO=’none’
STARTMODE=’off’
Do not alter the UNIQUE or _nm_name lines. Remove any other
lines (USERCTL, etc).
Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified,
it’s time to create the configuration filefor the bonding device
itself. This file is named ifcfg-bondX, where X is the number of the
bonding device to create, starting at0. The first such file is
ifcfg-bond0, the second is ifcfg-bond1, andso on. The sysconfig
network configuration system will correctlystart multiple instances
of bonding.
The contents of the ifcfg-bondX file is asfollows:
BOOTPROTO=”static”
BROADCAST=”
10.0.2.255
“
IPADDR=”
10.0.2.10
“
NETMASK=”
255.255.0.0
“
NETWORK=”
10.0.2.0
“
REMOTE_IPADDR=””
STARTMODE=”onboot”
BONDING_MASTER=”yes”
BONDING_MODULE_OPTS=”mode=active-backupmiimon=100″
BONDING_SLAVE0=”eth0″
BONDING_SLAVE1=”bus-pci-0000:06:08.1″
Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK
values with the appropriate values for yournetwork.
The STARTMODE specifies when the device is brought online.
The possible values are:
onboot: The device is started atboot time. If you’re not
sure, this is probably whatyou want.
manual: The device is startedonly when ifup is called
manually. Bonding devices may be configured this
way if you do not wish them tostart automatically
at boot for some reason.
hotplug: The device is started by a hotplug event. This is not
a valid choice for abonding device.
off or ignore: The device configuration is ignored.
The line BONDING_MASTER=’yes’ indicates that the device is a
bonding master device. The only useful value is “yes.”
The contents of BONDING_MODULE_OPTS are supplied to the
instance of the bonding module for thisdevice. Specify the options
for the bonding mode, link monitoring, andso on here. Do not include
the max_bonds bonding parameter; this willconfuse the configuration
system if you have multiple bondingdevices.
Finally, supply one BONDING_SLAVEn=”slave device” for each
slave. where “n” is an increasing value, one for each slave. The
“slave device” is either aninterface name, e.g., “eth0”, or a device
specifier for the network device. The interface name is easier to
find, but the ethN names are subject tochange at boot time if, e.g.,
a device early in the sequence hasfailed. The device specifiers
(bus-pci-0000:06:08.1 in the example above)specify the physical
network device, and will not change unlessthe device’s bus location
changes (for example, it is moved from onePCI slot to another). The
example above uses one of each type fordemonstration purposes; most
configurations will choose one or the otherfor all slave devices.
When all configuration files have been modified or created,
networking must be restarted for theconfiguration changes to take
effect. This can be accomplished via the following:
# /etc/init.d/network restart
Note that the network control script (/sbin/ifdown) will
remove the bonding module as part of thenetwork shutdown processing,
so it is not necessary to remove the moduleby hand if, e.g., the
module parameters have changed.
Also, at this writing, YaST/YaST2 will not manage bonding
devices (they do not show bondinginterfaces on its list of network
devices). It is necessary to edit the configuration file by hand to
change the bonding configuration.
Additional general options and details of the ifcfg file
format can be found in an example ifcfgtemplate file:
/etc/sysconfig/network/ifcfg.template
Note that the template does not document the various BONDING_
settings described above, but does describemany of the other options.
3.1.1 Using DHCP with sysconfig
Under sysconfig, configuring a device with BOOTPROTO=’dhcp’
will cause it to query DHCP for its IPaddress information. At this
writing, this does not function for bondingdevices; the scripts
attempt to obtain the device address fromDHCP prior to adding any of
the slave devices. Without active slaves, the DHCP requests arenot
sent to the network.
3.1.2 Configuring Multiple Bonds withsysconfig
The sysconfig network initialization system is capable of
handling multiple bonding devices. All that is necessary is for each
bonding instance to have an appropriatelyconfigured ifcfg-bondX file
(as described above). Do not specify the “max_bonds”parameter to any
instance of bonding, as this will confusesysconfig. If you require
multiple bonding devices with identicalparameters, create multiple
ifcfg-bondX files.
Because the sysconfig scripts supply the bonding module
options in the ifcfg-bondX file, it is notnecessary to add them to
the system /etc/modules.conf or/etc/modprobe.conf configuration file.
3.3 Configuring Bonding Manually
This section applies to distros whose network initialization
scripts (the sysconfig or initscriptspackage) do not have specific
knowledge of bonding. One such distro is SuSE Linux EnterpriseServer
version 8.
The general method for these systems is to place the bonding
module parameters into /etc/modules.conf or/etc/modprobe.conf (as
appropriate for the installed distro), thenadd modprobe and/or
ifenslave commands to the system’s globalinit script. The name of
the global init script differs; forsysconfig, it is
/etc/init.d/boot.local and for initscriptsit is /etc/rc.d/rc.local.
For example, if you wanted to make a simple bond of two e100
devices (presumed to be eth0 and eth1), andhave it persist across
reboots, edit the appropriate file(/etc/init.d/boot.local or
/etc/rc.d/rc.local), and add the following:
modprobe bonding mode=balance-albmiimon=100
modprobe e100
ifconfig bond0
192.168.1.1
netmask
255.255.255.0
up
ifenslave bond0 eth0
ifenslave bond0 eth1
Replace the example bonding module parameters and bond0
network configuration (IP address, netmask,etc) with the appropriate
values for your configuration.
Unfortunately, this method will not provide support for the
ifup and ifdown scripts on the bonddevices. To reload the bonding
configuration, it is necessary to run theinitialization script, e.g.,
# /etc/init.d/boot.local
or
# /etc/rc.d/rc.local
It may be desirable in such a case to create a separate script
which only initializes the bondingconfiguration, then call that
separate script from withinboot.local. This allows for bonding tobe
enabled without re-running the entireglobal init script.
To shut down the bonding devices, it is necessary to first
mark the bonding device itself as beingdown, then remove the
appropriate device driver modules. For our example above, you can do
the following:
# ifconfig bond0 down
# rmmod bonding
# rmmod e100
Again, for convenience, it maybe desirable to create a script
with these commands.
3.3.1 Configuring Multiple Bonds Manually
This section contains information on configuring multiple
bonding devices with differing options forthose systems whose network
initialization scripts lack support forconfiguring multiple bonds.
If you require multiple bonding devices, but all with the same
options, you may wish to use the”max_bonds” module parameter,
documented above.
To create multiple bonding devices with differing options, it
is necessary to load the bonding drivermultiple times. Note that
current versions of the sysconfig networkinitialization scripts
handle this automatically; if your distrouses these scripts, no
special action is needed. See the section Configuring Bonding
Devices, above, if you’re not sure aboutyour network initialization
scripts.
To load multiple instances of the module, it is necessary to
specify a different name for each instance(the module loading system
requires that every loaded module, evenmultiple instances of the same
module, have a unique name). This is accomplished by supplying
multiple sets of bonding options in/etc/modprobe.conf, for example:
alias bond0 bonding
options bond0 -o bond0 mode=balance-rrmiimon=100
alias bond1 bonding
options bond1 -o bond1 mode=balance-albmiimon=50
will load the bonding module two times. The first instance is
named “bond0” and creates thebond0 device in balance-rr mode with an
miimon of 100. The second instance is named”bond1″ and creates the
bond1 device in balance-alb mode with anmiimon of 50.
In some circumstances (typically with older distributions),
the above does not work, and the secondbonding instance never sees
its options. In that case, the second options line can besubstituted
as follows:
install bond1 /sbin/modprobe–ignore-install bonding -o bond1 \
mode=balance-alb miimon=50
This may be repeated any number of times, specifying a new and
unique name in place of bond1 for eachsubsequent instance.
5. Querying Bonding Configuration
5.1 Bonding Configuration
Each bonding device has a read-only file residing in the
/proc/net/bonding directory. The file contents include information
about the bonding configuration, optionsand state of each slave.
For example, the contents of /proc/net/bonding/bond0 after the
driver is loaded with parameters of mode=0and miimon=1000 is
generally as follows:
Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004)
Bonding Mode: load balancing (round-robin)
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 1000
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Slave Interface: eth0
MII Status: up
Link Failure Count: 1
The precise format and contents will change depending upon the
bonding configuration, state, and versionof the bonding driver.
5.2 Network configuration
The network configuration can be inspected using the ifconfig
command. Bonding devices will have the MASTER flag set; Bonding slave
devices will have the SLAVE flag set. The ifconfig output does not
contain information on which slaves areassociated with which masters.
In the example below, the bond0 interface is the master
(MASTER) while eth0 and eth1 are slaves(SLAVE). Notice all slaves of
bond0 have the same MAC address (HWaddr) asbond0 for all modes except
TLB and ALB that require a unique MACaddress for each slave.
# /sbin/ifconfig
bond0 Link encap:Ethernet HWaddr00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0
TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0
collisions:0 txqueuelen:0
eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
collisions:0 txqueuelen:100
Interrupt:10 Base address:0x1080
eth1 Link encap:Ethernet HWaddr00:C0:F0:1F:37:B4
inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
UP BROADCAST RUNNING SLAVEMULTICAST MTU:1500 Metric:1
RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
Interrupt:9 Base address:0x1400
6. Switch Configuration
For this section, “switch” refers to whatever system the
bonded devices are directly connected to(i.e., where the other end of
the cable plugs into). This may be an actual dedicated switchdevice,
or it may be another regular system (e.g.,another computer running
Linux),
The active-backup, balance-tlb and balance-alb modes do not
require any specific configuration of theswitch.
The 802.3ad mode requires that the switch have the appropriate
ports configured as an 802.3adaggregation. The precise method used
to configure this varies from switch toswitch, but, for example, a
Cisco 3550 series switch requires that theappropriate ports first be
grouped together in a single etherchannelinstance, then that
etherchannel is set to mode”lacp” to enable 802.3ad (instead of
standard EtherChannel).
The balance-rr, balance-xor and broadcast modes generally
require that the switch have theappropriate ports grouped together.
The nomenclature for such a group differsbetween switches, it may be
called an “etherchannel” (as inthe Cisco example, above), a “trunk
group” or some other similarvariation. For these modes, each switch
will also have its own configuration optionsfor the switch’s transmit
policy to the bond. Typical choices include XOR of either the MACor
IP addresses. The transmit policy of the two peers does notneed to
match. For these three modes, the bonding mode really selects a
transmit policy for an EtherChannel group;all three will interoperate
with another EtherChannel group.
7. 802.1q VLAN Support
It is possible to configure VLAN devices over a bond interface
using the 8021q driver. However, only packets coming from the 8021q
driver and passing through bonding will betagged by default. Self
generated packets, for example, bonding’slearning packets or ARP
packets generated by either ALB mode or theARP monitor mechanism, are
tagged internally by bonding itself. As a result, bonding must
“learn” the VLAN IDs configuredabove it, and use those IDs to tag
self generated packets.
For reasons of simplicity, and to support the use of adapters
that can do VLAN hardware accelerationoffloading, the bonding
interface declares itself as fully hardwareoffloading capable, it gets
the add_vid/kill_vid notifications togather the necessary
information, and it propagates thoseactions to the slaves. In case
of mixed adapter types, hardwareaccelerated tagged packets that
should go through an adapter that is notoffloading capable are
“un-accelerated” by the bondingdriver so the VLAN tag sits in the
regular location.
VLAN interfaces must be added on top of a bonding interface
only after enslaving at least oneslave. The bonding interface has a
hardware address of 00:00:00:00:00:00 untilthe first slave is added.
If the VLAN interface is created prior tothe first enslavement, it
would pick up the all-zeroes hardwareaddress. Once the first slave
is attached to the bond, the bond deviceitself will pick up the
slave’s hardware address, which is thenavailable for the VLAN device.
Also, be aware that a similar problem can occur if all slaves
are released from a bond that still has oneor more VLAN interfaces on
top of it. When a new slave is added, the bonding interface will
obtain its hardware address from the firstslave, which might not
match the hardware address of the VLANinterfaces (which was
ultimately copied from an earlier slave).
There are two methods to insure that the VLAN device operates
with the correct hardware address if allslaves are removed from a
bond interface:
• Remove all VLAN interfacesthen recreate them
• Set the bonding interface’shardware address so that it
matches the hardware address of the VLANinterfaces.
Note that changing a VLAN interface’s HW address would set the
underlying device — i.e. the bondinginterface — to promiscuous
mode, which might not be what you want.
8. Link Monitoring
The bonding driver at present supports two schemes for
monitoring a slave device’s link state: theARP monitor and the MII
monitor.
At the present time, due to implementation restrictions in the
bonding driver itself, it is not possible toenable both ARP and MII
monitoring simultaneously.
8.1 ARP Monitor Operation
The ARP monitor operates as its name suggests: it sends ARP
queries to one or more designated peersystems on the network, and
uses the response as an indication that thelink is operating. This
gives some assurance that traffic isactually flowing to and from one
or more peers on the local network.
The ARP monitor relies on the device driver itself to verify
that traffic is flowing. In particular, the driver must keep up to
date the last receive time,dev->last_rx, and transmit start time,
dev->trans_start. If these are not updated by the driver, thenthe
ARP monitor will immediately fail anyslaves using that driver, and
those slaves will stay down. If networking monitoring (tcpdump, etc)
shows the ARP requests and replies on thenetwork, then it may be that
your device driver is not updating last_rxand trans_start.
8.2 Configuring Multiple ARP Targets
While ARP monitoring can be done with just one target, it can
be useful in a High Availability setup tohave several targets to
monitor. In the case of just one target, the target itself may go
down or have a problem making itunresponsive to ARP requests. Having
an additional target (or several) increasesthe reliability of the ARP
monitoring.
Multiple ARP targets must be separated by commas as follows:
# example options for ARP monitoring withthree targets
alias bond0 bonding
options bond0 arp_interval=60arp_ip_target=
192.168.0.1
,
192.168.0.3
,
192.168.0.9
For just a single target the options would resemble:
# example options for ARP monitoring withone target
alias bond0 bonding
options bond0 arp_interval=60arp_ip_target=
192.168.0.100
8.3 MII Monitor Operation
The MII monitor monitors only the carrier state of the local
network interface. It accomplishes this in one of three ways: by
depending upon the device driver tomaintain its carrier state, by
querying the device’s MII registers, or bymaking an ethtool query to
the device.
If the use_carrier module parameter is 1 (the default value),
then the MII monitor will rely on thedriver for carrier state
information (via the netif_carriersubsystem). As explained in the
use_carrier parameter information, above,if the MII monitor fails to
detect carrier loss on the device (e.g.,when the cable is physically
disconnected), it may be that the driverdoes not support
netif_carrier.
If use_carrier is 0, then the MII monitor will first query the
device’s (via ioctl) MII registers andcheck the link state. If that
request fails (not just that it returnscarrier down), then the MII
monitor will make an ethtool ETHOOL_GLINKrequest to attempt to obtain
the same information. If both methods fail (i.e., the driver either
does not support or had some error inprocessing both the MII register
and ethtool requests), then the MII monitorwill assume the link is
up.
9. Potential Sources of Trouble
9.1 Adventures in Routing
When bonding is configured, it is important that the slave
devices not have routes that supercederoutes of the master (or,
generally, not have routes at all). For example, suppose the bonding
device bond0 has two slaves, eth0 and eth1,and the routing table is
as follows:
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
10.0.0.0 0
.0.0.0
255.255.0.0
U 40 0 0 eth0
10.0.0.0 0
.0.0.0
255.255.0.0
U 40 0 0 eth1
10.0.0.0 0
.0.0.0
255.255.0.0
U 40 0 0 bond0
127.0.0.0 0
.0.0.0
255.0.0.0
U 40 0 0 lo
This routing configuration will likely still update the
receive/transmit times in the driver(needed by the ARP monitor), but
may bypass the bonding driver (becauseoutgoing traffic to, in this
case, another host on network 10 would useeth0 or eth1 before bond0).
The ARP monitor (and ARP itself) may become confused by this
configuration, because ARP requests(generated by the ARP monitor)
will be sent on one interface (bond0), butthe corresponding reply
will arrive on a different interface(eth0). This reply looks to ARP
as an unsolicited ARP reply (because ARPmatches replies on an
interface basis), and is discarded. The MII monitor is not affected
by the state of the routing table.
The solution here is simply to insure that slaves do not have
routes of their own, and if for some reasonthey must, those routes do
not supercede routes of their master. This should generally be the
case, but unusual configurations or errantmanual or automatic static
route additions may cause trouble.
9.2 Ethernet Device Renaming
On systems with network configuration scripts that do not
associate physical devices directly withnetwork interface names (so
that the same physical device always hasthe same “ethX” name), it may
be necessary to add some special logic toeither /etc/modules.conf or
/etc/modprobe.conf (depending upon which isinstalled on the system).
For example, given a modules.conf containing the following:
alias bond0 bonding
options bond0 mode=some-mode miimon=50
alias eth0 tg3
alias eth1 tg3
alias eth2 e1000
alias eth3 e1000
If neither eth0 and eth1 are slaves to bond0, then when the
bond0 interface comes up, the devices mayend up reordered. This
happens because bonding is loaded first,then its slave device’s
drivers are loaded next. Since no other drivers have been loaded,
when the e1000 driver loads, it willreceive eth0 and eth1 for its
devices, but the bonding configurationtries to enslave eth2 and eth3
(which may later be assigned to the tg3devices).
Adding the following:
add above bonding e1000 tg3
causes modprobe to load e1000 then tg3, in that order, when
bonding is loaded. This command is fully documented in the
modules.conf manual page.
On systems utilizing modprobe.conf (or modprobe.conf.local),
an equivalent problem can occur. In this case, the following can be
added to modprobe.conf (ormodprobe.conf.local, as appropriate), as
follows (all on one line; it has been splithere for clarity):
install bonding /sbin/modprobe tg3;/sbin/modprobe e1000;
/sbin/modprobe –ignore-install bonding
This will, when loading the bonding module, rather than
performing the normal action, insteadexecute the provided command.
This command loads the device drivers inthe order needed, then calls
modprobe with –ignore-install to cause thenormal action to then take
place. Full documentation on this can be found in the modprobe.conf
and modprobe manual pages.
9.3. Painfully Slow Or No Failed LinkDetection By Miimon
By default, bonding enables the use_carrier option, which
instructs bonding to trust the driver tomaintain carrier state.
As discussed in the options section, above, some drivers do
not support the netif_carrier_on/_off linkstate tracking system.
With use_carrier enabled, bonding willalways see these links as up,
regardless of their actual state.
Additionally, other drivers do support netif_carrier, but do
not maintain it in real time, e.g., onlypolling the link state at
some fixed interval. In this case, miimon will detect failures,but
only after some long period of time hasexpired. If it appears that
miimon is very slow in detecting linkfailures, try specifying
use_carrier=0 to see if that improves thefailure detection time. If
it does, then it may be that the driverchecks the carrier state at a
fixed interval, but does not cache the MIIregister values (so the
use_carrier=0 method of querying theregisters directly works). If
use_carrier=0 does not improve thefailover, then the driver may cache
the registers, or the problem may beelsewhere.
Also, remember that miimon only checks for the device’s
carrier state. It has no way to determine the state ofdevices on or
beyond other ports of a switch, or if aswitch is refusing to pass
traffic while still maintaining carrier on.
10. SNMP agents
If running SNMP agents, the bonding driver should be loaded
before any network drivers participating ina bond. This requirement
is due to the interface index(ipAdEntIfIndex) being associated to
the first interface found with a given IPaddress. That is, there is
only one ipAdEntIfIndex for each IPaddress. For example, if eth0 and
eth1 are slaves of bond0 and the driver foreth0 is loaded before the
bonding driver, the interface for the IPaddress will be associated
with the eth0 interface. This configuration is shown below, the IP
address
192.168.1.1
has an interface index of 2 which indexesto eth0
in the ifDescr table (ifDescr.2).
interfaces.ifTable.ifEntry.ifDescr.1 = lo
interfaces.ifTable.ifEntry.ifDescr.2 = eth0
interfaces.ifTable.ifEntry.ifDescr.3 = eth1
interfaces.ifTable.ifEntry.ifDescr.4 = eth2
interfaces.ifTable.ifEntry.ifDescr.5 = eth3
interfaces.ifTable.ifEntry.ifDescr.6 = bond0
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
This problem is avoided by loading the bonding driver before
any network drivers participating in abond. Below is an example of
loading the bonding driver first, the IPaddress
192.168.1.1
is
correctly associated with ifDescr.2.
interfaces.ifTable.ifEntry.ifDescr.1 = lo
interfaces.ifTable.ifEntry.ifDescr.2 = bond0
interfaces.ifTable.ifEntry.ifDescr.3 = eth0
interfaces.ifTable.ifEntry.ifDescr.4 = eth1
interfaces.ifTable.ifEntry.ifDescr.5 = eth2
interfaces.ifTable.ifEntry.ifDescr.6 = eth3
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
While some distributions may not report the interface name in
ifDescr, the association between the IPaddress and IfIndex remains
and SNMP functions such asInterface_Scan_Next will report that
association.
11. Promiscuous mode
When running network monitoring tools, e.g., tcpdump, it is
common to enable promiscuous mode on thedevice, so that all traffic
is seen (instead of seeing only trafficdestined for the local host).
The bonding driver handles promiscuous modechanges to the bonding
master device (e.g., bond0), and propagatesthe setting to the slave
devices.
For the balance-rr, balance-xor, broadcast, and 802.3ad modes,
the promiscuous mode setting is propagatedto all slaves.
For the active-backup, balance-tlb and balance-alb modes, the
promiscuous mode setting is propagated onlyto the active slave.
For balance-tlb mode, the active slave is the slave currently
receiving inbound traffic.
For balance-alb mode, the active slave is the slave used as a
“primary.” This slave is used for mode-specific controltraffic, for
sending to peers that are unassigned or ifthe load is unbalanced.
For the active-backup, balance-tlb and balance-alb modes, when
the active slave changes (e.g., due to alink failure), the
promiscuous setting will be propagated tothe new active slave.
12. Configuring Bonding for HighAvailability
High Availability refers to configurations that provide
maximum network availability by havingredundant or backup devices,
links or switches between the host and therest of the world. The
goal is to provide the maximum availabilityof network connectivity
(i.e., the network always works), eventhough other configurations
could provide higher throughput.
12.1 High Availability in a Single SwitchTopology
If two hosts (or a host and a single switch) are directly
connected via multiple physical links, thenthere is no availability
penalty to optimizing for maximumbandwidth. In this case, there is
only one switch (or peer), so if it fails,there is no alternative
access to fail over to. Additionally, the bonding load balance modes
support link monitoring of their members,so if individual links fail,
the load will be rebalanced across theremaining devices.
See Section 13, “Configuring Bonding for Maximum Throughput”
for information on configuring bonding withone peer device.
12.2 High Availability in a Multiple SwitchTopology
With multiple switches, the configuration of bonding and the
network changes dramatically. In multiple switch topologies, there is
a trade off between network availabilityand usable bandwidth.
Below is a sample network, configured to maximize the
availability of the network:
| |
|port3 port3|
+—–+—-+ +—–+—-+
| |port2 ISL port2| |
| switch A +————————–+ switch B |
| | | |
+—–+—-+ +—–++—+
|port1 port1|
| +——-+ |
+————-+ host1+—————+
eth0 +——-+ eth1
In this configuration, there is a link between the two
switches (ISL, or inter switch link), andmultiple ports connecting to
the outside world (“port3” oneach switch). There is no technical
reason that this could not be extended to athird switch.
12.2.1 HA Bonding Mode Selection forMultiple Switch Topology
In a topology such as the example above, the active-backup and
broadcast modes are the only useful bondingmodes when optimizing for
availability; the other modes require alllinks to terminate on the
same peer for them to behave rationally.
active-backup: This is generally thepreferred mode, particularly if
the switches have an ISL and play together well. If the
network configuration is such that one switch is specifically
a backup switch (e.g., has lower capacity, higher cost, etc),
then the primary option can be used to insure that the
preferred link is always used when it is available.
broadcast: This mode is really a specialpurpose mode, and is suitable
only for very specific needs. Forexample, if the two
switches are not connected (no ISL), and the networks beyond
them are totally independent. Inthis case, if it is
necessary for some specific one-way traffic to reach both
independent networks, then the broadcast mode may be suitable.
12.2.2 HA Link Monitoring Selection forMultiple Switch Topology
The choice of link monitoring ultimately depends upon your
switch. If the switch can reliably fail ports in response to other
failures, then either the MII or ARPmonitors should work. For
example, in the above example, if the”port3″ link fails at the remote
end, the MII monitor has no direct means todetect this. The ARP
monitor could be configured with a targetat the remote end of port3,
thus detecting that failure without switchsupport.
In general, however, in a multiple switch topology, the ARP
monitor can provide a higher level ofreliability in detecting end to
end connectivity failures (which may becaused by the failure of any
individual component to pass traffic forany reason). Additionally,
the ARP monitor should be configured withmultiple targets (at least
one for each switch in the network). This will insure that,
regardless of which switch is active, theARP monitor has a suitable
target to query.
13. Configuring Bonding for MaximumThroughput
13.1 Maximizing Throughput in a SingleSwitch Topology
In a single switch configuration, the best method to maximize
throughput depends upon the application andnetwork environment. The
various load balancing modes each havestrengths and weaknesses in
different environments, as detailed below.
For this discussion, we will break down the topologies into
two categories. Depending upon the destination of mosttraffic, we
categorize them into either”gatewayed” or “local” configurations.
In a gatewayed configuration, the “switch” is acting primarily
as a router, and the majority of trafficpasses through this router to
other networks. An example would be the following:
+———-+ +———-+
| |eth0 port1| | to other networks
| Host A +———————+router +——————->
| +———————+ |Hosts B and C are out
| |eth1 port2| | here somewhere
+———-+ +———-+
The router may be a dedicated router device, or another host
acting as a gateway. For our discussion, the important point isthat
the majority of traffic from Host A willpass through the router to
some other network before reaching itsfinal destination.
In a gatewayed network configuration, although Host A may
communicate with many other systems, all ofits traffic will be sent
and received via one other peer on thelocal network, the router.
Note that the case of two systems connected directly via
multiple physical links is, for purposes ofconfiguring bonding, the
same as a gatewayed configuration. In that case, it happens that all
traffic is destined for the”gateway” itself, not some other network
beyond the gateway.
In a local configuration, the”switch” is acting primarily as
a switch, and the majority of trafficpasses through this switch to
reach other stations on the samenetwork. An example would be the
following:
+———-+ +———-+ +——–+
| |eth0 port1| +——-+ Host B |
| Host A +————+ switch |port3 +——–+
| +————+ | +——–+
| |eth1 port2| +——————+ Host C |
+———-+ +———-+port4 +——–+
Again, the switch may be a dedicated switch device, or another
host acting as a gateway. For our discussion, the important point is
that the majority of traffic from Host A isdestined for other hosts
on the same local network (Hosts B and C inthe above example).
In summary, in a gatewayed configuration, traffic to and from
the bonded device will be to the same MAClevel peer on the network
(the gateway itself, i.e., the router),regardless of its final
destination. In a local configuration, traffic flowsdirectly to and
from the final destinations, thus, eachdestination (Host B, Host C)
will be addressed directly by theirindividual MAC addresses.
This distinction between a gatewayed and a local network
configuration is important because many ofthe load balancing modes
available use the MAC addresses of thelocal network source and
destination to make load balancingdecisions. The behavior of each
mode is described below.
13.1.1 MT Bonding Mode Selection for SingleSwitch Topology
This configuration is the easiest to set up and to understand,
although you will have to decide whichbonding mode best suits your
needs. The trade offs for each mode are detailed below:
balance-rr: This mode is the only mode thatwill permit a single
TCP/IP connection to stripe traffic across multiple
interfaces. It is therefore the only mode that will allow a
single TCP/IP stream to utilize more than one interface’s
worth of throughput. This comesat a cost, however: the
striping often results in peer systems receiving packets out
of order, causing TCP/IP’s congestion control system to kick
in, often by retransmitting segments.
It is possible to adjust TCP/IP’s congestion limits by
altering the net.ipv4.tcp_reordering sysctl parameter. The
usual default value is 3, and the maximum useful value is 127.
For a four interface balance-rr bond, expect that a single
TCP/IP stream will utilize no more than approximately 2.3
interface’s worth of throughput, even after adjusting
tcp_reordering.
Note that this out of order delivery occurs when both the
sending and receiving systems are utilizing a multiple
interface bond. Consider aconfiguration in which a
balance-rr bond feeds into a single higher capacity network
channel (e.g., multiple 100Mb/sec ethernets feeding a single
gigabit ethernet via an etherchannel capable switch). In this
configuration, traffic sent from the multiple 100Mb devices to
a destination connected to the gigabit device will not see
packets out of order. However,traffic sent from the gigabit
device to the multiple 100Mb devices may or may not see
traffic out of order, depending upon the balance policy of the
switch. Many switches do notsupport any modes that stripe
traffic (instead choosing a port based upon IP or MAC level
addresses); for those devices, traffic flowing from the
gigabit device to the many 100Mb devices will only utilize one
interface.
If you are utilizing protocols other than TCP/IP, UDP for
example, and your application can tolerate out of order
delivery, then this mode can allow for single stream datagram
performance that scales near linearly as interfaces are added
to the bond.
This mode requires the switch to have the appropriate ports
configured for “etherchannel” or “trunking.”
active-backup: There is not much advantagein this network topology to
the active-backup mode, as the inactive backup devices are all
connected to the same peer as the primary. In this case, a
load balancing mode (with link monitoring) will provide the
same level of network availability, but with increased
available bandwidth. On the plusside, active-backup mode
does not require any configuration of the switch, so it may
have value if the hardware available does not support any of
the load balance modes.
balance-xor: This mode will limit trafficsuch that packets destined
for specific peers will always be sent over the same
interface. Since the destinationis determined by the MAC
addresses involved, this mode works best in a “local” network
configuration (as described above), with destinations all on
the same local network. This modeis likely to be suboptimal
if all your traffic is passed through a single router (i.e., a
“gatewayed” network configuration, as described above).
As with balance-rr, the switch ports need to be configured for
“etherchannel” or “trunking.”
broadcast: Like active-backup, there is notmuch advantage to this
mode in this type of network topology.
802.3ad: This mode can be a good choice forthis type of network
topology. The 802.3ad mode is anIEEE standard, so all peers
that implement 802.3ad should interoperate well. The 802.3ad
protocol includes automatic configuration of the aggregates,
so minimal manual configurationof the switch is needed
(typically only to designate that some set of devices is
available for 802.3ad). The802.3ad standard also mandates
that frames be delivered in order (within certain limits), so
in general single connections will not see misordering of
packets. The 802.3ad mode doeshave some drawbacks: the
standard mandates that all devices in the aggregate operate at
the same speed and duplex. Also,as with all bonding load
balance modes other than balance-rr, no single connection will
be able to utilize more than a single interface’s worth of
bandwidth.
Additionally, the linux bonding 802.3ad implementation
distributes traffic by peer (using an XOR of MAC addresses),
so in a “gatewayed” configuration, all outgoing traffic will
generally use the same device. Incoming traffic may also end
up on a single device, but that is dependent upon the
balancing policy of the peer’s 8023.ad implementation. In a
“local” configuration, traffic will be distributed across the
devices in the bond.
Finally, the 802.3ad mode mandates the use of the MII monitor,
therefore, the ARP monitor is not available in this mode.
balance-tlb: The balance-tlb mode balancesoutgoing traffic by peer.
Since the balancing is done according to MAC address, in a
“gatewayed” configuration (as described above), this mode will
send all traffic across a single device. However, in a
“local” network configuration, this mode balances multiple
local network peers across devices in a vaguely intelligent
manner (not a simple XOR as in balance-xor or 802.3ad mode),
so that mathematically unlucky MAC addresses (i.e., ones that
XOR to the same value) will not all “bunch up” on a single
interface.
Unlike 802.3ad, interfaces may be of differing speeds, and no
special switch configuration is required. On the down side,
in this mode all incoming traffic arrives over a single
interface, this mode requires certain ethtool support in the
network device driver of the slave interfaces, and the ARP
monitor is not available.
balance-alb: This mode is everything thatbalance-tlb is, and more.
It has all of the features (and restrictions) of balance-tlb,
and will also balance incoming traffic from local network
peers (as described in the Bonding Module Options section,
above).
The only additional down side to this mode is that the network
device driver must support changing the hardware address while
the device is open.
13.1.2 MT Link Monitoring for Single SwitchTopology
The choice of link monitoring may largely depend upon which
mode you choose to use. The more advanced load balancing modes do not
support the use of the ARP monitor, and arethus restricted to using
the MII monitor (which does not provide ashigh a level of end to end
assurance as the ARP monitor).
13.2 Maximum Throughput in a MultipleSwitch Topology
Multiple switches may be utilized to optimize for throughput
when they are configured in parallel aspart of an isolated network
between two or more systems, for example:
+———–+
| Host A |
+-+—+—+-+
| | |
+——–+ | +———+
| | |
+——+—+ +—–+—-+ +—–+—-+
| Switch A | | Switch B | | Switch C |
+——+—+ +—–+—-+ +—–+—-+
| | |
+——–+ | +———+
| | |
+-+—+—+-+
| Host B |
+———–+
In this configuration, the switches are isolated from one
another. One reason to employ a topology such as this is for an
isolated network with many hosts (a clusterconfigured for high
performance, for example), using multiplesmaller switches can be more
cost effective than a single larger switch,e.g., on a network with 24
hosts, three 24 port switches can besignificantly less expensive than
a single 72 port switch.
If access beyond the network is required, an individual host
can be equipped with an additional networkdevice connected to an
external network; this host thenadditionally acts as a gateway.
13.2.1 MT Bonding Mode Selection forMultiple Switch Topology
In actual practice, the bonding mode typically employed in
configurations of this type isbalance-rr. Historically, in this
network configuration, the usual caveatsabout out of order packet
delivery are mitigated by the use ofnetwork adapters that do not do
any kind of packet coalescing (via the useof NAPI, or because the
device itself does not generate interruptsuntil some number of
packets has arrived). When employed in this fashion, the balance-rr
mode allows individual connections betweentwo hosts to effectively
utilize greater than one interface’sbandwidth.
13.2.2 MT Link Monitoring for MultipleSwitch Topology
Again, in actual practice, the MII monitor is most often used
in this configuration, as performance isgiven preference over
availability. The ARP monitor will function in thistopology, but its
advantages over the MII monitor aremitigated by the volume of probes
needed as the number of systems involvedgrows (remember that each
host in the network is configured withbonding).
14. Switch Behavior Issues
14.1 Link Establishment and Failover Delays
Some switches exhibit undesirable behavior with regard to the
timing of link up and down reporting by theswitch.
First, when a link comes up, some switches may indicate that
the link is up (carrier available), but notpass traffic over the
interface for some period of time. This delay is typically due to
some type of autonegotiation or routingprotocol, but may also occur
during switch initialization (e.g., duringrecovery after a switch
failure). If you find this to be a problem, specify an appropriate
value to the updelay bonding module optionto delay the use of the
relevant interface(s).
Second, some switches may “bounce” the link state one or more
times while a link is changing state. This occurs most commonly while
the switch is initializing. Again, an appropriate updelay value may
help.
Note that when a bonding interface has no active links, the
driver will immediately reuse the firstlink that goes up, even if the
updelay parameter has been specified (theupdelay is ignored in this
case). If there are slave interfaces waiting for the updelay timeout
to expire, the interface that first wentinto that state will be
immediately reused. This reduces down time of the network if the
value of updelay has been overestimated,and since this occurs only in
cases with no connectivity, there is noadditional penalty for
ignoring the updelay.
In addition to the concerns about switch timings, if your
switches take a long time to go into backupmode, it may be desirable
to not activate a backup interfaceimmediately after a link goes down.
Failover may be delayed via the downdelaybonding module option.
14.2 Duplicated Incoming Packets
It is not uncommon to observe a short burst of duplicated
traffic when the bonding device is firstused, or after it has been
idle for some period of time. This is most easily observed by issuing
a “ping” to some other host onthe network, and noticing that the
output from ping flags duplicates(typically one per slave).
For example, on a bond in active-backupmode with five slaves
all connected to one switch, the output mayappear as follows:
# ping -n
10.0.4.2
PING
10.0.4.2
(
10.0.4.2
) from
10.0.3.10
: 56(84) bytes of data.
64 bytes from
10.0.4.2
: icmp_seq=1 ttl=64 time=13.7 ms
64 bytes from
10.0.4.2
: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from
10.0.4.2
: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from
10.0.4.2
: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from
10.0.4.2
: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from
10.0.4.2
: icmp_seq=2 ttl=64 time=0.216 ms
64 bytes from
10.0.4.2
: icmp_seq=3 ttl=64 time=0.267 ms
64 bytes from
10.0.4.2
: icmp_seq=4 ttl=64 time=0.222 ms
This is not due to an error in the bonding driver, rather, it
is a side effect of how many switchesupdate their MAC forwarding
tables. Initially, the switch does not associate the MAC address in
the packet with a particular switch port,and so it may send the
traffic to all ports until its MACforwarding table is updated. Since
the interfaces attached to the bond mayoccupy multiple ports on a
single switch, when the switch(temporarily) floods the traffic to all
ports, the bond device receives multiplecopies of the same packet
(one per slave device).
The duplicated packet behavior is switchdependent, some
switches exhibit this, and some donot. On switches that display this
behavior, it can be induced by clearing theMAC forwarding table (on
most Cisco switches, the privileged command”clear mac address-table
dynamic” will accomplish this).
15. Hardware Specific Considerations
This section contains additional information for configuring
bonding on specific hardware platforms, orfor interfacing bonding
with particular switches or other devices.
15.1 IBM BladeCenter
This applies to the JS20 and similar systems.
On the JS20 blades, the bonding driver supports only
balance-rr, active-backup, balance-tlb andbalance-alb modes. This is
largely due to the network topology insidethe BladeCenter, detailed
below.
JS20 network adapter information
All JS20s come with two Broadcom Gigabit Ethernet ports
integrated on the planar (that’s”motherboard” in IBM-speak). In the
BladeCenter chassis, the eth0 port of allJS20 blades is hard wired to
I/O Module #1; similarly, all eth1 portsare wired to I/O Module #2.
An add-on Broadcom daughter card can beinstalled on a JS20 to provide
two more Gigabit Ethernet ports. These ports, eth2 and eth3, are
wired to I/O Modules 3 and 4, respectively.
Each I/O Module may contain either a switch or a passthrough
module (which allows ports to be directlyconnected to an external
switch). Some bonding modes require a specific BladeCenter internal
network topology in order to function; theseare detailed below.
Additional BladeCenter-specific networking information can be
found in two IBM Redbooks (
www.ibm.com/redbooks):
“IBM eServer BladeCenter NetworkingOptions”
“IBM eServer BladeCenter Layer 2-7Network Switching”
BladeCenter networking configuration
Because a BladeCenter can be configured in a very large number
of ways, this discussion will be confinedto describing basic
configurations.
Normally, Ethernet Switch Modules (ESMs) are used in I/O
modules 1 and 2. In this configuration, the eth0 and eth1ports of a
JS20 will be connected to differentinternal switches (in the
respective I/O modules).
A passthrough module (OPM or CPM, optical or copper,
passthrough module) connects the I/O moduledirectly to an external
switch. By using PMs in I/O module #1 and #2, the eth0 and eth1
interfaces of a JS20 can be redirected tothe outside world and
connected to a common external switch.
Depending upon the mix of ESMs and PMs, the network will
appear to bonding as either a single switchtopology (all PMs) or as a
multiple switch topology (one or more ESMs,zero or more PMs). It is
also possible to connect ESMs together,resulting in a configuration
much like the example in “HighAvailability in a Multiple Switch
Topology,” above.
Requirements for specific modes
The balance-rr mode requires the use of passthrough modules
for devices in the bond, all connected toan common external switch.
That switch must be configured for”etherchannel” or “trunking” on the
appropriate ports, as is usual forbalance-rr.
The balance-alb and balance-tlb modes will function with
either switch modules or passthroughmodules (or a mix). The only
specific requirement for these modes isthat all network interfaces
must be able to reach all destinations fortraffic sent over the
bonding device (i.e., the network mustconverge at some point outside
the BladeCenter).
The active-backup mode has no additional requirements.
Link monitoring issues
When an Ethernet Switch Module is in place, only the ARP
monitor will reliably detect link loss toan external switch. This is
nothing unusual, but examination of theBladeCenter cabinet would
suggest that the “external”network ports are the ethernet ports for
the system, when it fact there is a switchbetween these “external”
ports and the devices on the JS20 systemitself. The MII monitor is
only able to detect link failures betweenthe ESM and the JS20 system.
When a passthrough module is in place, the MII monitor does
detect failures to the “external”port, which is then directly
connected to the JS20 system.
Other concerns
The Serial Over LAN (SoL) link is established over the primary
ethernet (eth0) only, therefore, any loss oflink to eth0 will result
in losing your SoL connection. It will not fail over with other
network traffic, as the SoL system isbeyond the control of the
bonding driver.
It may be desirable to disable spanning tree on the switch
(either the internal Ethernet SwitchModule, or an external switch) to
avoid fail-over delay issues when usingbonding.
16. Frequently Asked Questions
• Is it SMP safe?
Yes. The old 2.0.xx channel bonding patch was not SMP safe.
The new driver was designed to be SMP safefrom the start.
2. What type of cards will work with it?
Any Ethernet type cards (you can even mix cards – a Intel
EtherExpress PRO/100 and a 3com 3c905b, forexample). For most modes,
devices need not be of the same speed.
3. Howmany bonding devices can I have?
There is no limit.
4. How many slaves can a bonding device have?
This is limited only by the number of network interfaces Linux
supports and/or the number of network cardsyou can place in your
system.
5. What happens when a slave link dies?
If link monitoring is enabled, then the failing device will be
disabled. The active-backup mode will fail over to a backup link, and
other modes will ignore the failedlink. The link will continue to be
monitored, and should it recover, it willrejoin the bond (in whatever
manner is appropriate for the mode). Seethe sections on High
Availability and the documentation for eachmode for additional
information.
Link monitoring can be enabled via either the miimon or
arp_interval parameters (described in themodule parameters section,
above). In general, miimon monitors the carrier state as sensed by
the underlying network device, and the arpmonitor (arp_interval)
monitors connectivity to another host onthe local network.
If no link monitoring is configured, the bonding driver will
be unable to detect link failures, and willassume that all links are
always available. This will likely result in lost packets, anda
resulting degradation of performance. The precise performance loss
depends upon the bonding mode and networkconfiguration.
6. Can bonding be used for High Availability?
Yes. See the section on HighAvailability for details.
7. Which switches/systems does it work with?
The full answer to this depends upon the desired mode.
In the basic balance modes (balance-rr and balance-xor), it
works with any system that supportsetherchannel (also called
trunking). Most managed switches currently available have such
support, and many unmanaged switches aswell.
The advanced balance modes (balance-tlb and balance-alb) do
not have special switch requirements, butdo need device drivers that
support specific features (described in theappropriate section under
module parameters, above).
In 802.3ad mode, it works with with systems that support IEEE
802.3ad Dynamic Link Aggregation. Most managed and many unmanaged
switches currently available support802.3ad.
The active-backup mode should work with any Layer-II switch.
8. Where does a bonding device get its MAC address from?
If not explicitly configured (with ifconfig or ip link), the
MAC address of the bonding device is takenfrom its first slave
device. This MAC address is then passed to all following slaves and
remains persistent (even if the first slaveis removed) until the
bonding device is brought down orreconfigured.
If you wish to change the MAC address, you can set it with
ifconfig or ip link:
# ifconfig bond0 hw ether 00:11:22:33:44:55
# ip link set bond0 address66:77:88:99:aa:bb
The MAC address can be also changed by bringing down/up the
device and then changing its slaves (ortheir order):
# ifconfig bond0 down ; modprobe -r bonding
# ifconfig bond0 …. up
# ifenslave bond0 eth…
This method will automatically take the address from the next
slave that is added.
To restore your slaves’ MAC addresses, you need to detach them
from the bond (`ifenslave -d bond0 eth0′).The bonding driver will
then restore the MAC addresses that theslaves had before they were
enslaved
Linux总是可以用一种最简单的方式实现一个很复杂的功能,特别是网络方面的 ,哪怕这个功能被认为只是在高端设备上才有,linux也可以很容易的实现,以前的文章已经说了不少次了,比如vlan功能,比如高级路由和防火墙功能等等,本文着重说一下linux的bonding,也就是端口聚合的功能模块。不可否认,在网络设备这个层面上上,linux搞出了两个很成功的虚拟设备的概念,一个是tap网卡,另一个就是本文所讲述的bonding,关于tap网卡的内容,请参阅之前关于OpenVPN的文章。
如果有一个问题摆在眼前,那就是关于linux bonding有什么比较好的资料,答案就是linux内核的文档,该文档在$KERNEL-ROOT/Documentation/networking/bonding.txt,我觉得没有任何资料比这个更权威了。
一、bonding简介
bonding是一个linux kernel的driver,加载了它以后,linux支持将多个物理网卡捆绑成一个虚拟的bond网卡,随着版本的升级,bond驱动可配置的参数越来越多,而且配置本身也越来越方便了。
我们在很多地方会使用到物理网卡端口汇聚的功能,比如我们想提升网络速率,比如我们想提供热备份,比如我们想把我们的主机配置成一个网桥,并且使之支持802.3ad动态端口聚合协议等等,然而最重要的还是两点,第一点是负载均衡,第二点就是热备份啦。
二、驱动以及Changes介绍
linux的bonding驱动的最初版本仅仅提供了基本的机制,而且需要在加载模块的时候指定配置参数,如果想更改配置参数,那么必须重新加载bonding模块;然后modprobe支持一种rename的机制,也就是在modprobe的时候支持使用-o重新为此模块命名,这样就可以实现一个模块以不同的配置参数加载多次了,起初比如我有4个网口,想把两个配置成负载均衡,两个配置成热备,这只能手工重新将bonding编译成不同的名称来解决,modprobe有了-o选项之后,就可以两次加载相同的驱动了,比如可以使用:
modprobe bonding -o bond0 mode=0
modprobe bonding -o bond1 mode=1
加载两次bonding驱动,用lsmod看一下,结果是bond0和bond1,并没有bonding,这是由于modprobe加载时命名了,然而最终,这个命名机制不再被支持了,因为正如modprobe的man手册所叙述的一样,-o重命名机制主要适用于test。最后,bonding支持了sysfs的配置机制,对/sys/class/net/目录下的文件进行读或者写就可以完成对驱动的配置。
不管怎样,在sysfs完全支持bonding配置之前,如果想往某一个bonding网卡添加设备或者删除设备的时候,还是要使用经典且传统的ioctl调用,因此必然需要一个用户态程序与之对应,该程序就是ifenslave。
我想,如果linux的所有关于设备的配置都能统一于sysfs,所有的关于内核和进程配置统一于procfs(内核是所有进程共享的地址空间,也有自己的内核线程以及进程0,因此对内核的配置应该在procfs中),对所有的消息,使用netlink通信,这就太好了,摆脱了命令式的ioctl配置,文件式(netlink使用的sendto之类的系统调用也可以归为文件系统调用相关的)的配置将更加高效,简单以及好玩!
三、bonding配置参数
在内核文档中,列举了许多bonding驱动的参数,然后本文不是文档的翻译,因此不再翻译文档和介绍和主题无关的参数,仅对比较重要的参数进行介绍,并且这些介绍也不是翻译,而是一些建议或者心得。
ad_select: 802.3ad相关。如果不明白这个,那不要紧,抛开Linux的bonding驱动,直接去看802.3ad的规范就可以了。列举这个选项说明linux bonding驱动完全支持了动态端口聚合协议。
arp_interval和arp_ip_target: 以一个固定的间隔向某些固定的地址发送arp,以监控链路。有些配置下,需要使用arp来监控链路,因为这是一种三层的链路监控,使用网卡状态或者链路层pdu监控只能监控到双绞线两端的接口的健康情况,而监控不到到下一条路由器或者目的主机之间的全部链路的健康状况。
primary: 表示优先权,顺序排列,当出现某种选择事件时,按照从前到后的顺序选择网口,比如802.3ad协议中的选择行为。
fail_over_mac: 对于热备模式是否使用同一个mac地址,如果不使用一个mac的话,就要完全依赖免费arp机制更新其它机器的arp缓存了。比如,两个有网卡,网卡1和网卡2处于热备模式,网卡1的mac是mac1,网卡2的mac是mac2,网卡1一直是master,但是网卡1突然down掉了,此时需要网卡2接替,然而网卡2的mac地址与之前的网卡1不同,别的主机回复数据包的时候还是使用网卡1的mac地址来回复的,由于mac1已经不在网络上了,这就会导致数据包将不会被任何网卡接收。因此网卡2接替了master的角色之后,最好有一个回调事件,处理这个事件的时候,进行一次免费的arp广播,广播自己更换了mac地址。
lacp_rate: 发送802.3ad的LACPDU,以便对端设备自动获取链路聚合的信息。
max_bonds: 初始时创建bond设备接口的数量,默认值是1。但是这个参数并不影响可以创建的最大的bond设备数量。
use_carrier: 使用MII的ioctl还是使用驱动获取保持的状态,如果是前者的话需要自己调用mii的接口进行硬件检测,而后者则是驱动自动进行硬件检测(使用watchdog或者定时器),bonding驱动只是获取结果,然而这依赖网卡驱动必须支持状态检测,如果不支持的话,网卡的状态将一直是on。
mode: 这个参数最重要,配置以什么模式运行,这个参数在bond设备up状态下是不能更改的,必须先down设备(使用ifconfig bondX down)才可以配置,主要的有以下几个:
1.balance-rr or 0: 轮转方式的负载均衡模式,流量轮流在各个bondX的真实设备之间分发。注意,一定要用状态检测机制,否则如果一个设备down掉以后,由于没有状态检测,该设备将一直是up状态,仍然接受发送任务,这将会出现丢包。
2.active-backup or 1: 热备模式。在比较高的版本中,免费arp会在切换时自动发送,避免一些故障,比如fail_over_mac参数描述的故障。
3.balance-xor or 2: 我不知道既然bonding有了xmit_hash_policy这个参数,为何还要将之单独设置成一种模式,在这个模式中,流量也是分发的,和轮转负载不同的是,它使用源/目的mac地址为自变量通过xor|mod函数计算出到底将数据包分发到哪一个口。
4.broadcast or 3: 向所有的口广播数据,这个模式很XX,但是容错性很强大。
5.802.3ad or 4: 这个就不多说了,就是以802.3ad的方式运行。
…
xmit_hash_policy: 这个参数的重要性我认为仅次于mode参数,mode参数定义了分发模式 ,而这个参数定义了分发策略 ,文档上说这个参数用于mode2和mode4,我觉得还可以定义更为复杂的策略呢。
1.layer2: 使用二层帧头作为计算分发出口的参数,这导致通过同一个网关的数据流将完全从一个端口发送,为了更加细化分发策略,必须使用一些三层信息,然而却增加了计算开销,天啊,一切都要权衡!
2.layer2+3: 在1的基础上增加了三层的ip报头信息,计算量增加了,然而负载却更加均衡了,一个个主机到主机的数据流形成并且同一个流被分发到同一个端口,根据这个思想,如果要使负载更加均衡,我们在继续增加代价的前提下可以拿到4层的信息。
3.layer3+4: 这个还用多说吗?可以形成一个个端口到端口的流,负载更加均衡。然而且慢! 事情还没有结束,虽然策略上我们不想将同一个tcp流的传输处理并行化以避免re-order或者re-transmit,因为tcp本身就是一个串行协议,比如Intel的8257X系列网卡芯片都在尽量减少将一个tcp流的包分发到不同的cpu,同样,端口聚合的环境下,同一个tcp流也应该使用本policy使用同一个端口发送,但是不要忘记,tcp要经过ip,而ip是可能要分段的,分了段的ip数据报中直到其被重组(到达对端或者到达一个使用nat的设备)都再也不能将之划为某个tcp流了。ip是一个完全无连接的协议,它只关心按照本地的mtu进行分段而不管别的,这就导致很多时候我们使用layer3+4策略不会得到完全满意的结果。可是事情又不是那么严重,因为ip只是依照本地的mtu进行分段,而tcp是端到端的,它可以使用诸如mss以及mtu发现之类的机制配合滑动窗口机制最大限度减少ip分段,因此layer3+4策略,很OK!
miimon和arp: 使用miimon仅能检测链路层的状态,也就是链路层的端到端连接(即交换机某个口和与之直连的本地网卡口),然而交换机的上行口如果down掉了还是无法检测到,因此必然需要网络层的状态检测,最简单也是最直接的方式就是arp了,可以直接arp网关,如果定时器到期网关还没有回复arp reply,则认为链路不通了。
35、Pssh
是一个可以在多台服务器上执行命令的工具,同时支持拷贝文件,是同类工具中很出色的。使用是必须在各个服务器上配置好密钥认证访问。
在系统centos 5.6 64位 和 red hat enterpriselinux 6.1 64位中测试通过
1 安装pssh
在http://www.theether.org/pssh/ 或者http://code.google.com/p/parallel-ssh/下载pssh最新版本
# wget http://www.theether.org/pssh/pssh-1.4.3.tar.gz
# tar zxvf pssh-1.4.3.tar.gz
# cd pssh-1.4.3
# wget’http://peak.telecommunity.com/dist/ez_setup.py’
# python ez_setup.py
# python setup.py install
36、道
n 学会翻墙并搭建自己的网站;
n 去stackoverflow.com上回答10个问题;
n 在ATA上发一篇技术文章;
n 了解淘宝网站是如何支撑秒杀的;
n 了解支撑双11这样的大促都做了什么;
n 尝试做一件非工作职责范围,对团队或公司业务有帮助的事;
n 尝试给涉及的或感兴趣的开源软件提交一个patch;
初步判断自己希望发展的方向
37、High SpeedFrameWork(HSF)
l 远程服务调用框架(RPC)
l 方便易用,对Java代码侵入很小
l 支撑了800多个线上业务系统,双11HSF日调用量超过1200亿
37、淘宝分布式数据层(TDDL)
l 数据源管理,数据切分,去IOE的最重要组件之一
l 安全稳定,从未出过严重故障
l 支持了600多个线上的业务系统
38、Ipsan
39、Storage AreaNetwork,
存储区域网络,多采用高速光纤通道,对速率、冗余性要求高
使用iSCSI存储协议,块级传输
40、IDC
IDC 互联网数据中心(Internet DataCenter)简称IDC,就是电信部门利用已有的互联网通信线路、带宽资源,建立标准化的电信专业级机房环境,为企业、政府提供服务器托管、租用以及相关增值等方面的全方位服务。
41、TDP
TDP的英文全称是“Thermal DesignPower”,中文直译是“散热设计功耗”。主要是提供给计算机系统厂商,散热片/风扇厂商,以及机箱厂商等等进行系统设计时使用的。一般TDP主要应用于CPU,CPU TDP值对应系列CPU 的最终版本在满负荷(CPU 利用率为100%的理论上)可能会达到的最高散热热量,散热器必须保证在处理器TDP最大的时候,处理器的温度仍然在设计范围之内。
42、DIMM
(Dual Inline Memory Module,双列直插内存模块)与SIMM相当类似,不同的只是DIMM的金手指两端不像SIMM那样是互通的,它们各自独立传输信号,因此可以满足更多数据信号的传送需要。
NUMA
MySQL单机多实例方案
MySQL单机多实例方案,是指在一台物理的PC服务器上运行多个MySQL数据库实例,为什么要这样做?这样做的好处是什么?
1.存储技术飞速发展,IO不再是瓶颈
普通PC服务器的CPU与IO资源不均衡,因为磁盘的IO能力非常有限,为了满足应用的需要,往往需要配置大量的服务器,这样就造成CPU资源的大量浪费。但是,Flash存储技术的出现改变了这一切,单机的IO能力不再是瓶颈,可以在单机运行多个MySQL实例提升CPU利用率。
2.MySQL对多核CPU利用率低
MySQL对多核CPU的利用率不高,一直是个问题,5.1版本以前的MySQL,当CPU超过4个核时,性能无法线性扩展。虽然MySQL后续版本一直在改进这个问题,包括Innodb plugin和Percona XtraDB都对多核CPU的利用率改进了很多,但是依然无法实现性能随着CPU core的增加而提升。我们现在常用的双路至强服务器,单颗CPU有4-8个core,在操作系统上可以看到16-32 CPU(每个core有两个线程),四路服务器可以达到64 core甚至更多,所以提升MySQL对于多核CPU的利用率是提升性能的重要手段。下图是Percona的一份测试数据:
3.NUMA对MySQL性能的影响
我们现在使用的PC服务器都是NUMA架构的,下图是Intel 5600 CPU的架构:
NUMA的内存分配策略有四种:
1.缺省(default):总是在本地节点分配(分配在当前进程运行的节点上);
2.绑定(bind):强制分配到指定节点上;
3.交叉(interleave):在所有节点或者指定的节点上交织分配;
4.优先(preferred):在指定节点上分配,失败则在其他节点上分配。
因为NUMA默认的内存分配策略是优先在进程所在CPU的本地内存中分配,会导致CPU节点之间内存分配不均衡,当某个CPU节点的内存不足时,会导致swap产生,而不是从远程节点分配内存。这就是所谓的swap insanity现象。
MySQL采用了线程模式,对于NUMA特性的支持并不好,如果单机只运行一个MySQL实例,我们可以选择关闭NUMA,关闭的方法有三种:1.硬件层,在BIOS中设置关闭;2.OS内核,启动时设置numa=off;3.可以用numactl命令将内存分配策略修改为interleave(交叉),有些硬件可以在BIOS中设置。
如果单机运行多个MySQL实例,我们可以将MySQL绑定在不同的CPU节点上,并且采用绑定的内存分配策略,强制在本节点内分配内存,这样既可以充分利用硬件的NUMA特性,又避免了单实例MySQL对多核CPU利用率不高的问题。
资源隔离方案
1.CPU,Memory
numactl –cpubind=0 –localalloc,此命令将MySQL绑定在不同的CPU节点上,cpubind是指NUMA概念中的CPU节点,可以用numactl–hardware查看,localalloc参数指定内存为本地分配策略。
2.IO
我们在机器中内置了fusionio卡(320G),配合flashcache技术,单机的IO不再成为瓶颈,所以IO我们采用了多实例共享的方式,并没有对IO做资源限制。多个MySQL实例使用相同的物理设备,不同的目录的来进行区分。
3.Network
因为单机运行多个实例,必须对网络进行优化,我们通过多个的IP的方式,将多个MySQL实例绑定在不同的网卡上,从而提高整体的网络能力。还有一种更高级的做法是,将不同网卡的中断与CPU绑定,这样可以大幅度提升网卡的效率。
4.为什么不采用虚拟机
虚拟机会耗费额外的资源,而且MySQL属于IO类型的应用,采用虚拟机会大幅度降低IO的性能,而且虚拟机的管理成本比较高。所以,我们的数据库都不采用虚拟机的方式。
5.性能
下图是Percona的测试数据,可以看到运行两个实例的提升非常明显。
高可用方案
因为单机运行了多个MySQL实例,所以不能采用主机层面的HA策略,比如heartbeat。因为当一个MySQL实例出现问题时,无法将整个机器切换。所以必须改为MySQL实例级别的HA策略,我们采用了自己开发的MySQL访问层来解决HA的问题,当某个实例出问题时,只切换一个实例,对于应用来说,这一层是透明的。
MySQL单机多实例方案的优点
1.节省成本,虽然采用Flash存储的成本比较高,但是如果可以缩减机器的数量,考虑到节省电力和机房使用的成本,还是比单机单实例的方案更便宜。
2.提升利用率,利用NUMA特性,将MySQL实例绑定在不同的CPU节点,不仅提高了CPU利用率,同时解决了MySQL对多核CPU的利用率问题。
3.提升用户体验,采用Flash存储技术,大幅度降低IO响应时间,有助于提升用户的体验。
–EOF–
关于NUMA可以参考这篇文章:NUMA与Intel新一代Xeon处理
43、SLC/MLC/TLC
SLC = Single-Level Cell ,即1bit/cell,速度快寿命长,价格超贵(约MLC 3倍以上的价格),约10万次擦写寿命
MLC = Multi-Level Cell,即2bit/cell,速度一般寿命一般,价格一般,约3000—10000次擦写寿命
TLC = Trinary-Level Cell,即3bit/cell,也有Flash厂家叫8LC,速度慢寿命短,价格便宜,约500次擦写寿命,目前还没有厂家能做到1000次。
1、 读写效能区别相比SLC闪存,MLC的读写效能要差,SLC闪存约可以反复读写10万次左右,而MLC则大约只能读写1万次左右,甚至有部分产品只能达到5000次左右。2、读写速度区别在相同条件下,MLC的读写速度要比SLC芯片慢,目前MLC芯片速度大约只有2M左右。3、能耗区别在相同使用条件下,MLC能耗比SLC高,要多15%左右的电流消耗。4、成本区别MLC内存颗粒容量大,可大幅节省制造商端的成本。slc容量小,成本高。从性能上讲SLC好,单从性价比讲MLC高。
44、存储架构
服务器内置存储:一般指在服务器内部的存储空间如IDE、SCSI、SAS、SATA等。
直接连接存储DAS(DirectAttached Storage):通过IDE、SCSI、FC接口与服务器直接相连,以服务器为中心。客户机的数据访问必须通过服务器,然后经过其I/O总线访问相应的存储设备,服务器实际上起到一种存储转发的作用。
网络连接存储NAS:使用一个专用存储服务器,去掉了通用服务器原有的不适用的大多数计算功能,而仅仅提供文件系统功能,用于存储服务。NAS通过基于 IP的网络文件协议向多种客户段提供文件级I/O服务,客户端可以在NAS存储设备提供的目录或设备中进行文件级操作。专用服务器利用NFS或CIFS,充当远程文件服务器,对外提供了跨平台的文件同时存取服务,因此NAS主要应用于文件共享任务。
存储区域网络SAN(storage areanetwork):通过网络方式连接存储设备和应用服务器,这个网络专用于主机和存储设备之间的访问,当有数据的存取需求时,数据可以通过存储区域网络在服务器和后台存储设备之间高速传输,这种形式的网络存储结构称为SAN。SAN由应用服务器、后端存储系统、SAN连接设备组成。后端存储系统由SAN 控制器和磁盘系统构成。SAN控制器是后端存储系统的关键,它提供存储接入、数据操作及备份、数据共享、数据快照等数据安全管理和系统管理功能。后端存储系统使用磁盘阵列和RAID策略为数据提供存储空间和安全保护措施。SAN连接设备包括交换机、HBA卡和各种介质的连接线。
一、iSCSI存储系统架构之控制器系统架构
iSCSI的核心处理单元采用与FC光纤存储设备相同的结构。即采用专用的数据传输芯片、专用的RAID数据校验芯片、专用的高性能cache缓存和专用的嵌入式系统平台。打开设备机箱时可以看到iSCSI设备内部采用无线缆的背板结构,所有部件与背板之间通过标准或非标准的插槽链接在一起,而不是普通PC中的多种不同型号和规格的线缆链接。
这种类型的iSCSI存储系统架构核心处理单元采用高性能的硬件处理芯片,每个芯片功能单一,因此处理效率较高。操作系统是嵌入式设计,与其他类型的操作系统相比,嵌入式操作系统具有体积小、高稳定性、强实时性、固化代码以及操作方便简单等特点。因此控制器架构的iSCSI存储设备具有较高的安全性和和稳定性。
控制器架构iSCSI存储内部基于无线缆的背板链接方式,完全消除了链接上的单点故障,因此系统更安全,性能更稳定。一般可用于对性能的稳定性和高可用性具有较高要求的在线存储系统,比如:中小型数据库系统,大型数据的库备份系统,远程容灾系统,网站、电力或非线性编辑制作网等。
控制器架构的iSCSI设备由于核心处理器全部采用硬件,制造成本较高,因此一般销售价格较高。
目前市场还可以见到一种特殊的基于控制器架构的iSCSI存储设备。该类存储设备是在现有FC存储设备的基础上扩充或者增加iSCSI协议转换模块,使得FC存储设备可以支持FC数据传输协议和iSCSI传输协议,如EMC 150i/300i/500i 等。
区分一个设备是否是控制器架构,可从以下几个方面去考虑:
1、是否双控:除了一些早期型号或低端型号外,高性能的iSCSI存储一般都会采用active-active的双控制器工作方式。控制器为模块化设计,并安装在同一个机箱内,非两个独立机箱的控制器。
2、缓存:有双控制器缓存镜像、缓存断电保护功能。
3、数据校验:采用专用硬件校验和数据传输芯片,非依靠普通CPU的软件校验,或普通RAID卡。
4、内部结构:打开控制器架构的设备,内部全部为无线缆的背板式连接方式,各硬件模块连接在背板的各个插槽上。
二、iSCSI存储系统架构之iSCSI连接桥系统架构
整个iSCSI存储系统架构分为两个部分,一个部分是前端协议转换设备,另一部分是后端存储。结构上类似NAS网关及其后端存储设备。
前端协议转换部分一般为硬件设备,主机接口为千兆以太网接口,磁盘接口一般为SCSI接口或FC接口,可连接SCSI磁盘阵列和FC存储设备。通过千兆以太网主机接口对外提供iSCSI数据传输协议。
后端存储一般采用SCSI磁盘阵列和FC存储设备,将SCSI磁盘阵列和FC存储设备的主机接口直接连接到iSCSI桥的磁盘接口上。
iSCSI连接桥设备本身只有协议转换功能,没有RAID校验和快照、卷复制等功能。创建RAID组、创建LUN等操作必须在存储设备上完成,存储设备有什么功能,整个iSCSI设备就具有什么样的功能。
SANRAD的V-Switch系列,ATTO Technology的iPBridge系列的iSCSI桥接器,提供iSCSI-to-SCSI与iSCSI-to-FC 的桥接,可将直连的磁盘阵列柜(Disk Array,JBOD、DAS)或磁带设备(Autoloader、Tape Library)转变成iSCSI存储设备。
不过随着iSCSI技术的逐渐成熟,连接桥架构的iSCSI设备越来越少,目前的市场上基本已看不到这样的产品了。
三、iSCSI存储系统架构之PC系统架构
那么何谓PC架构?按字面的意思可以理解为存储设备建立在PC服务器的基础上。即就是选择一个普通的、性能优良的、可支持多块磁盘的PC(一般为PC服务器和工控服务器),选择一款相对成熟稳定的iSCSItarget软件,将iSCSItarget软件安装在PC服务器上,使普通的PC服务器转变成一台iSCSI存储设备,并通过PC服务器的以太网卡对外提供iSCSI数据传输协议。
目前常见的iSCSItarget软件多半由商业软件厂商提供,如DataCore Software的SANmelody,FalconStor Software的iSCSI Server for Windows,和String Bean Software的WinTarget等。这软件都只能运行在Windows操作系统平台上。
在PC架构的iSCSI存储设备上,所有的RAID组校验、逻辑卷管理、iSCSI 运算、TCP/IP 运算等都是以纯软件方式实现,因此对PC的CPU和内存的性能要求较高。另外iSCSI存储设备的性能极容易收PC服务器运行状态的影响。
当由于PC架构iSCSI存储设备的研发、生产、安装使用相对简单,硬件和软件成本相对较低,因此市场上常见的基于PC架构的iSCSI设备的价格都比较低,在一些对性能稳定性要求较低的系统中具有较大的价格优势。
四、iSCSI存储系统架构之PC+NIC系统架构
PC+iSCSItarget软件方式是一种低价低效比的iSCSI存储系统架构解决方案,另外还有一种基于PC+NIC的高阶高效性iSCSI存储系统架构方案。
这款iSCSI存储系统架构方案是指在PC服务器中安装高性能的TOE智能NIC卡,将CPU资源较大的iSCSI运算、TCP/IP运算等数据传输操作转移到智能卡的硬件芯片上,由智能卡的专用硬件芯片来完成iSCSI运算、TCP/IP运算等,简化网络两端的内存数据交换程序,从而加速数据传输效率,降低PC的CPU占用,提高存储的性能。
1、 规划建设和管理SAN存储网络; 2、管理主流中高端存储产品,如EMC、HDS、NETAPP等; 3、应用主流备份软件,如Veritas、TSM、commvault,实施数据备份解决方案; 4、存储管理、备份管理、数据迁移、性能优化。职位要求: 1、熟练掌握存储网络建设管理和相关产品方案; 2、熟练掌握主流中高端存储产品,如EMC、NetApp、HDS等; 3、熟练掌握主流备份软件,如Veritas、TSM、commvault,实施数据备份解决方案; 4、熟悉Linux、AIX、HP-UNIX系统管理,掌握相关Shell脚本编程; 5、具有一定数据库知识。
45、SSD
地址空间虚拟化、容量冗余、垃圾回收、磨损均衡、坏块管理等一系列机制和措
施,保证了SSD 使用寿命的最大化
46、error whileloading shared libraries: libpython2.7.so.1.0: cannot open shared object file:No su .
分类: python 2012-09-26 17:01 2458人阅读 评论(0) 收藏 举报
objectfilepython编译器安装了python2.7,第一次执行时报错:
error while loading shared libraries:libpython2.7.so.1.0: cannot open shared object file: No such file or directory
解决方法如下:
1.编辑 vi /etc/ld.so.conf
如果是非root权限帐号登录,使用 sudo vi /etc/ld.so.conf
添加上python2.7的lib库地址,如我的/usr/local/Python2.7/lib,保存文件
2.执行 /sbin/ldconfig -v命令,如果是非root权限帐号登录,使用sudo /sbin/ldconfig -v。这样 ldd 才能找到这个库,执行python2.7就不会报错了
/etc/ld.so.conf:
这个文件记录了编译时使用的动态链接库的路径。
默认情况下,编译器只会使用/lib和/usr/lib这两个目录下的库文件
如果你安装了某些库,没有指定 –prefix=/usr 这样lib库就装到了/usr/local下,而又没有在/etc/ld.so.conf中添加/usr/local/lib,就会报错了
ldconfig是个什么东东吧 :
它是一个程序,通常它位于/sbin下,是root用户使用的东东。具体作用及用法可以man ldconfig查到
简单的说,它的作用就是将/etc/ld.so.conf列出的路径下的库文件缓存到/etc/ld.so.cache 以供使用
因此当安装完一些库文件,(例如刚安装好glib),或者修改ld.so.conf增加新的库路径后,需要运行一下/sbin/ldconfig
使所有的库文件都被缓存到ld.so.cache中,如果没做,即使库文件明明就在/usr/lib下的,也是不会被使用的,结果
编译过程中抱错,缺少xxx库。
47、Cache
为了提高磁盘存取效率, Linux做了一些精心的设计, 除了对dentry进行缓存(用于VFS,加速文件路径名到inode的转换), 还采取了两种主要Cache方式:Buffer Cache和Page Cache。前者针对磁盘块的读写,后者针对文件inode的读写。这些Cache有效缩短了 I/O系统调用(比如read,write,getdents)的时间。
48、linux性能问题(CPU,内存,磁盘I/O,网络)
一. CPU性能评估
1.vmstat [-V] [-n] [depay [count]]
-V : 打印出版本信息,可选参数
-n : 在周期性循环输出时,头部信息仅显示一次
delay : 两次输出之间的时间间隔
count : 按照delay指定的时间间隔统计的次数。默认是1
如:vmstat 1 3
user1@user1-desktop:~$ vmstat 1 3
procs ———–memory———- —swap——-io—- -system– —-cpu—-
r b swpd free buff cache si so bi bo in csus sy id wa
0 0 0 1051676 139504 477028 0 0 46 31 130493 3 1 95 2
0 0 0 1051668 139508 477028 0 0 0 4 3771792 3 1 95 0
0 0 0 1051668 139508 477028 0 0 0 0 3271741 3 1 95 0
r : 运行和等待CPU时间片的进程数(若长期大于CPU的个数,说明CPU不足,需要增加CPU)【注意】
b : 在等待资源的进程数(如等待I/O或者内存交换等)
swpd : 切换到内存交换区的内存数量,单位kB
free : 当前空闲物理内存,单位kB
buff : buffers cache的内存数量,一般对块设备的读写才需要缓存
cache : page cached的内存数量,一般作为文件系统cached,频繁访问的文件都会被cached
si : 由磁盘调入内存,即内存进入内存交换区的数量
so : 内存调入磁盘,内存交换区进入内存的数量
bi : 从块设备读入数据的总量,即读磁盘,单位kB/s
bo : 写入到块设备的数据总量,即写磁盘,单位kB/s
in : 某一时间间隔中观测到的每秒设备中断数
cs : 每秒产生的上下文切换次数
us :用户进程消耗的CPU时间百分比【注意】
sy : 内核进程消耗CPU时间百分比【注意】
id : CPU处在空闲状态的时间百分比【注意】
wa :IO等待所占用的CPU时间百分比
如果si、so的值长期不为0,表示系统内从不足,需要增加系统内存
bi+bo参考值为1000,若超过1000,且wa较大,表示系统IO有问题,应该提高磁盘的读写性能
in与cs越大,内核消耗的CPU时间就越多
us+sy参考值为80%,如果大于80%,说明可能存在CPU资源不足的情况
综上所述,CPU性能评估中重点注意r、us、sy和id列的值。
2. sar [options] [-o filename] [interval[count] ]
options:
-A :显示系统所有资源设备(CPU、内存、磁盘)的运行状态
-u : 显示系统所有CPU在采样时间内的负载状态
-P : 显示指定CPU的使用情况(CPU计数从0开始)
-d : 显示所有硬盘设备在采样时间内的使用状况
-r : 显示内存在采样时间内的使用状况
-b : 显示缓冲区在采样时间内的使用情况
-v : 显示进程、文件、I节点和锁表状态
-n :显示网络运行状态。参数后跟DEV(网络接口)、EDEV(网络错误统计)、SOCK(套接字)、FULL(显示其它3个参数所有)。可单独或一起使用
-q : 显示运行队列的大小,与系统当时的平均负载相同
-R : 显示进程在采样时间内的活动情况
-y : 显示终端设备在采样时间内的活动情况
-w : 显示系统交换活动在采样时间内的状态
-o : 将命令结果以二进制格式存放在指定的文件中
interval : 采样间隔时间,必须有的参数
count : 采样次数,默认1
如:sar -u 1 3
user1@user1-desktop:~$ sar -u 1 3
Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)
09时27分18秒 CPU %user%nice %system %iowait %steal %idle
09时27分19秒 all 1.990.00 0.50 5.97 0.00 91.54
09时27分20秒 all 3.900.00 2.93 5.85 0.00 87.32
09时27分21秒 all 2.930.00 1.46 4.39 0.00 91.22
平均时间: all 2.95 0.00 1.64 5.40 0.00 90.02
%user : 用户进程消耗CPU时间百分比
%nice : 运行正常进程消耗CPU时间百分比
%system : 系统进程消耗CPU时间百分比
%iowait : IO等待多占用CPU时间百分比
%steal : 内存在相对紧张坏经下pagein强制对不同页面进行的steal操作
%idle : CPU处在空闲状态的时间百分比
3. iostat [-c | -d] [-k] [-t] [-x [device]][interval [count]]
-c :显示CPU使用情况
-d :显示磁盘使用情况
-k : 每秒以k bytes为单位显示数据
-t :打印出统计信息开始执行的时间
-x device :指定要统计的磁盘设备名称,默认为所有磁盘设备
interval :制定两次统计时间间隔
count : 统计次数
如: iostat -c
user1@user1-desktop:~$ iostat -c
Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal%idle
2.51 0.02 1.27 1.40 0.00 94.81
(每项代表的含义与sar相同)
4. uptime ,如:
user1@user1-desktop:~$ uptime
10:13:30 up 1:15, 2 users, load average:0.00, 0.07, 0.11
显示的分别是:系统当前时间,系统上次开机到现在运行了多长时间,目前登录用户个数,系统在1分钟内、5分钟内、15分钟内的平均负载
注意:load average的三个值一般不能大于系统CPU的个数,否则说明CPU很繁忙
二 . 内存性能评估
1. free
2. watch 与 free 相结合,在watch后面跟上需要运行的命令,watch就会自动重复去运行这个命令,默认是2秒执行一次,如:
Every 2.0s: free Sat Mar 5 10:30:17 2011
total used free shared buffers cached
Mem: 2060496 1130188 930308 0 261284 483072
-/+ buffers/cache: 385832 1674664
Swap: 3000316 0 3000316
(-n指定重复执行的时间,-d表示高亮显示变动)
3.使用vmstat,关注swpd、si和so
4. sar -r,如:
user1@user1-desktop:~$ sar -r 2 3
Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)
10时34分11秒 kbmemfreekbmemused %memused kbbuffers kbcached kbcommit %commit
10时34分13秒 9235481136948 55.18 265456 487156 1347736 26.63
10时34分15秒 9235481136948 55.18 265464 487148 1347736 26.63
10时34分17秒 9235481136948 55.18 265464 487156 1347736 26.63
平均时间: 923548 1136948 55.18 265461 487153 1347736 26.63
kbmemfree : 空闲物理内存
kbmemused : 已使用物理内存
%memused : 已使用内存占总内存百分比
kbbuffers : Buffer Cache大小
kbcached : Page Cache大小
kbcommit : 应用程序当前使用内存大小
%commit :应用程序使用内存百分比
三 . 磁盘I/O性能评估
1. sar -d ,如:
user1@user1-desktop:~$ sar -d 1 3
Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)
10时42分27秒 DEV tpsrd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
10时42分28秒 dev8-0 0.000.00 0.00 0.00 0.00 0.00 0.00 0.00
10时42分28秒 DEV tpsrd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
10时42分29秒 dev8-0 2.000.00 64.00 32.00 0.02 8.00 8.00 1.60
10时42分29秒 DEV tpsrd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
10时42分30秒 dev8-0 0.000.00 0.00 0.00 0.00 0.00 0.00 0.00
平均时间: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
平均时间: dev8-0 0.67 0.00 21.33 32.00 0.01 8.00 8.00 0.53
DEV : 磁盘设备名称
tps :每秒到物理磁盘的传送数,即每秒的I/O流量。一个传送就是一个I/O请求,多个逻辑请求可以被合并为一个物理I/O请求
rc_sec/s:每秒从设备读入的扇区数(1扇区=512字节)
wr_sec/s : 每秒写入设备的扇区数目
avgrq-sz : 平均每次设备I/O操作的数据大小(以扇区为单位)
avgqu-sz : 平均I/O队列的长度
await : 平均每次设备I/O操作的等待时间(毫秒)
svctm :平均每次设备I/O 操作的服务时间(毫秒)
%util :一秒中有百分之几的时间用用于I/O操作
正常情况下svctm应该小于await,而svctm的大小和磁盘性能有关,CPU、内存的负荷也会对svctm值造成影响,过多的请求也会简介导致svctm值的增加。
await的大小一般取决与svctm的值和I/O队列长度以及I/O请求模式。如果svctm与await很接近,表示几乎没有I/O等待,磁盘性能很好;如果await的值远高于svctm的值,表示I/O队列等待太长,系统上运行的应用程序将变慢,此时可以通过更换更快的硬盘来解决问题。
%util若接近100%,表示磁盘产生I/O请求太多,I/O系统已经满负荷地在工作,该磁盘可能存在瓶颈。长期下去,势必影响系统的性能,可通过优化程序或者通过更换更高、更快的磁盘来解决此问题。
2. iostat -d
user1@user1-desktop:~$ iostat -d 2 3
Linux 2.6.35-27-generic (user1-desktop)2011年03月05日 _i686_ (2 CPU)
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 5.89 148.87 57.77 1325028 514144
Device: tps Blk_read/s Blk_wrtn/s Blk_readBlk_wrtn
sda 0.00 0.00 0.00 0 0
Device: tps Blk_read/s Blk_wrtn/s Blk_readBlk_wrtn
sda 0.00 0.00 0.00 0 0
Blk_read/s : 每秒读取的数据块数
Blk_wrtn/s : 每秒写入的数据块数
Blk_read : 读取的所有块数
Blk_wrtn : 写入的所有块数
如果Blk_read/s很大,表示磁盘直接读取操作很多,可以将读取的数据写入内存中进行操作;如果Blk_wrtn/s很大,表示磁盘的写操作很频繁,可以考虑优化磁盘或者优化程序。这两个选项没有一个固定的大小,不同的操作系统值也不同,但长期的超大的数据读写,肯定是不正常的,一定会影响系统的性能。
3. iostat -x /dev/sda 2 3 ,对指定磁盘的单独统计
4. vmstat -d
四 . 网络性能评估
1. ping
time值显示了两台主机之间的网络延时情况,若很大,表示网络的延时很大。packets loss表示网络丢包率,越小表示网络的质量越高。
2. netstat -i ,如:
user1@user1-desktop:~$ netstat -i
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVRTX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 6043239 0 0 0 87311 0 0 0 BMRU
lo 16436 0 2941 0 0 0 2941 0 0 0 LRU
Iface : 网络设备的接口名称
MTU : 最大传输单元,单位字节
RX-OK / TX-OK : 准确无误地接收 / 发送了多少数据包
RX-ERR / TX-ERR : 接收 / 发送数据包时产生了多少错误
RX-DRP / TX-DRP : 接收 / 发送数据包时丢弃了多少数据包
RX-OVR / TX-OVR : 由于误差而遗失了多少数据包
Flg :接口标记,其中:
L :该接口是个回环设备
B : 设置了广播地址
M : 接收所有的数据包
R :接口正在运行
U : 接口处于活动状态
O : 在该接口上禁用arp
P :表示一个点到点的连接
正常情况下,RX-ERR,RX-DRP,RX-OVR,TX-ERR,TX-DRP,TX-OVR都应该为0,若不为0且很大,那么网络质量肯定有问题,网络传输性能也一定会下降。
当网络传输存在问题时,可以检测网卡设备是否存在故障,还可以检查网络部署环境是否合理。
3. netstat -r (default行对应的值表示系统的默认路由)
4. sar -n ,n后为DEV(网络接口信息)、EDEV(网络错误统计信息)、SOCK(套接字信息)、和FULL(显示所有)
wangxin@wangxin-desktop:~$ sar -n DEV 2 3
Linux 2.6.35-27-generic (wangxin-desktop)2011年03月05日 _i686_ (2 CPU)
11时55分32秒 IFACErxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
11时55分34秒 lo 2.002.00 0.12 0.12 0.00 0.00 0.00
11时55分34秒 eth0 2.500.50 0.31 0.03 0.00 0.00 0.00
11时55分34秒 IFACErxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
11时55分36秒 lo 0.000.00 0.00 0.00 0.00 0.00 0.00
11时55分36秒 eth0 1.500.00 0.10 0.00 0.00 0.00 0.00
11时55分36秒 IFACErxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
11时55分38秒 lo 0.000.00 0.00 0.00 0.00 0.00 0.00
11时55分38秒 eth0 14.500.00 0.88 0.00 0.00 0.00 0.00
平均时间: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
平均时间: lo 0.67 0.67 0.04 0.04 0.00 0.00 0.00
平均时间: eth0 6.17 0.17 0.43 0.01 0.00 0.00 0.00
IFACE : 网络接口设备
rxpck/s : 每秒接收的数据包大小
txpck/s :每秒发送的数据包大小
rxkB/s : 每秒接受的字节数
txkB/s : 每秒发送的字节数
rxcmp/s : 每秒接受的压缩数据包
txcmp/s : 每秒发送的压缩数据包
rxmcst/s : 每秒接受的多播数据包
49、HSF(High-SpeedService Framework)
是一个远程调用(RPC)框架,建筑起淘宝整个Java应用分布式集群环境。
50、LVS简介
LVS是LinuxVirtual Server的简称,也就是Linux虚拟服务器,
使用LVS技术要达到的目标是:通过LVS提供的负载均衡技术和Linux操作系统实现一个高性能、高可用的服务器群集,它具有良好可靠性、可扩展性和可操作性。从而以低廉的成本实现最优的服务性能。
LVS自从1998年开始,发展到现在已经是一个比较成熟的技术项目了。可以利用LVS技术实现高可伸缩的、高可用的网络服务,例如WWW服务、Cache服务、DNS服务、FTP服务、MAIL服务、视频/音频点播服务等等,有许多比较著名网站和组织都在使用LVS架设的集群系统,例如:Linux的门户网站(www.linux.com)、向RealPlayer提供音频视频服务而闻名的Real公司(www.real.com)、全球最大的开源网站(sourceforge.net)等。
二、 LVS体系结构
使用LVS架设的服务器集群系统有三个部分组成:最前端的负载均衡层,用Load Balancer表示,中间的服务器群组层,用Server Array表示,最底端的数据共享存储层,用Shared Storage表示,在用户看来,所有的内部应用都是透明的,用户只是在使用一个虚拟服务器提供的高性能服务。
LVS体系结构如图1所示:
图1 LVS的体系结构
下面对LVS的各个组成部分进行详细介绍:
Load Balancer层:位于整个集群系统的最前端,有一台或者多台负载调度器(Director Server)组成,LVS模块就安装在Director Server上,而Director的主要作用类似于一个路由器,它含有完成LVS功能所设定的路由表,通过这些路由表把用户的请求分发给Server Array层的应用服务器(Real Server)上。同时,在Director Server上还要安装对Real Server服务的监控模块Ldirectord,此模块用于监测各个Real Server服务的健康状况。在Real Server不可用时把它从LVS路由表中剔除,恢复时重新加入。
Server Array层:由一组实际运行应用服务的机器组成,Real Server可以是WEB服务器、MAIL服务器、FTP服务器、DNS服务器、视频服务器中的一个或者多个,每个Real Server之间通过高速的LAN或分布在各地的WAN相连接。在实际的应用中,Director Server也可以同时兼任Real Server的角色。
Shared Storage层:是为所有RealServer提供共享存储空间和内容一致性的存储区域,在物理上,一般有磁盘阵列设备组成,为了提供内容的一致性,一般可以通过NFS网络文件系统共享数据,但是NFS在繁忙的业务系统中,性能并不是很好,此时可以采用集群文件系统,例如Red hat的GFS文件系统,oracle提供的OCFS2文件系统等。
从整个LVS结构可以看出,DirectorServer是整个LVS的核心,目前,用于Director Server的操作系统只能是Linux和FreeBSD,linux2.6内核不用任何设置就可以支持LVS功能,而FreeBSD作为Director Server的应用还不是很多,性能也不是很好。
对于Real Server,几乎可以是所有的系统平台,Linux、windows、Solaris、AIX、BSD系列都能很好的支持。
三、 LVS集群的特点
3.1 IP负载均衡与负载调度算法
1.IP负载均衡技术
负载均衡技术有很多实现方案,有基于DNS域名轮流解析的方法、有基于客户端调度访问的方法、有基于应用层系统负载的调度方法,还有基于IP地址的调度方法,在这些负载调度算法中,执行效率最高的是IP负载均衡技术。
LVS的IP负载均衡技术是通过IPVS模块来实现的,IPVS是LVS集群系统的核心软件,它的主要作用是:安装在Director Server上,同时在Director Server上虚拟出一个IP地址,用户必须通过这个虚拟的IP地址访问服务。这个虚拟IP一般称为LVS的VIP,即Virtual IP。访问的请求首先经过VIP到达负载调度器,然后由负载调度器从Real Server列表中选取一个服务节点响应用户的请求。
当用户的请求到达负载调度器后,调度器如何将请求发送到提供服务的Real Server节点,而Real Server节点如何返回数据给用户,是IPVS实现的重点技术,IPVS实现负载均衡机制有三种,分别是NAT、TUN和DR,详述如下:
VS/NAT: 即(Virtual Server viaNetwork Address Translation)
也就是网络地址翻译技术实现虚拟服务器,当用户请求到达调度器时,调度器将请求报文的目标地址(即虚拟IP地址)改写成选定的Real Server地址,同时报文的目标端口也改成选定的Real Server的相应端口,最后将报文请求发送到选定的Real Server。在服务器端得到数据后,Real Server返回数据给用户时,需要再次经过负载调度器将报文的源地址和源端口改成虚拟IP地址和相应端口,然后把数据发送给用户,完成整个负载调度过程。
可以看出,在NAT方式下,用户请求和响应报文都必须经过Director Server地址重写,当用户请求越来越多时,调度器的处理能力将称为瓶颈。
VS/TUN :即(VirtualServer via IP Tunneling)
也就是IP隧道技术实现虚拟服务器。它的连接调度和管理与VS/NAT方式一样,只是它的报文转发方法不同,VS/TUN方式中,调度器采用IP隧道技术将用户请求转发到某个Real Server,而这个Real Server将直接响应用户的请求,不再经过前端调度器,此外,对Real Server的地域位置没有要求,可以和Director Server位于同一个网段,也可以是独立的一个网络。因此,在TUN方式中,调度器将只处理用户的报文请求,集群系统的吞吐量大大提高。
VS/DR: 即(Virtual Server viaDirect Routing)
也就是用直接路由技术实现虚拟服务器。它的连接调度和管理与VS/NAT和VS/TUN中的一样,但它的报文转发方法又有不同,VS/DR通过改写请求报文的MAC地址,将请求发送到Real Server,而Real Server将响应直接返回给客户,免去了VS/TUN中的IP隧道开销。这种方式是三种负载调度机制中性能最高最好的,但是必须要求Director Server与Real Server都有一块网卡连在同一物理网段上。
2.负载调度算法
上面我们谈到,负载调度器是根据各个服务器的负载情况,动态地选择一台Real Server响应用户请求,那么动态选择是如何实现呢,其实也就是我们这里要说的负载调度算法,根据不同的网络服务需求和服务器配置,IPVS实现了如下八种负载调度算法,这里我们详细讲述最常用的四种调度算法,剩余的四种调度算法请参考其它资料。
轮叫调度(RoundRobin)
“轮叫”调度也叫1:1调度,调度器通过“轮叫”调度算法将外部用户请求按顺序1:1的分配到集群中的每个Real Server上,这种算法平等地对待每一台Real Server,而不管服务器上实际的负载状况和连接状态。
加权轮叫调度(WeightedRound Robin)
“加权轮叫”调度算法是根据RealServer的不同处理能力来调度访问请求。可以对每台Real Server设置不同的调度权值,对于性能相对较好的Real Server可以设置较高的权值,而对于处理能力较弱的Real Server,可以设置较低的权值,这样保证了处理能力强的服务器处理更多的访问流量。充分合理的利用了服务器资源。同时,调度器还可以自动查询Real Server的负载情况,并动态地调整其权值。
最少链接调度(LeastConnections)
“最少连接”调度算法动态地将网络请求调度到已建立的链接数最少的服务器上。如果集群系统的真实服务器具有相近的系统性能,采用“最小连接”调度算法可以较好地均衡负载。
加权最少链接调度(WeightedLeast Connections)
“加权最少链接调度”是“最少连接调度”的超集,每个服务节点可以用相应的权值表示其处理能力,而系统管理员可以动态的设置相应的权值,缺省权值为1,加权最小连接调度在分配新连接请求时尽可能使服务节点的已建立连接数和其权值成正比。
其它四种调度算法分别为:基于局部性的最少链接(Locality-Based Least Connections)、带复制的基于局部性最少链接(Locality-BasedLeast Connections with Replication)、目标地址散列(DestinationHashing)和源地址散列(Source Hashing),对于这四种调度算法的含义,本文不再讲述,如果想深入了解这其余四种调度策略的话,可以登陆LVS中文站点zh.linuxvirtualserver.org,查阅更详细的信息。
3.2 高可用性
LVS是一个基于内核级别的应用软件,因此具有很高的处理性能,用LVS构架的负载均衡集群系统具有优秀的处理能力,每个服务节点的故障不会影响整个系统的正常使用,同时又实现负载的合理均衡,使应用具有超高负荷的服务能力,可支持上百万个并发连接请求。如配置百兆网卡,采用VS/TUN或VS/DR调度技术,整个集群系统的吞吐量可高达1Gbits/s;如配置千兆网卡,则系统的最大吞吐量可接近10Gbits/s。
3.3 高可靠性
LVS负载均衡集群软件已经在企业、学校等行业得到了很好的普及应用,国内外很多大型的、关键性的web站点也都采用了LVS集群软件,所以它的可靠性在实践中得到了很好的证实。有很多以LVS做的负载均衡系统,运行很长时间,从未做过重新启动。这些都说明了LVS的高稳定性和高可靠性。
3.4 适用环境
LVS对前端DirectorServer目前仅支持Linux和FreeBSD系统,但是支持大多数的TCP和UDP协议,支持TCP协议的应用有:HTTP,HTTPS ,FTP,SMTP,,POP3,IMAP4,PROXY,LDAP,SSMTP等等。支持UDP协议的应用有:DNS,NTP,ICP,视频、音频流播放协议等。
LVS对Real Server的操作系统没有任何限制,RealServer可运行在任何支持TCP/IP的操作系统上,包括Linux,各种Unix(如FreeBSD、Sun Solaris、HP Unix等),Mac/OS和Windows等。
3.5 开源软件
LVS集群软件是按GPL(GNU PublicLicense)许可证发行的自由软件,因此,使用者可以得到软件的源代码,并且可以根据自己的需要进行各种修改,但是修改必须是以GPL方式发行
51、精卫简介
精卫填海(简称精卫)是一个基于MySQL数据库的数据复制组件,远期目标是构建一个完善可接入多种不同类型源数据的实时数据复制框架。基于最最原始的生产者-消费者模型,引入Pipeline(负责数据传送)、Extractor(生产数据)、Applier(消费数据)的概念,构建一套高易用性的数据复制框架。
52、HDD传输带宽:
具有高带宽规格的硬盘在传输大块连续数据时具有优势,而具有高IOPS的硬盘在传输小块不连续的数据时具有优势。
53、IOPS=1/(换道时间+数据传输时间)
完成一次IO所用的时间=寻道时间+旋转延迟时间+数据传输时间,IOPS=IO并发系数/(寻道时间+旋转延迟时间+数据传输时间)
54、Lxc Python3scripting
As much fun as C may be, I usually like toscript my containers and C isn’t really the best language for that. That’s whyI wrote and maintain the official python3 binding.
The equivalent to the example above inpython3 would be:
import lxc
import sys
# Setup the container object
c = lxc.Container(“apicontainer”)
if c.defined:
print(“Container already exists”, file=sys.stderr)
sys.exit(1)
# Create the container rootfs
if not c.create(“download”,lxc.LXC_CREATE_QUIET, {“dist”: “ubuntu”,
“release”: “trusty”,
“arch”: “i386”}):
print(“Failed to create the container rootfs”,file=sys.stderr)
sys.exit(1)
# Start the container
if not c.start():
print(“Failed to start the container”, file=sys.stderr)
sys.exit(1)
# Query some information
print(“Container state: %s” %c.state)
print(“Container PID: %s” %c.init_pid)
# Stop the container
if not c.shutdown(30):
print(“Failed to cleanly shutdown the container, forcing.”)
if not c.stop():
print(“Failed to kill thecontainer”, file=sys.stderr)
sys.exit(1)
# Destroy the container
if not c.destroy():
print(“Failed to destroy the container.”, file=sys.stderr)
sys.exit(1)
Now for that specific example, python3isn’t that much simpler than the C equivalent.
But what if we wanted to do somethingslightly more tricky, like iterating through all existing containers, startthem (if they’re not already started), wait for them to have networkconnectivity, then run updates and shut them down?
import lxc
import sys
for container inlxc.list_containers(as_object=True):
#Start the container (if not started)
started=False
if not container.running:
if not container.start():
continue
started=True
if not container.state == “RUNNING”:
continue
#Wait for connectivity
if not container.get_ips(timeout=30):
continue
#Run the updates
container.attach_wait(lxc.attach_run_command,
[“apt-get”,”update”])
container.attach_wait(lxc.attach_run_command,
[“apt-get”,”dist-upgrade”, “-y”])
#Shutdown the container
if started:
if not container.shutdown(30):
container.stop()
The most interesting bit in the exampleabove is the attach_wait command, which basically lets your run a standardpython function in the container’s namespaces, here’s a more obvious example:
import lxc
c = lxc.Container(“p1”)
if not c.running:
c.start()
def print_hostname():
with open(“/etc/hostname”, “r”) as fd:
print(“Hostname: %s” % fd.read().strip())
# First run on the host
print_hostname()
# Then on the container
c.attach_wait(print_hostname)
if not c.shutdown(30):
c.stop()
And the output of running the above:
stgraber@castiana:~$ python3 lxc-api.py
/home/stgraber/<frozen>:313: Warning:The python-lxc API isn’t yet stable and may change at any point in the future.
Hostname: castiana
Hostname: p1
It may take you a little while to wrap yourhead around the possibilities offered by that function, especially as it alsotakes quite a few flags (look for LXC_ATTACH_* in the C API) which lets youcontrol which namespaces to attach to, whether to have the function containedby apparmor, whether to bypass cgroup restrictions, …
That kind of flexibility is somethingyou’ll never get with a virtual machine and the way it’s supported through ourbindings makes it easier than ever to use by anyone who wants to automatecustom workloads.
You can also use the API to script cloningcontainers and using snapshots (though for that example to work, you needcurrent upstream master due to a small bug I found while writing this…):
import lxc
import os
import sys
if not os.geteuid() == 0:
print(“The use of overlayfs requires privileged containers.”)
sys.exit(1)
# Create a base container (if missing)using an Ubuntu 14.04 image
base = lxc.Container(“base”)
if not base.defined:
base.create(“download”, lxc.LXC_CREATE_QUIET,{“dist”: “ubuntu”,
“release”:”precise”,
“arch”: “i386”})
#Customize it a bit
base.start()
base.get_ips(timeout=30)
base.attach_wait(lxc.attach_run_command, [“apt-get”,”update”])
base.attach_wait(lxc.attach_run_command, [“apt-get”,”dist-upgrade”, “-y”])
if not base.shutdown(30):
base.stop()
# Clone it as web (if not already existing)
web = lxc.Container(“web”)
if not web.defined:
#Clone base using an overlayfs overlay
web = base.clone(“web”, bdevtype=”overlayfs”,
flags=lxc.LXC_CLONE_SNAPSHOT)
#Install apache
web.start()
web.get_ips(timeout=30)
web.attach_wait(lxc.attach_run_command, [“apt-get”,”update”])
web.attach_wait(lxc.attach_run_command, [“apt-get”,”install”,
“apache2”, “-y”])
if not web.shutdown(30):
web.stop()
# Create a website container based on theweb container
mysite = web.clone(“mysite”,bdevtype=”overlayfs”,
flags=lxc.LXC_CLONE_SNAPSHOT)
mysite.start()
ips =mysite.get_ips(family=”inet”, timeout=30)
if ips:
print(“Website running at: http://%s” % ips[0])
else:
if not mysite.shutdown(30):
mysite.stop()
The above will create a base containerusing a downloaded image, then clone it using an overlayfs based overlay, addapache2 to it, then clone that resulting container into yet another one called“mysite”. So “mysite” is effectively an overlay clone of “web” which is itselfan overlay clone of “base”.
So there you go, I tried to cover most ofthe interesting bits of our API with the examples above, though there’s muchmore available, for example, I didn’t cover the snapshot API (currentlyrestricted to system containers) outside of the specific overlayfs case aboveand only scratched the surface of what’s possible to do with the attachfunction.
LXC 1.0 will release with a stable versionof the API, we’ll be doing additions in the next few 1.x versions (while doingbugfix only updates to 1.0.x) and hope not to have to break the whole API forquite a while (though we’ll certainly be adding more stuff to it)
55、NIC
台式机一般都采用内置网卡来连接网络。网卡也叫“网络适配器”,英文全称为“Network Interface Card”,简称“NIC”,网卡是局域网中最基本的部件之一,它是连接计算机与网络的硬件设备。无论是双绞线连接、同轴电缆连接还是光纤连接,都必须借助于网卡才能实现数据的通信。它的主要技术参数为带宽、总线方式、电气接口方式等。它的基本功能为:从并行到串行的数据转换,包的装配和拆装,网络存取控制,数据缓存和网络信号。目前主要是8位和16位网卡。
56、NAT:
NAT(NetworkAddress Translation,网络地址转换)是将IP 数据包头中的IP 地址转换为另一个IP 地址的过程。在实际应用中,NAT 主要用于实现私有网络访问公共网络的功能。这种通过使用少量的公有IP 地址代表较多的私有IP 地址的方式,将有助于减缓可用IP地址空间的枯竭。在RFC 1632中有对NAT的说明。
NAT功能
NAT不仅能解决了lP地址不足的问题,而且还能够有效地避免来自网络外部的攻击,隐藏并保护网络内部的计算机。
1.宽带分享:这是 NAT 主机的最大功能。
2.安全防护:NAT 之内的 PC 联机到 Internet 上面时,他所显示的 IP 是 NAT 主机的公共 IP,所以 Client 端的 PC 当然就具有一定程度的安全了,外界在进行 portscan(端口扫描)的时候,就侦测不到源Client 端的 PC 。
NAT实现方式
NAT的实现方式有三种,即静态转换StaticNat、动态转换Dynamic Nat和端口多路复用OverLoad。
静态转换是指将内部网络的私有IP地址转换为公有IP地址,IP地址对是一对一的,是一成不变的,某个私有IP地址只转换为某个公有IP地址。借助于静态转换,可以实现外部网络对内部网络中某些特定设备(如服务器)的访问。
动态转换是指将内部网络的私有IP地址转换为公用IP地址时,IP地址是不确定的,是随机的,所有被授权访问上Internet的私有IP地址可随机转换为任何指定的合法IP地址。也就是说,只要指定哪些内部地址可以进行转换,以及用哪些合法地址作为外部地址时,就可以进行动态转换。动态转换可以使用多个合法外部地址集。当ISP提供的合法IP地址略少于网络内部的计算机数量时。可以采用动态转换的方式。
端口多路复用(Port address Translation,PAT)是指改变外出数据包的源端口并进行端口转换,即端口地址转换(PAT,Port Address Translation).采用端口多路复用方式。内部网络的所有主机均可共享一个合法外部IP地址实现对Internet的访问,从而可以最大限度地节约IP地址资源。同时,又可隐藏网络内部的所有主机,有效避免来自internet的攻击。因此,目前网络中应用最多的就是端口多路复用方式。
NAT的技术背景
要真正了解NAT就必须先了解现在IP地址的适用情况,私有 IP 地址是指内部网络或主机的IP 地址,公有IP 地址是指在因特网上全球唯一的IP 地址。RFC 1918 为私有网络预留出了三个IP 地址块,如下:
A 类:10.0.0.0~10.255.255.255
B 类:172.16.0.0~172.31.255.255
C 类:192.168.0.0~192.168.255.255
上述三个范围内的地址不会在因特网上被分配,因此可以不必向ISP 或注册中心申请而在公司或企业内部自由使用。
NAPT
NAPT(Network Address PortTranslation),即网络端口地址转换,可将多个内部地址映射为一个合法公网地址,但以不同的协议端口号与不同的内部地址相对应,也就是<内部地址+内部端口>与<外部地址+外部端口>之间的转换。NAPT普遍用于接入设备中,它可以将中小型的网络隐藏在一个合法的IP地址后面。NAPT也被称为“多对一”的NAT,或者叫PAT(Port Address Translations,端口地址转换)、地址超载(addressoverloading)。
NAPT与动态地址NAT不同,它将内部连接映射到外部网络中的一个单独的IP地址上,同时在该地址上加上一个由NAT设备选定的TCP端口号。NAPT算得上是一种较流行的NAT变体,通过转换TCP或UDP协议端口号以及地址来提供并发性。除了一对源和目的IP地址以外,这个表还包括一对源和目的协议端口号,以及NAT盒使用的一个协议端口号。
NAPT的主要优势在于,能够使用一个全球有效IP地址获得通用性。主要缺点在于其通信仅限于TCP或UDP。当所有通信都采用TCP或UDP,NAPT允许一台内部计算机访问多台外部计算机,并允许多台内部主机访问同一台外部计算机,相互之间不会发生冲突。
NAT工作原理
NAT将自动修改IP报文的源IP地址和目的IP地址,Ip地址校验则在NAT处理过程中自动完成。有些应用程序将源IP地址嵌入到IP报文的数据部分中,所以还需要同时对报文的数据部分进行修改,以匹配IP头中已经修改过的源IP地址。否则,在报文数据部分嵌入IP地址的应用程序就不能正常工作。
①如图这个 client(终端)的 gateway (网关)设定为 NAT 主机,所以当要连上 Internet 的时候,该封包就会被送到 NAT 主机,这个时候的封包 Header 之 source IP(源IP)为 192.168.1.100 ;
②而透过这个 NAT 主机,它会将 client 的对外联机封包的 source IP ( 192.168.1.100 ) 伪装成 ppp0 ( 假设为拨接情况 )这个接口所具有的公共 IP,因为是公共 IP 了,所以这个封包就可以连上 Internet 了,同时 NAT 主机并且会记忆这个联机的封包是由哪一个 ( 192.168.1.100 ) client 端传送来的;
③由 Internet 传送回来的封包,当然由 NAT主机来接收了,这个时候, NAT 主机会去查询原本记录的路由信息,并将目标 IP 由 ppp0 上面的公共 IP 改回原来的 192.168.1.100 ;
④最后则由 NAT 主机将该封包传送给原先发送封包的 Client。
配置NAT
在配置NAT(网络地址转换)之前,首先需要了解内部本地地址和内部全局地址的分配情况。根据不同的需求,执行以下不同的配置任务。
内部源地址NAT配置
内部源地址NAPT配置
重叠地址NAT配置
TCP负载均衡
57、How-toinstall LXC and OpenQuake LXC on RHEL/CentOS 6
As root user:
1)Add the EPEL repo to your RHEL/CentOS 6 server
$ rpm -ivh http://mirror.1000mbps.com/fedora-epel/6/i386/epel-release-6-8.noarch.rpm
2)Install LXC 0.9.0 from epel and some other stuff needed
$ yum install --enablerepo=epel-testing lxc lxc-libs lxc-templates bridge-utils libcgroup
3) Enablethe cgroups
$ service cgconfig start
$ service cgred start
$ chkconfig --level 345 cgconfig on
$ chkconfig --level 345 cgred on
4) Setup thenetwork:
the easiest way isto create an internal network, so you do not need to expose the LXC to thebare-metal server network.
$ brctl addbr lxcbr0
解决方法参考链接:http://www.360doc.com/content/12/0507/14/9318309_209243400.shtml
其实,问题很简单,就是要关闭网络管理器:
01.
02.chkconfig NetworkManager off
03.service NetworkManager stop
04.
复制代码似乎在某文章里提到过这个东西,但我不知道怎么关闭,就忽略了,不知道把这个服务启动会怎么样。
b) Make the bridge persistent on reboot
create /etc/sysconfig/network-scripts/ifcfg-lxcbr0
and add
DEVICE="lxcbr0"
TYPE="Bridge"
BOOTPROTO="static"
IPADDR="10.0.3.1"
NETMASK="255.255.255.0"
$ ifup lxcbr0
5)Configure the firewall
to allow outgoing traffic from the container: edit /etc/sysconfig/iptables
and
a) Comment or remove
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
b) Add at the end of file
*nat
:PREROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -o eth0 -j MASQUERADE
COMMIT
c) Restart the firewall
$ service iptables restart
6) Enable IPv4 forwarding
edit /etc/sysctl.conf
and change net.ipv4.ip_forward = 0
to net.ipv4.ip_forward = 1
,then apply the new parameters with
$ sysctl –p
7) Download OpenQuake LXC
$ cd /tmp && wget http://ftp.openquake.org/oq-master/lxc/Ubuntu_lxc_12.04_64_oq_master_nightly-140310.tar.bz2
8) Extract the OpenQuake LXC
$ tar --numeric-owner -C /var/lib/lxc -xpsjf /tmp/Ubuntu_lxc_12.04_64_oq_master_nightly-140310.tar.bz2
9) Check if the LXC is installed and ready
with the commandlxc-ls you should see
$ lxc-ls
openquake-nightly-140310
10) Setup the OpenQuake LXC ip address
open /var/lib/lxc/openquake/rootfs/etc/network/interfaces
and change iface eth0 inet dhcp
to
iface eth0 inet static
address 10.0.3.2
netmask 255.255.255.0
gateway 10.0.3.1
dns-nameservers 8.8.8.8
11) Start the OpenQuake LXC
$ lxc-start –d –n openquake
12) Login into the running OpenQuake LXC
$ lxc-console –n openquake
(to detach press ctrl-a + q)
You can also loginusing SSH from the host server:
$ ssh openquake@10.0.3.2
User: openquake
Password: openquake
Please note:
· This how-to is intended fora fresh, standard installation of RHEL/CentOS 6 (and is tested on 6.4). It mayneed some adjustments for customized installations.
· On 5. the firewall could bealready customized by the sysadmin, please be careful when edit it. For moredetails please ask to your network and/or system administrator.
· On 5. section b. -APOSTROUTING -o eth0 -j MASQUERADE
“eth0” is the name of the host server main interface.It can differ in your configuration (see the used interface with ifconfig).
· On 8. the --numeric-owner
is mandatory.
· On 10. the 8.8.8.8
DNS is the one provided by Google. It’s better to use yourinternal DNS, so change that IP address with the one associated to your DNSserver. For more details please ask to your network and/or systemadministrator.
· On certain installationsthe rsyslogd
process inside thecontainer can eat lots of CPU cycles. To fix it run, within the container,these commands:
service rsyslog stop
sed -i -e 's/^\$ModLoad imklog/#\$ModLoad imklog/g' /etc/rsyslog.conf
service rsyslog start
58、django
开放源代码的Web应用框架,由Python写成。采用了MVC的软件设计模式,即模型M,视图V和控制器C。它最初是被开发来用于管理劳伦斯出版集团旗下的一些以新闻内容为主的网站的,即是CMS(内容管理系统)软件。并于2005年7月在BSD许可证下发布。这套框架是以比利时的吉普赛爵士吉他手Django Reinhardt来命名的。
Django 框架的核心组件有:
用于创建模型的对象关系映射
为最终用户设计的完美管理界面
一流的 URL 设计
设计者友好的模板语言
缓存系统。
59、Lxc 1.0 noconfiguration file for ‘/sbin/init’
hose who have problem with “lxc-start: noconfiguration file for ‘/sbin/init’ (may crash the host)”, just add “-f ” tothe lxc-start. For example:
lxc-start -n vps101 -l DEBUG -f/var/lib/lxc/vps101/config
/usr/var/lib/lxc/centos/config
By default, containers are located under/var/lib/lxc for the root user, and $HOME/.local/share/lxc otherwise. Thelocation can be specified for all lxc commands using the”-P|–lxcpath” argument.
60、Httpd
httpd是Apache超文本传输协议(HTTP)服务器的主程序。被设计为一个独立运行的后台进程,它会建立一个处理请求的子进程或线程的池。
61、Turbo Boost
Intel® Turbo Boost Technology 2.01automatically allows processor cores to run faster than the rated operatingfrequency if they’re operating below power, current, and temperature specificationlimits.
62、P-state
在Intel平台上通常指的是EIST(EnhancedIntel SpeedStep Technology),EIST允许多个核动态的切换电压和频率,动态的调整系统的功耗。OSPM通过WRMSR指令写IA32_PERF_CTL MSR的方式调整CPU电压和工作频率。
68、如何将Python的json本地话
首先从http://pypi.python.org/pypi/python-json下载python-json,然后安装。
解压zip包然后把json.py minjson.py 拷到 /usr/lib/python2.5/下面就行了。
怎样使用请看:http://docs.python.org/library/json.html
69、Pythonmysqldb
tar xvzf MySQL-python-1.2.1.tar.gz
cd MySQL-python-1.2.1
yum install -y python-devel
yum install -y mysql-devel
python setup.py build
python setup.py install
70、IO调度策略
bfq 、
“bandwidth fair quing”
cfq、
completely fair quing
noop、
Noop- is the idea of first come firstserve, get you cake and wait in line to get another peice so if a fat ass isjust requiring the whole cake others get hungry but must wait there turn. Bfqcan solve that issue especially with sudden heavy multitasking not sticking atask to feel sluggish.
dealine
71、tmpfs
tmpfs是一种基于内存的文件系统,它和虚拟磁盘ramdisk比较类似像,但不完全相同,和ramdisk一样,tmpfs可以使用RAM,但它也可以使用swap分区来存储。而且传统的ramdisk是个块设备,要用mkfs来格式化它,才能真正地使用它;而tmpfs是一个文件系统,并不是块设备,只是安装它,就可以使用了。tmpfs是最好的基于RAM的文件系统。
DMI,即Desktop ManagementInterface。
72、iptables
iptables 是建立在 netfilter 架构基础上的一个包过滤管理工具,最主要的作用是用来做防火墙或透明代理。Iptables 从 ipchains 发展而来,它的功
能更为强大。Iptables 提供以下三种功能:包过滤、NAT(网络地址转换)和通用的 pre-route packet mangling。包过滤:用来过滤包,但是不修
改包的内容。Iptables 在包过滤方面相对于 ipchians 的主要优点是速度更快,使用更方便。NAT:NAT 可以分为源地址 NAT 和目的地址 NAT。
Iptables 可以追加、插入或删除包过滤规则。实际上真正执行这些过虑规则的是 netfilter 及其相关模块(如 iptables 模块和 nat 模块)。
看内核时总遇到if(likely( )){}或是if(unlikely( ))这样的语句,最初不解其意,现在有所了解,所以也想介绍一下。
73、Linux kernel–likely
likely() 与 unlikely()是内核(我看的是2.6.22.6版本,2.6的版本应该都有)中定义的两个宏。位于/include/linux/compiler.h中,
具体定义如下:
#define likely(x) __builtin_expect(!!(x),1)
#define unlikely(x) __builtin_expect(!!(x),0)
__builtin_expect是gcc(版本>=2.96,网上写的,我没验证过)中提供的一个预处理命令(这个名词也是网上写的,我想叫函数更好些),有利于代码优化。gcc(version 4.4.0)具体定义如下:
long __builtin_expect (long exp, long c)[Built-in Function]
注解为:
You may use __builtin_expect to provide thecompiler with branch prediction information. In general, you should prefer touse actual profile feedback for this (‘-fprofile-arcs’), as programmers arenotoriously bad at predicting how their programs actually perform. However,there are applications in which this data is hard to collect.The return valueis the value of exp, which should be an integral expression. The semantics ofthe built-in are that it is expected that exp == c.
它的意思是:我们可以使用这个函数人为告诉编绎器一些分支预测信息“exp==c”是“很可能发生的”。
#define likely(x) __builtin_expect(!!(x),1)也就是说明x==1是“经常发生的”或是“很可能发生的”。
使用likely ,执行if后面语句的可能性大些,编译器将if{}是的内容编译到前面, 使用unlikely ,执行else后面语句的可能性大些,编译器将else{}里的内容编译到前面。这样有利于cpu预取,提高预取指令的正确率,因而可提高效率。
74、查看内存页大小:
getconf PAGESIZE
tune2fs -l /dev/sda1 |grep ‘Block size’
75、安装mysql 5.5环境
76、t-mysql.x86_645.5.18
yum install -b current t-mysql
Loaded plugins: branch, security
Setting up Install Process
Resolving Dependencies
–> Running transaction check
—> Package t-mysql.x86_640:5.5.18.4132-38.el6 will be installed
–> Finished Dependency Resolution
yum install -y -b current t-alimysql-env
77、内存管理中的coldpage和hot page, 冷页 vs 热页
所谓冷热是针对处理器cache来说的,冷就是页不大可能在cache中,热就是有很大几率在cache中。
78、LBA
LBA(Logical Block Address),中文名称:逻辑区块地址。是描述电脑存储设备上数据所在区块的通用机制,一般用在像硬盘这样的辅助记忆设备。LBA可以意指某个数据区块的地址或是某个地址所指向的数据区块。电脑上所谓一个逻辑区块通常是512或1024位组。ISO-9660格式的标准CD则以2048位组为一个逻辑区块大小。
79、SAN
存储区域网络(SAN)是一种高速网络或子网络,提供在计算机与存储系统之间的数据传输。存储设备是指一张或多张用以存储计算机数据的磁盘设备。一个 SAN 网络由负责网络连接的通信结构、负责组织连接的管理层、存储部件以及计算机系统构成,从而保证数据传输的安全性和力度。当前常见的可使用 SAN 技术,诸如 IBM 的光纤 SCON,它是 FICON 的增强结构,或者说是一种更新的光纤信道技术。LSI的Nytro XD智能高速缓存技术,为存储区域网络 (SAN) 和直接附加存储 (DAS) 环境提供开箱即用的应用加速功能。另外存储区域网络中也运用到高速以太网协议。SCSI 和 iSCSI 是目前使用较为广泛的两种存储区域网络协议。
80、OSI(开放系统互联(Open System Interconnection)
81、DIMM
DIMM(Dual Inline MemoryModule,双列直插内存模块)与SIMM(single in-line memory module,单边接触内存模组)相当类似,不同的只是DIMM的金手指两端不像SIMM那样是互通的,它们各自独立传输信号,因此可以满足更多数据信号的传送需要。同样采用DIMM,SDRAM 的接口与DDR内存的接口也略有不同,SDRAM DIMM为168Pin DIMM结构,金手指每面为84Pin,金手指上有两个卡口,用来避免插入插槽时,错误将内存反向插入而导致烧毁;DDR2 DIMM则采用240pin DIMM结构,金手指每面有120Pin。卡口数量的不同,是二者最为明显的区别。DDR3 DIMM同为240pin DIMM结构,金手指每面有120Pin,与DDR2 DIMM一样金手指上也只有一个卡口,但是卡口的位置与DDR2 DIMM稍微有一些不同,因此DDR3内存是插不进DDR2 DIMM的,同理DDR2内存也是插不进DDR3 DIMM的,因此在一些同时具有DDR2 DIMM和DDR3 DIMM的主板上,不会出现将内存插错插槽的问题。
82、DMI
DMI是英文单词DesktopManagement Interface的缩写,也就是桌面管理界面,它含有关于系统硬件的配置信息。计算机每次启动时都对DMI数据进行校验,如果该数据出错或硬件有所变动,就会对机器进行检测,并把测试的数据写入BIOS芯片保存。
83、translationlookaside buffer
Translates a virtual address into aphysical address
84、IA-32systems start at 0x08048000, leaving a gap of roughly 128 MiB between the
lowest possible address and the start ofthe text mapping that is used to catch NULL pointers
85、SElinux
查看SELinux状态:
1、/usr/sbin/sestatus-v ##如果SELinux status参数为enabled即为开启状态
SELinux status: enabled
2、getenforce ##也可以用这个命令检查
关闭SELinux:
1、临时关闭(不用重启机器):
setenforce 0 ##设置SELinux 成为permissive模式
##setenforce 1 设置SELinux 成为enforcing模式
2、修改配置文件需要重启机器:
修改/etc/selinux/config 文件
将SELINUX=enforcing改为SELINUX=disabled
重启机器即可
今天的文章网络运维词汇汇总分享到此就结束了,感谢您的阅读,如果确实帮到您,您可以动动手指转发给其他人。
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:https://bianchenghao.cn/33391.html