Linux之(12)sed和gawk基础

Author：onceday Date：2022年9月19日

全系列文章请查看专栏: Linux Shell基础_Once_day的博客-CSDN博客。

漫漫长路，有人对你微笑过嘛…

本文主要收集整理于以下文档：

《Linux命令行与shell脚本编程大全》
《鸟哥的Linux私房菜》

文章目录

- - Linux之(12)sed和gawk基础

1.引言

sed和gawk是脚本学习中必不可缺少一部分，面对大量文本的处理，十分有帮助。

sed编辑器即stream editor，和普通的交互式编辑器不一样，流编译处理器在处理数据之前需要预先提供一组规则来编辑数据流。

在接收到命令之后，会做以下的事情：

一次从输入中读取一行数据
根据所提供的编辑器命令匹配数据
按照命令修改流中的数据
将新的数据输出到STDOUT

在流编辑器将所有命令与一行数据匹配完毕，再读取下一行，然后一直重复这个过程，在处理完流中的所有数据行后，就会停止操作。处理效率是非常高的。

2. sed编辑器

命令格式如下：

sed options script file

options有三种：

选项	描述
-e script	在处理输入时，将script指定的命令添加到已有的命令
-f file	在处理输入时，将file文件中指定的命令添加到已有的命令
-n	不产生命令输出，使用print命令来完成输出

默认情况下，会直接应用命令在STDIN输入流上：

onceday@ubuntu:test$ echo "hello" | sed 's/hello/hello world/' hello world

也可以在命令行上执行多个命令，使用-e选项：

onceday@ubuntu:test$ cat data.txt hello word onceday@ubuntu:test$ sed -e 's/hello/see/;s/word/me/' data.txt see me

也可以使用次提示符来分割命令：

onceday@ubuntu:test$ sed -e ' s/hello/see/ s/word/me/ ' data.txt see me

可以从文件中读取sed命令，推荐使用后缀名.sed，如：

onceday@ubuntu:test$ echo "s/hello/see/ > s/word/me/" > da.sed onceday@ubuntu:test$ cat da.sed s/hello/see/ s/word/me/ onceday@ubuntu:test$ sed -f da.sed data.txt see me

2.1 替换选项

使用s(substitute)来在行中替换文本。

该命令只替换每行中出现的第一处对应文本，如果要让替换命令能有替换一行中不同地方出现的文本，则需要使用替换标记(substitution flag)。

s/pattern/replacement/flags

有四种可用的替换标识：

数字，表明新文本将替换第几处模式匹配的地方。
g，表明新文本将会替换所有匹配的文本。
p，表明原先行的内容要打印出来。
w file，将替换的结果写到文件中。

例如，下面是替换第二处模式匹配的地方：

onceday@ubuntu:test$ cat data.txt hello word, hello word, hello word, hello word see me, see you, see him, see me, see you, see him onceday@ubuntu:test$ sed 's/hello/test/2' data.txt hello word, test word, hello word, hello word see me, see you, see him, see me, see you, see him

使用p和-n可以只输出被替换命令修改过的行。

onceday@ubuntu:test$ sed -n 's/see/test/p' data.txt test me, see you, see him, see me, see you, see him

w标记将输出保存在指定文件中，只有包含匹配模式的行才会保存在指定的输出中。

onceday@ubuntu:test$ sed 's/see/test/w out.txt' data.txt hello word, hello word, hello word, hello word test me, see you, see him, see me, see you, see him onceday@ubuntu:test$ cat out.txt test me, see you, see him, see me, see you, see him

如果在替换的文本中含有\字符，比如路径名称，那么就需要转义\。

onceday@ubuntu:test$ sed 's/hello/test\/me/2' data.txt hello word, test/me word, hello word, hello word see me, see you, see him, see me, see you, see him

也可以采用其他的分割字符，如!，

onceday@ubuntu:test$ sed 's!hello!bin/bash!' data.txt bin/bash word, hello word, hello word, hello word see me, see you, see him, see me, see you, see him

2.2 使用地址指定行

默认情况下，sed编辑器中使用的命令会作用于文本数据的所有行，如果只想将命令作用于特定行或某些行，则必须使用行寻址（line addressing）。

以数字形式表示行区间
用文本形式来过滤出行

其格式是相同的，即[address]command，或者多个命令分组：

address { command1 command2 command3 }

数字方式的行寻址，其索引为文本流中的行位置来引用，sed编辑器会将文本中的第一行编号为1，然后继续按顺序为接下来的行分配行号。

多个行之间使用5,2逗号来表示，结尾行可以使用$代指。

onceday@ubuntu:test$ sed '2s!hello!bin/bash!' data.txt hello word, hello word bin/bash word, hello word hello word, hello word hello word, hello word onceday@ubuntu:test$ sed '2,4s!hello!bin/bash!' data.txt hello word, hello word bin/bash word, hello word bin/bash word, hello word bin/bash word, hello word onceday@ubuntu:test$ sed '3,$s!hello!bin/bash!' data.txt hello word, hello word hello word, hello word bin/bash word, hello word bin/bash word, hello word

第二种方法是使用文本模式来过滤，如/pattern/command：

pattern是过滤参数，可以使用固定文本，如hello就只会刚好匹配hello这个字符的行。

onceday@ubuntu:test$ sed '/hello/s!hello!bin/bash!' data.txt bin/bash word, hello word bin/bash word, hello word bin/bash word, hello word bin/bash word, hello word

此外，pattern支持正则模式匹配，这个功能就非常强大了。

2.3 命令组合

如果需要在单行上执行多条命令，可以使用花括号将多条命令组合在一起，sed编辑器会处理地址行列出的每条命令。

onceday@ubuntu:test$ sed '2{ s/hello/hi/ s/word/you/ }' data.txt hello word, hello word hi you, hello word hello word, hello word hello word, hello word

也可以指定它们的区间，2,10表示从2到10这9行，或者5,$表示从第5行到结尾行。

而在组合命令里面的所有命令都会执行在对应行上。

2.4 删除行

直接使用d（delete)，可以删除文本中的指定行，默认删除所有行。

所以该命令需要和地址行号一起使用，这样可以删除指定的这一部分的内容。

onceday@ubuntu:test$ sed 'd' data.txt onceday@ubuntu:test$ sed '2d' data.txt hello word, hello word hello word, hello word hello word, hello word onceday@ubuntu:test$ sed '2,$d' data.txt hello word, hello word

也可以使用文本模式匹配模式：

onceday@ubuntu:test$ sed '/hello/d' data.txt see you, see me see you, see me onceday@ubuntu:test$ cat data.txt hello word, hello word hello word, hello word see you, see me see you, see me

在目前的命令中，删除的只是输出中的行，原始文件还不会更改。

对于文本模式，也可以使用行区间模式，

onceday@ubuntu:test$ sed '/hello/,/see/d' data.txt see you, see me onceday@ubuntu:test$ cat data.txt hello word, hello word hello word, hello word i will be deleted right ? see you, see me see you, see me

可以看到，第一个模式会打开行删除功能，第二个模式会关闭行删除功能，而且是最大的匹配情况。

这个文本模式匹配，并不会只执行一次，只要有文本被匹配上，就会开启或关闭功能。

如果一直匹配不到停止模式字符串，那么就会一直删除下去，直到无剩下的文本。

2.5 插入和附加文本

使用i(insert)命令会在指定行前增加一个新行。

使用a(append)命令会在指定行后增加一个新行。

这两个命令只能在当前行的前面插入新行或者添加在后面，但无法修改当前行的数据。

onceday@ubuntu:test$ cat data.txt hello word, hello word i will be deleted see you, see me onceday@ubuntu:test$ sed '2i\insert' data.txt hello word, hello word insert i will be deleted see you, see me onceday@ubuntu:test$ sed '2a\append' data.txt hello word, hello word i will be deleted append see you, see me onceday@ubuntu:test$ sed '/will/a\append' data.txt hello word, hello word i will be deleted append see you, see me

也可以使用文本模式匹配来寻找对应的行，但是不能使用使用区间地址。

如果想插入到尾部，直接使用$符号即可：

onceday@ubuntu:test$ sed '$a\this is 1 \ this is 2' data.txt hello word, hello word i will be deleted see you, see me this is 1 this is 2

可以看到，如果要插入多个行，必须使用反斜线\，而且必须位于这一行的末尾。

2.6 修改行内容

使用c(change)可以修改数据流中整行文本的值，需要指定单行文本。

onceday@ubuntu:test$ sed '2c\this is new line' data.txt hello word, hello word this is new line see you, see me

同样的，也可以使用文本匹配模式或者区间模式来指定地址。对于区间模式，并不会逐一修改每一行中的内容，而是量地址区间中间的行都换成要修改内容

2.7 转换命令

转换命令(transform)会处理单个字符，命令格式如下：

[address]y/inchars/outchars/

inchars的每个字符的值会和outchars的值一对一的替换。

onceday@ubuntu:test$ sed '1y/lo/LO/' data.txt heLLO wOrd, heLLO wOrd i will be deleted see you, see me

可以使用地址行指定处理的范围，而且会处理该行遇到的所有匹配的字符。

2.8 打印行

p标记专门用于打印文本行，是在正常输出外面额外输出的。

如模式匹配的输出，可使用-n禁止正常输出：

onceday@ubuntu:test$ sed '/hello/p' data.txt hello word, hello word hello word, hello word i will be deleted see you, see me onceday@ubuntu:test$ sed -n '/hello/p' data.txt hello word, hello word

可以用来快速打印某些行的内容，以及文本匹配的内容。

onceday@ubuntu:test$ sed -n '1{ p s/hello/HELLO/p }' data.txt hello word, hello word HELLO word, hello word

这种用法，可以显示修改之前行的内容，以及修改之后行的内容。

如果想输出行号，可以使用=字符。

onceday@ubuntu:test$ sed "=" data.txt 1 hello word, hello word 2 i will be deleted 3 see you, see me

也可以配合组合命令和地址区间、文本匹配等进行输出指定内容的行号：

onceday@ubuntu:test$ sed -n '/will/{ > = > p > }' data.txt 2 i will be deleted

使用l(list)命令可以打印数据中的文本以及不可打印字符：

onceday@ubuntu:test$ sed -n "l" data.txt hello word, hello word$ i will be deleted$ see you, see me$

换行符使用$字符代替，一般这些字符可能使用\t这种转义字符表示，或者八进制表示方式。

2.9 处理文件

使用w(write)命令可以向文件中写入行，格式如下：

[address]w filename

如下：

onceday@ubuntu:test$ sed "2,\$w data2.txt" data.txt hello word, hello word i will be deleted see you, see me onceday@ubuntu:test$ cat data2.txt i will be deleted see you, see me

也可以使用文本匹配模式去筛选数据行，然后使用-n来消除正常输出。

使用r(read)可以将独立的文件数据插入到数据流中，[address]r filename：

onceday@ubuntu:test$ sed '3r data2.txt' data.txt hello word, hello word i will be deleted see you, see me i will be deleted see you, see me

地址这里也可以使用文本模式匹配来定位。

可以在一个组合命令里面先增加内容，然后再删除原有的内容，从而完成数据替换过程。

3.gawk编辑器

gawk可提供比sed更多功能的支持，可以提供一个类编程环境来修改和重新组织文件中的数据。

gawk是Unix中的原始awk程序的GNU版本，可提供结构化编程的方式。

定义变量来保存数据
使用算术和字符串操作符来处理数据
使用结构化编程概念来为数据处理增加处理逻辑
通过提取数据文件中的数据素，将其重新排序或者格式化，生成格式化报告。

通常该文件可以用来从大文本文件中提取数据素，并将它们格式化成可读的报告。

3.1 gawk程序的基本格式

gawk option program file

以下常见的选项：

选项	描述
-F fs	执行行中划分数据字段的字段分隔符
-f file	从指定的文件中读取程序
-v var=value	定义gawk程序中的一个变量及其默认值
-mf N	指定要处理的数据文件中的最大字段数
-mr N	指定数据文件中的最大数据行数
-W keyword	指定gawk的兼容模式或警告等级

3.2 从命令行读取程序脚本

gawk脚本使用一对花括号，必须将脚本命令放到两个花括号（{ }）中：

gawk '{print "hello word"}'

按下回车之后，gawk会等待STDIN输入，因此需要继续输入数据，没输入一行文本，然后按下回车键，gawk会对这行文本运行一遍程序脚本。

和sed的逻辑一样，gawk程序会针对数据流中的每行文本执行相应的程序。

需要使用EOF（End-of-File）字符，使用组合键Ctrl + D，可以在bash中产生一个EOF字符。

3.3 使用数据字段变量

默认情况，gawk会自动给一行中的每个数据素分配一个变量：

$0代表整个文本行
$1代表文本行中的第一个数据字段
$2代表文本行中的第二个数据字段
$n代表文本行中的第n个数据字段

数据段都是通过字段分割符划分，gawk在读取一行文本时，会使用预定义的字段分割符划分每个数据字段，gawk中默认的字段分割符是任意的空白字段。

onceday@ubuntu:test$ cat data.txt hello word, hello word i will be deleted see you, see me onceday@ubuntu:test$ gawk '{print $1}' data.txt hello i see

也可采用选项指定分隔符：

onceday@ubuntu:test$ gawk -F, '{print $1}' data.txt hello word i will be deleted see you

3.4 使用多条命令

可以直接使用命令分隔符：

onceday@ubuntu:test$ gawk '{print $1;print $3}' data.txt hello hello i be see see

或者使用次提示符：

onceday@ubuntu:test$ gawk '{ > print $1 $2 > print $3 > }' data.txt helloword, hello iwill be seeyou, see

以及使用文件保存脚本：

onceday@ubuntu:test$ gawk -f da.gawk data.txt hello this is a text i this is a text see this is a text onceday@ubuntu:test$ cat da.gawk { print $1 str = " this is a text" print str }

可以看到，在gawk脚本中使用变量str不需要使用$符号。

3.5 在处理数据之前或之后处理脚本

使用``BEGIN`可以指定脚本在处理数据之前运行，如下：

onceday@ubuntu:test$ gawk 'BEGIN {print "Hello word!"}' Hello word!

通过新增代码区域可以增加处理数据的脚本：

onceday@ubuntu:test$ gawk 'BEGIN {print "Hello word!"} > {print $0}' data.txt Hello word! hello word, hello word i will be deleted see you, see me

同样的，可以使用``END`·可以指定脚本在处理程序之后运行：

onceday@ubuntu:test$ gawk 'END {print "Hello word!"} {print $0}' data.txt hello word, hello word i will be deleted see you, see me Hello word!

4.正则表达式

正则表达式即定义的模式模板pattern template。该模板用于对数据进行模式匹配。

正则表达式基本知识 Once_day的博客 CSDN博客_正则表达式运算

正则表达式是通过正则表达式引擎（regular expression engine）来实现的。正则表达式引擎是一套底层软件，负责解释正则表达式模式并使用这些模式进行文本匹配。

在Linux中，有两种流行的正则表达式引擎：

POSIX基础正则表达式（basic regular expression， BRE）引擎
POSIX扩展正则表达式（extended regular expression，ERE）引擎

大部分Linux工具都至少符合POSIX BRE引擎规范，但sed编辑器只符合BRE引擎规范的子集。

POSIX ERE引擎通常出现在依赖正则表达式进行文本过滤的编程语言，为常见模式提供了高级模式符号和特殊符号，比如匹配数字、单词以及按字母排序的字符。比如，gawk程序使用ERE引擎来处理它的正则表达式模式。

4.1 BRE模式

正则表达式可以匹配空格在内的字符，而且区分大小写，如下：

onceday@ubuntu:~$ echo "this is a test." | sed -n '/ /p' this is a test.

需要注意的是以下的特殊字符需要转义\ :

. * [] ^ {} \ + ? | ()

反斜线不是特殊字符，但由于本身代表转义，所以需要连续两个，如下：

onceday@ubuntu:~$ echo "this is a $." | sed -n '/\$/p' this is a $. onceday@ubuntu:~$ echo "this is a \." | sed -n '/\\/p' this is a \.

可以使用锚字符来锁定行首和行尾：

使用^锁定行首
使用$锁定行尾

(具体有关正则表达式的知识可以看本节一开始的链接文档，这里及下面不再重复介绍正则表达式)。

锚字符的一种应用就是用来找到空白行，如：

sed '/^$/d' xxx.txt

上面的表达式可以找到文本中的空白行，然后删除掉它们。

点号字符可以匹配除换行符之外的任意单个字符。

可以定义一个字符组character class，可以用来匹配某个字符位置上的任意字符。

onceday@ubuntu:~$ echo "this is a Test." | sed -n '/[Tt][Ee][Ss][Tt]/p' this is a Test.

这个字符组可用来提供可能的模式，非常适合于一些特殊规则的筛选。

可以通过使用脱字符来反转字符组的选择：

onceday@ubuntu:~$ echo "this is a Test." | sed -n '/[Tt][^Kk][Ss][Tt]/p' this is a Test.

表示选取除Kk之外的其他字符，且必须选取一个字符。

也可以直接指定区间，避免输入太多的字符：

onceday@ubuntu:~$ echo "tshj" | sed -n '/[a-k]/p' tshj

4.1.1 特殊的字符数组

组	描述
[[:alpha:]]	匹配任意字母字符，不管是大写还是小写
[[:alnum:]]	匹配任意字母数字字符0~9、A-Z或a-z
[[:blank:]]	匹配空格或制表符
[[:digit:]]	匹配0~9之间的数字
[[:lower:]]	匹配小写字母字符a~z
[[:print:]]	匹配任意可打印字符
[[:punct:]]	匹配标点符号
[[:space:]]	匹配任意空白字符：空格、制表符、NL、FF、VT和CR
[[:upper:]]	匹配任意大写字母字符A~Z

4.1.2 星号

星号*表示该字符必须在匹配的文件中出现0次或多次，这个常用于拼写错误的匹配。

星号还可以用在字符组之上，如下：

onceday@ubuntu:~$ echo "tshj" | sed -n '/[hj]*/p' tshj

4.2 扩展正则表达式

gawk程序支持ERE模式，但sed不支持。

问号?表示前面的字符可以出现0次或1次。

加号+表明前面的字符可以出现1次或多次。

使用花括号可以为可重复的正则表达式指定一个上限，这通常称为间隔interval。

[xx]{m}，正则表达式准确出现m次
[xx]{m,n}，正则表达式至少出现m次，至多n次

默认情况下，gawk不会识别正则表达式的间隔，需要指定--re-interval命令行选项。

可以使用竖线|来选择多个模式，如下形式：

expr1|expr2|expr3|......

每个模式依序被检查，任何一个匹配了数据流文本，那么就通过了测试。如果没有模式匹配，则数据流文本匹配失败。

可以使用表达式分组，该分组会被视为一个标准字符，所以可以像对普通字符一样给该分组使用特殊字符。

onceday@ubuntu:~$ echo "Sat" | gawk '/Sat(urday)?/{print $0}' Sat

4.3 正则表达式实例演示

4.3.1 目录文件记数

onceday@ubuntu:shell$ cat countfiles.sh #!/bin/bash # count number of files in your PATH mypath=$(echo $PATH | sed 's/:/ /g') count=0 for directory in $mypath do check=$(ls $directory) for item in $check do count=$[ $count + 1] done echo "$directory - $count" count=0 done

通过将PATH分割路径的:变成[ ] 空格，然后借助bash shell的脚本规则来进行遍历。

4.3.2 验证邮件地址

邮件地址的基本格式如下：

username@hostname

username值可用字母数字字符以及以下的特殊字符表示：

点号
单破折号
加号
下划线

邮件地址的hostname部分由一个或多个域名和一个服务器名组成，服务器名和域名也必须遵照严格的命名规则，只允许字母数字字符以及以下特殊字符。

点号
下划线

首先构造用户名，^([a-zA-Z0-9_\-\.\+]+)@。

然后构造服务器名和子域名，([a-zA-Z0-9_\-\.]+)。

顶级域名有一些特殊的规则，顶级域名只能是字母字符，必须不能少于两个字符，长度上也不能超过5个字符。下面是其正则表达式：\.([a-zA-Z]{2,5})$。

整体如下：

^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

该表达式会过滤掉那些不符合上述规则的邮件地址。

今天的文章 Linux之(12)sed和gawk基础分享到此就结束了，感谢您的阅读。

Linux之(12)sed和gawk基础

Linux之(12)sed和gawk基础

文章目录

1.引言

2. sed编辑器

2.1 替换选项

2.2 使用地址指定行

2.3 命令组合

2.4 删除行

2.5 插入和附加文本

2.6 修改行内容

2.7 转换命令

2.8 打印行

2.9 处理文件

3.gawk编辑器

3.1 gawk程序的基本格式

3.2 从命令行读取程序脚本

3.3 使用数据字段变量

3.4 使用多条命令

3.5 在处理数据之前或之后处理脚本

4.正则表达式

4.1 BRE模式

4.1.1 特殊的字符数组

4.1.2 星号

4.2 扩展正则表达式

4.3 正则表达式实例演示

4.3.1 目录文件记数

4.3.2 验证邮件地址

相关推荐