awk, sed, cut and tr: processing a text file
When you want to process a text file using the command line, awk
, sed
, cut
and tr
are the most used programs. Here you will learn the basic use of each one.
Table of Contents
awk
This powerful tool can do simple and complex text processing tasks. It searches for a pattern on a file and performs some actions. This process of searching and performing actions is called a ‘program’. When you run awk
, you specify an awk ‘program’.
awk 'program' <input files>
- When the program is long, it is more covenient to put it on a file and then run it like this:
awk -f programfile <input files>
A program consist of one or several ‘rules’. A program must be enclosed between single quotes. A rule consist of a pattern and an action:
pattern { action }
You can omit the pattern (everything matches) or the action (prints all the line). You can add several rules separated by newlines or ;
.
A basic use of awk is for printing (showing) all contents of a file.
awk '{print}' text.txt
# or
awk '{print $0}' text.txt
Pipes
- You can pipe the output of another command to awk.
df -h | awk '{print $0}'
awk '{print}' < text.txt
- To Export the output to a file.
df -h | awk '{print}' > output.txt
Columns
- Print first column of a tab-separator file (or command).
df -h | awk '{print $1}'
- Print last column
awk '{print $NF}' text.txt
NF
stands for “Number of Field”.
- Print several columns.
df -h | awk '{print $1,$3}'
- Using a diferent separator.
awk -F ":" '{print $1}' /etc/passwd
- You can also type
BEGIN {FS=":"}
:awk 'BEGIN {FS=":"}; {print $1}' /etc/passwd
- You can also type
- You can add several code blocks inside the single quotes (
awk '{code block}; {code block}'
)
Rows
- Print a specific line
awk 'NR==3 {print}' text.txt
NR
stands for “Number of Record”.
Formating the output
- Sum column values row by row (same for other mathematical operations).
$ awk -F ':' '{print $3+$4}' /etc/passwd 0 2 4
- Separate column values by some character (like a tab).
$ awk -F ',' '{print NR"\t"$1"\t"$2}' < test.csv 1 fechalectura lectura 2 2021-06-04 177 3 2021-08-10 184 4 2021-10-07 190
- Print the total number of lines (records).
# END: this code block runs when all input has been processed awk -F ':' 'END {print NR}' /etc/passwd
- Using
printf
.# Print second column as a floating number with two decimals awk -F ',' '{printf "%.2f\n",$2}' < testfile
- Define the
OFS
(Output Field Separator).awk -F ',' '{OFS="\t"; $1=$1; print}' < test.csv
Conditions
- Search for a RegExp.
# look for 'sshd' awk -F ':' '/sshd/ {print $1,$7}' /etc/passwd
- Standard “if” (always inside ‘action’ section).
# Don't print first value of column 2 awk -F ',' '{if (NR != 1) print $2}' test.csv
# Split the file based on the third column value awk -F ',' '{if ($3 >= 1000) print $0}' < test.csv > testmore1000.csv awk -F ',' '{if ($3 < 1000) print $0}' < test.csv > testless1000.csv
- Condition outside curly braces.
# Print odd lines awk 'NR % 2 != 0 {print}' file
# Print lines where $1 contains 'a' awk '$1 ~ /a/ {print}' urls.txt
# Print lines if contains abc AND xyz awk '/abc/ && /xyz/' file
Built-in functions
tolower(<string>)
,toupper(<string>)
: change uppercase to lowercase and viceversa.$ awk '{print tolower($0)}' <<< 'Hello World' hello world
length(<string>)
: show the length of a string.$ awk '{print length($0)}' <<< 'Hello World' 11
mktime(<date spec>)
: transform a date spec into a timestamp.$ awk '{print mktime($0)}' <<< '2022 01 01 02 00 00' 1640998800
strftime(<format>, <timestamp>)
: format a timestamp.$ awk '{print strftime("%d-%m-%Y",$0)}' <<< '1640998800' 01-01-2022
substr(<string>, <start>, <length>)
: return a substring.
Some examples
- Sum values for each year
$ awk -F ',' '/2022/ {sum22 += $2} ; /2021/ {sum21 += $2} END {print "2022: " sum22 ", 2021: " sum21}' file.csv 2022: 627, 2021: 748
- Add a new calculated column and new column names
# BEGIN: this will run before file processing # current, diff and prev are variables $ awk -F ',' 'BEGIN {print "date,m3,diff"}; {current = $2}; {diff = current - prev}; {prev = $2}; {if (NR != 1) print $1","$2","diff}' ../file.csv | column -s ',' -t date m3 diff 2021-06-04 177 177 2021-08-10 184 7 2021-10-07 190 6 2021-12-06 197 7
- Remove two lines (and add line numbers)
$ awk -F ',' '{if (NR==1 || NR==3) {} else {print NR " "$0}}' test.csv | head -n5 2 2021-07-21 00:00,247 4 2021-07-21 02:00,136 5 2021-07-21 03:00,82 6 2021-07-21 04:00,84 7 2021-07-21 05:00,115
- Search multiple patterns (&&)
$ ps aux | awk '/pts/ && /bash/' ricardo 1521 0.0 0.0 11528 5892 pts/1 Ss abr19 0:00 /bin/bash ricardo 9482 0.0 0.0 11132 5392 pts/39 Ss+ 12:45 0:00 /bin/bash ricardo 15174 0.0 0.0 12224 3756 pts/1 S+ 14:02 0:00 awk /pts/ && /bash/
$ awk -F ',' '/2021/ && NR==2 {start21=$2}; /2021/ {end21=$2}; /2022/ {end22=$2}; END {print "2021: "end21-start21"; 2022: "end22-end21}' file.csv 2021: 20; 2022: 17
- Display a random line (look at the use of double quotes)
$ awk "NR==$(($RANDOM % `wc -l < urls.txt`))+1 {print}" urls.txt https://eldiario.es $ awk "NR==$(($RANDOM % `wc -l < urls.txt`))+1 {print}" urls.txt https://google.com
sed
Use the ‘Streaming EDitor’ to transform a text.
Replace/delete text
- Substitute “word1” for “word2”.
sed 's/word1/word2/' text.txt
This command will not change the file, only show the results. You can export the output to a file with
>
or use-i
to edit the original file (use with caution, it’s always safer to create a new file).sed 's/word1/word2/' text.txt > newtext.txt
sed -i 's/word1/word2/' text.txt
Also, it only changes the first occurence in each line. To change all ocurrences:
sed 's/word1/word2/g' text.txt
- Find and replace a word in several files at once:
sed -i 's/word1/word2/g' *.txt
- Delete word1.
sed 's/word1//g' text.txt
- Delete first character of every line.
sed 's/^.//' text.txt
- Delete last character of every line.
sed 's/.$//' text.txt
- Replace “o” for “O” only on lines that match a pattern.
sed '/root/s/o/O/g' /etc/passwd
Delete lines
- Delete lines matching a pattern.
sed '/root/d' /etc/passwd
- You can use RegExp when looking for a pattern. This command will delete any empty line.
sed '/^$/d' test.txt
- Depending of the file formatting, you can also use
'/\r/d'
or'/^\r$/d'
.
- Depending of the file formatting, you can also use
Print matched lines
sed -n '/pattern/p' file.txt
Prepend/Append lines
- Insert text one line before every line.
sed 'i\new line' test.txt
- Append text (insert one line after every line).
sed 'a\new line' test.txt
Specify a line number
If you add a line number before the subcommand letter (a
, i
, d
, etc.) that subcommand will only run in that line. To refer to the last line, type $
.
# Delete second line
sed 2d test.txt
# Delete from line 3 to line 6
sed 3,6d test.txt
# Insert a line at the beginning
sed '1i\new line' test.txt
Extended RegExp support
For some RegExp, you may need to use -E
parameter.
sed -E '/(\.com)$/d'
Replace between patterns
You can use RegExp to split a line into several groups (using parenthesis) and replace only one of them.
$ cat urls.txt
https://elpais.com
https://eldiario.es
https://radiohuesca.com
https://google.com
$ sed -E 's/^(https:\/\/)(.*)(\.com)$/\1test\3/' urls.txt
https://test.com
https://eldiario.es
https://test.com
https://test.com
- You can select pattern groups with
\
and its number:1
for first group,2
for the second, etc. In this case, we want to print the first group (https://
), addtest
and print the third group (.com
).
More examples
- Merge lines 2 and 3. Replace ‘2’ with the line you want to merge. You can add spaces or any character between the lines.
sed '2N;s/\n//' testfile
- Change only first occurence in a file (from line 0 to
/-/
)$ cat text Hola Mundo - Hola Mundo $ sed '0,/-/{s/H/h/g}' text hola Mundo - Hola Mundo
- Change uppercase to lowercase and viceversa
sed 's/[[:lower:]]/\U&/g' lowertoupper.txt sed 's/[[:upper:]]/\l&/g' uppertolower.txt # Detects accented letters
cut
cut
is a simpler version of awk
. You can use it to separate a text in columns and show a specific column or several columns.
- Print first column on a “:” delimiter file.
cut -d ":" -f1 /etc/passwd
- Print two columns.
cut -d ":" -f1,7 /etc/passwd
- By default,
cut
uses the delimiter as a separator in the output, but you can change it with--output-delimiter=DELIMITER
.cut -d ":" -f1,7 --output-delimiter=" " /etc/passwd
tr
tr
works similar to sed
: translates or deletes characters from standard input, writing to standard output.
tr 'character' 'substitution' < file
For example, you can change commas to tabs:
tr ',' '\t' < file.csv
- You can achieve something similar with
column -s ',' -t < file.csv
.
Or you can delete all spaces:
echo {a..z} | tr -d ' '
You can change uppercase into lowercase (and viceversa) easily:
$ tr '[:lower:]' '[:upper:]' <<< 'Hello World'
HELLO WORLD
If you have any suggestion, feel free to contact me via social media or email.
Latest tutorials and articles:
Featured content: