Some Text Processing Tools

Text processing is the automated process of analyzing and sorting raw text data to gain valuable insights. Whenever we send an email, type a comment on social media or leave a message from in app or service, we would have leave a data trail that contains a lot of valuable information for companies. To access this valuable information, at first companies organize, sort, and measure textual data with text processing tools. Then this processed data is combined text processing tools with machine learning or natural language processing (NLP). Thus, they can extract a valuable insight. We can also use this text processing tools for our own projects.
Text processing tools are not only used for big and very complex data that is in the company, but we can also produce valuable data from the texture file in our own projects, either. We may put data in the file, then we can create significant data from this. Let’s produce a scenario for better understanding. You may want to record your outgoings for every month with the table which contains 2 columns: product and price in the file to see how much do you pay to something. Then you may want to calculate how much do you spend from the price column. In this step, this means you will make use of these tools.
In this post, I’ll talk about simple text processing tools that you can use on command-line.

Awk is an excellent tool and programming language for building UNIX/Linux shell scripts. It scans all lines and filters data according to the space by default.
The syntax is:
awk options 'selection_criteria { action }' input-file
Take this file as an example (will be referred as ~/people.txt
in later examples):
$1 $2 $3 $4 $5
John 34 Engineer Male
Clara 25 Doctor Female
Ed 21 Student Male
Lucy 48 Teacher Female 2400
William 75 Retired Male
When you run an awk program, the action
part is executed for each line that matches to selection_criteria
. By default, awk splits each line by whitespace character and assigns these divided parts into variables, named $1
..., and. $0
represents the whole line. So for the first line, we would have $1 = John
, $2 = 34
, $3 = Engineer
, $4 = Male
and $5 = 2400
$ awk '{print $1 " " $3}' ~/people.txt
John Engineer
Clara Doctor
Ed Student
Lucy Teacher
William Retired
In this example, we are telling awk to open ~/people.txt
and for every line, take first and third items and print them with a space in-between them.
- Here are some widely used global variables that you can use in your awk programs:

Let’s make an example with them:
$ awk '{ if (NF==5) print NR,$1,$5}' ./people.txt
4 Lucy 2400
The command above does this:
- If the line have 5 fragments,
- prints the number for which line it is and then print first phrase and fifth phrase.
What happens if our fragments are separated with something other than space? We can change default divider(whitespace) with -F
argument. See the following file:
If we want to sum all of the salaries, we can do the following:
$ awk -F ';' '{sum += $5} END {print sum}' data.csv
With this command, each line is splitted up using ;
, and then for each line is taken the fifth element and added to the sum
variable. At the end of the program, which is indicated by the END
block, printed the sum
performs lot’s of function on files, like searching, find and replace, insertion or deletion. It supports regular expressions.
The syntax is:
I’m going to use the following file as an example:
1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of war, or of wars.
4 Total war is warfare is not restricted to military targets.
5 Some war studies consider war a universal. War derives from wyrre.
Find and Replace
is used to replace text in the file. s/pattern/replacement/
replaces the first occurrence of pattern
with replacement
in the line.
means global replacement. It denotes to replace all the occurrences of the string in the line. s/pattern/replacement/t
replaces all occurrences of pattern
with replacement.
$ sed 's/war/fighting/g' sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of fighting, or of fightings.
4 Total fighting is fightingfare is not restricted to military targets.
5 Some fighting studies consider fighting a universal. War derives from wyrre.
- As seen in the example above, we replace all
If we remove g
, sed will change first string it found for each line. If we type any number instead of g
, nth match is changed in the each line.
$ sed 's/war/fighting/' sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of fighting, or of wars.
4 Total fighting is warfare is not restricted to military targets.
5 Some fighting studies consider war a universal. War derives from wyrre.
- Here, only the first occurrences of
is replaced withfighting
We can also replace strings from nth string to all string in each line.
$ sed ‘1,4 s/war/fight/2g’ sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of war, or of fights.
4 Total war is fightfare is not restricted to military targets.
5 Some war studies consider war a universal. War derives from wyrre.
- Between Line 1 and Line 4, all
is replaced withfight
from the secondwar
Using $
, we can modify lines from nth to the last like below.
$ sed '4,$ s/war/fighting/g' sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of war, or of wars.
4 Total fighting is fightingfare is not restricted to military targets.
5 Some fighting studies consider fighting a universal. War derives from wyrre.
- Changes all
from fourth line to end of file.
We can also use regular-expressions to do more precise matching:
$ sed 's/[wW]ar/Fighting/g' sed-example.txt1 Fighting is an intense armed conflict between governments.
2 Fighting is generally characterized by extreme violence.
3 Fightingfare refers to the common activities of Fighting, or of Fightings.
4 Total Fighting is Fightingfare is not restricted to military targets.
5 Some Fighting studies consider Fighting a universal. Fighting derives from wyrre.
- It replaces the pattern that matches the regex, with the specified expression, that is,
Deleting lines

$ sed '3,$d' sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
- Deletes third line and every line after.
$ sed '/war/d' sed-example.txt
- Deletes files that contain
phrase. In this example all lines are deleted.
Using the following file:
- The file has the number representing each line.
$ sed '3~2d' numbers.txt1
- Deletes second lines from third line.
Printing Specified Lines

$ sed -n '3,$p' sed-example.txt3 Warfare refers to the common activities of war, or of wars.
4 Total war is warfare is not restricted to military targets.
5 Some war studies consider war a universal. War derives from wyrre.
- Prints lines from 3th to end of file.
grep is abbreviation of global regular expression printer
. It searches a file for particular configuration of characters, and prints matched lines.
The syntax is:
grep [options] pattern [files]
Let’s edit sed-example.txt file to better understanding. New file content is:
1 War is an intense armed conflict between governments.
2 It contains violence, aggression, destruction, and mortality.
3 War is generally characterized by extreme violence.
4 Warfare refers to the common activities of war, or of wars.
5 Total war is warfare is not restricted to military targets.
6 Some war studies consider war a universal. War derives from wyrre.
I have added one more line that does not contain war
to the file here.
$ grep "war" grep.txt4 Warfare refers to the common activities of war, or of wars.
5 Total war is warfare is not restricted to military targets.
6 Some war studies consider war a universal. War derives from wyrre.
looks for the pattern every line. Then prints all lines that containwar
$ grep -i "wAr" grep.txt1 War is an intense armed conflict between governments.
3 War is generally characterized by extreme violence.
4 Warfare refers to the common activities of war, or of wars.
5 Total war is warfare is not restricted to military targets.
6 Some war studies consider war a universal. War derives from wyrre.
- Matches file case insensitively like
$ grep -c "war" grep.txt
- Displays the count of number of matched lines.
$ grep -l "war" grep.txt example.txt
- Lists matched files with the pattern.
By default, grep only supports a subset of regular expressions. To get the full support, you need to enable extended regular expressions with the -E ( — extended-regexp) switch.
$ grep -E "^[0-9]\s(War)\s" war.txt --color1 War is an intense armed conflict between governments.
3 War is generally characterized by extreme violence.
- It matches the string that starts with number and then continues with
$ grep -Ev "^[0-9]\s(War)\s*" example.txt --color2 It contains violence, aggression, destruction, and mortality.
5 Total war is warfare is not restricted to military targets.
6 Some war studies consider war a universal. War derives from wyrre.
- It prints mismatched expressions.
Let’s create an emails.txt for the much more useful grep sample.
also provides an option to print only matching parts.
$ grep -Eo "[a-zA-Z1-9_.-]*@[a-zA-Z]*\.(com)" emails.txt --color
- It matches all emails in the file simply.
As we saw above, we can obtain the necessary data with such configurations.
Regular Expressions Primer
Regular Expression(regex, regexp) is a string of text that allows you to create patterns that help match, locate, and manage text. As used in command line and text editors to find text within a file; it is also supported by many programming languages such as Perl, Php, JS. Becoming an expert in Regex saves you many hours, if you are working with a lot of data.
Here I present you some of the most used regex constructs in a table. Keep in mind that regex is not exactly the same in every application or language, some of the constructs below might be missing in your implementation.

matches all text where there isg
followed by any single character, followed byg
, as in,get

matches all text that starts withg
and continue with e character if exist such asg
matches all text such asgt

matches all text likeget
, but will not matchgt

matches all text likegt
, but will not matchgeet
or something else like that.

matches all text isxx

matches all texts arexx

matches all texts are likeg
matches any lowercase letter.[A-Z]+
matches any uppercase letter.[0-9]+
matches any number.


car | bus

matches one or more digits.

You can use some online regex editor to practice/try out things.
is a command slices each line and extracts the text. It can be used to cut parts of a line by byte position, character and field.
The syntax is:
cut OPTION... [FILE]...
parameter is used to specify by which character to divide the content.The
parameter is used to specify the section that we demand.
Using the following file as an example:
John 34 Engineer Male
Clara 25 Doctor Female
Ed 21 Student Male
Lucy 48 Teacher Female 2400
William 75 Retired Male
Following command cuts the context according to the space and gets first and third section.
$ cut -d " " -f 1,3 cut-example.txt
John Engineer
Clara Doctor
Ed Student
Lucy Teacher
William Retired
The -c
parameter is used to split the content by characters. Following one gets characters from first to seventh.
$ cut -c 1-7 cut-example.txt
John 34
Clara 2
Ed 21 S
Lucy 48
I end this story here, so as not to make it too long. See you in my next post.