Some Text Processing Tools

Hicran Şevik
11 min readJul 5, 2021

--

Text processing is the automated process of analyzing and sorting raw text data to gain valuable insights. Whenever we send an email, type a comment on social media or leave a message from in app or service, we would have leave a data trail that contains a lot of valuable information for companies. To access this valuable information, at first companies organize, sort, and measure textual data with text processing tools. Then this processed data is combined text processing tools with machine learning or natural language processing (NLP). Thus, they can extract a valuable insight. We can also use this text processing tools for our own projects.

Text processing tools are not only used for big and very complex data that is in the company, but we can also produce valuable data from the texture file in our own projects, either. We may put data in the file, then we can create significant data from this. Let’s produce a scenario for better understanding. You may want to record your outgoings for every month with the table which contains 2 columns: product and price in the file to see how much do you pay to something. Then you may want to calculate how much do you spend from the price column. In this step, this means you will make use of these tools.

In this post, I’ll talk about simple text processing tools that you can use on command-line.

awk

Awk is an excellent tool and programming language for building UNIX/Linux shell scripts. It scans all lines and filters data according to the space by default.

The syntax is:

awk options 'selection_criteria { action }' input-file

Take this file as an example (will be referred as ~/people.txt in later examples):

$1      $2   $3        $4      $5
John 34 Engineer Male
Clara 25 Doctor Female
Ed 21 Student Male
Lucy 48 Teacher Female 2400
William 75 Retired Male

When you run an awk program, the action part is executed for each line that matches to selection_criteria. By default, awk splits each line by whitespace character and assigns these divided parts into variables, named $1 $2 $3 ..., and. $0 represents the whole line. So for the first line, we would have $1 = John, $2 = 34, $3 = Engineer, $4 = Male and $5 = 2400.

$ awk '{print $1 " " $3}' ~/people.txt
John Engineer
Clara Doctor
Ed Student
Lucy Teacher
William Retired

In this example, we are telling awk to open ~/people.txt and for every line, take first and third items and print them with a space in-between them.

  • Here are some widely used global variables that you can use in your awk programs:

Let’s make an example with them:

$ awk '{ if (NF==5) print NR,$1,$5}' ./people.txt
4 Lucy 2400

The command above does this:

  • If the line have 5 fragments,
  • prints the number for which line it is and then print first phrase and fifth phrase.

What happens if our fragments are separated with something other than space? We can change default divider(whitespace) with -F argument. See the following file:

John;34;Engineer;Male;5000
Clara;25;Doctor;Female;1200
Ed;21;Student;Male;4000
Lucy;48;Teacher;Female;7400
William;75;Retired;Male;6000

If we want to sum all of the salaries, we can do the following:

$ awk -F ';' '{sum += $5} END {print sum}' data.csv
23600

With this command, each line is splitted up using ;, and then for each line is taken the fifth element and added to the sum variable. At the end of the program, which is indicated by the END block, printed the sum variable.

sed

sed performs lot’s of function on files, like searching, find and replace, insertion or deletion. It supports regular expressions.

The syntax is:

sed OPTIONS... [SCRIPT] [INPUTFILE...]

I’m going to use the following file as an example:

1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of war, or of wars.
4 Total war is warfare is not restricted to military targets.
5 Some war studies consider war a universal. War derives from wyrre.

Find and Replace

s is used to replace text in the file. s/pattern/replacement/ replaces the first occurrence of pattern with replacement in the line.

g means global replacement. It denotes to replace all the occurrences of the string in the line. s/pattern/replacement/t replaces all occurrences of pattern with replacement.

$ sed 's/war/fighting/g' sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of fighting, or of fightings.
4 Total fighting is fightingfare is not restricted to military targets.
5 Some fighting studies consider fighting a universal. War derives from wyrre.
  • As seen in the example above, we replace all war with fighting.

If we remove g, sed will change first string it found for each line. If we type any number instead of g, nth match is changed in the each line.

$ sed 's/war/fighting/' sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of fighting, or of wars.
4 Total fighting is warfare is not restricted to military targets.
5 Some fighting studies consider war a universal. War derives from wyrre.
  • Here, only the first occurrences of war is replaced with fighting.

We can also replace strings from nth string to all string in each line.

$ sed ‘1,4 s/war/fight/2g’ sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of war, or of fights.
4 Total war is fightfare is not restricted to military targets.
5 Some war studies consider war a universal. War derives from wyrre.
  • Between Line 1 and Line 4, all war is replaced with fight from the second war phrase.

Using $, we can modify lines from nth to the last like below.

$ sed '4,$ s/war/fighting/g' sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
3 Warfare refers to the common activities of war, or of wars.
4 Total fighting is fightingfare is not restricted to military targets.
5 Some fighting studies consider fighting a universal. War derives from wyrre.
  • Changes all war with fighting from fourth line to end of file.

We can also use regular-expressions to do more precise matching:

$ sed 's/[wW]ar/Fighting/g' sed-example.txt1 Fighting is an intense armed conflict between governments.
2 Fighting is generally characterized by extreme violence.
3 Fightingfare refers to the common activities of Fighting, or of Fightings.
4 Total Fighting is Fightingfare is not restricted to military targets.
5 Some Fighting studies consider Fighting a universal. Fighting derives from wyrre.
  • It replaces the pattern that matches the regex, with the specified expression, that is, fighting.

Deleting lines

$ sed '3,$d' sed-example.txt1 War is an intense armed conflict between governments.
2 War is generally characterized by extreme violence.
  • Deletes third line and every line after.
$ sed '/war/d' sed-example.txt
  • Deletes files that contain war phrase. In this example all lines are deleted.

Using the following file:

1
2
3
4
5
6
7
8
9
10
  • The file has the number representing each line.
$ sed '3~2d' numbers.txt1
2
4
6
8
10
  • Deletes second lines from third line.

Printing Specified Lines

$ sed -n '3,$p' sed-example.txt3 Warfare refers to the common activities of war, or of wars.
4 Total war is warfare is not restricted to military targets.
5 Some war studies consider war a universal. War derives from wyrre.
  • Prints lines from 3th to end of file.

grep

grep is abbreviation of global regular expression printer. It searches a file for particular configuration of characters, and prints matched lines.

The syntax is:

grep [options] pattern [files]

Let’s edit sed-example.txt file to better understanding. New file content is:

1 War is an intense armed conflict between governments.
2 It contains violence, aggression, destruction, and mortality.
3 War is generally characterized by extreme violence.
4 Warfare refers to the common activities of war, or of wars.
5 Total war is warfare is not restricted to military targets.
6 Some war studies consider war a universal. War derives from wyrre.

I have added one more line that does not contain war to the file here.

$ grep "war" grep.txt4 Warfare refers to the common activities of war, or of wars.
5 Total war is warfare is not restricted to military targets.
6 Some war studies consider war a universal. War derives from wyrre.
  • grep looks for the pattern every line. Then prints all lines that contain war.
$ grep -i "wAr" grep.txt1 War is an intense armed conflict between governments.
3 War is generally characterized by extreme violence.
4 Warfare refers to the common activities of war, or of wars.
5 Total war is warfare is not restricted to military targets.
6 Some war studies consider war a universal. War derives from wyrre.
  • Matches file case insensitively like war, wAr, War.
$ grep -c "war" grep.txt
3
  • Displays the count of number of matched lines.
$ grep -l "war" grep.txt example.txt
grep.txt
  • Lists matched files with the pattern.

By default, grep only supports a subset of regular expressions. To get the full support, you need to enable extended regular expressions with the -E ( — extended-regexp) switch.

$ grep -E "^[0-9]\s(War)\s" war.txt --color1 War is an intense armed conflict between governments.
3 War is generally characterized by extreme violence.
  • It matches the string that starts with number and then continues with War word.
$ grep -Ev "^[0-9]\s(War)\s*" example.txt --color2 It contains violence, aggression, destruction, and mortality.
5 Total war is warfare is not restricted to military targets.
6 Some war studies consider war a universal. War derives from wyrre.
  • It prints mismatched expressions.

Let’s create an emails.txt for the much more useful grep sample.

someone_12@outlook.com
deneme@com
crazy_1234@
hello jenny@huawei.com
crazy_1234@someone.someone.com

grep also provides an option to print only matching parts.

$ grep -Eo "[a-zA-Z1-9_.-]*@[a-zA-Z]*\.(com)" emails.txt --color
someone_12@outlook.com
jenny@huawei.com​
  • It matches all emails in the file simply.

As we saw above, we can obtain the necessary data with such configurations.

Regular Expressions Primer

Regular Expression(regex, regexp) is a string of text that allows you to create patterns that help match, locate, and manage text. As used in command line and text editors to find text within a file; it is also supported by many programming languages such as Perl, Php, JS. Becoming an expert in Regex saves you many hours, if you are working with a lot of data.

Here I present you some of the most used regex constructs in a table. Keep in mind that regex is not exactly the same in every application or language, some of the constructs below might be missing in your implementation.

  • g.t matches all text where there is g followed by any single character, followed by g , as in, get, got, g5t.
  • ge* matches all text that starts with g and continue with e character if exist such as g, ge, gee, geee.
  • g.*t matches all text such as gt , get, guilt, goat.
  • ge+t matches all text like get, geet, geeet, but will not match gt.
  • ge?t matches all text like gt , get, but will not match geet or something else like that.
  • x{2} matches all text is xx.
  • x{2,3} matches all texts are xx and xxx.
  • [Gg]+ matches all texts are like g, G, gG, Ggg.
  • [a-z]+ matches any lowercase letter.
  • [A-Z]+ matches any uppercase letter.
  • [0-9]+ matches any number.
  • (cars?)|bus matches car , cars and bus.
  • car | bus matches car and bus.
  • \d+ matches one or more digits.

You can use some online regex editor to practice/try out things.

cut

cut is a command slices each line and extracts the text. It can be used to cut parts of a line by byte position, character and field.

The syntax is:

cut OPTION... [FILE]...

The -d parameter is used to specify by which character to divide the content.

The -f parameter is used to specify the section that we demand.

Using the following file as an example:

John 34 Engineer Male
Clara 25 Doctor Female
Ed 21 Student Male
Lucy 48 Teacher Female 2400
William 75 Retired Male

Following command cuts the context according to the space and gets first and third section.

$ cut -d " " -f 1,3 cut-example.txt
John Engineer
Clara Doctor
Ed Student
Lucy Teacher
William Retired

The -c parameter is used to split the content by characters. Following one gets characters from first to seventh.

$ cut -c 1-7 cut-example.txt
John 34
Clara 2
Ed 21 S
Lucy 48
William

I end this story here, so as not to make it too long. See you in my next post.

--

--

Hicran Şevik

Software Engineer who loves to talk about zombies 🧟‍♀️