🛠️ Super Linux Warp Pipe
July 24th, 2018
Today at work I came across a problem that I managed to solve using multiple Linux pipes. What I needed to do to solve the problem was count the frequency of words existing between a certain context.
This is a simplified version of what the input looked like:
Where is the { FRUIT apple } ?
Is the { PLANT tree } within the ground ?
The { FRUIT banana } ate some { FRUIT banana } .
If I wanted to look for the frequency of words only with the FRUIT
label, then the intended output should look like:
2 banana
1 apple
To prevent a breach of confidentiality, I have replaced the actual words with something else.
The command that ended up doing the trick was:
cat input_file | grep -o -P '(?<={ FRUIT).*?(?= })' | grep -v "^\s$" | sort | uniq -c | sort -bnr > output_file
Below, I will break down what each part does. I will cover the following commands and flags: grep -o -P -v
, sort -bnr
, and uniq -c
. I'll cover regular expressions (regex) in another post.
grep -o
or grep --only-matching
basically matches and prints out only the matched phrase. For example, since I wanted to count the number of occurrences of each of these words or phrases, I wanted to not only search for the words between { FRUIT
and }
, but print them out.
grep -P
or grep --perl-regexp
is pretty self-explanatory. It allows you to specify a Perl regex.
grep -v
or grep --inverse-match
selects lines that don't match the regex. It's the negation of standard grep.
uniq -c
or uniq --count
prefixes each line with the number of occurrences of each phrase or word.
sort -b
or sort --ignore-leading-blanks
ignores leading blanks. Surprising.
sort -n
or sort --numeric-sort
sorts in ascending numerical order. (0, 1, 2, ... 9)
sort -r
or sort --reverse
reverses the sort.