How To‎ > ‎Linux/Unix/Programming‎ > ‎

Text Processing with Linux

How to merge two lists, sort them and exclude repeated lines?

Option 1

cat file1name file2name | sort | uniq > outputfilename
this will sort ascending
to sort descending add -r option to sort:
cat file1name file2name | sort -r | uniq > outputfilename

Option 2:

sort file1name file2name | uniq -u > diffLines 

How to find duplicates in two lists?

sort file1name file2name | uniq -d > duplicates 

How to compare two files line by line?

comm -1 -2 <(sort first.txt) <(sort second.txt)

Count the number of lines in a file and get only the number?

wc -l myfile.txt | cut -d' ' -f1
or:
count=`wc -l myfile.txt | cut -d' ' -f1`

String manipulation, matching, etc.

sed substitute command: match a regular expression and replace it by something else
sed s/"<www"/"<http:\/\/www"/ mappingbased_properties_en.nt >mappingbased_properties_en_fixed.nt
more
String matching in Perl

changing "www" in the "mappingbased_properties_en.nt" file to "http://www" in the "mappingbased_properties_en_fixed.nt" file:

Replace all slashes by commas in a file

sed -e 's/\//,/g' results.csv > results-all.csv

How to split file into several with the fixed length?

split -l 1000 file.nt
will split files into separate ones of lenth 1000

How to add a character (or a string) at the end of each line of a file?

add ">" at the end of each line ($ is a regex for end of the line):

sed 's/$/>/' myfile

This will not modify the file. To modify the file add option -i:

sed -i 's/$/>/' myfile

OR

save it to another file:

sed 's/$/>/' myfile > anotherfile

How to add a character (or a string) at the beginning of each line of a file?

add "<" at the beginning of each line (^ is a regex for end of the line):

sed 's/^/</' myfile

This will not modify the file. To modify the file add option -i:

sed -i 's/^/</' myfile

OR

save it to another file:

sed 's/^/</' myfile > anotherfile


How to rename a set of files or add an additional extension?

for i in *.*; do mv "$i" "$i.n3"; done

How to add a fixed header at the top of each file in your dir?

merge content of file1 with all other files in your dir (e.g. add a fixed header with file1 content at the top of each file in your dir):

for i in *.n3; do cat file1 "$i" > "$i.withhead.n3"; done

Convert 7-bit ASCII representations to UTF-8 Unicode

for i in *.n3; do ascii2uni "$i" > "$i.utf8.n3"; done

Convert middle quotes in the four double quotes with single?

e.g. if you have a string with ...."..."..."..".....

and need to convert it to:

...."...'...'..".....

do this:

for i in *.n3; do sed 's/\([\"].*\)[\"]\(.*\)[\"]\(.*[\"]\)/\1'\2'\3/g' "$i" > "$i.quotesfixed.n3"; done

Escape backslashes

for i in *.n3; do sed 's/\(.*\)[\\]\(.*\)/\1 \2/g' "$i" > "$i.bfixed.n3"; done

Convert middle quote in the three double quotes with single?

e.g. if you have a string with ...."..."..".....

and need to convert it to:

....".....".....

i.e. remove the middle one
for i in *.n3; do sed 's/\([\"].*\)[\"]\(.*[\"]\)/\1 \2/g' "$i" > "$i.3qfixed.n3"; done

How many lines are there in each file?

count numer of lines of the files with extension .txt in your specified dir:

find /thepathtothedir -maxdepth 1 -name "*.txt" -print0  | xargs -0 -n 1 wc -l

to put that list into a file use:
  find . -maxdepth 1 -name "*.txt" -print0  | xargs -0 -n 1 wc -l > lines.txt

Count the files with 0 lines:
  grep "^0 " lines.txt | wc -l

or with 10 lines:
  grep "^10" lines.txt  | wc -l


Sed tutorial

http://www.tutorialspoint.com/unix/unix-regular-expressions.htm

Awk tricks

Input file: expected.constraints in a form:

name Person:0.9 Organisation:0.1

awk '{print $1,$2}' expected.constraints|tr ':' ' '|sort -k 3 -rn|column -t|less


Output:

name Person 0.9

How to Read CSV or convert it to TSV using bash on Mac OS X?

It is very simple to read a CSV file using AWK such as:

awk -F "\"*,\"*" '{print $1 "\t" $2 "\t" $3}' test.csv

where your test.csv file looks like this:

first,second,third

1,2,3

4,5,6

7,8,9



If instead as input you get only the first line:

first,second,third


it is very likely that you need to fix your line endings as they might be CR.


To check you line endings do:


file test.csv


If what you get is:


test.csv: ASCII text, with CR line terminators

you will need to remove CR line terminators.


You can use dos2unix package for that. If you don't have it installed, you can get it using brew:


sudo brew install dos2unix


Finally, you can remove them as following:


mac2unix roles.csv

mac2unix: converting file roles.csv to Unix format...


Now run your awk script again, and it will work.



---


awk -F'\t' 'BEGIN{OFS="\t"}{print 0,0,"somelabel",$2,$3}' input.tsv > output.tsv


--


awk -F'\t' '{print $3}' input.tsv |cut -d' ' -f 3-


Assume you have the format

0 label some text with words


The command above with print 'some text with words'.


How to remove trailing and leading spaces from a string in awk?

Use tr -d '[:blank:]'


for example, in a file that reads

TOPIC#:   CARS

the below command will extract CARS without any spaces:



topicid=`grep "TOPIC#:" "$i" | awk -F':' '{print $2}' | tr -d '[:blank:]' `



Comments