man cut and other simple yet useful unix bits

tips
corcra — Mon 22 July 2013

Instead of just reading the man file, you could read this post about cut!

Printing columns ('fields') n to m (inclusive) from a file:

    cut -d [delimiter] -f n-m filename

Thus, removing the first n-1 fields from a file:

    cut -d [delimiter] -f n- filename

[delimiter] is automatically a tab space. You could also have ' ' (space), '`', ':', '-', '_'. Apparently 'HELLO' is not an acceptable delimiter, which is some kind of bug I guess.

If you just want a specific column, you could use awk:

    awk '{ print $n }' filename

Or do some fancier things like - say* we have a file containing a list of chromosome numbers and SNPids and some other information separated into columns, and we want to extract just the chromosomes and SNPids, rewriting '2' as 'chr02' etc. and including a tab space, we could write

    awk '{ if ($1<10) print "chr0" $1 "\t" $2; else print "chr" $1 "\t" $2 }' filename

The double-quotation marks are necessary here. In awk it's not that column numbering intentionally starts from 1 (note that chromosomes, which are in the first column are accessed via $1), but $0 contains the full line. So you could do

    awk '{ if ($1+$2 == 3) print $0; else print $1+$2,"is not 3" }' filename

if for some reason you wanted to pick out lines whose first two columns sum to three. If you try doing that and $2 or $1 don't contain something which could reasonably be added (e.g. in the SNPid example) awk will just give weird output and not realise the horrible things it's doing, so be careful with that.

Note the comma (eg in print $1+$2,"is not 3") just denotes a space. As per earlier, use "\t" to insert a tab.

You could do something similar to extract all the even or odd columns in a file by silencing those you don't want:

    awk '{ for(i=1;i<=NF;i+=2) $i="" }1' filename > evencols

No, the 1 is not a typo. It just tells awk to print every line. Now, this will produce some unwanted spaces between fields, so we can get rid of the with sed:

    sed "s/^ //;s/  / /g" evencols

The basic thing going on here is s/string_to_replace/with_this_string, separated by ; indicating a new command for sed. In the first one we're stripping a leading whitespace from each line - ^ indicates 'start of line', so we're replacing "white space at start" with "nothing". The second command is simply replacing double whitespace with single whitespace. I'm sure there are more rigorous ways to do this, but this worked for me.

What about finding things? Suppose I have a giant folder - how giant you say?

    ls . | wc -l

This just pipes the output of ls . into wc which, with the -l flag counts how many lines we have. The folder I'm looking at has 948 things in it, because I am organised like that. I want to find a file with 'wolf' in the title, so I can do

    ls -l . | grep 'wolf'

I inclued the -l flag on ls because I'm interested in things like the biggest file with wolf in its title. Supposing I had a worryingly large number of wolf-related files, I could get straight to the biggest one by piping more commands together:

    ls -l . | grep 'wolf' | sort -n | tail -1

sort outputs low to high, which is why we take the tail -1 one.

Now, let's suppose I don't know which subdirectory my wolf file is in. I could do

    find [directory] -name '*wolf*'

to find all files with 'wolf' anywhere in their title in the directory [directory] and all subdirectories of it. To search from the current directory, use . as [directory], etc. To only find wolf files over a certain size (say 1 MB) from the current directory, we have

    find [directory] -name '*wolf*' -size +1M

(use -1k to get wolf files under 1 kB) or to find all wolf files, sort them by size, and pick out the biggest one, we do

    find . -name 'gray_wolf*' -ls | sort -k5 -n | tail -1

The -ls flag tells find to give output in a sort of ls format. For me, the 5th column of this output is the file-size, so we sort based on this column (sort -k5), and the rest is the same as before.

*based on a real event