Instead of just reading the man file, you could read this post about cut
!
Printing columns ('fields') n to m (inclusive) from a file:
cut -d [delimiter] -f n-m filename
Thus, removing the first n-1 fields from a file:
cut -d [delimiter] -f n- filename
[delimiter]
is automatically a tab space. You could also have ' ' (space), '`'
, ':
', '-
', '_
'. Apparently 'HELLO
' is not an acceptable delimiter, which is some kind of bug I guess.
If you just want a specific column, you could use awk
:
awk '{ print $n }' filename
Or do some fancier things like - say* we have a file containing a list of chromosome numbers and SNPids and some other information separated into columns, and we want to extract just the chromosomes and SNPids, rewriting '2
' as 'chr02
' etc. and including a tab space, we could write
awk '{ if ($1<10) print "chr0" $1 "\t" $2; else print "chr" $1 "\t" $2 }' filename
The double-quotation marks are necessary here. In awk
it's not that column numbering intentionally starts from 1 (note that chromosomes, which are in the first column are accessed via $1
), but $0
contains the full line. So you could do
awk '{ if ($1+$2 == 3) print $0; else print $1+$2,"is not 3" }' filename
if for some reason you wanted to pick out lines whose first two columns sum to three. If you try doing that and $2
or $1
don't contain something which could reasonably be added (e.g. in the SNPid example) awk
will just give weird output and not realise the horrible things it's doing, so be careful with that.
Note the comma (eg in print $1+$2,"is not 3"
) just denotes a space. As per earlier, use "\t"
to insert a tab.
You could do something similar to extract all the even or odd columns in a file by silencing those you don't want:
awk '{ for(i=1;i<=NF;i+=2) $i="" }1' filename > evencols
No, the 1
is not a typo. It just tells awk
to print every line. Now, this will produce some unwanted spaces between fields, so we can get rid of the with sed
:
sed "s/^ //;s/ / /g" evencols
The basic thing going on here is s/string_to_replace/with_this_string, separated by ; indicating a new command for sed
. In the first one we're stripping a leading whitespace from each line - ^
indicates 'start of line', so we're replacing "white space at start" with "nothing". The second command is simply replacing double whitespace with single whitespace. I'm sure there are more rigorous ways to do this, but this worked for me.
What about finding things? Suppose I have a giant folder - how giant you say?
ls . | wc -l
This just pipes the output of ls .
into wc
which, with the -l
flag counts how many lines we have. The folder I'm looking at has 948 things in it, because I am organised like that. I want to find a file with 'wolf' in the title, so I can do
ls -l . | grep 'wolf'
I inclued the -l
flag on ls because I'm interested in things like the biggest file with wolf
in its title. Supposing I had a worryingly large number of wolf-related files, I could get straight to the biggest one by piping more commands together:
ls -l . | grep 'wolf' | sort -n | tail -1
sort
outputs low to high, which is why we take the tail -1
one.
Now, let's suppose I don't know which subdirectory my wolf file is in. I could do
find [directory] -name '*wolf*'
to find all files with 'wolf
' anywhere in their title in the directory [directory]
and all subdirectories of it. To search from the current directory, use . as [directory], etc. To only find wolf
files over a certain size (say 1 MB) from the current directory, we have
find [directory] -name '*wolf*' -size +1M
(use -1k
to get wolf
files under 1 kB) or to find all wolf files, sort them by size, and pick out the biggest one, we do
find . -name 'gray_wolf*' -ls | sort -k5 -n | tail -1
The -ls
flag tells find
to give output in a sort of ls
format. For me, the 5th column of this output is the file-size, so we sort based on this column (sort -k5
), and the rest is the same as before.
*based on a real event