Open Source Software Notes

I have to make a lot of notes to myself about how to do stuff on the computer.
quick edit link

Linux/Kubuntu

simple spellcheck:

cat <file> | aspell list | sort | uniq -ic | less

to fast copy over the network with ssh:

cd /destination/dir/ && ssh SOURCE "cd /source/dir && tar -czvf - *" | tar xzf -

to find the day of year of a particular date:

date --date='27 Nov 2007' +%j

to get the details on an arbitrary list of files:

locate [file] > filelist.txt
while read file; do ls -l $file; done < filelist.txt

to quickly scan through a text file for a word. then use ‘n’ and ‘N’ to search forward and backward:

cat [file] | less -p [word]

to remove all of the blank lines in a text document:

sed -i '/^$/d' [file]

to add an extension (here .csv) to all files in a directory:

rename 's/$/.csv/' *

to count the number of each unique item in a list:

cat [file] | sort | uniq -c | sort -nr

Bioinformatics

to prepend “>filename” to every FASTA file in a directory:

#!/bin/bash
for file in ./*.fasta; do
foo=${file##*/};
bar=${foo%.*};
sed -i "1i \>$bar" $file;
echo $bar;
done

to fix the bad formatting of CAMERA export files

sed 's/^"//;s/","/\t/g;s/",$//g;s/ //g' [infile] > [outfile]

download complete genome sequences from JGI Integrated Microbial Genomes (IMG) using a list of IMG taxon ids (input.txt)

#!/bin/sh

for i in $(cat input.txt);
do echo $i
FILE=$i.fasta
BASE="http://img.jgi.doe.gov/cgi-bin/pub/main.cgi?section=TaxonDetail&downloadTaxonFnaFile=1&_noHeader=1&taxon_oid="
URL="${BASE}${i}"
wget $URL -O $FILE
done

to find all of the EC numbers in [file], sort, de-replicate, count, and print them by order of decreasing frequency

grep -o -P 'EC\W*\d\.\d\.\d\.\d' [file] | sort | uniq -c | sort -rn > output.txt

ARB import filter to read full_name from a FASTA file. Save to $ARBHOME/lib/import/

From of FASTA file should be >[name][tab][full_name]

AUTODETECT ">*"
#Global settings:
KEYWIDTH 1

BEGIN ">??*"

MATCH ">*"
SRT "* *=*1:*\t*=*1"
WRITE "name"

MATCH ">*"
SRT "*\t*=*2"
WRITE "full_name"

SEQUENCEAFTER "*"
SEQUENCESRT ""
SEQUENCECOLUMN 0
SEQUENCEEND ">*"

# DONT_GEN_NAMES
CREATE_ACC_FROM_SEQUENCE

END "//"

perl script to translate names in tree files or sequence files

given the file to convert and a 2-column translation table. will probably need to be edited depending on type of file. save as ‘myconvert.pl’, make it executable ‘chmod +x myconvert.pl’, and run as ‘./myconvert.pl [treefile] [translationfile]’

#!/usr/bin/perl
use strict;

my $treefile = $ARGV[0]; # newick-like tree
my $translatefile = $ARGV[1]; #names to translate
my %namehash = ();
my %outhash = ();
open(FILE, "< $translatefile") or die;
while(<FILE>) {
chomp;
my @array = split(/\t/); #split on tab
$array[1] =~ s/[ \/']/_/g; #replace bad chars with underscore
$namehash{$array[0]} = $array[1];
}
close FILE;
open(FILE, "< $treefile") or die;
LINE: while(<FILE>) {
# chomp; #uncomment to remove newlines
# s/^[ \t]*//; #uncomment to replace whitespace at beginning of line
# s/['"]//g; #uncomment to delete quotation marks
foreach my $phyname (keys %namehash) {
s/$phyname/$namehash{$phyname}/;
}
print "$_";
}
close FILE;

script to reverse sort lines by the number of tab characters in each line (for importing into R)

perl -nle 'print $count = ($_ =~ tr/\t//) . "\t$_";' [filename] | sort -rn > [outfile]

script to extract accession numbers:

genbank (-f2) or refseq (-f4) or descriptions (-f5) out of genbank amino acids fasta files into a new file

cat *.faa | grep '>gi' | cut -f2 -d\| > [outfile]

cat *.faa | grep '>gi' | cut -f4 -d\| > [outfile]

cat *.faa | grep '>gi' | cut -f5 -d\| > [outfile]

R

or gRRRRR as I sometimes call it.

to print a figure that’s on your screen:

dev.print(png,file="print.png",width=800)

to clean up and remove variables:

rm(list=ls())

to remove NA-only rows by subsetting:

mydf[apply(mydf,1,function(x)any(!is.na(x))),]

LaTeX

to generate a clean one-page HTML output of a TeX document

latex2html -split 0 -no_navigation -info 0 -address 0 [file.tex]

to convert normal quotes into LaTeX quotes

sed 's/"$[^"]*"$/``\1/g' [inputfile] > [outputfile]

to globally comment out/not run figures in LaTeX, put it at the end of the preamble

\renewcommand{\includegraphics}[1][]{\url}

Engauge [Graph/Plot] Digitizer

Use this excellent program to convert an image of a graph into usable X/Y data points. It expects plots that do NOT have multiple Y values, so rotate images (e.g. P vs. depth) by 90 before you import them. If your plot has multiple colors it is easiest to digitize, in that case just use the ‘discretize’ options and turn off the ‘grid removal’ options. There are tutorials available at the Engauge site on SourceForge.

sudo apt-get install engauge-digitizer