Open Source Software Notes
I have to make a lot of notes to myself about how to do stuff on the computer.
quick edit link
Linux/Kubuntu
simple spellcheck:
to fast copy over the network with ssh:
to find the day of year of a particular date:
to get the details on an arbitrary list of files:
while read file; do ls -l $file; done < filelist.txt
to quickly scan through a text file for a word. then use ‘n’ and ‘N’ to search forward and backward:
to remove all of the blank lines in a text document:
to add an extension (here .csv) to all files in a directory:
to count the number of each unique item in a list:
Bioinformatics
to prepend “>filename” to every FASTA file in a directory:
for file in ./*.fasta; do
foo=${file##*/};
bar=${foo%.*};
sed -i "1i \>$bar" $file;
echo $bar;
done
to fix the bad formatting of CAMERA export files
download complete genome sequences from JGI Integrated Microbial Genomes (IMG) using a list of IMG taxon ids (input.txt)
for i in $(cat input.txt);
do echo $i
FILE=$i.fasta
BASE="http://img.jgi.doe.gov/cgi-bin/pub/main.cgi?section=TaxonDetail&downloadTaxonFnaFile=1&_noHeader=1&taxon_oid="
URL="${BASE}${i}"
wget $URL -O $FILE
done
to find all of the EC numbers in [file], sort, de-replicate, count, and print them by order of decreasing frequency
ARB import filter to read full_name from a FASTA file. Save to $ARBHOME/lib/import/
From of FASTA file should be >[name][tab][full_name]
#Global settings:
KEYWIDTH 1
BEGIN ">??*"
MATCH ">*"
SRT "* *=*1:*\t*=*1"
WRITE "name"
MATCH ">*"
SRT "*\t*=*2"
WRITE "full_name"
SEQUENCEAFTER "*"
SEQUENCESRT ""
SEQUENCECOLUMN 0
SEQUENCEEND ">*"
# DONT_GEN_NAMES
CREATE_ACC_FROM_SEQUENCE
END "//"
perl script to translate names in tree files or sequence files
given the file to convert and a 2-column translation table. will probably need to be edited depending on type of file. save as ‘myconvert.pl’, make it executable ‘chmod +x myconvert.pl’, and run as ‘./myconvert.pl [treefile] [translationfile]’
use strict;
my $treefile = $ARGV[0]; # newick-like tree
my $translatefile = $ARGV[1]; #names to translate
my %namehash = ();
my %outhash = ();
open(FILE, "< $translatefile") or die;
while(<FILE>) {
chomp;
my @array = split(/\t/); #split on tab
$array[1] =~ s/[ \/\(\)']/_/g; #replace bad chars with underscore
$namehash{$array[0]} = $array[1];
}
close FILE;
open(FILE, "< $treefile") or die;
LINE: while(<FILE>) {
# chomp; #uncomment to remove newlines
# s/^[ \t]*//; #uncomment to replace whitespace at beginning of line
# s/['"]//g; #uncomment to delete quotation marks
foreach my $phyname (keys %namehash) {
s/$phyname/$namehash{$phyname}/;
}
print "$_";
}
close FILE;
script to reverse sort lines by the number of tab characters in each line (for importing into R)
script to extract accession numbers:
genbank (-f2) or refseq (-f4) or descriptions (-f5) out of genbank amino acids fasta files into a new file
R
or gRRRRR as I sometimes call it.
to print a figure that’s on your screen:
to clean up and remove variables:
to remove NA-only rows by subsetting:
LaTeX
to generate a clean one-page HTML output of a TeX document
to convert normal quotes into LaTeX quotes
to globally comment out/not run figures in LaTeX, put it at the end of the preamble
Engauge [Graph/Plot] Digitizer
Use this excellent program to convert an image of a graph into usable X/Y data points. It expects plots that do NOT have multiple Y values, so rotate images (e.g. P vs. depth) by 90 before you import them. If your plot has multiple colors it is easiest to digitize, in that case just use the ‘discretize’ options and turn off the ‘grid removal’ options. There are tutorials available at the Engauge site on SourceForge.