The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...' --Isaac Asimov
That's Funny… random header image

Mapping KEGG organism codes to NCBI RefSeq Ids

January 17th, 2012 by eric

I was working on a problem today that required matching up the 3-letter codes KEGG uses for identifying genomes to the NCBI RefSeq codes used for DNA molecules in the genomes I’m working with (e.g. NC_031415). Because I couldn’t find such a mapping file I had to do it myself. Here’s how.

First, I use linux command line tools to download the KEGG organism list and parse it to get the organism code and the NCBI ftp directory name:

wget http://www.genome.jp/kegg/catalog/org_list.html
awk '/Prokaryotes/,0' org_list.html | grep "show_organism?org" | cut -d'>' -f3 | cut -f1 -d'<' > org_a
awk '/Prokaryotes/,0' org_list.html | grep ftp | cut -d'/' -f6 > org_b
paste org_a org_b | sort -k2 > org_list

Then, I used lftp to find all of the RefSeq files within each directory. I just chose the ‘.rpt’ files, any of the unique files would do. I parsed the resulting file to get the ftp directory name and the RefSeq ID.

lftp -e 'open ftp://ftp.ncbi.nih.gov/genomes/Bacteria/; ls -d */*.rpt > refseq.rpt; bye'
awk '{print $9}' refseq.rpt | cut -f1 -d'.' | cut -f1 -d'/' > ref_a
awk '{print $9}' refseq.rpt | cut -f1 -d'.' | cut -f2 -d'/' > ref_b
paste ref_b ref_a | uniq -f1 | sort -k2 > ref_list

Then, I joined the two files on their common column: the NCBI ftp directory name. I don’t care about the directory name so then I make a new file without it.

join -j2 -t$'\t' org_list ref_list > kegg2refseq.txt
cat kegg2refseq.txt | cut -f2,3 > kegg_refseq.tmp

From here I’ve written a script (download: idreplacer) that uses the lproks_1.txt and lproks_2.txt files provided by NCBI to interconvert names and IDs.

./idreplacer.pl refseqacc refseqid kegg_refseq.tmp | ./idreplace.pl refseqid refseqacclist > kegg2refseq.csv

et voila! Here’s the final KEGG to RefSeq ID mapping file.

Postscript: Here is a database table from MicrobesOnline that does a similar task, mapping KEGG organism to NCBI taxonomy: KEGG2Taxonomy

Tags:   No Comments

Leave A Comment

0 responses so far ↓

  • There are no comments yet...add one by filling out the form below.