Friday, May 23, 2008

Music from DNA string

Had this thought of converting DNA strings to MIDI music files using Sean [ http://groups.google.com/groups/profile?enc_user=N0KQ5w8AAAD0VD270uUpCbIPP1utz--x ] 's MIDI module for PERL way back [ http://groups.google.com/group/DNA-Music/about ] in JNU days when I was working on improving fourier transform based coding region prediction [ http://www.ncbi.nlm.nih.gov/pubmed/11927773 ] .
Recently I got this opportunity to participate in YSR [ http://www.sars.no/seminars/Retreat_2008_invitation.pdf ] where I finally hacked together these ideas of converting the DNA strings to MIDI music. I am just coming back from there. The evening was full of serious biology talks from mostly experimental biologist [ http://www.uib.no/People/ash022/BYSR08_program_HR_GB.pdf ] where I hope, I lived up to the reputations of being a Bioinformatician, for not just having fun with computers but justifying out salary by plugging in the biological data somehow
The presentation tries to explain the flowchart in 4 slides [slideshare: http://www.slideshare.net/sharma_animesh/ysr-presentation-animesh-rev/ , pdf: http://sharma.animesh.googlepages.com/YSR_Presentation_Animesh_Rev.pdf ].
The code [ http://sharma.animesh.googlepages.com/create_music_from_string.pl ] depends on Sean's MIDI module for PERL which can be downloaded from CPAN [ http://search.cpan.org/~sburke/MIDI-Perl-0.81/lib/MIDI/Simple.pm ] takes in sequence in fasta format [ eg: http://sharma.animesh.googlepages.com/telo.fas , it does accept multiple sequences ] and outputs a MIDI file [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.midi ] and DFT values [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.fft , NOTE: for multiple sequences, the counter starts from 0 for 1st sequence and goes to N-1 for Nth sequence].
To play the midi files [ eg: http://sharma.animesh.googlepages.com/FLJ20436.fas.0.midi generated from http://sharma.animesh.googlepages.com/FLJ20436.fas ] in Linux, I would recommend to install timidity [ http://timidity.sourceforge.net/ ].
Lots need to be done to make this thing do something practical, like I need to change the code to to Fast Fourier transform, employ a window based strategy to walk through the whole genome and create region based music ...
By the way, there is a good use to look at the other generated file *.ff [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.fft ] as well, it says a lot about the repeat length hidden in the sequence.
Like the sequence ( http://sharma.animesh.googlepages.com/telo.fas ) I used initially is:
>teloseq
AGGGTTAGGGTTAGGGTTAGG
which has telomere repeat signal TTAGGG of length 6, now when we look at its fft file ( http://sharma.animesh.googlepages.com/telo.fas.0.fft ) and plot it ( http://sharma.animesh.googlepages.com/telo.jpg ), we observe a peak (defined in the original paper as more then signal of 4) around 6 and 3.
Coding regions generally show a peak at 3, hypothesis being the codon usage bias in the coding strand, while non-coding regions lack this property (not useful in eucaryotic sequences as they have introns too)... so in a way if you input a sequence from coding region to this code, you might find a peak at 3 in the *.fft generated.
To read more about MIDI encoding, I would suggest "The Musician's Guide to MIDI by Christian Braut" [ http://www.amazon.com/Musicians-Guide-Sybex-Macintosh-library/dp/0782112854 ].
Now I need to train my ears to listen to lots of coding and non-coding region music and then given some unknown sequence, predict if it belongs to either of these regions just by listening to its midi file
clipped from www.sars.no
Young Scientist's Retreat - Bergen 2008, Invitation
Open to all phd students and postdocs at MBI, Dept. of Informatics,
Dept of Biomedicine, Sars Centre and CBU.
To register contact Ståle Ellingsen.
May 22
Sydneshaugen Skole 12:00-18:00
Host: MCB Research School
 blog it

0 comments: