Had this thought of converting DNA strings to MIDI music files using Sean [ http://groups.google.com/groups/profile?enc_user=N0KQ5w8AAAD0VD270uUpCbIPP1utz--x ] 's MIDI module for PERL way back [ http://groups.google.com/group/DNA-Music/about ] in JNU days when I was working on improving fourier transform based coding region prediction [ http://www.ncbi.nlm.nih.gov/pubmed/11927773 ] .
Recently I got this opportunity to participate in YSR [ http://www.sars.no/seminars/Retreat_2008_invitation.pdf ] where I finally hacked together these ideas of converting the DNA strings to MIDI music. I am just coming back from there. The evening was full of serious biology talks from mostly experimental biologist [ http://www.uib.no/People/ash022/BYSR08_program_HR_GB.pdf ] where I hope, I lived up to the reputations of being a Bioinformatician, for not just having fun with computers but justifying out salary by plugging in the biological data somehow
The presentation tries to explain the flowchart in 4 slides [slideshare: http://www.slideshare.net/sharma_animesh/ysr-presentation-animesh-rev/ , pdf: http://sharma.animesh.googlepages.com/YSR_Presentation_Animesh_Rev.pdf ].
The code [ http://sharma.animesh.googlepages.com/create_music_from_string.pl ] depends on Sean's MIDI module for PERL which can be downloaded from CPAN [ http://search.cpan.org/~sburke/MIDI-Perl-0.81/lib/MIDI/Simple.pm ] takes in sequence in fasta format [ eg: http://sharma.animesh.googlepages.com/telo.fas , it does accept multiple sequences ] and outputs a MIDI file [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.midi ] and DFT values [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.fft , NOTE: for multiple sequences, the counter starts from 0 for 1st sequence and goes to N-1 for Nth sequence].
To play the midi files [ eg: http://sharma.animesh.googlepages.com/FLJ20436.fas.0.midi generated from http://sharma.animesh.googlepages.com/FLJ20436.fas ] in Linux, I would recommend to install timidity [ http://timidity.sourceforge.net/ ].
Lots need to be done to make this thing do something practical, like I need to change the code to to Fast Fourier transform, employ a window based strategy to walk through the whole genome and create region based music ...
By the way, there is a good use to look at the other generated file *.ff [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.fft ] as well, it says a lot about the repeat length hidden in the sequence.
Like the sequence ( http://sharma.animesh.googlepages.com/telo.fas ) I used initially is:
>teloseq
AGGGTTAGGGTTAGGGTTAGG
which has telomere repeat signal TTAGGG of length 6, now when we look at its fft file ( http://sharma.animesh.googlepages.com/telo.fas.0.fft ) and plot it ( http://sharma.animesh.googlepages.com/telo.jpg ), we observe a peak (defined in the original paper as more then signal of 4) around 6 and 3.
Coding regions generally show a peak at 3, hypothesis being the codon usage bias in the coding strand, while non-coding regions lack this property (not useful in eucaryotic sequences as they have introns too)... so in a way if you input a sequence from coding region to this code, you might find a peak at 3 in the *.fft generated.
To read more about MIDI encoding, I would suggest "The Musician's Guide to MIDI by Christian Braut" [ http://www.amazon.com/Musicians-Guide-Sybex-Macintosh-library/dp/0782112854 ].
Now I need to train my ears to listen to lots of coding and non-coding region music and then given some unknown sequence, predict if it belongs to either of these regions just by listening to its midi file
Recently I got this opportunity to participate in YSR [ http://www.sars.no/seminars/Retreat_2008_invitation.pdf ] where I finally hacked together these ideas of converting the DNA strings to MIDI music. I am just coming back from there. The evening was full of serious biology talks from mostly experimental biologist [ http://www.uib.no/People/ash022/BYSR08_program_HR_GB.pdf ] where I hope, I lived up to the reputations of being a Bioinformatician, for not just having fun with computers but justifying out salary by plugging in the biological data somehow
The presentation tries to explain the flowchart in 4 slides [slideshare: http://www.slideshare.net/sharma_animesh/ysr-presentation-animesh-rev/ , pdf: http://sharma.animesh.googlepages.com/YSR_Presentation_Animesh_Rev.pdf ].
The code [ http://sharma.animesh.googlepages.com/create_music_from_string.pl ] depends on Sean's MIDI module for PERL which can be downloaded from CPAN [ http://search.cpan.org/~sburke/MIDI-Perl-0.81/lib/MIDI/Simple.pm ] takes in sequence in fasta format [ eg: http://sharma.animesh.googlepages.com/telo.fas , it does accept multiple sequences ] and outputs a MIDI file [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.midi ] and DFT values [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.fft , NOTE: for multiple sequences, the counter starts from 0 for 1st sequence and goes to N-1 for Nth sequence].
To play the midi files [ eg: http://sharma.animesh.googlepages.com/FLJ20436.fas.0.midi generated from http://sharma.animesh.googlepages.com/FLJ20436.fas ] in Linux, I would recommend to install timidity [ http://timidity.sourceforge.net/ ].
Lots need to be done to make this thing do something practical, like I need to change the code to to Fast Fourier transform, employ a window based strategy to walk through the whole genome and create region based music ...
By the way, there is a good use to look at the other generated file *.ff [ eg: http://sharma.animesh.googlepages.com/telo.fas.0.fft ] as well, it says a lot about the repeat length hidden in the sequence.
Like the sequence ( http://sharma.animesh.googlepages.com/telo.fas ) I used initially is:
>teloseq
AGGGTTAGGGTTAGGGTTAGG
which has telomere repeat signal TTAGGG of length 6, now when we look at its fft file ( http://sharma.animesh.googlepages.com/telo.fas.0.fft ) and plot it ( http://sharma.animesh.googlepages.com/telo.jpg ), we observe a peak (defined in the original paper as more then signal of 4) around 6 and 3.
Coding regions generally show a peak at 3, hypothesis being the codon usage bias in the coding strand, while non-coding regions lack this property (not useful in eucaryotic sequences as they have introns too)... so in a way if you input a sequence from coding region to this code, you might find a peak at 3 in the *.fft generated.
To read more about MIDI encoding, I would suggest "The Musician's Guide to MIDI by Christian Braut" [ http://www.amazon.com/Musicians-Guide-Sybex-Macintosh-library/dp/0782112854 ].
Now I need to train my ears to listen to lots of coding and non-coding region music and then given some unknown sequence, predict if it belongs to either of these regions just by listening to its midi file
|
0 comments:
Post a Comment