Here is a rundown of the major changes wrt the original sim4 program.
The original sim4 program only considered the standard GT/AG splice junctions. However, some other possibilities seem to exist. The code was thus modified to allow a specified set of splice junctions, which can be modified through command line options. The default set is GT/AG, GC/AG, and AT/AC.
The output of the program shows which alternative was used, and provides a simple scoring to show how well the splice points fits with the model. The default behaviour is to consider the 10 nucleotides before and after the splice junctions, as well as the 4 nucleotides defining the junction, giving a total of 24 nucleotides. If they all correspond perfectly, we get a perfect score of 24.
This introduced the I, L, M, Y, and Z command line options.
The original sim4 tried to guess which of the two files contained the DNA and which the RNA, and allowed to run one RNA against a collection of DNA sequences.
SIBsim4 only allows to run a collection of RNA against a single genomic DNA sequence, and the sequence type is specified by their ordering on the command line: first comes the genomic DNA sequence, and then the collection of RNA sequences.
The original sim4 code had some limitation of the total number of MSPs it would consider (around 200), which seemed to cause problems for very large genes (like titin). The program constructed a hash table of words in the DNA sequence and kept the table's buckets as simple lists, which leads to very long computation time as the length of the DNA sequence grows.
SIBsim4 has a new structure to keep MSPs, and the hash table's buckets are kept as search trees, provided by the GNU libc library. This yields great runtime speed improvement, and allows the use of very long, chromosome scale, genomic DNA sequences.
An attempt was made to cleanup the code as much as possible, and to use standard C library routines as much as possible. Thus:
The original sim4 had some code to try to remove polyA tails, which could be activated through a command line option.
SIBsim4 always tries to detect polyA tails, and can report their presence along with a polyA signal when requested through the -A command line option.
There are cases when a gene gets duplicated on the same piece of DNA. SIBsim4 was modified to try to cope with the fact, and will report all the better matches found between an RNA and the DNA. At some point, there will be command line options to specify cutoff values...
SIBsim4 will try to detect and report chimeras. It will compare the overall better score obtained by forcing the RNA element to be colinear with the DNA, and the score obtained when not enforcing colinearity. If the score obtained by the non-colinear alignment is significantly better, the RNA will be reported as a chimera.
Replace col_t type with collec_t type, to avoid compilation problems in AIX
Complete rewrite of the splice point assignment stuff. A very simplified Smith-Waterman type of scoring is used to determine the optimal splice point.
Set match/mismatch scores as command line options (a la Blast: -q and -r).
Try to get better closely consecutive MSP merging. Introduced the -g command line option.
Merge together the msp_t and exon_t internal structures.
Assorted code cleanups.
Fix problem with small overlapping exons.
Better handling of small overlapping exons when linking MSPs.
Provide some dbug info when the program crashes...
Fix some compilation issues with older GCC versions.
Fix some link_msps issues.
Improve detection and handling of polyA.
Improve splicing code.
Try to also use match/mismatch scores for extensions of the first and last exons.
Remove leftover unused functions, and some code cleanups.
Improve error messages.
Improve splicing code.
Some code cleanups and bug fixes.
Add -o switch to specify the base coordinate of the DNA sequence on the chromosome.
Avoid the creation of artefactual, very small exons.
Several code cleanups and code fixes
Fix thinko -in -o switch code.
Fix yet a thinko in the way LEN is printed for DNA.
Warn when there are multiple DNA sequences.
Improve polyA detection. Avoid spurious polyA.
Fix some failures to detect duplicated genes.
Do not compute K and C parameters from the sequences length. Use the default values, or the ones supplied by the user through command line switches.
Stop using the floor() function and requiring the math library.
Add the -c and -f command line switches. They should be primarily useful to detect duplicated genes, where the duplicates have already diverged significantly.
Add a -s option to control how potential duplicated genes are detected, instead of abusing the -f option.
While linking MSP, if two consecutive group of exons appear like they could be part of two different copies of the same gene, they will be tested to see if the score of each individual group relative to the best overall score is greater than this value. If both groups have a relative score above this threshold they will be split.
Add a few checks to try to avoid excessive runtime in some situations:
Implement chimera detection and reporting; add -H command line option.
Add -Wconversion to compiler options and cleanup warnings.