Changes from sim4

Here is a rundown of the major changes wrt the original sim4 program.

Splice points

The original sim4 program only considered the standard GT/AG splice junctions. However, some other possibilities seem to exist. The code was thus modified to allow a specified set of splice junctions, which can be modified through command line options. The default set is GT/AG, GC/AG, and AT/AC.

The output of the program shows which alternative was used, and provides a simple scoring to show how well the splice points fits with the model. The default behaviour is to consider the 10 nucleotides before and after the splice junctions, as well as the 4 nucleotides defining the junction, giving a total of 24 nucleotides. If they all correspond perfectly, we get a perfect score of 24.

This introduced the I, L, M, Y, and Z command line options.

No guesswork about DNA/RNA

The original sim4 tried to guess which of the two files contained the DNA and which the RNA, and allowed to run one RNA against a collection of DNA sequences.

SIBsim4 only allows to run a collection of RNA against a single genomic DNA sequence, and the sequence type is specified by their ordering on the command line: first comes the genomic DNA sequence, and then the collection of RNA sequences.

Modified MSP computation

The original sim4 code had some limitation of the total number of MSPs it would consider (around 200), which seemed to cause problems for very large genes (like titin). The program constructed a hash table of words in the DNA sequence and kept the table's buckets as simple lists, which leads to very long computation time as the length of the DNA sequence grows.

SIBsim4 has a new structure to keep MSPs, and the hash table's buckets are kept as search trees, provided by the GNU libc library. This yields great runtime speed improvement, and allows the use of very long, chromosome scale, genomic DNA sequences.

Misc fixes

An attempt was made to cleanup the code as much as possible, and to use standard C library routines as much as possible. Thus:

many global variables were replaced by local ones
memory allocation is now handled by the usual xmalloc and friends
command line options are parsed using getopt
more debugging output when compiled in DEBUG mode
fix memory leaks

PolyA tails handling

The original sim4 had some code to try to remove polyA tails, which could be activated through a command line option.

SIBsim4 always tries to detect polyA tails, and can report their presence along with a polyA signal when requested through the -A command line option.

Handle duplicated genes

There are cases when a gene gets duplicated on the same piece of DNA. SIBsim4 was modified to try to cope with the fact, and will report all the better matches found between an RNA and the DNA. At some point, there will be command line options to specify cutoff values...

Chimera detection

SIBsim4 will try to detect and report chimeras. It will compare the overall better score obtained by forcing the RNA element to be colinear with the DNA, and the score obtained when not enforcing colinearity. If the score obtained by the non-colinear alignment is significantly better, the RNA will be reported as a chimera.

Changes between SIBsim4-0.5 and SIBsim4-0.6

Replace col_t type with collec_t type, to avoid compilation problems in AIX

Complete rewrite of the splice point assignment stuff. A very simplified Smith-Waterman type of scoring is used to determine the optimal splice point.

Set match/mismatch scores as command line options (a la Blast: -q and -r).

Try to get better closely consecutive MSP merging. Introduced the -g command line option.

Merge together the msp_t and exon_t internal structures.

Assorted code cleanups.

Changes between SIBsim4-0.6 and SIBsim4-0.6.1

Fix problem with small overlapping exons.

Changes between SIBsim4-0.6.1 and SIBsim4-0.6.2

Better handling of small overlapping exons when linking MSPs.

Provide some dbug info when the program crashes...

Changes between SIBsim4-0.6.2 and SIBsim4-0.6.3

Fix some compilation issues with older GCC versions.

Fix some link_msps issues.

Changes between SIBsim4-0.6.3 and SIBsim4-0.7

Improve detection and handling of polyA.

Improve splicing code.

Try to also use match/mismatch scores for extensions of the first and last exons.

Remove leftover unused functions, and some code cleanups.

Changes between SIBsim4-0.7 and SIBsim4-0.8

Improve error messages.

Improve splicing code.

Some code cleanups and bug fixes.

Changes between SIBsim4-0.8 and SIBsim4-0.9

Add -o switch to specify the base coordinate of the DNA sequence on the chromosome.

Avoid the creation of artefactual, very small exons.

Several code cleanups and code fixes

Changes between SIBsim4-0.9 and SIBsim4-0.10

Fix thinko -in -o switch code.

Changes between SIBsim4-0.10 and SIBsim4-0.11

Fix yet a thinko in the way LEN is printed for DNA.

Changes between SIBsim4-0.11 and SIBsim4-0.12

Warn when there are multiple DNA sequences.

Improve polyA detection. Avoid spurious polyA.

Changes between SIBsim4-0.12 and SIBsim4-0.13

Fix some failures to detect duplicated genes.

Do not compute K and C parameters from the sequences length. Use the default values, or the ones supplied by the user through command line switches.

Stop using the floor() function and requiring the math library.

Changes between SIBsim4-0.13 and SIBsim4-0.14

Add the -c and -f command line switches. They should be primarily useful to detect duplicated genes, where the duplicates have already diverged significantly.

Changes between SIBsim4-0.14 and SIBsim4-0.15

Add a -s option to control how potential duplicated genes are detected, instead of abusing the -f option.

While linking MSP, if two consecutive group of exons appear like they could be part of two different copies of the same gene, they will be tested to see if the score of each individual group relative to the best overall score is greater than this value. If both groups have a relative score above this threshold they will be split.

Changes between SIBsim4-0.15 and SIBsim4-0.16

Add a few checks to try to avoid excessive runtime in some situations:

keep track of RNA coverage when looking for duplicated genes
check multiple coverage before looking for duplicates
avoid looking for duplicated genes when the global score is bad anyway

Changes between SIBsim4-0.16 and SIBsim4-0.17

Implement chimera detection and reporting; add -H command line option.

Add -Wconversion to compiler options and cleanup warnings.