phyloGeoRef

Tuesday, August 23, 2011

The End of Coding Period

Hi all,
I could not blog during the last few weeks because I had to move back to my college so I couldn't find extra time to blog.
The recent weeks were fun too like all others. I was able to prepare a sufficiently lucid and simple manual for the project which is kept in the manuals folder. I was also able to generate a simple GUI for running the project which I believe would be a great help for the non-java users.
Using the library is very easy, even a java beginner should be able to generate kmls after reading and following the instructions in the manual.

This summer of code was a most remarkable experience for me, I hope that I continue contributing to the open source community in future too.

~apurv

Thursday, July 28, 2011

Sign of 180 Resolution

This is for the case when maxDistance (the distance between maximally separated nodes is 180).

Then there arises an ambiguity is choosing the globalMaxAngle as +180 or -180 in the transformed positions. And it does make a difference to the answer.

Consider for example the case of 3 positions. 10, -170 and -110.

The maximally separated nodes are 10 and -170. Suppose 10 is chosen as origin and clockwise direction is positive. Then

Case 1: -170 ----> 180 (where ----> denotes transformation)

mean = (-120 + 180 + 0 )/3 = 20

Case 2: -170 ----> -180

mean = (-120 + -180 + 0)/3 = -100

-100 seems to be a better choice because migration distances are smaller this way and it adheres to my policy of "keep the parent node at a position within the bounds of the end nodes"

To achieve this effect, I can count which hemisphere has more number of nodes. If the negative hemisphere has more nodes then its -180 and vice versa.

How to find whether a point is in the +ive or -ive hemisphere, that is given in the last post.

Wednesday, July 27, 2011

Transformation Rules

Once I find the two positions which are maximally separated namely globalAngleZero and globalAngleMax, I need to transform the origin to globalAngleZero with positive sense towards the shortest path direction towards globalAngleMax.

maxDistance is the distance between these two positions.

Case 1: maxDistance = 180

For all position in posVector

   let L = minimumAngularDistance(position, globalAngleZero)

         if (position >= globalAngleZero OR position <=globalAngleMax)
do nothing

         if(position < globalAngleZero OR position > globalAngleMax)
               L = -L (invert the sign)

         sigma += L

meanPos = sigma/n (where n is the length of posVector)

The actual mean position is obtained by adding meanPos to globalAngleZero in a positive sense.

Case 2: maxDistance < 180

For all position in posVector

        let L1 = minimumAngularDistance(position, globalAngleZero)
        let L2 = minimumAngularDistance(position, globalAngleMax)

        if (L1 + L2 = maxDistance) (if this point lies inside the region bounded by end points)
              sigma = sigma + L1

        else
             sigma = sigma - L1

meanPos = sigma/n (where n is the length of posVector)

The actual mean position can be obtained by adding meanPos to globalAngleZero in clockwise and anti-clockwise sense and that position is choosen for which L1+L2= maxDistance condition is satisfied.

Tuesday, July 26, 2011

This is hard !!!

So I encountered some problems in my centroid calculation algorithm, which I am trying to correct but I really wanted to see why am I implementing the algorithm that I have mentioned in the last to last post rather than the naive algorithm.

This gives a time complexity of O(n log n) over naive O(n^2) which would have certainly been easier. But I think this would save appreciable computational cycles.

Assuming that we are working with 4000nodes. And considering the average number of children a node has to be 8. Then here are the comparisons.

Naive Algorithm.
64 X 4000 = 256,000

New algorithm.
24 X 4000 = 96,000

Here are the steps in the new in-centroid calculation.

Get the maximally separated nodes.
(This is the part where I really used some clever techniques.)
Choose one of them as zero and the other's direction as the positive x-axis.
Transform the coordinates.
Calculate the normal mean on the transformed coordinates.
Transform back.

This is difficult to code, requires extreme patience and there are so many cases ; I need to finish it fast.

Saturday, July 23, 2011

Failure of a simple weighted mean algorithm

I will demonstrate this with an example. The thing that I was trying to do was to make the distances exactly proportional to the edge lengths. This is not possible always.

Take a very simple example in 2D, i.e a plane.

Suppose you have 4 points A,B,C and D on the plane and you want to find out a point such that the distances of these points from that central point is in the same proportions.
It is very easy to see that this is not always possible. Well to start you can choose any 3 points and draw a circle through them but now you got yourself in a problem. It is not necessary that D is also concyclic to the same circle.
Further things in spherical geometry make things even worse.

I need some other idea to use/represent this information. Had it been an animation then things would have been different.

Wednesday, July 20, 2011

Things Done and Things Undone !

Implemented denotes what is already implemented in phyloGeoRef.
TODO: demotes that things that are must to be done.
Planned: Denotes a change in implementation or adding additional features.

TREE FORMATs (Files containing the tree data)

Implemented: newick (.nwk) , nexml (.xml) , phyloxml (.xml) , nexus (.nex), (.tolxml)
METADATA FORMATs (Files which contain the spatial data)

Implemented: csv, txt
NODE POSITIONING

Implemented: Simple Mean
Planned: Weighted Mean for trees with edge lengths
Bug: Incorrect positioning algorithm. Modify it and make it efficient.
TUTORIAL
Posted: https://github.com/dapurv5/phyloGeoRef/wiki/Tutorial:-Using-the-GrandUnifiedReader,-adding-new-properties.
Planned: Tutorial on writing the main class. (1/2 day)
ERROR CHECKING:
Implemented: Error checking for implausible lat/long
Implemented: Error due to absence of information given in the tree but not in the metadata file.
Planned: Checking corner cases like when all the child nodes of a parent node have missing locations.
Planned: Checking clade consistency.
WRITING KML
Implemented: Display of tip nodes on the map and the skeletal structure of the tree.
Implemented: Levelwise slicing of the tree in folders.
Planned: Hierarchical slicing of the tree into folders.
TODO: HTML balloons associated with each node.
Planned: Associating regions with HTUs also.
Planned: Animation.
Planned: Compressing the kml file as kmz
EXTRA DATA
Planned: Taking images and other rich data from external web services.
Planned: Preparing a local cache for this data since downloading data for thousands of species on each run of the program would be a time wasting process.
Planned: Writing a command line option parser for the main class.

Pseudo Code for the new Reconstruction Algorithm

When we traverse the tree post order to assign coordinates to each of the internal nodes. Do this for each of the internal nodes.

Create 4 buckets 1 for each quadrant on a circle.
Also create four flags 1 for each quadrant. A true value of a flag for a quadrant indicates the presence of atleast one node in that quadrant.

Step 1:

for each child node, child of parent node.
Putting the node in the appropriate bucket depending on its quadrant.
Mark the flag true for that quadrant.

Step 2:

Sort each bucket so that the minimum values are towards the top or the beginning in the bucket. Use the sorting algorithm discussed in the previous post. If you have to sort negative numbers sort mod of them and then reverse the order.
It is to be noted that since you are sorting only within a bucket all elements within the same bucket have the same sign. This takes O(d) time where d is the maximum degree of any node in the phylogeny.

Step 3:

def distance (bucket a, bucket b):
the maximum distance between any two nodes in these buckets.

maxDistance = distance(bucket 1, bucket 3)
maxDistance = max { maxDistance, distance(bucket 2, bucket 4) }

maxDistance = max { maxDistance, distance(bucket 1, bucket 2)}
maxDistance = max { maxDistance, distance(bucket 2, bucket 3)}
maxDistance = max { maxDistance, distance(bucket 3, bucket 4)}
maxDistance = max { maxDistance, distance(bucket 4, bucket 1)}

I was trying to find out a general formula interms of variables that saves all this but my attempts in that direction didn't yield any feasible results. So now I believe that no one universal formula exists which can capture the centroid location for all the nodes on the globe so therefore these elaborations.

Once we find the 2 nodes between which we have the maximum distance we can transform the coordinates so as to make the coordinate of the one of these nodes zero.