Thursday 21 August 2014

The final GSOC post

Since I had delayed the last post, so I am combining two of my blog posts and am going to use this blog post to give you an update of all that has happened in this GSOC (special emphasis on the stuff of the last 4 weeks).

So first question is, where am I right now? In Mumbai. In my college. A stupid answer !
I have finished all the weekly goals of the GSOC project apart from a few things :-

1) I wanted to implement atleast one the optimal triangulation algorithms (using the research paper). The particular algorithm which I had to implement was decided pretty late and I didn't find time to implement it towards the end and hence that is still pending. I hope to do it soon.

2) Alpha - expansion algorithm is due. I have written the code, but there seems to be bug. I will debug it and push the code for alpha-beta sampling soon.
Also, some code reviews are pending and hence I am expecting some work on the code improvement part. But other than that, the work which I was officially expected to do in GSOC is over. 
Now that I am done with my status, let me talk a little bit about what happened since my last blog post (since the virtual time of my blog post, if you may). So for those who had read my last blog post, I was slightly stuck with integrating the factor product code for cython. However, I realized soon that cython could easily take in the python objects directly (duh) and so I easily modified the factor product to work seamlessly with extra data that is stored in factors (this extra data is the assignment of the eliminate variables (the variables which were eliminated while maximizing) corresponding to every combinations of the existing variables). 
Anyway, once I was done with that, the next task was the implementation of specific special-setting inference algorithms. One of them was the "Graph-cut algorithm for MAP in pairwise binary MRFs with submodular potentials ". This particular algorithm requires a max-flow-min-cut algorithm and I was really surprised to see that networkx didn't have a function which returns the exact cut-edges which I needed for this algorithm. I could have used external python libraries for this but this would have created an extra dependency just for one function. So we (I and mentors) decided that it would be best to implement the max-flow algorithm on my own which could then be used for this algorithm. So I implemented this and then implemented the algorithm.

However, I was slightly busy for a week after this with resume submission deadlines and so on in the institute and after that came back to the GSOC work. I implemented the gibbs sampling algorithm first and then went to alpha-expansion algorithm. While implementing I suddenly realized that alpha-expansion algorithm might not be too helpful for practical purposes because there are just too many constraints (All the variables must have equal cardinality. Anyway, I implemented it and have not pushed it yet, as there are a few issues to resolve in this function.
Once I was done with that, I spent some time with code review, examples, test cases and documentation (Well, this is the part that I didn't like. I mean while writing examples, I just had to copy stuff from the examples, change the format etc and it was one of the most boring things which i did as a part of gsoc (which kind of tells you how interesting GSOC was for me, in general ). I don't know if there is a better way of doing this but copy-pasting examples, removing "self." and formatting it doesn't look like the kind of work which doesn't have alternatives in the CS world). I had tried to find out more about it, but couldn't find better methods. If you know of something, please tell me too.

I plan to write another blog about the stuff which I learned from GSOC. However, I think it is best written after a week or so of reflection. :)

That's all for now.
 Navin

Thursday 7 August 2014

Two weeks after mid-term eval

Oh man, Florian Fuchs is really going to kill me for this, but this blog post had completely gone off my mind. Anyway, so here I am writing this post long after it was supposed to be written. So I will just write about the experiences which I had till 13th. (They were interesting experiences for sure). I will also pretend in the blog as if it is 13th today :p

So, starting from where I had finished off in the last post. I was left with a broken factor function and some frustration arising from that. I had posted the issue on our group, but Shashank and Ankur were busy with other stuff and they were not finding time to solve it. So in the last GSOC chat, Shashank told me to write a new factor function itself. He thought that it would be a good exercise in cython to do that on my own and also, it would be very difficult for me to parse the code which Shashank had initially written for the factor product function.

Also, I decided to include a new list to store the MAP values along with the numpy.array which we had for storing the potential values. (Corresponding to each value, there would be the values which the other variables had taken during elimination). There were a lot of tough design decisions here (how to store it : list vs dictionary. Is a list fine? However, I implemented each of the entries as a list of the eliminated variable and stored the entire thing in a list. I am not really sure what to do about the huge expense of the fact that we are storing it in a list. For later.

After this, I pinged shashank again about the product module since it was making testing very difficult for me. So shashank told me to implement it myself since it would be a very good exercise in learning cython for me too. So I implemented this in cython and it was working fine. However, there was this big issue. How to handle the eliminated variable values in the cython function. because they were stored in lists and we could only pass C++ data structures to the cython functions. I am on the lookout for possible ways to handle this.

Anyway, that's all for now. (from the point of view of 13th August).

Thursday 31 July 2014

Halfway into GSOC

Edit: I just realized today that I had not publised this post. How dumb!

Well, the mid-way deadline into GSOC just went by. It has been a fun journey till now. So first, let me begin with the crucial updates.

So as I had mentioned in my last post, I had begun working on the message passing framework. Well, the entire thing was looking very simple initially but coding the entire thing was not so simple. (I mean, it is a very nice thing to visualize the message passing algorithm working, but coding it is a separate beast altogether). Anyway, I did code it up with pleasure and wrote the code for message passing. After that I finished the basic version of message passing which can at least find the normalization constant. However, like every other code in this world, it wasn't running as expected in the first go.  Well, no big deal! Like every other bug. However, after a lot of effort (and a lot of wasted pages in working out the things on paper) I realized that the bug was in one of the pgmpy library functions (the function which multiplies the two factors together).

So the next obvious step would be to solve the bug. However, this part of the code was not commented really well and hence I just thought that it will be best to ask Ankur and Shashank to solve this issue. So I have posted this issue on the group and am hoping for a quick solution. What this means is that I really can't proceed much farther without a product function (because that is a very crucial function in the upcoming things).

In any case, I have also written up the function to find the marginal probabilities. (However, again this is untested. How do I test it, because, this is again heavily dependent on the product function). I have run it with some test cases and this code it running at least . ;)

So now, the next task will be quickly test the above functions and then be done with the MAP functions. (There are some issues with this though. The factor will require a special representation because with each combination of value of variables, we will also need to store the value of the eliminated variable (the value which these variables took for the purpose of maximization)).

Till 13th July, bye!

Wednesday 18 June 2014

Three weeks into Coding - Progress / Roadblocks

Well, this was supposed to be written when I was two weeks into coding. But, it went off my mind and my mentor has kindly reminded me to write this one. Its about to be four weeks but I will just talk about the experiences of my first three weeks.

So when I had written the last blog, I hadn't even begun the coding for GSOC and hence was all excited to start coding. And now, its been three weeks. I can't use the cliched phrase "Its been a long journey", but it has been a nice journey. Let me first describe the things I have done by now before going into more philosophical discussions.

Beginning with Week1. The task for me for the week was to implement the framework of Markov Network. I had to define all those little-little functions for adding nodes, removing edges, setting observation, accessing potentials etc etc. I had to take care to ensure that the we don't pass integers as node names, we don't access observation that was set and so on. Well to be fair to the previous developers of pgmpy, I copied quite a few things from the code for BayesianModels. So, thanks, the previous developers of pgmpy!

One mistake which I did in this process was making my own class for representing factors, when it existed already in the only package which I had overlooked while going through the code. So Ankur told me about that and hence I had to change my code to work with the existing Factor class.

The next week was supposed to be spent for building the independencies relations for the Markov Networks. However, I felt like working with triangulation and junction trees and so on and hence started working on this particular module. The coding part for this was very interesting as it involved implementing algorithms and so on. So, I started working on this and implemented a lot of things, starting from a checking if a graph is triangulated to triangulation heuristics to implementing the JunctionTree class and writing functions to create the junction tree from the triangulated graph.

So there is a long series of events which happened when I submitted the PR. First, I had not commented the code or added docstrings to the code because I had expected a lot of comments about changing the interface and hence I thought that it will be best if I postpone the commenting part till after the interface and code was fixed. However, I received a (rather-angry) mail from Abinash asking me why I hadn't commented the code. I gave him my reasons. Also, he told me that many of the functions which I had written was already in the standard networkx package. Ah, and I was heart-broken. All the efforts spent in writing those maximum-spanning trees and finding the cliques and checking if the graph was triangulated had just gone waste. I might accept my mistake in not using Maximum Spanning tree code of networkx, (but then it was 15 lines and I would have written that code atleast 3 times before), but who would imagine networkx hiding such specific functionalities(finding cliques in chordal graphs, finding tree width of chordal graphs etc) inside its deep seas of code disguised as functions such that you wouldn't even find them if you looked for functions of a graph object. Anyway, all that code had to be replaced by these library functions for maintainability etc.

Once I was done with this, I started commenting the code and adding docstrings and writing test cases. When I did that, I had a big realization, "Copying code was easy (Bayesian Model's framework to Markov Model), but documentation and testing won't be easy" as I will actually have to do all that. (Well, just because I am mentioning copying the code, don't start assuming that I copied all the code. I only copied small functions)

This took quite some part of week3. Oh man, documentation and testing have two really bad features : 1. They take a lot of time 2. They are slightly boring to do, but then as Ankur said, "I will realize the importance of testing and commenting".

Anyway, once I was done with that I started working on the message passing algorithm, and setting up the entire framework and I am still on it. The next task after this will be to write up various inference algorithms using Junction Trees (and yes, documenting and testing too :P ). Hope I have a good time ahead, as I have been having till now.

Signing off for now.
Navin


Sunday 18 May 2014

The beginning of the Summer of Code

This is my first post and hence I begin by congratulating myself on getting selected for the Google Summer of Code. I will begin by describing the organization I got selected for : PGMPY. So the pgm in pgmpy stands for probabilistic graphical models and the py stands for python. You would have guessed it by now, but still, it is a organization under the PSF (Python Software Foundation) and works on probabilistic graphical models. Here is the link to the organization https://github.com/pgmpy/pgmpy . Sadly it doesn't still have a "web-page" of its own but still the github page will do. It is relatively a very new organization and they have very little code right now (as compared to many other organizations. A lot of previous GSOCers said that it took them quite a while to get familiar with just the part of the code which was relevant to them).

Anyway, let's get to what I will be doing. My task will be to basically implement the Markov Networks. After implementing the basic architecture, I will be implementing a few algorithms. Take a look here for the current version of the timeline (We decided to make some changes to the initial timeline which I had proposed. A few changes here and there. Though it would hardly be interesting for you ) https://docs.google.com/document/d/1WRuiEUgiuwWMuxezTy-YXqSiVILP2Q1lQ50zU6kib6g/edit?usp=sharing .

Now, the big question is what I have been doing since 19th of April to today . (Which is approximately a month). We were not supposed to start coding yet as it is still the community-bonding period and we are supposed to familiarize ourselves with the code, become friendly with the mentors, get to know people in the org, and do any preparatory work, as necessary to begin the coding work. So I haven't yet started coding up stuff. However I have been doing some cool stuff.

To being with, I had my end-sems exam till 30th of April and so, I was quite inactive till then. We had an IRC chat on 28th of April and then , we just had a brief chat about things and set a few plans for community bonding period. We also chatted a bit about IDE and cython and testing-techniques among other things. After the exams, I lazed away for a while and after about the 3rd of May started reading up the algorithms which I would be implementing. I did a course on Advanced Machine Learning this semester , however, I thought that it would be a good idea to go through the algorithm from the book (the standard book on Probabilistic Graphical Models ofcourse, the one and only, Probabilistic Graphical Models, by Koller and Friedman).

We had the next IRC chat on 10th May (oh , we had decided in the previous IRC chat that we would be having these chats every Saturday starting from 10th). Then it was more of a report of progress and clearing doubts, if any. We discussed plans about the subsequent weeks. So, as per the plan, I went through the software preparations part of my project, and familiarized myself with Cython, PyCharm IDE, configured VIM to work well with python (that was quite a struggle, but then, anything for VIM :D ). I also took a look at the exisiting code so that I could use parts of it for my implementation. (I also guess that I found a bug , for which I sent a pull request).

And then we land up at today. We had another IRC chat yesterday and my task for the next week would be setting up the Markov Networks class and writing the basic class implementation, testing it up and so on. (Look at the timeline) But the good news is that, coding starts now. :)