Illia Polosukhin's blog

Monday, January 13, 2014

Is Windows a good development platform?

I've had a conversation with my colleague couple days ago about is Windows a good platform for development or not. I was frustrate with some of MS software at that point, I was arguing that even though developers are the power of Windows as a platform, they were never treated as good, as in Linux (or Mac now).

On one hand, Windows has arguably best in class IDE for C/C++, C#, Visual Basic - Visual Studio. Note, it's only three out of dozens of main stream languages. And by been closed to external expansion - they didn't implement system like Eclipse, where others can add support for other languages. Now, is that would add a lot of work for MS developers - not really, if they architected VS in the right way - it's already pluggable system (they have number of different languages, with different syntaxes - it just makes sense to have abstract system, where each language is an add-on), so it's just a matter of sharing this API (and yes, supporting it in next versions).

Having different programming languages in Visual Studio brings me to next step - you project configuration and build scripts. Each project is not just a set of files with code, but also what compiler to use, flags for the compiler and linkers in different configurations, external libraries and resources required and etc. Now, Visual Studio organizes all this in XML file, which presumably is human readable and editable (XML is not a good format for this anyway - too much text is just too much text). But compare a 4 configuration setup, when in that XML you will have 4 sections of pretty much the same text, except may be couple flags absent or present and CMake, where it's organized as common block and if blocks with specifics of each configuration. Now when you want to edit only release - you will need to go through two same blocks and find what you need in a pile of options, or find appropriate if RELEASE and change options there. This continues with external libraries and etc (it's really pain adding an library even from Visual Studio, because you need to remember to select all configurations, as if adding library to only one is a default case everybody needs).

Ok, for example, we got our C++ project up and running and checked into repository. Called up a colleague and asked to add a feature to it... Oh wait, they just got a new machine and need to install environment. Can we help him somehow? Not really, only give bunch of links to download ton of software and wait until they click through hundreds of wizards. Note, that we just need a (clean) clone of our environment on different machine, but there is really (except using VMs) no way to just give him a setup script, that would install everything automatically. Hey, that's a day he will not be able to return to his life.

So our colleague installed everything, but our project doesn't compile on his machine. He looks and finds out that he installed third party library of incorrect version and to incorrect path. Because it actually asked where to install it - why there is no specific place, where libraries are installed? And yes, we didn't specify version, because how would we do it - only add version number to the folder it's installed, again manual and pretty complicated to keep track and update later. Simple install script, that would install required versions would save time and make sure of appropriate version and place. Now we institute a special place for libraries to be put or even include them in repository to make sure versions are the same for everybody and go on, but feeling that this OS doesn't know what any developer will need still left.

Switching gears, we were asked to write a web scrapper. Of cause you can try to do it in C#, but that will be overkill - number of lines of code you will write is just insane compared to scripted languages. So our choices are Perl and Python. We go with Python - and again everything starts - download Python, install by clicking on a wizard. This time no Visual Studio to back you up - so open Notepad... oh, wait, the default editor is just nightmare. How for the love of god default editor can be that bad? Anyway, download Notepad++ (again wizard, click-click, time is passing by). Python has great system to install required libraries, which in Linux/Mac works like a charm. It works in Windows, until you need to compile some C/C++ code for it. How is it possible that no one man figured out how to compile programmatically their code when Visual Studio installed from Python. It works well with GCC, but I've tried multiple times, including figuring out code that does compilation with Visual Studio and failed to configure it right. No, there is precompiled packages you can download and install (wizard, click-click) and from what I understand they are compiling them with MinGW (prove me wrong).

Client who asked about web scrapper - asked to save everything to MySQL database. We need to install it (you know it - wizard, click-click), we need to configure it (UTF8 by default, different port and etc) - quest to find where configuration file is, because who looked at where that mysql asked us to install itself. Ok, now how is our colleague or where we deploy our code will have same environment? Again, write a document where to download everything, and attach a config file to that document.

Concluding, both *nix and Windows platforms have their positive and negative sides. Linux has a lot of development awareness just because developers is a huge user base, and any developer can have a say at how it should be. I wanted to point out that Windows, as OS should be much more aware of this issues and open to help developers to do their job better and easier, and in result spread MS products. Instead, by been not really helpful in a lot of this details it repeals them. Now days, Mac OS by been easy to use from one hand and actually having a lot of Linux flexibility (*nix file system, bash scripting, MAC ports and Homebrew to install packages) grows as a primary platform where development is been made. Of cause another trend is developing web application, when you just connect to a server and write code there. Windows by having only graphics remote desktop requires much higher bandwidth, which not always available. In result, it's Linux who shines, by providing powerful text-based IDE in addition to great development environment.

Tuesday, October 15, 2013

Reference for Introduction to Decision Trees

Last weekend I've been at SFBay ACM Data Mining Camp and did an introduction session to Decision Trees, Stochastic Gradient Boosting and Random Forests. For people who are interested in further reading about this topic, I'm posting some links in this post.

Decision Trees

Decision trees idea is to build a set of rules in hierarchical form, that will allow to predict the value of a target variable. Rules are usually in form of "if-then-else" statements and organized in the form of a binary tree.

Basic reading:

Hardcore reading:

Original book, which describes methodology: Classification And Regression Trees. Leo Breiman, Jerome Friedman, Charles Stone, Richard Olshen.

Stochastic Gradient Boosting of Decision Trees

This method of building predictive model is based on constructing a set of small regression decision trees, that are fitted (learned) sequentially, where each next iteration learns from error of previous iterations.

Basic reading:

Hardcore reading:

Original paper, which describes methodology: Jerome Friedman. Stochastic Gradient Boosting.

Random Forests

The idea behind Random Forests, is that you are building strong classifiers on independently sampled data subsets, while using random selection of features to split each node. The generalization error for forests like this converges to a limit as the number of trees in the forest becomes large.

Basic reading:

Hardcore reading:

Original paper, which describes methodology: Leo Breiman, Random Forests.

As a conclusion, I would say that all three methods have their own advantages and disadvantages and it's worth learning how and when to use each on of them.

Wednesday, July 10, 2013

Introduction to Data Science books and courses

I was asked about books and courses that will help to get started with learning Data Science (Data Mining, Machine Learning or Data Analysis).
My main toolchain is Python, NumPy/SciPy/Pandas/Scikit-learn, Hadoop and MRJob. Based on this I put together a list of books that will good to start with:

Python

Learning Python. Mark Lutz.

http://www.amazon.com/Learning-Python-Mark-Lutz/dp/1449355730/

Book to learn Python before jumping to data science.

Python for Data Analysis. Wes McKinney.
http://www.amazon.com/books/dp/1449319793

Book from the author of pandas module. Great book to learn how to do descriptive stats with Python.

Programming Collective Intelligence. Toby Segaran.

http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325

Introduction to self written Machine learning algorithms with Python.

Machine Learning in Action. Peter Harrington.

http://www.amazon.com/Machine-Learning-Action-Peter-Harrington/dp/1617290181

k-Nearest neighbors, naive Bayes, SVM, decision trees with examples in Python

Hadoop

Hadoop: Definitive Guide. Tom White.
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979

Definitive guide from one of the early contributors to Hadoop source code and person with wast experience working with it.

R & Stats

Data Analysis with Open Source Tools. Phillipp K. Janert.

http://www.amazon.com/Data-Analysis-Source-Tools-ebook/dp/B004FGMTYA

Sometimes Python is just not enough and this book will help to start working with R.

Think stats. Allen B. Downey.

http://www.amazon.com/dp/1449307116

If you are coming from Computer Science major you better get this book about probability theory and stats.

Good read on Data Science

Predictive Analytics Power Predict. Eric Siegel.

http://www.amazon.com/Predictive-Analytics-Power-Predict-ebook/dp/B00BGC2WGQ/

Good read on Predictive Analytics philosophy and examples of real world tasks that people solved with it.

Courses

Introduction to Data Science - Good introduction to all main concepts that data scientist should know (SQL, NoSQL, Hadoop, R, Machine learning algorithms and visualization and etc).
Computing for Data Analysis - Course about learning R and solving real problems with it.
Machine Learning - Basics of Machine learning from Andrew Ng (Founder of Coursera and Director of AI Lab in Stanford).
Computational Investment - Course that will teach how building a trade-robot for stock exchange in Python using all the tools that Data Scientist uses (see as practical examples).

This list of books and courses will be updated when I'll find something worth reading or watching on this topic. If somebody knows a good book that I should add to this list - please, let me know.

Friday, April 5, 2013

My experience with Massive Open Online Courses

I've been one of that 100,000 of people who signed up for Artificial Intelligence class by +Peter Norvig & +Sebastian Thrun and Machine Learning class by +Andrew Ng.

Classes were really great and enjoyed them tremendously - more from practical point for me, because I knew most of the theory (I read book by Norvig before on AI, and Machine learning is field were I work) - but this classes gave an opportunity to actually write some code for example problems and see some results.

This resulted in really fun semester for me - I was studying in my university (KhPI in Ukraine), had this two courses and also signed up for remote education in Yandex school of Data Analysis and I was working part time for Salford-Systems. Didn't have much time for social life, as you can guess :)

When term finished I've got 97.5% at Artificial Intelligence and maximum at Machine Learning. Yandex school was pretty intense but I finished it too with 90-100% scores.

For next term I've signed up and started attending number of courses on Udacity and Coursera - but I've never finished one there. Great influence there was that I switch to work full-time and had less spare time. On the other hand, there was so many courses that I wanted to attend (computer science, gamification, physics, economics, strategic planning) that I got really unfocused.

Udacity is offering courses without deadlines, which even pushed to even harder procrastination - "I can watch this lesson next week", and then next week something else happend. On the other hand, Coursera was rushing with deadlines, and if you signed up for 3-4 courses at the same time and actually have a day job - you'll start missing deadlines. And as soon as that happend - motivation to continue sinks, you stop thinking about credit (hey, I missed dead line - so I won't get good credit anyway) and switch to more "Udacity" mode - "I can watch this next week" - and then stop happening after a while.

From my short talk with a guy from Udacity on a PyData, I think this is not just my issue - variety of courses leads to dispersed attention and in result less finished courses overall.

Of cause, Udacity and Coursera just recently started and do have only one year of experience. I'm sure we will definitely see new developments when MOOCs startups will figure out how to leverage huge amounts of data they are collecting right now and deliver better personalization or just better lessons\quizes\home works.

This is was one of points of +Peter Norvig's keynote talk at PyData 2013 - that they just start to analyzing data collected on AI class, and how it can lead to enhancements in lectures themselves or even in error descriptions in Python for novice users.

As I started to think what would be a form that will allow to focus personal education from one size, allow to have flexible schedule but have deadlines as well, I emerged to something I call "continues flow of education". First you signup and specify your interests (social media can be leveraged to see what's you are interested as well), then based on this you will get a personalized queue of things to learn and do. For example, when you have 10-20 minutes (or better hour or two) you can go to your personal queue and do first thing that is on top - watch a new lesson, answer a quiz, do a part of homework. So system actually plans according to your interests what should be learned by you and then delivers this knowledge to you one piece after another - without really giving much choice (unless you don't want to study something).

Of cause, there still question about dead lines - they should be enforced but because you can't move to next thing until you finished previous home work - you don't have three concurrent home works to do - and it only depends on you - if you want to pass one homework earlier and move on to next one or study a bit more. On the other hand if you missed deadline - you still need to finish this homework to move on. Additionally, even if you month late with this home work you still will have one week on next homework - i.e. been late on thing doesn't produce chain effect like right now in Coursera.

Another question - is that by jumping form one subject to another it may be hard to switch mind from one thing to another - but really this is what we were doing in high school and undergraduate - every day different subjects and you needed to do homework. Plus, switching back and force will actually reveal how good you are really learned subject.

In conclusion, MOOCs already changing the world - see number of testimonials from children from poor countries where school education is pretty bad. This children now can study from best teachers in US, additionally thousands of people can just go and learn additional subjects that will help them with their day-to-day job. This is time of extensive learning for MOOCs startups themselves as well about user behavior and about best form for delivering knowledge and helping people to put it down in their minds.

Wednesday, February 27, 2013

pip install numpy and scipy together

While working for a new python module at +Salford Systems, I've discovered an issue if you run command like this:

https://github.com/scipy/scipy

pip install numpy scipy

...

".tox/py25/build/scipy/setup.py", line 131, in setup_package from numpy.distutils.core import setup ImportError: No module named numpy.distutils.core

Same issue will appear if numpy and scipy modules are listed in requirements.txt and you will try to install it as:

pip install -r requirements.txt

The origin of issue is that setup.py in scipy is using numpy.distrutils.core.setup and doesn't want to work with it, but pip actually queries setup.py with egg_info before installing anything. This is done to prevent installing broken packages and leaving half-installed state. Instead it checks first that everything fine and only then starts to install them.

Now, I've made my our branch of scipy on github and fixed this by making whitelist of commands that are able to work without numpy importing. This allows to run listed above examples without issues.
Pending pull-request to master of scipy is here - https://github.com/scipy/scipy/pull/453

PS. Related issues on a web:

http://projects.scipy.org/scipy/ticket/1429 pypa/pip#25 pypa/pip#272 sjsrey/pysal#207 http://bb10.com/python-testing-general/2011-06/msg00028.html

Saturday, December 29, 2012

Pymisc module

Pymisc - is module for miscellaneous utilities for your average python scripts and projects.
This module was developed with same idea as "django-misc" that I've described before - to move utilities that are used frequently to specific location.
To get it installed you can use GitHub (latest) version or PyPi (stable) version by installing via pip:

pip install git+git://github.com/ilblackdragon/pymisc.git

or for stable version from PyPI:

pip install pymisc

Now, when it's installed on your machine, let's discuss what you can get from it:

decorators.py contains @logprint (enter and exit from function will be logged, as well as crashs that may happend) and @memorized (cachine decorator)
settings.py contains Settings class that provide near django.conf.settings experience and additionally you actually can change values and they will be auto-saved when application closes.
utils package contains a long list of routines for different purposes, which I'll describe in github documentation one day
reader package contains couple csv utility modules that really when work hard with this format of data files
django and html are actually copies of django-misc stuff, so if you use it already - just ignore it
web.browser.Browser - is a class that provides some basic routines on top of usual urllib module to allow easier do json requests, download files and etc.

I'll continue developing this module and adding more stuff (including some doc and examples), and if you have a piece of code that you thinks belongs in this kind of place - let me know or fork&pull-request on GitHub.

Friday, December 28, 2012

Release cycle - Part II

This is additional part to previous post about release cycle.

So I continued to read +Joel Spolsky and guess what? He actually ended up with Kanban ideas in mind in one of his latest posts. Apparently I should read all Joel's stuff before commenting :)

Basic idea behind "pull"-kind of scheduling system (as I understand it), is getting features to customer as fast as it possible. Now, saying this - your release cycle should be getting faster and faster. And for this, of cause, you will need all this modern tools (TDD and unit-tests, DVCS and etc), which will allow to push you changes, that are ready for deployment, to production server and know that feature works and has high quality.

By making "pull" system, you'll illuminate most of "waste" inventory and will be able to show customers same features you use "in-house". Imagine, you have configured you Continues what-ever system (Delivery or Deployment) - system that allows to get code that works through test-system to the production (web service or auto-update server). But this is only part of job for "pull" system.

You also need to have police to choose features to develop only based on "highest" priority. This means, that if you have a mile long backlog, you will need to sort it by priority (number of clients asking for feature and\or $$ additionally gained) and acknowledge to yourself, that low priority features won't be done... ever. You'll always get new ideas for features from clients or from team - that will go with high priority. So don't spend time to revise them again - throw them out. If any of that features was really important - it will come up again.

Ok, that's all fun when you have perfect code, it's all separated by modules and it's all covered by unit-tests and etc. But let's look truth in the eyes - code is a mess and tests are inferior. And this means that refactoring is required, while we want to push new features out (customers are waiting and nobody wants to disappoint them).

To get refactoring going - let's make in a backlog a feature with concreate proposition how to refactor parts of code (not just - "rework this", but concreate "this should have interface A, B, implement algorithm C") and set a high priority to it. Then when current features are done, one by one refactoring tasks will be performed by developers. Proposition for refactoring should have description how new code should be tested (unit-tests, integration-tests, system-tests, performance-tests) to ensure that it works correctly. If new high priority feature gets into play - it will be after some of refactoring tasks, that already got in the queues of developers.

While code is still imperfect - release cycle will be maintained not continues, but it should be minimized. From years to month as first steps. When new minor version will be released each month - next step go to weeks - just a couple of features (sometimes small and sometimes rather large). While this - refactoring should be going to get code to better shape and unlock possibility to deliver software continuously.

Wednesday, December 26, 2012

Release cycle

And again let's start the topic from +Joel Spolsky post about "Picking a Ship Date". His main idea in the article is:

If product is brand new - release often
If product have maturity - 1 year - 2 years release cycle is for you
If product is platform - at least 3 years.

Ok, the article is from 2002 (what? yep, that old). So there was no distributed version control system (no, really, first releases of Git or Mercurial were at April, 2005).

Now, how did it changed the world? It actually changed it pretty dramatically. Before the main workflow was to develop a set of big features and then go to "pre-release cycle", when you fix issues, add small features and get your QA cracking your software. Because of that if you have 1 year release cycle, you would have only 3-5 month to get big features to the system, and then you have this cycle of getting software to production form.

Behold, DVCS gave you an option to have master branch in "release" form all the time. If you are doing a work that requires more than one commit (which any normal work does - because you should commit frequently) you just put in separate branch, which will be merged when it's ready (and may be even tested by QA). When in old world you would just make you current state unreleasable by pushing your commit to trunk (ok, there was branches in SVN... but really, did somebody used them to make new features?).

Let's look at Github - they've made 2000+ releases at 6 month period. Some of them sure were pretty small - a bug fix, a simple button that million customers asked to add or just a tooltip that made better usability. But some of them were pretty big - new interface, or changing backend, or may be new system of distributed handling of repositories (I'm making this up, though). The idea is - each release were the same - somebody finished his work in separate branch and merged it to main (release) branch.

Another example is Google Chrome - I think it's the best update system in the world that is in this world. IT DOESN'T NEED ME TO DO ANYTHING (I'm looking at you Flash, Java, ...)! Yes, and I like it. And they managed over past 4 years I'm using it - no to change interface unexpectedly - so the argument of "too frequent releases will affect usability" doesn't sound if the development process is built with usability in mind.

As a conclusion I would say, that frequent releases are a good thing when update system is very good (web service or auto-update without person even noticing) and when your have very concreate plan of features that will be implemented and not break usability (features add functionality, not complexity of UI).

Monday, December 24, 2012

Management Team

Read today guest post form +Joel Spolsky at avc.com -
Management Team. The main idea is to get your developers (QAs, etc) do their job how they think is right (because they have more knowledge in it) and not micro manage them.

Idea sounds reasonable, but I have some doubts in situations when people don't have good interperson skills.

Let's see an example with Pet and Jack. They are both senior developers in a team that develops a new version of super product X. They have a product manager who made awesome functional spec and now they are discussing an implementation. Pet thinks that implementing with B-trees will be better, when Jack really likes Red-Black trees. Now, they have a pretty straightforward way to figure out who is right - kick the code in and compare which solution is better for the particular situation. Easy, right?

Wait... what if you need now to choose if you want use library A or library B? And you can't easily anticipate future problems (ops, library A had an issue on HP-UX when you have MySQL demon running), neither you can implement fast enough both solutions and test which one is better.

Even worth, if library A actually a full blown framework for solving problems, and you just need to customize it, and library B is a set of functions which you can use, but you need to develop a wrapper around it to make it work. I mean, when you can't get an abstraction level between your code and a library to decide later which one you wanna use.

Pet and Jack will be arguing wich library is better to use, and because Jack doesn't actually like to speak much (yep, he is better expressing himself in code), he decides to give up and agree with Pet to use library A.

Now, you see +Joel Spolsky has a Team Lead on the chart, which presumably should solve arguments like this. He is taking responsibility for making large design decisions, selecting tools and setting up conventions for his team. For this, Team Lead should have an experience with a lot of things and be very good in judging what will be better for a team and further development

Returning to our example, Pet got promoted to Team Lead, because the guy who was Team Lead before was caught sleeping with CEOs secretary and got kicked off. Pet was chosen because he has better "people" skills and is very knowledgable about product team developing. O yeah, and he is pretty good in beer-pong (Jack didn't go to that party, so we don't know if he is better then Pet).

Pet got a request to implement new feature for next version and sat down with Jack to design. Of cause Jack has some ideas about better design - but he already argued once with Pet and got kicked. Plus, Pet now his boss. So he listens to Pet's ideas, which are mostly good and even if there are some not-so-good design decisions - he will just agree.

So in result, we see that people skills actually worth not just promotion, but who's ideas will be implemented.

Sunday, December 23, 2012

Functional and Technical spec in software design

I started reading +Joel Spolsky's blog pretty heavily. I'm reading his old posts from 2000s. I'm pretty sure most of you are familiar with his blog - it's kinda famous in software development world.

I'm reading 3-5 posts a day. I don't agree with some of this thoughts (he didn't agree with some of them too as time goes :) ), but most of them are pretty bright.

One of the ideas, that I'm trying to employ is technical writing - guess what, this blog was made for this - to practice writing in English on technical topics. But because of my laziness I wasn't doing much here. So that's should change in next month or so.

Another thing, that I'm trying to get used to - is functional specs, as he calls them. A document that describes feature - essentially document that will help all people involved in software development process to get understanding how feature should work and what should be done for this.

But functional spec, as Joel points out - is view from user's point. And it can be written by program manager - person who is not a developer, but more list a marketing\product development kind of person.

On the other hand, in complicated situations - like developing new product or producing a large feature (more like a feature set) - when implementation is not clear - there should be a technical spec. Or functional spec should incorporate this information.

The purpose of that is to think about design/implementation and future obstacles:

A software design is simply one of several ways to force yourself to think through the entire problem before attempting to solve it. Skilled programmers use different techniques to this end: some write a first version and throw it away, some write extensive manual pages or design documents, others fill out a code template where every requirement is identified and assigned to a specific function or comment. For example, in Berkeley DB, we created a complete set of Unix-style manual pages for the access methods and underlying components before writing any code. Regardless of the technique used, it's difficult to think clearly about program architecture after code debugging begins, not to mention that large architectural changes often waste previous debugging effort. Software architecture requires a different mind set from debugging code, and the architecture you have when you begin debugging is usually the architecture you'll deliver in that release.

This is Design Lesson 2 from history of Berkeley DB. Check it out - nice article about history of development pretty complicated system.