Illia Polosukhin's blog: 2013

Tuesday, October 15, 2013

Reference for Introduction to Decision Trees

Last weekend I've been at SFBay ACM Data Mining Camp and did an introduction session to Decision Trees, Stochastic Gradient Boosting and Random Forests. For people who are interested in further reading about this topic, I'm posting some links in this post.

Decision Trees

Decision trees idea is to build a set of rules in hierarchical form, that will allow to predict the value of a target variable. Rules are usually in form of "if-then-else" statements and organized in the form of a binary tree.

Basic reading:

Hardcore reading:

Original book, which describes methodology: Classification And Regression Trees. Leo Breiman, Jerome Friedman, Charles Stone, Richard Olshen.

Stochastic Gradient Boosting of Decision Trees

This method of building predictive model is based on constructing a set of small regression decision trees, that are fitted (learned) sequentially, where each next iteration learns from error of previous iterations.

Basic reading:

Hardcore reading:

Original paper, which describes methodology: Jerome Friedman. Stochastic Gradient Boosting.

Random Forests

The idea behind Random Forests, is that you are building strong classifiers on independently sampled data subsets, while using random selection of features to split each node. The generalization error for forests like this converges to a limit as the number of trees in the forest becomes large.

Basic reading:

Hardcore reading:

Original paper, which describes methodology: Leo Breiman, Random Forests.

As a conclusion, I would say that all three methods have their own advantages and disadvantages and it's worth learning how and when to use each on of them.

Wednesday, July 10, 2013

Introduction to Data Science books and courses

I was asked about books and courses that will help to get started with learning Data Science (Data Mining, Machine Learning or Data Analysis).
My main toolchain is Python, NumPy/SciPy/Pandas/Scikit-learn, Hadoop and MRJob. Based on this I put together a list of books that will good to start with:

Python

Learning Python. Mark Lutz.

http://www.amazon.com/Learning-Python-Mark-Lutz/dp/1449355730/

Book to learn Python before jumping to data science.

Python for Data Analysis. Wes McKinney.
http://www.amazon.com/books/dp/1449319793

Book from the author of pandas module. Great book to learn how to do descriptive stats with Python.

Programming Collective Intelligence. Toby Segaran.

http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325

Introduction to self written Machine learning algorithms with Python.

Machine Learning in Action. Peter Harrington.

http://www.amazon.com/Machine-Learning-Action-Peter-Harrington/dp/1617290181

k-Nearest neighbors, naive Bayes, SVM, decision trees with examples in Python

Hadoop

Hadoop: Definitive Guide. Tom White.
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979

Definitive guide from one of the early contributors to Hadoop source code and person with wast experience working with it.

R & Stats

Data Analysis with Open Source Tools. Phillipp K. Janert.

http://www.amazon.com/Data-Analysis-Source-Tools-ebook/dp/B004FGMTYA

Sometimes Python is just not enough and this book will help to start working with R.

Think stats. Allen B. Downey.

http://www.amazon.com/dp/1449307116

If you are coming from Computer Science major you better get this book about probability theory and stats.

Good read on Data Science

Predictive Analytics Power Predict. Eric Siegel.

http://www.amazon.com/Predictive-Analytics-Power-Predict-ebook/dp/B00BGC2WGQ/

Good read on Predictive Analytics philosophy and examples of real world tasks that people solved with it.

Courses

Introduction to Data Science - Good introduction to all main concepts that data scientist should know (SQL, NoSQL, Hadoop, R, Machine learning algorithms and visualization and etc).
Computing for Data Analysis - Course about learning R and solving real problems with it.
Machine Learning - Basics of Machine learning from Andrew Ng (Founder of Coursera and Director of AI Lab in Stanford).
Computational Investment - Course that will teach how building a trade-robot for stock exchange in Python using all the tools that Data Scientist uses (see as practical examples).

This list of books and courses will be updated when I'll find something worth reading or watching on this topic. If somebody knows a good book that I should add to this list - please, let me know.

Friday, April 5, 2013

My experience with Massive Open Online Courses

I've been one of that 100,000 of people who signed up for Artificial Intelligence class by +Peter Norvig & +Sebastian Thrun and Machine Learning class by +Andrew Ng.

Classes were really great and enjoyed them tremendously - more from practical point for me, because I knew most of the theory (I read book by Norvig before on AI, and Machine learning is field were I work) - but this classes gave an opportunity to actually write some code for example problems and see some results.

This resulted in really fun semester for me - I was studying in my university (KhPI in Ukraine), had this two courses and also signed up for remote education in Yandex school of Data Analysis and I was working part time for Salford-Systems. Didn't have much time for social life, as you can guess :)

When term finished I've got 97.5% at Artificial Intelligence and maximum at Machine Learning. Yandex school was pretty intense but I finished it too with 90-100% scores.

For next term I've signed up and started attending number of courses on Udacity and Coursera - but I've never finished one there. Great influence there was that I switch to work full-time and had less spare time. On the other hand, there was so many courses that I wanted to attend (computer science, gamification, physics, economics, strategic planning) that I got really unfocused.

Udacity is offering courses without deadlines, which even pushed to even harder procrastination - "I can watch this lesson next week", and then next week something else happend. On the other hand, Coursera was rushing with deadlines, and if you signed up for 3-4 courses at the same time and actually have a day job - you'll start missing deadlines. And as soon as that happend - motivation to continue sinks, you stop thinking about credit (hey, I missed dead line - so I won't get good credit anyway) and switch to more "Udacity" mode - "I can watch this next week" - and then stop happening after a while.

From my short talk with a guy from Udacity on a PyData, I think this is not just my issue - variety of courses leads to dispersed attention and in result less finished courses overall.

Of cause, Udacity and Coursera just recently started and do have only one year of experience. I'm sure we will definitely see new developments when MOOCs startups will figure out how to leverage huge amounts of data they are collecting right now and deliver better personalization or just better lessons\quizes\home works.

This is was one of points of +Peter Norvig's keynote talk at PyData 2013 - that they just start to analyzing data collected on AI class, and how it can lead to enhancements in lectures themselves or even in error descriptions in Python for novice users.

As I started to think what would be a form that will allow to focus personal education from one size, allow to have flexible schedule but have deadlines as well, I emerged to something I call "continues flow of education". First you signup and specify your interests (social media can be leveraged to see what's you are interested as well), then based on this you will get a personalized queue of things to learn and do. For example, when you have 10-20 minutes (or better hour or two) you can go to your personal queue and do first thing that is on top - watch a new lesson, answer a quiz, do a part of homework. So system actually plans according to your interests what should be learned by you and then delivers this knowledge to you one piece after another - without really giving much choice (unless you don't want to study something).

Of cause, there still question about dead lines - they should be enforced but because you can't move to next thing until you finished previous home work - you don't have three concurrent home works to do - and it only depends on you - if you want to pass one homework earlier and move on to next one or study a bit more. On the other hand if you missed deadline - you still need to finish this homework to move on. Additionally, even if you month late with this home work you still will have one week on next homework - i.e. been late on thing doesn't produce chain effect like right now in Coursera.

Another question - is that by jumping form one subject to another it may be hard to switch mind from one thing to another - but really this is what we were doing in high school and undergraduate - every day different subjects and you needed to do homework. Plus, switching back and force will actually reveal how good you are really learned subject.

In conclusion, MOOCs already changing the world - see number of testimonials from children from poor countries where school education is pretty bad. This children now can study from best teachers in US, additionally thousands of people can just go and learn additional subjects that will help them with their day-to-day job. This is time of extensive learning for MOOCs startups themselves as well about user behavior and about best form for delivering knowledge and helping people to put it down in their minds.

Wednesday, February 27, 2013

pip install numpy and scipy together

While working for a new python module at +Salford Systems, I've discovered an issue if you run command like this:

https://github.com/scipy/scipy

pip install numpy scipy

...

".tox/py25/build/scipy/setup.py", line 131, in setup_package from numpy.distutils.core import setup ImportError: No module named numpy.distutils.core

Same issue will appear if numpy and scipy modules are listed in requirements.txt and you will try to install it as:

pip install -r requirements.txt

The origin of issue is that setup.py in scipy is using numpy.distrutils.core.setup and doesn't want to work with it, but pip actually queries setup.py with egg_info before installing anything. This is done to prevent installing broken packages and leaving half-installed state. Instead it checks first that everything fine and only then starts to install them.

Now, I've made my our branch of scipy on github and fixed this by making whitelist of commands that are able to work without numpy importing. This allows to run listed above examples without issues.
Pending pull-request to master of scipy is here - https://github.com/scipy/scipy/pull/453

PS. Related issues on a web:

http://projects.scipy.org/scipy/ticket/1429 pypa/pip#25 pypa/pip#272 sjsrey/pysal#207 http://bb10.com/python-testing-general/2011-06/msg00028.html