Showing posts with label data science. Show all posts
Showing posts with label data science. Show all posts

Tuesday, October 15, 2013

Reference for Introduction to Decision Trees

   Last weekend I've been at SFBay ACM Data Mining Camp and did an introduction session to Decision Trees, Stochastic Gradient Boosting and Random Forests. For people who are interested in further reading about this topic, I'm posting some links in this post.


Decision Trees

    Decision trees idea is to build a set of rules in hierarchical form, that will allow to predict the value of a target variable. Rules are usually in form of "if-then-else" statements and organized in the form of a binary tree.

Basic reading: 

Hardcore reading: 


Stochastic Gradient Boosting of Decision Trees

    This method of building predictive model is based on constructing a set of small regression decision trees, that are fitted (learned) sequentially, where each next iteration learns from error of previous iterations.

Basic reading:
Hardcore reading: 
    Original paper, which describes methodology: Jerome Friedman. Stochastic Gradient Boosting.


Random Forests

    The idea behind Random Forests, is that you are building strong classifiers on independently sampled data subsets, while using random selection of features to split each node. The generalization error for forests like this converges to a limit as the number of trees in the forest becomes large.

Basic reading:
Hardcore reading:
     Original paper, which describes methodology: Leo Breiman, Random Forests.


    As a conclusion, I would say that all three methods have their own advantages and disadvantages and it's worth learning how and when to use each on of them.

Wednesday, July 10, 2013

Introduction to Data Science books and courses

    I was asked about books and courses that will help to get started with learning Data Science (Data Mining, Machine Learning or Data Analysis).
   My main toolchain is Python, NumPy/SciPy/Pandas/Scikit-learn, Hadoop and MRJob. Based on this I put together a list of books that will good to start with:

Python

Learning Python. Mark Lutz.
Book to learn Python before jumping to data science. 

Python for Data Analysis. Wes McKinney.
http://www.amazon.com/books/dp/1449319793
Book from the author of pandas module. Great book to learn how to do descriptive stats with Python.

Programming Collective Intelligence. Toby Segaran.
Introduction to self written Machine learning algorithms with Python.

Machine Learning in Action. Peter Harrington.
k-Nearest neighbors, naive Bayes, SVM, decision trees with examples in Python

Hadoop

Definitive guide from one of the early contributors to Hadoop source code and person with wast experience working with it.

R & Stats

Data Analysis with Open Source Tools. Phillipp K. Janert.
Sometimes Python is just not enough and this book will help to start working with R.

Think stats. Allen B. Downey.
If you are coming from Computer Science major you better get this book about probability theory and stats.

Good read on Data Science

Predictive Analytics Power Predict. Eric Siegel.
Good read on Predictive Analytics philosophy and examples of real world tasks that people solved with it.

Courses

  • Introduction to Data Science - Good introduction to all main concepts that data scientist should know (SQL, NoSQL, Hadoop, R, Machine learning algorithms and visualization and etc).
  • Computing for Data Analysis - Course about learning R and solving real problems with it.
  • Machine Learning - Basics of Machine learning from Andrew Ng (Founder of Coursera and Director of AI Lab in Stanford).
  • Computational Investment - Course that will teach how building a trade-robot for stock exchange in Python using all the tools that Data Scientist uses (see as practical examples).

    This list of books and courses will be updated when I'll find something worth reading or watching on this topic. If somebody knows a good book that I should add to this list - please, let me know.