Friday, June 24, 2011

Data Mining 2011


Today, I got results from Data Mining Cup 2011. In results I occured on 15 place (from 35) for first task, and doesn't occured in second's task table at all (but I want to know why).
Still my results wasn't so bad for that simple algorithm (on this later) I made:
NumberNameScoreComment
1.TU_Dortmund_169835
.........
12.Uni_Siberian_Telecommunication_160550
13.TU_Wien_151704<-- strange jump in scores
.........
15.Uni_Kharkov_150627<-- me
.........
30.Uni_Chile_221018
.........
35.Inst_Telkom_21230
So, as you see, I occured on middle group, and as I think, in this group we all used pretty same alogrithm - modifications of nearest neighborhood. But top 12 people have another alogithm, and I'll try to figure out what it was.
Still now I am describing mine algorithm.
For learn step I made hash map, where key was item number and value - hash maps, which contains as keys - items that was viewed\ordered in the same session of initial key, and value - count of views\orders.
For example:
1000|1|0
1000|2|0
1000|3|0
1001|1|0
1001|2|0
And in result I'll have {'1': {'2': 2, '3': 1}, '2': {'1': 2, '3': 1}, '3': {'1': 1, '2': '1'}}.
And when I needed to get test results, I for each session, gather items that was already viewed\ordered, and merge values of this hash map. Than sort values and return top 3 of them, which wasn't already in session.
For example:
1003|1|0
1003|2|0
Merged hashmap will be: {'1': 2, '2':  2, '3': 2, '4': 1, '5': 3}. Next removing 1 and 2, because they already in current session: {'3': 2, '4': 1, '5': 3}. Sort by value and return keys: [5, 3, 4].
But if there aren't 3 values in resulting list, I will additionaly return top selling items. For example, after sort step list was: [3], than adding top sellers it will be: [3, 10, 15], where 10 and 15 - top 2 selling items, which were calculated on learn step.
Additionally, I used weight for views\add to cart\orders - 1, 5, 10 when building learning hashmap.

For task 2 I made the same, but it builds this hash map online and returns current best choices on each step.
But I'm very interested how top1-3 teams solved this tasks. If I'll find out and they'll allow - I'll post it here =)