Thursday, March 13, 2008

Welcome

Welcome All! Let me introduce myself. My name is Len Bertoni and I am a member of the team acmehill consisting of myself, my wife, my daughter and my son. We are competing for the netflix prize (http://www.netflixprize.com/) and have recently cracked the top 10.

Team acmehill downloaded the netflix prize data about 1 year ago, but only recently "got serious" about attempting to do anything with it, starting our leaderboard climb in Nov 2007. Our computing power is rather modest, consisting of a 2 year old Emachine (3GHz, 2GB RAM, 160G HD) purchased from Wal-Mart. However, since the contest is as much thinking about why and how people rate movies as it is about applying proven (and not so proven!) mathematical techniques to the provided data, we have not been limited by hardware too much. All programs are written in VB.Net using Visual Basic Express 2005. All code is written "from scratch" and no external mathematical libraries are used.

My family provides me with new ideas to test out while I subject them to countless "fun hours" of delving into matrix factorization (Dad, what is a matrix anyway?), neural nets (does anyone really know how they work anyway?), etc.

I have tried with varying success the following techniques (many do not survive in my final blend as I can not bear to blend 100s of files to produce my final prediction set):

KNN on Movies,
Matrix Factorization,
User/Movie Clustering,
Neural Nets,
Restricted Boltzmann Machines,
Anchoring,
Milestones,
Global Effects Removal

Right now I am implementing KNN on users, which is quite a bit more computer intensive, as the sheer number of users (1/2 million) makes nearest neighbor determination more problematic. Hopefully team acmehill will have some success with this approach.

58 comments:

Unknown said...

Congratulations, very impressive score.

Can you please elaborate a bit more (perhaps a couple of links) on these methods:

"Anchoring"
"Milestones"

by321 said...

Hi, I implemented both viewer and movie clustering, using completely symmetrical approaches. I found movie clustering is significantly worse than viewer clustering, I still don't know if this is just the nature of things or if there's a bug in my movie clustering code.

What's your experience ?

Acmehill said...

Anchoring is described here (http://en.wikipedia.org/wiki/Anchoring). It is basically the idea that people select 1 main attribute when considering a decision and then adjust the other attributes to fit their initial assessment. For example, you decide to watch an early movie by your favorite actor. He/She wasn't famous yet and isn't the star of the movie that you are watching, but his/her performance may still be the deciding factor in whether you like the movie or not. I employ anchoring by trying to determine at any given time (yes, I use the rating date) to determine what "features" a user is basing his/her ratings on.

Milestones (my word) represent movies that, if rated, have the greatest effect on the ratings of other movies (by the same user). For example, once you've seen "The Sting", you may have a very different opinion of other "hustle" movies. By calculating the effect previously rated (viewed?) movies have, the order of ratings becomes important.

Acmehill said...

dmnewbie,

I have also implemented movie and user clustering and have seen the same effect as you.

I think that it is because multiple movies watched by 1 user vary less than multiple users watching 1 movie. The STD of the ratings around the user average is less than the STD around the movie average (slightly). This, coupled with the higher support per movie makes customer clustering more effective. It harder to cluster movies because their increased STD and greater support make finding like clusters difficult.

by321 said...

Thanks for your response, that settled one thing for me.

Would you care to describe the topology of your neural network approach ? I'd like to implement it, but my neural network experience is based on image processing, and I'm not sure how best to setup the network for this task.

Unknown said...

Hello,

If you would be interested in working on a recommender system for web startup focused on music, please contact me.

Thank you.

Sincerely,
David
david@storylog.com

by321 said...

Hi,

Did you get anywhere with user-user KNN ?

I implemented this and let it generate predictions for 10% of the probe, on residues of global effects.

The RMSE of this 10% is about 0.92, not bad but it contributes very little (0.0002) when blended with other predictions.

Unknown said...

A simple suggestion for insane Netflix customers who forget they have seen a movie --- probably most important considering our aging population.

Can you say -- you have already seen this movie - but you would really like this one.....

Of course this could cut into their revenue?

John Boehm said...

I read about your problem on the NYT. If the datasets allow you to factor other ratings from the same user that generated the review of ND, could you discount their review of ND? It seems that ND is an exception to normal correlation and needs to have the rating adjusted for relevance. I have not looked at the datasets so I am not sure if you can make a secondary rating based on individual user ratings. If ratings can be associated with a unique username and their individual preferences taken into consideration; the ND score could be adjusted to how much they would have actually rated the film if it was consistent with their outer ratings.

TO be clear, I have no idea what I am talking about and know you have devoted hours to this problem so I have no position being right or wrong. It just struck me reading the column.

Mike said...

Hi Len,

Curious if you've found any correlation between the taste of particular viewers and that of particular critics.

Particularly with regards to 'the Napoleon Dynamite problem,' I would think that critics who have more established taste and a refined set of criteria for liking certain films might offer a window into predicting what casual viewers who often agree with these critics might also like.

Good luck,
M

Aaron Scott said...

Len, some ideas...

Maybe the answer is in the descriptons of the films. There are literally hundreds of categories each film can fall into. In the case of Napoleon Dynamite, I'm thinking of words like "quirky" and "offbeat" and "unusual" and "comedy" and "high school" and "nerdy," etc.

Of course, there is also the name of the stars in the films, the amount of money it took to make it, maybe even opening day (since expected blockbusters tend to be released at certain times of the year, I understand)

As you can imagine, I could go on for a long time.

It occurred to me that I often like films by the same director (Shymalan, for instance), at least for the most part.

My way of trying to solve the matter was by first identifying hundreds of sub-categories that classify certain films.

So, while "Napoleon" is listed as, I guess, a comedy by Netflix, there are many sub-categories that also describe it.

It would seem logical that a film that is is classified very nearly like "Napoleon" in terms of category and sub-categories would likely receive a similar rating from viewers.

To tweak this, you might add weights to certain categories. For instance, "Sci-Fi" may carry more weight than "teen" when you're talking about a film sci-fi film that features teens.

And as for sequels, I would imagine that most people at least like sequels if the same stars are involved--some form of nostalgia, I suppose.

OK, so here's how you do it....

You take a sampling of the films in a particular category (drama, say). Then, using as many sub-categories as you can, you label each film with as many descriptive terms as is suitable (there might be a list of, say, 100 sub-categories from which you would choose, since presumeably you wouldn't necessarily have to identify EVERY sub-category).

Then, once each film has a "profile" based on the sub-categories which describe it, you compare things.

Is there a particular director that winds up overwhelmingly in the 5-star ranks? Or one that doesn't? Are there any sub-categories that tend to be attached to low-rated films (by the way, "film budget" might be important, too).

Further, can you identify common things that make for a well-liked film?

For instance, if you find that someone who likes "Comedy" will have a strong statistical chance of also liking comedy films that have this or that star, and this or that director...that would mean that they could likely watch ANY comedy...and Netflix could recommend another comedy based on your findings and, sure enough, they'd likely enjoy it, too.

By the way, "Year made" and "Year Portrayed" are likely important. Some people prefer older films...and some people prefer films that are about a particular time period.

If these help, please e-mail me at aaronescott@yahoo.com.--$50,000 would be plenty! Good luck!

Unknown said...

I'd like to be able to view movies in confidence that there won't be offensive content... which I realize is a very different search parameter for each of us. Perhaps the solution would be allowing users to have a customizable 2-tier aspect of search to allow personalized blacklisting of (1) words and (2) actions. 'Twould require quite a complete indexing of movies... maybe some of the "family" movie rating sites (many can be found via Google) would sell their lists of things like "12 instances of D*** and 4 occurrences of F***." Quite an unusual job for folks claiming to be offended by such things, huh?

We use the same kind of keyword filtering on our spam firewall filtering appliance at work. Seems like some of the heuristics might crossover well from that kind of technology- so there's another idea of how to add someone else's wheel rather than reinventing it.

Example movie:

"Support Your Local Sheriff", Rated G, but with regular shots of and references to the local brothel and a closing scene where the "respected leaders of the community" emerge from the brothel in various states of undress. The way that the whore house is presented is more of the problem, but in general I would like to know ahead of family movie time if prostitution is going to be shown as normal and acceptable. Doesn't mean I won't watch the movie, but I might be a little more prepared as to audience and debriefing material : )

Please forgive me for offending anyone who thinks it's morally acceptable to pay for sex; I've just got this big Jesus thing going on in my life.

ferrini said...

I read the Times article today and found the problem interesting, especially the part about "Napoleon Dynamite" and other problem movies.

Seems to me those movies share something. They are exactly the sort of films, the self described hip types hate to allow as films they'd predictably like or not like, because they fall into the netherland between hip and gauche. The game is not judging the movie qua movie, but judging the crowd in the abstract, who may pan or rave. This is a crowd of fools we're talking about and their opinions are at least as much about being appreciated as wise cultural consumers as they are about actually liking or not liking a given offering.
I don't know a fig about programming nor what data you have to use indirectly to forecast a given customer's Napoleon Dynamite tastes but I do think you could create a model which is a referendum on said customer's scores on the other hard movies. See if he/she tends to pan or like those films and guess they're more likely to see Napoleon Dynamite that way, or any of the others in the category. The issue here is that the category defies prediction because that's the objective of the customer and you have to accept a relatively high rate of error but still improve by taking it into consideration.
Just a thought.

Ishmael said...

Hi Len,
Just read the article in the NYT. Interesting Napolean Dynamite problem. Have you considered simply assigning a random rating to this film (and others like it)?

When building recommendations, ignore all ratings for these films and give it a more-or-less random value.

Milan Mitrovich
Wrightsville, PA

Unknown said...

The only way to make a leap in accuracy is to truly analyze the movies themselves.
The Tempo, Rhythm, Color, Flicker rate, speed the characters talk.

I pondered this idea and found the following research paper

"Application of Computational Media Aesthetics Methodology to Extracting Color Semantics in Film"

"ABSTRACT
Using lm grammar as the underpinning, we study the ex-
traction of structures in video based on color using a wide
con guration of clustering methods combined with existing
and new similarity measures. We study the visualisation
of these structures, which we call Scene-Cluster Temporal
Charts and show how it can bring out the interweaving of
di erent themes and settings in a lm. We also extract color
events that lmmakers use to draw/force a viewer's atten-
tion to a shot/scene. This is done by rst extracting a set
of colors used rarely in lm, and then building a probabilis-
tic model for color event detection. We demonstrate with
experimental results from ten movies that our algorithms
are e ective in the extraction of both scene-cluster tempo-
ral charts and color events."

I'm not sure what kind of off the shelf software exits to extract this kind of data http://www.virage.com and Avid or Adobe Premier Pro and/or other movie editing programs is a start.

Perhaps a company would sponsor you with there software in hopes that you will win the prize. It would be excellent way to market/advertise their company.

Congrats on your New York Times mention and Good luck!

Dave said...

Len,

Would love to talk to you about a company I have founded. I'm a successful entrepreneur already, but have a great idea that you could potentially be a part of. Ping me if you'd like to chat. -Dave

Anonymous said...

以簡單的行為愉悅他人的心靈,勝過千人低頭禱告。........................................

早餐 said...

你的部落格不錯哦,支持!!!!@@ ........................................

Anonymous said...

大奶妹貼圖區0204性影片觀賞露點自拍淫婦女生如何自慰色情站成人笑話av激情網愛視訊美女淫蕩av成人色情電話辣妹視訊聊天性關係情色vcd自慰圖淫美成人論壇台灣色情論壇成人聊天室自拍裸女貼圖視訊成人免費a片影片av成人網成人色情色情台灣辣妹小穴太太陰毛色情訊息裸女自拍色情影片a片論壇性愛技巧美女脫胸罩情色天堂av寫真色情視訊聊天做愛視訊成人影片床上戲情色聊天網火辣情色台灣女優性愛秘笈台灣av女優手淫自慰影片

Anonymous said...

做愛自拍照免費線上自拍影片自拍短片國中自拍免費自拍線上免費做愛影片線上免費無碼短片線上免費貼片線上免費漫畫網線上免費影片a線上直播免費a片線上洪爺線上看無碼卡通線上看影片a線上看碼線上美女影片線上做愛小影片線上做愛遊戲線上做愛影片區線上情色看線上情色貼線上情色影片免費線上無碼免費av線上短片免費線上短片免費試看線上看情色影片網線上看情色網線上看做愛小影片免費線上看A線上看A片武俠小說 亂倫小說 嘟嘟成人哈比寬頻成人

V定治 said...

how do u do?................................................................

Anonymous said...

天可補,海可填,南山可移,日月既往,不可復追。......................................................

Anonymous said...

從人生中拿走友誼,猶如從生活中移走陽光........................................

Anonymous said...

想像力的力量比知識更加巨大。........................................

Anonymous said...

Habit is a second nature...................................................

王BlancaMcleroy said...

好文!值得一推~~加油哦!.............................................

欣怡 said...

認清問題就等於已經解決了一半的問題。.............................................

Anonymous said...

kk視訊內衣秀 免費情色av影片 情色小魔女自拍 免費情色影片無碼 qq美美色網免費看 ut辣妹哈啦視訊聊天室 34c美女寶貝視訊 ez聊天網 avdvd成人影城 aa台辣妹有約 34c玩美女人百分百貼影片區 一葉情貼圖片區 383 天下貼圖情趣 商品 一對多視訊大學生情人 gogo2sex視訊妹 線上免費視訊聊天 yam天空辣妹影音 視訊交友etude 比基尼辣妹 kk272視訊 A片-sex520 影音聊天室fm358 sex520 net視訊美女 080視訊聊天室how2 85st免費影片觀看 小弟弟成人娛樂網 辣妹視訊聊天室 90691 免費熟女人影片 g8mm 視訊小愛 比基尼美女的奶,玩美女人影音秀 383影音城a 情色視訊,貓貓論壇 啦咧華人視訊交友聊天室 38ga成人 辣妹有約溫泉 微風愛情館論壇aio 我愛78論壇影片 飯島愛免費影片 辣手美眉甜心寶貝直播貼片 免費a片觀看 視訊交友loveliness av影片下載 avhigh免費影片 限制性影片 hi5 tv 免費聊天情人fm358 a383girl影音城 亞洲視訊34c甜心寶貝直播貼片 免費聊天mmshow 南台灣視訊網愛聊天室

FrederickBovee0722麗玲 said...

馬丁路德:「即使知道明天世界即將毀滅,我仍願在今天種下一棵小樹。」......................................................

Anonymous said...

海鷗要高飛,必先遠退。花蜜要香醇,必先久釀。......................................................

曾文詠 said...

It is no use crying over spilt milk.............................................................

tyroneb said...

幸福不是一切,人還有責任。..................................................

瓊文瓊文 said...

More haste, less speed...................................................................

Anonymous said...

看看blog放鬆一下,工作累死了....<.................................................................

Anonymous said...

知識可以傳授,智慧卻不行。每個人必須成為他自己。......................................................................

Anonymous said...

欣賞是一種美德~回應是最大的支持^^.................................................................

Anonymous said...

所有的資產,在不被諒解時,都成了負債............................................................

Anonymous said...

人生是故事的創造與遺忘。............................................................

Anonymous said...

No one knows the weight of anothers burden. ............................................................

Anonymous said...

向著星球長驅直進的人,反比踟躕在峽路上的人,更容易達到目的。............................................................

Anonymous said...

培養健全孩子最好的方法是父母先成為健全的人。............................................................

Anonymous said...

文章是心情的反應~~祝妳天天寫的都是讓人開心的好文章哦!!............................................................

Anonymous said...

你快d upday個blog啦~我等左好耐啦/_愛死呀ling之fans上 ............................................................

Anonymous said...

連接生與死這兩塊陸地的橋樑是愛…….......................................................

Anonymous said...

Poverty tries friends.............................................................

Anonymous said...

果然很有意思呀....這當然要頂一頂呀 ̄﹏ ̄............................................................

Anonymous said...

Learning makes a good man better and ill man worse.............................................................

Anonymous said...

所有的資產,在不被諒解時,都成了負債.................................................................

Anonymous said...

時間就是塑造生命的材料。

家唐銘 said...

人不能像動物一樣活著,而應該追求知識和美德............................................................

凱許倫 said...

你不能左右天氣,但你可以改變心情.................................................................

Anonymous said...

真正的朋友不會把友誼掛在嘴巴上......................................................................

Anonymous said...

愛,拆開來是心和受兩個字。用心去接受對方的一切,用心去愛對方的所有。......................................................................

Anonymous said...

困難的不在於新概念,而在於逃避舊有的概念。......................................................................

Anonymous said...

上班好累哦,看看部落格轉換心情~~~先謝謝啦!!............................................................

Anonymous said...

累了嗎?來杯咖啡休息一下吧!...............................................................

Anonymous said...

如果成為一支火柴,也要點亮一個短暫的宇宙;如果是一隻烏鴉,也要叫疼閉塞的耳膜。............................. ...................................

Anonymous said...

婚姻對男人來說是賭他的自由,對女人而言卻是賭她的幸福。............................................................