last data update: 2011/10/17, 21:11

Website loading time

during the test: 1.73 s

cable connection (average): 1.97 s

DSL connection (average): 2.2 s

modem (average): 14.64 s

HTTP headers

Information about DNS servers

horicky.blogspot.comCNAMEblogspot.l.google.comIN3600

Received from the first DNS server

Request to the server "horicky.blogspot.com"
You used the following DNS server:
DNS Name: ns2.kotinet.com
DNS Server Address: 212.50.192.226#53
DNS server aliases:

HEADER opcode: REQUEST, status: NOERROR, id: 13173
flag: qr rd REQUEST: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 4

REQUEST SECTION:
horicky.blogspot.com. IN ANY

AUTHORITY SECTION:
blogspot.com. 28739 IN NS ns2.google.com.
blogspot.com. 28739 IN NS ns3.google.com.
blogspot.com. 28739 IN NS ns4.google.com.
blogspot.com. 28739 IN NS ns1.google.com.

SECTION NOTES:
ns1.google.com. 205580 IN A 216.239.32.10
ns2.google.com. 205580 IN A 216.239.34.10
ns3.google.com. 205580 IN A 216.239.36.10
ns4.google.com. 205580 IN A 216.239.38.10

Received 181 bytes from address 212.50.192.226#53 in 137 ms

Received from the second DNS server

Request to the server "horicky.blogspot.com"
Received 38 bytes from address 82.141.108.26#53 in 143 ms
Request to the server "horicky.blogspot.com"
You used the following DNS server:
DNS Name: ns3.kotinet.com
DNS Server Address: 82.141.108.26#53
DNS server aliases:

Host horicky.blogspot.com not found: 5(REFUSED)
Received 38 bytes from address 82.141.108.26#53 in 142 ms

Subdomains (the first 50)

Typos (misspells)

goricky.blogspot.com
boricky.blogspot.com
noricky.blogspot.com
joricky.blogspot.com
uoricky.blogspot.com
yoricky.blogspot.com
hiricky.blogspot.com
hkricky.blogspot.com
hlricky.blogspot.com
hpricky.blogspot.com
h0ricky.blogspot.com
h9ricky.blogspot.com
hoeicky.blogspot.com
hodicky.blogspot.com
hoficky.blogspot.com
hoticky.blogspot.com
ho5icky.blogspot.com
ho4icky.blogspot.com
horucky.blogspot.com
horjcky.blogspot.com
horkcky.blogspot.com
horocky.blogspot.com
hor9cky.blogspot.com
hor8cky.blogspot.com
horixky.blogspot.com
horivky.blogspot.com
horifky.blogspot.com
horidky.blogspot.com
horicjy.blogspot.com
horicmy.blogspot.com
horicly.blogspot.com
horicoy.blogspot.com
horiciy.blogspot.com
horickt.blogspot.com
horickg.blogspot.com
horickh.blogspot.com
horicku.blogspot.com
horick7.blogspot.com
horick6.blogspot.com
oricky.blogspot.com
hricky.blogspot.com
hoicky.blogspot.com
horcky.blogspot.com
horiky.blogspot.com
horicy.blogspot.com
horick.blogspot.com
ohricky.blogspot.com
hroicky.blogspot.com
hoircky.blogspot.com
horciky.blogspot.com
horikcy.blogspot.com
horicyk.blogspot.com
hhoricky.blogspot.com
hooricky.blogspot.com
horricky.blogspot.com
horiicky.blogspot.com
horiccky.blogspot.com
horickky.blogspot.com
horickyy.blogspot.com

Location

IP: 209.85.175.132

continent: NA, country: United States (USA), city: Mountain View

Website value

rank in the traffic statistics:

There is not enough data to estimate website value.

Basic information

website build using CSS

code weight: 90.4 KB

text per all code ratio: 51 %

title: Pragmatic Programming Techniques

description:

keywords:

encoding: UTF-8

language: en

Website code analysis

one word phrases repeated minimum three times

PhraseQuantity
the122
to66
of54
is46
we45
and42
user33
that27
item25
can24
as17
be15
userX12
with12
space12
each12
matrix11
in11
by10
concept10
use10
has10
similarity10
rating10
an10
this9
The9
movies9
compute9
In9
if8
...8
users8
items8
at8
other8
set8
will7
interaction7
then7
into7
on7
all7
between7
number7
such6
idea6
same6
space.6
do6
computing6
from6
row6
user's5
function5
recommend5
them5
these5
are5
vector5
metadata5
top5
have5
tag5
need4
for4
Then4
how4
words,4
equivalent4
match4
they4
For4
similar4
also4
test4
one4
should4
cell4
We4
seen4
following4
itemA4
rate4
there4
group4
both4
map3
rows3
example,3
who3
SVD3
represents3
Notice3
know3
find3
product3
look3
dot3
recommender3
our3
more3
If3
value3
existing3
It3
And3
determine3
This3
association3
which3
model,3
To3
algorithm3
first3
follows3
it3
what3
cells3
column3
or3
rule3
their3
time,3
(or3
given3
itemY3
Now3
1,3

two word phrases repeated minimum three times

PhraseQuantity
to the13
we can13
is to11
of the9
the user8
the item8
number of7
user and7
the concept6
the same6
compute the5
can be5
in the5
that we5
idea is5
the number5
at the5
to user5
of movies5
the similarity5
space to5
the test4
equivalent to4
the following4
need to4
movies that4
concept space4
The idea4
item space4
between user4
set of4
user space4
that is4
other words,4
an item4
In other4
the matrix4
the user's4
In this4
and then4
space and3
user to3
userX and3
is equivalent3
with the3
to be3
and the3
and item3
Notice that3
represents the3
dot product3
to determine3
the top3
For example,3
the cell3
we have3
concept space.3
as follows3
map the3
we know3
that the3
computing all3
to item3
If we3
there are3
we use3
the set3
can use3
rating on3
use the3
then compute3
this model,3
model, we3
group of3
such as3
into the3
item to3
do the3

three word phrases repeated minimum three times

PhraseQuantity
the number of5
of movies that4
The idea is4
In other words,4
between user and4
idea is to4
the concept space.3
to the item3
is equivalent to3
the item space3
space to the3
user and item3
the set of3
this model, we3
user space to3

B tags

Now given all the metadata of user and item, as well as their interaction over time, can we answer the following questions ... What is the probability that userX purchase itemY ?What rating will userX give to itemY ?What is the top k unseen items that should be recommended to userX ?Content-based Approach In this approach, we make use of the metadata to categorize user and item and then match them at the category level. One example is to recommend jobs to candidates, we can do a IR/text search to match the user's resume with the job descriptions. Another example is to recommend an item that is "similar" to the one that the user has purchased. Similarity is measured according to the item's metadata and various distance function can be used. The goal is to find k nearest neighbors of the item we know the user likes. Collaborative Filtering Approach In this approach, we look purely at the interactions between user and item, and use that to perform our recommendation. The interaction data can be represented as a matrix. Notice that each cell represents the interaction between user and item. For example, the cell can contain the rating that user gives to the item (in the case the cell is a numeric value), or the cell can be just a binary value indicating whether the interaction between user and item has happened. (e.g. a "1" if userX has purchased itemY, and "0" otherwise. The matrix is also extremely sparse, meaning that most of the cells are unfilled. We need to be careful about how we treat these unfilled cells, there are 2 common ways ... Treat these unknown cells as "0". Make them equivalent to user giving a rate "0". This may or may not be a good idea depends on your application scenarios. Guess what the missing value should be. For example, to guess what userX will rate itemA given we know his has rate on itemB, we can look at all users (or those who is in the same age group of userX) who has rate both itemA and itemB, then compute an average rating from them. Use the average rating of itemA and itemB to interpolate userX's rating on itemA given his rating on itemB. User-based Collaboration Filter In this model, we do the following Find a group of users that is “similar” to user XFind all movies liked by this group that hasn’t been seen by user XRank these movies and recommend to user X This introduces the concept of user-to-user similarity, which is basically the similarity between 2 row vectors of the user/item matrix. To compute the K nearest neighbor of a particular users. A naive implementation is to compute the "similarity" for all other users and pick the top K. Different similarity functions can be used. Jaccard distance function is defined as the number of intersections of movies that both users has seen divided by the number of union of movies they both seen. Pearson similarity is first normalizing the user's rating and then compute the cosine distance. There are two problems with this approach Compare userX and userY is expensive as they have millions of attributesFind top k similar users to userX require computing all pairs of userX and userYLocation Sensitive Hashing and Minhash

It will be expensive to permute the rows if the number of rows is large. Remember that the purpose of h(c1) is to return row number of the first row that is 1. So we can scan each row of c1 to see if it is 1, if so we apply a function newRowNum = hash(rowNum) to simulate a permutation. Take the minimum of the newRowNum seen so far. As an optimization, instead of doing one column at a time, we can do it a row at the time, the algorithm is as follows To solve problem 2, we need to avoid computing all other users' similarity to userX. The idea is to hash users into buckets such that similar users will be fall into the same bucket. Therefore, instead of computing all users, we only compute the similarity of those users who is in the same bucket of userX. The idea is to horizontally partition the column into b bands, each with r rows. By pick the parameter b and r, we can control the likelihood (function of similarity) that they will fall into the same bucket in at least one band. Item-based Collaboration Filter If we transpose the user/item matrix and do the same thing, we can compute the item to item similarity. In this model, we do the following ... Find the set of movies that user X likes (from interaction data)Find a group of movies that is similar to these set of movies that we know user X likesRank these movies and recommend to user X It turns out that computing item-based collaboration filter has more benefit than computing user to user similarity for the following reasons ... Number of items typically smaller than number of usersWhile user's taste will change over time and hence the similarity matrix need to be updated more frequent, item to item similarity tends to be more stable and requires less update.Singular Value Decomposition If we look back at the matrix, we can see the matrix multiplication is equivalent to mapping an item from the item space to the user space. In other words, if we view each of the existing item as an axis in the user space (notice, each user is a vector of their rating on existing items), then multiplying a new item with the matrix gives the same vector like the user. So we can then compute a dot product with this projected new item with user to determine its similarity. It turns out that this is equivalent to map the user to the item space and compute a dot product there. In other words, multiply the matrix is equivalent to mapping between item space and user space. Now lets imagine there is a hidden concept space in between. Instead of jumping directly from user space to item space, we can think of jumping from user space to a concept space, and then to the item space. Notice that here we first map the user space to the concept space and also map the item space to the concept space. Then we match both user and item at the concept space. This is a generalization of our recommender. We can use SVD to factor the matrix into 2 parts. Let P be the m by n matrix (m rows and n columns). P = UDV where U is an m by m matrix, each column represents the eigenvectors of P*transpose(P). And V is an n by n matrix with each row represents the eigenvector of transpose(P)*P. D is a diagonal matrix containing eigenvalues of P*transpose(P), or transpose(P)*P. In other words, we can decompose P into U*squareroot(D) and squareroot(D)*V. Notice that D can be thought as the strength of each "concept" in the concept space. And the value is order in terms of their magnitude in decreasing order. If we remove some of the weakest concept by making them zero, we reduce the number of non-zero elements in D, which effective generalize the concept space (make them focus in the important concepts). Calculate SVD decomposition for matrix with large dimensions is expensive. Fortunately, if our goal is to compute an SVD approximation (with k diagonal non-zero value), we can use the random projection mechanism as describer here. Association Rule Based

We represent each user as a basket and each viewing as an item (notice that we ignore the rating and use a binary value). After that we use association rule mining algorithm to detect frequent item set and the association rules. Then for each user, we match the user's previous viewing items to the set of rules to determine what other movies should we recommend. Evaluate the recommender

After we have a recommender, how do we evaluate the performance of it ? The basic idea is to use separate the data into the training set and the test set. For the test set, we remove certain user-to-movies interaction (change certain cells from 1 to 0) and pretending the user hasn't seen the item. Then we use the training set to train a recommender and then fit the test set (with removed interaction) to the recommender. The performance is measured by how much overlap between the recommended items with the one that we have removed. In other words, a good recommender should be able to recover the set of items that we have removed from the test set. Leverage tagging information on items

U tags

I tags

images

file namealternative text
My Photo
Powered by Blogger

headers

H1

H2

Thursday, September 1, 2011

Sunday, August 28, 2011

Saturday, July 9, 2011

Thursday, April 21, 2011

Saturday, March 19, 2011

Thursday, March 17, 2011

Sunday, December 5, 2010

About Me

Links

Previous Posts

Archives

H3

Thursday, September 1, 2011

Sunday, August 28, 2011

Saturday, July 9, 2011

Thursday, April 21, 2011

Saturday, March 19, 2011

Thursday, March 17, 2011

Sunday, December 5, 2010

About Me

Links

Previous Posts

Archives

H4

H5

H6

internal links

addressanchor text
12:59 PM
Links to this post
9:37 PM
Links to this post
4:35 PM
Links to this post
10:29 PM
Links to this post
6:47 PM
Links to this post
Linear and Logistic regression
Neural Network
Support Vector Machine
Decision tree
data mining
machine learning
predictive analytics
10:59 PM
Links to this post
bayesian networks
linear regression
neural networks
decision trees
support vector machines
nearest neighbors
association rules
principal component analysis
Hadoop, Map/Reduce
sequential algorithm can be restructured to run in map reduce
Business Intelligence
data mining
scalability
8:36 AM
Links to this post
Recommendation Engine
Scale Independently in the Cloud
Fraud Detection Methods
K-Means Clustering in Map Reduce
Compare Machine Learning models with ROC Curve
Predictive Analytics Conference 2011
BI at large scale
Map Reduce and Stream Processing
Scalable System Design Patterns
BigTable Model with Cassandra and HBase
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
June 2008
July 2008
August 2008
October 2008
November 2008
December 2008
January 2009
April 2009
May 2009
July 2009
August 2009
September 2009
October 2009
November 2009
December 2009
January 2010
February 2010
March 2010
May 2010
June 2010
July 2010
August 2010
October 2010
November 2010
December 2010
March 2011
April 2011
July 2011
August 2011
September 2011
Atom

external links

addressanchor text
random projection mechanism as describer here
2 Comments
1 Comments
0 Comments
3 Comments
2 Comments
a good paper on their PLANET project
0 Comments
a big portion of machine learning algorithm
Apache Mahout project
implemented an impressive list of algorithms
4 Comments
My Photo
Ricky Ho
View my complete profile
Google News
Edit-Me
Edit-Me
Powered by Blogger