Tag: datamining

Like-minded users and Dynamic Links

Intro
You begin to browse the online catalog and you realize, “hey what’s his face also bought this toy!”. If your like me, the majority of times you’ll tend to skim through an online retailer such as Amazon.com starting at your initial search and then following all the recommended books, toys, gadgets, and gizmos that other users shopped for.

What will this short piece focus on?
Abstracting the idea behind a typical recommendation system and applying it to recommendations of links. “Huh?”, yea i said that too when i picked up this paper, From User Access Patterns to Dynamic Hypertext Linking written back in 1996. The paper focuses on a method to recommend links to users based on their session information. But let’s start from the beginning.

What is a recommendation system and what’s it used for.
In the opening paragraph I put together a short example. The example, you or any user visits Amazon.com and selects a product, the product page comes up and a section under the products image is displayed. The section is usually called, “Users who bought this item also bought…”. This section is called, the, “recommendation”, since the items are recommendations that Amazon.com believes you will express interest in, based on products you previously viewed and currently are viewing. See, not so hard.

Amazon.com uses a cool, neat, method to recommend a products. No, they cant simply chose something at random and recommend it to you, that would not serve any purpose. No, Amazon.com and other retail sites use recommendation algorithms, tools, which allow these sites to intelligently recommend items to you. How is this done? Well, in a nutshell. When you initially visit the site, Amazon.com begins to track what your looking at. Based on what you’ve already seen, a profile is created for you.

The profile normally contains a list of items you have clicked on, a click here means your interested in the item, and compares your list with all the other profiles Amazon has collected. When a set of users with, somewhat, similar profiles are found we look at what items the users have visited and chose an item you from their list that you have yet to visit. The idea behind it, “like minds think alike” and if like minds think alike, like minds will buy alike too!

Back to the paper
So how is this related in any way with a Dynamic Link Generator? Easy, both are the same. The overall idea behind the paper is, “Like minds browse alike”. In other words. If one user visits sections on a web site and another user visits somewhat similar pages, overall we can predict what the user will select to view next based on this data. If we can predict what the user will visit, we can display a link that might persuade a user to visit and hopefully purchase a product. Lets take a simple example.

A user, Armando, goes to Amazon.com and visits the category “Sports” starts to look around the “Sports” section and then decides, “hey, this blows, lets check out the electronics section”. Armando clicks on the “Electronics” link and then decides to go into the “Pets Section”. He buys a toy for Snoopy and then remembers he needs to buy something electronic to balance things out. He heads back to the electronics section.

The session’s web log for Armando would look something like this

  • GET /www.somestorehere.com/
  • GET /www.somestorehere.com/electronics
  • GET /www.somestorehere.com/petgoods
  • GET /www.somestorehere.com/electronics

Given the logs above we convert the log into a vector. How do we do this?

From Logs to Vectors
I hope your not confused yet. If your not confused, great! Its about to get much easier. The authors propose that we use an ID system to track what pages the users visits. Each page on the site will have a unique ID. Again, another example.

Take a look at the table below and pay close attention to the ID each page contains. Notice that each page’s ID is unique.

Page to Page Id
index.php 100
electronics/index.php 200
petgoods/index.php 300

We now have the Ids for each page. Its time to create the vector. The vector for Armando will look like this:

<(100, x), (200, x), (300, x)>


A quick note. The paper does not focus on the order the pages were accessed, so I wont pay any attention to it either.

As you can see from the vector above, the vector contains all the pages the user visited, removing duplicates. You can also tell that each page contains
an x. your probably wondering what the x is all about.

What the x is all about. (weights as interest)
The “x” is a way to tell how important the page is. How do the authors rate importance? By the number of times the user visited the page during the session. Can this be something else? Sure, we can use any type of method you believe a user can rate the importance of the page. Some people use the time spent on the page, while other literally ask users to click on a 1-5 rating scale while on the page. But lets stick to what the authors propose, page visits during the session.

Going back to the example, we look at the logs again for Armando and count the number of times Armando visited each section. We notice that he visited “index.php” once, “electronics/index.php” twice, and “petgoods/index.php” once. We take this data and place the number of times the page was access into the “x” values for each ID. Our final vector looks something like this:

<(100, 1), (200, 2), (300, 1)>

What do we do with this? Nothing! trash it and call it a day. No no, come back. The vector is used to group like minded users based on those page ids.

Like-Minds Unite!
We now get into the meat and potatoes of the paper. The previous “stuff” was just that, “stuff”. It introduced how to clean the data and how to format the data, or what the paper calls, “Preprocessing”. In this section we introduce what the paper uses to cluster the data, the “Leader Algorithm”.

Leader Algorithm

The leader algorithms takes a collection of vectors, and begins with the initial vector. With the initial vector on hand it compares it with all the other clusters already available. Since there are no cluster, the algorithm creates a new cluster and continues with the next vector. While analyzing, If a vector resembles, measured by how close a user’s visited pages is to all members of a cluster visited pages, it places the vector into that same cluster, otherwise it creates a new cluster. This continues until all vectors are analyzed. Thats it!

A brief note on the papers data sets. The authors used only specific users from the data set they collected. They only used users that accessed more than 3 pages per session and limit the size of the cluster, which they do not specify.

Benefits
Faster than other clustering algorithms since it only runs once over the entire set of data once.

Creating the Link
Now that we have our group of people lets create a dynamic link. Lets do this by using another example which adds to the previous Armando example.

We know that Armando accessed the index page, the electronics section and the pets section. his overall vector was <(100, 1), (200, 2), (300, 1)>. Continuing, Armando is placed into his own cluster until another person visits the site, since there are no clusters to compare him to.

Beth comes to the site, she visits a few pages and after the third page the system kicks in. Her partial session log, looks like this:

  • GET /www.somestorehere.com/beautysupply/
  • GET /www.somestorehere.com/womensclothing/
  • GET /www.somestorehere.com/kidsclothing/
  • GET /www.somestorehere.com/
  • GET /www.somestorehere.com/kidsclothing/
  • GET /www.somestorehere.com/beautysupply/
  • GET /www.somestorehere.com/kidsclothing/

and her vector after preprocessing is: <(101, 1), (103, 2), (106, 3), (100, 1)>.

Since her vector shares no resemblance to Armando’s she’s placed into a separate cluster.

Now, Ambi comes to the site and begins to browse, after her third page view her resulting partial session vector is: <(103, 1), (106, 1), (101, 1), (201, 4)>

Since Ambi has a strong similarity to Beth in the pages she’s visited we can chose one of Beth’s pages that Ambi has not visited, (100) for example, and present Ambi with a prominent link to the page.

Summary
The benefits to a small mom and pop shop are apparent since Amazon.com im sure has better algorithms than what was presented here. The paper is short but to the point. It points out how to capture the data, process the data into useable vectors, and then cluster the data using the leader algorithm to recommend links to users.

Given that the paper was written in 1996. The paper shows a very simple method in not only creating dynamic links for users based on like-minded users, users with the same pattern of clicks, but proposes a new algorithm which for its time might have been “cutting-edge”.

Armando Padilla

Part I – Google News Personalization: Scalable Online Collaborative Filtering.

These are some notes for my data mining class. I haven’t read the entire article but this is what I have so far.

Main Focus:
The papers main focus is on google’s news web site, http://news.google.com , and how to recommend news stories to users. The focus is on a recommendation system to allow the user to find useful information that cant be easily located using a search engine.

Why Google News
Google News was used due to the amount of users that use the service, according to the article, millions a month, and the amount of “Item churn” that is produced by the site. To clarrify, Item Churning is the process in which news stories are added and removed from the site. In this case item churning is high since news articles are added to the site every few minutes and existing articles are removed from the site in the same amount of time.

Rating System vs Points Earned System
The paper touches on a few algorithms used but ill go over what the overall idea is. The algorithm allows the user to click on an article and assign it a positive point. For example. If i click on a article the article get one point while a non click gives it no points at all. This is in contrast to a rating system much like that of Amazon.com where a user gives an item 1-5 rating.

Limitations
The process proposed cant give an article a negative rating. If the user clicked on the article yet hated it there is no way to determine this, the article will retain the positive rating. Compared to that of amazon’s and netflix where a rating of 1 can be a sure indicator of a negative feeling towards the media.

The Math (Notations Only)
N = total number of users.
U = Collection of click history of users.
Cu = Specific click history set of user X.
s = stories within the click-history set.

What is a click history? A click history is a collection of article the user has gone to in a spefic amount of time. For example. If i click on the news article, “Snoopy, hacked the Pentagon”, then I read, “Armandos Dog hacked the Pentagon”, and finally I read, “Worlds greatest dog is computer wiz – owner say, ‘Pssh i know'”, the click history for user1 will be:

Cu = {“Snoopy, hacked the Pentagon”,
“Armandos Dog hacked the Pentagon”,
“Worlds greatest dog is computer wiz – owner say,’Pssh i know'”}

And s = the stories above in no particular order.

Ill add more and revice as I keep reading the paper.
http://www2007.org/papers/paper570.pdf

Armando Padilla

Apriori Algorithm – Ruby

AA is not only where the alcoholics get help but also where were data miners go when they need to find a basic algorithm to locate all the combinations that appear a certain number of times in a set of information.

The Apriori algorithm, takes an initial data set, beats it to a pulp, strains it, and gives you all the possible ways of cooking a hot poket. With cheese, no cheese, with meat, or no meat, this algorithm gets you every possible cobination and the number of times it happens in the database set.

Confused? Good! Since i like examples to understand things lets use one. Lets say you go to the market. You by some milk, a can of decaf and some dog food. The supermarket chain records this info for not only your trasaction but for everyone (talk about big brother looking over you). After about a month the super market chain has all this info on the things people buy. they see how some people buy milk with cookies and how some people buy squeeky toys with condoms (dont ask). But they want to find more info. They pop open their computer run the Apriori algorithm and determine that they want to find all combination that occure in the data set more than 2 times. After running the algorihtm they see that milk and tomatoes sold alot together. Good market info if you want to have a sale on tomatoes so you can get rid of your milk 🙂 They also found that cookies with toothpaste were also items that sold in hgh volumes aswell. In data mining terms we use the words “frequent” to state how well a itemset (the combination of items) was old.

So in closing, this its 1 am and the algorithm is coded below it will allow you to stick in a dataset of information and receive the number of times the combination occures that meet your expected minimum supporting threshold, the minimum number of times you want the combination to appear.

Before i get to the code, let me go over it first. The below code reads a filepath that we need to use. I have two files you can use. One is the
test.txt
file and the other is the retail.txt file. When running the test.txt file the system finishes very quickly and stores the itemsets in a output file of your choice. On the other hand, when we run retail.txt and set the minimum_sup to less than 2000 the aplication runs for 2 hours, a sign of either a very poor implementation or a very bad langauage to use on data mining.

Algorithm
1. Get all the unique items in the dataset
2. For each unique item count the number of records that contain the combination.
3. If the number of times it appears is greater than the min_sup keep it and prune the other unique values that didnt meet the min_sup.
4. with the new set of values join them to form your new list.
5. check if new set has any subsets that are not frequent.
6. trash the itemsets that have subsets that are not frequent.
7. with new clean set go to #2.
8. continue until new list is null, this happens when the new list has all itemsets which have subsets as not being frequent.

Ruby Code:
##########
# Apriori Algorithm Implementation
#
# @author Armando Padilla, mando81@prodigy.net
# @homepage http://www.armando.ws
#
# v0.0.2 month.week.day
##########

#require ‘profiler’
#Profiler__::start_profile
start = Time.now
puts “Started – “+start.to_s

#############
# Create the new list (Lk) for the next iteration
#
# @param itemSets Lk-1 list that has been cleared out of all
# none frequent sets.
#############
def createList(itemSets)

###
# HOLDS THE NEW LIST
###
listX = Array.new()

if !itemSets.nil?

#Check if the initial elements are the same
#if they are the same then
#Check if the last item is less than the next row’s last item.
#if its less then join both items (union on both items)

k = 1
itemSetSize = itemSets.size

itemSets.each do |list1|

n=k
until n == itemSetSize

#Check if we can combine the lists
#if initial run yes lets combine the list
#otherwise follow the steps

if itemSets[n].size == 1

if !list1.eql?(itemSets[n])

@row = Array.new()
@row.push(list1,itemSets[n])
@row = @row.sort
@row = @row.flatten!

###
# DO NOT ADD IF WE ALREADY HAVE THE SET
###
if !listX.include?(@row)
listX.push(@row)
end

end

else

####
#if the all the indexes before the last are the same
#then combine ONLY IF the last indexes are the same
####
toIndex = list1.size-2

###
# GET ALL THE VALUES UP TO THE LAST INDEX
# EG. [1,2,3,4,5] and [1,2,3,4,6]
# BECOMES [1,2,3,4] and [1,2,3,4]
###

arrayA = list1[0..toIndex]
arrayB = itemSets[n][0..toIndex]

if arrayA.eql?(arrayB)

totalCountA = list1.size-1
totalCountB = itemSets[n].size-1

if list1[totalCountA] < itemSets[n][totalCountB]

@row = Array.new()
@row.push(list1|itemSets[n])
@row = @row.sort.uniq
@row = @row.flatten!

###
# DO NOT ADD IF WE ALREADY HAVE THE SET
###
if !listX.include?(@row)
listX.push(@row)
end

end

end
#puts “=========”

end

n = n+1
end

k = k+1
end

end

return listX

end

##############
# Returns the item sets that support the minimum supporting threshold.
#
##############
def getFrequentItemSets(itemSets, minSup, fileInputPath)

########
# RUNS THROUGH THE DATA SET AND GETS THE COUNT
######
list = Array.new()

##
# FOREACH ITEM CHECK HOW MANY TIMES IT APPEARS IN THE DATA SET
##
itemSets.each do |itemSet|

itemSetSize = itemSet.size
count = 0

###
# FOR EACH ROW IN THE ITEM SET
# RUN THROUGH EACH ROW IN THE DATA SET
# AND DETERMINE IF THE ITEM SET APPEARS IN THE RECORD
###
@dataSet.each do |set|

currentRow = set

if itemSetSize == 1

if currentRow.index(itemSet[0])
count = count+1
end

else

###
# FOR EACH ROW CHECK IF THE ITEMSET IS PRESENT
#
# ARMANDO – USING INTERSECTION TO COMPARE
# =======================================
# ITEMSET = A
# CURRENT ROW = B
# TAKE THE INTERSECTION OF A AND B
# IF SIZE OF INTERSECTION IS EQUAL TO SIZE OF ITEMSET THEN
# WE LOCATED THE ITEM SET IN THE ROW
###

intersection = []
itemSet.each do |a|

set.each do |b|

if a == b
intersection.push(a)
end

end

end

if intersection.size >= itemSetSize
#if (itemSet&currentRow).size >= itemSet.size

count = count+1

end

end

end #end – run through data set

###
# DETERMINE IF IT MEETS THE MIN_SUP
# if the calculatedValue is equal to or greater than the
# minimum supporting threshold ad it to our new List (L k)
###
if count >= minSup

@row = Array.new()
@row.push(itemSet)
list.push(itemSet)

string = “”
itemSet.each do |x|
string = string+” “+x.to_s
end
@file.puts string+” :”+count.to_s
@file.flush

else

irow = Array.new()
irow.push(itemSet)
@infrequentSets.push(irow)

end
p(itemSet)

end #end – run though each item

return list.sort

end #end – getFrequentItemSets

##############
# Returns boolean when the item set is either
# frequent or infrequent
##############
def isInfrequentSets(itemSets)

globalCount = 0
###
# FOREACH ITEMSET CHECK IF ITS IN THE INFREQUENT ITEM SET
###
itemSets.each do |itemSet|

count = 0
@infrequentSets.each do |itemSetX|

###
# IF WE FOUND A ITEM SET WITH at least 2 sub items
# REMOVE IT
###
if(itemSet&itemSetX[0]).size >= 2
count = count+1
end

end

if count != 0
globalCount = globalCount + 1
end

end

if globalCount == itemSets.size

return true

end

end

#####
# Reads the input from a text file and create an array
# Each line is an array of elements.
# Elements broken out by white space.
#
# @param fileInputPath Path to data set file.
#
####
def getDataSet(fileInputPath)

dataSet = []

if File.exists?(fileInputPath)

###
# GET THE FILE CONTENT
###
dataContainer = File.open(fileInputPath, “r”)

###
# FOREACH LINE BREAK APART BY WHITESPACE
# AND PLACE INTO ARRAY
###
flist = []
dataContainer.readlines.each do |line|

flist.push(line)

end

flist.each do |line|

elements = line.split(” “)

row = Array.new()

elements.each do |element|

row.push(element.to_i)

if @globalHash[element.to_i] != nil
@globalHash[element.to_i] = @globalHash[element.to_i]+1
else
@globalHash[element.to_i] = 1
end

end

###
# ADD THE ROW AS AN ARRAY
# TO THE DATA SET
###
dataSet.push(row)

end

else

puts “Error: File, “+fileInputPath+”, does not exist”

end

return dataSet

end

#########################################################
#########################################################
#########################################################
if ARGV.size == 3

###
# INITIALIZE VARIABLES
###
min_sup = ARGV[2].to_i
fileInputPath = ARGV[0]
outputFilePath = ARGV[1]

@infrequentSets = Array.new()
@globalHash = Hash.new() #holds out initial List with counts

#####
# GET THE DATA SET FROM FILE
#####
@dataSet = Array.new()
@dataSet = getDataSet(fileInputPath)

####
# WRITE OUT THE OUTPUT
####
@file = File.open(outputFilePath, “w+”)

#####
# GET ALL THE ITEMS WITH MIN_SUP > N – List1
#####
@itemsContainer = []
@globalHash.keys.each do |element|

if @globalHash[element].to_i >= min_sup

@itemsContainer.push(element)

string = element.to_s
@file.puts string+” :”+@globalHash[element].to_s
@file.flush

end

end
@itemsContainer.sort!

####
#MAKE ALL ITEMS IN CONTAINER AN ARRAY
####
@items = Array.new()
@itemsContainer.each do |x|
@row = Array.new()
@row.push(x)
@items.push(@row)
end

@items = createList(@items)

######
# RUN THROUGH THE DATA SET UNTIL WE HAVE REACHED
# ALL FREQUENT ITEMSETS.
######
k = 1
@listK = @items

foundAllFrequentSets = FALSE
while foundAllFrequentSets == FALSE

###
# GET ALL THE ITEMS THAT PASS THE MIN_SUP THRESHOLD
###
@frequentSet = getFrequentItemSets(@listK, min_sup, fileInputPath)
p(@frequentSet)
#######
# CREATE THE NEW LIST FROM THE FRQUENT ITEMSET
# THAT MET MIN_SUP
#######
@listK = createList(@frequentSet)

#check if the result all contain infrequent sub sets
if isInfrequentSets(@listK)
foundAllFrequentSets = TRUE
end

k = k+1
end #while foundAllFrequentSets == FALSE

@file.close

else

puts “Error: Must Be Format Type :: apriori.rb

end

endTime = Time.now
puts “Ended – “+endTime.to_s+”\n”
puts “Total Time: “+(endTime-start).to_s+”\n”

#Profiler__::stop_profile
#Profiler__::print_profile($stderr)

BTW: the data sets were created by my prof Dr. Sun.
Armando Padilla