Tag: Ruby

Apriori Algorithm – Ruby

AA is not only where the alcoholics get help but also where were data miners go when they need to find a basic algorithm to locate all the combinations that appear a certain number of times in a set of information.

The Apriori algorithm, takes an initial data set, beats it to a pulp, strains it, and gives you all the possible ways of cooking a hot poket. With cheese, no cheese, with meat, or no meat, this algorithm gets you every possible cobination and the number of times it happens in the database set.

Confused? Good! Since i like examples to understand things lets use one. Lets say you go to the market. You by some milk, a can of decaf and some dog food. The supermarket chain records this info for not only your trasaction but for everyone (talk about big brother looking over you). After about a month the super market chain has all this info on the things people buy. they see how some people buy milk with cookies and how some people buy squeeky toys with condoms (dont ask). But they want to find more info. They pop open their computer run the Apriori algorithm and determine that they want to find all combination that occure in the data set more than 2 times. After running the algorihtm they see that milk and tomatoes sold alot together. Good market info if you want to have a sale on tomatoes so you can get rid of your milk 🙂 They also found that cookies with toothpaste were also items that sold in hgh volumes aswell. In data mining terms we use the words “frequent” to state how well a itemset (the combination of items) was old.

So in closing, this its 1 am and the algorithm is coded below it will allow you to stick in a dataset of information and receive the number of times the combination occures that meet your expected minimum supporting threshold, the minimum number of times you want the combination to appear.

Before i get to the code, let me go over it first. The below code reads a filepath that we need to use. I have two files you can use. One is the
test.txt
file and the other is the retail.txt file. When running the test.txt file the system finishes very quickly and stores the itemsets in a output file of your choice. On the other hand, when we run retail.txt and set the minimum_sup to less than 2000 the aplication runs for 2 hours, a sign of either a very poor implementation or a very bad langauage to use on data mining.

Algorithm
1. Get all the unique items in the dataset
2. For each unique item count the number of records that contain the combination.
3. If the number of times it appears is greater than the min_sup keep it and prune the other unique values that didnt meet the min_sup.
4. with the new set of values join them to form your new list.
5. check if new set has any subsets that are not frequent.
6. trash the itemsets that have subsets that are not frequent.
7. with new clean set go to #2.
8. continue until new list is null, this happens when the new list has all itemsets which have subsets as not being frequent.

Ruby Code:
##########
# Apriori Algorithm Implementation
#
# @author Armando Padilla, mando81@prodigy.net
# @homepage http://www.armando.ws
#
# v0.0.2 month.week.day
##########

#require ‘profiler’
#Profiler__::start_profile
start = Time.now
puts “Started – “+start.to_s

#############
# Create the new list (Lk) for the next iteration
#
# @param itemSets Lk-1 list that has been cleared out of all
# none frequent sets.
#############
def createList(itemSets)

###
# HOLDS THE NEW LIST
###
listX = Array.new()

if !itemSets.nil?

#Check if the initial elements are the same
#if they are the same then
#Check if the last item is less than the next row’s last item.
#if its less then join both items (union on both items)

k = 1
itemSetSize = itemSets.size

itemSets.each do |list1|

n=k
until n == itemSetSize

#Check if we can combine the lists
#if initial run yes lets combine the list
#otherwise follow the steps

if itemSets[n].size == 1

if !list1.eql?(itemSets[n])

@row = Array.new()
@row.push(list1,itemSets[n])
@row = @row.sort
@row = @row.flatten!

###
# DO NOT ADD IF WE ALREADY HAVE THE SET
###
if !listX.include?(@row)
listX.push(@row)
end

end

else

####
#if the all the indexes before the last are the same
#then combine ONLY IF the last indexes are the same
####
toIndex = list1.size-2

###
# GET ALL THE VALUES UP TO THE LAST INDEX
# EG. [1,2,3,4,5] and [1,2,3,4,6]
# BECOMES [1,2,3,4] and [1,2,3,4]
###

arrayA = list1[0..toIndex]
arrayB = itemSets[n][0..toIndex]

if arrayA.eql?(arrayB)

totalCountA = list1.size-1
totalCountB = itemSets[n].size-1

if list1[totalCountA] < itemSets[n][totalCountB]

@row = Array.new()
@row.push(list1|itemSets[n])
@row = @row.sort.uniq
@row = @row.flatten!

###
# DO NOT ADD IF WE ALREADY HAVE THE SET
###
if !listX.include?(@row)
listX.push(@row)
end

end

end
#puts “=========”

end

n = n+1
end

k = k+1
end

end

return listX

end

##############
# Returns the item sets that support the minimum supporting threshold.
#
##############
def getFrequentItemSets(itemSets, minSup, fileInputPath)

########
# RUNS THROUGH THE DATA SET AND GETS THE COUNT
######
list = Array.new()

##
# FOREACH ITEM CHECK HOW MANY TIMES IT APPEARS IN THE DATA SET
##
itemSets.each do |itemSet|

itemSetSize = itemSet.size
count = 0

###
# FOR EACH ROW IN THE ITEM SET
# RUN THROUGH EACH ROW IN THE DATA SET
# AND DETERMINE IF THE ITEM SET APPEARS IN THE RECORD
###
@dataSet.each do |set|

currentRow = set

if itemSetSize == 1

if currentRow.index(itemSet[0])
count = count+1
end

else

###
# FOR EACH ROW CHECK IF THE ITEMSET IS PRESENT
#
# ARMANDO – USING INTERSECTION TO COMPARE
# =======================================
# ITEMSET = A
# CURRENT ROW = B
# TAKE THE INTERSECTION OF A AND B
# IF SIZE OF INTERSECTION IS EQUAL TO SIZE OF ITEMSET THEN
# WE LOCATED THE ITEM SET IN THE ROW
###

intersection = []
itemSet.each do |a|

set.each do |b|

if a == b
intersection.push(a)
end

end

end

if intersection.size >= itemSetSize
#if (itemSet&currentRow).size >= itemSet.size

count = count+1

end

end

end #end – run through data set

###
# DETERMINE IF IT MEETS THE MIN_SUP
# if the calculatedValue is equal to or greater than the
# minimum supporting threshold ad it to our new List (L k)
###
if count >= minSup

@row = Array.new()
@row.push(itemSet)
list.push(itemSet)

string = “”
itemSet.each do |x|
string = string+” “+x.to_s
end
@file.puts string+” :”+count.to_s
@file.flush

else

irow = Array.new()
irow.push(itemSet)
@infrequentSets.push(irow)

end
p(itemSet)

end #end – run though each item

return list.sort

end #end – getFrequentItemSets

##############
# Returns boolean when the item set is either
# frequent or infrequent
##############
def isInfrequentSets(itemSets)

globalCount = 0
###
# FOREACH ITEMSET CHECK IF ITS IN THE INFREQUENT ITEM SET
###
itemSets.each do |itemSet|

count = 0
@infrequentSets.each do |itemSetX|

###
# IF WE FOUND A ITEM SET WITH at least 2 sub items
# REMOVE IT
###
if(itemSet&itemSetX[0]).size >= 2
count = count+1
end

end

if count != 0
globalCount = globalCount + 1
end

end

if globalCount == itemSets.size

return true

end

end

#####
# Reads the input from a text file and create an array
# Each line is an array of elements.
# Elements broken out by white space.
#
# @param fileInputPath Path to data set file.
#
####
def getDataSet(fileInputPath)

dataSet = []

if File.exists?(fileInputPath)

###
# GET THE FILE CONTENT
###
dataContainer = File.open(fileInputPath, “r”)

###
# FOREACH LINE BREAK APART BY WHITESPACE
# AND PLACE INTO ARRAY
###
flist = []
dataContainer.readlines.each do |line|

flist.push(line)

end

flist.each do |line|

elements = line.split(” “)

row = Array.new()

elements.each do |element|

row.push(element.to_i)

if @globalHash[element.to_i] != nil
@globalHash[element.to_i] = @globalHash[element.to_i]+1
else
@globalHash[element.to_i] = 1
end

end

###
# ADD THE ROW AS AN ARRAY
# TO THE DATA SET
###
dataSet.push(row)

end

else

puts “Error: File, “+fileInputPath+”, does not exist”

end

return dataSet

end

#########################################################
#########################################################
#########################################################
if ARGV.size == 3

###
# INITIALIZE VARIABLES
###
min_sup = ARGV[2].to_i
fileInputPath = ARGV[0]
outputFilePath = ARGV[1]

@infrequentSets = Array.new()
@globalHash = Hash.new() #holds out initial List with counts

#####
# GET THE DATA SET FROM FILE
#####
@dataSet = Array.new()
@dataSet = getDataSet(fileInputPath)

####
# WRITE OUT THE OUTPUT
####
@file = File.open(outputFilePath, “w+”)

#####
# GET ALL THE ITEMS WITH MIN_SUP > N – List1
#####
@itemsContainer = []
@globalHash.keys.each do |element|

if @globalHash[element].to_i >= min_sup

@itemsContainer.push(element)

string = element.to_s
@file.puts string+” :”+@globalHash[element].to_s
@file.flush

end

end
@itemsContainer.sort!

####
#MAKE ALL ITEMS IN CONTAINER AN ARRAY
####
@items = Array.new()
@itemsContainer.each do |x|
@row = Array.new()
@row.push(x)
@items.push(@row)
end

@items = createList(@items)

######
# RUN THROUGH THE DATA SET UNTIL WE HAVE REACHED
# ALL FREQUENT ITEMSETS.
######
k = 1
@listK = @items

foundAllFrequentSets = FALSE
while foundAllFrequentSets == FALSE

###
# GET ALL THE ITEMS THAT PASS THE MIN_SUP THRESHOLD
###
@frequentSet = getFrequentItemSets(@listK, min_sup, fileInputPath)
p(@frequentSet)
#######
# CREATE THE NEW LIST FROM THE FRQUENT ITEMSET
# THAT MET MIN_SUP
#######
@listK = createList(@frequentSet)

#check if the result all contain infrequent sub sets
if isInfrequentSets(@listK)
foundAllFrequentSets = TRUE
end

k = k+1
end #while foundAllFrequentSets == FALSE

@file.close

else

puts “Error: Must Be Format Type :: apriori.rb

end

endTime = Time.now
puts “Ended – “+endTime.to_s+”\n”
puts “Total Time: “+(endTime-start).to_s+”\n”

#Profiler__::stop_profile
#Profiler__::print_profile($stderr)

BTW: the data sets were created by my prof Dr. Sun.
Armando Padilla

Find-S Algorithm – Ruby

I created this a while back. One of the first Machine Learning algorithms that I implemented in Ruby.

Hope it helps people out.


########
# Title: Ruby Implemintation of (Find-S Algorithm)
# Description: Find-S algorithm allows us to get a hypothesis that will categorize
# our data set to its MOST GENERAL FORM.
# Author: Armando Padilla
# California State University Los Angeles
# Computer Science
#
# Email: mando81@prodigy.net
# Home Page: http://www.armando.ws
# ========================================================================
# Find-S: Find S is a Machine Learning Algorithm, one of many, that
# given a set of training data will identify the best hypothesis
# to categorize possitive examples in the data. Note this algorithm will
# ONLY take into account user specifid possitive examples in the MOST SPECIFIC CASE.
#
# LIMITATION: The problem of the Find-S hypothesis is that, it is
# too general and may not take into account noise that the training data may hold.
# Also it can not attain ALL possible hypothesis in our hypothesis space. What is
# the user wanted the most general hypothesis? It also only considers only conjunctive hypothesis.
#===========================================================================
########

#### IGNOR!!! BENCH MARKING IGNOR!!! ####
starttime = Time.now
#############################

hypothesis = Array.new;
data = Array.new;

#Initializa Data to use
data = [[“Sunny”, “Warm”, “Normal”, “Strong”, “Warm”, “Same” , “Yes”],
[“Sunny”, “Warm”, “High” , “Strong”, “Warm”, “Same” , “Yes”],
[“Rainy”, “Cold”, “High” , “Strong”, “Warm”, “Change”, “No”],
[“Sunny”, “Warm”, “High” , “Strong”, “Cold”, “Change”, “Yes”]
];

#Initialize hypothesis
#We set out initial hypothesis to its most Specific Case represented by “%”
hypothesis = [“%”, “%”, “%”, “%”, “%”, “%”];

#Remove all negative records from our data
print “\n——- REMOVING ALL NEGATIVE DATA ———\n”;

i = 0;
lastElement = data[1].size-1;
while(i < data.length)
if(data[i][lastElement].to_s.downcase == “no”)
puts “Removing ::: “+data[i].to_s;
data.delete_at(i);
end
i = i+1;
end

######
# Find S
# For each element check if we can make it a more general term
# to ultimately derive a hypothesis that can classify all data
# % => Most Specific Case
# ? => Most General Case
# => Some Specific Case
#####
k=0;
while(k
change = 0;

i=0;
while(i

if(hypothesis[k].to_s.downcase != data[i][k].to_s.downcase)
hypothesis[k] = data[i][k].to_s;
change = change+1;
end

i=i+1;
end

#if we have changed the value for the attribute in k possition more than
#twice it might be a safe bet that we can generalize that attribute and make
#it ‘?’
if(change > 1)
hypothesis[k] = “?”;
end

k=k+1;
end

########################################
# Displays the Hypothesis that BEST Fits
# our data.
########################################
print “\n\n”;
print “\t– Find-S Hypothesis –\n\t”;
i=0;
while(i
print hypothesis[i]+” “;
i=i+1;
end
print “\n\n”;

### END – BENCH MARK & DISPLAY ###
endtime = Time.now
print “Execution Time: ” + (endtime-starttime).to_s + “\n”;
########################

Inverted Index and Ruby

Inverted Index and YOU!

Yahoo, Google, Microsoft and basically all other search engines had to start
somewhere and so we start here with an Inverted Index. What is it? Heck! why do we
even use it?

In the opening of this article i brought up a few search engines, they use large, billion and billions
of records that users can navigate though and search through, hell i even think at this point there are more
unique records in all three data sets then there are starts..pssh. So with all this data how in the world
does google get you the data super fast in a nice little package called a search result? Well its called
an inverted index. Im pretty sure they use much more advanced algorithms then this basic one but like I
said before we have to start somewhere.

For this example we have a database with 1,000,000,000 records and we need to find the words,”my dog is cool” in this database. We also need to locate where the combination of those words appear in. Lucky i know, they appear in row, 1, 8, 5000,999999, and 999999999. If we tried to run a search on this database we would have to search through ALL 1 billion rows in the database, comparing all the words in each row to the words that Im trying to serach for. Not cool.

This isnt good, not optimal, because it will take a very long time to find the results that we want. Got it? If you dont here is a very simple equation to remember.

“very big database” + “words to search for” = long time to wait for results if we dont have a good algorithm.

So how can we speed this up? What we can do is use, yes you guessed it, an “inverted index”. The inverted index allows us to run through each row, find where the combination of words occures, and finally keep one single row containing the Ids of the rows to where the combination of words appear in. This will allow us to look at one table and read one 1 row instead of 1,000,000,000 to identify where the combination of words appears. Once we identify which rows contain the word combination using that single row, we can present the results for the user. Much fater isnt it 🙂 Graphically this would look something like this.

Example #1 – Initial database without the inverted index. Table has rows from 1 to 1,000,000,000

TABLEX
ID TEXT
1 “on hey dude my dog is cool”
2 “i hate your dog”
3 “damn it! he pooped on my head”
4 “what? he did what?”
5 “he pooped on my head man!”
6 “oh..that…yea he does that”
7 “your ok with that?”
8 “yea my dog is cool like that”
.
.
.
.
.
1000000000 “yea i guess when you said my dog is cool…”

Example #1
With the Inverted Index. New table in the database.

TABLEY
TID KEY ROWS
1 “my dog is cool” {1, 8, 5000, 999999, 999999999}
2 “bleh…i hate…” {20, 26, 27, 28, 29, 10000, 202234, 99000}

Looking at the inverted index array you can see how
we set things up. We have a primary key like always, the unique phrase (this can be anything but has to be unique), and we have an array or some type of delimeter seperated string contaning all the rows from TableX where the unique key appears in.

So concluding, this section is on why we need the inverted index to save the world and possibly solve world hunger. No really we need it to reduce the wait time on mining for information.

By the way, I went ahead and implemented ruby code that you can use to mess with and test out the Inverted index concept with. Enjoy.

##############
# Inverted Index – Ruby Code
#
# Data Sets from:
# Book Data Mining, Concepts and techniques
#
# @author: Armando Padilla, mando81@prodigy.net
##############

###
# INITIAL DATASET TO USE
# Tuple set from Page 181
#
# [tuple id, value 1, value 2, value 3, value 4, value 5]
#
###
@tuples = Array.new()
@tuples = [[‘1’, ‘a1’, ‘b1’, ‘c1’, ‘d1’, ‘e1’],
[‘2’, ‘a1’, ‘b2’, ‘c1’, ‘d2’, ‘e1’],
[‘3’, ‘a1’, ‘b2’, ‘c1’, ‘d1’, ‘e2’],
[‘4’, ‘a2’, ‘b1’, ‘c1’, ‘d1’, ‘e2’],
[‘5’, ‘a2’, ‘b1’, ‘c1’, ‘d1’, ‘e3’]
]

###
# Holds the inverted array
###
invertedIndex = Hash.new()

###
# FOREACH ROW
###
i = 0
until i == @tuples.size

#Get the row
tuple = @tuples[i]

#Get the number of attributes
size = tuple.size()

###
# FOR EACH ATTRIBUTE IN THE ROW
####
j=1
until j == size

#Get the TID
tid = tuple[0]

#Get the current attribute
attribute = tuple[j]

#Check if we have already visited this attribute.
if !invertedIndex.has_key?(attribute)

###
# If we havent look down the column and
# get the TID of each row the attribute appears
# and add it to a array container. When done
# add teh array container to the main invertedIndex Hash.
##
x = Array.new()
k=0
until k == @tuples.size

if @tuples[k][j] == attribute

#puts “Attribute: “+attribute+” TID: “+@tuples[k][0]
x.push(@tuples[k][0])

end

k = k+1

end

invertedIndex[attribute] = x

end #end – if the attribute has already been visited.

j = j+1

end #end – loop through each attribute

i = i+1
end #end – loop through the rows

#######################################
# Print out the resulting Inverted Index
#######################################
invertedIndex.each{|key, value| puts key+ ” “+value.to_s}
Armando Padilla