I few weeks ago before the Lakers and my slacking off went into high gear I started to look at the SemanticHackers API. It had some neat items that someone could use for, let’s just say a feed recommendation application, a simplified google type of application based on semantics, and a butt load of other things that I cant think of at the moment. Just go check out my previous post on it, Gnosis & SemanticHacker Review.
I started to work on a feed reader that, based on the keywords, and semantic meaning of the article, aggregators collect information related to the article. In this case it collected images from Flickr. This article will be about my overall findings, process, and just plain out fun while working with PHP and RSS Feeds. Let’s get to it!
I spend an average of 1-2 hours reading and searching for interesting articles on subject matters that range from football to technology. The goal of this application was to cut down my searching by at least 10% by recommending new material and the other goal of the application was to aggregate information pertaining to the article that would allow the reader to get visual information in the form of pictures based off the article’s text. A Feed Aggregator for lack of a better term.
Every project needs an attack plan. Here is my crude attack plan.
1. Allow feeds to be displayed on page.
2. Allow feeds to be mined for key words.
3. Allow feeds to me mined for semantic meaning.
4. Use keywords to extract images from the web related to the article.
5. Allow the user to add additional feeds to the system.
6. Use Flex to create a nice GUI for the app.
7. Add recommendation capabilities.
8. Use natural language process to extract “better” keywords.
9. Build web spiders to scour web for related articles.
I broke this pet project into three overall parts. The first part was to get a basic implementation of the concept onto the web. Nothing pretty to look at, nothing in terms of “wow factor” just get something up. To do this I first thought about using Java or Ruby but I decided to go with the PHP and the Zend Framework. The reason was due to the Zend_Service library. The library has a good set of wrappers, facades, that allow users to use the wrapper to make API calls to Flick, Amazon, Google, and other open API libraries. You can get a better look at the wrappers here (look on the right hand side for items with the Zend_Service prefix).
The second part of the project will be to create a Flex GUI using the MXML and the PHP code created in release 1. The third and final step will be the “wow” factor. I would like to do two things. 1. Implement a better keyword extractor. And 2. Add a recommendation portion to the system which recommends articles and sites the user MIGHT enjoy reading.
While building this initial release, I came across some quick, “ah ok wow didnt know that”, items. To have a place to reference them again, I decided to make this section and so others might find this information helpful.
Self Extracting Tags – Not so natural language processor.
On the keyword extractor, the system extracts all upper case words, since most of the time upper case words are either persons names or names of places (nouns) its a safe bet we can use it. This approach also extracts the, “I”, “And”, “Also”, type of words used in the start of a sentence. To alleviate this issue I extract all words with 2 or more concurrent words with first upper case letters.
Example: “Hello there Armando Padilla. How are you today?”
The process described above would extract “Hello”, “Armando Padilla”, “How”. After filtering this set to extract, “2 or more concurrent words with first letter capitalized”, we get the new set, “Armando Padilla”.
I did read, and plan to implement, an approach to count the number of times a given word appears in the text and from this dataset use the bottom 10%. For example: “I”, “the”, “it”, “he”, “she”, would naturally have a higher count than, “armando padilla”, “Sliced Peaches”. If we order the dataset of words by their word count in descending order and then take the last 10% we will have created a new dataset with “Armando Padilla” and “Sliced Peaches” in the set.
Semantics of the Article
I took a look at how well the SemanticHacker’s semantic text analyzer worked between extracting semantic meaning from a full description and a set of keywords using the method described above. Turns out the meanings never changed, from what I could tell. To extract meaning off of a set of text i decided to just pass in the full description of the feed. Works well. Will test which method is faster later.
Flickr.com and user based tags.
Finally, finally for this part 1 article, I tested how well the Zend_Service_Flickr implemented the search API. I was forced to use the Tag based search and found that it returned very poor quality images appose to the regular keyword based search. I’m currently working on adding in the keyword base search capability to my local copy of the Zend_Service_Flick code for better results.
Another drawback that I found was using the tag based search forces you to rely on the users tagging abilities. In the case of an article relating to USC football the context of the article would be “USC FOOTBALL”. If we use the tags, “USC”, or “Trojans” the application might return a picture of a condom rather than a USC player due to the poor choice of tags used by the user. Still looking into this one.
This is the first release with none of the bells and whistles. Click here to run it. The application pulls up date for the TrojanWire.com feeds but can just as easily be migrated to use any feeds out there. The red highlighted items on are the keywords the algorithm I created picked up, and unfortunately the semantic portion could not be installed on the server for some odd reason.
Creating the Flex Gui and compile it using AIR. This will allow the application to live outside the web, something that I wanted to do from the get-go.