Part of my thesis is to analyze a set of SMS messages using Parts of Speech Tagging. I wanted to see what POS would come up with given SMS messages to analyze. It was interesting. For the most part POS could successfully tag each of the SMS messages properly but was curious to see if there was a higher rate of a specific noun/verb/pronoun/etc usage in SMS messages. This post isnt about the outcome but rather a forum to place the POS N-grams out.
Before I could do anything I wanted to see if someone had already done this. I wanted to answer, Are there any N-grams for parts of speech tagging specifically for SMS text? Turns out I was either not looking in the correct place or there are non. So, I decided to create my own using the SMS data set located here: http://wing.comp.nus.edu.sg:8080/SMSCorpus/data/corpus/smsCorpus_en_xml_2012.04.30.zip
Creating the POS n-grams.
To create the POS n-grams I used the Treebank created by TweetNLP and then ran the output through some formatting of my own.
I now present you 2-gram, 3-gram, and 4-gram files based off of the SMS messages.
Download – 2 Gram
Download – 3 Gram
Download – 4 Gram
That’s it, in a nutshell. I will be exploring how to predict the word a person will type within a mobile setting. Why a mobile setting? I believe this medium forces the user to make decisions on how to spell and formulate thoughts due to either time sensitivity (user is texting in the car…illegally, or doing other things, or in Twitters case restricted to 144 characters).
Yes, there has been extensive research in this field but I feel most of these papers are either out dated (using the T9 algorithm) or are using a corpus containing sentences derived from books…yes I’m saying they’re wrong when it comes to the mobile space.
I believe a better training set are SMS and Twitter feeds since in both arenas the user is forced to cut corners by abbreviating abbreviations (ie. l8tr, n2g, n e 1). With this in hand I’ve gone ahead and fetched English SMS messages (~50,000) and put together a Twitter api to pull at random.
I will provide my data once I’m done.
I will be using the idea of Context Free Grammer (NLP) to the prediction steps. I have found a few Treebank Tags which are useful and will base much of what I do on these as my foundation. These Treebank do have limitations but I will expand on them.
I will give credit and site.
The end result
At the end of this exercise I want to, compare my data using Twitter and SMS vs the traditional corpus in use. I also would like to see if PCFGs are a good tool to predict the next word.
More to come.