Word Predicting in a Character Length restricted setting using PCFG


That’s it, in a nutshell. I will be exploring how to predict the word a person will type within a mobile setting. Why a mobile setting? I believe this medium forces the user to make decisions on how to spell and formulate thoughts due to either time sensitivity (user is texting in the car…illegally, or doing other things, or in Twitters case restricted to 144 characters).

Yes, there has been extensive research in this field but I feel most of these papers are either out dated (using the T9 algorithm) or are using a corpus containing sentences derived from books…yes I’m saying they’re wrong when it comes to the mobile space.

I believe a better training set are SMS and Twitter feeds since in both arenas the user is forced to cut corners by abbreviating abbreviations (ie. l8tr, n2g, n e 1). With this in hand I’ve gone ahead and fetched English SMS messages (~50,000) and put together a Twitter api to pull at random.

I will provide my data once I’m done.

The PCFGs.
I will be using the idea of Context Free Grammer (NLP) to the prediction steps. I have found a few Treebank Tags which are useful and will base much of what I do on these as my foundation. These Treebank do have limitations but I will expand on them.

I will give credit and site.

The end result
At the end of this exercise I want to, compare my data using Twitter and SMS vs the traditional corpus in use. I also would like to see if PCFGs are a good tool to predict the next word.

More to come.

Armando Padilla