Learn how to working with free-form entry text fields such as user feedback, tweets, or maintenance logs.
You may also use text to provide a continuous value such as in the case of scoring short answer essays.
Many example text datasets are available from sites like the UCI Machine Learning Repository.
While the full scale of natural language processing that can be done is quite broad, the Nexosis API can automatically help you use text without digging deep into the specifics.
Getting text into features is as easy as identifying a Text data type in your metadata for one or more columns in your dataset. We attempt to identify text automatically just like other data types but if you need to set it yourself see our metadata instructions for more information. Each text column will have a vocabulary built and features will be added to the dataset when we run algorithms - that’s it, you don’t have to do any additional work to use the text.
The Nexosis API goes through a multi-stage process to add additional columns for each word found throughout the examples provided in your dataset. Ultimately we aim to add a word importance score based on something called term frequency–inverse document frequency. You don’t have to concern yourself with the details of the score but you can think of it as a weighting for each word roughly based on how often it occurs. Let’s look at a few of the steps in more detail.
product_key | description | product_class |
---|---|---|
1 | “The best at cleaning” | “cleaning” |
2 | “increases your digging power” | “garden” |
word | type | occurrences |
---|---|---|
best | word | 1 |
cleaning | word | 1 |
best cleaning | word | 1 |
increases | word | 1 |
digging | word | 1 |
power | word | 1 |
increases digging | word | 1 |
digging power | word | 1 |
the | stop_word | 1 |
at | stop_word | 1 |
your | stop_word | 1 |
product_class | best cleaning | cleaning | digging |
---|---|---|---|
cleaning | 0.707106781 | 1 | - |
garden | - | - | 0.577350269 |
If you want to see the vocabulary for a particular column you can see a list of available vocabularies through the API. Each vocabulary has a unique id and identifies the dataset and column for which it was created.
{
"id": "6e0d9884-a5a5-4a30-b9f1-fea8380be51e",
"dataSourceName": "AirlineTweets",
"columnName": "text",
"dataSourceType": "dataSet",
"createdOnDate": "2018-01-22T19:25:18.6662961+00:00",
"createdBySessionId": "01611f54-7568-4a52-8693-a6c3d77b3964"
}
This id can be used to actually pull back all of the word instances along with whether they are a stop word, and the word rank if not a stop word.
{
"id": "6e0d9884-a5a5-4a30-b9f1-fea8380be51e",
"items": [{
"text": "united",
"type": "word",
"rank": 0
},
{
"text": "totes",
"type": "stopWord"
}]
}
The word rank is the relative importance of the word when it was used in the model. The lowest number is the most important and the vocabulary will be returned in rank order by default.
1. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. ↩