You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1291 lines
29 KiB
1291 lines
29 KiB
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"___\n",
|
|
"\n",
|
|
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
|
|
"___\n",
|
|
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
|
|
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Feature Extraction from Text"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This notebook is divided into two sections:\n",
|
|
"* First, we'll find out what what is necessary to build an NLP system that can turn a body of text into a numerical array of *features* by manually calcuating frequencies and building out TF-IDF.\n",
|
|
"* Next we'll show how to perform these steps using scikit-learn tools."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Part One: Core Concepts on Feature Extraction\n",
|
|
"\n",
|
|
"\n",
|
|
"In this section we'll use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>\n",
|
|
"<div class=\"alert alert-info\" style=\"margin: 20px\">This first section is for illustration only!\n",
|
|
"<br>Don't worry about memorizing this code - later on we will let Scikit-Learn Preprocessing tools do this for us.</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Start with some documents:\n",
|
|
"For simplicity we won't use any punctuation in the text files One.txt and Two.txt. Let's quickly open them and read them. Keep in mind, you should avoid opening and reading entire files if they are very large, as Python could just display everything depending on how you open the file.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 57,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"This is a story about dogs\n",
|
|
"our canine pets\n",
|
|
"Dogs are furry animals\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"with open('One.txt') as mytext:\n",
|
|
" print(mytext.read())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 58,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"This story is about surfing\n",
|
|
"Catching waves is fun\n",
|
|
"Surfing is a popular water sport\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"with open('Two.txt') as mytext:\n",
|
|
" print(mytext.read())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Reading entire text as a string"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 59,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('One.txt') as mytext:\n",
|
|
" entire_text = mytext.read()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 60,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'This is a story about dogs\\nour canine pets\\nDogs are furry animals\\n'"
|
|
]
|
|
},
|
|
"execution_count": 60,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"entire_text"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 61,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"This is a story about dogs\n",
|
|
"our canine pets\n",
|
|
"Dogs are furry animals\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(entire_text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Reading Each Line as a List"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 62,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('One.txt') as mytext:\n",
|
|
" lines = mytext.readlines()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 63,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"['This is a story about dogs\\n',\n",
|
|
" 'our canine pets\\n',\n",
|
|
" 'Dogs are furry animals\\n']"
|
|
]
|
|
},
|
|
"execution_count": 63,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"lines"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Reading in Words Separately"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 64,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('One.txt') as f:\n",
|
|
" words = f.read().lower().split()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 65,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"['this',\n",
|
|
" 'is',\n",
|
|
" 'a',\n",
|
|
" 'story',\n",
|
|
" 'about',\n",
|
|
" 'dogs',\n",
|
|
" 'our',\n",
|
|
" 'canine',\n",
|
|
" 'pets',\n",
|
|
" 'dogs',\n",
|
|
" 'are',\n",
|
|
" 'furry',\n",
|
|
" 'animals']"
|
|
]
|
|
},
|
|
"execution_count": 65,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"words"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Building a vocabulary (Creating a \"Bag of Words\")\n",
|
|
"\n",
|
|
"Let's create dictionaries that correspond to unique mappings of the words in the documents. We can begin to think of this as mapping out all the possible words available for all (both) documents."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 83,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('One.txt') as f:\n",
|
|
" words_one = f.read().lower().split()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 84,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"['this',\n",
|
|
" 'is',\n",
|
|
" 'a',\n",
|
|
" 'story',\n",
|
|
" 'about',\n",
|
|
" 'dogs',\n",
|
|
" 'our',\n",
|
|
" 'canine',\n",
|
|
" 'pets',\n",
|
|
" 'dogs',\n",
|
|
" 'are',\n",
|
|
" 'furry',\n",
|
|
" 'animals']"
|
|
]
|
|
},
|
|
"execution_count": 84,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"words_one"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 85,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"13"
|
|
]
|
|
},
|
|
"execution_count": 85,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"len(words_one)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 86,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"uni_words_one = set(words)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 87,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'a',\n",
|
|
" 'about',\n",
|
|
" 'animals',\n",
|
|
" 'are',\n",
|
|
" 'canine',\n",
|
|
" 'dogs',\n",
|
|
" 'furry',\n",
|
|
" 'is',\n",
|
|
" 'our',\n",
|
|
" 'pets',\n",
|
|
" 'story',\n",
|
|
" 'this'}"
|
|
]
|
|
},
|
|
"execution_count": 87,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"uni_words_one"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Repeat for Two.txt**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 88,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('Two.txt') as f:\n",
|
|
" words_two = f.read().lower().split()\n",
|
|
" uni_words_two = set(words_two)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 89,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'a',\n",
|
|
" 'about',\n",
|
|
" 'catching',\n",
|
|
" 'fun',\n",
|
|
" 'is',\n",
|
|
" 'popular',\n",
|
|
" 'sport',\n",
|
|
" 'story',\n",
|
|
" 'surfing',\n",
|
|
" 'this',\n",
|
|
" 'water',\n",
|
|
" 'waves'}"
|
|
]
|
|
},
|
|
"execution_count": 89,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"uni_words_two"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Get all unique words across all documents**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 91,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"all_uni_words = set()\n",
|
|
"all_uni_words.update(uni_words_one)\n",
|
|
"all_uni_words.update(uni_words_two)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 93,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'a',\n",
|
|
" 'about',\n",
|
|
" 'animals',\n",
|
|
" 'are',\n",
|
|
" 'canine',\n",
|
|
" 'catching',\n",
|
|
" 'dogs',\n",
|
|
" 'fun',\n",
|
|
" 'furry',\n",
|
|
" 'is',\n",
|
|
" 'our',\n",
|
|
" 'pets',\n",
|
|
" 'popular',\n",
|
|
" 'sport',\n",
|
|
" 'story',\n",
|
|
" 'surfing',\n",
|
|
" 'this',\n",
|
|
" 'water',\n",
|
|
" 'waves'}"
|
|
]
|
|
},
|
|
"execution_count": 93,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"all_uni_words"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 94,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"full_vocab = dict()\n",
|
|
"i = 0\n",
|
|
"\n",
|
|
"for word in all_uni_words:\n",
|
|
" full_vocab[word] = i\n",
|
|
" i = i+1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 96,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'water': 0,\n",
|
|
" 'sport': 1,\n",
|
|
" 'canine': 2,\n",
|
|
" 'pets': 3,\n",
|
|
" 'about': 4,\n",
|
|
" 'catching': 5,\n",
|
|
" 'dogs': 6,\n",
|
|
" 'furry': 7,\n",
|
|
" 'fun': 8,\n",
|
|
" 'story': 9,\n",
|
|
" 'is': 10,\n",
|
|
" 'our': 11,\n",
|
|
" 'surfing': 12,\n",
|
|
" 'animals': 13,\n",
|
|
" 'are': 14,\n",
|
|
" 'this': 15,\n",
|
|
" 'popular': 16,\n",
|
|
" 'a': 17,\n",
|
|
" 'waves': 18}"
|
|
]
|
|
},
|
|
"execution_count": 96,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Do not expect this to be in alphabetical order! \n",
|
|
"# The for loop goes through the set() in the most efficient way possible, not in alphabetical order!\n",
|
|
"full_vocab"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Bag of Words to Frequency Counts\n",
|
|
"\n",
|
|
"Now that we've encapsulated our \"entire language\" in a dictionary, let's perform *feature extraction* on each of our original documents:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Empty counts per doc**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 126,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create an empty vector with space for each word in the vocabulary:\n",
|
|
"one_freq = [0]*len(full_vocab)\n",
|
|
"two_freq = [0]*len(full_vocab)\n",
|
|
"all_words = ['']*len(full_vocab)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 127,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
|
|
]
|
|
},
|
|
"execution_count": 127,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"one_freq"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 128,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
|
|
]
|
|
},
|
|
"execution_count": 128,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"two_freq"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 129,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']"
|
|
]
|
|
},
|
|
"execution_count": 129,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"all_words"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 130,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for word in full_vocab:\n",
|
|
" word_ind = full_vocab[word]\n",
|
|
" all_words[word_ind] = word "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 131,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"['water',\n",
|
|
" 'sport',\n",
|
|
" 'canine',\n",
|
|
" 'pets',\n",
|
|
" 'about',\n",
|
|
" 'catching',\n",
|
|
" 'dogs',\n",
|
|
" 'furry',\n",
|
|
" 'fun',\n",
|
|
" 'story',\n",
|
|
" 'is',\n",
|
|
" 'our',\n",
|
|
" 'surfing',\n",
|
|
" 'animals',\n",
|
|
" 'are',\n",
|
|
" 'this',\n",
|
|
" 'popular',\n",
|
|
" 'a',\n",
|
|
" 'waves']"
|
|
]
|
|
},
|
|
"execution_count": 131,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"all_words"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Add in counts per word per doc:**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 132,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# map the frequencies of each word in 1.txt to our vector:\n",
|
|
"with open('One.txt') as f:\n",
|
|
" one_text = f.read().lower().split()\n",
|
|
" \n",
|
|
"for word in one_text:\n",
|
|
" word_ind = full_vocab[word]\n",
|
|
" one_freq[word_ind]+=1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 133,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[0, 0, 1, 1, 1, 0, 2, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0]"
|
|
]
|
|
},
|
|
"execution_count": 133,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"one_freq"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 134,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Do the same for the second document:\n",
|
|
"with open('Two.txt') as f:\n",
|
|
" two_text = f.read().lower().split()\n",
|
|
" \n",
|
|
"for word in two_text:\n",
|
|
" word_ind = full_vocab[word]\n",
|
|
" two_freq[word_ind]+=1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 135,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 3, 0, 2, 0, 0, 1, 1, 1, 1]"
|
|
]
|
|
},
|
|
"execution_count": 135,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"two_freq"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 141,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>water</th>\n",
|
|
" <th>sport</th>\n",
|
|
" <th>canine</th>\n",
|
|
" <th>pets</th>\n",
|
|
" <th>about</th>\n",
|
|
" <th>catching</th>\n",
|
|
" <th>dogs</th>\n",
|
|
" <th>furry</th>\n",
|
|
" <th>fun</th>\n",
|
|
" <th>story</th>\n",
|
|
" <th>is</th>\n",
|
|
" <th>our</th>\n",
|
|
" <th>surfing</th>\n",
|
|
" <th>animals</th>\n",
|
|
" <th>are</th>\n",
|
|
" <th>this</th>\n",
|
|
" <th>popular</th>\n",
|
|
" <th>a</th>\n",
|
|
" <th>waves</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" water sport canine pets about catching dogs furry fun story is \\\n",
|
|
"0 0 0 1 1 1 0 2 1 0 1 1 \n",
|
|
"1 1 1 0 0 1 1 0 0 1 1 3 \n",
|
|
"\n",
|
|
" our surfing animals are this popular a waves \n",
|
|
"0 1 0 1 1 1 0 1 0 \n",
|
|
"1 0 2 0 0 1 1 1 1 "
|
|
]
|
|
},
|
|
"execution_count": 141,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd.DataFrame(data=[one,two],columns=all_words)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"By comparing the vectors we see that some words are common to both, some appear only in `One.txt`, others only in `Two.txt`. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them **sparse matrices**.\n",
|
|
"\n",
|
|
"\n",
|
|
"# Concepts to Consider:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Bag of Words and Tf-idf\n",
|
|
"In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.\n",
|
|
"\n",
|
|
"However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).\n",
|
|
"\n",
|
|
"Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Stop Words and Word Stems\n",
|
|
"Some words like \"the\" and \"and\" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say `cat` in place of both `cat` and `cats`. This will shrink our vocab array and improve performance."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Tokenization and Tagging\n",
|
|
"When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.\n",
|
|
"\n",
|
|
"Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Part Two: Feature Extraction with Scikit-Learn\n",
|
|
"\n",
|
|
"Let's explore the more realistic process of using sklearn to complete the tasks mentioned above!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Scikit-Learn's Text Feature Extraction Options"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 185,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"text = ['This is a line',\n",
|
|
" \"This is another line\",\n",
|
|
" \"Completely different line\"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## CountVectorizer"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 186,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer,CountVectorizer"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 187,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cv = CountVectorizer()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 188,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"<3x6 sparse matrix of type '<class 'numpy.int64'>'\n",
|
|
"\twith 10 stored elements in Compressed Sparse Row format>"
|
|
]
|
|
},
|
|
"execution_count": 188,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"cv.fit_transform(text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 189,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sparse_mat = cv.fit_transform(text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 190,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"matrix([[0, 0, 0, 1, 1, 1],\n",
|
|
" [1, 0, 0, 1, 1, 1],\n",
|
|
" [0, 1, 1, 0, 1, 0]], dtype=int64)"
|
|
]
|
|
},
|
|
"execution_count": 190,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"sparse_mat.todense()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 191,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}"
|
|
]
|
|
},
|
|
"execution_count": 191,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"cv.vocabulary_"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 192,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cv = CountVectorizer(stop_words='english')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 193,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"matrix([[0, 0, 1],\n",
|
|
" [0, 0, 1],\n",
|
|
" [1, 1, 1]], dtype=int64)"
|
|
]
|
|
},
|
|
"execution_count": 193,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"cv.fit_transform(text).todense()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 194,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'line': 2, 'completely': 0, 'different': 1}"
|
|
]
|
|
},
|
|
"execution_count": 194,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"cv.vocabulary_"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## TfidfTransformer\n",
|
|
"\n",
|
|
"TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 206,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tfidf_transformer = TfidfTransformer()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 207,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cv = CountVectorizer()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 208,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"counts = cv.fit_transform(text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 209,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"<3x6 sparse matrix of type '<class 'numpy.int64'>'\n",
|
|
"\twith 10 stored elements in Compressed Sparse Row format>"
|
|
]
|
|
},
|
|
"execution_count": 209,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"counts"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 210,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tfidf = tfidf_transformer.fit_transform(counts)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 211,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"matrix([[0. , 0. , 0. , 0.61980538, 0.48133417,\n",
|
|
" 0.61980538],\n",
|
|
" [0.63174505, 0. , 0. , 0.4804584 , 0.37311881,\n",
|
|
" 0.4804584 ],\n",
|
|
" [0. , 0.65249088, 0.65249088, 0. , 0.38537163,\n",
|
|
" 0. ]])"
|
|
]
|
|
},
|
|
"execution_count": 211,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"tfidf.todense()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 212,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.pipeline import Pipeline"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 215,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipe = Pipeline([('cv',CountVectorizer()),('tfidf',TfidfTransformer())])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 219,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"results = pipe.fit_transform(text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 220,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"<3x6 sparse matrix of type '<class 'numpy.float64'>'\n",
|
|
"\twith 10 stored elements in Compressed Sparse Row format>"
|
|
]
|
|
},
|
|
"execution_count": 220,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 218,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"matrix([[0. , 0. , 0. , 0.61980538, 0.48133417,\n",
|
|
" 0.61980538],\n",
|
|
" [0.63174505, 0. , 0. , 0.4804584 , 0.37311881,\n",
|
|
" 0.4804584 ],\n",
|
|
" [0. , 0.65249088, 0.65249088, 0. , 0.38537163,\n",
|
|
" 0. ]])"
|
|
]
|
|
},
|
|
"execution_count": 218,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"results.todense()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## TfIdfVectorizer\n",
|
|
"\n",
|
|
"Does both above in a single step!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 202,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tfidf = TfidfVectorizer()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 203,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"new = tfidf.fit_transform(text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 204,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"matrix([[0. , 0. , 0. , 0.61980538, 0.48133417,\n",
|
|
" 0.61980538],\n",
|
|
" [0.63174505, 0. , 0. , 0.4804584 , 0.37311881,\n",
|
|
" 0.4804584 ],\n",
|
|
" [0. , 0.65249088, 0.65249088, 0. , 0.38537163,\n",
|
|
" 0. ]])"
|
|
]
|
|
},
|
|
"execution_count": 204,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"new.todense()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.5"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|