Natural Language Processing(NLP) for Machine Learning
Machine learning with natural language is faced with one major hurdle – its algorithms usually deal with numbers, and natural language is, well, text. So we need to transform that text into numbers, otherwise known as text vectorization.
[toc]
Form the Dataset
Positive Samples
- Twitter Information Operations: Insights into attempts to manipulate Twitter by state-backed entities.
- User Dataset: followers count, following count, account creation date, etc.
- Tweets Dataset: tweets content, hash-tag, etc.
- Both of these two dataset have
user_id
which can tell us which tweets is belong to who. Base on this information, we could use all the tweets of a account as a feature of the user, and convert this feature into a numeric value which could directly used by machine learning model.
Negative Samples
- User Dataset — “Tweets Loud and Quiet”
- Tweets Dataset — sentiment140
- Those two dataset have no connection, but we could know the distribution of users information and tweets content separately. That’s also why we could use these two separate dataset to form the negative sample by simply sampling the tweets from tweets dataset to be the users tweets feature. the total number of tweets posted by user is told by the users dataset, as well as the frequency of user’s tweet behavior(calculated by total number divided by time horizon).
Sample of Natural Language Dataset
user_id(index) | follower_count | following_count | tweet_content | state-back label |
---|---|---|---|---|
1 | 32 | 1 | @DiazCanelB: Campaign by MEPs against Cuba rejected in Belgium. Another instance of the Empire’s vulgar and interfering policy of subver… RT | 1 |
2 | 23 | 45 | @DiazCanelB: Fidel: “I keep in mind..that Bolivar was the man that José Martí most admired. | 1 |
3 | 2245 | 3332 | #Style used to be an #interaction between the #human #soul and tools that were limiting. | 0 |
4 | 4 | 0 | #AI RT @couponfree01: #udemy Free Discount - The Complete Node.js Developer Course (3rd Edition) | 0 |
Sample of Numeric Dataset
user_id(index) | follower_count | following_count | against | campaign | … | Developer | mind | state-back label |
---|---|---|---|---|---|---|---|---|
1 | 32 | 1 | 0.63 | 0.77 | … | 0.65 | 0 | 1 |
2 | 23 | 45 | 0 | 0 | … | 0 | 1 | 1 |
3 | 2245 | 3332 | 0 | 0 | … | 0 | 0 | 0 |
4 | 4 | 0 | 0 | 0 | … | 0.64 | 0 | 0 |
Note: the numeric value isn’t the number of time that word appear in the sample, it’s the TF-IDF value of the words. That’s why the values are decimal instead of integer. TF-IDF value will be introduced in the Vectorizing Data section, please find it below. The mean reason to do so is the reduce the dimension and also measure the feature of samples in a more scientific way
Pre-processing Data
Remove punctuation
Punctuation can provide grammatical context to a sentence which supports our understanding. But for our vectorizer which counts the number of words and not the context, it does not add value, so we remove all special characters.
e.g.: How are you?->How are you
Tokenization
Is the process of segmenting running text into sentences and words. In essence, it’s the task of cutting a text into pieces called tokens, and at the same time throwing away certain characters, such as punctuation.
Remove stopwords
Stopwords are common words that will likely appear in any text. They don’t tell us much about our data so we remove them.
e.g.: silver or lead is fine for me-> silver, lead, fine.
we are passing two parameters to CountVectorizer, max_df
and stop_words
. The first is just to say ignore all words that have appeared in 85% of the documents, since those may be unimportant. The later, is a custom stop words list. You can also use stop words that are native to sklearn by setting stop_words='english'
,.
Lemmatizing
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set
It is better than stemming as it uses a dictionary-based approach i.e a morphological analysis to the root word.
e.g.: entitling, entitled -> entitle
Vectorizing Data
Vectorizing is the process of encoding text as integers i.e. numeric form to create feature vectors so that machine learning algorithms can understand our data.
Bag-Of-Words
It gives a result of 1 if present in the sentence and 0 if not present. It, therefore, creates a bag of words with a document-matrix count in each text document.
Extracting features with TF-IDF
Why is TF-IDF used in Machine Learning ?
Machine learning with natural language is faced with one major hurdle – its algorithms usually deal with numbers, and natural language is, well, text. So we need to transform that text into numbers, otherwise known as text vectorization. It’s a fundamental step in the process of machine learning for analyzing data, and different vectorization algorithms will drastically affect end results, so you need to choose one that will deliver the results you’re hoping for.
What is TF-IDF ?
TF-IDF which stands for Term Frequency – Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document.
The TF-IDF value increases in proportion to the number of times a word appears in the document but is often offset by the frequency of the word in the corpus, which helps to adjust with respect to the fact that some words appear more frequently in general.
- The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
- The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
- So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.
Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.
Mathematical Term
To put it in more formal mathematical terms, the TF-IDF score for the word t in the document d from the document set D is calculated as follows:
where
How can one reduce the TFIDF model size?
- The most effortless way is by filtering out infrequent words. You can achieve this by setting input arguments as follows: to use
min_df
to ignore terms that have a document frequency lower than themin_df
. If float, the parameter represents a proportion of documents, integer absolute counts. When dealing with a relatively large corpus, usingmin_df
of 5, 10, or 50 reduces the size of the vocabulary significantly while maintaining (or often improving) the accuracy. max_features
To consider only the topmax_features
ordered by term frequency across the corpus. This is useful if you have strict limit on the size of TF-IDF transformed features (e.g. up to 100,000 TF-IDF features).
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.
I haven’t tried this work in our project. We could discuss how to do this part if we want the higher performance or want to dive deeply into the natural language processing.
Some Basic Idea of Constructing Features:
- The average length of tweets posted by user.
- The average length of sentence(base on the intuition that the provocative sentence tend to have few words in a sentence to make a clear slogan).
Metric
- Accuracy can be a misleading metric for imbalanced data sets. Consider a sample with 95 negative and 5 positive values. Classifying all values as negative in this case gives 0.95 accuracy score.
- Precision: In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the query.
- Recall: recall is the fraction of the relevant documents that are successfully retrieved.
- Precision and recall are then defined as:
- F-measure: he traditional F-measure or balanced F-score (F1 score) is the geometric(harmonic) mean of precision and recall:
Reference
- Natural Language Processing(NLP) for Machine Learning
- Sklearn | Feature Extraction with TF-IDF
- Feature Extraction using TF-IDF algorithm
- Extracting Keywords with TF-IDF and Python’s Scikit-Learn
- Twitter sentiment analysis using Python and NLTK
- What is TF-IDF
Other Useful Dataset
Natural Language Processing(NLP) for Machine Learning
http://vincentgaohj.github.io/Blog/2020/11/13/Natural-Language-Processing(NLP)-for-Machine-Learning/