Posted 2020-11-13Algorithm / Natural Language Processing12 minutes read (About 1777 words)

Natural Language Processing(NLP) for Machine Learning

Machine learning with natural language is faced with one major hurdle – its algorithms usually deal with numbers, and natural language is, well, text. So we need to transform that text into numbers, otherwise known as text vectorization.

[toc]

Form the Dataset

Positive Samples

Twitter Information Operations: Insights into attempts to manipulate Twitter by state-backed entities.
- User Dataset: followers count, following count, account creation date, etc.
- Tweets Dataset: tweets content, hash-tag, etc.
Both of these two dataset have user_id which can tell us which tweets is belong to who. Base on this information, we could use all the tweets of a account as a feature of the user, and convert this feature into a numeric value which could directly used by machine learning model.

Negative Samples

User Dataset — “Tweets Loud and Quiet”
Tweets Dataset — sentiment140
Those two dataset have no connection, but we could know the distribution of users information and tweets content separately. That’s also why we could use these two separate dataset to form the negative sample by simply sampling the tweets from tweets dataset to be the users tweets feature. the total number of tweets posted by user is told by the users dataset, as well as the frequency of user’s tweet behavior(calculated by total number divided by time horizon).

Sample of Natural Language Dataset

user_id(index)	follower_count	following_count	tweet_content	state-back label
1	32	1	@DiazCanelB: Campaign by MEPs against Cuba rejected in Belgium. Another instance of the Empire’s vulgar and interfering policy of subver… RT	1
2	23	45	@DiazCanelB: Fidel: “I keep in mind..that Bolivar was the man that José Martí most admired.	1
3	2245	3332	#Style used to be an #interaction between the #human #soul and tools that were limiting.	0
4	4	0	#AI RT @couponfree01: #udemy Free Discount - The Complete Node.js Developer Course (3rd Edition)	0

Sample of Numeric Dataset

user_id(index)	follower_count	following_count	against	campaign	…	Developer	mind	state-back label
1	32	1	0.63	0.77	…	0.65	0	1
2	23	45	0	0	…	0	1	1
3	2245	3332	0	0	…	0	0	0
4	4	0	0	0	…	0.64	0	0

Note: the numeric value isn’t the number of time that word appear in the sample, it’s the TF-IDF value of the words. That’s why the values are decimal instead of integer. TF-IDF value will be introduced in the Vectorizing Data section, please find it below. The mean reason to do so is the reduce the dimension and also measure the feature of samples in a more scientific way

Pre-processing Data

Remove punctuation

Punctuation can provide grammatical context to a sentence which supports our understanding. But for our vectorizer which counts the number of words and not the context, it does not add value, so we remove all special characters.

e.g.: How are you?->How are you

Tokenization

Is the process of segmenting running text into sentences and words. In essence, it’s the task of cutting a text into pieces called tokens, and at the same time throwing away certain characters, such as punctuation.

Remove stopwords

Stopwords are common words that will likely appear in any text. They don’t tell us much about our data so we remove them.

e.g.: silver or lead is fine for me-> silver, lead, fine.

we are passing two parameters to CountVectorizer, max_df and stop_words. The first is just to say ignore all words that have appeared in 85% of the documents, since those may be unimportant. The later, is a custom stop words list. You can also use stop words that are native to sklearn by setting stop_words='english',.

Lemmatizing

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set

It is better than stemming as it uses a dictionary-based approach i.e a morphological analysis to the root word.

e.g.: entitling, entitled -> entitle

Vectorizing Data

Vectorizing is the process of encoding text as integers i.e. numeric form to create feature vectors so that machine learning algorithms can understand our data.

Bag-Of-Words

It gives a result of 1 if present in the sentence and 0 if not present. It, therefore, creates a bag of words with a document-matrix count in each text document.

Extracting features with TF-IDF

Why is TF-IDF used in Machine Learning ?

Machine learning with natural language is faced with one major hurdle – its algorithms usually deal with numbers, and natural language is, well, text. So we need to transform that text into numbers, otherwise known as text vectorization. It’s a fundamental step in the process of machine learning for analyzing data, and different vectorization algorithms will drastically affect end results, so you need to choose one that will deliver the results you’re hoping for.

What is TF-IDF ?

TF-IDF which stands for Term Frequency – Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document.

The TF-IDF value increases in proportion to the number of times a word appears in the document but is often offset by the frequency of the word in the corpus, which helps to adjust with respect to the fact that some words appear more frequently in general.

The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.

Mathematical Term

To put it in more formal mathematical terms, the TF-IDF score for the word t in the document d from the document set D is calculated as follows:

$\begin{equation} \text { tf-idf }(t, d, D)=t f(t, d) \cdot \text { idf }(t, D) \end{equation}$

where

$\begin{equation} \begin{array}{c} t f(t, d)=\log (1+\text { freq }(t, d)) \\ i d f(t, D)=\log \left(\frac{N}{\operatorname{count}(d \in D: t \in d)}\right) \end{array} \end{equation}$

How can one reduce the TFIDF model size?

The most effortless way is by filtering out infrequent words. You can achieve this by setting input arguments as follows: to use min_df to ignore terms that have a document frequency lower than the min_df. If float, the parameter represents a proportion of documents, integer absolute counts. When dealing with a relatively large corpus, using min_df of 5, 10, or 50 reduces the size of the vocabulary significantly while maintaining (or often improving) the accuracy.
max_features To consider only the top max_features ordered by term frequency across the corpus. This is useful if you have strict limit on the size of TF-IDF transformed features (e.g. up to 100,000 TF-IDF features).

Feature Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

I haven’t tried this work in our project. We could discuss how to do this part if we want the higher performance or want to dive deeply into the natural language processing.

Some Basic Idea of Constructing Features:

The average length of tweets posted by user.
The average length of sentence(base on the intuition that the provocative sentence tend to have few words in a sentence to make a clear slogan).

Metric

Accuracy can be a misleading metric for imbalanced data sets. Consider a sample with 95 negative and 5 positive values. Classifying all values as negative in this case gives 0.95 accuracy score.
Precision: In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the query.
Recall: recall is the fraction of the relevant documents that are successfully retrieved.
Precision and recall are then defined as: