Unsupervised Learning With GloVe Word Embeddings on Search Queries
In this article, I’ll discuss the applications of GloVe with the help of an example. GloVe stands for Global Vectors for Word Representations and it’s a relatively newer state of the art natural language processing technique of creating vector space models of word semantics, more commonly known as word embeddings. GloVe is an unsupervised algorithm developed by Stanford University that aims at generating word embeddings by aggregating global word-word co-occurrence matrix from a corpus. GloVe is similar to Word2Vec with the primary difference being that Word2Vec is a ‘predictive’ model that predicts context given a word while GloVe is a count-based model that learns the vectors by essentially doing dimensionality reduction on the co-occurrence counts matrix with respect to minimizing a cost function. Here is an implementation of GloVe in Python along with an overview of how it works and here is the original paper published by Stanford.
Real World Applications of Word Embeddings:
Word Embeddings in general can be used in various forms to accomplish numerous tasks. It’s a very useful skill to have and we can make amazing things out of it. The following are just a handful of tasks that can be accomplished with the use of word embeddings:
- Keyword Research: With keywords being pulled out from multiple sources for paid search and SEO campaigns, it’s often a manual task of combining these into topics and groups. When dealing with long tail keywords and phrases, the volume of unique keywords can easily exceed several thousands. In this case, it becomes very difficult to implement keyword research manually. This is where word embeddings come into the picture. Similar to the example presented in this article, we can use word vectors for unsupervised learning to make clusters out of them based on the topics they represent. These clusters can then be used towards forming paid search campaigns and ad groups in addition to a wide spectrum of SEO tasks from identifying keyword opportunities to measuring SEO campaign performance. For companies that depend on search engines and organic or paid search traffic for their revenue, these tasks are crucial and indispensible to the overall processes of creating, analyzing and optimizing search campaign performance.
- Chat Bots: Word embeddings can prove to be extremely helpful when developing chat bots. The underlying functions of chat bots providing the user with some relevant information, answering the user’s question, asking a follow up question or keeping the conversation going in some way. Understanding user intention based on the meaning behind words and sentences can take natural language processing to a whole new level. This in turn can enable chat bots to achieve the aforementioned tasks very efficiently.
- Internal Search Engine: Ranking documents and pages as per the user’s key phrase is just another application of word embeddings. This can be achieved by weighting the page’s word embeddings based on their TF-IDF (term frequency — inverse document frequency) values to give certain words more or less importance and by computing a distance based similarity such as cosine similarity between the pages and the keyword or phrase typed in by the user. The pages can then be ranked based on their distance scores with the least distance being the most relevant. Other word embedding techniques can work out as well for this purpose.
- News Based Recommendation Engine: Imagine we have a news app which displays the latest information on a particular topic. To recommend similar articles to the user, this could be done easily with doc2vec embeddings. Here the task would be finding out the most similar articles and ranking them on a distance based score. This would be a very efficient solution, eliminating any need for it to be done manually.
- Product Based Recommendation Engine: Building a recommendation engine or a association rule based system at the category level or sub category level can be too broad and may lead to inefficiencies due to impersonal or generic results. On the other hand, building it at a SKU level has its own difficulties given millions of items at that level. In addition to the resource requirements, data becomes very sparse at the deepest levels and it becomes difficult to process it. If products or items could get grouped based on their descriptions and other text based features, this would solve both problems at each end of the spectrum — the problem of relevancy or personalization vs. the difficulty of resource requirements and computing on highly sparse data structures.
- Understanding Search Terms: By grouping search terms and phrases into clusters, we can get immense insight into what products are currently trending or what users are unable to locate on the website. This information can assist us in optimizing the website navigation, product availability and promotional activities for the business unit. For example, if we know the most popular search terms are around a certain model of a mobile phone, we might consider displaying it higher up in the conversion funnel (homepage or category level). This piece of insight could come in handy to our supply chain and operations teams who in turn could maximize the availability of the product to users, thereby driving up the revenue potential.
Case Study: Clustering Search Engine Queries
Having seen some of the use cases of word embeddings, let’s go ahead with a case study and implement it in Python. The data in this case study is from the Google Merchandise Store, a real e-commerce site that sells branded Google merchandise. Much of their data is openly available to the public in real time and can be accessed through Google Analytics.
There are a few business objectives to make a note of before we get started:
- Understand how many impressions are driven by which keyword groups and topics.
- Compare the CTR and search ranking performance of keyword clusters.
- Use the insights from the analysis to improve search engine optimization (SEO).
The search query data can be downloaded from Google Analytics under Acquisision > Search Console > Queries. Or it can be downloaded directly from Search Console. Let’s import the packages and see what the data looks like.
import pandas as pd
import numpy as np
import string
import sys
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltkpath = 'C:\\Users\\kbhandari\\Desktop\\Document Clustering\\'
data = pd.read_csv(path+"Search Queries 20180701-20180930.csv")
data.head(n=10)
data.describe().round(1)Output:
Search Query Clicks Impressions CTR Average Position
0 youtube merchandise 3200 10443 0.3064 1.0
1 youtube merch 2233 10200 0.2189 1.2
2 google t shirt 1832 6384 0.2870 1.0
3 youtube shop 1509 4196 0.3596 1.4
4 youtube store 1390 4360 0.3188 1.1
5 google merchandise store 1048 26798 0.0391 1.0
6 google backpack 966 7583 0.1274 1.3
7 google shirt 866 3555 0.2436 1.0
8 youtube t shirt 710 5034 0.1410 1.3
9 google stickers 696 3315 0.2100 1.9
Clicks Impressions Average Position
count 30357.0 30357.0 30357.0
mean 1.2 86.5 48.9
std 31.3 5756.1 38.5
min 0.0 1.0 1.0
25% 0.0 1.0 11.0
50% 0.0 2.0 45.0
75% 0.0 7.0 78.0
max 3200.0 881915.0 520.0
There are 5 columns and 30,357 unique search queries. An interesting fact is that although the maximum impressions are 881K, 75% of all search queries received only 7 impressions or less. This means along with the less relevant terms, we have a lot of long tail search queries which have very little search volume. But we know when long tail keywords are put together, they actually make up a larger share of total impressions compared to that of high search volume terms which are broader and more generic. This makes our analysis all the more compelling and necessary to give a complete understanding of the performance of these search queries.
Data Preprocessing (Part 1/2)
If we look at the data closely, we can see the data isn’t clean. There are punctuations, non-ascii symbols, numbers and many spelling mistakes.
To clean up the data, I’ve written a function which preprocesses the data based on a selection of multiple arguments. The full Python code for this function is available in GitHub repository under ‘Glove_Clustering.py’.
#Preprocess data
data = data_preprocess(data, column='Search Query', lower=True, ascii_chars=True, no_numbers=True, no_punctuation=True, remove_stopwords=True, lemmatize=False, custom_blank_text='non ascii symbols punctuations numbers')#Check length of query
print(data['Query_Modified'].str.split().str.len().describe(percentiles=[0.25,0.5,0.75,0.80,0.85,0.90,0.95]))Output:Donecount 30357.000000
mean 2.434562
std 1.105504
min 1.000000
25% 2.000000
50% 2.000000
75% 3.000000
80% 3.000000
85% 3.000000
90% 3.000000
95% 4.000000
max 47.000000
Name: Query_Modified, dtype: float64
In the ‘data_preprocess’ function above, we’re converting all alphabets to lower case, keeping only Ascii characters (characters in keyboard) and removing all numbers, punctuation and stopwords such as ‘a’, ‘the’, ‘from’, etc. and replacing them with blanks. For this particular example, I’ve set the lemmatize argument as false because it doesn’t change the cluster output very drastically. This is because the word vectors for similar words will be very close irrespective of the word’s grammatical form (superlative, verb, noun, adverb, etc.). And lastly, for documents which ONLY contains punctuations or numbers or Unicode symbols, once these get removed, the blanks get replaced by some custom text. In this case I’ve chosen the text to be ‘non ascii symbols punctuations numbers’. The idea behind doing this is so that all these queer nonsensical search queries can get grouped together into one cluster.
The next part gives us a summary of the length of each search query. We can see there are no blanks anymore and the average length of a search query is around 2–3 words. The max query length goes up to 47 words. Understanding the number of words per document in the corpus can be very useful for the upcoming sections on preprocessing the data further and extracting word vectors.
Data Preprocessing (Part 2/2)
The next phase in preprocessing the data is to get a list of all the non-GloVe words in the corpus along with their word counts. Non-GloVe words are those which appear in our corpus but haven’t been seen by the GloVe algorithm. Some examples of these are ‘tshirt’, ‘emoji’, ‘googl’ and ‘youtub’. Now as humans we obviously know what these words mean. But somehow we need to change these non-GloVe words to GloVe words so that these can be used by GloVe in the cluster analysis.
from gensim.models import KeyedVectors
# Load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(path+filename, binary=False)
non_glove_words_df = get_non_glove_words(dataframe = data, column = 'Query_Modified', model = model)
print(len(non_glove_words_df))Output:Done3260
In the code above, I’m loading the GloVe model into workspace and passing it as an argument to the ‘get_non_glove_words’ function which I’ve written.
The ouput of this function is a dataframe of 3,260 unique non-GloVe words along with their word count and the cumulative percentage of word counts. So to interpret this table, let’s look at the first 5 words (tshirt, tshirts, emoji,..). If we manually replace these to something that GloVe has seen, it would account for 695 or 11.38% of all non-GloVe words in the corpus.
Now obviously, it’s not practical or feasible to replace ALL the words here manually. So I’ve written a piece of code that replaces the words based on regex based rules. For example, I’m using regex to replace all words which contain the letters ‘s’, ‘h’, ‘r’, ‘t’ in that sequence with the word ‘shirt’. Another example is to replace ‘drinkware’ to ‘drink ware’ based on a full word match. Here’s code that does the regex based replacements:
#Spelling Mistakes and Abbreviations
replacements = {
'(?:^|\W)microfleece(?:$|\W)': ' micro fleece ',
'(?:^|\W)drinkware(?:$|\W)': ' drink ware ',
'(?:^|\W)shopee(?:$|\W)': ' shop ',
'(?:^|\W)trackid(?:$|\W)': ' track id ',
'(?:^|\W)tensorflow(?:$|\W)': ' tensor flow ',
'(?:^|\W)men39s(?:$|\W)': ' men ',
'(?:^|\W)packable(?:$|\W)': ' pack able ',
'(?:^|\W)tube(?:$|\W)': ' youtube ',
'(?:^|\W)tub(?:$|\W)': ' youtube ',
'(?:^|\W)merch(?:$|\W)': ' merchandise ',
'(?:^|\W)tee(?:$|\W)': ' t-shirt ',
'(?:^|\W)tees(?:$|\W)': ' t-shirts ',
r'[a-z]*y[a-z]*[a-z]*t[a-z]*[a-z]*b[a-z]*[a-z]*e[a-z0-9]*': 'youtube',
r'[a-z]*y[a-z]*[a-z]*o[a-z]*[a-z]*u[a-z]*[a-z]*t[a-z0-9]*': 'youtube',
r'[a-z]*y[a-z]*[a-z]*o[a-z]*[a-z]*u[a-z]*[a-z]*b[a-z0-9]*': 'youtube ',
r'[a-z]*u[a-z]*[a-z]*t[a-z]*[a-z]*b[a-z]*[a-z]*e[a-z0-9]*': 'youtube',
r'[a-z]*g[a-z]*[a-z]*o[a-z]*[a-z]*l[a-z]*[a-z]*e[a-z0-9]*': 'google',
r'[a-z]*g[a-z]*[a-z]*o[a-z]*[a-z]*o[a-z]*[a-z]*g[a-z0-9]*': 'google',
r'[a-z]*s[a-z]*[a-z]*h[a-z]*[a-z]*o[a-z]*[a-z]*p[a-z0-9]*': 'shop',
r'[a-z]*a[a-z]*[a-z]*n[a-z]*[a-z]*d[a-z]*[a-z]*r[a-z0-9]*[a-z]*d[a-z]*[a-z]*': 'android',
r'[a-z]*s[a-z]*[a-z]*h[a-z]*[a-z]*r[a-z]*[a-z]*t[a-z0-9]*': 'shirt',
r'[a-z]*w[a-z]*[a-z]*m[a-z]*[a-z]*e[a-z]*[a-z]*n[a-z0-9]*': 'women',
r'[a-z]*a[a-z]*[a-z]*p[a-z]*[a-z]*r[a-z]*[a-z]*l[a-z0-9]*': 'apparel',
r'[a-z]*e[a-z]*[a-z]*m[a-z]*[a-z]*o[a-z]*[a-z]*j[a-z]*[a-z]*i[a-z]*[a-z0-9]*': 'emotion',
r'[a-z]*s[a-z]*[a-z]*t[a-z]*[a-z]*o[a-z]*[a-z]*r[a-z]*[a-z]*e[a-z]*[a-z0-9]*': 'store',
r'[a-z]*a[a-z]*[a-z]*c[a-z]*[a-z]*s[a-z]*[a-z]*r[a-z]*[a-z]*s[a-z]*[a-z0-9]*': 'accessories',
r'[a-z]*a[a-z]*[a-z]*c[a-z]*[a-z]*c[a-z]*[a-z]*r[a-z]*[a-z]*s[a-z]*[a-z0-9]*': 'accessories',
r'[a-z]*m[a-z]*[a-z]*e[a-z]*[a-z]*r[a-z]*[a-z]*c[a-z]*[a-z]*d[a-z]*[a-z]*i[a-z]*[a-z0-9]*': 'merchandise',
r'[a-z]*m[a-z]*[a-z]*e[a-z]*[a-z]*r[a-z]*[a-z]*c[a-z]*[a-z]*d[a-z]*[a-z]*e[a-z]*[a-z0-9]*': 'merchandise',
r'[a-z]*m[a-z]*[a-z]*r[a-z]*[a-z]*c[a-z]*[a-z]*h[a-z]*[a-z]*d[a-z]*[a-z]*z[a-z]*[a-z0-9]*': 'merchandise',
r'[a-z]*m[a-z]*[a-z]*e[a-z]*[a-z]*c[a-z]*[a-z]*h[a-z]*[a-z]*d[a-z]*[a-z]*i[a-z]*[a-z0-9]*': 'merchandise',
r'[a-z]*t[a-z]*[a-z]*i[a-z]*[a-z]*m[a-z]*[a-z]*b[a-z]*[a-z]*u[a-z]*[a-z]*k[a-z]*[a-z0-9]*': 'timbuktu',
r'[a-z]*w[a-z]*[a-z]*a[a-z]*[a-z]*t[a-z]*[a-z]*e[a-z]*[a-z]*r[a-z]*[a-z]*b[a-z]*[a-z]*t[a-z]*[a-z]*l[a-z]*[a-z0-9]': 'water bottle',
r'[a-z]*b[a-z]*[a-z]*a[a-z]*[a-z]*g[a-z]*[a-z]*s[a-z]*[a-z]*t[a-z]*[a-z]*o[a-z]*[a-z]*r[a-z]*[a-z]*e[a-z]*[a-z0-9]': 'bag store',
}data['Query_Modified'].replace(replacements, regex=True, inplace=True)#Extra Spaces
data['Query_Modified'] = data['Query_Modified'].apply(lambda x: re.sub("\s\s+", " ", str(x.strip())))non_glove_words_df = get_non_glove_words(dataframe = data, column = 'Query_Modified', model = model)
print(len(non_glove_words_df))Output:Done1558
So we can see from the output of the code above, the total non-GloVe words in our corpus got reduced from 3,260 to 1,558 by specifying only 33 regex based rules. That’s more than half! Pretty amazing!
Now the final step in the data preprocessing stage is to replace the remaining non-GloVe words with empty strings (or a custom text) in order to obtain the GloVe word vectors. This can be done with the following code:
data = replace_non_glove_words(data, non_glove_words_df, 'Query_Modified')
#Blank rows
print(len(data[data['Query_Modified'] == '']))
blanks = data[data['Query_Modified'] == '']
data.loc[data['Query_Modified'] == '','Query_Modified'] = 'non ascii symbols punctuations numbers'
data = data[data['Query_Modified'] != '']#Length of query
print(data['Query_Modified'].str.split().str.len().describe(percentiles=[0.25,0.5,0.75,0.80,0.85,0.90,0.95]))Output:Progress: 100%891count 30357.000000
mean 2.501400
std 1.184896
min 1.000000
25% 2.000000
50% 2.000000
75% 3.000000
80% 3.000000
85% 3.000000
90% 4.000000
95% 5.000000
max 46.000000
Name: Query_Modified, dtype: float64
The progress bar in the output above gives us the completion progress of the ‘replace_non_glove_words’ function. The code in the next line counts the number of empty strings in the corpus. There are 891 of these which get replaced to ‘non ascii symbols punctuations numbers’ so these can be grouped into the same cluster as the one before for queer nonsensical search queries. Finally, we get a summary of the number of words in all the documents of the corpus.
Extracting GloVe Word Vectors
There are a few different ways to extract GloVe word vectors for each document in a corpus. Experimented on 2 techniques:
- First n Words — If the ’n’ argument is 3, this method would extract vectors for the first 3 words only. Once the word vectors are extracted, these would be stacked horizontally. So for GloVe embeddings of 100 dimensions with n=3, we would have 300 columns. This method works well on well defined structured data where text is present in the form of columns. For example, it would work well when there’s a specific column for color of the product and a separate one for the product size or the product brand.
- Sum Word Vectors — This method simply adds the vectors of words in each document. Let’s take an example. Consider the text ‘This is a red book’. To apply the technique on this string, we must extract the GloVe vectors for ‘this’, ‘is’, ‘a’, ‘red’ and ‘book’ and simply add them all up. This technique can handle words irrespective of their position in the document and therefore worked better for this case study.
Other techniques — There may be many different ways to extract the word vectors for documents. Another way to obtain the top keywords can be implementing TF-IDF or using POS (parts of speech) to tag words which represent nouns, adjectives, adverbs, etc. These can be considered as viable alternatives especially if the document size is big. Instead of taking the first ’n’ words or taking the sum of all the word vectors, one might also consider averaging them or using a min-max technique to get the extremes of vectors.
#Extracting glove vectors#Method 1 (not run)
#cluster_dataset = extract_vectors(data,column='Query_Modified',method='first_n_words',n=3)#Method 2
cluster_dataset = extract_vectors(data,column='Query_Modified',method='sum_word_vectors')
print(cluster_dataset.shape)Output:
(30357, 101)
The ‘extract_vectors’ function that wrote with the ‘sum_word_vectors’ argument yields the output as a dataframe of 101 columns as I’m using the 100 dimension GloVe embeddings file. The first column represents the queries and the next 100 columns represent the sum of each document’s word vectors. Once we have this, we can move on to the exciting part.
Implementing K-Means Clustering
There are a vast number of different techniques to cluster variables — spectral clustering, probability based clustering, density based clustering, K-mediods, hierarchical clustering, etc. but I’ve decided to go ahead with K-means as it’s the easiest and most go-to clustering algorithm. The code below implements the K-means algorithm and the elbow method to give a ball park estimate of how many clusters to choose.
#Clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScalersc_X = StandardScaler()
X_train = sc_X.fit_transform(cluster_dataset.iloc[:,1:])
del cluster_datasetwcss = []
for i in range(1, 52, 5):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X_train)
wcss.append(kmeans.inertia_)
print('\rProgress: %d' % i, end='')
sys.stdout.flush()import matplotlib.pyplot as plt
plt.plot(range(1, 52, 5), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = 8, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X_train)del X_traindata['CLUSTERS'] = kmeans.labels_
data.to_csv(path+"cluster_sum_glove_vectors.csv",index=False)
From the plot on the left, we can see the elbow at around 6 clusters. The Y-axis represents the WCSS which is the within cluster sum of squares. These are the sum of squared distances from observations within a cluster to their centroid. After a bit of trial and error I’ve decided on choosing 8 final clusters for this analysis.
Cluster Summary
The final clusters can be downloaded from here. Sorting the dataframe by clusters would give a sense of homogeneity of the points that lie within a particular cluster. To understand the terms that fall within the 8 clusters, I’ve written a function to help us do that:
#Cluster Summary
cluster_summary = summary(data, 'CLUSTERS', 'Search Query', 'Query_Modified', top_n=10, show_original=False)
The ‘top_n’ argument refers to the number of top keywords to display in the summary based on their word count. The last argument — ‘show_original’ displays the original unmodified search query column in its pristine form when set to ‘True’. Anyway, here is the summary:
- Cluster 1 (Index 0): This cluster represents all the non-Ascii symbols, numbers, punctutations and stopwords that would make the string empty when removed. Also included in this cluster are the non-Glove words which are either nonsensical or misspelled words. Queries in this cluster are less relevant to the Google Merchandise Store and this can be seen in the low number of clicks and CTR.
- Cluster 2 (Index 1): The second cluster has 6,622 observations and also the highest WCSS score. This means the queries under this cluster are the least homogenous compared to the other clusters. In this cluster, we can see terms such as ‘Android’, ‘nest’ which probably refers to Google’s Nest Thermostat and ‘Waze’ which is an alternative to Google Maps. Since this group is so diverse, we can create sub clusters to understand the queries better.
- Cluster 3 (Index 2): This next one is an interesting cluster. Here, we’ve got branded terms such as ‘Google’, ‘YouTube’ and ‘Android’. The ‘www’ indicates that some of these queries might be navigational. Since most terms in this cluster are branded and likely to be navigational, we would expect the average rank of queries to be higher up in the search results. And this can be seen in the weighted average position column in which this cluster’s search rank is the lowest compared to the other clusters. Not surprisingly, this cluster has been driving the highest share of impressions to the site.
- Cluster 4 (Index 3): The fourth cluster has grouped queries with words such as ‘merchandise’, ‘apparel’, ‘clothing’, ‘shop’, ‘store’ and ‘near’ which suggests that users typing these queries are looking for merchandise shops or apparel stores or places to buy clothing near them. The term ‘play’ also indicates people looking for items on Google’s play store.
- Cluster 5 (Index 4): This one has grouped generic products and items such as water bottles, cups, stickers, bags, backpack, laptops, etc. Since these are generic and non-branded terms, we can see the average rank is 35, which is a lot higher than cluster #3 consisting of branded items.
- Cluster 6 (Index 5): Here we have queries related to shirts, tees and t-shirts, men’s shirts, women’s shirts, black and white colored shirts and shops that sell shirts. Interesting insight here is that the Google Merchandise Store isn’t ranking high enough for these type of queries. We can see the average rank to be highest for this cluster, suggesting the website doesn’t show up high enough in the SERPs for these queries on average.
- Cluster 7 (Index 6): Queries in this cluster all revolve around YouTube. This ranges from YouTube merchandise such as shirts and jackets to YouTube logo and other random YouTube shows and channels in between. Here, there are some queries which are highly relevant to the Google Merchandise Store and hence have a higher ranking in
- Cluster 8 (Index 7): The last cluster has the least number of queries within it and most of them relate to shipping, address and delivery information.
Sub Cluster Summary And Insights
As some of the clusters obtained here can by highly diverse, let’s try and break it down a notch. Implementing the same steps above on observations of cluster #4 gives the following summary table with 2 clusters chosen:
The K-Means algorithm has managed to group queries with terms such as bottle, water, glass, oz, etc. together in one cluster and queries containing ‘bag’, ‘backpack’, ‘canvas’, ‘stickers’ etc. in another cluster. Interestingly, the first cluster here also has a higher CTR and lower average position indicating it’s performance is much better than the second one.
Recommendation: Although only one cluster is broken down into sub clusters, we can still make a useful recommendation here. Focusing SEO efforts on optimizing campaigns around water bottles, cups, mugs, drinkware, etc. would be beneficial to the Google Merchandise Store. This is because queries around water bottles are already performing better than the other categories so a small push here can improve organic search performance and help acquire more organic traffic at a faster pace than SEO campaigns on other generic product categories.
To obtain clusters which have a higher degree of homogeneity, one can either increase the number of clusters in the K-Means algorithm or drill down into a particular cluster like I’ve done and get more detailed insights. The performance of the unsupervised learning depends a LOT on the quality of data available and how it is preprocessed by us.
Doing this analysis was fun and there has been a lot to learn. Please let us know in the comments below how you go about clustering text and if there’s a better way to go about it than word embeddings.
If you like this post, give it a 👏 and ❤️. And Many Thanks for your genuine Support, it matters.
Till then- keep Learning, keep Sharing, keep Growing.