Topic Modeling news headlines using LDA, LSA, LDI and HDP

Link to Github repo

Introduction

We implement in this project topic modeling on the Australian Broadcasting Corporation (“ABC”) headlines dataset combining the text and publication dates of ~1.1M ABC News article headlines published over the course of 2003-2017. The goal of this exercise is to uncover in an unsupervised manner common topics shared across different article headlines which can then be used to assign an unseen article headlines to a topic category, with potential applications for document indexing and retrieval and content recommendation systems. While various methods exist for implementing topic modeling, this project will make use of Latent Dirichlet Allocation (“LDA“), Latent Semantic Analysis (“LSA“), Latent Semantic Indexing (“LSI“) and Hierarchical Dirichlet Process (“HDP“).


Exploratory Data Analysis

The ABC News dataset spans 1.1M article headlines published over the course of 2003 – 2017 and covering various article categories such as news, politics, business, sports, opinion articles, etc. We can visualize the most common words present in the headline corpus in order to gain a better grasp of the contents of these news headlines as shown in Figure 1, with the top 10 most common words of police, new, man, says, govt, court, council, interview, NSW and Australia suggesting a focus on reporting everyday Australian news events such as for instance law enforcement actions, government announcements and policies, judicial proceedings and news interviews.

Figure 1. Most common words in ABC News Headline dataset

We can further visualize the distribution of the number of words and characters across our headline corpus to gain a better idea of how much text data we have at our disposal to feed to our topic modeling algorithms, showing average 6.4 and 40.2 word and character lengths respectively across a total corpus size of 7.1M words.

Figure 2. Distribution of headline word lengths in ABC News Headline dataset
Figure 3. Distribution of headline character lengths in ABC News Headline dataset

We can further extract the part of speech (“POS”) tags of each of our headlines’ words showing that word’s part of speech classification using the TextBlob library in order to better understand common grammatical structures of our headlines. As expected we can see that nouns (NN, NNS), adjectives (JJ), prepositions (IN) and verbs (VB, VBP, VBZ) make up our top 7 POS tags by frequency:

from textblob import TextBlob
tagged_headlines = [TextBlob(reindexed_data[i]).pos_tags for i in range(reindexed_data.shape[0])]
Figure 4. Average headline character length in ABC News Headline dataset

Next we can plot the historical trends in ABC News’ published headline counts at the year, month and day levels of granularity for the period 2003-2017 in order to better ascertain if any time periods may be disproportionately represented in the corpus as well as if article publications display any seasonality. As shown in Figure 5 we can observe at the year level gradual increases in yearly headline counts over the period 2004-2014 followed by a sharp decline from 2014 onwards, with the month level interestingly showing sharp 50-70% decreases in headline counts in September 2006, January 2015 and January 2016.

Similarly we can interestingly observe at the day level headline counts seemingly respecting a maximum 250 headlines per day limit across the period 2003-2011, with this daily maximum increasing to approximately 400 headlines across the period 2012-2016 before falling back again to a 200-250 maximum daily headline count for the remainder of the 2016-2018 period. We can also only observe eight days with zero published headlines all occurring prior to 2009 in the corpus:

Figure 5. Year-, month- and day-level trends in ABC News headline counts

In order to gain a better grasp of possible seasonality trends in our data we can further plot headline counts at the day, weekday and month levels. As shown in Figure 6 daily headline counts show little seasonality with headline counts for each of the days of the month remaining broadly stable in the [32,500 – 37500] range and with the 31st of the month displaying ~40% fewer published headlines in line with the fact that seven months out of the year have 31 days.

We can further observe headline counts approximately halving on weekend days relative to weekdays, consistent with the expectation that most news would be published during the week instead of weekends. Interestingly we can further observe moderate seasonality at the month level with headline counts decreasing approximately 8% in the December, January and February winter months versus the rest of the year:

Figure 6. Day-, weekday- and month-level seasonality trends in ABC News headline counts

Theoretical background to topic modeling

While we won’t go too in depth into the math underlying most topic models as other blog posts such as [1] and [2] provide rather comprehensive descriptions of some of this theory, it is useful to understand some of the assumptions and high level intuition underlying some of the most popular topic models such as LDA. LDA bases itself on the assumptions that documents can be represented as distributions of topics, and that topics can be represented as distribution of words, where the order of words in this document is considered irrelevant. Given this irrelevance of word order, each document in our corpus can be therefore thought of as ‘bags-of-words’, aka as high dimensional matrices of the counts of words occurring throughout that document, which can then be used as the basis of some of our features informing our topic predictions.

As the first step in this topic modeling process, LDA randomly assigns each word in our corpus to one of K topics, where K is a pre-defined metric we supply to control the number of topics we would like our model to classify our documents into. Assuming a total corpus of d documents comprised of an identical w words each, LDA next iterates through all d * w words in our corpus and computes the below statistics for each word wj:

  • p(wj | tk), or the proportion of all documents assigned to a topic tk for a given word wj, which can be thought of as attempting to capture how responsible word wj is for mapping each of our d documents to a given topic tk.
    • This can be thought of as a more ‘global’ measure of word relevance to topic tk, where we are looking across our entire corpus to calculate this metric
    • If this number if low, it should therefore be less likely that wj, in the corpus context, is relevant for describing tk
  • p(tk | di), or the proportion of words in document di that are assigned to topic tk. This is reflective of LDA’s assumption that a higher proportion of words in a single document being mapped to the same topic tk is an indicator of a higher probability that new word wj, also occuring in document di, also maps to tk.
    • This can be thought of as a more ‘local’ measure of topic relevance to document di, where we are looking only at our document di to calculate this metric
    • If this number if low, it should therefore be less likely that wj, in the current document context, is relevant for describing tk

After calculating the the above two metrics for wj across each of our topics, we multiply these to yield p(wj | tk, di), which becomes a single metric by which to update the topic assignment of wj. Grossly simplifying, if either p(wj | tk) or p(tk | di) are low and therefore respectively indicating that a low proportion of all documents containing word wj map to tk or a low proportion of words in the current document di map to tk, this combined metric will also be low, and vice versa. Therefore if w0 was previously assigned to topic 1 and we find that p(w0 | t0, d0) > p(w0 | t1, d0), LDA therefore re-assigns word 0 from topic 1 to topic 0. This process finishes once we have completed a pre-defined number of iterations with K topics being created and can be visualized in the below graph:


Topic Modeling

We explore two distinct approaches to data preprocessing of our text data to evaluate how these impact the separations of our topic groups, the first using the scikit-learn library’s text preprocessing functions to only remove stopwords with no word lemmatization or stemming applied and the second using the NLTK and Gensim libraries’ preprocessing functions to apply lemmatization, stemming and stopwords removal to our headline text.

As we will be looking to test the sklearn library’s LSA and LDA model implementations in our first approach, we randomly sample 10,000 article headlines to quickly ascertain how well each algorithm is able to separate our data. We leverage an sklearn CountVectorizer object to remove stopwords and generate a matrix of token counts mapping unique ID’s of word tokens constructed at the document-level to the frequency counts of that specific token in that document. The below words2vec() function returns this token frequency matrix as well as CountVectorizer object to be used in future explorations:

from sklearn.feature_extraction.text import CountVectorizer

@tdec
def words2vec(data, max_features = 40000):
    count_vectorizer = CountVectorizer(stop_words='english', max_features=max_features)
    document_term_matrix = count_vectorizer.fit_transform(data)
    return count_vectorizer, document_term_matrix

small_text_sample = reindexed_data.sample(n=10000, random_state=0).values
counter_vectorizer, small_document_term_matrix = words2vec(data = small_text_sample, max_features = 40000)

We can check the output of this token frequency matrix with the below code, verifying the absence of stopwords throughout this converted token count matrix:

print('Before preprocessing ', small_text_sample[1])
print('Words converted to vector ', document_term_matrix[1])
print('Word vector inverse transformed to word output ', inv_transform_count_vectorizer(counter_vectorizer, document_term_matrix[1]))

Out:
>>>> Before preprocessing  four freed from car crash in ocean reef
>>>> Words converted to vector    (0, 4549)	1
  (0, 2014)	1
  (0, 2854)	1
  (0, 7721)	1
  (0, 9074)	1
>>>> Word vector inverse transformed to word output  [array(['freed', 'car', 'crash', 'ocean', 'reef'], dtype='<U17')]
i) LSA using scikit-learn

With this token frequency matrix created we can begin fitting topic modeling models to our data. We will begin with training an sklearn LSA model specifying 8 topics as the number of components to our model. As LSA represents topics as distributions of word tokens, once this model fitted we can further visualize the top 15 words included in each of our eight headline text predicted topics to gain a better grasp of some of the overarching themes captured in each topic. While one can observe several words co-occurring throughout different topics, we can map the below topics to the following categories as shown in Table I:

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE

lsa_model = TruncatedSVD(n_components=8)
lsa_topic_matrix = lsa_model.fit_transform(small_document_term_matrix)
lsa_keys = get_keys(lsa_topic_matrix)
lsa_categories, lsa_counts = keys_to_counts(lsa_keys)

top_n_words_lsa = get_top_words(n = 15, n_topics = 8, keys = lsa_keys, document_term_matrix = small_document_term_matrix, count_vectorizer = counter_vectorizer)
for i in range(len(top_n_words_lsa)):
    print("Topic {}: ".format(i+1), top_n_words_lsa[i])

Out:
1.6 sec to complete <function get_top_words at 0x7f198f6f5160>
Topic 1:  police death probe missing car drug woman search fatal attack shooting assault man investigate case
Topic 2:  man charged murder jailed dies court accused guilty arrested bail woman canberra stabbing death pleads
Topic 3:  new laws year cancer years queensland sets trial zealand centre york hope opens set ceo
Topic 4:  says wa government group school claims help power minister plans mp labor funding opposition support
Topic 5:  court face accused high trial sex charges ban told faces case set challenge hears murder
Topic 6:  govt qld urged sa plan hospital work act vic boost funds closer defends cut considers
Topic 7:  council election plan water takes centre calls backs fears business lake park residents start brisbane
Topic 8:  interview australia health nsw report world china coast win australian wins sydney cup day killed
TopicTop 6 wordsCategory
1police death probe missing car drug Crime reports
2man charged murder jailed dies courtPrison-related
3new laws year cancer years queenslandNew laws / policy announcements
4says wa government group school claimsDomestic politics
5court face accused high trial sex chargesJudicial proceedings
6govt qld urged plan hospital workEconomy
7council election plan water takes centreElection coverage / politics
8interview australia health nsw report worldInterview / world politics
Table I. Sklearn LSA Predicted Topic Categories top words

We can plot the distribution of headline counts for each of these eight lSA topics, interestingly showing that the domestic and world politics categories account for approximately 60% of all published articles:

Figure 7. Distribution of headline counts by sklearn LSA predicted topic

Plotting a t-SNE two-dimensional representation of our LSA predicted topic probability matrices shows relatively poor separability between our various topic groups however, suggesting LSA may not be appropriate for this specific topic modeling task. We next investigate LDA to see if this model can provide better separability between predicted topics.

Figure 8. t-SNE representation of sklearn LSA predicted topics
ii) LDA using scikit-learn

Fitting an LDA model specifying 8 topics we can similarly examine the top 15 words included in each of our eight predicted topics and extract high-level topic categories as shown in Table II:

lda_model_sklearn = LatentDirichletAllocation(n_components=n_topics, learning_method='online', random_state=0, verbose=0)
lda_topic_matrix_sklearn = lda_model_sklearn.fit_transform(document_term_matrix)
lda_keys_sklearn = get_keys(lda_topic_matrix_sklearn)
lda_categories_sklearn, lda_counts_sklearn = keys_to_counts(lda_keys_sklearn)

top_n_words_lda_sklearn = get_top_words(n = 15, n_topics = n_topics, keys = lda_keys_sklearn, document_term_matrix = document_term_matrix, count_vectorizer = counter_vectorizer)
for i in range(len(top_n_words_lda_sklearn)):
    print("Topic {}: ".format(i+1), top_n_words_lda_sklearn[i])

Out:
1.6 sec to complete <function get_top_words at 0x7f198e234820>
Topic 1:  police child calls day court says abuse dead change market missing climate claims nt vic
Topic 2:  council court coast murder gold government face says national police iraq drug man case news
Topic 3:  man charged police nsw sydney home road hit crash guilty jailed melbourne centre new pleads
Topic 4:  says wa death sa abc australian report open sex final laws mp action opposition safety
Topic 5:  new qld election ban country future trial end industry hour pay port dies company cancer
Topic 6:  interview australia world cup china south accused pm hill work rain jail ahead push team
Topic 7:  police health govt hospital plan boost car minister school house probe help wins set regional
Topic 8:  new water killed high attack public farmers funding police urged years charges continue woman oil
TopicTop 6 wordsCategory
1police child calls day court saysCrime reports
2council court coast murder gold governmentJudicial proceedings
3man charged police nsw sydney homeAccident / crime reports
4says wa death sa australian reportPolitics / scandals
5new qld election ban country futureDomestic politics
6interview australia world cup china southInterview / world politics
7police health govt hospital plan boostEconomy
8new water killed high attack publicMiscellaneous
Table II. LSA Predicted Topic Categories top words

Interestingly we can further observe quasi-equal headline counts across each of our LDA topics:

Figure 9. Distribution of headline counts by sklearn LDA predicted topic

Plotting a t-SNE two-dimensional representation of our LDA predicted topic probability matrices shows much greater separability between our various topic groups, suggesting LDA may be more appropriate in this case for this specific topic modeling task. Given this improved observed separability we can move forward with scaling up our analysis using LDA, this time using 100,000 headlines.

Figure 10. t-SNE representation of LDA predicted topics

Once our LDA model retrained using a 100K headline sample we can visualize the historical distributions of each LDA topic by year using the below correlation heatmap and grouped barchart. Interestingly we can observe a sharp increase in the prevalence of ‘Economy’ topic 7 from 2014 onwards relative to other topics, with economy-related headlines being the most represented of all topics in years 2014, 2016 and 2017, possibly suggesting an increased focus on economic issues for the Australian public during this period:

Figure 11. Correlation heatmap of sklearn LDA topics by year

Interestingly we can also observe a sharp increase in the number of ‘Politics / scandals’ topic 4 headlines over the course of 2006 – 2008 with this topic category capturing the highest number of headlines in each of these years, possibly suggesting that political issues / headline-grabbing scandals may have dominated the Australian news cycle of this time period:

Figure 12. Sklearn LDA predicted topic frequencies by year
iii) LDA using NLTK and Gensim

To test the impact of introducing lemmatization and stemming to data preprocessing pipeline we use the NLTK and Gensim libraries to apply lemmatization, stemming and stopwords removal in a single preprocessing pipeline.

At a high level, lemmatization and stemming are both text preprocessing functions that aim to reduce inflectional forms of words to a common base form so that more ‘condensed’ feature sets can be created such that higher-quality information can hopefully be captured to make it easier for our topic model to better separate predicted topics.

Stemming is however more of a crude heuristic process that simply chops off the ends of words (i.e. verbs meeting and meets being chopped to base verb meet), while lemmatization carries out this chopping by considering the grammatical and morphological contexts of that specific word to in theory allow for improved differentiation between words (i.e. chopping verb meeting to base verb meet while keeping noun meeting as meeting, which would be missed by stemming, or mapping verbs went and goes to verb go, which would be similarly mis-mapped in stemming).

We carry out this preprocessing with the below NLTK WordNetLemmatizer and SnowballStemmer functions after removing stopwords leveraging Gensim’s stopwords corpus. Once our text preprocessed we use the Gensim.dictionary.doc2bow() function to create a bag of words dictionary mapping unique token ID’s to token frequencies in a manner identical to sklearn’s CountVectorizer object:

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

def lemmatize_and_stem(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def lemmatize_stem_remove_stopwords(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_and_stem(token))
    return result

processed_docs = raw_data['headline_text'].map(lemmatize_stem_remove_stopwords)

#dictionary containing the mapping of all words, a.k.a tokens to their unique integer id
bowdict = gensim.corpora.Dictionary(processed_docs)
bowdict.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [bowdict.doc2bow(doc) for doc in processed_docs]
print('BOW for headline #4310:', bow_corpus[4310])

Out:
>>>> BOW for headline #4310: [(76, 1), (112, 1), (483, 1), (4014, 1)]

With this bag of words dictionary created we can proceed with fitting LDA using a Gensim LdaMulticore model object and look at the top 10 words of each of our eight predicted topics:

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

lda_model_gensim = gensim.models.LdaMulticore(bow_corpus, num_topics=n_topics, id2word=bowdict, passes=2, workers=2)
for idx, topic in lda_model_gensim.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Out:
Topic: 0 
Words: 0.027*"trump" + 0.017*"say" + 0.014*"south" + 0.010*"minist" + 0.010*"state" + 0.009*"china" + 0.008*"polit" + 0.008*"close" + 0.008*"call" + 0.008*"vote"
Topic: 1 
Words: 0.034*"australian" + 0.032*"australia" + 0.020*"world" + 0.016*"open" + 0.014*"countri" + 0.014*"elect" + 0.012*"hour" + 0.011*"year" + 0.010*"share" + 0.009*"market"
Topic: 2 
Words: 0.027*"queensland" + 0.014*"water" + 0.013*"break" + 0.011*"return" + 0.011*"citi" + 0.011*"busi" + 0.010*"news" + 0.010*"royal" + 0.009*"john" + 0.008*"take"
Topic: 3 
Words: 0.017*"canberra" + 0.017*"coast" + 0.014*"hospit" + 0.013*"rise" + 0.012*"price" + 0.011*"peopl" + 0.011*"gold" + 0.010*"victoria" + 0.010*"fall" + 0.009*"show"
Topic: 4 
Words: 0.036*"polic" + 0.021*"charg" + 0.017*"death" + 0.016*"murder" + 0.014*"court" + 0.013*"woman" + 0.013*"crash" + 0.012*"die" + 0.012*"alleg" + 0.011*"kill"
Topic: 5 
Words: 0.012*"turnbul" + 0.012*"australia" + 0.009*"win" + 0.009*"port" + 0.008*"island" + 0.008*"trial" + 0.007*"season" + 0.007*"star" + 0.007*"sydney" + 0.007*"christma"
Topic: 6 
Words: 0.014*"nation" + 0.013*"rural" + 0.012*"tasmania" + 0.012*"donald" + 0.012*"chang" + 0.011*"indigen" + 0.011*"servic" + 0.010*"concern" + 0.010*"communiti" + 0.009*"worker"
Topic: 7 
Words: 0.016*"govern" + 0.015*"plan" + 0.013*"live" + 0.013*"school" + 0.012*"tasmanian" + 0.011*"council" + 0.009*"fund" + 0.009*"industri" + 0.008*"farm" + 0.008*"power"
TopicTop 6 wordsCategory
1trump say south minist state china polit close call voteWorld politics
2australian australia world open countri elect hour year share marketDomestic politics
3queensland water break return citi busi news royal john takeGeneral news
4canberra coast hospit rise price peopl gold victoria fall showMarkets
5polic charg death murder court woman crash die alleg killCrime reports
6turnbul australia win port island trial season star sydney christmaDomestic politics
7national rural tasmania donal chang indigen servic concern communiti workerNew laws / policy announcements
8govern plan live school tasmanian council fund industri farm powerGovernment-related

Interestingly the predicted topics of our Gensim LDA model are quite different than those of our sklearn LDA model and we can further observe somewhat reduced separability between our topic clusters. It further interestingly appears that our stemming and lemmatization contributed to producing more condensed localized headline clusters versus our previous sklearn approach, which should come as somewhat expected given these two data preprocessing approaches should contribute to condensing and enriching our feature set to common base word forms:

Figure 13. t-SNE representation of Gensim LDA predicted topics

We can further plot headline counts for our predicted Gensim LDA model topics, interestingly showing that the ‘General News’ category accounts for approximately 20% of all published articles:

Figure 14. Distribution of headline counts by Gensim LDA predicted topics
iv) LSI and HDP using Gensim

Implementing HDP and LSI models using Gensim is also rather straightforward and can be achieved with the following code, the predicted topics output of which appear to be of noticeably lower quality than our previous LDA models:

from gensim.models import LsiModel, HdpModel

bow_vectors = [bowdict.doc2bow(lemmatize_stem_remove_stopwords(doc)) for doc in headlines_raw]
lsi_model = LsiModel(corpus=bow_vectors, num_topics=10, id2word = bowdict)
lsi_model.show_topics(num_topics=8)

Out: 
[(0,
  '0.937*"iraq" + 0.179*"say" + 0.068*"troop" + 0.067*"govt" + 0.065*"missil" + 0.058*"plan" + 0.050*"report" + 0.047*"kill" + 0.046*"iraqi" + 0.045*"bomb"'),
 (1,
  '-0.860*"polic" + -0.186*"govt" + -0.156*"plan" + -0.139*"charg" + 0.131*"iraq" + -0.119*"probe" + -0.108*"protest" + -0.104*"death" + -0.080*"court" + -0.079*"anti"'),
 (2,
  '-0.683*"govt" + -0.457*"plan" + 0.338*"polic" + -0.161*"council" + 0.146*"iraq" + -0.123*"fund" + -0.116*"urg" + -0.114*"iraqi" + -0.108*"claim" + -0.092*"say"'),
 (3,
  '-0.736*"plan" + 0.587*"govt" + -0.154*"protest" + -0.153*"council" + -0.097*"anti" + -0.094*"water" + 0.061*"claim" + 0.058*"urg" + 0.055*"polic" + -0.041*"iraqi"'),
 (4,
  '0.680*"iraqi" + 0.367*"say" + 0.221*"baghdad" + -0.208*"plan" + -0.199*"govt" + 0.188*"kill" + -0.176*"iraq" + 0.119*"claim" + -0.119*"polic" + 0.118*"forc"'),
 (5,
  '0.558*"charg" + 0.458*"face" + 0.437*"court" + 0.292*"council" + -0.216*"iraqi" + -0.173*"polic" + 0.144*"murder" + -0.120*"say" + -0.104*"plan" + -0.082*"protest"'),
 (6,
  '-0.680*"council" + 0.434*"protest" + 0.335*"anti" + 0.200*"charg" + 0.177*"face" + 0.162*"court" + -0.105*"secur" + -0.105*"fund" + -0.103*"polic" + 0.102*"govt"'),
 (7,
  '-0.570*"protest" + -0.437*"council" + -0.437*"anti" + 0.324*"plan" + 0.227*"say" + 0.150*"charg" + -0.099*"warn" + -0.077*"secur" + -0.070*"claim" + -0.070*"fund"')]
from gensim.models import LsiModel, HdpModel

hdp_model = HdpModel(corpus=bow_vectors, id2word=bowdict)
hdp_model.show_topics(num_topics = 8)

Out: 
[(0,
  '0.001*protest + 0.001*fatter + 0.001*illus + 0.001*threat + 0.001*say + 0.001*enter + 0.001*issu + 0.001*philli + 0.001*survey + 0.001*daintre + 0.001*ferret + 0.001*bastard + 0.001*sky + 0.001*agforc + 0.001*desex + 0.000*testimoni + 0.000*grader + 0.000*blame + 0.000*jonathan + 0.000*blade'),
 (1,
  '0.001*crossin + 0.001*bogey + 0.001*henti + 0.001*panesar + 0.001*daley + 0.001*undersea + 0.001*ricin + 0.001*creep + 0.001*kafelnikov + 0.001*gerrard + 0.001*agfest + 0.001*nosed + 0.001*chilean + 0.000*crossbench + 0.000*redknapp + 0.000*kon + 0.000*pyramid + 0.000*greer + 0.000*summernat + 0.000*scotland'),
 (2,
  '0.001*iraq + 0.001*shearer + 0.001*croatian + 0.001*iowa + 0.001*commod + 0.001*drop + 0.001*loan + 0.001*warnek + 0.001*crow + 0.001*govt + 0.000*finger + 0.000*trauma + 0.000*biosecur + 0.000*lone + 0.000*ivori + 0.000*boot + 0.000*hushovd + 0.000*tough + 0.000*culina + 0.000*ansett'),
 (3,
  '0.001*foil + 0.001*zika + 0.001*cochran + 0.001*brisban + 0.001*thirsti + 0.001*violat + 0.001*burley + 0.001*parole + 0.001*manual + 0.001*spread + 0.001*misunderstand + 0.000*evil + 0.000*coat + 0.000*encrypt + 0.000*huon + 0.000*artc + 0.000*leg + 0.000*querrey + 0.000*post + 0.000*fatal'),
 (4,
  '0.001*edith + 0.001*iraqi + 0.001*plan + 0.001*bolton + 0.001*cough + 0.001*photo + 0.001*hurdl + 0.001*iraq + 0.001*tomb + 0.001*groom + 0.001*gough + 0.001*macklin + 0.000*deputi + 0.000*macalist + 0.000*graem + 0.000*knive + 0.000*lavish + 0.000*tyson + 0.000*plume + 0.000*diseas'),
 (5,
  '0.001*fremantl + 0.001*die + 0.001*near + 0.001*troop + 0.001*claim + 0.001*custodian + 0.001*emb + 0.001*doubter + 0.001*journalist + 0.001*buddi + 0.001*multin + 0.001*slum + 0.001*toll + 0.001*affect + 0.001*hoffman + 0.000*notic + 0.000*locker + 0.000*implod + 0.000*steamer + 0.000*evapor'),
 (6,
  '0.001*bushrang + 0.001*windi + 0.001*dugong + 0.001*trujillo + 0.001*mouth + 0.001*weight + 0.001*gere + 0.001*slovak + 0.001*warn + 0.001*makelel + 0.001*council + 0.001*polic + 0.000*brom + 0.000*exist + 0.000*road + 0.000*haunt + 0.000*hound + 0.000*locat + 0.000*fuelwatch + 0.000*ebola'),
 (7,
  '0.001*ferdinand + 0.001*mclachlan + 0.001*sheet + 0.001*prepar + 0.001*fertil + 0.001*zombi + 0.001*reduc + 0.001*woo + 0.001*larrakia + 0.001*govt + 0.001*fran + 0.000*stalker + 0.000*ident + 0.000*wide + 0.000*off + 0.000*synthet + 0.000*kathryn + 0.000*squid + 0.000*bevan + 0.000*tango')]

Conclusion

We implemented in this project topic modeling using LDA, LSA, LSI and HDP models leveraging the sklearn, NLTK and Gensim libraries, with our t-SNE representations showing that our sklearn-based LDA models trained on headline data with only stopwords removed appeared to outperform all approaches in producing better separated headline topic groups. We further observed that introducing lemmatization and stemming into our data preprocessing pipeline interestingly appears to contribute to a localized clustering or ‘bunching’ phenomenon that appears to produce lower quality topic separation results. Our next steps in this project would be to experiment would other topic modeling approaches such as non-negative matrix factorization and explicit semantic analysis.

Thanks for reading!

Sources:

  • https://www.mygreatlearning.com/blog/understanding-latent-dirichlet-allocation/
  • https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
  • https://iq.opengenus.org/topic-modelling-techniques/
  • https://www.kaggle.com/therohk/million-headlines/code
  • https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df
  • https://github.com/susanli2016/NLP-with-Python/blob/master/LDA_news_headlines.ipynb
  • https://radimrehurek.com/gensim/corpora/dictionary.html
  • https://www.kaggle.com/faressayah/text-analysis-topic-modelling-with-spacy-gensim
  • https://towardsdatascience.com/t-sne-clearly-explained-d84c537f53a