Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (2024)

Nikhil Kumar Rajput
Department of Computer Science
Ramanujan College (University of Delhi)
India
n.rajput@ramanujan.du.ac.in
\AndBhavya Ahuja Grover
Department of Computer Science
Ramanujan College (University of Delhi)
India
b.ahuja@ramanujan.du.ac.in
\AndVipin Kumar Rathi
Department of Computer Science
Ramanujan College (University of Delhi)
India
vipkrathi2013@gmail.com
\AndRiya Bansal
Department of Computer Science
University of Delhi
India
rbansal1@cs.du.ac.in

Abstract

The COVID-19 epidemic has had a great impact on social media conversation, especially on sites like Twitter, which has emerged as a hub for public reaction and information sharing. This paper deals by analyzing a vast dataset of Twitter messages related to this disease, starting from January 2020. Two approaches were used: a statistical analysis of word frequencies and a sentiment analysis to gauge user attitudes. Word frequencies are modeled using unigrams, bigrams, and trigrams, with power law distribution as the fitting model. The validity of the model is confirmed through metrics like Sum of Squared Errors (SSE), R-squared ( $R^{2}$ ), and Root Mean Squared Error (RMSE). High $R^{2}$ and low SSE/RMSE values indicate a good fit for the model. Sentiment analysis is conducted to understand the general emotional tone of Twitter users messages. The results reveal that a majority of tweets exhibit neutral sentiment polarity, with only 2.57% expressing negative polarity.

Keywords Word Frequency $\cdot$ Power Law $\cdot$ Coronavirus $\cdot$ Social Media $\cdot$ Twitter $\cdot$ Sentiment Analysis

1 Introduction

The year 2020 etched itself into history as a year when the world grappled with a formidable foe - the Coronavirus pandemic. This enormous disaster changed lives all throughout the world, cutting across national boundaries. Its tentacles reached far beyond the immediate realm of physical health, casting long shadows on economic stability, societal norms, and most profoundly, the human psyche.

The profound psychological impact of the pandemic is increasingly evident, prompting the exploration of innovative methods to capture and analyze human emotions. Platforms of social media, particularly Twitter and Facebook, offer unique windows into our collective psyche. Here, people share a kaleidoscope of emotions, weaving narratives of facts, fears, statistics, and dominant thoughts. This paper aims to shed light on this dynamic social media landscape, delving into the textual content that currently swirls around the pandemic through a statistical lens. We embark on a journey to unveil the intricate tapestry of human emotions woven during this unprecedented time, focusing specifically on the emotions/feelings expressed on the dynamic platform of Twitter. By diving into the vast ocean of tweets, we gain a unique perspective on the collective consciousness grappling with the pandemic’s complex and far-reaching consequences.

To shed light on this terrain, two primary perspectives are utilized: word frequency analysis [1] and sentiment analysis [2]. In quantitative linguistics, word frequency analysis is a well-known method for figuring out how frequently a word appears in a particular text or collection of texts. It assists in determining the most commonly used terms as well as their relative significance in that context. Using the power law [3], a statistical tool proven to be effective in capturing such linguistic patterns, this study reveals patterns in the textual analysis by examining the probability distribution of word frequencies extracted from Twitter messages during the 2020 Coronavirus outbreak.

Sentiment analysis is performed in conjunction with this to determine the text’s underlying emotional tone. This technique aids in deciphering attitudes towards specific subjects, providing a nuanced understanding of the emotional content within the discourse. As an intriguing and highly relevant field of research, sentiment analysis enriches our ability to infer and comprehend the emotional dimensions encapsulated in the textual expressions during the Coronavirus pandemic. This research seeks to provide important insights into the language nuances and collective attitudes that are present in the Twitter conversation during the global health crisis by merging different analytical methodologies.

There has been an increasing amount of interest in the use of social media data to analyze public opinions and conversations during the COVID-19 outbreak. Studies have indicated that Twitter data in particular can offer important insights into how the public is reacting to the pandemic. For instance, a study by Lwin et al. [4] demonstrated the analysis of global sentiments surrounding the COVID-19 pandemic on Twitter, highlighting the platform’s potential for understanding public trends and attitudes. Additionally, another work [5, 6] focused on the conversation around COVID-19 on Twitter during the first wave of the pandemic, emphasizing the application of sentiment analysis and topic modeling to analyze tweets published in English. People’s opinions about COVID-19 were subjected to sentiment analysis by Kaur and Sharma [7]. They collected relevant tweets using the Twitter API, analyzed positive, negative, and neutral sentiments using techniques of machine learning, and pre-processed tweets using the NLTK package. The Textblob dataset served as the basis for the analysis, and the results were displayed using a variety of visualizations that highlighted neutral, positive, and negative opinions. Another study by Umer et al. [8] used ensemble model which is a combination of machine learning and deep learning models and uses the advantages of manually created features with automatic feature extraction. In this, TextBlob and VADER was used, unstructured data is collected, preprocessed, and analyzed prior to machine learning model training. In the same way, the effectiveness of the Word2Vec, TF, and TF-IDF features was examined. The outcomes showed that machine learning models work better using TextBlob and TF-IDF.

A study by Sunitha et al. [9] examined the real-time coronavirus-related tweets using a sentiment analysis approach. Originally, tweets from approximately 3100 European and Indian users were gathered between March 23, 2020, and November 1, 2021. For a deeper comprehension of the gathered data, pre-processing and exploratory research were then completed. Additionally, GloVe, pre-trained Word2Vec, Term Frequency-Inverse Document Frequency (TF-IDF), and quick text embeddings were used to achieve the feature extraction. The acquired feature vectors were then supplied to the ensemble classifiers, which included the GRU and the CapsNet neural network that categorized the user’s emotions into four categories: fear, joy, sadness, and rage. The experimental results collected demonstrated that the suggested model was able to classify the sentiments of both Indian and European people with prediction accuracy of 97.28% and 95.20%, respectively. A study by Vijay et al. [10] examined Indian tweets on COVID-19 (Nov 2019-May 2020) categorized as positive, negative, or neutral. Statewise, monthly, and overall datasets revealed initial negativity shifted to positive by April 2020, with focus on overcoming the virus. These studies underscore the significance of leveraging Twitter data and natural language processing techniques to gain a deeper understanding of public discourse and sentiments during the COVID-19 outbreak.

The study is organized as follows: section 2 includes word frequency analysis and the emergence of power law along with a brief summary of existing research in this area, section 3 includes a statistical analysis of tweets, section 4 presents an overview of sentiment analysis and, section 5 includes an adopted methodology for performing sentiment analysis. The results of analyzing the sentiment of these Twitter messages are presented in section 6 and the conclusion of paper is in section 7.

2 Word frequency analysis and power law

Word frequency analysis explores the numerical world of text to ascertain the frequency at which a given word occurs. It illuminates important themes and patterns in the examined data by identifying the most prevalent and recurrent phrases through counting occurrences and generating frequency distributions. Power law, a mathematical rule describing connections between changing quantities, finds its way into word frequency analysis too. It often governs the distribution of word frequencies, predicting that a few words will be used much more than most, creating a “few common, many rare" pattern prevalent in natural language and text corpora. In the context of textual analysis, power law suggests that a few words (often referred to as “stop words" like “the", “and", “is") occur very frequently, while the majority of words occur rarely.

Many investigators have devised statistical and mathematical techniques to assess literary artifacts. A significant method is inferring the pattern of frequency distributions of every word in the content [1]. Zipf’s law is mostly inherent in word frequency distributions [11, 12]. The law states that for a word vector $x$ , the word frequency distribution $\nu$ changes as an inverse power of $x$ . Other widely used distributions are Zipf-Mandelbrot [13], lognormal [14, 15], and Gauss-Poisson [14]. Research similar to this has been conducted in several languages, including Hindi [16], Chinese [17], Japanese [18], and many more [1]. Studies on single- and multi-word frequencies have been carried out extensively. One example is [19], which studied the frequencies and versatilities of bigrams and trigrams and reported 577 distinct bigrams and 6,140 different trigrams.

The power law distribution is one among the most well known. Because of its peculiar characteristics, this “non-normal" distribution has drawn an extensive amount of interest in the academic world. The following is the mathematical representation of the rightly skewed distribution:

f(x)=ax^{b}

(1)

where a is constant and b is the scaling or exponential parameter.

Power law has been used in several research projects. The authors of [20] highlight the existence of power law in social networks and make advantage of this characteristic to develop a degree threshold-based similarity metric that aids in link prediction. The authors assert that the power spectrum of the departure process resembles a power law comparable to that of observed traffic during the fragmentation of the data into Ethernet frames, in an attempt to describe the self-similar computer network traffic. They further claim that the input procedure had no bearing on the power law [21]. Power law modeling of internet topologies was demonstrated in [22]. The authors demonstrated that 5 out of 24 term frequency distributions and query frequency could be best fitted by a power law while looking into the existence of power laws in information retrieval data [23]. Power law is incredibly useful in many different fields. In this paper, we intend to use it to model the word frequencies of twitter messages posted during this time.

3 Statistical analysis of tweets

This section contains the particulars of the study we conducted on the data we collected about tweets posted on Twitter from January 2020 to the present, which is the period of time after the global media began reporting on the coronavirus outbreak in China. The word frequency data collected from [24] corresponds to the tweets. According to the data source, there were about more than 4 million tweets every day starting on March 11, 2020, as awareness increased. Also, the data prominently captures the tweets in English, Spanish, and French languages.

A total of four analysis have been done to analyze the study. The first is the data on Twitter Id evolution reflecting on number of tweets. It focuses on how many distinct Twitter IDs there were during a certain time frame that tweeted about the coronavirus. Figure 1 illustrates the associated data and displays the trend of user activity over time. It demonstrates that the number of evolving Twitter IDs peaked in the months of March and April.

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (1)

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (2)

Fig. 2 displays a word cloud visualization that highlights the most commonly appearing term, “covid." It suggests that the public is highly engaged in the pandemic and the term is frequently used by Twitter users in their tweets. It can be inferred that people were tweeting about cases, news, or personal experiences. Also, the word “covid" is a more direct and efficient way to say “COVID-19". Its dominance indicates that consumers valued conciseness and speed above length while tweeting.

The other three are unigram, bigram and trigram frequencies of words. The study examined the frequency of single-word combinations (unigrams), two-word phrases (bigrams), and three-word sequences (trigrams) in order to get insight into the world of Twitter. For this analysis, they concentrated on the top 1,000 examples of each kind. The relationship between rank (or index) and frequency for unigrams, bigrams, and trigrams is shown in the frequency distribution plots in Figs. 3, 4, and 5, respectively. A power-law distribution is seen in these graphical representations, suggesting that the data and the power-law model fit the data quite well.

The calculated exponents for unigrams, bigrams and trigram are -0.9305, -3.7187 and -0.6514, respectively. The corresponding parameters are reported in Table 1. Notably, heavy tails are observed, particularly in the cases of unigrams and trigrams. The effective fit by the power-law distribution suggests a distinct behavior in tweet messages compared to literary documents like novels and poems, which adhere to Zipf’s law with exponents close to 1.

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (3)

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (4)

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (5)

Three goodness of fit metrics have been used to assess how well the power law distribution fits the data: SSE, $R^{2}$ , and RMSE. The value obtained for the three datasets with the three forms of token of words has been shown in Table 1. We observe a notably elevated $R^{2}$ value across all three scenarios: unigram (0.9158), bigram (0.9863), and trigram (0.9764). Furthermore, the obtained values for SSE and RMSE are considerably minimal in each case. These results affirm the suitability of the power-law model for representing the frequency distribution of tweet message data.

	Unigram	Bigram	Trigram
Parameters
a	1.1450	0.9994	1.0566
b	-0.9305	-3.7187	-0.6514
Goodness of fit
SSE	0.2063	0.0137	0.0750
$R^{2}$	0.9158	0.9863	0.9764
RMSE	0.0143	0.0037	0.0086

4 Sentiment Analysis of Twitter Messages

Sentiment analysis is a fast growing field due to its capacity to interpret emotional quotient of a text. A common definition of it is a computational investigation into people’s attitudes, feelings, and views toward a topic [25]. Sentiment analysis is mostly used to evaluate viewpoints, uncover latent emotions, and finally to classify their polarity into negative, positive or neutral. Some examples related to applications for sentiment analysis include customer reviews [26], news and blogs [27], and the stock market [28]. A number of techniques, such as Artificial Neural Networks [29], Naive Bayes [30] and, Support Vector Machines [31] have been used for sentiment analysis. A number of publications have also offered algorithms for sentiment analysis on tweets [32], [33], [34], [35].

5 Methodology

The methodology employed for sentiment analysis on a dataset of tweets about the COVID-19 epidemic is described in this section. Data collecting, data preparation, sentiment analysis, and data visualization are the four primary steps of the entire process. The process flow is illustrated in Fig. 6.

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (6)

5.1 Data Collection

Data has been collected from the COVID19 Tweets Kaggle dataset, which can be accessed at https://www.kaggle.com/datasets/gpreda/covid19-tweets. The dataset was gathered through the use of a Python script and the Twitter API, and it consists of tweets that come from various geographical places. To accumulate a substantial number of tweet samples, a daily query for the prominent hashtag #covid19 has been executed for a specific duration, starting from July 25, 2020. The initial batch consisted of 17,000 tweets, and the collection process continued on a daily basis. The dataset is characterized by its focus on tweets related to the Covid-19 pandemic, particularly those featuring the #covid19 hashtag.

5.2 Data Preprocessing

Since raw text data can contain insignificant information that can hinder accurate sentiment analysis, the collected tweets underwent a crucial preprocessing stage to refine them for further analysis. It is carried out with Python’s NLTK (Natural Language Toolkit) module [36] utilizing Natural Language Processing (NLP) approaches [36]. This stage involved various techniques to clean and standardize the text data. The following used are noise elimination, text standardization, stop words removal and lemmatization.

5.2.1 Noise Elimination

The initial preprocessing step concentrated on eliminating irrelevant information that minimally impacts sentiment analysis. This “noise removal" process targeted elements like URLs, mentions (@usernames), hashtags, special characters, and numbers. These elements were excluded because they don’t significantly contribute to understanding the emotional tone of the tweets. To achieve this efficiently, Python’s “re" module was utilized. This module makes it possible to remove these unnecessary components from the gathered tweets by enabling the use of regular expressions, which are effective tools for pattern-based searching and removal. Further preprocessing procedures build upon this cleaning process, guaranteeing that the most relevant elements of the text input are kept in mind for precise sentiment analysis.

5.2.2 Text standardization

Following the initial noise removal, the text underwent a standardization stage to ensure consistency and eliminate potential sources of confusion for the sentiment analysis process. This standardization involves two key steps:

1.
Punctuation Removal: Punctuation marks, such as commas, periods, and exclamation points, were eliminated using Python’s “translate" function. This simplifies the text by removing elements that don’t directly contribute to understanding the underlying sentiment.
2.
Whitespace Trimming: Leading and trailing whitespaces, including extra spaces at the beginning or end of the tweet were removed using the “strip" method. This ensures the text starts and ends cleanly, without unnecessary characters that might affect the model’s interpretation. By eliminating these inconsistencies, the standardization process contributes to a more uniform dataset, allowing the sentiment analysis model to focus on the essential content of the tweets.

5.2.3 Removal of stop words

After the text was standardized, the next step aimed to further refine the data by focusing on the words themselves. This involved stop word removal, a process that eliminates common words with minimal impact on sentiment analysis. Examples of stop words include “the", “a", “is" and “and." While these words serve grammatical functions and contribute to the overall meaning of a sentence in everyday language, they often hold little value in understanding the emotional tone of a tweet.

The NLTK library [37] in Python was utilized for this task. This library provides a readily available list of stop words in various languages, allowing for efficient removal of these common words from the collected tweets. By eliminating stop words, the focus shifts towards the words that carry more emotional weight, ultimately leading to a more accurate understanding of the sentiment expressed in the tweets.

5.2.4 Lemmatization

Lemmatization is a technique in NLP used to reduce words to their base form, also known as the lemma [38]. Unlike stemming, which focuses solely on the words stem, lemmatization takes context and grammatical role into account to ensure the resulting base form is a meaningful word. For instance, “running" becomes “run", “changing/changes/changed" becomes “change". This ensures that different grammatical variations of the same word are treated consistently, allowing the sentiment analysis model to accurately capture the overall sentiment regardless of the specific word form used. NLTK provided the WordNetLemmatizer() function within its Python library and was employed to lemmatize the text data.

5.3 Sentiment Analysis using TextBlob

Following the data preprocessing stage, the sentiment analysis delves into uncovering the emotional undertones of the collected tweets. This step aims to quantify the overall sentiment, assigning a numerical value called polarity to each tweet. We utilize the TextBlob library [39][40] in Python for this task. TextBlob offers a polarity score ranging from -1 (highly negative) to +1 (highly positive). Based on the calculated polarity scores, the tweets are categorized into specific sentiment classes:

Positive: Tweets with scores in the range (0, 1] are classified as positive.

Negative: Tweets with scores in the range [-1, 0) are classified as negative.

Neutral: Tweets with a score of exactly 0 are classified as neutral.

5.4 Data Visualization

Histograms are a powerful tool for visualizing the distribution of data, making them well-suited for sentiment analysis of the collected COVID-19 tweets. Here we created a histogram with the x-axis representing the polarity scores (ranging from -1 to 1) and the y-axis representing the number of tweets/frequency within each score range. This visualization reveals the overall distribution of sentiment across the dataset, indicating whether the majority of tweets lean towards positive, negative, or neutral sentiment.

6 Results of Sentiment Analysis of Twitter Messages

To illustrate the general emotion, polarity values have been shown in a histogram (i.e., more positivity or negativity). Figs. 7 and 8 display the graphs. Based on the polarities in the dataset, Table 2 shows the percentage of positive, negative, and neutral tweets.

Sentiment Polarity	Positive	Neutral	Negative
Tweets	6.45 $\%$	90.97 $\%$	2.57 $\%$

Fig. 7 corresponds to the histogram of sentiment polarities of tweets on COVID19 Tweets. It can be seen that majority of the tweets have a neutral sentiment followed by positive. The same can be inferred from Table 2 that shows that around 90.97 $\%$ tweets are neutral, 6.45 $\%$ positive and a mere 2.57 $\%$ is negative.

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (7)

Fig. 8 represent the histograms produced by removing the neutral tweets. It readily reiterates that the positive emotions in the tweets are higher than negative ones. It shows that humans still post positive tweets and focuses on positivity rather than negativity.

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic (8)

7 Conclusion

This work explores the statistical analysis of tweets posted during the COVID-19 outbreak. As the pandemic swept across the globe, it left a trail of fear, uncertainty, and potential risks in its wake. This study seeks to analyze how these anxieties and the overall situation were reflected in the content of tweets during that time. To examine the COVID19 tweets, two methods have been employed: word frequency analysis and sentiment analysis. According to the statistics, messages are more prevalent in the months of March and April 2020, which suggests how tweets have changed during that time. A word frequency analysis revealing the frequency of occurrences of each word in the tweets has been conducted. The top three terms with the highest frequency of occurrence in user-generated tweets are covid, 19, and covid19. For the top 1,000 instances, unigram, bigram and trigram frequencies were plotted and all of the plots fits the rightly skewed power law distribution. A heavy tail was observed in the plots of unigram and bigram. The exponential parameters obtained were -0.9305, -3.7187 and -0.6514 for unigram, bigram and trigram respectively. The model was validated through metrics like SSE, $R^{2}$ and RMSE. High value of $R^{2}$ and low value of SSE and RMSE were obtained which indicated a good fit for the model. The tweet dataset was also subjected to a sentiment analysis, and the related sentiment polarity (positive, neutral, or negative) and histograms were plotted. The majority of tweets (90.97%) had neutral polarity, followed by positive polarity (6.45%), according to the statistics. It demonstrates that people’s feelings toward the COVID19 scenario were more neutral.

References

[1]RHarald Baayen.Word frequency distributions, volume18.Springer Science & Business Media, 2001.
[2]Maite Taboada.Sentiment analysis: An overview from linguistics.Annual Review of Linguistics, 2:325–347, 2016.
[3]Aaron Clauset, CosmaRohilla Shalizi, and MarkEJ Newman.Power-law distributions in empirical data.SIAM review, 51(4):661–703, 2009.
[4]MayOo Lwin, Jiahui Lu, Anita Sheldenkar, PeterJohannes Schulz, Wonsun Shin, Raj Gupta, and Yinping Yang.Global sentiments surrounding the covid-19 pandemic on twitter: analysis of twitter trends.JMIR public health and surveillance, 6(2):e19447, 2020.
[5]Digvijay Pandey, Subodh Wairya, Bandinee Pradhan, etal.Understanding covid-19 response by twitter users: A text analysis approach.Heliyon, 8(8), 2022.
[6]JavierJ Amores, David Blanco-Herrero, and Carlos Arcila-Calderón.The conversation around covid-19 on twitter—sentiment analysis and topic modelling to analyse tweets published in english during the first wave of the pandemic.Journalism and Media, 4(2):467–484, 2023.
[7]Chhinder Kaur and Anand Sharma.Twitter sentiment analysis on coronavirus using textblob.EasyChair2516-2314, 2020.
[8]Muhammad Umer, Saima Sadiq, Michele Nappi, MuhammadUsman Sana, Imran Ashraf, etal.Etcnn: extra tree and convolutional neural network-based ensemble model for covid-19 tweets sentiment classification.Pattern Recognition Letters, 164:224–231, 2022.
[9]DSunitha, RajKumar Patra, NVBabu, ASuresh, and SureshChand Gupta.Twitter sentiment analysis using ensemble based deep learning model towards covid-19 in india and european countries.Pattern Recognition Letters, 158:164–170, 2022.
[10]Tanmay Vijay, Ayan Chawla, Balan Dhanka, and Purnendu Karmakar.Sentiment analysis on covid-19 twitter data.In 2020 5th IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE), pages 1–7. IEEE, 2020.
[11]GeorgeKingsley Zipf.The psycho-biology of language.1935.
[12]Wentian Li.Random texts exhibit zipf’s-law-like word frequency distribution.IEEE Transactions on information theory, 38(6):1842–1845, 1992.
[13]Benoît Mandelbrot.Information theory and psycholinguistics.BB Wolman and E, 1965.
[14]Harald Baayen.Statistical models for word frequency distributions: A linguistic evaluation.Computers and the Humanities, 26(5-6):347–363, 1992.
[15]JohnB Carroll.Word-frequency studies and the lognormal distribution.In Proceedings of the Conference on Language and Language Behavior. Ed. EM Zale. New York: Appleton-Century-Crofts, pages 213–235, 1968.
[16]BDJayaram and MNVidya.Zipf’s law for indian languages.Journal of quantitative linguistics, 15(4):293–317, 2008.
[17]SShtrikman.Some comments on zipf’s law for the chinese language.Journal of Information Science, 20(2):142–143, 1994.
[18]Sasuke Miyazima, Youngki Lee, Tomomasa Nagamine, and Hiroaki Miyajima.Power-law distribution of family names in japanese societies.Physica A: Statistical Mechanics and its Applications, 278(1-2):282–288, 2000.
[19]RobertL Solso, PaulF Barbuto, and ConnieL Juel.Bigram and trigram frequencies and versatilities in the english language.Behavior Research Methods & Instrumentation, 11(5):475–484, 1979.
[20]Virinchi Srinivas and Pabitra Mitra.Link prediction in social networks: role of power law distribution.Springer, 2016.
[21]AJField, Uli Harder, and PGHarrison.Measurement and modelling of self-similar traffic in computer networks.IEE Proceedings-Communications, 151(4):355–363, 2004.
[22]Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos.On power-law relationships of the internet topology.ACM SIGCOMM computer communication review, 29(4):251–262, 1999.
[23]Casper Petersen, JakobGrue Simonsen, and Christina Lioma.Power law distributions in information retrieval.ACM Transactions on Information Systems (TOIS), 34(2):1–37, 2016.
[24]JuanM. Banda, Ramya Tekumalla, Guanyu Wang, Jingyuan Yu, Tuo Liu, Yuning Ding, Katya Artemova, Elena Tutubalina, and Gerardo Chowell.A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration, Feb 2023.This dataset will be updated bi-weekly at least with additional tweets, look at the github repo for these updates. .
[25]Walaa Medhat, Ahmed Hassan, and Hoda Korashy.Sentiment analysis algorithms and applications: A survey.Ain Shams engineering journal, 5(4):1093–1113, 2014.
[26]Daekook Kang and Yongtae Park.Measuring customer satisfaction of service based on an analysis of the user generated contents: sentiment analysis and aggregating function based mcdm approach.In 2012 IEEE International Conference on Management of Innovation & Technology (ICMIT), pages 244–249. IEEE, 2012.
[27]Namrata Godbole, Manja Srinivasaiah, and Steven Skiena.Large-scale sentiment analysis for news and blogs.Icwsm, 7(21):219–222, 2007.
[28]ThienHai Nguyen and Kiyoaki Shirai.Topic modeling based sentiment analysis on social media for stock market prediction.In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1354–1364, 2015.
[29]Rodrigo Moraes, JoãOFrancisco Valiati, and Wilson PGaviãO Neto.Document-level sentiment classification: An empirical comparison between svm and ann.Expert Systems with Applications, 40(2):621–633, 2013.
[30]Huaxia Rui, Yizao Liu, and Andrew Whinston.Whose and what chatter matters? the effect of tweets on movie sales.Decision support systems, 55(4):863–870, 2013.
[31]ChienChin Chen and You-De Tseng.Quality evaluation of product reviews using an information quality framework.Decision Support Systems, 50(4):755–768, 2011.
[32]Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and RebeccaJ Passonneau.Sentiment analysis of twitter data.In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 30–38, 2011.
[33]Manoochehr Ghiassi, James Skinner, and David Zimbra.Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network.Expert Systems with applications, 40(16):6266–6282, 2013.
[34]Aliaksei Severyn and Alessandro Moschitti.Twitter sentiment analysis with deep convolutional neural networks.In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 959–962, 2015.
[35]Hassan Saif, Yulan He, and Harith Alani.Semantic sentiment analysis of twitter.In International semantic web conference, pages 508–524. Springer, 2012.
[36]Adil Rajput.Natural language processing, sentiment analysis, and clinical analytics.In Innovation in health informatics, pages 79–97. Elsevier, 2020.
[37]Steven Bird, Ewan Klein, and Edward Loper.Natural language processing with Python: analyzing text with the natural language toolkit." O’Reilly Media, Inc.", 2009.
[38]Divya Khyani, BSSiddhartha, NMNiveditha, and BMDivya.An interpretation of lemmatization and stemming in natural language processing.Journal of University of Shanghai for Science and Technology, 22(10):350–357, 2021.
[39]Steven Loria.textblob documentation.Release 0.15, 2, 2018.
[40]Steven Loria.Tutorial: Quickstart – textblob 0.18.0.post0 documentation.https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis, 2024.Accessed: 2024-03-02.