
ChatGPT is a GPT (Generative Pre-trained Transformer) machine learning (ML) tool that has surprised the world. Its breathtaking capabilities impress casual users, professionals, researchers, and even its own creators. Moreover, its capacity to be an ML model trained for general tasks and perform very well in domain-specific situations is impressive. I am a researcher, and its ability to do sentiment analysis (SA) interests me.
SA is a very widespread Natural Language Processing (NLP). It has several applications and thus can be used in several domains (e.g., finance, entertainment, psychology). However, some fields adopt specific terms and jargon (e.g., finance). Hence, whether general domain ML models can be as capable as domain-specific models is still an open research question in NLP.
If you ask the ChatGPT this research question — which is this article’s title — it will give you a humble answer (go on, try it). But, oh, my dear reader, I usually wouldn’t spoil this for you, but you have no idea how surprisingly modest this ChatGPT answer was…
Still, as an AI researcher, industry professional, and hobbyist, I am used to fine-tuning general domain NLP machine learning tools (e.g., GloVe) for usage in domain-specific tasks. This is the case because it was uncommon for most domains to find an out-of-the-box solution that could do well enough without some fine-tuning. I will show you how this could no longer be the case.
In this text, I compare ChatGPT to a domain-specific ML model by discussing the following topics:
- SemEval 2017 Task 5 — A domain-specific challenge
- Using ChatGPT API to label a dataset with code examples
- Verdict and results of the comparison with reproducibility details
- Conclusion and Results Discussion
- BONUS: How this comparison can be done in an applied scenario
Note 1: This is just a simple hands-on experiment that sheds some light on the subject, NOT an exhaustive scientific investigation.
Note 2: All images unless otherwise noted are by the author.
1. SemEval 2017 Task 5 — A domain-specific challenge
SemEval (Semantic Evaluation) is a renowned NLP workshop where research teams compete scientifically in sentiment analysis, text similarity, and question-answering tasks. The organizers provide textual data and gold-standard datasets created by annotators (domain specialists) and linguists to evaluate state-of-the-art solutions for each task.
In particular, SemEval’s Task 5 of the 2017 edition asked researchers to score financial microblogs and news headlines for sentiment analysis on a -1 (most negative) to 1 (most positive) scale. We’ll use the gold-standard dataset from that year’s SemEval to test ChatGPT’s performance in a domain-specific task. Subtask 2 dataset (news headlines) had two sets of sentences (maximum of 30 words each): the training (1,142 sentences) and the testing (491 sentences) sets.
Considering these sets, the data distribution of sentiment scores and text sentences is displayed below. The plot below shows bimodal distributions in both training and testing sets. Moreover, the graph indicates more positive than negative sentences in the dataset. This will be a piece of handy information in the evaluation section.
2. Using ChatGPT API to label a dataset
Using ChatGPT API has already been discussed here on Medium for synthesizing data. Also, you can find sentiment labeling examples in the ChatGPT API code samples section (Notice that using the API is not free). For this code example, consider SemEval’s 2017 Task gold-standard dataset that you can get here.
Then, to use the API for labeling several sentences at once, use a code as such, where I prepare a full prompt with sentences from a dataframe with the Gold-Standard dataset with the sentence to be labeled and the target company to which the sentiment refers.
def prepare_long_prompt(df): initial_txt = "Classify the sentiment in these sentences between brackets regarding only the company specified in double-quotes. The response should be in one line with format company name in normal case followed by upper cased sentiment category in sequence separated by a semicolon:\n\n" prompt = "\"" + df['company'] + "\"" + " [" + df['title'] + ")]" return initial_txt + '\n'.join(prompt.tolist())Then, call the API for the text-davinci-003 engine (GPT-3 version). Here I made some adjustments to the code to account for the max number of total characters in the prompt plus the answer, which must be at most 4097 characters.
3. Verdict and results of the comparison
one of the most essential factors in a textual model is the size of the word embeddings. This technology has evolved since the SemEval 2017 edition. Thus, some updates in this part could significantly increase the results of the domain-specific model.
4. Conclusion and Results Discussion
technology. This architecture was designed to work with numerical sentiment scores like those in the Gold-Standard dataset. Still, there are techniques (e.g., Bullishnex index) for converting categorical sentiment, as generated by ChatGPT in appropriate numerical values. Applying such a conversion makes it possible to use ChatGPT-labeled sentiment in such an architecture. Moreover, this is an example of what you can do in such a situation and is what I intend to do in a future analysis.