Tiny Transformers for Aspect-Based Financial Sentiment Classification

Why do this at all? We’ve been asked time and time again for financial sentiment classifications on our data and it’s overwhelmingly what quants care about because fundamentally, they want to know how texual data can inform trading and investment decisions.

FinBERT¹ performs well on the popular Financial PhraseBank² benchmark, but poorly on real world data.
LLMs are surprisingly good at labeling textual data, but intractably expensive to use at scale.

Therefore, over the past few weeks, we have trained and productionized three text classifiers to predict the financial sentiment of a text snippet, determine whether the text contains a forward-looking statement or not, and determine whether the text contains a prediction or not.

Combining these classifiers along with other dimensions in NOSIBLE’s data enables us to do powerful aspect based analysis to answer questions like:

What does this textual data mean for a company?
Is it positive or negative sentiment about the company’s future prospects?
Is there a trend in either direction? (a 7-day moving average of financial sentiment forward-looking statements about Tesla in the last three months)

In the festive spirit of Thanksgiving 🍾, we have open sourced these models on HuggingFace and the datasets they were trained on. We also share source code here for use in your own projects.

Wait but you’re a search engine, right? NOSIBLE is already a world-class search engine, but more importantly it’s incredibly fast which unlocks the ability to build near real time datafeeds. Our DataFeeds product lets you to turn any search about any topic into a point-in-time, backtest-friendly, deduped time series of data.

We’ve leaned even harder in this direction and our Data Feeds product is now better described as web-scale surveillance. To prove this isn’t just a marketing term, we’re extracting and including these signals in our datafeeds for our customers, and we’re just getting started.

Let’s get to it.

Real world data

A quality data set is key to building a good ML model. The combination of compute, a return to MLPs, and a much larger labeled dataset contributed significantly to what made AlexNet so successful back in 2012.

Using the vast amount of real-world data indexed by NOSIBLE, we sampled a data feed centered around financial news. They are varied and contain a ton of signal, this is what they look like:

Some peers in the therapeutics segment have already reported their Q3 results. Gilead Sciences posted year-on-year revenue growth of 3%, beating expectations by 3.7%, and Biogen reported revenue up 2.8%, topping estimates by 8.2%. Following their reports, Gilead Sciences’s stock traded up 1.2% and Biogen’s stock was up 4.1%.

There are fun ones too:

If you thought the wait for Kingdom Hearts III was ridiculous, you ain’t see nothing yet. Now that we have an official release date for the game, Square Enix has already announced two collectors editions of the game. The first is your average run of the mill collectors edition, priced at $80 dollars. That’s not the one you want. No, the big daddy collectors edition itself, the deluxe edition is the one you’re going to want.

A human labeling 100,000 of these “by hand” is intractable and frankly in 2025 would probably produce worse results than using LLMs to label the data. Instead, we used a combination of hand-labeling, LLM ensemble labeling, and active-learning to produce a high-quality labeled dataset.

Creating a quality labeled dataset

Labeling 200 samples to 10,000 samples

Before we get too excited, we needed to understand the data we were working with in the context of financial sentiment classification. We sampled a small representative dataset of 200 texts uniformly across categories and subcategories and labeled the data by hand whether the text reflected a negative, neutral, or positive sentiment. This process is not sexy, but necessary.

After hand-labeling, we computed the class distribution of our labels. In a representative sample of the full dataset, some classes may have too few examples for a model to learn from. If this is the case, you may need to sample more data from under-represented classes to ensure a balanced dataset. Our initial sample of 100 didn’t contain enough negative sentiment, so we simply sampled another 100.

Hand-labeling, while tedious, worked that biological neural network up top. It helped us understand the data: what do easy-to-label samples look like, and what do difficult ones? Armed with this understanding, we authored a prompt to label the same 200 samples to see where our human labels were potentially incorrect.

We chose a number of fast LLMs to use as labelers so we could ensemble them, wrote a prompt to label the data, and ran it to produce labels.

We used this ensemble:

Prompt used for labeling

"""
# TASK DESCRIPTION

Read through the following snippet of text carefully and classify the **financial sentiment** as
either negative, neutral, or positive. You must also provide a short rationale for why you
assigned the financial sentiment you did.

For clarity here are the definitions negative, neutral, and positive sentiments:

   - **Negative**: The snippet describes an event or development that has had, is having, or is
       expected to have a material negative impact on the company’s financial performance, share price, reputation,
       or outlook.

   - **Neutral**: The snippet is informational/descriptive and is not expected to have a material positive or
       negative impact on the company.

   - **Positive**: The snippet describes an event or development that has had, is having, or is expected to have
       a material positive impact on the company’s financial performance, share price, reputation, or outlook.

Materiality note:
- “Material impact” includes likely effects on share price, revenue, costs, profitability, cash flow, guidance,
regulatory exposure, reputation, risk exposure, or competitive position.

# TASK GUIDELINES

For the avoidance of doubt here is a decision tree that you can follow to arrive at the most appropriate
sentiment classification for the snippet. Pay careful attention to the logic. Don't deviate.

START
│
├── Step 1: Carefully read and understand the snippet.
│
├── Step 2: Check for sentiment indicators:
│
├── Is the snippet clearly NEGATIVE?
│     (share price decline, losses, scandals, lawsuits, layoffs, product recalls, regulatory fines,
│      leadership resignations, declining sales, market-share losses, reputational damage etc.)
│    │
│    ├── YES → Classify as "negative"
│    │      └── Provide a rationale by summarizing WHY the snippet is negative.
│    │
│    └── NO → Continue below
│
├── Is the snippet clearly POSITIVE?
│     (share price increases, strong earnings, favorable partnerships, successful product launches, awards,
│      expansion plans, positive analyst coverage, reputational enhancement, etc.)
│    │
│    ├── YES → Classify as "positive"
│    │      └── Provide a rationale by summarizing WHY the snippet is positive.
│    │
│    └── NO → Continue below
│
└── If neither clearly positive nor negative → Classify as "neutral"
         (routine product announcements without performance implications, leadership appointments,
          scheduled reports, factual statements, general industry overviews, etc.)
       └── Provide a rationale by summarizing the WHY the snippet is neutral.

If there is conflicting sentiment in the snippet pick the most dominant one, otherwise default to **neutral**.

# RESPONSE FORMAT

You must respond with ONLY a valid JSON object formatted as follows. DO NOT WRITE ANY PREAMBLE JUST RETURN JSON.

{{
   "rationale": "A one-sentence rationale for your classification",
   "financial_sentiment": "either negative, neutral, or positive"
}}

# SNIPPET TO LABEL

Here is the snippet we would like you to assign a negative, neutral, or positive financial sentiment label to:

{text}

P.S. REMEMBER TO READ THE SNIPPET CAREFULLY AND FOLLOW THE GUIDELINES TO ARRIVE AT THE MOST APPROPRIATE
FINANCIAL SENTIMENT CLASSIFICATION. WHEN IN DOUBT, YOU SHOULD DEFER TO A "neutral" CLASSIFICATION FOR THE SNIPPET.
GOOD LUCK!
"""

To assess the human and LLM labels, we first ensembled the AI labels using majority vote (the most frequently occurring label). For each sample, we compared the human label with the ensemble’s majority vote label and sense-checked disagreements. Utilizing the LLM’s provided rationales, we then reasoned about whether the disagreement stemmed from an incorrect human label, an incorrect LLM label, or a prompt needing refinement, relabeling samples as necessary—a theme we revisited later in the project.

To make this concrete, a common issue we identified was that earnings reports and analyst statements were often labelled neutral when in fact they contained positive and negative outcomes or forecasts for a company. This led us to refine the prompt to better capture financial performance as it’s financial sentiment we’re after not generic sentiment.

Financial Sentiment definition quants are looking for

    NEGATIVE           NEUTRAL           POSITIVE
        ●─────────────────●─────────────────●
        │                 │                 │
     Bearish       Balanced Outlook      Bullish

We’re reminded again: human data inspection is a crucial step in machine learning. This investment of time upfront helped us optimize the prompt before labeling the final 100,000 samples.

Auto-Labeling 10,000 samples

Scaling to 10,000 samples followed the same procedure:

Sample text snippets uniformly across categories and subcategories, this time 10,000
For each LLM in the ensemble, label all samples using the refined prompt
Compute majority vote label for the ensemble
Check classes are sufficiently represented for learning (class distribution)

Let’s recap our dataset transformations so far:

Dataset transformations from 200 hand labeled samples to 10,000 LLM labeled samples

┌───────────┐
│ text      │   shape (200, 1)
│ str       │
└───────────┘
           │
           │ +1 human label
           ▼
┌──────────┬─────────────┐
│ text     │ human_label │   shape (200, 2)
│ str      │ str         │
└──────────┴─────────────┘
           │
           │ +10,000 samples, +8 LLM labels, -1 Human label
           ▼
┌──────────┬─────────────┬─────┬───────────┐
│ text     │ grok_4_fast │ ... │ qwen3_32b │   shape (10_000, 9)
│ str      │ str         │     │ str       │    
└──────────┴─────────────┴─────┴───────────┘   *10_200? Nope.
           │                                   We kept the samples for the human set separate
           │ +majority vote
           ▼
┌──────────┬─────────────┬─────┬───────────┬───────────────┐
│ text     │ grok_4_fast │ ... │ qwen3_32b │ majority_vote │   shape (10_000, 10)
│ str      │ str         │ str │ str       │ str           │   
└──────────┴─────────────┴─────┴───────────┴───────────────┘

Training a baseline classifier

A baseline classifier is a good reference point to compare more complex models against. However, its true value lies in what it can tell us about the dataset. The linear models we trained performed relatively well on a validation set. This indicates that:

There is “signal”: There are learnable features in the dataset.
Our labels are good: Some features have a linear relationship with the labels we’ve assigned.
Benchmark for improvement: If a more complex model does poorly, we know the issue lies in something like the model architecture, training procedure, or a bug, not the dataset itself.

If this simple model performed poorly, we should stop and investigate why. Is there a bug? Verify it. Is it the dataset features, its size, the labels, or is the task inherently too difficult and our model needs more expressive power?

For our baseline model we used text embeddings as input features to a linear classifier trained to predict the majority vote label from our LLM ensemble. The reason we chose embeddings is because these text embedding models are trained to produce dense vector representations of text that capture semantic meaning, our hypothesis is the signal we want to learn is encoded in these embeddings already. This makes them well-suited as input data for a classification task.

The way this works is as follows:

Text classification using embeddings

                ┌───────────┐               ┌────────────┐
    Text        │ Embedding │  Embeddings   │ Classifier │  Targets    
 ─────────────▶ │   Model   │ ────────────▶ │    Model   │ ─────────▶  Label = "negative"
 ["Miss..."]    │           │  [[0.42...]]  │            │ [0,-1,...] 
                └───────────┘               └────────────┘
                
 Target mapping: {0: neutral, 1: positive, -1: negative}

Like the LLM labeler models, we ensemble embeddings models too. The embedding models we used on OpenRouter:

The best baseline model trained on this 10,000 sample dataset achieved around 85% accuracy. After scaling to 100,000 samples (discussed in the next section), we observed a 1-2% accuracy improvement. The final baseline model achieved accuracy of 86.65% in training and 85.93% on a held-out validation set of 20,000 samples. This was achieved using SGDClassifier from sklearn.linear_model with the hinge loss function, trained on dense text embeddings. It was an ensemble of models trained on different embeddings generated using OpenRouter. Majority vote was used again to aggregate their label predictions.

Code to train baseline classifier

import os
import datetime as dt
import polars as pl
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

def embedding_model(data: pl.DataFrame, model: str, embeddings_path: str) -> tuple:
    """
    Train a linear classifier on text embeddings for financial sentiment classification.
    
    Generates or loads embeddings, trains an SGDClassifier with hinge loss, and evaluates
    performance on train/validation splits. Returns predictions, accuracy metrics, and
    confusion matrices for both splits.
    
    :param data: DataFrame containing 'text' and 'majority_vote' columns with samples and labels
    :param model: Embedding model identifier for OpenRouter API
    :param embeddings_path: Path to cache/load embeddings (created if doesn't exist)
    :return: Returns a tuple of predictions, accuracy metrics, and confusion matrices for both splits.
    """
    # Memoize the embeddings for subsequent training runs.
    if not os.path.exists(embeddings_path):
        # Generate the embeddings using OpenRouter.
        embed_data(data=data, out_file=embeddings_path, model=model)

    # Load the embeddings, only as many samples as this training run.
    embeddings = np.load(embeddings_path)
    embeddings = embeddings[:data.shape[0]]

    # Fetch labels using majority vote and map to model targets.
    label_to_target={"positive": 1, "neutral": 0, "negative": -1}
    labels = list(label_to_target.keys())
    targets = [label_to_target[row['majority_vote']] for row in data.to_dicts()]

    # Samples already shuffled, simple subscript split is sufficient.
    train_pct = 0.8
    n_train = int(data.shape[0] * train_pct)
    x_train, y_train = embeddings[:n_train], targets[:n_train]
    x_val, y_val = embeddings[n_train:], targets[n_train:]

    # Get the training and validation text.
    train_text = data[:n_train].select("text").to_numpy().flatten().tolist()
    val_text = data[n_train:].select("text").to_numpy().flatten().tolist()

    classifier = SGDClassifier(
        loss='hinge',
        penalty='l2',
        alpha=0.00001,
        validation_fraction=0.00001
    )
    classifier.fit(x_train, y_train)

    # Make predictions.
    train_preds = classifier.predict(x_train)
    val_preds = classifier.predict(x_val)

    # Score the preds.
    train_acc = accuracy_score(y_true=y_train, y_pred=train_preds)
    val_acc = accuracy_score(y_true=y_val, y_pred=val_preds)

    confusion_train = confusion_matrix(y_true=y_train, y_pred=train_preds)
    confusion_train = {
        r: {
            c: int(confusion_train[i, j])
            for j, c in enumerate(labels)
        }
        for i, r in enumerate(labels)
    }
    confusion_val = confusion_matrix(y_true=y_val, y_pred=val_preds)
    confusion_val = {
        r: {
            c: int(confusion_val[i, j])
            for j, c in enumerate(labels)
        }
        for i, r in enumerate(labels)
    }

    print(f"{dt.datetime.utcnow()} - Train Accuracy {train_acc:.4f}")
    print(f"{dt.datetime.utcnow()} - Val Accuracy {val_acc:.4f}")
    print(f"{dt.datetime.utcnow()} - Confusion Matrix")
    print("Train", simplejson.dumps(confusion_train, indent=4))
    print("Val", simplejson.dumps(confusion_val, indent=4))

    return (
        (y_train, train_preds, train_text),
        (y_val, val_preds, val_text),
        (train_acc, val_acc),
        (confusion_train, confusion_val)
    )

With baseline models performing well on 10k samples, we were ready to scale to the full 100k dataset.

Auto-Labeling the final 100,000 samples

We followed the same process as for the 10,000 samples and ended up with the following dataset ready for training… or so we thought.

Dataset transformations to 100,000 samples with LLM ensemble labeling

┌──────────┬─────────────┬─────┬───────────┬───────────────┐
│ text     │ grok_4_fast │ ... │ qwen3_32b │ majority_vote │   shape (10_000, 10)
│ str      │ str         │ str │ str       │ str           │   
└──────────┴─────────────┴─────┴───────────┴───────────────┘
           │
           │ +90,000 samples
           │ +8 LLM labels on new samples
           │ +majority vote on new samples
           │
           ▼
┌──────────┬─────────────┬─────┬───────────┬───────────────┐
│ text     │ grok_4_fast │ ... │ qwen3_32b │ majority_vote │   shape (100_000, 10)
│ str      │ str         │ str │ str       │ str           │   
└──────────┴─────────────┴─────┴───────────┴───────────────┘

Including some optional headers in our embeddings requests to OpenRouter enabled them to measure our usage. Surprisingly, embedding a 100,000 sample dataset over 6 embedding models put us on the OpenRouter leaderboard for embeddings. Once we’ve operationalized this process into our DataFeeds product, we will sit in first place, no doubt.

OpenRouter leaderboard showing NOSIBLE ranking 6th for Qwen3-Embedding-8B usage after the project

Active-learning to improve label quality

Before training more powerful models, we decided to improve the dataset quality further. We identified samples where linear models trained on different embeddings (Qwen3-Embedding-8B, …, Mistral: Mistral Embed 2312) unanimously predicted a label (e.g. all predicted negative) and our trained model predicted differently (e.g. neutral). These disagreements revealed hard-to-classify samples with mixed sentiment or unclear subjects. When all linear models reach consensus, that unanimous agreement is a stronger signal than their the LLM labelers majority vote label, giving us confidence the majority vote label is in fact incorrect, and not the linear models prediction.

In traditional supervised learning, human experts label datasets once and those labels are fixed as ground truth. Since our labels came from generic LLMs via majority vote rather than domain experts, they’re more fallible but also more flexible. This led us to an approach called active-learning, where the target output (our label we’re trying to predict) is not fixed but can be updated during training.

Relabeling algorithm

Our first approach was to relabel all the samples where all our models trained on different embeddings unanimously agreed on a label and our model predicted differently, but it occurred to us updating a label and retraining a model will shift the models learned decision boundaries. Will this set of disagreements reduce if we repeated the process? Will it converge to 0? It does!

We tested this by iteratively relabeling our dataset. The basic outline of our full training algorithm with relabeling is as follows:

Label a set of 100k samples with the LLM labelers and compute majority vote.
Train multiple linear models on different embeddings of the same text to predict the majority vote of the LLM labelers.
Perform iterative relabeling:
- Compare all the linear models’ predictions to the majority vote label.
- Identify disagreements where all linear models agree but the majority vote label does not.
- Consult an oracle (OpenAI: GPT-5.1, a large LLM acting as our active-learning expert) to evaluate disagreements and relabel samples when appropriate.
- Drop the worst performing linear model from the ensemble.
- Repeat until no additional samples require relabeling.
This is the final dataset used for training the classification models.

The accuracy improvement we got was significant.

~3.5% Validation accuracy improvement as a result relabeling.

Split	Before	After	Improvement
Train	86.65%	90.03%	+3.38%
Validation	85.93%	89.48%	+3.55%

Prompt used to consult the oracle

"""
Consult the oracle to evaluate whether it agrees with the linear models prediction, or if we need to relabel this text sample.
"""
import textwrap

labels = ["negative", "neutral", "positive"]
labelling_prompt = textwrap.dedent(financial_sentiment_prompt(text))
label_options = " | ".join([f'"{label}"' for label in labels])
# The prediction variable is one of the labels above, where all linear models agreed.

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text":  textwrap.dedent(
                    f"""
                    # TASK DESCRIPTION
                    
                    Given the following LLM prompt, another model labelled this text as \"{prediction}\".
                    Is this model correct? Provide the correct label and a rationale for your answer.
                    
                    # RESPONSE FORMAT
                    
                    You must respond using this exact format. DO NOT RESPOND IN ANY OTHER WAY:

                    ```json
                    {{
                        "correct": "true" | "false",
                        "label": {label_options},
                        "reason": "A rationale for the correctness or incorrectness of your label."
                    }}
                    ```
                    
                    P.S. Think fast there is no time to waste.                        
                    """
                )
            }
        ]
    },
    {
        "role": "user",
        "content": [{ "type": "text", "text": labelling_prompt }]
    }
]

# Send these messages to OpenRouter using the OpenAI client.

Training frontier classification models

With baseline models achieving almost 90% accuracy, and a large, quality dataset of 100,000 samples, we’re more than ready to explore more complex models.

Supervised Finetuning (SFT) for classification

We tried different base models for finetuning like ModernBERT and Microsoft’s DeBERTa, but the best was a tiny modern causal LLM, Qwen3 0.6B. Qwen3 0.6B is a small model in its family, suitable for edge computing and low-latency applications. In fact, it’s so small it’s not even hosted on OpenRouter.

Fine-tuning a causal language model for classification is fairly straightforward but requires some careful implementation details to get right. This approach treats classification as next-token prediction: given a prompt with the text snippet, the model predicts the label token (negative, neutral, or positive).

Three implementation details matter:

Label masking: We want the model to learn to predict the label token only, not to memorize the prompt.

We masked the tokenized prompt labels to only compute loss on the label token, not the prompt. This prevents the model from memorizing prompts and forces it to learn the classification task. This is done by multiplying the tokenized prompt with -100 which instructs PyTorch to ignore them during training.

labels = [-100] * len(prompt_tokens["input_ids"]) + answer_tokens["input_ids"]

Left-padding: Batch processing on a GPU requires padding different length sequences of training examples to equal length.

For causal models, left-padding is critical as it ensures the model doesn’t attend to padding tokens when predicting the next token.

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

EOS token handling: During evaluation, we exclude the end-of-sequence token (<|im_end|>) from accuracy calculations. We only measure whether the model correctly predicted the label token itself (positive/negative/neutral), not the conversational markup tokens.

Our initial performance results were looking a little too good to be true, and this was why.

Here's our full training script

import os
import datetime as dt
import random
import textwrap
from itertools import chain

import polars as pl
import torch
from datasets import Dataset, DatasetDict
from sklearn.metrics import f1_score, accuracy_score
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    TrainingArguments,
    Trainer, AutoModelForSequenceClassification, DataCollatorWithPadding,
)
from functools import partial


def eval_finbert():
    """
    Evaluate the finbert model.

    :return:
    """
    import numpy as np

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)

        # Prefer Macro F1.
        f1 = f1_score(labels, predictions, average="macro")
        accuracy = accuracy_score(labels, predictions)

        return {"f1": f1, "accuracy": accuracy}

    finbert_model_id = "ProsusAI/finbert"

    finbert_label2id = {
        "positive": 0,
        "negative": 1,
        "neutral": 2,
    }
    finbert_id2label = {v: k for k, v in finbert_label2id.items()}
    finbert_num_labels = len(finbert_label2id)

    finbert_tokenizer = AutoTokenizer.from_pretrained(finbert_model_id)
    finbert_model = AutoModelForSequenceClassification.from_pretrained(
        finbert_model_id,
        num_labels=finbert_num_labels,
        label2id=finbert_label2id,
        id2label=finbert_id2label,
    )

    def tokenize_finbert(batch):
        return finbert_tokenizer(batch["text"], truncation=True, max_length=512)

    # PhraseBank with FinBERT tokenizer.
    finbert_phrase_ds = fin_bank_ds.map(tokenize_finbert, batched=True)
    val_ds_for_finbert = val_ds.map(tokenize_finbert, batched=True)

    finbert_collator = DataCollatorWithPadding(tokenizer=finbert_tokenizer)

    finbert_eval_args = TrainingArguments(
        output_dir="finbert_baseline_eval",
        per_device_eval_batch_size=32,
        do_train=False,
        do_eval=True,
        logging_strategy="no",
    )

    # Evaluate FinBERT on PhraseBank.
    finbert_trainer_phrase = Trainer(
        model=finbert_model,
        args=finbert_eval_args,
        data_collator=finbert_collator,
        eval_dataset=finbert_phrase_ds,
        compute_metrics=compute_metrics,
    )

    print("\n---")
    print("Evaluating FinBERT on Financial PhraseBank...")
    finbert_phrase_metrics = finbert_trainer_phrase.evaluate()
    print("FinBERT metrics on PhraseBank:", finbert_phrase_metrics)
    print("---")

    # Evaluate FinBERT on 20% of data_labels.
    finbert_trainer_val = Trainer(
        model=finbert_model,
        args=finbert_eval_args,
        data_collator=finbert_collator,
        eval_dataset=val_ds_for_finbert,
        compute_metrics=compute_metrics,
    )

    print("\n---")
    print("Evaluating FinBERT on 20% held-out data_labels...")
    finbert_val_metrics = finbert_trainer_val.evaluate()
    print("FinBERT metrics on 20% data_labels held-out set:", finbert_val_metrics)
    print("---\n")

    # Move the finbert model off the gpu.
    finbert_model.to(device="cpu")


def compute_metrics_without_eos(eval_pred, eos_token_id):
    """
    Compute the f1 score and the accuracy based on TOKEN classification. Given we are predicting a single token
    this is a good approximation of the actual scores.

    Make sure to remove eos token.

    :param eval_pred: Predictions.
    :param eos_token_id: The eos token id.
    :return:
    """
    predictions, labels = eval_pred
    predictions = predictions[:, :-1]
    labels = labels[:, 1:]

    # Exclude Prompt (-100) AND EOS.
    valid_mask = (labels != -100) & (labels != eos_token_id)

    pred_flat = predictions[valid_mask]
    label_flat = labels[valid_mask]

    f1 = f1_score(label_flat, pred_flat, average="macro")
    accuracy = accuracy_score(label_flat, pred_flat)
    return {"f1": f1, "accuracy": accuracy}


def preprocess_logits_for_metrics(logits, labels):
    """
    Original logits are (Batch, Seq, Vocab).
    We only need (Batch, Seq) containing the indices of the max logit.
    """
    if isinstance(logits, tuple):
        # Depending on the model and config, logits may contain extra tensors,
        # like past_key_values, but logits always come first
        logits = logits[0]
    return logits.argmax(dim=-1)


if __name__ == "__main__":

    # ----------------------------------------------------
    # STEP 1 - Load financial sentiment dataset
    # ----------------------------------------------------

    task = "financial_sentiment"

    # Target TEXT strings for the Guard model.
    label_map = {
        "positive": "positive",
        "negative": "negative",
        "neutral": "neutral"
    }

    modeling_dir = os.path.join("/", "workspace", ".nosible")

    data_labels = pl.read_ipc(os.path.join(modeling_dir, f"{task}_100.0k_iter_18.ipc"))
    fin_bank = pl.read_ndjson(os.path.join(modeling_dir, "financial_phrase_bank.ndjson"))

    # Select text and text-labels.
    # Check if 'majority_vote' exists (it does in your IPC), otherwise rename or use 'labels'.
    if "majority_vote" in data_labels.columns:
        data = data_labels.select(["text", "majority_vote"]).rename({"majority_vote": "labels"})
    else:
        data = data_labels.select(["text", "labels"])

    fin_bank = fin_bank.select(["text", "labels"])

    print(f"Data shape: {data.shape}")

    # ----------------------------------------------------
    # STEP 2 - Create train/val splits
    # ----------------------------------------------------

    # Slice to 100k only if we have that much data, otherwise take all.
    limit = min(100_000, len(data))
    data = data[:limit]

    full_ds = Dataset.from_polars(df=data)
    fin_bank_ds = Dataset.from_polars(df=fin_bank)

    # Now this will succeed because we have at least 20 rows (16 train, 4 test)
    split_ds = full_ds.train_test_split(test_size=0.2, seed=42)
    train_ds = split_ds["train"]
    val_ds = split_ds["test"]

    ds = DatasetDict({
        "train": train_ds,
        "val": val_ds,
        "phrasebank": fin_bank_ds,
    })

    # ----------------------------------------------------
    # STEP 3 - Fine-tune Qwen 3 (Guard Approach)
    # ----------------------------------------------------

    # UPDATED: Use Qwen 3 (0.6B).
    model_id = "Qwen/Qwen3-0.6B"

    # Important: trust_remote_code=True is essential for Qwen3.
    # Make sure to set left padding.
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"

    # Load Model.
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        trust_remote_code=True,
        dtype=torch.bfloat16,
        device_map="auto"
    )

    def tokenize(batch):
        """
        Tokenize a batch.

        :param batch: Batch to tokenize.
        :return:
        """

        input_ids_list = []
        attention_mask_list = []
        labels_list = []

        for text, label in zip(batch['text'], batch['labels']):

            # 1. Build the prompt (User part)
            system = "Classify the financial sentiment as positive, negative, or neutral."
            msgs = [
                {"role": "system", "content": system},
                {"role": "user", "content": text},
            ]
            prompt_str = tokenizer.apply_chat_template(
                msgs,
                tokenize=False,
                add_generation_prompt=True,
                enable_thinking=False, # By default, this is true.
            )

            # 2. Build the answer (Assistant part)
            answer_str = f"{label}<|im_end|>" # Standard Qwen end token.

            # 3. Tokenize separately to know lengths
            prompt_tokens = tokenizer(prompt_str, add_special_tokens=False)
            answer_tokens = tokenizer(answer_str, add_special_tokens=False)

            assert len(answer_tokens["input_ids"]) == 2

            # 4. Combine
            input_ids = prompt_tokens["input_ids"] + answer_tokens["input_ids"]
            attention_mask = prompt_tokens["attention_mask"] + answer_tokens["attention_mask"]

            # 5. CREATE LABELS WITH MASKING
            # -100 tells PyTorch to IGNORE these tokens during training.
            labels = [-100] * len(prompt_tokens["input_ids"]) + answer_tokens["input_ids"]

            # Truncate if necessary (simplified for brevity).
            if len(input_ids) > 2048:
                input_ids = input_ids[:2048]
                attention_mask = attention_mask[:2048]
                labels = labels[:2048]

            input_ids_list.append(input_ids)
            attention_mask_list.append(attention_mask)
            labels_list.append(labels)

        return {
            "input_ids": input_ids_list,
            "attention_mask": attention_mask_list,
            "labels": labels_list
        }

    # Tokenize the datasets.
    t_ds = ds.map(tokenize, batched=True, remove_columns=ds["train"].column_names, num_proc=8)

    # Set the collator for padding.
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)

    timestamp = dt.datetime.now().strftime('%Y%m%d_%H%M%S')
    task_output = f"{task}_qwen3_0.6B_{timestamp}"

    training_args = TrainingArguments(
        output_dir=task_output,

        # Qwen3 0.6B is very small, we might be able to increase batch size slightly.
        gradient_accumulation_steps=4,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,

        num_train_epochs=5,
        learning_rate=2e-5,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        max_grad_norm=1.0,
        weight_decay=0.1,

        # Improvements for instruction finetuning.
        neftune_noise_alpha=5,

        # Optimizations.
        bf16=True,
        optim="adamw_torch_fused",
        logging_strategy="steps",
        logging_steps=10,
        logging_dir=task_output,
        logging_first_step=True,
        group_by_length=True,
        torch_compile=False,

        # Eval params.
        eval_strategy="steps",
        eval_steps=500,
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="eval_val_loss",
        report_to="none",
        save_strategy="steps",
        save_steps=500,

    )

    n_eval_train = int(0.05 * len(t_ds["train"]))
    chat_end_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=t_ds["train"],
        eval_dataset={
            "train": t_ds["train"].select(list(range(n_eval_train))),
            "val": t_ds["val"],
            "phrasebank": t_ds["phrasebank"],
        },
        compute_metrics=partial(compute_metrics_without_eos, eos_token_id=chat_end_token_id),
        preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    )

    trainer.train()

Results

We compare the accuracy of our finetuned Qwen3 0.6B model against FinBERT¹ and a variety of LLMs prompted for zero-shot classification on the out of sample validation set from our 100,000 dataset, and the Financial PhraseBank¹ dataset by Malo, P. et al. (2014). This dataset contains 4,840 financial news snippets labeled for sentiment by human experts.

Although GPT-5.1 topped us by a margin, it’s orders of magnitude more expensive, so in practical terms it’s not feasible our (NOSIBLE’s) scale.

Cost calculation methodology:

For the LLMs: We summed input and output token costs per million in a 10:1 ratio, which reflects the typical input/output length relationship when prompted with our labeling prompt.

For the finetuned model: Since Qwen3 0.6B isn’t hosted on OpenRouter, we estimated cost at a conservative 100:1 ratio given the minimal output—just the label token and EOS token.

Closing thoughts

Large LLMs like GPT-5.1 excel at financial sentiment classification but are prohibitively expensive at production scale. We show they are orders of magnitude more costly than our tiny models.
FinBERT¹, the established benchmark model, performed poorly on our real-world data despite strong performance on Financial PhraseBank². This suggests overfitting to cleaner, more structured benchmark datasets rather than the messy, varied text found in actual web data.
By training on 100,000 real-world samples from our search index, we built a model that generalizes better across both our production data and benchmark datasets like Financial PhraseBank illustrating the importance of data quality and quantity over complicated models.
“Look at the data” is still an important and relevant skill no matter what level you’re at.
Relabeling on datasets with generated labels is an excellent idea, and can be further automated and improved.
We’ve been duped into believing causal language models are only useful for text generation tasks, more specifically chat, however they are very useful if you understand them.
In the spirit of the festive season we’ve open sourced all the datasets and models on HuggingFace… share with us what you build!
- Financial Sentiment:
  - NOSIBLE Financial Sentiment dataset
  - NOSIBLE Financial Sentiment v1.1 Base model
- Forward Looking:
  - NOSIBLE Forward-Looking dataset
  - NOSIBLE Prediction v1.1 Base model
- Prediction:
  - NOSIBLE Prediction dataset
  - NOSIBLE Forward-Looking v1.1 Base model

Try it out yourself!

Quickstart on HuggingFace

Visit our model page
Click Deploy -> HF Inference Endpoints
Configure the endpoint:
- For Hardware choose a beefy GPU like a L40S
- For Inference Engine choose SGLang and set Max Prefill Tokens: 65536
- For Container add container args: --dtype float16 --cuda-graph-max-bs 128 --disable-radix-cache
- For Advanced Settings choose Download Pattern: Download Everything
Copy your endpoint URL and bring your HuggingFace API KEY.
Use the code below to classify some text.
Profit 🐒

Client code to interact with your HF Inference Endpoint.

import math
from openai import OpenAI

# Initialize the client pointing to your vLLM server
client = OpenAI(
    base_url="YOUR_ENDPOINT_URL_HERE/v1",
    api_key="YOUR_API_KEY_HERE"
)

model_id = "NOSIBLE/financial-sentiment-v1.1-base"

# Input text to classify
text = "The company reported a record profit margin of 15% this quarter."

# Define the classification labels
labels = ["positive", "negative", "neutral"]

# Prepare the conversation
messages = [
    {"role": "system", "content": "Classify the financial sentiment as positive, neutral, or negative."},
    {"role": "user", "content": text},
]

# Make the API call
chat_completion = client.chat.completions.create(
    model=model_id,
    messages=messages,
    temperature=0,
    max_tokens=1,
    stream=False,
    logprobs=True,              # Enable log probabilities to calculate confidence
    top_logprobs=len(labels),   # Ensure we capture logprobs for our choices
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False}, # Must be set to false.
        "regex": "(positive|neutral|negative)",
    },
)

# Extract the response content
response_label = chat_completion.choices[0].message.content

# Extract the logprobs for the generated token to calculate confidence
first_token_logprobs = chat_completion.choices[0].logprobs.content[0].top_logprobs

print(f"--- Classification Results ---")
print(f"Input: {text}")
print(f"Predicted Label: {response_label}\n")

print("--- Label Confidence ---")
for lp in first_token_logprobs:
    # Convert log probability to percentage
    probability = math.exp(lp.logprob)
    print(f"Token: '{lp.token}' | Probability: {probability:.2%}")

Acknowledgments

The team involved in the project includes:

Citations

Datasets:

NOSIBLE Financial Sentiment: Nosible Inc.
Financial PhraseBank: Malo, P. et al. (2014).

Research:

Qwen3: Qwen3 Technical Report. arXiv:2505.09388
Qwen3 Guard: Qwen3Guard Technical Report. arXiv:2510.14276v1

FinBERT: Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063 ↩ ↩² ↩³ ↩⁴
Financial PhraseBank: Malo, P. et al. (2014) ↩ ↩²