Fast Enough to Matter: Productionizing Tiny Transformers for Signal Extraction

Over the past few weeks, we have trained and productionized three text classifiers to predict the financial sentiment of a text snippet, determine whether the text contains a forward-looking statement or not, and determine whether the text contains a prediction or not.

In the festive spirit of Thanksgiving we have open sourced these models on HuggingFace, and the datasets they were trained on. We also share some source code here to use in your own projects.

Wait but you’re a search engine, right? NOSIBLE is already a world-class search engine, but more importantly it’s incredibly fast which unlocks the ability to build near real time datafeeds. Our DataFeeds product lets you to turn any search about any topic into a point-in-time, backtest-friendly, deduped time series of data.

We’ve leaned even harder in this direction and our Data Feeds product is now better described as web-scale survellience. To prove this isn’t just a marketing term, we’re extracting and including these signals in our datafeeds for our customers, and we’re just getting started.

Let’s get to it.

Real world data

A quality data set is key to building a good ML model. The combination of compute, a return to MLPs, and a much larger labeled dataset contributed significantly to what made AlexNet so successful back in 2012.

Using the vast amount of real-world data indexed by NOSIBLE, we sampled a data feed centered around financial news. They are varied and contain a ton of signal, this is what they look like:

Some peers in the therapeutics segment have already reported their Q3 results. Gilead Sciences posted year-on-year revenue growth of 3%, beating expectations by 3.7%, and Biogen reported revenue up 2.8%, topping estimates by 8.2%. Following their reports, Gilead Sciences’s stock traded up 1.2% and Biogen’s stock was up 4.1%.

There are fun ones too:

If you thought the wait for Kingdom Hearts III was ridiculous, you ain’t see nothing yet. Now that we have an official release date for the game, Square Enix has already announced two collectors editions of the game. The first is your average run of the mill collectors edition, priced at $80 dollars. That’s not the one you want. No, the big daddy collectors edition itself, the deluxe edition is the one you’re going to want.

A human labeling 100,000 of these “by hand” is intractable and frankly in 2025 would probably produce worse results than using LLMs to label the data. Instead, we used a combination of hand-labeling, LLM ensemble labeling, and active-learning to produce a high-quality labeled dataset.

Creating a quality labeled dataset

Labeling 200 samples to 10,000 samples

Before we get too excited, we needed to understand the data we were working with with respect to the problem, sentiment classification. We sampled a small representative dataset of 200 texts uniformly across categories and subcategories and labeled the data by hand whether the text reflected a negative, neutral, or positive sentiment. This process is not sexy, but necessary.

After hand-labeling, we computed the class distribution of our labels. In a representative sample of the full dataset, some classes may have too few examples for a model to learn from. If this is the case, you may need to sample more data from under-represented classes to ensure a balanced dataset. Our initial sample of 100 didn’t contain enough negative sentiment, so we simply sampled another 100.

Hand-labeling, while tedious, worked that biological neural network up top. It helped us understand the data we’re working with: what do easy to label samples look like, what do difficult ones? Armed with this understanding, we authored a prompt to label the same 200 samples to see where our human labels were potentially incorrect.

We chose a number of fast LLMs to use as labelers, wrote a prompt to label the data, and ran it for all the models. This is trivial, use the OpenAI client with OpenRouter in a simple loop.

We used this ensemble:

Prompt used for labeling

"""
# TASK DESCRIPTION

Read through the following snippet of text carefully and classify the **financial sentiment** as
either negative, neutral, or positive. You must also provide a short rationale for why you
assigned the financial sentiment you did.

For clarity here are the definitions negative, neutral, and positive sentiments:

   - **Negative**: The snippet describes an event or development that has had, is having, or is
       expected to have a material negative impact on the company’s financial performance, share price, reputation,
       or outlook.

   - **Neutral**: The snippet is informational/descriptive and is not expected to have a material positive or
       negative impact on the company.

   - **Positive**: The snippet describes an event or development that has had, is having, or is expected to have
       a material positive impact on the company’s financial performance, share price, reputation, or outlook.

Materiality note:
- “Material impact” includes likely effects on share price, revenue, costs, profitability, cash flow, guidance,
regulatory exposure, reputation, risk exposure, or competitive position.

# TASK GUIDELINES

For the avoidance of doubt here is a decision tree that you can follow to arrive at the most appropriate
sentiment classification for the snippet. Pay careful attention to the logic. Don't deviate.

START
│
├── Step 1: Carefully read and understand the snippet.
│
├── Step 2: Check for sentiment indicators:
│
├── Is the snippet clearly NEGATIVE?
│     (share price decline, losses, scandals, lawsuits, layoffs, product recalls, regulatory fines,
│      leadership resignations, declining sales, market-share losses, reputational damage etc.)
│    │
│    ├── YES → Classify as "negative"
│    │      └── Provide a rationale by summarizing WHY the snippet is negative.
│    │
│    └── NO → Continue below
│
├── Is the snippet clearly POSITIVE?
│     (share price increases, strong earnings, favorable partnerships, successful product launches, awards,
│      expansion plans, positive analyst coverage, reputational enhancement, etc.)
│    │
│    ├── YES → Classify as "positive"
│    │      └── Provide a rationale by summarizing WHY the snippet is positive.
│    │
│    └── NO → Continue below
│
└── If neither clearly positive nor negative → Classify as "neutral"
         (routine product announcements without performance implications, leadership appointments,
          scheduled reports, factual statements, general industry overviews, etc.)
       └── Provide a rationale by summarizing the WHY the snippet is neutral.

If there is conflicting sentiment in the snippet pick the most dominant one, otherwise default to **neutral**.

# RESPONSE FORMAT

You must respond with ONLY a valid JSON object formatted as follows. DO NOT WRITE ANY PREAMBLE JUST RETURN JSON.

{{
   "rationale": "A one-sentence rationale for your classification",
   "financial_sentiment": "either negative, neutral, or positive"
}}

# SNIPPET TO LABEL

Here is the snippet we would like you to assign a negative, neutral, or positive financial sentiment label to:

{text}

P.S. REMEMBER TO READ THE SNIPPET CAREFULLY AND FOLLOW THE GUIDELINES TO ARRIVE AT THE MOST APPROPRIATE
FINANCIAL SENTIMENT CLASSIFICATION. WHEN IN DOUBT, YOU SHOULD DEFER TO A "neutral" CLASSIFICATION FOR THE SNIPPET.
GOOD LUCK!
"""

To assess the human and LLM labels, we first ensembled the AI labels using majority vote, that’s the most occurring label. For each sample, we compared the human label with the ensemble’s majority vote label and sense-checked disagreements. Utilizing the LLM’s provided rationales, we then reasoned about whether the disagreement stemmed from an incorrect human label, an incorrect LLM label, or a prompt needing refinement, relabeling samples as necessary—a theme we revisited later in the project.

To make this concerete, a key issue with our initial sentiment prompt we identified was that emotional sentiment dominated the LLM labels. Texts that were emotionally neutral like an earnings report, would be very negatively or positively polarising based on the financial outlook of the company, so we redefined this task and updated prompts to incorporate financial outlook as part of the sentiment label, arriving at “financial sentiment”.

To make this concrete, a key issue with our initial sentiment prompt was that emotional sentiment dominated the LLM labels. An earnings report written in dry, neutral language received a neutral label, despite indicating strongly negative or positive financial performance. This mismatch led us to redefine the task, explicitly incorporating financial outlook into the label definition and arriving at “financial sentiment” as a better signal.

    NEGATIVE           NEUTRAL           POSITIVE
        ●─────────────────●─────────────────●
        │                 │                 │
     Bearish       Balanced Outlook      Bullish

We’re reminded again, human data inspection is a crucial step in machine learning. This investment of time upfront helped us optimize the prompt before labeling 100,000 samples.

Auto-Labeling 10,000 samples

It’s simple to scale this up, we follow roughly the same procedure:

Sample text snippets uniformly across categories and subcategories, this time 10,000
For each LLM in the ensemble, label all samples using the refined prompt
Compute majority vote label for the ensemble
Check classes are sufficiently represented for learning (class distribution)

Talking Python, nest a thread pool executor inside a process pool executor to parallelize the network requests to OpenRouter until you’re no longer network bound, this is a useful pattern in general.

To illustrate our dataset transformations so far, here’s a picture:

┌───────────┐
│ text      │   shape (200, 1)
│ str       │
└───────────┘
           │
           │ +1 human label
           ▼
┌──────────┬─────────────┐
│ text     │ human_label │   shape (200, 2)
│ str      │ str         │
└──────────┴─────────────┘
           │
           │ +10,000 samples, +8 LLM labels, -1 Human label
           ▼
┌──────────┬─────────────┬─────┬───────────┐
│ text     │ grok_4_fast │ ... │ qwen3_32b │   shape (10_000, 9)
│ str      │ str         │     │ str       │    
└──────────┴─────────────┴─────┴───────────┘   *10_200? Nope.
           │                                   We kept the samples for the human set separate
           │ +majority vote
           ▼
┌──────────┬─────────────┬─────┬───────────┬───────────────┐
│ text     │ grok_4_fast │ ... │ qwen3_32b │ majority_vote │   shape (10_000, 10)
│ str      │ str         │ str │ str       │ str           │   
└──────────┴─────────────┴─────┴───────────┴───────────────┘

Training a baseline classifier

A baseline classifier is a good reference point to compare more complex models against. However it’s true value lies in what it can tell us about the dataset. The linear models we trained performed relatively well on a validation set. This indicates that:

There is “signal”: There are learnable features in the dataset.
Our labels are good: Some features have a linear relationship with the the labels we’ve assigned.
Benchmark for improvement: If a more complex model does poorly, we know the issue lies in something like the model architecture, training procedure, or a bug, not the dataset itself.

If this simple model performed poorly, we should stop and investigate why. Is there a bug? Verify. Is it the dataset features, it’s size, the labels, or is the task inherently too difficult and our model needs more expressive power?

The best baseline model trained on this 10k dataset achieved around 85% accuracy. After scaling to 100,000 samples (discussed in the next section), we observed a 1-2% accuracy improvement. The final baseline model achieved accuracy of 86.65% in training and 85.93% on a held-out validation set of 20,000 samples. This was achieved using SGDClassifier from sklearn.linear_model with the hinge loss function, trained on dense text embeddings. It was an ensemble of models trained on different embeddings generated using OpenRouter. Majority vote was used again to aggregate their label predictions.

Code to train baseline classifier

import os
import datetime as dt
import polars as pl
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

def embedding_model(data: pl.DataFrame, model: str, embeddings_path: str) -> tuple:
    # Memoize the embeddings for subsequent training runs.
    if not os.path.exists(embeddings_path):
        # Generate the embeddings using OpenRouter.
        embed_data(data=data, out_file=embeddings_path, model=model)

    # Load the embeddings, only as many samples as this training run.
    embeddings = np.load(embeddings_path)
    embeddings = embeddings[:data.shape[0]]

    # Fetch labels using majority vote and map to model targets.
    label_to_target={"positive": 1, "neutral": 0, "negative": -1}
    labels = list(label_to_target.keys())
    targets = [label_to_target[row['majority_vote']] for row in data.to_dicts()]

    # Samples already shuffled, simple subscript split is sufficient.
    train_pct = 0.8
    n_train = int(data.shape[0] * train_pct)
    x_train, y_train = embeddings[:n_train], targets[:n_train]
    x_val, y_val = embeddings[n_train:], targets[n_train:]

    # Get the training and validation text.
    train_text = data[:n_train].select("text").to_numpy().flatten().tolist()
    val_text = data[n_train:].select("text").to_numpy().flatten().tolist()

    classifier = SGDClassifier(
        loss='hinge',
        penalty='l2',
        alpha=0.00001,
        validation_fraction=0.00001
    )
    classifier.fit(x_train, y_train)

    # Make predictions.
    train_preds = classifier.predict(x_train)
    val_preds = classifier.predict(x_val)

    # Score the preds.
    train_acc = accuracy_score(y_true=y_train, y_pred=train_preds)
    val_acc = accuracy_score(y_true=y_val, y_pred=val_preds)

    confusion_train = confusion_matrix(y_true=y_train, y_pred=train_preds)
    confusion_train = {
        r: {
            c: int(confusion_train[i, j])
            for j, c in enumerate(labels)
        }
        for i, r in enumerate(labels)
    }
    confusion_val = confusion_matrix(y_true=y_val, y_pred=val_preds)
    confusion_val = {
        r: {
            c: int(confusion_val[i, j])
            for j, c in enumerate(labels)
        }
        for i, r in enumerate(labels)
    }

    print(f"{dt.datetime.utcnow()} - Train Accuracy {train_acc:.4f}")
    print(f"{dt.datetime.utcnow()} - Val Accuracy {val_acc:.4f}")
    print(f"{dt.datetime.utcnow()} - Confusion Matrix")
    print("Train", simplejson.dumps(confusion_train, indent=4))
    print("Val", simplejson.dumps(confusion_val, indent=4))

    return (
        (y_train, train_preds, train_text),
        (y_val, val_preds, val_text),
        (train_acc, val_acc),
        (confusion_train, confusion_val)
    )

Apart from accuracy, we inspect the confusion matrix which shows us the raw data behind a Macro-F1 score, a metric that balances the F1 score (precision and recall accuracy) across label classes irrespective of their size.

We used the following embedding models from OpenRouter:

At this point, we have a baseline model, and are ready to scale up.

Auto-Labeling the final 100,000 samples

We followed the same process as for the 10,000 sample, and ended up with the following dataset ready for training… or so we thought.

┌──────────┬─────────────┬─────┬───────────┬───────────────┐
│ text     │ grok_4_fast │ ... │ qwen3_32b │ majority_vote │   shape (10_000, 10)
│ str      │ str         │ str │ str       │ str           │   
└──────────┴─────────────┴─────┴───────────┴───────────────┘
           │
           │ +90,000 samples
           │ +8 LLM labels on new samples
           │ +majority vote on new samples
           │
           ▼
┌──────────┬──────────.──┬─────┬───────────┬───────────────┐
│ text     │ grok_4_fast │ ... │ qwen3_32b │ majority_vote │   shape (100_000, 10)
│ str      │ str         │ str │ str       │ str           │   
└──────────┴───────────.─┴─────┴───────────┴───────────────┘

Including some optional headers in our embeddings requests to OpenRouter enabled them to measure our usage, and, surprisingly, embedding a 100,000 dataset over 6 embedding models put us on the OpenRouter leaderboard for embeddings. Once we’ve operationalised this process into our DataFeeds product, we will sit in first place, no doubt.

Active-learning to improve label quality

Before training more powerful models, we decided to improve the dataset quality further. We identified samples where all LLM labelers (grok-4-fast, …, qwen3-32b) unanimously agreed on a label (e.g., all predicted negative) but our trained model predicted differently (neutral). These disagreements revealed hard-to-classify samples with mixed sentiment or unclear subjects. When all LLM labelers reach consensus, that unanimous agreement is a much stronger signal than their majority vote label, giving us high confidence the label is in fact incorrect, and not the prediction.

In traditional supervised learning, human experts label datasets once and those labels are fixed as ground truth. Since our labels came from generic LLMs via majority vote rather than domain experts, they’re more fallible but also more flexible. This led us to an approach called active-learning, where the target output (our label we’re trying to predict) is not fixed but can be updated during training.

Relabeling algorithm

Our first approach was to relabel all the samples where all LLM labelers unanimously agreed on a label and our trained model predicted differently, but it occurred to us updating a label and retraining will shift the models learned decision boundaries. Will this set of disagreements reduce if we repeated the process? Will it converge to 0?

We tested this by iteratively re-labeling our dataset, the basic outline of our full training algorithm with relabeling is as follows:

Label a set of 100k samples with the LLM labelers and compute majority vote.
Train multiple linear models on different embeddings of the same text to predict the majority-vote of the LLM labelers.
Perform iterative relabeling:

Compare all the linear models’ predictions to the majority vote label.
Identify disagreements where all linear models agree but the majority vote label does not.
Consult an oracle (OpenAI: GPT-5.1, a large LLM acting as our active-learning expert) to evaluate disagreements and relabel samples when appropriate.
Drop the worst performing linear model from the ensemble.
Repeat until no additional samples require relabeling.

This is the final dataset used for training the classification models.

Re-labeling resulted in a solid improvement of ~3.5% in validation accuracy.

Split	Before	After	Improvement
Train	86.65%	90.03%	+3.38%
Validation	85.93%	89.48%	+3.55%

Training frontier classification models

With baseline models achieving almost 90% accuracy, and a large, quality dataset of 100,000 samples, we’re more than ready to explore more complex models.

Supervised Finetuning (SFT) for classification

We tried different base models for finetuning like ModernBERT and Microsoft’s DeBERTa, but the best was a tiny modern causal LLM, Qwen3 0.6B. Qwen3 0.6B is a small model in it’s family, suitable for edge computing and low-latency applications. In fact it’s so small, it’s not even hosted on OpenRouter.

Fine-tuning a causal language model for classification is fairly straightforward but requires some careful implementation details to get right. This approach treats classification as next-token prediction: given a prompt with the text snippet, the model predicts the label token (negative, neutral, or positive).

Three implementation details matter:

Label masking: We want the model to learn to predict the label token only, not to memorize the prompt.

We masked the tokenized prompt labels to only compute loss on the label token, not the prompt. This prevents the model from memorizing prompts and forces it to learn the classification task. This is done by multiplying the tokenized prompt with -100 which instructs PyTorch to ignore them during training.

labels = [-100] * len(prompt_tokens["input_ids"]) + answer_tokens["input_ids"]

Left-padding: Batch processing on a GPU requires padding different length sequences of training examples to equal length.

For causal models, left-padding is critical as it ensures the model doesn’t attend to padding tokens when predicting the next token.

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

EOS token handling: During evaluation, we exclude the end-of-sequence token (<|im_end|>) from accuracy calculations. We only measure whether the model correctly predicted the label token itself (positive/negative/neutral), not the conversational markup tokens.

Our initial performance results were looking a little too good to be true, and this was why.

Here's our full training script

import os
import datetime as dt
import polars as pl
import torch
from datasets import Dataset, DatasetDict
from sklearn.metrics import f1_score, accuracy_score
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    TrainingArguments,
    Trainer
)
from functools import partial


def compute_metrics_with_eos(eval_pred, eos_token_id):
    predictions, labels = eval_pred
    predictions = predictions[:, :-1]
    labels = labels[:, 1:]

    # Exclude Prompt (-100) AND EOS
    valid_mask = (labels != -100) & (labels != eos_token_id)

    pred_flat = predictions[valid_mask]
    label_flat = labels[valid_mask]

    f1 = f1_score(label_flat, pred_flat, average="macro")
    accuracy = accuracy_score(label_flat, pred_flat)
    return {"f1": f1, "accuracy": accuracy}

def preprocess_logits_for_metrics(logits, labels):
    """
    Original logits are (Batch, Seq, Vocab).
    We only need (Batch, Seq) containing the indices of the max logit.
    """
    if isinstance(logits, tuple):
        # Depending on the model and config, logits may contain extra tensors,
        # like past_key_values, but logits always come first
        logits = logits[0]
    return logits.argmax(dim=-1)


if __name__ == "__main__":
    # ----------------------------------------------------
    # STEP 1 - Load financial sentiment dataset
    # ----------------------------------------------------

    task = "financial_sentiment"

    # Target TEXT strings for the Guard model
    label_map = {
        "positive": "positive",
        "negative": "negative",
        "neutral": "neutral"
    }

    modeling_dir = os.path.join("/", "workspace", ".nosible")

    data_labels = pl.read_ipc(os.path.join(modeling_dir, f"{task}_100.0k_iter_18.ipc"))
    fin_bank = pl.read_ndjson(os.path.join(modeling_dir, "financial_phrase_bank.ndjson"))


    # Select text and text-labels
    # Check if 'majority_vote' exists (it does in your IPC), otherwise rename or use 'labels'
    if "majority_vote" in data_labels.columns:
        data = data_labels.select(["text", "majority_vote"]).rename({"majority_vote": "labels"})
    else:
        data = data_labels.select(["text", "labels"])

    fin_bank = fin_bank.select(["text", "labels"])

    print(f"Data shape: {data.shape}") # Should now be (20, 2) or (100000, 2)

    # ----------------------------------------------------
    # STEP 2 - Create train/val splits
    # ----------------------------------------------------

    # Slice to 100k only if we have that much data, otherwise take all
    limit = min(100_000, len(data))
    data = data[:limit]

    full_ds = Dataset.from_polars(df=data)
    fin_bank_ds = Dataset.from_polars(df=fin_bank)

    # Now this will succeed because we have at least 20 rows (16 train, 4 test)
    split_ds = full_ds.train_test_split(test_size=0.2, seed=42)
    train_ds = split_ds["train"]
    val_ds = split_ds["test"]

    ds = DatasetDict({
        "train": train_ds,
        "val": val_ds,
        "phrasebank": fin_bank_ds,
    })

    # ----------------------------------------------------
    # STEP 3 - Fine-tune Qwen 3 (Guard Approach)
    # ----------------------------------------------------

    # UPDATED: Use Qwen 3 (0.6B)
    model_id = "Qwen/Qwen3-0.6B"

    # Important: trust_remote_code=True is essential for Qwen3
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"

    # Load Model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        trust_remote_code=True,
        dtype=torch.bfloat16,
        device_map="auto"
    )

    def tokenize(batch):
        input_ids_list = []
        attention_mask_list = []
        labels_list = []

        for text, label in zip(batch['text'], batch['labels']):

            # 1. Build the prompt (User part)
            system = "Classify the financial sentiment as positive, negative, or neutral."
            msgs = [
                {"role": "system", "content": system},
                {"role": "user", "content": text},
            ]
            prompt_str = tokenizer.apply_chat_template(
                msgs,
                tokenize=False,
                add_generation_prompt=True,
                enable_thinking=False, # By default, this is true.
            )

            # 2. Build the answer (Assistant part)
            # answer_str = f"{label}<|im_end|>" # Standard Qwen end token
            answer_str = f"{label}"

            # 3. Tokenize separately to know lengths
            prompt_tokens = tokenizer(prompt_str, add_special_tokens=False)
            answer_tokens = tokenizer(answer_str, add_special_tokens=False)

            assert len(answer_tokens["input_ids"]) == 1

            # 4. Combine
            input_ids = prompt_tokens["input_ids"] + answer_tokens["input_ids"]
            attention_mask = prompt_tokens["attention_mask"] + answer_tokens["attention_mask"]

            # 5. CREATE LABELS WITH MASKING
            # -100 tells PyTorch to IGNORE these tokens during training
            labels = [-100] * len(prompt_tokens["input_ids"]) + answer_tokens["input_ids"]

            # Truncate if necessary (simplified for brevity)
            if len(input_ids) > 2048:
                input_ids = input_ids[:2048]
                attention_mask = attention_mask[:2048]
                labels = labels[:2048]

            input_ids_list.append(input_ids)
            attention_mask_list.append(attention_mask)
            labels_list.append(labels)

        return {
            "input_ids": input_ids_list,
            "attention_mask": attention_mask_list,
            "labels": labels_list
        }

    t_ds = ds.map(tokenize, batched=True, remove_columns=ds["train"].column_names, num_proc=8)

    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)

    timestamp = dt.datetime.now().strftime('%Y%m%d_%H%M%S')
    task_output = f"{task}_qwen3_0.6B_{timestamp}"

    training_args = TrainingArguments(
        output_dir=task_output,

        # Qwen3 0.6B is very small, we might be able to increase batch size slightly
        gradient_accumulation_steps=4,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,

        num_train_epochs=5,
        learning_rate=2e-5,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        max_grad_norm=1.0,
        weight_decay=0.1,

        # Improvements for instruction finetuning.
        neftune_noise_alpha=5,

        # Optimizations.
        bf16=True,
        optim="adamw_torch_fused",
        logging_strategy="steps",
        logging_steps=10,
        logging_dir=task_output,
        logging_first_step=True,
        group_by_length=True,
        torch_compile=False,

        # Eval params.
        eval_strategy="steps",
        eval_steps=500,
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="eval_val_loss",
        report_to="none",
        save_strategy="steps",
        save_steps=500,
    )

    n_eval_train = int(0.05 * len(t_ds["train"]))

    chat_end_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>")

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=t_ds["train"],
        eval_dataset={
            "train": t_ds["train"].select(list(range(n_eval_train))),
            "val": t_ds["val"],
            "phrasebank": t_ds["phrasebank"],
        },
        compute_metrics=partial(compute_metrics_with_eos, eos_token_id=chat_end_token_id),
        preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    )

    trainer.train()

Results

We compare the accuracy of our finetuned Qwen3 0.6B model against FinBERT and a variety of LLMs prompted for zero-shot classification on the out of sample validation set from our 100,000 dataset, and the Financial PhraseBank dataset by Malo, P. et al. (2014). This dataset contains 4,840 financial news snippets labeled for sentiment by human experts.

Although GPT-5.1 topped us by a margin, it’s orders or magnitude more expensive so in practical terms it’s not feasible.

The way we calculated price here is:

For the LLMs Sum the input and output token cost per million in a ratio of 10:1, this resembles the relationship between input and output length when the LLMs are prompted with our labeling prompt.

For the finetuned model Since Qwen3 0.6B is not even quoted on OpenRouter, we estimated the cost at a very conservative ratio of 100:1, since the length of the output tokens is essentially 2—the label and the EOS token.

Closing thoughts

“Look at the data” is still an important and relevant skill no matter what level you’re at.
Relabeling on datasets with generated labels is an excellent idea, and can be further automated and improved.
We’ve been duped into believing causal language models are only useful for text generation tasks, more specifically chat, however they are very useful if you understand them.
In the spirit of the festive season we’ve open sourced all the datasets and models on HuggingFace… share with us what you build!
- Financial Sentiment:
  - NOSIBLE Financial Sentiment dataset
  - NOSIBLE Financial Sentiment v1.1 Base model
- Forward Looking:
  - NOSIBLE Forward-Looking dataset
  - NOSIBLE Prediction v1.1 Base model
- Prediction:
  - NOSIBLE Prediction dataset
  - NOSIBLE Forward-Looking v1.1 Base model

Acknowledgments

The team involved in the project includes:

Citations

FinBERT: Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063
Qwen3: Qwen3 Technical Report. arXiv:2505.09388
Qwen3 Guard: Qwen3Guard Technical Report. arXiv:2510.14276v1
PhraseBank: Malo, P. et al. (2014)

Datasets:

NOSIBLE Financial Sentiment: Nosible Inc.
Financial PhraseBank: Malo, P. et al. (2014).