国内金融日报 · 实战教程
LLM 驱动的财经新闻情绪分析与择时:
从零构建量化信号
市场情绪是影响 A 股短期价格的核心变量之一。传统方法(字典匹配、规则打分)无法理解"央行超预期降息但措辞偏鹰"这类语义矛盾的新闻。本教程利用 FinBERT(金融领域预训练语言模型)和 LangChain,从实时财经新闻中提取情绪分数,构建可量化的多空信号,并用 Backtrader 回测验证策略。
7
完整步骤
FinBERT
核心模型
≈85%
情绪分类准确率(金融文本)
云原生
支持 GKE 弹性部署
ℹ️
本教程代码已在 Python 3.10 + Transformers 4.40 + Backtrader 1.9.78 环境验证。GPU 非必须,CPU 下 FinBERT 推理单条约 80ms,可满足日线级别择时需求。
Step 1:系统架构与工作原理
整个系统分为四层:
- 数据层:通过爬虫或 API 获取财经新闻(标题 + 正文)
- 情绪层:FinBERT 对每条新闻打分(positive / negative / neutral,附置信度)
- 信号层:汇聚 N 天新闻情绪,计算加权情绪指数(Sentiment Index)
- 策略层:基于情绪阈值产生买入 / 卖出 / 持有信号,供回测和实盘使用
💡
为什么选 FinBERT 而不是 GPT-4?FinBERT 是在 Reuters/Bloomberg 金融新闻上微调的 BERT,对"正面/负面/中性"三分类任务推理速度快 10 倍,成本接近零,且金融情绪准确率与 GPT-4 相当(约 85%)。对于高频新闻处理,性价比更优。
Step 2:环境准备与依赖安装
①
创建虚拟环境
# 推荐 Python 3.10+
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
②
安装核心依赖
pip install torch transformers==4.40 \
langchain langchain-openai \
pandas numpy requests \
backtrader matplotlib
若只使用 CPU,安装轻量版 torch:
pip install torch --index-url https://download.pytorch.org/whl/cpu
③
预下载 FinBERT 模型
python -c "
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = 'ProsusAI/finbert'
AutoTokenizer.from_pretrained(model_name)
AutoModelForSequenceClassification.from_pretrained(model_name)
print('FinBERT 下载完成')
"
ℹ️
模型约 440MB,HuggingFace 国内访问不稳定,可通过镜像站下载:
HF_ENDPOINT=https://hf-mirror.com python -c "..."Step 3:财经新闻数据获取
以 东方财富网财经新闻 API 为例(免费、无需注册),获取最近 N 天的财经头条:
# news_fetcher.py
import requests
import pandas as pd
from datetime import datetime, timedelta
def fetch_eastmoney_news(days: int = 7, max_per_day: int = 50) -> pd.DataFrame:
"""
从东方财富获取近 N 天财经新闻
返回 DataFrame,列:date, title, content, url
"""
records = []
base_url = "https://np-listapi.eastmoney.com/comm/wap/getListInfo"
for delta in range(days):
date = (datetime.today() - timedelta(days=delta)).strftime("%Y-%m-%d")
params = {
"client": "wap",
"type": 1,
"mTypeAndCode": "0,1,2",
"pageSize": max_per_day,
"pageIndex": 1,
}
try:
resp = requests.get(base_url, params=params, timeout=10)
data = resp.json()
for item in data.get("data", {}).get("list", []):
records.append({
"date": date,
"title": item.get("title", ""),
"content": item.get("digest", ""), # 摘要文本
"url": item.get("url", ""),
})
except Exception as e:
print(f"[WARN] {date} 数据获取失败: {e}")
df = pd.DataFrame(records)
# 去重
df.drop_duplicates(subset=["title"], inplace=True)
print(f"共获取 {len(df)} 条新闻,时间跨度 {days} 天")
return df
if __name__ == "__main__":
df = fetch_eastmoney_news(days=7)
df.to_csv("news_raw.csv", index=False, encoding="utf-8-sig")
print(df.head())
⚠️
生产环境请使用付费 API(如同花顺 iFinD、Wind),免费接口限频且可能随时下线。本教程以东方财富作为演示,实际使用请替换为稳定数据源。
Step 4:FinBERT 情绪分析模型
FinBERT 原生处理英文,对中文新闻需先翻译,或使用中文金融 BERT(如 ssymmetry/fin-bert-zh)。本文展示两种方式:
方式 A:英文 FinBERT(推荐稳定性高的场景)
# sentiment_analyzer.py
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
import torch
import pandas as pd
class FinBERTAnalyzer:
def __init__(self, translate: bool = True):
model_name = "ProsusAI/finbert"
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.pipe = pipeline(
"sentiment-analysis",
model=self.model,
tokenizer=self.tokenizer,
device=0 if torch.cuda.is_available() else -1,
max_length=512,
truncation=True,
)
self.translate = translate
if translate:
# 使用轻量翻译库(需 pip install deep-translator)
from deep_translator import GoogleTranslator
self.translator = GoogleTranslator(source="zh-CN", target="en")
def _preprocess(self, text: str) -> str:
"""清洗+翻译(可选)"""
text = text.strip()[:512]
if self.translate:
try:
text = self.translator.translate(text)
except Exception:
pass
return text
def analyze_batch(self, texts: list[str]) -> list[dict]:
"""
批量情绪分析
返回:[{"label": "positive"|"negative"|"neutral", "score": float}, ...]
"""
cleaned = [self._preprocess(t) for t in texts]
results = self.pipe(cleaned, batch_size=16)
return results
def analyze_df(self, df: pd.DataFrame, text_col: str = "title") -> pd.DataFrame:
"""在 DataFrame 上批量推理,新增 sentiment / score 列"""
texts = df[text_col].tolist()
results = self.analyze_batch(texts)
df = df.copy()
df["sentiment"] = [r["label"].lower() for r in results]
df["score"] = [r["score"] for r in results]
# 转为数值:positive=+1,negative=-1,neutral=0,乘置信度
score_map = {"positive": 1, "negative": -1, "neutral": 0}
df["sentiment_score"] = df.apply(
lambda row: score_map[row["sentiment"]] * row["score"], axis=1
)
return df
# 快速测试
if __name__ == "__main__":
analyzer = FinBERTAnalyzer(translate=False)
texts = [
"Stocks rallied as central bank cuts rates aggressively",
"Massive sell-off triggered by worse-than-expected GDP data",
"Market closed flat amid thin trading volumes",
]
for text, res in zip(texts, analyzer.analyze_batch(texts)):
print(f"[{res['label']:8s} {res['score']:.3f}] {text[:60]}")
方式 B:中文金融 BERT(无需翻译)
# 仅需修改 model_name,其余代码一致
model_name = "ssymmetry/fin-bert-zh"
# 该模型在中文金融文本上微调,标签同为 positive/negative/neutral
Step 5:构建情绪择时信号
将每日新闻汇聚为单一情绪指数(Sentiment Index),再施加均值回归滤波,生成稳定的买卖信号:
# signal_builder.py
import pandas as pd
import numpy as np
def build_sentiment_index(df: pd.DataFrame,
window: int = 5,
pos_threshold: float = 0.15,
neg_threshold: float = -0.15) -> pd.DataFrame:
"""
输入:包含 date / sentiment_score 的 DataFrame
输出:每日情绪指数 + 多空信号
"""
# 1. 按日聚合:取加权均值(高置信度新闻权重更大)
daily = (
df.groupby("date")
.apply(lambda g: np.average(g["sentiment_score"],
weights=np.abs(g["sentiment_score"]) + 1e-9))
.reset_index(name="sentiment_index")
)
daily["date"] = pd.to_datetime(daily["date"])
daily.sort_values("date", inplace=True)
# 2. 滑动窗口平滑(消除单日极端噪音)
daily["smooth_index"] = (
daily["sentiment_index"]
.rolling(window=window, min_periods=1)
.mean()
)
# 3. 生成信号:1=多头,-1=空头,0=观望
def classify(score):
if score >= pos_threshold:
return 1 # 情绪偏正 → 买入/持有
elif score <= neg_threshold:
return -1 # 情绪偏负 → 卖出/做空
else:
return 0 # 情绪中性 → 观望
daily["signal"] = daily["smooth_index"].apply(classify)
return daily
if __name__ == "__main__":
df = pd.read_csv("news_scored.csv")
signals = build_sentiment_index(df)
print(signals.tail(10).to_string(index=False))
signals.to_csv("signals.csv", index=False)
示例输出:
多头信号
+0.32
情绪指数超过阈值 +0.15,触发买入
空头信号
−0.28
情绪指数低于阈值 −0.15,触发卖出
💡
领域知识链式思维(DK-CoT)进阶:对于模糊新闻,可先用 LangChain 构造推理 Prompt,引导 LLM 解释为什么这条新闻对大盘是正面还是负面,再将解释文本送入 FinBERT,显著提升复杂新闻的分类准确率。
Step 6:Backtrader 回测验证策略
将情绪信号接入 Backtrader,与沪深 300 指数基准进行对比回测:
# backtest.py
import backtrader as bt
import pandas as pd
class SentimentStrategy(bt.Strategy):
params = dict(
signals=None, # pd.DataFrame with date / signal
stake=0.95, # 每次满仓比例
)
def __init__(self):
# 构建日期→信号查找表
self.signal_map = {
row["date"]: row["signal"]
for _, row in self.p.signals.iterrows()
}
self.order = None
def next(self):
# 若有未完成订单,等待
if self.order:
return
today = self.data.datetime.date(0).isoformat()
signal = self.signal_map.get(today, 0)
cash = self.broker.getcash()
price = self.data.close[0]
size = int(cash * self.p.stake / price / 100) * 100 # 取整百股
if signal == 1 and not self.position:
self.order = self.buy(size=size)
elif signal == -1 and self.position:
self.order = self.sell(size=self.position.size)
def notify_order(self, order):
if order.status in [order.Completed]:
self.order = None
def run_backtest(price_csv: str, signals_csv: str, cash: float = 100_000):
cerebro = bt.Cerebro()
cerebro.broker.setcash(cash)
cerebro.broker.setcommission(commission=0.001) # 万一双边
# 加载价格数据(OHLCV)
price_df = pd.read_csv(price_csv, parse_dates=["date"], index_col="date")
data_feed = bt.feeds.PandasData(dataname=price_df)
cerebro.adddata(data_feed)
# 加载信号
signals_df = pd.read_csv(signals_csv)
cerebro.addstrategy(SentimentStrategy, signals=signals_df)
cerebro.addanalyzer(bt.analyzers.SharpeRatio, _name="sharpe",
riskfreerate=0.02, annualize=True)
cerebro.addanalyzer(bt.analyzers.DrawDown, _name="drawdown")
cerebro.addanalyzer(bt.analyzers.Returns, _name="returns")
results = cerebro.run()
strat = results[0]
sharpe = strat.analyzers.sharpe.get_analysis()["sharperatio"]
max_dd = strat.analyzers.drawdown.get_analysis()["max"]["drawdown"]
total_r = strat.analyzers.returns.get_analysis()["rtot"] * 100
print(f"初始资金: {cash:,.0f}")
print(f"最终资金: {cerebro.broker.getvalue():,.2f}")
print(f"总收益率: {total_r:.2f}%")
print(f"夏普比率: {sharpe:.3f}")
print(f"最大回撤: {max_dd:.2f}%")
cerebro.plot(style="candlestick", iplot=False)
if __name__ == "__main__":
run_backtest("hs300_daily.csv", "signals.csv")
⚠️
过拟合风险:阈值
pos_threshold 与 neg_threshold 应在 样本外数据上验证,不要在同一段数据上既调参又评估。建议用 2020–2023 年数据训练,2024–2026 年数据测试。Step 7:异步批处理服务部署
将情绪分析封装为异步 API 服务,支持高并发新闻流处理:
# server.py — 基于 Starlette + Uvicorn
from starlette.applications import Starlette
from starlette.routing import Route
from starlette.requests import Request
from starlette.responses import JSONResponse
import asyncio
import uvicorn
from sentiment_analyzer import FinBERTAnalyzer
# 全局单例,避免重复加载模型
analyzer = FinBERTAnalyzer(translate=False)
# 使用 asyncio.Queue 实现异步批处理
queue: asyncio.Queue = asyncio.Queue()
BATCH_SIZE = 32
BATCH_TIMEOUT = 0.1 # 100ms 内凑成一批
async def batch_worker():
"""后台批处理协程:凑够 BATCH_SIZE 或超时后一起推理"""
while True:
batch, futures = [], []
try:
item, future = await asyncio.wait_for(queue.get(), timeout=BATCH_TIMEOUT)
batch.append(item)
futures.append(future)
# 继续凑批次
while len(batch) < BATCH_SIZE:
try:
item, future = queue.get_nowait()
batch.append(item)
futures.append(future)
except asyncio.QueueEmpty:
break
except asyncio.TimeoutError:
continue
if batch:
loop = asyncio.get_event_loop()
results = await loop.run_in_executor(
None, analyzer.analyze_batch, batch
)
for future, result in zip(futures, results):
future.set_result(result)
async def analyze(request: Request):
data = await request.json()
texts = data.get("texts", [])
if not texts:
return JSONResponse({"error": "texts is required"}, status_code=400)
loop = asyncio.get_event_loop()
futures = []
for text in texts:
future = loop.create_future()
await queue.put((text, future))
futures.append(future)
results = await asyncio.gather(*futures)
return JSONResponse({"results": results})
app = Starlette(
routes=[Route("/analyze", analyze, methods=["POST"])],
on_startup=[lambda: asyncio.create_task(batch_worker())],
)
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080, workers=1)
启动后,测试接口:
curl -X POST http://localhost:8080/analyze \
-H "Content-Type: application/json" \
-d '{"texts": ["央行意外降息50个基点", "PMI数据低于预期"]}'
💡
GKE 部署提示:将此服务容器化后,配合 Horizontal Pod Autoscaler(HPA)和 CPU 利用率指标(目标 70%),可在市场开盘时自动扩容到 10 Pod,收盘后缩到 1 Pod,大幅降低成本。
常见问题排查
| 问题 | 原因 & 解决方案 |
|---|---|
| 模型偏向预测"中性" | 金融新闻数据不平衡(中性占比通常 60%+)。对负面样本做数据增强(同义词替换、回译),或在损失函数中加权 class_weight={0:1, 1:3, 2:3} |
| 中文新闻准确率低 | FinBERT 原版仅支持英文。改用 ssymmetry/fin-bert-zh 或先翻译为英文再推理;也可使用 DK-CoT 让 GPT-4o 解释新闻含义后再分类 |
| 单批推理超时(>500ms) | 批大小过大,或 CPU 核心不足。将 BATCH_SIZE 降到 8,或开启 ONNX Runtime 加速:optimum-cli export onnx --model ProsusAI/finbert finbert-onnx/ |
| 回测收益率虚高 | 通常是未设置滑点和手续费。Backtrader 中务必设置 setcommission 与 set_slippage_perc(0.001),模拟真实交易成本 |
| Hugging Face 下载失败 | 设置镜像:export HF_ENDPOINT=https://hf-mirror.com,或手动下载模型文件到本地目录后 from_pretrained("./local-finbert") |
进阶:多信号融合与实盘扩展
单一新闻情绪信号噪声较大,生产环境建议融合多维度信号:
- 新闻情绪(本教程):FinBERT 对财经新闻的正负评分
- 社交情绪:雪球热帖 / 股吧讨论情绪(同理使用中文金融 BERT)
- 技术指标:MA 金叉、RSI 超卖信号(传统量化)
- 基本面信号:财报超预期程度(LLM 解析财报全文)
多信号融合可使用简单加权平均或 XGBoost 学习最优权重,在更长时间跨度的历史数据上验证稳健性,建议结合 Walk-Forward Optimization 避免参数过拟合。
📚
延伸阅读:
- Lopez-Lira & Tang (2023):Can ChatGPT Forecast Stock Price Movements?—最早系统性验证 LLM 择时效果的论文
- FinBERT 原论文:Araci (2019),ProsusAI GitHub 上有中文示例
- Rotman 商学院 Chen et al. (2026):LLM"脑扫描"技术,解释 LLM 是通过什么概念做出财务预测