中文情感（积极&消极）句子分类数据集-数据集详情|下载-集智数据集

发布时间：2024-11-16 15:38:00

数据集：中文情感（积极&消极）句子分类数据集 806 80

本内容由，集智官方收集发布，仅供参考学习，不代表集智官方赞同其观点或证实其内容的真实性准确性，请勿用于商业用途。

以下是使用DistilBERT对积极和消极情感句子（如“天气晴朗，心情很好”）进行分类的完整技术博客，保证通俗易懂、可复现且贴合实际场景。

使用 DistilBERT 实现情感句子分类

情感分析是自然语言处理（NLP）中的核心任务之一。本文将基于 DistilBERT 构建一个轻量化情感分类模型，识别句子的积极或消极情感（如“天气晴朗，心情很好”被分类为积极情感）。

1. 环境准备

确保安装必要的库：

pip install transformers torch pandas scikit-learn

2. 数据准备

假设您有两个Excel表格：积极情感数据集.xlsx 和 消极情感数据集.xlsx，分别包含以下字段：id、积极（消极）情感内容、内容分词、中文拼音。

加载和合并数据：

import pandas as pd

# 加载积极和消极情感数据
positive_data = pd.read_excel("积极情感数据集.xlsx")
negative_data = pd.read_excel("消极情感数据集.xlsx")

# 为数据添加标签：积极情感标记为1，消极情感标记为0
positive_data['label'] = 1
negative_data['label'] = 0

# 合并数据集
data = pd.concat([positive_data, negative_data], ignore_index=True)

# 提取文本和标签
texts = data['积极（消极）情感内容'].values  # 替换为实际字段名
labels = data['label'].values

# 检查数据样本
print(f"数据样本数量: {len(data)}")
print(data.head())

3. 数据预处理

使用 Hugging Face 的 transformers 库中的 DistilBERT 分词器，将句子转化为模型可接受的格式。

from transformers import DistilBertTokenizer

# 加载DistilBERT分词器
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-multilingual-cased")

# 定义分词函数
def tokenize_texts(texts, max_length=128):
    return tokenizer(
        list(texts),
        max_length=max_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )

# 对句子进行分词
tokenized_texts = tokenize_texts(texts)
print(tokenized_texts.keys())  # 包含 input_ids 和 attention_mask

4. 数据集划分

使用 scikit-learn 将数据划分为训练集和测试集。

from sklearn.model_selection import train_test_split
import torch

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    tokenized_texts["input_ids"], labels, test_size=0.2, random_state=42
)

# 转换标签为张量
y_train = torch.tensor(y_train)
y_test = torch.tensor(y_test)
X_train_masks, X_test_masks = train_test_split(tokenized_texts["attention_mask"], test_size=0.2, random_state=42)

5. 构建 DistilBERT 模型

加载 DistilBERT 模型并添加分类层。

from transformers import DistilBertForSequenceClassification

# 加载DistilBERT模型，输出2个类别（积极和消极）
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=2)

# 定义优化器
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

6. 模型训练

使用 PyTorch 的 DataLoader 和训练循环训练模型。

from torch.utils.data import DataLoader, TensorDataset

# 构建数据集和DataLoader
train_data = TensorDataset(X_train, X_train_masks, y_train)
train_loader = DataLoader(train_data, batch_size=16, shuffle=True)

# 模型训练
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

epochs = 3
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        b_input_ids, b_input_mask, b_labels = tuple(t.to(device) for t in batch)
        
        # 清零梯度
        optimizer.zero_grad()
        
        # 前向传播
        outputs = model(input_ids=b_input_ids, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs.loss
        total_loss += loss.item()
        
        # 反向传播
        loss.backward()
        optimizer.step()
    
    print(f"Epoch {epoch + 1} | Loss: {total_loss / len(train_loader):.4f}")

7. 模型评估

在测试集上评估模型性能。

from sklearn.metrics import accuracy_score, classification_report

# 模型评估
model.eval()
with torch.no_grad():
    outputs = model(input_ids=X_test.to(device), attention_mask=X_test_masks.to(device))
    predictions = torch.argmax(outputs.logits, dim=1).cpu().numpy()

# 计算准确率
accuracy = accuracy_score(y_test, predictions)
print(f"测试集准确率: {accuracy:.4f}")

# 输出分类报告
print("分类报告:\n", classification_report(y_test, predictions))

8. 模型应用

对新句子进行情感分类。

def predict_sentiment(sentence):
    inputs = tokenize_texts([sentence])
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        prediction = torch.argmax(outputs.logits, dim=1).item()
    
    return "积极情感" if prediction == 1 else "消极情感"

# 测试新句子
sentences = [
    "天气晴朗，心情很好。",
    "今天的工作很糟糕，令人沮丧。"
]

for sentence in sentences:
    print(f"句子: {sentence}\n情感预测: {predict_sentiment(sentence)}\n")

输出示例

假设输入句子如下：

"天气晴朗，心情很好。"
"今天的工作很糟糕，令人沮丧。"

输出结果：

句子: 天气晴朗，心情很好。
情感预测: 积极情感

句子: 今天的工作很糟糕，令人沮丧。
情感预测: 消极情感

技术要点

轻量化模型：DistilBERT 是 BERT 的轻量化版本，计算效率高，适合句子分类任务。
多语言支持：使用 distilbert-base-multilingual-cased，支持中文句子分析。
文本特征提取：通过预训练模型的上下文感知能力，捕获句子中的情感特征。
可复现性：从数据加载到模型训练，所有步骤都可复现，适用于类似情感分析场景。

通过本文的指导，您可以快速实现一个基于 DistilBERT 的句子情感分类模型，适用于多种情感分析任务。

语义抽取数据集

这种数据集通常包含带有标记的文本，其中标记了特定的信息实体或概念，如人物名称、组织机构、日期等。这些数据集用于训练模型从自由文本中提取关键信息。帮助模型理解文本的深层含义，并从中抽取有用的信息。

浏览排行下载排行

更多内容：
情感分类
DistilBERT文本分类
句子情感分析
自然语言处理NLP
积极消极情感分析
多语言情感分类
情感分类模型
文本分类算法
机器学习情感分析
NLP情感识别