roberta legal portuguese下载 - roberta legal portuguese源代码下载

roberta legal portuguese

Ai源码

1.0.0

下载

罗伯塔法律葡萄牙

该存储库为论文Robertalexpt提供了相关资源：通过重复数据删除葡萄牙语的法律罗伯塔模型。

提示

查看罗伯塔法律葡萄牙人？收藏！

语料库

我们汇编了两个主要培训的主要语料库：

Legalpt，葡萄牙法律语料库
Crawlpt，一种用于比较的葡萄牙一般语料库。

语料库	领域	令牌（b）	尺寸（GIB）
Legalpt	合法的	22.5	125.1
爬行
brwac	一般的	2.7	16.3
CC100（PT）	一般的	8.4	49.1
奥斯卡-2301（PT）	一般的	18.1	97.8

根据Lee等人的方法，使用Minhash算法和局部敏感的哈希进行了重复数据删除。（2022）。我们使用了5克和256号尺寸的签名，考虑到两个文档的Jaccard相似性超过0.7，它们是相同的。

Legalpt（重复程序）
爬网（重复指定）

数据集

Portulex Benchmark是一个四任务基准，旨在评估葡萄牙法律背景下语言模型的质量和性能。

数据集	任务	火车	开发	测试
RRI	CLS	8.26K	1.05k	1.47K
Lener-Br	ner	7.83K	1.18k	1,39k
ulyssesner-br	ner	3.28K	489	524
FGV-STF	ner	415	60	119

型号

我们的模型以四种不同的配置进行了预估计：

仅在BRWAC（Robertatimbau_base）上。
仅在法律规范（Robertalegalpt-bas）上
仅在Crawlpt Corpus（Robertacrawlpt-bas）上
结合两个Corpora（Robertalexpt-bas）

在Portulex基准测试拆分上评估的多个模型的宏F1得分（％）：

模型	莱纳	ulyner-pl	FGV-STF	rrip	平均的（％）
		粗/细	粗
基于Bertimbau	88.34	86.39/83.83	79.34	82.34	83.78
Bertimbau-large	88.64	87.77/84.74	79.71	83.79	84.60
艾伯蒂娜 - 普特·br基本	89.26	86.35/84.63	79.30	81.16	83.80
艾伯蒂娜 - 普特·布尔Xlarge	90.09	88.36/ 86.62	79.94	82.79	85.08
贝蒂卡尔基地	83.68	79.21/75.70	77.73	81.11	79.99
Jurisbert-Base	81.74	81.67/77.97	76.04	80.85	79.61
Bertimbaulaw-base	84.90	87.11/84.42	79.78	82.35	83.20
Legal-XLM-R-BASE	87.48	83.49/83.16	79.79	82.35	83.24
Legal-XLM-R-RARGE	88.39	84.65/84.55	79.36	81.66	83.50
法律 - 罗伯塔 - 普特	87.96	88.32/84.83	79.57	81.98	84.02
我们的
Robertatimbau-base（Bertimbau的繁殖）	89.68	87.53/85.74	78.82	82.03	84.29
Robertalegalpt-base（接受法律培训）	90.59	85.45/84.40	79.92	82.84	84.57
Robertacrawlpt-base（在爬网训练）	89.24	88.22/86.58	79.88	82.80	84.83
Robertalexpt-base（在爬网 + Legalpt培训）	90.73	88.56 /86.03	80.40	83.22	85.41

总而言之，尽管基本规模，但Robertalexpt仍始终达到最高的法律NLP效力。借助足够的预训练数据，它可以超越较大的模型。结果强调了域多样性培训数据比纯模型量表的重要性。

引用

 @inproceedings { garcia-etal-2024-robertalexpt ,
    title = " {R}o{BERT}a{L}ex{PT}: A Legal {R}o{BERT}a Model pretrained with deduplication for {P}ortuguese " ,
    author = " Garcia, Eduardo A. S.  and
      Silva, Nadia F. F.  and
      Siqueira, Felipe  and
      Albuquerque, Hidelberg O.  and
      Gomes, Juliana R. S.  and
      Souza, Ellen  and
      Lima, Eliomar A. " ,
    editor = " Gamallo, Pablo  and
      Claro, Daniela  and
      Teixeira, Ant{'o}nio  and
      Real, Livy  and
      Garcia, Marcos  and
      Oliveira, Hugo Gon{c{c}}alo  and
      Amaro, Raquel " ,
    booktitle = " Proceedings of the 16th International Conference on Computational Processing of Portuguese " ,
    month = mar,
    year = " 2024 " ,
    address = " Santiago de Compostela, Galicia/Spain " ,
    publisher = " Association for Computational Lingustics " ,
    url = " https://aclanthology.org/2024.propor-1.38 " ,
    pages = " 374--383 " ,
}