Blackstone是一种用于处理长形式,非结构化法律文本的spacy模型和图书馆。 Blackstone是ICLR&D的英格兰和威尔士研究实验室法律委员会报告的实验研究项目。黑石是由丹尼尔·霍德利(Daniel Hoadley)撰写的。
我们为什么要建造黑石?
黑石有什么特别之处?
观察和其他值得注意的事物
安装
安装库
安装黑石模型
关于模型
管道
命名实体认可者
文本分类器
用法
应用NER模型
可视化实体
应用文本分类器模型
自定义管道扩展
缩写和长期定义解决方案
复合案例参考检测
立法链接器
句子细分器
在过去的几年中,法律与技术交集的活动激增。但是,在英国,这项活动的大部分活动都是在律师事务所和其他商业环境中进行的。这样做的结果是,尽管法律信息学领域中的发展永无止境,但几乎没有开源的研究可以提供。
此外,英国法律信息域(无论是开放还是关闭)的研究重点是开发NLP申请,用于自动化合同和其他本质上具有交易的法律文件。这是可以理解的,因为英国法律NLP研究的主要恩人是律师事务所和律师事务所往往并不难以将可以利用为培训数据来利用的交易文件。
我们看到的问题是,英国的法律NLP研究已经过度专注于商业应用,并且值得投资开发有关其他法律文本可用的法律NLP研究,例如判决,学术文章,骷髅论点和辩护。
笔记!强烈建议您将黑石安装到虚拟环境中!有关虚拟环境的更多信息,请参见此处。黑石应与Python 3.6及更高的Python兼容。
要安装黑石,请执行以下步骤:
第一步是安装库,该库目前包含少数自定义旋转组件。这样安装库:
pip install blackstone
第二步是安装Spacy模型。这样安装模型:
pip install https://blackstone-model.s3-eu-west-1.amazonaws.com/en_blackstone_proto-0.0.1.tar.gz
如果您正在开发黑石,则可以从源头安装:
pip install --editable .
pip install -r dev-requirements.txt
这是黑石的第一个版本,该模型最好将其视为原型。它在边缘周围很粗糙,代表了ICLR&d进行的有关法律文本的NLP开源研究的较大开源研究计划的第一步。
因此,这是对原型模型中发生的事情的简短摘要。
本发行版中包含的原型模型在其管道中具有以下元素:
由于法律文本的标记为言论和依赖性培训数据的稀缺性,因此从Spacy的en_core_web_sm模型中获取了tokenizer , tagger和parser Pipeline组件。总的来说,这些组件似乎从事一项不错的工作,但是将来在某个时候通过定制培训数据重新访问这些组件是一件好事。
ner和textcat组件是经过专门针对黑石训练的自定义组件。
Blackstone模型的NER组成部分已经过训练以检测以下实体类型:
| 恩特 | 姓名 | 例子 |
|---|---|---|
| 卡ase名 | 案例名称 | 例如,史密斯诉琼斯(Re Jones ) |
| 引用 | 引用(报告和未报告案件的唯一标识符) | 例如(2002)2 Cr应用R 123 |
| 乐器 | 书面法律文书 | 例如1968年盗窃法,《欧洲人权公约》,心肺复苏术 |
| 条款 | 书面法律文书中的单位 | 例如,第1节,第2(3)节 |
| 法庭 | 法院或法庭 | 例如上诉法院,上庭 |
| 法官 | 提到法官 | 例如Eady J,Cornhill的Bingham勋爵 |
黑石的发行版也带有文本分类器。与NER组件(已被训练以识别令牌和一系列感兴趣的令牌)相反,文本分类器将较长的文本跨度分类,例如句子。
文本分类器已经过培训以根据五个相互排斥的类别之一对文本进行分类,如下所示:
| 猫 | 描述 |
|---|---|
| 公理 | 文字似乎假设了一个公认的原则 |
| 结论 | 文本似乎是发现,保留,决心或结论 |
| 问题 | 本文似乎讨论了一个问题 |
| Legal_test | 该测试似乎讨论了法律测试 |
| uncat | 文本不属于上面的四个类别之一 |
这是该模型如何应用于R(Miller)诉退出欧盟(Birnie Intervening)的R(Miller)诉国务卿在R(Miller)诉国务卿第31段中使用的一些文本的一个例子[2017] UKSC 5; [2018] AC 61:
import spacy
# Load the model
nlp = spacy . load ( "en_blackstone_proto" )
text = """ 31 As we shall explain in more detail in examining the submission of the Secretary of State (see paras 77 and following), it is the Secretary of State’s case that nothing has been done by Parliament in the European Communities Act 1972 or any other statute to remove the prerogative power of the Crown, in the conduct of the international relations of the UK, to take steps to remove the UK from the EU by giving notice under article 50EU for the UK to withdraw from the EU Treaty and other relevant EU Treaties. The Secretary of State relies in particular on Attorney General v De Keyser’s Royal Hotel Ltd [1920] AC 508 and R v Secretary of State for Foreign and Commonwealth Affairs, Ex p Rees-Mogg [1994] QB 552; he contends that the Crown’s prerogative power to cause the UK to withdraw from the EU by giving notice under article 50EU could only have been removed by primary legislation using express words to that effect, alternatively by legislation which has that effect by necessary implication. The Secretary of State contends that neither the ECA 1972 nor any of the other Acts of Parliament referred to have abrogated this aspect of the Crown’s prerogative, either by express words or by necessary implication.
"""
# Apply the model to the text
doc = nlp ( text )
# Iterate through the entities identified by the model
for ent in doc . ents :
print ( ent . text , ent . label_ )
> >> European Communities Act 1972 INSTRUMENT
> >> article 50 EU PROVISION
>> > EU Treaty INSTRUMENT
> >> Attorney General v De Keyser ’ s Royal Hotel Ltd CASENAME
> >> [ 1920 ] AC 508 CITATION
>> > R v Secretary of State for Foreign and Commonwealth Affairs , Ex p Rees - Mogg CASENAME
>> > [ 1994 ] QB 552 CITATION
> >> article 50 EU PROVISION Spacy带有出色的可视化器,包括NER预测的可视化器。 Blackstone带有一个自定义的调色板,可用于在使用显示屏时更容易在源文本上挖掘实体。
"""
Visualise entities using spaCy's displacy visualiser.
Blackstone has a custom colour palette: `from blackstone.displacy_palette import ner_displacy options`
"""
import spacy
from spacy import displacy
from blackstone . displacy_palette import ner_displacy_options
nlp = spacy . load ( "en_blackstone_proto" )
text = """
The applicant must satisfy a high standard. This is a case where the action is to be tried by a judge with a jury. The standard is set out in Jameel v Wall Street Journal Europe Sprl [2004] EMLR 89, para 14:
“But every time a meaning is shut out (including any holding that the words complained of either are, or are not, capable of bearing a defamatory meaning) it must be remembered that the judge is taking it upon himself to rule in effect that any jury would be perverse to take a different view on the question. It is a high threshold of exclusion. Ever since Fox’s Act 1792 (32 Geo 3, c 60) the meaning of words in civil as well as criminal libel proceedings has been constitutionally a matter for the jury. The judge’s function is no more and no less than to pre-empt perversity. That being clearly the position with regard to whether or not words are capable of being understood as defamatory or, as the case may be, non-defamatory, I see no basis on which it could sensibly be otherwise with regard to differing levels of defamatory meaning. Often the question whether words are defamatory at all and, if so, what level of defamatory meaning they bear will overlap.”
18 In Berezovsky v Forbes Inc [2001] EMLR 1030, para 16 Sedley LJ had stated the test this way:
“The real question in the present case is how the courts ought to go about ascertaining the range of legitimate meanings. Eady J regarded it as a matter of impression. That is all right, it seems to us, provided that the impression is not of what the words mean but of what a jury could sensibly think they meant. Such an exercise is an exercise in generosity, not in parsimony.”
"""
doc = nlp ( text )
# Call displacy and pass `ner_displacy_options` into the option parameter`
displacy . serve ( doc , style = "ent" , options = ner_displacy_options )产生看起来像这样的东西:
Blackstone的文本分类器为doc生成了预测的分类。 textcat管道组件已设计为应用于单个句子,而不是由许多句子组成的单个文档。
import spacy
# Load the model
nlp = spacy . load ( "en_blackstone_proto" )
def get_top_cat ( doc ):
"""
Function to identify the highest scoring category
prediction generated by the text categoriser.
"""
cats = doc . cats
max_score = max ( cats . values ())
max_cats = [ k for k , v in cats . items () if v == max_score ]
max_cat = max_cats [ 0 ]
return ( max_cat , max_score )
text = """
It is a well-established principle of law that the transactions of independent states between each other are governed by other laws than those which municipal courts administer.
It is, however, in my judgment, insufficient to react to the danger of over-formalisation and “judicialisation” simply by emphasising flexibility and context-sensitivity.
The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A).
"""
# Apply the model to the text
doc = nlp ( text )
# Get the sentences in the passage of text
sentences = [ sent . text for sent in doc . sents ]
# Print the sentence and the corresponding predicted category.
for sentence in sentences :
doc = nlp ( sentence )
top_category = get_top_cat ( doc )
print ( f" " { sentence } " { top_category } n " )
> >> "In my judgment, it is patently obvious that cats are a type of dog." ( 'CONCLUSION' , 0.9990500807762146 )
> >> "It is a well settled principle that theft is wrong." ( 'AXIOM' , 0.556410014629364 )
> >> "The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A)." ( 'ISSUE' , 0.5040785074234009 )除了核心模型外,黑石的原始发行版还配备了三个自定义组件:
AbbreviationDetector()组成部分,并将其缩写形式解决到其长形式的定义,例如ECtHR > European Court of Human Rights 。CASENAME和CITATION对,从而使CITATION合并到其父CASENAME 。法律文件的作者缩写将使用的术语将被用来使用,而不是长期以来的文档中的其余部分,这并不少见。例如,
欧洲人权法院(“ ECTHR”)是法院最终负责申请《欧洲人权公约》(“ ECHR”)。
The abbreviation detection component in Blackstone seeks to address this by implementing an ever so slightly modified version of scispaCy's AbbreviationDetector() (which is itself an implementation of the approach set out in this paper: https://psb.stanford.edu/psb-online/proceedings/psb03/schwartz.pdf).我们的实施仍然存在一些问题,但其用法的一个例子如下:
import spacy
from blackstone . pipeline . abbreviations import AbbreviationDetector
nlp = spacy . load ( "en_blackstone_proto" )
# Add the abbreviation pipe to the spacy pipeline.
abbreviation_pipe = AbbreviationDetector ( nlp )
nlp . add_pipe ( abbreviation_pipe )
doc = nlp ( 'The European Court of Human Rights ("ECtHR") is the court ultimately responsible for applying the European Convention on Human Rights ("ECHR").' )
print ( "Abbreviation" , " t " , "Definition" )
for abrv in doc . _ . abbreviations :
print ( f" { abrv } t ( { abrv . start } , { abrv . end } ) { abrv . _ . long_form } " )
> >> "ECtHR" ( 7 , 10 ) European Court of Human Rights
>> > "ECHR" ( 25 , 28 ) European Convention on Human Rights 黑石中的复合案例参考检测部分旨在将CITATION实体与其父CASENAME实体结合。
普通法司法管辖区通常通过名称的耦合(通常源自该案件的当事方的名称)和一些独特的引用来确定案件的报告,例如:
Regina V Horncastle [2010] 2 AC 373
Blackstone的NER模型分别尝试识别CASENAME和CITATION实体。但是,将这些实体作为对将这些实体撤出是有用的(尤其是在信息提取的背景下)。
CompoundCases()在NER之后应用自定义管道,并在两种情况下识别CASENAME / CITATION对:
import spacy
from blackstone . pipeline . compound_cases import CompoundCases
nlp = spacy . load ( "en_blackstone_proto" )
compound_pipe = CompoundCases ( nlp )
nlp . add_pipe ( compound_pipe )
text = "As I have indicated, this was the central issue before the judge. On this issue the defendants relied (successfully below) on the decision of the High Court in Gelmini v Moriggia [1913] 2 KB 549. In Jones' case [1915] 1 KB 45, the defendant wore a hat."
doc = nlp ( text )
for compound_ref in doc . _ . compound_cases :
print ( compound_ref )
> >> Gelmini v Moriggia [ 1913 ] 2 KB 549
>> > Jones ' case [ 1915 ] 1 KB 45Blackstone的立法Linker试图通过使用NER模型来识别INSTRUMENT的存在,然后浏览依赖树以识别儿童提供的条件,从而将其对其父INSTRUMENT PROVISION提及。
一旦Blackstone确定了一项PROVISION : INSTRUMENT对,它将试图为立法上的规定和仪器生成目标URL.gov.uk。
import spacy
from blackstone . utils . legislation_linker import extract_legislation_relations
nlp = spacy . load ( "en_blackstone_proto" )
text = "The Secretary of State was at pains to emphasise that, if a withdrawal agreement is made, it is very likely to be a treaty requiring ratification and as such would have to be submitted for review by Parliament, acting separately, under the negative resolution procedure set out in section 20 of the Constitutional Reform and Governance Act 2010. Theft is defined in section 1 of the Theft Act 1968"
doc = nlp ( text )
relations = extract_legislation_relations ( doc )
for provision , provision_url , instrument , instrument_url in relations :
print ( f" n { provision } t { provision_url } t { instrument } t { instrument_url } " )
> >> section 20 http : // www . legislation . gov . uk / ukpga / 2010 / 25 / section / 20 Constitutional Reform and Governance Act 2010 http : // www . legislation . gov . uk / ukpga / 2010 / 25 / contents
> >> section 1 http : // www . legislation . gov . uk / ukpga / 1968 / 60 / section / 1 Theft Act 1968 http : // www . legislation . gov . uk / ukpga / 1968 / 60 / contentsBlackstone用基于自定义规则的句子分段的船运输,该句子解决了法律文本中固有的一系列特征,这些特征倾向于阻碍开箱即用的句子分段规则。
可以选择通过列表的Spacy式匹配器模式列表来扩展此行为,这些匹配器模式将明确防止匹配中的句子边界检测。
import spacy
from blackstone . pipeline . sentence_segmenter import SentenceSegmenter
from blackstone . rules import CITATION_PATTERNS
nlp = spacy . load ( "en_blackstone_proto" )
# add the Blackstone sentence_segmenter to the pipeline before the parser
sentence_segmenter = SentenceSegmenter ( nlp . vocab , CITATION_PATTERNS )
nlp . add_pipe ( sentence_segmenter , before = "parser" )
doc = nlp (
"""
The courts in this jurisdiction will enforce those commitments when it is legally possible and necessary to do so (see, most recently, R. (on the application of ClientEarth) v Secretary of State for the Environment, Food and Rural Affairs (No.2) [2017] P.T.S.R. 203 and R. (on the application of ClientEarth) v Secretary of State for Environment, Food and Rural Affairs (No.3) [2018] Env. L.R. 21). The central question in this case arises against that background.
"""
)
for sent in doc . sents :
print ( sent . text )我们要感谢以下(直接或间接)帮助我们建立该原型的人员/组织。