Blackstone是一種用於處理長形式,非結構化法律文本的spacy模型和圖書館。 Blackstone是ICLR&D的英格蘭和威爾士研究實驗室法律委員會報告的實驗研究項目。黑石是由丹尼爾·霍德利(Daniel Hoadley)撰寫的。
我們為什麼要建造黑石?
黑石有什麼特別之處?
觀察和其他值得注意的事物
安裝
安裝庫
安裝黑石模型
關於模型
管道
命名實體認可者
文本分類器
用法
應用NER模型
可視化實體
應用文本分類器模型
自定義管道擴展
縮寫和長期定義解決方案
複合案例參考檢測
立法鏈接器
句子細分器
在過去的幾年中,法律與技術交集的活動激增。但是,在英國,這項活動的大部分活動都是在律師事務所和其他商業環境中進行的。這樣做的結果是,儘管法律信息學領域中的發展永無止境,但幾乎沒有開源的研究可以提供。
此外,英國法律信息域(無論是開放還是關閉)的研究重點是開發NLP申請,用於自動化合同和其他本質上具有交易的法律文件。這是可以理解的,因為英國法律NLP研究的主要恩人是律師事務所和律師事務所往往並不難以將可以利用為培訓數據來利用的交易文件。
我們看到的問題是,英國的法律NLP研究已經過度專注於商業應用,並且值得投資開發有關其他法律文本可用的法律NLP研究,例如判決,學術文章,骷髏論點和辯護。
筆記!強烈建議您將黑石安裝到虛擬環境中!有關虛擬環境的更多信息,請參見此處。黑石應與Python 3.6及更高的Python兼容。
要安裝黑石,請執行以下步驟:
第一步是安裝庫,該庫目前包含少數自定義旋轉組件。這樣安裝庫:
pip install blackstone
第二步是安裝Spacy模型。這樣安裝模型:
pip install https://blackstone-model.s3-eu-west-1.amazonaws.com/en_blackstone_proto-0.0.1.tar.gz
如果您正在開發黑石,則可以從源頭安裝:
pip install --editable .
pip install -r dev-requirements.txt
這是黑石的第一個版本,該模型最好將其視為原型。它在邊緣周圍很粗糙,代表了ICLR&d進行的有關法律文本的NLP開源研究的較大開源研究計劃的第一步。
因此,這是對原型模型中發生的事情的簡短摘要。
本發行版中包含的原型模型在其管道中具有以下元素:
由於法律文本的標記為言論和依賴性培訓數據的稀缺性,因此從Spacy的en_core_web_sm模型中獲取了tokenizer , tagger和parser Pipeline組件。總的來說,這些組件似乎從事一項不錯的工作,但是將來在某個時候通過定制培訓數據重新訪問這些組件是一件好事。
ner和textcat組件是經過專門針對黑石訓練的自定義組件。
Blackstone模型的NER組成部分已經過訓練以檢測以下實體類型:
| 恩特 | 姓名 | 例子 |
|---|---|---|
| 卡ase名 | 案例名稱 | 例如,史密斯訴瓊斯(Re Jones ) |
| 引用 | 引用(報告和未報告案件的唯一標識符) | 例如(2002)2 Cr應用R 123 |
| 樂器 | 書面法律文書 | 例如1968年盜竊法,《歐洲人權公約》,心肺復蘇術 |
| 條款 | 書面法律文書中的單位 | 例如,第1節,第2(3)節 |
| 法庭 | 法院或法庭 | 例如上訴法院,上庭 |
| 法官 | 提到法官 | 例如Eady J,Cornhill的Bingham勳爵 |
黑石的發行版也帶有文本分類器。與NER組件(已被訓練以識別令牌和一系列感興趣的令牌)相反,文本分類器將較長的文本跨度分類,例如句子。
文本分類器已經過培訓以根據五個相互排斥的類別之一對文本進行分類,如下所示:
| 貓 | 描述 |
|---|---|
| 公理 | 文字似乎假設了一個公認的原則 |
| 結論 | 文本似乎是發現,保留,決心或結論 |
| 問題 | 本文似乎討論了一個問題 |
| Legal_test | 該測試似乎討論了法律測試 |
| uncat | 文本不屬於上面的四個類別之一 |
這是該模型如何應用於R(Miller)訴退出歐盟(Birnie Intervening)的R(Miller)訴國務卿在R(Miller)訴國務卿第31段中使用的一些文本的一個例子[2017] UKSC 5; [2018] AC 61:
import spacy
# Load the model
nlp = spacy . load ( "en_blackstone_proto" )
text = """ 31 As we shall explain in more detail in examining the submission of the Secretary of State (see paras 77 and following), it is the Secretary of State’s case that nothing has been done by Parliament in the European Communities Act 1972 or any other statute to remove the prerogative power of the Crown, in the conduct of the international relations of the UK, to take steps to remove the UK from the EU by giving notice under article 50EU for the UK to withdraw from the EU Treaty and other relevant EU Treaties. The Secretary of State relies in particular on Attorney General v De Keyser’s Royal Hotel Ltd [1920] AC 508 and R v Secretary of State for Foreign and Commonwealth Affairs, Ex p Rees-Mogg [1994] QB 552; he contends that the Crown’s prerogative power to cause the UK to withdraw from the EU by giving notice under article 50EU could only have been removed by primary legislation using express words to that effect, alternatively by legislation which has that effect by necessary implication. The Secretary of State contends that neither the ECA 1972 nor any of the other Acts of Parliament referred to have abrogated this aspect of the Crown’s prerogative, either by express words or by necessary implication.
"""
# Apply the model to the text
doc = nlp ( text )
# Iterate through the entities identified by the model
for ent in doc . ents :
print ( ent . text , ent . label_ )
> >> European Communities Act 1972 INSTRUMENT
> >> article 50 EU PROVISION
>> > EU Treaty INSTRUMENT
> >> Attorney General v De Keyser ’ s Royal Hotel Ltd CASENAME
> >> [ 1920 ] AC 508 CITATION
>> > R v Secretary of State for Foreign and Commonwealth Affairs , Ex p Rees - Mogg CASENAME
>> > [ 1994 ] QB 552 CITATION
> >> article 50 EU PROVISION Spacy帶有出色的可視化器,包括NER預測的可視化器。 Blackstone帶有一個自定義的調色板,可用於在使用顯示屏時更容易在源文本上挖掘實體。
"""
Visualise entities using spaCy's displacy visualiser.
Blackstone has a custom colour palette: `from blackstone.displacy_palette import ner_displacy options`
"""
import spacy
from spacy import displacy
from blackstone . displacy_palette import ner_displacy_options
nlp = spacy . load ( "en_blackstone_proto" )
text = """
The applicant must satisfy a high standard. This is a case where the action is to be tried by a judge with a jury. The standard is set out in Jameel v Wall Street Journal Europe Sprl [2004] EMLR 89, para 14:
“But every time a meaning is shut out (including any holding that the words complained of either are, or are not, capable of bearing a defamatory meaning) it must be remembered that the judge is taking it upon himself to rule in effect that any jury would be perverse to take a different view on the question. It is a high threshold of exclusion. Ever since Fox’s Act 1792 (32 Geo 3, c 60) the meaning of words in civil as well as criminal libel proceedings has been constitutionally a matter for the jury. The judge’s function is no more and no less than to pre-empt perversity. That being clearly the position with regard to whether or not words are capable of being understood as defamatory or, as the case may be, non-defamatory, I see no basis on which it could sensibly be otherwise with regard to differing levels of defamatory meaning. Often the question whether words are defamatory at all and, if so, what level of defamatory meaning they bear will overlap.”
18 In Berezovsky v Forbes Inc [2001] EMLR 1030, para 16 Sedley LJ had stated the test this way:
“The real question in the present case is how the courts ought to go about ascertaining the range of legitimate meanings. Eady J regarded it as a matter of impression. That is all right, it seems to us, provided that the impression is not of what the words mean but of what a jury could sensibly think they meant. Such an exercise is an exercise in generosity, not in parsimony.”
"""
doc = nlp ( text )
# Call displacy and pass `ner_displacy_options` into the option parameter`
displacy . serve ( doc , style = "ent" , options = ner_displacy_options )產生看起來像這樣的東西:
Blackstone的文本分類器為doc生成了預測的分類。 textcat管道組件已設計為應用於單個句子,而不是由許多句子組成的單個文檔。
import spacy
# Load the model
nlp = spacy . load ( "en_blackstone_proto" )
def get_top_cat ( doc ):
"""
Function to identify the highest scoring category
prediction generated by the text categoriser.
"""
cats = doc . cats
max_score = max ( cats . values ())
max_cats = [ k for k , v in cats . items () if v == max_score ]
max_cat = max_cats [ 0 ]
return ( max_cat , max_score )
text = """
It is a well-established principle of law that the transactions of independent states between each other are governed by other laws than those which municipal courts administer.
It is, however, in my judgment, insufficient to react to the danger of over-formalisation and “judicialisation” simply by emphasising flexibility and context-sensitivity.
The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A).
"""
# Apply the model to the text
doc = nlp ( text )
# Get the sentences in the passage of text
sentences = [ sent . text for sent in doc . sents ]
# Print the sentence and the corresponding predicted category.
for sentence in sentences :
doc = nlp ( sentence )
top_category = get_top_cat ( doc )
print ( f" " { sentence } " { top_category } n " )
> >> "In my judgment, it is patently obvious that cats are a type of dog." ( 'CONCLUSION' , 0.9990500807762146 )
> >> "It is a well settled principle that theft is wrong." ( 'AXIOM' , 0.556410014629364 )
> >> "The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A)." ( 'ISSUE' , 0.5040785074234009 )除了核心模型外,黑石的原始發行版還配備了三個自定義組件:
AbbreviationDetector()組成部分,並將其縮寫形式解決到其長形式的定義,例如ECtHR > European Court of Human Rights 。CASENAME和CITATION對,從而使CITATION合併到其父CASENAME 。法律文件的作者縮寫將使用的術語將被用來使用,而不是長期以來的文檔中的其餘部分,這並不少見。例如,
歐洲人權法院(“ ECTHR”)是法院最終負責申請《歐洲人權公約》(“ ECHR”)。
The abbreviation detection component in Blackstone seeks to address this by implementing an ever so slightly modified version of scispaCy's AbbreviationDetector() (which is itself an implementation of the approach set out in this paper: https://psb.stanford.edu/psb-online/proceedings/psb03/schwartz.pdf).我們的實施仍然存在一些問題,但其用法的一個例子如下:
import spacy
from blackstone . pipeline . abbreviations import AbbreviationDetector
nlp = spacy . load ( "en_blackstone_proto" )
# Add the abbreviation pipe to the spacy pipeline.
abbreviation_pipe = AbbreviationDetector ( nlp )
nlp . add_pipe ( abbreviation_pipe )
doc = nlp ( 'The European Court of Human Rights ("ECtHR") is the court ultimately responsible for applying the European Convention on Human Rights ("ECHR").' )
print ( "Abbreviation" , " t " , "Definition" )
for abrv in doc . _ . abbreviations :
print ( f" { abrv } t ( { abrv . start } , { abrv . end } ) { abrv . _ . long_form } " )
> >> "ECtHR" ( 7 , 10 ) European Court of Human Rights
>> > "ECHR" ( 25 , 28 ) European Convention on Human Rights 黑石中的複合案例參考檢測部分旨在將CITATION實體與其父CASENAME實體結合。
普通法司法管轄區通常通過名稱的耦合(通常源自該案件的當事方的名稱)和一些獨特的引用來確定案件的報告,例如:
Regina V Horncastle [2010] 2 AC 373
Blackstone的NER模型分別嘗試識別CASENAME和CITATION實體。但是,將這些實體作為對將這些實體撤出是有用的(尤其是在信息提取的背景下)。
CompoundCases()在NER之後應用自定義管道,並在兩種情況下識別CASENAME / CITATION對:
import spacy
from blackstone . pipeline . compound_cases import CompoundCases
nlp = spacy . load ( "en_blackstone_proto" )
compound_pipe = CompoundCases ( nlp )
nlp . add_pipe ( compound_pipe )
text = "As I have indicated, this was the central issue before the judge. On this issue the defendants relied (successfully below) on the decision of the High Court in Gelmini v Moriggia [1913] 2 KB 549. In Jones' case [1915] 1 KB 45, the defendant wore a hat."
doc = nlp ( text )
for compound_ref in doc . _ . compound_cases :
print ( compound_ref )
> >> Gelmini v Moriggia [ 1913 ] 2 KB 549
>> > Jones ' case [ 1915 ] 1 KB 45Blackstone的立法Linker試圖通過使用NER模型來識別INSTRUMENT的存在,然後瀏覽依賴樹以識別兒童提供的條件,從而將其對其父INSTRUMENT PROVISION提及。
一旦Blackstone確定了一項PROVISION : INSTRUMENT對,它將試圖為立法上的規定和儀器生成目標URL.gov.uk。
import spacy
from blackstone . utils . legislation_linker import extract_legislation_relations
nlp = spacy . load ( "en_blackstone_proto" )
text = "The Secretary of State was at pains to emphasise that, if a withdrawal agreement is made, it is very likely to be a treaty requiring ratification and as such would have to be submitted for review by Parliament, acting separately, under the negative resolution procedure set out in section 20 of the Constitutional Reform and Governance Act 2010. Theft is defined in section 1 of the Theft Act 1968"
doc = nlp ( text )
relations = extract_legislation_relations ( doc )
for provision , provision_url , instrument , instrument_url in relations :
print ( f" n { provision } t { provision_url } t { instrument } t { instrument_url } " )
> >> section 20 http : // www . legislation . gov . uk / ukpga / 2010 / 25 / section / 20 Constitutional Reform and Governance Act 2010 http : // www . legislation . gov . uk / ukpga / 2010 / 25 / contents
> >> section 1 http : // www . legislation . gov . uk / ukpga / 1968 / 60 / section / 1 Theft Act 1968 http : // www . legislation . gov . uk / ukpga / 1968 / 60 / contentsBlackstone用基於自定義規則的句子分段的船運輸,該句子解決了法律文本中固有的一系列特徵,這些特徵傾向於阻礙開箱即用的句子分段規則。
可以選擇通過列表的Spacy式匹配器模式列表來擴展此行為,這些匹配器模式將明確防止匹配中的句子邊界檢測。
import spacy
from blackstone . pipeline . sentence_segmenter import SentenceSegmenter
from blackstone . rules import CITATION_PATTERNS
nlp = spacy . load ( "en_blackstone_proto" )
# add the Blackstone sentence_segmenter to the pipeline before the parser
sentence_segmenter = SentenceSegmenter ( nlp . vocab , CITATION_PATTERNS )
nlp . add_pipe ( sentence_segmenter , before = "parser" )
doc = nlp (
"""
The courts in this jurisdiction will enforce those commitments when it is legally possible and necessary to do so (see, most recently, R. (on the application of ClientEarth) v Secretary of State for the Environment, Food and Rural Affairs (No.2) [2017] P.T.S.R. 203 and R. (on the application of ClientEarth) v Secretary of State for Environment, Food and Rural Affairs (No.3) [2018] Env. L.R. 21). The central question in this case arises against that background.
"""
)
for sent in doc . sents :
print ( sent . text )我們要感謝以下(直接或間接)幫助我們建立該原型的人員/組織。