wikipedia2vec 다운로드 wikipedia2vec 소스 코드 다운로드

wikipedia2vec

기타 소스코드

v2.0.0

다운로드

Wikipedia2Vec

테스트

Wikipedia2Vec는 Wikipedia의 단어 및 엔티티 (즉, Wikipedia에 해당 페이지가있는 개념)의 임베딩 (또는 벡터 표현)을 얻는 데 사용되는 도구입니다. 그것은 Studio Ousia에 의해 개발되고 유지됩니다.

이 도구를 사용하면 단어와 엔티티의 임베딩을 동시에 배울 수 있으며 연속 벡터 공간에 비슷한 단어와 엔터리가 서로 가깝게 배치 할 수 있습니다. 임베딩은 공개적으로 사용 가능한 Wikipedia 덤프가 입력으로 단일 명령으로 쉽게 훈련 될 수 있습니다.

이 도구는 기존의 스킵 그램 모델을 구현하여 단어의 임베딩과 Yamada et al. (2016)은 엔티티의 임베딩을 배우기 위해.

Wikipedia2VEC와 기존 임베딩 도구 (예 : FastText, Gensim, Rdf2Vec 및 Wiki2Vec) 간의 경험적 비교가 여기에서 제공됩니다.

문서는 http://wikipedia2vec.github.io/에서 온라인으로 제공됩니다.

기본 사용

Wikipedia2Vec는 PYPI를 통해 설치할 수 있습니다.

% pip install wikipedia2vec

이 도구를 사용하면 Wikipedia 덤프가 입력으로 열차 명령을 실행하여 임베딩을 배울 수 있습니다. 예를 들어, 다음 명령은 최신 영어 Wikipedia 덤프를 다운로드 하고이 덤프에서 삽입을 배웁니다.

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

그런 다음 학습 된 임베딩은 model_file 에 기록됩니다. 이 명령은 많은 선택적 매개 변수를 사용할 수 있습니다. 자세한 내용은 문서를 참조하십시오.

사전 임베딩

12 개 언어 (예 : 영어, 아랍어, 중국어, 네덜란드, 프랑스어, 독일어, 이탈리아, 일본어, 폴란드어, 포르투갈어, 러시아어 및 스페인어)에 대한 사전 배치 된 임베딩은이 페이지에서 다운로드 할 수 있습니다.

사용 사례

Wikipedia2Vec은 다음과 같은 작업에 적용되었습니다.

엔티티 연결 : Yamada et al., 2016, Eshel et al., 2017, Chen et al., 2019, Poerner et al., 2020, van Hulst et al., 2020.
지명 된 엔티티 인식 : Sato et al., 2017, Lara-Clares and Garcia-Serrano, 2019.
질문 답변 : Yamada et al., 2017, Poerner et al., 2020.
엔티티 타이핑 : Yamada et al., 2018.
텍스트 분류 : Yamada et al., 2018, Yamada and Shindo, 2019, Alam et al., 2020.
관계 분류 : Poerner et al., 2020.
Paraphrase 탐지 : Duong et al., 2018.
지식 그래프 완성 : Shah et al., 2019, Shah et al., 2020.
가짜 뉴스 탐지 : Singh et al., 2019, Ghosal et al., 2020.
영화의 음모 분석 : Papalampidi et al., 2019.
새로운 실체 발견 : Zhang et al., 2020.
엔티티 검색 : Gerritse et al., 2020.
심해 탐지 : Zhong et al., 2020.
대화 정보 추구 : Rodriguez et al., 2020.
쿼리 확장 : Rosin et al., 2020.

참조

과학적 간행물에서 Wikipedia2Vec을 사용하는 경우 다음 논문을 인용하십시오.

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, Wikipedia2Vec : Wikipedia의 단어와 개체의 삽입을 배우고 시각화하기위한 효율적인 도구.

 @inproceedings{yamada2020wikipedia2vec,
  title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2020},
  publisher = {Association for Computational Linguistics},
  pages = {23--30}
}

임베딩 모델은 원래 다음 논문에서 제안되었습니다.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, 지명 된 엔티티 명백에 대한 단어와 개체의 임베딩에 대한 공동 학습.

 @inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  pages={250--259}
}

이 예에서 구현 된 텍스트 분류 모델은 다음 논문에서 제안되었습니다.

Ikuya Yamada, Hiroyuki Shindo, 텍스트 분류를위한 신경 주의적 가방 모델.

 @article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics},
  pages = {563--573}
}