The Familia open source project includes document topic inference tools, semantic matching calculation tools, and three theme models based on industrial-grade corpus training: Latent Dirichlet Allocation (LDA), SentenceLDA and Topical Word Embedding (TWE). It supports users to conduct research and application of various scenarios such as text classification, text clustering, and personalized recommendation in a "ready use" way. Considering the high cost of training theme models and limited resources for open source theme models, we will gradually open up theme models in multiple vertical fields based on industrial-grade corpus training, as well as typical application methods of these models in the industry, to help the scientific research and implementation of theme model technology. ( English )
Recently, we launched the LDA model in Familia in PaddleHub 1.8. According to the data set, it is divided into lda_news, lda_novel and lda_webpage.
PaddleHub is very convenient to use, and we will use the use of lda_news to introduce the example.
First of all, before using PaddleHub, you need to install the PaddlePaddle deep learning framework. For more installation instructions, please refer to PaddlePaddle Quick Installation.
Install Paddlehub: pip install paddlehub
lda_news model installation: hub install lda_news
Specific use:
import paddlehub as hub
lda_news = hub . Module ( name = "lda_news" )
jsd , hd = lda_news . cal_doc_distance ( doc_text1 = "今天的天气如何,适合出去游玩吗" , doc_text2 = "感觉今天的天气不错,可以出去玩一玩了" )
# jsd = 0.003109, hd = 0.0573171
lda_sim = lda_news . cal_query_doc_similarity ( query = '百度搜索引擎' , document = '百度是全球最大的中文搜索引擎、致力于让网民更便捷地获取信息,找到所求。百度超过千亿的中文网页数据库,可以瞬间找到相关的搜索结果。' )
# LDA similarity = 0.06826
results = lda_news . cal_doc_keywords_similarity ( '百度是全球最大的中文搜索引擎、致力于让网民更便捷地获取信息,找到所求。百度超过千亿的中文网页数据库,可以瞬间找到相关的搜索结果。' )
# [{'word': '百度', 'similarity': 0.12943492762349573},
# {'word': '信息', 'similarity': 0.06139783578769882},
# {'word': '找到', 'similarity': 0.055296603463188265},
# {'word': '搜索', 'similarity': 0.04270794098349327},
# {'word': '全球', 'similarity': 0.03773627056367886},
# {'word': '超过', 'similarity': 0.03478658388202199},
# {'word': '相关', 'similarity': 0.026295857219683725},
# {'word': '获取', 'similarity': 0.021313585287833996},
# {'word': '中文', 'similarity': 0.020187103312009513},
# {'word': '搜索引擎', 'similarity': 0.007092890537169911}]A more specific introduction and usage method can be found here: https://www.paddlepaddle.org.cn/hublist?filter=en_category&value=SemanticModel
For the corresponding paper introduction of the topic model currently included in Familia, please refer to the relevant papers.
The application paradigm of topic models in industry can be abstracted into two categories: semantic representation and semantic matching.
Semantic Representation (Semantic Representation) reduces the subject dimensions of the document and obtains semantic representations of the document. These semantic representations can be applied to downstream applications such as text classification, text content analysis, and CTR prediction.
Semantic Matching
To calculate the semantic matching degree between texts, we provide two similarity calculation methods for text types:
For more detailed content and industrial application cases, please refer to the Familia Wiki . If you want to visualize the above application paradigm based on the web, you can refer to Familia-Visualization .
Third-party dependencies include gflags-2.0 , glogs-0.3.4 , protobuf-2.5.0 , and also require the compiler to support C++11, g++ >= 4.8 , and is compatible with Linux and Mac operating systems. By default, executing the following script will automatically get the dependencies and install them.
$ sh build.sh # 包含获取并安装第三方依赖的过程
$ cd model
$ sh download_model.sh
We will gradually open up multiple theme models in different fields to meet more different scenario needs.
The demo in Familia includes the following features:
Semantic representation calculation Use the topic model to infer topics to the input document to obtain the topic dimensionality reduction representation of the document.
Semantic matching calculation calculates the similarity between texts, including the similarity between short text-long text, long text-long text.
The model content displays the theme words and close neighbors of the model, which facilitates users to have an intuitive understanding of the theme of the model.
For specific demo instructions, please refer to the usage documentation.
If there is an error in dynamic libraries such as libglog.so, libgflags.so, etc., please add third_party to LD_LIBRARY_PATH of the environment variable.
export LD_LIBRARY_PATH=./third_party/lib:$LD_LIBRARY_PATH
The simple FMM word segmentation tool is built in the code, which only forward matches the vocabulary lists that appear in the theme model. If there are higher requirements for word segmentation and semantic accuracy, it is recommended to use a commercial word segmentation tool and use the function of a custom word list to import the word list in the theme model.
Welcome to submit any questions and bug reports to Github Issues. Or send a consultation email to { family } at baidu.com
docker run -d
--name familia
-e MODEL_NAME=news
-p 5000:5000
orctom/familia
MODEL_NAME can be one of news / novel / webpage / webo
http://localhost:5000/swagger/
The following article describes the Familia project and industrial cases powered by topic modeling. It bundles and translates the Chinese documentation of the website. We recommend citing this article as default.
Di Jiang, Yuanfeng Song, Rongzhong Lian, Siqi Bao, Jinhua Peng, Huang He, Hua Wu. 2018. Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering. arXiv preprint arXiv:1808.03733.
@article{jiang2018familia,
author = {Di Jiang and Yuanfeng Song and Rongzhong Lian and Siqi Bao and Jinhua Peng and Huang He and Hua Wu},
title = {{Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering}},
journal = {arXiv preprint arXiv:1808.03733},
year = {2018}
}
Further Reading: Federated Topic Modeling
Familia is provided under the BSD-3-Clause License.