Familia Download - Familia Source code download

Familia

Other source code

v.1.1.2

Download

The Familia open source project includes document topic inference tools, semantic matching calculation tools, and three theme models based on industrial-grade corpus training: Latent Dirichlet Allocation (LDA), SentenceLDA and Topical Word Embedding (TWE). It supports users to conduct research and application of various scenarios such as text classification, text clustering, and personalized recommendation in a "ready use" way. Considering the high cost of training theme models and limited resources for open source theme models, we will gradually open up theme models in multiple vertical fields based on industrial-grade corpus training, as well as typical application methods of these models in the industry, to help the scientific research and implementation of theme model technology. ( English )

News!!!

Recently, we launched the LDA model in Familia in PaddleHub 1.8. According to the data set, it is divided into lda_news, lda_novel and lda_webpage.

PaddleHub is very convenient to use, and we will use the use of lda_news to introduce the example.

First of all, before using PaddleHub, you need to install the PaddlePaddle deep learning framework. For more installation instructions, please refer to PaddlePaddle Quick Installation.
Install Paddlehub: pip install paddlehub
lda_news model installation: hub install lda_news
Specific use:

 import paddlehub as hub

lda_news = hub . Module ( name = "lda_news" )
jsd , hd = lda_news . cal_doc_distance ( doc_text1 = "今天的天气如何，适合出去游玩吗" , doc_text2 = "感觉今天的天气不错，可以出去玩一玩了" )
# jsd = 0.003109, hd = 0.0573171

lda_sim = lda_news . cal_query_doc_similarity ( query = '百度搜索引擎' , document = '百度是全球最大的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。' )
# LDA similarity = 0.06826

results = lda_news . cal_doc_keywords_similarity ( '百度是全球最大的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。' )
# [{'word': '百度', 'similarity': 0.12943492762349573}, 
#  {'word': '信息', 'similarity': 0.06139783578769882}, 
#  {'word': '找到', 'similarity': 0.055296603463188265}, 
#  {'word': '搜索', 'similarity': 0.04270794098349327}, 
#  {'word': '全球', 'similarity': 0.03773627056367886}, 
#  {'word': '超过', 'similarity': 0.03478658388202199}, 
#  {'word': '相关', 'similarity': 0.026295857219683725}, 
#  {'word': '获取', 'similarity': 0.021313585287833996}, 
#  {'word': '中文', 'similarity': 0.020187103312009513}, 
#  {'word': '搜索引擎', 'similarity': 0.007092890537169911}]

A more specific introduction and usage method can be found here: https://www.paddlepaddle.org.cn/hublist?filter=en_category&value=SemanticModel

Application Introduction

For the corresponding paper introduction of the topic model currently included in Familia, please refer to the relevant papers.

The application paradigm of topic models in industry can be abstracted into two categories: semantic representation and semantic matching.

Semantic Representation (Semantic Representation) reduces the subject dimensions of the document and obtains semantic representations of the document. These semantic representations can be applied to downstream applications such as text classification, text content analysis, and CTR prediction.
Semantic Matching
To calculate the semantic matching degree between texts, we provide two similarity calculation methods for text types:
- Short text-long text similarity calculation, usage scenarios include document keyword extraction, calculation of search engine queries and web page similarity, etc.
- Long text - Long text similarity calculation, usage scenarios include calculating the similarity of two documents, calculating the similarity of user portraits and news, etc.

For more detailed content and industrial application cases, please refer to the Familia Wiki . If you want to visualize the above application paradigm based on the web, you can refer to Familia-Visualization .

Code Compilation

Third-party dependencies include gflags-2.0 , glogs-0.3.4 , protobuf-2.5.0 , and also require the compiler to support C++11, g++ >= 4.8 , and is compatible with Linux and Mac operating systems. By default, executing the following script will automatically get the dependencies and install them.

 $ sh build.sh # 包含获取并安装第三方依赖的过程

Model download

 $ cd model
$ sh download_model.sh

For detailed configuration instructions for the model, please refer to the model description.

We will gradually open up multiple theme models in different fields to meet more different scenario needs.

Demo

The demo in Familia includes the following features:

Semantic representation calculation Use the topic model to infer topics to the input document to obtain the topic dimensionality reduction representation of the document.
Semantic matching calculation calculates the similarity between texts, including the similarity between short text-long text, long text-long text.
The model content displays the theme words and close neighbors of the model, which facilitates users to have an intuitive understanding of the theme of the model.

For specific demo instructions, please refer to the usage documentation.

Things to note

If there is an error in dynamic libraries such as libglog.so, libgflags.so, etc., please add third_party to LD_LIBRARY_PATH of the environment variable.
export LD_LIBRARY_PATH=./third_party/lib:$LD_LIBRARY_PATH
The simple FMM word segmentation tool is built in the code, which only forward matches the vocabulary lists that appear in the theme model. If there are higher requirements for word segmentation and semantic accuracy, it is recommended to use a commercial word segmentation tool and use the function of a custom word list to import the word list in the theme model.

Question consultation

Welcome to submit any questions and bug reports to Github Issues. Or send a consultation email to { family } at baidu.com

Docker

 docker run -d 
    --name familia 
    -e MODEL_NAME=news 
    -p 5000:5000 
    orctom/familia

MODEL_NAME can be one of news / novel / webpage / webo

API

 http://localhost:5000/swagger/

Citation

The following article describes the Familia project and industrial cases powered by topic modeling. It bundles and translates the Chinese documentation of the website. We recommend citing this article as default.

Di Jiang, Yuanfeng Song, Rongzhong Lian, Siqi Bao, Jinhua Peng, Huang He, Hua Wu. 2018. Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering. arXiv preprint arXiv:1808.03733.

 @article{jiang2018familia,
  author = {Di Jiang and Yuanfeng Song and Rongzhong Lian and Siqi Bao and Jinhua Peng and Huang He and Hua Wu},
  title = {{Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering}},
  journal = {arXiv preprint arXiv:1808.03733},
  year = {2018}
}

Further Reading: Federated Topic Modeling

Copyright and License

Familia is provided under the BSD-3-Clause License.

Expand

Additional Information

Version v.1.1.2
Type Other source code
Update Time 2025-04-17
size 6MB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All