lushan Download - lushan Source code download

lushan

Other source code

v2.0

Download

Welcome to lushan

Tao Hui http://weibo.com/taohui3

English Version

Design concept

1. What is lushan?

lushan was a lightweight key-value database in version 1.0. Using the memcached protocol, multiple libraries can be mounted. Using lushan you can easily build a cluster on several machines like memcached. lushan 2.0 is also a lightweight application framework that can mount multiple shared libraries, so that you have the ability to access data and calculate in one process at the same time, which makes it unprecedentedly easy to write large data-quantity and high-performance services. Especially suitable for business scenarios for Internet recommendations, advertising and search. Lushan has been used in Sina Weibo recommendation and advertising business for many years.

2. Characteristics of lushan

Memcached protocol. lushan uses the memcached protocol, which can utilize various language clients of memcached that have been widely used.
Multiple libraries. lushan can mount multiple libraries so that you don't have to redeploy a set of services for every additional data, nor do you have to restart it.
Multiple computing modules. lushan can mount multiple shared libraries, so that you can put the calculations in the same process as the data. Programming is very easy, and the performance is still very high for services that rely on large amounts of data.
Statistical status data. lushan has detailed statistical status data, which are very important for operation and maintenance.
Very fast. Lushan focuses on details in design and development, using IO multiplexing communication model, meticulous timeout processing and as few memory copies as possible.
Hadoop. lushan provides LushanFileOutputFormat.java, you can use its library format that can be mounted directly on hadoop. It also provides a transfer framework, which can make your data online in a simple and rigorous manner.
Redis. lushan can connect Redis through the lproxy module, thereby combining batch updated static data with real-time updated dynamic data to form a key-value storage cluster with extremely powerful functions.

3. Why develop lushan?

Around 2013, I was developing the "Missed Weibo" recommendation. I need to provide several data stores online to experiment with different algorithms, as well as their online and test versions. This requires several systems to be deployed. This approach is too low. So, one weekend I developed lushan, which can free you from these things, very cool.

This is true, lushan has since become the infrastructure for Weibo recommendation and advertising business. Now there are two clusters operating and maintaining, with 12 machines respectively, serving online query data from T, with more than 1 billion queries per day.

When I finished developing the first version of lushan, I always had the urge to allow lushan to mount the shared library at the same time. But I haven't made up my mind for a long time, because I believe that a framework should have its own positioning, and the other is that the easily changed part should be separated from the stable part in architecture. However, in 2015, when developing Weibo advertisements, the user's interest data, relational data, and feature data of CTR estimates were easily sorted through Hadoop and stored in lushan. At this time, two modules can be written to implement the targeting and ctr prediction functions, and the performance is very strong. So I gave up my original idea and implemented the second version, making the lushan function even more powerful. In actual applications, you can still use lushan only as a key-value database, or deploy lushan clusters that only provide data and clusters that provide computing at the same time separately. This is also used within Weibo advertising.

4. Start quickly

rely

libevent 1.4 or above.
If you want to use the software framework function, you need libmemcached-1.0 or above.
If you need to use lproxy to connect to redis, you need to redis C client hiredis-0.13 or above

Compile and install

Open Makefile and change LIBEVENT_HOME to your own libevent installation path.
make dist
Open conf/lushan.conf and modify HDB_PATH, UPLOAD_PATH, HMOD_PATH and LOG_PATH according to your installation directory.
If you have the need to bind IP, you can set BIND_ADDR in conf/lushan.conf. Note that there cannot be spaces in the lushan.conf configuration "=" number, because it is directly treated as a shell script.
Add the lushan_upload module in rsyncd.conf, and the path is consistent with the $UPLOAD_PATH you set above.

Data access example

A sample library has been provided in the examples directory, and mount it according to the following steps:

bin/lushan.sh >dev/null 2>&1 &
rsync examples/hdict_20150820131415 127.0.0.1::lushan_upload/1/
touch done.flg; rsync done.flg 127.0.0.1::lushan_upload/1/hdict_20150820131415/
echo -n -e "get 1-123456rn" | nc 127.0.0.1 9999

The output is the value corresponding to key 123456.

Explain each step:

The first step is to start the lushan process. lushan.sh is responsible for starting the lushan process and checking the $UPLOAD_PATH directory every 3 seconds. If there is a folder with hdcit_xxxx in the $UPLOAD_PATH/$no directory, and it contains the done.flg file. Then move it to $HDB_PATH/$no and send a command to the lushan process to mount hdict_xxxx under the $no number.
rsync your hdict_xxxx to $UPLOAD_PATH/$no, where $no is the number that this dataset will be mounted on lushan. Please use rsync instead of cp, not only because rsync can pass data to another machine, but rsync will either be transmitted successfully or fail, and there will be no intermediate state of half of the file.
After the above steps are finished, your library will still not be automatically mounted. Because lushan.sh does not know whether hdict_xxxx has been transferred or only some of the files in it have been transferred. Therefore, you need to rsync done.flg to hdict_xxxx folder. Doing this can also allow you to keep a piece of data online on multiple lushan servers at the same time, as mentioned in the "How to build a cluster" below.
You can use the memcached client to query the library you mounted. But an easier way is to send commands that comply with the memcached protocol. The fame and fortune above means you want to query the key 123456 in the library numbered 1.

Calculation module example

When used as a computing framework, lushan supports two protocols, similar to the single-line "URL" protocol of HTTP GET, and a protocol similar to the HTTP POST that specifies the send value length, which also supports sending binary data. Two examples are provided in the modules directory: lproxy and lecho, which demonstrate these two protocols respectively.

lproxy example, for a requested key, query redis first, and return directly if there is redis, and query the local mounted libraries if there is redis. In simple cases, this example can also be used in production environments. Please modify this code if there are more complex requirements.

Follow the steps below:

Create a text file x.txt, enter the following two lines, the first tab is the key, and the following is the value, as follows:
```
 168	hello lushan
 187	line 2
```
Use lushan_line_maker in tools to convert it to lushan file format. If it is hdict_20180428192000, mount it to the directory with lushan number 1 through the previous data access example.
Start a redis on the machine, set dbno is 1, and then add a record, key is 168, value is "hello redis"
Execute make in the hproxy directory, place the generated hmodule.so and hmodule.conf in the hmod/15/1.0.0 directory, and modify the host and port in hmodule.conf to deploy the ip and port of redis for you.
implement:
```
 echo -ne "hmod_open /mnt/lushan/hmod/15/1.0.0/ 15rn" | nc 127.0.0.1 9999
```
If OPENED is returned, it will be turned on successfully, otherwise check whether libhiredis is in LD_LIBRARY_PATH.

Query:

 echo -ne "get m15?k=1-168rn" | nc 127.0.0.1 9999
 VALUE m15?k=1-168 0 11
 hello redis
 END

 echo -ne "get m15?k=1-187rn" | nc 127.0.0.1 9999
 VALUE m15?k=1-187 0 6
 line 2
 END

Then, as we expect, if redis exists, the redis result will be returned, otherwise the lushan data will be queryed.

Close a module:
```
 echo -ne "hmod_close 15rn" | nc 127.0.0.1 9999
```
If all your modules do not have global variables, you can use hmod_open to directly replace the old library, so that there is no loss to online services.

The example of lecho is similar, just simply returns the data you requested. This example is very simple and will not be described in detail.

5. hdict format

hdict is the library format mounted by lushan. He is very simple. There are two necessary files in the hdict_xxxx directory, dat and idx. The former contains your data, and the latter is a mapping of key to value position offset in the dat file. definition:

 typedef struct {
    uint64_t key;
    uint64_t pos;
} idx_t;

key is a 64-bit unsigned long skeleton that does not include the library number. pos is composed of the length of the value and its offset in the dat file:

 pos = (length << 40) | offset;

The idx file must be arranged in ascending order of idx_t.key. dat files are not required. You can either create an index on a dat file that already exists, or generate an index at the same time when outputting the file.

Ordered files are very common in map-reduce computational models. You can specify the output file format in hadoop to generate a library in hdict format. For example, the following command:

 job.setOutputFormat(LushanFileOutputFormat.class);

6. Statistical status data

There are three commands to obtain statistical status data: stats, info and hmod_info. The former outputs global data, while the latter two outputs data for each library and each module.

 echo -n -e "statsrn" | nc 127.0.0.1 9999

STAT pid 13810
STAT uptime 1435075686
STAT curr_connections 1411
STAT connection_structures 4061
STAT cmd_get 2099151223
STAT get_hits 3950240117
STAT get_misses 2443878402
STAT threads 16
STAT timeouts 117
STAT waiting_requests 0
STAT ialloc_failed 0
END

echo -n -e "inforn" | nc 127.0.0.1 9999

id                label state ref   num_qry  idx_num     open_time path
----------------------------------------------------------------
1   interest_CF_trends  OPEN  0   139922 18419392 150824-042654 /mnt/lushan/hdb/12/hdict_20150711204737
2   interest_CF_trends  OPEN  0   190508 26175141 150824-050246 /mnt/lushan/hdb/12/hdict_20150711204737

echo -ne "hmod_inforn" | nc 127.0.0.1 9999
id                label state ref   num_qry     open_time path
----------------------------------------------------------------
0                       OPEN  0  267130787 180419-174502 /mnt/lushan/hmod/0
5                       OPEN  0  336829974 180419-174503 /mnt/lushan/hmod/5

You can use lushan.php to create graphical statistics status pages.

7. How to build a cluster?

If you have experience with mysql, it will be easy to build a simple cluster. First, you want to divide your data into groups, usually a multiple of your machine's number. Then consider how many sets of services you want to deploy, usually two sets of distributed in different IDCs. Then follow the grouping rules to query your data through the memcached client.

Although it is very simple, lushan still provides a simple framework to help you handle some details of data transmission. The name is transfer.py, which can help you:

Periodically check whether the library in hdict format has been generated.
Download the hdict format library from hadoop to the machine where the lushan is located, or transfer it from local to the machine where the lushan is located. Here is a plug-in that allows you to check whether your data is legal before going online. You can even write your own plug-in to create the original "fake hdict format file" into "real hdict format file".
Rsync hdict format file to each lushan service copy of the corresponding number. After all the transmission is successful, rsync done.flg is then rsync to each lushan service copy to ensure that the same data is launched simultaneously on different services.

8. Lushan Non-best practices

Domain name. Configure a domain name for each lushan machine. In this way, when replacing the machine, you only need to switch the domain name pointing. You can configure the domain name in the data transmission process without any need to move.
Monitor whether the data is online on time. Through lushan.php, you can simply write a script to check whether the opening time of a library meets the expected update time of this library. If it does not meet, you can call an alarm.
Lushan service is restarted. Lushan is very stable when used as a key-value database, and usually runs for several years without any problems. However, when used as a software framework, it may be restarted due to bugs in the module code. Getting stats through php can get the lushan startup time, thereby determining whether there is a restart. If restarted, alarm will be called.
Trip. lushan When rebooting hmodule, it will determine whether the hmodule.disable file exists in the directory, and if it exists, it will not be loaded. You can add this file to the newly launched hmodule directory. If there is a bug in this module, it will be skipped next time to restart to ensure the normal service of other modules.

9. How to make the memcached protocol support sending complex requests?

The memcached protocol is relatively simple. Get only supports simple requests, but can return relatively complex results. The set class command supports complex requests, but only supports relatively simple results. Lushan has made two changes to this. When used, it is similar to the GET and POST protocols of HTTP. The following keys can usually simply correspond to the URL in HTTP:

The "key" of the get request can exceed the 250 byte limit. Settings when sending:
```
 memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_VERIFY_KEY, 0);
```
This way, there is no problem sending through libmemcached. When returning the result, you need to return the key within 250 bytes. You can choose to return the signature of the request or the first 250 bytes. As long as you read according to the truncated key when reading, and the truncated key does not conflict.
Use gets to support sending multiple lines of requests. Usually a simple get request is enough, but if you want to send a request similar to json, you need a more complex protocol. lushan re-modified the gets protocol and modified it to the same protocol as set. When using the client, press the above settings to check the legality of the key, and then send a packet in the following format:
```
 gets key 0 0 value_lenrn
 valuern
```
Just fetch the key in the return result. Encapsulation is made in lutil.h, and just call hrequest_pack.

10. Lushan Possible problems and solutions

There are more timeouts in stat, or you record more requests from the client to TIMEOUT times greater than or equal to your configuration. You can try setting a larger NUM_THREADS.
There are many hdict files that have not been transmitted under a certain number of uploads, which usually means that your transmission script will be interrupted before it is transmitted.

Expand

Additional Information

Version v2.0
Type Other source code
Update Time 2025-03-11
size 580.28KB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All