There are a total of seven classic algorithm problems here (I put them in 7 folders);
Q1 and Q2 are two topics related to big data processing;
The other five topics are:
1.DBScan clustering algorithm,
2.simple dynamic programming,
3.classic sorting algorithm,
4.full arrangement of strings,
5.calculation of tf-idf
There are seven classic algorithm problems here (I put them in 7 folders respectively);
Q1 and Q2 are two questions related to big data processing;
The other five questions are:
1.DBScan聚类算法、
2.简单的动态规划、
3.经典的排序算法、
4.字符串的全排列、
5.tf-idf的计算
The following is a description of related documents for Q1 and Q2
The following are the relevant documentation for Q1 and Q2
1. Question description.
2. Ideas.
3. Environment.
4. Code execution.
In the past few days, I have interviewed algorithm engineers and encountered two particularly good written test questions. Here are the ideas and codes I submitted. They may not be the best solution, but they are also the results of my study and thinking over the past two days.
A table is known to record the number of likes for each diary, and uses a programming language you are familiar with to calculate the number of likes for the diary P90. Definition of P90: Assume P90=n, then the number of likes in 90% of the diary is less than or equal to n. Note: The original data is not sorted, the number of diaries reaches 1 billion
The table structure is as follows:
note_id, like_count
1, 4
2, 6
..., ...
Example:
1. The number of likes for 10 diaries is: 1, 2, 2, 3, 3, 4, 5, 5, 6, 50, then P90 = 6
2. The number of clicks for 9 diaries is: 1, 2, 2, 3, 3, 4, 5, 5, 6, then P90=5.5
A file that contains 10 billion texts, one per line, and the degree of each text does not exceed 1024, and there may be duplicate text. Use the programming language you are familiar with to find the most 10,000 texts, the conditions are as follows:
1. No duplication between texts.
2. If a text is retained in the result, then all texts with the same degree as the text should be retained in the result.
Description: External sorting implements calculation of P90 value of diary like table
1.读取大文件表,维护一个数据总量nTotal
2.分为n个小文件(第一次内部排序<采用优化后的快排>;注意:内存量 大于等于 小文件的数据量大小)
3.采用外部排序排序每个小文件
4.合并所有小文件并输出到一个新的文件(依次读取每个文件从第一行到末尾,比较获取极(大/小)值,存入新文件)
5.最终获得一个排序好的大文件
6.通过排序后的大文件获取p90的位置 通过文件指针偏移读取具体的p90数据
注:小文件倒序排列,点击数多的排在前面
Description: pandas can process files with <= 5TB, and use pandas directly to process super large files
1.读入文件为DataFrame:reader
2.去重
3.新增文本长度列和索引列
4.按文本长度列分组(1024个分组)并返回每个分组的元素个数,产生新的DataFrame:group
5.通过group表返回top_n文本的最小长度
6.在reader大表中按条件查找行,产生DataFrame:result
7.通过result表返回符合条件的top_n文本
Description: hash map
1.读取大文件表,生成hash key&value 存入1024个小文件(采用桶排序/计数排序,注意value不是字符串内容而是记录所在大文件中的行数)
1.1 key为hash值(采用md5<散列化>)
1.2 value为所在大文件中的行数
2.根据顺序依次从大到小读出topN
3.获取topN在文件中的行数并读取大文件表获取内容
4.循环输出topN
md5码:128位,16个字节
python3.6
pip install tqdm
pip install pandas
Entry function: Q1/get_p90.py executes:
cd Q1
python get_p90.py
Entry function: Q2/get_top_1w_pandas.py executes:
cd Q2
python get_top_1w_pandas.py
Entry function: Q2/get_top_1w_hash.py executes:
cd Q2
python get_top_1w_hash.py