Article_Search源码学习

一.python 搜索引擎简单算法


https://blog.csdn.net/luanpeng825485697/article/details/78997189
https://blog.csdn.net/qq_28626909/article/details/81674790 **
https://blog.csdn.net/qq_31113079/article/details/55226264 pagerank
https://blog.csdn.net/qq_35993946/article/details/88087827 样例
https://blog.csdn.net/sinat_29673403/article/details/80422953 python根据关键词实现信息检索推荐(使用深度学习算法)
https://www.cnblogs.com/alexkn/p/8413295.html Python推荐算法学习1

http://www.sohu.com/a/234187512_599691
http://www.kaneseo.com/baidu/2141.html
https://www.cnblogs.com/zhiliaoniu/p/3442053.html



https://blog.csdn.net/qq_31113079/article/details/55226264
https://blog.csdn.net/sinat_29673403/article/details/80422953

二.参考几套es搜索源码:

https://github.com/SnakeHacker/QA-Snake.git
https://github.com/mtianyan/FunpySpiderSearchEngine.git
https://github.com/howie6879/owllook.git
https://github.com/smile0304/Article_Search

比对了上面几套源码, Article_Search 这套更容易理解与上手(只有两个py文件) ,很可惜的是找到的几套都没有用什么特定算法实现,基本是es默认实现搜索

三. Article_Search 源码参考:

https://blog.smilehacker.net/2017/12/28/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E%E7%9A%84%E6%90%AD%E5%BB%BA/ 文章解析
https://github.com/smile0304/Technical_Article_Spider/tree/master/Technical_Artical_Spider 数据
https://github.com/smile0304/Article_Search 源码

其中有趣的是

1.html的标签去除(个人前几天在利用的时候直接取的中文)

def get_origin(self,hit):
"""
获取文章来源
:return:index_name 来源名称
"""
if "_index" in hit:
if "teachnical_4hou" == hit["_index"]:
origin = "嘶吼"
elif "article_anquanke" == hit["_index"]:
origin = "安全客"
elif "teachnical_freebuf" == hit["_index"]:
origin = "Freebuf"
else:
origin = "未知来源"
return origin
def filter_tags(self,htmlstr):
"""
过滤HTML中的标签
将HTML中标签等信息去掉
:param htmlstr: HTML字符串
:return:
"""
# 先过滤CDATA
re_cdata = re.compile('//<!\[CDATA\[[^>]*//\]\]>', re.I) # 匹配CDATA
re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', re.I) # Script
re_style = re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', re.I) # style
re_br = re.compile('<br\s*?/?>') # 处理换行
re_h = re.compile('</?\w+[^>]*>') # HTML标签
re_comment = re.compile('<!--[^>]*-->') # HTML注释
re_stopwords = re.compile('\u3000') # 去除无用的'\u3000'字符
s = re_cdata.sub('', htmlstr) # 去掉CDATA
s = re_script.sub('', s) # 去掉SCRIPT
s = re_style.sub('', s) # 去掉style
s = re_br.sub('\n', s) # 将br转换为换行
s = re_h.sub('', s) # 去掉HTML 标签
s = re_comment.sub('', s) # 去掉HTML注释
s = re_stopwords.sub('', s)
# 去掉多余的空行
blank_line = re.compile('\n+')
s = blank_line.sub('\n', s)
s = self.replaceCharEntity(s) # 替换实体
return s

def replaceCharEntity(self,htmlstr):
"""
#替换常用HTML字符实体.
使用正常的字符替换HTML中特殊的字符实体.
你可以添加新的实体字符到CHAR_ENTITIES中,处理更多HTML字符实体.
:param htmlstr: HTML字符串
:return: 这个return不重要
"""
CHAR_ENTITIES = {'nbsp': ' ', '160': ' ',
'lt': '<', '60': '<',
'gt': '>', '62': '>',
'amp': '&', '38': '&',
'quot': '"', '34': '"', }

re_charEntity = re.compile(r'&#?(?P<name>\w+);')

2.其他中规中矩

model模式进行数据建模

flask前后端

2019.5.28