大创项目–数据分析(一)

参考了一些文本聚类的文章,当自己去实现时,发现数据对于本机太大,10来分钟都没有跑出来(后来发现是自己提取数据,写错了逻辑。。。。。。。。。。。。)

网上基本采用KMeans算法,由于没有很好的实现效果,本文仅为数据分析过程demo(一)

参考:

https://juejin.im/post/5b028ce1518825673564d357   (怎样用机器学习8步解决90%的自然语言处理问题?)
使用机器学习来进行自动化文本分类
https://chuansongme.com/n/2317435 https://www.infoq.cn/article/machine-learning-automatic-classification-of-text-data-part2 http://www.lining0806.com/%E6%9D%8E%E8%88%AA%E5%8D%9A%E5%A3%AB%EF%BC%9A%E6%B5%85%E8%B0%88%E6%88%91%E5%AF%B9%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%9A%84%E7%90%86%E8%A7%A3/(浅谈我对机器学习的理解) https://www.csuldw.com/2019/03/24/2019-03-24-anomaly-detection-introduction/ (八大无监督异常检测技术)
使用机器学习来进行自动化文本分类
https://cloud.tencent.com/developer/article/1137500(Python无监督学习的4大聚类算法) http://www.raincent.com/content-10-12065-1.html 无监督深度学习简介(附Python代码) https://www.infoq.cn/article/machine-learning-automatic-classification-of-text-data 词语分析的步骤 https://blog.csdn.net/tonydz0523/article/details/87934076 python代码 ** https://blog.csdn.net/u011587401/article/details/78323706 代码实现 https://blog.csdn.net/weixin_34362790/article/details/87578754 https://blog.csdn.net/Luzaofa/article/details/79712638 纯代码 https://blog.csdn.net/yyxyyx10/article/details/63685382 https://blog.csdn.net/weixin_34362790/article/details/87578754 https://blog.csdn.net/weixin_41276745/article/details/79611259 还不错 https://blog.csdn.net/lzc4869/article/details/76068170 https://bainingchao.github.io/2018/12/24/Python%E6%95%B0%E6%8D%AE%E9%A2%84%E5%A4%84%E7%90%86%EF%BC%9A%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E3%80%81%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD%E9%80%9A%E7%94%A8%E6%8A%80%E6%9C%AF%EF%BC%881%EF%BC%89/ https://www.jianshu.com/p/c98df5c9e406 http://www.shareditor.com/blogshow?blogId=51 https://zhuanlan.zhihu.com/p/37157010 http://media.people.com.cn/n1/2019/0113/c424561-30524771.html https://yq.aliyun.com/articles/593627 https://gist.github.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235 https://www.cnblogs.com/aipiaoborensheng/p/7813087.html https://blog.csdn.net/Luzaofa/article/details/79712638(从分词到统计到聚类) https://blog.csdn.net/wanpi931014/article/details/80792112 https://blog.csdn.net/tonydz0523/article/details/87934076

代码重点参考:

https://blog.csdn.net/Luzaofa/article/details/79712638
https://blog.csdn.net/wanpi931014/article/details/80792112
https://blog.csdn.net/tonydz0523/article/details/87934076

from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import os
corpus = []
tfidfdict = {}
seg_ty=[]
rootdir = "/root/temp/data/"
list_txt = os.listdir(rootdir)
# print(list_txt)
for i in list_txt:
    f=open(rootdir+i, 'r')
    for j in f.readlines():
        seg_ty.append(j.rstrip("\n"))
    f.close()
# print(seg_ty)
for line in seg_ty:
    corpus.append(line.strip())
    vectorizer = CountVectorizer()
    transformer = TfidfTransformer()
    tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
    word = vectorizer.get_feature_names()
    weight = tfidf.toarray()
for i in range(len(weight)):
    for j in range(len(word)):
        getword = word[j]
        getvalue = weight[i][j]
k = 8
clf = KMeans(n_clusters=k)
s = clf.fit(weight)
order_centroids = clf.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for ss in range(k):
    print("\n")
    print("Cluster %d:" % ss, end='')
    for ind in order_centroids[ss, :10]:
        print(' %s' % terms[ind], end='')
# K = range(1, 15)
# for k in K:
#     print("第几次聚类:" + str(k) + "\n")
#     clf = KMeans(n_clusters=k)
#     s = clf.fit(weight)
#     order_centroids = clf.cluster_centers_.argsort()[:, ::-1]
#     terms = vectorizer.get_feature_names()
#     for ss in range(k):
#         print("/n")
#         print("Cluster %d:" % ss, end='')
#         for ind in order_centroids[ss, :10]:
#             print(' %s' % terms[ind], end='')

2019.5.22