【语言与知识主题月】词向量表示实现简单的词性对比
才能我浪费99 发布于2020-07-01 浏览:1778 回复:5
2
收藏
最后编辑于2020-08-07

1.功能描述:
词向量(Word embedding),又叫Word嵌入式自然语言处理(NLP)中的一组语言建模和特征学习技术的统称,其中来自词汇表的单词或短语被映射到实数的向量。 从概念上讲,它涉及从每个单词一维的空间到具有更低维度的连续向量空间的数学嵌入。
词向量表示功能:依托全网海量优质数据和深度神经网络技术,通过词语的向量化来实现文本的可计算,帮助您快速完成语义挖掘、相似度计算等应用

2.平台接入

具体接入方式比较简单,可以参考我的另一个帖子,这里就不重复了:
http://ai.baidu.com/forum/topic/show/943327

3.调用攻略(Python3)及评测

3.1首先认证授权:

在开始调用任何API之前需要先进行认证授权,具体的说明请参考:

http://ai.baidu.com/docs#/Auth/top

具体Python3代码如下:

# -*- coding: utf-8 -*-
#!/usr/bin/env python

import urllib
import base64
import json
#client_id 为官网获取的AK, client_secret 为官网获取的SK
client_id =【百度云应用的AK】
client_secret =【百度云应用的SK】

#获取token
def get_token():
host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=' + client_id + '&client_secret=' + client_secret
request = urllib.request.Request(host)
request.add_header('Content-Type', 'application/json; charset=UTF-8')
response = urllib.request.urlopen(request)
token_content = response.read()
#print (token_content)
if token_content:
token_info = json.loads(token_content)
token_key = token_info['access_token']
return token_key

3.2百度词向量表示接口调用:

详细说明请参考: https://ai.baidu.com/ai-doc/NLP/fk6z52elw

说明的比较清晰,这里就不重复了。

大家需要注意的是:词向量表示
API访问URL:https://aip.baidubce.com/rpc/2.0/nlp/v2/word_emb_vec

Body请求示例:

{
"word":"张飞"
}


Python3调用代码如下:

#保存图片
def word_emb_vec(word):
    token=get_token()
    url = 'https://aip.baidubce.com/rpc/2.0/nlp/v2/word_emb_vec'
    params = dict()
    params['word'] = word
    params = json.dumps(params).encode('utf-8')
    access_token = token
    url = url + "?access_token=" + access_token
    request = urllib.request.Request(url=url, data=params)
    request.add_header('Content-Type', 'application/json')
    response = urllib.request.urlopen(request)
    content = response.read()
    if content:
        content=content.decode('GBK')
        data = json.loads(content)
        print (data)
        vec=data['vec']
        
        return vec

得到词向量之后,可以通过对比词向量的余弦,查看两个词的相似性,代码如下

def get_cos(word1,word2):
    vector1=word_emb_vec(word1)
    vector2=word_emb_vec(word2)
    op7=np.dot(vector1,vector2)/(np.linalg.norm(vector1)*(np.linalg.norm(vector2)))
    print(op7)

4.功能评测:
首先看返回词向量的内容:
word_emb_vec('男人')
'vec': [0.14957, 0.0506594, 0.00590814, -0.741117, 0.0759627, -0.0763338, -0.0391505, 0.719658, -0.00874455, -0.0613591, 0.814288, 0.399113, 1.14151, 0.101542, -0.191723, -0.270611, 0.125958, 0.146356, 0.00681456, 0.245249, 0.32353, 0.308319, -0.186733, 0.203161, -0.224631, 0.0981015, 0.114921, -0.0798769, -0.36119, -0.354334, 0.759017, 0.853483, -0.0946948, -0.346782, -0.66278, -0.104744, 0.43147, -0.251805, -0.151215, 1.1425, -0.248728, -0.979987, -0.0917991, 0.0570076, 0.0750189, 0.0159285, -0.203256, -0.0640013, 0.0777781, 0.288224, -0.312974, 0.462127, 0.0324547, -0.112558, -0.569289, -0.154023, 0.60759, 0.267811, -0.472261, 0.0828524, 0.178171, -0.0482741, 0.0313408, 0.191465, -0.310401, 0.139544, 0.572285, -0.243615, 0.0993748, -0.114586, 0.348809, -0.143074, -0.147737, 0.297309, -0.383708, 0.00217513, 0.330906, -0.354153, 0.0169129, 0.0320543, 0.0686201, 0.134361, 0.00451721, -0.459966, 0.705804, -0.168291, -0.131499, 0.222965, -0.446353, 0.0917951, -0.902397, -0.772433, 0.0655784, 0.428168, -0.321743, 0.189519, 0.0608679, 0.277248, -0.65996, 0.77518, 0.0359867, 0.088424, 0.180274, -0.0465873, -0.402518, 0.118143, -0.171174, -0.480887, -0.270404, -0.269574, -0.13676, -0.199101, 0.112937, -0.26418, -0.383186, 0.0464617, 0.158061, 0.365251, 0.15576, -0.358843, 0.361039, -0.632417, -0.629698, 0.0309841, -0.674734, 0.498699, -0.668686, -0.0220172, -0.723335, -0.0795335, -0.107022, 0.312656, -0.387218, -0.366383, -0.297916, 0.473208, 0.00896746, -0.43941, -0.215024, -0.458478, -0.0969657, 0.169899, 0.126719, -0.946809, -0.841628, -0.166182, 0.320629, -0.258833, -0.0332638, -0.536819, -0.196768, -0.766007, 0.172789, -0.270428, -0.238071, -0.248078, 0.160778, 0.382264, 0.616025, -0.173696, -0.617759, 0.514419, -0.209873, 0.442695, -0.386098, -0.594597, -0.27786, 0.388387, 0.445868, -0.183934, -0.606911, 0.070814, 0.0115831, 0.113637, -0.327813, -0.236559, 0.40367, 0.476713, -0.0773047, 0.027468, 0.0709693, 0.0308525, -0.207771, 0.211789, -0.0927274, 0.403466, 0.371714, 0.0646408, -0.116703, 0.105332, 0.112318, 0.0490551, -0.590082, -0.48463, -0.06648, 0.126251, -0.102568, 0.0557029, -0.620823, -0.628877, -0.570209, 0.17041, 0.385707, -0.228037, 0.269375, -0.014944, -0.627244, -0.725446, -0.21745, 0.150867, 0.76023, -0.746001, 0.0371543, 0.426303, -0.264588, 0.123655, 0.25828, 0.588779, 0.454309, -0.0209698, -0.482142, 0.383936, 0.21448, 0.453907, 0.546535, 0.273153, 0.674526, -0.393966, -0.301687, -0.223894, 0.335907, 0.201039, -0.584857, 0.0211272, 0.246013, 0.356831, -0.0507147, -0.213487, -0.242233, -0.847947, -0.0255889, -0.106797, -0.341794, -0.134527, -0.0140863, -0.20834, 0.330305, -0.28092, 0.426464, 0.326779, 0.435346, -0.855865, 0.26246, 0.0223976, -0.468753, 0.973164, -0.408432, 0.0745207, -0.378696, -0.0821405, -0.744526, -0.285787, 0.443777, -0.127995, -0.62187, -0.363267, 0.6858, 0.21187, -0.0552124, -0.00848779, -0.0674103, 0.202086, 0.122263, -0.165846, -0.084144, 0.261218, 0.00677127, -0.533982, -0.143043, 0.0799873, -0.0967705, 0.240433, 0.10117, -0.383091, -0.242721, 0.360233, -0.0562067, 0.747933, -0.108178, 0.532692, -0.335799, 0.0815199, 0.800582, -0.538283, -0.265569, 0.436931, -0.486518, 0.247905, -0.0621825, -0.66114, -0.243723, -0.35133, -0.150125, 0.484735, 0.0519581, -0.256562, 0.376365, -0.452107, 0.218626, 0.306604, 0.441617, 0.813498, 0.0952148, 0.479077, 0.18535, 0.179864, 0.066797, -0.352023, 0.134643, -0.3422, 0.310528, 0.0717825, 0.142448, 0.871125, 0.157541, -0.209198, 0.175951, -0.0676073, 0.906788, -0.155865, 0.515824, -0.14338, 0.154781, -0.516935, 0.368031, 0.299688, -0.0968536, -0.044695, -0.120755, 0.508404, 0.309613, -0.167205, 0.265668, -0.516459, -0.155084, 0.542736, -0.0446496, 0.306388, 0.302789, -0.524443, 0.468902, 0.260437, -0.193686, -0.55549, 0.402587, -0.292686, 0.0264999, 0.0124655, -1.05729, -0.828395, 0.308923, -0.521785, 0.335434, -0.303029, -0.38057, -0.0852518, -0.220848, -0.145685, -0.091011, -0.0880651, -0.439557, -0.301675, -0.575047, 0.249342, -0.261571, 0.413077, 0.484653, 0.484851, 0.44319, 0.746729, 0.276945, 0.0376924, -0.130674, 0.0907921, -0.194949, 0.269357, 0.658613, -0.493208, 0.509235, 0.121334, 0.371293, -0.139572, 0.216573, 0.374797, 0.056889, -0.592565, 0.306371, -0.365489, -0.171278, -0.0676333, 0.386491, 0.466417, -0.0196255, -0.109045, -0.10323, -0.419468, 0.12879, -0.366129, -0.0278969, 0.267851, -0.293084, -0.297622, 0.499546, -0.327888, -0.398166, 0.00485447, 0.412219, 0.197442, -0.58613, 0.300133, 0.16238, -0.815105, 0.0491939, -0.0583335, 0.00187992, 0.765612, -0.235804, -0.24127, -0.113596, 0.210066, -0.409705, 0.027145, 0.313364, -0.421504, -0.0171721, -0.0324981, 0.392362, -0.614043, 0.302038, -0.102667, 0.638218, -0.144693, 0.0849292, -0.0301896, 0.526541, -0.153094, 0.2574, 0.729667, 0.236264, -0.401632, 0.562234, 0.0405247, 0.143772, -0.309127, 0.0322185, -0.0573678, 0.499957, -0.198872, -0.235382, 0.0630554, 0.240135, 0.0481843, -0.367869, -0.251823, -0.169109, -0.182495, 0.569803, 0.0560783, -0.1731, -0.0794593, -0.118454, 0.421758, -0.24632, 0.5994, -0.540353, -0.0247761, -0.0132994, 0.541955, -0.101937, -0.814687, 0.12927, 0.0293534, 0.404583, 0.0484314, -0.944176, -0.424569, -0.261916, 0.394313, -0.0534663, -0.178807, 0.530818, 0.682838, -0.179505, 0.268459, -0.0270741, 0.27738, -0.17382, -0.57372, -0.128649, -0.704485, -1.0633, 0.435766, 0.230545, -0.712659, 0.633231, -0.507347, -0.134128, 0.0537996, 0.823216, -0.238, -0.209477, 0.321504, -0.445678, -0.173284, -0.297303, 0.187506, -0.0582548, -0.600576, -0.186411, 0.464035, -0.347893, 0.126047, -0.519948, -0.0532066, 0.36593, -0.479277, 0.408933, 0.704362, -0.319202, 0.115036, -0.756324, -0.229519, -0.138238, -0.304798, 0.634628, -0.333447, 0.221741, -0.257071, -0.393322, 0.4962, 0.379307, -0.033172, -0.606709, 0.0389895, 0.364831, -0.168191, 0.424054, 0.196972, -0.127662, -0.10767, -0.595637, 0.119616, -0.477269, -0.0911875, -0.148237, -0.626138, 0.144058, 0.716946, -0.947703, -0.440411, -0.0675046, 0.474601, 0.588877, -0.027404, -0.463918, 0.255222, -0.308353, -0.149774, -0.370922, 0.00909845, 0.187043, 0.228313, 0.164353, -0.238765, 0.067514, 0.186497, 0.000422873, -0.684144, 0.277051, 0.58959, -0.224257, -0.128831, -0.285987, 0.242316, -0.0013447, -0.303768, -0.0425822, -0.164852, -0.52861, 0.200365, -0.328953, -0.0664814, -0.178464, -0.323298, -0.38943, 0.0289446, 0.150695, -0.203407, -0.28553, 0.26317, -0.148031, -0.25887, 0.915224, 0.552578, 0.324488, -0.455632, 0.149378, 0.502581, 0.449093, 0.311576, -0.0991581, 0.106249, 0.119134, -0.440504, -0.898428, -0.248424, 0.285365, 0.333664, 0.145482, -0.0571962, -0.521087, -0.280208, 0.245937, -0.0246199, 0.344233, 0.244139, 0.098292, 0.017103, 0.0683953, -0.443467, -0.0733332, -0.306381, 0.0457678, -0.605813, 0.0310824, -0.377379, -0.196962, 0.541834, 0.0249888, -0.13387, 0.105866, -0.31826, 0.00136836, -0.294967, -0.39223, -0.699923, 0.278083, 0.174761, -0.045033, 0.113319, -0.516294, 0.351018, 0.16402, 1.02061, -0.226719, -0.475377, 0.251926, -0.888735, -0.377499, 0.734468, -0.404232, 0.463639, 0.525003, -0.125911, -0.243736, 0.558801, 0.0935185, -1.0247, 0.00826732, 0.156673, 0.123406, -0.458109, -0.62199, 0.422789, -0.10598, -0.243525, 0.377827, 0.16452, -0.111134, -0.235925, -0.479644, -0.116927, -0.636149, 0.664667, 0.873789, -0.294484, -0.410905, 0.024232, -0.323574, -0.186915, -0.00370848, 0.0970067, 0.206442, -0.339088, -0.0336798, 0.395676, 0.449296, -0.567426, 0.454215, -0.275185, -0.314289, -0.228429, 0.118849, 0.110243, -0.554396, 0.565103, -0.0606894, -0.627417, -0.891244, 0.283052, 0.387241, -0.821646, -0.0149994, 0.0707396, -0.184412, -0.304883, 0.426739, -0.496772, -0.706512, -0.207059, 0.14139, 0.173061, -0.394069, 0.133884, -0.847603, 0.267959, 0.236649, 0.143396, -0.259335, -0.236624, -0.037057, 0.0552762, 0.160665, 0.108195, -0.667212, 0.374069, 0.690952, 0.244185, 0.340555, -0.0652253, 0.643023, 0.839276, 0.049741, 0.192492, -0.449225, 0.324986, -0.209879, -0.393226, 0.279655, 0.123335, 0.629983, 0.721486, -0.0644492, 0.0726083, 0.183262, 0.138676, -0.119156, -0.0227634, 0.375201, 0.199323, -0.0559803, -0.00730438, -0.676047, 0.0761832, -0.246059, 0.39959, -0.232211, -0.0423632, 0.762384, -0.2007, 0.400613, 0.169053, -0.448535, 0.0236959, 0.318962, 0.0491377, 0.0470824, -0.224957, 0.449829, 0.21696, -0.204625, -0.262438, 0.873472, -0.298366, 0.25748, -0.371729, 0.163921, 0.282278, 0.034704, 0.173799, -0.031974, 0.0149707, -0.0570836, -0.476415, 0.221582, -0.32148, -0.641307, 0.10264, -0.443924, -0.455497, -0.426272, -0.153848, 0.497667, -0.589927, -0.020713, -0.868313, 0.533844, -0.129304, 0.133801, -0.400212, 0.177133, 0.111556, -0.0753431, -0.374982, -0.145819, -0.245483, -0.444888, -0.59205, -0.00138705, 0.0854563, -0.0739961, 0.728538, 0.0783692, 0.115251, -0.475149, 0.233304, 0.205116, -0.243987, 0.0708792, 0.08237, 0.499962, -0.144976, -0.945277, 0.0633061, 0.370174, 0.237901, 0.124377, -0.450402, 0.0632569, 0.0899013, 0.810449, 0.382978, -0.559633, -0.270347, -0.228679, -0.138249, -0.241334, -0.410516, 0.516557, 0.204054, -0.282353, 0.14391, -0.022453, 0.383348, 0.253702, -0.0313313, 0.68617, -0.339954, -0.287303, 0.0104883, -0.348415, -0.566333, 0.483609, 0.316424, -0.486293, 0.321935, 0.76442, 0.0792855, 0.113101, -0.154407, 0.332721, 0.487399, -0.274007, 0.504729, 0.0704361, 0.340006, 0.273822, -0.186565, 0.0534993, 0.0290258, 0.23593, 0.295339, 0.148287, 0.450252, -0.167375, -0.255082, 0.0624997, -0.330573, -0.638081, -0.45701, -0.174557, -0.0644444, -0.0999032, -0.39149, -0.189119, 0.0718402, 0.296814, -0.598838, -0.993107, 0.301189, -0.43486, -0.378185, 0.101787, -0.0898481, 0.177102, 0.391976, -0.151183, 0.8441, 0.345766, -0.325375, 0.0390049, 0.743756, 0.184589, 0.248391, 0.196213, 0.0176211, -0.640395, 0.565851, -0.0703608, -0.293801, 0.195363, -0.188893, 0.537508, -0.127929, -0.283462, -0.382127, -0.166851, 0.0663484, 1.04695, 0.129774, 0.178341, 0.169103, -0.10012, -0.114465, 0.466267, -0.437626, 0.247062, -0.128018, 0.240539, -0.0818517, 0.382506, -0.260223, 0.125807, -0.0770882, -0.312231, 0.159486, 0.150326, 0.385176, -0.0381167, -0.935924, 0.237937, -0.155586, -0.529374, -0.236009, 1.21008, 0.553623, -1.24977, 0.00589785, 0.167624, -0.0187688, 0.119562, 0.642393, -0.620224, -0.232635, -0.317913, -0.268425, 0.117414, -0.53835, -0.531041, -0.0959205, -0.0311325, 0.105376, 0.347445, 0.372054, 0.702779, 0.0120678, -0.269805, 0.263375, 0.874558, -0.0340523, 0.536076, 0.0802191, -0.359763, -0.34368, 0.467919, -0.237715, 0.158112, 0.494604, 0.0369929, 0.643149, -0.179544, 0.00428273, -0.00467311, 0.0284255, 0.105274, -0.165207, 0.374155, 0.218473, 0.384805, -0.772122, 0.722221, 0.209461, -0.477744, 0.34397, -0.264462, 0.0431618, 0.463819, 0.126338, -0.128319, -0.307222, -0.568764, 0.159551, 0.34793, 1.00154, -0.117449, 3.95573e-05, 0.0703093, -0.0155285, -0.989202, -0.319138, 0.510269, 0.320459, -0.267601, 0.269364]

然后通过词向量对比不同词的相似性:
get_cos('男生','女生')
0.7354834999425102

get_cos('男生','天空')
0.04052168632494217

通过对比可以发现,相对于“天空”,“男生”与“女生”词性更加相近,这也符合我们的预期。

5.测试结论和建议
词向量计算是通过训练的方法,将语言词表中的词映射成一个长度固定的向量。词表中所有的词向量构成一个向量空间,每一个词都是这个词向量空间中的一个点,利用这种方法,实现文本的可计算。词向量是NLP的基础,词向量表示大大的方便了用户的使用。可以用于:
语义召回:对候选资源进行词向量表示,并构建向量表示基础上的快速索引召回技术,与传统的基于字词倒排索引方法不同,直接从语义相关性角度上给用户召回结果
个性化推荐:基于用户的历史行为建模用户兴趣表示,学习用户与推荐候选之间的兴趣匹配度,实现对用户的个性化推荐。

收藏
点赞
2
个赞
共5条回复 最后由worddict回复于2020-08-07
#6worddict回复于2020-08-07

感觉很方便

0
#5才能我浪费99回复于2020-07-14

这个功能方便了很多人

0
#4才能我浪费99回复于2020-07-14

词向量需要大量的语料

0
#3worddict回复于2020-07-13

词向量是NLP的基础

0
#2worddict回复于2020-07-13

写的非常详细

0
TOP
切换版块