用于广告引擎的高效检索算法

360影视 2024-12-05 18:51 3

摘要:DNF (析取范式): 这种形式则是将逻辑命题表示为一系列"或"操作的并集,每个项都是一个简单命题。比如,(A AND B) OR (NOT A AND C) 就是一个DNF表达式

DNF (Disjunctive Normal Form) 和CNF ( Conjunctive normal Form)都是布尔代数中表示逻辑表达式的标准形式。

1、DNF (析取范式): 这种形式则是将逻辑命题表示为一系列"或"操作的并集,每个项都是一个简单命题。比如,(A AND B) OR (NOT A AND C) 就是一个DNF表达式

An example DNF expression is of the form:(A ∈ {a1, a2} ∧ B 6∈ {b1, b2} ∧ C ∈ {c1})∨ (A ∈ {a1, a3} ∧ D 6∈ {d1})where A, B, C and D are attributes, and a1, a2, a3, b1, b2, c1 and d1 are attribute values

2、CNF (析取合取范式): 它是一种将逻辑命题分解为一系列"与"(AND)操作的组合,每个组合又由一些"或"(OR)操作及其对应的否定条件组成。例如,一个CNF表达式可能是 (A OR B) AND (NOT C OR D)。

以上两种范式各有应用,其中DNF在广告检索中应用比较广泛,以下针对此算法进行详细演绎。

an example CNF expression is of the form:(A ∈ {a1, a2} ∨ B 6∈ {b1, b2} ∨ C ∈ {c1})∧ (A ∈ {a1, a3} ∨ D 6∈ {d1})

将文档中的所有条件按照DNF范式拆解出独立的conjuctions,然后按照^分隔个数量算出来K值(其中“不属于”值为0)

按照^分隔出来的最小条件和K值进行倒排

按照K值从0-2开始依次进行条件检索

有7篇文档(或者广告),及其对应的定向约束。现有一个请求查询需求:age IN 3 AND state IN CA AND gender IN M,检索出满足条件的doc列表。

文档ID文档约束doc1age IN 3 AND state IN NY OR state IN CA AND gender IN Mdoc2age IN 3 AND gender IN F OR state NOT CA;NYdoc3age IN 3 AND gender IN M AND state NOT CA OR state IN CA AND gender IN Fdoc4age IN 3;4 OR state IN CA AND gender IN Mdoc5state NOT CA;NY OR age IN 3;4doc6state NOT CA;NY OR age IN 3 AND state IN NY OR state IN CA AND gender IN Mdoc7age IN 3 AND state IN NY OR state IN CA AND gender IN F

(1)拆解出所有doc列表的ConjunctionID,conjunctionID不能重复。

文档约束ConjunctionID
age IN 3 AND state IN NYc1doc1state IN CA AND gender IN Mc2doc1age IN 3 AND gender IN Fc3doc2state NOT CA;NYc4doc2age IN 3 AND gender IN M AND state NOT CAc5doc3state IN CA AND gender IN Fc6doc3age IN 3;4c7doc4state IN CA AND gender IN Mc2doc4state NOT CA;NYc4doc5age IN 3;4c7doc5state NOT CA;NYc4doc6age IN 3 AND state IN NYc1doc6state IN CA AND gender IN Mc2doc6age IN 3 AND state IN NYc1doc7state IN CA AND gender IN Fc6doc7

(2)建立conjunctionID到docID的倒排

文档约束ConjunctionID文档ID集合age IN 3 AND state IN NYc1doc1,doc6,doc7state IN CA AND gender IN Mc2doc1,doc4,doc6age IN 3 AND gender IN Fc3doc2state NOT CA;NYc4doc2,doc5,doc6state IN CA AND gender IN Fc6doc3,doc7ConjunctionID文档约束Kc1age IN 3 AND state IN NY2c2state IN CA AND gender IN M2c3age IN 3 AND gender IN F2c4state NOT CA;NY0c6state IN CA AND gender IN F2K文档约束ConjunctionID集合0Zc4(IN)0state NOT CAc4(NOT)0state NOT NYc4(NOT)2age IN 3c1(IN),c3(IN),c5(IN)2state IN NYc1(IN)2state IN CAc2(IN),c5(NOT),c6(IN)0Zc4(IN)0state NOT CAc4(NOT)2age IN 3c1(IN),c3(IN),c5(IN)2gender IN Mc2(IN),c5(IN)

由于查询条件的K值为3,则只需要匹配K

(6)从0-K依次去获取对应的ConjunctionID结果

K文档约束ConjunctionID集合备注0Zc4(IN)遍历所有,遇到IN加入,只要遇到NOT就剔除。c4先被匹配,然后被过滤。最终没有匹配到任何结果。1age IN 3c7(IN)遍历所有,遇到IN加入,只要遇到NOT就剔除。C7被成功匹配。

K=0和1的情况,不需要匹配或者只需要匹配到任一个约束的cunjctionID就可以获得结果。

所以只需要对每个查询条件,去遍历每个约束下的集合,即可

2age IN 3c1(IN),c3(IN),c5(IN)先排序,每次先选中[0,K-1]之间的ConjunctionID集合,按照WAND算法进行匹配。本次没有匹配成功,c1跳过。下次从c2开始排序和匹配2state IN CAc2(IN),c5(NOT),c6(IN)本次匹配成功,得到c2。下次从c3开始排序和匹配2age IN 3c1(IN),c3(IN),c5(IN)本次没有匹配成功,c3跳过。下次从c5开始排序和匹配2age IN 3c1(IN),c3(IN),c5(IN)本次匹配成功,得到c5。再doublecheck一下c5在K=2的所有约束条件是否有NOT的,最后结果是过滤掉c5。本次匹配失败。下次从c6开始排序和匹配2state IN CAc2(IN),c5(NOT),c6(IN)本次没有匹配成功,c6跳过。结束匹配,最终只有c2被召回2gender IN Mc2(IN),c5(IN)2age IN 3c1(IN),c3(IN),c5(IN)

K>1的情况,只需要匹配【0,K-1】之间的结果就足够,由于NOT不计入K值里,所以在获取到匹配结果后,还需要对每个约束做一次遍历的校验,避免有被NOT过滤的情况

(7)以上匹配后最终结果是(c2,c7),查询到最后的文档ID集合是(doc1,doc4,doc5,doc6)

ConjunctionID文档ID集合c1doc1,doc6,doc7c2doc1,doc4,doc6c3doc2c4doc2,doc5,doc6c5doc3c6doc3,doc7c7doc4,doc5# -*- coding=utf-8 -*-from __future__ import unicode_literalsimport sysreload(sys)sys.setdefaultencoding('utf-8') # 设置默认编码为utf-8以支持中文字符处理from collections import defaultdict # 用于创建字典,提供默认值import re # 用于正则表达式处理# 布尔逻辑符号和其他常量定义AND = "AND" # 逻辑与OR = "OR" # 逻辑或IN = "IN" # 属于,表示"在集合中"NOT = "NOT" # 逻辑非,表示"不在集合中"SEP = ";" # 分隔符,多个值之间的分隔EOF = "EOF" # 表示列表的结束ZERO = "Z" # 表示零值SPACE = " " # 空格# Term 类用于存储键值对,比如 age:3class Term: def __init__(self, key, value): self.key = key # 属性名 self.value = value # 属性值# Assignment 类表示一个布尔条件的关系,例如 age IN 3;4,表示关系是 "IN" 或 "NOT"class Assignment: def __init__(self, relation, term): self.relation = relation # 关系类型(IN或NOT) self.term = term # 关联的 Term 对象# Conjunction 类表示一个合取子句(多个条件用 AND 连接)class Conjunction: def __init__(self, dnf): self.dnf = dnf # 保存该合取子句的原始 DNF 字符串 self.size = 0 # 条件的数量 self.assigns = # 保存布尔条件的列表 self.id = 0 # 为该合取子句分配的唯一标识符 def pushAssign(self,assign): # 添加布尔条件到 assign 列表中 self.assigns.append(assign) def setId(self,id): # 设置合取子句的唯一 ID self.id = id# 解析单个合取子句(例如 age IN 3;4 AND state NOT NY)def parseConjunction(con): conjunction = Conjunction(con) # 创建 Conjunction 对象 assigns = con.split(AND) # 将合取子句按照 AND 拆分成多个布尔条件 for ass in assigns: ass = ass.strip # 去除空格 # 处理 IN 关系,形式如 age IN 3;4 elements = ass.split(IN) if len(elements) == 2: key = elements[0].strip # 属性名(如 age) relation = IN # 关系为 IN for val in elements[1].strip.split(SEP): value = val.strip # 属性值(如 3, 4) term = Term(key, value) assignment = Assignment(relation, term) conjunction.pushAssign(assignment) # 添加到 Conjunction conjunction.size += 1 # 处理 NOT 关系,形式如 state NOT NY elements = ass.split(NOT) if len(elements) == 2: key = elements[0].strip # 属性名(如 state) relation = NOT # 关系为 NOT for val in elements[1].strip.split(SEP): value = val.strip # 属性值(如 NY) term = Term(key, value) assignment = Assignment(relation, term) conjunction.pushAssign(assignment) return conjunction # 返回解析后的 Conjunction 对象# 构建倒排索引,处理 DNF 公式中的 OR 条件def buildTwoLevelInvertedIndex(doc_dnf, doc_id, con_doc_inverted_index, con_id_map, ass_con_inverted_index): doc_dnf = re.sub('\s+', ' ', doc_dnf) # 清除多余空格 print doc_id, ":", doc_dnf if 0 == len(doc_dnf): # 如果 DNF 表达式为空,直接返回 print doc_id ," is empty !!!" return # 将 DNF 表达式按照 OR 拆分成多个合取子句 cons = doc_dnf.split(OR) for con in cons: con = con.strip # 去除每个合取子句前后的空格 # 如果该合取子句没有被映射到 ID,则为它分配新的 ID if con not in con_id_map: con_id = len(con_id_map) + 1 con_id_map[con] = con_id conjunction = parseConjunction(con) # 解析该合取子句 conjunction.setId(con_id) # 为每个布尔条件建立倒排索引 for a in conjunction.assigns: if (a.term.key, a.term.value) not in ass_con_inverted_index[conjunction.size]: ass_con_inverted_index[conjunction.size][a.term.key, a.term.value] = [(con_id, a.relation)] else: ass_con_inverted_index[conjunction.size][a.term.key, a.term.value].append((con_id, a.relation)) # 特殊处理当条件大小为 0 时 if 0 == conjunction.size: if ZERO not in ass_con_inverted_index[conjunction.size]: ass_con_inverted_index[conjunction.size][ZERO] = [(con_id, IN)] else: if (con_id, IN) not in ass_con_inverted_index[conjunction.size][ZERO]: ass_con_inverted_index[conjunction.size][ZERO].append((con_id, IN)) # 将合取子句 ID 映射到文档 ID(倒排索引的第一层) con_doc_inverted_index[con_id_map[con]].append(doc_id)# Posting List 类,用于存储倒排列表中的布尔条件条目class Plist: def __init__(self, key, conid_relations): self.key = key # 键值对(如 (age, 3)) self.conid_relations = conid_relations # 条目列表(包含表达式 ID 和关系) self.size = len(conid_relations) # 条目数量 self.current_idx = 0 # 当前索引指针 self.current_entry_id = conid_relations[0][0] # 当前条目 ID self.current_entry_relation = conid_relations[0][1] # 当前条目的关系 # 跳到下一个大于或等于 nextid 的条目 def skipToNextId(self, nextid, defaultid): skip_flag = False if self.current_idx = nextid: self.current_idx = idx self.current_entry_id = self.conid_relations[idx][0] self.current_entry_relation = self.conid_relations[idx][1] skip_flag = True break if not skip_flag: self.current_idx = EOF # 表示到达列表末尾 self.current_entry_id = defaultid # 默认值处理# 按当前条目 ID 对倒排列表排序def sortPlistByCurrentEntries(plists): # 排序算法:首先按条目 ID 和键名排序 Plists = sorted(plists, key=lambda x: (x.current_entry_id, x.key[0])) return Plists# 查询合取子句def retrievalConjunctions(query, con_id_map, ass_con_inverted_index): query = re.sub('\s+', ' ', query) # 去掉多余空格 fit_cons = set # 存储匹配的合取子句 # 解析查询字符串 conjunction = parseConjunction(query) if 0 == len(conjunction.assigns): return fit_cons # 没有匹配条件,返回空集 # 查找匹配的布尔条件 size = min(max(ass_con_inverted_index.keys), conjunction.size) for K in xrange(size, -1, -1): # 如果 K=0,专门处理无条件和 NOT 条件 if K == 0: for key, conid_relations in ass_con_inverted_index[0].items: for conid, relation in conid_relations: if relation == "IN": fit_cons.add(conid) # 将无条件 IN 条件的 ID 添加到 fit_cons 中 elif relation == "NOT": # 如果是 NOT 条件,确保从 fit_cons 中移除 fit_cons.discard(conid) # 排除包含 NOT 条件的 ID continue # 完成 K=0 处理后继续下一个 K 值 Plists = for ass in conjunction.assigns: key = (ass.term.key, ass.term.value) if key in ass_con_inverted_index[K]: value = ass_con_inverted_index[K][key] plist = Plist(key, value) Plists.append(plist) # 如果倒排列表条目数不足,跳过 if len(Plists)

来源:幕后传奇

相关推荐