文章目录[隐藏]
敏感时期,新接到了个敏感词列表,要求把线上数据全扫一下,正常按业务逻辑挨个处理表处理字段还是相当费劲的,所以改了下策略,不如直接扫特定的库,先查库里有啥表,再挨个表遍历数据,如果数据的某个字段是字符串类型就过一下过滤器。
为了方便服务器运行,开发语言直接选 Python。
敏感词过滤算法都是成熟的,从 GitHub 随便找了个 SensitivePy,就没必要自己从头写了。
Python3.6 的 MySQLdb 有点问题,所以此处又重新装了下 pymysql,如果 MySQLdb 安装顺利就不需要前两行代码了
#-*- coding:utf-8 -*-
import pymysql
pymysql.install_as_MySQLdb()
import MySQLdb
下面是 DFA 过滤算法
class GFW(object):
def __init__(self):
self.d = {}
#give a list of "ming gan ci"
def set(self,keywords):
p = self.d
q = {}
k = ''
for word in keywords:
word += chr(11)
p = self.d
for char in word:
char = char.lower()
if p=='':
q[k] = {}
p = q[k]
if not (char in p):
p[char] = ''
q = p
k = char
p = p[char]
pass
def check(self,text):
"""
>>> gfw = GFW()
>>> gfw.set(["abd","defz","bcz"])
>>> print gfw.check("xabdabczabdxaadefz")
[(1, 3, 'abd'), (5, 3, 'bcz'), (8, 3, 'abd'), (14, 4, 'defz')]
"""
p = self.d
i = 0
j = 0
result = []
ln = len(text)
while i+j<ln:
t = text[i+j].lower()
#print i,j,hex(ord(t))
if not (t in p):
j = 0
i += 1
p = self.d
continue
p = p[t]
j+=1
#print p,i,j
if chr(11) in p:
p = self.d
result.append((i,j,text[i:i+j]))
i = i+j
j = 0
return result
基本工作也就这些,剩下的直接连库扫表就可以了
class MySql:
def __init__(self, host, port, username, password):
self.conn = MySQLdb.connect(host=host, port=int(port), user=username, passwd=password, charset="utf8")
self.databases = []
for each in self.query("show databases"):
print("加载数据库 - ", each[0])
def query(self, sql, args=()):
cursor = self.conn.cursor()
cursor.execute(sql, args)
return cursor.fetchall()
gfw = GFW()
gfw.set(["敏感词一", "敏感词二"])
db = MySql("xxx", xxx, "xxx", "xxx")
db.query("use xxx")
for each in db.query("show tables"):
print("处理表:", each[0], "开始")
total = 0
startID = 0
while True:
list = db.query("select * from %s where id > %s order by id asc limit 1000" % (each[0], startID))
if len(list) == 0:
break
for record in list:
for field in record:
if isinstance(field, str) and len(gfw.check(field)) > 0:
print(each, record[0], gfw.check(field), "违规")
startID = record[0]
total = total + 1
print("处理表:", each[0], "完成,共 ", total , "条记录")
上面的代码演示了如何处理一个库里的所有表,如果库比较多可以在 show databases
后直接开循环,具体就不演示了。