MariaDB重复被插入

我有以下Python代码来检查MariaDB记录是否已经存在，然后插入。但是，我正在插入重复项。代码有什么问题，还是有更好的方法来做到这一点？我是使用Python-MariaDB的新手。MariaDB重复被插入

import mysql.connector as mariadb 
from hashlib import sha1 

mariadb_connection = mariadb.connect(user='root', password='', database='tweets_db') 

# The values below are retrieved from Twitter API using Tweepy 
# For simplicity, I've provided some sample values 
id = '1a23bas' 
tweet = 'Clear skies' 
longitude = -84.361549 
latitude = 34.022003 
created_at = '2017-09-27' 
collected_at = '2017-09-27' 
collection_type = 'stream' 
lang = 'us-en' 
place_name = 'Roswell' 
country_code = 'USA' 
cronjob_tag = 'None' 
user_id = '23abask' 
user_name = 'tsoukalos' 
user_geoenabled = 0 
user_lang = 'us-en' 
user_location = 'Roswell' 
user_timezone = 'American/Eastern' 
user_verified = 1 
tweet_hash = sha1(tweet).hexdigest() 

cursor = mariadb_connection.cursor(buffered=True) 
cursor.execute("SELECT Count(id) FROM tweets WHERE tweet_hash = %s", (tweet_hash,)) 
if cursor.fetchone()[0] == 0: 
    cursor.execute("INSERT INTO tweets(id,tweet,tweet_hash,longitude,latitude,created_at,collected_at,collection_type,lang,place_name,country_code,cronjob_tag,user_id,user_name,user_geoenabled,user_lang,user_location,user_timezone,user_verified) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)", (id,tweet,tweet_hash,longitude,latitude,created_at,collected_at,collection_type,lang,place_name,country_code,cronjob_tag,user_id,user_name,user_geoenabled,user_lang,user_location,user_timezone,user_verified)) 
    mariadb_connection.commit() 
    cursor.close() 
else: 
    cursor.close() 
    return

以下是表格的代码。

CREATE TABLE tweets (
    id VARCHAR(255) NOT NULL, 
    tweet VARCHAR(255) NOT NULL, 
    tweet_hash VARCHAR(255) DEFAULT NULL, 
    longitude FLOAT DEFAULT NULL, 
    latitude FLOAT DEFAULT NULL, 
    created_at DATETIME DEFAULT NULL, 
    collected_at DATETIME DEFAULT NULL, 
    collection_type enum('stream','search') DEFAULT NULL, 
    lang VARCHAR(10) DEFAULT NULL, 
    place_name VARCHAR(255) DEFAULT NULL, 
    country_code VARCHAR(5) DEFAULT NULL, 
    cronjob_tag VARCHAR(255) DEFAULT NULL, 
    user_id VARCHAR(255) DEFAULT NULL, 
    user_name VARCHAR(20) DEFAULT NULL, 
    user_geoenabled TINYINT(1) DEFAULT NULL, 
    user_lang VARCHAR(10) DEFAULT NULL, 
    user_location VARCHAR(255) DEFAULT NULL, 
    user_timezone VARCHAR(100) DEFAULT NULL, 
    user_verified TINYINT(1) DEFAULT NULL 
);

来源

2017-09-26 Ham Sam

我们可以看到'SHOW CREATE TABLE mytable'和实际生成的SQL。 –

当然，我已经用实际的代码片段和CREATE TABLE语法更新了这个问题，谢谢 –

如果您正在寻找独特的推文，请使用'tweet''UNIQUE'或至少'INDEXed'。 “散列”只会增加复杂性。 –

向tweet_has提交添加唯一常量。

alter table tweets modify tweet_hash varchar(255) UNIQUE ;

来源

2017-09-27 14:46:13 sfgroups

我还必须在Python代码中添加一个异常，以忽略重复插入错误：'除了mariadb.IntegrityError' –

每个表应该有一个PRIMARY KEY。 id应该是这样吗？（CREATE TABLE不是这么说的。）根据定义，PK是UNIQUE，所以在插入重复项时会导致错误。

同时：

为什么有tweet_hash？索引tweet。
不要说255当有特定的限制小于那个。
user_id和user_name应该在另一个“查找”表中，而不是在这个表中。
user_verified是否属于user？或者每个推文？
如果您预计有数百万条推文，则需要将此表缩小并编制索引 - 否则您会遇到性能问题。

来源

2017-09-27 15:26:39

感谢您的好处，这里有一些基于您的建议的推理和修改。 1）'tweet_hash'允许快速查找，而不是实际搜索包含多个单词的全文字符串。 2）我将'tweet_hash'大小减少到了50，你已经正确地指出，列的大小没有优化。 3）你是对的，我应该重新构建这个有2个表格。 4）我有索引，但是，它确实需要变得更小 –

“哈希”的随机性使其不会更快。不必要的列和索引的开销大于抵消任何优势。 –

MariaDB重复被插入

回答

相关问题