Python：结合str.contains和合并在熊猫

我有两个数据框，看起来有点像下面（df1中的'内容'列实际上是一篇文章的全部内容，而不是，如在我的例子中，只有一个句子）：Python：结合str.contains和合并在熊猫

PDF  Content 
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun! 
2 1111 Johannes writes about apples and oranges and that's great. 
3 8000 Content that cannot be matched to the anything in df1.  
4 3993 There is an interesting piece on bananas plus kiwis as well. 
    ...

（共5709个）

Author  Title 
1 Johannes  Apples and oranges 
2 Peter   Bananas and pears and grapes 
3 Hannah  Bananas plus kiwis 
4 Helena  Mangos and peaches 
    ...

（共10228项）

我想通过搜索 '标题' 从DF2在合并两个dataframes 'C意图'的df1。如果标题出现在的第一个2500个字符的内容中，则它是匹配的。注意：重要的是保留来自df1的所有条目。相比之下，我只想保留匹配的df2条目（即左连接）。注意：所有标题都是唯一值。

所需的输出（列顺序无所谓）：

Author  Title      PDF  Content 
1 Peter  Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun! 
2 Johannes Apples and oranges   1111 Johannes writes about apples and oranges and that's great. 
3 NaN  NaN       8000 Content that cannot be matched to the anything in df2.  
4 Hannah  Bananas plus kiwis   3993 There is an interesting piece on bananas plus kiwis as well. 
    ...

我想我需要pd.merge和str.contains之间的组合，但我无法弄清楚如何！

来源

2017-10-18 NynkeLys

你想要什么行为/期望如果有多个匹配？ – ASGM

标题栏中的所有条目都是唯一的。关于内容列，我希望标题条目与内容条目中找到的第一个匹配相匹配。 – NynkeLys

“首次找到匹配”，如...？首先在数据集中（逐行）还是首先根据字符串中的位置？ – ctwheels

警告：解决方案可能会很慢:)。
1.获取列表的标题
2.创建基于标题列表顺序
3. CONCAT DF1和DF2的IDX

lst = [item.lower() for item in df2.Title.tolist()] 
    end = len(lst) 
    def func(row): 
    content = row[:2500].lower() 
    for i, item in enumerate(lst): 
     if item in content: 
     return i 
    end += 1 
    return end 
    df1 = df1.assign(idx=df1.Content.apply(func)) 

    res = pd.concat([df1.set_index('idx'), df2], axis=1)

输出

 PDF           Content Author \ 
0 1111.0 Johannes writes about apples and oranges and t... Johannes 
1 1234.0 This article is about bananas and pears and gr...  Peter 
2 3993.0 There is an interesting piece on bananas plus ... Hannah 
3  NaN            NaN Helena 
4 8000.0 Content that cannot be matched to the anything...  NaN 

          Title 
0   Apples and oranges 
1 Bananas and pears and grapes 
2   Bananas plus kiwis 
3   Mangos and peaches 
4       NaN

来源

2017-10-18 16:12:37 galaxyan

即使最初，我也会得到以下错误：两个数据帧只有非空对象： ---------------------------- ----------------------------------------------- AttributeError Traceback （最近呼叫的最后一个） in （） 2＃在第二个df的前2500个字符中。 ----> 4 lst = [item.lower（）用于df2.Title中的项目。tolist（）] 5 end = len（lst） 6 def func（row）： AttributeError：'float'对象没有属性'lower'。有什么想法？ – NynkeLys

@NynkeLys将内容更改为str – galaxyan

我使用以下命令，但仍得到相同的错误：df1.Content = df1.Content.astype（'str'） – NynkeLys

你可以做DF1指数完整的笛卡尔连接/交叉产品，然后过滤。既然你不能做一个哈希查找，它不应该有任何比同等慢“加入”的声明：

df1['key'] = 1 
df2['key'] = 2 
df3 = pd.merge(df1, df2, on='key') 
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1) 
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]

其产生表：

 PDF Author       Title \ 
0 1234.0 Johannes   Apples and oranges 
1 1234.0  Peter Bananas and pears and grapes 
4 1111.0 Johannes   Apples and oranges 
14 3993.0 Hannah   Bananas plus kiwis 

               Content 
0 This article is about bananas and pears and gr... 
1 This article is about bananas and pears and gr... 
4 Johannes writes about apples and oranges and t... 
14 There is an interesting piece on bananas plus ...

来源

2017-10-18 16:25:02 scnerd

谢谢！我试过了，但得到了以下错误：ValueError：无法设置没有定义索引的框架和无法转换为Series的值。任何想法？ – NynkeLys

有什么想法？运行你的代码会产生一个不断的错误我使用Python 2.7，即使使用与我为我的问题创建的dfs完全相同的dfs。 – NynkeLys

Python：结合str.contains和合并在熊猫

回答

相关问题