2017-10-18 124 views
2

我有两个数据框,看起来有点像下面(df1中的'内容'列实际上是一篇文章的全部内容,而不是,如在我的例子中,只有一个句子):Python:结合str.contains和合并在熊猫

PDF  Content 
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun! 
2 1111 Johannes writes about apples and oranges and that's great. 
3 8000 Content that cannot be matched to the anything in df1.  
4 3993 There is an interesting piece on bananas plus kiwis as well. 
    ... 

(共5709个)

Author  Title 
1 Johannes  Apples and oranges 
2 Peter   Bananas and pears and grapes 
3 Hannah  Bananas plus kiwis 
4 Helena  Mangos and peaches 
    ... 

(共10228项)

我想通过搜索 '标题' 从DF2在合并两个dataframes 'C意图'的df1。如果标题出现在的第一个2500个字符的内容中,则它是匹配的。 注意:重要的是保留来自df1的所有条目。相比之下,我只想保留匹配的df2条目(即左连接)。 注意:所有标题都是唯一值。

所需的输出(列顺序无所谓):

Author  Title      PDF  Content 
1 Peter  Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun! 
2 Johannes Apples and oranges   1111 Johannes writes about apples and oranges and that's great. 
3 NaN  NaN       8000 Content that cannot be matched to the anything in df2.  
4 Hannah  Bananas plus kiwis   3993 There is an interesting piece on bananas plus kiwis as well. 
    ... 

我想我需要pd.merge和str.contains之间的组合,但我无法弄清楚如何!

+1

你想要什么行为/期望如果有多个匹配? – ASGM

+0

标题栏中的所有条目都是唯一的。关于内容列,我希望标题条目与内容条目中找到的第一个匹配相匹配。 – NynkeLys

+0

“首次找到匹配”,如...?首先在数据集中(逐行)还是首先根据字符串中的位置? – ctwheels

回答

0

警告:解决方案可能会很慢:)。
1.获取列表的标题
2.创建基于标题列表顺序
3. CONCAT DF1和DF2的IDX

lst = [item.lower() for item in df2.Title.tolist()] 
    end = len(lst) 
    def func(row): 
    content = row[:2500].lower() 
    for i, item in enumerate(lst): 
     if item in content: 
     return i 
    end += 1 
    return end 
    df1 = df1.assign(idx=df1.Content.apply(func)) 

    res = pd.concat([df1.set_index('idx'), df2], axis=1) 

输出

 PDF           Content Author \ 
0 1111.0 Johannes writes about apples and oranges and t... Johannes 
1 1234.0 This article is about bananas and pears and gr...  Peter 
2 3993.0 There is an interesting piece on bananas plus ... Hannah 
3  NaN            NaN Helena 
4 8000.0 Content that cannot be matched to the anything...  NaN 

          Title 
0   Apples and oranges 
1 Bananas and pears and grapes 
2   Bananas plus kiwis 
3   Mangos and peaches 
4       NaN 
+0

即使最初,我也会得到以下错误:两个数据帧只有非空对象: ---------------------------- ----------------------------------------------- AttributeError Traceback (最近呼叫的最后一个) in () 2#在第二个df的前2500个字符中。 ----> 4 lst = [item.lower()用于df2.Title中的项目。tolist()] 5 end = len(lst) 6 def func(row): AttributeError:'float'对象没有属性'lower'。 有什么想法? – NynkeLys

+0

@NynkeLys将内容更改为str – galaxyan

+0

我使用以下命令,但仍得到相同的错误:df1.Content = df1.Content.astype('str') – NynkeLys

0

你可以做DF1指数完整的笛卡尔连接/交叉产品,然后过滤。既然你不能做一个哈希查找,它不应该有任何比同等慢“加入”的声明:

df1['key'] = 1 
df2['key'] = 2 
df3 = pd.merge(df1, df2, on='key') 
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1) 
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']] 

其产生表:

 PDF Author       Title \ 
0 1234.0 Johannes   Apples and oranges 
1 1234.0  Peter Bananas and pears and grapes 
4 1111.0 Johannes   Apples and oranges 
14 3993.0 Hannah   Bananas plus kiwis 

               Content 
0 This article is about bananas and pears and gr... 
1 This article is about bananas and pears and gr... 
4 Johannes writes about apples and oranges and t... 
14 There is an interesting piece on bananas plus ... 
+0

谢谢!我试过了,但得到了以下错误:ValueError:无法设置没有定义索引的框架和无法转换为Series的值。任何想法? – NynkeLys

+0

有什么想法?运行你的代码会产生一个不断的错误我使用Python 2.7,即使使用与我为我的问题创建的dfs完全相同的dfs。 – NynkeLys