我有两个数据框,看起来有点像下面(df1中的'内容'列实际上是一篇文章的全部内容,而不是,如在我的例子中,只有一个句子):Python:结合str.contains和合并在熊猫
PDF Content
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 1111 Johannes writes about apples and oranges and that's great.
3 8000 Content that cannot be matched to the anything in df1.
4 3993 There is an interesting piece on bananas plus kiwis as well.
...
(共5709个)
Author Title
1 Johannes Apples and oranges
2 Peter Bananas and pears and grapes
3 Hannah Bananas plus kiwis
4 Helena Mangos and peaches
...
(共10228项)
我想通过搜索 '标题' 从DF2在合并两个dataframes 'C意图'的df1。如果标题出现在的第一个2500个字符的内容中,则它是匹配的。 注意:重要的是保留来自df1的所有条目。相比之下,我只想保留匹配的df2条目(即左连接)。 注意:所有标题都是唯一值。
所需的输出(列顺序无所谓):
Author Title PDF Content
1 Peter Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 Johannes Apples and oranges 1111 Johannes writes about apples and oranges and that's great.
3 NaN NaN 8000 Content that cannot be matched to the anything in df2.
4 Hannah Bananas plus kiwis 3993 There is an interesting piece on bananas plus kiwis as well.
...
我想我需要pd.merge和str.contains之间的组合,但我无法弄清楚如何!
你想要什么行为/期望如果有多个匹配? – ASGM
标题栏中的所有条目都是唯一的。关于内容列,我希望标题条目与内容条目中找到的第一个匹配相匹配。 – NynkeLys
“首次找到匹配”,如...?首先在数据集中(逐行)还是首先根据字符串中的位置? – ctwheels