2016-11-10 46 views
2

我遇到了这样的麻烦:我需要找到用户第一次点击一个电子邮件(变量发送),并在发生时在相应的行中放置一个。找到最早的发生

该数据集有几千个用户(散列)在通讯中点击电子邮件的一部分。我试图通过发送,哈希将它们分组,然后找到最早的日期,但无法使其工作。

所以我去了一小讨厌的解决方案,然而返回奇怪的事情:

我的数据集(相关变量):

>>> clicks[['datetime','hash','sending']].head() 

      datetime        hash sending 
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5 
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d  5 
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477  5 
3 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7  5 
4 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad  5 

有6个发送回合,datetimedatetime64[ns]

我这样做是如下的方式:

所有的
clicks['first'] = 0 

for hash in clicks['hash'].unique(): 
    t = clicks.ix[clicks.hash==hash, ['hash','datetime','sending']] 
    part = t['sending'].unique() 

    for i in part: 
     temp = t.ix[t.sending == i,'datetime'] 
     clicks.ix[t[t.datetime == np.min(temp)].index.values,'first']=1 

首先,我不认为这是非常Python的,而且是相当缓慢的。但主要是它返回一个奇怪的类型!有0.01.0值,但我不能与他们合作:

>>> type(clicks.first) 
    <type 'instancemethod'> 

>>> clicks.loc[clicks.first==1] 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1296, in __getitem__ 
    return self._getitem_axis(key, axis=0) 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1467, in _getitem_axis 
    return self._get_label(key, axis=axis) 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 93, in _get_label 
    return self.obj._xs(label, axis=axis) 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1749, in xs 
    loc = self.index.get_loc(key) 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/indexes/base.py", line 1947, in get_loc 
    return self._engine.get_loc(self._maybe_cast_indexer(key)) 
    File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154) 
    File "pandas/index.pyx", line 156, in pandas.index.IndexEngine.get_loc (pandas/index.c:3977) 
    File "pandas/index.pyx", line 373, in pandas.index.Int64Engine._check_type (pandas/index.c:7634) 
KeyError: False 

所以任何想法,请?非常感谢!

----- UPDATE:------

INSTALLED VERSIONS 
    ------------------ 
    commit: None 
    python: 2.7.12.final.0 
    python-bits: 64 
    OS: Darwin 
    OS-release: 15.6.0 
    machine: x86_64 
    processor: i386 
    byteorder: little 
    LC_ALL: None 
    LANG: en_US.UTF-8 

    pandas: 0.18.1 

回答

3

我认为你需要groupbyapply其中具有minimal比较值,并输出布尔 - 需要通过astype转换为int01

clicks = pd.DataFrame({'hash': {0: '0b1f4745df5925dfb1c8f53a56c43995', 1: '0a73d5953ebf5826fbb7f3935bad026d', 2: '605cebbabe0ba1b4248b3c54c280b477', 3: '0b1f4745df5925dfb1c8f53a56c43995', 4: '0a73d5953ebf5826fbb7f3935bad026d', 5: '605cebbabe0ba1b4248b3c54c280b477', 6: 'd26d61fb10c834292803b247a05b6cb7', 7: '48f8ab83e8790d80af628e391f3325ad'}, 'sending': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5}, 'datetime': {0: pd.Timestamp('2016-11-01 19:13:34'), 1: pd.Timestamp('2016-11-01 10:47:14'), 2: pd.Timestamp('2016-10-31 19:09:21'), 3: pd.Timestamp('2016-11-01 19:13:34'), 4: pd.Timestamp('2016-11-01 11:47:14'), 5: pd.Timestamp('2016-10-31 19:09:20'), 6: pd.Timestamp('2016-10-31 13:42:36'), 7: pd.Timestamp('2016-10-31 10:46:30')}}) 
print (clicks) 
      datetime        hash sending 
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5 
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d  5 
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477  5 
3 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5 
4 2016-11-01 11:47:14 0a73d5953ebf5826fbb7f3935bad026d  5 
5 2016-10-31 19:09:20 605cebbabe0ba1b4248b3c54c280b477  5 
6 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7  5 
7 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad  5 
#if column dtype of column datetime is not datetime (with this sample not necessary) 
clicks.datetime = pd.to_datetime(clicks.datetime) 
clicks['first'] = clicks.groupby(['hash','sending'])['datetime'] \ 
         .apply(lambda x: x == x.min()) \ 
         .astype(int) 
print (clicks) 
      datetime        hash sending first 
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5  1 
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d  5  1 
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477  5  0 
3 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5  1 
4 2016-11-01 11:47:14 0a73d5953ebf5826fbb7f3935bad026d  5  0 
5 2016-10-31 19:09:20 605cebbabe0ba1b4248b3c54c280b477  5  1 
6 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7  5  1 
7 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad  5  1 

----- UPDATE:------

INSTALLED VERSIONS 
------------------ 
commit: None 
python: 2.7.12.final.0 
python-bits: 64 
OS: Darwin 
OS-release: 15.6.0 
machine: x86_64 
processor: i386 
byteorder: little 
LC_ALL: None 
LANG: en_US.UTF-8 

pandas: 0.18.1 
+0

Wowza,谢谢!我尝试了lambda,但没有让它工作,不知道如何从中选择最小值。所以这看起来不错,但仍然不能对它进行分类,得到相同的错误。虽然'clicks.first'最终是整数。你知道为什么吗? –

+0

也许你有重复最小值的问题。它对样本很好,并且真实的数据不是? – jezrael

+0

每个'hash'和'sending'都不能有重复。该子集的错误说:'TypeError:不能在上使用这些索引器[False] '进行位置索引'所以它看起来不再是'DataFrame' –

0

注:我不熟悉的大熊猫模块,但我确实有蟒蛇经常(它系统工程)

为什么工作你不只是使用日期时间模块?您可以根据时间戳轻松对其进行排序。例如:

Python 2.7.12 (default, Oct 26 2016, 11:37:25) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import datetime 
>>> fmt = '%Y-%m-%d %H:%S:%M' 
>>> timestamps = ['2016-11-01 19:13:34', '2016-11-01 10:47:14', 
...    '2016-10-31 19:09:21', '2016-10-31 13:42:36', 
...    '2016-10-31 10:46:30'] 
>>> def compare_dates(d1, d2): 
...  d1_dt = datetime.datetime.strptime(d1, fmt) 
...  d2_dt = datetime.datetime.strptime(d2, fmt) 
...  if d1 > d2: 
...   return 1 
...  elif d1 == d2: 
...   return 0 
...  else: 
...   return -1 
... 
>>> timestamps.sort(cmp=compare_dates) 
>>> timestamps 
['2016-10-31 10:46:30', '2016-10-31 13:42:36', '2016-10-31 19:09:21', '2016-11-01 10:47:14', '2016-11-01 19:13:34'] 
>>> 

正如您所看到的,使用日期时间模块对日期进行排序很容易。看起来微不足道的是编写一个比较函数,并根据日期对它们进行排序以找出最早发生的事件。

相关问题