使用大熊猫建立“一个单独的第Excel表单

我想建立基于从另一表中的值（和COUNTIFS）二维表（）的基础上COUNTIF二维表。我管理这个成功使用Excel为原型，但我坚持两个概念：使用大熊猫建立“一个单独的第Excel表单

1. Emulating Excel COUNTIF() on pandas 
2. Dynamically build a new dataframe

注：COUNTIF（）需要一个范围和标准作为参数。例如，如果我有一个颜色列表，我想知道的时候“橙”的数量是在下面的列表：

A 
Red 
Orange 
Blue 
Orange 
Black

，那么我会简单地使用下面的公式：

COUNTIF(A1:A5, "Orange")

这应返回2.

当然COUNTIF（）的功能可以如形式示例变得更加复杂，在这种形式COUNTIF级联标准（范围1，条件1，范围2 2，条件2 ...）可以被解释为一个与criterian。例如，如果我在一个列表wantto女性的数量超过35个类似下面：

A    B 
Female  19 
Female  40 
Male   45

，那么我会简单地使用下面的公式：

COUNTIF(A1:A3, "Female", B1:B3, ">35"

这应返回1.

回到我的用例。这是源表：

Product No Opening Date Closing Date Opening Month Closing Month 
0   1 2016-01-01 2016-06-30 2016-01-31 2016-06-30 
1   2 2016-01-01 2016-04-30 2016-01-31 2016-04-30 
2   3 2016-02-01 2016-06-30 2016-02-29 2016-06-30 
3   4 2016-02-01 2016-05-31 2016-02-29 2016-05-31 
4   5 2016-02-01 2099-12-31 2016-02-29 2099-12-31 
5   6 2016-01-01 2099-12-31 2016-01-31 2016-10-31 
6   7 2016-06-01 2016-07-31 2016-06-30 2016-07-31 
7   8 2016-06-01 2016-11-30 2016-06-30 2016-11-30 
8   9 2016-06-01 2016-07-31 2016-06-30 2016-07-31 
9   10 2016-06-01 2099-12-31 2016-06-30 2099-12-31

这是我想要达到的二维矩阵：

  2016-01-31 2016-02-29 2016-03-31 2016-04-30 2016-05-31 \ 
2016-01-31   3   3   3   2   2 
2016-02-29   3   3   3   3   2 
2016-03-31   0   0   0   0   0 
2016-04-30   0   0   0   0   0 
2016-05-31   0   0   0   0   0 
2016-06-30   4   4   4   4   4 
2016-07-31   0   0   0   0   0 
2016-08-31   0   0   0   0   0 
2016-09-30   0   0   0   0   0 
2016-10-31   0   0   0   0   0 
2016-11-30   0   0   0   0   0 
2016-12-31   0   0   0   0   0 

      2016-06-30 2016-07-31 2016-08-31 2016-09-30 2016-10-31 \ 
2016-01-31   1   1   1   1   0 
2016-02-29   1   1   1   1   1 
2016-03-31   0   0   0   0   0 
2016-04-30   0   0   0   0   0 
2016-05-31   0   0   0   0   0 
2016-06-30   4   2   2   2   2 
2016-07-31   0   0   0   0   0 
2016-08-31   0   0   0   0   0 
2016-09-30   0   0   0   0   0 
2016-10-31   0   0   0   0   0 
2016-11-30   0   0   0   0   0 
2016-12-31   0   0   0   0   0 

      2016-11-30 2016-12-31 
2016-01-31   0   0 
2016-02-29   1   1 
2016-03-31   0   0 
2016-04-30   0   0 
2016-05-31   0   0 
2016-06-30   1   1 
2016-07-31   0   0 
2016-08-31   0   0 
2016-09-30   0   0 
2016-10-31   0   0 
2016-11-30   0   0 
2016-12-31   0   0

基本上，我想通过时间来建立产品生存的矩阵。纵轴表示新产品的起源，而横轴表示这些帐户在多长时间内持续存在。

例如，如果10个产品是在1月推出，这个数字对于一月VS月份应该是10.如果这10种产品的1〜2月被关闭，这个数字对于一月二月VS应该9.如果所有剩余产品在六月被关闭，那么行一月VS六月，七月，八月，等应该是在二月，三月，四月0

产品开发等。不会影响到一月一行。

我管理使用以下Excel公式来构建2D矩阵：

=COUNTIF(Accounts!$D$2:$D$11,Main!$A2)-COUNTIFS(Accounts!$D$2:$D$11,Main!$A2, Accounts!$E$2:$E$11,"<="&Main!B$1)

（这将填充所述第一小区）

我最初的策略是建立一个多维列表，并使用数for循环来填充他们，但我不知道是否有一个更简单（或更多建议的方式）在熊猫。

来源

2017-09-27 James J.

因为我没有足够的声誉对您的问题发表评论，只是还没有，我我会假设你在你的数据中有年份等于2099的错别字。

我还想问你在2016-06-30行如何有4个'产品号'在某种程度上存在于前几列（即2016-01-31至2016-05-31）。

如果那些错误，那么这里是我的解决方案：

首先，使数据：

# Make dataframe 
df = pd.DataFrame({'Product No' : [i for i in range(1,11)], 
        'Opening Date' : ['2016-01-01']*2 +\ 
            ['2016-02-01']*3 +\ 
            ['2016-01-01'] +\ 
            ['2016-06-01']*4, 
        'Closing Date' : ['2016-06-30', '2016-04-30', '2016-06-30', '2016-05-31'] +\ 
            ['2016-12-31']*2 +\ 
            ['2016-07-31', '2016-11-30', '2016-07-31', '2016-12-31'], 
        'Opening Month' : ['2016-01-31']*2 +\ 
            ['2016-02-29']*3 +\ 
            ['2016-01-31'] +\ 
            ['2016-06-30']*4, 
        'Closing Month' : ['2016-06-30', '2016-04-30', '2016-06-30', '2016-05-31', 
             '2016-12-31', '2016-10-31', '2016-07-31', '2016-11-30', 
             '2016-07-31', '2016-12-31']}) 

# Reorder columns 
df = df.loc[:, ['Product No', 'Opening Date', 'Closing Date', 
       'Opening Month', 'Closing Month']] 

# Convert dates to datetime 
for i in df.columns[1:]: 
    df.loc[:, i] = pd.to_datetime(df.loc[:, i])

其次，我创建了一个“日期范围”数据框用于保持分钟到原来的最大日期数据集。我还包括一个“产品编号”一栏，使每个产品都会对表中的一行：

# Create date range dataframe 
daterange = pd.DataFrame({'daterange' : pd.date_range(start = df.loc[:, 'Opening Month'].min(), 
                end = df.loc[:, 'Closing Month'].max(), 
                freq = 'M'), 
          'Product No' : [1]*12}) 

# Create 10 multiples of the daterange and concatenate 
daterange10 = pd.concat([daterange]*10) 

# Find the cumulative sum of the 'Product No' for daterange10 
daterange10.loc[:, 'Product No'] = daterange10.groupby('daterange').cumsum()

第三，我合并的日期范围和原来的DF在一起，并限制行仅包括当一个“产品编号”存在。另外请注意，如果产品在本月的最后一天关闭，那么关闭日期必须大于或等于自（以我的观点）以来的日期范围，然后在整个月中存在：

# Merge df with daterange10 
df = df.merge(daterange10, 
       how = 'inner', 
       on = 'Product No') 

# Limit rows to when 'Opening Month' is <= 'daterange' and 'Closing Month' is >= 'daterange' 
df = df[(df.loc[:, 'Opening Month'] <= df.loc[:, 'daterange']) & 
     (df.loc[:, 'Closing Month'] >= df.loc[:, 'daterange'])]

最后，我用日期值做一个数据透视表。请注意，它仅包括对在首位存在的纵轴日期：

# Pivot on 'Opening Month', 'daterange'; count unique 'Product No'; fill NA with 0 
df.pivot_table(index = 'Opening Month', 
       columns = 'daterange', 
       values = 'Product No', 
       aggfunc = pd.Series.nunique).fillna(0)

来源

2017-09-27 19:24:00

非常理解，我设法学习一些技巧的。谢谢！ –

尝试把你的数据转换成数据帧的大熊猫，然后使用迭代方法来构建产品的生存数据框：

import pandas as pd 

mydata = pd.read_excel('mysourcedata.xlsx') 

def product_survival(sourcedf, startdate, enddate): 

    df = pd.DataFrame() 

    daterange = pd.date_range(startdate, enddate, freq='M') 

    for i in daterange: # Rows 
     for j in daterange: # Columns 
      mycount = sourcedf[(sourcedf['Opening Month'] == i) & (sourcedf['Closing Month'] > j)]['Product No'].count() 
      df.loc[i, j] = mycount 

    return df 

print(product_survival(mydata, '2016-01-31', '2016-12-31'))

来源

2017-09-27 16:41:03 Dan

使用大熊猫建立“一个单独的第Excel表单

回答

相关问题