2017-09-16 74 views
1

我试图从pypi中提取pip包的许可信息,然后加载到熊猫数据框中。我之前做过一个例子,为PD加载列表解析。但我无法弄清楚这一个...将数据加载到熊猫

到目前为止,我已经写了。

from requests import get 

import pandas as pd 

import pip 

url = 'https://pypi.python.org/pypi' 

# packages_list = ['numpy','twisted'] 

installed_packages = pip.get_installed_distributions() 
installed_packages_list = sorted(["%s==%s" % (i.key, i.version) 
    for i in installed_packages]) 

packages = [] 
licenses = [] 
summarys = [] 

for index, package in enumerate(installed_packages_list): 
    package = package.split("==")[0] 
    full_url = url+'/'+ package +'/json' 
    #print 'url is ' + full_url 
    page = get(url+'/'+package+'/json').json() 


    #print 'Package: ' + package + ', license is:' + page['info']['license'] + '. ' + page['info']['summary'] 
    packages.append(package) 
    licenses.append(page['info']['license']) 
    summarys.append(page['info']['summary']) 


print packages 


pd_packages = pd.DataFrame(
    { 
    "packages":[packages], 
    "licenses":[licenses], 
    "summarys":[summarys] 
    }) 

print pd_packages 
+1

什么这是个问题吗? –

+0

它显示类似于0 [MIT,,MPL-2.0,LGPL,UNKNOWN,BSD-like,BSD,... packages \ 0 [beautifulsoup4,bs4,certifi,chardet,get,i ... summarys 0 [屏幕抓取库,虚拟包是... – vkk07

+0

我想获取这种数据在桌子的种类和转储到使用熊猫csv – vkk07

回答

2

试试这个:

def get_pkg_info(pkg, url_pat='https://pypi.python.org/pypi/{}/json'): 
    r = requests.get(url_pat.format(pkg)) 
    if r.status_code != requests.codes.ok: 
     return [pkg, None, None] 
    d = r.json() 
    if d and 'info' in d: 
     return [pkg, d['info'].get('license'), d['info'].get('summary')] 
    else: 
     return [pkg, None, None] 

data = [get_pkg_info(x.split('==')[0]) for x in installed_packages_list] 

df = pd.DataFrame(data, columns=['package','license','summary']) 

演示:

In [166]: pd.options.display.max_rows = 15 

In [167]: df = pd.DataFrame(data, columns=['package','license','summary']) 

In [168]: df 
Out[168]: 
       package  license           summary 
0    alabaster   None  A configurable sidebar-enabled Sphinx theme 
1  anaconda-client  UNKNOWN   Anaconda Cloud command line client library 
2 anaconda-navigator Proprietary 
3  anaconda-project   None            None 
4   asn1crypto   MIT Fast ASN.1 parser and serializer with definiti... 
5    astroid   LGPL A abstract syntax tree for Python with inferen... 
6    astropy   BSD   Community-developed python astronomy tools 
..     ...   ...            ... 
216    xarray  Apache   N-D labeled arrays and datasets in Python 
217    xlrd   BSD Library for developers to extract data from Mi... 
218   xlsxwriter   BSD  A Python module for creating Excel XLSX files. 
219    xlwings BSD 3-clause Make Excel fly: Interact with Excel from Pytho... 
220    xlwt   BSD Library to create spreadsheet files compatible... 
221   xmltodict   MIT Makes working with XML feel like you are worki... 
222    yapsy   BSD       Yet another plugin system 

[223 rows x 3 columns] 
0

我认为这个问题源于你的DataFrame(pd_packages)的创建。包,许可证和摘要已经列出,因此[packages]使它成为一份列表,它解释了您在下面的评论中的输出。

所以不是这个

pd_packages = pd.DataFrame(
    { 
    "packages":[packages], 
    "licenses":[licenses], 
    "summarys":[summarys] 
    }) 

试试这个

pd.DataFrame(
    { 
    "packages":packages, 
    "licenses":licenses, 
    "summarys":summarys 
    }) 
+0

感谢鲍勃。这就是我在向名称中添加[]之前所做的事情......我得到一个错误“如果使用所有标量值,则必须传递索引”。这就是为什么我添加[] – vkk07

+0

这很奇怪。即使列表是空的,我也不会期望这个错误 –