数据提取使用Python

我希望从这个环节中提取的历史价格： https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=KEL 数据提取使用Python

要做到这一点，我用下面的代码

import requests 
import pandas as pd 
import time as t 

t0=t.time() 

symbols =[ 
      'HMIM', 
      'CWSM','DSIL','RAVT','PIBTL','PICT','PNSC','ASL', 
      'DSL','ISL','CSAP','MUGHAL','DKL','ASTL','INIL'] 

for symbol in symbols: 
    header = { 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", 
    "X-Requested-With": "XMLHttpRequest" 
} 
    r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header) 
    dfs = pd.read_html(r.text) 
    df=dfs[6] 
    df=df.ix[2: , ] 
    df.columns=['Date','Open','High','Low','Close','Volume'] 
    df.set_index('Date', inplace=True) 
    df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),columns=['Open','High','Low','Close','Volume'], 
      index_label=['Date']) 

    print(symbol) 


t1=t.time() 
print('exec time is ', t1-t0, 'seconds')

上面代码中提取从链路的数据将其转换为熊猫数据框并保存。

问题是，它需要很多时间，并且对于更多符号来说效率不高。任何人都可以建议任何其他方式以有效的方式实现上述结果。

此外，是否有任何其他编程语言可以在较短的时间内完成相同的工作。

来源

2017-04-18 Furqan Hashim

我会_guess_那个时候一个体面的部分是在阻塞GET请求。如果您尝试异步运行请求，会发生什么情况，例如与['request-futures']（https://github.com/ross/requests-futures）？ – roganjosh

不在我平时的电脑上，下载一些先决条件来测试:) – roganjosh

我是编程新手，所以需要我花时间尝试异步运行请求。阅读文档。 –

具有requests的正常GET请求是“阻塞”的;一个请求被发送，一个响应被接收并且然后被处理。至少有一部分处理时间用于等待响应 - 我们可以将所有请求与requests-futures异步发送，然后在准备好后收集响应。

这就是说，我认为DSIL是超时或类似的东西（我需要看得更远）。虽然我可以从symbols随机选择得到一个体面的加速，但两种方法都需要约。同时如果DSIL在列表中。

编辑：似乎我撒谎，这只是一个不幸的巧合与“DSIL”多次。在symbols中您拥有的标签越多，异步方法将越快超过标准requests。

import requests 
from requests_futures.sessions import FuturesSession 
import time 

start_sync = time.time() 

symbols =['HMIM','CWSM','RAVT','ASTL','INIL'] 

header = { 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", 
    "X-Requested-With": "XMLHttpRequest" 
} 

for symbol in symbols: 
    r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header) 

end_sync = time.time() 

start_async = time.time() 
# Setup 
session = FuturesSession(max_workers=10) 
pooled_requests = [] 

# Gather request URLs 
for symbol in symbols: 
    request= 'https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(symbol) 
    pooled_requests.append(request) 

# Fire the requests 
fire_requests = [session.get(url, headers=header) for url in pooled_requests] 
responses = [item.result() for item in fire_requests] 

end_async = time.time() 

print "Synchronous requests took: {}".format(end_sync - start_sync) 
print "Async requests took:  {}".format(end_async - start_async)

在上面的代码中，我获得了3倍的加速获取响应。您可以遍历responses列表并正常处理每个响应。

编辑2：通过异步请求的响应去，并将其保存为您前面所做的：

for i, r in enumerate(responses): 
    dfs = pd.read_html(r.text) 
    df=dfs[6] 
    df=df.ix[2: , ] 
    df.columns=['Date','Open','High','Low','Close','Volume'] 
    df.set_index('Date', inplace=True) 
    df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(symbols[i]),columns=['Open','High','Low','Close','Volume'], 
      index_label=['Date'])

来源

2017-04-18 16:45:41 roganjosh

干得好。现在它快得多，但为了保存数据帧中的响应，我无法使用异步方法来实现这一点。 –

@FurqanHashim我已编辑re：DSIL标签。应该没有什么能阻止你像往常一样写。让我检查并编辑。 – roganjosh

@FurqanHashim请参阅编辑2 – roganjosh

数据提取使用Python

回答

相关问题