我想从Tax Foundation网站上刮取'.xlsx'文件。可悲的是我不断收到一条错误消息:Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file
。我做了一些研究,它说解决这个问题的方法是将文件扩展名改为'.xls'而不是'.xlsx'。谁能帮忙?如何更改文件扩展名?
from bs4 import BeautifulSoup
import urllib.request
import os
url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/")
soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))
FHFA = os.chdir('C:/US_Census/Directory')
seen = set()
for link in soup.find_all('a', href=True):
href = link.get('href')
if not any(href.endswith(x) for x in ['.xlsx']):
continue
file = href.split('/')[-1]
filename = file.rsplit('.', 1)[0]
if filename not in seen: # only retrieve file if it has not been seen before
seen.add(filename) # add the file to the set
url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
print(filename)
print(' ')
print("All files successfully downloaded.")
P.S.我知道你可以下载这个文件,但是我在网上抓取它来自动化一个特定的过程。
您使用的是什么版本的Python? – TheDetective
这个循环语句(如果没有)['.xlsx'])中的href.endswith(x)'在['.xlsx']'中运行一次',然后检查是否有'href.endswith 'XLSX')'。你基本上可以用'如果不是href.endswith('。xlsx')'这个简单一些'缩短这个。 – Vinny
我正在使用Python 3.6 @TheDetective – bhammer