2016-09-25 109 views
-1

我在爬取一个网站(https://www.zhihu.com/people/xie-ke-41/followers),我想获取所有关注者的信息。正如你所看到的,一些追随者在Chrome的信息与AJAX带来的,我用的开发者工具,并找到URL the url which has followers' information无法从beautiful_soup对象提取数据

我的代码:

import requests 
from bs4 import BeautifulSoup 


zhihu_rl = 'https://www.zhihu.com/node/ProfileFollowersListV2' 

data = { 
'method': 'next', 
'params': '{"offset":20,"order_by":"created","hash_id":"86858a7a4aa77d290364625efcaacb70"}'} 

headers = { 
'Host': 'www.zhihu.com', 
'Origin': 'https://www.zhihu.com', 
'Referer': 'https://www.zhihu.com/people/xie-ke-41/followers', 
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 
'X-Requested-With': 'XMLHttpRequest', 
'X-Xsrftoken': 'foo', 
'Cookie':'xxxxxxxxxxxx'} 

rep = requests.post(url=zhihu_rl, data=data, headers=headers) 

bsobj = BeautifulSoup(rep.text, 'html.parser') 

print(bsobj.find_all('div', {'class': "zm-profile-card zm-profile-section-item zg-clear no-hovercard"})) 

和一个空的列表返回。 我可以看到的信息是开发人员的工具: thr information i see in developers' tool ,为什么不能bs4提取它们? PS:我可以得到所有的div,但是当我限制属性。它失败了

+0

我无法访问该网站。从我看到的最后一行开始,项目之前不应该有空格? 'zm-profile-section-item' –

+0

对不起,这是我的拼写错误 – dogewang

回答

-2

你已经使用了好的头组合,否则服务器可能无法识别你的头,它认为你没有启用javascript。限制属性的使用。为类和#为id。其他CSS选择器也可以正常工作。您还需要使用Selenium进行JavaScript执行(ajax调用),因为美丽优缺乏此功能 最后,确保网站没有防刮保护。在这种情况下,你需要使用JavaScript运行时就像Js2Py

+0

我想我已经拿到了数据,可以打印bsobj看到它,问题是为什么不能使用attrs提取物品。 – dogewang

+0

使用正则表达式提取数据,如果你已经有了它。其他功能选择如下'print(bsobj.find_elements_by_class_name(“zm-profile-card zm-profile-section-item zg-clear no-hovercard”))' – yabets

1

问题是你已经越狱JSON,如果打印bsobj你可以看到这样的输出:

{"r":0, 
"msg": ["<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"6327483c9e474097e7dbb2493a7f277c\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u4ed6<\/button>\n<\/div>\n<a title=\"\u738b\u5728\u9014\"\ndata-hovercard=\"p$t$wang-zai-tu-81\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/wang-zai-tu-81\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$wang-zai-tu-81\" href=\"https:\/\/www.zhihu.com\/people\/wang-zai-tu-81\" class=\"zg-link author-link\" title=\"\u738b\u5728\u9014\"\n>\u738b\u5728\u9014<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/followers\" class=\"zg-link-gray-normal\">1 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/answers\" class=\"zg-link-gray-normal\">1 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"a3596eaecae6f05f0ddf95dfcc6b5517\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8<\/button>\n<\/div>\n<a title=\"\u7075\u9b42\"\ndata-hovercard=\"p$t$ling-hun-30-21\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/ling-hun-30-21\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$ling-hun-30-21\" href=\"https:\/\/www.zhihu.com\/people\/ling-hun-30-21\" class=\"zg-link author-link\" title=\"\u7075\u9b42\"\n>\u7075\u9b42<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/followers\" class=\"zg-link-gray-normal\">0 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/answers\" class=\"zg-link-gray-normal\">0 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"74fad3af2b93f7da69c37eda64c31037\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8<\/button>\n<\/div>\n<a title=\"\u5f90\u6668\"\ndata-hovercard=\"p$t$xu-chen-77-49\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/xu-chen-77-49\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$xu-chen-77-49\" href=\"https:\/\/www.zhihu.com\/people\/xu-chen-77-49\" class=\"zg-link author-link\" title=\"\u5f90\u6668\"\n>\u5f90\u6668<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\">\u4f1a\u8ba1\u5e08<\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/followers\" class=\"zg-link-gray-normal\">0 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/answers\" class=\"zg-link-gray-normal\">0 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"032b36abfbe05a30913c794a4b099629\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u5979<\/button>\n<\/div>\n<a title=\"Shuai Zhang\"\ndata-hovercard=\"p$t$shuai-zhang-49\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/shuai-zhang-49\">\n<img src=\"https:\/\/pic2.zhimg.com\/v2-8aa42ff00873460e29444d62ff51acfd_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$shuai-zhang-49\" href=\"https:\/\/www.zhihu.com\/people\/shuai-zhang-49\" class=\"zg-link author-link\" title=\"Shuai Zhang\"\n>Shuai Zhang<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/followers\" class=\"zg-link-gray-normal\">79 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/asks\" class=\"zg-link-gray-normal\">1 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/answers\" class=\"zg-link-gray-normal\">119 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\" class=\"zg-link-gray-normal\">174 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"6388162f5357ca1bd872dc0b6efe4802\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u4ed6<\/button>\n<\/div>\n<a title=\"\u5468\u5468\"\ndata-hovercard=\"p$t$zhou-zhou-69-22\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/zhou-zhou-69-22\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$zhou-zhou-69-22\" href=\"https:\/\/www.zhihu.com\/people\/zhou-zhou-69-22\" class=\"zg-link author-link\" title=\"\u5468\u5468\"\n>\u5468\u5468<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/followers\" class=\"zg-link-gray-normal\">4 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/answers\" class=\"zg-link-gray-normal\">7 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\" class=\"zg-link-gray-normal\">1 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"3a1a9da0e0bb4abe2554fa2a6032f27f\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u5979<\/button>\n<\/div>\n<a title=\"\u7f8e\u7f8e\u836f\u5242\u5e08\"\ndata-hovercard=\"p$t$sui-nuo-81\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/sui-nuo-81\">\n<img src=\"https:\/\/pic2.zhimg.com\/ae23b8e89725a24de650dee53e9a60a5_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$sui-nuo-81\" href=\"https:\/\/www.zhihu.com\/people\/sui-nuo-81\" class=\"zg-link 

不幸的是它也是无效JSON所以我们不能称之为req.json(),并得到很好的非转义的HTML,所以你将不得不使用string_escape手动做到这一点:

In [14]: rep = requests.post(url=zhihu_rl, data=data, headers=headers) 

In [15]: bsobj = BeautifulSoup(rep.text.decode("string_escape"), 'lxml') 

In [16]: ancs = (bsobj.find_all('div', {'class': 'zm-profile-card zm-profile-section-item zg-clear no-hovercard'})) 

In [17]: len(ancs) 
Out[17]: 20 

这也是zm-profile-section-itemzm-profile-section- item

在未来也从未登录后饼干,我可以在一两分钟的完全访问您的帐户。

+0

感谢您的耐心回答! – dogewang

+0

不用担心,我强烈建议你在别人之前更改密码http://s13.postimg.org/5vcsmev07/comp.png –