python字符串查找函数没有给出由美丽的邮件返回的文本的位置

我试图抓取10-K文件的一部分。我有一个问题来确定'项目7（a）'的位置。从beautifulsoup返回的文本，尽管它有单词。但是下面的代码正在处理我制作的包含'item 7（a）'的字符串。python字符串查找函数没有给出由美丽的邮件返回的文本的位置

import urllib2 
import re 
import bs4 as bs 
url=https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm' 

html = urllib2.urlopen(url).read().decode('utf8') 
soup = bs.BeautifulSoup(html,'lxml') 
text = soup.get_text() 
text = text.encode('utf-8') 
text = text.lower() 
print type(text) 
print len(text) 
text1 = "hf dfbd item 7. abcd sfjsdf sdbfjkds item 7(a). adfbdf item 8. skjfbdk item 7. sdfkba ootgf sffdfd item 7(a). sfbdskf sfdf item 8. sdfbksdf " 
print text.find('item 7(a)') 
print text1.find('item 7(a)') 

Output: 
<type 'str'> 
592214 
-1 
37

来源

2017-12-03 Vinay

是否使用python2任何机会？ –

是的。我正在使用Python 2.7。我也尝试在Python 3.6中，但我得到了相同的结果。 – Vinay

你显示“文字”吗？也许服务器给你在web浏览器中的不同结果。 – furas

页在文本ITEM 7(A)

使用实体  （ Ñ OT 乙 reaking SP ACE）（使用char码 160）的
代替正常空间（代码 32）

您可以用代码替换所有的字符210（chr(160)）与正常空间（" "）。
在Python 2，你（编码后）有替代两个字符 - 194和160

text = text.replace(chr(160), " ") # Python 3 
text = text.replace(char(194)+chr(160), " ") # Python 2

完整的示例

#import urllib.request as urllib2 # Python 3 
import urllib2 
import re 
import bs4 as bs 

url='https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm' 

html = urllib2.urlopen(url).read().decode('utf8') 
soup = bs.BeautifulSoup(html,'lxml') 
text = soup.get_text() 
text = text.encode('utf-8') # only Python 2 
text = text.lower() 

#text = text.replace(chr(160), " ") # Python 3 
text = text.replace(char(194)+chr(160), " ") # Python 2 

search = 'item 7(a)' 

# find every occurence in text  
pos = 0 
while True: 
    pos = text.find(search, pos) 
    if pos == -1: 
     break 
    #print(pos, ">"+text[pos-1]+"<", ord(text[pos-1])) 
    print(text[pos:pos+20]) 
    pos += 1

编辑：只测试与Python 3

你可以搜索字符串后，搜索字符串'item 7(a)'。
但是你必须知道你必须在这个地方使用 而不是" "。

from html import unescape 

search = unescape('item&nbsp;7(a)')

的完整代码

#import urllib.request as urllib2 # Python 3 
import urllib2 
import re 
import bs4 as bs 

url='https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm' 

html = urllib2.urlopen(url).read().decode('utf8') 
soup = bs.BeautifulSoup(html,'lxml') 
text = soup.get_text() 
text = text.lower() 

from html import unescape 

search = unescape('item&nbsp;7(a)') 

# find every occurence in text  
pos = 0 
while True: 
    pos = text.find(search, pos) 
    if pos == -1: 
     break 
    #print(pos, ">"+text[pos-1]+"<", ord(text[pos-1])) 
    print(text[pos:pos+20]) 
    pos += 1

来源

2017-12-03 01:45:16 furas

python字符串查找函数没有给出由美丽的邮件返回的文本的位置

回答

相关问题