通过python字符串函数删除字符串附加字符

这是我想从中提取位置信息的Web CSS。通过python字符串函数删除字符串附加字符

<div class="location"> 
    <div class="listing-location">Location</div> 
    <div class="location-areas"> 
    <span class="location">Al Bayan</span> 
    ‪,‪ 
    <span class="location">Nepal</span> 
    </div> 
    <div class="area-description"> 3.3 km from Mall of the Emirates </div> 
    </div>

的Python Beautuifulsoup4我使用的代码是：

try: 
      title= soup.find('span',{'id':'listing-title-wrap'}) 
      title_result= str(title.get_text().strip()) 
      print "Title: ",title_result 
    except StandardError as e: 
      title_result="Error was {0}".format(e) 
      print title_result

输出：

"Al Bayanأ¢â‚¬آھ,أ¢â‚¬آھ 

          Nepal"

我怎么能转换格式为以下

['Al Bayan', 'Nepal']

什么应该是代码的第二行以获得此输出

来源

2016-06-01 Panetta

生成此输出的HTML是什么？ – 2016-06-01 07:01:47

他们都是那种格式吗？一些jbberish然后2个换行符然后是真正的文本？ – Keatinge

试试这个解决方案http://stackoverflow.com/a/2743163/524743 – Samuel

你读错了，只是阅读类位置的跨度

soup = BeautifulSoup(html, "html.parser") 
locList = [loc.text for loc in soup.find_all("span", {"class" : "location"})] 
print(locList)

此打印你想要什么：

['Al Bayan', 'Nepal']

来源

2016-06-01 07:15:41 Keatinge

[u'Al Bayan'，'u'Nepal]这是输出。 – Panetta

用字符串映射。这会给你预期的结果。 'map（str，output_list）' –

@Panetta我稍微改了一下，现在就运行它。没有理由使用地图时，已经有一个列表补偿 – Keatinge

有一个单线解决方案。考虑将a作为您的字符串。

In [38]: [i.replace(" ","") for i in filter(None,(a.decode('unicode_escape').encode('ascii','ignore')).split('\n'))] 
Out[38]: ['Al Bayan,', 'Nepal']

来源

2016-06-01 07:15:09

asci编解码器不能编码字符u'\ u202a'。试过了，这是错误 – Panetta

@Panetta你确切的错误是什么。并且你给了什么作为输入。这对我很有用。 –

您可以使用正则表达式只能过滤字母和空格：

>>> import re 
>>> re.findall('[A-Za-z ]+', area_result) 
['Al Bayan', ' Nepal']

希望它会有所帮助。

来源

2016-06-01 07:19:24 3kt

通过python字符串函数删除字符串附加字符

回答

相关问题