从set元素中移除unicode字符？

python新增功能。我正在写一个刮板，它会产生一组全部具有unicode字符的值。从set元素中移除unicode字符？

我想知道如何从它删除unicode字符。我在使用python3的印象之下，但我不知道，因为命令是scrapy，我总是使用python2。从未使用过不使用python命令运行的工具。

import scrapy 


class QuotesSpider(scrapy.Spider): 
    name = "quotes" 

    def start_requests(self): 
     urls = [ 
      'http://quotes.toscrape.com/page/1/', 
      'http://quotes.toscrape.com/page/2/', 
     ] 
     for url in urls: 
      yield scrapy.Request(url=url, callback=self.parse) 



    def parse(self, response): 
     for quote in response.css('div.quote'): 
      yield { 
       'text': quote.css('span.text::text').extract_first(), 
       'author': quote.css('small.author::text').extract_first(), 
       'tags': quote.css('div.tags a.tag::text').extract(), 
       }

运行命令是

scrapy crawl quotes -o output.json

如何从响应或产生集中的项目删除中的Unicode字符？

来源

2017-08-10 Ryan Gedwill

中的所有字符utf8编码的页面是Unicode字符（甚至是_these_）。你想要删除什么？ – DyZ

@DYZ在每个记录的'text'属性内容中都有一个'\ u201c'。我可以很明显地解析它，但那只会让我走得这么远。 –

为什么不用一个'''替换？ – DyZ

试试这样说：

... 
'text': quote.css('span.text::text').extract_first().decode('unicode_escape').encode('ascii', 'ignore') 
...

来源

2017-08-10 05:29:25

此代码应工作

yield { 
    'text': quote.css('span.text::text').extract_first().encode("utf-8").decode('unicode_escape').encode('ascii', 'ignore'), 
    'author': quote.css('small.author::text').extract_first().encode("utf-8").decode('unicode_escape').encode('ascii', 'ignore'), 
    'tags': quote.css('div.tags a.tag::text').extract().encode("utf-8").decode('unicode_escape').encode('ascii', 'ignore'), 
}

或者你可以创建一个函数为Unicode转换为字符串，

def convertToString(encodedString): 
    return encodedString.encode("utf-8").decode('unicode_escape').encode('ascii', 'ignore')

来源

2017-12-02 13:47:54

从set元素中移除unicode字符？

回答

相关问题