2016-08-13 61 views
0

我需要从下面的span元素中检索文本,而不将其分割为文本部分。使用xpath或css查询从跨度中检索文本

<span class="a-size-base review-text">I purchased this from Fry's Electronics. 
 
<br/> 
 
<br/> 
 
The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 
 
<br/> 
 
<br/> 
 
I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 
 
<br/> 
 
<br/> 
 
The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 
 
<br/> 
 
<br/> 
 
The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 
 
<br/> 
 
<br/> 
 
Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 
 
<br/> 
 
<br/> 
 
NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
 
</span>

但是在应用我的XPath查询

// * [包含(CONCAT( “”,@class, “”),CONCAT( “”,“回顾-text”, “”))] /文()

我得到这个:

Text='I purchased this from Fry's Electronics.' 
 
Text='' 
 
Text='The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.' 
 
Text='' 
 
Text='I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems.' 
 
Text='' 
 
Text='The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome.' 
 
Text='' 
 
Text='The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge.' 
 
Text='' 
 
Text='Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's).' 
 
Text='' 
 
Text='NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person.'

我想找回一个文字块无破损。 我使用这个XPath仪http://www.freeformatter.com/xpath-tester.html

回答

0

scrapy选择器的一个方便的功能是选择链接,这样你就可以用CSS选择开始,然后应用XPath字符串的方法,如string()normalize-space()

下面是一个例子scrapy 1.1 shell会话:

~$ scrapy shell 
2016-08-16 12:20:57 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot) 
2016-08-16 12:20:57 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'} 
(...) 
In [1]: html = '''<span class="a-size-base review-text">I purchased this from Fry's Electronics. 
    ...: <br/> 
    ...: <br/> 
    ...: The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 
    ...: <br/> 
    ...: <br/> 
    ...: I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 
    ...: <br/> 
    ...: <br/> 
    ...: The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 
    ...: <br/> 
    ...: <br/> 
    ...: The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 
    ...: <br/> 
    ...: <br/> 
    ...: Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 
    ...: <br/> 
    ...: <br/> 
    ...: NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
    ...: </span>''' 

In [2]: import scrapy 

In [3]: selector = scrapy.Selector(text=html) 

In [4]: selector.css('span.review-text').xpath('string()').extract_first() 
Out[4]: 'I purchased this from Fry\'s Electronics.\n\n\nThe picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I\'m very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.\n\n\nI wasn\'t planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems.\n\n\nThe unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you\'re ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome.\n\n\nThe stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge.\n\n\nOverall I\'m very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry\'s).\n\n\nNOTE: If you see any strange distortion in the images it\'s likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person.\n' 

In [5]: print(selector.css('span.review-text').xpath('string()').extract_first()) 
I purchased this from Fry's Electronics. 


The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 


I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 


The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 


The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 


Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 


NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 


In [6]: print(selector.css('span.review-text').xpath('normalize-space()').extract_first()) 
I purchased this from Fry's Electronics. The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
+0

谢谢@paul trmbrth。好的解决方案 – Brayoni

0

整个<span>元素转换为string

string(
    //*[contains(concat(" ", @class, " a-size-base review-text"), concat(" ", "review-text", " "))] 
) 

注意,这只是第一个<span>元素的作品符合标准。在XPath 2.0中,可以使用string-join()这将<span>元素的任意数量的工作:

string-join( 
    //*[contains(concat(" ", @class, " a-size-base review-text"), concat(" ", "review-text", " "))]/text(), 
    "" 
) 
+0

我使用** ** LXML只支持_xpath 1.0_,所以我不能用'字符串join'。如果我将整个元素转换为'string'。 _xpath_查询似乎返回一个字符串而不是一个列表。 – Brayoni

+0

以下内容返回scrapy shell上的列表。 'response.xpath('// * [contains(concat(“”,@class,“”),concat(“”,“review-text”,“”))]')。extract()'。 – Brayoni

0

我不得不进行后期处理,以除去使用python的正则表达式的HTML标签。

re.sub(r'<span class="a-size-base review-text">|<br>|</span>', "", text) 

然而,我试过@ har07的建议;

  • scrapy使用LXML仅支持的XPath 1.0因此我不能趁string-join它可在的XPath 2.0
  • 当我尝试我无法从我的XPath查询得到选择列表string