2017-08-31 62 views
0

我想从两个html页面提取数据。当我从一个页面提取数据并转到另一个页面时,某些元素会发生更改,数据会出现在列表和列表更改中。如何从两个html页面提取数据?

我下面的问题

details_containers = soup_page.findAll("div",{"id":"RESTAURANT_DETAILS"}) 
     details_container = details_containers[0].findAll("div",{"class":"content"}) 
     cuisine = details_container[0].text.strip() 
     print(cuisine) 
     meals = details_container[1].text.strip() 
     print(meals) 
     hotel_features = details_container[2].text.strip() 
     print(hotel_features) 

从第一HTML代码我想要的美食,餐饮,retaurant_features含量值。但是有一些额外的小时数,平均价格的内容值。

<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS"> 
    <div class="header_with_improve wrap"> 
     <a href="/UpdateListing-g297595-d6384395-Ocellus-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')"> 
      <div class="improve_listing_btn ui_button primary">Improve this listing</div> 
     </a> 
     <h3 class="tabs_header">Restaurant Details</h3> </div> 
    <div class="details_tab"> 
     <div class="table_section"> 
      <div class="row"> 
       <div class="ratingSummary wrap"> 
        <div class="histogramCommon bubbleHistogram wrap"> 
         <div class="colTitle"> 
          Rating summary 
         </div> 
         <ul class="barChart"> 
          <li> 
           <div class="ratingRow wrap"> 
            <div class="label part "> 
             <span class="text">Food</span> 
            </div> 
            <div class="wrap row part "> 
             <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span> 
            </div> 
           </div> 
           <div class="ratingRow wrap"> 
            <div class="label part "> 
             <span class="text">Service</span> 
            </div> 
            <div class="wrap row part "> 
             <span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span> 
            </div> 
           </div> 
          </li> 
          <li> 
           <div class="ratingRow wrap"> 
            <div class="label part "> 
             <span class="text">Value</span> 
            </div> 
            <div class="wrap row part "> 
             <span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span> 
            </div> 
           </div> 
          </li> 
         </ul> 
        </div> 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Average prices 
       </div> 
       <div class="content"> 
        <span>₹&nbsp;448 - 
₹&nbsp;768</span> 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Cuisine 
       </div> 
       <div class="content"> 
        <a href="/Restaurants-g297595-c24-Raipur_Raipur_District_Chhattisgarh.html">Indian</a>, <a href="/Restaurants-g297595-c3-Raipur_Raipur_District_Chhattisgarh.html">Asian</a>, <a href="/Restaurants-g297595-c26-Raipur_Raipur_District_Chhattisgarh.html">Italian</a>, <a href="/Restaurants-g297595-c20-Raipur_Raipur_District_Chhattisgarh.html">French</a>, <a href="/Restaurants-g297595-c11-Raipur_Raipur_District_Chhattisgarh.html">Chinese</a>, <a href="/Restaurants-g297595-c22-Raipur_Raipur_District_Chhattisgarh.html">International</a>, <a href="/Restaurants-g297595-zfz10665-Raipur_Raipur_District_Chhattisgarh.html">Vegetarian Friendly</a> 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Meals 
       </div> 
       <div class="content"> 
        Breakfast, Lunch, Dinner, Brunch 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Restaurant features 
       </div> 
       <div class="content"> 
        Reservations, Seating, Takeout, Private Dining, Waitstaff 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Good for 
       </div> 
       <div class="content"> 
        Groups, Business meetings, Child-friendly 
       </div> 
      </div> 
      <div class="row"> 
       <div class="hours title"> 
        Open Hours 
       </div> 
       <div class="hours content"> 
        <div class="detail"> 
         <span class="day">Sunday</span> 
         <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span> 
        </div> 
        <div class="detail"> 
         <span class="day">Monday</span> 
         <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span> 
        </div> 
        <div class="detail"> 
         <span class="day">Tuesday</span> 
         <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span> 
        </div> 
        <div class="detail"> 
         <span class="day">Wednesday</span> 
         <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span> 
        </div> 
        <div class="detail"> 
         <span class="day">Thursday</span> 
         <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span> 
        </div> 
        <div class="detail"> 
         <span class="day">Friday</span> 
         <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span> 
        </div> 
        <div class="detail"> 
         <span class="day">Saturday</span> 
         <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span> 
        </div> 
       </div> 
      </div> 
     </div> 
     <div class="additional_info"> 
      <div class="title"> 
       Location and Contact Information </div> 
      <div class="content"> 
       <ul class="detailsContent"> 
        <li> 
         <div class="detail">Address: 
          <span> <span class="format_address"><span class="street-address">G.E. Road</span> | <span class="extended-address">Mayura Hotel</span>, <span class="locality">Raipur 492001, </span><span class="country-name">India</span> </span> 
          </span> 
         </div> 
        </li> 
        <li> 
         <div class="detail">Location: 
          <span> Asia</span> 
          <span> &nbsp;&gt;&nbsp; India</span> 
          <span> &nbsp;&gt;&nbsp; Chhattisgarh</span> 
          <span> &nbsp;&gt;&nbsp; Raipur District</span> 
          <span> &nbsp;&gt;&nbsp; Raipur</span> 
         </div> 
        </li> 
        <li> 
         <div class="detail">Phone Number: 
          <span>+91 77142 00500</span> 
         </div> 
        </li> 
        <li> 
         <span class="ui_icon email"></span> 
         <a target="_blank&quot;" href="mailto:[email protected]" onclick="ta.trackEventOnPage('Eatery_Listing','Email','6384395')"> 
E-mail </a> 
        </li> 
        <!--trkP:waypoint_for_poi_2--> 
        <!-- PLACEMENT waypoint_for_poi --> 
        <div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi"> 
        </div> 
        <!--etk--> 
       </ul> 
      </div> 
     </div> 
     <!--[if lte IE 9]> 
      <style> 
       .details_block .threeColumnList{ 
        height: 350px; 
        overflow: auto; 
       } 
      </style> 
      <![endif]--> 
    </div> 
</div> 

从第二HTML我想要的美食,餐饮,retaurant_features像上面的HTML内容的值。 但在这额外内容的小时值,平均价格不存在

<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS"> 
    <div class="header_with_improve wrap"> 
     <a href="/UpdateListing-g297595-d8595502-Barbeque_Nation-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')"> 
      <div class="improve_listing_btn ui_button primary">Improve this listing</div> 
     </a> 
     <h3 class="tabs_header">Restaurant Details</h3> </div> 
    <div class="details_tab"> 
     <div class="table_section"> 
      <div class="row"> 
       <div class="ratingSummary wrap"> 
        <div class="histogramCommon bubbleHistogram wrap"> 
         <div class="colTitle"> 
          Rating summary 
         </div> 
         <ul class="barChart"> 
          <li> 
           <div class="ratingRow wrap"> 
            <div class="label part "> 
             <span class="text">Food</span> 
            </div> 
            <div class="wrap row part "> 
             <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span> 
            </div> 
           </div> 
           <div class="ratingRow wrap"> 
            <div class="label part "> 
             <span class="text">Service</span> 
            </div> 
            <div class="wrap row part "> 
             <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span> 
            </div> 
           </div> 
          </li> 
          <li> 
           <div class="ratingRow wrap"> 
            <div class="label part "> 
             <span class="text">Value</span> 
            </div> 
            <div class="wrap row part "> 
             <span class="ui_bubble_rating bubble_40" alt="4.0 of 5 bubbles"></span> 
            </div> 
           </div> 
          </li> 
         </ul> 
        </div> 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Cuisine 
       </div> 
       <div class="content"> 
        <a href="/Restaurants-g297595-c24-Raipur_Raipur_District_Chhattisgarh.html">Indian</a>, <a href="/Restaurants-g297595-c6-Raipur_Raipur_District_Chhattisgarh.html">Barbecue</a>, <a href="/Restaurants-g297595-c3-Raipur_Raipur_District_Chhattisgarh.html">Asian</a>, <a href="/Restaurants-g297595-zfz10665-Raipur_Raipur_District_Chhattisgarh.html">Vegetarian Friendly</a>, <a href="/Restaurants-g297595-zfz10697-Raipur_Raipur_District_Chhattisgarh.html">Vegan Options</a>, <a href="/Restaurants-g297595-zfz10992-Raipur_Raipur_District_Chhattisgarh.html">Gluten Free Options</a> 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Meals 
       </div> 
       <div class="content"> 
        Lunch, Dinner 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Restaurant features 
       </div> 
       <div class="content"> 
        Reservations, Seating, Waitstaff, Wheelchair Accessible, Validated Parking 
       </div> 
      </div> 
      <div class="row"> 
       <div class="title"> 
        Good for 
       </div> 
       <div class="content"> 
        Groups, Special Occasion Dining, Kids, Child-friendly 
       </div> 
      </div> 
     </div> 
     <div class="additional_info"> 
      <div class="title"> 
       Location and Contact Information </div> 
      <div class="content"> 
       <ul class="detailsContent"> 
        <li> 
         <div class="detail">Address: 
          <span> <span class="format_address"> | <span class="extended-address">Magneto The Mall, 2nd Floor</span>, <span class="locality">Raipur 429010, </span><span class="country-name">India</span> </span> 
          </span> 
         </div> 
        </li> 
        <li> 
         <div class="detail">Location: 
          <span> Asia</span> 
          <span> &nbsp;&gt;&nbsp; India</span> 
          <span> &nbsp;&gt;&nbsp; Chhattisgarh</span> 
          <span> &nbsp;&gt;&nbsp; Raipur District</span> 
          <span> &nbsp;&gt;&nbsp; Raipur</span> 
         </div> 
        </li> 
        <li> 
         <div class="detail">Phone Number: 
          <span>+91 77160 60008</span> 
         </div> 
        </li> 
        <li> 
         <span class="ui_icon email"></span> 
         <a target="_blank&quot;" href="mailto:[email protected]" onclick="ta.trackEventOnPage('Eatery_Listing','Email','8595502')"> 
    E-mail </a> 
        </li> 
        <!--trkP:waypoint_for_poi_2--> 
        <!-- PLACEMENT waypoint_for_poi --> 
        <div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi"> 
        </div> 
        <!--etk--> 
       </ul> 
      </div> 
     </div> 
     <!--[if lte IE 9]> 
       <style> 
        .details_block .threeColumnList{ 
         height: 350px; 
         overflow: auto; 
        } 
       </style> 
       <![endif]--> 
    </div> 
</div> 
+1

如果可以,请缩进您的HTML。它只是让其他人更快地理解文档的结构。 – ContinuousLoad

回答

0

而是获得所有<div class="content">块的列表,并通过他们的指数选择几个(这是从第一页改变为秒) ,你可以找到所有<div class="row">,其中包含一个标题和相应的内容。

rows = details_container.findAll('div', {'class': 'row'}) 

# used to store data extracted from HTML <div class="row"> elements 
data = {} 

for row in rows: 
    title = row.find('div', {'class': 'title'}) 
    content = row.find('div', {'class': 'content'}) 

    if title and content: 
    # here I am just formatting the dict key to be more python-ish. totally optional 
    title = title.text.strip().lower().replace(' ', '-') 
    data[title] = content 

# tested with the HTML from the first page 
print data.keys() 
#=> [u'cuisine', u'restaurant-features', u'average-prices', u'good-for', u'open-hours', u'meals'] 
print type(data['cuisine']) 
#=> <class 'bs4.element.Tag'> 

现在,您可以提取从HTML网页内容项目,而不关心什么顺序出现在,此代码应在具有作为两页,你所提供的相同的一般结构任何HTML工作。我希望这有帮助!