2013-05-05 58 views
0

这是解析一些网站的程序。第一个网站是site1。所有的逻辑来解析perticular站点位于( - >配置:站点1)如何编写这个clojure enlive程序,以便它可以解析多个url?

(ns program.core 
    (require [net.cgrand.enlive-html :as html])) 

(def config 
    {:site1 
     {:site-url 
      ["http://www.site1.com/page/1" 
      "http://www.site1.com/page/2" 
       "http://www.site1.com/page/3" 
      "http://www.site1.com/page/4"] 
     :url-encoding "iso-8859-1" 
     :parsing-index 
      {:date 
       {:selector 
        [[:td.PadMed (html/nth-of-type 1)] :table [:tr (html/nth-of-type 2)] 
        [:td (html/nth-of-type 3)] [:span]] 
       :trimming-fn 
        (comp first :content) ; (first) to remove extra parenthese 
       } 
      :title 
       {:selector 
        [[:td.PadMed (html/nth-of-type 1)] :table :tr [:td (html/nth-of-type 2)] [:a]] 
       :trimming-fn 
        (comp first :content first :content) 
       } 
      :url 
       {:selector 
        [[:td.PadMed (html/nth-of-type 1)] :table :tr [:td (html/nth-of-type 2)] [:a]] 
       :trimming-fn 
        #(str "http://www.site.com" (:href (:attrs %))) 
       } 
      } 
     }}) 
    ;=== Fetch fn ===; 

    (defn fetch-encoded-url 
     ([url] (fetch-encoded-url url "utf-8")) 
     ([url encoding] (-> url java.net.URL. 
        .getContent 
        (java.io.InputStreamReader. encoding) 
        html/html-resource))) 

现在我想分析包含在页面( - >配置:现场1:站点URL)在这个例子中,我使用只有第一个网址,但我怎么能设计这个实际上为所有的网址做一个大师for

(defn parse-element [element] 
    (into [] (map (-> config :site1 :parsing-index element :trimming-fn) 
      (html/select 
       (fetch-encoded-url 
       (-> config :site1 :site-url first) 
       (-> config :site1 :url-encoding)) 
       (-> config :site1 :parsing-index element :selector))))) 

(def element-lists 
    (apply map vector 
     (map parse-element (-> config :site1 :parsing-index keys)))) 

(def tagged-lists 
    (into [] (for [element-list element-lists] 
      (zipmap [:date :title :url] element-list)))) 

;==== Fn call ==== 
    (println tagged-lists) 
+1

4小时前你刚刚问同样的问题吗? – nansen 2013-05-05 20:44:48

+0

对不起。刚刚删除了先前的问题,谢谢! – leontalbot 2013-05-05 21:55:55

回答

1

通行证:site1作为参数parse-elementelements-list

(defn parse-element [site element] 
    (into [] (map (-> config site :parsing-index element :trimming-fn) 
     (html/select 
      (fetch-encoded-url 
      (-> config site :site-url first) 
      (-> config site :url-encoding)) 
      (-> config site :parsing-index element :selector))))) 

(def element-lists [site] 
    (apply map vector 
     (map (partial parse-element site) (-> config site :parsing-index keys)))) 

然后映射了:site1:site2 ...键。


附录在回答的意见进一步的问题。

您可以通过:site-urlhtml/select包装在map中。喜欢的东西:

(defn parse-element [site element] 
    (let [site-urls (-> config site :site-url)] 
    (into [] (map (-> config site :parsing-index element :trimming-fn) 
     map 
     #(html/select 
      (fetch-encoded-url 
      % 
      (-> config site :url-encoding)) 
      (-> config site :parsing-index element :selector))) 
     site-urls))) 

(我希望我得到了括号右)

,那么你可能需要检查:修剪-FN,为了它来处理嵌套。一个apply应该就足够了。

+0

太棒了!我怎么能做到这一点,为一个给定的网站内的多个网址? :site-url [“http://www.site1.com/page/1” “http://www.site1.com/page/2” “http://www.site1.com/page/3“ ”http://www.site1.com/page/4“] – leontalbot 2013-05-07 01:52:09

+0

您将需要重写'parse-element'来映射”:site-url“向量,而不是将它的第一个元素包装整个'html/select'。看我的编辑。 – 2013-05-07 02:13:17

+0

@ user1184248考虑如果答案符合您的期望,则可以进行upvoting和/或接受答案。 – 2013-05-07 02:25:03

相关问题