2011-07-19 44 views
5

我很欣赏关于如何利用Clojure高效地分析和比较两个文件的建议/见解。有两个(日志)文件包含员工考勤;从这些文件中我需要确定两名员工在同一时间工作的所有日子,在同一部门。以下是日志文件的示例。如何解析和比较文件?

注意:每个文件具有不同数量的条目。

第一个文件:

Employee Id  Name   Time In   Time Out   Dept. 
mce0518   Jon  2011-01-01 06:00 2011-01-01 14:00  ER 
mce0518   Jon  2011-01-02 06:00 2011-01-01 14:00  ER 
mce0518   Jon  2011-01-04 06:00 2011-01-01 13:00  ICU 
mce0518   Jon  2011-01-05 06:00 2011-01-01 13:00  ICU 
mce0518   Jon  2011-01-05 17:00 2011-01-01 23:00  ER 

第二个文件:

Employee Id  Name   Time In   Time Out   Dept. 
pdm1705   Jane  2011-01-01 06:00 2011-01-01 14:00  ER 
pdm1705   Jane  2011-01-02 06:00 2011-01-01 14:00  ER 
pdm1705   Jane  2011-01-05 06:00 2011-01-01 13:00  ER 
pdm1705   Jane  2011-01-05 17:00 2011-01-01 23:00  ER 

回答

3

,如果你不打算定期做,

 

(defn data-seq [f] 
    (with-open [rdr (java.io.BufferedReader. 
        (java.io.FileReader. f))] 
    (let [s (rest (line-seq rdr))] 
     (doall (map seq (map #(.split % "\\s+") s)))))) 

(defn same-time? [a b] 
    (let [a (drop 2 a) 
     b (drop 2 b)] 
    (= a b))) 

(let [f1 (data-seq "f1.txt") 
     f2 (data-seq "f2.txt")] 

    (reduce (fn[h v] 
      (let [f2 (filter #(same-time? v %) f2)] 
       (if (empty? f2) 
       h 
       (conj h [(first v) (map first f2)])))) [] f1) 
) 
 

会得到你,

[["mce0518" ("pdm1705")] ["mce0518" ("pdm1705")] ["mce0518" ("pdm1705")]] 
+0

函数名称相同的时间?由于它既检查时间又检查部门,有点欺骗性。 –

+0

是的,我打算在两个电话中处理比较,但后来想出了下降2然后=会做。 –

1

我才有点短,(恕我直言)更可读的版本

(use ; moar toolz - moar fun 
    '[clojure.contrib.duck-streams :only (reader)] 
    '[clojure.string :only (split)] 
    '[clojure.contrib.str-utils :only (str-join)] 
    '[clojure.set :only (intersection)]) 

(defn read-presence [filename] 
    (with-open [rdr (reader filename)] ; file will be securely (always) closed after use 
    (apply hash-set ; make employee's hash-set 
     (map #(str-join "--" (drop 2 (split % #" [ ]+"))) ; right-to-left: split row by spaces then forget two first columns then join using "--" 
     (drop 1 ; ommit first line 
      (line-seq rdr)))))) ; read file content line-by-line 

(intersection (read-presence "a.in") (read-presence "b.in")) ; now it's simple! 
;result: #{"2011-01-01 06:00--2011-01-01 14:00--ER" "2011-01-02 06:00--2011-01-01 14:00--ER" "2011-01-05 17:00--2011-01-01 23:00--ER"} 

假设a.inb.in是您的文件。我还假设你有一个哈希设定为每一位员工 - (幼稚)推广到N的员工将需要六行:

(def employees ["greg.txt" "allison.txt" "robert.txt" "eric.txt" "james.txt" "lisa.txt"]) 
(for [a employees b employees :when (and 
             (= a (first (sort [a b]))) ; thou shall compare greg with james ONCE 
             (not (= a b)))] ; thou shall not compare greg with greg 
    (str-join " -- " ; well, it's not pretty... nor pink at least 
    [a b (intersection (read-presence a) (read-presence b))])) 
;result: ("a.in -- b.in -- #{\"2011-01-01 06:00--2011-01-01 14:00--ER\" \"2011-01-02 06:00--2011-01-01 14:00--ER\" \"2011-01-05 17:00--2011-01-01 23:00--ER\"}") 

其实这个循环是SOOO丑陋,它不记住中间结果...待改进。

- 编辑 -

我知道必须有一些优雅的核心或贡献!

(use '[clojure.contrib.combinatorics :only (combinations)]) 

(def employees ["greg.txt" "allison.txt" "robert.txt" "eric.txt" "james.txt" "lisa.txt"]) 
(def employee-map (apply conj (for [e employees] {e (read-presence e)}))) 
(map (fn [[a b]] [a b (intersection (employee-map a) (employee-map b))]) 
    (combinations employees 2)) 
;result: (["a.in" "b.in" #{"2011-01-01 06:00--2011-01-01 14:00--ER" "2011-01-02 06:00--2011-01-01 14:00--ER" "2011-01-05 17:00--2011-01-01 23:00--ER"}]) 

现在,它的记忆(在员工地图解析数据),一般...懒:d