2014-09-04 43 views
0

我有两个执行两个程序的xml报告。这些报告包含一个部分,其中列出了所有执行的I/O操作以及每个操作的内容。其中有些是XML,其它的是二进制的,但包含在报告中的数据始终是文本,所以我有一些与此类似:比较二进制文件的文本表示

.....0.................. [email protected]'F...O)v...O*......................0..........l...c...= 
Y!...!pvw.........(.........E... 
yY...-qVC......p...K,......Pm.........Si4........,.......C0....?0....'...................K0....0 
. *...H...... 
....0I1.0 ..U....US1.0...U. 
. 
Google Inc1%0#..U....Google Internet Authority G20.. 
140423121609Z. 
140722000000Z0f1.0 ..U....US1.0...U... 
California1.0...U... 
Mountain View1.0...U. 
. 
Google Inc1.0...U....*.google.com0...."0 
. *...H...... 
..........0.... 
..............>..........:...z...S...5...%f............-....*J...i.......c}m......N%...t....G..f.......y.........0x...F.........:......k...k$......!............I...A...........A...G.......q...C...g........r.......b....6.......c...|X.........F...?qs......'.........................mrM.....D....9... 
....$...v... [email protected]/... U~....r......... .........g_ ...[y...7=...i... >......b......s...........W......#w..............e..........yI.........{..............0.....0...U.%..0...+.........+.......0.........U..........0.......*.google.com... 
*.android.com....*.appengine.google.com....*.cloud.google.com....*.google-analytics.com....*.google.ca....*.google.cl....*.google.co.in....*.google.co.jp....*.google.co.uk....*.google.com.ar....*.google.com.au....*.google.com.br....*.google.com.co....*.google.com.mx....*.google.com.tr....*.google.com.vn....*.google.de....*.google.es....*.google.fr....*.google.hu....*.google.it....*.google.nl....*.google.pl....*.google.pt....*.googleapis.cn....*.googlecommerce.com....*.googlevideo.com... 
*.gstatic.com... 
*.gvt1.com....*.urchin.com....*.url.google.com....*.youtube-nocookie.com... 
*.youtube.com....*.youtubeeducation.com....*.ytimg.com....android.com....g.co....goo.gl....google-analytics.com... 
google.com....googlecommerce.com... 
urchin.com....youtu.be....youtube.com....youtubeeducation.com0h..+.........0Z0+..+.....0.....http://pki.google.com/GIAG2.crt0+..+.....0.....http://clients1.google.com/ocsp0...U.........XV.H...%....r..!.......y...'0...U.........00...U.#..0.....J............h...v...b....Z.../0...U. ..0.0.. 
+.......y...00..U...)0'0%...#...!....http://pki.google.com/GIAG2.crl0 
. *...H...... 
..........A...d...A~A..0...P-JY/........"..M...N.=...H....n%...A......u......2...X......I........F...%....%p..............K...j...A.............g$Y...h....K....E...m......s/......t.....S..SN...Wo.B6.......a......|.............q........?.............y...N....K=....1......|+......3=.....6....j...&...H?.1.....X.H..#V".k.............-.....C.....5S......$.G............eMY(...1+,.e...v"......K...C...}.....V............28K......[......4A.Vr.......C0....?0....'...................K0....0 
. *...H...... 
....0I1.0 ..U....US1.0...U. 

我要比较这些段找到相似之处,即寻找是否两个程序向/从文件系统写入/读取类似的内容。另外,由于有很多I/O操作(100s)和许多报告(10000s),所以我应该很快做到这一点。我正在使用java。

任何建议?

+0

如何定义“相似”内容? – ControlAltDel 2014-09-04 15:27:10

+0

这是其中一个问题XD – Totem 2014-09-04 15:29:18

+0

将每个字符/字符集与整个第二个文档进行比较:P – Boop 2014-09-04 16:30:34

回答

0

最后我使用了归一化压缩距离。我还不知道这是否是我的数据最好的方法......