2016-11-16 296 views
1

引用this post后,我可以读取驻留在* .tar.gz文件中的多个* .txt文件。但现在,我需要读取* .tar.gz文件中的HDF5文件。样本文件可以从million songs dataset下载here下载。任何人都可以告诉我如何更改以下代码以便将HDF5文件读入RDD中?谢谢!从Spark中的* .tar.gz压缩文件中读取HDF5文件

package a.b.c 

import org.apache.spark._ 
import org.apache.spark.sql.{SQLContext, DataFrame} 
import org.apache.spark.ml.tuning.CrossValidatorModel 
import org.apache.spark.ml.regression.LinearRegressionModel 
import org.apache.spark.ml.{Pipeline, PipelineModel} 
import org.apache.spark.ml.regression.LinearRegression 
import org.apache.spark.input.PortableDataStream 
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream 
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream 
import scala.util.Try 
import java.nio.charset._ 

object Main { 
    def main(args: Array[String]) { 
    val conf = new SparkConf().setAppName("lab1").setMaster("local") 
    val sc = new SparkContext(conf) 
    val sqlContext = new SQLContext(sc) 

    import sqlContext.implicits._ 
    import sqlContext._ 

    val inputpath = "path/to/millionsong.tar.gz" 
    val rawDF = sc.binaryFiles(inputpath, 2) 
       .flatMapValues(x => extractFiles(x).toOption) 
       .mapValues(_.map(decode())) 
       .map(_._2) 
       .flatMap(x => x) 
       .flatMap { x => x.split("\n") } 
       .toDF() 
    } 

    def extractFiles(ps: PortableDataStream, n: Int = 1024) = Try { 
    val tar = new TarArchiveInputStream(new GzipCompressorInputStream(ps.open)) 
    Stream.continually(Option(tar.getNextTarEntry)) 
     // Read until next exntry is null 
     .takeWhile(_.isDefined) 
     // flatten 
     .flatMap(x => x) 
     // Drop directories 
     .filter(!_.isDirectory) 
     .map(e => { 
     Stream.continually { 
      // Read n bytes 
      val buffer = Array.fill[Byte](n)(-1) 
      val i = tar.read(buffer, 0, n) 
      (i, buffer.take(i))} 
     // Take as long as we've read something 
     .takeWhile(_._1 > 0) 
     .map(_._2) 
     .flatten 
     .toArray}) 
     .toArray 
    } 

    def decode(charset: Charset = StandardCharsets.UTF_8)(bytes: Array[Byte]) = new String(bytes, StandardCharsets.UTF_8) 
} 
+0

你有没有找到一种方法来做到这一点? – Reginbald

回答

0

我设法通过写字节流转换为本地文件,然后打开此文件为H5,使用this提取特征读取压缩包内的文件HDF5。这里是我的代码:

var tarFiles: Array[String] = Array() 
val tar_path = path + "millionsongsubset.tar.gz" 

//TODO: add all your tar.gz files in main folder path to tarFiles array 
//should add here as many tar.gz files as wanted containing the 
//hdf5 files for the songs 
tarFiles = tarFiles :+ tar_path 
//tarFiles = tarFiles :+ (path+"A.tar.gz") 
//tarFiles = tarFiles :+ (path+"B.tar.gz") 
//tarFiles = tarFiles :+ (path+"C.tar.gz") 

//This reads all tar.gz files in tarFiles list, and for each .h5 
//file within, it extracts each song's list of features. 
//Thus, it gets a list of features for all songs in the files. 
var allHDF5 = sc.parallelize(tarFiles).flatMap(path => { 
    val tar = new TarArchiveInputStream(new GzipCompressorInputStream(new FileInputStream(path))) 
    var entry: TarArchiveEntry = tar.getNextEntry().asInstanceOf[TarArchiveEntry] 
    var res: List[Array[Byte]] = List() 
    var i = 0 
    while (entry != null) { 
     var outputFile:File = new File(entry.getName()); 
     if (!entry.isDirectory() && entry.getName.contains(".h5")) { 
      var byteFile = Array.ofDim[Byte](entry.getSize.toInt) 
      tar.read(byteFile); 
      res = byteFile :: res 
      if(i%100==0) { 
       println("Read " + i + " files") 
      } 
      i = i+1 

     } 
     entry = tar.getNextEntry().asInstanceOf[TarArchiveEntry] 
    } 
    //All files are turned into byte arrays 
    res 

    }).map(bytes => { 
    // The toString method is used as a UUID for the file 
    val name = bytes.toString() 
    FileUtils.writeByteArrayToFile(new File(name), bytes) 
    val reader = HDF5Factory.openForReading(name) 
    val features = getFeatures(reader) 
    reader.close() 
    features 
    }) 

    println("Extracted songs from tar.gz, showing 5 examples") 
    allHDF5.take(5).foreach(x => { x.foreach(y => print(y+" ")) 
         println()}) 

的几点意见:

  1. getFeatures方法:这种方法是代码的here一个很简单的适应,提取了一些功能,并将其送回的数组。请注意,为了运行此特征提取代码,您将需要this library,它具有很好的javadoc
  2. 请注意,如果此代码在具有多个执行程序的群集中运行,则执行程序会在本地写入.h5文件,因此如果它们围绕群集移动,则在某些时候您可能会尝试读取不存在的文件在本地执行。