2014-11-08 25 views
1

我有我想读入R.它具有类似于下面的数据从MS SQL Server生成一个CSV文件:阅读CSV既成对和不成引号

# reproduce file 
possibilities <- c('this is good','"this has, a comma"','here is a " quotation','') 
newstrings <- expand.grid(possibilities,possibilities,possibilities,stringsAsFactors = F) 
xwrite <- apply(newstrings,1,paste,collapse = ",") 
xwrite <- c('v1,v2,v3',xwrite) 
writeLines(xwrite,con = 'test.csv') 

我通常会打开这个与Excel和它神奇地读取和写入一个更清洁的R格式,但这是超过了行限制。如果我无法弄清楚,我将不得不返回并以另一种格式输出它。我尝试了很多我读过的变体。

# a few things I've tried 
(rl <- readLines('test.csv')) 
read.csv('test.csv',header = T,quote = "",stringsAsFactors = F) 
read.csv('test.csv',header = F,quote = "",stringsAsFactors = F,skip = 1) 
read.csv('test.csv',header = T,stringsAsFactors = F) 
read.csv('test.csv',header = F,stringsAsFactors = F,skip = 1) 
read.table('test.csv',header = F) 
read.table('test.csv',header = F,quote = "\"") 
read.table('test.csv',header = T,sep = ",") 
scan('test.csv',what = 'character') 
scan('test.csv',what = 'character',sep = ",") 
scan('test.csv',what = 'character',sep = ",",quote = "") 
scan('test.csv',what = 'character',sep = ",",quote = "\"") 

unlist(strsplit(rl,split = ',')) 

这似乎对我有数据的工作,但我不放心重用它,因为它不第六行这说明可能在另一个文件中可能发生的数据。

# works if only comma OR unpaired quotation but not both 
rl[grep('^[^\"]*\"[^\"]*$',rl)] <- sub('^([^\"]*)(\")([^\"]*)$','\\1\\3',rl[grep('^[^\"]*\"[^\"]*$',rl)]) 
writeLines(rl,'testfixed.csv') 
read.csv('testfixed.csv') 

我发现了一个similar problem,但我的引号的问题是数据独来独往,没有一个统一的格式问题。

是否有可能从此获得正确的data.frame?

回答

0

我不认为有直接的方法来做到这一点。在这里,我基本上用逗号分隔strsplit。但首先,我将,\"\",这样的特殊分隔符处理。

lines <- readLines('test.csv') 
## separate teh quotaion case 
lines_spe <- strsplit(lines,',\"|\",') 
nn <- sapply(lines_spe,length)==1 
## the normal case 
lines[nn] <- strsplit(lines[nn],',',perl=TRUE) 
## aggregate the results 
lines[!nn] <- lines_spe[!nn] 
## bind to create a data.frame 
dat <- 
setNames(as.data.frame(do.call(rbind,lines[-1]),stringsAsFactors =F), 
     lines[[1]]) 
## treat the special case of strsplit('some text without second part,',',') 
dat[dat$v1==dat$v2,"v2"] <- "" 
dat 
#       v1      v2 
# 1    this is good   this is fine 
# 2  this has no commas  this has, a comma" 
# 3 this has no quotations this has a " quotation 
# 4 this field has something       
# 5       now the other side does 
# 6  "this has, a comma this has a " quotation 
# 7   and a final line  that should be fine 

结果是除了不具有第二部分,其中strsplit未能得到第二空文本的情况下,近良好:在您的数据,出现这种情况有:“这一领域有什么东西,”。这里举一个例子来解释这个问题:

strsplit('aaa,',',') 
[[1]] 
[1] "aaa" 

> strsplit(',aaa',',') 
[[1]] 
[1] "" "aaa" 
0

这是更接近,可能会做。如果逗号旁边有一个单引号,那么它会失败,因为我假设那些实际需要引用的字符串的开始或结尾。

rl <- readLines('test.csv') 
rl <- gsub('([^,])(\")([^,])','\\1\\3',rl,perl = T) 
writeLines(rl,'testfixed.csv') 
read.csv('testfixed.csv')