2016-09-23 114 views
1

我有几个JSON文件,文本分组为date,bodytitle。例如,考虑:从存储在JSON文件中的文本中创建语料库R

{"date": "December 31, 1990, Monday, Late Edition - Final", "body": "World stock markets begin 1991 facing the threat of a war in the Persian Gulf, recessions or economic slowdowns around the world, and dismal earnings -- the same factors that drove stock markets down sharply in 1990. Finally, there is the problem of the Soviet Union, the wild card in everyone's analysis. It is a country whose problems could send stock markets around the world reeling if something went seriously awry. With Russia about to implode, that just adds to the risk premium, said Mr. Dhar. LOAD-DATE: December 30, 1990 ", "title": "World Markets;"} 
{"date": "December 30, 1992, Sunday, Late Edition - Final", "body": "DATELINE: CHICAGO Gleaming new tractors are becoming more familiar sights on America's farms. Sales and profits at the three leading United States tractor makers -- Deere & Company, the J.I. Case division of Tenneco Inc. and the Ford Motor Company's Ford New Holland division -- are all up, reflecting renewed agricultural prosperity after the near-depression of the early and mid-1980's. But the recovery in the tractor business, now in its third year, is fragile. Tractor makers hope to install computers that can digest this information, then automatically concentrate the application of costly fertilizer and chemicals on the most productive land. Within the next 15 years, that capability will be commonplace, predicted Mr. Ball. LOAD-DATE: December 30, 1990 ", "title": "All About/Tractors;"} 

我有载有1989年期间生产的所有文本单独的文件三种不同的报纸 - 2016年我的最终目标是将所有文本合并成一个单一的语料库。我使用熊猫库在Python中完成了它,并且我想知道是否可以用R来完成。这里是我的代码,在R上的循环:

for (i in 1989:2016){ 
    df0 = pd.DataFrame([json.loads(l) for l in open('NYT_%d.json' % i)]) 
    df1 = pd.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)]) 
    df2 = pd.DataFrame([json.loads(l) for l in open('WP_%d.json' % i)]) 
    appended_data.append(df0) 
    appended_data.append(df1) 
    appended_data.append(df2) 
} 

回答

2

有R中很多选择阅读json文件,并将它们转换为data.frame/data.table。

在这里,人们使用jsonlitedata.table

library(data.table) 
library(jsonlite) 
res <- lapply(1989:2016,function(i){ 
    ff <- c('NYT_%d.json','USAT_%d.json' ,'WP_%d.json') 
    list_files_paths <- sprintf(ff,i) 
    rbindlist(lapply(list_files_paths,fromJSON)) 
    }) 

这里水库是data.table列表。如果你要聚集所有data.table在一个单一的data.table:

rbindlist(res) 
3

使用jsonlite::stream_in阅读您的文件和jsonlite::rbind.pages把它们结合起来。

0

使用ndjson::stream_in以比jsonlite::stream_in :-)更快地读出它们:-)

相关问题