2016-03-15 102 views
0

我正在寻找从statistics.gov.scot网站下载一些数据。例如,我想提供一些关于住院率的数据。到源中的数据表我感兴趣的查询格式:阅读奇怪格式化程序的CSV文件

http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall 

,并通过this link访问,对于那些谁想要尝试。该查询会生成一个包含相关信息的*.CSV文件,但该文件的格式会带来一些挑战。

文件例如

文件内容看起来像这样:

Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00 
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions 
measure type,"" 
Admission Type,"" 
Age,"" 
Gender,"" 
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)" 

,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012 
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012 
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194" 

在导入到Excel:

Excel import

然而,当通过read.csv进口[R它看起来像这样:

> head(problematicFile) 
                V1      V2 
1    Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00 
2 http://statistics.gov.scot/data/hospital-admissions  Hospital Admissions 
3          measure type       
4          Admission Type       
5             Age       
6            Gender 

问题

read.csv进口只返回两列。我猜测这个问题涉及到一些最初的列是空的。我想以类似于Excel中所示的导入的方式读取此文件。重点是,我打算使用列AB列中的行,并且自然地使用下面的数据表。在生成data.frame方面,如果存在空单元但其尺寸等同于Excel中的尺寸,我将很乐意包含NA值。我尝试过:

read.csv(file = link, header = FALSE, na.strings = "", 
           fill = TRUE) 

但我一直在抵达同样的问题。

期望的结果

的期望的结果看起来应该(用手产生提取物):

Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00 NA NA NA NA NA NA NA 
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA NA NA NA NA NA NA 
measure type NA NA NA NA NA NA NA NA 
Admission Type NA NA NA NA NA NA NA NA 
Age NA NA NA NA NA NA NA NA 
Gender NA NA NA NA NA NA NA NA 
Measure (cell values): Ratio (Rate Per 100,000 Population)   NA NA NA NA NA 
NA NA NA NA NA NA NA NA NA 
NA NA http://reference.data.gov.uk/id/year/2002 http://reference.data.gov.uk/id/year/2003 http://reference.data.gov.uk/id/year/2004 http://reference.data.gov.uk/id/year/2005 http://reference.data.gov.uk/id/year/2006 http://reference.data.gov.uk/id/year/2007 http://reference.data.gov.uk/id/year/2008 
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area 2002 2003 2004 2005 2006 2007 2008 
http://statistics.gov.scot/id/statistical-geography/S92000003 Scotland 9,351 9,262 9,261 9,347 9,723 10,517 10,293 
http://statistics.gov.scot/id/statistical-geography/S16000082 Angus South 8,236 8,500 8,523 8,371 8,616 8,978 9,325 
http://statistics.gov.scot/id/statistical-geography/S16000106 Edinburgh Northern and Leith 9,040 8,040 7,925 9,042 10,355 11,833 8,916 
http://statistics.gov.scot/id/statistical-geography/S16000140 Renfrewshire South 9,391 9,122 9,491 9,586 10,425 10,900 11,065 
http://statistics.gov.scot/id/statistical-geography/S16000108 Edinburgh Southern 5,878 5,910 6,101 6,035 7,426 9,343 6,766 
http://statistics.gov.scot/id/statistical-geography/S16000075 Aberdeen Donside 10,047 10,963 10,629 10,512 10,383 10,787 10,685 
http://statistics.gov.scot/id/statistical-geography/S16000137 Perthshire North 9,388 9,524 7,799 9,350 9,543 9,791 9,991 
http://statistics.gov.scot/id/statistical-geography/S16000077 Aberdeenshire East 7,211 7,300 7,153 7,411 7,435 7,268 7,547 
http://statistics.gov.scot/id/statistical-geography/S16000114 Galloway and West Dumfries 9,861 9,165 8,143 9,258 7,508 10,213 10,399 
http://statistics.gov.scot/id/statistical-geography/S16000096 Dumbarton 8,703 8,570 8,727 9,310 9,389 9,885 10,237 

截图

只是为了进一步说明这一点,我想保持的尺寸和用NA填充缺失值:

Excel with NAs

回答

2

从头文件解析元数据有点棘手。您可能更愿意下载整个标准化数据集,而不是该交叉表片。

> reconv <- read.csv("http://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions") 

> head(reconv) 

    GeographyCode DateCode Measurement        Units Value Gender Age 
1  S92000003  2003  Mean Average reconvictions per offender 0.62 All All 
2  S92000003  2004  Mean Average reconvictions per offender 0.33 All All 
3  S92000003  2004  Mean Average reconvictions per offender 0.61 All All 
4  S92000003  2005  Mean Average reconvictions per offender 0.60 All All 
5  S92000003  2006  Mean Average reconvictions per offender 0.60 All All 
6  S92000003  2007  Mean Average reconvictions per offender 0.11 All All 

这将会把所有因子水平的元数据(所以您不必对它进行解析)的:

> str(reconv) 

'data.frame': 10119 obs. of 7 variables: 
$ GeographyCode: Factor w/ 26 levels "S12000005","S12000006",..: 26 26 26 26 26 26 26 26 26 26 ... 
$ DateCode  : int 2003 2004 2004 2005 2006 2007 2007 2008 2008 2009 ... 
$ Measurement : Factor w/ 2 levels "Mean","Ratio": 1 1 1 1 1 1 1 1 1 1 ... 
$ Units  : Factor w/ 2 levels "Average reconvictions per offender",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ Value  : num 0.62 0.33 0.61 0.6 0.6 0.11 0.57 0.6 0.33 0.33 ... 
$ Gender  : Factor w/ 3 levels "All","Female",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ Age   : Factor w/ 6 levels "21-25","26-30",..: 4 4 4 4 4 4 4 4 4 4 ... 

您可以选择切片你感兴趣:

> slice <- subset(reconv, Measurement=="Ratio" & Gender=="All" & Age=="All") 

,回到原来的交叉列表切片,如果你想:

> library(reshape2) 
> dcast(slice, GeographyCode ~ DateCode, value.var="Value", fun.aggregate = first) 

    GeographyCode 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
1  S12000005 41.4 34.3 41.0 40.7 37.4 37.2 33.3 34.6 35.8 33.0 32.8 
2  S12000006 34.9 36.0 31.9 34.2 31.1 28.7 27.9 29.6 27.5 26.8 27.0 
3  S12000008 33.7 33.2 33.7 33.2 31.7 32.8 30.4 31.5 29.1 28.1 28.7 
4  S12000010 26.7 24.5 25.7 26.9 26.7 27.8 29.3 25.1 22.4 29.0 28.2 
5  S12000013 31.7 26.1 30.6 35.4 31.6 25.9 24.0 18.9 30.5 22.8 18.6 
... 
1

您需要手动指定col.names以强制read.csv读取多个列。同时指定na.strings作为空字符串会将NA值保留在空列中。

read.csv(<parameters>, col.names=c("Col1","Col2".....), na.strings="") 
+0

感谢您的兴趣,但我正在努力避免这种情况。我需要这些信息,因为它包含指标名称和其他一些我将要使用的数据。如果我跳过这个文件,我将不得不阅读它**两次**一次,以获得相关元数据的前9行,然后又一次获取实际数据。我想避免这种情况,我想有一个大的表格,将NAs放置在空白列中,然后引用我需要的值,**包括**第一列中的内容。 – Konrad

+0

@Konrad看看这个更改是否有帮助 –

+0

'列名比列名更多'问题是我在导入文件之前不会知道文件的大小。另一种方法可以是通过'readLines'来读取它,然后用数据和前几行中的其他值从行中导出表格。理想情况下,我宁愿有一个带有NAs的表格,所以我可以这样做:'indicatorName < - x [7,2]'或其他任何我可能需要从中选择的东西。 – Konrad

0

您可以使用read指定列数。表和列名的供应:

read.table(file = link, 
      fill = TRUE, 
      sep = ",", 
      na.strings = "", 
      col.names = paste("c", 1:12, sep = "")) 

不过,我不知道这是否是因为你需要知道的列数先验很好的解决方案。

另一种方法是将整个csv读作字符串。然后,您可以通过将标题存储在另一个对象(例如列表)中进行预处理,并且可以将“表格部分”用作数据框。

+0

谢谢,这是一个开始。我希望能够以某种方式一次性读取所有内容,因为我可以跳过'data.frame'并选择我想要的内容。我有一个列表,这些文件在一个循环中,所以我可以进一步将它分解成两个对象,一个标题,但认为可以避免这种情况。 – Konrad