我正在寻找从statistics.gov.scot网站下载一些数据。例如,我想提供一些关于住院率的数据。到源中的数据表我感兴趣的查询格式:阅读奇怪格式化程序的CSV文件
http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall
,并通过this link访问,对于那些谁想要尝试。该查询会生成一个包含相关信息的*.CSV
文件,但该文件的格式会带来一些挑战。
文件例如
文件内容看起来像这样:
Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"
,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"
在导入到Excel:
然而,当通过read.csv
进口[R它看起来像这样:
> head(problematicFile)
V1 V2
1 Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions Hospital Admissions
3 measure type
4 Admission Type
5 Age
6 Gender
问题
的read.csv
进口只返回两列。我猜测这个问题涉及到一些最初的列是空的。我想以类似于Excel中所示的导入的方式读取此文件。重点是,我打算使用列A和B列中的行,并且自然地使用下面的数据表。在生成data.frame
方面,如果存在空单元但其尺寸等同于Excel中的尺寸,我将很乐意包含NA
值。我尝试过:
read.csv(file = link, header = FALSE, na.strings = "",
fill = TRUE)
但我一直在抵达同样的问题。
期望的结果
的期望的结果看起来应该(用手产生提取物):
Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00 NA NA NA NA NA NA NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA NA NA NA NA NA NA
measure type NA NA NA NA NA NA NA NA
Admission Type NA NA NA NA NA NA NA NA
Age NA NA NA NA NA NA NA NA
Gender NA NA NA NA NA NA NA NA
Measure (cell values): Ratio (Rate Per 100,000 Population) NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA http://reference.data.gov.uk/id/year/2002 http://reference.data.gov.uk/id/year/2003 http://reference.data.gov.uk/id/year/2004 http://reference.data.gov.uk/id/year/2005 http://reference.data.gov.uk/id/year/2006 http://reference.data.gov.uk/id/year/2007 http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area 2002 2003 2004 2005 2006 2007 2008
http://statistics.gov.scot/id/statistical-geography/S92000003 Scotland 9,351 9,262 9,261 9,347 9,723 10,517 10,293
http://statistics.gov.scot/id/statistical-geography/S16000082 Angus South 8,236 8,500 8,523 8,371 8,616 8,978 9,325
http://statistics.gov.scot/id/statistical-geography/S16000106 Edinburgh Northern and Leith 9,040 8,040 7,925 9,042 10,355 11,833 8,916
http://statistics.gov.scot/id/statistical-geography/S16000140 Renfrewshire South 9,391 9,122 9,491 9,586 10,425 10,900 11,065
http://statistics.gov.scot/id/statistical-geography/S16000108 Edinburgh Southern 5,878 5,910 6,101 6,035 7,426 9,343 6,766
http://statistics.gov.scot/id/statistical-geography/S16000075 Aberdeen Donside 10,047 10,963 10,629 10,512 10,383 10,787 10,685
http://statistics.gov.scot/id/statistical-geography/S16000137 Perthshire North 9,388 9,524 7,799 9,350 9,543 9,791 9,991
http://statistics.gov.scot/id/statistical-geography/S16000077 Aberdeenshire East 7,211 7,300 7,153 7,411 7,435 7,268 7,547
http://statistics.gov.scot/id/statistical-geography/S16000114 Galloway and West Dumfries 9,861 9,165 8,143 9,258 7,508 10,213 10,399
http://statistics.gov.scot/id/statistical-geography/S16000096 Dumbarton 8,703 8,570 8,727 9,310 9,389 9,885 10,237
截图
只是为了进一步说明这一点,我想保持的尺寸和用NA
填充缺失值:
感谢您的兴趣,但我正在努力避免这种情况。我需要这些信息,因为它包含指标名称和其他一些我将要使用的数据。如果我跳过这个文件,我将不得不阅读它**两次**一次,以获得相关元数据的前9行,然后又一次获取实际数据。我想避免这种情况,我想有一个大的表格,将NAs放置在空白列中,然后引用我需要的值,**包括**第一列中的内容。 – Konrad
@Konrad看看这个更改是否有帮助 –
'列名比列名更多'问题是我在导入文件之前不会知道文件的大小。另一种方法可以是通过'readLines'来读取它,然后用数据和前几行中的其他值从行中导出表格。理想情况下,我宁愿有一个带有NAs的表格,所以我可以这样做:'indicatorName < - x [7,2]'或其他任何我可能需要从中选择的东西。 – Konrad