2016-11-26 42 views
1

我很新到u-SQL计数,试图解决的u-SQL脚本搜索字符串,然后GROUPBY该字符串,并获得不同的文件

STR1 = \全球\欧洲\莫斯科\ 12345 \ FILE1.TXT

STR2 = \ global.bee.com \欧洲\莫斯科\ 12345 \ FILE1.TXT

STR3 = \全球\欧洲\阿姆斯特丹\ 54321 \ File1.Rvt STR4 = \ global.bee .com \ europe \ amsterdam \ 12345 \ File1.Rvt

case1: 我该如何得到“\ eur我想从str1和str2中取出(“\ europe \ Moscow \ 12345 \ File1.txt”),然后“Groupby(\ global \ europe \莫斯科\ 12345) “并采取不同的文件的数量从路径( ”“ \欧洲\莫斯科\ 12345 \”)

所以输出会是这样的:

distinct_filesby_Location_Date

到解决上述情况我尝试了下面的u-sql代码,但不太清楚我是否在写正确的脚本:

@inArray = SELECT new SQL.ARRAY<string>(
       filepath.Contains("\\europe")) AS path 
    FROM @t; 

@filesbyloc = 
    SELECT [ID], 
     path.Trim() AS path1 
    FROM @inArray 
    CROSS APPLY 
    EXPLODE(path1) AS r(location); 

OUTPUT @filesbyloc 
TO "/Outputs/distinctfilesbylocation.tsv" 
USING Outputters.Tsv(); 

任何帮助,你会不胜感激。

回答

1

这样做的一种方法是将要处理的所有字符串放在一个文件中,例如strings.txt并将其保存在U-SQL输入文件夹中。还有一个与你想要匹配的城市的文件,例如cities.txt。请尝试以下的U型SQL脚本:

@input = 
    EXTRACT filepath string 
    FROM "/input/strings.txt" 
    USING Extractors.Tsv(); 

// Give the strings a row-number 
@input = 
    SELECT ROW_NUMBER() OVER() AS rn, 
      filepath 
    FROM @input; 


// Get the cities 
@cities = 
    EXTRACT city string 
    FROM "/input/cities.txt" 
    USING Extractors.Tsv(); 

// Ensure there is a lower-case version of city for matching/joining 
@cities = 
    SELECT city, 
      city.ToLower() AS lowercase_city 
    FROM @cities; 


// Explode the filepath into separate rows 
@working = 
    SELECT rn, 
      new SQL.ARRAY<string>(filepath.Split('\\')) AS pathElement 
    FROM @input AS i; 

// Explode the filepath string, also changing to lower case 
@working = 
    SELECT rn, 
      x.pathElement.ToLower() AS pathElement 
    FROM @working AS i 
     CROSS APPLY 
      EXPLODE(pathElement) AS x(pathElement); 


// Create the output query, joining on lower case city name, display, normal case name 
@output = 
    SELECT c.city, 
      COUNT(*) AS records 
    FROM @working AS w 
     INNER JOIN 
      @cities AS c 
     ON w.pathElement == c.lowercase_city 
    GROUP BY c.city; 


// Output the result 
OUTPUT @output TO "/output/output.txt" 
USING Outputters.Tsv(); 

//OUTPUT @working TO "/output/output2.txt" 
//USING Outputters.Tsv(); 

我的结果:

My output file results

HTH

+1

非常感谢你wBob,你真的让我的工作变得简单我只是用谷歌搜索找到一些方法来做到这一点。 Bob还有一件事,如果你看过我的输出链接,你必须看到2个字段“位置”和“日期”,这意味着按日期位置的文件数量。如何也可以添加到您提供的上述解决方案中。请指教。再一次非常感谢你回复我的帖子这么快:-) –

+0

好极了,你应该考虑把它当作答案! – wBob

+0

日期在哪里?从您的示例数据中不清楚。它在文件名中,还是你需要从文件本身收集它? – wBob

1

以自由格式输入文件为TSV文件,不知道所有的列语义,这是一种编写查询的方法。请注意,我做出了评论中提供的假设。

@d = 
    EXTRACT path string, 
      user string, 
      num1 int, 
      num2 int, 
      start_date string, 
      end_date string, 
      flag string, 
      year int, 
      s string, 
      another_date string 
    FROM @"\users\temp\citypaths.txt" 
    USING Extractors.Tsv(encoding: Encoding.Unicode); 

// I assume that you have only one DateTime format culture in your file. 
// If it becomes dependent on the region or city as expressed in the path, you need to add a lookup. 
@d = 
SELECT new SqlArray<string>(path.Split('\\')) AS steps, 
     DateTime.Parse(end_date, new CultureInfo("fr-FR", false)).Date.ToString("yyyy-MM-dd") AS end_date 
FROM @d; 

// This assumes your paths have a fixed formatting/mapping into the city 
@d = 
SELECT steps[4].ToLowerInvariant() AS city, 
     end_date 
FROM @d; 

@res = 
SELECT city, 
     end_date, 
     COUNT(*) AS count 
FROM @d 
GROUP BY city, 
     end_date; 

OUTPUT @res 
TO "/output/result.csv" 
USING Outputters.Csv(); 

// Now let's pivot the date and count. 

OUTPUT @res2 
TO "/output/res2.csv" 
USING Outputters.Csv(); 
     @res2 = 
SELECT city, MAP_AGG(end_date, count) AS date_count 
FROM @res 
GROUP BY city; 

// This assumes you know exactly with dates you are looking for. Otherwise keep it in the first file representation. 
@res2 = 
SELECT city, 
     date_count["2016-11-21"]AS [2016-11-21], 
     date_count["2016-11-22"]AS [2016-11-22] 
FROM @res2; 

更新后得到了一些实例DATA IN私人电子邮件:基于数据

你发给我的(城市的提取和计数,你要么可以用做后合并为中概述Bob的回答是,您需要事先了解您的城市,或者从我的示例中的城市位置获取字符串,您不需要事先知道城市),您想要将行集枢转city, count, date进入行集date, city1, city2, ...的每行都包含每个城市的日期和计数。

你可以很容易地通过以下方式改变@res2计算调整我上面的例子:

// Now let's pivot the city and count. 
@res2 = SELECT end_date, MAP_AGG(city, count) AS city_count 
     FROM @res 
     GROUP BY end_date; 

// This assumes you know exactly with cities you are looking for. Otherwise keep it in the first file representation or use a script generation (see below). 
@res2 = 
SELECT end_date, 
     city_count["istanbul"]AS istanbul, 
     city_count["midlands"]AS midlands, 
     city_count["belfast"] AS belfast, 
     city_count["acoustics"] AS acoustics, 
     city_count["amsterdam"] AS amsterdam 
FROM @res2; 

注意,在我的例子,你需要看它枚举枢轴语句中的所有城市在SQL.MAP列中。如果这不是已知的,你将不得不首先提交一个脚本来为你创建脚本。例如,假设您的city, count, date行集位于文件中(或者您可以复制语句以在生成脚本和生成的脚本中生成行集),则可以将其写为以下脚本。然后将结果作为实际处理脚本提交。

// Get the rowset (could also be the actual calculation from the original file 
@in = EXTRACT city string, count int?, date string 
     FROM "https://stackoverflow.com/users/temp/Revit_Last2Months_Results.tsv" 
     USING Extractors.Tsv(); 

// Generate the statements for the preparation of the data before the pivot 
@stmts = SELECT * FROM (VALUES 
        ("@s1", "EXTRACT city string, count int?, date string FROM \"https://stackoverflow.com/users/temp/Revit_Last2Months_Results.tsv\" USING Extractors.Tsv();"), 
        ("@s2", "SELECT date, MAP_AGG(city, count) AS city_count FROM @s1 GROUP BY date;") 
       ) AS T(stmt_name, stmt); 

// Now generate the statement doing the pivot 
@cities = SELECT DISTINCT city FROM @in2; 

@pivots = 
SELECT "@s3" AS stmt_name, "SELECT date, "+String.Join(", ", ARRAY_AGG("city_count[\""+city+"\"] AS ["+city+"]"))+ " FROM @s2;" AS stmt 
FROM @cities; 

// Now generate the OUTPUT statement after the pivot. Note that the OUTPUT does not have a statement name. 
@output = 
SELECT "OUTPUT @s3 TO \"/output/pivot_gen.tsv\" USING Outputters.Tsv();" AS stmt 
FROM (VALUES(1)) AS T(x); 

// Now put the statements into one rowset. Note that null are ordering high in U-SQL 
@result = 
SELECT stmt_name, "=" AS assign, stmt FROM @stmts 
UNION ALL SELECT stmt_name, "=" AS assign, stmt FROM @pivots 
UNION ALL SELECT (string) null AS stmt_name, (string) null AS assign, stmt FROM @output; 

// Now output the statements in order of the stmt_name 
OUTPUT @result 
TO "/pivot.usql" 
ORDER BY stmt_name 
USING Outputters.Text(delimiter:' ', quoting:false); 

现在下载并提交它。

+0

找到输出嗨Michael,感谢您的评论,我尝试应用您在上面建议的代码,这可能是一个解决方案,但根据我的要求它没有给我预期的结果。如果你可以分享你的“电子邮件ID”,我可以给你详细的,因为这个地方是非常有限的分享的细节。 –

+0

您可以通过Microsoft的usql联系我。 我推荐的一件事是看代码,并确定我的假设和您的方案之间的差异,以确定您可能需要更改样本的位置。 –

相关问题