处理媒体csv报告中的数据质量问题.最佳实践



需要帮助解决更好的实践问题

我有一个azure函数,它从不同的API中引入数据,并将它们匹配以创建最终的csv报告。我有一个60k-100k的poblation和30列

为了便于解释,我将使用一个学校的小例子。

public Student {
string Grade {get; set;}
Name   LegName {get; set;}
string FatherName {get; set;}
string TeacherId {get; set;}
string SchoolId {get; set;}
}
public Name {
string FirstName {get; set;}
string LastName {get; set;}
}

在构建报告之前,我创建了两个Dictionary,其中<Id,名称>来自两个公开学校和教师信息的API。当然,还有一个来自Student API的Student列表。我无法控制这个trhee API,设计,数据质量,什么都没有。

现在,当我有了所有的数据后,我开始创建报告。

string GenerateTXT(Dictionary<string, string> schools, Dictionary<string,  string> teachers, Student students){
StringBuilder content = new StringBuilder();
foreach(var student in students){
content.Append($"{student.Grade}t");
content.Append($"{student.LegName.FirstName}t");
content.Append($"{student.LegName.LastName}t");
content.Append($"{schools.TryGetValue(student.TeacherId)}t");
content.Append($"{teachers.TryGetValue(student.SchoolId)}t";        
content.Append($"{student.FatherNme}t");
content.AppendLine();
}
return content.ToString();    
}

现在问题来了。我开始注意到数据质量问题,所以函数开始抛出异常。例如,没有有效学校或老师的学生,或者没有名字的学生。我试图解决预期的场景和异常处理。

string GenerateTXT(Dictionary<string, string> schools, Dictionary<string,  string> teachers, Student students){
StringBuilder content = new StringBuilder();
var value = string.Empty;
foreach(var student in students){
try {
content.Append($"{student.Grade}t");
content.Append($"{student.LegName.FirstName}t");
content.Append($"{student.LegName.LastName}t");            
if(teachers.TryGetValue(student.TeacherId))
content.Append($"{teachers[student.TeacherId]}t");
else
content.Append($"t");
if(schools.TryGetValue(student.SchoolId))
content.Append($"{schools[student.SchoolId]}t");
else
content.Append($"t");            
content.Append($"{student.FatherNme}t");
content.AppendLine();
}
catch(Exception ex) {
log.Error($"Error reading worker {student.FirstName}");
}

}
return content.ToString();
}

这样做的问题是,当发生意外错误时,我会停止读取可能拥有的下一列数据,而是跳到下一个工作者。因此,如果一个学生由于某种随机原因没有名字,那么报告中的那一行将只有分数,而没有其他内容,但我实际上拥有其余的值。所以问题来了。我可以在每列上放一个try catch,但请记住,我的真实场景有大约30列,可能更多。。。所以我认为这是一个非常糟糕的解决方案。有没有一种模式可以更好地解决这个问题?

提前感谢

因此,我要给您的第一点建议是使用CsvHelper。这是一个久经考验的库,因为它处理了所有你永远不会想到的边缘案例。所以,这么说,试试看:

public class Student
{
public string Grade { get; set; }
public Name LegName { get; set; }
public string FatherName { get; set; }
public string TeacherId { get; set; }
public string SchoolId { get; set; }
}
public class Name
{
public string FirstName { get; set; }
public string LastName { get; set; }
}
public class NormalizedData
{
public string Grade { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
public string School { get; set; }
public string Teacher { get; set; }
public string FatherName { get; set; }
}
static void GenerateCSVData(CsvHelper.CsvWriter csv, Dictionary<string, string> schools,
Dictionary<string, string> teachers, Student[] students)
{
var normalizedData = students.Select(x => new NormalizedData
{
Grade = x.Grade,
FatherName = x.FatherName,
FirstName = x.LegName?.FirstName, // sanity check incase LegName is null
LastName = x.LegName?.LastName, // ...
School = schools.ContainsKey(x.SchoolId ?? string.Empty) ? schools[x.SchoolId] : null,
Teacher = teachers.ContainsKey(x.TeacherId ?? string.Empty) ? teachers[x.TeacherId] : null
});
csv.WriteRecords(normalizedData);
}
private static string GenerateStringCSVData(Dictionary<string, string> schools,
Dictionary<string, string> teachers, Student[] students)
{
using(var ms = new MemoryStream())
{
using(var sr = new StreamWriter(ms, leaveOpen: true))
using (var csv = new CsvHelper.CsvWriter(sr,
new CsvConfiguration(CultureInfo.InvariantCulture)
{
Delimiter = ",", // change this to "t" if you want to use tabs
Encoding = Encoding.UTF8
}))
{
GenerateCSVData(csv, schools, teachers, students);
}
ms.Position = 0;
return Encoding.UTF8.GetString(ms.GetBuffer(), 0, (int)ms.Length);
}
}
private static int Main(string[] args)
{
var teachers = new Dictionary<string, string>
{
{ "j123", "Jimmy Carter" },
{ "r334", "Ronald Reagan" },
{ "g477", "George Bush" }
};
var schools = new Dictionary<string, string>
{
{ "s123", "Jimmy Carter University" },
{ "s334", "Ronald Reagan University" },
{ "s477", "George Bush University" }
};
var students = new Student[]
{
new Student
{
FatherName = "Bob Jimmy",
SchoolId = "s477",
Grade = "5",
LegName = new Name{ FirstName = "Apple", LastName = "Jimmy" },
TeacherId = "r334"
},
new Student
{
FatherName = "Jim Bobby",
SchoolId = null, // intentional
Grade = "", // intentional
LegName = null, // intentional
TeacherId = "invalid id" // intentional
},
new Student
{
FatherName = "Mike Michael",
SchoolId = "s123",
Grade = "12",
LegName = new Name{ FirstName = "Peach", LastName = "Michael" },
TeacherId = "g477"
},
};
var stringData = GenerateStringCSVData(schools, teachers, students);
return 0;
}

该输出:

Grade,FirstName,LastName,School,Teacher,FatherName
5,Apple,Jimmy,George Bush University,Ronald Reagan,Bob Jimmy
,,,,,Jim Bobby
12,Peach,Michael,Jimmy Carter University,George Bush,Mike Michael

所以,你可以看到,其中一个学生有无效的数据,但它通过放置空白数据而不是崩溃或抛出异常来恢复得很好。

现在我还没有看到你的原始数据,所以你可能需要对此进行更多的调整,以覆盖所有边缘情况,但当使用CsvHelper作为你的作者时,调整它会容易得多。

最新更新