需要为特定任务选择适当的集合类

  • 本文关键字:集合类 选择 任务 c# .net
  • 更新时间 :
  • 英文 :


我正在为以下问题寻找设计解决方案:

我有一个大的项目集,我需要将其与其他项目集进行比较,以找到交集和异常集。同时,此项目的内部状态可能会在运行时发生更改,尽管此状态不会影响项目的标识。

我会使用类似HashSet<T>的东西来运行ExceptIntersect操作并快速添加项,但我无法更新项的状态,因为没有从集合中获取元素的操作。

我会使用Dictionary<string, T>快速添加项目,并可以快速访问它们以更改它们的状态,但没有为IDictionary提供设置比较操作。

考虑到性能因素,您将如何解决问题?

正如我在上面的注释中所指出的,所有值都将具有相同的键,这意味着所有IDictionary<string, T>都将具有同样的KeyValuePair<string, T>,因此您可以只使用扩展方法。

更重要的是,还可以利用fac,即保证每个项目有一个固定的密钥意味着你可以单独基于密钥进行设置操作。这允许您使用以下内容快速复制ISet<T>方法:

//Null-checks omitted for brevity:
public static class DictionaryAsSet
{
  //Note that some, but not all, of these methods allow one to use two dictionaries
  //with different types of value, as long as they've the same type of key.
  //They also assume that the same `IEqualityComparer<TKey>` is used, and will be
  //weird in results otherwise.
  public static void ExceptWithByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count != 0)
    {
      if(dictionary == (object)other)
        dictionary.Clear();
      else
        foreach(TKey key in other.Keys)
          dictionary.Remove(key);
    }
  }
  public static void IntersectWithByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count != 0 && dictionary != (object)other )
    {
      List<TKey> toRemove = new List<TKey>();
      foreach(TKey key in other.Keys)
        if(!dictionary.ContainsKey(key))
          toRemove.Add(key);
      if(toRemove.Count == dictionary.Count)
        dictionary.Clear();
      else
        foreach(TKey key in toRemove)
          dictionary.Remove(key);
    }
  }
  public static bool IsSubsetOfByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count == 0 || dictionary == (object)other)
      return true;
    if(dictionary.Count > other.Count)
      return false;
    foreach(TKey key in dictionary.Keys)
      if(!other.ContainsKey(key))
        return false;
    return true;
  }
  public static bool IsProperSubsetOfByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    return dictionary.Count < other.Count && dictionary.IsSubsetOfByKey(other);
  }
  public static bool IsSupersetOfByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    return other.IsSubsetOfByKey(dictionary);
  }
  public static bool IsProperSupersetOfByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    return other.IsProperSubsetOfByKey(dictionary);
  }
  public static bool OverlapsByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count == 0 || other.Count == 0)
      return true;
    if(dictionary == (object)other)
      return true;
    foreach(TKey key in dictionary.Keys)
      if(other.ContainsKey(key))
        return true;
    return false;
  }
  public static bool SetEqualsByKey<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary == (object)other)
      return true;
    if(dictionary.Count != other.Count)
      return false;
    foreach(TKey key in dictionary.Keys)
      if(!other.ContainsKey(key))
        return false;
    return true;
  }
  public static void SymmetricExceptWithByKey<TKey, TValue>(this IDictionary<TKey, TValue> dictionary, IDictionary<TKey, TValue> other)
  {
    if(dictionary.Count == 0)
      dictionary.UnionWithByKey(other);
    else if(dictionary == other)
      dictionary.Clear();
    else
    {
      List<TKey> toRemove = new List<TKey>();
      List<KeyValuePair<TKey, TValue>> toAdd = new List<KeyValuePair<TKey, TValue>>();
      foreach(var kvp in other)
        if(dictionary.ContainsKey(kvp.Key))
          toRemove.Add(kvp.Key);
        else
          toAdd.Add(kvp);
      foreach(TKey key in toRemove)
        dictionary.Remove(key);
      foreach(var kvp in toAdd)
        dictionary.Add(kvp.Key, kvp.Value);
    }
  }
  public static void UnionWithByKey<TKey, TValue>(this IDictionary<TKey, TValue> dictionary, IDictionary<TKey, TValue> other)
  {
    foreach(var kvp in other)
      if(!dictionary.ContainsKey(kvp.Key))
        dictionary.Add(kvp.Key, kvp.Value);
  }
}

其中大多数在效率上应该与HashSet<T>相当,尽管只有少数优化是HashSet<T>无法通过访问其内部实现的。

或者,如果您更喜欢System.Linq.Enumerable扩展方法的工作方式,则可以为此特定场景创建它们的优化版本。例如:

public static class DictionaryAsSetEnumerable
{
  //we could also return IEnumerable<KeyValuePair<TKey, TValue1>> if we wanted
  public static IEnumerable<TValue1> Except<TKey, TValue1, TValue2>(this IDictionary<TKey, TValue1> dictionary, IDictionary<TKey, TValue2> other)
  {
    if(dictionary.Count != 0 && dictionary != (object)other)
    {
       foreach(var kvp in dictionary)
         if(!other.ContainsKey(kvp.Key))
           yield return kvp.Value;
    }
  }
  //And so on. The approach for each here should be clear from those above 
}

Enumerable.Except()的实现相比,应该表明这是更快的,能够做出Enumerable.Except无法做出的一些假设。

最后一种方法是组合集合对象。在这里,我们创建一个类来表示每个方法。例如:

public static class DictionarySetExtensions
{
  public static IDictionary<TKey, TValue> ExceptByKey<TKey, TValue>(this IDictionary<TKey, TValue> dictionary, IDictionary<TKey, TValue> other)
  {
    return new ExceptDictionary<TKey, TValue>(dictionary, other);
  }
  private class ExceptDictionary<TKey, TValue> : IDictionary<TKey, TValue>
  {
    private readonly IDictionary<TKey, TValue> _source;
    private readonly IDictionary<TKey, TValue> _exclude;
    public ExceptDictionary(IDictionary<TKey, TValue> source, IDictionary<TKey, TValue> exclude)
    {
      _source = source;
      _exclude = exclude;
    }
    public TValue this[TKey key]
    {
      get
      {
        if(_exclude.ContainsKey(key))
          throw new KeyNotFoundException();
        return _source[key];
      }
      //A non-readonly version is possible, but probably ill-advised. This sort of
      //approach creates surprises if you don't use immutable results.
      set { throw new InvalidOperationException("Read Only Dictionary"); }
    }
    ICollection<TKey> IDictionary<TKey, TValue>.Keys
    {
      get
      {
        //there are more efficient approaches by creating a wrapper
        //class on this again, but this shows the principle.
        return this.Select(kvp => kvp.Key).ToList();
      }
    }
    ICollection<TValue> IDictionary<TKey, TValue>.Values
    {
      get
      {
        return this.Select(kvp => kvp.Value).ToList();
      }
    }
    //Note that Count is O(n), not O(1) as usual with collections.
    public int Count
    {
      get
      {
        int tally = 0;
        using(var en = GetEnumerator())
          while(en.MoveNext())
            ++tally;
        return tally;
      }
    }
    bool ICollection<KeyValuePair<TKey, TValue>>.IsReadOnly
    {
      get { return true; }
    }
    public bool ContainsKey(TKey key)
    {
      return _source.ContainsKey(key) && !_exclude.ContainsKey(key);
    }
    void IDictionary<TKey, TValue>.Add(TKey key, TValue value)
    {
      throw new InvalidOperationException("Read only");
    }
    bool IDictionary<TKey, TValue>.Remove(TKey key)
    {
      throw new InvalidOperationException("Read only");
    }
    public bool TryGetValue(TKey key, out TValue value)
    {
      if(_exclude.ContainsKey(key))
      {
        value = default(TValue);
        return false;
      }
      return _source.TryGetValue(key, out value);
    }
    void ICollection<KeyValuePair<TKey, TValue>>.Add(KeyValuePair<TKey, TValue> item)
    {
      throw new InvalidOperationException("Read only");
    }
    void ICollection<KeyValuePair<TKey, TValue>>.Clear()
    {
      throw new InvalidOperationException("Read only");
    }
    public bool Contains(KeyValuePair<TKey, TValue> item)
    {
      TValue cmp;
      return TryGetValue(item.Key, out cmp) && Equals(cmp, item.Value);
    }
    public void CopyTo(KeyValuePair<TKey, TValue>[] array, int arrayIndex)
    {
      //Way lazy here for demonstration sake. This is the sort of use of ToList() I hate, but you'll get the idea.
      this.ToList().CopyTo(array, arrayIndex);
    }
    bool ICollection<KeyValuePair<TKey, TValue>>.Remove(KeyValuePair<TKey, TValue> item)
    {
      throw new InvalidOperationException("Read only");
    }
    public IEnumerator<KeyValuePair<TKey, TValue>> GetEnumerator()
    {
      foreach(var kvp in _source)
        if(!_exclude.ContainsKey(kvp.Key))
          yield return kvp;
    }
    IEnumerator IEnumerable.GetEnumerator()
    {
      return GetEnumerator();
    }
  }
}

使用这种方法,调用ExceptByKey会返回一个新对象,该对象的行为就像包含了设置操作异常一样。调用UnionByKey会返回一个采用相同方法的不同类的实例,以此类推

internal abstract class ReadOnlyDictionaryBase<TKey, TValue> : IDictionary<TKey, TValue>
{
  public TValue this[TKey key]
  {
    get
    {
      TValue value;
      if(!TryGetValue(key, out value))
        throw new KeyNotFoundException();
      return value;
    }
  }
  TValue IDictionary<TKey, TValue>.this[TKey key]
  {
    get { return this[key]; }
    set { throw new InvalidOperationException("Read only"); }
  }
  public ICollection<TKey> Keys
  {
    get { return this.Select(kvp => kvp.Key).ToList(); }
  }
  public ICollection<TValue> Values
  {
    get { return this.Select(kvp => kvp.Value).ToList(); }
  }
  public int Count
  {
    get
    {
      int tally = 0;
      using(var en = GetEnumerator())
        while(en.MoveNext())
          ++tally;
      return tally;
    }
  }
  bool ICollection<KeyValuePair<TKey, TValue>>.IsReadOnly
  {
    get { return true; }
  }
  public bool ContainsKey(TKey key)
  {
    TValue unused;
    return TryGetValue(key, out unused);
  }
  void IDictionary<TKey, TValue>.Add(TKey key, TValue value)
  {
    throw new NotSupportedException("Read only");
  }
  bool IDictionary<TKey, TValue>.Remove(TKey key)
  {
    throw new NotSupportedException("Read only");
  }
  public abstract bool TryGetValue(TKey key, out TValue value);
  void ICollection<KeyValuePair<TKey, TValue>>.Add(KeyValuePair<TKey, TValue> item)
  {
    throw new NotSupportedException("Read only");
  }
  void ICollection<KeyValuePair<TKey, TValue>>.Clear()
  {
    throw new NotSupportedException("Read only");
  }
  public bool Contains(KeyValuePair<TKey, TValue> item)
  {
    TValue value;
    return TryGetValue(item.Key, out value) && Equals(value, item);
  }
  public void CopyTo(KeyValuePair<TKey, TValue>[] array, int arrayIndex)
  {
    this.ToList().CopyTo(array, arrayIndex);
  }
  bool ICollection<KeyValuePair<TKey, TValue>>.Remove(KeyValuePair<TKey, TValue> item)
  {
    throw new NotSupportedException("Read only");
  }
  public abstract IEnumerator<KeyValuePair<TKey, TValue>> GetEnumerator();
  IEnumerator IEnumerable.GetEnumerator()
  {
    return GetEnumerator();
  }
}

那么您只需要实现TryGetValue()GetEnumerable()就可以实现一个类,例如:

internal class  UnionDictionary<TKey, TValue> : ReadOnlyDictionaryBase<TKey, TValue>
{
  private readonly IDictionary<TKey, TValue> _first;
  private readonly IDictionary<TKey, TValue> _second;
  public UnionDictionary(IDictionary<TKey, TValue> first, IDictionary<TKey, TValue> second)
  {
    _first = first;
    _second = second;
  }
  public override bool TryGetValue(TKey key, out TValue value)
  {
    return _first.TryGetValue(key, out value) || _second.TryGetValue(key, out value);
  }
  public override IEnumerator<KeyValuePair<TKey, TValue>> GetEnumerator()
  {
    foreach(var kvp in _first)
      yield return kvp;
    foreach(var kvp in _second)
      if(!_first.ContainsKey(kvp.Key))
        yield return kvp;
  }
}

尽管您可能想让一些成员成为虚拟成员,然后通过优化覆盖它们,例如,使用上述UnionDictionary,我们可以从中受益:

public override int Count
{
  get
  {
    int tally = _first.Count;//O(1) if _first has an O(1) Count
    foreach(var kvp in _second)
      if(!_first.ContainsKey(kvp.Key))
        ++tally;
    return tally;
  }
}

这里有趣的是,不同任务的相对效率与其他方法完全不同:结果在O(1)时间内返回,而不是像大多数其他情况那样在O(n)或O(n+m)内返回。对对象的大多数调用也是O(1),尽管仍然比对原始字典的调用慢,而Count已经从O(1"变为O(n)。

同样值得注意的是,这些对象中的源对象越多,效率就越低。因此,如果我们使用一些小字典,并对其进行大量基于集的操作,这种方法很快就会慢得多,因为对方法的调用最终会有越来越多的工作要做。另一方面,如果我们有大量字典,只对其进行几次集操作,那么这种方法会快得多,分配和在序列中迭代。

这种方法还有另一个有趣的优点和有趣的缺点。

有趣的优点是,这可以提供良好的线程安全性。由于所有这些操作都会从参数中生成不可变的对象,这些对象也不会发生变异,因此可以让数百个线程在共享字典上工作,而不会有任何变异的风险。当然,有人改变源Dictionary会破坏所有这些线程的工作,但这可以通过在创建后不改变它们来避免,或者通过强制执行它来避免:

public ExceptDictionary(IDictionary<TKey, TValue> source, IDictionary<TKey, TValue> exclude, IEqualityComparer<TKey> comparer)
{
  _source = source.IsReadOnly ? source : source.ToDictionary(kvp => kvp.Key, kvp => kvp.Value, comparer);
   _exclude = exclude.IsReadOnly ? exclude : exclude.ToDictionary(kvp => kvp.Key, kvp => kvp.Value, comparer);
}

遗憾的是,只有当我们知道我们使用的比较器时,这才有效。它的另一个优点是,如果我们知道源字典不可能有任何突变,那么我们可以记住更昂贵的调用,例如Count第一次只需要是O(n),在随后的调用中可以是O(1)。

(相反,虽然不是线程安全的,但相反的情况也可能有用;根据应用程序状态的变化来更改一些源字典,并且表示集合操作的对象会自动更新)。

有趣的缺点是,垃圾收集会造成多大的问题。当涉及到垃圾收集时,这种通用方法通常非常好,因为有可能在多个地方重用同一个集合。不过,这并不是一个例子,因为我们可以在内存中拥有纯粹用于指示键没有匹配值的对象,或者在并集的两个源之间重复的对象,等等。通过大量操作,您可以拥有大量内存来创建一个语义上只包含少数元素的结构。你可以通过定期将垃圾倾倒到Dictionary中,并允许垃圾被收集来解决这个问题。一个人应该多久做一次这是一种平衡——经常会错过这种方法的全部要点,而很少会留下大量浪费。

一种方法是在ReadOnlyDictionaryBase中添加一个内部可见的Depth字段,我们在构建时设置该字段:

public static IDictionary<TKey, TValue> UnionByKey<TKey, TValue>(this IDictionary<TKey, TValue> first, IDictionary<TKey, TValue> second)
{
  var firstRO = first as ReadOnlyDictionaryBase<TKey, TValue>;
  var secondRO = second as ReadOnlyDictionaryBase<TKey, TValue>;
  depth = (firstRO == null ? 1 : firstRO.Depth) + (secondRO == null ? 1 : secondRO.Depth);
  var result = new UnionDictionary<TKey, TValue>(first, second, depth);
  return depth > MAX_DEPTH ? result.DumpToDictionary() : result;
}

我有一个大的项目集,我需要将其与其他项目集进行比较,以找到交集和异常集。同时,此项目的内部状态可能会在运行时发生更改,尽管此状态不会影响项目的标识。

虽然从技术上讲,您可以更改Dictionary中的Key或HashSet中存在的对象,只要在对象的GetHashCodeEquals方法中不使用任何更改的内部数据,就可以了,但这似乎是一种非常奇怪的做法。我会劝阻你不要这样做,并建议把你的物品分开。

为什么?几年前,我构建了一些框架类型的代码,其中对象平等基于对象字段的一些,而不是所有(这与您描述的类似,其中一些属性组成ID,另一些只是组成额外的数据),从那以后,它引起了不少错误,因为其他开发人员不断对此感到惊讶和困惑。我从中得到的经验是,C#开发人员总体上似乎希望对象具有以下任一种:

  • 仅引用相等
  • 基于所有领域的"深层"平等

因为这不仅仅是指相等,人们会改变一个"额外"的字段,然后想知道为什么他们的两个对象仍然相等,即使额外的字段不同。

关于如何拆分的建议

将关键部分放入一个不可变的类或结构中,并拥有包含可变数据的第二个类。然后,您应该能够愉快地将所有关键部分放入Dictionary中,并在不引起问题(或混乱)的情况下更新可变数据。

您必须编写自己的Except/Insect方法,但这应该不会太难。

举个例子,而不是这样:

public class Item {
    readonly int key1;
    readonly  string key2;
    string extra1;
    DateTime extra2;
    public override Equals(Object other) {
        var otherItem = other as Item;
        if(otherItem == null)
            return false;
        return key1 == other.key1 && key2 == other.key2
    } // and equivalent GetHashCode which only checks key1 and key2
}
var data = new HashSet<Item>(); ...

你可以有这样的

public class ItemKey {
    readonly int key1;
    readonly string key2;
    // implement equals, gethashcode, etc
}
public class ItemData {
    string extra1;
    DateTime extra2;
    // don't implement equals, just rely on reference equality here
}
var data = new Dictionary<ItemKey, ItemData>() ...

然后,您可以单独根据密钥执行像Intersect这样的哈希集操作,并且在执行时只需携带ItemData即可。

我建议使用哈希集。

Except() and Intersect() with other set. 
Add() for adding new element.
ToList() (extension method) for accessing each elements in the set.

最新更新