从RAM内存中,将列表转换为Google Colab上的一式编码列表时



我正在尝试使用Skipgram从头开始实现Word2Vec,并卡在创建输入层

class SkipGramBatcher:
  def __init__(self, text):
    self.text = text.results
  def get_batches(self, batch_size):
    n_batches = len(self.text)//batch_size
    pairs = []

    for idx in range(0, len(self.text)):
      window_size = 5
      idx_neighbors = self._get_neighbors(self.text, idx, window_size)
      idx_pairs = [(idx,idx_neighbor) for idx_neighbor in idx_neighbors]
      pairs.extend(idx_pairs)

    for idx in range(0, len(pairs), batch_size):
      X = [pair[0] for pair in pairs[idx:idx+batch_size]]
      Y = [pair[1] for pair in pairs[idx:idx+batch_size]]
      yield X,Y
  def _get_neighbors(self, text, idx, window_size):
    text_length = len(text)
    start = max(idx-window_size,0)
    end = min(idx+window_size+1,text_length)
    neighbors_words = set(text[start:end])
    return list(neighbors_words)

为了测试目的,我将vocab_size限制为1000个单词。当我尝试测试我的SkipGramBatcher时,我会摆脱免费的RAM内存和COLAB重新启动。

for x,y in skip_gram_batcher.get_batches(64):
  x_ohe = to_one_hot(x)
  y_ohe = to_one_hot(y)
  print(x_one.shape, y_ohe.shape) 
def to_one_hot(indexes):
  n_values = np.max(indexes) + 1
  return np.eye(n_values)[indexes]

我想我以错误的方式做某事,任何帮助都得到赞赏。

Google Colab消息:

Mar 5, 2019, 4:47:33 PM WARNING WARNING:root:kernel fee9eac6-2adf-4c31-9187-77e8018e2eae restarted
Mar 5, 2019, 4:47:33 PM INFO    KernelRestarter: restarting kernel (1/5), keep random ports
Mar 5, 2019, 4:47:23 PM WARNING tcmalloc: large alloc 66653388800 bytes == 0x27b4c000 @ 0x7f4533736001 0x7f4527e29b85 0x7f4527e8cb43 0x7f4527e8ea86 0x7f4527f26868 0x5030d5 0x507641 0x504c28 0x502540 0x502f3d 0x506859 0x502209 0x502f3d 0x506859 0x504c28 0x511eca 0x502d6f 0x506859 0x504c28 0x502540 0x502f3d 0x506859 0x504c28 0x502540 0x502f3d 0x507641 0x504c28 0x501b2e 0x591461 0x59ebbe 0x507c17
Mar 5, 2019, 4:39:43 PM INFO    Adapting to protocol v5.1 for kernel fee9eac6-2adf-4c31-9187-77e8018e2eae

我想我明白了为什么Google COLAB将高达66GB的66GB分配给您的程序。

由于x被分配了批次大小 elements

X = [pair[0] for pair in pairs[idx:idx+batch_size]]

转换为ONE_HOT_ENCODING

  n_values = np.max(indexes) + 1
  return np.eye(n_values)[indexes]

X被分配一个维度(64,64)的矩阵,并且由于索引也来自(0:63)。它本质上返回(64,64)矩阵。

警告: - 这仅适用于x。

现在重复此过程,例如n次。每次x&y是(64,64)矩阵,还有 Pairs 变量,这也是一个大列表,因此内存不断增加。

提示: - y是字符串列表,np.max(y)无法完成。

最新更新