Python中字符串(逗号分隔)的频率



我试图从字段中找到字符串的频率"选择投资者";在这个网站上https://www.cbinsights.com/research-unicorn-companies

有没有办法提取每个逗号分隔字符串的频率?

例如,术语";红杉资本中国"如约赶到

@Mazhar提供的解决方案检查某个术语是否是由逗号分隔的字符串的子字符串。因此,通过这种方法返回的'Sequoia Capital'的出现次数是包含'Sequoia Capital'的所有字符串(即'Sequoia Capital''Sequoia Capital China''Sequoia Capital India''Sequoia Capital Israel''and Sequoia Capital China'(的出现次数的总和。以下代码避免了这个问题:

import pandas as pd
from collections import defaultdict
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
freqs = defaultdict(int)
for group in df['Select Investors']:
if hasattr(group, 'lower'):
for raw_investor in group.lower().split(','):
investor = raw_investor.strip()
# Ignore empty strings produced by wrong data like this:
# 'B Capital Group,, GE Ventures, McKesson Ventures'
if investor:
freqs[investor] += 1

演示

In [57]: freqs['sequoia capital']
Out[57]: 41
In [58]: freqs['sequoia capital china']
Out[58]: 46
In [59]: freqs['sequoia capital india']
Out[59]: 25
In [60]: freqs['sequoia capital israel']
Out[60]: 2
In [61]: freqs['and sequoia capital china']
Out[61]: 1

出现次数之和为115,这与当前接受的解决方案为'sequoia capital'返回的频率一致。

我用这种正确的、更像蟒蛇的方式

import itertools
import collections
import pandas as pd

def fun(x):
x = map(lambda y: y.strip().lower(), str(x).split(','))
return filter(lambda y: y and y != 'nan', x)

# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]

# Process
investor = first_df['Select Investors'].apply(lambda x: fun(x))
investor = investor.values.flatten()
investor = list(itertools.chain(*investor))
# Organize
final_data = collections.Counter(investor).items()
final_df = pd.DataFrame(final_data, columns=['Investor', 'Frequency'])
final_df

输出:

Investor                                        Frequency
0   Sequoia Capital China                           46
1   SIG Asia Investments                            3
2   Sina Weibo                                      2
3   Softbank Group                                  9
4   Founders Fund                                   16
...     ...     ...
1187    Motive Partners. Apollo Global Management   1
1188    JBV Capital                                 1
1189    Array Ventures                              1
1190    AWZ Ventures                                1
1191    Endiya Partners                             1

最新更新