我有一个数据帧,我想把一些行的一部分转换成几行。实际上,这些行在Questions
列中表示一个问题,在Answer_i
列中表示这些问题的答案。例如以下行:
QID Questions QType Answer_1 Answer_2 Answer_3 Answer_4 Answer_5 Answer_6 Answer_7 Answer_8 Answer_9 Answer_10 Answer_11 Answer_12 Answer_13 Answer_14 Answer_15
1177 The travel restrictions of COVID-19 have been ... Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider Discounted flights Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider Baggage policy Very important consideration Important consideration Somewhat consider Not an important considera... Do not consider
我想为这一行获得以下数据帧:
QID Questions QType Answer_1 Answer_2 Answer_3 Answer_4 ...
1263 1177 The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight? Likert Scale
Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider
1264 1177_1 Discounted flights Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider
1265 1177_2 Baggage policy Likert Scale Very important consideration Important consideration Somewhat consider Not an important consideration Do not consider
到目前为止,我一直在尝试迭代答案:
for i, row in df.iterrows():
passed_items = []
for cell in row:
if cell in passed_items:
print("need to create a new line")
answers = {f"Answer{i}": passed_items[i] for i in range(0, len(passed_items))} # dyanmically allocate to place them in the right columns
dict_replacing = {'Questions': questions, **answers} # dictionary that will replace the forle create the new lines
df1 = pd.DataFrame(dict_replacing)
df = df1.combine_first(df)
passed_items = []
passed_items.append(str(cell))
但它给了我回报:
Answer_2 Answer_9 Answer_0 Answer_1 Answer_10 Answer_11 Answer_12 Answer_13 Answer_14 Answer_2 Answer_3 Answer_4 Answer_5 Answer_6 Answer_7 Answer_8 QID QType Questions
0 NaN NaN Very important consideration Important consideration NaN NaN NaN NaN NaN Somewhat consider Not an important consideration Do not consider Baggage policy Discounted flights NaN NaN NaN NaN The airline/company you fly with
1 NaN NaN Very important consideration Important consideration NaN NaN NaN NaN NaN Somewhat consider Not an important consideration Do not consider Baggage policy Discounted flights NaN NaN NaN NaN The departure airport
2 NaN NaN Very important consideration Important consideration NaN NaN NaN NaN NaN Somewhat consider Not an important consideration Do not consider Baggage policy Discounted flights NaN NaN NaN NaN Duration of flight/route
3 NaN NaN Very important consideration Important consideration NaN NaN NaN NaN NaN Somewhat consider Not an important consideration Do not consider Baggage policy Discounted flights NaN NaN NaN NaN Price
4 NaN NaN Very important consideration Important consideration NaN NaN NaN NaN NaN Somewhat consider Not an important consideration Do not consider Baggage policy Discounted flights NaN NaN NaN NaN Baggage policy
5 NaN NaN Very important consideration Important consideration NaN NaN NaN NaN NaN Somewhat consider Not an important consideration Do not consider Baggage policy Discounted flights NaN NaN NaN NaN Environmental impacts
1263 Important consideration Somewhat consider No... Do not consider NaN Very important consideration Baggage policy Very important consideration Important consideration Somewhat consider Not an important considera... Do not consider NaN Do not consider Discounted flights Very important consideration Important consideration Somewhat consider Not an important consideration 1177.0 Likert Scale The travel restrictions of COVID-19 have been ...
列的顺序没有得到尊重,有些列是双重的。
更新
我一直在努力理解罗伯·雷蒙德的回答。
我不明白:
- for循环:
for i in range(3, len(r)-len(repeat)):
我们是否对每一列进行迭代,直到数据帧的最后一列 - 分解函数CCD_ 4:
根据w3ressource:
explode((函数用于将列表中的每个元素转换为行,复制索引值。
是这个东西把r转换成我想要的列表吗?
- 如何创建新的数据报。我知道这与前面的答案有关
这是代码,并附上我的评论:
import collections
df = pd.DataFrame({"QID":[1177],"Questions":["The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?"],"QType":["Likert Scale"],"Answer0":["Very important consideration"],"Answer1":["Important consideration"],"Answer2":["Somewhat consider"],"Answer3":["Not an important consideration"],"Answer4":["Do not consider"],"Answer5":["Discounted flights"],"Answer6":["Very important consideration"],"Answer7":["Important consideration"],"Answer8":["Somewhat consider"],"Answer9":["Not an important consideration"],"Answer10":["Do not consider"],"Answer11":["Baggage policy"],"Answer12":["Very important consideration"],"Answer13":["Important consideration"],"Answer14":["Somewhat consider"],"Answer15":["Not an important consideration"],"Answer16":["Do not consider"],"Answer17":["Price of flights"],"Answer18":["Very important consideration"],"Answer19":["Important consideration"],"Answer20":["Somewhat consider"],"Answer21":["Not an important consideration"],"Answer22":["Do not consider"],"Answer23":["Insurance"],"Answer24":["Very important consideration"],"Answer25":["Important consideration"],"Answer26":["Somewhat consider"],"Answer27":["Not an important consideration"],"Answer28":["Do not consider"],"Answer29":["Airport services"],"Answer30":["Very important consideration"],"Answer31":["Important consideration"],"Answer32":["Somewhat consider"],"Answer33":["Not an important consideration"],"Answer34":["Do not consider"],"Answer35":["Environmental impact"],"Answer36":["Very important consideration"],"Answer37":["Important consideration"],"Answer38":["Somewhat consider"],"Answer39":["Not an important consideration"],"Answer40":["Do not consider"],"Answer41":["In-flight service"],"Answer42":["Very important consideration"],"Answer43":["Important consideration"],"Answer44":["Somewhat consider"],"Answer45":["Not an important consideration"],"Answer46":["Do not consider"],"Answer47":["Customer support"],"Answer48":["Very important consideration"],"Answer49":["Important consideration"],"Answer50":["Somewhat consider"],"Answer51":["Not an important consideration"],"Answer52":["Do not consider"],"Answer53":["Overcrowding on aircraft/airports"],"Answer54":["Very important consideration"],"Answer55":["Important consideration"],"Answer56":["Somewhat consider"],"Answer57":["Not an important consideration"],"Answer58":["Do not consider"],"Answer59":["Airport safety after COVID-19"],"Answer60":["Very important consideration"],"Answer61":["Important consideration"],"Answer62":["Somewhat consider"],"Answer63":["Not an important consideration"],"Answer64":["Do not consider"],"Answer65":["Refund policy"]})
def getquestions(r):
# counter
repeat = list({k:v for k,v in collections.Counter(r[3:].values).items() if v>1}) # get all the questions
questions = []
firstfound = 0
#
for i in range(3, len(r)-len(repeat)): # I don't get this one
if r[i:i+len(repeat)].tolist()==repeat: # I think we are trying to get the subset that are repeat
if r[i+len(repeat):i+len(repeat)+1].values[0] is not None: # here we get the question
questions.append(r[i+len(repeat):i+len(repeat)+1].values[0]) # we store it
if firstfound==0: firstfound = i+len(repeat) # so when it's not 0, we do not update? Why? Why is this thing for?
if len(questions) > 0: #weird cases ?
# somethong odd, sometimes it's a list other times a str
newq = r[1] + questions if isinstance(r[1], list) else [r[1]] + questions
r[1] = newq
# reset all the questions that have been used by list
for i in range(firstfound, len(r)):
if isinstance(r[i], str): r[i] = None
return r
def fixqid(c):
return [id if i==0 or c[i-1]!=id else f"{id}_{i}" for i, id in enumerate(c)]
df = df.apply(lambda r: getquestions(r), axis=1).explode("Questions").reset_index().drop("index", 1) # what does explode stands for?
df["QID"] = fixqid(df["QID"].values)
df
复杂的转换
- 在核心构建问题作为嵌入问题的列表,然后
explode()
- 已使用
collections.Counter()
在Answer#列中键入唯一值,然后消除只出现一次的值 - 有了这个列表,可以在列之间移动它以找到匹配的位置。将下一列作为嵌入的问题,将其添加到列表中
- 嵌入问题后,将列表放回问题列。重置所有多余的答案
- 修复QID列的后期处理
这实际上在这个例子中找到了10个嵌入的问题。我注意到包括输出,因为它太宽了,无法合理格式化
import collections
df = pd.DataFrame({"QID":[1177,"1177R"],"Questions":["The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?","How would you like to book your next holiday?"],"QType":["Likert Scale","Likert Scale"],"Answer0":["Very important consideration","Airline XYZ app"],"Answer1":["Important consideration","Airline XYZ website"],"Answer2":["Somewhat consider","Third party website"],"Answer3":["Not an important consideration","Third party app"],"Answer4":["Do not consider","Travel agent"],"Answer5":["Discounted flights","Call"],"Answer6":["Very important consideration",""],"Answer7":["Important consideration",""],"Answer8":["Somewhat consider",""],"Answer9":["Not an important consideration",""],"Answer10":["Do not consider",""],"Answer11":["Baggage policy",""],"Answer12":["Very important consideration",""],"Answer13":["Important consideration",""],"Answer14":["Somewhat consider",""],"Answer15":["Not an important consideration",""],"Answer16":["Do not consider",""],"Answer17":["Price of flights",""],"Answer18":["Very important consideration",""],"Answer19":["Important consideration",""],"Answer20":["Somewhat consider",""],"Answer21":["Not an important consideration",""],"Answer22":["Do not consider",""],"Answer23":["Insurance",""],"Answer24":["Very important consideration",""],"Answer25":["Important consideration",""],"Answer26":["Somewhat consider",""],"Answer27":["Not an important consideration",""],"Answer28":["Do not consider",""],"Answer29":["Airport services",""],"Answer30":["Very important consideration",""],"Answer31":["Important consideration",""],"Answer32":["Somewhat consider",""],"Answer33":["Not an important consideration",""],"Answer34":["Do not consider",""],"Answer35":["Environmental impact",""],"Answer36":["Very important consideration",""],"Answer37":["Important consideration",""],"Answer38":["Somewhat consider",""],"Answer39":["Not an important consideration",""],"Answer40":["Do not consider",""],"Answer41":["In-flight service",""],"Answer42":["Very important consideration",""],"Answer43":["Important consideration",""],"Answer44":["Somewhat consider",""],"Answer45":["Not an important consideration",""],"Answer46":["Do not consider",""],"Answer47":["Customer support",""],"Answer48":["Very important consideration",""],"Answer49":["Important consideration",""],"Answer50":["Somewhat consider",""],"Answer51":["Not an important consideration",""],"Answer52":["Do not consider",""],"Answer53":["Overcrowding on aircraft/airports",""],"Answer54":["Very important consideration",""],"Answer55":["Important consideration",""],"Answer56":["Somewhat consider",""],"Answer57":["Not an important consideration",""],"Answer58":["Do not consider",""],"Answer59":["Airport safety after COVID-19",""],"Answer60":["Very important consideration",""],"Answer61":["Important consideration",""],"Answer62":["Somewhat consider",""],"Answer63":["Not an important consideration",""],"Answer64":["Do not consider",""],"Answer65":["Refund policy",""]})
def getquestions(r):
repeat = list({k:v for k,v in collections.Counter(r[3:].values).items() if v>1 and isinstance(k, str)})
if len(repeat)<3: return r
questions = []
firstfound = 0
for i in range(3, len(r)-len(repeat)):
if r[i:i+len(repeat)].tolist()==repeat:
if r[i+len(repeat):i+len(repeat)+1].values[0] is not None:
questions.append(r[i+len(repeat):i+len(repeat)+1].values[0])
if firstfound==0: firstfound = i+len(repeat)
if len(questions) > 0:
# somethong odd, sometimes it's a list other times a str
newq = r[1] + questions if isinstance(r[1], list) else [r[1]] + questions
r[1] = newq
# reset all the questions that have been used by list
for i in range(firstfound, len(r)):
if isinstance(r[i], str): r[i] = np.nan
return r
def fixqid(c):
qid = []
sub = 0
for i, id in enumerate(c):
if i==0 or c[i-1]!=id:
sub=0
qid.append(id)
else:
sub +=1
qid.append(f"{id}_{sub}")
return qid
df = df.apply(lambda r: getquestions(r), axis=1).explode("Questions").reset_index().drop("index", 1)
df["QID"] = fixqid(df["QID"].values)
df.iloc[:,:10]