如果列中存在循环,则将数据帧的一行转换为多行



我有一个数据帧,我想把一些行的一部分转换成几行。实际上,这些行在Questions列中表示一个问题,在Answer_i列中表示这些问题的答案。例如以下行:

QID     Questions   QType   Answer_1    Answer_2    Answer_3    Answer_4    Answer_5    Answer_6    Answer_7    Answer_8    Answer_9    Answer_10   Answer_11   Answer_12   Answer_13   Answer_14   Answer_15
1177    The travel restrictions of COVID-19 have been ...   Likert Scale    Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     Discounted flights  Very important consideration    Important consideration     Somewhat consider   Not an important consideration  Do not consider     Baggage policy  Very important consideration    Important consideration     Somewhat consider Not an important considera...     Do not consider

我想为这一行获得以下数据帧:

QID    Questions    QType    Answer_1     Answer_2    Answer_3    Answer_4 ...    
1263    1177    The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?    Likert Scale 
Very important consideration  Important consideration Somewhat consider   Not an important consideration  Do not consider 
1264    1177_1  Discounted flights  Likert Scale    Very important consideration    Important consideration Somewhat consider   Not an important consideration  Do not consider 
1265    1177_2  Baggage policy    Likert Scale  Very important consideration    Important consideration Somewhat consider   Not an important consideration  Do not consider

到目前为止,我一直在尝试迭代答案:

for i, row in df.iterrows():
passed_items = []
for cell in row:
if cell in passed_items:
print("need to create a new line")
answers = {f"Answer{i}": passed_items[i] for i in range(0, len(passed_items))} # dyanmically allocate to place them in the right columns
dict_replacing = {'Questions': questions, **answers} # dictionary that will replace the forle create the new lines
df1 = pd.DataFrame(dict_replacing)
df = df1.combine_first(df)
passed_items = []
passed_items.append(str(cell))

但它给了我回报:

Answer_2    Answer_9    Answer_0    Answer_1    Answer_10   Answer_11   Answer_12   Answer_13   Answer_14   Answer_2    Answer_3    Answer_4    Answer_5    Answer_6    Answer_7    Answer_8    QID     QType   Questions
0   NaN     NaN     Very important consideration    Important consideration     NaN     NaN     NaN     NaN     NaN     Somewhat consider   Not an important consideration  Do not consider     Baggage policy  Discounted flights  NaN     NaN     NaN     NaN     The airline/company you fly with
1   NaN     NaN     Very important consideration    Important consideration     NaN     NaN     NaN     NaN     NaN     Somewhat consider   Not an important consideration  Do not consider     Baggage policy  Discounted flights  NaN     NaN     NaN     NaN     The departure airport
2   NaN     NaN     Very important consideration    Important consideration     NaN     NaN     NaN     NaN     NaN     Somewhat consider   Not an important consideration  Do not consider     Baggage policy  Discounted flights  NaN     NaN     NaN     NaN     Duration of flight/route
3   NaN     NaN     Very important consideration    Important consideration     NaN     NaN     NaN     NaN     NaN     Somewhat consider   Not an important consideration  Do not consider     Baggage policy  Discounted flights  NaN     NaN     NaN     NaN     Price
4   NaN     NaN     Very important consideration    Important consideration     NaN     NaN     NaN     NaN     NaN     Somewhat consider   Not an important consideration  Do not consider     Baggage policy  Discounted flights  NaN     NaN     NaN     NaN     Baggage policy
5   NaN     NaN     Very important consideration    Important consideration     NaN     NaN     NaN     NaN     NaN     Somewhat consider   Not an important consideration  Do not consider     Baggage policy  Discounted flights  NaN     NaN     NaN     NaN     Environmental impacts
1263    Important consideration Somewhat consider No...     Do not consider     NaN     Very important consideration    Baggage policy  Very important consideration    Important consideration     Somewhat consider Not an important considera...     Do not consider     NaN     Do not consider     Discounted flights  Very important consideration    Important consideration     Somewhat consider   Not an important consideration  1177.0  Likert Scale    The travel restrictions of COVID-19 have been ...

列的顺序没有得到尊重,有些列是双重的。

更新

我一直在努力理解罗伯·雷蒙德的回答。

我不明白:

  • for循环:for i in range(3, len(r)-len(repeat)):我们是否对每一列进行迭代,直到数据帧的最后一列
  • 分解函数CCD_ 4:

根据w3ressource:

explode((函数用于将列表中的每个元素转换为行,复制索引值。

是这个东西把r转换成我想要的列表吗?

  • 如何创建新的数据报。我知道这与前面的答案有关

这是代码,并附上我的评论:

import collections
df = pd.DataFrame({"QID":[1177],"Questions":["The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?"],"QType":["Likert Scale"],"Answer0":["Very important consideration"],"Answer1":["Important consideration"],"Answer2":["Somewhat consider"],"Answer3":["Not an important consideration"],"Answer4":["Do not consider"],"Answer5":["Discounted flights"],"Answer6":["Very important consideration"],"Answer7":["Important consideration"],"Answer8":["Somewhat consider"],"Answer9":["Not an important consideration"],"Answer10":["Do not consider"],"Answer11":["Baggage policy"],"Answer12":["Very important consideration"],"Answer13":["Important consideration"],"Answer14":["Somewhat consider"],"Answer15":["Not an important consideration"],"Answer16":["Do not consider"],"Answer17":["Price of flights"],"Answer18":["Very important consideration"],"Answer19":["Important consideration"],"Answer20":["Somewhat consider"],"Answer21":["Not an important consideration"],"Answer22":["Do not consider"],"Answer23":["Insurance"],"Answer24":["Very important consideration"],"Answer25":["Important consideration"],"Answer26":["Somewhat consider"],"Answer27":["Not an important consideration"],"Answer28":["Do not consider"],"Answer29":["Airport services"],"Answer30":["Very important consideration"],"Answer31":["Important consideration"],"Answer32":["Somewhat consider"],"Answer33":["Not an important consideration"],"Answer34":["Do not consider"],"Answer35":["Environmental impact"],"Answer36":["Very important consideration"],"Answer37":["Important consideration"],"Answer38":["Somewhat consider"],"Answer39":["Not an important consideration"],"Answer40":["Do not consider"],"Answer41":["In-flight service"],"Answer42":["Very important consideration"],"Answer43":["Important consideration"],"Answer44":["Somewhat consider"],"Answer45":["Not an important consideration"],"Answer46":["Do not consider"],"Answer47":["Customer support"],"Answer48":["Very important consideration"],"Answer49":["Important consideration"],"Answer50":["Somewhat consider"],"Answer51":["Not an important consideration"],"Answer52":["Do not consider"],"Answer53":["Overcrowding on aircraft/airports"],"Answer54":["Very important consideration"],"Answer55":["Important consideration"],"Answer56":["Somewhat consider"],"Answer57":["Not an important consideration"],"Answer58":["Do not consider"],"Answer59":["Airport safety after COVID-19"],"Answer60":["Very important consideration"],"Answer61":["Important consideration"],"Answer62":["Somewhat consider"],"Answer63":["Not an important consideration"],"Answer64":["Do not consider"],"Answer65":["Refund policy"]})
def getquestions(r):
# counter
repeat = list({k:v for k,v in collections.Counter(r[3:].values).items() if v>1}) # get all the questions
questions = []
firstfound = 0
# 
for i in range(3, len(r)-len(repeat)): # I don't get this one
if r[i:i+len(repeat)].tolist()==repeat: # I think we are trying to get the subset that are repeat
if r[i+len(repeat):i+len(repeat)+1].values[0] is not None: # here we get the question
questions.append(r[i+len(repeat):i+len(repeat)+1].values[0]) # we store it
if firstfound==0: firstfound = i+len(repeat) # so when it's not 0, we do not update? Why? Why is this thing for?
if len(questions) > 0: #weird cases ?
# somethong odd, sometimes it's a list other times a str
newq = r[1] + questions if isinstance(r[1], list) else [r[1]] + questions
r[1] = newq
# reset all the questions that have been used by list
for i in range(firstfound, len(r)):
if isinstance(r[i], str): r[i] = None
return r
def fixqid(c):
return [id if i==0 or c[i-1]!=id else f"{id}_{i}" for i, id in enumerate(c)]
df = df.apply(lambda r: getquestions(r), axis=1).explode("Questions").reset_index().drop("index", 1) # what does explode stands for?
df["QID"] = fixqid(df["QID"].values)   
df

复杂的转换

  1. 在核心构建问题作为嵌入问题的列表,然后explode()
  2. 已使用collections.Counter()Answer#列中键入唯一值,然后消除只出现一次的值
  3. 有了这个列表,可以在列之间移动它以找到匹配的位置。将下一列作为嵌入的问题,将其添加到列表中
  4. 嵌入问题后,将列表放回问题列。重置所有多余的答案
  5. 修复QID列的后期处理

这实际上在这个例子中找到了10个嵌入的问题。我注意到包括输出,因为它太宽了,无法合理格式化

import collections
df = pd.DataFrame({"QID":[1177,"1177R"],"Questions":["The travel restrictions of COVID-19 have been lifted and you are looking to book a flight. To what extent are the following factors considerations in your choice of flight?","How would you like to book your next holiday?"],"QType":["Likert Scale","Likert Scale"],"Answer0":["Very important consideration","Airline XYZ app"],"Answer1":["Important consideration","Airline XYZ website"],"Answer2":["Somewhat consider","Third party website"],"Answer3":["Not an important consideration","Third party app"],"Answer4":["Do not consider","Travel agent"],"Answer5":["Discounted flights","Call"],"Answer6":["Very important consideration",""],"Answer7":["Important consideration",""],"Answer8":["Somewhat consider",""],"Answer9":["Not an important consideration",""],"Answer10":["Do not consider",""],"Answer11":["Baggage policy",""],"Answer12":["Very important consideration",""],"Answer13":["Important consideration",""],"Answer14":["Somewhat consider",""],"Answer15":["Not an important consideration",""],"Answer16":["Do not consider",""],"Answer17":["Price of flights",""],"Answer18":["Very important consideration",""],"Answer19":["Important consideration",""],"Answer20":["Somewhat consider",""],"Answer21":["Not an important consideration",""],"Answer22":["Do not consider",""],"Answer23":["Insurance",""],"Answer24":["Very important consideration",""],"Answer25":["Important consideration",""],"Answer26":["Somewhat consider",""],"Answer27":["Not an important consideration",""],"Answer28":["Do not consider",""],"Answer29":["Airport services",""],"Answer30":["Very important consideration",""],"Answer31":["Important consideration",""],"Answer32":["Somewhat consider",""],"Answer33":["Not an important consideration",""],"Answer34":["Do not consider",""],"Answer35":["Environmental impact",""],"Answer36":["Very important consideration",""],"Answer37":["Important consideration",""],"Answer38":["Somewhat consider",""],"Answer39":["Not an important consideration",""],"Answer40":["Do not consider",""],"Answer41":["In-flight service",""],"Answer42":["Very important consideration",""],"Answer43":["Important consideration",""],"Answer44":["Somewhat consider",""],"Answer45":["Not an important consideration",""],"Answer46":["Do not consider",""],"Answer47":["Customer support",""],"Answer48":["Very important consideration",""],"Answer49":["Important consideration",""],"Answer50":["Somewhat consider",""],"Answer51":["Not an important consideration",""],"Answer52":["Do not consider",""],"Answer53":["Overcrowding on aircraft/airports",""],"Answer54":["Very important consideration",""],"Answer55":["Important consideration",""],"Answer56":["Somewhat consider",""],"Answer57":["Not an important consideration",""],"Answer58":["Do not consider",""],"Answer59":["Airport safety after COVID-19",""],"Answer60":["Very important consideration",""],"Answer61":["Important consideration",""],"Answer62":["Somewhat consider",""],"Answer63":["Not an important consideration",""],"Answer64":["Do not consider",""],"Answer65":["Refund policy",""]})
def getquestions(r):
repeat = list({k:v for k,v in collections.Counter(r[3:].values).items() if v>1 and isinstance(k, str)})
if len(repeat)<3: return r
questions = []
firstfound = 0
for i in range(3, len(r)-len(repeat)):
if r[i:i+len(repeat)].tolist()==repeat:
if r[i+len(repeat):i+len(repeat)+1].values[0] is not None:
questions.append(r[i+len(repeat):i+len(repeat)+1].values[0])
if firstfound==0: firstfound = i+len(repeat)
if len(questions) > 0:
# somethong odd, sometimes it's a list other times a str
newq = r[1] + questions if isinstance(r[1], list) else [r[1]] + questions
r[1] = newq
# reset all the questions that have been used by list
for i in range(firstfound, len(r)):
if isinstance(r[i], str): r[i] = np.nan
return r
def fixqid(c):
qid = []
sub = 0
for i, id in enumerate(c):
if i==0 or c[i-1]!=id:
sub=0
qid.append(id)
else:
sub +=1
qid.append(f"{id}_{sub}")
return qid 
df = df.apply(lambda r: getquestions(r), axis=1).explode("Questions").reset_index().drop("index", 1)
df["QID"] = fixqid(df["QID"].values)   
df.iloc[:,:10]