提取使用正直格式的不同格式的日期并对其进行排序-Pandas

我是文本挖掘的新手，我需要从一个 *.txt文件中提取日期并对其进行排序。日期在句子之间（每行）及其格式之间可能如下：

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010

如果缺少这一天，请考虑第一个，如果丢失了一个月，请考虑一月。

我的想法是提取所有日期并将其转换为mm/dd/yyyy格式。但是，我对如何找到和替换祖先有些怀疑。这就是我所做的：

import pandas as pd
doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)
df = pd.Series(doc)
df2 = pd.DataFrame(df,columns=['text'])
def myfunc(x):
    if len(x)==4:
        x = '01/01/'+x
    else:
        if not re.search('/',x):
            example = re.sub('[-]','/',x)
            terms = re.split('/',x)
            if (len(terms)==2):
                if len(terms[-1])==2:
                    x = '01/'+terms[0]+'/19'+terms[-1]
                else:
                    x = '01/'+terms[0]+'/'+terms[-1] 
            elif len(terms[-1])==2:
                x = terms[0].zfill(2)+'/'+terms[1].zfill(2)+'/19'+terms[-1]
    return x
df2['text'] = df2.text.str.replace(r'(((?:d+[/-])?d+[/-]d+)|d{4})', lambda x: myfunc(x.groups('Date')[0]))

我仅针对数值日期格式完成了此操作。但是我有点困惑如何使用Alfanumerical日期进行。

我知道是一个粗糙的代码，但这正是我得到的。

我认为这是Coursera文本挖掘作业之一。好吧，您可以使用Regex并提取来获取解决方案。dates.txt即

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)
df = pd.Series(doc)
def date_sorter():
    # Get the dates in the form of words
    one = df.str.extract(r'((?:d{,2}s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|.|s|,)s?d{,2}[a-z]*(?:-|,|s)?s?d{2,4})')
    # Get the dates in the form of numbers
    two = df.str.extract(r'((?:d{1,2})(?:(?:/|-)d{1,2})(?:(?:/|-)d{2,4}))')
    # Get the dates where there is no days i.e only month and year  
    three = df.str.extract(r'((?:d{1,2}(?:-|/))?d{4})')
    #Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.
    dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())
date_sorter()

输出：

9 1971-04-1084 1971-05-182 1971-07-0853 1971-07-1128 1971-09-12474 1972-01-01153 1972-01-1313 1972-01-26129 1972-05-0698 1972-05-13111 1972-06-10225 1972-06-1531 1972-07-20171 1972-10-04191 1972-11-30486 1973-01-01335 1973-02-01415 1973-02-0136 1973-02-14405 1973-03-01323 1973-03-01422 1973-04-01375 1973-06-01380 1973-07-01345 1973-10-0157 1973-12-01481 1974-01-01436 1974-02-01104 1974-02-24299 1974-03-01

如果仅要返回索引，则return pd.Series(dates.sort_values().index)

第一个正则解析

 ＃？：非捕获组（（？： d {，2}  s）？＃两个数字组。 （？：Jan | feb | Mar | Apr | Apr | 5月| Jun | Jul | Jul | aug | sep | sep | oct | nov | dec | dec）[a-z]*＃组中的单词以任何字母结尾`[]`[]`*`）。 （？： -  | 。|  s |，）＃模式匹配 - ，。  s？＃（``？'在这里暗示只有空间，即先前的令牌）  d {，2} [a-z]*＃小于或等于两个数字在末尾具有任意数量的字母（`*`）。（例如：可能是第一，第13，22日，1月，12月等）。 （？： -  |，|  s）？＃字符 -/，/空间可能一次发生，并且可能不会因为``？'而发生  s？＃空间可能发生或可能根本不会发生（最大值是1）（``？  d {2,4}）＃匹配数字是2或4

希望它有帮助。

相关内容

最新更新

热门标签：