
我正在使用一个Stata数据集,该数据集的时间段以一种相当奇怪的方式保存在字符串中,其中包含单词"作为时间范围的指示,带有十二小时制时钟的标记,例如"20march2020 1 p.m. to 3 p.m.";我想知道解析/使用这些信息的最佳方法是什么,特别是关于datetime。我已经通读了datetime文档,虽然它对一天中的特定时间很有用,但当涉及到时间范围时,它并不是特别有用。

我正在考虑将字符串分成两个字符串,时间范围的开始和结束,例如"20march2020 1 p.m."。和"2020年3月20日下午3点",但我很好奇是否有更直接的解决方案来使这些数据可行。我对我的方法的主要担忧是,如果时间间隔超过午夜,自动更改日期,例如"2020年3月20日晚上11点到凌晨1点"。如有任何建议,我将不胜感激。


input str28 times
"17may2020 1 p.m. to 10 p.m."
"17may2020 10 p.m. to 5 a.m." 
"18may2020 5 a.m. to noon"
"18may2020 noon to 7 p.m."
"18may2020 7 p.m. to 1 a.m."
input str28 times
"17may2020 1 p.m. to 10 p.m."
"17may2020 10 p.m. to 5 a.m." 
"18may2020 5 a.m. to noon"
"18may2020 noon to 7 p.m."
"18may2020 7 p.m. to 1 a.m."
// Noon won't be recognized by clock(), so replace with 12 p.m.
replace times = subinstr(times, "noon", "12 p.m.", .)
// Split times in two variables
gen times_only = substr(times, 11, .)
split times_only , parse("to")
// Generate datetime variables
gen double datetime1 = clock(substr(times,1,10) + times_only1, "DMYh")
gen double datetime2 = clock(substr(times,1,10) + times_only2, "DMYh")
format datetime1 datetime2 %tc
// If datetime2 is before datetime1, add one day (86400000 milliseconds)
replace datetime2 = datetime2 + 86400000 if datetime2 < datetime1
// Drop auxiliary variables
drop times_only*
// Admire the results
|                       times            datetime1            datetime2 |
1. | 17may2020 1 p.m. to 10 p.m.   17may2020 13:00:00   17may2020 22:00:00 |
2. | 17may2020 10 p.m. to 5 a.m.   17may2020 22:00:00   18may2020 05:00:00 |
3. | 18may2020 5 a.m. to 12 p.m.   18may2020 05:00:00   18may2020 12:00:00 |
4. | 18may2020 12 p.m. to 7 p.m.   18may2020 12:00:00   18may2020 19:00:00 |
5. | 18may2020 7 p.m. to 12 a.m.   18may2020 19:00:00   19may2020 00:00:00 |
