你好,我想使用美国国家科学基金会数据集,但原始excel文件变量名称没有正确转置数据。有没有人有关于如何转置数据集,使其适合分析的数据格式的任何示例代码。这是原始的excel文件,这样你就可以理解问题了。
来自NSF网站的Excel原始文件
*编辑添加导入视图
记住Nick关于SO本质的恰当观点,我在这里确实有一些示例代码可能会有所帮助。不知道你所说的"调换"是什么意思我不能给出一个确切的答案,但是您可以根据您的目的调整下面的reshape
命令。
导入和查看
// Import Excel File defining TL and BR of table
import excel "nsb20197-tabs02-012.xlsx", cellrange(A6:K177) clear
list in 1/15
+----------------------------------------------------------------------------------------------------------------------------------------+
| A B C D E F G H I J K |
|----------------------------------------------------------------------------------------------------------------------------------------|
1. | Associate's level |
2. | American Indian or Alaska Native |
3. | All fields 6282 4131 2151 65.8 34.2 8935 5697 3238 63.8 36.2 |
4. | S&E 608 386 222 63.5 36.5 942 495 447 52.5 47.5 |
5. | Engineering 17 4 13 23.5 76.5 50 8 42 16 84 |
|----------------------------------------------------------------------------------------------------------------------------------------|
6. | Natural sciences 384 220 164 57.3 42.7 453 174 279 38.4 61.6 |
7. | Social and behavioral sciences 207 162 45 78.3 21.7 439 313 126 71.3 28.7 |
8. | Non-S&E 5674 3745 1929 66 34 7993 5202 2791 65.09999999999999 34.9 |
9. | Asian or Pacific Islander |
10. | All fields 27313 15522 11791 56.8 43.2 54809 30916 23893 56.4 43.6 |
|----------------------------------------------------------------------------------------------------------------------------------------|
11. | S&E 2649 1284 1365 48.5 51.5 7862 3492 4370 44.4 55.6 |
12. | Engineering 160 23 137 14.4 85.59999999999999 574 111 463 19.3 80.7 |
13. | Natural sciences 2010 939 1071 46.7 53.3 4419 1562 2857 35.3 64.7 |
14. | Social and behavioral sciences 479 322 157 67.2 32.8 2869 1819 1050 63.4 36.6 |
15. | Non-S&E 24664 14238 10426 57.7 42.3 46947 27424 19523 58.4 41.6 |
构造数据
// Create supercategories
* level = Column A if column A contains the word level (ignoring case)
gen level = word(A,1) if ustrregexm(A, "level", 1), before(A)
* demographic = Column A if next ob in column A contains the word all fields (ignoring case)
gen demographic = A if ustrregexm(A[_n+1], "all fields", 1), before(A)
* fill down demographic and level, and drop blank rows
foreach v of varlist level demographic {
replace `v' = `v'[_n-1] if missing(`v')
}
drop if mi(demographic) | demographic == A | regexm(A, level)
// rename variables
* rename A
rename A field
* rename count columns
local list "B C D E F"
local year = 2000
rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
local list "G H I J K"
local year = 2017
rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
* destring
destring *_2000 *_2017, replace
重塑为长
* reshape long
drop perc*
reshape long all_ male_ female_, i(level demographic field) j(year)
rename *_ degrees*
reshape long degrees, i(level demographic field year) j(gender) string
一个如何重塑wide by field的例子
* test reshape wide by field + ensure the variable name is less than 32 characters post reshape
replace field = lower(strtoname(field))
replace field = substr(field, 1, 32 - strlen("degrees") - 1)
reshape wide degrees, i(level demographic year gender) j(field) string