我有一个一维数组,我用来存储数据集的分类功能,例如:
Administration Oral ,Aged ,Area Under Curve ,Cholinergic Antagonists/adverse effects/*pharmacokinetics/therapeutic use ,Circadian Rhythm/physiology ,Cross-Over Studies ,Delayed-Action Preparations ,Dose-Response Relationship Drug ,Drug Administration Schedule ,Female ,Humans ,Mandelic Acids/adverse effects/blood/*pharmacokinetics/therapeutic use ,Metabolic Clearance Rate ,Middle Aged ,Urinary Incontinence/drug therapy ,Xerostomia/chemically induced ,
Adult ,Anti-Ulcer Agents/metabolism ,Antihypertensive Agents/metabolism ,Benzhydryl Compounds/administration & dosage/blood/*pharmacology ,Caffeine/*metabolism ,Central Nervous System Stimulants/metabolism ,Cresols/administration & dosage/blood/*pharmacology ,Cross-Over Studies ,Cytochromes/*pharmacology ,Debrisoquin/*metabolism ,Drug Interactions ,Humans ,Male ,Muscarinic Antagonists/pharmacology ,Omeprazole/*metabolism ,*Phenylpropanolamine ,Polymorphism Genetic ,Tolterodine Tartrate ,Urinary Bladder Diseases/drug therapy ,
...
...
数组的每个元素代表数据实例所属的类别。我需要使用单热编码,因此我可以将其用作训练算法的功能。我知道可以使用Scrikit-Learn来实现这一点,但是我不确定如何实施它。(有〜150个可能的类别,大约有1,000个数据实例。)
我建议您使用pandas中的get_dummies方法。该界面要好得多,尤其是当您已经在使用Pandas来存储数据时。Sklearn实现的参与度更高。如果您确实决定走Sklearn路线,则需要使用onehotencoder或labelbinarizer。两者都要求您首先将类别转换为可以使用LabElenCoder完成的整数值。