Problem description
There are three data files contains the information about 260+ people in- cluding PD patients and controls. In PD research, it is believed that PD has several sub-types, but currently from our data we don’t have those types infor- mation. An interesting question is that can we use machine learning/ data analysis methods to figure out those sub-types?
The methods we could apply include but not limit to K-NN, K-means, kernel estimation and some other unsupervised learning methods. Since they are all very standard approaches, you can easily find the references or tutorials for those methods online.
One difficulty here is that the sample size we have is comparable to the dimension of the data. Directly use traditional approaches may not work. You can try to use regularization methods, such as, l2, l1 or elastic net. If you prefer, you can also go for some non-convex regularization such as, log, lq, SACD or MCP.
Another difficulty is that there are more than 2 types to cluster and the data is not balance within each groups. You may need to design special loss function to tackle it.
As for data, you can use study id to link them among different files.
Finally, PD is believed to have 4 sub-types. So your model should have at least 4 groups.
1