Solved – Doc2Vec for large documents

I have about 7000000 patents that I would like to do find the document similarity of. Obviously with a sample set that big it will take a long time to run. I am just taking a small sample of about 5600 patent documents and I am preparing to use Doc2vec to find similarity between different … Read more

Solved – Doc2Vec for large documents

I have about 7000000 patents that I would like to do find the document similarity of. Obviously with a sample set that big it will take a long time to run. I am just taking a small sample of about 5600 patent documents and I am preparing to use Doc2vec to find similarity between different … Read more

Solved – Linear regression multiple output in python

Say I have predictor array x:(n,px) and a predicted array y:(n, py). What would be the best way in Python to calculate all regression coefficients (linear) from x to each dimension of y (1…py)? The output of the whole thing would be a matrix (py, px) (for each output, px parameters). I could easy iterate … Read more

Solved – Using survival analysis with multiple events

Assuming that I have a data set of people with the following features: age (numeric) education (factor) time an 'event' happened to them(could happen multiple times in the recorded time of the person). The time is measured in months. and I want to predict when this event will luckily happen again in a time domain(ex. … Read more

Solved – Using survival analysis with multiple events

Assuming that I have a data set of people with the following features: age (numeric) education (factor) time an 'event' happened to them(could happen multiple times in the recorded time of the person). The time is measured in months. and I want to predict when this event will luckily happen again in a time domain(ex. … Read more

Solved – Why does the model consistently perform worse in cross-validation

Okay so I run this model manually and get around 80-90% accuracy: mlp = MLPClassifier(hidden_layer_sizes=( 50, 50), activation="logistic", max_iter=500) mlp.out_activation_ = "logistic" mlp.fit(X_train, Y_train) predictions = mlp.predict(X_test) print(confusion_matrix(Y_test, predictions)) print(classification_report(Y_test, predictions)) Then, I do some 10-fold cross validation: print(cross_val_score(mlp, X_test, Y_test, scoring='accuracy', cv=10)) And I get accuracy stats something like the following for each fold: … Read more

Solved – Why does the model consistently perform worse in cross-validation

Okay so I run this model manually and get around 80-90% accuracy: mlp = MLPClassifier(hidden_layer_sizes=( 50, 50), activation="logistic", max_iter=500) mlp.out_activation_ = "logistic" mlp.fit(X_train, Y_train) predictions = mlp.predict(X_test) print(confusion_matrix(Y_test, predictions)) print(classification_report(Y_test, predictions)) Then, I do some 10-fold cross validation: print(cross_val_score(mlp, X_test, Y_test, scoring='accuracy', cv=10)) And I get accuracy stats something like the following for each fold: … Read more

Solved – Array of samples from multivariate gaussian distribution Python

I am trying to build in Python the scatter plot in part 2 of Elements of Statistical Learning. First it is said to generate 10 means mk from a bivariate Gaussian distribution N((1,0)T,I) and labeled this class BLUE. Similarly, 10 more were drawn from N((0,1)T,I) and labeled class ORANGE. I draw one such mean from … Read more

Solved – Array of samples from multivariate gaussian distribution Python

I am trying to build in Python the scatter plot in part 2 of Elements of Statistical Learning. First it is said to generate 10 means mk from a bivariate Gaussian distribution N((1,0)T,I) and labeled this class BLUE. Similarly, 10 more were drawn from N((0,1)T,I) and labeled class ORANGE. I draw one such mean from … Read more

Solved – Fixed Effects OLS Regression: Difference between Python linearmodels PanelOLS and Statass xtreg, fe command

I'd like to perform a fixed effects panel regression with two IVs (x1 and x2) and one DV (y), using robust standard errors. In Python I used the following command: result = PanelOLS(data.y, sm2.add_constant(data[['x1', 'x2']]), entity_effects=True).fit(cov_type='robust') result resulting in: PanelOLS Estimation Summary ================================================================================ Dep. Variable: y R-squared: 0.0008 Estimator: PanelOLS R-squared (Between): -0.0212 No. Observations: … Read more