Ah, good luck! Curious to hear how it goes and happy to answer questions I can!
-
-
Replying to @BagelDaughter
Hey, quick question, what is going on here in this whole if statement? Isn't returning col.astype('category') enough? I can see it being something with the types of the categories but then how does len(cats) help you decide anything? https://github.com/spitz-dan-l/postgres-binary-parser/blob/64f54da0c6e6821467a217bfc3b0726f10b12173/postgres_binary_parser/schema.py#L382 …
2 replies 0 retweets 0 likes -
Replying to @makmanalp
The intention was: turn it categorical, and force the categories to be strings. reduce_categories was written before remove_unused_categories was added to pandas, so it was intended to make the string conversion as efficient as possible. Probably not relevant to your use case
1 reply 0 retweets 1 like -
Replying to @BagelDaughter
OK I figured it was something like that but I thought .astype('category') did all that, but I'm realizing now that maybe it didn't used to
1 reply 0 retweets 0 likes -
-
-
Replying to @BagelDaughter
Thank you! Preliminary results: to_csv + COPY: approx 15sec/million rows, binary+copy: approx 240sec/million :-( The copy itself is super super fast - 0.3 seconds for 100k rows, but now almost 99% of time is spent within the .store()
1 reply 0 retweets 0 likes -
Replying to @makmanalp @BagelDaughter
schema looks like this: Schema("test", [ id_("year"), cat("location_level"), num("export_value"), num("import_value"), num("export_rca"), num("cog"), num("distance"), id_("product_id"), cat("product_level"), id_("location_id") ])
1 reply 0 retweets 0 likes -
Replying to @makmanalp @BagelDaughter
I also dumped you a profile / .pstats file if you ever wanted to take a look at it in the future, but I didn't really get time to dig into what was taking a lot of time: https://gist.github.com/makmanalp/3f895a16887f199194911ebc476211a5 …
2 replies 0 retweets 0 likes -
Replying to @makmanalp
Oh awesome, thanks for the data! Sorry it didn’t work quick enough for you. I may try to optimize the cython code again someday, profile data will be super useful!
1 reply 0 retweets 1 like
Did you ever attempt turbodbc? That’s probably where I’d reach if I were doing this now
-
-
Replying to @BagelDaughter
will possibly give it a shot but I've heard not great things about the postgres ODBC driver ... I think since I know to_csv is my bottleneck I might just try to parallelize by table tbh.
0 replies 0 retweets 0 likesThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.