my plan is to try something like this: https://gist.github.com/mangecoeur/1fbd63d4758c2ba0c470 … but instead use to_csv in a named pipe or StringIO buffer to avoid reading / writing a file.
-
Show this thread
-
Replying to @makmanalp
1. There is turbodbc with arrow support http://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#apache-arrow-support … though unsure of the write support (definitely has bulk read just using basic select that’s fast)
1 reply 0 retweets 1 like -
Replying to @BagelDaughter @makmanalp
2. Use this binary parser/writer I made for this exact use case! https://github.com/spitz-dan-l/postgres-binary-parser … It uses cython for the binary bits. Only supports a few of the datatypes, also uses an idiosyncratic DSL for schema specification rather than eg SQLAlchemy
1 reply 1 retweet 1 like -
Replying to @BagelDaughter
I did look at turbodbc but there didn't seem to be much info on how well executemanycolumns with a dataframe worked.
1 reply 0 retweets 0 likes -
Replying to @makmanalp @BagelDaughter
That parser is very neat! I thought of embarking down that road but I read this in the docs and it scared me: "a binary-format file is less portable across machine architectures and PostgreSQL versions" - what has been your experience here, esp across versions?
1 reply 0 retweets 0 likes -
Replying to @makmanalp
Definitely not wrong; there's no proper "spec" for the binary format other than the postgres source code! It just reuses the code postgres uses to send values around internally. I haven't run into problems there but I've never had to do a big version upgrade.
1 reply 0 retweets 1 like -
Replying to @BagelDaughter @makmanalp
Also though, don’t underestimate the inefficiency of pandas’ csv read/write, when string columns are involved. Huge bottleneck and pandas significantly underperforms here compared to most csv read/write tools
1 reply 0 retweets 1 like -
Replying to @BagelDaughter
Welp, we basically did every trick in the book and now it looks like you were right - to_csv is now the bottleneck like you said. Might try a parallel insert first, but this might be next.
1 reply 0 retweets 0 likes -
Replying to @makmanalp
Ah, good luck! Curious to hear how it goes and happy to answer questions I can!
1 reply 0 retweets 1 like -
Replying to @BagelDaughter
Hey, quick question, what is going on here in this whole if statement? Isn't returning col.astype('category') enough? I can see it being something with the types of the categories but then how does len(cats) help you decide anything? https://github.com/spitz-dan-l/postgres-binary-parser/blob/64f54da0c6e6821467a217bfc3b0726f10b12173/postgres_binary_parser/schema.py#L382 …
2 replies 0 retweets 0 likes
The intention was: turn it categorical, and force the categories to be strings. reduce_categories was written before remove_unused_categories was added to pandas, so it was intended to make the string conversion as efficient as possible. Probably not relevant to your use case
-
-
Replying to @BagelDaughter
OK I figured it was something like that but I thought .astype('category') did all that, but I'm realizing now that maybe it didn't used to
1 reply 0 retweets 0 likes -
- 9 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.