OK I figured it was something like that but I thought .astype('category') did all that, but I'm realizing now that maybe it didn't used to
-
-
-
-
Replying to @BagelDaughter
Thank you! Preliminary results: to_csv + COPY: approx 15sec/million rows, binary+copy: approx 240sec/million :-( The copy itself is super super fast - 0.3 seconds for 100k rows, but now almost 99% of time is spent within the .store()
1 reply 0 retweets 0 likes -
Replying to @makmanalp @BagelDaughter
schema looks like this: Schema("test", [ id_("year"), cat("location_level"), num("export_value"), num("import_value"), num("export_rca"), num("cog"), num("distance"), id_("product_id"), cat("product_level"), id_("location_id") ])
1 reply 0 retweets 0 likes -
Replying to @makmanalp @BagelDaughter
I also dumped you a profile / .pstats file if you ever wanted to take a look at it in the future, but I didn't really get time to dig into what was taking a lot of time: https://gist.github.com/makmanalp/3f895a16887f199194911ebc476211a5 …
2 replies 0 retweets 0 likes -
Replying to @makmanalp @BagelDaughter
Hmm, oh shit - wonder if this was my fault. It looks like a huge amount of time is spent in isnull() stuff, which might be because of a "problem" I "fixed" - was getting a weird error about how cat_lens here https://github.com/spitz-dan-l/postgres-binary-parser/blob/64f54da0c6e6821467a217bfc3b0726f10b12173/postgres_binary_parser/psql_binary.pyx#L277 … is an object instead of an int64_t
1 reply 0 retweets 0 likes -
Replying to @makmanalp @BagelDaughter
which I found was coming from cat_lens being full of nans due to this line https://github.com/spitz-dan-l/postgres-binary-parser/blob/64f54da0c6e6821467a217bfc3b0726f10b12173/postgres_binary_parser/psql_binary.pyx#L206 … which I replaced with column[~column.isnull()] ... which may have been a mistake
1 reply 0 retweets 0 likes -
Replying to @makmanalp @BagelDaughter
back on this ... I was wondering if repeated allocations with array.extend_buffer were slowing things down but when I looked at the implementation it uses resize_smart which overallocates by 50% so if there's a hit I bet it's marginal. You've inlined all the write() funcs too ...
1 reply 0 retweets 1 like -
Replying to @makmanalp @BagelDaughter
Damn you for doing a good job and leaving me no low hanging fruit! :-)
1 reply 0 retweets 1 like
is_null taking a long time is a very plausible culprit. Unsure about the type error you mentioned unfortunately.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.