I use tools like Cook's D to find erroneous data. https://en.m.wikipedia.org/wiki/Cook%27s_distance … Do y'all have favorite tools / methods?
-
-
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi
-
-
-
I think that's a false dichotomy. Of course we should work on collecting without errors, but that doesn't save me from having to clean a dataset that I had no control over the collection of. Also munging!
-
One concern I have with data cleaning the adding of forks to 'garden of forking paths.' http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf … Perhaps cleaning is something we have to cope with, a representation of a defect?
- Još 1 odgovor
Novi razgovor -
-
-
The only proper way to clean data is to discard data with errors. Changing it is on a different level.
-
Thankfully I work with relatively high quality data, but when this happens I like to mark the data as erroneous and include in the analysis for contextualization. If data is messy, users should know that.
- Još 2 druga odgovora
Novi razgovor -
-
-
Both! Even data that is clean for the purpose of the original capture (say an app) will need cleaning and transforming for analysis.
-
I'm with ya in terms of reorganizing data. When I hear 'data cleaning,' the removal of data like perceived outliers is what comes to mind. It sounds like folks may be taking a broader view.
- Još 1 odgovor
Novi razgovor -
-
-
It's been said before, but I really don't like this phrase "data cleaning". It implies the data is "dirty", which is the wrong metaphor. It removes context from the task and suggests there is a universal definition of "clean" vs "dirty".
-
But it's always about what the current question is, and the current knowledge that needs to be "manually" included in the data as judgment calls. And it's always going to be extremely specific to the question being asked.
- Još 3 druga odgovora
Novi razgovor -
-
-
I’m realizing I’ve been including transformations like adding columns with computed values or using `pivot_longer` as part of cleaning. Do y’all see that as a separate process?
-
I’m at the point where I prefer the more vague term of data wrangling. Sometimes there’s cleaning involved, but there’s also a great deal of just reorganizing the data.
- Još 2 druga odgovora
Novi razgovor -
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.