Dark data is the name for data that is collected but never used.
lt’s said that 96-98% of all data is dark data (not that I can confirm or deny those claims).
Dark data forms the bottom layer in the DIKW hierarchy below (image sourced from here).
What would the dark data percentage be within OSS do you think? Or more specifically, your OSS?
If you’re not going to use it, then why collect it?
I have two conflicting trains of thought here:
- The Minimum Viable Data perspective; and
- It’s relatively cheap and easy to collect / store raw data if an interface is already built, so hoard it all just in case your data scientists (or automated data algorithms) ever need it
Where do you sit on the data collection spectrum?