Let’s first start by asking, what is SCALE in data analytics projects?
The first two things that come to our mind are probably volume of data and number of variables (analytics algorithms / approach).
But let’s expand our discussion of SCALE assuming that most likely 60-80% of your time in any data analytics project would be spent in readying data for analysis (extraction, loading, integration and harmonization).
How many different data silos spread across multiple system can we integrate?
- Master data management (handling veracity in data)
- Provenance and auditing (for meeting government regulations and compliance)
How agile we need to be?
- Adding a new data source
- Quickly adding more variables to our analysis
- Capturing real time data from connected devices
- Cloud neutrality
Can we ensure data privacy at our scale?
- Granular security
- Encryption (data, logs)
And of course we would need right amount of computing power and storage to do the job. Think about horizontally distributed architecture on commodity hardware or in cloud environment.
These are just few critical features to look at when you are evaluating your technology infrastructure for large scale analytics projects.
Which NoSQL databases, data integration and analytics products have you worked with while working on large scale analytics projects?