Product Design

April 1, 2019

/ business & Agile

Data Warehouse or Big data? - which one is better for your business?

Jakub Maślanka

Data Warehouse (DT) is one of the most interesting forms of large data sets usage which furthermore is on the advanced level of practical implementation. Unlike many Big Data solutions, it has a relatively long history and its dynamic growth is driven by real business needs. Among common services I took a closer look at Amazon Redshift and in this article I will present the examples of integration with Python to create basic reporting features.

What’s a Data Warehouse? 

Data Warehouse is an advanced relational database providing an insight into a company’s performance through historical data from different sources - apps, systems, and others. They’ve been designed to enable not only to store massive volumes of data but also (or even above all) analyzing them and presenting the results in an intelligible, user-friendly way

Although the history of data warehouse goes back to the ‘70s, it has taken a decade to establish first advanced works of commercial introduction. But from that moment DT development sped up and there’s been created a lot of schemes like star, snowflake, galaxy or constellation supporting the most effective use of information contained in data warehouses’ structures (dimensions, measures and facts tables). Of course, we need to remember that a DT on its own is just an extensive database with a specific architecture simplifying analysis but only after ETL - Extract, Transform, Load process implementation. 

Data warehouses vs Big Data 

The level of alignment and restrictions imposed to deliver clear and proper conclusions from data analysis are the main aspects differing Data Warehouses from Big Data. The clue of this disparity lies in the type of used information. In the traditional approach DT storages company’s internal data, which because of their quality and credibility, empower to perform a relatively factual analysis. The problem shows up in case of processing and researching non-standard information. And most of the data coming from the outside of a company is untypical but enabling to e.g. study changing consumers’ preferences. The level of their complexity complicates the research and translating the conclusions into specific business models. This is the moment when Big Data solutions come to rescue. However, they are generally more advanced and much more expensive so only the biggest players can afford to implement them. And we need to point out that with a current level of DTs’ development, most companies are able to use them and to gain a huge and valuable knowledge provided by their features. 

  Big Data  Data Warehouse 
Concept The technology of researching massive, heterogeneous, and unstructured databases. A database with a specific architecture adjusted to the goal of its usage and the topic it concerns.
Level of development A relatively new concept which has been developed for about 20 years.  The beginnings of implementation rich to ’80s. 
Data Data is diverse in terms of origin and formats (it allows i.a. audio or video files or social media posts), with quality often hard to verify.   Preferred data is credible, verified, and coherent; they mostly come from the company’s inner sources.  
Structure Mostly flat (data lake); Data is not classified till a specific query appears. Highly advanced relational databases; the data is structured in terms of a chosen schema. 
Sources Diverse, often external e.g. social media.  Mainly internal, coming from the company’s systems (transactional, HR, CRM, ERP, etc.). 
Data time range  Data can be loaded real-time or close to real-time thanks to the common implementation of in-memory technology.  Usually, historical data provided cyclically, but there’s also a possibility to implement a solution like Amazon Kinesis Data Firehose to achieve real-time data loading. 
Purpose and potential Flexibility in supported data formats and sources provide a great potential to use Big Data in researching companies environment, trends and customers’ preferences.  The conclusions from data analysis are being used mostly to build particular models and make business decisions within a well-known topic of company performance. 
Cost An effective analysis of massive amounts of diverse data requires a high level of skills (usually the company has to hire data science experts) and expensive tools.   The services are getting more and more low-cost and easier to use but all significant and advanced changes in the structure are expensive and time-consuming because of difficulties in rebuilding the Data Warehouse.  
Target group  Enterprises and big players with great innovation development budgets.   Cost and use friendliness for even smaller companies which plan to research gathered data.  
Technologies and tools  Hadoop, Data Lake, Storm, Qubole, MongoDB, Spark, HPCC.  Amazon Redshift, Teradata,  Panoply, Oracle 12c, MarkLogic.

 

Why Amazon Redshift? 

To set an example of Data Warehouse integration I’ve chosen Amazon Redshift. There’s a reason why companies like Pinterest, Nasdaq, Periscope Data or Philips bet on it and among Redshift’s values, we can list: 

  •  speed

Thanks to features like Columnar Data Storage, Advanced Compression or Massively Parallel Processing (MPP), Redshift makes it incredibly fast to query the database and that in terms of analyzing huge volumes of data is just crucial. 

  • scalability

With just a few clicks you can increase the number of nodes your Warehouse will use. Amazon will take care of everything and the changes will be applied immediately.

  • simplicity 

Redshift manages all the work of setting up, operating and scaling your warehouse.

Intuitive interface, easy to set up, automatic backups and fancy charts. It has it all.

  • cost-effectiveness

The way Redshift stores the data transfers to a very effective compression which further saving the space and your cost. Pricing includes on-demand hour rates and up to 70% discount when choosing a 3-year term. You can get 1TB for as cheap as 1000$ a year and some companies in their testimonials mention 90% savings in comparison to other data warehouses. 

How can I use Amazon Redshift with Python?

There are a few ways to integrate Amazon Redshift with Python. You can simply collect data and access it from Amazon Control Panel or SQL clients but that requires the knowledge of this programming language and some manual work. Taking this step ahead you can automate this process to create periodic reports of your data as Python is great when it comes to automation of boring tasks. 

More advanced solution is to analyze the data with specialized libraries that give us many tools to interact with the dataset. That enables us to create in-app analytics or even analytics-driven behavior. Doesn’t it sound cool?

Last but not least, it can be integrated with third-party services such as analytics dashboards (ie. Plotly, Salesforce) to give you detailed insight into your data with minimal effort.

Code example 

Redshift gives us the ability to define our own Python functions (UDFs) inside the SQL. The snippet below creates a function that returns Top Level Domain of an email address.  

CREATE OR REPLACE FUNCTION f_find_top_domain( email VARCHAR(100) )
    RETURNS VARCHAR(100) IMMUTABLE AS $$
    domain = email.split('@')[-1]
    return domain.split(‘.’)[-1]
$$ LANGUAGE plpythonu;

Let’s create a query that uses this function, it might look like this:

SELECT COUNT(*) AS domain_count,
    f_find_top_domain(email) AS domain
FROM users
WHERE email IS NOT NULL
GROUP BY domain
ORDER BY domain_count DESC
LIMIT 3;

If we were to run that query in Redshift’s Query Editor the result would be: 

Query Result in Amazon Redshift  

Using Python we can do it within our application. I will show you how you can connect to your Redshift cluster, run query and create a basic chart to display the results.

from sqlalchemy import create_engine
import plotly.graph_objs as graph
import plotly.io as plotly_io
import pandas
import requests
requests.packages.urllib3.disable_warnings()


query = '''
    SELECT COUNT(*) AS domain_count,
        f_find_top_domain(email) AS domain
    FROM users
    WHERE email IS NOT NULL
    GROUP BY domain
    ORDER BY domain_count DESC
    LIMIT 4;
'''

password = '***'
user = '***'
host = '***'
port = 5439
db = '***'

url = f'postgresql+psycopg2://{user}:{password}@{host}:{port}/{db}'
engine = create_engine(url)

data_frame = pandas.read_sql_query(query, engine)

layout = graph.Layout(title='Top Level Domains', yaxis=dict(title='Sum'))
data = [graph.Bar(x=data_frame.domain, y=data_frame.domain_count)]
fig = graph.Figure(data=data, layout=layout)

plotly_io.write_image(fig, ‘chart.png')


Top level domain Amazon Redshift

Of course, these are only a few examples of features, which we can use while integrating Python and the potential of Amazon Redshift. I highly recommend spending some time to use and test other solutions which can make data analysis smooth and much more effortless. 

The article was originally released in polish language on justgeek.it

 

We use cookies on this site to improve performance. By browsing this site you are agreeing to this. For more information see our Privacy policy.