Statistical Data

Portfolio milestone 6

OPTION #1: NORTH-WIND DATA MINING AND STATISTICAL ANALYSIS – DATA WAREHOUSE

The purpose of this milestone assignment is to complete the tasks described below in preparation for your final project delivery.

  1. Data Warehouse:
    • Create a data warehouse database, including the fact and dimension tables (star schema).
    • Create the schema for each table.
    • Populate the tables using either ETL (Pentaho) or SQL (PostgreSQL).
  2. Preprocessing for SAS:
    • Extract data from the data warehouse, creating a file for input into SAS. The format of the file is your choice. Ensure SAS University Edition accepts your selected format.

You should use the plan formulated in Milestone 1 of Module 3 for the detailed steps you intend to follow.

Statistical Data assignment

For this milestone assignment, you are expected to submit:

  • Screenshots of the populated data warehouse
  • Star schema design, either a drawing or screenshot
  • Row counts for the fact and dimension tables
  • Brief description of your key learnings from completing this assignment

Your assignment must meet the following requirements:

Refer to the Portfolio Project Milestone rubric in the Module 6 folder for more information on the expectations for this assignment.

Statistical Data Additional information

This project is based off of this assignment:

PROJECT PLAN

Introduction

Data mining and statistical analysis are two major pillars of development for any organization that deals with big data. The two aspects are involved majorly in forecasting and making estimates of specific entities based on the current status quo of an organization. In conducting analysis, great care ought to be taken to ensure accurate reports are generated, and precise decisions are made. This paper will outline the processes, activities, challenges faced in executing the Portfolio project, and all the aspects of production involved.

Statistical Data Tasks

This project involved a number of stipulated tasks that constituted the final analysis reports. First, data warehousing was the first task to be undertaken. Here, a database schema containing all the tables and data from the North wind database was designed. For this project, the database and table sequences were created in PostgreSQL, and the tables were populated with the given data.

The next step in the process was to extract all the data from the tables, and preparing an input file stream for SAS. In this phase, the project involved a number of tasks including; installation of SAS University Edition, setting up environment variables, and data extraction.

The next step involved the actual data analysis in SAS whereby a number of tasks were executed. The extracted data was first imported into the application, and process the began. From classification to association, the data was processed through a number of statistical measures, and reports were generated.

Activities

The three main activities involved in this project revolve around the ETL process. The extraction process was conducted on the North wind database. In this process, data was collected from the database, and it was then imported into the SAS software for analysis. The analysis process involved the course of transformation. In this step, all data was passed through a channel of activities to transform it to desired status. The first activity undertaken in this process was classification. This process involves partitioning different sets of data into related categories to make retrieval easy (Alasadi, 2017).

The next activity undertaken was clustering. This process is similar to classification, and it involves grouping similar entities into clusters (Smith, 2018). There are many approaches to clustering, but the activity involves grouping similar data objects together. Another important aspect of analysis that was undertaken is association. This activity involves studying patterns and identifying relationships between variables in a large set of data. With all these activities in place, a summary of the generated findings was output and detailed reports generated.

Software requirements

The different software applications involved in the project are PostgreSQL, Panteho Integration and SAS. Each of the applications has its own stipulated system requirements. All these software standardly require the following minimum requirements:

Windows 7 and above, 64-bit hardware, minimum of 1 GB ram and an updated browser.

Challenges

While undertaking the portfolio project there are a number of challenges that I faced. First configurations were a big challenge to the analysis process. This is because; setting up both the SAS and POSTGRES software to run in synchrony was rather a big challenge. Next, preparing analysis reports was also a challenge, due to the structure of the data that I was analyzing. In addition, using the Panteho integration was rather challenging, owing the fact that there are minimal resources to guide on how to use it in data breakdown.

Conclusion

Besides the challenges faced in the project, there are many lessons learnt and their implications are significant. From the system requirements to all tasks undertaken, it is possible to deduce best practices in the data analysis process. It is also possible to identify common mistakes and any misconceptions involved. This entire process is fundamental in establishing a firm basis for anyone venturing in data science.

References

Smith, M.J. (2018). Statistical analysis handbook. A comprehensive handbook of statistical concepts, techniques and software tools. The Winchelsea Press, Drumlin Security Ltd, Edinburgh.

Alasadi, S. (2017, September). Review of data processing techniques in data mining. Journal of Engineering and Applied Sciences. University of Babylon.

4 days ago

do you know how to

  • Populate the tables using either ETL (Pentaho) or SQL (PostgreSQL).?

 

Last Updated on February 10, 2019

Don`t copy text!
Scroll to Top