What Is Data Engineering? And Why Is It Critical to Succeed in Today’s Data-Driven Landscape?
What is data engineering.
It is estimated that there will be around 200 zettabytes of data by 2025. And one hundred zettabytes of data will be stored in the cloud. 1 Storing zettabytes of data is challenging on its own. It can be even more difficult to gain value from such a huge amount of information. The data that’s collected will have security and governance requirements that are mandatory to protect. Poor data quality can result in misinformed business decisions, which can lead to pricy mistakes. The data that is collected needs to be secure. It also must be clean and consistent. This is where data engineering comes into play.
Data engineering is the process of discovering, designing and building the data infrastructure to help data owners and data users use and analyze raw data from multiple sources and formats. This allows businesses to use the data to make critical business decisions. Without data engineering, it would be impossible to make sense of the huge amounts of data that are available.
Figure 1: The fundamentals of data engineering.
What Is the History of Data Engineering?
Around 2011 the term “data engineering” became popular at new data-driven companies. To understand the history of data engineering, let’s look at how data has changed over the years.
In the 1970s and ‘80s, mainframes and midrange machines stored most enterprise data. In the 1990s, much of this shifted into distributed applications like ERP, SCM, CRM and other systems. Into the 2000s, as illustrated in Figure 2, there were on-premises data marts and data warehouses .
Data warehouses delivered so much value the world moved towards purpose-built data warehouse appliances, which were very expensive. So expensive that data modelers and data engineers had to optimize the systems and reduce the operational costs.
Figure 2: Evolution of the data landscape.
Around 2006, Hadoop, an open-source framework, was introduced. It looked like Big Data was going to take over. However, Hadoop had a massive impact on data management . The idea that compute and data storage are expensive got flipped on its head. Data storage and compute now became cheap. Although compute and storage were inexpensive, Hadoop was very complex. Technology again evolved and today people are rushing to the cloud. With cloud, data storage is cheap, and compute is cheap because it is consumption based.
The key factors in the evolution of data engineering are price and performance. Newer architectures like data fabric and data mesh are emerging to support data science practice using artificial intelligence (AI) and machine learning (ML).
Traditional ETL (extract, transform, load) , a data processing development practice, evolved to the term “data engineering” to explain the handling of increasing volumes of data across data infrastructure, data warehouses and data lakes , data modelling, data wrangling, data cataloging and metadata management .
Why Is Data Engineering Important?
Data engineering is important because it allows businesses to use data to solve critical business problems. Data that is unavailable and/or of poor quality leads to potential mismanaged resources, longer time to market and loss in revenue.
Data is present in every step of the business. It is also necessary to do important work. For example, the marketing team builds customer segmentation. Or the product team builds new features based on customer demand. Data truly is the backbone of a company’s success.
Cost overruns, resource constraints and technology/implementation complexity can derail your cloud data integration/management strategy and implementation. On top of that, missing or inaccurate data can lead to lost trust, wasted time and frustrated data users. And poor customer service.
Effective data engineering is the answer to overcome these problems. Instead of quick fixes, a comprehensive data management platform is key to modern data engineering.
How Is Data Engineering Different from Data Science?
The data landscape is always changing. And due to the amount of data being produced, data gathering and data management are complex. And organizations want fast insights from this data. While the required skillset for a data engineer and a data scientist may sound alike, the roles are distinct:
- Data engineers develop, test and maintain data pipelines and architectures.
- Data scientists use that data to predict trends and answer questions that are important to the organization.
The data engineer does the legwork to help the data scientist provide accurate metrics. The role of a data engineer is very outcome orientated. A data engineer is a superhero of sorts because she can bring all this data to life. 2
The graphic below shows how data engineering assists in data science operations.
Figure 3. How data engineering supports data science projects.
Examples of Customer Success in Data Engineering
Below are some data engineering case studies across key industries.
- Intermountain Healthcare wanted to drive digital transformation with a Digital Front Door. They adopted Informatica Data Integration and Data Quality , which helps them deliver high-throughput ingestion and verification of patient data. Now they can load 300 CSV files in 10 minutes, a task that used to take a week.
- Banco ABC Brasil needed to deliver a better experience to clients by accelerating the credit application process. So, they moved financial and customer data from source systems into a Google Cloud data lake using Informatica Intelligent Data Management Cloud . By doing so, they improved customer service by processing credit applications 30% faster.
- Vita Coco, a coconut water retailer, wanted to drive business growth by analyzing downstream, product sales performance data to ensure stock and drive sales. By using Informatica Cloud Data Integration, they can accept depletion, scan and sales data from partners in a variety of formats. This helps increase sales by working with distributors to adjust regional promotions and processes.
Data engineering in the cloud at Vita Coco
The Demand for Data Engineers
Data engineering is a rapidly growing profession. From large public cloud companies to innovators, data engineers are in high demand. There are over 220,000 job listings for a data engineer in the U.S. on LinkedIn. In fact, data engineering is the fastest growing tech job, beating data science hands down, and the demand has only increased since 2020. 3
According to The New York Times , U.S. unemployment rates for high-tech jobs range from slim to nonexistent. On average, each tech worker looking for a job is considering more than two employment offers. 4
What Are the Core Responsibilities of a Data Engineer?
The role of the data engineer is to transform raw data into a clean state so it can be used by business leaders and data science teams to make decisions. Data engineers work in the background to help answer a specific question. The more data the company processes, the more time is spent on processing and analyzing it.
Below is a list of core responsibilities of a data engineer: 5
- Analyze and organize raw data
- Build data systems and data pipelines
- Evaluate business needs and objectives
- Interpret trends and patterns
- Conduct complex data analysis and report on results
- Prepare data for prescriptive and predictive modeling
- Build algorithms and prototypes
- Combine raw information from different sources
- Explore ways to enhance data quality and reliability
- Identify opportunities for data acquisition
- Develop analytical tools and programs
- Collaborate with data scientists and architects on several projects
What Key Tools and Technologies Does a Data Engineer Use?
Data engineers wear many hats throughout the data lifecycle. This means you must have a diverse background that goes beyond education. To start, a degree in computer science, engineering, applied mathematics, statistics or related IT area is critical. Here are key technical skills that every data engineer should have:
- Deep understanding of data management concepts focusing on data lake and data warehousing
- Experience in database management concepts (relational/non-relational database management system concepts)
- Proficiency in scripting/coding languages such as SQL, R, Python, Java, etc.
- Cloud computing skills in one or more cloud service providers (e.g., Amazon Web Services , Microsoft Azure , Google Cloud Platform , etc.)
- Basic understanding of machine learning algorithms, statistical models and some mathematical functions
- Knowledge of data discovery and profiling through data cataloging and data quality tools
Examples of Data Engineering Projects
A data engineer is the backbone of any organization. Because technology is always changing, the types of projects you can work on is diverse. Below are some examples a typical data engineer may work on:
- Data aggregation
- Website monitoring
- Real-time data analytics o Event data analysis
- Smart IoT infrastructure
- Shipping and distribution demand forecasting
- Virtual chatbots
- Loan prediction
What Are Key Traits of a Successful Data Engineer?
Being a great data engineer goes beyond technical skills and advanced degrees. Having the right personality is just as important. A career in data engineering can be rewarding. It can also be overwhelming. Here are five key traits of a successful data engineer:
- Curious . Data engineers must keep up with the latest trends surrounding technology. Things change fast and you need to be able to quickly learn new tools. You should be eager to learn and always ask, “Why?”
- Flexible . There is constant change in the data industry. Data engineers should be comfortable with changing priorities.
- Problem-solver . Data engineers test and maintain the data architecture that they design. They also look for ways to improve the data processes. This requires a mind for creative problem solving.
- Multi-tasker . Not all data engineers come from a computer science or data science background. Other fields include statistics and computer engineering. It helps to be well-versed in all data topics. It is also key to know how to use tools focused on automation.
- Strong communicator . Data engineers are an important part of the data team. They must be comfortable explaining concepts to non-technical and technical stakeholders at every level.
What Does Modern Data Engineering Look Like?
As technological advancement makes overall data processing simpler, data requirements are getting more complex. The success of data engineering must take advantage of modern technologies to make processes scalable, reusable and adaptable. To do this well, businesses need a solution for cloud data warehouses and data lakes across multiple cloud service providers and on-premises to meet all data processing needs.
Informatica offers a cloud-native end-to-end data management platform with the Intelligent Data Management Cloud (IDMC). Its capabilities meet most any data management needs and simplify your tasks through automation using CLAIRE , its AI-driven engine. IDMC can empower data engineers to:
- Build a foundation for analytics, AI and data science initiatives
- Support modern data engineering trends and frameworks
- Gain elastic scale to meet business demands and control costs
- Choose any cloud services at any time across IDMC as your requirements change with Informatica Processing Units (IPU)
- Solve problems and inform critical business decisions that accelerate innovation
- See how IDMC’s data management architecture for data engineering enables organizations to control business data, both in the cloud and in a combination of on-premises and cloud applications.
- Learn more about how Informatica can help you build intelligent data engineering for AI and advanced analytics.
1 How Much Data Is Created Every Day in 2022? [NEW Stats] (earthweb.com)
4 The New York Times, OnTech with Shira Ovide, June 14, 2022
- United States
- New Zealand
- Southeast Asia
- Switzerland (French)
- Switzerland (German)
- United Kingdom
- United Arab Emirates
Learn Data Integration
© Informatica Inc.
Modernize your business domains with advanced data mesh architecture
From chaos to clarity: unleash powerful and game-changing insights through centralized data lake, creating new revenue streams by enabling effective data monetization with digital banking data platform.
Data infrastructure modernization reduces cloud costs by 50%
Sigmoid designed and implemented a reference architecture for unified trade to modernize the legacy data infrastructure which enhanced operational efficiency and fast-tracked computation of business credit scores.
Cash forecasting accuracy improved upto 95%
Sigmoid created a customised cloud-based cash forecasting solution using advanced ML algorithms to estimate future cash flows from multiple markets and optimize the cash for the current year. The solution enabled the Global Operations team to identify opportunities for unlocking cash from working capital and streamline financial planning for the entire business.
Demand forecasting solution with 10+ market disruption indicators, delivers prediction accuracy of 92%
Sigmoid provided a customized demand forecasting solution utilizing advanced ML algorithms, incorporating market indicators and macroeconomic parameters. The solution enabled swift responses to market dynamics resulting in 80% faster time to insights.
Data monetization with a scalable and configurable platform for financial institutions unlocks new revenue streams
Sigmoid developed a data mesh architecture on GCP to consolidate data from multiple sources and develop data products which helped our client monetize their data and reduce the time required to commission new customers by 90%.
Global eCommerce insights hub improves demand planning, leading to 5% uplift in channel sales
Sigmoid developed a centralized data insights hub for global eCommerce as a Single Source of Truth (SSOT) for improved reporting and planning. This enabled a control tower that helped the analytics team gain visibility across the eCommerce channels leading to improved planning and sales.
Data mesh architecture enables data-as-a-product and lowers inventory costs by 15%
Sigmoid developed a data mesh architecture for modernizing enterprise data to enable next-gen analytics use cases that helped the client improve sales and inventory.
Assortment recommendation engine maps the right products with outlets leading to 14% sales growth
Sigmoid developed an assortment optimization solution with customized prediction and recommendation models using Databricks to help a leading alcoholic beverages company improve customer experience by placing the right products at the right outlets and drive sales.
Centralized data lake and automated data pipelines drive real-time KPI monitoring through a Supply Chain Control Tower
Sigmoid optimized logistics management by building a centralized data lake with automated data pipelines to facilitate fast and error-free reporting and improve dashboard performance for crucial supply chain KPIs in real-time.
Omnichannel marketing data hub optimized campaign execution resulting in a 5% uplift in lead conversion
Sigmoid built a centralized marketing data platform on Azure with automated data pipelines to optimize marketing campaigns’ ROI for a leading medical technology company.
ML based image processing with Google Vision delivers deep insights into consumer buying patterns
Sigmoid developed an image analytics based solution powered by deep learning to help an American multinational F100 consumer products manufacturer gain insights into consumer buying patterns, product preferences, and brand penetration.
ML based targeting and campaign optimization increases ROAS on Amazon by 10x
A kids nutrition brand partnered with Sigmoid to increase spend efficiency and accelerate sales on Amazon by identifying the right target audience and optimizing campaign settings like budgets, bid value, and ad frequency for each campaign.
Best practices to manage data and services on Azure optimizes cloud costs by 25%
Sigmoid evaluated the data workloads, subscriptions, and resource groups on the Azure cloud and implemented best practices such as serverless architecture and autoscaling. In the first month, a 25% reduction in cloud subscription costs across development, QA, and production environments was observed.
Real-time integration of IoT sensor data with automated data pipelines for effective logistics management
Sigmoid optimized logistics management by re-architecting the data platform and building robust and scalable data pipelines which resulted in faster data collection, improved data quality, and seamless pipeline extensibility.
Data integration & standardization on Google cloud improves time to insight by 25%
Sigmoid developed a centralized and scalable data warehouse on Google Cloud to optimize data-driven decision-making. A single source of truth for multiple data sources was created to address legacy data warehouse challenges of data discoverability, governance, and security.
Data migration from on prem systems to Snowflake reduces time to insights by 10X
Sigmoid optimized Snowflake's performance by implementing best practices for data migration, resulting in a 10X improvement in the efficiency of data pipelines.
Data hub on AWS enables near real time tracking of 20+ KPIs to improve sales performance and customer satisfaction
We created a data hub on AWS for field sales, regional sales and marketing leaders across multiple levels of hierarchy and get access to 20+ sales KPIs in near real time.
Generating near real-time insights for a restaurant chain on a balanced scorecard powered by a LCNC data platform
Sigmoid leveraged cloud-agnostic, Low Code/No Code (LC/NC) tools to create automated data pipelines that generate insights on 70+ key performance metrics.
Data standardization from 20 ERPs into Snowflake and BI platform development for faster access to 50+ business KPIs
An enterprise-level unified BI platform that enabled near real-time tracking and analysis of 50+ business KPIs.
Automatic preventive maintenance framework to detect faults and reduce scheduled maintenance costs
Developed an automatic preventive maintenance framework to detect failure patterns on hydroforming presses for timely maintenance.
Centralized portal for BI results in seamless access to 30K+ reports and dashboards with faster access to insights
Sigmoid developed a centralized data portal with report cataloging functionality which led to faster report discovery across teams, enhanced collaboration between teams and faster time-to-market.
ML-based assortment lifecycle solution increases market share by 0.8%
Developed assortment lifecycle intelligence to optimize overall investment in products to focus on key categories and high potential products
ML-based demand forecasting model to reduce data run-time by 20x
Developed an ML-based demand forecasting model to improve forecast accuracy and enhance visibility in the supply chain
Centralized AI deployment environment for a CPG brand reduced time to scale ML models by 85%
Built AI deployment environment for faster deployment of ML models across departments
Campaign optimization for a CPG brand resulted in a 25% improvement in ROAS
Built a robust automated solution to suggest keyword recommendations and campaign strategy changes
Automating Financial Crime Compliance analysis for a leading investment bank
Improved in risk assessment time and enabled near real-time flagging of data anomaly
Data migration to Snowflake for 5x faster web analytics and insights visualization on DOMO
Automated the process of data migration to Snowflake and visualization on DOMO to analyze digital data
Cloud optimization led to a 54% reduction in cost for a leading CPG company
Cloud monitoring and optimization to reduce cost and building custom dashboards to get granular details of billing patters
ML-driven recommendation to power real-time sales analytics
Building and maintaining robust data pipelines to make quality datasets ready for ML use cases
Building data pipelines to make quality datasets ready for ML use cases
ML-based Consumer Profiling and Segmentation for Improved Marketing Spend
Customer segmentation leading to targeted and tailored marketing communications
Centralized Data Lake for Real-time Analytics and Reporting
Automated data ingestion from 30+ diverse sources and enabled 2.5x faster time to insights for marketing team
Automated Production Schedule for Manufacturing Plants
Built an automated master production schedule for multiple manufacturing plants to improve agility and efficiency in resource allocation.
Social Media Analytics Accelerates Product Innovation
NLP-based social media analytics helped identify pain-passion points of customers and accelerate product innovation.
Single Source of Truth for Real-time Marketing Campaign Optimization
Created a set of 10 dashboards to be used by different business teams for real-time campaign reporting, optimization, and analysis
90% Improvement In ML Model Run Time Using MLOps
Improved model performance, reduce model run time, increased scalability and reduced cost of model deployment using MLOps
Improving Dashboard Query Performance by 10X
Re-architected efficient and near real-time ETL data pipeline while enabling centralized monitoring of dashboards
AI-powered data imputation boosts sales performance
Sigmoid leveraged a neural network-based deep learning library to predict the missing data values with up to 98% accuracy to help a semiconductor manufacturer boost its sales performance.
Data Standardization to Track Data Anomalies for Operational Efficiency
Built automated report and visualize high-quality data on Tableau dashboard
2.5% improvement in OEE of machines
Built and automated an AI system to improve the overall equipment effectiveness (OEE)
80% performance improvements in data pipelines
Built scalable and high performing data pipelines in an Azure data platform
Churn Analytics Improves Customer Retention by 70%
Predictive machine learning models to determine customers likely to churn
15% lift in new user conversion using MTA and MMM
Real-Time Demand Generation Attribution
Automated Data Ingestion from 10+ retailers for near real-time insights into sales trends for a CPG company
Data Lake creation
250TB+ data processed for faster customer analytics
Customer Analytics and Data Warehousing
70% better accuracy for demand forecasting
65% cost savings with efficient Cloud Migration to GCP
100MN+ rows of data per day processed for improved trade surveillance
7% sales uplift using 1:1 personalized email marketing
1:1 Personalized Marketing
Production-ready system by building and integrating scalable ML models
Productionize Demand Forecasting Models
8% profitability boost through personalized recommendation engine
11% improvement in marketing campaigns using MTA
24×7 monitoring and support using highly available, robust systems
Improved Data Pipeline Availability
11% match rate improvement with adaptive identity graphs
30% QoQ revenue uplift with price optimization engine
100MN+ personalized emails sent by productionizing MAB model
Productionize Personalized Marketing Models
33% improvement in ROMI using CLTV models
85% reduction in false positives using real-time fraud detection
Faster and more accurate settlements through automation
Property Claim Estimation
Enhanced transparency and visualization of millions of customer data points
Unified interactive analytics and external reporting for enhanced transparency
Unified Analytics Platform
15% reduction in premium using ML-backed underwriting system
Group Risk Scoring of Patients
79% accuracy of the retargeting model
Faster analysis on 250TB+ data and creation of a SSOT
Single Source of Truth Creation
Automatic bidding with 80,000+ coefficients
80% precision improvement using ML-based approach to lead buying
Supply Chain Control Tower with data lake drives real-time KPI monitoring
Sigmoid helped a leading food manufacturing company to optimize their logistics management by building a centralized data lake with automated data pipelines
ML based image processing delivers insights into consumer buying patterns
Sigmoid developed an image analytics based solution powered by deep learning to gain insights into consumer buying patterns, product preferences, and brand penetration for an American F100 consumer products manufacturer.
Sigmoid’s ML based solution enabled a leading CPG company specializing in health, hygiene, and nutrition products to improve target audience conversion rate on Amazon ads portal by leveraging extensive data from both first party and external data sources.
Balanced scorecard & LCNC data platform generates insights on 70+ key metrics
Sigmoid implemented cloud-agnostic, Low Code/No Code tools that integrated multiple data sources with varying levels of granularity. The solution leveraged data pipelines to create multilayered dashboards with critical brand performance metrics such as brand loyalty, operational excellence, profitability, and more.
Centralized AI deployment environment reduced time to scale ML models by 85%
Sigmoid built a unique AI deployment environment for a leading CPG brand that allowed multiple business teams to run ML models and scale them across multiple departments and geographies. It ensured faster deployment of ML models across while reducing the cost of running models by 2x.
Automated data pipelines deliver 10X faster insights from IoT sensors
Sigmoid built an efficient and near real-time ETL data pipeline for the customer that resulted in faster query processing time and increased scalability. The solution facilitated centralized monitoring of dashboards allowing quick identification and resolution of machine downtime.
Improving product profitability with ML driven assortment intelligence
Sigmoid helped a global F&B major with a ML solution to align their brand strategy with business rules and promotion guidelines that led to an improvement in market share by 0.8% and 3% improvement in contribution margin.
Improved forecast accuracy while saving forecasting costs by 5x
Sigmoid helped a leading alcoholic beverage company predict demand accurately to optimize their supply chain, improve inventory management and improve risk mitigation.
2x faster website analytics with migration to Snowflake
Watch this case study to find out how Sigmoid helped a leading FMCG in the oral care segment build automated data pipelines to migrate their digital media data to Snowflake to overcome inconsistency and manual efforts of loading data.
54% higher savings with cloud optimization
Sigmoid developed an end-to-end Google cloud monitoring and optimization solution to reduce wastage on the data infrastructure costs for a leading CPG company.
Sigmoid developed an ML-based order recommendation engine for a leading CPG company for effective management of inventory and recommending new products to boost retail sales growth.
Enhanced trade surveillance with 4x faster data pipelines
Sigmoid helped a top-tier investment bank set up trade surveillance and make it regulatory compliant by setting up Spark-based ETL pipelines. We revamped existing systems to ensure optimal performance for the order of their trade surveillance while ensuring resource optimization and creating faster response times.
15% increase in capacity utilization with automated master production schedule
Sigmoid built a solution to efficiently predict scenarios based on optimization of important KPIs such as capacity utilization, safety stock maintenance, inventory to be maintained, raw material availability and scrap minimization for a fortune 500
Centralized and harmonized data lake for faster analytics and reporting
Sigmoid built a tactic-based data model for a Fortune 500 food manufacturing firm to standardize the input files from different data providers, agencies, and channels.
Multi-Touch attribution accelerator for CPG
The interactive dashboards help marketing managers and brand leaders to optimize performance, justify media spend and improve ROI.
ML model improvement & management using MLOps
Sigmoid developed a solution using MLOps that reduced the time to train the model and automated the model training. The solution resulted in 90% improvement In ML model run time.
E-commerce campaign optimization for 25% improvement in ROAS
Sigmoid created an automated solution to suggest recommendations and campaign strategy changes for various online platforms such as Amazon and in the long run, improve organic sales.
2.5% improvement in overall equipment effectiveness with ML
Sigmoid built and automated an AI system for a leading CPG – food and beverage company to improve the overall equipment effectiveness (OEE) of machines on their production lines.
Churn analytics improves customer retention by 70%
Sigmoid developed ML based prediction models based on granular understanding of different customer segments to predict those likely to churn and improve the retention rate.
DataOps for 24/7 monitoring and scalable systems
Sigmoid developed highly scalable data systems to ensure 99.99% uptime, zero outages and seamless deployment of new applications for a leading cloud service provider.
Automated data ingestion for near real-time insights
Sigmoid built a data foundation for a leading CPG company to leverage data from different retailers in near real time and enabled automated reports, reducing the time to access data from 7 days to 2 hours!
Timely and accurate demand forecasting at scale
Sigmoid’s accurate demand forecasting solution reduced inventory handling costs & reduced time to plan campaigns by 66%.
Automation of ML data pipelines using spark
Sigmoid productionized the MAB model for a major restaurant chain that delivered tailored emails to over 12Mn customers resulting in 8% sales uplift.
Development of spark based ETL on GCP
Sigmoid developed a Spark based system for a leading AdTech company that optimised their data infrastructure landscape resulting in $2.5Mn+ annual cost savings and faster query processing.
Increasing profitability through personalized recommendation of offers
Sigmoid improved CLTV& increased sales per order by 24% for a leading cosmetics giant by developing scalable personalised strategies using advanced ML algorithms.
Timely inflight campaign optimization with MTA
Sigmoid enhanced marketing effectiveness by building Multi-Touch Attribution models on Azure using Rule mining and predictive ML algorithms leading to an increase in Marketing ROI.
- Español – América Latina
- Português – Brasil
Professional Data Engineer
Certification exam guide
A Professional Data Engineer makes data usable and valuable for others by collecting, transforming, and publishing data. This individual evaluates and selects products and services to meet business and regulatory requirements. A Professional Data Engineer creates and manages robust data processing systems. This includes the ability to design, build, deploy, monitor, maintain, and secure data processing workloads.
Section 1: Designing data processing systems
1.1 Designing for security and compliance. Considerations include:
● Identity and Access Management (e.g., Cloud IAM and organization policies)
● Data security (encryption and key management)
● Privacy (e.g., personally identifiable information, and Cloud Data Loss Prevention API)
● Regional considerations (data sovereignty) for data access and storage
● Legal and regulatory compliance
1.2 Designing for reliability and fidelity. Considerations include:
● Preparing and cleaning data (e.g., Dataprep, Dataflow, and Cloud Data Fusion)
● Monitoring and orchestration of data pipelines
● Disaster recovery and fault tolerance
● Making decisions related to ACID (atomicity, consistency, isolation, and durability) compliance and availability
● Data validation
1.3 Designing for flexibility and portability. Considerations include:
● Mapping current and future business requirements to the architecture
● Designing for data and application portability (e.g., multi-cloud and data residency requirements)
● Data staging, cataloging, and discovery (data governance)
1.4 Designing data migrations. Considerations include:
● Analyzing current stakeholder needs, users, processes, and technologies and creating a plan to get to desired state
● Planning migration to Google Cloud (e.g., BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance, Google Cloud networking, Datastream)
● Designing the migration validation strategy
● Designing the project, dataset, and table architecture to ensure proper data governance
Section 2: Ingesting and processing the data
2.1 Planning the data pipelines. Considerations include:
● Defining data sources and sinks
● Defining data transformation logic
● Networking fundamentals
● Data encryption
2.2 Building the pipelines. Considerations include:
● Data cleansing
● Identifying the services (e.g., Dataflow, Apache Beam, Dataproc, Cloud Data Fusion, BigQuery, Pub/Sub, Apache Spark, Hadoop ecosystem, and Apache Kafka)
○ Streaming (e.g., windowing, late arriving data)
○ Ad hoc data ingestion (one-time or automated pipeline)
● Data acquisition and import
● Integrating with new data sources
2.3 Deploying and operationalizing the pipelines. Considerations include:
● Job automation and orchestration (e.g., Cloud Composer and Workflows)
● CI/CD (Continuous Integration and Continuous Deployment)
Section 3: Storing the data
3.1 Selecting storage systems. Considerations include:
● Analyzing data access patterns
● Choosing managed services (e.g., Bigtable, Cloud Spanner, Cloud SQL, Cloud Storage, Firestore, Memorystore)
● Planning for storage costs and performance
● Lifecycle management of data
3.2 Planning for using a data warehouse. Considerations include:
● Designing the data model
● Deciding the degree of data normalization
● Mapping business requirements
● Defining architecture to support data access patterns
3.3 Using a data lake. Considerations include:
● Managing the lake (configuring data discovery, access, and cost controls)
● Processing data
● Monitoring the data lake
3.4 Designing for a data mesh. Considerations include:
● Building a data mesh based on requirements by using Google Cloud tools (e.g., Dataplex, Data Catalog, BigQuery, Cloud Storage)
● Segmenting data for distributed team usage
● Building a federated governance model for distributed data systems
Section 4: Preparing and using data for analysis
4.1 Preparing data for visualization. Considerations include:
● Connecting to tools
● Precalculating fields
● BigQuery materialized views (view logic)
● Determining granularity of time data
● Troubleshooting poor performing queries
● Identity and Access Management (IAM) and Cloud Data Loss Prevention (Cloud DLP)
4.2 Sharing data. Considerations include:
● Defining rules to share data
● Publishing datasets
● Publishing reports and visualizations
● Analytics Hub
4.3 Exploring and analyzing data. Considerations include:
● Preparing data for feature engineering (training and serving machine learning models)
● Conducting data discovery
Section 5: Maintaining and automating data workloads
5.1 Optimizing resources. Considerations include:
● Minimizing costs per required business need for data
● Ensuring that enough resources are available for business-critical data processes
● Deciding between persistent or job-based data clusters (e.g., Dataproc)
5.2 Designing automation and repeatability. Considerations include:
● Creating directed acyclic graphs (DAGs) for Cloud Composer
● Scheduling jobs in a repeatable way
5.3 Organizing workloads based on business requirements. Considerations include:
● Flex, on-demand, and flat rate slot pricing (index on flexibility or fixed capacity)
● Interactive or batch query jobs
5.4 Monitoring and troubleshooting processes. Considerations include:
● Observability of data processes (e.g., Cloud Monitoring, Cloud Logging, BigQuery admin panel)
● Monitoring planned usage
● Troubleshooting error messages, billing issues, and quotas
● Manage workloads, such as jobs, queries, and compute capacity (reservations)
5.5 Maintaining awareness of failures and mitigating impact. Considerations include:
● Designing system for fault tolerance and managing restarts
● Running jobs in multiple regions or zones
● Preparing for data corruption and missing data
● Data replication and failover (e.g., Cloud SQL, Redis clusters)
Take the next step
Tell us what you’re solving for. A Google Cloud expert will help you find the best solution.
- Work with a trusted partner Find a partner
- Start using Google Cloud Try it free
- Continue browsing See all products
- Start using Google Cloud Go to console
- Basic Skills
- Advanced Skills
- Hands On Course
- Case Studies
- Best Practices Cloud
- Data Sources
- Interview Questions
- Books And Courses
Case Studies #
- Data Science @Airbnb
- Data Science @Amazon
- Data Science @Baidu
- Data Science @Blackrock
- Data Science @BMW
- Data Science @Booking.com
- Data Science @CERN
- Data Science @Disney
- Data Science @DLR
- Data Science @Drivetribe
- Data Science @Dropbox
- Data Science @Ebay
- Data Science @Expedia
- Data Science @Facebook
- Data Science @Google
- Data Science @Grammarly
- Data Science @ING Fraud
- Data Science @Instagram
- Data Science @LinkedIn
- Data Science @Lyft
- Data Science @NASA
- Data Science @Netflix
- Data Science @OLX
- Data Science @OTTO
- Data Science @Paypal
- Data Science @Pinterest
- Data Science @Salesforce
- Data Science @Siemens Mindsphere
- Data Science @Slack
- Data Science @Spotify
- Data Science @Symantec
- Data Science @Tinder
- Data Science @Twitter
- Data Science @Uber
- Data Science @Upwork
- Data Science @Woot
- Data Science @Zalando
How I do Case Studies #
Data science at airbnb #.
| Podcast Episode: #063 Data Engineering At Airbnb Case Study |------------------| |How Airbnb is doing data engineering? Let’s check it out. | Watch on YouTube \ Listen on Anchor |
Airbnb Engineering Blog: https://medium.com/airbnb-engineering
Data Infrastructure: https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
Scaling the serving tier: https://medium.com/airbnb-engineering/unlocking-horizontal-scalability-in-our-web-serving-tier-d907449cdbcf
Druid Analytics: https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c
Spark Streaming for logging events: https://medium.com/airbnb-engineering/scaling-spark-streaming-for-logging-event-ingestion-4a03141d135d
-Druid Wiki: https://en.wikipedia.org/wiki/Apache_Druid
Data Science at Amazon #
https://www.datasciencecentral.com/profiles/blogs/20-data-science-systems-used-by-amazon-to-operate-its-business \ https://aws.amazon.com/solutions/case-studies/amazon-migration-analytics/
Data Science at Baidu #
Data Science at Blackrock #
Data Science at BMW #
Data Science at Booking.com #
| Podcast Episode: #064 Data Engineering at Booking.com Case Study |------------------| |How Booking.com is doing data engineering? Let’s check it out. | Watch on YouTube \ Listen on Anchor |
Kafka Architecture: https://data-flair.training/blogs/kafka-architecture/
Confluent Platform: https://www.confluent.io/product/confluent-platform/
Data Science at CERN #
| Podcast Episode: #065 Data Engineering At CERN Case Study |------------------| |How is CERN doing Data Engineering? They must get huge amounts of data from the Large Hadron Collider. Let’s check it out. | Watch on YouTube \ Listen on Anchor |
Data Science at Disney #
Data Science at DLR #
Data Science at Drivetribe #
Data Science at Dropbox #
Data Science at Ebay #
Data Science at Expedia #
Data Science at Facebook #
Data Science at Google #
http://www.unofficialgoogledatascience.com/ \ https://ai.google/research/teams/ai-fundamentals-applications/ \ https://cloud.google.com/solutions/big-data/ \ https://datafloq.com/read/google-applies-big-data-infographic/385
Data Science at Grammarly #
Data Science at ING Fraud #
Data Science at Instagram #
Data Science at LinkedIn #
| Podcast Episode: #073 Data Engineering At LinkedIn Case Study |------------------| |Let’s check out how LinkedIn is processing data :) | Watch on YouTube \ Listen on Anchor |
Data Science at Lyft #
Data Science at NASA #
| Podcast Episode: #067 Data Engineering At NASA Case Study |------------------| |A look into how NASA is doing data engineering. | Watch on YouTube \ Listen on Anchor |
Data Science at Netflix #
| Podcast Episode: #062 Data Engineering At Netﬂix Case Study |------------------| |How Netﬂix is doing Data Engineering using their Keystone platform. | Watch on YouTube \ Listen on Anchor |
Netflix revolutionized how we watch movies and TV. Currently over 75 million users watch 125 million hours of Netflix content every day!
Netflix's revenue comes from a monthly subscription service. So, the goal for Netflix is to keep you subscribed and to get new subscribers.
To achieve this, Netflix is licensing movies from studios as well as creating its own original movies and TV series.
But offering new content is not everything. What is also very important is, to keep you watching content that already exists.
To be able to recommend you content, Netflix is collecting data from users. And it is collecting a lot.
Currently, Netflix analyses about 500 billion user events per day. That results in a stunning 1.3 Petabytes every day.
All this data allows Netflix to build recommender systems for you. The recommenders are showing you content that you might like, based on your viewing habits, or what is currently trending.
The Netflix batch processing pipeline #
When Netflix started out, they had a very simple batch processing system architecture.
The key components were Chuckwa, a scalable data collection system, Amazon S3 and Elastic MapReduce.
Chuckwa wrote incoming messages into Hadoop sequence files, stored in Amazon S3. These files then could be analysed by Elastic MapReduce jobs.
Netflix batch processing pipeline Jobs were executed regularly on a daily and hourly basis. As a result, Netflix could learn how people used the services every hour or once a day.
Know what customers want: #
Because you are looking at the big picture you can create new products. Netflix uses insight from big data to create new TV shows and movies.
They created House of Cards based on data. There is a very interesting TED talk about this you should watch:
How to use data to make a hit TV show | Sebastian Wernicke
Batch processing also helps Netflix to know the exact episode of a TV show that gets you hooked. Not only globally but for every country where Netflix is available.
Check out the article from TheVerge
They know exactly what show works in what country and what show does not.
It helps them create shows that work in everywhere or select the shows to license in different countries. Germany for instance does not have the full library that Americans have :(
We have to put up with only a small portion of TV shows and movies. If you have to select, why not select those that work best.
Batch processing is not enough #
As a data platform for generating insight the Cuckwa pipeline was a good start. It is very important to be able to create hourly and daily aggregated views for user behavior.
To this day Netflix is still doing a lot of batch processing jobs.
The only problem is: With batch processing you are basically looking into the past.
For Netflix, and data driven companies in general, looking into the past is not enough. They want a live view of what is happening.
The trending now feature #
One of the newer Netflix features is "Trending now". To the average user it looks like that "Trending Now" means currently most watched.
This is what I get displayed as trending while I am writing this on a Saturday morning at 8:00 in Germany. But it is so much more.
What is currently being watched is only a part of the data that is used to generate "Trending Now".
"Trending now" is created based on two types of data sources: Play events and Impression events.
What messages those two types actually include is not really communicated by Netflix. I did some research on the Netflix Techblog and this is what I found out:
Play events include what title you have watched last, where you did stop watching, where you used the 30s rewind and others. Impression events are collected as you browse the Netflix Library like scroll up and down, scroll left or right, click on a movie and so on.
Basically, play events log what you do while you are watching. Impression events are capturing what you do on Netflix, while you are not watching something.
Netflix real-time streaming architecture #
Netflix uses three internet facing services to exchange data with the client's browser or mobile app. These services are simple Apache Tomcat based web services.
The service for receiving play events is called "Viewing History". Impression events are collected with the "Beacon" service.
The "Recommender Service" makes recommendations based on trend data available for clients.
Messages from the Beacon and Viewing History services are put into Apache Kafka. It acts as a buffer between the data services and the analytics.
Beacon and Viewing History publish messages to Kafka topics. The analytics subscribes to the topics and gets the messages automatically delivered in a first in first out fashion.
After the analytics the workflow is straight forward. The trending data is stored in a Cassandra Key-Value store. The recommender service has access to Cassandra and is making the data available to the Netflix client.
The algorithms how the analytics system is processing all this data is not known to the public. It is a trade secret of Netflix.
What is known, is the analytics tool they use. Back in Feb 2015 they wrote in the tech blog that they use a custom made tool.
They also stated, that Netflix is going to replace the custom made analytics tool with Apache Spark streaming in the future. My guess is, that they did the switch to Spark some time ago, because their post is more than a year old.
Data Science at OLX #
| Podcast Episode: #083 Data Engineering at OLX Case Study |------------------| |This podcast is a case study about OLX with Senior Data Scientist Alexey Grigorev as guest. It was super fun. | Watch on YouTube \ Listen on Anchor |
Data Science at OTTO #
Data Science at Paypal #
Data Science at Pinterest #
| Podcast Episode: #069 Engineering Culture At Pinterest |------------------| |In this podcast we look into data platform and processing at Pinterest. | Watch on YouTube \ Listen on Anchor |
Data Science at Salesforce #
Data Science at Siemens Mindsphere #
| Podcast Episode: #059 What Is The Siemens Mindsphere IoT Platform? |------------------| |The Internet of things is a huge deal. There are many platforms available. But, which one is actually good? Join me on a 50 minute dive into the Siemens Mindsphere online documentation. I have to say I was super unimpressed by what I found. Many limitations, unclear architecture and no pricing available? Not good! | Watch on YouTube \ Listen on Anchor |
Data Science at Slack #
Data Science at Spotify #
| Podcast Episode: #071 Data Engineering At Spotify Case Study |------------------| |In this episode we are looking at data engineering at Spotify, my favorite music streaming service. How do they process all that data? | Watch on YouTube \ Listen on Anchor |
Data Science at Symantec #
Data Science at Tinder #
Data Science at Twitter #
| Podcast Episode: #072 Data Engineering At Twitter Case Study |------------------| |How is Twitter doing data engineering? Oh man, they have a lot of cool things to share these tweets. | Watch on YouTube \ Listen on Anchor |
Data Science at Uber #
Data Science at Upwork #
Data Science at Woot #
Data Science at Zalando #
| Podcast Episode: #087 Data Engineering At Zalando Case Study Talk |------------------| |I had a great conversation about data engineering for online retailing with Michal Gancarski and Max Schultze. They showed Zalando’s data platform and how they build data pipelines. Super interesting especially for AWS users. | Watch on YouTube
Do me a favor and give these guys a follow on LinkedIn:
LinkedIn of Michal: https://www.linkedin.com/in/michalgancarski/
LinkedIn of Max: https://www.linkedin.com/in/max-schultze-b11996110/
Zalando has a tech blog with more infos and there is also a meetup in Berlin:
Zalando Blog: https://jobs.zalando.com/tech/blog/
Next Zalando Data Engineering Meetup: https://www.meetup.com/Zalando-Tech-Events-Berlin/events/262032282/
AWS CDK: https://docs.aws.amazon.com/cdk/latest/guide/what-is.html
Delta Lake: https://delta.io/
AWS Step Functions: [https://aws.amazon.com/step-functions/ AWS State Language: https://states-language.net/spec.html]( https://aws.amazon.com/step-functions/ AWS State Language: https://states-language.net/spec.html )
Youtube channel of the meetup: [https://www.youtube.com/channel/UCxwul7aBm2LybbpKGbCOYNA/playlists talk at Spark+AI]( https://www.youtube.com/channel/UCxwul7aBm2LybbpKGbCOYNA/playlists talk at Spark+AI)
Summit about Zalando's Processing Platform: https://databricks.com/session/continuous-applications-at-scale-of-100-teams-with-databricks-delta-and-structured-streaming
Talk at Strata London slides: https://databricks.com/session/continuous-applications-at-scale-of-100-teams-with-databricks-delta-and-structured-streaming
- Data Science at Airbnb
- Data Science at Amazon
- Data Science at Baidu
- Data Science at Blackrock
- Data Science at BMW
- Data Science at Booking.com
- Data Science at CERN
- Data Science at Disney
- Data Science at DLR
- Data Science at Drivetribe
- Data Science at Dropbox
- Data Science at Ebay
- Data Science at Expedia
- Data Science at Facebook
- Data Science at Google
- Data Science at Grammarly
- Data Science at ING Fraud
- Data Science at Instagram
- Data Science at LinkedIn
- Data Science at Lyft
- Data Science at NASA
- Data Science at Netflix
- Data Science at OLX
- Data Science at OTTO
- Data Science at Paypal
- Data Science at Pinterest
- Data Science at Salesforce
- Data Science at Siemens Mindsphere
- Data Science at Slack
- Data Science at Spotify
- Data Science at Symantec
- Data Science at Tinder
- Data Science at Twitter
- Data Science at Uber
- Data Science at Upwork
- Data Science at Woot
- Data Science at Zalando
Data engineering case-study in digitalized manufacturing
- Change Username/Password
- Update Address
- Payment Options
- Order History
- View Purchased Documents
- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2023 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
- Case Studies
Explore the latest data journey stories from our customers.
- Retail & CPG
- Financial Services
- Talent Solutions
Uncover how a popular restaurant chain partnered with phData to build an automated verbal order intake system with the power of ML and NLP.
Read the Case Study →
Global Investment Firm
Explore how a global investment firm teamed up with phData to create a process that detects and identifies data quality issues early with their existing data pipeline.
Healthcare Technology Provider
Discover how a healthcare revenue cycle technology & services company was able to move to Snowflake quickly without interrupting their processes with phData’s help.
Dive into this story of how phData helped a family-owned window manufacturer develop an actionable data strategy to help them migrate seamlessly to Snowflake.
Major Financial Institution
Learn how a major fortune 500 financial institution was able to automate and streamline its data by moving it from on-premise to the cloud with the help of phData.
Medical Device Manufacturer
Explore this incredible story of how a top medical device company was able to successfully migrate from Oracle to Snowflake with phData’s help.
Dive into this powerful tale of how phData was able to help a top-5 restaurant chain get ML models into production faster, more efficiently, and with less risk.
Logistics Software Company
Unpack this story of how phData helped migrate a leading logistics software company’s data into Snowflake, giving them a consolidated view of their data.
Medical Insurance Provider
Uncover how a medical insurance provider was able to better understand the effectiveness of its marketing efforts thanks to phData’s data science team.
Learn how phData helped a luxury automaker leverage Snowflake’s data platform and a custom ML framework to improve sales forecasting.
Household Goods Manufacturer
Explore how phData helped a major household goods manufacturer build a foundational, cloud-based data hub to better report on consumer interactions.
Uncover how phData helped a major mortgage lender build a unified analytics platform on Snowflake that leverages data to convert more leads into revenue.
Uncover how phData transformed a major industrial manufacturers’s existing sensor-based analytics platform into a more efficient, centralized, IoT data solution.
Agribusiness and Dairy Company
Learn how a top dairy co-op was able to noticeably increase profits by having phData rewrite its Monte Carlo simulation as a simple Python application.
Pharmacy Benefits Manager
Discover how a PBM was able to successfully migrate its data warehouse to Snowflake and achieve massive improvements thanks to the help of phData.
Life Insurance Company
Read about how phData helped a top-10 U.S. life insurance company migrate from Cloudera to EMR, better positioning them for future growth and scale.
Outdoor Vehicle Manufacturer
Explore how in just three months, phData was able to deliver a modern, end-to-end ML solution to help an outdoor vehicle manufacturer better forecast demand.
Global Insurance Company
Discover how a global insurance company was able to save over $1M by migrating from an on-premise data environment to an AWS cloud-based platform.
Explore how a large CRM company gained better insights into its learning and training program by working with phData to create a new Tableau dashboard.
Uncover how a nationwide fast-food chain migrated several individual use cases from their existing Airflow stack to AWS Managed Workflows thanks to phData.
Discover how a medical manufacturer was able to make machine learning-backed decisions to predict manager performance with help from phData.
Consumer Packaged Goods Enterprise
Dive into this eye-opening story of how phData helped a major CPG enterprise develop an Analytics Center of Excellence with Tableau and Data Coach.
Journey into how phData was able to help a top life insurance company automated pipeline for continuous deployment of Airflow and the underlying infrastructure.
Fast Casual Restaurant Chain
Dive into this story of how a major fast-casual restaurant chain built a custom analytics solution in Alteryx to improve compliance.
Learn how phData helped a medical manufacturer deploy Dataiku on their cloud infrastructure in full compliance with corporate cloud and security standards.
Discover the ins and outs of how phData helped a PBM reimplement its data pipeline into Spark — all in a sustainable, reliable, and fully compliant way.
Oil & Gas Company
Learn how phData helped a large U.S. Oil & Gas company build an all-new data streaming architecture that continuously fuels safety monitoring with IoT.
Health Insurance Company
Learn how a health insurance company was able to move their internal ML capabilities to the cloud using AWS without losing their data or process.
Undergoing a large data center migration is not easy but with phData’s help, a top-10 U.S Life Insurance company was able to implement in no time.
Learn how a major fortune 500 financial institution transitioned to an automated reporting process in Power BI and Snowflake thanks to phData’s expertise.
Get an intimate look at how a major door & window manufacturer migrated its marketing data from an Excel spreadsheet into Snowflake with phData.
Major US Manufacturer
Explore how a manufacturing company supercharged its BI by leaning to phData for dashboard development, design, and admin capabilities in Power BI.
Major Regional Bank
Experience how phData was able to help a major bank create automated, scalable pipelines for two pilot use cases within Snowflake in just 5 weeks.
Healthcare Marketplace and IT Provider
Unpack this story of how phData helped a Healthcare Marketplace and IT provider automate their backup and recovery process in Snowflake.
Learn how phData helped NextGen migrate their data to Snowflake, build a custom software solution, and enabled analytics and reporting for their customers.
Unpack this migration story of how a major Life Insurance company was able to migrate successfully to AWS from Cloudera thanks to phData’s help.
Legal Services Provider
Unpack this story of how a legal services provider leveraged phData’s data science services to accurately forecast staffing supply and demand.
Take a journey into this powerful story of how phData helped a CRM company build a use case for a single source of truth for their contact data.
Fast Food Chain
Learn how phData helped a fast food chain determine if Amazon Redshift’s automated machine learning capabilities would benefit their business.
Medical Device Maker
Explore how phData was able to help a Fortune 500 medical device maker streamline its data ecosystem while saving thousands of dollars.
B2B Software Company
Take a look into this story of how phData was able to help a large B2B SaaS company implement Snowflake, Airflow, and dbt.
Explore how phData guided a major CPG enterprise to improve its reporting capabilities by upgrading to Power BI Premium capacity.
Local Municipal Company
Unearth this story of how a local municipal company gained better insights into customer energy usage by leveraging Power BI and Snowflake.
Consumer Packaged Good Enterprise
Dive into this story of how a large CPG enterprise migrated successfully to Tableau Cloud from Server with phData’s migration services.
Explore how phData helped a CPG company build a forecast model that allowed for better planning for inventory, shipping, and production needs.
Dive into this story of how phData was able to help a massive marketing company migrate successfully to Snowflake from Hadoop using Snowpark.
Global Talent Company
Explore how phData guided a global talent company to merge its enterprise and subsidiary data through a modernized technology stack.
Financial Services Company
Explore how a large financial services company gained quality data insight by working with phData to turn data into customer investment opportunities.
Global Manufacturing Company
Unpack this story of how a global manufacturing company migrated to Snowflake in 30 weeks with no interruptions to the business.
Learn how phData leveraged Snowflake, dbt, and Fivetran to help a regional bank optimize its live boards for their Commercial Lending reporting.
Global Financial Firm
Venture into this story of how phData created an anomaly detection model on Azure ML for a large financial firm that helps them better track experimentation.
Healthcare Technology Company
Unpack this story of how a leading healthcare technology company leveraged Snowpark to reduce its daily data transformation process from 20 hours to 13 minutes.
Thriving Medical Company
Explore how a Medical company uses deep machine learning to accelerate advances in sleep apnea technology.
Major Trucking Company
Take a closer look at how phData helped a large trucking company migrate data from their legacy on-premise systems to Snowflake for better decision-making.
Professional Services Company
Explore how a large professional services firm that provides risk management, insurance brokerage, and human resources consulting services migrated to Snowflake, partnering with phData.
Premier Automotive Lender
Journey into this story of how a prominent automotive lender was able to migrate to Snowflake to unlock better analytics and restore trust in its data.
Engineering & Construction Company
Discover how one of North America's largest engineering & construction firms successfully migrated from Teradata to Snowflake thanks to phData’s help.
SaaS Provider for the Trade Industry
Unpack this story of how a celebrated SaaS provider for the trade industry leveraged dbt to standardize metrics & centralize KPI reporting.
Cancer Research & Treatment Organization
Explore this tale of how a c ancer treatment & research organization leveraged Snowflake & Sigma to create a robust data reporting system.
Prominent Regional Bank
Discover how a major banking & financial company was able to use Experian data in Alteryx to reduce data processing costs.
Esteemed Healthcare Provider
Learn how an innovative healthcare company leveraged Alteryx & phData to wield sentiment analysis to improve media monitoring.
Innovative Electronics Manufacturing Company
Dive into this powerful story of how a leading electronics manufacturing company was able to upskill its team in Sigma with phData.
Established Financial Company
Embark on a financial giant's journey, slashing latency by 90% in 12 weeks with phData. Seamless ETL migration to Snowflake, unlocking a new era of data excellence.
Premier Global Title Services Company
Discover how a leading global title services company migrated to Snowflake to streamline operations & optimize growth.
Take the next step with phData.
Learn how phData can help solve your most challenging data analytics and machine learning problems.
Join our team
- About phData
- Leadership Team
- Other Partners
- phData Toolkit
Subscribe to our newsletter
- © 2023 phData
- Accesibility Policy
- Data Processing Agreement
Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.
- Data Coach Overview
- Course Collection
Accelerate and automate your data projects with the phData Toolkit
- Get Started
- Retail and CPG
- Healthcare and Life Sciences
- Call Center Analytics Services
- Snowflake Native Streaming of HL7 Data
- Snowflake Retail & CPG Supply Chain Forecasting
- Snowflake Plant Intelligence For Manufacturing
- Snowflake Demand Forecasting For Manufacturing
- Snowflake Data Collaboration For Manufacturing
- MLOps Framework
- Teradata to Snowflake
- Cloudera CDP Migration
Other technology partners.
Check out our latest insights
How to Process LiDAR Data with Snowflake
dbt Labs’ Coalesce 2023 Recap
- Dashboard Library
- Whitepapers and eBooks
Consulting, migrations, data pipelines, dataops, change management, enablement & learning, coe, coaching, pmo, data science and machine learning services, mlops enablement, prototyping, model development and deployment, strategy services, data, analytics, and ai strategy, architecture and assessments, reporting, analytics, and visualization services, self-service, integrated analytics, dashboards, automation, elastic operations, data platforms, data pipelines, and machine learning.
Top 10 real-world data science case studies.
Data science has become integral to modern businesses and organizations, driving decision-making, optimizing operations, and improving customer experiences. From predicting machine failures in manufacturing to personalizing healthcare treatments, data science is profoundly transforming industries.
Data science, often called the "most desirable job of the 21st century," is a multidisciplinary field that combines data analysis, machine learning, and domain knowledge to extract meaningful insights from data. It has far-reaching applications in diverse industries, revolutionizing how we solve problems and make decisions.
In this blog, we will delve into the top 10 real-world data science case studies that showcase the power and versatility of data-driven insights across various sectors.
Let’s dig in!
Table of Contents
- 1. Case study 1: Predictive maintenance in manufacturing
- 1.2. 2. Siemens
- 2. Case study 2: Healthcare diagnostics and treatment personalization
- 2.1. 1. IBM Watson Health
- 2.2. 2. PathAI
- 3. Case study 3: Fraud detection and prevention in finance
- 3.1. 1. PayPal
- 3.2. 2. Capital One
- 4. Case study 4: Urban planning and smart cities
- 4.1. 1. Singapore
- 4.2. 2. Barcelona
- 5. Case study 5: E-commerce personalization and recommendation systems
- 5.1. 1. Amazon
- 5.2. 2. eBay
- 6. Case study 6: Agricultural yield prediction
- 6.1. 1. John Deere
- 6.2. 2. Caterpillar Inc.
- 7. Case study 7: Energy consumption optimization
- 7.1. 1. EnergyOptiUS
- 7.2. 2. CarbonSmart USA
- 8. Case study 8: Transportation and route optimization
- 8.1. 1. Uber
- 8.2. 2. Lyft
- 9. Case study 9: Natural language processing in customer service
- 9.1. 1. Zendesk
- 10. Case study 10: Environmental conservation and data analysis
- 10.1. 1. NASA
- 10.2. 2. WWF
- 11. Conclusion
Case study 1: Predictive maintenance in manufacturing
General Electric (GE), a global industrial conglomerate, leverages data science to implement predictive maintenance solutions. By analyzing sensor data from their industrial equipment, such as jet engines and wind turbines, GE can predict the need for maintenance before a breakdown occurs. This proactive approach minimized downtime and reduced maintenance costs.
Here’s how data science played a pivotal role in enhancing GE's manufacturing operations through predictive maintenance:
- In their aviation division, GE has reported up to a 30% reduction in unscheduled maintenance by utilizing predictive analytics on sensor data from jet engines.
- In the renewable energy sector, GE's wind turbines have seen a 15% increase in operational efficiency due to data-driven maintenance practices.
- Over the past year, GE saved $50 million in maintenance costs across various divisions thanks to predictive maintenance models.
Siemens, another industrial giant, embraces predictive maintenance through data science. They use machine learning algorithms to monitor and analyze data from their manufacturing machines. This approach allows Siemens to identify wear and tear patterns and schedule maintenance precisely when required.
As a result, Siemens achieved substantial cost savings and increased operational efficiency through:
- Siemens has reported a remarkable 20% reduction in unplanned downtime across its manufacturing facilities globally since implementing predictive maintenance solutions powered by data science.
- Through data-driven maintenance, Siemens has achieved a 15% increase in overall equipment effectiveness (OEE), resulting in improved production efficiency and reduced production costs.
- In a recent case study, Siemens documented a $25 million annual cost savings in maintenance expenditures, directly attributed to their data science-based predictive maintenance approach.
Case study 2: Healthcare diagnostics and treatment personalization
1. ibm watson health.
IBM Watson Health employs data science to enhance healthcare by providing personalized diagnostic and treatment recommendations. Watson's natural language processing capabilities enable it to sift through vast medical literature and patient records to assist doctors in making more informed decisions.
Data science has significantly aided IBM Watson Health in healthcare diagnostics and personalized treatment in:
- IBM Watson Health has demonstrated a 15% increase in the accuracy of cancer diagnoses when assisting oncologists in analyzing complex medical data, including genomic information and medical journals.
- In a recent clinical trial, IBM Watson Health's AI-powered recommendations helped reduce the average time it takes to develop a personalized cancer treatment plan from weeks to just a few days, potentially improving patient outcomes and survival rates.
- Watson's data-driven insights have contributed to a 30% reduction in medication errors in some healthcare facilities by flagging potential drug interactions and allergies in patient records.
- IBM Watson Health has processed over 200 million pages of medical literature to date, providing doctors with access to a vast knowledge base that can inform their diagnostic and treatment decisions.
PathAI utilizes machine learning algorithms to assist pathologists in diagnosing diseases more accurately. By analyzing digitized pathology images, PathAI's system can identify patterns and anomalies that the human eye might miss. This analysis speeds up the diagnostic process and enhances the precision of pathology reports by 6-9%, leading to better patient care.
Data science has been instrumental in PathAI's advancements in:
- PathAI's AI-driven pathology platform has shown a 25% improvement in diagnostic accuracy compared to traditional manual evaluations when identifying challenging cases like cancer subtypes or rare diseases.
- In a recent study involving over 10,000 pathology reports, PathAI's system helped pathologists reduce the time it takes to analyze and report findings by 50%, enabling quicker treatment decisions for patients.
- By leveraging machine learning, PathAI has been able to significantly decrease the rate of false negatives and false positives in pathology reports, resulting in a 20% reduction in misdiagnoses.
- PathAI's platform has processed millions of pathology images, making it a valuable resource for pathologists to access a vast repository of data to aid in their diagnostic decisions.
Case study 3: Fraud detection and prevention in finance
PayPal, a leader in online payments, employs advanced data science techniques to detect and prevent fraudulent transactions in real-time. They analyze transaction data, user behavior, and other relevant factors to identify suspicious activity.
Here's how data science has helped PayPal in this regard:
- PayPal's real-time fraud detection system reported an impressive 99.9% accuracy rate in identifying and blocking fraudulent transactions, minimizing financial losses for both the company and its users.
- In a recent report, PayPal reported that their proactive fraud prevention measures saved users an estimated $2 billion in potential losses due to unauthorized transactions in a single year.
- The average time it takes for PayPal's data science algorithms to detect and respond to a fraudulent transaction is just milliseconds, ensuring that fraudulent activities are halted before they can cause harm.
- PayPal's continuous monitoring and data-driven approach to fraud prevention have resulted in a 40% reduction in the overall fraud rate across their platform over the past three years.
2. Capital One
Capital One, a major player in the banking industry, relies on data science to combat credit card fraud. Their machine-learning models assess transaction patterns and historical data to flag potentially fraudulent activities. This assessment safeguards their customers and enhances their trust in the bank's services.
Here's how data science has helped Capital One in this regard:
- Capital One's data-driven fraud detection system has achieved an industry-leading fraud detection rate of 97%, meaning that it successfully identifies and prevents fraudulent transactions with a high level of accuracy.
- In the past year, Capital One has reported a $50 million reduction in fraud-related losses, thanks to their machine-learning models, which continuously evolve to adapt to new fraud tactics.
- The bank's real-time fraud detection capabilities allow them to stop fraudulent transactions in progress, with an average response time of less than 1 second, minimizing potential financial losses for both the bank and its customers.
- Customer surveys have shown that 94% of Capital One customers feel more secure about their financial transactions due to the bank's proactive fraud prevention measures, thereby enhancing customer trust and satisfaction.
Case study 4: Urban planning and smart cities
Singapore is pioneering the smart city concept, using data science to optimize urban planning and public services. They gather data from various sources, including sensors and citizen feedback, to manage traffic flow, reduce energy consumption, and improve the overall quality of life in the city-state.
Here’s how data science helped Singapore in efficient urban planning:
- Singapore's real-time traffic management system, powered by data analytics, has led to a 25% reduction in peak-hour traffic congestion, resulting in shorter commute times and lower fuel consumption.
- Through its data-driven initiatives, Singapore has achieved a 15% reduction in energy consumption across public buildings and street lighting, contributing to significant environmental sustainability gains.
- Citizen feedback platforms have seen 90% of reported issues resolved within 48 hours, reflecting the city's responsiveness in addressing urban challenges through data-driven decision-making.
- The implementation of predictive maintenance using data science has resulted in a 30% decrease in the downtime of critical public infrastructure, ensuring smoother operations and minimizing disruptions for residents.
Barcelona has embraced data science to transform into a smart city as well. They use data analytics to monitor and control waste management, parking, and public transportation services. By doing so, Barcelona improves the daily lives of its citizens and makes the city more attractive for tourists and businesses.
Data science has significantly influenced Barcelona's urban planning and the development of smart cities, reshaping the urban landscape of this vibrant Spanish metropolis by:
- Barcelona's data-driven waste management system has led to a 20% reduction in the frequency of waste collection in certain areas, resulting in cost savings and reduced environmental impact.
- The implementation of smart parking solutions using data science has reduced the average time it takes to find a parking spot by 30%, easing congestion and frustration for both residents and visitors.
- Public transportation optimization through data analytics has improved service reliability, resulting in a 10% increase in daily ridership and reduced waiting times for commuters.
- Barcelona's efforts to become a smart city have attracted 30% more tech startups and foreign investments over the past five years, stimulating economic growth and job creation in the region.
Case study 5: E-commerce personalization and recommendation systems
Amazon, the e-commerce giant, heavily relies on data science to personalize the shopping experience for its customers. They use algorithms to analyze customers' browsing and purchasing history, making product recommendations tailored to individual preferences. This approach has contributed significantly to Amazon's success and customer satisfaction by reducing customer service response times by 40%.
Additionally, Amazon leverages data science for:
- Amazon's data-driven product recommendations have led to a 29% increase in average order value as customers are more likely to add recommended items to their carts.
- A study found that Amazon's personalized shopping experience has resulted in a 68% improvement in click-through rates on recommended products compared to non-personalized suggestions.
- Customer service response times have been reduced by 40% due to fewer inquiries related to product recommendations, as customers find what they need more easily.
- Amazon's personalized email campaigns, driven by data science, have shown an 18% higher open rate and a 22% higher conversion rate compared to generic email promotions.
eBay also harnesses the power of data science to enhance user experiences. Their recommendation systems suggest relevant products and optimize search results, increasing user engagement and sales. This data-driven approach has helped eBay remain competitive in the ever-evolving e-commerce landscape.
Data science also helped eBay in:
- eBay's recommendation algorithms have contributed to a 12% increase in average order value as customers are more likely to discover and purchase complementary products.
- The optimization of search results using data science has led to a 20% reduction in bounce rates on the platform, indicating that users are finding what they're looking for more effectively.
- eBay's personalized marketing campaigns, driven by data analysis, have achieved an 18% higher conversion rate compared to generic promotions, leading to increased sales and revenue.
- Over the past year, eBay's revenue has grown by 10%, outperforming many competitors, thanks in part to their data-driven enhancements to the user experience.
Case study 6: Agricultural yield prediction
1. john deere.
John Deere, a leader in agricultural machinery, implements data science to predict crop yields. By analyzing data from sensors on their farming equipment, weather data, and soil conditions, they provide farmers with valuable insights for optimizing planting and harvesting schedules. These insights enable farmers to increase crop yields while conserving resources.
Here’s how John Deere leverages data science:
- Farmers using John Deere's data science-based crop prediction system have reported an average 15% increase in crop yields compared to traditional farming methods.
- By optimizing planting and harvesting schedules based on data insights, farmers have achieved a 20% reduction in water usage, contributing to sustainable agriculture and resource conservation.
- John Deere's predictive analytics have reduced the need for chemical fertilizers and pesticides by 25%, resulting in cost savings for farmers and reduced environmental impact.
- Over the past five years, John Deere's data-driven solutions have helped farmers increase their overall profitability by $1.5 billion through improved crop yields and resource management.
2. Caterpillar Inc.
Caterpillar Inc., a construction and mining equipment manufacturer, applies data science to support the agriculture industry. They use machine learning algorithms to analyze data from heavy machinery in the field, helping farmers identify maintenance needs and prevent costly breakdowns during critical seasons.
Here’s how Caterpillar leverages data science:
- Farmers who utilize Caterpillar's data science-based maintenance system have experienced a 30% reduction in unexpected equipment downtime, ensuring that critical operations can proceed smoothly during peak farming seasons.
- Caterpillar's predictive maintenance solutions have resulted in a 15% decrease in overall maintenance costs, as equipment issues are addressed proactively, reducing the need for emergency repairs.
- By optimizing machinery maintenance schedules, farmers have achieved a 10% increase in operational efficiency, enabling them to complete tasks more quickly and effectively.
- Caterpillar's data-driven approach has contributed to a 20% improvement in the resale value of heavy machinery, as well-maintained equipment retains its value over time.
Case study 7: Energy consumption optimization
EnergyOptiUS specializes in optimizing energy consumption in commercial buildings. They leverage data science to monitor and control heating, cooling, and lighting systems in real-time. Analyzing historical data and weather forecasts ensures energy efficiency while maintaining occupant comfort. Additionally, they leverage data science for:
- Buildings equipped with EnergyOptiUS's energy optimization solutions have achieved an average 20% reduction in energy consumption, leading to substantial cost savings for businesses and a reduced carbon footprint.
- Real-time monitoring and control of energy systems have resulted in a 15% decrease in maintenance costs, as equipment operates more efficiently and experiences less wear and tear.
- EnergyOptiUS's data-driven approach has led to a 25% improvement in occupant comfort, as temperature and lighting conditions are continuously adjusted to meet individual preferences.
- Over the past year, businesses using EnergyOptiUS's solutions have collectively saved $50 million in energy expenses, enhancing their overall financial performance and sustainability efforts.
2. CarbonSmart USA
CarbonSmart USA uses data science to assist businesses in reducing their carbon footprint. They provide actionable insights and recommendations based on data analysis, enabling companies to adopt more sustainable practices and meet their environmental goals. Additionally, CarbonSmart USA leverages data science to:
- Businesses that have partnered with CarbonSmart USA have, on average, reduced their carbon emissions by 15% within the first year of implementing recommended sustainability measures.
- Data-driven sustainability initiatives have led to $5 million in annual cost savings for companies through reduced energy consumption and waste reduction.
- CarbonSmart USA's recommendations have helped businesses collectively achieve a 30% increase in their sustainability ratings, enhancing their reputation and appeal to environmentally conscious consumers.
- Over the past five years, CarbonSmart USA's services have contributed to the reduction of 1 million metric tons of CO2 emissions, playing a significant role in mitigating climate change.
Case study 8: Transportation and route optimization
Uber revolutionized the transportation industry by using data science to optimize ride-sharing and delivery routes. Their algorithms consider real-time traffic conditions, driver availability, and passenger demand to provide efficient, cost-effective transportation services. Other use cases include:
- Uber's data-driven routing and matching algorithms have led to an average 20% reduction in travel time for passengers, ensuring quicker and more efficient transportation.
- By optimizing driver routes and minimizing detours, Uber has contributed to a 30% decrease in fuel consumption for drivers, resulting in cost savings and reduced environmental impact.
- Uber's real-time demand prediction models have helped reduce passenger wait times by 25%, enhancing customer satisfaction and increasing the number of rides booked.
- Over the past decade, Uber's data-driven approach has enabled 100 million active users to complete over 15 billion trips, demonstrating the scale and impact of their transportation services.
Lyft, a competitor to Uber, also relies on data science to enhance ride-sharing experiences. They use predictive analytics to match drivers with passengers efficiently and reduce wait times. This data-driven approach contributes to higher customer satisfaction and driver engagement. Additionally,
- Lyft's data-driven matching algorithms have resulted in an average wait time reduction of 20% for passengers, ensuring faster and more convenient rides.
- By optimizing driver-passenger pairings, Lyft has seen a 15% increase in driver earnings, making their platform more attractive to drivers and reducing turnover.
- Lyft's predictive analytics for demand forecasting have led to 98% accuracy in predicting peak hours, allowing for proactive driver allocation and improved service quality during high-demand periods.
- Customer surveys have shown a 25% increase in overall satisfaction among Lyft users who have experienced shorter wait times and smoother ride-sharing experiences.
Case study 9: Natural language processing in customer service
Zendesk, a customer service software company, utilizes natural language processing (NLP) to enhance customer support. Their NLP algorithms can analyze and categorize customer inquiries, automatically routing them to the most suitable support agent. This results in faster response times and improved customer experiences. Furthermore,
- Zendesk's NLP-driven inquiry routing has led to a 40% reduction in average response times for customer inquiries, ensuring quicker issue resolution and higher customer satisfaction.
- Customer support agents using Zendesk's NLP tools have reported a 25% increase in productivity, as the technology assists in categorizing and prioritizing inquiries, allowing agents to focus on more complex issues.
- Zendesk's automated categorization of customer inquiries has resulted in a 30% decrease in support ticket misrouting, reducing the chances of issues falling through the cracks and ensuring that customers' needs are addressed promptly.
- Customer feedback surveys indicate a 15% improvement in overall satisfaction since the implementation of Zendesk's NLP-enhanced customer support, highlighting the positive impact on the customer experience.
Case study 10: Environmental conservation and data analysis
NASA collects and analyzes vast amounts of data to better understand Earth's environment and climate. Their satellite observations, climate models, and data science tools contribute to crucial insights about climate change, weather forecasting, and natural disaster monitoring.
Here’s how NASA leverages data science:
- NASA's satellite observations have provided essential data for climate research, contributing to a 0.15°C reduction in the uncertainty of global temperature measurements, and enhancing our understanding of climate change.
- Their climate models have helped predict the sea level rise with 95% accuracy, which is vital for coastal planning and adaptation strategies in the face of rising sea levels.
- NASA's data-driven natural disaster monitoring has enabled a 35% increase in the accuracy of hurricane track predictions, allowing for better preparedness and evacuation planning.
- Over the past decade, NASA's climate data and research have led to a 20% reduction in the margin of error in long-term climate projections, improving our ability to plan for and mitigate the impacts of climate change.
The World Wildlife Fund (WWF) employs data science to support conservation efforts. They use data to track endangered species, monitor deforestation, and combat illegal wildlife trade. By leveraging data, WWF can make informed decisions and drive initiatives to protect the planet's biodiversity. Additionally,
- WWF's data-driven approach has led to a 25% increase in the accuracy of endangered species tracking, enabling more effective protection measures for vulnerable wildlife populations.
- Their deforestation monitoring efforts have contributed to a 20% reduction in illegal logging rates in critical rainforest regions, helping to combat deforestation and its associated environmental impacts.
- WWF's data-driven campaigns and initiatives have generated $100 million in donations and grants over the past five years, providing crucial funding for conservation projects worldwide.
- By leveraging data science, WWF has successfully influenced policy changes in 15 countries, leading to stronger regulations against illegal wildlife trade and habitat destruction.
Data science is not just a buzzword; it's a transformative force that reshapes industries and improves our daily lives. The real-world case studies mentioned above illustrate the incredible potential of data science in diverse domains, from healthcare to agriculture and beyond.
As technology advances, we can expect even more innovative applications of data science that will continue to drive progress and innovation across various sectors.
Whether predicting machine failures, personalizing healthcare treatments, or optimizing energy consumption, data science is at the forefront of solving some of the world's most pressing challenges.
Turing's expert data scientists offer tailored, cutting-edge, data-driven data science solutions across industries. With ethical data practices, scalable approaches, and a commitment to continuous improvement, Turing empowers organizations to harness the full potential of data science, driving innovation and progress in an ever-evolving technological landscape.
Talk to an expert today and join 900+ Fortune 500 companies and fast-scaling startups that have trusted Turing for their engineering needs.
Aditya is a content writer with 5+ years of experience writing for various industries including Marketing, SaaS, B2B, IT, and Edtech among others. You can find him watching anime or playing games when he’s not writing.
Frequently Asked Questions
Real-world data science case studies differ significantly from academic examples. While academic exercises often feature clean, well-structured data and simplified scenarios, real-world projects tackle messy, diverse data sources with practical constraints and genuine business objectives. These case studies reflect the complexities data scientists face when translating data into actionable insights in the corporate world.
Real-world data science projects come with common challenges. Data quality issues, including missing or inaccurate data, can hinder analysis. Domain expertise gaps may result in misinterpretation of results. Resource constraints might limit project scope or access to necessary tools and talent. Ethical considerations, like privacy and bias, demand careful handling.
Lastly, as data and business needs evolve, data science projects must adapt and stay relevant, posing an ongoing challenge.
Real-world data science case studies play a crucial role in helping companies make informed decisions. By analyzing their own data, businesses gain valuable insights into customer behavior, market trends, and operational efficiencies.
These insights empower data-driven strategies, aiding in more effective resource allocation, product development, and marketing efforts. Ultimately, case studies bridge the gap between data science and business decision-making, enhancing a company's ability to thrive in a competitive landscape.
Key takeaways from these case studies for organizations include the importance of cultivating a data-driven culture that values evidence-based decision-making. Investing in robust data infrastructure is essential to support data initiatives. Collaborating closely between data scientists and domain experts ensures that insights align with business goals.
Finally, continuous monitoring and refinement of data solutions are critical for maintaining relevance and effectiveness in a dynamic business environment. Embracing these principles can lead to tangible benefits and sustainable success in real-world data science endeavors.
Data science is a powerful driver of innovation and problem-solving across diverse industries. By harnessing data, organizations can uncover hidden patterns, automate repetitive tasks, optimize operations, and make informed decisions.
In healthcare, for example, data-driven diagnostics and treatment plans improve patient outcomes. In finance, predictive analytics enhances risk management. In transportation, route optimization reduces costs and emissions. Data science empowers industries to innovate and solve complex challenges in ways that were previously unimaginable.
Hire remote developers
Tell us the skills you need and we'll find the best developer for you in days, not weeks.
Search code, repositories, users, issues, pull requests...
We read every piece of feedback, and take your input very seriously.
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
BCG Data Engineering Interview Questions
Name already in use.
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more about the CLI .
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Data Set folder has 6 csv files. Please use the data dictionary (attached in the mail) to understand the dataset and then develop your approach to perform below analytics.
1. Primary Person
Application should perform below analysis and store the results for each analysis.
- Analysis 1: Find the number of crashes (accidents) in which number of persons killed are male?
- Analysis 2: How many two-wheelers are booked for crashes?
- Analysis 3: Which state has the highest number of accidents in which females are involved?
- Analysis 4: Which are the Top 5th to 15th VEH_MAKE_IDs that contribute to a largest number of injuries including death
- Analysis 5: For all the body styles involved in crashes, mention the top ethnic user group of each unique body style
- Analysis 6: Among the crashed cars, what are the Top 5 Zip Codes with the highest number crashes with alcohols as the contributing factor to a crash (Use Driver Zip Code)
- Analysis 7: Count of Distinct Crash IDs where No Damaged Property was observed and Damage Level (VEH_DMAG_SCL~) is above 4 and car avails Insurance
- Analysis 8: Determine the Top 5 Vehicle Makes where drivers are charged with speeding related offences, has licensed Drivers, uses top 10 used vehicle colours and has car licensed with the Top 25 states with highest number of offences (to be deduced from the data)
- Develop an application which is modular & follows software engineering best practices (e.g. Classes, docstrings, functions, config driven, command line executable through spark-submit)
- Code should be properly organized in folders as a project.
- Input data sources and output should be config driven
- Code should be strictly developed using Dataframe APIs (Do not use Spark SQL)
- Share the entire project as zip or link to project in GitHub repo.
Clone the repo and follow these steps:
- Go to the Project Directory: $ cd BCG_Big_Data_Case_Study
- On terminal, run $ make build . This will build the project to run via spark-submit. In this process a new folder with name "dist" is created, and the code artefacts are copied into it.
- $ cd Dist && spark-submit --master "local[*]" --py-files src.zip --files config.yaml main.py && cd ..
- Jupyter Notebook 84.0%
- Python 15.8%
- Makefile 0.2%
- Case studies
- Interview Cheat Sheets by Topic
Data Engineers' primary job is to ingest data from various sources into a data lake. This data should be organized in proper formats, processed desirably, and stored safely and in accordance with provided storage capacity and hardware architecture. Data engineers are in charge of developing and maintaining data infrastructure and applications and setting up data warehouses and pipelines. In the following Figure, you can see some of the most usual tasks in any data engineer's daily life.
As you can see, proficiency in work with multiple technologies is required. But don't worry. In the following paragraphs, we'll help you understand what you need to know to be a competitive data engineer candidate and what fundamental topics you should pay attention to for an interview.
Generally speaking, most of the required knowledge to be a data engineer can be broken up into the following categories:
Let’s examine together the skills mentioned above.
1 Programming knowledge
First of all, it is assumed that you successfully use at least one programming language. Statistics show that over 70% of jobs based on data engineering require knowledge of the Python programming language. It is a warm recommendation that if you do not have prior Python knowledge, start from today with mastering this popular and user-friendly language. Other highly recommended skills are proficiency in SQL, Java, Scala. Additionally, R, Ruby, and Perl are also considered popular programming environments in the world of data engineering. What do you need to pay special attention to when it comes to programming?
- Be familiar with data structures. Be sure to know how to use lists, dictionaries and how to link them. Also, basic operations as searching, inserting, and appending are essential for data manipulation processes.
- Understand algorithms and programming sequences that can search the data, merge or sort features and create new elements by combining the existing ones.
- Solve practical problems by finding some existing data sets on the web, play with data, try to extract your own conclusions, and find hidden knowledge. Every newly processed data set is one level up in your experience, which will mean a lot for you at the beginning of your data engineering path.
Build your intuition. Click the correct answer from the options.
What is a suitable Python function that will transform the input vector containing two different strings, “cat” and “dog”, into integers 0 and 1?
Click the option that best answers the question.
Although many consider SQL to be a query language, do not underestimate its power and the great need to master it! Don’t be surprised if you spend a significant amount of time discussing SQL techniques and problem-solving approaches in a data engineering interview. In addition to using it for queries in databases, SQL is regularly utilized as a data processing pattern within various Big Data frameworks as KafkaSQL, SparkSQL, Python libraries, etc. If you are proficient in using SQL itself, that knowledge would be a valuable indicator that you can also be efficient in these Big Data Frameworks. A big plus for you!
Try this exercise. Click the correct answer from the options.
Let’s suppose you want to deal with duplicate data in the SQL query. What functions are suitable for dealing with such data?
- AVG() and COUNT()
- HAVING and LAN()
- FROM and WHERE
- COUNT() and GROUP BY
3 Data Modeling
Data modeling is closely related to SQL, and it is considered to be an essential part of the overall system design process. Data modeling means designing a data model following provided data patterns and specific use cases. It is the first step towards the database design process and data analysis tasks.
Do you know what the main types of data models are?
- A physical model is one and only type
- Conceptual, Logical and Physical
- The types of data models do not exist
- Conceptual and Physical
4 Architectures and Design of Databases. Big Data Technologies
Be prepared to get a business use case by a company to test your capability to design a suitable data warehouse. In order to successfully respond to such a challenge, try to find real-life examples on the Internet, comments from the creators of online stores, and the experiences and guidelines they shared for designing such a system. Finally, try making a small test version of an online store yourself, such as a board game store, for example. If nothing else, try to graphically show the whole process of developing such a system and what stages it should go through (from data collection to the application interface). When it comes to working with databases, the following Figure shows the fundamental processes involved and some of the most popular frameworks used for their realization.
Don't be discouraged by the quantity of these tools! Of course, someone who is a beginner or even proficient in Data Engineering is not expected to know all the tools at once! But try to recognize all the involving elements of one data engineering path and build a proper knowledge foundation. It will definitely send a signal that you are familiar with all essential concepts from the data engineering domain of expertise. One another plus for you!
Let's test your knowledge. Click the correct answer from the options.
What is NOT the correct approach to validate a data migration process from one database to another?
- Null validation
- Reconciliation Checks
- Ad Hoc Testing
- Digital preservation
5 Soft Skills
You probably wonder why soft skills stand out here when they are probably negligible compared to technical abilities and expert knowledge. Suppose you went through this article until now with ease, recognizing each of the above concepts. In that case, it is totally unnecessary to talk about soft skills. Is that right? Well, not precisely… By no means neglect your personal skills and problem-solving abilities! They can be of crucial importance for a Data Engineer position.
Try to practice your presentation skills and try to explain the path to your solution in a clear and precise way. You can be the best programmer, but if you cannot present your solution optimally in oral communication during an interview, you will most likely get a rejection letter. In addition to good communication skills, critical thinking and the ability to work in a team can be crucial whether you get a job or not.
New technology should be implemented this month in your company. How will you explain it to coworkers who are unfamiliar with it?
Answer: To answer this question, try to present your communication skills that will illustrate how well you interact with your coworkers. You can also describe a situation where you introduced a new technical topic to the audience and how you overcame an initial misunderstanding.
ADDITIONAL INTERVIEW QUESTIONS
When data engineering roles are in question, the dominant part of the coding assessment relies on the data side, not on the algorithm side. Be ready to solve practical problems! Below are a few more examples of questions you may encounter very often in a data engineering interview.
The task is to construct a SQL query that will show the unique number of occurrences of one class within a single column.
Small help: You can get required results by using the next SQL functions: SELECT, COUNT, FROM, GROUP BY
Name three Python libraries that can be used for data processing tasks.
Answer: Numpy, Pandas, TensorFlow
Your assignment is to visually present outliers from a data set. Name one library and its function, which is an adequate solution for this kind of visualization.
Possible answers: Box plot, Scatter plot
The volume of data is rapidly increasing. What is your plan to add more capacity to the existing architecture?
Possible answers: You can request more database instances in the cloud on Google Cloud Platform, for example. Or to suggest removing old data sets and better data compression. Try to research on your own more solutions for this problem.
Name the main components of Hadoop. And, of course, what is Hadoop?
Answer: Shortly explained, Hadoop is the tool for processing Big data. Two main components of Hadoop are HDFS and MapReduce.
Suppose you know the answers to the previous questions on your own. In that case, you are well on your way to successfully passing one interview for Data Engineer. If you have problems solving these questions, try to investigate occurred issues and find solutions yourself. In any case, let this tutorial be the basis and starting point for mastering new data engineering knowledge. Be curious and don't just dwell on case studies and problems in the existing literature. Find yourself a real problem based on existing databases, approach it from a data engineer's point of view, and go through the entire development path from data collection to analysis. It will undoubtedly be fun, and more importantly, you will gain valuable practical experience in working with data!
One Pager Cheat Sheet
- Data Engineers build and maintain data infrastructure and applications , managing ingestion, organization, processing, storage, and warehousing of data from various sources according to hardware architecture and storage capacity.
- Being a Data Engineer requires knowledge in software engineering , data warehousing , data modeling , data integration and big data technologies.
- Acquiring proficiency in programming languages such as Python, SQL, Java, Scala, R, Ruby, and Perl and understanding data structures , algorithms , and practical problem-solving are key to succeeding in data engineering.
- The map() function can be used to transform a vector of strings into integers, by mapping each string to a specific corresponding integer value.
- Mastering SQL is an essential part of being a successful Data Engineer , and proficiency in it indicates proficiency in Big Data Frameworks such as KafkaSQL, SparkSQL and Python libraries.
- The COUNT() function in combination with the GROUP BY clause can be used to identify and get an exact count of duplicate records in a column.
- Data modeling is an integral part of the system design process that involves creating a data model following particular data patterns and use cases.
- Data models are the foundations of a data system and can be classified into three main types: Conceptual , Logical , and Physical .
- Get familiar with the fundamental processes and tools used in data engineering by exploring real-life examples, trying to develop a small test version of an online store, and learning the basics of database architecture and design.
- The correct approach to validate a data migration process from one database to another would involve running tests to check for data type, record count and other discrepancies, whereas Digital Preservation requires a different set of processes to protect digital information over time.
- Emphasizing the importance of soft skills, it is essential for a Data Engineer to have excellent communication, problem-solving and team working skills in order to stand out in the job market.
- Explaining new technology to unfamiliar coworkers requires excellent communication skills and the ability to illustrate concepts in a way that is easy to understand.
- Preparing for a data engineering interview may require solving practical coding problems, such as constructing a SQL query to reveal the unique number of occurrences of one class within a single column, as well as knowing certain terms and libraries such as Numpy , Pandas , TensorFlow , and Hadoop with its two main components HDFS and MapReduce .
- Basic Arrays Interview Questions
- Binary Search Trees Interview Questions
- Dynamic Programming Interview Questions
- Easy Strings Interview Questions
- Frontend Interview Questions
- Graphs Interview Questions
- Hard Arrays Interview Questions
- Hard Strings Interview Questions
- Hash Maps Interview Questions
- Linked Lists Interview Questions
- Medium Arrays Interview Questions
- Queues Interview Questions
- Recursion Interview Questions
- Sorting Interview Questions
- Stacks Interview Questions
- Systems Design Interview Questions
- Trees Interview Questions
- All Courses, Lessons, and Challenges
- Data Structures Cheat Sheet
- Free Coding Videos
- Bit Manipulation Interview Questions
- Python Interview Questions
- Java Interview Questions
- SQL Interview Questions
- QA and Testing Interview Questions
- Data Engineering Interview Questions
- Data Science Interview Questions
- Blockchain Interview Questions
- Lesson 2: Building a Basic Key-Value Store
- Multithreading in Operating Systems with Modern Programming
- Approaching Coding Interviews: Through a Recruiters' Lens
- Data Structures for Robotics
2023 Guide: 20+ Essential Data Science Case Study Interview Questions
Case studies are often the most challenging aspect of data science interview processes. They are crafted to resemble a company’s existing or previous projects, assessing a candidate’s ability to tackle prompts, convey their insights, and navigate obstacles.
To excel in data science case study interviews, practice is crucial. It will enable you to develop strategies for approaching case studies, asking the right questions to your interviewer, and providing responses that showcase your skills while adhering to time constraints.
The best way of doing this is by using a framework for answering case studies. For example, you could use the product metrics framework and the A/B testing framework to answer most case studies that come up in data science interviews.
There are four main types of data science case studies:
- Product Case Studies - This type of case study tackles a specific product or feature offering, often tied to the interviewing company. Interviewers are generally looking for a sense of business sense geared towards product metrics.
- Data Analytics Case Study Questions - Data analytics case studies ask you to propose possible metrics in order to investigate an analytics problem. Additionally, you must write a SQL query to pull your proposed metrics, and then perform analysis using the data you queried, just as you would do in the role.
- Modeling and Machine Learning Case Studies - Modeling case studies are more varied and focus on assessing your intuition for building models around business problems.
- Business Case Questions - Similar to product questions, business cases tackle issues or opportunities specific to the organization that is interviewing you. Often, candidates must assess the best option for a certain business plan being proposed, and formulate a process for solving the specific problem.
How Case Study Interviews Are Conducted
Oftentimes as an interviewee, you want to know the setting and format in which to expect the above questions to be asked. Unfortunately, this is company-specific: Some prefer real-time settings, where candidates actively work through a prompt after receiving it, while others offer some period of days (say, a week) before settling in for a presentation of your findings.
It is therefore important to have a system for answering these questions that will accommodate all possible formats, such that you are prepared for any set of circumstances (we provide such a framework below).
Why Are Case Study Questions Asked?
Case studies assess your thought process in answering data science questions. Specifically, interviewers want to see that you have the ability to think on your feet, and to work through real-world problems that likely do not have a right or wrong answer. Real-world case studies that are affecting businesses are not binary; there is no black-and-white, yes-or-no answer. This is why it is important that you can demonstrate decisiveness in your investigations, as well as show your capacity to consider impacts and topics from a variety of angles. Once you are in the role, you will be dealing directly with the ambiguity at the heart of decision-making.
Perhaps most importantly, case interviews assess your ability to effectively communicate your conclusions. On the job, data scientists exchange information across teams and divisions, so a significant part of the interviewer’s focus will be on how you process and explain your answer.
Quick tip: Because case questions in data science interviews tend to be product- and company-focused, it is extremely beneficial to research current projects and developments across different divisions , as these initiatives might end up as the case study topic.
How to Answer Data Science Case Study Questions (The Framework)
There are four main steps to tackling case questions in Data Science interviews, regardless of the type: clarify, make assumptions, gather context, and provide data points and analysis.
Step 1: Clarify
Clarifying is used to gather more information . More often than not, these case studies are designed to be confusing and vague. There will be unorganized data intentionally supplemented with extraneous or omitted information, so it is the candidate’s responsibility to dig deeper, filter out bad information, and fill gaps. Interviewers will be observing how an applicant asks questions and reach their solution.
For example, with a product question, you might take into consideration:
- What is the product?
- How does the product work?
- How does the product align with the business itself?
Step 2: Make Assumptions
When you have made sure that you have evaluated and understand the dataset, start investigating and discarding possible hypotheses. Developing insights on the product at this stage complements your ability to glean information from the dataset, and the exploration of your ideas is paramount to forming a successful hypothesis. You should be communicating your hypotheses with the interviewer, such that they can provide clarifying remarks on how the business views the product, and to help you discard unworkable lines of inquiry. If we continue to think about a product question, some important questions to evaluate and draw conclusions from include:
- Who uses the product? Why?
- What are the goals of the product?
- How does the product interact with other services or goods the company offers?
The goal of this is to reduce the scope of the problem at hand, and ask the interviewer questions upfront that allow you to tackle the meat of the problem instead of focusing on less consequential edge cases.
Step 3: Propose a Solution
Now that a hypothesis is formed that has incorporated the dataset and an understanding of the business-related context, it is time to apply that knowledge in forming a solution. Remember, the hypothesis is simply a refined version of the problem that uses the data on hand as its basis to being solved. The solution you create can target this narrow problem, and you can have full faith that it is addressing the core of the case study question.
Keep in mind that there isn’t a single expected solution, and as such, there is a certain freedom here to determine the exact path for investigation.
Step 4: Provide Data Points and Analysis
Finally, providing data points and analysis in support of your solution involves choosing and prioritizing a main metric. As with all prior factors, this step must be tied back to the hypothesis and the main goal of the problem. From that foundation, it is important to trace through and analyze different examples– from the main metric–in order to validate the hypothesis.
Quick tip: Every case question tends to have multiple solutions. Therefore, you should absolutely consider and communicate any potential trade-offs of your chosen method. Be sure you are communicating the pros and cons of your approach.
Note: In some special cases, solutions will also be assessed on the ability to convey information in layman’s terms. Regardless of the structure, applicants should always be prepared to solve through the framework outlined above in order to answer the prompt.
The Role of Effective Communication
There have been multiple articles and discussions conducted by interviewers behind the Data Science Case Study portion, and they all boil down success in case studies to one main factor: effective communication.
All the analysis in the world will not help if interviewees cannot verbally work through and highlight their thought process within the case study. Again, interviewers are keyed at this stage of the hiring process to look for well-developed “soft-skills” and problem-solving capabilities. Demonstrating those traits is key to succeeding in this round.
To this end, the best advice possible would be to practice actively going through example case studies, such as those available in the Interview Query questions bank . Exploring different topics with a friend in an interview-like setting with cold recall (no Googling in between!) will be uncomfortable and awkward, but it will also help reveal weaknesses in fleshing out the investigation.
Don’t worry if the first few times are terrible! Developing a rhythm will help with gaining self-confidence as you become better at assessing and learning through these sessions.
Product Case Study Questions
With product data science case questions , the interviewer wants to get an idea of your product sense intuition. Specifically, these questions assess your ability to identify which metrics should be proposed in order to understand a product.
1. How would you measure the success of private stories on Instagram, where only certain close friends can see the story?
Start by answering: What is the goal of the private story feature on Instagram? You can’t evaluate “success” without knowing what the initial objective of the product was, to begin with.
One specific goal of this feature would be to drive engagement. A private story could potentially increase interactions between users, and grow awareness of the feature.
Now, what types of metrics might you propose to assess user engagement? For a high-level overview, we could look at:
- Average stories per user per day
- Average Close Friends stories per user per day
However, we would also want to further bucket our users to see the effect that Close Friends stories have on user engagement. By bucketing users by age, date joined, or another metric, we could see how engagement is affected within certain populations, giving us insight on success that could be lost if looking at the overall population.
2. How would you measure the success of acquiring new users through a 30-day free trial at Netflix?
More context: Netflix is offering a promotion where users can enroll in a 30-day free trial. After 30 days, customers will automatically be charged based on their selected package. How would you measure acquisition success, and what metrics would you propose to measure the success of the free trial?
One way we can frame the concept specifically to this problem is to think about controllable inputs, external drivers, and then the observable output . Start with the major goals of Netflix:
- Acquiring new users to their subscription plan.
- Decreasing churn and increasing retention.
Looking at acquisition output metrics specifically, there are several top-level stats that we can look at, including:
- Conversion rate percentage
- Cost per free trial acquisition
- Daily conversion rate
With these conversion metrics, we would also want to bucket users by cohort. This would help us see the percentage of free users who were acquired, as well as retention by cohort.
3. How would you measure the success of Facebook Groups?
Start by considering the key function of Facebook Groups . You could say that Groups are a way for users to connect with other users through a shared interest or real-life relationship. Therefore, the user’s goal is to experience a sense of community, which will also drive our business goal of increasing user engagement.
What general engagement metrics can we associate with this value? An objective metric like Groups monthly active users would help us see if Facebook Groups user base is increasing or decreasing. Plus, we could monitor metrics like posting, commenting, and sharing rates.
There are other products that Groups impact, however, specifically the Newsfeed. We need to consider Newsfeed quality and examine if updates from Groups clog up the content pipeline and if users prioritize those updates over other Newsfeed items. This evaluation will give us a better sense of if Groups actually contribute to higher engagement levels.
4. How would you analyze the effectiveness of a new LinkedIn chat feature that shows a “green dot” for active users?
Note: Given engineering constraints, the new feature is impossible to A/B test before release. When you approach case study questions, remember always to clarify any vague terms. In this case, “effectiveness” is very vague. To help you define that term, you would want first to consider what the goal is of adding a green dot to LinkedIn chat.
5. How would you diagnose why weekly active users are up 5%, but email notification open rates are down 2%?
What assumptions can you make about the relationship between weekly active users and email open rates? With a case question like this, you would want to first answer that line of inquiry before proceeding.
Hint: Open rate can decrease when its numerator decreases (fewer people open emails) or its denominator increases (more emails are sent overall). Taking these two factors into account, what are some hypotheses we can make about our decrease in the open rate compared to our increase in weekly active users?
Data Analytics Case Study Questions
Data analytics case studies ask you to dive into analytics problems. Typically these questions ask you to examine metrics trade-offs or investigate changes in metrics. In addition to proposing metrics, you also have to write SQL queries to generate the metrics, which is why they are sometimes referred to as SQL case study questions .
6. Using the provided data, generate some specific recommendations on how DoorDash can improve.
In this DoorDash analytics case study take-home question you are provided with the following dataset:
- Customer order time
- Restaurant order time
- Driver arrives at restaurant time
- Order delivered time
- Customer ID
- Amount of discount
- Amount of tip
With a dataset like this, there are numerous recommendations you can make. A good place to start is by thinking about the DoorDash marketplace, which includes drivers, riders and merchants. How could you analyze the data to increase revenue, driver/user retention and engagement in that marketplace?
7. After implementing a notification change, the total number of unsubscribes increases. Write a SQL query to show how unsubscribes are affecting login rates over time.
This is a Twitter data science interview question , and let’s say you implemented this new feature using an A/B test. You are provided with two tables: events (which includes login, nologin and unsubscribe ) and variants (which includes control or variant ).
We are tasked with comparing multiple different variables at play here. There is the new notification system, along with its effect of creating more unsubscribes. We can also see how login rates compare for unsubscribes for each bucket of the A/B test.
Given that we want to measure two different changes, we know we have to use GROUP BY for the two variables: date and bucket variant. What comes next?
8. Write a query to disprove the hypothesis: Data scientists who switch jobs more often end up getting promoted faster.
More context: You are provided with a table of user experiences representing each person’s past work experiences and timelines.
This question requires a bit of creative problem-solving to understand how we can prove or disprove the hypothesis. The hypothesis is that a data scientist that ends up switching jobs more often gets promoted faster.
Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments on how often they jump in their careers.
For example, if we looked at the number of job switches for data scientists that have been in their field for five years, we could prove the hypothesis that the number of data science managers increased as the number of career jumps also rose.
- Never switched jobs: 10% are managers
- Switched jobs once: 20% are managers
- Switched jobs twice: 30% are managers
- Switched jobs three times: 40% are managers
9. Write a SQL query to investigate the hypothesis: Click-through rate is dependent on search result rating.
More context: You are given a table with search results on Facebook, which includes query (search term), position (the search position), and rating (human rating from 1 to 5). Each row represents a single search and includes a column has_clicked that represents whether a user clicked or not.
This question requires us to formulaically do two things: create a metric that can analyze a problem that we face and then actually compute that metric.
Think about the data we want to display to prove or disprove the hypothesis. Our output metric is CTR (clickthrough rate). If CTR is high when search result ratings are high and CTR is low when the search result ratings are low, then our hypothesis is proven. However, if the opposite is true, CTR is low when the search result ratings are high, or there is no proven correlation between the two, then our hypothesis is not proven.
With that structure in mind, we can then look at the results split into different search rating buckets. If we measure the CTR for queries that all have results rated at 1 and then measure CTR for queries that have results rated at lower than 2, etc., we can measure to see if the increase in rating is correlated with an increase in CTR.
10. How would you help a supermarket chain determine which product categories should be prioritized in their inventory restructuring efforts?
You’re working as a Data Scientist in a local grocery chain’s data science team. The business team has decided to allocate store floor space by product category (e.g., electronics, sports and travel, food and beverages). Help the team understand which product categories to prioritize as well as answering questions such as how customer demographics affect sales, and how each city’s sales per product category differs.
Check out our Data Analytics Learning Path .
Modeling and Machine Learning Case Questions
Machine learning case questions assess your ability to build models to solve business problems. These questions can range from applying machine learning to solve a specific case scenario to assessing the validity of a hypothetical existing model . The modeling case study requires a candidate to evaluate and explain any certain part of the model building process.
11. Describe how you would build a model to predict Uber ETAs after a rider requests a ride.
Common machine learning case study problems like this are designed to explain how you would build a model. Many times this can be scoped down to specific parts of the model building process. Examining the example above, we could break it up into:
How would you evaluate the predictions of an Uber ETA model?
What features would you use to predict the Uber ETA for ride requests?
Our recommended framework breaks down a modeling and machine learning case study to individual steps in order to tackle each one thoroughly. In each full modeling case study, you will want to go over:
- Data processing
- Feature Selection
- Model Selection
- Cross Validation
- Evaluation Metrics
- Testing and Roll Out
12. How would you build a model that sends bank customers a text message when fraudulent transactions are detected?
Additionally, the customer can approve or deny the transaction via text response.
Let’s start out by understanding what kind of model would need to be built. We know that since we are working with fraud, there has to be a case where either a fraudulent transaction is or is not present .
Hint: This problem is a binary classification problem. Given the problem scenario, what considerations do we have to think about when first building this model? What would the bank fraud data look like?
13. How would you design the inputs and outputs for a model that detects potential bombs at a border crossing?
Additional questions. How would you test the model and measure its accuracy? Remember the equation for precision:
Because we can not have high TrueNegatives, recall should be high when assessing the model.
14. Which model would you choose to predict Airbnb booking prices: Linear regression or random forest regression?
Start by answering this question: What are the main differences between linear regression and random forest?
Random forest regression is based on the ensemble machine learning technique of bagging . The two key concepts of random forests are:
- Random sampling of training observations when building trees.
- Random subsets of features for splitting nodes.
Random forest regressions also discretize continuous variables, since they are based on decision trees and can split categorical and continuous variables.
Linear regression, on the other hand, is the standard regression technique in which relationships are modeled using a linear predictor function, the most common example represented as y = Ax + B.
Let’s see how each model is applicable to Airbnb’s bookings. One thing we need to do in the interview is to understand more context around the problem of predicting bookings. To do so, we need to understand which features are present in our dataset.
We can assume the dataset will have features like:
- Location features.
- Number of bedrooms and bathrooms.
- Private room, shared, entire home, etc.
- External demand (conferences, festivals, sporting events).
Which model would be the best fit for this feature set?
15. Using a binary classification model that pre-approves candidates for a loan, how would you give each rejected application a rejection reason?
More context: You do not have access to the feature weights. Start by thinking about the problem like this: How would the problem change if we had ten, one thousand, or ten thousand applicants that had gone through the loan qualification program?
Pretend that we have three people: Alice, Bob, and Candace that have all applied for a loan. Simplifying the financial lending loan model, let us assume the only features are the total number of credit cards , the dollar amount of current debt , and credit age . Here is a scenario:
Alice: 10 credit cards, 5 years of credit age, $\$20K$ in debt - **Bob:** 10 credit cards, 5 years of credit age, $\$15K$ in debt
Candace: 10 credit cards, 5 years of credit age, $\$10K$ in debt If Candace is approved, we can logically point to the fact that Candace’s $\$10K$ in debt swung the model to approve her for a loan. How did we reason this out?
If the sample size analyzed was instead thousands of people who had the same number of credit cards and credit age with varying levels of debt, we could figure out the model’s average loan acceptance rate for each numerical amount of current debt. Then we could plot these on a graph to model the y-value (average loan acceptance) versus the x-value (dollar amount of current debt). These graphs are called partial dependence plots.
Business Case Questions
In data science interviews, business case study questions task you with addressing problems as they relate to the business. You might be asked about topics like estimation and calculation, as well as applying problem-solving to a larger case. One tip: Be sure to read up on the company’s products and ventures before your interview to expose yourself to possible topics.
16. How would you estimate the average lifetime value of customers at a business that has existed for just over one year?
More context: You know that the product costs $\$100$ per month, averages 10% in monthly churn, and the average customer stays for 3.5 months. Remember that lifetime value is defined by the prediction of the net revenue attributed to the entire future relationship with all customers averaged. Therefore, $\$100$ * 3.5 = $\$350$… But is it that simple?
Because this company is so new, our average customer length (3.5 months) is biased from the short possible length of time that anyone could have been a customer (one year maximum). How would you then model out LTV knowing the churn rate and product cost?
17. How would you go about removing duplicate product names (e.g. iPhone X vs. Apple iPhone 10) in a massive database?
See the full solution for this Amazon business case question on YouTube:
18. What metrics would you monitor to know if a 50% discount promotion is a good idea for a ride-sharing company?
This question has no correct answer and is rather designed to test your reasoning and communication skills related to product/business cases. First, start by stating your assumptions. What are the goals of this promotion? It is likely that the goal of the discount is to grow revenue and increase retention. A few other assumptions you might make include:
- The promotion will be applied uniformly across all users.
- The 50% discount can only be used for a single ride.
How would we be able to evaluate this pricing strategy? An A/B test between the control group (no discount) and test group (discount) would allow us to evaluate Long-term revenue vs average cost of the promotion. Using these two metrics how could we measure if the promotion is a good idea?
19. A bank wants to create a new partner card, e.g. Whole Foods Chase credit card). How would you determine what the next partner card should be?
More context: Say you have access to all customer spending data. With this question, there are several approaches you can take. As your first step, think about the business reason for credit card partnerships: they help increase acquisition and customer retention.
One of the simplest solutions would be to sum all transactions grouped by merchants. This would identify the merchants who see the highest spending amounts. However, the one issue might be that some merchants have a high-spend value but low volume. How could we counteract this potential pitfall? Is the volume of transactions even an important factor in our credit card business? The more questions you ask, the more may spring to mind.
20. How would you assess the value of keeping a TV show on a streaming platform like Netflix?
Say that Netflix is working on a deal to renew the streaming rights for a show like The Office , which has been on Netflix for one year. Your job is to value the benefit of keeping the show on Netflix.
Start by trying to understand the reasons why Netflix would want to renew the show. Netflix mainly has three goals for what their content should help achieve:
- Acquisition: To increase the number of subscribers.
- Retention: To increase the retention of active subscribers and keep them on as paying members.
- Revenue: To increase overall revenue.
One solution to value the benefit would be to estimate a lower and upper bound to understand the percentage of users that would be affected by The Office being removed. You could then run these percentages against your known acquisition and retention rates.
21. How would you determine which products are to be put on sale?
Let’s say you work at Amazon. It’s nearing Black Friday, and you are tasked with determining which products should be put on sale. You have access to historical pricing and purchasing data from items that have been on sale before. How would you determine what products should go on sale to best maximize profit during Black Friday?
To start with this question, aggregate data from previous years for products that have been on sale during Black Friday or similar events. You can then compare elements such as historical sales volume, inventory levels, and profit margins.
Learn More About Feature Changes
This course is designed teach you everything you need to know about feature changes:
More Data Science Interview Resources
Case studies are one of the most common types of data science interview questions . Practice with the data science course from Interview Query, which includes product and machine learning modules.
Fill the form and we’ll contact you soon
Our Case Studies.
Sales forecast, the process.
Using Machine Learning data , we've used different data sources to predict the demand for different SKU's across different regions.
Producing a timely forecast that can accurate predict demand and support decisions regarding shelf time and JIT(just-in-time) logistics.
Achieved an improvement of 10 percentage points in actual forecast accuracy.
Using data from past prices and product characteristics, we were able to produce price and promotion mix recommendations every two months.
Our customer wanted to set the optimum price to balance their sales and profitability, dynamically. Each week, several factors influence the optimum price point of their SKU's and it was important that each model was able to find explainable patterns for each product.
Achieved stability between sales and profitability for different products across different regions. This model enabled the customer to build better price-promotion mixes for more than 100 different stores.
Product Launch & Acceptance
Using data science models , we were able to find factors that justified the success of a specific product in a group of stores.
Finding stores that were more receptive to new products.
Improved targeted marketing and sales team focus to stores where there is a higher propensity for new products.
Discover how to apply Data & AI in your business.
Recommendation System for New Products
We've built several recommendation engines to cross-sell new products to Telco Customers. Our Recommendation Engines took into account data from customers and products and produced "Next Best Offers" to current customers during touchpoints.
Maximize lift of cross-sell offers and reduce number of irrelevant offers to customers.
Achieved targeted improvement of lift to new and existing customers.
Case Study #02
We advocate for a delicate balancing act between quick wins and infrastructure build. This balance is achieved with a strategic mix of careful planning and agile response.
We advocate for a delicate balancing act between quick wins and infrastructure build. This balance is achieved with a strategic mix of careful planning and agile response. Your goal should be to make the most out of Data & AI
We advocate for a delicate balancing act between quick wins and infrastructure build.
We've built several customer segmentations for our telco customers. These segmentations enable our customers to better predict which customers are more likely to acquire value-added services or which ones are more likely to use the services without paying.
Building clear customer segmentations that can be fed to CRM systems and support operators on their decision making process.
Achieved improvement of LTV of segmented customers and reduced percentage of portfolio in default by avoiding the recommendation of new products to customers with high likelihood of default.
Computer Vision for Medical Imaging
We've used models to classify images and obtain a consensus map based on several "true labels" from different doctors. The model we've built for our customer uses components of computer vision and clustering analysis to target specific areas of an x-ray that should be the ground truth for diagnosis.
Building a consensus map to help doctors make sense of different classifications.
Our model achieved an accuracy of over 90% , enabling doctors to have a better precision when building their diagnostic based on the image.
Operational Efficiency Project
Our customer needed to extract hundreds of text data from documents to label those documents according to several categories. We've solved this problem using text classification algorithms that are able to use both structured and unstructured data to predict the label of the document.
Reducing the manual input by users when classifying new documents. This process took roughly 10% of the team's effort and automating it was crucial to enable the team to focus on value-added tasks.
Achieved optimization of hours used by the organization in value-added tasks by automating a huge chunk of the manual labelling process.
Data Lake Setup and Management
Creating a Data Lake that supports a Public Sector Agency across all departments and gives timely information about different sectors, industries and companies.
Introducing contextual information, external to the organization, into the decision process of the agency.
Enabled several departments to use information that was expensive or acessible only through complicated processes, improving the quality and speed of virtually every decision in the agency.
Knowledge Database and Relevant Search
Our Customer needed to integrate scattered information throughout different documents in their decision process. We've built a graph database that ingests documents in all types of formats (spreadsheet, word and pdf), mapping entities and building their relationships using unstructured data.
Extract information from millions of documents to structure the data in a relational graph database that helps all departments making better decisions.
Our customer is now able to incorporate information that was scattered in files stored in deep folders. This information is crucial to support employees on their day to day basis, giving them visibility and insight on entities relationships.
Public Sector #03
Banking #03, computer vision for utilities.
Our customer needed to segment several elements using noisy satellite images to obtain relevant information that could be used when setting up new installations across the network. We've developed state-of-art models to identify several objects in a satellite image and build multi-layer image identification processes.
Identifying different objects in Satellite images to support infrastructure deployment decisions.
Our Customer is now able to save hundreds of hours of manual work by receiving automatic input from satellite images. Our model is able to detect relevant information in the images and calculate available space to deploy infrastructure on roofs or landscape.
Using a machine learning model , we've deployed an infrastructure able to predict, in real time, the probability of failure of an essential component of several machines.
If the component had some type of problem, the whole operation would stop, costing hundreds of dollars to the company. The objective from our customer was to have timely and precise predictions of when that component may fail in a specific time-window.
Our customer is now able to accurately predict when the component will fail, in real time. We were responsible for building the API that communicates with the machine, receiving data from several endpoints and sources. We've managed to make this data engineering process take few minutes so that the warning is timely activated and the company can act accordingly.
We've helped a startup setting up a routing optimization system from top-to-bottom, building the backbone for the organization processes. These processes range from data collecting to making routing decisions that impact the daily operations of the startup.
Building a scalable tech environment that supports the daily operations of US Start-up company.
The company is able to scale their data processes with no restrictions on size or complexity. Additionally, the optimization algorithm for routing enabled the company to expand across the US.
Interim CTO - SaaS Company
We act as interim CTO's for several startups, providing advisory, developing crucial tech processes and helping with hiring.
Supporting Start-Ups that don't have the resources to hire CTOs or are looking for tech advisory on their early stage phase.
We've helped a couple of Start-up companies scale by advising them on how to set up their data engineering infrastructure.
Leave your best email & click on the button below to download our whitepaper.
100+ Data Engineer Interview Questions and Answers for 2023
Top Data Engineer Interview Questions and Answers- Ace your next big data/data engineer job interview | ProjectPro
This blog is your one-stop solution for the top 100+ Data Engineer Interview Questions and Answers . In this blog, we have collated the frequently asked data engineer interview questions based on tools and technologies that are highly useful for a data engineer in the Big Data industry. Also, you will find some interesting data engineer interview questions that have been asked in different companies (like Facebook, Amazon, Walmart, etc.) that leverage big data analytics and tools.
End-to-End Big Data Project to Learn PySpark SQL Functions
Downloadable solution code | Explanatory videos | Tech Support
Preparing for data engineer interviews makes even the bravest of us anxious. One good way to stay calm and composed for an interview is to thoroughly answer questions frequently asked in interviews. If you have an interview for a data engineer role coming up, here are some data engineer interview questions and answers based on the skillset required that you can refer to help nail your future data engineer interviews.
Table of Contents
Top 100+ data engineer interview questions and answers, data engineer interview questions on big data, data engineer interview questions on python, data engineer interview questions on excel, data engineer interview questions on sql, data engineer interview questions on azure, data engineer interview questions on aws, data engineer interview questions on data lake, data engineer technical interview questions | data engineering technical interview questions, databricks data engineer interview questions, walmart data engineer interview questions, ey data engineer interview questions, behavioral data engineering questions, facebook data engineer interview questions, amazon data engineer interview questions, how data engineering helps businesses | why is data engineering in demand, data engineer job growth and demand in 2023, what skills does a data engineer need, get set go for your interview with projectpro’s top data engineer interview questions.
- How can I pass data engineer interview?
- What are the roles and responsibilities of data engineer?
- What are the 4 most key questions a data engineer is likely to hear during an interview?
The following sections consist of the top 100+ data engineer interview questions divided based on big data fundamentals, big data tools/technologies, and big data cloud computing platforms. Furthermore, you will find a few sections on data engineer interview questions commonly asked in various companies leveraging the power of big data and data engineering.
Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!
Any organization that relies on data must perform big data engineering to stand out from the crowd. But data collection, storage, and large-scale data processing are only the first steps in the complex process of big data analysis. Complex algorithms, specialized professionals, and high-end technologies are required to leverage big data in businesses, and big Data Engineering ensures that organizations can utilize the power of data.
Below are some big data interview questions for data engineers based on the fundamental concepts of big data, such as data modeling, data analysis , data migration, data processing architecture, data storage, big data analytics, etc.
Differentiate between relational and non-relational database management systems.
What is data modeling?
Data modeling is a technique that defines and analyzes the data requirements needed to support business processes. It involves creating a visual representation of an entire system of data or a part of it. The process of data modeling begins with stakeholders providing business requirements to the data engineering team.
Prepare for Your Next Big Data Job Interview with Kafka Interview Questions and Answers
How is a data warehouse different from an operational database?
What are the big four v’s of big data.
Volume: refers to the size of the data sets to be analyzed or processed. The size is generally in terabytes and petabytes.
Velocity: the speed at which you generate data. The data generates faster than traditional data handling techniques can handle it.
Variety: the data can come from various sources and contain structured, semi-structured, or unstructured data.
Veracity: the quality of the data to be analyzed. The data has to be able to contribute in a meaningful way to generate results.
Differentiate between Star schema and Snowflake schema.
What are the differences between oltp and olap, what are some differences between a data engineer and a data scientist.
Data engineers and data scientists work very closely together, but there are some differences in their roles and responsibilities.
A data scientist and data engineer role require professionals with a computer science and engineering background, or a closely related field such as mathematics, statistics, or economics. A sound command over software and programming languages is important for a data scientist and a data engineer. Read more for a detailed comparison between data scientists and data engineers .
How is a data architect different from a data engineer?
Differentiate between structured and unstructured data.
10. How does Network File System (NFS) differ from Hadoop Distributed File System (HDFS)?
What is meant by feature selection .
Feature selection is identifying and selecting only the features relevant to the prediction variable or desired output for the model creation. A subset of the features that contribute the most to the desired output must be selected automatically or manually.
How can missing values be handled in Big Data?
Some ways you can handle missing values in Big Data are as follows:
Deleting rows with missing values: You simply delete the rows or columns in a table with missing values from the dataset. You can drop the entire column from the analysis if a column has more than half of the rows with null values. You can use a similar method for rows with missing values in more than half of the columns. This method may not work very well in cases where a large number of values are missing.
Using Mean/Medians for missing values: In a dataset, the columns with missing values and the column's data type are numeric; you can fill in the missing values by using the median or mode of the remaining values in the column.
Imputation method for categorical data: If you can classify the data in a column, you can replace the missing values with the most frequently used category in that particular column. If more than half of the column values are empty, you can use a new categorical variable to place the missing values.
Predicting missing values: Regression or classification techniques can predict the values based on the nature of the missing values.
Last Observation carried Forward (LCOF) method: The last valid observation can fill in the missing value in data variables that display a longitudinal behavior.
Using Algorithms that support missing values: Some algorithms, such as the k-NN algorithm, can ignore a column if values are missing. Another such algorithm is Naive Bayes. The RandomForest algorithm can work with non-linear and categorical data.
What is meant by outliers?
In a dataset, an outlier is an observation that lies at an abnormal distance from the other values in a random sample from a particular data set. It is left up to the analyst to determine what can be considered abnormal. Before you classify data points as abnormal, you must first identify and categorize the normal observations. Outliers may occur due to variability in measurement or a particular experimental error. Outliers must be identified and removed before further analysis of the data not to cause any problems.
What is meant by logistic regression?
Logistic regression is a classification rather than a regression model, which involves modeling the probability of a discrete outcome given an input variable. It is a simple and efficient method that can approach binary and linear classification problems. Logistic regression is a statistical method that works well with binary classifications but can be generalized to multiclass classifications.
15. Briefly define the Star Schema.
The star join schema, one of the most basic design schemas in the Data Warehousing concept, is also known as the star schema. It looks like a star, with fact tables and related dimension tables. The star schema is useful when handling huge amounts of data.
16. Briefly define the Snowflake Schema.
The snowflake schema, one of the popular design schemas, is a basic extension of the star schema that includes additional dimensions. The term comes from the way it resembles the structure of a snowflake. In the snowflake schema , the data is organized and, after normalization, divided into additional tables.
What is the difference between the KNN and k-means methods?
The k-means method is an unsupervised learning algorithm used as a clustering technique, whereas the K-nearest-neighbor is a supervised learning algorithm for classification and regression problems.
KNN algorithm uses feature similarity, whereas the K-means algorithm refers to dividing data points into clusters so that each data point is placed precisely in one cluster and not across many.
What is the purpose of A/B testing?
A/B testing is a randomized experiment performed on two variants, ‘A’ and ‘B.’ It is a statistics-based process involving applying statistical hypothesis testing, also known as “two-sample hypothesis testing.” In this process, the goal is to evaluate a subject’s response to variant A against its response to variant B to determine which variants are more effective in achieving a particular outcome.
What do you mean by collaborative filtering?
Collaborative filtering is a method used by recommendation engines. In the narrow sense, collaborative filtering is a technique used to automatically predict a user's tastes by collecting various information regarding the interests or preferences of many other users. This technique works on the logic that if person 1 and person 2 have the same opinion on one particular issue, then person 1 is likely to have the same opinion as person 2 on another issue than another random person. In general, collaborative filtering is the process that filters information using techniques involving collaboration among multiple data sources and viewpoints.
What are some biases that can happen while sampling?
Some popular type of bias that occurs while sampling is
Undercoverage- The undercoverage bias occurs when there is an inadequate representation of some members of a particular population in the sample.
Observer Bias- Observer bias occurs when researchers unintentionally project their expectations on the research. There may be occurrences where the researcher unintentionally influences surveys or interviews.
Self-Selection Bias- Self-selection bias, also known as volunteer response bias, happens when the research study participants take control over the decision to participate in the survey. The individuals may be biased and are likely to share some opinions that are different from those who choose not to participate. In such cases, the survey will not represent the entire population.
Survivorship Bias- The survivorship bias occurs when a sample is more concentrated on subjects that passed the selection process or criterion and ignore the subjects who did not pass the selection criteria. Survivorship biases can lead to overly optimistic results.
Recall Bias- Recall bias occurs when a respondent fails to remember things correctly.
Exclusion Bias- The exclusion bias occurs due to the exclusion of certain groups while building the sample.
What is a distributed cache?
A distributed cache pools the RAM in multiple computers networked into a single in-memory data store to provide fast access to data. Most traditional caches tend to be in a single physical server or hardware component. Distributed caches, however, grow beyond the memory limits of a single computer as they link multiple computers, providing larger and more efficient processing power. Distributed caches are useful in environments that involve large data loads and volumes. They allow scaling by adding more computers to the cluster and allowing the cache to grow based on requirements.
Explain how Big Data and Hadoop are related to each other.
Apache Hadoop is a collection of open-source libraries for processing large amounts of data. Hadoop supports distributed computing, where you process data across multiple computers in clusters. Previously, if an organization had to process large volumes of data, it had to buy expensive hardware. Hadoop has made it possible to shift the dependency from hardware to achieve high performance, reliability, and fault tolerance through the software itself. Hadoop can be useful when there is Big Data and insights generated from the Big Data. Hadoop also has robust community support and is evolving to process, manage, manipulate and visualize Big Data in new ways.
21. Briefly define COSHH.
COSHH is an acronym for Classification and Optimization-based Scheduling for Heterogeneous Hadoop systems. As the name implies, it offers scheduling at both the cluster and application levels to speed up job completion.
22. Give a brief overview of the major Hadoop components.
Working with Hadoop involves many different components, some of which are listed below:
Hadoop Common: This comprises all the tools and libraries typically used by the Hadoop application.
Hadoop Distributed File System (HDFS): When using Hadoop, all data is present in the HDFS, or Hadoop Distributed File System. It offers an extremely high bandwidth distributed file system.
Hadoop YARN: The Hadoop system uses YARN, or Yet Another Resource Negotiator, to manage resources. YARN can also be useful for task scheduling.
Hadoop MapReduce: Hadoop MapReduce is a framework for large-scale data processing that gives users access.
23. List some of the essential features of Hadoop.
Hadoop is a user-friendly open source framework.
Hadoop is highly scalable. Hadoop can handle any sort of dataset effectively, including unstructured (MySQL Data), semi-structured (XML, JSON), and structured (MySQL Data) (Images and Videos).
Parallel computing ensures efficient data processing in Hadoop.
Hadoop ensures data availability even if one of your systems crashes by copying data across several DataNodes in a Hadoop cluster .
24. What methods does Reducer use in Hadoop?
The three primary methods to use with reducer in Hadoop are as follows:
setup(): This function is mostly useful to set input data variables and cache protocols.
cleanup(): This procedure is useful for deleting temporary files saved.
reduce(): This method is used only once for each key and is the most crucial component of the entire reducer.
25. What are the various design schemas in data modeling?
There are two fundamental design schemas in data modeling: star schema and snowflake schema.
Star Schema- The star schema is the most basic type of data warehouse schema . Its structure is similar to that of a star, where the star's center may contain a single fact table and several associated dimension tables. The star schema is efficient for data modeling tasks such as analyzing large data sets.
Snowflake Schema- The snowflake schema is an extension of the star schema. In terms of structure, it adds more dimensions and has a snowflake-like appearance. Data is split into additional tables, and the dimension tables are normalized.
26. What are the components that the Hive data model has to offer?
Some major components in a Hive data model are
You can go through many more detailed Hadoop Interview Questions here.
Learn more about Big Data Tools and Technologies with Innovative and Exciting Big Data Projects Examples.
Python is crucial in implementing data engineering techniques . Pandas, NumPy, NLTK , SciPy, and other Python libraries are ideal for various data engineering tasks such as faster data processing and other machine learning activities. Data engineers primarily focus on data modeling and data processing architecture but also need a fundamental understanding of algorithms and data structures. Take a look at some of the data engineer interview questions based on various Python concepts, including Python libraries, algorithms, data structures, etc. These data engineer interview questions cover Python libraries like Pandas , NumPy, and SciPy.
Differentiate between *args and **kwargs.
*args in function definitions are used to pass a variable number of arguments to a function when calling the function. By using the *, a variable associated with it becomes iterable.
**kwargs in function definitions are used to pass a variable number of keyworded arguments to a function while calling the function. The double star allows passing any number of keyworded arguments.
What is the difference between “is” and “==”?
Python's “is” operator checks whether two variables point to the same object. “==” is used to check whether the values of two variables are the same.
E.g. consider the following code:
a = [1,2,3]
b = [1,2,3]
a == b
evaluates to true since the values contained in the list a and list b are the same but
a is b
evaluates to false since a and b refers to two different objects.
Evaluates to true since c and b point to the same object.
How is memory managed in Python?
Memory in Python exists in the following way:
The objects and data structures initialized in a Python program are present in a private heap, and programmers do not have permission to access the private heap space.
You can allocate heap space for Python objects using the Python memory manager. The core API of the memory manager gives the programmer access to some of the tools for coding purposes.
Python has a built-in garbage collector that recycles unused memory and frees up memory for heap space.
What is a decorator?
A decorator is a tool in Python which allows programmers to wrap another function around a function or a class to extend the behavior of the wrapped function without making any permanent modifications to it. Functions in Python are first-class objects, meaning functions can be passed or used as arguments. A function works as the argument for another function in a decorator, which you can call inside the wrapper function.
Are lookups faster with dictionaries or lists in Python?
The time complexity to look up a value in a list in Python is O(n) since the whole list iterates through to find the value. Since a dictionary is a hash table, the time complexity to find the value associated with a key is O(1). Hence, a lookup is generally faster with a dictionary, but a limitation is that dictionaries require unique keys to store the values.
How can you return the binary of an integer?
The bin() function works on a variable to return its binary equivalent.
How can you remove duplicates from a list in Python?
A list can be converted into a set and then back into a list to remove the duplicates. Sets do not contain duplicate data in Python.
list1 = [5,9,4,8,5,3,7,3,9]
list2 = list(set(list1))
list2 will contain [5,9,4,8,3,7]
Set() may not maintain the order of items within the list.
What is the difference between append and extend in Python?
The argument passed to append() is added as a single element to a list in Python. The list length increases by one, and the time complexity for append is O(1).
The argument passed to extend() is iterated over, and each element of the argument adds to the list. The length of the list increases by the number of elements in the argument passed to extend(). The time complexity for extend is O(n), where n is the number of elements in the argument passed to extend.
list1 = [“Python”, “data”, “engineering”]
list2 = [“projectpro”, “interview”, “questions”]
List1 will now be : [“projectpro”, “interview”, “questions”, [“Python”, “data”, “engineering”]]
The length of list1 is 4.
Instead of append, use extend
List1 will now be : [“projectpro”, “interview”, “questions”, “Python”, “data”, “engineering”]
The length of list1, in this case, becomes 6.
When do you use pass, continue and break?
The break statement in Python terminates a loop or another statement containing the break statement. If a break statement is present in a nested loop, it will terminate only the loop in which it is present. Control will pass the statements after the break statement if they are present.
The continue statement forces control to stop the current iteration of the loop and execute the next iteration rather than terminating the loop completely. If a continue statement is present within a loop, it leads to skipping the code following it for that iteration, and the next iteration gets executed.
Pass statement in Python does nothing when it executes, and it is useful when a statement is syntactically required but has no command or code execution. The pass statement can write empty loops and empty control statements, functions, and classes.
How can you check if a given string contains only letters and numbers?
str.isalnum() can be used to check whether a string ‘str’ contains only letters and numbers.
Mention some advantages of using NumPy arrays over Python lists.
NumPy arrays take up less space in memory than lists.
NumPy arrays are faster than lists.
NumPy arrays have built-in functions optimized for various techniques such as linear algebra, vector, and matrix operations.
Lists in Python do not allow element-wise operations, but NumPy arrays can perform element-wise operations.
In Pandas, how can you create a dataframe from a list?
import pandas as pd
days = [‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’]
# Calling DataFrame constructor on list
df = pd.DataFrame(days)
df is the data frame created from the list ‘days’.
df = pd.DataFrame(days, index =[‘1’,’2’,’3’,’4’], columns=[‘Days’])
Can be used to create the data frame and the values for the index and columns.
In Pandas, how can you find the median value in a column “Age” from a dataframe “employees”?
The median() function can be used to find the median value in a column. E.g.- employees[“age”].median()
In Pandas, how can you rename a column?
The rename() function can be used to rename columns of a data frame.
To rename address_line_1 to ‘region’ and address_line_2 to ‘city’
How can you identify missing values in a data frame?
The isnull() function help to identify missing values in a given data frame.
The syntax is DataFrame.isnull()
It returns a dataframe of boolean values of the same size as the data frame in which missing values are present. The missing values in the original data frame are mapped to true, and non-missing values are mapped to False.
What is SciPy?
SciPy is an open-source Python library that is useful for scientific computations. SciPy is short for Scientific Python and is used to solve complex mathematical and scientific problems. SciPy is built on top of NumPy and provides effective, user-friendly functions for numerical optimization. The SciPy library comes equipped with functions to support integration, ordinary differential equation solvers, special functions, and support for several other technical computing functions.
Given a 5x5 matrix in NumPy, how will you inverse the matrix?
The function numpy.linalg.inv() can help you inverse a matrix. It takes a matrix as the input and returns its inverse. You can calculate the inverse of a matrix M as:
if det(M) != 0
M-1 = adjoint(M)/determinant(M)
"Inverse does not exist
What is an ndarray in NumPy?
In NumPy, an array is a table of elements, and the elements are all of the same types and you can index them by a tuple of positive integers. To create an array in NumPy, you must create an n-dimensional array object. An ndarray is the n-dimensional array object defined in NumPy to store a collection of elements of the same data type.
Using NumPy, create a 2-D array of random integers between 0 and 500 with 4 rows and 7 columns.
from numpy import random
x = random.randint(500, size=(4, 7))
Find all the indices in an array of NumPy where the value is greater than 5.
import NumPy as np
array = np.array([5,9,6,3,2,1,9])
To find the indices of values greater than 5
Gives the output (array([0,1,2,6])
Microsoft Excel is one of the most popular data engineering tools in the big data industry. In contrast to BI tools, which ingest processed data supplied by the data engineering pipeline, Excel gives data engineers flexibility and control over data entry. Here are some data engineer interview questions on Microsoft Excel and its features.
What are Freeze Panes in MS Excel?
Freeze panes are used in MS Excel to lock a particular row or column. The rows or columns you lock will be visible on the screen even when scrolling the sheet horizontally or vertically.
To freeze panes on Excel:
First, select the cell to the right of the columns and below the rows to be kept visible.
Select View > Freeze Panes > Freeze Panes.
What is meant by a ribbon?
In Excel, the ribbon exists in the topmost area of the window. They contain the toolbars and menu items available in Excel. Ribbons contain multiple tabs, each with its own command set. You can switch the ribbon between shown and hidden using CTRL+F1.
How can you prevent someone from copying the data in your spreadsheet?
In Excel, you can protect a worksheet, meaning that you can paste no copied data from the cells in the protected worksheet. To be able to copy and paste data from a protected worksheet, you must remove the sheet protection and unlock all cells, and once more lock only those cells that are not to be changed or removed. To protect a worksheet, go to Menu -> Review -> Protect Sheet -> Password. Using a unique password, you can protect the sheet from getting copied by others.
How can you find the sum of columns in Excel?
The SUM function may be useful for finding the sum of columns in an Excel spreadsheet.
=SUM(A5:F5) can be useful to find the sum of values in the columns A-F of the 5th row.
Explain macros in Excel.
Macros in Excel refers to an action or a set of actions that can be saved and recorded to run as often as required. Macros may be given names and can be used to save time to perform any frequently run tasks. Excel stores macros as VBA code, and you can view the code using a VBA editor. You can assign macros to objects, including shapes, graphics, or control.
What is the order of operations followed for evaluating expressions in Excel?
Excel follows the same order of operations as in standard mathematics, which is indicated by “PEMDAS” where:
P - Parentheses
E - Exponent
M - Multiplication
D - Division
A - Addition
S - Subtraction
Explain pivot tables in Excel.
A pivot table is a tool consisting of a table of grouped values where individual items of a larger, more extensive table aggregate within one or more discrete categories. It is useful for quick summarization of large unstructured data. It can automatically perform sort, total, count, or average of the data in the spreadsheet and display the results in another spreadsheet. Pivot tables save time and allow linking external data sources to Excel.
Mention some differences between SUBSTITUTE and REPLACE functions in Excel.
The SUBSTITUTE function in Excel is useful to find a match for a particular text and replace it. The REPLACE function replaces the text, which you can identify using its position.
=SUBSTITUTE (text, text_to_be_replaced, text_to_replace_old_text_with, [instance_number])
text refers to the text in which you can perform the replacements
instance_number refers to the number of times you need to replace a match.
E.g. consider a cell A5 which contains “Bond007”
=SUBSTITUTE(A5, “0”, “1”, 1) gives the result “Bond107”
=SUBSTITUTE(A5, “0”, “1”, 2) gives the result “Bond117”
=SUBSTITUTE(A5, “0”, “1”) gives the result “Bond117”
=REPLACE (old_text, start_num, num_chars, text_to_be_replaced)
Where start_num - starting position of old_text to be replaced
num_chars - number of characters to be replaced
=REPLACE(A5, 5, 1, “99”) gives the result “Bond9907”
What is the use of the IF function in Excel?
The IF function in Excel performs the logic test and is used to check whether a given condition is true or false, then perform further operations based on the result.
The syntax is:
=IF (test condition, value if true, value if false)
What filter will you use if you want more than two conditions or if you want to analyze the list using the database function?
You can use the Advanced Criteria Filter to analyze a list or in cases where you need to test more than two conditions.
What does it mean if there is a red triangle at the top right-hand corner of a cell?
A red triangle at the top right-hand corner of a cell indicates a comment associated with that particular cell. You can view the comment by hovering the cursor over it.
You will spend most of your career using SQL if you are a Data Engineer working in an organization. Building a strong foundation in SQL is crucial since you may easily save time and effort if you can leverage its various features effectively. Also, acquire a solid knowledge of databases such as the NoSQL or Oracle database. Questions addressing data modeling and database architecture test your understanding of entity-relationship modeling, normalization and denormalization, dimensional modeling, and relevant ideas. Below are a few data engineer interview questions on SQL concepts, queries on data storage, data retrieval, and a lot more.
What is meant by Aggregate Functions in SQL?
In SQL, aggregate functions are functions where the values from multiple rows are grouped to form a single value with its significant meaning. Aggregate functions in SQL include count(), min(), max(), sum(), avg().
How would you find duplicates using an SQL query?
To find duplicates in a single column:
SELECT column_name, COUNT(column_name)
GROUP BY column_name
Will display all the records in a column which have the same value.
To find duplicates in multiple columns of a table:
SELECT column1_name, column2_name, COUNT(*)
GROUP BY column1_name, column2_name
Will display all the records with the same values in column1 and column2.
What is a primary key in SQL?
In SQL, a primary key refers to a field in a table that can uniquely identify rows in that table. Primary keys must have unique values, and a primary key value cannot be NULL. A table can have only one primary key and can be a single field or multiple fields. When you use multiple fields as the primary key, they are collectively known as the composite key.
What is meant by the UNIQUE constraint in SQL?
The UNIQUE constraint is used for columns in SQL to ensure that all the values in a particular column are different. The UNIQUE constraint and the PRIMARY KEY both ensure that a column contains a value with unique values. However, there can be only one PRIMARY KEY per table, but you can specify the UNIQUE constraint for multiple columns. After creating the table, you can add or drop the UNIQUE constraints from columns.
What are the different kinds of joins in SQL?
A JOIN clause combines rows across two or more tables with a related column. The different kinds of joins supported in SQL are:
(INNER) JOIN: returns the records that have matching values in both tables.
LEFT (OUTER) JOIN: returns all records from the left table with their corresponding matching records from the right table.
RIGHT (OUTER) JOIN: returns all records from the right table and their corresponding matching records from the left table.
FULL (OUTER) JOIN: returns all records with a matching record in either the left or right table.
What do you mean by index and indexing in SQL?
In SQL, an index is a special lookup table used by the database search engine to perform data retrieval from any data structure more speedily. Indexes speed up SELECT queries and WHERE clauses, but slow down UPDATE and INSERT statements, which require input data. Indexes can be created or dropped and will not affect the data. Indexing is a method for optimizing database efficiency by reducing the number of disc accesses required during query execution. This data structure technique may quickly search for and access a database.
How is a clustered index different from a non-clustered index in SQL?
Clustered indexes in SQL modify how you store records in the database based on the indexed column. They are useful for the speedy retrieval of data from the database. Non-clustered indexes create a different entity within the table that references the original table. They are relatively slower than clustered indexes, and SQL allows only a single clustered index but multiple non-clustered indexes.
Differentiate between IN and BETWEEN operators.
The BETWEEN operator in SQL tests if a particular expression lies between a range of values. The values can be in the form of text, dates, or numbers. You can use the BETWEEN operator with SELECT, INSERT, UPDATE, and DELETE statements. In a query, the BETWEEN condition helps to return all values that lie within the range. The range is inclusive. The syntax is of BETWEEN is as follows:
WHERE column_name BETWEEN value1 AND value2;
The IN operator tests whether an expression matches the values specified in a list of values. It helps to eliminate the need of using multiple OR conditions. NOT IN operator may exclude certain rows from the query return. IN operator may also be used with SELECT, INSERT, UPDATE, and DELETE statements. The syntax is:
WHERE column_name IN (list_of_values);
What is a foreign key in SQL?
A foreign key is a field or a collection of fields in one table that can refer to the primary key in another table. The table which contains the foreign key is the child table, and the table containing the primary key is the parent table or the referenced table. The purpose of the foreign key constraint is to prevent actions that would destroy links between tables.
What is a cursor?
A cursor is a temporary memory or workstation. It is allocated by the server when DML operations are performed on the table by the user. Cursors store Database tables. SQL provides two types of cursors which are:
Implicit Cursors: they are allocated by the SQL server when users perform DML operations.
Explicit Cursors: Users create explicit cursors based on requirements. Explicit cursors allow you to fetch table data in a row-by-row method.
What is an alias in SQL?
An alias enables you to give a table or a particular column in a table a temporary name to make the table or column name more readable for that specific query. Aliases only exist for the duration of the query.
The syntax for creating a column alias
SELECT column_name AS alias_name
The syntax for creating a table alias
FROM table_name AS alias_name;
What is meant by normalization in SQL?
Normalization is a method used to minimize redundancy, inconsistency, and dependency in a database by organizing the fields and tables. It involves adding, deleting, or modifying fields that can go into a single table. Normalization allows you to break the tables into smaller partitions and link these partitions through different relationships to avoid redundancy.
Some rules followed in database normalization, which is also known as Normal forms are
1NF - first normal form
2NF - second normal form
3NF - third normal form
BCF - Boyce-Codd normal form
What is a stored procedure?
Stored procedures are used in SQL to run a particular task several times. You can save or reuse stored procedures when required.
The syntax for creating a stored procedure:
Syntax for executing a stored procedure
A stored procedure can take parameters at the time of execution so that the stored procedure can execute based on the values passed as parameters.
Build a job-winning Big Data portfolio with end-to-end solved Apache Spark Projects for Resume and ace that Big Data interview!
Write a query to select all statements that contain “ind” in their name from a table named places.
WHERE name LIKE '%ind%'
Which SQL query can be used to delete a table from the database but keep its structure intact?
The TRUNCATE command helps delete all the rows from a table but keeps its structure intact. The column, indexes, and constraints remain intact when using the TRUNCATE statement.
Write an SQL query to find the second highest sales from an " Apparels " table.
select min(sales) from
(select distinct sales from Apparels by sales desc)
where rownum < 3;
Is a blank space or a zero value treated the same way as the operator NULL?
NULL in SQL is not the same as zero or a blank space. NULL is used in the absence of any value and is said to be unavailable, unknown, unassigned, or inappropriate. Zero is a number, and a blank space gets treated as a character. You can compare a blank space or zero to another black space or zero, but cannot compare one NULL with another NULL.
What is the default ordering of the ORDER BY clause and how can this be changed?
The ORDER BY clause is useful for sorting the query result in ascending or descending order. By default, the query sorts in ascending order. The following statement can change the order:
SELECT expressions FROM table_name
ORDER BY expression DESC;
Will the following query return an output?
Select employee_id, avg (sales) , from employees , where avg(sales) > 70000 , group by month; .
No, the above query will not return an output since you cannot use the WHERE clause to restrict the groups. To generate output in this query, you should use the HAVING clause.
What is meant by SQL injection?
SQL injection is a type of vulnerability in SQL codes that allows attackers to control back-end database operations and access, retrieve and/or destroy sensitive data present in databases. SQL injection involves inserting malicious SQL code into a database entry field. When the code gets executed, the database becomes vulnerable to attack, and SQL injection is also known as SQLi attack.
What statement does the system execute whenever a database is modified?
Whenever a database is modified, the system executes a trigger command.
Write an SQL query to find all students’ names from a table named ‘Students’ that end with ‘T’.
SELECT * FROM student WHERE stud_name like '%T';
Mention some differences between the DELETE and TRUNCATE statements in SQL.
What is a trigger in sql.
In SQL, a trigger refers to a set of statements in a system catalog that runs whenever DML (Data Manipulation Language) commands run on a system. It is a special stored procedure that gets called automatically in response to an event. Triggers allow the execution of a batch of code whenever an insert, update or delete command is executed for a specific table. You can create a trigger by using the CREATE TRIGGER statement. The syntax is:
CREATE TRIGGER trigger_name
ON table_name FOR EACH ROW
Most businesses are switching to cloud infrastructure these days. Organizations employ a variety of providers including AWS, Google Cloud , and Azure for their BI and Machine Learning applications . Microsoft Azure allows data engineers to build and deploy applications using various solutions. Check out these common data engineer interview questions on various Microsft Azure concepts , tools, and frameworks.
76. Explain the features of Azure Storage Explorer.
It's a robust stand-alone application that lets you manage Azure Storage from any platform, including Windows, Mac OS, and Linux.
An easy-to-use interface gives you access to many Azure data stores, including ADLS Gen2, Cosmos DB, Blobs, Queues, Tables, etc.
One of the most significant aspects of Azure Storage Explorer is that it enables users to work despite being disconnected from the Azure cloud service using local emulators.
77. What are the various types of storage available in Azure?
In Microsoft Azure , there are five storage types classified into two categories.
The first group comprises Queue Storage, Table Storage, and Blob Storage . It is built with data storage, scalability, and connectivity and is accessible through a REST API.
The second group comprises File Storage and Disk Storage , which boosts the functionalities of the Microsoft Azure Virtual Machine environment and is only accessible through Virtual Machines.
Queue Storage enables you to create versatile applications that comprise independent components depending on asynchronous message queuing. Azure Queue storage stores massive volumes of messages accessible by authenticated HTTP or HTTPS queries anywhere.
Table Storage in Microsoft Azure holds structured NoSQL data. The storage is highly extensible while also being efficient in storing data. However, if you access temporary files frequently, it becomes more expensive. This storage can be helpful to those who find Microsoft Azure SQL too costly and don't require the SQL structure and architecture.
Blob Storage supports unstructured data/huge data files such as text documents, images, audio, video files, etc. In Microsoft Azure, you can store blobs in three ways: Block Blobs, Append Blobs, and Page Blobs.
File Storage serves the needs of the Azure VM environment. You can use it to store huge data files accessible from multiple Virtual Machines. File Storage allows users to share any data file via the SMB (Server Message Block) protocol.
Disk Storage serves as a storage option for Azure virtual machines. It enables you to construct virtual machine disks. Only one virtual machine can access a disk in Disk Storage.
78. What data security solutions does Azure SQL DB provide?
In Azure SQL DB, there are several data security options:
Azure SQL Firewall Rules: There are two levels of security available in Azure.
The first are server-level firewall rules, which are present in the SQL Master database and specify which Azure database servers are accessible.
The second type of firewall rule is database-level firewall rules, which monitor database access.
Azure SQL Database Auditing: The SQL Database service in Azure offers auditing features. It allows you to define the audit policy at the database server or database level.
Azure SQL Transparent Data Encryption: TDE encrypts and decrypts databases and performs backups and transactions on log files in real-time.
Azure SQL Always Encrypted: This feature safeguards sensitive data in the Azure SQL database , such as credit card details.
79. What do you understand by PolyBase?
Polybase is a system that uses the Transact-SQL language to access external data stored in Azure Blob storage, Hadoop, or the Azure Data Lake repository. This is the most efficient way to load data into an Azure Synapse SQL Pool. Polybase facilitates bidirectional data movement between Synapse SQL Pool and external resources, resulting in faster load performance.
PolyBase allows you to access data in Hadoop, Azure Blob Storage, or Azure Data Lake Store from Azure SQL Database or Azure Synapse Analytics.
PolyBase uses relatively easy T-SQL queries to import data from Hadoop, Azure Blob Storage, or Azure Data Lake Store without any third-party ETL tool.
PolyBase allows you to export and retain data to external data repositories.
80. What is the best way to capture streaming data in Azure?
Azure has a separate analytics service called Azure Stream Analytics , which supports the Stream Analytics Query Language, a primary SQL-based language.
It enables you to extend the query language's capabilities by introducing new Machine Learning functions.
Azure Stream Analytics can analyze a massive volume of structured and unstructured data at around a million events per second and provide relatively low latency outputs.
81. Discuss the different windowing options available in Azure Stream Analytics.
Stream Analytics has built-in support for windowing functions, allowing developers to quickly create complicated stream processing jobs. Five types of temporal windows are available: Tumbling, Hopping, Sliding, Session, and Snapshot.
Tumbling window functions take a data stream and divide it into discrete temporal segments, then apply a function to each. Tumbling windows often recur, do not overlap, and one event cannot correspond to more than one tumbling window.
Hopping window functions progress in time by a set period. Think of them as Tumbling windows that can overlap and emit more frequently than the window size allows. Events can appear in multiple Hopping window result sets. Set the hop size to the same as the window size to make a Hopping window look like a Tumbling window.
Unlike Tumbling or Hopping windows, Sliding windows only emit events when the window's content changes. As a result, each window contains at least one event, and events, like hopping windows, can belong to many sliding windows.
Session window functions combine events that coincide and filter out periods when no data is available. The three primary variables in Session windows are timeout, maximum duration, and partitioning key.
Snapshot windows bring together events having the same timestamp. You can implement a snapshot window by adding System.Timestamp() to the GROUP BY clause, unlike most windowing function types that involve a specialized window function (such as SessionWindow()).
82. Discuss the different consistency models in Cosmos DB.
There are five distinct consistency models/levels in Azure Cosmos DB , starting from strongest to weakest-
Strong - It ensures linearizability, i.e., serving multiple requests simultaneously. The reads will always return the item's most recent committed version. Uncommitted or incomplete writes are never visible to the client, and users will always be able to read the most recent commit.
Bounded staleness - It guarantees the reads to follow the consistent prefix guarantee. Reads may lag writes by "K" versions (that is, "updates") of an item or "T" time interval, whichever comes first.
Session - It guarantees reads to honor the consistent prefix, monotonic reads and writes, read-your-writes, and write-follows-reads guarantees in a single client session. This implies that only one "writer" session or several authors share the same session token.
Consistent prefix - It returns updates with a consistent prefix throughout all updates and has no gaps. Reads will never detect out-of-order writes if the prefix consistency level is constant.
Eventual - There is no guarantee for ordering of reads in eventual consistency. The replicas gradually converge in the lack of further writes.
83. What are the various types of Queues that Azure offers?
Storage queues and Service Bus queues are the two queue techniques that Azure offers.
Storage queues - Azure Storage system includes storage queues. You can save a vast quantity of messages on them. Authorized HTTP or HTTPS calls allow you to access messages from anywhere. A queue can hold millions of messages up to the storage account's overall capacity limit. Queues can build a backlog of work for asynchronous processing.
Service Bus queues are present in the Azure messaging infrastructure, including queuing, publish/subscribe, and more advanced integration patterns. They mainly connect applications or parts of applications that encompass different communication protocols, data contracts, trust domains, or network settings.
84. What are the different data redundancy options in Azure Storage?
When it comes to data replication in the primary region, Azure Storage provides two choices:
Locally redundant storage (LRS) replicates your data three times synchronously in a single physical location in the primary area. Although LRS is the cheapest replication method, it is unsuitable for high availability or durability applications.
Zone-redundant storage (ZRS) synchronizes data across three Azure availability zones in the primary region. Microsoft advises adopting ZRS in the primary region and replicating it in a secondary region for high-availability applications.
Azure Storage provides two options for moving your data to a secondary area:
Geo-redundant storage (GRS) synchronizes three copies of your data within a single physical location using LRS in the primary area. It moves your data to a single physical place in the secondary region asynchronously.
Geo-zone-redundant storage (GZRS) uses ZRS to synchronize data across three Azure availability zones in the primary region. It then asynchronously moves your data to a single physical place in the secondary region.
Get confident to build end-to-end projects
Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.
Data engineers may leverage cloud-based services like AWS to help enterprises overcome some of the issues they face as they deal with large data volumes. Practice these data engineering interview questions below to impress your hiring manager with your data engineering skills in cloud computing .
85. What logging capabilities does AWS Security offer?
AWS CloudTrail allows security analysis, resource change tracking, and compliance auditing of an AWS environment by providing a history of AWS API calls for an account. CloudTrail sends log files to a chosen Amazon Simple Storage Service (Amazon S3) bucket, with optional log file integrity validation.
Amazon S3 Access Logs record individual requests to Amazon S3 buckets and can be capable of monitoring traffic patterns, troubleshooting, and security and access audits. It can also assist a business in gaining a better understanding of its client base, establishing lifecycle policies, defining access policies, and determining Amazon S3 prices.
Amazon VPC Flow Logs record IP traffic between Amazon Virtual Private Cloud (Amazon VPC) network interfaces at the VPC, subnet, or individual Elastic Network Interface level. You can store Flow log data in Amazon CloudWatch Logs and export it to Amazon CloudWatch Streams for enhanced network traffic analytics and visualization .
86. How can Amazon Route 53 ensure high availability while maintaining low latency?
AWS's highly available and stable infrastructure builds Route 53. The DNS servers' widely distributed design helps maintain a constant ability to direct end-users to your application by avoiding internet or network-related issues. Route 53 delivers the level of dependability that specific systems demand. Route 53 uses a worldwide anycast network of DNS servers to automatically respond to inquiries from the best location available based on network circumstances. As a result, your end consumers will experience low query latency.
87. What is Amazon Elastic Transcoder, and how does it work?
Amazon Elastic Transcoder is a cloud-based media transcoding service.
It's intended to be a highly flexible, simple-to-use, and cost-effective solution for developers and organizations to transform (or "transcode") media files from their original format into versions suitable for smartphones, tablets, and computers.
Amazon Elastic Transcoder also includes transcoding presets for standard output formats, so you don't have to assume which parameters will work best on specific devices.
88. Discuss the different types of EC2 instances available.
On-Demand Instances - You pay for computing capacity by the hour or second with On-Demand instances, depending on the instances you run. There are no long-term obligations or upfront payments required. You can scale up or down your compute capacity based on your application's needs, and you only pay the per-hour prices for the instance you utilize.
Reserved Instances - When deployed in a specific Availability Zone, Amazon EC2 Reserved Instances (RI) offer a significant reduction (up to 72%) over On-Demand pricing and a capacity reservation.
Spot Instances - You can request additional Amazon EC2 computing resources for up to 90% off the On-Demand price using Amazon EC2 Spot instances.
89. Mention the AWS consistency models for modern DBs.
A database consistency model specifies how and when a successful write or change reflects in a future read of the same data.
The eventual consistency model is ideal for systems where data update doesn’t occur in real-time. It's Amazon DynamoDB's default consistency model, boosting read throughput. However, the outcomes of a recently completed write may not necessarily reflect in an eventually consistent read.
In Amazon DynamoDB, a strongly consistent read yields a result that includes all writes that have a successful response before the read. You can provide additional variables in a request to get a strongly consistent read result. Processing a highly consistent read takes more resources than an eventually consistent read.
90. What do you understand about Amazon Virtual Private Cloud (VPC)?
The Amazon Virtual Private Cloud (Amazon VPC) enables you to deploy AWS resources into a custom virtual network.
This virtual network is like a typical network run in your private data center, but with the added benefit of AWS's scalable infrastructure.
Amazon VPC allows you to create a virtual network in the cloud without VPNs, hardware, or real data centers.
You can also use Amazon VPC's advanced security features to give more selective access to and from your virtual network's Amazon EC2 instances.
91. Outline some security products and features available in a virtual private cloud (VPC).
Flow Logs - Analyze your VPC flow logs in Amazon S3 or Amazon CloudWatch to obtain operational visibility into your network dependencies and traffic patterns, discover abnormalities, prevent data leakage, etc.
Network Access Analyzer - The Network Access Analyzer tool assists you in ensuring that your AWS network meets your network security and compliance standards. Network Access Analyzer allows you to establish your network security and compliance standards.
Traffic Mirroring - You can directly access the network packets running through your VPC via Traffic Mirroring. This functionality enables you to route network traffic from Amazon EC2 instances' elastic network interface to security and monitoring equipment for packet inspection.
92. What do you mean by RTO and RPO in AWS?
Recovery time objective (RTO): The highest allowed time between a service outage and restoration. This specifies the maximum amount of service downtime that you may tolerate.
Recovery point objective (RPO): The maximum allowed time since the previous data recovery point. This establishes the level of data loss that is acceptable.
93. What are the benefits of using AWS Identity and Access Management (IAM)?
AWS Identity and Access Management (IAM) supports fine-grained access management throughout the AWS infrastructure.
IAM Access Analyzer allows you to control who has access to which services and resources and under what circumstances. IAM policies let you control rights for your employees and systems, ensuring they have the least amount of access.
It also provides Federated Access, enabling you to grant resource access to systems and users without establishing IAM Roles.
94. What are the various types of load balancers available in AWS?
An Application Load Balancer routes requests to one or more ports on each container instance in your cluster, making routing decisions at the application layer (HTTP/HTTPS). It also enables path-based routing and may route requests to one or more ports on each container instance in your cluster. Dynamic host port mapping is available with Application Load Balancers.
The transport layer (TCP/SSL) is where a Network Load Balancer decides the routing path. It processes millions of requests per second, and dynamic host port mapping is available with Network Load Balancers.
Gateway Load Balancer distributes traffic while scaling your virtual appliances to match demands by combining a transparent network gateway.
Data lakes are the ideal way to store the company's historical data because they can store a lot of data at a low cost. Data lake enables users to switch back and forth between data engineering and use cases like interactive analytics and machine learning. Azure Data Lake, a cloud platform, supports big data analytics by providing unlimited storage for structured, semi-structured, or unstructured data. Take a look at some important data engineering interview questions on Azure Data Lake.
95. What do you understand by Azure Data Lake Analytics?
Azure Data Lake Analytics is a real-time analytics job application that makes big data easier to understand.
You create queries to change your data and get essential insights instead of deploying, configuring, and optimizing hardware.
The analytics service can instantaneously manage jobs of any complexity by pitching in the amount of power you require.
Also, it's cost-effective because you only pay for your task when it's operating.
96. Compare Azure Data Lake Gen1 vs. Azure Data Lake Gen2.
97. what do you mean by u-sql.
Azure Data Lake Analytics uses U-SQL as a big data query language and execution infrastructure.
U-SQL scales out custom code (.NET/C#/Python) from a Gigabyte to a Petabyte scale using typical SQL techniques and language.
Big data processing techniques like "schema on reads," custom processors, and reducers are available in U-SQL.
The language allows you to query and integrate structured and unstructured data from various data sources, including Azure Data Lake Storage, Azure Blob Storage, Azure SQL DB, Azure SQL Data Warehouse , and SQL Server instances on Azure VMs.
98. Outline some of the features of Azure Data Lake Analytics.
Azure Data Lake offers high throughput for raw or other data types for analytics, real-time reporting, and monitoring.
It's highly flexible and auto-scalable, with payment handling flexibility.
U-SQL can process any structured and unstructured data using SQL syntax and Azure custom functions to set up new ADFS driver functions.
It offers a highly accessible on-premise data warehouse service for exploring data for analytics, reporting, monitoring, and Business Intelligence using various tools.
99. What are the different blob storage access tiers in Azure?
Hot tier - An online tier that stores regularly viewed or updated data. The Hot tier has the most expensive storage but the cheapest access.
Cool tier - An online layer designed for rarely storing data that is accessed or modified. The Cool tier offers reduced storage costs but higher access charges than the Hot tier.
Archive tier - An offline tier designed for storing data accessed rarely and with variable latency requirements. You should keep the Archive tier's data for at least 180 days.
Here are some data engineering interview questions for you that will help hiring managers to test your technical skills and knowledge.
100. What do you mean by Blocks and Block Scanner?
Block is the smallest unit of a data file and is regarded as a single entity. When Hadoop comes across a large data file, it automatically breaks it up into smaller pieces called blocks.
A block scanner is implemented to check whether the loss-of-blocks generated by Hadoop are successfully installed on the DataNode.
101. How does a block scanner deal with a corrupted data block?
The DataNode notifies the NameNode about a particular file when the block scanner detects a corrupted data block. After that, NameNode processes the data file by replicating it using the original, corrupted file. The corrupted data block is not deleted if there is a match between the replicas made and the replication block.
102. List some of the XML configuration files present in Hadoop.
Some of the XML configuration files present in Hadoop are
HDFS-site (one of the most important XML configuration files)
103. How would you check the validity of data migration between databases?
A data engineer's primary concerns should be maintaining the accuracy of the data and preventing data loss. The purpose of this question is to help the hiring managers understand how you would validate data.
You must be able to explain the suitable validation types in various instances. For instance, you might suggest that validation can be done through a basic comparison or after the complete data migration.
104. How does a SQL query handle duplicate data points?
In SQL, there are mainly two ways to handle or reduce duplicate data points- you can use the SQL keywords DISTINCT & UNIQUE to reduce duplicate data points. Additionally, you have other options, like using GROUP BY to handle duplicate data points.
Below are the Data Engineer interview questions asked in Databricks-
100. What is Databricks Runtime?
101. what spark components are included in azure databricks, 102. what are the types of runtimes azure databricks offers, 103. what is databricks file system, 104. how do you access azure data lake storage from a notebook.
Some of the Data Engineer interview questions asked in Walmart are
105. What is a case class in Scala?
106. elaborate on the hive architecture., 106. what are the various types of data models, 107. can we use hadoop commands to load data in the backend to a particular partition table, 108. how can we truncate a table in hive, 109. what is spark how is it different from hive.
Here are the most commonly asked Data Engineer interview questions at EY -
110. When should you not use a pie chart?
111. what is database dynamic, 112. explain the spark architecture., 113. explain joins in sql., 114. difference between map and flat map., 115. difference between rdd and dataframe., 116. given this role sits within ey data analytics team, please let us know about your recent experience and exposure to data and analytics. what data-related projects, tools, platforms, and technologies have you worked on.
These are some of the behavioral Data Engineer interview questions asked in almost every data engineering interview.
117. Why are you opting for a career in data engineering, and why should we hire you?
118. what are the daily responsibilities of a data engineer, 119. what problems did you face while trying to aggregate data from multiple sources how did you go about resolving this, 120. do you have any experience working on hadoop, and how did you enjoy it, 121. do you have any experience working in a cloud computing environment what are some challenges that you faced, 122. what are the fundamental characteristics that make a good data engineer, 123. how would you approach a new project as a data engineer, 124. do you have any experience working with data modeling techniques.
As per Glassdoor , here are some Data Engineer interview questions asked in Facebook:
124. Given a list containing a None value, replace the None value with the previous value in the list.
125. print the key in a dictionary corresponding to the nth highest value in the dictionary. print just the first one if there is more than one record associated with the nth highest value., 126. given two sentences, print the words that are present in only one of the two sentences., 127. create a histogram using values from a given list., 128. write a program to flatten the given list : [1,2,3,[4,5,[6,7[8,9]]]], 129. write a program to remove duplicates from any given list., 130. write a program to count the number of words in a given sentence., 131. find the number of occurrences of a letter in a string..
Data Engineer interview questions that are most commonly asked at Amazon
132. How can you tune a query? If a query takes longer than it initially did, what may be the reason, and how will you find the cause?
133. in python, how can you find non-duplicate numbers in the first list and create a new list preserving the order of the non-duplicates, 134. consider a large table containing three columns corresponding to datetime, employee, and customer_response. the customer_response column is a free text column. assuming a phone number is embedded in the customer_response column, how can you find the top 10 employees with the most phone numbers in the customer_response column, 135. sort an array in python so that it produces only odd numbers., 136. how can you achieve performance tuning in sql find the numbers which have the maximum count in a list, 137. generate a new list containing the numbers repeated in two existing lists., 138. how would you tackle a data pipeline performance problem as a data engineer.
Data engineering is more significant than data science. Data engineering maintains the framework that enables data scientists to analyze data and create models. Without data engineering, data science is not possible. A successful data-driven company relies on data engineering. Data engineering makes it easier to build a data processing stack for data collection, storage, cleaning, and analysis in batches or in real time, making it ready for further data analysis.
Furthermore, as businesses learn more about the significance of big data engineering, they turn towards AI-driven methodologies for end-to-end Data Engineering rather than employing the older techniques. Data engineering aids in finding useful data residing in any data warehouse with the help of advanced analytic methods. Data Engineering also allows businesses to collaborate with data and leads to efficient data processing.
Access Data Science and Machine Learning Project Code Examples
When compared to data science , data engineering does not receive as much media coverage. However, data engineering is a career field that is rapidly expanding and in great demand. It can be a highly exciting career for people who enjoy assembling the "pieces of a puzzle" that build complex data pipelines to ingest raw data, convert it, and then optimize it for various data users. According to a LinkedIn Search as of June 2022, there are over 229,000 jobs for data engineering in the United States , and over 41,000 jobs for the same in India .
Based on Glassdoor, the average salary of a data engineer in the United States is $112,493 per annum. In India, the average data engineer salary is â¹925,000. According to Indeed , Data Engineer is the 5th highest paying job in the United States across all the sectors. These stats clearly state that the demand for the role of a Data Engineer will only increase with lucrative paychecks.
Below are some essential skills that a data engineer or any individual working in the data engineering field requires-
SQL: Data engineers are responsible for handling large amounts of data. Structured Query Language (SQL) is required to work on structured data in relational database management systems (RDBMS). As a data engineer, it is essential to be thorough with using SQL for simple and complex queries and optimize queries as per requirements.
Data Architecture and Data Modeling: Data engineers are responsible for building complex database management systems. They are considered the gatekeepers of business-relevant data and must design and develop safe, secure, and efficient systems for data collection and processing.
Data Warehousing: It is important for data engineers to grasp building data warehouses and to work with them. Data warehouses allow the aggregation of unstructured data from different sources, which can be used for further efficient processing and analysis.
Programming Skills: The most popular programming languages used in Big Data Engineering are Python and R, which is why it is essential to be well versed in at least one of these languages.
Microsoft Excel: Excel allows developers to arrange their data into tables. It is a commonly used tool to organize and update data regularly if required. Excel provides many tools that can be used for data analysis, manipulation, and visualization.
Apache Hadoop-Based Analytics: Apache Hadoop is a prevalent open-source tool used extensively in Big Data Engineering. The Hadoop ecosystem provides support for distributed computing, allows storage, manipulation, security, and processing of large amounts of data, and is a necessity for anyone applying for the role of a data engineer.
Operating Systems: Data engineers are often required to be familiar with working with operating systems like LINUX, Solaris, UNIX, and Microsoft.
Machine Learning: Machine learning techniques are primarily required for data scientists. However, since data scientists and data engineers work closely together, knowledge of machine learning tools and techniques will help a data engineer.
We hope these questions will help you ace your interview and land a data engineer role in your dream organization. Apart from the data engineer interview questions, here are some essential tips to keep you prepared for your next data engineering interview:
Brush up your skills: Here are some skills that are expected in a data engineer role:
Technical skills: Data Engineers have to be familiar with database management systems, SQL, Microsoft Excel, programming languages especially R and Python, working with Big Data tools including Apache Hadoop and Apache Spark .
Analytical Skills: Data Engineering requires individuals with strong mathematical and statistical skills who can make sense of the large amounts of data that they constantly have to deal with.
Understanding business requirements: To design optimum databases, it is important that data engineers understand what is expected of them, and design databases as per requirements.
Be familiar with the specific company with which you are interviewing. Understand the goals and objectives of the company, some of their recent accomplishments, and any ongoing projects you can find out about. The more specific your answers to questions like “Why have you chosen Company X?”, the more you will be able to convince your interviewers that you have truly come prepared for the interview.
Have a thorough understanding of the projects you have worked on. Be prepared to answer questions based on these projects, primarily if the projects are related to Big Data and data engineering. You may be asked questions about the technology used in the data engineering projects, the datasets you used, how you obtained the required data samples, and the algorithms you used to approach the end goal. Try to recall any difficulties that you encountered during the execution of the project and how you went about solving them.
Spend time working on building up your project profile and in the process, your confidence. By working on projects, you can expand your knowledge by gaining hands-on experience. Projects can be showcased to your interviewer but will also help build up your skillset and give you a deeper understanding of the tools and techniques used in the market in the field of Big Data and data engineering.
Make sure to get some hands-on practice with ProjectPro’s solved big data projects with reusable source code that can be used for further practice with complete datasets. At any time, if you feel that you require some assistance, we provide one-to-one industry expert guidance to help you understand the code and ace your data engineering skills .
FAQs on Data Engineer Interview Questions
1. how can i pass data engineer interview.
You can pass a data engineer interview if you have the right skill set and experience necessary for the job role. If you want to crack the data engineer interview, acquire the essential skills like data modeling, data pipelines, data analytics, etc., explore resources for data engineer interview questions, and build a solid portfolio of big data projects. Practice real-world data engineering projects on ProjectPro, Github, etc. to gain hands-on experience.
2. What are the roles and responsibilities of data engineer?
Some of the roles and responsibilities of a data engineer are
Create and implement ETL data pipeline for a variety of clients in various sectors.
Generate accurate and useful data-driven solutions using data modeling and data warehousing techniques.
Interact with other teams (data scientists, etc.) and help them by delivering relevant datasets for analysis.
Build data pipelines for extraction and storage tasks by employing a range of big data engineering tools and various cloud service platforms.
3. What are the 4 most key questions a data engineer is likely to hear during an interview?
The four most key questions a data engineer is likely to hear during an interview are
What are the four V’s of Big Data?
Do you have any experience working on Hadoop, and how did you enjoy it?
Do you have any experience working in a cloud computing environment, what are some challenges that you faced?
About the Author
ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,
© 2023 Iconiq Inc.
Write for ProjectPro