-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathpython-data-science-guide.html
More file actions
1462 lines (1216 loc) · 58.2 KB
/
python-data-science-guide.html
File metadata and controls
1462 lines (1216 loc) · 58.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<title>Python Data Science — NumPy, Pandas & Spark Mastery</title>
<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500&family=Syne:wght@400;500;600;700&display=swap" rel="stylesheet">
<style>
*{box-sizing:border-box;margin:0;padding:0}
body{background:#0f1117;color:#ffffff;font-family:'Syne',sans-serif;min-height:100vh}
.topbar{background:#161b27;border-bottom:1px solid #2a3348;padding:.65rem 1rem;display:flex;align-items:center;justify-content:space-between;flex-wrap:wrap;gap:.5rem;position:sticky;top:0;z-index:300}
.logo{font-size:.95rem;font-weight:700;color:#8b5cf6}
.logo span{color:#9ca3af;font-weight:400;font-size:.78rem}
.topbar-right{display:flex;gap:.4rem;align-items:center;flex-wrap:wrap}
.srch{background:#1e2535;border:1px solid #2a3348;border-radius:8px;padding:.38rem .75rem;color:#ffffff;font-family:'Syne',sans-serif;font-size:.8rem;width:200px;outline:none}
.srch:focus{border-color:#8b5cf6;background:#252d40}
.srch::placeholder{color:#6b7280}
.fbtn{background:#1e2535;border:1px solid #2a3348;border-radius:8px;padding:.36rem .65rem;color:#9ca3af;font-size:.7rem;cursor:pointer;font-family:'Syne',sans-serif;transition:all .15s;white-space:nowrap}
.fbtn:hover,.fbtn.on{background:#252d40;color:#8b5cf6;border-color:#8b5cf655}
.mob-toggle{display:none;background:#8b5cf6;border:none;border-radius:8px;padding:.38rem .75rem;color:#fff;font-size:.75rem;font-weight:700;cursor:pointer;font-family:'Syne',sans-serif;gap:.35rem;align-items:center;white-space:nowrap}
.layout{display:flex;height:calc(100vh - 50px)}
.sidebar{width:300px;min-width:300px;background:#161b27;border-right:1px solid #2a3348;overflow-y:auto;height:100%;position:sticky;top:50px}
.sb-inner{padding:.4rem .35rem}
.main{flex:1;overflow-y:auto}
.wrap{padding:1.5rem 2rem 2rem;max-width:980px}
.mob-overlay{display:none;position:fixed;inset:0;background:#0f1117;z-index:400;flex-direction:column}
.mob-overlay.open{display:flex}
.mob-header{background:#161b27;border-bottom:1px solid #2a3348;padding:.65rem 1rem;display:flex;align-items:center;justify-content:space-between;gap:.5rem}
.mob-header-title{font-size:.9rem;font-weight:700;color:#ffffff}
.mob-close{background:#1e2535;border:1px solid #2a3348;border-radius:8px;padding:.35rem .7rem;color:#9ca3af;font-size:.75rem;cursor:pointer;font-family:'Syne',sans-serif}
.mob-search{padding:.6rem .75rem;background:#161b27;border-bottom:1px solid #2a3348}
.mob-search input{width:100%;background:#1e2535;border:1px solid #2a3348;border-radius:8px;padding:.4rem .75rem;color:#ffffff;font-family:'Syne',sans-serif;font-size:.82rem;outline:none}
.mob-search input:focus{border-color:#8b5cf6}
.mob-list{flex:1;overflow-y:auto;padding:.4rem .5rem}
.qi{display:flex;align-items:flex-start;gap:.45rem;padding:.52rem .6rem;border-radius:7px;cursor:pointer;border:1px solid transparent;transition:all .12s;margin-bottom:2px}
.qi:hover{background:#1e2535;border-color:#2a3348}
.qi.active{background:#1e2535;border-left:3px solid #8b5cf6;border-color:#8b5cf633}
.qi.hidden{display:none}
.qn{min-width:20px;font-size:.67rem;font-weight:700;color:#4b5563;margin-top:2px;font-family:'JetBrains Mono',monospace;flex-shrink:0}
.qi.active .qn{color:#8b5cf6}
.qt{font-size:.75rem;color:#ffffff;line-height:1.4;flex:1}
.fdot{width:5px;height:5px;border-radius:50%;margin-top:5px;flex-shrink:0}
.welcome{text-align:center;padding:3.5rem 1rem}
.welcome h2{font-size:1.35rem;font-weight:700;color:#ffffff;margin-bottom:.6rem}
.welcome p{font-size:.84rem;color:#ffffff;line-height:1.75}
.wgrid{display:grid;grid-template-columns:repeat(3,1fr);gap:.65rem;margin-top:1.5rem;text-align:left}
.wcard{background:#161b27;border:1px solid #2a3348;border-radius:8px;padding:.75rem}
.wcard h4{font-size:.72rem;font-weight:700;color:#8b5cf6;margin-bottom:.3rem}
.wcard p{font-size:.72rem;color:#ffffff;line-height:1.5}
.qhead{display:flex;align-items:flex-start;gap:.65rem;margin-bottom:.85rem;flex-wrap:wrap}
.qnum{background:#8b5cf6;color:#fff;font-size:.67rem;font-weight:700;padding:3px 8px;border-radius:10px;font-family:'JetBrains Mono',monospace;margin-top:3px;flex-shrink:0}
.qtitle{font-size:1.1rem;font-weight:700;color:#ffffff;line-height:1.35;flex:1}
.badges{display:flex;flex-wrap:wrap;gap:.3rem;margin-bottom:.8rem;align-items:center}
.fbadge{font-size:.66rem;padding:2px 7px;border-radius:10px;border:1px solid;font-weight:600;color:#ffffff}
.def-section{background:#1e2535;border-left:3px solid #8b5cf6;border-radius:0 8px 8px 0;padding:.85rem 1rem;font-size:.83rem;line-height:1.8;color:#ffffff;margin-bottom:1.2rem}
.def-section strong{color:#8b5cf6}
.slabel{font-size:.68rem;font-weight:700;letter-spacing:.12em;color:#6b7280;text-transform:uppercase;margin:1.2rem 0 .6rem;padding-left:2px}
.code-block{background:#161b27;border:1px solid #2a3348;border-radius:8px;padding:1rem;margin-bottom:1.2rem;overflow-x:auto}
.code-block code{font-family:'JetBrains Mono',monospace;font-size:.78rem;color:#ffffff;line-height:1.6;display:block;white-space:pre;text-wrap:wrap}
.keyword{color:#ec4899}
.string{color:#34d399}
.comment{color:#6b7280}
.function{color:#60a5fa}
.number{color:#fbbf24}
.expl-box{background:#1a1f2e;border:1px solid #2a3348;border-radius:8px;padding:.85rem;margin-bottom:1.2rem;font-size:.8rem;color:#e5e7eb;line-height:1.7}
.expl-box strong{color:#8b5cf6}
.example-grid{display:grid;grid-template-columns:1fr 1fr;gap:1rem;margin-bottom:1.2rem}
.example-card{background:#161b27;border:1px solid #2a3348;border-radius:8px;padding:1rem}
.example-card h5{color:#8b5cf6;font-size:.8rem;margin-bottom:.5rem;text-transform:uppercase}
.io-grid{display:grid;grid-template-columns:1fr 1fr;gap:.8rem;margin-bottom:1.2rem}
.io-box{background:#161b27;border:1px solid #2a3348;border-radius:8px;padding:.75rem}
.io-box h5{color:#34d399;font-size:.7rem;font-weight:700;margin-bottom:.4rem;text-transform:uppercase}
.io-box code{background:#0f1117;border:1px solid #2a3348;border-radius:4px;padding:.25rem .5rem;display:block;margin:.25rem 0;font-family:'JetBrains Mono',monospace;font-size:.75rem;color:#ffffff;white-space:pre-wrap}
.realworld-box{background:#1a2f3e;border-left:3px solid #06b6d4;border-radius:0 8px 8px 0;padding:1rem;margin-bottom:1.2rem;font-size:.82rem;line-height:1.7}
.realworld-box strong{color:#06b6d4}
.navrow{display:flex;gap:.5rem;margin-top:1.5rem;padding-top:1rem;border-top:1px solid #2a3348}
.nbtn{background:#1e2535;border:1px solid #2a3348;border-radius:8px;padding:.6rem .8rem;color:#9ca3af;font-size:.75rem;cursor:pointer;font-family:'Syne',sans-serif;transition:all .15s;flex:1;text-align:center;line-height:1.45;color:#ffffff}
.nbtn:hover{background:#252d40;color:#ffffff;border-color:#8b5cf6}
.nbtn.off{opacity:.25;pointer-events:none}
.badge-numpy{background:#1e3a5f;border-color:#3b82f6}
.badge-pandas{background:#1f3a2a;border-color:#10b981}
.badge-spark{background:#3a1f2a;border-color:#f472b6}
@media(max-width:768px){
.layout{display:block;height:auto}
.sidebar{display:none}
.main{width:100%}
.wrap{padding:1rem .75rem 2rem}
.mob-toggle{display:flex}
.mob-overlay.open{display:flex}
.example-grid,.io-grid{grid-template-columns:1fr}
.fbtn:not(.mob-toggle){display:none}
.srch{display:none}
}
</style>
</head>
<body>
<div class="topbar">
<div class="logo">NumPy, Pandas & Spark <span>Data Science Mastery</span></div>
<div class="topbar-right">
<input type="text" class="srch" id="searchbox" placeholder="Search topics...">
<button class="fbtn" onclick="toggleFilter('all')" id="filter-all">All</button>
<button class="fbtn" onclick="toggleFilter('numpy')" id="filter-numpy">NumPy</button>
<button class="fbtn" onclick="toggleFilter('pandas')" id="filter-pandas">Pandas</button>
<button class="fbtn" onclick="toggleFilter('spark')" id="filter-spark">Spark</button>
<button class="mob-toggle" onclick="toggleMobOverlay()">☰ Menu</button>
</div>
</div>
<div class="mob-overlay" id="moboverlay">
<div class="mob-header">
<div class="mob-header-title">Data Science Topics</div>
<button class="mob-close" onclick="toggleMobOverlay()">✕</button>
</div>
<div class="mob-search">
<input type="text" id="searchbox-mob" placeholder="Search topics..." onkeyup="filterQuestions()">
</div>
<div class="mob-list" id="moblist"></div>
</div>
<div class="layout">
<div class="sidebar"><div class="sb-inner" id="deskqlist"></div></div>
<div class="main">
<div class="wrap" id="content"></div>
</div>
</div>
<script>
const topics = [
{id:1,title:"NumPy Arrays & Basics",cat:"numpy",section:"Fundamentals"},
{id:2,title:"Array Operations & Broadcasting",cat:"numpy",section:"Core Concepts"},
{id:3,title:"Indexing & Slicing",cat:"numpy",section:"Manipulation"},
{id:4,title:"Mathematical Functions",cat:"numpy",section:"Operations"},
{id:5,title:"Linear Algebra with NumPy",cat:"numpy",section:"Advanced"},
{id:6,title:"NumPy Performance Optimization",cat:"numpy",section:"Optimization"},
{id:7,title:"Pandas Series & DataFrames",cat:"pandas",section:"Fundamentals"},
{id:8,title:"Data Loading & I/O",cat:"pandas",section:"Input/Output"},
{id:9,title:"Data Cleaning & Handling Missing Data",cat:"pandas",section:"Data Prep"},
{id:10,title:"Groupby & Aggregations",cat:"pandas",section:"Analysis"},
{id:11,title:"Merging & Joining DataFrames",cat:"pandas",section:"Transformation"},
{id:12,title:"Time Series Data",cat:"pandas",section:"Specialized"},
{id:13,title:"Pandas Performance & Memory",cat:"pandas",section:"Optimization"},
{id:14,title:"Spark Basics & RDDs",cat:"spark",section:"Fundamentals"},
{id:15,title:"Spark DataFrames & SQL",cat:"spark",section:"Core Concepts"},
{id:16,title:"Spark Transformations & Actions",cat:"spark",section:"Operations"},
{id:17,title:"Spark Structured Streaming",cat:"spark",section:"Streaming"},
{id:18,title:"MLlib - Machine Learning",cat:"spark",section:"ML"},
{id:19,title:"Spark Performance Tuning",cat:"spark",section:"Optimization"},
{id:20,title:"Real-World: Data Pipeline",cat:"spark",section:"Real-World"},
];
const content = {
1:{
title:"NumPy Arrays & Basics",
def:"NumPy (Numerical Python) is a fundamental library for numerical computing. It provides N-dimensional arrays (ndarrays), which are homogeneous collections of elements with the same data type. NumPy is the foundation for pandas, SciPy, and most data science libraries in Python.",
badges:["Fundamentals","NumPy Core","Arrays"],
examples:[
{title:"Creating NumPy Arrays",code:`import numpy as np
# 1D Array
arr1d = np.array([1, 2, 3, 4, 5])
print(arr1d) # [1 2 3 4 5]
# 2D Array (Matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2d)
# [[1 2 3]
# [4 5 6]]
# Using built-in functions
zeros = np.zeros((3, 3)) # 3x3 array of zeros
ones = np.ones((2, 4)) # 2x4 array of ones
identity = np.eye(3) # Identity matrix
range_arr = np.arange(0, 10, 2) # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5) # 5 values from 0 to 1`},
{title:"Array Properties",code:`arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape) # (2, 3) - dimensions
print(arr.dtype) # int64 - data type
print(arr.size) # 6 - total elements
print(arr.ndim) # 2 - number of dimensions
print(arr.nbytes) # 48 - memory used (in bytes)`}
],
explanation:"NumPy arrays are more efficient than Python lists because they store data contiguously in memory and are implemented in C. This makes NumPy operations significantly faster, especially for large datasets.",
inputOutput:{
input:"import numpy as np\narr = np.array([1, 2, 3, 4, 5])\nprint(arr.shape)",
output:"(5,)"
},
realWorld:"In data pipelines, NumPy arrays are used to load sensor data, image pixels, or financial time-series data before processing."
},
2:{
title:"Array Operations & Broadcasting",
def:"Broadcasting is NumPy's mechanism for performing operations on arrays of different shapes. It automatically aligns dimensions and repeats data where necessary, eliminating the need for explicit loops.",
badges:["Core Concepts","NumPy","Performance"],
examples:[
{title:"Element-wise Operations",code:`import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Element-wise operations
print(a + b) # [5 7 9]
print(a * b) # [4 10 18]
print(a ** b) # [1 32 729]
print(np.sqrt(a)) # [1. 1.41421356 1.73205081]`},
{title:"Broadcasting Example",code:`import numpy as np
# (3,3) array + (3,) array
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
vector = np.array([10, 20, 30])
result = matrix + vector
print(result)
# [[11 22 33]
# [14 25 36]
# [17 28 39]]
# Scalar broadcast
result2 = matrix * 2
print(result2)
# [[ 2 4 6]
# [ 8 10 12]
# [14 16 18]]`}
],
explanation:"Broadcasting eliminates the need for nested loops and explicit replication. This makes code more readable and significantly faster. NumPy handles the shape alignment internally by repeating smaller arrays to match larger ones.",
inputOutput:{
input:"a = np.array([[1,2],[3,4]])\nb = np.array([10,20])\nprint(a + b)",
output:"[[11 22]\n [13 24]]"
},
realWorld:"In machine learning, broadcasting is used when applying normalization (subtracting means and dividing by standard deviations) to datasets without explicit looping."
},
3:{
title:"Indexing & Slicing",
def:"NumPy provides powerful indexing and slicing mechanisms to access and modify array elements. This includes integer indexing, boolean indexing, fancy indexing, and multi-dimensional slicing.",
badges:["Manipulation","Indexing","Selection"],
examples:[
{title:"Basic Indexing & Slicing",code:`import numpy as np
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# Single element
print(arr[2]) # 2
print(arr[-1]) # 9 (last element)
# Slicing
print(arr[2:6]) # [2 3 4 5]
print(arr[::2]) # [0 2 4 6 8] (every 2nd element)
print(arr[::-1]) # [9 8 7 6 5 4 3 2 1 0] (reverse)
# 2D Array Indexing
arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(arr2d[1, 2]) # 6 (row 1, col 2)
print(arr2d[0, :]) # [1 2 3] (entire first row)
print(arr2d[:, 1]) # [2 5 8] (entire second column)`},
{title:"Boolean Indexing & Fancy Indexing",code:`import numpy as np
arr = np.array([10, 20, 30, 40, 50, 60])
# Boolean indexing (filtering)
mask = arr > 25
print(arr[mask]) # [30 40 50 60]
# Fancy indexing (using indices array)
indices = np.array([0, 2, 4])
print(arr[indices]) # [10 30 50]
# Conditional operations
arr2 = np.array([1, 2, 3, 4, 5])
arr2[arr2 > 3] = 0 # Set values > 3 to 0
print(arr2) # [1 2 3 0 0]`}
],
explanation:"Indexing and slicing enable efficient data access patterns. Boolean indexing is particularly useful for filtering datasets based on conditions without using explicit loops.",
inputOutput:{
input:"arr = np.array([[1,2,3],[4,5,6]])\nprint(arr[0, 1])",
output:"2"
},
realWorld:"In real-world data analysis, boolean indexing is used to filter datasets - e.g., selecting all transactions > $100 or all readings where temperature > 30°C."
},
4:{
title:"Mathematical Functions",
def:"NumPy provides a comprehensive set of mathematical functions including trigonometric, exponential, logarithmic, statistical, and rounding functions. These are optimized for vectorized operations.",
badges:["Operations","Math","Functions"],
examples:[
{title:"Common Mathematical Functions",code:`import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# Trigonometric
print(np.sin(arr)) # [-0.84147098 0.90929743 0.14112001 -0.75680975 -0.95892427]
print(np.cos(arr))
print(np.tan(arr))
# Exponential & Logarithmic
print(np.exp(arr)) # [e^1, e^2, e^3, e^4, e^5]
print(np.log(arr)) # [0. 0.69314718 1.09861229 1.38629436 1.60943791]
print(np.sqrt(arr)) # [1. 1.41421356 1.73205081 2. 2.23606798]
# Rounding
arr_float = np.array([1.234, 2.567, 3.891])
print(np.round(arr_float)) # [1. 3. 4.]
print(np.ceil(arr_float)) # [2. 3. 4.]
print(np.floor(arr_float)) # [1. 2. 3.]`},
{title:"Aggregation Functions",code:`import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Basic statistics
print(np.sum(arr)) # 45 (sum of all elements)
print(np.mean(arr)) # 5.0 (average)
print(np.std(arr)) # Standard deviation
print(np.var(arr)) # Variance
# Along axis
print(np.sum(arr, axis=0)) # [12 15 18] (sum of columns)
print(np.sum(arr, axis=1)) # [6 15 24] (sum of rows)
print(np.mean(arr, axis=1)) # [2. 5. 8.]
# Min/Max
print(np.min(arr)) # 1
print(np.max(arr)) # 9
print(np.argmax(arr)) # 8 (index of max value)`}
],
explanation:"These functions are vectorized, meaning they operate on entire arrays without explicit loops. This is orders of magnitude faster than Python loops. Understanding axis parameters is crucial for data analysis.",
inputOutput:{
input:"arr = np.array([1, 2, 3, 4, 5])\nprint(np.mean(arr))",
output:"3.0"
},
realWorld:"In financial analysis, you'd use np.mean() for average returns, np.std() for volatility, and np.max()/np.min() for peak/trough prices across portfolios."
},
5:{
title:"Linear Algebra with NumPy",
def:"NumPy provides linear algebra operations through the np.linalg module, including matrix multiplication, eigenvalue decomposition, matrix inversion, determinants, and solving linear systems.",
badges:["Linear Algebra","Advanced","Mathematics"],
examples:[
{title:"Matrix Operations",code:`import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Matrix multiplication
print(np.dot(A, B))
# [[19 22]
# [43 50]]
# Element-wise multiplication
print(A * B)
# [[ 5 12]
# [21 32]]
# Matrix transpose
print(A.T)
# [[1 3]
# [2 4]]
# Matrix inverse
print(np.linalg.inv(A))
# Determinant
print(np.linalg.det(A)) # -2.0`},
{title:"Eigenvalues & Eigenvectors",code:`import numpy as np
A = np.array([[4, 2], [1, 3]])
# Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues) # [5. 2.]
print("Eigenvectors:")
print(eigenvectors)
# Solving linear system: Ax = b
A = np.array([[3, 1], [1, 2]])
b = np.array([9, 8])
x = np.linalg.solve(A, b)
print("Solution:", x) # [2. 3.]`}
],
explanation:"Linear algebra is fundamental in machine learning (PCA, SVD), scientific computing, and optimization problems. NumPy's implementation uses optimized BLAS/LAPACK libraries for speed.",
inputOutput:{
input:"A = np.array([[1,2],[3,4]])\nprint(np.linalg.det(A))",
output:"-2.0"
},
realWorld:"In principal component analysis (PCA), eigenvalue decomposition is used to find the main directions of variance in high-dimensional data for dimensionality reduction."
},
6:{
title:"NumPy Performance Optimization",
def:"Optimizing NumPy code involves understanding memory layout, dtype selection, using vectorized operations, and leveraging in-place operations. Performance can be 100-1000x faster than pure Python.",
badges:["Optimization","Performance","Advanced"],
examples:[
{title:"Vectorization vs Loops",code:`import numpy as np
import time
# Setup large array
arr = np.random.rand(1000000)
# Method 1: Pure Python loop (SLOW)
start = time.time()
result1 = [x ** 2 for x in arr]
time_loop = time.time() - start
# Method 2: NumPy vectorization (FAST)
start = time.time()
result2 = arr ** 2
time_numpy = time.time() - start
print(f"Python loop: {time_loop:.6f}s") # ~0.1s
print(f"NumPy array: {time_numpy:.6f}s") # ~0.001s
print(f"Speedup: {time_loop/time_numpy:.1f}x") # ~100x faster`},
{title:"Memory Efficiency & In-place Operations",code:`import numpy as np
# Using specific dtypes to save memory
arr_float64 = np.ones(1000000, dtype=np.float64) # 8MB
arr_float32 = np.ones(1000000, dtype=np.float32) # 4MB
arr_int32 = np.ones(1000000, dtype=np.int32) # 4MB
print(f"float64: {arr_float64.nbytes / 1e6:.1f}MB")
print(f"float32: {arr_float32.nbytes / 1e6:.1f}MB")
# In-place operations (modify original, don't create new array)
arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
# NOT in-place (creates new array)
result = arr * 2
# In-place (modifies original)
arr *= 2 # Much more memory efficient`}
],
explanation:"The key to NumPy performance is vectorization - avoiding explicit Python loops. Always use NumPy functions that operate on entire arrays at once. In-place operations reduce memory overhead.",
inputOutput:{
input:"# Python loop vs NumPy\nPython: O(n) with interpreter overhead\nNumPy: O(n) but in optimized C code",
output:"NumPy typically 50-1000x faster"
},
realWorld:"In real-time data processing (streaming data, sensor readings), vectorization is critical to process millions of data points per second without lagging."
},
7:{
title:"Pandas Series & DataFrames",
def:"Pandas provides two main data structures: Series (1D labeled array) and DataFrame (2D labeled table). DataFrames are like SQL tables or Excel spreadsheets, with rows and columns that can have different dtypes.",
badges:["Fundamentals","Pandas","Data Structure"],
examples:[
{title:"Creating Series & DataFrames",code:`import pandas as pd
# Creating a Series
s = pd.Series([10, 20, 30, 40])
print(s)
# 0 10
# 1 20
# 2 30
# 3 40
# Series with custom index
s_named = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print(s_named['a']) # 100
# Creating DataFrame from dict
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)
# Creating DataFrame from list of dicts
df2 = pd.DataFrame([
{'id': 1, 'name': 'Alice', 'score': 95},
{'id': 2, 'name': 'Bob', 'score': 87}
])`},
{title:"DataFrame Basic Operations",code:`import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 75000, 80000]
})
# Accessing columns
print(df['Name']) # Series
print(df[['Name', 'Age']]) # DataFrame with 2 columns
# Accessing rows
print(df.iloc[0]) # First row by position
print(df.loc[0]) # First row by label
# DataFrame info
print(df.shape) # (4, 3) - 4 rows, 3 columns
print(df.info()) # Column types and non-null counts
print(df.describe()) # Statistical summary`}
],
explanation:"DataFrames are the core data structure in pandas and are used for almost all data manipulation tasks. They combine the flexibility of Python dicts with the efficiency of NumPy arrays.",
inputOutput:{
input:"df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6]})\nprint(df.shape)",
output:"(3, 2)"
},
realWorld:"DataFrames are used to store and manipulate data from CSV files, databases, APIs, and web scraping. Every data science project starts with loading data into a DataFrame."
},
8:{
title:"Data Loading & I/O",
def:"Pandas provides functions to read and write data in various formats: CSV, Excel, SQL databases, JSON, Parquet, HDF5, and more. These functions handle parsing, type inference, and missing values automatically.",
badges:["Input/Output","Data Loading","I/O"],
examples:[
{title:"Reading Data from Various Sources",code:`import pandas as pd
# Read CSV
df_csv = pd.read_csv('data.csv')
df_csv = pd.read_csv('data.csv', delimiter=';', encoding='utf-8')
# Read Excel
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read SQL Database
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:password@host/db')
df_sql = pd.read_sql('SELECT * FROM users LIMIT 1000', engine)
# Read JSON
df_json = pd.read_json('data.json')
# Read Parquet (columnar, efficient)
df_parquet = pd.read_parquet('data.parquet')`},
{title:"Writing Data to Various Formats",code:`import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [95, 87, 92]
})
# Write to CSV
df.to_csv('output.csv', index=False)
# Write to Excel
df.to_excel('output.xlsx', sheet_name='Data')
# Write to SQL
engine = sqlalchemy.create_engine('postgresql://user:password@host/db')
df.to_sql('users', engine, if_exists='append', index=False)
# Write to Parquet (compressed)
df.to_parquet('output.parquet', compression='snappy')`}
],
explanation:"I/O operations are critical for data pipelines. Parquet is recommended for large datasets due to compression and columnar storage. CSV is human-readable but less efficient for large files.",
inputOutput:{
input:"df = pd.read_csv('sales.csv')\nprint(f'Loaded {len(df)} rows')",
output:"Loaded 10000 rows"
},
realWorld:"In production pipelines, you might read data from S3 or databases, transform it, and write results back to S3 or a data warehouse. Chunked reading (chunksize parameter) handles files larger than RAM."
},
9:{
title:"Data Cleaning & Handling Missing Data",
def:"Real-world data is messy. Data cleaning involves handling missing values, removing duplicates, fixing data types, dealing with outliers, and standardizing formats. This typically consumes 70-80% of data science work.",
badges:["Data Preparation","Cleaning","Missing Values"],
examples:[
{title:"Handling Missing Values",code:`import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [10, np.nan, 30, 40, 50],
'C': ['cat', 'dog', np.nan, 'cat', 'dog']
})
# Check missing values
print(df.isnull()) # Boolean mask
print(df.isnull().sum()) # Count per column
# Drop rows with any missing values
df_clean = df.dropna()
# Drop rows where specific column is missing
df_clean = df.dropna(subset=['A'])
# Fill missing values
df_filled = df.fillna(0) # Fill with 0
df_filled = df.fillna(df.mean()) # Fill with mean
df_filled = df.fillna(method='ffill') # Forward fill
df_filled = df.fillna(method='bfill') # Backward fill`},
{title:"Data Cleaning Operations",code:`import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'alice', 'ALICE', 'Bob'],
'Age': [25, 25, 25, 30],
'Email': ['alice@example.com', 'alice@example.com', 'alice@example.com', 'bob@example.com']
})
# Remove duplicates
df_unique = df.drop_duplicates()
df_unique = df.drop_duplicates(subset=['Email'])
# Standardize string data
df['Name'] = df['Name'].str.lower().str.strip()
# Fix data types
df['Age'] = df['Age'].astype('int32')
df['Registered'] = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'])
# Rename columns
df.rename(columns={'Email': 'EmailAddress'}, inplace=True)
# Handle outliers (e.g., remove values beyond 3 std devs)
df_no_outliers = df[(df['Age'] - df['Age'].mean()).abs() <= 3 * df['Age'].std()]`}
],
explanation:"Data cleaning is an iterative process. Start by understanding your data (df.info(), df.describe()), identify issues, then apply appropriate transformations. Use domain knowledge to decide whether to drop or impute missing values.",
inputOutput:{
input:"df = pd.DataFrame({'A': [1, np.nan, 3]})\ndf.fillna(df.mean())",
output:" A\n0 1.0\n1 2.0\n2 3.0"
},
realWorld:"In production, you might build pipelines that automatically detect and flag data quality issues, log warnings, and apply standardized cleaning rules across multiple data sources."
},
10:{
title:"Groupby & Aggregations",
def:"The groupby() operation is one of pandas' most powerful features. It splits data into groups based on one or more columns, applies operations to each group, and combines results. This is similar to SQL GROUP BY.",
badges:["Analysis","Aggregation","Grouping"],
examples:[
{title:"Basic Groupby Operations",code:`import pandas as pd
df = pd.DataFrame({
'Department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'HR'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Salary': [50000, 55000, 70000, 75000, 45000, 48000]
})
# Group by department and calculate mean salary
grouped = df.groupby('Department')['Salary'].mean()
print(grouped)
# Department
# HR 46500.0
# IT 72500.0
# Sales 52500.0
# Multiple aggregations
agg_result = df.groupby('Department').agg({
'Salary': ['mean', 'sum', 'count', 'min', 'max']
})
print(agg_result)`},
{title:"Advanced Groupby with Custom Functions",code:`import pandas as pd
df = pd.DataFrame({
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Q1': [100, 150, 120, 180, 110, 160],
'Q2': [110, 160, 130, 190, 120, 170],
'Q3': [120, 170, 140, 200, 130, 180]
})
# Group and apply custom function
def custom_stat(x):
return {
'mean': x.mean(),
'trend': x.iloc[-1] - x.iloc[0] # Last - First
}
result = df.groupby('Category')[['Q1', 'Q2', 'Q3']].apply(custom_stat)
print(result)
# Using multiple columns for grouping
multi_group = df.groupby(['Category']).size() # Count per category`}
],
explanation:"Groupby is fundamental for exploratory data analysis and reporting. Understanding how to combine groupby with aggregations is essential for generating insights from data.",
inputOutput:{
input:"df.groupby('Category')['Sales'].sum()",
output:"Category\nA 5000\nB 7500"
},
realWorld:"E-commerce: Group by customer and sum purchases for customer lifetime value. Finance: Group by account and calculate portfolio statistics. HR: Group by department for headcount, salary analysis."
},
11:{
title:"Merging & Joining DataFrames",
def:"Merging combines DataFrames on shared columns or indices (similar to SQL JOINs). Types include inner join (intersection), left join (keep left), right join (keep right), and outer join (union).",
badges:["Transformation","Joining","Data Integration"],
examples:[
{title:"DataFrame Merge Operations",code:`import pandas as pd
# Create two DataFrames
df_customers = pd.DataFrame({
'CustomerID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df_orders = pd.DataFrame({
'OrderID': [101, 102, 103, 104],
'CustomerID': [1, 2, 1, 3],
'Amount': [100, 150, 200, 75]
})
# Inner join (only matching customer IDs)
inner = pd.merge(df_customers, df_orders, on='CustomerID', how='inner')
# Left join (all customers, matching orders if exist)
left = pd.merge(df_customers, df_orders, on='CustomerID', how='left')
# Right join (all orders, matching customers if exist)
right = pd.merge(df_customers, df_orders, on='CustomerID', how='right')
# Outer join (all rows from both)
outer = pd.merge(df_customers, df_orders, on='CustomerID', how='outer')`},
{title:"Concatenation & Set Operations",code:`import pandas as pd
df1 = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})
df2 = pd.DataFrame({
'A': [5, 6],
'B': [7, 8]
})
# Concatenate (stack) DataFrames
df_concat = pd.concat([df1, df2], axis=0) # Stack vertically
df_concat = pd.concat([df1, df2], axis=1) # Stack horizontally
# Join on index
df_left = df1.set_index('A')
df_right = df2.set_index('A')
df_joined = df_left.join(df_right, how='inner')`}
],
explanation:"Joins are crucial for combining data from multiple tables. Understanding join types is essential. Inner joins reduce data, left joins preserve the 'base' table, outer joins keep everything.",
inputOutput:{
input:"pd.merge(df1, df2, on='ID', how='inner')",
output:"Returns only rows with matching IDs in both DataFrames"
},
realWorld:"In data warehouses: Customer master + transaction table (inner join for active only), inventory + sales forecasts (left join to keep all products), consolidating data from multiple teams (outer join)."
},
12:{
title:"Time Series Data",
def:"Time series data has a temporal component. Pandas provides specialized tools for handling dates, resampling at different frequencies, calculating rolling statistics, and performing time-based analysis.",
badges:["Specialized","Time Series","Temporal Analysis"],
examples:[
{title:"Time Series Basics",code:`import pandas as pd
# Create time series index
dates = pd.date_range('2024-01-01', periods=30, freq='D')
data = pd.Series(range(30), index=dates)
# Access by date
print(data['2024-01-05']) # Single date
print(data['2024-01-05':'2024-01-10']) # Date range
# Create DataFrame with datetime
df = pd.DataFrame({
'Date': pd.date_range('2024-01-01', periods=100),
'Price': np.random.randn(100).cumsum() + 100,
'Volume': np.random.randint(1000, 10000, 100)
})
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Time-based selection
print(df.loc['2024-01']) # Entire January
print(df.loc['2024-01-01':'2024-01-10']) # Specific range`},
{title:"Resampling & Rolling Statistics",code:`import pandas as pd
# Create daily time series
dates = pd.date_range('2024-01-01', periods=365, freq='D')
daily_sales = pd.Series(np.random.randint(100, 500, 365), index=dates)
# Resample to weekly (sum aggregation)
weekly = daily_sales.resample('W').sum()
# Resample to monthly (mean)
monthly = daily_sales.resample('M').mean()
# Calculate rolling mean (7-day moving average)
rolling_mean = daily_sales.rolling(window=7).mean()
# Rolling statistics
rolling_std = daily_sales.rolling(window=7).std()
rolling_max = daily_sales.rolling(window=7).max()`}
],
explanation:"Time series analysis is crucial for forecasting and trend detection. Rolling statistics smooth noise, resampling changes granularity, and time-based indexing enables efficient filtering.",
inputOutput:{
input:"df.resample('M')['Price'].mean()",
output:"Date\n2024-01-31 105.3\n2024-02-29 108.7"
},
realWorld:"Stock market analysis: Daily prices → weekly/monthly trends. Website analytics: Hourly clicks → daily/weekly patterns. Sensor data: Raw measurements → hourly aggregates for anomaly detection."
},
13:{
title:"Pandas Performance & Memory",
def:"Optimizing pandas involves choosing appropriate dtypes, using categorical data, chunked reading for large files, using built-in methods instead of apply(), and avoiding unnecessary copies.",
badges:["Optimization","Memory","Performance"],
examples:[
{title:"Memory Optimization Techniques",code:`import pandas as pd
# Create large DataFrame
df = pd.DataFrame({
'int_col': [1, 2, 3] * 1000000,
'str_col': ['A', 'B', 'C'] * 1000000,
'float_col': [1.1, 2.2, 3.3] * 1000000
})
print(df.memory_usage(deep=True))
# Downcast numeric types
df['int_col'] = df['int_col'].astype('int8') # Instead of int64
# Use categorical for repeated string values
df['str_col'] = df['str_col'].astype('category')
# After optimization
print(df.memory_usage(deep=True)) # Much smaller!
# Reading large files in chunks
chunk_size = 10000
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
# Process chunk
result = chunk[chunk['Amount'] > 1000]`},
{title:"Efficient Operations",code:`import pandas as pd
df = pd.DataFrame({
'ID': range(100000),
'Value': np.random.randn(100000)
})
# SLOW: Using apply() with lambda
result_slow = df.apply(lambda row: row['Value'] ** 2, axis=1)
# FAST: Vectorized operation
result_fast = df['Value'] ** 2
# AVOID: Creating copies
df_copy = df.copy() # Use only when necessary
# BETTER: In-place operations
df.drop(columns=['temp'], inplace=True)
# Use groupby() instead of loops
# SLOW
result_slow = []
for group in df['Category'].unique():
result_slow.append(df[df['Category'] == group]['Value'].sum())
# FAST
result_fast = df.groupby('Category')['Value'].sum()`}
],
explanation:"In-place operations, vectorization, and appropriate dtypes are key to performance. Always profile your code to identify bottlenecks. Categorical data can reduce memory by 90% for repeated strings.",
inputOutput:{
input:"df.dtypes",
output:"int_col int64\nstr_col object\nfloat_col float64"
},
realWorld:"Data pipelines processing millions of records daily must optimize for memory and speed. Using categorical data for region/category fields, chunked reading for multi-GB files, and vectorized operations."
},
14:{
title:"Spark Basics & RDDs",
def:"Apache Spark is a distributed computing framework for processing large datasets. RDDs (Resilient Distributed Datasets) are the fundamental data structure - immutable, fault-tolerant collections partitioned across a cluster.",
badges:["Fundamentals","Spark","Distributed Computing"],
examples:[
{title:"Creating and Working with RDDs",code:`from pyspark import SparkContext
# Initialize Spark Context
sc = SparkContext("local", "RDD Example")
# Create RDD from collection
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
# Create RDD from external file
rdd2 = sc.textFile("hdfs://path/to/file.txt")
# RDD transformations (lazy)
rdd_mapped = rdd1.map(lambda x: x * 2) # [2, 4, 6, 8, 10]
rdd_filtered = rdd1.filter(lambda x: x > 2) # [3, 4, 5]
# RDD actions (trigger computation)
result = rdd_mapped.collect() # [2, 4, 6, 8, 10]
count = rdd_filtered.count() # 3
first = rdd1.first() # 1
sc.stop()`},
{title:"RDD Operations",code:`from pyspark import SparkContext
sc = SparkContext("local", "RDD Operations")
rdd = sc.parallelize([1, 2, 3, 4, 5, 6])
# Transformations
mapped = rdd.map(lambda x: (x, x**2)) # Pairs
flat = rdd.flatMap(lambda x: [x, x*2]) # Flattened
reduced = rdd.reduce(lambda x, y: x + y) # Sum: 21
# Pair RDD operations
pairs = rdd.map(lambda x: (x % 2, x)) # (0,2), (1,1), (0,4), ...
grouped = pairs.groupByKey() # Group by key
reduced_pairs = pairs.reduceByKey(lambda x, y: x + y)
# Join operations
rdd1 = sc.parallelize([(1, 'a'), (2, 'b'), (3, 'c')])
rdd2 = sc.parallelize([(1, 'x'), (2, 'y')])
joined = rdd1.join(rdd2) # Inner join on key
sc.stop()`}
],
explanation:"RDDs are the foundation of Spark. Key concepts: transformations are lazy (not executed immediately), actions trigger execution, and operations are parallelized across cluster nodes.",
inputOutput:{
input:"rdd = sc.parallelize([1,2,3,4,5])\nrdd.filter(lambda x: x > 2).collect()",
output:"[3, 4, 5]"
},
realWorld:"RDDs handle unstructured data like raw text logs. Spark automatically partitions data across cluster, handles failures by recomputing lost partitions from original data."
},
15:{
title:"Spark DataFrames & SQL",
def:"Spark DataFrames are higher-level abstractions over RDDs with optimizations via Catalyst optimizer. They're similar to pandas DataFrames but distributed. Spark SQL allows querying DataFrames using SQL syntax.",
badges:["Core Concepts","SQL","Distributed Data"],
examples:[
{title:"Creating and Querying DataFrames",code:`from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("DF_Example").getOrCreate()
# Create DataFrame from data
data = [("Alice", 25, 50000), ("Bob", 30, 60000), ("Charlie", 35, 75000)]
df = spark.createDataFrame(data, ["Name", "Age", "Salary"])
# Create DataFrame from file
df_csv = spark.read.csv("data.csv", header=True, inferSchema=True)
df_json = spark.read.json("data.json")
df_parquet = spark.read.parquet("data.parquet")
# Display DataFrame
df.show()
print(df.schema) # Print schema
df.printSchema() # Pretty print schema
# Query using DataFrame API
result = df.filter(df['Age'] > 28).select('Name', 'Salary')
result.show()
# SQL queries
df.createOrReplaceTempView("employees")
result = spark.sql("SELECT Name, Salary FROM employees WHERE Age > 28")
result.show()`},
{title:"DataFrame Transformations",code:`from pyspark.sql import functions as F
df = spark.createDataFrame([
("Alice", 25, 50000),
("Bob", 30, 60000)
], ["Name", "Age", "Salary"])
# Filtering
high_earners = df.filter(F.col('Salary') > 55000)
# Aggregation
stats = df.groupby('Age').agg(F.avg('Salary').alias('AvgSalary'))
# Sorting
sorted_df = df.orderBy(F.desc('Salary'))
# Adding columns
df_with_bonus = df.withColumn('Bonus', F.col('Salary') * 0.1)
# Dropping columns
df_clean = df_with_bonus.drop('Bonus')
# Joining
df2 = spark.createDataFrame([('Alice', 'Sales')], ['Name', 'Dept'])
joined = df.join(df2, 'Name', 'left')`}
],
explanation:"DataFrames provide SQL-like syntax and Catalyst optimizer for efficient query execution. Use DataFrames instead of RDDs when working with structured data - they're faster and more intuitive.",
inputOutput:{
input:"df.groupby('Department').agg(F.sum('Salary')).show()",
output:"+----------+----------+\n|Department|sum(Salary)|\n+----------+----------+\n| Sales | 105000 |\n+----------+----------+"
},
realWorld:"In data warehouses, DataFrames process structured data from Parquet files, databases, or S3. SQL queries enable data analysts to work alongside engineers in the same framework."
},
16:{
title:"Spark Transformations & Actions",
def:"Spark operations are either transformations (return new RDD/DataFrame) or actions (compute results). Understanding this distinction and lazy evaluation is crucial for performance and debugging.",
badges:["Operations","Performance","Spark"],
examples:[
{title:"Transformations (Lazy)",code:`from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Transform").getOrCreate()
# Read data
df = spark.read.csv("sales.csv", header=True, inferSchema=True)
# These are transformations - NOT executed yet
filtered = df.filter(F.col('Amount') > 100)
selected = filtered.select('ID', 'Amount')
mapped = selected.withColumn('Tax', F.col('Amount') * 0.1)
# Chain transformations (still not executed)
result = (df
.filter(F.col('Amount') > 100)
.select('ID', 'Amount')
.withColumn('Tax', F.col('Amount') * 0.1))
print(result) # Still no execution!`},
{title:"Actions (Trigger Computation)",code:`from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Actions").getOrCreate()
df = spark.read.csv("sales.csv", header=True, inferSchema=True)
# Actions - these EXECUTE the computation