Process page

What we had on the start of our data project?

🪱 Worm samples data

At first, we have datasets based on Caenorhabditis elegans. These are fully transparent worms, which live in the soil and are commonly used in labs by researchers. Because of this, scientists directly observe internal organs and cellular structures, as well as changes in these structures over time.

Fluorescent image of Caenorhabditis elegans.

We had different datasets for worms based on their age. Usually, for this type of worms, their aging goes like this:

Development stages:
embryo → L1 → L2 → L3 → L4 → adult

2DA stands for 2-days old worms, which means young worms aka babies.

L4 stands for 4th Larvae stage of Caenorhabditis elegans, which means that these group of worms were teenagers, almost adults.

This data frame shows a glimpse on our data about younger 2DA group of worms.

	gene_id	R1232_2DA_1	R1232_2DA_2	R1232_2DA_3	R0_26_2DA_1	R0_26_2DA_2	R0_26_2DA_3	SG2_2DA	SG26_2DA	log2FoldChange	pvalue	padj	gene_name	gene_chr	gene_start	gene_end	gene_strand	gene_length	gene_biotype	gene_description	tf_family	R1232_2DA_1_count	R1232_2DA_2_count	R1232_2DA_3_count	R0_26_2DA_1_count	R0_26_2DA_2_count	R0_26_2DA_3_count	R1232_2DA_1_fpkm	R1232_2DA_2_fpkm	R1232_2DA_3_fpkm	R0_26_2DA_1_fpkm	R0_26_2DA_2_fpkm	R0_26_2DA_3_fpkm	WB_id
0	R12H7.3	10808.318182	12175.395905	10052.840549	687.508172	684.605947	755.231679	11012.184879	709.115266	3.958145	0.000000e+00	0.000000e+00	-	X	13223639	13224708	-	1070	protein_coding	-	-	9241	10208	10340	704	985	724	294.588891	332.609937	276.777893	20.014523	19.780817	21.728719	WBGene00004825
1	C56E10.3	3333.373750	3266.889634	3112.102766	12.695463	21.545974	21.905891	3237.455383	18.715776	7.416474	2.914773e-276	2.393466e-272	-	X	6645654	6651753	+	1980	protein_coding	-	-	2850	2739	3201	13	31	21	49.097665	48.228658	46.303639	0.199726	0.336425	0.340591	WBGene00016976
2	R12H7.5	3824.607776	3726.090988	3727.523275	309.573992	314.154201	323.372680	3759.407346	315.700291	3.574364	3.603489e-273	1.972670e-269	-	X	13222917	13223546	-	630	protein_coding	-	Skp1	3270	3124	3834	317	452	310	177.046920	172.881616	174.303536	15.306471	15.416637	15.801579	WBGene00004826
3	Y54G11A.3	29.240121	21.469154	18.472338	2782.259635	2846.153657	3248.330730	23.060537	2958.914674	-7.017885	7.391234e-230	3.034656e-226	-	II	14276651	14283977	+	1703	protein_coding	-	DEAD	25	18	19	2849	4095	3114	0.500733	0.368499	0.319546	50.890200	51.669116	58.719628	WBGene00013214
4	Y54G2A.25	21.052887	5.963654	8.750055	3844.772265	4545.505474	4317.546850	11.922198	4235.941530	-8.491477	1.951221e-205	6.408980e-202	-	IV	2972307	2978461	-	3817	protein_coding	-	fn3	18	5	9	3937	6540	4139	0.160854	0.045669	0.067533	31.376147	36.816912	34.821911	WBGene00002243
5	F49B2.6	1660.838851	1693.677722	1713.066252	5.859445	14.595660	15.647065	1689.194275	12.034057	7.109462	3.833036e-165	1.049166e-161	-	I	14331085	14339839	+	3372	protein_coding	-	-	1420	1420	1762	6	21	15	14.364217	14.681794	14.966249	0.054128	0.133821	0.142851	WBGene00009865
6	F09E10.11	28137.183261	28951.154468	28486.288984	5747.138629	5824.363283	6991.108719	28524.875571	6187.536877	2.205158	5.527423e-150	1.296812e-146	-	X	1504131	1504841	-	711	ncRNA	-	-	24057	24273	29300	5885	8380	6702	1154.125509	1190.233560	1180.300682	251.786954	253.259758	302.701212	WBGene00006650
7	Y54G11A.1	101.755620	67.985655	48.611415	1808.615249	1819.592252	1826.534075	72.784230	1818.247192	-4.656609	1.689870e-139	3.469093e-136	-	II	14256350	14269153	-	1878	protein_coding	-	-	87	57	50	1852	2618	1751	1.580174	1.058175	0.762551	29.998653	29.954759	29.941248	WBGene00013212
8	C06G3.5	6128.729280	5874.199141	5485.312029	1401.383846	1469.296418	1493.773155	5829.413483	1454.817806	2.002259	5.896349e-118	1.075953e-114	-	IV	7024211	7028146	+	1353	protein_coding	-	-	5240	4925	5642	1435	2114	1432	132.103603	126.907350	119.434535	32.263424	33.573681	33.987918	WBGene00015551
9	R193.2	44345.566917	45082.838404	39023.299224	8677.837529	9502.469546	6627.053670	42817.234848	8269.120248	2.371991	3.545768e-104	5.823214e-101	-	X	1132999	1148343	-	5808	protein_coding	-	VWA	37915	37798	40138	8886	13672	6353	222.672008	226.892714	197.935530	46.541037	50.582141	35.126234	WBGene00020128
10	T10B9.7	2589.505081	2638.320508	2086.401917	131.837505	250.906342	198.196159	2438.075835	193.646668	3.645498	6.581251e-95	9.825808e-92	-	II	9783818	9787398	-	1593	protein_coding	-	p450	2214	2212	2146	135	361	190	47.407059	48.411405	38.584118	2.577949	4.869487	3.830162	WBGene00011676

This data frame is an example for older L4 group of worms.

	gene_id	R123_2_L4_1	R123_2_L4_2	R123_2_L4_3	R0_26_L4_1	R0_26_L4_2	R0_26_L4_3	L4_SG2	L4_SG26	log2FoldChange	pvalue	padj	gene_name	gene_chr	gene_start	gene_end	gene_strand	gene_length	gene_biotype	gene_description	tf_family	R123_2_L4_1_count	R123_2_L4_2_count	R123_2_L4_3_count	R0_26_L4_1_count	R0_26_L4_2_count	R0_26_L4_3_count	R123_2_L4_1_fpkm	R123_2_L4_2_fpkm	R123_2_L4_3_fpkm	R0_26_L4_1_fpkm	R0_26_L4_2_fpkm	R0_26_L4_3_fpkm	WB_id
0	F28A10.1	3723.084124	3729.922355	3294.560808	25.825091	50.548356	30.235968	3582.522429	35.536471	6.680924	2.306031e-273	4.281607e-269	-	II	826187	827610	+	1046	protein_coding	-	-	3122	2956	3186	26	50	50	109.268136	110.365587	97.794884	0.774074	1.540931	0.915027	WBGene00017869
1	R12H7.3	13430.292570	13045.895546	12007.671093	1562.418000	1404.233321	1626.695057	12827.953069	1531.115460	3.065531	3.086380e-197	2.865241e-193	-	X	13223639	13224708	-	1070	protein_coding	-	-	11262	10339	11612	1573	1389	2690	385.322232	377.359848	348.437829	45.781063	41.846897	48.124253	WBGene00004825
2	DY3.7	8602.924045	8567.717454	8388.410946	1330.985455	1505.330033	1333.406172	8519.684148	1389.907220	2.616593	8.049516e-150	4.981845e-146	-	I	8792965	8797293	+	3337	protein_coding	-	-	7214	6790	8112	1340	1489	2205	79.142954	79.464746	78.050158	12.505168	14.384120	12.648758	WBGene00006324
3	F25C8.1	1705.320403	1317.333877	1229.514376	16.885636	21.230309	17.536861	1417.389552	18.550936	6.265121	2.253226e-137	1.045891e-133	-	V	20907470	20909746	-	1989	pseudogene	-	-	1430	1044	1189	17	21	29	26.320466	20.498715	19.193277	0.266168	0.340353	0.279099	WBGene00009104
4	R12H7.5	5075.415128	5136.845030	4835.331556	699.264000	535.812570	630.117565	5015.863905	621.731379	3.011539	1.532456e-134	5.690623e-131	-	X	13222917	13223546	-	630	protein_coding	-	Skp1	4256	4071	4676	704	530	1042	247.316671	252.360560	238.306563	34.799503	27.119402	31.660857	WBGene00004826
5	F35H8.1	2833.455438	2315.428796	3530.329755	159.916909	189.050850	164.483664	2893.071330	171.150474	4.081859	1.482965e-129	4.589034e-126	-	II	9552631	9553261	-	494	protein_coding	-	-	2376	1835	3414	161	187	272	176.080722	145.067474	221.890453	10.149388	12.202799	10.539923	WBGene00009446
6	C56E10.3	1272.431377	1353.926484	1180.912882	34.764545	26.285145	14.513264	1269.090248	25.187652	5.715125	3.216686e-120	8.532029e-117	-	X	6645654	6651753	+	1980	protein_coding	-	-	1067	1073	1142	35	26	24	19.728386	21.163888	18.518379	0.550483	0.423305	0.232029	WBGene00016976
7	F10D2.9	720.289177	861.819002	645.262380	4578.987274	4753.567370	5850.659733	742.456853	5061.071459	-2.770908	1.214046e-87	2.817649e-84	-	V	7151356	7153146	-	1123	protein_coding	-	-	604	683	624	4610	4702	9675	19.690172	23.752093	17.840494	127.838626	134.973237	164.917490	WBGene00001399
8	VZK822L.1	7287.561525	8394.848928	8149.539776	27974.533097	27757.113081	30200.893887	7943.983410	28644.180022	-1.850362	2.246835e-84	4.635220e-81	-	IV	11913690	11915687	-	3488	protein_coding	-	-	6111	6653	7881	28164	27456	49942	64.139878	74.490687	72.544902	251.454137	253.749727	274.084800	WBGene00001398
9	C41C4.5	2181.140571	1997.451654	2144.670156	326.786727	375.068799	361.017453	2107.754127	354.290993	2.572174	1.845654e-82	3.426825e-79	-	II	8116356	8122861	+	4156	protein_coding	-	-	1829	1583	2074	329	371	597	16.111296	14.875320	16.022684	2.465252	2.877684	2.749757	WBGene00006832
10	M18.1	7416.354954	7978.450289	8247.776838	25931.371097	27595.358343	29512.723265	7880.860693	27679.817568	-1.812418	5.581165e-82	9.420498e-79	-	IV	12108671	12110258	-	1128	protein_coding	-	Collagen	6219	6323	7976	26107	27296	48804	201.838393	218.914744	227.027306	720.756878	780.072025	828.212568	WBGene00000703

Both of datasets above are the cutted version, as our real datasets included over 10.000 data rows together. Explanation of important column names:

gene_id – unique identifier of the gene
gene_name – common name of the gene (if available)
WB_id – WormBase identifier for the gene
baseMean – average expression level of the gene across all samples
FPKM / TPM – normalized expression values that allow comparison between genes and samples
log2FoldChange – how much the gene expression changes between two conditions
lfcSE – uncertainty (error) of the fold change value
pvalue – statistical measure showing whether the change is likely due to chance
padj – corrected p-value used to identify statistically significant genes
chromosome – chromosome where the gene is located
length – length of the gene
biotype – type of gene (e.g., protein-coding or non-coding)
description – short description of gene function
is_tf / TF – indicates whether the gene is a transcription factor

Based on this information, we could explore the structure of the data and detect meaningful differences in gene expression. Therefore, we proceed with EDA to gain initial insights into the dataset and guide further analysis.

1.1. Exploration data analysis (EDA)

Using padj to indentify statistically significant genes, we have found that most significant genes are downregulated, indicating a decrease in L4_SG26 compared to L4_SG2 (different groups of adult worms). Also most significant genes are protein-coding, so we can assume that major transcriptional changes affect functional genes.

There is a clear distinction between upregulated and downregulated genes, confirming that the two conditions (L4_SG2 vs L4_SG26) differ significantly at the transcriptomic level. Together, these analyses confirm that gene expression patterns and biological pathways differ between developmental stages.

1.2. Differential expression analysis

The plot shows gene expression changes in L4 vs 2-day adults. Most genes have little change, while a subset shows strong stage-specific regulation.

1.3. Gene set enrichment analysis (GSEA)

Genes related to peroxisomes are more active for 2DA group of worms. This suggests that the cells are more actively processing fats and protecting themselves from damage.

Genes involved in phosphorylation processes are less active for L4 group of worms. This suggests reduced cell signaling and regulation.

These results highlight stage-specific differences in biological processes between 2DA and L4 worms.

This plot compares pathway activity between L4 and 2-day (2DA) worms. Each point represents a pathway.

Most pathways behave similarly in both stages, especially showing reduced activity. However, some pathways are more active in 2DA than in L4, indicating stage-specific differences.

Overall, this suggests that certain biological processes are more active in 2DA, which is consistent with the GSEA results.

1.4. Pathway analysis

We followed the idea to use g:Profiler for analysing gene lists, which we had.

sources_of_interest = [
    "KEGG",   # metabolic & signalling pathways
    "REAC",   # Reactome pathways
    "WP",     # WikiPathways
    
    "GO:BP",  # biological processes
    "GO:MF",  # molecular functions
    "GO:CC"   # cellular components
]

As was discussed within team, we decided to focus more on KEGG, REAC, WP and tried to create a different list with pathways to recreate a vizualization.

Our g:Profiler approach gave us a more broad result among downregulated genes:

	source	native	name	p_value	significant	description	term_size	query_size	intersection_size	effective_domain_size	precision	recall	query	parents
0	KEGG	KEGG:01100	Metabolic pathways	5.029174e-19	True	Metabolic pathways	897	187	116	2968	0.620321	0.129320	query_1	['KEGG:00000']
1	KEGG	KEGG:01200	Carbon metabolism	1.697108e-18	True	Carbon metabolism	103	187	37	2968	0.197861	0.359223	query_1	['KEGG:00000']
2	KEGG	KEGG:01230	Biosynthesis of amino acids	1.879931e-09	True	Biosynthesis of amino acids	69	187	22	2968	0.117647	0.318841	query_1	['KEGG:00000']
3	KEGG	KEGG:00190	Oxidative phosphorylation	9.188188e-06	True	Oxidative phosphorylation	96	187	21	2968	0.112299	0.218750	query_1	['KEGG:00000']
4	KEGG	KEGG:00010	Glycolysis / Gluconeogenesis	2.249896e-05	True	Glycolysis / Gluconeogenesis	41	187	13	2968	0.069519	0.317073	query_1	['KEGG:00000']
5	KEGG	KEGG:01212	Fatty acid metabolism	1.745630e-04	True	Fatty acid metabolism	63	187	15	2968	0.080214	0.238095	query_1	['KEGG:00000']
6	REAC	REAC:R-CEL-1430728	Metabolism	2.204675e-04	True	Metabolism	1064	264	105	4004	0.397727	0.098684	query_1	['REAC:0000000']
7	KEGG	KEGG:00640	Propanoate metabolism	2.876040e-04	True	Propanoate metabolism	30	187	10	2968	0.053476	0.333333	query_1	['KEGG:00000']
8	KEGG	KEGG:01040	Biosynthesis of unsaturated fatty acids	4.083409e-04	True	Biosynthesis of unsaturated fatty acids	25	187	9	2968	0.048128	0.360000	query_1	['KEGG:00000']
9	REAC	REAC:R-CEL-917977	Transferrin endocytosis and recycling	4.408555e-04	True	Transferrin endocytosis and recycling	20	264	9	4004	0.034091	0.450000	query_1	['REAC:R-CEL-917937']
10	REAC	REAC:R-CEL-77387	Insulin receptor recycling	4.408555e-04	True	Insulin receptor recycling	20	264	9	4004	0.034091	0.450000	query_1	['REAC:R-CEL-74752']

We compiled a list of network genes based on previously identified upregulated and downregulated gene sets. We then restricted the list to 40 genes and generated a network visualization using cytoscape for biological interpritation of gene lists using the string database by their corresponding WB_id (WormBase identifier for the gene).

Meanwhile some genes may stay uninvolved; other genes interact with each other on this string network plot. For example, you can see on visualization here that cyc-2.2 (electron carrier protein) here is interacting with many different genes at the same time, and they interact with each other as well, turning everything into string network.

After that, we used previously chosen biological pathways, which are affected by transcriptional changes in the DM1 model. We found out that most of transcriptonic changes are caused by apoptosis (cellular stress / the death of cells), mitochondrial metabolism (energy metabolism), protein processing (cells reaction on stress), and cytoskeleton dynamics (changes in cell structure).

Additional analyses (still working on it)

🏃 Human data samples

At some point, we have finished work with working on worm samples and we have been offered to work with whether mice or human data. We have chosen the last.

Data frame from dataset with human data samples:

	Unnamed: 0	DM1	DM1.1	DM1.2	DM1.3	DM1.4	DM1.5	DM1.6	DM1.7	DM1.8	DM1.9	DM1.10	DM1.11	DM1.12	DM1.13	DM1.14	DM1.15	Ctrl	Ctrl.1	Ctrl.2	Ctrl.3	Ctrl.4	Ctrl.5
0	gene	11832X3	11885X4	11545X3	11545X4	11832X2	11832X1	11885X2	11832X4	12167X5	12167X7	11885X1	11885X3	12167X4	12167X10	12167X8	12167X11	11885X13	11885X11	11885X12	11885X16	11885X15	11885X14
1	TTN:ENSG00000155657.19	2733270	5918620	2156564	2906657	5997728	12937419	5750155	3719573	4987055	3282911	5874714	2481992	5059765	4442596	4748835	2813414	6187163	2553862	3179365	2511608	2714806	4853218
2	NEB:ENSG00000183091.15	542809	1013643	326058	359468	919344	1876455	829676	764947	807184	515992	957837	461102	793888	617262	753158	451184	596629	284077	352371	337686	328065	520774
3	RN7SL2:ENSG00000265150.1	745375	424648	1049072	580437	1116446	492639	250962	534655	504309	1000137	178405	182946	577956	668912	823843	704739	470059	429154	254236	443721	463935	446942
4	MT-CO1:ENSG00000198804.2	997302	292681	613228	447243	672198	567995	349955	540519	398712	568541	347256	289823	234719	621730	464804	459394	714608	501756	405015	486477	569275	375409
5	ACTA1:ENSG00000143632.10	675226	225569	370396	315559	304708	700308	168322	239437	249291	355265	237808	142270	240724	259678	305907	180839	582098	336483	244088	390215	497616	638403
6	RN7SK:ENSG00000202198.1	403671	175302	504526	253615	591815	283486	127061	452396	181975	457371	191550	312210	170562	375007	384614	378260	256022	228484	166543	296759	319581	255374
7	MYH7:ENSG00000092054.12	402290	295401	376691	200074	493153	657088	241244	235809	574612	237310	352163	131121	339214	456909	373738	238434	453862	216134	252786	272765	316279	322987
8	MT-ND4:ENSG00000198886.2	564314	146010	308134	320027	322889	344796	202753	342929	205239	295783	185555	107024	132823	310682	211592	294189	423613	256736	243488	359049	370988	280742
9	RN7SL1:ENSG00000258486.2	286767	133721	420826	236631	341706	184395	87352	182257	307485	514830	95278	120593	293725	292615	398738	351840	157114	125915	71508	124744	155094	130832
10	MALAT1:ENSG00000251562.3	428971	238207	156317	179399	339363	503112	299572	487754	223220	264110	280429	268117	228365	130398	183467	312418	164101	132039	164530	116230	96970	124090

1.1. Exploration data analysis (EDA)

1.2. Differential expression analysis

Compare MBL1 vs CUG

def find_mbl1_dependent(cug_df, mbl1_df, stage_label):
    merged = cug_df[["gene_id", "gene_name", "log2FoldChange", "direction",
                      "gene_biotype", "gene_description", "gene_chr",
                      "pvalue", "padj"]].merge(
        mbl1_df[["gene_id", "log2FoldChange", "direction",
                 "pvalue", "padj"]],
        on="gene_id",
        suffixes=("_cug", "_mbl1")
    )

    concordant  = merged[merged["direction_cug"] == merged["direction_mbl1"]].copy()
    discordant  = merged[merged["direction_cug"] != merged["direction_mbl1"]].copy()

    concordant["stage"] = stage_label
    discordant["stage"] = stage_label

    print(f"\n{stage_label}:")
    print(f"  Total overlap:  {len(merged):>4} genes")
    print(f"  Concordant:     {len(concordant):>4} genes  ← mbl-1-dependent")
    print(f"  Discordant:     {len(discordant):>4} genes  (opposite direction, worth flagging)")
    return concordant, discordant

mbl1_dep_l4,  discord_l4  = find_mbl1_dependent(cug_l4_sig,  mbl1_l4_sig,  "L4")
mbl1_dep_2da, discord_2da = find_mbl1_dependent(cug_2da_sig, mbl1_2da_sig, "2DA")

1.3. Gene set enrichment analysis (GSEA)

	WormBase ID	Ensembl ID
0	WBGene00000022	ENSG00000167972
1	WBGene00000040	ENSG00000122729
2	WBGene00000041	ENSG00000100412
3	WBGene00000064	ENSG00000184009
4	WBGene00000066	ENSG00000184009
5	WBGene00000068	ENSG00000162104
6	WBGene00000072	ENSG00000087274
7	WBGene00000081	ENSG00000018510
8	WBGene00000086	ENSG00000110514
9	WBGene00000089	ENSG00000069974
10	WBGene00000092	ENSG00000121957

Green nodes - human genes, pink nodes - worm genes.

The visualization shows that these human genes are linked to key cell processes such as cell death, nervous system function, cell structure, and energy production. This indicates that they may play a role in development and diseases, especially those affecting the nervous system.